cURL / Mailing Lists / curl-users / Single Mail

curl-users

Re: How to deal with special characters / character encoding?

From: Peter Wullinger <peter.wullinger_at_gmail.com>
Date: Thu, 15 Nov 2007 11:27:51 +0100

Warning: Long reply.

2007/11/15, RM <rastaboymarley_at_gmail.com>:
>
> Hi all,
>
> Sorry if this question has an easy answer, but I couldn't find anything on
> Google or in the archives.
>
> Running this on the command line (as an example):
> curl http://www.msnbc.msn.com/id/21773401/ | grep '<title>'

This yield's
  <title>Wow! Japan's moon probe updates Earthrise - Space.com - MSNBC.com
</title>
here.

So, is there any way I can get curl to bring back the same results as my
> browser?

cURL does indeed bring back the same results as your browser. But the output
is not rendered correctly and this is not cURLs fault.

Welcome to the world of character encodings, a brief introduction:

Your terminal is most certainly set to a character encoding that is
different from
the one in the retrieved HTML document. In this case, it is only to be
expected,
that your terminal is not able to render the text correctly, as the
terminal is not aware
of the particular character set inside the document:

curl -i http://www.msnbc.msn.com/id/21773401/

gives

[...]
Content-Type: text/html; charset=utf-8
[...]

Which means, your terminal character set is to something else than
utf-8 (depends on the operating system and your current user's settings),
and the current character set interprets the hexadecimal
character "0xe2" in the character set, it has been set up to display.

Since you get "â", your character set is most likely ISO 8859-1 (see e.g.
http://de.wikipedia.org/wiki/ISO_8859-1), since "â" is the correct for the
character
code "0xe2" there. But since you are being sent a two-byte encoded character
from
utf-8, namely ('), this does not get displayed correctly.

Peter
Received on 2007-11-15