cURL / Mailing Lists / curl-users / Single Mail

curl-users

Re: curl and Japanese web content

From: Dan Fandrich <dan_at_coneharvesters.com>
Date: Wed, 10 Sep 2008 14:56:30 -0700

On Wed, Sep 10, 2008 at 05:31:40PM -0400, denis.papathanasiou_at_gmail.com wrote:
> Can anyone explain why this command returns Japanese text properly:
>
> (1) curl "http://rss.asahi.com/f/asahi_international"
>
> While this command returns a series of "????????????" strings where the
> Japanese text should be:
>
> (2) curl "http://www.asahi.com/politics/update/0911/TKY200809100296.html"
>
> >From what I can tell from the two urls, the first is an rss feed
> (managed by pheedo.com) whose content-type is utf-8.
>
> The second is an html page directly on asahi.com whose content-type is
> EUC-JP (n.b. the http reply header did not specify a content-type, but
> looking at the html which came in reply, one of the meta tags identified
> the encoding as EUC-JP).

That's exactly why you're seeing differences in the output.

> I retried (2) with an explicit header to accept EUC-JP (i.e., curl -H
> "Accept-Encoding: charset=EUC-JP"), but that didn't change the result.

For one thing, you want Accept-Charset here, not Accept-Encoding. For
another, this is just a hint to the server as to your preference. There's
no guarantee the server will give you what you want.

> Is there anything else I can do?

You can convert the text into your desired character set once you
download it. Try piping the output of curl into "| iconv -f EUC-JP".

>>> Dan

-- 
http://www.MoveAnnouncer.com              The web change of address service
          Let webmasters know that your web site has moved
-------------------------------------------------------------------
List admin: http://cool.haxx.se/cgi-bin/mailman/listinfo/curl-users
FAQ:        http://curl.haxx.se/docs/faq.html
Etiquette:  http://curl.haxx.se/mail/etiquette.html
Received on 2008-09-10