cURL / Mailing Lists / curl-users / Single Mail

curl-users

curl and Japanese web content

From: <denis.papathanasiou_at_gmail.com>
Date: Wed, 10 Sep 2008 17:31:40 -0400

Can anyone explain why this command returns Japanese text properly:

(1) curl "http://rss.asahi.com/f/asahi_international"

While this command returns a series of "????????????" strings where the
Japanese text should be:

(2) curl "http://www.asahi.com/politics/update/0911/TKY200809100296.html"

From what I can tell from the two urls, the first is an rss feed
(managed by pheedo.com) whose content-type is utf-8.

The second is an html page directly on asahi.com whose content-type is
EUC-JP (n.b. the http reply header did not specify a content-type, but
looking at the html which came in reply, one of the meta tags identified
the encoding as EUC-JP).

I retried (2) with an explicit header to accept EUC-JP (i.e., curl -H
"Accept-Encoding: charset=EUC-JP"), but that didn't change the result.

Is there anything else I can do?

For the record, here is the verbose output from each request:

(1) curl -v "http://rss.asahi.com/f/asahi_international"
* About to connect() to rss.asahi.com port 80
* Trying 59.106.75.65... connected
* Connected to rss.asahi.com (59.106.75.65) port 80
> GET /f/asahi_international HTTP/1.1
> User-Agent: curl/7.15.5 (i486-pc-linux-gnu) libcurl/7.15.5
OpenSSL/0.9.8c zlib/1.2.3 libidn/0.6.5
> Host: rss.asahi.com
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Wed, 10 Sep 2008 21:24:59 GMT
< Server: Apache
< X-Powered-By: PHP/5.2.3-1ubuntu6.4
< ETag: "275cec-3d1e-cf6ab980"
< Last-Modified: Wed, 10 Sep 2008 20:51:02 GMT
< Content-Type: text/xml; charset="utf-8"
< Set-Cookie:
phdo=1-tst%7Cv2%3A079bd0e8c5cfff220f49cc0dbdf6c5ea%3AYJD%2F%2FAiDEeeQrBFshWpktjbocsHe%2F%2Be6kpYGxZpEHJ3w0UigV07yP68S98ttcnmb;
expires=Thu, 11-Sep-2008 21:24:59 GMT; path=/; domain=www.pheedo.jp
< Transfer-Encoding: chunked
< Connection: close
< Via: 1.1 AN-0003011042114406
  % Total % Received % Xferd Average Speed Time Time Time
Current
                                 Dload Upload Total Spent Left
Speed
100 42769 0 42769 0 0 35508 0 --:--:-- 0:00:01 --:--:--
59708* Closing connection #0

(2) curl -v
"http://www.asahi.com/politics/update/0911/TKY200809100296.html"
* About to connect() to www.asahi.com port 80
* Trying 96.17.10.17... connected
* Connected to www.asahi.com (96.17.10.17) port 80
> GET /politics/update/0911/TKY200809100296.html HTTP/1.1
> User-Agent: curl/7.15.5 (i486-pc-linux-gnu) libcurl/7.15.5
OpenSSL/0.9.8c zlib/1.2.3 libidn/0.6.5
> Host: www.asahi.com
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: Apache/2
< ETag: "1e346b-4a94-81de0ac0"
< Last-Modified: Wed, 10 Sep 2008 18:06:18 GMT
< Content-Type: text/html
< Cache-Control: max-age=1
< Expires: Wed, 10 Sep 2008 21:26:37 GMT
< Date: Wed, 10 Sep 2008 21:26:36 GMT
< Content-Length: 19092
< Connection: keep-alive
  % Total % Received % Xferd Average Speed Time Time Time
Current
                                 Dload Upload Total Spent Left
Speed
100 19092 100 19092 0 0 41435 0 --:--:-- --:--:-- --:--:--
 838k* Connection #0 to host www.asahi.com left intact

* Closing connection #0
-------------------------------------------------------------------
List admin: http://cool.haxx.se/cgi-bin/mailman/listinfo/curl-users
FAQ: http://curl.haxx.se/docs/faq.html
Etiquette: http://curl.haxx.se/mail/etiquette.html
Received on 2008-09-10