curl / Mailing Lists / curl-library / Single Mail

curl-library

Re: encoding expectations

From: Michael Kilburn <crusader.mike_at_gmail.com>
Date: Mon, 23 Apr 2018 22:08:46 -0500

 On Fri, 20 Apr 2018, Daniel Stenberg wrote:

>> - if HTTP protocol defines headers as US-ASCII-encoded string -- libcurl
has
>> to convert current encoding to US-ASCII
>
> I disagree and we also can't do that without breaking compatibility.

Then the only way this can work if libcurl demands following:
- execution character set (ECS) libcurl is compiled with should be US-ASCII
compatible (at least wrt certain symbols (@, '.', '/', etc) used for
interpreting user input (e.g. parsing urls)
- user passed strings should be in US-ASCII-compatible encoding (as
required by HTTP standard)

i.e. libcurl uses this approach:

>> Another option is to put conversion burden on the user -- basically
declare that each string passed to (received from) libcurl is in specific
predetermined encoding (US-ASCII/UTF8/etc)

>> - if DNS protocol expects bytes of US-ASCII-encoded strings on the wire
-- libcurl has to do the conversion
>
> That is done for the host name part now if curl is built with IDN support.

So, hostname (once extracted from url) is treated as opaque bytes unless
curl is built with IDN support -- in which case it'll assume it is in
"current" encoding and try to convert it into punycode before sending to
DNS server. Am I correct?

> We could possibly error out if non-ascii symbols are used in the name
without
> an IDN library, but that's also not how ping or telnet and other tools
work without
> IDN support and again not really in the spirit of curl: it uses what you
pass it. If
> when you pass it crap, it tries to make use of it.

Got it. :-) I guess I'll have to link libcurl statically to avoid problems
if client's libcurl is built in wrong way.

> For IDN, the host name is expected to use your current locale. (I'm not
sure for winidn.)

idn_win32.c suggest that on windows libcurl expects hostname to be encoded
in utf-8. Can you confirm, please?

> The rest of the URL is assumed to be "raw" single bytes where
> each byte is a letter and everything else is suitably URL encoded with
percent
> encoding %20-style. (I should avoid using the word ASCII here since it
works
> for EBCDIC as well.)

How does it work for EBCDIC without converting it to ASCII? these parts of
url should end up on HTTP header somewhere and HTTP standard requires
header to be US-ASCII encoded (afaik).

> IDNA 2003 vs IDNA 2008

Sigh... I didn't know punycode can be so complex.

-------------------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-library
Etiquette: https://curl.haxx.se/mail/etiquette.html
Received on 2018-04-24