curl / Mailing Lists / curl-library / Single Mail

curl-library

Re: encoding expectations

From: Michael Kilburn <crusader.mike_at_gmail.com>
Date: Wed, 2 May 2018 01:34:05 -0500

 Daniel Stenberg wrote:

>> - CURLOPT_URL string is expected to in US-ASCII-compatible encoding where

>> certain symbols (slash, dot, @, etc) can't appear as part of another
>> (multibyte) character

> I don't understand what "can't appear as part of another (multibyte)
> character" means.

On Linux when curl is built with IDN support hostname is assumed to be in
"current" encoding, which could be multi-byte (e.g. UTF-8), which in turn
can be constructed in such way that symbol code for dot, slash, etc can
happen as part of some multibyte character -- if this happens libcurl won't
be able to parse url correctly (to figure out hostname to resolve).
Luckily, UTF-8 is built in such way that any US-ASCII character can't
happen as part of another multibyte character -- that is why it is safe to
look for slashes and dots in utf-8 string. But not every multi-byte
encoding is so forgiving...

> If you're providing a string with an "ASCII-compatible encoding", why
mention
> multibyte at all? It confuses me and it feels like it would confuse others

> too. ASCII is always single-byte.

US-ASCII charset defines first 127 characters, being "compatible" here
means this charset is a superset of US-ASCII. You may relax this
requirement if you want -- just specify which characters libcurl uses to
find IDN in url.

All this complexity is the reason why it probably makes sense to
unconditionally require for url to be in US-ASCII and it's hostname part --
in UTF-8. Makes everything simple... assuming libidn2 (or whatever libcurl
uses on *nix) can take utf-8.

-------------------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-library
Etiquette: https://curl.haxx.se/mail/etiquette.html
Received on 2018-05-02