curl / Mailing Lists / curl-library / Single Mail


Re: a URL API ?

From: Dan Fandrich via curl-library <>
Date: Mon, 13 Aug 2018 01:25:50 +0200

On Sun, Aug 12, 2018 at 06:45:27PM +0200, Daniel Stenberg wrote:
> The current code for the API doesn't offer URL decoding at all when you ask
> for the full URL - since a returned URL is still supposed to be a URL so it
> can't really be "decoded" then. We can of course document that bit to mean
> "canonicalization" when used in combination with getting the URL.
> Canonicalization can probably be done by always URL decoding *and* URL
> encoding each individual part before they're put together to the end
> result... Wouldn't that work?

I think you're right, it should work. Documenting
(CURLU_URLDECODE|CURLU_URLENCODE) as performing canonicalization is probably
all you'd need, besides ensuring decode and encode happen in the correct order.
Actually, does CURLU_URLDECODE do anything on the curl_url_get call? It sounds
like something that should only do something on the curl_url_set call.

I'm a bit concerned by this paragraph of RFC 3986, though, with respect to
canonicalization in the curl API:

   URI producing applications should percent-encode data octets that
   correspond to characters in the reserved set unless these characters
   are specifically allowed by the URI scheme to represent data in that
   component. If a reserved character is found in a URI component and
   no delimiting role is known for that character, then it must be
   interpreted as representing the data octet corresponding to that
   character's encoding in US-ASCII.

This means that the preferred form of a URI differs depending on the scheme. Do
we want to build in knowledge of the preferred encoding sets for all the
different URI schemes out there today, or even just the ones curl supports?
This implies that the canonical form could change if curl adds support for a
new scheme in the future. If so, then I think there should be a new option for
this kind of encoding so the canonical form stays canonical for every URI
scheme, but programs that would prefer merely a fairly consistent
human-readable form using an encoding set optimized for the scheme in use could
use the other CURLU_URLENCODE_OPTIMIZED (or whatever it's called) option
Received on 2018-08-13