Buy commercial curl support. We
help you work out your issues, debug your libcurl applications, use the API,
port to new platforms, add new features and more. With a team lead by the
curl founder Daniel himself.
Re: should curl_url_get "normalize" URLs?
- Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]
From: Timothe Litt <litt_at_acm.org>
Date: Fri, 28 Mar 2025 16:05:23 -0400
On 27-Mar-25 04:10, Daniel Stenberg via curl-library wrote:
> Hi team,
>
> The curl_url_get man page [3] says it *normalizes* retrieved URLs.
> Normalizing in this context means that curl would do its best to
> return a single consistent representation of a URL even if you would
> provide different variations as input.
>
> Normalizing helps apps to for example compare URLs or otherwise be
> more consistent.
>
> This claim turned out to be false [1], as there are multiple details
> not normalized in the latest libcurl version and I work on a PR [2] to
> address the shortcomings.
>
> Normalizing URLs is less stragiht-forward than what it may sound. A
> naive version would decode every URL part, then encode them again and
> put together a full URL using all the re-encoded pieces.
>
> This however would break URLs in multiple ways, as for example '/'
> would be encoded to %2F in the path part and '=' would be encoded into
> %3D in the query part - so it can't be that done simple. Every part
> more or less has its own set of properties and characters to take into
> account and treat specially. Not to mention that it is simply more
> work that requires several more memory allocations to get done etc.
>
> Also, a user might not need/want this normalization to get done. Maybe
> we need a flag to enable/disable?
>
> Before I complete this work and risk wasting time going down the wrong
> rabit hole, let me know if you have any thoughts, opinions or feedback
> on this area.
>
> [1] = https://github.com/curl/curl/issues/16829
> [2] = https://github.com/curl/curl/pull/16841
> [3] = https://curl.se/libcurl/c/curl_url_get.html
>
Be careful. %2F is not the same as / in all cases. The rules are
messy enough that I won't restate them here, but refer to the the RFCs,
e.g. 3986 to start.
Note in section 2.4
<https://datatracker.ietf.org/doc/html/rfc3986#section-2.4>:
> When a URI is dereferenced, the components and subcomponents
> significant to the scheme-specific dereferencing process (if any)
> must be parsed and separated before the percent-encoded octets within
> those components can be safely decoded, a*s otherwise the data may be mistaken for component delimiters*.
Section 6.1 <https://datatracker.ietf.org/doc/html/rfc3986#section-6.1>
discusses Equivalence (and normalization) in depth.
There are both generic and scheme-specific rules and considerations.
The scheme RFCs have details.
About the only things that can easily and safely be normalized are the
authority (when it's known to be a dns name, and comparing embeded
authorization to separate data), ip addresses, and hexadecimal a-f case.
More can be done with scheme-specific knowledge. And even more if you
have knowledge of the server (e.g. file systems that are case
insensitive/case preserving can have case aliases, but case sensitive
file systems will not.) I won't mention the server side aliasing of
links (hard and symbolic - and context-dependent)...
While "Do What I Mean" has it's place, there does need to be a mechanism
for "Do exactly what I say". Even if the latter means that the user is
outsmarting herself. Better that, than the library outsmarting the user
with incorrect results.
I don't have the time to look into the man page, the current code, or
your PRs. But you asked about "rabbit holes"; these are some of the
entrances.
HTH.
Timothe Litt
ACM Distinguished Engineer
--------------------------
This communication may not represent the ACM or my employer's views,
if any, on the matters discussed.
Received on 2025-03-28
Date: Fri, 28 Mar 2025 16:05:23 -0400
On 27-Mar-25 04:10, Daniel Stenberg via curl-library wrote:
> Hi team,
>
> The curl_url_get man page [3] says it *normalizes* retrieved URLs.
> Normalizing in this context means that curl would do its best to
> return a single consistent representation of a URL even if you would
> provide different variations as input.
>
> Normalizing helps apps to for example compare URLs or otherwise be
> more consistent.
>
> This claim turned out to be false [1], as there are multiple details
> not normalized in the latest libcurl version and I work on a PR [2] to
> address the shortcomings.
>
> Normalizing URLs is less stragiht-forward than what it may sound. A
> naive version would decode every URL part, then encode them again and
> put together a full URL using all the re-encoded pieces.
>
> This however would break URLs in multiple ways, as for example '/'
> would be encoded to %2F in the path part and '=' would be encoded into
> %3D in the query part - so it can't be that done simple. Every part
> more or less has its own set of properties and characters to take into
> account and treat specially. Not to mention that it is simply more
> work that requires several more memory allocations to get done etc.
>
> Also, a user might not need/want this normalization to get done. Maybe
> we need a flag to enable/disable?
>
> Before I complete this work and risk wasting time going down the wrong
> rabit hole, let me know if you have any thoughts, opinions or feedback
> on this area.
>
> [1] = https://github.com/curl/curl/issues/16829
> [2] = https://github.com/curl/curl/pull/16841
> [3] = https://curl.se/libcurl/c/curl_url_get.html
>
Be careful. %2F is not the same as / in all cases. The rules are
messy enough that I won't restate them here, but refer to the the RFCs,
e.g. 3986 to start.
Note in section 2.4
<https://datatracker.ietf.org/doc/html/rfc3986#section-2.4>:
> When a URI is dereferenced, the components and subcomponents
> significant to the scheme-specific dereferencing process (if any)
> must be parsed and separated before the percent-encoded octets within
> those components can be safely decoded, a*s otherwise the data may be mistaken for component delimiters*.
Section 6.1 <https://datatracker.ietf.org/doc/html/rfc3986#section-6.1>
discusses Equivalence (and normalization) in depth.
There are both generic and scheme-specific rules and considerations.
The scheme RFCs have details.
About the only things that can easily and safely be normalized are the
authority (when it's known to be a dns name, and comparing embeded
authorization to separate data), ip addresses, and hexadecimal a-f case.
More can be done with scheme-specific knowledge. And even more if you
have knowledge of the server (e.g. file systems that are case
insensitive/case preserving can have case aliases, but case sensitive
file systems will not.) I won't mention the server side aliasing of
links (hard and symbolic - and context-dependent)...
While "Do What I Mean" has it's place, there does need to be a mechanism
for "Do exactly what I say". Even if the latter means that the user is
outsmarting herself. Better that, than the library outsmarting the user
with incorrect results.
I don't have the time to look into the man page, the current code, or
your PRs. But you asked about "rabbit holes"; these are some of the
entrances.
HTH.
Timothe Litt
ACM Distinguished Engineer
--------------------------
This communication may not represent the ACM or my employer's views,
if any, on the matters discussed.
-- Unsubscribe: https://lists.haxx.se/mailman/listinfo/curl-library Etiquette: https://curl.se/mail/etiquette.html
- application/pgp-signature attachment: OpenPGP digital signature