Buy commercial curl support. We
help you work out your issues, debug your libcurl applications, use the API,
port to new platforms, add new features and more. With a team lead by the
curl founder Daniel himself.
should curl_url_get "normalize" URLs?
- Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]
From: Daniel Stenberg via curl-library <curl-library_at_lists.haxx.se>
Date: Thu, 27 Mar 2025 09:10:58 +0100 (CET)
Hi team,
The curl_url_get man page [3] says it *normalizes* retrieved URLs. Normalizing
in this context means that curl would do its best to return a single
consistent representation of a URL even if you would provide different
variations as input.
Normalizing helps apps to for example compare URLs or otherwise be more
consistent.
This claim turned out to be false [1], as there are multiple details not
normalized in the latest libcurl version and I work on a PR [2] to address the
shortcomings.
Normalizing URLs is less stragiht-forward than what it may sound. A naive
version would decode every URL part, then encode them again and put together a
full URL using all the re-encoded pieces.
This however would break URLs in multiple ways, as for example '/' would be
encoded to %2F in the path part and '=' would be encoded into %3D in the query
part - so it can't be that done simple. Every part more or less has its own
set of properties and characters to take into account and treat specially. Not
to mention that it is simply more work that requires several more memory
allocations to get done etc.
Also, a user might not need/want this normalization to get done. Maybe we need
a flag to enable/disable?
Before I complete this work and risk wasting time going down the wrong rabit
hole, let me know if you have any thoughts, opinions or feedback on this area.
[1] = https://github.com/curl/curl/issues/16829
[2] = https://github.com/curl/curl/pull/16841
[3] = https://curl.se/libcurl/c/curl_url_get.html
Date: Thu, 27 Mar 2025 09:10:58 +0100 (CET)
Hi team,
The curl_url_get man page [3] says it *normalizes* retrieved URLs. Normalizing
in this context means that curl would do its best to return a single
consistent representation of a URL even if you would provide different
variations as input.
Normalizing helps apps to for example compare URLs or otherwise be more
consistent.
This claim turned out to be false [1], as there are multiple details not
normalized in the latest libcurl version and I work on a PR [2] to address the
shortcomings.
Normalizing URLs is less stragiht-forward than what it may sound. A naive
version would decode every URL part, then encode them again and put together a
full URL using all the re-encoded pieces.
This however would break URLs in multiple ways, as for example '/' would be
encoded to %2F in the path part and '=' would be encoded into %3D in the query
part - so it can't be that done simple. Every part more or less has its own
set of properties and characters to take into account and treat specially. Not
to mention that it is simply more work that requires several more memory
allocations to get done etc.
Also, a user might not need/want this normalization to get done. Maybe we need
a flag to enable/disable?
Before I complete this work and risk wasting time going down the wrong rabit
hole, let me know if you have any thoughts, opinions or feedback on this area.
[1] = https://github.com/curl/curl/issues/16829
[2] = https://github.com/curl/curl/pull/16841
[3] = https://curl.se/libcurl/c/curl_url_get.html
-- / daniel.haxx.se || https://rock-solid.curl.dev -- Unsubscribe: https://lists.haxx.se/mailman/listinfo/curl-library Etiquette: https://curl.se/mail/etiquette.htmlReceived on 2025-03-27