Buy commercial curl support from WolfSSL. We help you work
out your issues, debug your libcurl applications, use the API, port to new
platforms, add new features and more. With a team lead by the curl founder
himself.
A canonical URL host name dilemma
- Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]
From: Daniel Stenberg via curl-library <curl-library_at_lists.haxx.se>
Date: Sat, 9 Oct 2021 11:42:44 +0200 (CEST)
Hello friends.
Let me take you through a bug, my current work and the little dilemma I'm
facing in regards to how to "canonicalize" host names in URLs! I'll end the
mail with a question about a possible solution I've thought of.
# Not parsing percent-encoded host names in URLs
$ curl https://%63url.se/
curl: (6) Could not resolve host: %63url.se
instead of:
$ curl https://%63url.se/
[content from https://curl.se]
Issue: https://github.com/curl/curl/issues/7830
PR: https://github.com/curl/curl/pull/7834
## Obvious first take
Make sure that the URL parser **decodes** percent-encoded host names. %41
becomes `A` etc.
The parser rejects "control codes" while decoding. %00, %0a and %0d makes the
host name illegal.
## Canonical host name
The URL API can also *extract* the full URL so it needs to be able to reverse
the process and here begins the challenges.
My first simplistic (or maybe *naive*) approach works like this:
Setting `https://%63url.se/` is extracted again as `https://curl.se/` but
setting `https://%c0.se/` is extracted as `https://%c0.se/` (since anything
non-ASCII is not "URL compliant").
## IDN input
Enter IDN. Internation Domain Names. They are specified outside of the
regular URL spec (RFC 3986) and they are specified using non-ASCII byte
codes.
Example name: `räksmörgås.se` (clients puny-encode this name to
`xn--rksmrgs-5wao1o.se` for DNS etc).
Since this host/URL uses non-ASCII letters, the naive approch mentioned above
would then, when the URL API is used to extract this again, use a sequence of
percent-encoded UTF-8 `r%C3%A4ksm%C3%B6rg%C3A5s.se`.
It would **not** extract back to `räksmörgås.se`, which probably is what a
user will expect.
Next-level complication: mix in percent-encoding to the IDN name:
`r%c3%a4ksmörgås.se`
The two percent-encoded bytes is UTF-8 sequence for `ä`, which makes this
host name work the same way.
## IDN output
How do we know how to encode the host name when the user wants to extract it?
Alternatives I can think of:
### A) Don't
Store the originally provided name and use that for retrieval as well. This
is bad as then the same URL with differently encoded host names will appear
as two different ones. Users probably will not expect nor appreciate this.
### B) Always
Always percent-encode (this is what the PR currently does). It makes the host
name canonical and it still works IDN wise, but the retrieved URL is ugly and
user hostile.
### C) Puny-encode
Return the **puny-encoded** version of the name if it was an IDN name,
otherwise percent-encode. Makes the host name canonical, it still works IDN
wise, but the retrieved URL is ugly and user hostile. Just possibly a little
less hostile than version B. An upside could be that a puny-code version of
the host name works even with clients that don't speak IDN. This method then
works differently if libcurl was built with or without IDN support.
### D) Heuristics
If the host name was a valid IDN name, then return that name without
encoding, otherwise perecent-encode. This makes `r%c3%a4ksmörgås.se` as input
generate `räksmörgås.se` as output. This method then works differently if
libcurl was built with or without IDN support.
Can we make version (D) work and would that be preferred?
Date: Sat, 9 Oct 2021 11:42:44 +0200 (CEST)
Hello friends.
Let me take you through a bug, my current work and the little dilemma I'm
facing in regards to how to "canonicalize" host names in URLs! I'll end the
mail with a question about a possible solution I've thought of.
# Not parsing percent-encoded host names in URLs
$ curl https://%63url.se/
curl: (6) Could not resolve host: %63url.se
instead of:
$ curl https://%63url.se/
[content from https://curl.se]
Issue: https://github.com/curl/curl/issues/7830
PR: https://github.com/curl/curl/pull/7834
## Obvious first take
Make sure that the URL parser **decodes** percent-encoded host names. %41
becomes `A` etc.
The parser rejects "control codes" while decoding. %00, %0a and %0d makes the
host name illegal.
## Canonical host name
The URL API can also *extract* the full URL so it needs to be able to reverse
the process and here begins the challenges.
My first simplistic (or maybe *naive*) approach works like this:
Setting `https://%63url.se/` is extracted again as `https://curl.se/` but
setting `https://%c0.se/` is extracted as `https://%c0.se/` (since anything
non-ASCII is not "URL compliant").
## IDN input
Enter IDN. Internation Domain Names. They are specified outside of the
regular URL spec (RFC 3986) and they are specified using non-ASCII byte
codes.
Example name: `räksmörgås.se` (clients puny-encode this name to
`xn--rksmrgs-5wao1o.se` for DNS etc).
Since this host/URL uses non-ASCII letters, the naive approch mentioned above
would then, when the URL API is used to extract this again, use a sequence of
percent-encoded UTF-8 `r%C3%A4ksm%C3%B6rg%C3A5s.se`.
It would **not** extract back to `räksmörgås.se`, which probably is what a
user will expect.
Next-level complication: mix in percent-encoding to the IDN name:
`r%c3%a4ksmörgås.se`
The two percent-encoded bytes is UTF-8 sequence for `ä`, which makes this
host name work the same way.
## IDN output
How do we know how to encode the host name when the user wants to extract it?
Alternatives I can think of:
### A) Don't
Store the originally provided name and use that for retrieval as well. This
is bad as then the same URL with differently encoded host names will appear
as two different ones. Users probably will not expect nor appreciate this.
### B) Always
Always percent-encode (this is what the PR currently does). It makes the host
name canonical and it still works IDN wise, but the retrieved URL is ugly and
user hostile.
### C) Puny-encode
Return the **puny-encoded** version of the name if it was an IDN name,
otherwise percent-encode. Makes the host name canonical, it still works IDN
wise, but the retrieved URL is ugly and user hostile. Just possibly a little
less hostile than version B. An upside could be that a puny-code version of
the host name works even with clients that don't speak IDN. This method then
works differently if libcurl was built with or without IDN support.
### D) Heuristics
If the host name was a valid IDN name, then return that name without
encoding, otherwise perecent-encode. This makes `r%c3%a4ksmörgås.se` as input
generate `räksmörgås.se` as output. This method then works differently if
libcurl was built with or without IDN support.
Can we make version (D) work and would that be preferred?
-- / daniel.haxx.se | Commercial curl support up to 24x7 is available! | Private help, bug fixes, support, ports, new features | https://curl.se/support.html
-- Unsubscribe: https://lists.haxx.se/listinfo/curl-library Etiquette: https://curl.haxx.se/mail/etiquette.htmlReceived on 2021-10-09