curl / Mailing Lists / curl-library / Single Mail
Buy commercial curl support from WolfSSL. We help you work out your issues, debug your libcurl applications, use the API, port to new platforms, add new features and more. With a team lead by the curl founder himself.

A canonical URL host name dilemma

From: Daniel Stenberg via curl-library <curl-library_at_lists.haxx.se>
Date: Sat, 9 Oct 2021 11:42:44 +0200 (CEST)

Hello friends.

Let me take you through a bug, my current work and the little dilemma I'm
facing in regards to how to "canonicalize" host names in URLs! I'll end the
mail with a question about a possible solution I've thought of.

# Not parsing percent-encoded host names in URLs

     $ curl https://%63url.se/
     curl: (6) Could not resolve host: %63url.se

instead of:

     $ curl https://%63url.se/
     [content from https://curl.se]

Issue: https://github.com/curl/curl/issues/7830
PR: https://github.com/curl/curl/pull/7834

## Obvious first take

  Make sure that the URL parser **decodes** percent-encoded host names. %41
  becomes `A` etc.

  The parser rejects "control codes" while decoding. %00, %0a and %0d makes the
  host name illegal.

## Canonical host name

  The URL API can also *extract* the full URL so it needs to be able to reverse
  the process and here begins the challenges.

  My first simplistic (or maybe *naive*) approach works like this:

  Setting `https://%63url.se/` is extracted again as `https://curl.se/` but
  setting `https://%c0.se/` is extracted as `https://%c0.se/` (since anything
  non-ASCII is not "URL compliant").

## IDN input

  Enter IDN. Internation Domain Names. They are specified outside of the
  regular URL spec (RFC 3986) and they are specified using non-ASCII byte
  codes.

  Example name: `räksmörgås.se` (clients puny-encode this name to
  `xn--rksmrgs-5wao1o.se` for DNS etc).

  Since this host/URL uses non-ASCII letters, the naive approch mentioned above
  would then, when the URL API is used to extract this again, use a sequence of
  percent-encoded UTF-8 `r%C3%A4ksm%C3%B6rg%C3A5s.se`.

  It would **not** extract back to `räksmörgås.se`, which probably is what a
  user will expect.

  Next-level complication: mix in percent-encoding to the IDN name:

  `r%c3%a4ksmörgås.se`

  The two percent-encoded bytes is UTF-8 sequence for `ä`, which makes this
  host name work the same way.

## IDN output

  How do we know how to encode the host name when the user wants to extract it?

  Alternatives I can think of:

### A) Don't

  Store the originally provided name and use that for retrieval as well. This
  is bad as then the same URL with differently encoded host names will appear
  as two different ones. Users probably will not expect nor appreciate this.

### B) Always

  Always percent-encode (this is what the PR currently does). It makes the host
  name canonical and it still works IDN wise, but the retrieved URL is ugly and
  user hostile.

### C) Puny-encode

  Return the **puny-encoded** version of the name if it was an IDN name,
  otherwise percent-encode. Makes the host name canonical, it still works IDN
  wise, but the retrieved URL is ugly and user hostile. Just possibly a little
  less hostile than version B. An upside could be that a puny-code version of
  the host name works even with clients that don't speak IDN. This method then
  works differently if libcurl was built with or without IDN support.

### D) Heuristics

  If the host name was a valid IDN name, then return that name without
  encoding, otherwise perecent-encode. This makes `r%c3%a4ksmörgås.se` as input
  generate `räksmörgås.se` as output. This method then works differently if
  libcurl was built with or without IDN support.



Can we make version (D) work and would that be preferred?

-- 
  / daniel.haxx.se
  | Commercial curl support up to 24x7 is available!
  | Private help, bug fixes, support, ports, new features
  | https://curl.se/support.html


-- 
Unsubscribe: https://lists.haxx.se/listinfo/curl-library
Etiquette:   https://curl.haxx.se/mail/etiquette.html
Received on 2021-10-09