Buy commercial curl support from WolfSSL. We help you work
out your issues, debug your libcurl applications, use the API, port to new
platforms, add new features and more. With a team lead by the curl founder
himself.
feature proposal: flag for mirroring an HTTP resource
- Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]
From: Jannis R via curl-users <curl-users_at_lists.haxx.se>
Date: Sat, 7 Jan 2023 15:40:29 +0100
Hey,
thank you Daniel and all others, for building this awesome tool!
For my proposal, consider a use case that I'll call "mirroring" or "syncing" here:
- I want a local file to reflect an HTTP resource, whenever I have run the mirroring command. Without changes on the remote side, mirroring should effectively be an idempotent operation.
- I want this mirroring process to work as time- and bandwidth-efficiently as possible: It should not redownload any bytes that it has already downloaded, as long as the server provides the means for this.
- If the server *does not* provide the means for efficient mirroring, I want to be sure to have a correct & up-to-date copy, at the expense of time & bandwidth.
In particular, I'm talking about compressible files which are also served un-compressed to provide a low entry barrier. A common example would be large text files (e.g. CSV datasets) served using a standard web server (e.g. nginx), which either has access to pre-compressed variants of them or does on-the-fly compression.
AFAICT, implementing this behavior using the curl CLI is very hard currently, and not with a single call, because there are a few edge cases that the curl CLI doesn't provide config flags for.
## problems
Let me explain some prerequisites first:
Because the HTTP RFCs define [Content-Encoding] (CE) as being a property of the entity, [Range requests] *do not* "make sense" on CE-coded files. Therefore continuing an interrupted downloaded is only possible with a *non-CE-coded* representation of the resource. [Transfer-Encoding] would cleanly solve this problem, but unfortunately it is not widely supported in web servers and has no equivalent in HTTP/2 and HTTP/3 (yet?).
Also, because a CE-coded entity has a different [ETag] than its un-CE-coded equivalent, we *cannot* re-use the CE-coded ETag to continue downloading from the un-CE-coded entity, in oder to make sure we're still downloading the same "version" of the resource!
more details:
- https://github.com/golang/go/issues/30829#issuecomment-476694405
- https://github.com/httpwg/http2-spec/issues/445
Thus, we can only use CE-coding when downloading in one go (and start over after an interruption), and support continuation *for non-CE-coded entities only*.
## workaround
A workaround would be to use curl's --raw flag to download the "opaque" maybe-CE-coded-maybe-not entity with continuation (`-C -`) support, and then manually decode it if the server responded with a Content-Encoding header. As of curl 7.79.1, this doesn't work because
- curl doesn't provide a mechanism to use --etag-compare (to avoid creating an invalid copy if the remote file has changed) with an unfinished/unstarted download;
- when using the [If-Range header] manually instead of --etag-compare, curl *does* continue upon a 206 (If-Range matched, server responds with [Content-Range]), but *doesn't* overwrite the file on a 200 (If-Range did't match, server responds with full body).
This may seem like a niche use case, but it actually prevents curl from downloading a large compressible file in the most traffic- & time-efficient manner!
## proposed solution
Therefore, I would like to propose a new flag (or a set of flags, not sure how to split the described functionality into multiple orthogal flags) that configures curl to
- use If-Range instead of [If-None-Match] to download an entity with continuation (sort of a mixture between `-C -` and --etag-{compare,save});
- still store the previous ETag and compare it with the current one, like with --etag-{compare,save};
- fail if the server responds with Content-Encoding (because then a) byte ranges wouldn't match and b) the ETag is different), *except if --raw is used in addition;
- overwrite the partially downloaded local file if the server responds with 200 (If-Range didn't match);
- either implicitly enable `-z -` (Last-Modified comparison), or at least be compatible with it.
With the proposed --mirror-tmp-file flag (I'm very open to a better name!) and a "cache"/temp file path, mirroring a file could look as follows:
```
curl -f -o data.csv --mirror-tmp-file data.csv.gz 'http://example.org/data.csv'
```
What do you think?
## demo implementation
I have replicated this behaviour by wrapping curl into [a script that parses its output and upon it](https://gist.github.com/derhuerst/745cf09fe5f3ea2569948dd215bbfe1a). By reading it (~200 lines), you should be able to deduce the necessary changes the --mirror-tmp-file flag would have to implement. The attached readme also contains instructions on how to set up Caddy or nginx in order to test this scenario.
A cleaner implementation would probably use libcurl, but I'd argue that – even though wget exists – downloading a file using the curl CLI is common enough that there should be a way to do it "properly".
– Jannis
[Content-Encoding](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding)
[Range requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests)
[Transfer-Encoding](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Transfer-Encoding)
[ETag](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag)
[If-Range header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-Range)
[Content-Range](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Range)
[If-None-Match](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-None-Match)
Date: Sat, 7 Jan 2023 15:40:29 +0100
Hey,
thank you Daniel and all others, for building this awesome tool!
For my proposal, consider a use case that I'll call "mirroring" or "syncing" here:
- I want a local file to reflect an HTTP resource, whenever I have run the mirroring command. Without changes on the remote side, mirroring should effectively be an idempotent operation.
- I want this mirroring process to work as time- and bandwidth-efficiently as possible: It should not redownload any bytes that it has already downloaded, as long as the server provides the means for this.
- If the server *does not* provide the means for efficient mirroring, I want to be sure to have a correct & up-to-date copy, at the expense of time & bandwidth.
In particular, I'm talking about compressible files which are also served un-compressed to provide a low entry barrier. A common example would be large text files (e.g. CSV datasets) served using a standard web server (e.g. nginx), which either has access to pre-compressed variants of them or does on-the-fly compression.
AFAICT, implementing this behavior using the curl CLI is very hard currently, and not with a single call, because there are a few edge cases that the curl CLI doesn't provide config flags for.
## problems
Let me explain some prerequisites first:
Because the HTTP RFCs define [Content-Encoding] (CE) as being a property of the entity, [Range requests] *do not* "make sense" on CE-coded files. Therefore continuing an interrupted downloaded is only possible with a *non-CE-coded* representation of the resource. [Transfer-Encoding] would cleanly solve this problem, but unfortunately it is not widely supported in web servers and has no equivalent in HTTP/2 and HTTP/3 (yet?).
Also, because a CE-coded entity has a different [ETag] than its un-CE-coded equivalent, we *cannot* re-use the CE-coded ETag to continue downloading from the un-CE-coded entity, in oder to make sure we're still downloading the same "version" of the resource!
more details:
- https://github.com/golang/go/issues/30829#issuecomment-476694405
- https://github.com/httpwg/http2-spec/issues/445
Thus, we can only use CE-coding when downloading in one go (and start over after an interruption), and support continuation *for non-CE-coded entities only*.
## workaround
A workaround would be to use curl's --raw flag to download the "opaque" maybe-CE-coded-maybe-not entity with continuation (`-C -`) support, and then manually decode it if the server responded with a Content-Encoding header. As of curl 7.79.1, this doesn't work because
- curl doesn't provide a mechanism to use --etag-compare (to avoid creating an invalid copy if the remote file has changed) with an unfinished/unstarted download;
- when using the [If-Range header] manually instead of --etag-compare, curl *does* continue upon a 206 (If-Range matched, server responds with [Content-Range]), but *doesn't* overwrite the file on a 200 (If-Range did't match, server responds with full body).
This may seem like a niche use case, but it actually prevents curl from downloading a large compressible file in the most traffic- & time-efficient manner!
## proposed solution
Therefore, I would like to propose a new flag (or a set of flags, not sure how to split the described functionality into multiple orthogal flags) that configures curl to
- use If-Range instead of [If-None-Match] to download an entity with continuation (sort of a mixture between `-C -` and --etag-{compare,save});
- still store the previous ETag and compare it with the current one, like with --etag-{compare,save};
- fail if the server responds with Content-Encoding (because then a) byte ranges wouldn't match and b) the ETag is different), *except if --raw is used in addition;
- overwrite the partially downloaded local file if the server responds with 200 (If-Range didn't match);
- either implicitly enable `-z -` (Last-Modified comparison), or at least be compatible with it.
With the proposed --mirror-tmp-file flag (I'm very open to a better name!) and a "cache"/temp file path, mirroring a file could look as follows:
```
curl -f -o data.csv --mirror-tmp-file data.csv.gz 'http://example.org/data.csv'
```
What do you think?
## demo implementation
I have replicated this behaviour by wrapping curl into [a script that parses its output and upon it](https://gist.github.com/derhuerst/745cf09fe5f3ea2569948dd215bbfe1a). By reading it (~200 lines), you should be able to deduce the necessary changes the --mirror-tmp-file flag would have to implement. The attached readme also contains instructions on how to set up Caddy or nginx in order to test this scenario.
A cleaner implementation would probably use libcurl, but I'd argue that – even though wget exists – downloading a file using the curl CLI is common enough that there should be a way to do it "properly".
– Jannis
[Content-Encoding](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding)
[Range requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests)
[Transfer-Encoding](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Transfer-Encoding)
[ETag](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag)
[If-Range header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-Range)
[Content-Range](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Range)
[If-None-Match](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-None-Match)
-- Unsubscribe: https://lists.haxx.se/listinfo/curl-users Etiquette: https://curl.se/mail/etiquette.htmlReceived on 2023-01-07