Buy commercial curl support from WolfSSL. We help you work
out your issues, debug your libcurl applications, use the API, port to new
platforms, add new features and more. With a team lead by the curl founder
himself.
Re: feature proposal: flag for mirroring an HTTP resource
- Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]
From: Hans Henrik Bergan via curl-users <curl-users_at_lists.haxx.se>
Date: Sun, 8 Jan 2023 00:01:23 +0100
>What do you think?
Both the rsync and wget projects are more appropriate for this
functionality than curl.
On Sat, 7 Jan 2023 at 15:43, Jannis R via curl-users
<curl-users_at_lists.haxx.se> wrote:
>
> Hey,
>
> thank you Daniel and all others, for building this awesome tool!
>
> For my proposal, consider a use case that I'll call "mirroring" or "syncing" here:
>
> - I want a local file to reflect an HTTP resource, whenever I have run the mirroring command. Without changes on the remote side, mirroring should effectively be an idempotent operation.
> - I want this mirroring process to work as time- and bandwidth-efficiently as possible: It should not redownload any bytes that it has already downloaded, as long as the server provides the means for this.
> - If the server *does not* provide the means for efficient mirroring, I want to be sure to have a correct & up-to-date copy, at the expense of time & bandwidth.
>
> In particular, I'm talking about compressible files which are also served un-compressed to provide a low entry barrier. A common example would be large text files (e.g. CSV datasets) served using a standard web server (e.g. nginx), which either has access to pre-compressed variants of them or does on-the-fly compression.
>
> AFAICT, implementing this behavior using the curl CLI is very hard currently, and not with a single call, because there are a few edge cases that the curl CLI doesn't provide config flags for.
>
> ## problems
>
> Let me explain some prerequisites first:
>
> Because the HTTP RFCs define [Content-Encoding] (CE) as being a property of the entity, [Range requests] *do not* "make sense" on CE-coded files. Therefore continuing an interrupted downloaded is only possible with a *non-CE-coded* representation of the resource. [Transfer-Encoding] would cleanly solve this problem, but unfortunately it is not widely supported in web servers and has no equivalent in HTTP/2 and HTTP/3 (yet?).
>
> Also, because a CE-coded entity has a different [ETag] than its un-CE-coded equivalent, we *cannot* re-use the CE-coded ETag to continue downloading from the un-CE-coded entity, in oder to make sure we're still downloading the same "version" of the resource!
>
> more details:
> - https://github.com/golang/go/issues/30829#issuecomment-476694405
> - https://github.com/httpwg/http2-spec/issues/445
>
> Thus, we can only use CE-coding when downloading in one go (and start over after an interruption), and support continuation *for non-CE-coded entities only*.
>
> ## workaround
>
> A workaround would be to use curl's --raw flag to download the "opaque" maybe-CE-coded-maybe-not entity with continuation (`-C -`) support, and then manually decode it if the server responded with a Content-Encoding header. As of curl 7.79.1, this doesn't work because
> - curl doesn't provide a mechanism to use --etag-compare (to avoid creating an invalid copy if the remote file has changed) with an unfinished/unstarted download;
> - when using the [If-Range header] manually instead of --etag-compare, curl *does* continue upon a 206 (If-Range matched, server responds with [Content-Range]), but *doesn't* overwrite the file on a 200 (If-Range did't match, server responds with full body).
>
> This may seem like a niche use case, but it actually prevents curl from downloading a large compressible file in the most traffic- & time-efficient manner!
>
> ## proposed solution
>
> Therefore, I would like to propose a new flag (or a set of flags, not sure how to split the described functionality into multiple orthogal flags) that configures curl to
> - use If-Range instead of [If-None-Match] to download an entity with continuation (sort of a mixture between `-C -` and --etag-{compare,save});
> - still store the previous ETag and compare it with the current one, like with --etag-{compare,save};
> - fail if the server responds with Content-Encoding (because then a) byte ranges wouldn't match and b) the ETag is different), *except if --raw is used in addition;
> - overwrite the partially downloaded local file if the server responds with 200 (If-Range didn't match);
> - either implicitly enable `-z -` (Last-Modified comparison), or at least be compatible with it.
>
> With the proposed --mirror-tmp-file flag (I'm very open to a better name!) and a "cache"/temp file path, mirroring a file could look as follows:
>
> ```
> curl -f -o data.csv --mirror-tmp-file data.csv.gz 'http://example.org/data.csv'
> ```
>
> What do you think?
>
> ## demo implementation
>
> I have replicated this behaviour by wrapping curl into [a script that parses its output and upon it](https://gist.github.com/derhuerst/745cf09fe5f3ea2569948dd215bbfe1a). By reading it (~200 lines), you should be able to deduce the necessary changes the --mirror-tmp-file flag would have to implement. The attached readme also contains instructions on how to set up Caddy or nginx in order to test this scenario.
>
> A cleaner implementation would probably use libcurl, but I'd argue that – even though wget exists – downloading a file using the curl CLI is common enough that there should be a way to do it "properly".
>
> – Jannis
>
> [Content-Encoding](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding)
> [Range requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests)
> [Transfer-Encoding](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Transfer-Encoding)
> [ETag](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag)
> [If-Range header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-Range)
> [Content-Range](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Range)
> [If-None-Match](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-None-Match)
>
> --
> Unsubscribe: https://lists.haxx.se/listinfo/curl-users
> Etiquette: https://curl.se/mail/etiquette.html
Date: Sun, 8 Jan 2023 00:01:23 +0100
>What do you think?
Both the rsync and wget projects are more appropriate for this
functionality than curl.
On Sat, 7 Jan 2023 at 15:43, Jannis R via curl-users
<curl-users_at_lists.haxx.se> wrote:
>
> Hey,
>
> thank you Daniel and all others, for building this awesome tool!
>
> For my proposal, consider a use case that I'll call "mirroring" or "syncing" here:
>
> - I want a local file to reflect an HTTP resource, whenever I have run the mirroring command. Without changes on the remote side, mirroring should effectively be an idempotent operation.
> - I want this mirroring process to work as time- and bandwidth-efficiently as possible: It should not redownload any bytes that it has already downloaded, as long as the server provides the means for this.
> - If the server *does not* provide the means for efficient mirroring, I want to be sure to have a correct & up-to-date copy, at the expense of time & bandwidth.
>
> In particular, I'm talking about compressible files which are also served un-compressed to provide a low entry barrier. A common example would be large text files (e.g. CSV datasets) served using a standard web server (e.g. nginx), which either has access to pre-compressed variants of them or does on-the-fly compression.
>
> AFAICT, implementing this behavior using the curl CLI is very hard currently, and not with a single call, because there are a few edge cases that the curl CLI doesn't provide config flags for.
>
> ## problems
>
> Let me explain some prerequisites first:
>
> Because the HTTP RFCs define [Content-Encoding] (CE) as being a property of the entity, [Range requests] *do not* "make sense" on CE-coded files. Therefore continuing an interrupted downloaded is only possible with a *non-CE-coded* representation of the resource. [Transfer-Encoding] would cleanly solve this problem, but unfortunately it is not widely supported in web servers and has no equivalent in HTTP/2 and HTTP/3 (yet?).
>
> Also, because a CE-coded entity has a different [ETag] than its un-CE-coded equivalent, we *cannot* re-use the CE-coded ETag to continue downloading from the un-CE-coded entity, in oder to make sure we're still downloading the same "version" of the resource!
>
> more details:
> - https://github.com/golang/go/issues/30829#issuecomment-476694405
> - https://github.com/httpwg/http2-spec/issues/445
>
> Thus, we can only use CE-coding when downloading in one go (and start over after an interruption), and support continuation *for non-CE-coded entities only*.
>
> ## workaround
>
> A workaround would be to use curl's --raw flag to download the "opaque" maybe-CE-coded-maybe-not entity with continuation (`-C -`) support, and then manually decode it if the server responded with a Content-Encoding header. As of curl 7.79.1, this doesn't work because
> - curl doesn't provide a mechanism to use --etag-compare (to avoid creating an invalid copy if the remote file has changed) with an unfinished/unstarted download;
> - when using the [If-Range header] manually instead of --etag-compare, curl *does* continue upon a 206 (If-Range matched, server responds with [Content-Range]), but *doesn't* overwrite the file on a 200 (If-Range did't match, server responds with full body).
>
> This may seem like a niche use case, but it actually prevents curl from downloading a large compressible file in the most traffic- & time-efficient manner!
>
> ## proposed solution
>
> Therefore, I would like to propose a new flag (or a set of flags, not sure how to split the described functionality into multiple orthogal flags) that configures curl to
> - use If-Range instead of [If-None-Match] to download an entity with continuation (sort of a mixture between `-C -` and --etag-{compare,save});
> - still store the previous ETag and compare it with the current one, like with --etag-{compare,save};
> - fail if the server responds with Content-Encoding (because then a) byte ranges wouldn't match and b) the ETag is different), *except if --raw is used in addition;
> - overwrite the partially downloaded local file if the server responds with 200 (If-Range didn't match);
> - either implicitly enable `-z -` (Last-Modified comparison), or at least be compatible with it.
>
> With the proposed --mirror-tmp-file flag (I'm very open to a better name!) and a "cache"/temp file path, mirroring a file could look as follows:
>
> ```
> curl -f -o data.csv --mirror-tmp-file data.csv.gz 'http://example.org/data.csv'
> ```
>
> What do you think?
>
> ## demo implementation
>
> I have replicated this behaviour by wrapping curl into [a script that parses its output and upon it](https://gist.github.com/derhuerst/745cf09fe5f3ea2569948dd215bbfe1a). By reading it (~200 lines), you should be able to deduce the necessary changes the --mirror-tmp-file flag would have to implement. The attached readme also contains instructions on how to set up Caddy or nginx in order to test this scenario.
>
> A cleaner implementation would probably use libcurl, but I'd argue that – even though wget exists – downloading a file using the curl CLI is common enough that there should be a way to do it "properly".
>
> – Jannis
>
> [Content-Encoding](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding)
> [Range requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests)
> [Transfer-Encoding](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Transfer-Encoding)
> [ETag](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag)
> [If-Range header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-Range)
> [Content-Range](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Range)
> [If-None-Match](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-None-Match)
>
> --
> Unsubscribe: https://lists.haxx.se/listinfo/curl-users
> Etiquette: https://curl.se/mail/etiquette.html
-- Unsubscribe: https://lists.haxx.se/listinfo/curl-users Etiquette: https://curl.se/mail/etiquette.htmlReceived on 2023-01-08