cURL / Mailing Lists / curl-library / Single Mail

curl-library

Re: Hashing while downloading

From: Leon Winter <winter-curl_at_bfw-online.de>
Date: Tue, 20 Jan 2015 07:49:50 +0100

Hi,

> Right. I'm generally very careful with adding new APIs, especially such that
> aren't strictly transfer-related and I would say MD5 isn't about transfers.
>
> All new functions take their share of added maintenance and work.

Well, I was just pointing to the abstraction layers already implemented
in curl. So in essence I was just hoping for the symbols to be exported,
so the internal functions would become part of the API. But I see your
point that this is not really related to curls job.

> Well yes, but those two are in the library and in the tool, pretty much for
> the same reason you bring up here!

Exactly. So if the hashing functions will be part of the curl API, the
tool would just use that instead of doing the same stuff all over again.

> >While looking into this I also noticed that the metalink code does the
> >verification _after_ the download, which Daniel also mentions [0]. In the
> >mentioned RFCs about the headers and XML format I found no mention of the
> >time of the hash processing. Why not do it while downloading?
>
> I don't think there's any good reason other than it hasn't been done.
> Possibly because nobody has cared enough to actually do the work.

So, it turns out, things are not as simple as I thought. For complicated
stuff like resuming a download, one would need to take special care when
"hashing while downloading", since one has not computed the hash of the
first chunk of the file which is to be extended by the curl.
Also I was told, the hashing functions inside Apt are not that slow but
more the fast that they are applying multiple hash functions one after
another on the chunks received. Since Apt is single-threaded this really
slows down the download.
When resuming a download for which a hash shall be computed like in
Metalink case, one can either use the existing code and hash at the end,
or launch up a new thread which would hash the file to the current
position while the start downloading. Then once the current position
(which could be different to when curl was started due to the continued
downloading) is reached, the thread would be joined in or consequently
fed with the data on-the-fly.
Unfortunately this introduces new complexity. It is not really nice to
start reading a file from offset 0 when one wants to append it at the
same instance. Today in the real world this would probably not cause
problems due to the writeback buffer in the main memory being used as a
cache first and then utilizing a SSD with good random seek performance.
However in the classic hard disk driven model, such a strategy would
trigger competing seeks to the start and end of the files for reading
and writing decreasing life-time of the poor disks and depending on the
speed of the wire, maybe even slowing down the download.
Then instead of relying on the filesystem to do the job of caching
incoming data in main memory when still hashing, one could of course let
curl allocate dynamic memory repeatedly (again like the example from
the curl sources) to store the incoming data while the hashing is not
finished. Once the hashing would be done, we fallback to our normal
algorithm of hashing on-the-fly.
As you can see this is rather unpleasant, so I would vote for a normal
"hashing at the end" when resuming a download.
Also depending on the processing performance hashing while downloading
comes with a computational penalty which *should* fit in nicely in the
gaps of "waiting for I/O", however as we saw with the Apt example, once
the computation takes too long, it will last longer then the wire and
slow down the download. In the Apt case, it seems reasonable to go
multi-threaded, as all the hash function would rely on the same input
data. Then OS could figure out the best time slices to interleave with
I/O or possible move the computational workload to another core.
What do you think (about possible changes for the metalink download
stuff)?

> I would prefer to have the entire VTLS part of libcurl turned into a library
> of its own that libcurl could use (although it hasn't happen because there's
> just not enough desire from anywhere to drive such a change). I don't think
> it is libcurl's job to offer neither crypto nor hashing functionality
> outside of transfers.

I see your point of not exposing internal functions for crypto to the
outside. Also I like the idea of an independent abstraction library.
From my understanding the separation of VTLS in curl is already pretty
decent, so most of the work is done. Of course when abstracting there is
always the conflict of uniting the feature set and having available the
functionality of a specific library but from my impression VTLS
definitely provides what most people will need.

> The Metalink code is not in libcurl.

Sorry, I was merely referring to the glue in curl as I saw similar to
exactly the same MD5 code in the two places (both the library and the
tool).

Last but no least, unfortunately there seems to be little ambition of
the Apt folks to change anything, the Debian bug tracker is full of bugs
regarding dependency resolution or bugs in the HTTP handling. To enhance
the speed of updating a system, many people work-around with stuff like
apt-fast or apt-metalink, which would ask apt to provide the raw URLs
and download it by other means. The metalink folks would even use
multiple mirrors and feed the metalink file to aria2 which then
downloads from multiple servers at the same time.
In any case, since the HTTPS handling in Apt is already using curl, it
makes a lot of sense, to also use curl for plain HTTP. Since Apt has
other means of download which are probably also hand-written, it would
also make sense to use curl for /everything/. I think for our
objective, we would stick with HTTP and HTTPS using curl and doing
hashing while downloading in concurrent threads. We would then try to
upstream, but I am not too hopeful about seeing these changes upstream
anytime soon (also currently Apt is not multi-threaded due to some bugs
ten years ago).

Best regards,
Leon
-------------------------------------------------------------------
List admin: http://cool.haxx.se/list/listinfo/curl-library
Etiquette: http://curl.haxx.se/mail/etiquette.html
Received on 2015-01-20