curl / Mailing Lists / curl-library / Single Mail
Buy commercial curl support from WolfSSL. We help you work out your issues, debug your libcurl applications, use the API, port to new platforms, add new features and more. With a team lead by the curl founder himself.

re: libcurl read-like interface

From: XSLT2.0 via curl-library <curl-library_at_cool.haxx.se>
Date: Thu, 17 Dec 2020 23:40:01 +0100

> Oh right, that's a fun way to try stuff.

My fisrt intention was to use fcurl with the very classic code (error
checking omitted):

while(!fcurl_eof(fcurl)) {

    sz = fcurl_read( fcurl, buf, 1, BUF_SIZE);

    write(1, buf, sz);

}

But since fcurl seems bugged, it was quicker to just add the memcpy,
which is sort of what fcurl would do more or less.

So maybe you see the link I made between "read-like" interface and "zero
copy": what is the point of having the code above that is debatably
simpler than the callback code, if it adds a memcpy of the full transfer
in the process?

So that is why I assumed (wrongly) that they might also have been some
work in progress towards "caller provided buffers" to make read-like
interface on par with callback performance-wise, and that we could also
use this feature even without fcurl.


> But why measure user time? We want to see the performance inpact as a
whole, right? How much extra time does that memcpy() relay a 10GB
transfer. Then we measure wall clock time.

I beg to differ Daniel: that is the wrong tool... if the goal is to
measure the impact of a change on libcurl + callback (like adding an
extra memcpy).

Are you familiar with PERT diagrams used in project management and
critical path? What you are measuring is the length of the critical in
YOUR situation, and checking that the added code does not change the
critical path. You are not measuring the individual task itself.

In such a test you have 3 tasks: input, compute, output (that's very
general to all programs in fact!) and they all run pretty much in
parallel in modern OSes (cache, readahead, writeback...).

So what you measure with "real/elapsed" is that in your case "compute"
(libcurl + callback) is not in the critical path either original or SLOW.

That's an interesting measure (user perceived time in your situation),
but it tells nothing about the "cost" of a memcpy until you hit critical
path. In the measures we both did so far, most probably the critical
path in "input".

But there are cases that are not at all "exotic" where the critical path
is NOT input, and becomes "compute".

It happens on my Raspberry Pi 4 when I run any transfer with https
because it cannot sustain 1Gbps (ethernet speed) at doing crypto. So
now, since libcurl + callback is in the critical path, each code you add
either on libcurl or in the callback on top of showing in "user time",
directly impacts real/elapsed.

I am currently running the tests and will add them to the "benchmark".

So yes, if you want to measure something that makes sense to see whether
the library works as well as before when adding some code, you have to
use "user time". Otherwise you might see a difference or not at all
depending whether you are on the critical path or not!

With a real network measure (not localhost!), "user time" is also more
stable because it depends less on possible server slowdown or other
traffic on your PC or internet link.


And the answer is: a single memcpy adds between 1,4 and 2 seconds for a
10Gbps transfer on a RPi4 (direct impact from user time to elapsed when
using TLS).
To get a scale of what that means, it is more than 15% of the time taken
by the library + callback to do a simple GET (no TLS) on the same file,
so quite a huge impact in fact!



> what's the performance gain for the user who'd want to use this, and
can the API be done in a way that makes this feature practical/attractive.

For the sake of concision (this response is already too long!) I suggest
to do another post "What a fuse-driver programmer would like from libcurl".

Unlike "application" programs using libcurl, a "fuse driver", although
technically "userland", is a quasi-kernel code. So here performance is
quite critical and you can mess up badly with the system when the driver
becomes sluggish.

Being OK with "we are not in the critical path" is really not what you
want in a fuse driver! ;-)


May I have some time to come with a relevant response to that question?

I have in mind some possible propositions that could help without
"breaking" the API as much as "caller provided buffers" would do. I
still have to refine them a bit.


Rest assure, I also have plenty of bigger optimisations than 'memcpy' on
my own driver... even bugs like a known race condition I should fix!


Cheers

Alain




-------------------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-library
Etiquette: https://curl.se/mail/etiquette.html
Received on 2020-12-17