Buy commercial curl support from WolfSSL. We help you work
out your issues, debug your libcurl applications, use the API, port to new
platforms, add new features and more. With a team lead by the curl founder
himself.
Re: relative performance
- Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]
From: XSLT2.0 via curl-library <curl-library_at_cool.haxx.se>
Date: Thu, 26 Aug 2021 23:15:24 +0200
> Slow enough for what? I agree that if we run the tests over an actual physical
> network, a system with a fast enough CPU will saturate the network and then it
> mostly won't matter how fast or not curl is.
>
> For this kind of test to be sensible, we need to make sure to either have a
> faster pipe than can be saturated by a single CPU core or do a test setup that
> can't do it due to complexity. It seems easiest to accomplish this by doing
> transfers on localhost.
Exactly, and much better worded than what I wrote!
> When doing transfers on localhost I don't think it matters much exactly how
> fast the CPU is, and I'm convinced we will see deltas between versions
> whichever CPU we use. In fact, I beleive I've already spotted some. I'm just
> not ready yet to draw the conclusions nor to start working on figuring out why
> they exist.
Doing full transfers makes sense if you want to test several parts of
the library code.
It didn't make sense in my case, I know TLS handshake is (terribly) slow
and that it is needed for all transfers (the server closes the
connection even with Connection:keepalive!), so I was testing only
transfer speed.
> I've provided scripts in the curl/relative directory now that can:
>
> 1. build 'sprinter' the test tool
> 2. build (lib)curl for a number of versions and install them locally
> 3. run sprinter with each of those built versions
>
> It seems most interesting to do A LOT of smaller transfers with a fairly huge
> concurrency. I've played with doing 100,000 4K transfers at 100 at a time and
> with 6 8GB transfers at 2 at a time, and the latter will mostly just saturate
> the memory bandwidth in the machine.
>
> The output for the sprinter runs is not easily "comparable" yet and there's no
> machine help to detect regressions etc but it's a decent start I think.
>
> I'm curious if others will see the same thing I seem to see right now...
You probably won't fall into the same *pitfall* I fell in, because even
if you have 100 transfers in parallel in a "multi", you are still
"single threaded".
What explained that in my case I got times varying from 7:00 to 11:30
for the exact same transfer is a side effect of multi-threading.
I happens that my test laptop runs with the Conservative governor. As
fuse works, by default it multi-threads its requests, which are then
served with the same curl handle that must jump from one core to the
other. Not only that is bad for cache bouncing, but with the
conservative governor it didn't trigger enough workload for the cores to
be stepped up in frequency. Hence the whole transfer was done at the
lowest frequency! Running fuse "single threaded" (-s option) or my old
algorithm that uses a "worker stream" works better because the load on
the "worker" or single thread is enough so that the governor pushes the
frequency to max allowed. But measures started to be far more steady
when I set all the cores to a single frequency.
For the same reason, on my more recent desktop, minimum frequency being
enough to cope with the workload, the measures were super steady, and
none of the four very different algorithms made any difference (looking
at the wall time only).
Possibly also, one of the reason you find it better to have 100
transfers in parallel, is that it makes some load and the core is bumped
to max frequency very soon, and stays there.
Better safe than sorry, to be sure you eliminate this bias, I suggest
you lock your machine at a given frequency before running the test
script. If you don't lock at a given frequency and kernel+ACPI decides
to change the frequency in the middle of the test, the standard
deviation on your measures might increase to a point they become quite
hard to make any sense!
For instance, 1600MHz being the max on my laptop, adapt to your machine
and number of cores, and to the governors allowed on your machine. Any
governor is Ok since you set a unique frequency:
for core in $( seq 0 7 ); do sudo cpufreq-set -c $core -f 1600; sudo
cpufreq-set -c $core -g userspace; done
Check with
for core in $( seq 0 7 ); do sudo cpufreq-info -c $core -p; done
(There is not such kind of locking at the moment in your script, and it
is hard to code it since it depends on your processor... and you
probably don't want to run a curl test as sudo, or even have 'sudos' in
your script!)
I also run my tests with 'perf', it gives interesting figures about
cache/cache miss... (out of topic) the famous effect of "memcpy" on
which we disagreed is very real on a "slow" machine, believe me, and
shows in the wall time! It's not a mystery why kernel guys tried to
eliminate it (especially with fuse) with the "splicing" technique.
Having to copy the standard 16k curl buffer not only takes instruction
time, but completely flushes the L1 data cache on my laptop processor
('only' 32k data cache per core). 'perf' shows it. This probably makes
the most of the difference between using "curl with pause" or the raw
curl_easy_recv since with the later the copy is not necessary because it
receives the address where to place the data.
Nevertheless, I agree with you, it is hard to interpret all these
figures, and mathematically impossible to optimise for every different
cases, especially for a library. For instance, I saw you did a great job
optimising the size of the main curl structure over time. For some of my
algorithms, further 'nano-optimisation' could be obtained by clever
structure placement to minimise the number of "cache lines" the core has
to refresh. But another person using another protocol, would need other
placement...
Rest assure, as I said in summary on the previous e-mail, on a "recent
enough desktop" (5.5 years my desktop is "recent enough"), none of that
makes any difference. But same as you, I try to find the "right
algorithm" since it does make a difference on my antique laptop, but
also on the very recent Raspberry Pi 4!
Cheers an keep up the good job.
Alain
-------------------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-library
Etiquette: https://curl.se/mail/etiquette.html
Received on 2021-08-26
Date: Thu, 26 Aug 2021 23:15:24 +0200
> Slow enough for what? I agree that if we run the tests over an actual physical
> network, a system with a fast enough CPU will saturate the network and then it
> mostly won't matter how fast or not curl is.
>
> For this kind of test to be sensible, we need to make sure to either have a
> faster pipe than can be saturated by a single CPU core or do a test setup that
> can't do it due to complexity. It seems easiest to accomplish this by doing
> transfers on localhost.
Exactly, and much better worded than what I wrote!
> When doing transfers on localhost I don't think it matters much exactly how
> fast the CPU is, and I'm convinced we will see deltas between versions
> whichever CPU we use. In fact, I beleive I've already spotted some. I'm just
> not ready yet to draw the conclusions nor to start working on figuring out why
> they exist.
Doing full transfers makes sense if you want to test several parts of
the library code.
It didn't make sense in my case, I know TLS handshake is (terribly) slow
and that it is needed for all transfers (the server closes the
connection even with Connection:keepalive!), so I was testing only
transfer speed.
> I've provided scripts in the curl/relative directory now that can:
>
> 1. build 'sprinter' the test tool
> 2. build (lib)curl for a number of versions and install them locally
> 3. run sprinter with each of those built versions
>
> It seems most interesting to do A LOT of smaller transfers with a fairly huge
> concurrency. I've played with doing 100,000 4K transfers at 100 at a time and
> with 6 8GB transfers at 2 at a time, and the latter will mostly just saturate
> the memory bandwidth in the machine.
>
> The output for the sprinter runs is not easily "comparable" yet and there's no
> machine help to detect regressions etc but it's a decent start I think.
>
> I'm curious if others will see the same thing I seem to see right now...
You probably won't fall into the same *pitfall* I fell in, because even
if you have 100 transfers in parallel in a "multi", you are still
"single threaded".
What explained that in my case I got times varying from 7:00 to 11:30
for the exact same transfer is a side effect of multi-threading.
I happens that my test laptop runs with the Conservative governor. As
fuse works, by default it multi-threads its requests, which are then
served with the same curl handle that must jump from one core to the
other. Not only that is bad for cache bouncing, but with the
conservative governor it didn't trigger enough workload for the cores to
be stepped up in frequency. Hence the whole transfer was done at the
lowest frequency! Running fuse "single threaded" (-s option) or my old
algorithm that uses a "worker stream" works better because the load on
the "worker" or single thread is enough so that the governor pushes the
frequency to max allowed. But measures started to be far more steady
when I set all the cores to a single frequency.
For the same reason, on my more recent desktop, minimum frequency being
enough to cope with the workload, the measures were super steady, and
none of the four very different algorithms made any difference (looking
at the wall time only).
Possibly also, one of the reason you find it better to have 100
transfers in parallel, is that it makes some load and the core is bumped
to max frequency very soon, and stays there.
Better safe than sorry, to be sure you eliminate this bias, I suggest
you lock your machine at a given frequency before running the test
script. If you don't lock at a given frequency and kernel+ACPI decides
to change the frequency in the middle of the test, the standard
deviation on your measures might increase to a point they become quite
hard to make any sense!
For instance, 1600MHz being the max on my laptop, adapt to your machine
and number of cores, and to the governors allowed on your machine. Any
governor is Ok since you set a unique frequency:
for core in $( seq 0 7 ); do sudo cpufreq-set -c $core -f 1600; sudo
cpufreq-set -c $core -g userspace; done
Check with
for core in $( seq 0 7 ); do sudo cpufreq-info -c $core -p; done
(There is not such kind of locking at the moment in your script, and it
is hard to code it since it depends on your processor... and you
probably don't want to run a curl test as sudo, or even have 'sudos' in
your script!)
I also run my tests with 'perf', it gives interesting figures about
cache/cache miss... (out of topic) the famous effect of "memcpy" on
which we disagreed is very real on a "slow" machine, believe me, and
shows in the wall time! It's not a mystery why kernel guys tried to
eliminate it (especially with fuse) with the "splicing" technique.
Having to copy the standard 16k curl buffer not only takes instruction
time, but completely flushes the L1 data cache on my laptop processor
('only' 32k data cache per core). 'perf' shows it. This probably makes
the most of the difference between using "curl with pause" or the raw
curl_easy_recv since with the later the copy is not necessary because it
receives the address where to place the data.
Nevertheless, I agree with you, it is hard to interpret all these
figures, and mathematically impossible to optimise for every different
cases, especially for a library. For instance, I saw you did a great job
optimising the size of the main curl structure over time. For some of my
algorithms, further 'nano-optimisation' could be obtained by clever
structure placement to minimise the number of "cache lines" the core has
to refresh. But another person using another protocol, would need other
placement...
Rest assure, as I said in summary on the previous e-mail, on a "recent
enough desktop" (5.5 years my desktop is "recent enough"), none of that
makes any difference. But same as you, I try to find the "right
algorithm" since it does make a difference on my antique laptop, but
also on the very recent Raspberry Pi 4!
Cheers an keep up the good job.
Alain
-------------------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-library
Etiquette: https://curl.se/mail/etiquette.html
Received on 2021-08-26