curl / Mailing Lists / curl-library / Single Mail
Buy commercial curl support from WolfSSL. We help you work out your issues, debug your libcurl applications, use the API, port to new platforms, add new features and more. With a team lead by the curl founder himself.

Total http/2 concurrency for multiplexed multi-handle

From: Jeroen Ooms via curl-library <curl-library_at_lists.haxx.se>
Date: Wed, 8 Feb 2023 18:02:40 +0100

I have a CRON job scraping some content from GitHub every night (about
20k small files). It worked well for a year, but recently something
changed such that after a few minutes GitHub stars giving a lot of 403
and then after another minute I start getting thousands of these:

HTTP/2 stream 20135 was not closed cleanly before end of the underlying stream
HTTP/2 stream 20137 was not closed cleanly before end of the underlying stream
HTTP/2 stream 20139 was not closed cleanly before end of the underlying stream

So either they introduced a server bug, or perhaps GitHub is
deliberately blocking abusive behavior due to high concurrency.

I am using a multi handle with CURLPIPE_MULTIPLEX and otherwise
default settings. Am I correct that this means libcurl starts 100
concurrent streams (CURLMOPT_MAX_CONCURRENT_STREAMS), and still make 6
concurrent connections (CURLMOPT_MAX_HOST_CONNECTIONS) per host, i.e.
download 600 files in parallel? I can imagine that could be considered
abusive.

Should I set CURLMOPT_MAX_HOST_CONNECTIONS to 1 in case of http/2
multiplexing? Or is CURLMOPT_MAX_HOST_CONNECTIONS ignored in case of
multiplexing?

One other thing I noticed is that GitHub does not seem to set any
MAX_CONCURRENT_STREAMS, or at least I am not seeing any. For example
on httpbin I see this:

   curl -v 'https://httpbin.org/get' --http2
   * Connection state changed (MAX_CONCURRENT_STREAMS == 128)!

However for GitHub I don't see such a thing:

    curl -v 'https://raw.githubusercontent.com/curl/curl/master/README' --http2

So does this mean libcurl will assume 100 streams is OK?

Is there a way to debug this, and monitor how many active downloads a
multi-handle is making in total (summed over all connections)? Afaik,
the value 'running_handles' from curl_multi_perform() gives me the
total uncompleted requests, including those that have not started yet,
so that does not tell me how many are in progress?
-- 
Unsubscribe: https://lists.haxx.se/listinfo/curl-library
Etiquette:   https://curl.se/mail/etiquette.html
Received on 2023-02-08