Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curl is not handling uploads with large number of URLs (100,000+) #1959

Closed
arainchik opened this issue Oct 6, 2017 · 7 comments
Closed

Curl is not handling uploads with large number of URLs (100,000+) #1959

arainchik opened this issue Oct 6, 2017 · 7 comments

Comments

@arainchik
Copy link

I did this

I'm trying to upload large number of small files (100,000 or 1,000,000) using single HTTPS connection

for i in {1..100000}; do   echo "upload-file=/tmp/file${i}";   echo "url=https://server/path/file${i}"; done > /tmp/a.cfg
curl -vvv -K /tmp/a.cfg

I expected the following

I expected curl to start uploading process in a reasonable amount of time, but upload process never starts.

If I'm trying to download files instead of uploading - curl starts processing right away, see below.

for i in {1..100000}; do echo "url=https://server/path/file${i}"; done > /tmp/b.cfg
curl -vvv -K /tmp/b.cfg

It looks like curl is not processing large number of uploads/URLs in optimal way.

curl/libcurl version

[curl -V output]
curl 7.52.1 (x86_64-apple-darwin13.4.0) libcurl/7.52.1 OpenSSL/1.0.2l zlib/1.2.8
Protocols: dict file ftp ftps gopher http https imap imaps pop3 pop3s rtsp smb smbs smtp smtps telnet tftp
Features: IPv6 Largefile NTLM NTLM_WB SSL libz TLS-SRP UnixSockets HTTPS-proxy

operating system

macOS Sierra 10.12.6

@jay
Copy link
Member

jay commented Oct 6, 2017

I tried this in Windows with several recent builds and it takes about 20 seconds to parse the config with the upload options (200,000 lines) and 10 seconds to parse the config with the download options (100,000 lines). If you build with -DDEBUG_CONFIG it will show you the parsing in real time, for example I see this repeating without delay:

GOT: upload-file
PARAM: "/tmp/poa09580"
GOT: url
PARAM: "http://httpbin.org/put"

Check for a delay in parsing the entries.

edit: note I'm using debug builds of curl/libcurl, release builds would obviously be faster.

@arainchik
Copy link
Author

I'm not sure if Windows curl would operate any different. But curl on macOS (comes with the OS) and the latest build of curl from GitHub I tried on debian is processing config, but it takes forever and it never finishes in reasonable time to start uploading. I do see that it's processing entries (with strace), but it's doing it kind of slow.

read(4, "st/upload/34235\nupload-file /tmp"..., 4096) = 4096
read(4, "/localhost/upload/34307\nupload-f"..., 4096) = 4096
read(4, "l http://localhost/upload/34379\n"..., 4096) = 4096
read(4, "34451\nurl http://localhost/uploa"..., 4096) = 4096
cread(4, "le /tmp/34523\nurl http://localho"..., 4096) = 4096
read(4, "pload-file /tmp/34595\nurl http:/"..., 4096) = 4096
read(4, "/34666\nupload-file /tmp/34667\nur"..., 4096) = 4096
read(4, "t/upload/34738\nupload-file /tmp/"..., 4096) = 4096
read(4, "localhost/upload/34810\nupload-fi"..., 4096) = 4096
read(4, " http://localhost/upload/34882\nu"..., 4096) = 4096
read(4, "4954\nurl http://localhost/upload"..., 4096) = 4096
brk(0x2082000) = 0x2082000
read(4, "e /tmp/35026\nurl http://localhos"..., 4096) = 4096
read(4, "load-file /tmp/35098\nurl http://"..., 4096) = 4096
read(4, "35169\nupload-file /tmp/35170\nurl"..., 4096) = 4096
read(4, "/upload/35241\nupload-file /tmp/3"..., 4096) = 4096
read(4, "ocalhost/upload/35313\nupload-fil"..., 4096) = 4096

@arainchik
Copy link
Author

I've also compiled it with "-pg" option on Linux and I see that it spends most of time in getparameter function.

this is tool_getparam.c part where I think it spends all this time. If I get it correctly - it constantly scans list of URLs (from beginning to end) and more URLs get added - the longer it takes.

  /* we are uploading */
{
  struct getout *url;
  if(!config->url_out)
    config->url_out = config->url_list;
  if(config->url_out) {
    /* there's a node here, if it already is filled-in continue to find
       an "empty" node */
    while(config->url_out && (config->url_out->flags & GETOUT_UPLOAD))
      config->url_out = config->url_out->next;
  }

@bagder
Copy link
Member

bagder commented Oct 6, 2017

Yeps, that's exactly what I suspected and spotted as well. We should make the logic store the last used pointer so that it can start at that point the next time to avoid a lot of looping...

@arainchik
Copy link
Author

arainchik commented Oct 6, 2017

So I tried this: replacing tool_getparam.c:1916

from

config->url_out = config->url_list;

to

config->url_out = config->url_last;

and it seem to fixed the issue :) I'm not a developer and I don't understand the logic with GETOUT_UPLOAD flags at all, but hey - it worked :)

@bagder
Copy link
Member

bagder commented Oct 12, 2017

Lovely @arainchik and thanks! Someone just needs to verify that it actually is a proper fix and we can land that. This is a pretty tricky issue to add a test case for so - I'd rather not have a hundred thousand transfers as a test. I think we can skip it this time.

@arainchik
Copy link
Author

@Badger - you don't have to test actual 100,000 transfers. If you test and use some random, non-existing domain name or incorrect IP address like http://999.999.999.999/upload - you'll be able to test parsing config file without actual transfers.

bagder added a commit that referenced this issue Nov 4, 2017
By properly keeping track of the last entry in the list of URLs/uploads
to handle, curl now avoids many meaningless traverses of the list which
speeds up many-URL handling *MASSIVELY* (several magnitudes on 100K
URLs).

Added test 1291, to verify that it doesn't take ages - but we don't have
any detection of "too slow" command in the test suite.

Reported-by: arainchik on github
Fixes #1959
@bagder bagder removed the help wanted label Nov 4, 2017
@bagder bagder closed this as completed in ee8016b Nov 4, 2017
@lock lock bot locked as resolved and limited conversation to collaborators May 6, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

No branches or pull requests

3 participants