curl-library
Re: performance at scale
Date: Fri, 2 Oct 2009 14:33:48 -0700
On Fri, Oct 02, 2009 at 11:19:14AM -0700, Nick Gerner wrote:
> I'm curious if anyone has any tips about performance of libcurl at scale. I
> have some pretty good crawling code that I'm always trying to tune. I'm
> running curl_multi with poll and between 500 and 1000 curl handles.
Just curious, but what lead you to decide to use that many easy handles
in your multi-handle? I recently wrote an application that has to
download lots of different files. Instead of instituting a 1:1 mapping
between URL and easy-handle, I built a queue for the requests and
configured the system to use a fixed number of easy-handles in the
multi-handle (about 20, I think). Once a transaction is finished, the
easy-handle gets reconfigured to service the new request. I've had good
performance with such a design.
> And more interestingly:
>
> 977123 38.6471 url.c:0 ConnectionExists
> 477781 18.8972 (no location information) Curl_raw_equal
> 344057 13.6081 hostip.c:0 hostcache_timestamp_remove
> 230962 9.1350 rawstr.c:0 my_toupper
> 184067 7.2802 (no location information) Curl_hash_clean_with_criterium
> 67392 2.6655 (no location information) curl_multi_remove_handle
> 65826 2.6035 (no location information) Curl_hash_pick
> 35846 1.4178 (no location information) Curl_hash_add
>
> That ConnectionExists call seems to take a lot of time! Looking at the
> code, it looks like ConnectionExists should not get called if I set
> curl_easy_setopt(curl[i]->curl, CURLOPT_FRESH_CONNECT, (long)1);
>
> So I did that and got much better performance. But I still see basically
> the same oprofile report (basically 40% of my CPU time is in libcurl and 40%
> of libcurl's time is spent in ConnectionExists). So... any thoughts on:
>
> 1) why ConnectionExists takes so long? (I'm guessing it does an expensive
> traversal of a really big list of maybe 4k cached connections)
It looks like the code in ConnectionExists walks the entries in the
connection cache when it looks for a match. If it can't find a matching
connection, it looks like you'll make a linear scan of the entire table.
The connection cache is kept in the multi-handle when the multi
interface is used.
> 2) why I'm still getting all this time spent in ConnectionExists
Not sure. Do you have a test program that demonstrates this behavior?
> 3) any other general perf tips (e.g. other curl_easy_setopt or
> curl_multi_setopt settings, or maybe compile time options)
You might try setting a different value for CURLMOPT_MAXCONNECTS, but
this would limit your ability to re-use cached connections. The default
behavior is to cache 10 connections, but increase the size of the cache
by (n * 4), where n is the number of easy handles in the multi-handle.
(http://curl.haxx.se/libcurl/c/curl_multi_setopt.html)
That said, connection caching provides a substantial performance
benefit, if you expect your transactions to connect to the same host
multiple times.
I don't know what problem you're trying to solve. This means my advice
is more generic, and less useful, than it would be with more detail;
however, if you're interested in scaling your application up to multiple
cpus/threads, you might want to consider the following different
approaches.
1. Multiple threads, each with an easy-handle, where the work is pulled
from a queue.
2. A queue that with a multi-handle, where work is processed by a fixed
number of easy-handles.
3. One or more queues, multiple threads, each thread with a multi-handle
and a fixed number of easy-handles, where each thread schedules work
from the queue and runs its multi-handle.
I'm not sure if any of these are ideal for your project, but it might be
a worthwhile starting point.
-j
-------------------------------------------------------------------
List admin: http://cool.haxx.se/list/listinfo/curl-library
Etiquette: http://curl.haxx.se/mail/etiquette.html
Received on 2009-10-02