cURL / Mailing Lists / curl-library / Single Mail

curl-library

Re: curl-library Digest, Vol 50, Issue 5

From: Nick Gerner <nick_at_seomoz.org>
Date: Fri, 2 Oct 2009 17:51:22 -0700

>
> On Fri, Oct 02, 2009 at 11:19:14AM -0700, Nick Gerner wrote:
> > I'm curious if anyone has any tips about performance of libcurl at scale.
> I
> > have some pretty good crawling code that I'm always trying to tune. I'm
> > running curl_multi with poll and between 500 and 1000 curl handles.
>
> Just curious, but what lead you to decide to use that many easy handles
> in your multi-handle?
>

This is a web crawler. So we're using curl_multi to get parallelism without
threads. This is a much better solution than threads (at least for us). I
believe that's why curl_multi was developed in the first place. Then again,
I wasn't there, so what do I know :) We don't want to reuse the connections
at all (curlopt_forbid_reuse and curlopt_fresh_connect made a big
difference, see my later message with an updated oprofile dump)

Currently we're getting about 1k pages per second on a single core 2.4GHz.
 Does that sound in the right ballpark to others?

> > 2) why I'm still getting all this time spent in ConnectionExists
>
> Not sure. Do you have a test program that demonstrates this behavior?
>

see my later message, sorry for the confusion.

> > 3) any other general perf tips (e.g. other curl_easy_setopt or
> > curl_multi_setopt settings, or maybe compile time options)
>
> I don't know what problem you're trying to solve.

yeah, I can see how that would make advice hard :)
I guess I should keep it more focussed, see my later message with the
updated oprofile dump.

hostcache_timestamp_remove, Curl_hash_clean_with_criterium seem like where
most of my CPU time is spent now. Anyone know what these are doing and if I
can avoid this work? It seems like the real bottleneck should be:

1) my app code, surely I'm not as good a dev as you guys ;)
2) the network
3) handling HTTP, parsing strings, etc.

It seems as if neither of the above functions fall into these categories,
and might be supporting neat features that don't apply to my scenario. Can
I turn these off or somehow avoid these being so expensive?

--Nick

-------------------------------------------------------------------
List admin: http://cool.haxx.se/list/listinfo/curl-library
Etiquette: http://curl.haxx.se/mail/etiquette.html
Received on 2009-10-03