cURL / Mailing Lists / curl-library / Single Mail

curl-library

Re: Problems in using curl library in multithread application.

From: Kord Campbell <kord_at_grub.org>
Date: Wed, 13 Feb 2002 18:22:54 -0600 (CST)

Rick,

We create a separate instance of cURL per thread. We do batch
crawling, usually 500 URLs at a time, so if we have 10 threads,
or crawlers as we call them, then each thread gets used about
50 times, which means that each instance of cURL is used just
as many times as its controlling thread.

To manage the results from the threads, I pass back the address
of a struct that is written into by a callback function from cURL.
The address of the struct is used by a coordinating thread that
then writes the results out into a custom database which is then
written to the hard drive. The coordinator stops, starts and
pauses the crawlers as needed and is in charge of handing out
URLs for them to crawl. I should mention that our "database"
isn't reentrant proof, so this is why we have a single thread
doing all the writing and coordinating.

For the most part the bottleneck of the system would have to be
the archiver. The coordinator thread just pushes data around,
so it doesn't really take that long to move data from point A
to point B.

I've had beta testers report that a single instance of the grub
client, running 50 clients is capable of slamming a 4Mbps link
to the net. This would imply that the threads are capable of
crawling millions of sites a day, from a single machine.

Oh, BTW, our code is GPL'd, so if you want to, grab it from SF.

Hope this helps!

Kord
--------------------------------------------------------------
Kord Campbell Grub.Org Inc.
President 6051 N. Brookline #118
                                       Oklahoma City, OK 73112
kord_at_grub.org Voice: (405) 843-6336
http://www.grub.org Fax: (405) 848-5477
--------------------------------------------------------------

On Wed, 13 Feb 2002, rick vaillancourt wrote:

> Just curious Kord, do you create a new curl instance per thread, or do you
> cache curl instances?
>
> I would be interested in hearing how you manage these threads and what your
> preformance is like as I have a similar need. (Right now I am caching curl
> instances.)
>
> Thanks.
>
> -Rick
>
>
> >From: Kord Campbell <kord_at_grub.org>
> >To: chicco <chicco_at_gammasite.com>
> >CC: curl-library_at_lists.sourceforge.net
> >Subject: Re: Problems in using curl library in multithread application.
> >Date: Wed, 13 Feb 2002 10:58:41 -0600 (CST)
> >
> >On Wed, 13 Feb 2002, chicco wrote:
> >
> > > I'm using Curl lib in my multithread application to fetch pages from the
> > > net.
> >So are we. What platform are you running/developing your application on?
> >
> > > It is working all right until the number of fetching threads is more
> >then 8.
> > > When the number of threads exceed 8 the application use almost 100% of
> >CPU
> > > and works very very slow.
> > > Each thread is using it's own Curl object.
> > > This application use to operate O.K. using Wget (200 threads and more).
> >How are you communicating with cURL? We used wget as well before
> >converting to cURL, but it was by calling wget with a system() call.
> >We use the cURL libraries now, as you seem to be doing, and have zero
> >problems with running 100s of threads at the same time on Linux.
> >
> >I could probably help out more if you could supply more information
> >about your current setup.
> >
> >Kord
> >
> >http://grub.org
> >
> >
> >
>
>
>
>
> _________________________________________________________________
> Send and receive Hotmail on your mobile device: http://mobile.msn.com
>
Received on 2002-02-14