Re: Project hiper - High Performance libcurl
Date: Wed, 9 Nov 2005 13:29:57 -0800
On 11/9/05, Jamie Lokier <jamie_at_shareable.org> wrote:
> Daniel Stenberg wrote:
> > >The current multi interface of having the user wait for events is not an
> > >efficient design when it comes to IOCP.
> > >
> > >IOCP uses threading to get full scalabilty on SMP systems.
> > But does it really scale that well on the ordinary plain non-SMP system? Is
> > your normal Windows box up to serving 10,000 threads? (if you have the RAM
> > for it). And I would expect that using a thread for each connection will be
> > more memory consuming that using a single thread for all transfers (due to
> > stack and thread context overhead).
You misunderstood me, I do not want a separate thread for each
request. I want a small pool of threads that would be used by all
easy handles. It scales well on non-SMP systems.
> As I understand it, the most efficient models are based on events,
> like your design, but with a bounded pool of threads to provide a
> certain amount of concurrency. This is efficient even on a single
> CPU, because it provides concurrency when there is blocking I/O,
> e.g. due to swapping or disk I/O. It also has potential advantages in
> managing average latency and fairness, when some event operations take
> unusually long to be processed.
Correct. IOCP (by default) only allows as much threads to run as there
are CPUs. If one of your threads blocks, another will be awakened.
This makes sure you are optimally using your time.
> The idea is that you have the usual event callbacks and state
> machines, but the callbacks for different events can run concurrently.
> It's up to the thread/event scheduler to optimise the concurrency,
> priorities etc. to strike the best balance. This means that on a
> single CPU system, it would use only one or a few threads. On a
> larger system, it would use more threads. And, if some threads are
> tending to block on I/O or thread-specific swapping (e.g. looking
> things up in an in-memory hash table, which is paged out), the number
> of threads increases to provide greater throughput. Multiple CPUs,
> blocking, and slow event handlers are the reason for having concurrency.
> The event driven state machines are there to remove the overhead that
> comes with large numbers of threads. Particularly, thread stack
> contexts take a lot of memory, but there are many other overheads too.
> Another significant advantage of single-threaded, event state machines
> is that no data locking is needed, due to lack of concurrency.
> Locking is not cheap.
> To retain that advantage, it is best to encode rules which ensure some
> groups of event handlers will not be run concurrently. In effect,
> it's like saying that every event handler is associated with a lock,
> and all the handlers in a group share the same lock - but with the
> scheduler optimising away the actual locking in that case. This means
> the event handlers in a group can access the same data, without
> needing finer-grained locks.
> For a library like Curl, a likely rule would be to put all the
> handlers for a specific file descriptor, or for a specific request, in
> a group. _Some_ locking is then needed because file descriptoers
> service multiple requests, and requests may access multiple file
> descriptors. But the locking can be quite coarse-grained.
> The above sounds quite complicated and rarely done, and it is.
It would be trivial to add a lock to every easy handle, and only allow
single-threaded access to the easy handle. You'd still get a benefit
of concurrency when running multiple handles, and contention would be
very rare. I don't think any effort beyond that would be needed,
unless curl uses TLS somewhere.
> A good approximation is done, very simply, by starting a small pool of
> threads, and each thread _indepdently_ handling it's own large set of
> requests etc. You can do this with Curl today, and with Curl's
> enhancements to use large numbers of file descriptors that we've
> already talked about.
> > >It could certainly be done from one thread, but then you would be losing
> > >many of the benefits. This is aimed at being high performance so libcurl
> > >(imho) should definately add the minimal threading awareness for IOCP, if
> > >only for this hiper API.
> > >
> > >The API I gave was complete. You open a hiper handle and push your easy
> > >handles into it. The hiper handle would open one or more threads in the
> > >background and callback when something significant happens to an easy
> > >handle. Simple as that. If you wanted to wait until all the easy handles
> > >are finished, you can call the wait function.
> > (I see a problem to merge a Windows-optimzed concept with the currently
> > planned event-optimized concept...)
> I don't think it's a Windows-optimised concept, particularly. Maybe
> Windows provides an implementation? The same concept is applicable to
> any modern OS which has threads.
> > If that is what you want (using a thread for each transfer), won't it
> > suffice to simply start a new thread and fire off a separate
> > curl_easy_perform() in there? In what way would this suggested interface be
> > an improvement to that?
> I'd be very surprised if a thread per request was intended. That
> would indeed be very resource-intensive.
This would use at most as much threads as there are CPUs, and maybe
slightly more if blocking occurs in one of the threads.
> > HTTP pipelining might be hard to add nicely for such a use case.
> > >If hiper is meant to abstract away all the stuff needed for high
> > >performance http, it should also be in charge of threading efficiently as
> > >needed.
> > I want libcurl to abstract away all protocol and transfer related matters.
> > I want it to know as little as possible about event systems and threading
> > models.
> I agree. That's a good plan.
> I might have misunderstood the grandparent-post's proposals, adding
> instead my own flavour. :-) (I am writing a library of the type
> described above - I'm allowed to be biased!).
> We've discussed before what kind of API is needed to wait on large
> numbers of file descriptors - a "scalable" method. We did only look
> at the case of a single thread.
> Such applications can use multiple threads, simply by submitting
> requests in different threads, and each thread runs independently.
> That would work fine, even on Windows.
> But I think, off the top of my head, that the essential API feature
> we're now talking about is OS-supported methods of _automatically_
> distributing work across threads - instead of requiring the Curl-using
> application to distribute requests itself.
> That's really an advanced feature that would rarely be used, I think.
> However, for the biggest data-moving applications, I suspect it would
> be a performance enhancement. It's very hard to know, without trying
> it. The theoretical corner-case gains might be swamped, in all
> practical scenarios, by the overheads of additional locking that come
> with it.
> I think such fancy event/thread scheduler should not be part of Curl;
> it should be a project in its own right. (One I've heard of for unix
> is called libasync-smp. I'm slowly working on one myself, too).
It is easy to do (on Windows, at least), and would help users a lot.
If it's not in Curl, we lose the ability of easily writing
cross-platform applications that work on all the major OSes.
Not having it in Curl would also mean we lose efficiency. By not
being able to delegate which type of thread (I/O or non-I/O) to begin
a request in, Curl would be forcing the user to never destroy any of
the threads that Curl has touched, for fear of canceling I/O. We
would no longer be able to intelligently pool threads by how heavy the
CPUs aren't getting any faster, but cores are being added on. The
concept of multi-threaded coding needs to get a lot more popular, and
this is a prime example of something that would really benefit from
> But to support it, the question is whether Curl's API would be better
> adapted so that it _could_ work with something like that - something
> where events that are handled in one thread may migrate to other
> I think that comes down to thinking about what locking and shared data
> structures are used in Curl, and then specifying rules that say which
> of the file descriptor event handlers can be called in different
> threads than the requests which originated them, and what the
> non-concurrency groups are.
> libasync-smp's approach to that is for each event handler to have a
> "colour", which is a single integer. Handlers with the same colour
> will not be run concurrently, and all handlers are assigned a default
> colour of zero - or alternatively a colour corresponding to the thread
> that originated the request. The event handlers can then change their
> own colours, e.g. to a colour corresponding to a file descriptor or
> something, if they implement sufficient locking that running them on
> non-home threads is safe.
> There is also something called "gang scheduling" which I'll not
> describe in detail. This is where event handlers indicate their
> similarity of code/data cache usage, so they can be scheduled in a way
> which groups similar uses in sequence.
As I said above, it would be trivial to add a single lock to each easy
handle. No fancy scheduling needed.
> This is all complex and great stuff for high performance web servers
> that do lots of complicated processing.
> But is it worth exploring this level of detail for Curl, or would
> simply implementing scalable file descriptor events within each thread
> be quite enough for realistic applications of Curl?
This is simple to do for Windows. There really is no reason not to.
I'm not familiar with kqueue etc but I doubt it would be very complex
for them either.
> -- Jamie
-- Cory Nelson http://www.int64.orgReceived on 2005-11-09