curl-library
Avoiding creation of multiple connections to an http capable server before CURLMOPT_MAX_PIPELINE_LENGTH is reached.
Date: Fri, 31 Oct 2014 02:32:11 +0100
I know this is a long mail, but Daniel requested me to start
a discussion on this mailinglist, so that a consensus can be reached on
how to tackle an existing problem with http pipelining.
So, if you have anything to do with http pipelining, please bare with
me ;).
Note that English is not my first language. If the text below sounds
awkward at times, then that is the reason. I assure you that I'm a very
capable coder with decades of networking experience, despite that
fact and hope you are willing to over look this, and still take my post
seriously.
DESCRIPTION OF THE PROBLEM
--------------------------
Consider the following (real life) scenario:
- An application is using a single multi handle and makes many
connections to many servers. Some servers are capable of http
pipelining, others are not.
- When connecting to a pipelining capable server, the application
wants to use a CURLMOPT_MAX_PIPELINE_LENGTH of (say) 32. It is very
undesirable however to create more than (say) 2 parallel connections.
Under these circumstances, the multihandle option
CURLMOPT_MAX_HOST_CONNECTIONS can not be used to limit the number
of connections to the http pipelining capable server to 2 because
many other connections (to the non-http pipelining servers) require
to make more than 2 connections.
Under these circumstances it seem logical that libcurl provides
the following strategy:
If a new request is added, and CURLMOPT_MAX_PIPELINE_LENGTH has not
been reached yet then the new request is added to the existing
connection. Only when more than CURLMOPT_MAX_PIPELINE_LENGTH requests
are already in the pipeline, a new connection is created.
Under that algorithm, the application still has a good control over
the number of connections. If -say- it wants to use the benefit of
pipelining, which means only using a single socket (connection),
by limiting the number of connections to only ONE connection, then
all it has to take care of is to never add more than
CURLMOPT_MAX_PIPELINE_LENGTH easy handles to the multi handle
at the same time.
It can be considered a bug, or at the very least is very inconvenient
for the user and therefore undesirable fact that an attempt to limit
the number of connections this way currently DOES NOT WORK.
The internal http pipelining algorithm does exactly this however; and
hence it DOES work-- but only as long as libcurl *knows* that a certain
connection is one that supports http pipelining.
Hence, there are basically two types of cases where this fails:
1) Upon the first, initial request, libcurl does not know if the server
supports http pipelining or not, and its current algorithm is to
create a NEW connection for every request: while the user is able to
specify that they want to use keep-alive, they are not able to specify
that they want pipelining.
For this there exists a workaround, so the problem is not that severe:
an application can limit the number of added easy handles to one, until
it received a reply from the server - at this point libcurl should know
if the server supports http pipelining or not and its algorithm will
change into the one described above. Hence, after receiving one reply
the application can switch to adding a maximum of
CURLMOPT_MAX_PIPELINE_LENGTH active easy handles to the multi handle, as
described above.
2) For several different reasons, it can happen that an existing
connection, which libcurl knows that supports pipelining, is
closed. So far I know of three reasons why this happens: a) a connection
is reset by peer (resulting in a "retry"), b) a request times out (due
to CURLOPT_TIMEOUT) or c) the connection cache is full and a pipeline
connection is picked to be closed.
In all cases, since there is only a single connection to that server
(as desired) closing such a connection currently causes the bundle for
that site to be deleted, and hence for libcurl to forget that that
site is http pipeline capable. After this, since in all cases the "pipe
broke", all not-timed-out requests in the pipe are being internally
retried, all instantly-- before the server replies to the first-- hence
each resulting in a new connection and the length of the pipe
(CURLMOPT_MAX_PIPELINE_LENGTH) is converted into
CURLMOPT_MAX_PIPELINE_LENGTH parallel connections!
For me this is very undesirable; the result is namely that the server
that I connect with gets very mad at me and stops replying completely
without closing the sockets... I suppose it's a defense against DoS or
whatever. The result in any case is that all transfers halt completely.
ANALYSIS OF OPTIONS AND PROPOSAL
--------------------------------
When a server connection is made and it is not yet known if the that
connection supports pipelining or not then there are the following
possibilities:
1) Libcurl creates a new connection for every new request (this
is the current behavior).
2) Libcurl creates a single connection, and queues all other
requests until it received all headers for that first request
and decided if the server supports pipelining or not.
If it does, it uses the same connection for the the queued
requests. If it does not, then it creates new connections
for each request as under 1).
3) Libcurl assumes every connection can do pipelining and just
sends all requests over the first connection, until it finds
out that this server can't do pipelining - in which case the
pipe 'breaks' and already existing code will cause all requests
to be retried - now knowing that the server does not support
pipelining.
4) Libcurl doesn't wait for headers but uses information from
the user to decide if the server is supposed to support
http pipelining; meaning that it picks strategy 2) or 3) based
on a flag set on the easy handle by the user.
I think that option 4 is by far the best experience for the user;
however it is the hardest to implement because it requires to
implement both 2) and 3) as well as add the code to add a new flag.
The easiest solution is probably to always pick 3), but that is
almost unacceptable 'dirty' for those connections that do not
support pipelining; My current approach (for the user application)
is to add only a single request and let that finish, so the application
itself can detect if pipelining is supported or not, and when not
to add that site to the blacklist. Doing that, option 3) would not
lead to 'dirty' failures, but I can hardly say that that seems like
a good general solution; it would be far more favorable to add
a flag for easy handles then "abuse" the black list like I do now,
which is merely a kludge to workaround the problem that I can't
specify on a per easy handle basis if a server is expected to be
http pipeline capable or not.
Hence, IF the choice is to not add a new flag to easy handles, so the
user can specify preferences on a per easy handle case (as opposed to
being forced to use a dedicated multi handle for pipelining) then
option 2) seems the only reasonable choice.
Note that just option 2) (instead of 4) isn't that bad: if pipelining
is supported and the first request would cause a stall of -say-
10 seconds, then also in that case all requests added behind that
in the pipeline would be stalled ("queued" on the server, instead
of client-side on the viewer). The main difference is that if
pipelining is NOT supported then you don't create a lot of parallel
connections right away and hence there is an extra delay more or
less equal to the response time of the first request (typical less
than a second) for the additional requests; which subsequently occurs
every time that libcurl destroys the bundle structure (ie, when ALL
connections are closed). I don't see that as something bad: if all
connections are closed then apparently we're done with downloading
the bulk - getting an extra sub-second delay the next burst seems
very acceptable to me. Of course we only need to do that when the
multi handle is flagged to support pipelining in the first place.
If there are no objections then I will continue to implement
point 2), after which I'd decide if I want to go ahead and do 4 (and 3)
as well. I'd prefer to get some feedback from the appropriate devs
however before I do all this work and then it turns out that people
have better ideas ;).
Please let me know asap - as I work on this full time every day :p
The sooner I get a reply the less time I will be wasting :/
-- Carlo Wood <carlo_at_alinoe.com> ------------------------------------------------------------------- List admin: http://cool.haxx.se/list/listinfo/curl-library Etiquette: http://curl.haxx.se/mail/etiquette.htmlReceived on 2014-10-31