cURL / Mailing Lists / curl-library / Single Mail

curl-library

Re: Avoiding creation of multiple connections to an http capable server before CURLMOPT_MAX_PIPELINE_LENGTH is reached.

From: Daniel Stenberg <daniel_at_haxx.se>
Date: Fri, 31 Oct 2014 12:01:09 +0100 (CET)

On Fri, 31 Oct 2014, Carlo Wood wrote:

> - An application is using a single multi handle and makes many connections
> to many servers. Some servers are capable of http pipelining, others are
> not.

Pipelining is a mess, so yes unless you're working with a controlled set of
servers you know, there will be servers that don't support it properly. It
also reminds me that not everyone reading this might be aware of this
document: https://tools.ietf.org/html/draft-nottingham-http-pipeline-01

Pipelining in libcurl is only used by a small fraction of users which explains
why there's still nasty bugs lurking in there and why there are pipelinining
use case setups that may not be posssibly to do using the existing options.

The future of pipelining, technology wise, really is HTTP/2's multiplexed
streams, which will be a much better way to fully utilize few connections to
get many concurrent objects and streams with less latency penalisations and no
head of line blocking. We don't have any support for multiplexed HTTP/2 in
libcurl yet though, still only on my TODO list.

> - When connecting to a pipelining capable server, the application wants to
> use a CURLMOPT_MAX_PIPELINE_LENGTH of (say) 32. It is very undesirable
> however to create more than (say) 2 parallel connections.

Let me just insert here that this describes your use case. Other users will
have different priorities. With fewer connections you suffer more from head of
line blocking and the initial time to received data can be longer. That's also
why you see browsers with pipelining enabled still use more than two
connections per host.

> Under these circumstances, the multihandle option
> CURLMOPT_MAX_HOST_CONNECTIONS can not be used to limit the number of
> connections to the http pipelining capable server to 2 because many other
> connections (to the non-http pipelining servers) require to make more than 2
> connections.

Ah yes. This setting was introduced when someone wanted a more fixed limit
independent of the pipelning working or not against servers. Limiting
connections per host is also a way to follow the HTTP spec and to not have
services to a single host "starve out" communication with other hosts by the
shear amount of traffic.

Already in the past I've pondered if it would be better to somehow have a
callback or something to send a lot of info to and to have that decide action
in several of these cases as just adding more and more options also make
things very complicated and hard to use. And tricky to understand how things
work together.

CURLMOPT_MAX_HOST_PIPELNED_CONNECTIONS feels a bit... long. =) Also a bit hard
to explain exactly what it limits and when each option is in use.

> If a new request is added, and CURLMOPT_MAX_PIPELINE_LENGTH has not been
> reached yet then the new request is added to the existing connection. Only
> when more than CURLMOPT_MAX_PIPELINE_LENGTH requests are already in the
> pipeline, a new connection is created.

If depth goes before bredth, yes.

> Under that algorithm, the application still has a good control over the
> number of connections.

It already did have a pretty good control of the number of connections as it
could set a max (both total and per host). You're not increasing the control
here, you're changing the ways to control it.

> If -say- it wants to use the benefit of pipelining, which means only using a
> single socket (connection), by limiting the number of connections to only
> ONE connection, then all it has to take care of is to never add more than
> CURLMOPT_MAX_PIPELINE_LENGTH easy handles to the multi handle at the same
> time.

That would be very inconvenient for most applications though as they normally
don't even care or know which URLs that may end up on the same pipeline (or
not). Also, that would force them to have to drain the pipeline first before
they can add more transfers which is ineffective.

> It can be considered a bug, or at the very least is very inconvenient for
> the user and therefore undesirable fact that an attempt to limit the number
> of connections this way currently DOES NOT WORK.

The application can control the maximum number of connections to each host and
the maximum number of connections used in total. That's on purpose and by
design. If they don't work then we have a bug.

If you're saying that you can't explicitly force request transfers to a
specific host to all use a single connection (unless you use
CURLMOPT_MAX_HOST_CONNECTIONS == 1), then that is correct too and that is
because libcurl is transfer-based and it does things exactly when you ask it -
so if you ask for a transfer to start, it tries to do the transfer now rather
than to potentially wait and use an existing connection that can be used
later.

> The internal http pipelining algorithm does exactly this however; and hence
> it DOES work-- but only as long as libcurl *knows* that a certain connection
> is one that supports http pipelining.

Yes, queueing up requests to be done on a connection that may not support
pipelining is a (potential) performance drop.

> 2) For several different reasons, it can happen that an existing connection,
> which libcurl knows that supports pipelining, is closed. So far I know of
> three reasons why this happens: a) a connection is reset by peer (resulting
> in a "retry"),

This is only retried if that reset happens immediately after when a request
has been issued, since that's then usually the result of a persistant re-used
connection having been closed by the other end. A connection that just gets
reset somewhere in the middle will not be "retried" in any way but is a mere
transfer failure.

> In all cases, since there is only a single connection to that server (as
> desired)

As desired in your case. Pipelining users I've talked to before used
pipelining to maximize HTTP transfer performance, and that usually means more
than one connection.

> closing such a connection currently causes the bundle for that site to be
> deleted, and hence for libcurl to forget that that site is http pipeline
> capable. After this, since in all cases the "pipe broke", all not-timed-out
> requests in the pipe are being internally retried, all instantly-- before
> the server replies to the first-- hence each resulting in a new connection
> and the length of the pipe (CURLMOPT_MAX_PIPELINE_LENGTH) is converted into
> CURLMOPT_MAX_PIPELINE_LENGTH parallel connections!

Oh right. With the minor note that CURLMOPT_MAX_HOST_CONNECTIONS could still
limit the damage somewhat.

Having the knowledge of a host's pipelining capability being dumped at the
same time we kick out the connection is a pretty sever blow. It should really
be kept in a separate cache with a much longer life-time so that repeated
connections to the same hosts would have that knowledge immediately.

> When a server connection is made and it is not yet known if the that
> connection supports pipelining or not then there are the following
> possibilities:
>
> 1) Libcurl creates a new connection for every new request (this
> is the current behavior).

At least until it hits a max limit.

> 2) Libcurl creates a single connection, and queues all other requests until
> it received all headers for that first request and decided if the server
> supports pipelining or not. If it does, it uses the same connection for the
> the queued requests. If it does not, then it creates new connections for
> each request as under 1).

So: If there's a transfer going on for the same host but we don't know
pipelining capability for the connection yet, queue up the transfer until we
get the answer?

> 3) Libcurl assumes every connection can do pipelining and just sends all
> requests over the first connection, until it finds out that this server
> can't do pipelining - in which case the pipe 'breaks' and already existing
> code will cause all requests to be retried - now knowing that the server
> does not support pipelining.

That could mean a rather hefty performance blow during situations. So again it
boils down what your prios are: fewer connections or faster (lower latency)
transfers.

> 4) Libcurl doesn't wait for headers but uses information from the user to
> decide if the server is supposed to support http pipelining; meaning that it
> picks strategy 2) or 3) based on a flag set on the easy handle by the user.

But what are the odds of the application knowing that on a per URL basis? Or a
per easy-handle basis? It feels like more of a policy that you want to set
globally. assume pipelining to work or to not work.

> I think that option 4 is by far the best experience for the user; however it
> is the hardest to implement because it requires to implement both 2) and 3)
> as well as add the code to add a new flag.

Yes. And ideally test cases for stuff that is already hard to test as they are
even before this since the nature of pipelining is so timing sensitve.

> The easiest solution is probably to always pick 3)

...

> Hence, IF the choice is to not add a new flag to easy handles, so the user
> can specify preferences on a per easy handle case (as opposed to being
> forced to use a dedicated multi handle for pipelining) then option 2) seems
> the only reasonable choice.

Option 2 cannot be made the single static behavior, no. That's completely out
of the question.

> Note that just option 2) (instead of 4) isn't that bad:

For your use case and working pipeling, no.

Assuming pipelining by default and queuing up will be (potentially much) worse
in the cases where pipelining is in fact not supported by the server since
then you've added quite some time until you start trying transfer 2.

It is certainly good for the case where pipeling works on the host and you
insist on a single connection. If you like more than one pipeline to the host,
you probably prefer to have that setup for transfer 2 anyway and then option
2) doesn't seam ideal either.

> Please let me know asap - as I work on this full time every day :p The
> sooner I get a reply the less time I will be wasting :/

Lovely, but I don't do this full-time and these are rather tricky questions so
I'm not always very fast to respond, but I will certainly try my best to not
be a road block!

-- 
  / daniel.haxx.se
-------------------------------------------------------------------
List admin: http://cool.haxx.se/list/listinfo/curl-library
Etiquette:  http://curl.haxx.se/mail/etiquette.html
Received on 2014-10-31