curl / Mailing Lists / curl-library / Single Mail
Buy commercial curl support from WolfSSL. We help you work out your issues, debug your libcurl applications, use the API, port to new platforms, add new features and more. With a team lead by the curl founder himself.

LRU connection pooling can result in uneven load

From: Jeffrey Tolar via curl-library <curl-library_at_lists.haxx.se>
Date: Thu, 16 Sep 2021 16:52:45 -0700

Hi! Thanks for writing and maintaining curl.

I think this falls in between a bug report and a feature request; we're
currently using 7.72.0.

Our application uses the multi interface to regularly send batches of
(HTTP/1.1) requests, all to the same hostname. These are routed (via a
network load balancer) to one of many backend servers. The backend
servers don't close connections; additionally, we also don't explicitly
limit the number of connections (although we do limit the number of
concurrent requests, effectively limiting the maximum number of open
connections).

This has worked well for years; recently, however, we've noticed an
interesting behavior when one of the backend server starts exhibiting
consistently higher latencies.

The basic skeleton of our application looks something like (error
checking and option-setting elided):

   multi = curl_multi_init();
   while (true) {
     // submit requests
     for (i = 0; i < MAX_CONCURRENT_REQUEST; i++) {
       curl_multi_add_handle(multi, get_request());
     }

     do {
       select(...);
       curl_multi_perform(multi, &running_handles);

       while ((m = curl_multi_info_read(multi, ...)) {
         if (m->msg == CURLMSG_DONE) {
           // process the response, then:
           curl_multi_remove_handle(multi, m->easy_handle);
         }
       }
     } while (running_handles > 0);
   }

Namely, we queue up a bunch of requests, start the transfers at the same
time, and then remove the easy handles as responses come back.

Removing the easy handle causes the number of handles registered with
the multi to decrease; since we don't set CURLMOPT_MAXCONNECTS, the size
of the shared connection pool (4 times the number of easy handles) will
also shrink.

Say MAX_CONCURRENT_REQUEST is 10:

     1. 10 requests are fired; the connection pool can contain up to 40
        connections, although only 10 are used.
     2. Requests 1-8 come back; they cause the capacity of the connection
        pool to reduce to 36, 32, ..., 12.
     3. Request 9 comes back; it reduces the capacity to 8, so a
        connection is closed: there are 9 connections open now.
     4. Request 10 comes back; it reduces the capacity to 4, so another
        connection is closed; there are 8 connections now.

Say there are 10 backend instances, and on the first iteration, each
connection ends up getting routed to a different backend. If one backend
has higher latency than the others, it will be the last request to
complete, which means the corresponding connection will be the
most-recently-used.

When the pool exceeds capacity and curl is deciding which connection to
close, it picks the least-recently-used connection. In the example
above, that will be the connection that had the best latency.

On the second iteration, it will reuse the 8 (1 slow and 7 fast)
connections and open 2 new ones. There's a 1/10 chance for each of those
to get routed to the slow backend. If they are, then those requests will
finish later and be considered more recently used, and so escape the
pruning. If they aren't, they'll potentially be closed when the pool
shrinks at the end of the batch.

After a few iterations of this, most (or all) of the 8 reused
connections will be to the slow host, since the slow requests make the
connections more recently used.


There are a few solutions we've considered:

1. The ideal solution would be for the server to help balance the load
    by periodically closing connections based on some criteria; however,
    we don't control the backend in this case, so we'd like to find a
    client-side solution.
2. Set CURLMOPT_MAXCONNECTS >= MAX_CONCURRENT_REQUEST; this means we
    won't ever close a connection; that may be workable, but it means
    we're stuck with the load-balancing that happened during startup.
3. Fine-tune the timeouts so that requests to a slower backend are
    aborted; this potentially requires continual tuning of that value.
    (In experimenting, this behavior presented itself with subsecond
    latencies - I believe the only triggering condition is for there to
    be a backend with consistently higher latency than the others.)

I think ideally, we'd be able to ask curl to close the connections based
on the creation time (or perhaps request count). It looks like curl
keeps track of the connection establishment time, but it's only
referenced a few times in telnet.c and hostip.c.

I skimmed the changelog since 7.72 and the TODO, and didn't see anything
that looked related. "1.15 Monitor connections in the connection pool"
and "1.23 Offer API to flush the connection pool" are close, but I think
solve slightly different problems.

Any thoughts on a new option to tweak the pooling behavior? Perhaps a
CURLOPT_MAXLIFETIME_CONN setting, or a parameter for TODO 1.23 to be
able to selectively flush connections from the pool?

This was a long mail - thanks for making it through to the bottom!

--
Jeffrey Tolar
-- 
Unsubscribe: https://lists.haxx.se/listinfo/curl-library
Etiquette:   https://curl.haxx.se/mail/etiquette.html
Received on 2021-09-17