Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel h2 requests can occasionally result in timeouts #4043

Closed
TvdW opened this issue Jun 19, 2019 · 49 comments
Closed

parallel h2 requests can occasionally result in timeouts #4043

TvdW opened this issue Jun 19, 2019 · 49 comments
Assignees
Labels

Comments

@TvdW
Copy link

TvdW commented Jun 19, 2019

It's not entirely clear to me what's happening here and I haven't been able yet to reproduce this outside of my application, though I can currently reproduce this by running a loop that does two identical requests at a time. Eventually (after ~100 requests), one of them times out.

I'm logging this issue so that hopefully someone recognizes the symptoms or the cause, despite knowing it cannot be reproduced in its current form.

* Found bundle for host 127.0.0.1: 0x7f60e2fba4f0 [can multiplex]
* Re-using existing connection! (#0) with host 127.0.0.1
* Connected to 127.0.0.1 (127.0.0.1) port 1234 (#0)
* Using Stream ID: 125 (easy handle 0x7f60e03ce800)
> POST /retrieve HTTP/2
[snip, irrelevant headers]
Content-Length: 115

* We are completely uploaded and fine
* Found bundle for host 127.0.0.1: 0x7f60e2fba4f0 [can multiplex]
* Multiplexed connection found!
* Found connection 0, with 1 requests on it
* Re-using existing connection! (#0) with host 127.0.0.1
* Using Stream ID: 127 (easy handle 0x7f60e03cd000)
> POST / HTTP/2
[snip, irrelevant headers]
Content-Length: 115

* We are completely uploaded and fine
< HTTP/2 200
< server: [snip]
< date: Wed, 19 Jun 2019 09:48:27 GMT
< strict-transport-security: max-age=17280000
< x-[snip]: [snip]
<
< HTTP/2 200
< server: [snip]
< date: Wed, 19 Jun 2019 09:48:27 GMT
< strict-transport-security: max-age=17280000
< x-[snip]: [snip]
<
* Operation timed out after 1000 milliseconds with 806 bytes received
* stopped the pause stream!
* Connection #0 to host 127.0.0.1 left intact

What stands out, to me, is the "with 806 bytes received", because that's how many bytes are in the response, so the response is complete. There's no content-length header, and I suspect that that's related, as maybe it could be that curl is not correctly handing the end of a stream.

Running the exact same requests over http/1.1 makes the issue go away.

CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_2_PRIOR_KNOWLEDGE
CURLOPT_HTTPHEADER => [ snip, nothing relevant here ]
CURLOPT_POST => 1
CURLOPT_POSTFIELDS => "[ snip, 115 bytes of body to make the server respond with something useful ]"
CURLOPT_TIMEOUT_MS => 1000
CURLOPT_URL => "http://127.0.0.1:1234/"

This is happening on a recent build from master. Worked fine in 7.64.1, so something must have changed since then. Would bisect, but the segfaults make that impossible 😞

@TvdW
Copy link
Author

TvdW commented Jun 19, 2019

From a curl build with --enable-debug:

* STATE: DO => DO_DONE handle 0x7f1451543808; line 1619 (connection #0)
* STATE: DO_DONE => PERFORM handle 0x7f1451543808; line 1741 (connection #0)
* We are completely uploaded and fine
* HTTP/2 found, allow multiplexing
< HTTP/2 200
< server: [snip]
< date: Wed, 19 Jun 2019 10:12:11 GMT
< strict-transport-security: max-age=17280000
< x-[snip]: [snip]
<
* Curl_readwrite: forcibly told to drain data
* HTTP/2 found, allow multiplexing
< HTTP/2 200
< server: [snip]
< date: Wed, 19 Jun 2019 10:12:11 GMT
< strict-transport-security: max-age=17280000
< x-[snip]: [snip]
<
* nread <= 0, server closed connection, bailing
* STATE: PERFORM => DONE handle 0x7f1451542008; line 1931 (connection #0)
* multi_done
* Connection still in use 1, no more multi_done now!
* Expire cleared (transfer 0x7f1451542008)
* Operation timed out after 1000 milliseconds with 806 bytes received
* Kill stream: Disconnected with pending data
* multi_done
* stopped the pause stream!
* Connection #0 to host 127.0.0.1 left intact

@kunalekawde
Copy link

I remember hitting this issue very early part of libcurl integration wherein app side we were not handling correctly but since you say 7.64.1 the same program runs, could be something else. could you upload your sample program for this ?

@TvdW
Copy link
Author

TvdW commented Jun 19, 2019

@kunalekawde

use v5.14;
use warnings;
use AnyEvent::YACurl qw/:constants/;
use AnyEvent;

my $auth= <>;
chomp $auth;

my $request= {
    CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_2_PRIOR_KNOWLEDGE,
    CURLOPT_HTTPHEADER => [ "Host: [snip]", "[snip]: $auth" ],
    CURLOPT_POST => 1,
    CURLOPT_PIPEWAIT => 1,
    CURLOPT_POSTFIELDS => "[snip]",
    CURLOPT_TIMEOUT_MS => 1000,
    CURLOPT_URL => "http://127.0.0.1:1234/",
    CURLOPT_WRITEFUNCTION => sub { },
    CURLOPT_VERBOSE => 1,
};
my $client= AnyEvent::YACurl->new({
    CURLMOPT_PIPELINING => CURLPIPE_MULTIPLEX,
    CURLMOPT_MAX_HOST_CONNECTIONS => 4,
    CURLMOPT_MAXCONNECTS => 128,
});
for (1..1000) {
    my $cv1= AE::cv;
    my $cv2= AE::cv;
    $client->request($cv1, $request);
    $client->request($cv2, $request);
    my ($res1, $err1)= $cv1->recv;
    my ($res2, $err2)= $cv2->recv;
    die $err1 if $err1;
    die $err2 if $err2;
}

(99.9% sure this isn't a bindings problem so I haven't bothered to rewrite this in C, have to be able to reproduce it with outside servers before I do that.)

One thing I did just notice is that I can only reproduce the issue when I explicitly set CURLMOPT_PIPELINING, even though its default is already supposed to be CURLPIPE_MULTIPLEX.

@kunalekawde
Copy link

whats the nghttp2 version used ?

@TvdW
Copy link
Author

TvdW commented Jun 19, 2019

whats the nghttp2 version used ?

1.39.1

@kunalekawde
Copy link

whats the nghttp2 version used ?

1.39.1

And 7.64.1 with 1.39.1 works fine right ?

@TvdW
Copy link
Author

TvdW commented Jun 19, 2019

whats the nghttp2 version used ?

1.39.1

And 7.64.1 with 1.39.1 works fine right ?

Yes.

Like I said in the initial description: the only change is curl going from 7.64.1 to latest master. Same application code, same libraries, etc.

@jay
Copy link
Member

jay commented Jun 19, 2019

Would bisect, but the segfaults make that impossible

How's that?

@jay jay added the HTTP/2 label Jun 19, 2019
@TvdW
Copy link
Author

TvdW commented Jun 19, 2019

How's that?

Hm, thanks for questioning that. I had all the segfaults I've reported these last few weeks in mind, but it turns out none of the code paths triggering those are relevant for the bug in this issue.

$ git bisect bad
2f44e94efb3df8e50bb2ddbc4ec6b569a6424517 is the first bad commit
commit 2f44e94efb3df8e50bb2ddbc4ec6b569a6424517
Author: Daniel Stenberg <daniel@haxx.se>
Date:   Fri Apr 5 16:38:36 2019 +0200

  pipelining: removed

  As previously planned and documented in DEPRECATE.md, all pipelining
  code is removed.

  Closes #3651

@bagder
Copy link
Member

bagder commented Jun 19, 2019

Ugh, unfortunately that's a rather large and complicated commit. I'll dig into this issue soon.

@bagder bagder self-assigned this Jun 19, 2019
@bagder
Copy link
Member

bagder commented Jun 20, 2019

I can only reproduce the issue when I explicitly set CURLMOPT_PIPELINING, even though its default is already supposed to be CURLPIPE_MULTIPLEX.

Independent bug, triggered this PR: #4051

@bagder
Copy link
Member

bagder commented Jun 20, 2019

I wrote up a C example to try to reproduce and I immediately fell into an issue that #4055 fixes. I haven't quite seen the original issue yet but I figured this was worth fixing and mentioning first, as it might be related.

@TvdW
Copy link
Author

TvdW commented Jun 20, 2019

Doesn't look like that's it. Tried it, still times out.

@bagder
Copy link
Member

bagder commented Jun 20, 2019

here's 4043.c which is my attempt to reproduce, but it doesn't.

@kunalekawde
Copy link

kunalekawde commented Jun 20, 2019

I m also hitting this, initiated 50 http2 prior knowledge transfers in loop, all 50 requests got sent on same connection, got 1 response, followed by all timeouts.

STATE: PERFORM =\u003e DONE handle 0x18895a8
multi_done
Connection still in use 49, no more multi_done now!
Expire cleared (transfer 0x18895a8)

Does expire clear remove all timeouts including those related to pending transfers on that connection?

@bagder
Copy link
Member

bagder commented Jun 20, 2019

@kunalekawde that's sounds like a different bug. If I understand @TvdW's description, he gets some timeouts out of a large amount of transfers. You describe something that gets almost all timed out. Please provide an example that can reproduce that (feel free to take my example above and tweak it so that it fails like your app does) in a separate issue because it sounds like a serious issue!

@kunalekawde
Copy link

kunalekawde commented Jun 20, 2019

https://gist.github.com/kunalekawde/55361016249976573e2fb2bb89248eb3 with 50 transfers -- as arg to program, also try with 100 -- timeouts more

@bagder , could reproduce this issue with above example but here it perform quite a few transfers before timeout, for my application in which most of them timeout, I need to look into.

@TvdW
Copy link
Author

TvdW commented Jun 20, 2019

@bagder This reproduces the bug for me https://gist.github.com/TvdW/af53833b6112f0a7e2e21bff3b1f3e8a (ignore all the memory leaks etc, not relevant here)

@TvdW
Copy link
Author

TvdW commented Jun 20, 2019

As for "all" vs "some": I just tried running 1000 requests (in rounds of two) from the same multi, and 36 requests failed. The h2 stream ID of the final request was 7cf (1999) confirming that all 1000 requests were executed over the same connection.

@kunalekawde
Copy link

Another point here is that with the example I posted, it uses multiple connections, I had made a temporary hack to use same connection as discussed long back on mailing list.

lib/url.c

3714       if(!(IsMultiplexingPossible(data, conn_temp) & CURLPIPE_MULTIPLEX)) { /*TEMP CHECK*/
3715         if(Curl_conncache_bundle_size(conn_temp) < max_host_connections &&
3716            Curl_conncache_size(data) < max_total_connections) {
3717           /* We want a new connection anyway */
3718           reuse = FALSE;
3719
3720           infof(data, "We can reuse, but we want a new connection anyway\n");
3721           Curl_conncache_return_conn(conn_temp);
3722         }
3723       }

The issue seen by my app could be slightly different because of this.

@TvdW
Copy link
Author

TvdW commented Jun 20, 2019

@kunalekawde Please don't hijack my timeout issue then, create a new one ;)

@kunalekawde
Copy link

@kunalekawde Please don't hijack my timeout issue then, create a new one ;)

:) sure

@TvdW
Copy link
Author

TvdW commented Jun 20, 2019

There's definitely something strange happening here, though. My "localhost:1234" server runs envoy which is basically an http2 proxy (nghttp2 based). The way it works is I tell curl to connect to it (localhost:1234, via non-tls http2), send it a Host header indicating where it should route, and it takes care of everything for me.

Luckily the server that Envoy is communicating with also runs nginx, so we can configure it to speak http2 the same way Envoy does. In my code all I changed is CURLOPT_URL, the requests are the same and the responses very similar as well (different Server header, and some envoy metadata is gone).

However, when I do that, the issue disappears.

So I thought, what if it's because the requests arrive out of order sometimes, because of the nature of this proxy? Well, no, the timeouts seem to happen on both the first and second requests, randomly.

So I thought, what if it's not related to parallelism, and curl just gets into a bad state? I should be able to reproduce that for non-parallel requests, right? So I ran the requests at a randomly chosen parallelism of 1 or 2, but the issue only occurs at rounds where parallelism >1, so it's maybe not caused by curl getting into a bad state in previous requests.

@bagder
Copy link
Member

bagder commented Jun 20, 2019

Curious. When I try your sample code (only slightly adapted for my circumstances) against my local nginx it runs through just fine maybe 9 out of 10 times. The occasional time it doesn't work fine, it gets stuck in a loop where select() returns true for a condition that libcurl doesn't seem to get out of nor change!

@TvdW
Copy link
Author

TvdW commented Jun 20, 2019

Excellent, so you've reproduced the issue! (I get the same behavior)

The loop it gets into is the result of my sloppy coding in this example case (you can get rid of it by setting CURLOPT_TIMEOUT, but a low value there may make it look like it's working since I'm not handling errors at all), it's not the cause of the issue. 😃

[edit] wait, no, I don't get "true", I get a timeout (because the timeval passed to select is {0,0}).

@bagder
Copy link
Member

bagder commented Jun 20, 2019

As a test, if you don't do any POST there and only do plain GETs, can you make the same problem trigger?

@TvdW
Copy link
Author

TvdW commented Jun 20, 2019

As a test, if you don't do any POST there and only do plain GETs, can you make the same problem trigger?

Nope, server gives a 405 😄

@bagder
Copy link
Member

bagder commented Jun 20, 2019

Well, that doesn't bother curl... =)

@TvdW
Copy link
Author

TvdW commented Jun 20, 2019

  • Can't reproduce when removing CURLOPT_POST / CURLOPT_POSTFIELDS (server returns 405)
  • Can't reproduce when changing the body to something the server doesn't like (server returns 500)

What happens seems to be tied to the server's response, not the request. [edit] well, both, of course, it could well be some 200-OK silliness

@kunalekawde
Copy link

In the example I shared, server is google.com, on 100 GET requests, there were quiet many timeouts.

bagder added a commit that referenced this issue Jun 24, 2019
To make sure a HTTP/2 stream registers the end of stream.

Reported-by: Tom van der Woerdt
Fixes #4043
@TvdW
Copy link
Author

TvdW commented Jun 24, 2019

(Moved back from #4068 to avoid having the history in two places)

I was a bit afraid of that. I get a 405 immediately from my test server when I send the POST and a very short response. The 405 usually aborts the sending of the data so for a 200 post the code path is probably a little different.

I fixed my server config to return 200 for POSTs, and I tried with 13 bytes responses, 4KB responses and 17MB responses. I can't seem to reproduce this problem. 😞

Hm, for my tests to reproduce the issue they need to upload the body all the way and get a 200 from the server. The only notable difference between a 500 response and a 200 response is the absence of a Content-Length header in the 200 case. It looks like the server is returning a 200 and then streaming the body in a different h2 frame.

I did manage to do a similar GET request to the same server, still supplying a body etc, that also results in a 200. In this case, there's no timeout, which I strongly suspect to be related to the presence of the Content-Length header in this particular response (it has one).

So:

  • This happens when reading from the server, not when writing the request (your patch seems related to the upload, which finishes fine in my case)
  • curl seems to not pick up the end-of-stream flag on incoming data
  • So far, only reproducible when there's no Content-Length header

@bagder
Copy link
Member

bagder commented Jun 24, 2019

In the problematic case, roughly how big responses do you get?

@bagder
Copy link
Member

bagder commented Jun 24, 2019

And yes, my #4068 fix is related to the sending of the request so it's a different flaw that I just happened to run into when trying to repro this.

@TvdW
Copy link
Author

TvdW commented Jun 24, 2019

  • Operation timed out after 1000 milliseconds with 806 bytes received

The response is 806 bytes, and fully received, but curl does not seem to realize the stream has ended.

@TvdW
Copy link
Author

TvdW commented Jun 24, 2019

Here's an excerpt from the pcap

Screenshot from 2019-06-24 13-37-21
Screenshot from 2019-06-24 13-36-18

On a TCP level, the DATA and END_STREAM are delivered in the same TCP segment, but in different frames, maybe that's somehow related? idk

@TvdW
Copy link
Author

TvdW commented Jun 24, 2019

Recompiled curl with DEBUG_HTTP2 for even more debug output. https://gist.github.com/TvdW/894aea6caf0e51c5b76c4a1b63b0871e

@bagder
Copy link
Member

bagder commented Jun 24, 2019

I managed to get some hanging problems occasionally with that original sample code, but I then fixed the sample code to more properly end the loop immediately when there's no transfers left in the multi handle and now I don't seem to reproduce anymore or perhaps it just got a lot harder. See 4043-end-properly.c

@TvdW
Copy link
Author

TvdW commented Jun 24, 2019

With your updated code I get this (after swapping the curl_easy_setopt back to fit my case):

< 
* http2_recv: easy 0x118b218 (stream 0)
* h2_process_pending_input: All data in connection buffer processed
* http2_recv returns 0, http2_handle_stream_close
* nread <= 0, server closed connection, bailing
* STATE: PERFORM => DONE handle 0x118b218; line 1933 (connection #0)
* multi_done
* Connection still in use 1, no more multi_done now!
=== still NOW?
Client 0x118b218 ended, code: 0
=== run now!
select() on socket 3
=== still NOW?

Some other relevant excerpts (also visible in the full log, see gist):

* on_stream_close(), NO_ERROR (err 0), stream 5
* Removed stream 5 hash!
[snip]
* on_stream_close(), NO_ERROR (err 0), stream 7
* Removed stream 7 hash!

curl knows both streams complete, but for some reason doesn't actually finish the streams...

@TvdW
Copy link
Author

TvdW commented Jun 24, 2019

I don't think you'll be able to reproduce this, we seem to be stuck since we don't even know what the problem is... Are there are debug statements you would recommend that I add to find out what's happening?

@TvdW
Copy link
Author

TvdW commented Jun 24, 2019

This seems to fix it:

diff --git a/lib/http2.c b/lib/http2.c
index b016bac92..8976adb68 100644
--- a/lib/http2.c
+++ b/lib/http2.c
@@ -1570,6 +1570,9 @@ static ssize_t http2_recv(struct connectdata *conn, int sockindex,
     return -1;
   }
 
+  if(stream->closed)
+    return http2_handle_stream_close(conn, data, stream, err);
+
   /* Nullify here because we call nghttp2_session_send() and they
      might refer to the old buffer. */
   stream->upload_mem = NULL;

There's probably a better fix, but based on my analysis of the code here (I'm not at all an expert on the curl code base so correct me if I'm wrong) curl doesn't act on the callbacks given to it by nghttp2 but instead sets a bunch of flags and acts on them later. Whenever a connection has data, it will run through each of the 'easy' objects and these will each involve the nghttp2 loop, etc. The callbacks then set flags on the easy objects and since these all get processed in turn, flags get read and handled. However, not all flags get read on every http2_recv call, and it looks like in this particular case it's possible for a stream->closed to not be actioned.

The proper fix here may be to honor nghttp2's callbacks instead of using a lot of flags, but that's probably a lot of work and I'm sure there's a reason it was not written like that in the first place. 👍

@TvdW
Copy link
Author

TvdW commented Jun 24, 2019

What seems to happen here, in broad lines, is that on_stream_close gets called, which removes the stream from the hash and sets stream->closed, but the end of the stream is not signaled by http2_recv returning 0, as it does have incoming data and will return its length in http2.c:1759. The other stream then gets processed normally and also removed from the hash. At that point, the hash is empty, but it never signaled the end of the stream properly.

@bagder
Copy link
Member

bagder commented Jun 24, 2019

Just days ago I landed c0c40ab which basically took away such a function invoke because the http2_handle_stream_close() call there would occasionally close the wrong transfer.

bagder added a commit that referenced this issue Jun 24, 2019
To make sure a HTTP/2 stream registers the end of stream.

Bug #4043 made me find this problem but this fix doesn't correct the
reported issue.

Closes #4068
@TvdW
Copy link
Author

TvdW commented Jun 28, 2019

@bagder Found a way to reproduce it properly.

https://gist.githubusercontent.com/TvdW/97fc8a5e0fd989dd3590980d22ecae04/raw/c88ef1f9824cecface3fee023c4375a5f9f9cca4/test.c

🕺

@bagder
Copy link
Member

bagder commented Jun 28, 2019

Lovely. Please just be patient as I'm off on vacation a few weeks.

@TvdW
Copy link
Author

TvdW commented Jun 28, 2019

Enjoy your vacation!

@TvdW
Copy link
Author

TvdW commented Jul 23, 2019

Great news! Now that parallel transfers have landed, this is reproducible with the curl tool itself 😃

./src/curl -Z --parallel-max 2 -m1 --http2-prior-knowledge -v http://domisc-01.server.tvdw.eu/{0..3}

bagder added a commit that referenced this issue Jul 29, 2019
... so that end-of-stream is detected properly.

Reported-by: Tom van der Woerdt
Fixes #4043
bagder added a commit that referenced this issue Jul 29, 2019
... so that end-of-stream is detected properly.

Reported-by: Tom van der Woerdt
Fixes #4043
@bagder
Copy link
Member

bagder commented Jul 29, 2019

Time to nail this bugger. The case I see all the time with the curl -Z command line above no longer happens to me after #4160. It'll be interesting to learn how it runs for you. (This fix makes sense anyway so it will probably be merged even if it doesn't totally fix this issue.)

@TvdW
Copy link
Author

TvdW commented Jul 29, 2019

🥳 Testing now, but judging by the diff that's exactly the problem I was having

@TvdW
Copy link
Author

TvdW commented Jul 29, 2019

@bagder Confirmed! 🎉

@bagder bagder closed this as completed in fc6045f Jul 29, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Oct 27, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
4 participants