New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallel h2 requests can occasionally result in timeouts #4043
Comments
From a curl build with --enable-debug:
|
I remember hitting this issue very early part of libcurl integration wherein app side we were not handling correctly but since you say 7.64.1 the same program runs, could be something else. could you upload your sample program for this ? |
(99.9% sure this isn't a bindings problem so I haven't bothered to rewrite this in C, have to be able to reproduce it with outside servers before I do that.) One thing I did just notice is that I can only reproduce the issue when I explicitly set |
whats the nghttp2 version used ? |
1.39.1 |
And 7.64.1 with 1.39.1 works fine right ? |
Yes. Like I said in the initial description: the only change is curl going from 7.64.1 to latest master. Same application code, same libraries, etc. |
How's that? |
Hm, thanks for questioning that. I had all the segfaults I've reported these last few weeks in mind, but it turns out none of the code paths triggering those are relevant for the bug in this issue.
|
Ugh, unfortunately that's a rather large and complicated commit. I'll dig into this issue soon. |
Independent bug, triggered this PR: #4051 |
I wrote up a C example to try to reproduce and I immediately fell into an issue that #4055 fixes. I haven't quite seen the original issue yet but I figured this was worth fixing and mentioning first, as it might be related. |
Doesn't look like that's it. Tried it, still times out. |
here's 4043.c which is my attempt to reproduce, but it doesn't. |
I m also hitting this, initiated 50 http2 prior knowledge transfers in loop, all 50 requests got sent on same connection, got 1 response, followed by all timeouts. STATE: PERFORM =\u003e DONE handle 0x18895a8 Does expire clear remove all timeouts including those related to pending transfers on that connection? |
@kunalekawde that's sounds like a different bug. If I understand @TvdW's description, he gets some timeouts out of a large amount of transfers. You describe something that gets almost all timed out. Please provide an example that can reproduce that (feel free to take my example above and tweak it so that it fails like your app does) in a separate issue because it sounds like a serious issue! |
https://gist.github.com/kunalekawde/55361016249976573e2fb2bb89248eb3 with 50 transfers -- as arg to program, also try with 100 -- timeouts more @bagder , could reproduce this issue with above example but here it perform quite a few transfers before timeout, for my application in which most of them timeout, I need to look into. |
@bagder This reproduces the bug for me https://gist.github.com/TvdW/af53833b6112f0a7e2e21bff3b1f3e8a (ignore all the memory leaks etc, not relevant here) |
As for "all" vs "some": I just tried running 1000 requests (in rounds of two) from the same multi, and 36 requests failed. The h2 stream ID of the final request was 7cf (1999) confirming that all 1000 requests were executed over the same connection. |
Another point here is that with the example I posted, it uses multiple connections, I had made a temporary hack to use same connection as discussed long back on mailing list. lib/url.c
The issue seen by my app could be slightly different because of this. |
@kunalekawde Please don't hijack my timeout issue then, create a new one ;) |
:) sure |
There's definitely something strange happening here, though. My "localhost:1234" server runs envoy which is basically an http2 proxy (nghttp2 based). The way it works is I tell curl to connect to it (localhost:1234, via non-tls http2), send it a Host header indicating where it should route, and it takes care of everything for me. Luckily the server that Envoy is communicating with also runs nginx, so we can configure it to speak http2 the same way Envoy does. In my code all I changed is CURLOPT_URL, the requests are the same and the responses very similar as well (different Server header, and some envoy metadata is gone). However, when I do that, the issue disappears. So I thought, what if it's because the requests arrive out of order sometimes, because of the nature of this proxy? Well, no, the timeouts seem to happen on both the first and second requests, randomly. So I thought, what if it's not related to parallelism, and curl just gets into a bad state? I should be able to reproduce that for non-parallel requests, right? So I ran the requests at a randomly chosen parallelism of 1 or 2, but the issue only occurs at rounds where parallelism >1, so it's maybe not caused by curl getting into a bad state in previous requests. |
Curious. When I try your sample code (only slightly adapted for my circumstances) against my local nginx it runs through just fine maybe 9 out of 10 times. The occasional time it doesn't work fine, it gets stuck in a loop where select() returns true for a condition that libcurl doesn't seem to get out of nor change! |
Excellent, so you've reproduced the issue! (I get the same behavior) The loop it gets into is the result of my sloppy coding in this example case (you can get rid of it by setting CURLOPT_TIMEOUT, but a low value there may make it look like it's working since I'm not handling errors at all), it's not the cause of the issue. 😃 [edit] wait, no, I don't get "true", I get a timeout (because the timeval passed to select is {0,0}). |
As a test, if you don't do any POST there and only do plain GETs, can you make the same problem trigger? |
Nope, server gives a 405 😄 |
Well, that doesn't bother curl... =) |
What happens seems to be tied to the server's response, not the request. [edit] well, both, of course, it could well be some 200-OK silliness |
In the example I shared, server is google.com, on 100 GET requests, there were quiet many timeouts. |
To make sure a HTTP/2 stream registers the end of stream. Reported-by: Tom van der Woerdt Fixes #4043
(Moved back from #4068 to avoid having the history in two places)
Hm, for my tests to reproduce the issue they need to upload the body all the way and get a 200 from the server. The only notable difference between a 500 response and a 200 response is the absence of a Content-Length header in the 200 case. It looks like the server is returning a 200 and then streaming the body in a different h2 frame. I did manage to do a similar GET request to the same server, still supplying a body etc, that also results in a 200. In this case, there's no timeout, which I strongly suspect to be related to the presence of the Content-Length header in this particular response (it has one). So:
|
In the problematic case, roughly how big responses do you get? |
And yes, my #4068 fix is related to the sending of the request so it's a different flaw that I just happened to run into when trying to repro this. |
The response is 806 bytes, and fully received, but curl does not seem to realize the stream has ended. |
Recompiled curl with |
I managed to get some hanging problems occasionally with that original sample code, but I then fixed the sample code to more properly end the loop immediately when there's no transfers left in the multi handle and now I don't seem to reproduce anymore or perhaps it just got a lot harder. See 4043-end-properly.c |
With your updated code I get this (after swapping the curl_easy_setopt back to fit my case):
Some other relevant excerpts (also visible in the full log, see gist):
curl knows both streams complete, but for some reason doesn't actually finish the streams... |
I don't think you'll be able to reproduce this, we seem to be stuck since we don't even know what the problem is... Are there are debug statements you would recommend that I add to find out what's happening? |
This seems to fix it:
There's probably a better fix, but based on my analysis of the code here (I'm not at all an expert on the curl code base so correct me if I'm wrong) curl doesn't act on the callbacks given to it by nghttp2 but instead sets a bunch of flags and acts on them later. Whenever a connection has data, it will run through each of the 'easy' objects and these will each involve the nghttp2 loop, etc. The callbacks then set flags on the easy objects and since these all get processed in turn, flags get read and handled. However, not all flags get read on every http2_recv call, and it looks like in this particular case it's possible for a The proper fix here may be to honor nghttp2's callbacks instead of using a lot of flags, but that's probably a lot of work and I'm sure there's a reason it was not written like that in the first place. 👍 |
What seems to happen here, in broad lines, is that |
Just days ago I landed c0c40ab which basically took away such a function invoke because the http2_handle_stream_close() call there would occasionally close the wrong transfer. |
@bagder Found a way to reproduce it properly. 🕺 |
Lovely. Please just be patient as I'm off on vacation a few weeks. |
Enjoy your vacation! |
Great news! Now that parallel transfers have landed, this is reproducible with the curl tool itself 😃
|
... so that end-of-stream is detected properly. Reported-by: Tom van der Woerdt Fixes #4043
... so that end-of-stream is detected properly. Reported-by: Tom van der Woerdt Fixes #4043
Time to nail this bugger. The case I see all the time with the |
🥳 Testing now, but judging by the diff that's exactly the problem I was having |
@bagder Confirmed! 🎉 |
It's not entirely clear to me what's happening here and I haven't been able yet to reproduce this outside of my application, though I can currently reproduce this by running a loop that does two identical requests at a time. Eventually (after ~100 requests), one of them times out.
I'm logging this issue so that hopefully someone recognizes the symptoms or the cause, despite knowing it cannot be reproduced in its current form.
What stands out, to me, is the "with 806 bytes received", because that's how many bytes are in the response, so the response is complete. There's no content-length header, and I suspect that that's related, as maybe it could be that curl is not correctly handing the end of a stream.
Running the exact same requests over http/1.1 makes the issue go away.
This is happening on a recent build from master. Worked fine in 7.64.1, so something must have changed since then. Would bisect, but the segfaults make that impossible 😞
The text was updated successfully, but these errors were encountered: