Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP2 connections stuck when paused and resumed multiple times #10988

Closed
lemourin opened this issue Apr 17, 2023 · 27 comments
Closed

HTTP2 connections stuck when paused and resumed multiple times #10988

lemourin opened this issue Apr 17, 2023 · 27 comments
Assignees
Labels

Comments

@lemourin
Copy link
Contributor

lemourin commented Apr 17, 2023

I'd reopen #10971 if I could. The fix for that bug introduced a new regression, please see #10971 (comment).

I did more testing today with my real app coro-cloudstorage and I noticed that the downloads are now getting stuck. I can provide more context on what the app does if necessary. I changed the test program a bit, it now resumes some transfers; with the memory leak fix one transfer gets stuck and returns "Stream error in the HTTP/2 framing layer" after ~5 minutes.

Without #10972 it leaks, with #10972 the connections get stuck.

@bagder bagder added the HTTP/2 label Apr 17, 2023
@bagder
Copy link
Member

bagder commented Apr 17, 2023

The title of this issue says "connections stuck when paused and resumed multiple times" and then you draw a connection to #10972. But that PR is about when a paused transfer is aborted, it is not about being "paused/unpaused" at all. It makes me curious. Which one is it that is actually necessary to trigger this problem?

Without #10972 it leaks, with #10972 the connections get stuck.

Are you 100% certain of this? Because the only thing my PR there did was make sure data was freed that otherwise was not freed at all. That data was already ready off the stream and connection and freeing or not does not change the state of the connection. To me, it seems super-weird and unlikely that the PR #10972 changes the state of the connection in any way.

@lemourin
Copy link
Contributor Author

The title might be a bit imprecise. The test program does a combination of abort / pause / resume:

  1. fire 5 requests
  2. pause them after 16mb received
  3. fire 5 more requests after 3s
  4. abort requests started in 3.
  5. resume requests started in 1.
  6. one of those 5 requests will get stuck

All requests are fired to the same URL. The same code works fine with HTTP 1.1.

@bagder
Copy link
Member

bagder commented Apr 17, 2023

one of those 5 requests will get stuck

Just one? Will the others go on?

Are these h2 requests all done over the same connection?

@lemourin
Copy link
Contributor Author

one of those 5 requests will get stuck

Just one? Will the others go on?

It's not deterministic, anything from 0 to 5 will be stuck; it probably depends on timing.

Are these h2 requests all done over the same connection?

They should, no? All requests use the same multi handle.

@lemourin
Copy link
Contributor Author

I'm looking at the multi.c code. Shouldn't data->state.tempwrite be freed when the connection is closed instead of when a single transfer using that connection is aborted?

@bagder
Copy link
Member

bagder commented Apr 17, 2023

data is the handle for a single transfer. data->state.tempwrite is data cached for that specific single transfer. When multi_done() is called, that single transfer is done, it has completed. There's no point in keeping that data around anymore at that point.

The freeing of that data is completely unrelated to the connection's lifetime or when it gets closed or not.

@lemourin
Copy link
Contributor Author

Yea you're right. I cherry-picked #10972 to 8.0.1 and it works fine. The regression must have been introduced between 8.0.1 and 81b2b57. I can do a bisect.

@bagder
Copy link
Member

bagder commented Apr 17, 2023

I can do a bisect.

What would be most helpful!

@lemourin
Copy link
Contributor Author

744dcf2 is the culprit.

@bagder
Copy link
Member

bagder commented Apr 17, 2023

/cc @icing ^^

@icing icing self-assigned this Apr 18, 2023
icing added a commit to icing/curl that referenced this issue Apr 18, 2023
- curl#10988 reports transfer stalling when pausing/unpausing
- if the servers send window is exhausted before the pause,
  it will not send more on unpausing until the transfer
  consumes its data.
- make sure unpaused transfers get drained and expire, so
  any data already buffered will be processed. Which triggers
  a WINDOW_UPDATE to the server and it will start sending
  again.
@icing
Copy link
Contributor

icing commented Apr 18, 2023

Could you check if #10998 resolves your issue?

@lemourin
Copy link
Contributor Author

Tested, #10998 doesn't help.

@icing
Copy link
Contributor

icing commented Apr 19, 2023

Thanks for checking. We definitely need to add a test case with this sort of lib use.

icing added a commit to icing/curl that referenced this issue Apr 20, 2023
- curl#10988 reports transfer stalling when pausing/unpausing
- if the servers send window is exhausted before the pause,
  it will not send more on unpausing until the transfer
  consumes its data.
- make sure unpaused transfers get drained and expire, so
  any data already buffered will be processed. Which triggers
  a WINDOW_UPDATE to the server and it will start sending
  again.
@icing
Copy link
Contributor

icing commented Apr 21, 2023

@lemourin I added a test with pause/unpausing of transfers and could reproduce a situation where transfers got stuck. That I fixed in #11005.

Now, I do not know if this exactly matches what you observe, but I know it has one bug less. If you could find the time to run that in your environment, that'd be helpful. Thanks.

@lemourin
Copy link
Contributor Author

It's still the same for me, unfortunately.

For context, in case it's useful: I have an HTTP server which generates thumbnails for videos stored in cloud storage (e.g. OneDrive). It renders html that looks like this. The video data is fetched using curl and then it is passed through to FFmpeg (through C API). When FFmpeg says: please read n bytes from the stream then I tell curl to fire a request and then as soon as n bytes are fetched I pause the connection, to avoid buffering. When FFmpeg asks for m more bytes I resume the connection.

With curl at head the first time I try to load the file list it works fine, very rarely it gets stuck on the last thumbnail. When I do a hard reload (to force thumbnail regeneration) it loads around 3 thumbnails and the rest are stuck; it's 100% reproducible.

@icing
Copy link
Contributor

icing commented Apr 24, 2023

I tried to find your curl usage in coro-cloudstorage, but unsuccessful. The actual code is somewhere else, I assume?

I think we need a standalone program that reproduces your issue.

@lemourin
Copy link
Contributor Author

My curl usage is in coro-http library. It wraps curl into a coroutine interface. Compiling coro-cloudstorage with vcpkg will pull in all dependencies and build them from source:

mkdir build 
cd build 
cmake .. -DCMAKE_TOOLCHAIN_FILE=<vcpkg-root>/scripts/buildsystems/vcpkg.cmake -G Ninja
ninja
./examples/cloudstorage

The program will start a HTTP server on port 12345. Go to localhost:12345 in the browser to use the app. Then you can use some throwaway account for whatever cloud and give the app access to it. Upload some video files there (through the official web ui of that cloud provider) and see the thumbnails being generated in the app when going to the file list. You can disable the usage of cloud provided thumbnails by removing the code here. It will force using ffmpeg to generate the thumbnails.

The app can also be built without vcpkg. Then coro-http would need to be installed manually. vcpkg will pull in its own curl so you'd probably need to do that.

The app can be compiled with clang or msvc; gcc is not supported because of broken coroutine support.

icing added a commit to icing/curl that referenced this issue Apr 24, 2023
- curl#10988 reports transfer stalling when pausing/unpausing
- if the servers send window is exhausted before the pause,
  it will not send more on unpausing until the transfer
  consumes its data.
- make sure unpaused transfers get drained and expire, so
  any data already buffered will be processed. Which triggers
  a WINDOW_UPDATE to the server and it will start sending
  again.
@icing
Copy link
Contributor

icing commented Apr 24, 2023

I updated #11005 with a fix to my current theory of what is going wrong in your case.

With 81b2b57 we introduced more restrictive HTTP/2 flow control windows for transfers. This has many benefits. But, in the case if pause/unpause transfers, it needs extra care.

In an event based app like yours, transfers are only run when socket events happen, or when curl timeouts trigger. The unpausing of a http/2 transfer did not trigger a timeout, but just relied on new DATA arriving and generating socket events.

This is not always the case. On a pause, we restrict the flow window to 0 and on unpause, we increase it again. But the server will not always send more DATA. It can happen that, before it got the paused 0, it already send enough DATA to exhaust the window. Increasing the window again, the server will see that it still exhausts the newly enlarged window and not send anything.

This means no socket events and, without the curl timeout set, no events at all and the transfer stalls.

The fix is to always generate a timeout on unpausing, so the receive on the transfer will be made.

I hope this works for you and catches the underlying issue.

@lemourin
Copy link
Contributor Author

Thanks for the investigation!

With 81b2b57 we introduced more restrictive HTTP/2 ...

You mean 744dcf2 I suppose.

I tried the updated PR. Still no luck. The behavior is the same.

@icing
Copy link
Contributor

icing commented Apr 25, 2023

Seems we need a real trace of your app's libcurl use. How to do that:

  • build a libcurl with --enable-debug
  • install your own file-writing log function via CURLOPT_DEBUGFUNCTION. There is sample implementation in the documentation of that option
  • Run your application with the environment variable CURL_DEBUG=http/2.
  • sends us the generated log file

@lemourin
Copy link
Contributor Author

Logs on Google Drive. On shutdown all stalled requests are aborted. curl built at #11005.

@icing
Copy link
Contributor

icing commented Apr 26, 2023

Thanks! I miss 2 things in the log:

  1. If curl is built with debug enabled, there should be a DEBUG notification like
    * !!! WARNING !!! * This is a debug build of libcurl, do not use in production. at the start of the log
  2. In a debug build, when the environment variable CURL_DEBUG=http/2 is set for the process running libcurl, e.g. your server, there should be log messages like the one below to be found.
* [CONN-0-HTTP/2] [h2sid=1] cf_send(len=96) submit https://one.http.curl.se:56649/data.json

which will give us all the details of how the HTTP/2 implementation is called and what streams it opens etc.

Could you try and get a log with that information?

@lemourin
Copy link
Contributor Author

Sorry, I'm using cmake to build curl. I updated the log file.

@icing
Copy link
Contributor

icing commented Apr 27, 2023

Thanks! There is more interesting things to see in that file:

  • 31 requests that your app fully read the response for (appear as on_stream_close(), NO_ERROR)
  • 42 request that were cancelled by the app before the response was fully read (appear as premature DATA_DONE)
  • 3 requests that were RST by the server, e.g. it aborted them (appear as recv RST)

So, there is a little bit more ongoing. The 3 RST requests are the problem and those you see probably as stalled.

The cause is that HTTP/2 not only has flow control for individual streams, but also for the overall connection. I set that - unwisely - to 16 times the stream window. That would limit the amount of memory buffered in total and still allow for good performance. If all stream were indeed being read.

I took not into account a case like yours where, and that is what is happening in your app, the PAUSED streams would bring the overall connection window to 0. This is what is happening and why the server cannot send you any more response data. This makes transfer seem stalled. But it is the whole connection that is.

In #11043 I increase the overall window to 100 * window size of each stream. So one may PAUSE 99/100 and the last one will still find enough space to continue.

Please give that PR a try.

@bagder bagder closed this as completed in de4b016 Apr 27, 2023
@lemourin
Copy link
Contributor Author

Thanks @icing, looks like it works. So, if I understand correctly, if I try to pause 100/101 transfers then the 101st transfer will stall? If yes, then maybe this behavior should be surfaced in the documentation or maybe the window size should be made configurable? I'm just concerned that if I set the parallelism too high for my app then it would break.

@icing
Copy link
Contributor

icing commented Apr 28, 2023

No, this counts the active transfers on a single connection. And, commonly, servers and clients like curl only allow up to 100 simultaneous transfers on one connection.

When you app starts the 101th transfer against the same server, curl will open another connection.

@icing
Copy link
Contributor

icing commented Apr 28, 2023

FYI: just added #11052 that fixes "one last thing" in regard to flow control handling of cancelled transfers.

bch pushed a commit to bch/curl that referenced this issue Jul 19, 2023
- fixes stalled connections

- Make the connection window large enough, so that there is
  some room left should 99/100 streams be PAUSED by the application

Reported-by: Paweł Wegner
Fixes curl#10988
Closes curl#11043
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants