Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sockfilt: fix race-condition between threads and event-processing #5156

Closed
wants to merge 3 commits into from

Conversation

mback2k
Copy link
Member

@mback2k mback2k commented Mar 27, 2020

Fix race-condition of waiting threads finishing while events are
already being processed which lead to invalid or skipped events.

@mback2k mback2k added tests Windows Windows-specific labels Mar 27, 2020
@mback2k mback2k self-assigned this Mar 27, 2020
@mback2k mback2k force-pushed the fix-sockfilt-race-condition branch from f9b6799 to e409acd Compare March 28, 2020 18:40
@mback2k
Copy link
Member Author

mback2k commented Mar 29, 2020

Once ready, I will squash the commits into logical chunks for rebase and merge.

@mback2k mback2k force-pushed the fix-sockfilt-race-condition branch 2 times, most recently from 0d15f25 to 299b26d Compare March 29, 2020 18:17
mback2k added a commit to mback2k/curl that referenced this pull request Mar 30, 2020
Fix race-condition of waiting threads finishing while events are
already being processed which lead to invalid or skipped events.

Use mutex to check for one event at a time or do post-processing.
In addition to mutex-based locking use specific event as signal.

Closes curl#5156
@mback2k mback2k force-pushed the fix-sockfilt-race-condition branch from 299b26d to 8d0ff97 Compare March 30, 2020 16:22
@mback2k
Copy link
Member Author

mback2k commented Mar 30, 2020

@bagder I am seeing a random issue with test 715, but I do not think that sockfilt is causing the error here. Would you mind taking a look and giving me your insights about this proxy related test?
https://ci.appveyor.com/project/curlorg/curl/builds/31826331/job/ffwe3km7j2h8wj3s?fullLog=true#L2527

@mback2k
Copy link
Member Author

mback2k commented Mar 30, 2020

I have been running test 715 in an endless loop for nearly 1 hour now and cannot reproduce the failure:
while ./runtests.pl -d -k 715; do :; done

Maybe the failure happens only on a slow CI machine? I will have to figure out how to make my machine "slower"...

@bagder
Copy link
Member

bagder commented Mar 30, 2020

It seems to be a problem in sws when it runs as a CONNECT proxy. See the http2_server.log after the failure:

 16:41:38.046000 [CTRL] SENT "150 Binary data connection for 715 () (14 bytes).%0d%0a"
 16:41:38.046000 [DATA] READ 14 bytes from server
 16:41:38.046000 [DATA] READ "silly content%0a"
 16:41:38.046000 [DATA] SENT 14 bytes to client
 16:41:38.046000 [DATA] SENT "silly content%0a"
 16:41:44.093000 CONNECT proxy timeout after 6 idle seconds
 16:41:44.093000 [DATA] CLOSING server socket (cleanup)
 16:41:44.093000 [DATA] CLOSING client socket (cleanup)

It gets the "silly content" and passes that on and then it does nothing more until it times out. It doesn't seem to detect the end if that transfer.

Compare that to a functioning test case (on my Linux host):

20:00:37.187877 [CTRL] SENT "150 Binary data connection for 715 () (14 bytes).%0d%0a"
20:00:37.187907 [DATA] READ 14 bytes from server
20:00:37.187931 [DATA] READ "silly content%0a"
20:00:37.187958 [DATA] got 0, STOP READING server
20:00:37.188028 [DATA] SENT 14 bytes to client
20:00:37.188055 [DATA] SENT "silly content%0a"
20:00:37.188078 [DATA] DISABLED WRITING client
20:00:37.438417 [CTRL] READ 28 bytes from server
20:00:37.438495 [CTRL] READ "226 File transfer complete%0d%0a"
20:00:37.438524 [DATA] got 0, STOP READING client
20:00:37.438548 [DATA] DISABLED WRITING server
20:00:37.688894 [DATA] CLOSING client socket
20:00:37.688991 [DATA] CLOSING server socket
20:00:37.689020 [DATA] ENDING

@mback2k
Copy link
Member Author

mback2k commented Mar 30, 2020

It doesn't seem to detect the end if that transfer.

Thanks, can you tell me how that end of transfer should normally be signaled?

Edit: I can see that it is checking for sread returning <= 0, so I guess the socket should be closed. But by whom? Is it a socket managed by sockfilt on the other end in this case? Is there some difference in the logs regarding that?

@bagder
Copy link
Member

bagder commented Mar 30, 2020

This:

20:00:37.187958 [DATA] got 0, STOP READING server

Ie it should get signaled to read and then read zero bytes.

@mback2k
Copy link
Member Author

mback2k commented Mar 30, 2020

I mean on the other side of the connection. What is on the other side? sockfilt? And if yes, what makes sockfilt close the connection?

@mback2k mback2k force-pushed the fix-sockfilt-race-condition branch 3 times, most recently from 1ff4da8 to 03b3e43 Compare March 31, 2020 17:51
mback2k added a commit to mback2k/curl that referenced this pull request Apr 2, 2020
Fix race-condition of waiting threads finishing while events are
already being processed which lead to invalid or skipped events.

Use mutex to check for one event at a time or do post-processing.
In addition to mutex-based locking use specific event as signal.

Closes curl#5156
@mback2k mback2k force-pushed the fix-sockfilt-race-condition branch from 03b3e43 to 17389fb Compare April 2, 2020 16:36
@mback2k
Copy link
Member Author

mback2k commented Apr 2, 2020

I will create another PR to handle the remaining issues regarding waitpid on Windows.

For now my focus is to get this PR ready for merge to avoid the race-condition in sockfilt.

@mback2k mback2k marked this pull request as ready for review April 2, 2020 16:37
Fix race-condition of waiting threads finishing while events are
already being processed which lead to invalid or skipped events.

Use mutex to check for one event at a time or do post-processing.
In addition to mutex-based locking use specific event as signal.

Closes curl#5156
Replace the incomplete workaround regarding FD_CLOSE
only signalling once by instead doing a pre-check with
standard select and storing the result for later use.

select keeps triggering on closed sockets on Windows while
WSAEventSelect fires only once with data still available.
By doing the pre-check we do not run in a deadlock
due to waiting forever for another FD_CLOSE event.
And update other logmsg output in select_ws on Windows.
@mback2k mback2k force-pushed the fix-sockfilt-race-condition branch from 17389fb to c330831 Compare April 2, 2020 18:04
Copy link
Member

@MarcelRaad MarcelRaad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@bagder
Copy link
Member

bagder commented Apr 2, 2020

Maybe remove [WIP] from the title to better signal this is ready?

@bagder bagder removed their request for review April 2, 2020 21:53
@bagder
Copy link
Member

bagder commented Apr 2, 2020

This is all Windows code. I'm not qualified to review it...

@mback2k mback2k changed the title [WIP] sockfilt: fix race-condition between threads and event-processing sockfilt: fix race-condition between threads and event-processing Apr 2, 2020
@mback2k
Copy link
Member Author

mback2k commented Apr 2, 2020

Maybe remove [WIP] from the title to better signal this is ready?

Good idea. Forgot that after pressing the "Ready for review" button.

This is all Windows code. I'm not qualified to review it...

Maybe, but I still think you are the right person to review code style and C code in general. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests Windows Windows-specific
Development

Successfully merging this pull request may close these issues.

None yet

3 participants