Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data race in ftp test server #12002

Closed
dfandrich opened this issue Oct 1, 2023 · 3 comments
Closed

Data race in ftp test server #12002

dfandrich opened this issue Oct 1, 2023 · 3 comments
Assignees

Comments

@dfandrich
Copy link
Contributor

I did this

I've seen many FTP tests failing that have appear to result in a control message intended for one FTP client to be given to the next client. This results in a test log like this one (test 1091). The client errors out with curl: (8) Got a 226 ftp-server response when 220 was expected in this case.

What appears to happen (with timestamps from the sample ftp_server.log in the log above) is as follows:

  1. 06:55:04.575219 the test server verification runs to perform a dummy FTP transfer (downloading "verifiedserver") as a client and start requesting a transfer.
  2. 06:55:04.616714 client connects to data socket
  3. 06:55:21.341029 client downloads "verifiedserver" and server closes the data socket
  4. (speculation) client immediately closes control socket, since verification is complete
  5. 06:55:21.341158 server sends the 226 File transfer complete status code on the control connection (which was just closed—race is somewhere here)
  6. server is delayed due to overloaded server (?)
  7. test harness starts running the new curl FTP client which connects to the test server control port
  8. 06:55:21.341270 test server finally gets scheduled to run and sends the 226 response on the control socket, but it's a new socket so it goes to the new client. Is it then immediately notified that the client has DISConnected.
  9. client expects the 220 welcome banner message as the first thing it gets and not 226 so it errors out and exits.
  10. 06:55:21.344454 test server is notified that client has disconnected

There are some problems with the simple explanation of this scenario:

  • The timestamps make it unlikely to be completely accurate. Step 6 (server delayed) happens between two log messages only 112 microseconds apart. It's more likely an issue with bad buffering or reusing sockets.
  • Step 4. could only be true if curl does not wait for a 226 response after a transfer (I don't know offhand if it does or not).
  • The "verifiedserver" check takes 17 seconds (!) between the client requesting EPSV and the 226 response from the server. This is pretty suspicious, and leads credence to the theory that there's actually some bad connection lifetime management going on in the test server.

The ftp_sockctrl.log is probably the most explicit in showing what's going on. Some excerpts:

  • 06:55:17.140000 Client disconnect
  • 06:55:17.203000 Client connect
  • 06:55:20.890000 '226 File transfer complete\r\n'

I'm not sure if the client here is actually the test server or curl, but it doesn't really matter. It's showing that it's sending a 226 code (response to a completed file transfer) as the first thing to a new client that just connected.

I expected the following

Test servers should not have race conditions.

curl/libcurl version

8.4.0-DEV

operating system

The sample log comes from mingw-w64, and this problem likely happens more often on Windows, but it may have happen on other OSes as well.

@dfandrich
Copy link
Contributor Author

Here's a similar log from another failed test (test 932, SMTP CRAM-MD5 graceful cancellation). In this one, the client receives the SMTP banner message 220 code twice in a row. The server log shows the server sending the 220, then

   13:10:18.765251 MAIN sockfilt said DISC
   13:10:18.765358 ====> Client disconnected
   13:10:18.765629 Awaiting input
   13:10:18.766160 ====> Client connect
   13:10:18.766493 FTPD: Getting commands from log/server.cmd
   13:10:18.766593 FTPD: instructed to support AUTHENTICATION command

i.e. the client supposedly disconnected then immediately reconnected. The curl log shows no such disconnecte/reconnect. The sockfilt log shows the a connect/disconnect/connect happening before the first 220 is sent, then the second 220 is sent without any intervening disconnect. So, the problem looks to be in the server and not the sockfilt.

This one is for a different build (still Windows) but CMake, VS2010 this time, but I can't say for certain that it's Windows-only.

@dfandrich
Copy link
Contributor Author

Here's my current working theory for what's happening in the test 932 case above. This matches the logs nicely, although I can't match up the timing because hi-res timing isn't available in the Windows sockfilt.

  1. client1 connects, sockfilt sends CNCT to server
  2. client1 disconnects, sockfilt sends DISC to server †
  3. new client2 connects, sockfilt sends CNCT to server
  4. server wakes up, sees the first CNCT, sends 220 response to the wrong client2
  5. sockfilt passes on 220 to client2
  6. client2 sees 220, then responds with its next command, EHLO
  7. server then gets the DISC, resets itself
  8. server then sees the second CNCT, sends 220 response to the right client2 this time but they're now out of sync
  9. client sees the 220 response, thinks that's for its EHLO command. Client is out of sync with the server. Client responds based on its 220 instead of the 250 it should have received and the test protocol section is broken.

† The only open question is why in step 2 the client disconnects. Could it be this same race condition with the test server's verifyftp() call? In any case, it doesn't really matter because an unexpected disconnect could happen for any reason and the sockfilt+server would get out of sync and potentially stay out of sync.

I can think of two easy(ish) ways to fix this:

  1. Require an acknowledgement step after one side sends a DISC. Have both sides throw everything away until an ACKD is received.
  2. Send a connection number along with every command, e.g. DISC 123 or DATA 123\nfe84\n to tie the command to a particular socket. The number would be monotonically increasing to avoid duplicates. If one side receives a command for a closed socket, it's thrown out.

The first one is probably the simplest.

@dfandrich dfandrich self-assigned this Oct 7, 2023
dfandrich added a commit that referenced this issue Oct 7, 2023
If a client disconnected and reconnected quickly, before the ftp server
has a chance to respond, the protocol message/ack (ping/pong) sequence
got out of sync, causing messages sent to the old client to be delivered
to the new.  A disconnect must now be acknowledged and intermediate
requests thrown out until it is, which ensures that such synchronization
problems can't occur. This problem could affect ftp, pop3, imap and smtp
tests.

Fixes #12002
Closes #12049
@mback2k
Copy link
Member

mback2k commented Oct 7, 2023

Thanks a lot for debugging and fixing this. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

2 participants