Data race in ftp test server #12002

dfandrich · 2023-10-01T04:32:20Z

I did this

I've seen many FTP tests failing that have appear to result in a control message intended for one FTP client to be given to the next client. This results in a test log like this one (test 1091). The client errors out with curl: (8) Got a 226 ftp-server response when 220 was expected in this case.

What appears to happen (with timestamps from the sample ftp_server.log in the log above) is as follows:

06:55:04.575219 the test server verification runs to perform a dummy FTP transfer (downloading "verifiedserver") as a client and start requesting a transfer.
06:55:04.616714 client connects to data socket
06:55:21.341029 client downloads "verifiedserver" and server closes the data socket
(speculation) client immediately closes control socket, since verification is complete
06:55:21.341158 server sends the 226 File transfer complete status code on the control connection (which was just closed—race is somewhere here)
server is delayed due to overloaded server (?)
test harness starts running the new curl FTP client which connects to the test server control port
06:55:21.341270 test server finally gets scheduled to run and sends the 226 response on the control socket, but it's a new socket so it goes to the new client. Is it then immediately notified that the client has DISConnected.
client expects the 220 welcome banner message as the first thing it gets and not 226 so it errors out and exits.
06:55:21.344454 test server is notified that client has disconnected

There are some problems with the simple explanation of this scenario:

The timestamps make it unlikely to be completely accurate. Step 6 (server delayed) happens between two log messages only 112 microseconds apart. It's more likely an issue with bad buffering or reusing sockets.
Step 4. could only be true if curl does not wait for a 226 response after a transfer (I don't know offhand if it does or not).
The "verifiedserver" check takes 17 seconds (!) between the client requesting EPSV and the 226 response from the server. This is pretty suspicious, and leads credence to the theory that there's actually some bad connection lifetime management going on in the test server.

The ftp_sockctrl.log is probably the most explicit in showing what's going on. Some excerpts:

06:55:17.140000 Client disconnect
06:55:17.203000 Client connect
06:55:20.890000 '226 File transfer complete\r\n'

I'm not sure if the client here is actually the test server or curl, but it doesn't really matter. It's showing that it's sending a 226 code (response to a completed file transfer) as the first thing to a new client that just connected.

I expected the following

Test servers should not have race conditions.

curl/libcurl version

8.4.0-DEV

operating system

The sample log comes from mingw-w64, and this problem likely happens more often on Windows, but it may have happen on other OSes as well.

The text was updated successfully, but these errors were encountered:

dfandrich · 2023-10-06T20:14:29Z

Here's a similar log from another failed test (test 932, SMTP CRAM-MD5 graceful cancellation). In this one, the client receives the SMTP banner message 220 code twice in a row. The server log shows the server sending the 220, then

   13:10:18.765251 MAIN sockfilt said DISC
   13:10:18.765358 ====> Client disconnected
   13:10:18.765629 Awaiting input
   13:10:18.766160 ====> Client connect
   13:10:18.766493 FTPD: Getting commands from log/server.cmd
   13:10:18.766593 FTPD: instructed to support AUTHENTICATION command

i.e. the client supposedly disconnected then immediately reconnected. The curl log shows no such disconnecte/reconnect. The sockfilt log shows the a connect/disconnect/connect happening before the first 220 is sent, then the second 220 is sent without any intervening disconnect. So, the problem looks to be in the server and not the sockfilt.

This one is for a different build (still Windows) but CMake, VS2010 this time, but I can't say for certain that it's Windows-only.

dfandrich · 2023-10-06T21:59:31Z

Here's my current working theory for what's happening in the test 932 case above. This matches the logs nicely, although I can't match up the timing because hi-res timing isn't available in the Windows sockfilt.

client1 connects, sockfilt sends CNCT to server
client1 disconnects, sockfilt sends DISC to server †
new client2 connects, sockfilt sends CNCT to server
server wakes up, sees the first CNCT, sends 220 response to the wrong client2
sockfilt passes on 220 to client2
client2 sees 220, then responds with its next command, EHLO
server then gets the DISC, resets itself
server then sees the second CNCT, sends 220 response to the right client2 this time but they're now out of sync
client sees the 220 response, thinks that's for its EHLO command. Client is out of sync with the server. Client responds based on its 220 instead of the 250 it should have received and the test protocol section is broken.

† The only open question is why in step 2 the client disconnects. Could it be this same race condition with the test server's verifyftp() call? In any case, it doesn't really matter because an unexpected disconnect could happen for any reason and the sockfilt+server would get out of sync and potentially stay out of sync.

I can think of two easy(ish) ways to fix this:

Require an acknowledgement step after one side sends a DISC. Have both sides throw everything away until an ACKD is received.
Send a connection number along with every command, e.g. DISC 123 or DATA 123\nfe84\n to tie the command to a particular socket. The number would be monotonically increasing to avoid duplicates. If one side receives a command for a closed socket, it's thrown out.

The first one is probably the simplest.

If a client disconnected and reconnected quickly, before the ftp server has a chance to respond, the protocol message/ack (ping/pong) sequence got out of sync, causing messages sent to the old client to be delivered to the new. A disconnect must now be acknowledged and intermediate requests thrown out until it is, which ensures that such synchronization problems can't occur. This problem could affect ftp, pop3, imap and smtp tests. Fixes #12002 Closes #12049

mback2k · 2023-10-07T18:57:07Z

Thanks a lot for debugging and fixing this. 👍

dfandrich added FTP tests labels Oct 1, 2023

jay mentioned this issue Oct 3, 2023

cmake: fix unity with Windows Unicode + TrackMemory #11928

Closed

dfandrich self-assigned this Oct 7, 2023

dfandrich mentioned this issue Oct 7, 2023

tests: fix a race condition in ftp server disconnect #12049

Closed

dfandrich closed this as completed in f6513b9 Oct 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data race in ftp test server #12002

Data race in ftp test server #12002

dfandrich commented Oct 1, 2023

dfandrich commented Oct 6, 2023

dfandrich commented Oct 6, 2023

mback2k commented Oct 7, 2023

Data race in ftp test server #12002

Data race in ftp test server #12002

Comments

dfandrich commented Oct 1, 2023

I did this

I expected the following

curl/libcurl version

operating system

dfandrich commented Oct 6, 2023

dfandrich commented Oct 6, 2023

mback2k commented Oct 7, 2023