Two test servers, one lock #11231

dfandrich · 2023-05-31T19:19:16Z

The test servers use a single file serverlogs.lock to indicate when a test server is still writing its logs to prevent the test harness from looking at them too soon. However, many tests use more than one server running in parallel (e.g. http and http-proxy) and the first one to finish clears the lock for both of them. This means that the test harness can look at the logs before they are all ready. I believe this is what causes many of the flaky test failures we often see in the CI tests.

For example, in one particular test failure I looked at in test 1904, the proxy.input file started with the line [DISCONNECT]. This should never happen because there is no client connected when the test starts. Furthermore, test 1904 does NOT set connection-monitor without which [DISCONNECT] should not be printed. What must have happened was that a previous test set connection-monitor but the other server cleared the log lock before it had a chance to write that final message and only write it after the log files were deleted and the next test (1904) had already started.

Many other flaky test failures occur because a final QUIT command is not logged. This may also be explained if another server clears the log lock (causing the test harness to look at all the log files too soon) before that server gets a chance to write the final command.

More evidence for this is the server log line:

Error removing lock file log9/serverlogs.lock error: 2 No such file or directory

indicating that the lock file is expected to exist but was somehow deleted without the server doing it. The most likely explanation is that the other server deleted it.

One problem with this theory in the general case is that this kind of problem also happens on tests that only have a single server (e.g. ftp test 135). Some of these use sockfilt in addition to the main test server, but it doesn't appear that sockfilt engages in the log file locking scheme.

One solution to this problem is to allow multiple lock files, probably one fixed lock file per server, and have the test harness look for and wait for them all. Another is to have all servers set fcntl(1) locks on a single file, but this may not be portable enough and is probably why the file approach was chosen from the start.

The text was updated successfully, but these errors were encountered:

bagder · 2023-06-01T12:56:46Z

The fcntl() method does not seem to exist on Windows for example.

Maybe the easiest (?) approach is to have all test servers create their own "lock file" in a dedicated per-runner lock-directory and have runtests.pl verify that the entire locks-directory is empty before it proceeds?

bagder · 2023-06-01T21:53:31Z

@dfandrich do you want to take a stab at converting over to a multi-file system or should I?

dfandrich · 2023-06-01T23:24:10Z

Parallel testing changes the test timing so much that this issue causes many more flaky tests. Since this is basically blocking proper use of that feature, I'll tackle this one to get that unblocked. I'll still make a PR with parallel testing available shortly since it's still useful in this state, even if imperfect.

Parallel testing is enabled by using a nonzero value for the -j option to runtests.pl. Performant values seem to be about 7*num CPU cores, or 1.3*num CPU cores if Valgrind is in use. Flaky tests due to improper log locking (bug #11231) are exacerbated while parallel testing, so it is not enabled by default yet. Ref: #10818

Parallel testing is enabled by using a nonzero value for the -j option to runtests.pl. Performant values seem to be about 7*num CPU cores, or 1.3*num CPU cores if Valgrind is in use. Flaky tests due to improper log locking (bug #11231) are exacerbated while parallel testing, so it is not enabled by default yet. Ref: #10818 Closes #11246

Parallel testing is enabled by using a nonzero value for the -j option to runtests.pl. Performant values seem to be about 7*num CPU cores, or 1.3*num CPU cores if Valgrind is in use. Flaky tests due to improper log locking (bug #11231) are exacerbated while parallel testing, so it is not enabled by default yet. Fixes #10818 Closes #11246

Logs are written by several servers and all of them must be finished writing before the test results can be determined. This means each server must have its own lock file rather than sharing a single one which is how it was done up to now. Previously, the first server to complete a test would clear the lock which caused occasional flaky tests. Lock files are now all found in their own directory, so counting locks equals counting the files in that directory. Fixes #11231 Closes #XXX

Logs are written by several servers and all of them must be finished writing before the test results can be determined. This means each server must have its own lock file rather than sharing a single one, which is how it was done up to now. Previously, the first server to complete a test would clear the lock before the other server was done, which caused flaky tests. Lock files are now all found in their own directory, so counting locks equals counting the files in that directory. The result is that the proxy logs are now reliably written which actually changes the expected output for two tests. Fixes #11231 Closes #11259

Parallel testing is enabled by using a nonzero value for the -j option to runtests.pl. Performant values seem to be about 7*num CPU cores, or 1.3*num CPU cores if Valgrind is in use. Flaky tests due to improper log locking (bug curl#11231) are exacerbated while parallel testing, so it is not enabled by default yet. Fixes curl#10818 Closes curl#11246

Logs are written by several servers and all of them must be finished writing before the test results can be determined. This means each server must have its own lock file rather than sharing a single one, which is how it was done up to now. Previously, the first server to complete a test would clear the lock before the other server was done, which caused flaky tests. Lock files are now all found in their own directory, so counting locks equals counting the files in that directory. The result is that the proxy logs are now reliably written which actually changes the expected output for two tests. Fixes curl#11231 Closes curl#11259

Parallel testing is enabled by using a nonzero value for the -j option to runtests.pl. Performant values seem to be about 7*num CPU cores, or 1.3*num CPU cores if Valgrind is in use. Flaky tests due to improper log locking (bug curl#11231) are exacerbated while parallel testing, so it is not enabled by default yet. Fixes curl#10818 Closes curl#11246

Logs are written by several servers and all of them must be finished writing before the test results can be determined. This means each server must have its own lock file rather than sharing a single one, which is how it was done up to now. Previously, the first server to complete a test would clear the lock before the other server was done, which caused flaky tests. Lock files are now all found in their own directory, so counting locks equals counting the files in that directory. The result is that the proxy logs are now reliably written which actually changes the expected output for two tests. Fixes curl#11231 Closes curl#11259

dfandrich added the tests label May 31, 2023

dfandrich mentioned this issue Jun 4, 2023

runtests: create multiple test runners when requested #11246

Closed

dfandrich mentioned this issue Jun 5, 2023

Tests take a long time to run #10818

Closed

dfandrich self-assigned this Jun 5, 2023

dfandrich mentioned this issue Jun 7, 2023

runtests; give each server a unique log lock file #11259

Closed

dfandrich closed this as completed in 7d62f0d Jun 8, 2023

dfandrich mentioned this issue Jun 17, 2023

Test reliability needs improvement #11328

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two test servers, one lock #11231

Two test servers, one lock #11231

dfandrich commented May 31, 2023

bagder commented Jun 1, 2023

bagder commented Jun 1, 2023

dfandrich commented Jun 1, 2023 via email

Two test servers, one lock #11231

Two test servers, one lock #11231

Comments

dfandrich commented May 31, 2023

bagder commented Jun 1, 2023

bagder commented Jun 1, 2023

dfandrich commented Jun 1, 2023 via email