Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two test servers, one lock #11231

Closed
dfandrich opened this issue May 31, 2023 · 3 comments
Closed

Two test servers, one lock #11231

dfandrich opened this issue May 31, 2023 · 3 comments
Assignees
Labels

Comments

@dfandrich
Copy link
Contributor

The test servers use a single file serverlogs.lock to indicate when a test server is still writing its logs to prevent the test harness from looking at them too soon. However, many tests use more than one server running in parallel (e.g. http and http-proxy) and the first one to finish clears the lock for both of them. This means that the test harness can look at the logs before they are all ready. I believe this is what causes many of the flaky test failures we often see in the CI tests.

For example, in one particular test failure I looked at in test 1904, the proxy.input file started with the line [DISCONNECT]. This should never happen because there is no client connected when the test starts. Furthermore, test 1904 does NOT set connection-monitor without which [DISCONNECT] should not be printed. What must have happened was that a previous test set connection-monitor but the other server cleared the log lock before it had a chance to write that final message and only write it after the log files were deleted and the next test (1904) had already started.

Many other flaky test failures occur because a final QUIT command is not logged. This may also be explained if another server clears the log lock (causing the test harness to look at all the log files too soon) before that server gets a chance to write the final command.

More evidence for this is the server log line:

Error removing lock file log9/serverlogs.lock error: 2 No such file or directory

indicating that the lock file is expected to exist but was somehow deleted without the server doing it. The most likely explanation is that the other server deleted it.

One problem with this theory in the general case is that this kind of problem also happens on tests that only have a single server (e.g. ftp test 135). Some of these use sockfilt in addition to the main test server, but it doesn't appear that sockfilt engages in the log file locking scheme.

One solution to this problem is to allow multiple lock files, probably one fixed lock file per server, and have the test harness look for and wait for them all. Another is to have all servers set fcntl(1) locks on a single file, but this may not be portable enough and is probably why the file approach was chosen from the start.

@bagder
Copy link
Member

bagder commented Jun 1, 2023

The fcntl() method does not seem to exist on Windows for example.

Maybe the easiest (?) approach is to have all test servers create their own "lock file" in a dedicated per-runner lock-directory and have runtests.pl verify that the entire locks-directory is empty before it proceeds?

@bagder
Copy link
Member

bagder commented Jun 1, 2023

@dfandrich do you want to take a stab at converting over to a multi-file system or should I?

@dfandrich
Copy link
Contributor Author

dfandrich commented Jun 1, 2023 via email

dfandrich added a commit that referenced this issue Jun 4, 2023
Parallel testing is enabled by using a nonzero value for the -j option
to runtests.pl. Performant values seem to be about 7*num CPU cores, or
1.3*num CPU cores if Valgrind is in use.

Flaky tests due to improper log locking (bug #11231) are exacerbated
while parallel testing, so it is not enabled by default yet.

Ref: #10818
dfandrich added a commit that referenced this issue Jun 4, 2023
Parallel testing is enabled by using a nonzero value for the -j option
to runtests.pl. Performant values seem to be about 7*num CPU cores, or
1.3*num CPU cores if Valgrind is in use.

Flaky tests due to improper log locking (bug #11231) are exacerbated
while parallel testing, so it is not enabled by default yet.

Ref: #10818
Closes #11246
dfandrich added a commit that referenced this issue Jun 5, 2023
Parallel testing is enabled by using a nonzero value for the -j option
to runtests.pl. Performant values seem to be about 7*num CPU cores, or
1.3*num CPU cores if Valgrind is in use.

Flaky tests due to improper log locking (bug #11231) are exacerbated
while parallel testing, so it is not enabled by default yet.

Fixes #10818
Closes #11246
@dfandrich dfandrich self-assigned this Jun 5, 2023
dfandrich added a commit that referenced this issue Jun 5, 2023
Logs are written by several servers and all of them must be finished
writing before the test results can be determined. This means each
server must have its own lock file rather than sharing a single one
which is how it was done up to now. Previously, the first server to
complete a test would clear the lock which caused occasional flaky
tests. Lock files are now all found in their own directory, so counting
locks equals counting the files in that directory.

Fixes #11231
Closes #XXX
dfandrich added a commit that referenced this issue Jun 6, 2023
Logs are written by several servers and all of them must be finished
writing before the test results can be determined. This means each
server must have its own lock file rather than sharing a single one
which is how it was done up to now. Previously, the first server to
complete a test would clear the lock which caused occasional flaky
tests. Lock files are now all found in their own directory, so counting
locks equals counting the files in that directory.

Fixes #11231
Closes #XXX
dfandrich added a commit that referenced this issue Jun 7, 2023
Logs are written by several servers and all of them must be finished
writing before the test results can be determined. This means each
server must have its own lock file rather than sharing a single one,
which is how it was done up to now. Previously, the first server to
complete a test would clear the lock before the other server was done,
which caused flaky tests.

Lock files are now all found in their own directory, so counting locks
equals counting the files in that directory.  The result is that the
proxy logs are now reliably written which actually changes the expected
output for two tests.

Fixes #11231
Closes #11259
bch pushed a commit to bch/curl that referenced this issue Jul 19, 2023
Parallel testing is enabled by using a nonzero value for the -j option
to runtests.pl. Performant values seem to be about 7*num CPU cores, or
1.3*num CPU cores if Valgrind is in use.

Flaky tests due to improper log locking (bug curl#11231) are exacerbated
while parallel testing, so it is not enabled by default yet.

Fixes curl#10818
Closes curl#11246
bch pushed a commit to bch/curl that referenced this issue Jul 19, 2023
Logs are written by several servers and all of them must be finished
writing before the test results can be determined. This means each
server must have its own lock file rather than sharing a single one,
which is how it was done up to now. Previously, the first server to
complete a test would clear the lock before the other server was done,
which caused flaky tests.

Lock files are now all found in their own directory, so counting locks
equals counting the files in that directory.  The result is that the
proxy logs are now reliably written which actually changes the expected
output for two tests.

Fixes curl#11231
Closes curl#11259
ptitSeb pushed a commit to wasix-org/curl that referenced this issue Sep 25, 2023
Parallel testing is enabled by using a nonzero value for the -j option
to runtests.pl. Performant values seem to be about 7*num CPU cores, or
1.3*num CPU cores if Valgrind is in use.

Flaky tests due to improper log locking (bug curl#11231) are exacerbated
while parallel testing, so it is not enabled by default yet.

Fixes curl#10818
Closes curl#11246
ptitSeb pushed a commit to wasix-org/curl that referenced this issue Sep 25, 2023
Logs are written by several servers and all of them must be finished
writing before the test results can be determined. This means each
server must have its own lock file rather than sharing a single one,
which is how it was done up to now. Previously, the first server to
complete a test would clear the lock before the other server was done,
which caused flaky tests.

Lock files are now all found in their own directory, so counting locks
equals counting the files in that directory.  The result is that the
proxy logs are now reliably written which actually changes the expected
output for two tests.

Fixes curl#11231
Closes curl#11259
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

2 participants