Alter timeout ordering with connections #7178

tomwatson1024 · 2021-06-02T16:57:48Z

@richardwhiuk made this fix in the @Metaswitch fork of curl at https://github.com/Metaswitch/curl a few years ago - see Metaswitch#2. Since then it's had a great deal of soak testing in our production code. As part of getting ourselves off the fork, we'd like to upstream the fix.

This change:

Checks whether a connection has succeded before checking whether it's timed
out.

This means if we've connected quickly, but subsequently been descheduled, we
allow the connection to succeed. Note, if we timeout, but between checking the
timeout, and connecting to the server the connection succeeds, we will allow
it to go ahead. This is viewed as an acceptable trade off.
Add additional failf logging around failed connection attempts to propogate
the cause up to the caller.

More info in the linked PR.

- Check whether a connection has succeded before checking whether it's timed out. This means if we've connected quickly, but subsequently been descheduled, we allow the connection to succeed. Note, if we timeout, but between checking the timeout, and connecting to the server the connection succeeds, we will allow it to go ahead. This is viewed as an acceptable trade off. - Add additional failf logging around failed connection attempts to propogate the cause up to the caller.

bagder · 2021-06-02T21:24:23Z

build failures. example

connect.c: In function 'Curl_is_connected':
connect.c:989:17: error: 'allow' may be used uninitialized in this function [-Werror=maybe-uninitialized]
  988 |         conn->timeoutms_per_addr[i] = conn->tempaddr[i]->ai_next == NULL ?
      |                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  989 |           allow : allow / 2;
      |           ~~~~~~^~~~~~~~~~~

bagder · 2021-06-02T21:38:41Z

I would like to learn more specifics of the exact benefits you see from using this. You're adjusting the connect timeout check, but how big difference does this actually make and what timeouts do you use? Couldn't you also just increase your connection timeout a few milliseconds to avoid the problems?

This code somewhat obfuscates the timeout handling so I think it better be worth it and then I need you to explain to us how it is!

MartinHowarth · 2021-06-18T10:40:09Z

Thanks for your patience - taking over from Tom on this.

This issue has a particular impact on heavily loaded systems. We were seeing that the kernel had successfully set up the connection within the timeout (in ~100 microseconds), but the process that asked for it (via curl) was de-scheduled for longer than the connection timeout because the system was heavily loaded. We set a relatively aggressive timeout (50ms) so we can notice that the remote is dead quickly and switch to a different one. However, it's common to see a process being de-scheduled for >100ms. When the process is scheduled again, curl sees that this 100ms > 50ms and marks the connection as timed out, even though the connection was setup within the timeout.

The alternative of setting the timeout to be longer to account for this means we have an unacceptably long time to spot that a peer is down.

This PR changes the order of checking for connection success: instead of checking for timeout before success, check for success before timeout.

In terms of stability/testing, we've been using this change on our fork with high-load customers for years.

I'm starting to look into fixing the CI complaints now :)

MartinHowarth · 2021-06-21T08:43:45Z

Looks like 2 tests still fail (2100, 575) - I can't work out how these could relate to these changes :(

Are they known to be flaky tests (they didn't fail on all platforms), or could you give me some pointers? Thanks!

bagder · 2021-06-21T08:48:51Z

Looks like 2 tests still fail (2100, 575) - I can't work out how these could relate to these changes

A few of the Windows builds are unfortunately flaky so these fails are not because of your changes.

bagder · 2021-06-21T08:55:17Z

lib/multi.c

@@ -1700,47 +1751,13 @@ static CURLMcode multi_runsingle(struct Curl_multi *multi,
    if(data->conn &&
       (data->mstate >= MSTATE_CONNECT) &&
       (data->mstate < MSTATE_COMPLETED)) {
+      /* We defer handling the connection timeout to later, to see if the
+       * connection has actually succeeded.
+       * See https://github.com/Metaswitch/curl/pull/2 for original changes */


Please don't refer to an external URL (that will vanish at some random point in time). If there's an additional explanation available that helps here, write it in the comment too.

I've updated this comment to be more useful

bagder · 2021-06-21T08:58:02Z

lib/multi.c

+       multi_ischanged(multi, false)) {
+      /* We now handle stream timeouts if and only if this will be the last
+       * loop iteration */
+      multi_handle_timeout(data, nowp, &stream_error, &result, TRUE);


The TRUE in there will make the function check the timeout for the connection phase, and yet you call this function also in the MSTATE_DO state which then already has connected. Surely that makes this too trigger-happy?

This is preserving the existing behaviour. The previous check at line 1706 included checking the connect timeout in the MSTATE_DO.

Ah yes I see. That's certainly a bug that surely in itself could've contributed to you having the problems you've had! The DO state is passed the connect so it is just wrong to check the connect timeout in that state.

Thanks - I've changed the check to data->mstate < MSTATE_DO to exclude the DO state

bagder · 2021-06-21T08:59:09Z

lib/multi.c

+       data->mstate >= MSTATE_CONNECT &&
+       data->mstate <= MSTATE_DO &&
+       rc != CURLM_CALL_MULTI_PERFORM &&
+       multi_ischanged(multi, false)) {


Why is the multi_ischanged a requirement for this check? For most "normal" case that's never true and then you don't check the timeout here at all...

Good question! I agree with your assessment here, and I've changed this to a !multi_ischanged(multi, false) because I think it was supposed to match the opposite of the conditions to exit the while loop based on the comment right below.

I've also expanded the comment to explain the last two checks in this check.

bagder · 2021-06-21T13:56:35Z

./multi.c:1754:73: warning: Trailing whitespace (TRAILINGSPACE)
       /* Check for overall operation timeout here but defer handling the 
                                                                         ^
./multi.c:1756:71: warning: Trailing whitespace (TRAILINGSPACE)
        * in the window since we last checked timeout. This prevents us 
                                                                       ^
checksrc: 0 errors and 2 warnings

MartinHowarth · 2021-06-23T14:12:27Z

I think the remaining failures are also flaky windows tests - is there anything else I need to do to get this ready to merge?

bagder · 2021-06-24T08:01:14Z

Let's just have the rest of the CI run, then I'll prepare for merge

bagder · 2021-06-24T13:55:06Z

Thanks!

tomwatson1024 force-pushed the tomw/timeout-ordering branch from 40aabab to 7449865 Compare June 2, 2021 17:09

jay added the connecting & proxies label Jun 4, 2021

bagder added the needs-info label Jun 10, 2021

Calculate time left when reducing per-connection timeouts as well

9b50e55

bagder reviewed Jun 21, 2021

View reviewed changes

Fix timeout check and improve comments.

2a8c04b

Don't check timeouts in DO state

74527e4

bagder closed this in 0842175 Jun 24, 2021

bagder removed the needs-info label Jun 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alter timeout ordering with connections #7178

Alter timeout ordering with connections #7178

tomwatson1024 commented Jun 2, 2021

bagder commented Jun 2, 2021

bagder commented Jun 2, 2021 •

edited

MartinHowarth commented Jun 18, 2021

MartinHowarth commented Jun 21, 2021

bagder commented Jun 21, 2021

bagder Jun 21, 2021

MartinHowarth Jun 21, 2021

bagder Jun 21, 2021

MartinHowarth Jun 21, 2021

bagder Jun 21, 2021

MartinHowarth Jun 21, 2021

bagder Jun 21, 2021

MartinHowarth Jun 21, 2021

bagder commented Jun 21, 2021

MartinHowarth commented Jun 23, 2021

bagder commented Jun 24, 2021 •

edited

bagder commented Jun 24, 2021

Alter timeout ordering with connections #7178

Alter timeout ordering with connections #7178

Conversation

tomwatson1024 commented Jun 2, 2021

bagder commented Jun 2, 2021

bagder commented Jun 2, 2021 • edited

MartinHowarth commented Jun 18, 2021

MartinHowarth commented Jun 21, 2021

bagder commented Jun 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bagder commented Jun 21, 2021

MartinHowarth commented Jun 23, 2021

bagder commented Jun 24, 2021 • edited

bagder commented Jun 24, 2021

bagder commented Jun 2, 2021 •

edited

bagder commented Jun 24, 2021 •

edited