connect: stop halving the remaining timeout when less than N ms left #11693

bagder · 2023-08-18T11:49:54Z

When curl wants to connect to a host, it always has a TIMEOUT. The maximum time it is allowed to spend until a connect is confirmed.

curl will try to connect to each of the IP adresses returned for the host. Two loops, one for each IP family.

During the connect loop, while curl has more than one IP address left to try within a single address family, curl has traditionally allowed (TIME-LEFT/2) for this connect attempt. This, to not get stuck on the initial addresses in case the timeout but still allow later addresses to get attempted.

This has the downside that when users set a very short timeout and the host has a large number of IP addresses, the effective result might be that every attempt gets a little too short time.

This change proposes that we stop doing the divided-by-two if the total time left is below a threshold. This threshold is 600 milliseconds in this initial commit.

To discuss:

A) a good/bad idea?
B) what's the threshold?
C) is there instead a better fix to do?

bagder · 2023-08-18T11:54:09Z

Alternative take: only do /2 for the first attempt and then go all the way for the rest

bagder · 2023-08-18T12:09:03Z

A more advanced and aggressive take: fire off two attempts in parallel, per IP family and allow both to use the full timeout period...

icing · 2023-08-18T12:39:59Z

First, I agree that the /2 way is non-optimal when more than 2 addresses are available. So, what would be a good approach to partition the time for attempts?

If we believe there is a meaningful, minimal duration for an attempt, assigning less time is not productive. A ping to nghttp2.org from my place is 260ms for me. Statistics here and here list maximum ping times of 300ms, in extreme 360ms. Testing this site with eissing.org results in pings <300ms, except Seoul listed with >600ms.

Which I interpret as 300ms being a good limit for most parts of the net to succeed. 1000ms seems a safe limit.

For easy calculation, say someone starts a transfer with connect_timeout of 3sec. ipv4 and v6 are handled in parallel, so lets focus just on one family. What we wan then is:

addr found	duration 1st	2nd	3rd	4th	5th	10th	11th	20th
1	3000	-	-	-	-	-	-	-
2	1000	2000	-	-	-	-	-	-
3	1000	1000	1000	-	-	-	-	-
4	750	750	750	750	-	-	-	-
5	600	600	600	600	600	-	-	-
10	300	300	300	300	300	300	-	-
20	300	300	300	300	300	300	x	x

Read: if an attempt does not succeed after 1sec, and we have more to try, we move on. The last address tried takes all the remaining time.

We think 300ms is the minimum viable wait time. If we have less overall time than (addr_count *300ms), some will never be tried. But that should be ok. The connection timeout needs to be raised then.

Does that make sense? (Maybe 1200ms and 600ms are better values, but I think the distribution should work as outlined in the table.)

bagder · 2023-08-18T12:49:58Z

I was in Cornwall UK on a summer holiday maybe ten years ago. In my rental cabin there, I experienced 60000 millisecond pings to my mail server at home. Meaning: there are architectures that introduce substantial delays that cannot be explained simply by physical (cable) distance.

Maybe some users still want to be able to connect over that network - assuming that TCP allows it. I don't think we can assume that 1000ms will be enough to get through...

icing · 2023-08-18T12:55:09Z

I was in Cornwall UK on a summer holiday maybe ten years ago. In my rental cabin there, I experienced 60000 millisecond pings to my mail server at home. Meaning: there are architectures that introduce substantial delays that cannot be explained simply by physical (cable) distance.

Maybe some users still want to be able to connect over that network - assuming that TCP allows it. I don't think we can assume that 1000ms will be enough to get through...

The last address will always get the remaining time. If you have a single address for your mail server, this proposal would wait the full connect timeout.

Shortening the first attempts and giving the remaining time to the last address should give a shorter average connect time. If the probability of success is the same for all addresses. AFAIK, there is no priority for A records in DNS, so one cannot assume that the first addresses are "better".

bagder · 2023-08-18T13:05:06Z

The last address will always get the remaining time. If you have a single address for your mail server, this proposal would wait the full connect timeout.

If the server needs 2000 ms to connect, and I set a 3000 ms timeout and that works when the host has a single IP. Then one day they instead announce 10 addresses and then curl can't connect anymore. Annoying, since 3000ms could be enough if we only knew how to spend the time...

icing · 2023-08-18T13:10:19Z

The last address will always get the remaining time. If you have a single address for your mail server, this proposal would wait the full connect timeout.

If the server needs 2000 ms to connect, and I set a 3000 ms timeout and that works when the host has a single IP. Then one day they instead announce 10 addresses and then curl can't connect anymore. Annoying, since 3000ms could be enough if we only knew how to spend the time...

The proposed implementation would not work with 2 addresses, since each one would get 1500ms and this would fail in your example. So...I believe there is no time distribution working for all cases one can dream up, unless we go parallel for each address found.

bagder · 2023-08-18T13:50:25Z

Right, I think what I primarily want is to make sure that not all the time is wasted on the first address if there are more than one. The current /2 method works for this. If all addresses timeout however, the /2 method makes it next to impossible to connect to address number N (5? 6?) unless the timeout is set very large from the beginning.

My idea in this PR would be to avoid /2-shrinking the timeout down to silly levels.

bagder · 2023-08-19T17:55:53Z

This is probably not the ideal approach, but I think it improves the current situation at least a little.

I'm in favor of merging this so I made it a "real" PR now.

And we can continue the search and think about how this can be improved further in the future.

When curl wants to connect to a host, it always has a TIMEOUT. The maximum time it is allowed to spend until a connect is confirmed. curl will try to connect to each of the IP adresses returned for the host. Two loops, one for each IP family. During the connect loop, while curl has more than one IP address left to try within a single address family, curl has traditionally allowed (TIME-LEFT/2) for *this* connect attempt. This, to not get stuck on the initial addresses in case the timeout but still allow later addresses to get attempted. This has the downside that when users set a very short timeout and the host has a large number of IP addresses, the effective result might be that every attempt gets a little too short time. This change stop doing the divided-by-two if the total time left is below a threshold. This threshold is 600 milliseconds. Closes #11693

jay · 2023-08-21T17:40:15Z

I wouldn't get caught up in the ping times. Some transfers are taking place over local networks where there's hardly any latency.

jay

I suggest have a macro like this in ALLCAPS to signal a variable can be evaluated multiple times, even if it's unlikely here it's still best practice. Like which is easier to spot the bug isspace(*foo++) ISSPACE(*foo++)

When curl wants to connect to a host, it always has a TIMEOUT. The maximum time it is allowed to spend until a connect is confirmed. curl will try to connect to each of the IP adresses returned for the host. Two loops, one for each IP family. During the connect loop, while curl has more than one IP address left to try within a single address family, curl has traditionally allowed (TIME-LEFT/2) for *this* connect attempt. This, to not get stuck on the initial addresses in case the timeout but still allow later addresses to get attempted. This has the downside that when users set a very short timeout and the host has a large number of IP addresses, the effective result might be that every attempt gets a little too short time. This change stop doing the divided-by-two if the total time left is below a threshold. This threshold is 600 milliseconds. Closes #11693

bagder · 2023-08-21T17:55:26Z

I suggest have a macro like this in ALLCAPS

Good call. I agree.

When curl wants to connect to a host, it always has a TIMEOUT. The maximum time it is allowed to spend until a connect is confirmed. curl will try to connect to each of the IP adresses returned for the host. Two loops, one for each IP family. During the connect loop, while curl has more than one IP address left to try within a single address family, curl has traditionally allowed (time left/2) for *this* connect attempt. This, to not get stuck on the initial addresses in case the timeout but still allow later addresses to get attempted. This has the downside that when users set a very short timeout and the host has a large number of IP addresses, the effective result might be that every attempt gets a little too short time. This change stop doing the divided-by-two if the total time left is below a threshold. This threshold is 600 milliseconds. Closes curl#11693

github-actions bot added the connecting & proxies label Aug 18, 2023

bagder marked this pull request as ready for review August 19, 2023 17:54

bagder force-pushed the bagder/timeout-tweak branch from b3e7c2c to 7522a72 Compare August 21, 2023 15:28

jay approved these changes Aug 21, 2023

View reviewed changes

bagder force-pushed the bagder/timeout-tweak branch from 7522a72 to 0c802f4 Compare August 21, 2023 17:55

bagder closed this in 748da39 Aug 29, 2023

bagder deleted the bagder/timeout-tweak branch August 29, 2023 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

connect: stop halving the remaining timeout when less than N ms left #11693

connect: stop halving the remaining timeout when less than N ms left #11693

bagder commented Aug 18, 2023

bagder commented Aug 18, 2023

bagder commented Aug 18, 2023 •

edited

icing commented Aug 18, 2023 •

edited

bagder commented Aug 18, 2023

icing commented Aug 18, 2023 •

edited

bagder commented Aug 18, 2023 •

edited

icing commented Aug 18, 2023

bagder commented Aug 18, 2023

bagder commented Aug 19, 2023

jay commented Aug 21, 2023

jay left a comment

bagder commented Aug 21, 2023

connect: stop halving the remaining timeout when less than N ms left #11693

connect: stop halving the remaining timeout when less than N ms left #11693

Conversation

bagder commented Aug 18, 2023

bagder commented Aug 18, 2023

bagder commented Aug 18, 2023 • edited

icing commented Aug 18, 2023 • edited

bagder commented Aug 18, 2023

icing commented Aug 18, 2023 • edited

bagder commented Aug 18, 2023 • edited

icing commented Aug 18, 2023

bagder commented Aug 18, 2023

bagder commented Aug 19, 2023

jay commented Aug 21, 2023

jay left a comment

Choose a reason for hiding this comment

bagder commented Aug 21, 2023

bagder commented Aug 18, 2023 •

edited

icing commented Aug 18, 2023 •

edited

icing commented Aug 18, 2023 •

edited

bagder commented Aug 18, 2023 •

edited