Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connect: stop halving the remaining timeout when less than N ms left #11693

Closed
wants to merge 1 commit into from

Conversation

bagder
Copy link
Member

@bagder bagder commented Aug 18, 2023

When curl wants to connect to a host, it always has a TIMEOUT. The maximum time it is allowed to spend until a connect is confirmed.

curl will try to connect to each of the IP adresses returned for the host. Two loops, one for each IP family.

During the connect loop, while curl has more than one IP address left to try within a single address family, curl has traditionally allowed (TIME-LEFT/2) for this connect attempt. This, to not get stuck on the initial addresses in case the timeout but still allow later addresses to get attempted.

This has the downside that when users set a very short timeout and the host has a large number of IP addresses, the effective result might be that every attempt gets a little too short time.

This change proposes that we stop doing the divided-by-two if the total time left is below a threshold. This threshold is 600 milliseconds in this initial commit.

To discuss:

A) a good/bad idea?
B) what's the threshold?
C) is there instead a better fix to do?

@bagder
Copy link
Member Author

bagder commented Aug 18, 2023

Alternative take: only do /2 for the first attempt and then go all the way for the rest

@bagder
Copy link
Member Author

bagder commented Aug 18, 2023

A more advanced and aggressive take: fire off two attempts in parallel, per IP family and allow both to use the full timeout period...

@icing
Copy link
Contributor

icing commented Aug 18, 2023

First, I agree that the /2 way is non-optimal when more than 2 addresses are available. So, what would be a good approach to partition the time for attempts?

If we believe there is a meaningful, minimal duration for an attempt, assigning less time is not productive. A ping to nghttp2.org from my place is 260ms for me. Statistics here and here list maximum ping times of 300ms, in extreme 360ms. Testing this site with eissing.org results in pings <300ms, except Seoul listed with >600ms.

Which I interpret as 300ms being a good limit for most parts of the net to succeed. 1000ms seems a safe limit.

For easy calculation, say someone starts a transfer with connect_timeout of 3sec. ipv4 and v6 are handled in parallel, so lets focus just on one family. What we wan then is:

addr found  duration 1st 2nd 3rd 4th 5th ... 10th 11th ... 20th
1 3000 - - - - - - -
2 1000 2000 - - - - - -
3 1000 1000 1000 - - - - -
4 750 750 750 750 - - - -
5 600 600 600 600 600 - - -
10 300 300 300 300 300 300 - -
20 300 300 300 300 300 300 x x

Read: if an attempt does not succeed after 1sec, and we have more to try, we move on. The last address tried takes all the remaining time.

We think 300ms is the minimum viable wait time. If we have less overall time than (addr_count *300ms), some will never be tried. But that should be ok. The connection timeout needs to be raised then.

Does that make sense? (Maybe 1200ms and 600ms are better values, but I think the distribution should work as outlined in the table.)

@bagder
Copy link
Member Author

bagder commented Aug 18, 2023

I was in Cornwall UK on a summer holiday maybe ten years ago. In my rental cabin there, I experienced 60000 millisecond pings to my mail server at home. Meaning: there are architectures that introduce substantial delays that cannot be explained simply by physical (cable) distance.

Maybe some users still want to be able to connect over that network - assuming that TCP allows it. I don't think we can assume that 1000ms will be enough to get through...

@icing
Copy link
Contributor

icing commented Aug 18, 2023

I was in Cornwall UK on a summer holiday maybe ten years ago. In my rental cabin there, I experienced 60000 millisecond pings to my mail server at home. Meaning: there are architectures that introduce substantial delays that cannot be explained simply by physical (cable) distance.

Maybe some users still want to be able to connect over that network - assuming that TCP allows it. I don't think we can assume that 1000ms will be enough to get through...

The last address will always get the remaining time. If you have a single address for your mail server, this proposal would wait the full connect timeout.

Shortening the first attempts and giving the remaining time to the last address should give a shorter average connect time. If the probability of success is the same for all addresses. AFAIK, there is no priority for A records in DNS, so one cannot assume that the first addresses are "better".

@bagder
Copy link
Member Author

bagder commented Aug 18, 2023

The last address will always get the remaining time. If you have a single address for your mail server, this proposal would wait the full connect timeout.

If the server needs 2000 ms to connect, and I set a 3000 ms timeout and that works when the host has a single IP. Then one day they instead announce 10 addresses and then curl can't connect anymore. Annoying, since 3000ms could be enough if we only knew how to spend the time...

@icing
Copy link
Contributor

icing commented Aug 18, 2023

The last address will always get the remaining time. If you have a single address for your mail server, this proposal would wait the full connect timeout.

If the server needs 2000 ms to connect, and I set a 3000 ms timeout and that works when the host has a single IP. Then one day they instead announce 10 addresses and then curl can't connect anymore. Annoying, since 3000ms could be enough if we only knew how to spend the time...

The proposed implementation would not work with 2 addresses, since each one would get 1500ms and this would fail in your example. So...I believe there is no time distribution working for all cases one can dream up, unless we go parallel for each address found.

@bagder
Copy link
Member Author

bagder commented Aug 18, 2023

Right, I think what I primarily want is to make sure that not all the time is wasted on the first address if there are more than one. The current /2 method works for this. If all addresses timeout however, the /2 method makes it next to impossible to connect to address number N (5? 6?) unless the timeout is set very large from the beginning.

My idea in this PR would be to avoid /2-shrinking the timeout down to silly levels.

@bagder bagder marked this pull request as ready for review August 19, 2023 17:54
@bagder
Copy link
Member Author

bagder commented Aug 19, 2023

This is probably not the ideal approach, but I think it improves the current situation at least a little.

I'm in favor of merging this so I made it a "real" PR now.

And we can continue the search and think about how this can be improved further in the future.

bagder added a commit that referenced this pull request Aug 21, 2023
When curl wants to connect to a host, it always has a TIMEOUT. The
maximum time it is allowed to spend until a connect is confirmed.

curl will try to connect to each of the IP adresses returned for the
host. Two loops, one for each IP family.

During the connect loop, while curl has more than one IP address left to
try within a single address family, curl has traditionally allowed
(TIME-LEFT/2) for *this* connect attempt. This, to not get stuck on the
initial addresses in case the timeout but still allow later addresses to
get attempted.

This has the downside that when users set a very short timeout and the
host has a large number of IP addresses, the effective result might be
that every attempt gets a little too short time.

This change stop doing the divided-by-two if the total time left is
below a threshold. This threshold is 600 milliseconds.

Closes #11693
@jay
Copy link
Member

jay commented Aug 21, 2023

I wouldn't get caught up in the ping times. Some transfers are taking place over local networks where there's hardly any latency.

Copy link
Member

@jay jay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest have a macro like this in ALLCAPS to signal a variable can be evaluated multiple times, even if it's unlikely here it's still best practice. Like which is easier to spot the bug isspace(*foo++) ISSPACE(*foo++)

When curl wants to connect to a host, it always has a TIMEOUT. The
maximum time it is allowed to spend until a connect is confirmed.

curl will try to connect to each of the IP adresses returned for the
host. Two loops, one for each IP family.

During the connect loop, while curl has more than one IP address left to
try within a single address family, curl has traditionally allowed
(TIME-LEFT/2) for *this* connect attempt. This, to not get stuck on the
initial addresses in case the timeout but still allow later addresses to
get attempted.

This has the downside that when users set a very short timeout and the
host has a large number of IP addresses, the effective result might be
that every attempt gets a little too short time.

This change stop doing the divided-by-two if the total time left is
below a threshold. This threshold is 600 milliseconds.

Closes #11693
@bagder
Copy link
Member Author

bagder commented Aug 21, 2023

I suggest have a macro like this in ALLCAPS

Good call. I agree.

@bagder bagder closed this in 748da39 Aug 29, 2023
@bagder bagder deleted the bagder/timeout-tweak branch August 29, 2023 08:44
ptitSeb pushed a commit to wasix-org/curl that referenced this pull request Sep 25, 2023
When curl wants to connect to a host, it always has a TIMEOUT. The
maximum time it is allowed to spend until a connect is confirmed.

curl will try to connect to each of the IP adresses returned for the
host. Two loops, one for each IP family.

During the connect loop, while curl has more than one IP address left to
try within a single address family, curl has traditionally allowed (time
left/2) for *this* connect attempt. This, to not get stuck on the
initial addresses in case the timeout but still allow later addresses to
get attempted.

This has the downside that when users set a very short timeout and the
host has a large number of IP addresses, the effective result might be
that every attempt gets a little too short time.

This change stop doing the divided-by-two if the total time left is
below a threshold. This threshold is 600 milliseconds.

Closes curl#11693
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

3 participants