PATCH: Fix request timer management
Date: Wed, 12 Jul 2017 12:17:11 -0300
While upgrading to the latest cURL 7.54.1, one of our own internal tests
was failing while exercising the new version. It seems that there are
some bugs in how timers are managed for a single easy handle that causes
the wrong "next timeout" value to be reported to the application when a
new minimum needs to be recomputed and that new minimum should be an
existing timer that isn't currently set for the easy handle. So, when
the application is driving a set of easy handles via the
`curl_multi_socket_action()` API (for example), it gets told to wait the
wrong amount of time before the next call, which causes requests to
linger for a long time (or, it is my guess, possibly forever).
From its symptoms, this issue is very similar to an issue I previously
reported on an older version:
https://curl.haxx.se/mail/lib-2014-02/0038.html The symptoms are
similar enough that the test program that demonstrates the bug
(attached) is almost identical. However, the root cause is different.
Note that the symptoms are very timing sensitive: if events and timers
are just so, the bug doesn't manifest itself. The faster your computer
and connection, the less likely you are to notice the problem, it seems.
From inspection of the code, it seems that cURL tries to keep a sorted
linked list of the outstanding (or expired) timers for an easy handle,
and then additionally a splay tree in the multi of the minimum timer
from each such linked list. Thus, it can easily find the soonest next
timer expiry time point and use it to tell the application when to
schedule the next timer callback into cURL. This approach makes sense to me.
The trouble comes in how cURL maintains the invariants of the list and
splay tree. Although undocumented, it seems that the current code
(7.54.1 and master) tries to keep the "most recent" timer for each easy
handle _only_ in the splay tree and not in the linked list.
Unfortunately, the splay tree entry for such a timer does not contain
enough information to properly recompute the minimum timer for an easy
handle in all circumstances. And, the splay tree entry doesn't seem to
be properly considered when adding new timer instances. There seem to
be a few subtle bugs in the code that manages the relationship between
the splay tree and linked lists.
The attached patch changes the invariant such that every non-expired
timer must always exist in the easy handle's linked list. The minimum
such timer value still exists in the splay tree. The code to maintain
this invariant seems to be simpler, and seems to avoid all the bugs in
the original. In summary:
* When a timer is seen to expire by `Curl_splaygetbest()`, the new splay
entry is now left in the linked list by `add_next_timeout()`.
* When a timer is scheduled by `Curl_expire()`, an entry is always left
in the linked list.
* Also, in `Curl_expire()`, the old code used to assume the splay tree
entry value (as present in `expiretime`) was for the same `id` as was
being updated and would sometimes copy it into the linked list as if it
were, even though this assumption does not seem to be true. (The call
to `multi_addtimeout()` was passing `nowp` instead of `set` and
presuming `nowp` had something to do with `id`.) This seems like it
could lead to confusion of what timers are set for what values and loss
of scheduled timers, or worse. The patch avoids this.
There seem to be no bad side effects to keeping all these timers in the
linked lists. Doing so makes the computation of the next minimum timer
trivial. The expired timer values continue to be removed in the same
way as before.
The `output-unpatched.txt` and `output-patched.txt` attachments show the
output of the `test.c` program that demonstrates the bug. In the
unpatched output, note how the 2000 ms timeout is ignored (because it's
been forgotten since it isn't kept in the linked list) and so the
program is told to wait far too long. (Inspecting the program state
here reveals that `Curl_timeleft()` has meanwhile properly computed the
remaining time for the request; it's just the splay tree and linked list
giving the wrong answer, so cURL doesn't tell application the correct
timeout value in the `CURLMOPT_TIMERFUNCTION`.)
The patch applies against 7.54.1. From inspection (only), it seems the
bug is present in older versions and in the current master version on
I hope this patch or something derived from it can be applied.
I also have a few observations and questions that came up during the
investigation of this bug which may or may help. Please take them as
they are intended as helpful ideas (and not criticisms):
1. After this patch, it seems like it might be worth replacing the easy
handle's `expiretime` value with a boolean indicating whether or not the
easy is represented in the splay tree. (Each easy's representative time
should always be that of the minimum timer in the list, I think.) This
patch does not attempt to make this change.
2. It seems odd that once an easy handle completes, it keeps timers in
its list and can contribute to the splay tree. Perhaps it would be best
to clear all timers from the list and splay tree when the easy handle
completes. This patch does not attempt to make this change.
3. It seems odd that once a DNS request completes (for example), its
timer (as scheduled by `Curl_expire()`) remains in the linked list.
(With the c-ares resolver, a non-configurable 300 second timer is
scheduled in what appears to be a safety mechanism, but it never gets
removed even when the DNS result arrives.) With this change (and debug
dumps of the linked list while experimenting) it became much more
obvious that this timer stays around. (This is, in fact, the source of
the ~299 second timer in the unpatched output from the test program.
Interestingly, if this timer _had_ been cleared, the test program would
likely never have terminated.) This patch does not attempt to change this.
-- Brad Spencer
- text/plain attachment: 10-timer-recompute-fix.diff
- text/plain attachment: output-patched.txt
- text/plain attachment: output-unpatched.txt
- text/plain attachment: test.c