Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Avoid using free'd easy handles referenced from connection cache #2669

Conversation

dtzWill
Copy link
Contributor

@dtzWill dtzWill commented Jun 18, 2018

For a while now I've been trying to sort out a problem with Nix's usage of libcurl,
one that is especially apparent when building with address sanitizer enabled
or building against musl.


The commits in this PR appear to "resolve" the issue but I don't believe are really the right answer,
as they only mask the real problem: free'd handles end up being referred to from the connection cache.


The usage is this:

  • we create a multi handle that lives more or less forever
  • single thread handles everything curl-related, processes
    a queue of requests populated by other threads and for each
    creates an easy handle and adds to multi.
  • A mapping of easy handle to our own datastructure is maintained, AFAIK this works
    (similar to the ASIO example or one of those, as I recall)
  • Loop is basically this:
    • curl_multi_perform
    • Read all available messages from curl_multi_info_read, freeing the handles as we go
    • curl_multi_wait (we add an extra FD that's a wake-up pipe when new requests are added)
    • Create handles for any new requests and add to the multi handle
    • (minor: check if thread has been asked to quit, otherwise keep looping)

We make many many requests that use http2/openssl and enable multiplexing (but not http1 pipelining).

If it matters we also enable PIPEWAIT and MAXCONNECTS is set to 25.


Eventually, and almost immediately with ASAN or quickly with musl, it appears the memory for a deallocated handle
is used for something else and everything goes south when curl attempts to "check if the connection is dead"
which involves -- for http2+ssl at least-- using data-structures and buffers that are no longer present.

I guess glibc is either less quick to recycle free'd memory or its use of thread-local arenas mostly
mitigates the problem since the thread creating/free'ing curl handles really does nothing else and so
is less likely to run into problems.

These commits workaround this by checking if the magic bit is unset, but that of course only works sometimes
and still requires accessing free'd memory and hoping for the best.


Here's an example report from address sanitizer: https://gist.github.com/dtzWill/4b02c8943ec3be4875ab54186e2cc176


I see that in general it's known to be problematic to remove handles possibly involved in pipelines,
but I thought I'd double-check that this particular problem is known/expected and to perhaps ask
if there's a recommendation for avoiding these problems.

As far as I can tell the easy handles added to a multi handle are expected to outlive the multi-handle
(although this is not mentioned elsewhere, so I'd appreciate confirmation) since they may be referenced
from the multi handle's connection cache indefinitely and there is no way to clear or query this cache.

I tried using a pool of ~100 connection handles --and basically never free them (curl_easy_cleanup) until we're entirely done.
This seems to mitigate the issue but I don't know that this can be relied upon as it may only appear to work
by making it less problematic when multi handle attempts to deference easy handles that have been removed
(since the pool ensures this memory will likely be unused for at least a while) but in the worst case
might break if the cache refers to a handle long enough for us to cycle through the pool, in which case
the connection check might do something silly like attempt to read into the buffer of a new handle
that may or may not really be part of the cached connection referencing it.

Anyway-- I suppose this is a bug report of sorts, but also a request for insights into what we're doing wrong
or how to restructure things to use the libcurl API in a more expected/supported manner.

Thank you for your time!


FWIW this issue seems particularly problematic in 7.60 (and still in latest git), it does not happen in 7.59 or the handful of previous versions I tested.

Ideally such a pointer wouldn't still be present,
but seems to happen in some circumstances.

This papers over the issue--if a new handle (or other object!)
is now using this memory Bad Things (tm) will happen.
These checks are redundant on the http2 path,
but might be better to check here to be more "general".
@bagder
Copy link
Member

bagder commented Jun 18, 2018

connections are kept in the connection cache, not easy handles. Once you remove an easy handle from the multi handle it can be killed right away. That's a common pattern and one that we've always said works. When the multi interface is used, re-using easy handles is mostly just avoiding free() / malloc() since the connections, the DNS cache etc are then all owned by the multi handle anyway.

in general it's known to be problematic to remove handles possibly involved in pipelines

Yes, if that handle uses a live connection involved in a HTTP/1.1 pipeline - not when HTTP/2 multiplexing is used.

I would urge you to try to reproduce the problem with a stand-alone example that mimics how your "real" application works but with as much as possible of the application logic removed and the libcurl logic left that causes the problems.

given your log output, it looks like a struct connectdata or parts of that ends up freed while apparently still in use. I have not seen this happen.

@dtzWill
Copy link
Contributor Author

dtzWill commented Jun 18, 2018

Okay, will do thank you. I say easy handles are referenced from connection cache since the connection structs have a data field that refers to an easy handle--I'm not sure if this is allowed to be NULL or the most-recently-used-handle or what, but I'm seeing these refer to handles that have been removed (curl_multi_remove_handle) and then free'd via curl_easy_cleanup; it sounds like this process certainly should be ensuring the handle is no longer referenced by the cached connection entries?

That's encouraging to hear! I'll see about creating a standalone example although it may take some time.

Thanks for the quick response!

@bagder
Copy link
Member

bagder commented Jun 18, 2018

the conn->data pointer should point to a legitimate easy handle whenever that connection is in use, otherwise it is the "most-recently-used-handle" in some cases. A cleaner approach would be to make sure to NULLify that pointer properly when the easy handle that it pointed to is gone or when the connection is detached from it.

@dtzWill
Copy link
Contributor Author

dtzWill commented Jun 19, 2018

This example reliably produces the crash for me, although unsure why it's not happening on travis.

But log of such a crash is included, let me know if you can reproduce or if more information is helpful.

https://github.com/dtzWill/curl-repro

EDIT: oh hooray travis did reproduce it at least one time, now linked from README

@bagder
Copy link
Member

bagder commented Jun 19, 2018

I think it is a stale ->data pointer, much like we discussed earlier. I seem to have a hard time to reproduce this, can you check if this little patch makes any difference?

diff --git a/lib/url.c b/lib/url.c
index d29eddaea..0cab0a303 100644
--- a/lib/url.c
+++ b/lib/url.c
@@ -963,10 +963,11 @@ static bool extract_if_dead(struct connectdata *conn,
     /* The check for a dead socket makes sense only if there are no
        handles in pipeline and the connection isn't already marked in
        use */
     bool dead;
 
+    conn->data = data;
     if(conn->handler->connection_check) {
       /* The protocol has a special method for checking the state of the
          connection. Use it to check if the connection is dead. */
       unsigned int state;
 
@@ -977,11 +978,10 @@ static bool extract_if_dead(struct connectdata *conn,
       /* Use the general method for determining the death of a connection */
       dead = SocketIsDead(conn->sock[FIRSTSOCKET]);
     }
 
     if(dead) {
-      conn->data = data;
       infof(data, "Connection %ld seems to be dead!\n", conn->connection_id);
       Curl_conncache_remove_conn(conn, FALSE);
       return TRUE;
     }
   }

Possible extra thing to make the pointer not point to a dead handle:

diff --git a/lib/conncache.c b/lib/conncache.c
index 6bd06582a..066542915 100644
--- a/lib/conncache.c
+++ b/lib/conncache.c
@@ -449,10 +449,11 @@ bool Curl_conncache_return_conn(struct connectdata *conn)
       (void)Curl_disconnect(conn_candidate, /* dead_connection */ FALSE);
     }
   }
   CONN_LOCK(data);
   conn->inuse = FALSE; /* Mark the connection unused */
+  conn->data = NULL; /* no owner */
   CONN_UNLOCK(data);
 
   return (conn_candidate == conn) ? FALSE : TRUE;
 
 }

@dtzWill
Copy link
Contributor Author

dtzWill commented Jun 20, 2018

After some testing, I tentatively report the first patch (didn't test with/without second) seems to fix the issue!

Sorry for this being a PR and not an issue as seems more appropriate-- please close this when committing your suggested patch.

@bagder
Copy link
Member

bagder commented Jun 20, 2018

That's just awesome! Thanks for confirming. I'll make a PR with my patches, just to make sure the CI doesn't find anything bad there and then merge.

bagder added a commit that referenced this pull request Jun 20, 2018
... and make sure to NULLify the ->data pointer when the connection is put
into the cache to make this mistake easier to detect in the future.

Reported-by: Will Dietz
Closes #2669
bagder added a commit that referenced this pull request Jun 21, 2018
... and make sure to NULLify the ->data pointer when the connection is put
into the cache to make this mistake easier to detect in the future.

Reported-by: Will Dietz
Fixes #2669
Closes #2672
@bagder bagder closed this in 2c15693 Jun 21, 2018
@dtzWill dtzWill deleted the experimental/dont-use-invalid-easy-handles-referenced-from-conn-cache branch June 22, 2018 13:48
@lock lock bot locked as resolved and limited conversation to collaborators Sep 20, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

Successfully merging this pull request may close these issues.

None yet

2 participants