Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in the schannel SSL sessionid cache #815

Closed
w23 opened this issue May 18, 2016 · 12 comments
Closed

Race condition in the schannel SSL sessionid cache #815

w23 opened this issue May 18, 2016 · 12 comments
Labels

Comments

@w23
Copy link
Contributor

w23 commented May 18, 2016

Core issue

Windows SSPI engine (https://github.com/curl/curl/blob/master/lib/vtls/schannel.c) has trouble managing curl_schannel_cred lifetime properly:

  1. refcount is modified not atomically and outside of CURL_LOCK_DATA_SSL_SESSION scope
  2. refcount is modified inconsistently: Curl_ssl_{get,add}sessionid() in https://github.com/curl/curl/blob/master/lib/vtls/vtls.c don't lead to refcount being immediately incremented, as it should be semantically.

This leads to sporadic memory-related crashes (accessing freed memory, double free, etc) when downloading https:// URLs concurrently in separate threads.

I also believe that the same set of issues is present for OpenSSL engine in https://github.com/curl/curl/blob/master/lib/vtls/openssl.c, but the race gap is way more narrow there and I couldn't crash it in a reasonable amount of time.

And there are also a couple of somewhat related notes on schannel.c:

How to reproduce

To have a chance at reproducing this one needs to start several threads and use them to fetch several https:// URLs concurrently.
Unfortunately, I don't have a clean standalone utility for this purpose, and there is ideally a server part that is completely out of scope.

However, I at least can give you more details about my setup:

  • There are many (>8, as per max_ssl_sessions*) HTTPS servers/virtual hosts that can stand several random requests per second for hours without anyone complaining or throttling. I use trivial python http server for generating several kilobytes of random payload, and nginx as an ssl-wrapping reverse proxy to it. I use valid CA-signed SSL certificates, but I think that it could also be reproduced using self-signed certs (allowing libcurl to accept them, of course)
  • libcurl 7.43.0 on Windows with SSPI SSL engine is used. The offending code hasn't changed since that rather old version, so I believe the issue is still present.
  • CURLSH handle is created and set with proper locking functions. This handle is used for all requests.
  • Several threads are created. Each thread generates a stream of random https:// URLs to random servers.
  • curl_easy API is used to perform each request.

After a few minutes and a few thousands requests a crash is usually observed.

(*interestingly, this variable seems to be intended to be set by user, but there is no API for it, only a hardcoded value of 8)

Ways to fix

I could come up with two feasibly-looking options. I ask maintainers for their opinion, as I'm not familiar with curl codebase.

General ideas of these options are:

  • A. Along with Curl_ssl_kill_session() make a Curl_ssl_retain_session() that is called by Curl_ssl_{get,add}sessionid() on appropriate occasions. Amend schannel.c accordingly. Note that if this issue is indeed present for OpenSSL engine, then this gets a bit trickier. OpenSSL has no separate increment-refcount function for SSL_SESSION, and I see no clean way to fake it.
  • B. Increase CURL_LOCK_DATA_SSL_SESSION scope, e.g. by making it taken explicitly by a user of Curl_ssl_*sessionid() API, rather than implicitly by these functions themselves. Although this is less clean in terms of who manages what lifetime, and looks somewhat ugly for multiple-returns, it is a very straightforward change. I have already implemented it for SSPI and OpenSSL engines locally to discover that I'm not getting crashes anymore.

Note that I haven't looked at other SSL engines, and have no idea what's going on there. And, unfortunately, it is likely I won't have resources to do that. I'd appreciate any feedback from informed people on this.

Thanks!

@bagder bagder added the TLS label May 19, 2016
@bagder
Copy link
Member

bagder commented May 19, 2016

just want to make sure @mback2k sees this...

@mback2k
Copy link
Member

mback2k commented May 19, 2016

Thanks Daniel, I will try to take a look at it this weekend.

@bagder bagder changed the title Race condition in SSL sessionid cache in libcurl Race condition in the schannel SSL sessionid cache May 20, 2016
@mback2k
Copy link
Member

mback2k commented May 22, 2016

@w23 In the meantime, please take a look at the following thread to understand the reasoning for the cached-flag and the need to free sessions in two places: http://thread.gmane.org/gmane.comp.web.curl.library/44652

@mback2k
Copy link
Member

mback2k commented May 22, 2016

It seems like OpenSSL does not have the same issues, because the OpenSSL library itself does the reference counting of session IDs. There the refcount is increased by each call to SSL_get1_session and decremented by calls to SSL_SESSION_free until it reaches zero and the session is finally freed.

The OpenSSL backend in libcurl only calls SSL_get1_session and SSL_SESSION_free (in order to keep the refcount of an already cached session at 1) during the initial SSL handshake. Since it never decrements the refcount on connection shutdown, to me it seems like one last SSL session per host is never freed until the backend itself is freed. The OpenSSL backend basically just makes sure that there is just one cached session with refcount = 1 per host.

The SChannel backend actually does the refcounting itself, because Windows does not provide such a mechanism. And the SChannel backend also tries to decrement the refcount and free stale sessions during connection shutdown. Also the session (actually the Windows credential handle) must be kept alive during the whole SSL connection. This means that instead of freeing "stale" sessions while creating a new connection, we are only allowed to replace the session in the cache (remove it from the cache and if refcount is zero instead of freeing it set cached to FALSE). Then in order to free SSL sessions that are no longer cached, the SSL connection shutdown code needs to take care of it.

@w23
Copy link
Contributor Author

w23 commented May 23, 2016

Thanks @bagder, @mback2k for a quick feedback!

There are several things I'd like to elaborate.

Race condition in OpenSSL

I have looked at OpenSSL codepath more closely, and I still believe that the race does exist.
(I will use curl-7_49_0 tag to demonstrate this as it is less volatile than master)

Let's examine three lines at https://github.com/curl/curl/blob/curl-7_49_0/lib/vtls/openssl.c#L2075:

:2075  if(!Curl_ssl_getsessionid(conn, &ssl_sessionid, NULL)) {
:2076    /* we got a session id, use it! */
:2077    if(!SSL_set_session(connssl->handle, ssl_sessionid)) {

The problem here is that the Curl_ssl_getsessionid() function, while itself being effectively atomic due to internal CURL_LOCK_DATA_SSL_SESSION lock, does not give ownership of the sessionid object that it returns. ssl_sessionid internal refcount is incremented only on line 2077. So, between lines 2075 and 2077 (this gap is of course a bit wider than that, but that detail is irrelevant here) there's an opportunity for another thread to interfere and actually destroy this sessionid object, leaving its reference invalid.

The question is, can this gap be actually exploited by some other session-handling code.
Turns out, it can. I looked through the code further and I can devise a history where this does happen.

Let me put the relevant excerpt as a quick reference here (https://github.com/curl/curl/blob/curl-7_49_0/lib/vtls/openssl.c#L2811)

:2805  our_ssl_sessionid = SSL_get1_session(connssl->handle);
:2806
:2807  /* SSL_get1_session() will increment the reference count and the session
:2808     will stay in memory until explicitly freed with SSL_SESSION_free(3),
:2809     regardless of its state. */
:2810
:2811  incache = !(Curl_ssl_getsessionid(conn, &old_ssl_sessionid, NULL));
:2812  if(incache) {
:2813    if(old_ssl_sessionid != our_ssl_sessionid) {
:2814      infof(data, "old SSL session ID is stale, removing\n");
:2815      Curl_ssl_delsessionid(conn, old_ssl_sessionid);
:2816      incache = FALSE;
:2817    }
:2818  }
:2819
:2820  if(!incache) {
:2821    result = Curl_ssl_addsessionid(conn, our_ssl_sessionid,
:2822                                   0 /* unknown size */);

The timeline:
0. Curl has been just initialized, the cache is empty.

  1. Thread_1 creates Connection_1 to a host Host_1. It goes through ossl_connect_step1() function, Curl_ssl_getsessionid() on line 2075 returns nothing, so it proceeds with its own implicit Session_1(refcount==1). This thread is preempted just before it calls the ossl_connect_step3() function.
  2. Thread_2 creates Connection_2 to a host Host_1. It also sees the empty cache, so it proceeds with its own Session_2(refcount==1). And it is also preempted just before it can get to the ossl_connect_step3() function.
  3. Thread_1 enters the ossl_connect_step3(), increments its Session_1 refcount on line 2805. There is no cached session for this host yet (line 2811), so it puts its own session into cache (line 2821). Cache now contains a Session_1(refcount==2) for Host_1. At this point this thread can go through with its job and complete downloading using its connection. Connection_1 dies, and its session refcount is decremented by OpenSSL internally. Session_1(refcount==1) is now in cache for Host_1.
  4. Thread_3 creates Connection_3 to a host Host_1. It enters ossl_connect_step1() and gets Session_1(refcount==1) on line 2075, but is preempted before it can get to line 2077.
  5. Thread_2 gets to continue. It retrieves Session_1(refcount==1) on line 2811, notices that it is not equal to Session_2 on line 2813, and goes through to line 2815 to delete Session_1. This eventually results in SSL_SESSION_free() called on Session_1(refcount==1), decrementing the refcount to 0 and destroying the Session_1 object. Whether this thread continues further is irrelevant.
  6. Thread_3 calls SSL_set_session(), passing it a pointer to freed memory of Session_1.
  7. 💥

This is not the only scenario. Sessionid can also be indirectly destroyed by being pushed it out if limited cache space due to old age. Imagine some thread getting a valid sessionid at line 2075, then getting interrupted before line 2077, and then >=8 connections to >=8 different hosts slipping through in the mean time before the first thread gets to continue with long-dead sessionid. However unlikely, this unfortunate scheduling pattern could happen in practice, given there's enough aggregate time.

SChannel and cached flag

I've read your reference, but I still don't understand why it is necessary to maintain a separate flag.
Let's assume the following use of refcounts (this differs from how they're used currently):

  • when a session is created, its refcount is set to 1.
  • when a session is put into cache, its refcount is incremented (instead of setting cached flag).
  • when a session is removed from cache, its refcount is decremented.
  • when a session is retrieved from cache, its refcount is incremented (immediately!).
  • when session's refcount reaches zero, the session is destroyed (by calling appropriate SSPI calls, etc.)

I believe this would produce exactly the desired lifetime:

  • If an SSL connection that is being destroyed is the last owner of a session, the session is destroyed.
  • If a session is removed from cache (e.g. by either explicit delsessionid, old-age, or curl deinit) and the session is not used by anyone else, it is destroyed.

What am I missing here?

How to proceed with a fix

Would you be willing to look at and review a push request of a fix B I proposed in an original message? I could go through other SSL engines and blindly change the affected lines, without actually compiling and checking everything myself for every engine there is.

Thank you!

@bagder
Copy link
Member

bagder commented May 23, 2016

between lines 2075 and 2077 (this gap is of course a bit wider than that, but that detail is irrelevant here) there's an opportunity for another thread to interfere

Really? The easy handles are not shareable between threads so I don't understand that scenario. Each handle is used single-threaded so there shouldn't be any risk for another thread there, not using the same handle/data.

@w23
Copy link
Contributor Author

w23 commented May 23, 2016

Easy handles themselves are not shareable, indeed. But the session cache part is shared, if you do curl_easy_setopt(... CURLOPT_SHARE ...) and curl_share_setopt(... CURLSHOPT_SHARE, CURL_LOCK_DATA_SSL_SESSION), see https://curl.haxx.se/libcurl/c/curl_share_setopt.html. (I probably should've stated the exact setup explicitly in original message)
This is not transparent from reading the vtls code, because the session cache is accessed by (struct connectdata*)conn->data->state.session, which seems to be handle-specific. However, the session pointer is actually set to point to shared conn->data->share->sslsession, see https://github.com/curl/curl/blob/curl-7_49_0/lib/url.c#L2207

@w23
Copy link
Contributor Author

w23 commented May 26, 2016

I've been able to trigger this race in OpenSSL engine. To do so in a reasonable amount of time, I had to make a couple of adjustments to libcurl sources (note that these don't change any logic):

  1. Widen the gap by putting a usleep(rand()) there.
  2. Lower the number of cached sessions to just one.

These changes and the tool used to reproduce the issue can be found here: w23@2f33977.

However, to detect this corruption with a confidence, an instrumentation is needed. I used clang -fsanitize=address. Without that the issue manifests in an undefined way. On my machine I saw curl_easy_perform() returning CURLE_SSL_CONNECT_ERROR due to SSL_set_session() failing to find some methods for session. It is not trivial to explain this in terms of the memory corruption.
With clang ASAN I get nice reports like this:

==28218==ERROR: AddressSanitizer: heap-use-after-free on address 0x61300031abc0 at pc 0x0000008f231c bp 0x7f2c81ffb980 sp 0x7f2c81ffb978
READ of size 4 at 0x61300031abc0 thread T2
    #0 0x8f231b in SSL_set_session (/home/w23/src/curl/session_race+0x8f231b)
    #1 0x7d3b4d in ossl_connect_step1 /home/w23/src/curl/lib/vtls/openssl.c:2087:9
    #2 0x7cad4f in ossl_connect_common /home/w23/src/curl/lib/vtls/openssl.c:2898:14
    #3 0x7ca233 in Curl_ossl_connect_nonblocking /home/w23/src/curl/lib/vtls/openssl.c:2985:10
    #4 0x576c21 in Curl_ssl_connect_nonblocking /home/w23/src/curl/lib/vtls/vtls.c:322:12
    #5 0x5c2118 in https_connecting /home/w23/src/curl/lib/http.c:1385:12
    #6 0x5c1870 in Curl_http_connect /home/w23/src/curl/lib/http.c:1355:14
    #7 0x630344 in Curl_protocol_connect /home/w23/src/curl/lib/url.c:3727:16
    #8 0x5366d9 in multi_runsingle /home/w23/src/curl/lib/multi.c:1573:16
    #9 0x530ad3 in curl_multi_perform /home/w23/src/curl/lib/multi.c:2122:14
    #10 0x4fae16 in easy_transfer /home/w23/src/curl/lib/easy.c:726:15
    #11 0x4f471b in easy_perform /home/w23/src/curl/lib/easy.c:813:42
    #12 0x4f3b30 in curl_easy_perform /home/w23/src/curl/lib/easy.c:832:10
    #13 0x4f1075 in fetchThread /home/w23/src/curl/session_race.c:83:12
    #14 0x7f2c852ed42b in start_thread (/lib64/libpthread.so.0+0x742b)
    #15 0x7f2c84714b8c in clone (/lib64/libc.so.6+0xe6b8c)

0x61300031abc0 is located 0 bytes inside of 352-byte region [0x61300031abc0,0x61300031ad20)
freed by thread T1 here:
    #0 0x4d27bb in __interceptor_free (/home/w23/src/curl/session_race+0x4d27bb)
    #1 0x91cf1f in CRYPTO_free (/home/w23/src/curl/session_race+0x91cf1f)
    #2 0x578f41 in Curl_ssl_kill_session /home/w23/src/curl/lib/vtls/vtls.c:414:5
    #3 0x57962b in Curl_ssl_delsessionid /home/w23/src/curl/lib/vtls/vtls.c:438:7
    #4 0x7d63f5 in ossl_connect_step3 /home/w23/src/curl/lib/vtls/openssl.c:2828:7
    #5 0x7cc14c in ossl_connect_common /home/w23/src/curl/lib/vtls/openssl.c:2961:14
    #6 0x7ca233 in Curl_ossl_connect_nonblocking /home/w23/src/curl/lib/vtls/openssl.c:2985:10
    #7 0x576c21 in Curl_ssl_connect_nonblocking /home/w23/src/curl/lib/vtls/vtls.c:322:12
    #8 0x5c2118 in https_connecting /home/w23/src/curl/lib/http.c:1385:12
    #9 0x62f15d in Curl_protocol_connecting /home/w23/src/curl/lib/url.c:3659:14
    #10 0x536b8f in multi_runsingle /home/w23/src/curl/lib/multi.c:1593:16
    #11 0x530ad3 in curl_multi_perform /home/w23/src/curl/lib/multi.c:2122:14
    #12 0x4fae16 in easy_transfer /home/w23/src/curl/lib/easy.c:726:15
    #13 0x4f471b in easy_perform /home/w23/src/curl/lib/easy.c:813:42
    #14 0x4f3b30 in curl_easy_perform /home/w23/src/curl/lib/easy.c:832:10
    #15 0x4f1075 in fetchThread /home/w23/src/curl/session_race.c:83:12
    #16 0x7f2c852ed42b in start_thread (/lib64/libpthread.so.0+0x742b)

Which exactly confirms the scenario I described a few comments above.

To get these you'll need to compile both openssl and curl with ASAN.
Here's a short transcrips of commands I used:

# for freshly unpacked openssl do:
./Configure -fsanitize=address --prefix=/tmp/openssl-sanitized linux-x86_64-clang && make depend && make && make install

# for curl do:
CC=clang CFLAGS=-fsanitize=address ./configure --enable-debug --disable-shared --enable-static --with-ssl=/tmp/openssl-sanitized && make

# to compile the test do:
clang -pthreads -fsanitize=address -g -O0 -I. session_race.c ./lib/.libs/libcurl.a /tmp/openssl-sanitized/lib/libssl.a /tmp/openssl-sanitized/lib/libcrypto.a -lz -lrtmp -lidn -o session_race

# to run the test pass a few https urls to it
./session_race https://facebook.com/ https://twitter.com/

Two URLs are enough. I usually get corruption just after 10-20 seconds.

@w23
Copy link
Contributor Author

w23 commented May 26, 2016

And the fix that I'm proposing looks like this: w23@c89969b.

I've also looked at a couple of other engines (axtls, cyassl, darwinssl), although briefly, and I couldn't find any reference counting at all. I am confused as to how they manage sessions lifetimes.

@nickzman
Copy link
Member

The Secure Transport (darwinssl) engine simply creates a session ID string and shares it with the Security framework. The framework handles the rest internally.

@mback2k
Copy link
Member

mback2k commented May 27, 2016

@w23 Thanks for your in-depth investigation. To me your proposed fix looks good and very straightforward, but since this is a change that affects all TLS backends, Daniel should probably have the final word. Please take a look at the comments I made on w23@c89969b.

w23 added a commit to w23/curl that referenced this issue May 31, 2016
Sessionid cache management is inseparable from managing individual
session lifetimes. E.g. for reference-counted sessions (like those in
SChannel and OpenSSL engines) every session addition and removal
should be accompanied with refcount increment and decrement
respectively. Failing to do so synchronously leads to a race condition
that causes symptoms like use-after-free and memory corruption.
This commit:
 - makes existing session cache locking explicit, thus allowing
   individual engines to manage lock's scope.
 - fixes OpenSSL and SChannel engines by putting refcount management
   inside this lock's scope in relevant places.
 - adds these explicit locking calls to other engines that use
   sessionid cache to accommodate for this change. Note, however,
   that it is unknown whether any of these engines could also have
   this race.

Bug: curl#815
@w23
Copy link
Contributor Author

w23 commented May 31, 2016

@mback2k Thanks for the review! I've addressed your comments, added locking to other engines and created a pull request with these changes: #847
I'd appreciate if you (or anyone else; @bagder maybe?) could take a look.
Thanks!

@bagder bagder closed this as completed in 31c521b Jun 1, 2016
@lock lock bot locked as resolved and limited conversation to collaborators May 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Development

No branches or pull requests

4 participants