Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

socket: segfault when requesting google.com twice #3986

Closed
TvdW opened this issue Jun 4, 2019 · 10 comments
Closed

socket: segfault when requesting google.com twice #3986

TvdW opened this issue Jun 4, 2019 · 10 comments

Comments

@TvdW
Copy link

TvdW commented Jun 4, 2019

Thought I'd give curl master a try before 7.65.1 gets released to see if it passes my own unit tests... it did not 馃槩

I did this

https://gist.github.com/TvdW/d310abd835550dbc3c45649b04373186

Basically just two requests to google.com via the socket api

I expected the following

No segfault 馃槃

curl/libcurl version

curl 7.65.0 (x86_64-unknown-linux-gnu) libcurl/7.65.0 OpenSSL/1.0.2k-fips zlib/1.2.7 brotli/1.0.1 c-ares/1.15.0 nghttp2/1.38.0
Release-Date: 2019-06-04
Protocols: dict file ftp ftps gopher http https imap imaps pop3 pop3s rtsp smb smbs smtp smtps telnet tftp 
Features: AsynchDNS brotli HTTP2 HTTPS-proxy IPv6 Largefile libz NTLM NTLM_WB SSL UnixSockets

(reports 7.65.0 because I packaged it via maketgz first)

@TvdW
Copy link
Author

TvdW commented Jun 4, 2019

* Found bundle for host www.google.com: 0x26792f0 [serially]
* Server doesn't support multiplex (yet)
* Connection #0 is still name resolving, can't reuse
*   Trying 1.2.3.4:3128...
* TCP_NODELAY set
* Hostname 'my-proxy-server' was found in DNS cache
*   Trying 1.2.3.4:3128...
* TCP_NODELAY set
* Connected to my-proxy-server (1.2.3.4) port 3128 (#0)
* allocate connect buffer!
* Establish HTTP proxy tunnel to www.google.com:443
> CONNECT www.google.com:443 HTTP/1.1
Host: www.google.com:443
Proxy-Connection: Keep-Alive

Segmentation fault

@bagder
Copy link
Member

bagder commented Jun 4, 2019

This doesn't reproduce for me after having run it dozens of times. Also tried without proxy.

I suppose this is a new crash? Can you git bisect which commit that causes it?

Or perhaps first: build with debug symbols, run with gdb and do 'bt full' on the crash and show us that dump?

@TvdW
Copy link
Author

TvdW commented Jun 4, 2019

Did some dancing with curl configure flags and it seems this is only reproducible when I add --enable-ares=../c-ares-lib --disable-pthreads --disable-threaded-resolver

@TvdW
Copy link
Author

TvdW commented Jun 4, 2019

> CONNECT www.google.com:443 HTTP/1.1
Host: www.google.com:443
Proxy-Connection: Keep-Alive

a.out: multi.c:2511: multi_socket: Assertion `data' failed.

Program received signal SIGABRT, Aborted.
0x00007ffff71631f7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7_4.2.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-34.el7.x86_64 libcom_err-1.42.9-13.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 openssl-libs-1.0.2k-16.el7_6.1.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt full
#0  0x00007ffff71631f7 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007ffff71648e8 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007ffff715c266 in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3  0x00007ffff715c312 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4  0x0000000000410301 in multi_socket (multi=0x70fcd8, checkall=false, s=9, ev_bitmask=2, running_handles=0x7fffffffdda0) at multi.c:2511                                                                                                   
        list = 0x760928
        e = 0x747938
        enext = 0x0
        pipe_st = {old_pipe_act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0, 7637864, 0, 4220256, 140737488346816, 140737488346064, 140737339695309, 140737488345712, 4385956, 14594605, 269611,  
                14594605, 468619, 140737488345744, 0, 140737488346128}}, sa_flags = 4263064, sa_restorer = 0x7fffffffdc38}, no_signal = 216}                                                                                                 
        entry = 0x760928
        result = CURLM_OK
        data = 0x0
        t = 0xc7
        now = {tv_sec = 14594605, tv_usec = 269612}
        __PRETTY_FUNCTION__ = "multi_socket"
#5  0x0000000000410b38 in curl_multi_socket_action (multi=0x70fcd8, s=9, ev_bitmask=2, running_handles=0x7fffffffdda0) at multi.c:2677                                                                                                       
        result = CURLM_OK
#6  0x0000000000406d8a in main ()
No symbol table info available.

@TvdW
Copy link
Author

TvdW commented Jun 4, 2019

bisect has my name on it...

8581e1928ea8e125021408ec071fcedb0276c7fe is the first bad commit
commit 8581e1928ea8e125021408ec071fcedb0276c7fe
Author: Daniel Stenberg <daniel@haxx.se>
Date:   Tue May 28 08:23:43 2019 +0200

    multi: track users of a socket better
    
    They need to be removed from the socket hash linked list with more care.
    
    When sh_delentry() is called to remove a sockethash entry, remove all
    individual transfers from the list first. To enable this, each Curl_easy struct
    now stores a pointer to the sockethash entry to know how to remove itself.
    
    Reported-by: Tom van der Woerdt and Kunal Ekawde
    
    Fixes #3952
    Fixes #3904
    Closes #3953

 lib/multi.c   | 38 ++++++++++++++++++++++++++------------
 lib/urldata.h |  1 +
 2 files changed, 27 insertions(+), 12 deletions(-)

@bagder
Copy link
Member

bagder commented Jun 4, 2019

It is certainly related to that fix. The assert you see happens when it traverses the list of users of a socket and a pointer is NULL where it isn't supposed to be. The commit above (8581e19) was meant to make sure there couldn't be any such NULL entries in that list at that point in time. The problem most likely happens when a connection is closed with more than one stream using it.

I could easily repro such a problem up until I landed 8581e19 but now I cannot and I've tried tuning your recipe in several ways with no luck...

@TvdW
Copy link
Author

TvdW commented Jun 5, 2019

Well, it's indeed a bit hard to fix a bug that can't be reproduced...

Maybe this log with debug enabled makes it easier to figure out how to reproduce it (idk...)

* STATE: INIT => CONNECT handle 0x1e5fc38; line 1346 (connection #-5000)
* Added connection 0. The cache now contains 1 members
* STATE: CONNECT => WAITRESOLVE handle 0x1e5fc38; line 1387 (connection #0)
* STATE: INIT => CONNECT handle 0x1e77908; line 1346 (connection #-5000)
* Found bundle for host www.google.com: 0x1e90158 [serially]
* Server doesn't support multiplex (yet)
* Connection #0 is still name resolving, can't reuse
* Added connection 1. The cache now contains 2 members
* STATE: CONNECT => WAITRESOLVE handle 0x1e77908; line 1387 (connection #1)
*   Trying 1.2.3.4:3128...
* TCP_NODELAY set
* STATE: WAITRESOLVE => WAITCONNECT handle 0x1e5fc38; line 1466 (connection #0)
* Hostname 'my-proxy-server' was found in DNS cache
*   Trying 1.2.3.4:3128...
* TCP_NODELAY set
* STATE: WAITRESOLVE => WAITCONNECT handle 0x1e77908; line 1466 (connection #1)
* Connected to my-proxy-server (1.2.3.4) port 3128 (#0)
* STATE: WAITCONNECT => WAITPROXYCONNECT handle 0x1e5fc38; line 1522 (connection #0)
* Marked for [keep alive]: HTTP default
* allocate connect buffer!
* Establish HTTP proxy tunnel to www.google.com:443
> CONNECT www.google.com:443 HTTP/1.1
Host: www.google.com:443
Proxy-Connection: Keep-Alive

a.out: multi.c:2511: multi_socket: Assertion `data' failed.
Aborted

It does only seem to happen with c-ares enabled, and only when the proxy server needs to be resolved via DNS. Any chance you're attempting it via a 'localhost' proxy? That may not hit dns.

It's hitting the assertion on every run, it does not seem to be a race condition style bug.

@TvdW
Copy link
Author

TvdW commented Jun 5, 2019

I take it back, it does seem to be a race condition with DNS resolution. Changing the hostname from 'my-proxy-server' to '1.2.3.4.xip.io' (resolves to same IP address, but slower because external DNS) makes the behavior slightly less predictable (but still crashes on every run).

If you were using a localhost proxy, try curl_easy_setopt(easy, CURLOPT_PROXY, "http://127.0.0.1.xip.io:3128");

@bagder
Copy link
Member

bagder commented Jun 5, 2019

Thanks for that extra detail, I just made the assert trigger here...

@bagder
Copy link
Member

bagder commented Jun 5, 2019

I believe this is the situation (with some certainty).

The problem

We have a hash table where we add all sockets in use. The "sockhash". In each "sockhash" entry there's linked list. That list contains all "transfers" that are using the same socket, so that when there's activity on the socket we can iterate over all the relevant transfers since one or more of them might be affected.

Turns out that I was a bit silly previously since each transfer (struct Curl_easy) currently can only be added to one of these sockhash linked lists, while in reality a single transfer can use more than one socket at a time and perhaps more important it can get added to a new socket before it gets removed from the previous one.

In the current problematic case in here, we can see that due to the race the transfer is sometimes added to a new socket and after that it gets removed from the previous. But since there's just a single storage area, the "removal from the previous" clears the current single association and badness ensues.

The fix

(Current thinking)

We need to make sure that each sockhash can have a linked list with transfers in so that each transfer can exist in multiple lists at the same time. We need to give up using a single storage area for the sh_queue within the Curl_easy struct and instead we need to dynamically allocate it when we add transfers to the linked list.

bagder added a commit that referenced this issue Jun 7, 2019
Since more than one socket can be used by each handle at a given time.

TODO: change to use a separate hash table in each Curl_sh_entry object
instead so that removing a handle from it doesn't have to iterate. A
single socket can be used by hundreds or even thousands of different
transfers at the same time.

Reported-by: Tom van der Woerdt
Fixes #3986
bagder added a commit that referenced this issue Jun 8, 2019
Since more than one socket can be used by each transfer at a given time,
each sockhash entry how has its own hash table with transfers using that
socket.

In addition, the sockhash entry can now be marked 'blocked = TRUE'"
which then makes the delete function just set 'removed = TRUE' instead
of removing it "for real", as a way to not rip out the carpet under the
feet of a parent function that iterates over the transfers of that same
sockhash entry.

Reported-by: Tom van der Woerdt
Fixes #3961
Fixes #3986
Fixes #3995
Closes #3997
@bagder bagder closed this as completed in 094b5f3 Jun 10, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Sep 8, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

Successfully merging a pull request may close this issue.

2 participants