cURL / Mailing Lists / curl-library / Single Mail

curl-library

debugging a crash in Curl_pgrsTime/checkPendPipeline?

From: <johansen_at_sun.com>
Date: Fri, 17 Jul 2009 17:30:11 -0700

Folks,
The OpenSolaris packaging project recently switched to using PyCurl as
the framework for performing network operations. In general, the
transition has gone well; however, I've recently run across a segfault
that a few users are seeing.

It's with libcurl-7.19.4 and PyCurl-7.19.0. A couple of patches have
been applied to PyCurl so that a reset() of an easy handle works
properly.

I've yet to reproduce the crash myself, but it appears to occur while
performing downloads with a large number of files.

I've made some progress with the cores that have come in from users, but
I'm hardly a libcurl expert. I don't have an intuitive sense about
where to look to find the underlying problem.

If anyone has suggestions about how to investigate this further, I'd be
much obliged.

Here's what I've been able to figure out so far:

---- >8 cut here 8< ----

python2.4 died with SIGSEGV

> ::status
debugging core file of pkg (32-bit) from mcescher
file: /usr/bin/i86/isapython2.4
initial argv: /usr/bin/python2.4 /usr/bin/pkg install --no-index redistributable
threading model: native threads
status: process terminated by SIGSEGV (Segmentation Fault), addr=40c

This is the stack from the only LWP running in the process:

> $C
08045f24 libcurl.so.3.0.0`Curl_pgrsTime+0xb3(10, 4, 853acd8, 0)
08045f74 libcurl.so.3.0.0`checkPendPipeline+0xa4(25b384b0, 853ac90, fecc2a00,
fd5546fe)
08045fc4 libcurl.so.3.0.0`multi_runsingle+0x18f(ac850f0, 25a2cf00, 8363f08,
fd5553d6)
08046014 libcurl.so.3.0.0`curl_multi_perform+0x4c(ac850f0, 804603c)
08046048 pycurl.so`do_multi_perform+0xb3(852c9bc, 0, 831ff54, fd5b16a0)
08046098 libpython2.4.so.1.0`call_function+0x315(804610c, 0, da8b7165, 1)

Stack shows that we died in Curl_pgrsTime()

> libcurl.so.3.0.0`Curl_pgrsTime+0xb3::dis
libcurl.so.3.0.0`Curl_pgrsTime+0x92: addl $0x10,%esp
libcurl.so.3.0.0`Curl_pgrsTime+0x95: fstpl 0x3e0(%esi)
libcurl.so.3.0.0`Curl_pgrsTime+0x9b: fwait
libcurl.so.3.0.0`Curl_pgrsTime+0x9c:
jmp +0x102 <libcurl.so.3.0.0`Curl_pgrsTime+0x1a3>
libcurl.so.3.0.0`Curl_pgrsTime+0xa1: leal -0x28(%ebp),%eax
libcurl.so.3.0.0`Curl_pgrsTime+0xa4: subl $0xc,%esp
libcurl.so.3.0.0`Curl_pgrsTime+0xa7: pushl %eax
libcurl.so.3.0.0`Curl_pgrsTime+0xa8:
call -0x3ca5 <PLT=libcurl.so.3.0.0`curlx_tvnow>
libcurl.so.3.0.0`Curl_pgrsTime+0xad: addl $0xc,%esp
libcurl.so.3.0.0`Curl_pgrsTime+0xb0: movl 0x8(%ebp),%esi
libcurl.so.3.0.0`Curl_pgrsTime+0xb3: pushl 0x3fc(%esi)

Disassembly shows that we're dying after a call to curlx_tvnow, a leaf
function in Curl_pgrsTime, called like this:

        Curl_tvdiff_secs(Curl_tvnow(), data->progress.t_startsingle);

> ::regs
%fs = 0x0000 %esi = 0x00000010

%esi = 0x10

0x3fc + 0x10 is 0x40c which corresponds to our fault address. This is
unmapped, obviously the cause of the segfault.

%ebp = 0x08045f24

> 0x08045f24+0x8/X
0x8045f2c: 10

So far, this makes sense. We're moving timevalues from the arugments
and return values of one function to another, as per this nested
invocation.

If the Curl_pgrsTime call frame is to be believed, we were called with
0x10 as the address for the SessionHandle pointer. The second argument
is 4. This corresponds to a timerid of TIMER_PRETRANSFER, which is
verified by inspection of checkPendPipeline(). It only makes one call
to Curl_pgrsTime and that's with TIMER_PRETRANSFER as an argument.

Working backwards a bit, it's possible to see that the second argument
to multi_runsingle() is a pointer to struct Curl_one_easy. The stack
says that our pointer to Curl_one_easy is 0x25a2cf00. multi_runsingle
calls checkPendPipeline() with a pointer to connectdata at offset 0xc in
the Curl_one_easy structure.

Just for the sake of sanity, it's possible to work backwards even
further by looking at the Curl_multi that was supplied in
curl_multi_perform. It contains a list of Curl_one_easy structs that
are linked together and traversed like this:

  easy=multi->easy.next;
  while(easy != &multi->easy) {
    CURLMcode result;

    result = multi_runsingle(multi, easy);
    if(result)
      returncode = result;

    easy = easy->next; /* operate on next handle */
  }

In both curl_multi_perform() and multi_runsingle() the first argument
is a pointer to a Curl_multi struct. Walking that struct, it has the
following Curl_one_easy structs:

0x256f3208, 0x1e987c58, 0x2569de48, 0x25a2cf00, 0x2672e0d8, 0xbced6b0,
0xb3dfae8, 0x1dc4aa70, 0xc1082a0, 0xc106c78, and 0xac850f4. The last
points back to 0x256f3208, completing the loop. We seem to have gotten
hung up processing 0x25a2cf00, at least according to the call stack of
multi_runsingle().

Based upon this information 0x25a2cf00 + 0xc should be the connectdata:

> 0x25a2cf00 + 0xc/K
0x25a2cf0c: 25b384b0

The connect data has a SessionHandle pointer as its first member:

> 25b384b0/K
0x25b384b0: ad0cc48

So this should be the session handle where we ran into trouble.

The session handle keeps a magic number in its final member. This
allows us to identify that it's actually a SessionHandle:

> ad0cc48+0x85ec/K
0xad15234: 0

In this case the magic number is zero, which means that the handle has
already been closed, or isn't a SessionHandle object. However, the
SessionHandle keeps a backpointer to the Multi object at offset 0x8:

> ad0cc48+0x8/K
0xad0cc50: ac850f0

This points back to the correct multi handle, which means we should be
looking at a SessionHandle object here.

Getting back to the task at hand, we need to fish out the easy_conn from
our Curl_easy_one in order to double-check the sanity of our stack
frame:

> 0x25a2cf00+0xc/K
0x25a2cf0c: 25b384b0

This seems to match what we saw before.

checkPendPipeline() looks at the head of the pend_pipe list and moves
each pending object to the send_pipe, after calling Curl_pgrsTime() to
perform some accounting.

Unfortunately for us, the connectdata's pipes appear to be NULL. From
code inspection, it's not clear how this might happen. However, the
recv_pipe, send_pipe, and pend_pipe are all pointers that live near
0x25b38640 (0x25b384b0+0x190 or specifically 194 for send_pipe and 19c
for pend_pipe). However, dumping out the structure shows a large range
of nothing:

            0 1 2 3 \/ 5 6 7 8 9 a b c d e f
25b38630: 00000000 0a644660 00000000 00000000
                     send recv pend
25b38640: 00000000 00000000 00000000 00000000
25b38650: 00000000 00000000 00000000 00000000
25b38660: 00000000 00000000 00000000 00000000
25b38670: 00000000 00000000 00000000 00000000
25b38680: 00000000 00000000 00000000 0853acd8

So, we can't figure it out by working backwards. We'll have to go
fishing in the stack to see if we can find the object that caused all of
the trouble.

> 0x08045f24,64::dump -qe
8045f24: 08045f74 fd555ff8 00000010 00000004
8045f34: 0853acd8 00000000 0021ea91 fee10000
8045f44: 0853ac88 00000000 00000001 25b384b0
8045f54: fef1de6f 27258b8c 00000001 080a659c
8045f64: 00000000 25a2cf00 25a2cf00 fd5774fc
8045f74: 08045fc4 fd55487f 25b384b0 0853ac90
8045f84: fecc2a00 fd5546fe 08045fe0 fed7fcfd

Digging around in the stack turned out to be fruitful, as unlikely as
that might be. At 0x08045f44 there's a pointer to 0x0853ac88, which when
dereferenced returns 0x10. 0x10 has showed up in every dump that we've
seen. It's not clear, yet, why this is always the first failed argument
to Curl_pgrsTime().

pmap says the following about the heap:

08061000 512936K rwx-- [ heap ]

This means that every address from 0x08061000 to 0x2754b000 is within
the heap. This means that 0x0853ac88 was allocated from the heap rather
early on in the process.

---- >8 cut here 8< ----

If anyone has further suggestions about ways to investigate the problem,
I'd appreciate the help. I'm especially curious why the library might
be manipulating a SessionHandle that has no magic number, and how the
recv_pipe, send_pipe, and pend_pipe members got set to NULL. The
Curl_one_easy appears to indicate that we're in CURLM_STATE_DONE state,
and disassembly shows we called checkPendPipeline() from the case
statement for handling that multi state. If anyone can postulate a
reasonably theory as to why this had happened, I'd certainly be
interested.

Thanks,

-j
Received on 2009-07-18