cURL / Mailing Lists / curl-library / Single Mail


Re: Infinite loop

From: Ray Satiro via curl-library <>
Date: Tue, 12 Jan 2016 13:32:55 -0500

On 1/12/2016 6:18 AM, Jakub Zakrzewski wrote:
> Hi All,
> I have a very weird problem and I'm having hard time trying to find the cause.
> My program makes frequent calls to a sftp server to get the list of files (like once a minute or so). The sftp server is quite slow and overloaded. Every once in a while (according to my colleage once or twice a month) it goes into infinite loop. Weird thing is it affects all the instances that have tried to connect to this server at around this time so I suspect that a weird (or simply unexpected) answer from the server is causing this. The bad thing is this is a production server with release versions of everything, so debugging is a nightmare. The frequency of the problem occuring doesn't help also.
> This is what I know so far:
> sftp software: something that runs on linux, I'm trying to get this info
> $ curl.exe --version
> curl 7.44.0 (Windows) libcurl/7.44.0 OpenSSL/1.0.2d zlib/ libssh2/1.6.0
> Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp scp sftp smtp smtps telnet tftp
> Features: SSPI Kerberos SPNEGO NTLM SSL libz
> The callstacks from two dumps made from hanging processes (unfortunately, the variable values are mangled and make no sense; probably due to this being an optimezed release build):
> 1)
> ntdll.dll!NtDeviceIoControlFile() Unknown
> mswsock.dll!WSPSelect() Unknown
> ws2_32.dll!select() Unknown
> libcurl.dll!Curl_socket_check(unsigned __int64 readfd0=1477468750437493, unsigned __int64 readfd1=634792935808, unsigned __int64 writefd=0, long timeout_ms=-865151808) Line 328 C
> libcurl.dll!ssh_block_statemach(connectdata * conn=0x00000093cc9b8101, bool duringconnect=128) Line 2713 C
> libcurl.dll!Curl_disconnect(connectdata * conn=0x00000093cc6ed4c0, bool dead_connection=128) Line 2764 C
> libcurl.dll!Curl_done(connectdata * * connp=0x0000000000000000, CURLcode status=-865151808, bool premature=72) Line 6088 C
> libcurl.dll!multi_runsingle(Curl_multi * multi=0x00000093cc6ed4c0, timeval now={...}, SessionHandle * data=0x00000093cc860a00) Line 1079 C
> libcurl.dll!curl_multi_perform(void * multi_handle=0x0005b8d800026521, int * running_handles=0x0000000000000000) Line 1811 C
> libcurl.dll!easy_transfer(void * multi=0x00000093cc6ed4c0) Line 715 C
> libcurl.dll!easy_perform(SessionHandle * data=0x0000000000000001, bool events=240) Line 803 C
> 2)
> Network.dll!CURLProgressCallback(void * clientp=0x000a7f8000095c83, double dltotal=1.976262583365e-322#DEN, double dlnow=2.149869607130e-314#DEN, double ultotal=2.149869607130e-314#DEN, double ulnow=0.00000000000000000) Line 253 C++
> libcurl.dll!Curl_pgrsUpdate(connectdata * conn=0x000a410000095c83) Line 388 C
> libcurl.dll!ssh_block_statemach(connectdata * conn=0x0000000105787e01, bool duringconnect=false) Line 2721 C
> libcurl.dll!Curl_disconnect(connectdata * conn=0x000000010577a380, bool dead_connection=false) Line 2764 C
> libcurl.dll!Curl_done(connectdata * * connp=0x0000000000000000, CURLcode status=91726720, bool premature=120) Line 6088 C
> libcurl.dll!multi_runsingle(Curl_multi * multi=0x000000010577a380, timeval now={...}, SessionHandle * data=0x00000001056bae00) Line 1079 C
> libcurl.dll!curl_multi_perform(void * multi_handle=0x0007280800027fd9, int * running_handles=0x0000000000000000) Line 1811 C
> libcurl.dll!easy_transfer(void * multi=0x000000010577a380) Line 715 C
> libcurl.dll!easy_perform(SessionHandle * data=0x0000000000000001, bool events=240) Line 803 C
> In our code there's no retry whatsoever - we call the easy_perform and return the result.
> The affected processes take as much CPU as they can, so it's an active loop.
> There's a timeout set (via CURLOPT_TIMEOUT) but it apparently has no effect.
> There's no network communication between my program and the sftp server - it just loops without contacting the server.
> I have the two affected processes running at the moment. I have no idea how long I can keep that state (they already run for a couple of days).
> Does anyone have an idea what can I do next?

If you take enough samples of the stack you may be able to get an idea
of what function is the source of the loop even if the variable values
do not show correctly. Just from the two samples above it looks like
there's a disconnect and ssh_block_statemach [1] is called, which is
where there's a loop waiting for the SSH_STOP state.

You can see in ssh_block_statemach that Curl_timeleft is called every
iteration to check for operation timeout. Curl_timeleft should in this
case use data->set.timeout (CURLOPT_TIMEOUT val x 1000) for its
calculation [2][3]. I think it would be interesting to see what's in
data->set.timeout and what the result of Curl_timeleft is. The only way
I see that part going awry is if whatever it is you're passing to
CURLOPT_TIMEOUT is either not type long, contains a value <=0, or
contains a value > LONG_MAX/1000.

Since SSH_STOP state is not achieved, it would be of value to know the
current state. In at least one iteration I can tell ssh_statemach_act
enables blocking [4] due to LIBSSH2_ERROR_EAGAIN, which will cause
Curl_socket_ready (Curl_socket_check) to be called [5]. However that
really isn't illuminating because EAGAIN happens all the time so it
could mean anything.

My advice would be take some more samples of the stack (what are you
using to do that btw) and then make dumps of both processes before
terminating. Use process explorer or task manager, right-click on each
process and create dump file. It is likely possible to search through
the memory to get the state and other information, however I would go
the route of make a debug build of libcurl and walk through it when it
happens, but beforehand add some fprintfs to record the states in
ssh_statemach_act as they change to get an idea of what's happening
leading up to it.


List admin:
Received on 2016-01-12