curl-library
Re: "100% CPU usage during SFTP transfer" bugfix
Date: Tue, 18 Aug 2009 15:49:23 +0400
On 08/18/2009 01:05 PM, Daniel Stenberg wrote:
> For example, he was using an older libssh2 than you. Are you also on
> Windows?
I use Fedora Linux and RHEL. It is reproducible issue.
> It does give me the feeling it might be a libssh2 bug, but without more
> details we can't tell for sure.
>
>> Infinite loop (invoking of clock_gettime, poll, recv functions) loads
>> CPU.
>
> But why does it loop? When the EAGAIN situation is returned from
> libssh2, libcurl checks for what to wait for socket-wise and that
> shouldn't happen immediately. Is libssh2 telling us to wait for the
> wrong direction in this situation? Can you dig up more specific details
> on what exactly is happening?
Looks like it is not a libssh2 bug.
I've compiled 7.19.5 version of curl with the latest version of libssh2
(1.2). The test is simple (to upload a file to a remote host by SFTP):
src/curl -T ./big.iso -u anyuser:anypass sftp://my_host.com/~/
after that curl sends .iso file to the host. And gets 99% of CPU.
libssh2 opens a socket for sending the file:
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 4
file/socket descriptor is 4.
After that:
send(4,
"D\320kE\33\10\253:\241~O\261\2407l\346\313\2275^\313X\305\303RY\264Y\201foH|"...,
16452, MSG_NOSIGNAL) = 16452 //sending data from the file
recv(4, 0x827c7e4, 16384, MSG_NOSIGNAL) = -1 EAGAIN (Resource temporarily
unavailable) //there is no socket data
clock_gettime(CLOCK_MONOTONIC, {14965, 269959641}) = 0
clock_gettime(CLOCK_MONOTONIC, {14965, 270097791}) = 0
clock_gettime(CLOCK_MONOTONIC, {14965, 270215983}) = 0
poll([{fd=4, events=POLLOUT}], 1, 1000) = 1 ([{fd=4, revents=POLLOUT}])
poll([{fd=4, events=POLLOUT}], 1, 0) = 1 ([{fd=4, revents=POLLOUT}])
recv(4, 0x827c7e4, 16384, MSG_NOSIGNAL) = -1 EAGAIN (Resource temporarily
unavailable) //check again, but still nothing
clock_gettime(CLOCK_MONOTONIC, {14965, 270749652}) = 0
clock_gettime(CLOCK_MONOTONIC, {14965, 270877781}) = 0
clock_gettime(CLOCK_MONOTONIC, {14965, 270972324}) = 0
poll([{fd=4, events=POLLOUT}], 1, 1000) = 1 ([{fd=4, revents=POLLOUT}])
poll([{fd=4, events=POLLOUT}], 1, 0) = 1 ([{fd=4, revents=POLLOUT}])
recv(4, 0x827c7e4, 16384, MSG_NOSIGNAL) = -1 EAGAIN (Resource temporarily
unavailable) //check again, nothing
clock_gettime(CLOCK_MONOTONIC, {14965, 271424763}) = 0
. . .
and many calls of "clock_gettime" and "poll" functions until:
recv(4, "\207<{\317@\t\256\36|\34\33\24P{\0A\212q
\346\34_\323\270V\322\225\222-\213\243\237\237"..., 16384, MSG_NOSIGNAL) =
68 //the data is received from the socket
Then we read the file and send the next portion of data:
read(3, "\1CD001\1\0LINUX "..., 16384) = 16384 //read
file data
send(4,
"D\320kE\33\10\253:\241~O\261\2407l\346\313\2275^\313X\305\303RY\264Y\201foH|"...,
16452, MSG_NOSIGNAL) = 16452 //send
invoke clock_gettime, poll & recv again and again until the data is received
from the socket.
Looks like these functions: clock_gettime, poll & recv are invoked in
infinite loop, because the socket is non-blocking and “recv” syscall does
not wait for data from the socket but exits immediately.
Here is the result of the test "Who does use the most of CPU time":
% time calls errors syscall
------ ----------- ----------- --------- ---------
39.82 70527 clock_gettime
34.33 46339 poll
20.92 23251 23199 recv
1.10 38 send
clock_gettime, poll are invoked very often.
We can see the following backtraces:
clock_gettime is invoked:
#0 0x00a2ecf4 in clock_gettime () from /lib/librt.so.1
#1 0x0074cf69 in curlx_tvnow () from
/home/nortel/curl-7.19.5/lib/.libs/libcurl.so.4
#2 0x00777842 in Curl_socket_ready () from
/home/nortel/curl-7.19.5/lib/.libs/libcurl.so.4
#3 0x0076e8cf in Curl_perform () from
/home/nortel/curl-7.19.5/lib/.libs/libcurl.so.4
#4 0x0076f58d in curl_easy_perform () from
/home/nortel/curl-7.19.5/lib/.libs/libcurl.so.4
#5 0x0804fc74 in main ()
clock_gettime is invoked:
#0 0x00a2ecc6 in clock_gettime () from /lib/librt.so.1
#1 0x002f6f69 in curlx_tvnow () from
/home/nortel/curl-7.19.5/lib/.libs/libcurl.so.4
#2 0x003176bd in Curl_readwrite () from
/home/nortel/curl-7.19.5/lib/.libs/libcurl.so.4
#3 0x003188ea in Curl_perform () from
/home/nortel/curl-7.19.5/lib/.libs/libcurl.so.4
#4 0x0031958d in curl_easy_perform () from
/home/nortel/curl-7.19.5/lib/.libs/libcurl.so.4
#5 0x0804fc74 in main ()
poll is invoked:
#0 0x00e25402 in __kernel_vsyscall ()
#1 0x003fc93b in poll () from /lib/libc.so.6
#2 0x003218ce in Curl_socket_ready () from
/home/nortel/curl-7.19.5/lib/.libs/libcurl.so.4
#3 0x003188cf in Curl_perform () from
/home/nortel/curl-7.19.5/lib/.libs/libcurl.so.4
#4 0x0031958d in curl_easy_perform () from
/home/nortel/curl-7.19.5/lib/.libs/libcurl.so.4
#5 0x0804fc74 in main ()
So, poll and clock_gettime are invoked from the libcurl.
Looks like these backtraces are performed by the following code in infinite
loop (lib/transfer.c file, Transfer() function):
switch (Curl_socket_ready(fd_read, fd_write, timeout_ms)) {
case -1: /* select() error, stop reading */
#ifdef EINTR
/* The EINTR is not serious, and it seems you might get this more
often when using the lib in a multi-threaded environment! */
if(SOCKERRNO == EINTR)
continue;
#endif
return CURLE_RECV_ERROR; /* indicate a network problem */
case 0: /* timeout */
default: /* readable descriptors */
result = Curl_readwrite(conn, &done);
/* "done" signals to us if the transfer(s) are ready */
break;
}
> libcurl is supposed to do that blocking magic on its own
> using the same mechanism that libssh2 uses internally when blocking is
> selected. It would then rather indicate that libssh2 _is_ right and that
> libcurl gets confused for some reason.
Could you please provide info, how to enable this mechanism on libcurl
level? "--disable-nonblock" option for the configure does not help and curl
still loads CPU to 100%.
>> Could you please tell me, can these changes cause any issues in
>> libcurl work?
>
> Yes. It'll make libcurl work less like it is supposed to, as libcurl
> wants to work non-blocking with the transfer-layer. It'll be most
> evident if you use the multi interface.
Blocking mode in the libssh2 is performed by non-blocked socket and the
select syscall. So, libcurl just waits for return from a libssh2 call
("libssh2_sftp_write", probably). After that it will invoke the same code.
Could you please tell me, what kind of issues can this cause?
> I'd like us to fix all bugs, yes. But fixing bugs is a lot of hard work
> and we're all busy people. I hope we can help you track this down when
> you tell us more details from what you experience.
I understand. You're doing very good job. Thanks a lot.
Please, see details above and feel free to comment.
Thanks,
Pavel
Received on 2009-08-18