curl-library
Re: Stalls when PUT-ing to Amazon S3
Date: Thu, 18 Dec 2008 11:28:27 -0500
> > We've been encountering some strange stall behavior, where (lib)curl seems
> > to simply stop sending data. The errors are reproducible, but sporadic.
> > We're happy to try debugging suggestions as well as possible solutions.
>
> Can any (other) client PUT to that server?
Yes -- earlier versions of Amanda (2.6.0) are in the wild uploading to
Amazon S3 with no trouble, using the same version of libcurl. Several
beta testers have reported these RequestTimeout errors in 2.6.1b1.
> I'm looking at the 7.19.2-excerpt pcap and I think the TCP flow looks very
> strange. Starting at frame #1803 there are several hundred duplicated TCP
> acks!
Yes, exactly, and my initial response was the same as yours: how can
this weird TCP behavior, apparently on the part of Amazon, be caused
by curl? Yet the one variable that reliably controls the presence of
these failures is the revision of the code *using* libcurl (Amanda).
Just to throw a little extra fun on the fire: when the "400 Bad
Request"/RequestTimeout occurs, Amanda retries 14 times, as
recommended by Amazon. These retries use new TCP connections (but not
new curl handles) to new Amazon S3 endpoints (a different IP, anyway),
but result in another RequestTimeout after 30 seconds.
I've narrowed the failures down; the following revision fails, while
its parent does not:
http://github.com/nikolasco/amanda/commit/0beedc9eb238592c6e34444c6a79b9d0f8c3acdb
since the failures take several hours and saturate my 'net connection,
I can't be 100% confident in that identification. Nonetheless, I've
looked through this patch carefully, and there are no changes that
would affect sending data. We added a CURLOPT_HEADERFUNCTION/DATA and
CURLOPT_PROGRESSFUNCTION/DATA, and removed the MAXFILESIZE stuff.
What "state" could be maintained on the client across connections that
would have some manifestation in the TCP conversation, and that might
be affected by the different flags applied to the curl connection? I
wouldn't be too surprised to find that Amazon rejects packets with the
high bit of their TCP sequence number set, or something equally
ridiculous, but with ISN randomization, this circumstance wouldn't
repeat 14 times in a row.
We're grasping at straws here, and interested in any ideas, no matter
how hairbrained!
Dustin
-- Storage Software Engineer http://www.zmanda.comReceived on 2008-12-18