curl-and-python

Using pycurl with streaming python interfaces?

From: <johansen_at_sun.com>
Date: Wed, 3 Dec 2008 15:53:04 -0800

Hi,
I searched around on the pycurl and libcurl archives, but I couldn't
find anything that describes my current puzzle. OpenSolaris has been
using python to implement our new package manager. In a number of
situations we've found urllib/httplib to be unsatisfactory. We're
considering switching to PycURL instead.

For some operations, the server from which a packaging client downloads
content may need to receive a large request. In these cases, we don't
want to grow the heap of the client or server unnecessarily. We opted
to use a streaming approach, sacrificing a bit of performance for less
memory usage.

If I've understood correctly, the canonical solution is to use the
WRITEFUNCTION callback to write the data as it arrives. This should be
sufficient for the majority of the cases we've implemented; however, I
have a few cases where such an approach doesn't fit with the Python
idioms we're using.

I'd like to offer two examples, one practical the other theoretical,
and solicit some advice from other programmers on this list.

In each case, we'd like to perform some kind of additional
transformation on the data, pass it to another layer, and then continue
to receive some more data on that connection. The particulars are what
make this a bit challenging.

Case #1
-------

        res = urllib2.urlopen(...)

        for line in res:
                fields = line.split(None, 3)
                if len(fields) < 4:
                        yield fields[:2] + [ "", "" ]
                else:
                        yield fields[:4]

The above cases reads lines in the file-object returned by urlopen, and
then yields a line to the caller of this function. Obviously, I can't
use yield in the write callback. It occurred to me that if I used the
multi interface and stored the lines in a list, I could yield these
lines in between calls to multi_perform. Is there way to do this with
the easy interface, or another better approach that I'm missing?

Case #2
-------

        f = urlopen(...)

        tar_stream = TarFile.open(mode = "r|", fileobj = f)
        for info in tar_stream:
                tar_stream.extract(info, ...)

This is the theoretical case, since we're actually going to rip this
code out. However, it occurred to me that it would still be interesting
to figure out how to implement this, if, for example, we need backwards
compatibility.

The TarFile class can take a file-object, or an object that implements
the same interfaces that a file does, and will read from that as data
becomes available.

I looked in libcurl and pycurl, but I didn't see any interface that
would let me access data in the response body as a file-object, without
first downloading the entire response.

If one were to try to accomplish this today, is there a way to read the
data in the response, a bit at a time, so that I may be streamed into a
file-blocks type of interface that is provided by TarFile and gzip
classes?

Thanks,

-j
_______________________________________________
http://cool.haxx.se/cgi-bin/mailman/listinfo/curl-and-python
Received on 2008-12-04