curl-and-python

Re: postfields and unicode objects

From: Toshio Kuratomi <a.badger_at_gmail.com>
Date: Thu, 13 Aug 2009 11:41:55 -0700

On 08/13/2009 09:27 AM, johansen_at_sun.com wrote:
> It looks like unicode should work.
>
> Based upon this description, either the object isn't a unicode object
> that's a subclass of string, or you've found a bug in the Python C API.
>
> Here's the relevant bit of pycurl.c (1635-1647):
>
> case CURLOPT_POSTFIELDS:
> if (PyString_AsStringAndSize(obj, &str, &len) != 0)
> return NULL;
> /* automatically set POSTFIELDSIZE */
> if (len <= INT_MAX) {
> res = curl_easy_setopt(self->handle, CURLOPT_POSTFIELDSIZE, (lon
> } else {
> res = curl_easy_setopt(self->handle, CURLOPT_POSTFIELDSIZE_LARGE
> }
> if (res != CURLE_OK) {
> CURLERROR_RETVAL();
> }
> break;
>
> The only place this code is checking types is in the if statement.
>

The code that's causing this is earlier around line 1587:

    /* Handle the case of string arguments */
    if (PyString_Check(obj)) {
         [...]

PyString_Check only checks for StringType, not UnicodeType. I don't see
a way to check for BaseStringType which is what you might be expecting
to do here but you can change that to:

    if (PyString_Check(obj) || PyUnicode_Check(obj)) {

which achieves much the same thing as checking for BaseStringType.

However, if we just fix this, we throw will UnicodeEncodeErrors later
because python-2 tries to encode to ASCII by default. There's a couple
ways around this. Probably the best for your use case is to check for
unicode and convert to a byte string using utf-8 or another encoding
that covers the full range of unicode. That would look like this:

/* Convert any Unicode types to utf-8 */
if (PyUnicode_Check(obj)) {
 obj = PyUnicode_AsUTF8String(obj);
}

/* Handle the case of string arguments */
if (PyString_Check(obj) {
[...]

That's a little funky in that the user doesn't have the chance to
convert to anything but utf-8. However, since the API accepts byte
strings as well, the user can get a different encoding (say latin-1) by
doing the conversion in their code::

>>> c = pycurl.Curl()
>>> data = u'data=café'
>>> # Accept the default of utf-8 encoded data
>>> c.setopt(c.POSTFIELDS, data)
>>> # Explicitly use a different encoding
>>> encoded_data = data.encode('latin-1')
>>> c.setopt(c.POSTFIELDS, encoded_data)

If you go this route you want to document that any unicode strings will
be converted to utf-8 before being passed to curl. Any byte strings
will be passed through as is.

If you want a clean separation between unicode and byte strings so the
user always has to do the same thing, you'd want to accept either bytes
or unicode but not both. If you settle on bytes, the user will always
have to encode unicode types to a byte string in their code (this is the
current situation). If you settle on unicode, you'd need to add two
arguments to setopt so the user can specify what encoding to use and
whether to throw an exception if the unicode string can't be encoded
with that encoding. These would be passed on to
PyUnicode_AsEncodedString() to do the conversion to a byte string.

-Toshio

_______________________________________________
http://cool.haxx.se/cgi-bin/mailman/listinfo/curl-and-python

Received on 2009-08-13