cURL / Mailing Lists / curl-library / Single Mail

curl-library

Re: libcurl - windows / unicode filenames support...

From: Sergei Nikulov <sergey.nikulov_at_gmail.com>
Date: Tue, 9 Aug 2016 11:59:25 +0300

2016-08-05 22:27 GMT+03:00 Ray Satiro via curl-library
<curl-library_at_cool.haxx.se>:
> On 8/5/2016 6:16 AM, Sergei Nikulov wrote:
>
> 2016-08-05 12:11 GMT+03:00 Rod Widdowson <rdw_at_steadingsoftware.com>:
>
>> Aside, but curious minds need to know.
>>
>> As a newcomer here - can someone help me what "Unicode for windows" means?
>> I have to assume it is in URL handling, not files? The word UTF8 has to be
>> the give-away since UTF8 is a pretty alien concept for windows at the k-mode
>> interface (where I mostly hang out).
>
> +1
>
> UTF-16 (wide character) encoding, which is the most common encoding of
> Unicode and the one used for native Unicode encoding on Windows
> operating systems.
>
> So I also wondering how it can encode UTF-8 in file names.
>
>
> Supporting Unicode in Windows has been discussed in #345 [1]. While I
> acknowledge UTF-16 is the native choice I thought it would be easier to pass
> around UTF-8 in the library, that way we wouldn't have to implement a bunch
> of sister libcurl functions for wide characters. The problem with that is
> because UTF-8 is not properly supported as a locale (except maybe cygwin) by
> the underlying MS C runtime (CRT) it won't do the conversions automatically.
> For example before we call a function like fopen with a UTF-8 filename we'd
> have to convert to UTF-16 stored in wide chars and instead call _wfopen [2]

All conversion is used to be done automatically by defining UNICODE and _UNICODE
for Windows.
So my idea is simple - typedef some kind of CURL_CHAR and use it
instead of plain char.
This typedef will be simple char on Unix/Linux variants and TCHAR for Windows.

This also will save some typing for ex. in ldap.c where I see a lot of
#ifdef WIN32 ...

If you'll need UTF-8 on Windows you should build with Windows Unicode
(-DUNICODE -D_UNICODE) and use WCHAR -> UTF-8 code page conversion.

> since there is no way to set the locale to UTF-8. We'd have to handle that
> for a lot of CRT functions basically making a layer over the CRT and doing
> something also painful. It seems like either way we'd have to create a bunch
> of functions, but I suspected the latter would be easier to maintain since
> they're essentially just wrappers. But how do we know in many of our library
> functions whether a string we're passed is UTF-8 or just ANSI? That's
> another problem. And another one is displaying Unicode characters in the
> console, which didn't always work well, although with Consolas it has gotten
> better.
>
> A few people have shown interest in this but it waned. Make no mistake it
> will take a lot of time to implement properly in a way that is maintainable,
> which is very important. The issues have essentially been abandoned because
> nobody has the time, but feel free to resurrect them if you want to do the
> work. Going forward, I think it is important that we all have some consensus
> on a design before any other work is put in. It could be done in a way that
> is piecemeal, like only for filenames first, but we should agree on some
> sort of ultimate plan first.
>
>
> [1]: https://github.com/curl/curl/issues/345
> [2]: https://msdn.microsoft.com/en-us/library/yeby3zcb.aspx
>
>
> -------------------------------------------------------------------
> List admin: https://cool.haxx.se/list/listinfo/curl-library
> Etiquette: https://curl.haxx.se/mail/etiquette.html

-- 
Best Regards,
Sergei Nikulov
-------------------------------------------------------------------
List admin: https://cool.haxx.se/list/listinfo/curl-library
Etiquette:  https://curl.haxx.se/mail/etiquette.html
Received on 2016-08-09