Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broken UTF-8 encoded content terminal output on Windows #9841

Closed
OSPanel opened this issue Nov 1, 2022 · 35 comments
Closed

broken UTF-8 encoded content terminal output on Windows #9841

OSPanel opened this issue Nov 1, 2022 · 35 comments
Assignees
Labels
cmdline tool unicode Unicode, code page, character encoding Windows Windows-specific

Comments

@OSPanel
Copy link

OSPanel commented Nov 1, 2022

Latest version (x86 & x64 tested) with cmd.exe (Windows 10).

When requesting data using curl.exe output to the console occurs with errors if the response is in UTF 8 encoding.

The output failure occurs in about 1 attempt out of 5 and usually in the same arbitrary place regardless of the text. Most likely, it's all about the incorrect gluing of chunks.

Wget has no such problem (tested).

TEST (do a lot of attempts, the output will sometimes be broken):
curl.exe https://raw.githubusercontent.com/OSPanel/OpenServerPanel/main/system/lang/Russian.txt

Incorrect results: https://raw.githubusercontent.com/OSPanel/OpenServerPanel/main/resources/error.png

@vszakats
Copy link
Member

vszakats commented Nov 1, 2022

Does this issue happen only with curl-for-win builds, or with any 7.86.0 Windows build?

@OSPanel
Copy link
Author

OSPanel commented Nov 1, 2022

All these versions have this bug 7.86, 7.80, 7.79 with Windows 10 and Windows 7 (other builds not tested, other Windows version not tested):

x64
https://curl.se/windows/dl-7.86.0/curl-7.86.0-win64-mingw.zip
https://curl.se/windows/dl-7.80.0/curl-7.80.0-win64-mingw.zip
...etc
x32
https://curl.se/windows/dl-7.86.0/curl-7.86.0-win32-mingw.zip
https://curl.se/windows/dl-7.80.0/curl-7.80.0-win32-mingw.zip
...etc

Other tests:

Curl version 7.85 for Linux (Arch Linux) does not have this error (Kitty/Putty terminal).

@vszakats
Copy link
Member

vszakats commented Nov 1, 2022

Thanks. So this is a long time issue. Can you try with an MSYS2 / Git for Windows / Windows built-in curl.exe to see how those behave and what their -V output is?

[ This may be related to the Unicode build option. curl-for-win had it enabled in 7.71.0–7.77.0. ]

@OSPanel
Copy link
Author

OSPanel commented Nov 1, 2022

MSYS2 (Bug detected):

curl -V
curl 7.85.0 (x86_64-w64-mingw32) libcurl/7.85.0 OpenSSL/1.1.1q (Schannel) zlib/1.2.12 brotli/1.0.9 z
std/1.5.2 libidn2/2.3.3 libpsl/0.21.1 (+libidn2/2.3.1) libssh2/1.10.0 nghttp2/1.48.0
Release-Date: 2022-08-31
Protocols: dict file ftp ftps gopher gophers http https imap imaps ldap ldaps mqtt pop3 pop3s rtsp s
cp sftp smb smbs smtp smtps telnet tftp
Features: alt-svc AsynchDNS brotli HSTS HTTP2 HTTPS-proxy IDN IPv6 Kerberos Largefile libz MultiSSL
NTLM PSL SPNEGO SSL SSPI threadsafe TLS-SRP zstd

Git for Windows (Bug detected):

curl -V
curl 7.79.1 (Windows) libcurl/7.79.1 Schannel
Release-Date: 2021-09-22
Protocols: dict file ftp ftps http https imap imaps pop3 pop3s smtp smtps telnet tftp
Features: AsynchDNS HSTS IPv6 Kerberos Largefile NTLM SPNEGO SSL SSPI UnixSockets

Windows 10 built-in (Bug NOT detected):

curl -V
curl 7.55.1 (Windows) libcurl/7.55.1 WinSSL
Release-Date: 2017-11-14, security patched: 2019-11-05
Protocols: dict file ftp ftps http https imap imaps pop3 pop3s smtp smtps telnet tftp
Features: AsynchDNS IPv6 Largefile SSPI Kerberos SPNEGO NTLM SSL

@vszakats
Copy link
Member

vszakats commented Nov 1, 2022

Thanks @OSPanel. I'm transferring this to the curl project, as the Issue affects all (recent) Windows builds.

If you are prepared to make your own curl build, you may try enabling the Unicode feature to see if it resolves this issue.

@vszakats vszakats transferred this issue from curl/curl-for-win Nov 1, 2022
@vszakats vszakats changed the title BUG - broken UTF 8 encoded content output BUG - broken UTF-8 encoded content output on Windows Nov 1, 2022
@vszakats vszakats added the Windows Windows-specific label Nov 1, 2022
@OSPanel
Copy link
Author

OSPanel commented Nov 1, 2022

I can't compile the binary, but I can check the test binary if needed.

@vszakats
Copy link
Member

vszakats commented Nov 1, 2022

@OSPanel: You can find fresh Unicode builds here: https://ci.appveyor.com/project/curlorg/curl-for-win/builds/45248995/artifacts

@OSPanel
Copy link
Author

OSPanel commented Nov 2, 2022

Thanks. I downloaded and tested these x86+x64 builds (dated 10/26/2022), and this bug was found there :-(

@vszakats
Copy link
Member

vszakats commented Nov 2, 2022

I freshly made these builds today, right before posting here. They have Unicode enabled (check curl -V).

[ For reproducibility the date is always the same for a given curl version number. ]

@OSPanel
Copy link
Author

OSPanel commented Nov 2, 2022

I used this link: https://ci.appveyor.com/project/curlorg/curl-for-win/builds/45248995/artifacts
Release-Date: 2022-10-26 (Bug detected)

curl -V
curl 7.86.0 (x86_64-w64-mingw32) libcurl/7.86.0 OpenSSL/3.0.5 (Schannel) zlib/1.2.13 brotli/1.0.9 zstd/1.5.2 WinIDN libssh2/1.10.0 nghttp2/1.50.0 ngtcp2/0.10.0 nghttp3/0.7.1 libgsasl/2.2.0
Release-Date: 2022-10-26
Protocols: dict file ftp ftps gopher gophers http https imap imaps ldap ldaps mqtt pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp ws wss
Features: alt-svc AsynchDNS brotli gsasl HSTS HTTP2 HTTP3 HTTPS-proxy IDN IPv6 Kerberos Largefile libz MultiSSL NTLM SPNEGO SSL SSPI threadsafe TLS-SRP Unicode UnixSockets zstd

@vszakats
Copy link
Member

vszakats commented Nov 2, 2022

@OSPanel: That's the correct one indeed, and thanks for testing it!

Based on your tests, this issue is present in both Unicode and non-Unicode builds.

@bagder
Copy link
Member

bagder commented Nov 2, 2022

I don't see an explanation exactly what the symptoms of this error is?

Is it perhaps code from this commit that causes it? 5bfaa86

@vszakats
Copy link
Member

vszakats commented Nov 2, 2022

I'm making an attempt to revert this patch in the test build here: https://ci.appveyor.com/project/curlorg/curl-for-win/builds/45254038

Even though the notion of the patch seems correct to me and changing terminal codepage on the fly was fragile/broken at least before Windows 10.

Binaries here: https://ci.appveyor.com/project/curlorg/curl-for-win/builds/45254038/artifacts

@OSPanel
Copy link
Author

OSPanel commented Nov 2, 2022

I'm making an attempt to revert this patch in the test build here: https://ci.appveyor.com/project/curlorg/curl-for-win/builds/45254038

Even though the notion of the patch seems correct to me and changing terminal codepage on the fly was fragile/broken at least before Windows 10.

Thanks. This test build works well, the bug we are discussing has not been detected.

curl -V
curl 7.86.0 (x86_64-w64-mingw32) libcurl/7.86.0 OpenSSL/3.0.5 (Schannel) zlib/1.2.13 brotli/1.0.9 zstd/1.5.2 WinIDN libssh2/1.10.0 nghttp2/1.50.0 ngtcp2/0.10.0 nghttp3/0.7.1 libgsasl/2.2.0
Release-Date: 2022-10-26
Protocols: dict file ftp ftps gopher gophers http https imap imaps ldap ldaps mqtt pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp ws wss
Features: alt-svc AsynchDNS brotli gsasl HSTS HTTP2 HTTP3 HTTPS-proxy IDN IPv6 Kerberos Largefile libz MultiSSL NTLM SPNEGO SSL SSPI threadsafe TLS-SRP UnixSockets zstd

@vszakats
Copy link
Member

vszakats commented Nov 2, 2022

It means that 5bfaa86 is causing the difference.

According to past experience fixing this exact issue in a different project, and this earlier Python Issue, the WriteConsoleW() call should be correct here. Maybe the CP/encoding translation before it has issues? E.g. when the input stream has a multibyte character which is processed in two passes, breaking the char into two halves?

@jay
Copy link
Member

jay commented Nov 3, 2022

Yes, I think the OP called it correctly. We are passing an incomplete UTF-8 encoding to MultiByteToWideChar more than one time (up to 4 times since there are up to 4 bytes in UTF-8 encoded characters). That results in an invalid character � (U+FFFD) output for each incomplete UTF-8 encoded character.

I cannot reproduce with the OP's URL but I have no doubt this issue is valid. Let's take the first mangled output for example:

Команда доступна только для пассивных (не подлежащих запуску) модулей!
                                    ^
Команда доступна только для пассивны�� (не подлежащих запуску) модулей!

A caret is pointing to the х that becomes mangled. Though it looks like the regular ascii x (U+0078) in the screenshot it's actually a cyrillic х (U+0445) which is represented in UTF-8 encoding as D1 85. It looks likely that MultiByteToWideChar received D1 at the end of one buffer and then 85 at the start of the next buffer.

I think that can be fixed by delaying the conversion and output for incomplete Unicode UTF-8 encoded characters until they are complete. I had at one point written a UTF-8 correctness function for libcurl for other reasons but I don't think we added it. I will see if maybe some of that logic applies here.

Also, the comments in those python threads about WriteConsole long output failing concern me. The MS documentation currently says "If the total size of the specified number of characters exceeds the available heap, the function fails with ERROR_NOT_ENOUGH_MEMORY" but according to the python threads it used to reference a specific shared 64k heap that may have an arbitrary size available. That is something we'd need to actually take into consideration though I wonder why they changed the documentation. I will look into this as well.

@jay jay self-assigned this Nov 3, 2022
@bagder
Copy link
Member

bagder commented Nov 16, 2022

Any progress on this @jay? It feels like a rather complicated dilemma.

@jay
Copy link
Member

jay commented Nov 17, 2022

I think it's possible. I was focused on other bugs but I will take a look at it tomorrow.

@bagder bagder changed the title BUG - broken UTF-8 encoded content output on Windows broken UTF-8 encoded content terminal output on Windows Nov 28, 2022
@OSPanel
Copy link
Author

OSPanel commented Dec 14, 2022

5bfaa86 - I assume that this is a bad fix. You cannot output UTF 8 to the console if the current encoding is different from 65001. Therefore, the old code worked well, it cached the current encoding in the console and returned it after the output was completed. This is a normal approach. And if someone has low-quality fonts installed that do not support UTF 8, and because of this their console looks bad, so it's not a CURL problem, it's a problem of the user and his chosen font. Just return the old code by canceling this commit.
Or is there any news on the fix?

@jay
Copy link
Member

jay commented Dec 15, 2022

I've been busy and haven't had a chance to work on it. curl outputs UTF-16 to the console, it is the conversion UTF-8 => UTF-16 which is failing in this case. The user's font may not be able to display some valid Unicode characters that are being output but IMO that is not a curl issue.

@OSPanel
Copy link
Author

OSPanel commented Dec 15, 2022

Yes, before the commit 5bfaa86 was applied, curl's work in the console depended only on the user's font, which is why I suggest returning the old code. Now, after that commit was accepted, the output to the console is completely broken. Maybe you will return the old code until you get around to changing something?

@vszakats
Copy link
Member

The old code was broken in a different way, e.g. it could leave the console in an unstable state (leading to a console crash).

@NativeDevMan
Copy link

Any news on fixes yet?

@OSPanel
Copy link
Author

OSPanel commented Feb 9, 2023

curl https://wttr.in/London?lang=nl
As it turned out, CURL distorts not only the text output, but even the output of simple graphic lines. Reproducibility of the problem - every first or second request (~70% of requests).
er2

@dfandrich
Copy link
Contributor

That last problem doesn't seem related to me. Also, how do you know it's a curl problem and not a problem with the server, network, network stack or terminal?

@OSPanel
Copy link
Author

OSPanel commented Feb 10, 2023

That last problem doesn't seem related to me. Also, how do you know it's a curl problem and not a problem with the server, network, network stack or terminal?

The problem is related to encoding, so it is not always reproduced, it depends on the output data. When using the city of London, the problem is no longer visible, but from the 10th time I was able to pick up a location where the reproducibility of the breach is 100% right now (the weather will change during your viewing and you may have to pick up another location):

curl.exe https://wttr.in/Malta?lang=nl
curl.exe https://wttr.in/pitsburg?lang=nl

er1
er2

@dfandrich
Copy link
Contributor

dfandrich commented Feb 10, 2023 via email

@OSPanel
Copy link
Author

OSPanel commented Feb 10, 2023

Could it be an issue with output when curl detects a terminal and not when it's directed to a file?

So it is, only the output to the console is broken. Read the posts above here.

@dfandrich
Copy link
Contributor

dfandrich commented Feb 10, 2023 via email

@dfandrich
Copy link
Contributor

Jay's Nov. 3 comment seems to confirm that this is basically what's happening and it's the same issue.

@OSPanel
Copy link
Author

OSPanel commented Feb 10, 2023

Jay's Nov. 3 comment seems to confirm that this is basically what's happening and it's the same issue.

Maybe curl is better to just stop outputting data to the console until the entire response is received, because this leads to distortion when gluing bytes...

@dfandrich
Copy link
Contributor

dfandrich commented Feb 10, 2023 via email

@vszakats vszakats added the unicode Unicode, code page, character encoding label Feb 16, 2023
@OSPanel
Copy link
Author

OSPanel commented Apr 3, 2023

Can anyone help with fixing this error, please. Jay hasn't had time for this in 5 months.

jay added a commit to jay/curl that referenced this issue Apr 4, 2023
@jay
Copy link
Member

jay commented Apr 4, 2023

I have some changes for this but I'm still working on it.
https://github.com/curl/curl/compare/master...jay:curl:utf8_win_console_fix?expand=1

jay added a commit to jay/curl that referenced this issue Apr 5, 2023
- Suppress an incomplete UTF-8 sequence at the end of the buffer.

- Attempt to reconstruct incomplete UTF-8 sequence from prior call(s)
  in current call.

Prior to this change, in Windows console, UTF-8 sequences split between
two or more calls to the write callback would cause invalid "replacement
characters" U+FFFD to be printed instead of the actual Unicode
character. This is because in Windows only UTF-16 encoded characters are
printed to the console, therefore we convert the UTF-8 contents to
UTF-16, which cannot be done with partial UTF-8 sequences.

Reported-by: Maksim Arhipov

Fixes curl#9841
Closes #xxx
jay added a commit to jay/curl that referenced this issue Apr 29, 2023
- Suppress an incomplete UTF-8 sequence at the end of the buffer.

- Attempt to reconstruct incomplete UTF-8 sequence from prior call(s)
  in current call.

Prior to this change, in Windows console, UTF-8 sequences split between
two or more calls to the write callback would cause invalid "replacement
characters" U+FFFD to be printed instead of the actual Unicode
character. This is because in Windows only UTF-16 encoded characters are
printed to the console, therefore we convert the UTF-8 contents to
UTF-16, which cannot be done with partial UTF-8 sequences.

Reported-by: Maksim Arhipov

Fixes curl#9841
Closes #xxx
@NativeDevMan
Copy link

Please tell me, if you have information, how soon will #10890 be accepted into the main branch?

jay added a commit to jay/curl that referenced this issue Jun 6, 2023
- Suppress an incomplete UTF-8 sequence at the end of the buffer.

- Attempt to reconstruct incomplete UTF-8 sequence from prior call(s)
  in current call.

Prior to this change, in Windows console, UTF-8 sequences split between
two or more calls to the write callback would cause invalid "replacement
characters" U+FFFD to be printed instead of the actual Unicode
character. This is because in Windows only UTF-16 encoded characters are
printed to the console, therefore we convert the UTF-8 contents to
UTF-16, which cannot be done with partial UTF-8 sequences.

Reported-by: Maksim Arhipov

Fixes curl#9841
Closes #xxxx
@jay jay closed this as completed in af3f4e4 Aug 1, 2023
ptitSeb pushed a commit to wasix-org/curl that referenced this issue Sep 25, 2023
- Suppress an incomplete UTF-8 sequence at the end of the buffer.

- Attempt to reconstruct incomplete UTF-8 sequence from prior call(s)
  in current call.

Prior to this change, in Windows console UTF-8 sequences split between
two or more calls to the write callback would cause invalid "replacement
characters" U+FFFD to be printed instead of the actual Unicode
character. This is because in Windows only UTF-16 encoded characters are
printed to the console, therefore we convert the UTF-8 contents to
UTF-16, which cannot be done with partial UTF-8 sequences.

Reported-by: Maksim Arhipov

Fixes curl#9841
Closes curl#10890
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cmdline tool unicode Unicode, code page, character encoding Windows Windows-specific
Development

Successfully merging a pull request may close this issue.

6 participants