Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent/Incompatible handling of filename escaping in multipart/form-data compared to RFC 7578 and browsers #7789

Closed
sleevi opened this issue Sep 29, 2021 · 15 comments
Labels

Comments

@sleevi
Copy link

sleevi commented Sep 29, 2021

I did this

  1. Create a file of the form "foo.jpg (note the leading quote)
  2. Submit as a form using curl -i -F encoded_image=@\"foo.jpg https://www.google.com/searchbyimage/upload

I expected the following

Within the multipart/form-data that is generated, the filename parameter of the request will be in the following form: filename="\"foo.jpg" - using the quoted-string production of RFC 822.

However, I would expect that the filename would of the form filename="%22foo.jpg", using the Percent-Encoding Option of RFC 7578

curl/libcurl version

ToT

operating system

Further Details

The context here is that multipart/form-data encoding is handled by lib/formdata.c. The form-data support is implemented using the generic MIME encoder, in the family of curl_mime_* functions. For example, in the above example of attaching a file, this is handled by

result = curl_mime_filedata(part, file->contents);
calling curl_mime_filedata

curl_mime_filedata passes the filename onwards using curl_mime_filename, passing only the basename of the filename.

Later, when compiling the multipart message, the field name and filename for the Content-Disposition header are escaped using the escape_string function, as shown at

curl/lib/mime.c

Lines 1867 to 1886 in 52fab72

if(part->name) {
name = escape_string(part->name);
if(!name)
ret = CURLE_OUT_OF_MEMORY;
}
if(!ret && part->filename) {
filename = escape_string(part->filename);
if(!filename)
ret = CURLE_OUT_OF_MEMORY;
}
if(!ret)
ret = Curl_mime_add_header(&part->curlheaders,
"Content-Disposition: %s%s%s%s%s%s%s",
disposition,
name? "; name=\"": "",
name? name: "",
name? "\"": "",
filename? "; filename=\"": "",
filename? filename: "",
filename? "\"": "");

These is a complex area due to issues of character sets and filenames. Within this space, RFC 6266, Section 4.3 is relevant, and further expanded upon in RFC 6266, Appendix C.2. RFC 7578, Section 4.2 attempts to provide guidance here with respect to the filename handling inter-operably within a Content-Disposition: form-data part.

The HTML Living Standard Multipart form data specification places a normative dependency on RFC 7578, providing further guidance with respect to the encoding and escaping of field names and filenames for file fields, the two areas escaped via escape_string, to encode to the appropriate encoding and then replace { 0x0A, 0X0D, 0X22 } with { "%0A", "%0D", "%22" }, respectively. This is compatible with the guidance of RFC 7578, and reflects widespread use within browsers (e.g. Chromium/Chrome or WebKit/Safari)

This is an area where MIME attachments differ in practice than form data. It seems that it may be useful to at least align the form encoder to match the approach mentioned within RFC 7578, for greater harmonization.

@bagder bagder added the HTTP label Sep 29, 2021
@bagder
Copy link
Member

bagder commented Sep 29, 2021

I agree. @monnerat, do you want to grab this?

@monnerat
Copy link
Contributor

monnerat commented Sep 29, 2021

This is a picky area: we have something that seems to work for forms and featuring mime did not change this escaping that was already present in original formdata code.

In addition, there are contradictory informations in the mentioned documents:

In https://html.spec.whatwg.org/multipage/form-control-infrastructure.html#multipart-form-data:

For field names and filenames for file fields, the result of the encoding in the previous bullet point must be escaped by replacing any 0x0A (LF) bytes with the byte sequence %0A, 0x0D (CR) with %0D and 0x22 (") with %22. The user agent must not perform any other escapes.

In https://datatracker.ietf.org/doc/html/rfc6266#section-4.3:

Note: Many user agents do not properly handle the escape character
"\" when using the quoted-string form. Furthermore, some user
agents erroneously try to perform unescaping of "percent" escapes
(see Appendix C.2), and thus might misinterpret filenames
containing the percent character followed by two hex digits.

So: %-escape or not, and what about the backslash ???

I fear this change may break a lot of things: in particular, tests 39 and 1158.
I'm not convinced it is wise to introduce problems where there is no bug, just because of an inconsistency with third party applications.

As a consequence, I can do it, but things have to be clarified first.

@bagder
Copy link
Member

bagder commented Sep 29, 2021

This is rather tricky. It seems the three specs have very different ideas on how to encode some basic bytes:

ascii RFC 6266 RFC 7578 WHATWG
0x0d unclear %0d %0d
0x22 (") \" %22 %22
0x5c (\) \\ %5c \

Or did I get any of them wrong?

I fear this change may break a lot of things: in particular, tests 39 and 1158.

Those tests verify curl's behavior. If we change behavior, we of course also need to update those tests accordingly!

When specs are hard to follow (like here) I think we need to check what other tools and libraries use and what widely deployed server-side applications accept/support. I think unifying with browser behavior on this seems reasonable as they are probably the major producers of formdata bodies.

@sleevi
Copy link
Author

sleevi commented Sep 29, 2021

So: %-escape or not, and what about the backslash ???

So this isn’t quite the problem with URLs, but similar.

In order of precedence, and recognizing that cURL may ignore the WHATWG work due to the Living Standard nature:

  • WHATWG HTML Living Standard
  • RFC 7578
  • RFC 6266 (ish)

I say “ish”, because virtually no one is using the paramname* construction to extend paramname with character set details.

(Edited as previous comment was edited)

@bagder
Copy link
Member

bagder commented Sep 29, 2021

Yes sorry, I realized my mistake and edited the table. WHATWG and RFC 7578 agree on 0x0a, 0x0d and 0x22.

I think I'm still wrong with the backslash there as 7578 doesn't say it should be encoded I think...

@sleevi
Copy link
Author

sleevi commented Sep 29, 2021

@bagder I think your table is correct now, in that Section 2 of RFC 7578 says you can percent encode the disallowed characters - which naively would include such.

However, I don’t think \ is necessarily an interesting case, primarily because I understand browsers treat it as a path separator, and thus it’s elided from the filename before encoding, similar to how cURL grabs the basename of the file. That said, I could see it being entirely reasonable for WHATWG to make that a bit more explicit, either in processing or making sure it’s addressed for encoding (to avoid mis-escaped sequences, similar to the handling)

@monnerat
Copy link
Contributor

Or did I get any of them wrong?

Seems not. But you know http better than I do.

Those tests verify curl's behavior. If we change behavior, we of course also need to update those tests accordingly!

Sure! But this shows we have no bug, but perhaps a misinterpretation. That said, if tests break, applications running perfectly for more that 13 years now may fail too.

... and what widely deployed server-side applications accept/support.

I think it's the most important. Servers themselves may parse it too (Apache modules?) As we only generate these and do not parse them, It would be good to know how widespread CGI-like apps and servers (i.e.: PHP) understand them.

I say “ish”, because virtually no one is using the paramname* construction to extend paramname with character set details.

Curl does not support charset encoding in http/mime header fields. This would be rather cumbersome to implement.

I think I'm still wrong with the backslash there as 7578 doesn't say it should be encoded I think...

AFAICS, you're as puzzled as I am :-)

As the various documents are not OK with each other, this introduces a huge spelunking work for us in third-party products. Even the latter may not all stick to the same behavior. I'm not sure I'm ready for thst :-(

@sleevi
Copy link
Author

sleevi commented Sep 29, 2021

this introduces a huge spelunking work for us in third-party products.

Indeed! I mentioned separately to @bagder that this began as an “Oh, that’s odd” when someone noticed a webpage behaving differently if browsers sent the request vs cURL, and the deeper I went, the more boxes of spiders I found myself opening.

I thought it useful to write this down and try to capture the references. cURL’s ubiquity suggests this would be the main place people run into it.

From a browser evolution space, there’s been a wide variety of behaviours before converging on the current approach. For example, one was just to replace the with a ? (as with any other high-bit character), another is to ignore the input filename and generate a sequential or random filename if it couldn’t be represented.

Although ultimately this is only intended as a hint for servers (for the reasons and risks noted in the various specs), and to that end, many behaviors are valid, I figured I would at least file this so it’s on your radar, even if the risk/complexity/ineffability is too much :)

@monnerat
Copy link
Contributor

monnerat commented Sep 29, 2021

Although ultimately this is only intended as a hint for servers ...

Before speaking semantics, we should care about lexical and syntax. If we do not encode a backslash, would it be correct or would it be interpreted as a quoted-pair (http uses non-terminal quoted-string while it is defined in RFC 822!). Furthermore, from what I understand from the documents, this should be applied to field names too (although non-ascii is discouraged for them). And those are not just hints...

@bagder
Copy link
Member

bagder commented Sep 29, 2021

Test form:

<form action="http://localhost:8080/submit.cgi" method="post" enctype="multipart/form-data">
  Name: <input type="text" name="person+/"><br>
  File: <input type="file" name="secret"><br>
  <input type="submit" value="Submit">
</form> 

Firefox 88 on Linux for the file name "foo:

-----------------------------104053578414215370022244339964
Content-Disposition: form-data; name="secret"; filename="\"foo"
Content-Type: application/octet-stream

contents

Chrome 94 on the same box same name:

------WebKitFormBoundaryApcVGpeqb4k55wB8
Content-Disposition: form-data; name="secret"; filename="%22foo"
Content-Type: application/octet-stream

contents

@bagder
Copy link
Member

bagder commented Sep 29, 2021

It's even worse. I edited my html for the form to be:

  File: <input type="file" name="secret
"><br>

(the newline there is intended)

The results from the jury:

Firefox 88:

Content-Disposition: form-data; name="secret "; filename="\"foo"

Chrome 94:

Content-Disposition: form-data; name="secret%0D%0A"; filename="%22foo"

@monnerat
Copy link
Contributor

I tried with firefox 92.0 and it uses percent encoding without backslash escaping.

@monnerat
Copy link
Contributor

Maybe some M$ tester could check with IE and edge?

@bagder
Copy link
Member

bagder commented Sep 29, 2021

It still means that receivers of multipart formposts today have to deal with both the \-escaped and percent encoded fields for now and the near-term future. Not sure that helps us with this issue, I'm just stating the obvious.

@monnerat
Copy link
Contributor

Escaping-style detection (if some) is a server-side problem, no change is then required in curl for that. And sooner or later, if major clients move to %-escaping, receivers will have to adapt :-(
The way detection should be done is unfortunately not specified: this will guarantee here again that, even if some server-side escaping-style detection is implemented somewhere, there's very few chance that another application will have the same algorithm.

I checked PHP 7.4.24: it uses raw name values as $_POST and $_FILES keys and raw filename values as $_FILES[]['name']. The PHP scripts are thus responsible for unescaping them.

To be versatile on our side, I can only suggest we have an additional CURLOPT_ for that (with which default?).

monnerat added a commit to monnerat/curl that referenced this issue Oct 1, 2021
Until now, form field and file names where escaped using the
backslash-escaping algorithm defined for multipart mails. This commit
replaces this with the percent-escaping method for URLs.

As this may introduce incompatibilities with server-side applications,
a libcurl option CURLOPT_FORM_ESCAPE_AS_MIME is introduced to revert to
legacy use of backslash-escaping. This is controlled by new cli tool
option --mime-escape.

New tests and documentation are provided for this feature.

Reported by: Ryan Sleevi
Fixes curl#7789
monnerat added a commit to monnerat/curl that referenced this issue Oct 1, 2021
Until now, form field and file names where escaped using the
backslash-escaping algorithm defined for multipart mails. This commit
replaces this with the percent-escaping method for URLs.

As this may introduce incompatibilities with server-side applications,
a libcurl option CURLOPT_FORM_ESCAPE_AS_MIME is introduced to revert to
legacy use of backslash-escaping. This is controlled by new cli tool
option --mime-escape.

New tests and documentation are provided for this feature.

Reported by: Ryan Sleevi
Fixes curl#7789
monnerat added a commit to monnerat/curl that referenced this issue Oct 2, 2021
Until now, form field and file names where escaped using the
backslash-escaping algorithm defined for multipart mails. This commit
replaces this with the percent-escaping method for URLs.

As this may introduce incompatibilities with server-side applications,
a new libcurl option CURLOPT_MIME_OPTIONS with bitmask
CURLMIMEOPT_FORMESCAPE is introduced to revert to legacy use of
backslash-escaping. This is controlled by new cli tool option --form-escape.

New tests and documentation are provided for this feature.

Reported by: Ryan Sleevi
Fixes curl#7789
monnerat added a commit to monnerat/curl that referenced this issue Oct 2, 2021
Until now, form field and file names where escaped using the
backslash-escaping algorithm defined for multipart mails. This commit
replaces this with the percent-escaping method for URLs.

As this may introduce incompatibilities with server-side applications,
a new libcurl option CURLOPT_MIME_OPTIONS with bitmask
CURLMIMEOPT_FORMESCAPE is introduced to revert to legacy use of
backslash-escaping. This is controlled by new cli tool option --form-escape.

New tests and documentation are provided for this feature.

Reported by: Ryan Sleevi
Fixes curl#7789
monnerat added a commit to monnerat/curl that referenced this issue Oct 5, 2021
Until now, form field and file names where escaped using the
backslash-escaping algorithm defined for multipart mails. This commit
replaces this with the percent-escaping method for URLs.

As this may introduce incompatibilities with server-side applications,
a new libcurl option CURLOPT_MIME_OPTIONS with bitmask
CURLMIMEOPT_FORMESCAPE is introduced to revert to legacy use of
backslash-escaping. This is controlled by new cli tool option --form-escape.

New tests and documentation are provided for this feature.

Reported by: Ryan Sleevi
Fixes curl#7789
monnerat added a commit to monnerat/curl that referenced this issue Oct 7, 2021
Until now, form field and file names where escaped using the
backslash-escaping algorithm defined for multipart mails. This commit
replaces this with the percent-escaping method for URLs.

As this may introduce incompatibilities with server-side applications,
a new libcurl option CURLOPT_MIME_OPTIONS with bitmask
CURLMIMEOPT_FORMESCAPE is introduced to revert to legacy use of
backslash-escaping. This is controlled by new cli tool option --form-escape.

New tests and documentation are provided for this feature.

Reported by: Ryan Sleevi
Fixes curl#7789
monnerat added a commit to monnerat/curl that referenced this issue Oct 8, 2021
Until now, form field and file names where escaped using the
backslash-escaping algorithm defined for multipart mails. This commit
replaces this with the percent-escaping method for URLs.

As this may introduce incompatibilities with server-side applications,
a new libcurl option CURLOPT_MIME_OPTIONS with bitmask
CURLMIMEOPT_FORMESCAPE is introduced to revert to legacy use of
backslash-escaping. This is controlled by new cli tool option --form-escape.

New tests and documentation are provided for this feature.

Reported by: Ryan Sleevi
Fixes curl#7789
monnerat added a commit to monnerat/curl that referenced this issue Oct 8, 2021
 names

Until now, form field and file names where escaped using the
backslash-escaping algorithm defined for multipart mails. This commit
replaces this with the percent-escaping method for URLs.

As this may introduce incompatibilities with server-side applications,
a new libcurl option CURLOPT_MIME_OPTIONS with bitmask
CURLMIMEOPT_FORMESCAPE is introduced to revert to legacy use of
backslash-escaping. This is controlled by new cli tool option --form-escape.

New tests and documentation are provided for this feature.

Reported by: Ryan Sleevi
Fixes curl#7789
monnerat added a commit to monnerat/curl that referenced this issue Oct 15, 2021
 names

Until now, form field and file names where escaped using the
backslash-escaping algorithm defined for multipart mails. This commit
replaces this with the percent-escaping method for URLs.

As this may introduce incompatibilities with server-side applications,
a new libcurl option CURLOPT_MIME_OPTIONS with bitmask
CURLMIMEOPT_FORMESCAPE is introduced to revert to legacy use of
backslash-escaping. This is controlled by new cli tool option --form-escape.

New tests and documentation are provided for this feature.

Reported by: Ryan Sleevi
Fixes curl#7789
monnerat added a commit to monnerat/curl that referenced this issue Oct 21, 2021
Until now, form field and file names where escaped using the
backslash-escaping algorithm defined for multipart mails. This commit
replaces this with the percent-escaping method for URLs.

As this may introduce incompatibilities with server-side applications,
a new libcurl option CURLOPT_MIME_OPTIONS with bitmask
CURLMIMEOPT_FORMESCAPE is introduced to revert to legacy use of
backslash-escaping. This is controlled by new cli tool option --form-escape.

New tests and documentation are provided for this feature.

Reported by: Ryan Sleevi
Fixes curl#7789
monnerat added a commit to monnerat/curl that referenced this issue Oct 25, 2021
Until now, form field and file names where escaped using the
backslash-escaping algorithm defined for multipart mails. This commit
replaces this with the percent-escaping method for URLs.

As this may introduce incompatibilities with server-side applications,
a new libcurl option CURLOPT_MIME_OPTIONS with bitmask
CURLMIMEOPT_FORMESCAPE is introduced to revert to legacy use of
backslash-escaping. This is controlled by new cli tool option --form-escape.

New tests and documentation are provided for this feature.

Reported by: Ryan Sleevi
Fixes curl#7789
monnerat added a commit to monnerat/curl that referenced this issue Oct 25, 2021
Until now, form field and file names where escaped using the
backslash-escaping algorithm defined for multipart mails. This commit
replaces this with the percent-escaping method for URLs.

As this may introduce incompatibilities with server-side applications,
a new libcurl option CURLOPT_MIME_OPTIONS with bitmask
CURLMIMEOPT_FORMESCAPE is introduced to revert to legacy use of
backslash-escaping. This is controlled by new cli tool option --form-escape.

New tests and documentation are provided for this feature.

Reported by: Ryan Sleevi
Fixes curl#7789
@bagder bagder closed this as completed in b20b364 Nov 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

3 participants