Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow REUSE best practices for licensing/copyright #8869

Closed
wants to merge 2 commits into from

Conversation

mxmehl
Copy link
Contributor

@mxmehl mxmehl commented May 17, 2022

This PR starts with adding some of the copyright/licensing information proposed on the mailing list, with the goal of making curl REUSE compliant and thereby following established community best practices and ISO standards.

The commits are very detailed and contain additional notes with the aim to a) individually showcase the various options to adding copyright/licensing info and b) to narrow down the possibilities how to treat existing copyright headers (specifically 8bf401f, db6f607 and 79a9239).

As requested by @bagder, I used the year ranges everywhere (either keeping the existing ones or, for newly covered files, using the year of the first commit).

Commit 8ae81c2 also shows how the BSD-3-Clause licensed file is handled, and 3fea8f4 how a continuous check in the future may be added.

Potential next steps

  1. Define the style of the headers
    1. Same for newly covered, rather insignificant files: shall we use a more minimalistic style there?
  2. Find more directories which you want to bulk-cover in .reuse/dep5.
  3. Do the actual work.
  4. Check whether all makes sense, e.g. for test data that may come from a third party.
  5. Decide how to maintain this clean status (inclusion in CI etc).

Current REUSE status

When running reuse lint over this branch, the result is as follows:

* Bad licenses:
* Deprecated licenses:
* Licenses without file extension:
* Missing licenses:
* Unused licenses:
* Used licenses: BSD-3-Clause, curl
* Read errors: 0
* Files with copyright information: 2999 / 3487
* Files with license information: 1519 / 3487

Before that we were at:

* Files with copyright information: 1488 / 3486
* Files with license information: 0 / 3486

.gitignore Outdated Show resolved Hide resolved
src/makefile.amiga Outdated Show resolved Hide resolved
.gitignore Outdated Show resolved Hide resolved
@bagder
Copy link
Member

bagder commented May 17, 2022

Maybe a better approach is to simply generate a separate "DEP5" file at release time (without modifying existing files) and bundle that in the tarball?

@danielgustafsson
Copy link
Member

Maybe a better approach is to simply generate a separate "DEP5" file at release time (without modifying existing files) and bundle that in the tarball?

That's my preferred approach.

@mxmehl
Copy link
Contributor Author

mxmehl commented May 18, 2022

Maybe a better approach is to simply generate a separate "DEP5" file at release time (without modifying existing files) and bundle that in the tarball?

The problem with this is that then the correctness of the licensing information is preserved in a script. It proved to be much more efficient and stable to put this information as close to the files as possible. Plus, one can assume that many re-users look at a clone of the repo or its Github UI and not always at the release tarball.

Doing both is ideal, but depending on the build system it is quite tricky to reach 100% REUSE compliance for the built package. It's much simpler to fix this in the source code directly.

@bagder
Copy link
Member

bagder commented May 18, 2022

The problem with this is that then the correctness of the licensing information is preserved in a script

We are always going to leave "the correctness" to a script anyway. How else are we going to verify and update?

It proved to be much more efficient and stable to put this information as close to the files as possible.

Why so? Wouldn't compliance be tested in a test case and even in CI? Wouldn't then any breakage be as likely to be fixed no matter the exact way the information is stored?

one can assume that many re-users look at a clone of the repo or its Github UI and not always at the release tarball.

My thinking is that the REUSE data is not for "users". It's there to be machine detected and parsed, so why not put that data away from the immediate human consumption. The human readable texts are the ones we already have.

it is quite tricky to reach 100% REUSE compliance for the built package

What exactly does "100% REUSE compliance" mean?

@Tachi107
Copy link
Contributor

Maybe a better approach is to simply generate a separate "DEP5" file at release time (without modifying existing files) and bundle that in the tarball?

Worth noting that the REUSE team plans to deprecate the dep5 format in the future, see fsfe/reuse-tool#244 (comment)

@danielgustafsson
Copy link
Member

Maybe a better approach is to simply generate a separate "DEP5" file at release time (without modifying existing files) and bundle that in the tarball?

Worth noting that the REUSE team plans to deprecate the dep5 format in the future, see fsfe/reuse-tool#244 (comment)

But AFAICT only to replace it with a different format, so if we can make the tool emit one format we can change it to emit another format.

@mxmehl
Copy link
Contributor Author

mxmehl commented May 23, 2022

It proved to be much more efficient and stable to put this information as close to the files as possible.

Why so? Wouldn't compliance be tested in a test case and even in CI? Wouldn't then any breakage be as likely to be fixed no matter the exact way the information is stored?

Perhaps I misunderstood your idea with the DEP5 file. I think it's not beneficial to the complete transparency of the repos licensing/copyright situation of this DEP5 file is only present in the tarball. It should be in the repo that serves as the basis for many packagers, users and individual/corporate re-users.

The information is most accessible if it's in the file header. You already solved that quite well for a good percentage of the files. In those where there already is the comment header, adding one line with the License Identifier would be the most elegant solution. For the majority of missing files, especially test data, we could just add the .reuse/dep5 file.

one can assume that many re-users look at a clone of the repo or its Github UI and not always at the release tarball.

My thinking is that the REUSE data is not for "users". It's there to be machine detected and parsed, so why not put that data away from the immediate human consumption. The human readable texts are the ones we already have.

SPDX-License-Identifier: curl is also useful information for humans. Of course, the license notice can also be read and understood by humans, but if I as a re-user wanted to be sure that I understand all conditions correctly, and that they are equal across multiple files I may want to re-use, it would require more brain power and time.

Also, there are some files that don't contain these headers already which someone may also want to use. REUSE would close this gap.

it is quite tricky to reach 100% REUSE compliance for the built package

What exactly does "100% REUSE compliance" mean?

It means that for all in a repository, humans and machines can find copyright and licensensing information. This is in either of these three places:

  1. Directly in the file as a comment header
  2. As a adjacent .license file
  3. In .reuse/dep5

(and in the future in another place that more flexible, as some here have noted)

@bagder
Copy link
Member

bagder commented May 23, 2022

So how about adding SPDX-License-Identifier: curl last in the headers, and then put the rest of the information in .reuse/dep5 ?

@mxmehl
Copy link
Contributor Author

mxmehl commented May 24, 2022

Sure, that would be possible. So to clarify:

  • All files that already have the nice comment header shall be extended by the SPDX-License-Identifier line, as outlined in 79a9239
  • All other files shall be covered in the DEP5 file

Do I understand correctly that you don't want commentable files, e.g. Markdown files in /docs or YAML files in /.github, to be directly supplemented with copyright and licensing information?

@bagder
Copy link
Member

bagder commented May 24, 2022

We also need to update the copyright year ranges in all files we update to please our copyright check script!

you don't want commentable files, e.g. Markdown files in /docs to be directly supplemented with copyright and licensing information?

I'm worried that it will make them less readable (ie be user hostile) plus we would need to handle that for the website too if so, as we generate web pages from many of the markdown files and such headers would not go well there.

The yaml files should probably have headers.

@mxmehl
Copy link
Contributor Author

mxmehl commented May 24, 2022

We also need to update the copyright year ranges in all files we update to please our copyright check script!

Ah, alright, will check that out then.

I'm worried that it will make them less readable (ie be user hostile) plus we would need to handle that for the website too if so, as we generate web pages from many of the markdown files and such headers would not go well there.

Two lines of comments would suffice for them for the bare minimum. As someone who works with REUSE-compliant repos a lot, I personally don't find it too distracting. But again, just listing options ;)

Regarding the websites, the markdown comment style would be HTML so that should™ not be a problem. Is there some way to test it?

@mxmehl mxmehl changed the title WIP: Follow REUSE best practices for licensing/copyright Follow REUSE best practices for licensing/copyright May 27, 2022
@mxmehl
Copy link
Contributor Author

mxmehl commented May 27, 2022

The PR is ready to be reviewed. I did this today:

  • Reverted the earlier suggestions that were the basis for discussion. We agreed that we only want to use in-file-comments (where the standard header already exists) or easily editable files.
  • Applied a license identifier to all files, using the correct year range.
  • Added bulk info to the DEP5 file, or for files that we don't want to or cannot edit directly. (Note: with the <filename>.license files it would have been easier)
  • I found a number of files with different license notices, and also under different licenses. I don't see a compliance issue, but a look into the new LICENSES/ directory shows them.
  • I also made copyright.pl happy again. In this file, I updated the ignore lists a bit.
  • Lastly, resolved conflicts and fixed new files from master.

Please have a look whether the files and the licenses make sense to you, thanks!

Before merging, I can also clean up the reverted commits if wished. I tried to avoid force-pushes to keep my changes transparent for now.

We should probably also discuss whether to add some documentation or so.

@danielgustafsson
Copy link
Member

Before merging, I can also clean up the reverted commits if wished. I tried to avoid force-pushes to keep my changes transparent for now.

This should be squashed into a single commit before going into master and the resuling commit-sha added to a .git-blame-ignore-revs file to keep it from showing up when working. As it's not re-indenting but merely adding stuff it shouldn't be too invasive to code archaeology but we might as well tuck it away.

Copy link
Member

@danielgustafsson danielgustafsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs checksrc fixes:
./vtls/gskit.h:22:3: warning: Trailing whitespace (TRAILINGSPACE)
*
^

@bagder
Copy link
Member

bagder commented May 30, 2022

Please also rebase it (and force-push). "This branch cannot be rebased due to conflicts"

@bagder
Copy link
Member

bagder commented May 30, 2022

I also made copyright.pl happy again.

But after you apply this patch, the script will warn on every file you've updated that does not also say - 2022 in the copyright range. Ideally I guess copyright.pl should ignore only this particular change but that might be too tricky, so I would be fine with automatically bumping all copyright ranges to -2022 for that reason.

@mxmehl
Copy link
Contributor Author

mxmehl commented May 30, 2022

Thank you for the useful feedback. Sorry for the trailing whitespaces, the script I was using was falsely producing these. I will also squash and put them to the blame ignore file.

Please also rebase it (and force-push). "This branch cannot be rebased due to conflicts"

I would fix these again shortly before the merge. Every change on master will most likely introduce new conflicts given the large number of edits I made here.

But after you apply this patch, the script will warn on every file you've updated that does not also say - 2022 in the copyright range. Ideally I guess copyright.pl should ignore only this particular change but that might be too tricky, so I would be fine with automatically bumping all copyright ranges to -2022 for that reason.

I am not sure I understand, sorry. I ran copyright.pl in my branch and fixed the newly introduced warnings caused by my edits so shouldn't we be fine then? In the changed files you will see in many files that not only the License Identifer has been added but also the copyright ranged changed, starting from the year of the first commit to 2022. Note here that some files had headers starting from much earlier, probably because they have just been copied from other files.

@mxmehl mxmehl force-pushed the reuse branch 2 times, most recently from 09914b4 to e076f04 Compare June 1, 2022 09:27
@vexxhost-ci
Copy link

vexxhost-ci bot commented Jun 1, 2022

Starting check jobs.
Warning:
Failed to create check run curl/check: 500 Server Error: Internal Server Error for url: https://api.github.com/repos/curl/curl/check-runs

@mxmehl
Copy link
Contributor Author

mxmehl commented Jun 1, 2022

I just rebased and squashed all commits and added the large commit to the .git-blame-ignore-revs file. Thanks for the review!

From my perspective this would be good to go. Note that with any merged PR, the likelihood of a renewed need to resolve conflicts again increases.

FYI, here's the output of reuse lint:

# SUMMARY

* Bad licenses:
* Deprecated licenses:
* Licenses without file extension:
* Missing licenses:
* Unused licenses:
* Used licenses: BSD-3-Clause, BSD-4-Clause-UC, GPL-3.0-or-later, ISC, LicenseRef-BSD-3-Clause-Hollinger, LicenseRef-BSD-3-Clause-Simtec, LicenseRef-OpenEvidence, MIT, curl
* Read errors: 0
* Files with copyright information: 3498 / 3498
* Files with license information: 3498 / 3498

Congratulations! Your project is compliant with version 3.0 of the REUSE Specification :-)

@bagder
Copy link
Member

bagder commented Jun 2, 2022

such a huge patch is hard to review!

  1. The new files don't seem to be included in releases (maketgz builds them) ?
  2. I don't think you should change the start year in the copyright ranges. Most of the files are derived or downright cloned from another file at creation so the year of the first commit does not accurate reflect the year of the origins of that file.
  3. the man pages in docs/libcurl/opts/ seem to not have had their copyright year ranges updated? (I would presume -/scripts/copyright.pl warns about that?)
  4. The PR has a merge conflict again

Separately from this work: we should probably consider removing the examples with the two separate BSD licenses (fopen.c and rtsp.c) and the GPL-3 licensed m4 macro. In the latter case, ax_compile_check_sizeof.m4 is a 23 line function that brings in the (scary) 200+ lines GPL-3 license text. The macro being slightly harder to replace since it would require a rewrite.

bagder added a commit that referenced this pull request Jun 2, 2022
To simplify the license situation, as they were the only files in the
source tree using these specific BSD-3 clause licenses.

For an fopen style API, we recommend instead going
https://github.com/curl/fcurl

Ref: #8869
bagder added a commit that referenced this pull request Jun 2, 2022
To simplify the license situation, as they were the only files in the
source tree using these specific BSD-3 clause licenses.

For an fopen style API, we recommend instead going
https://github.com/curl/fcurl

Ref: #8869
Closes #8949
@bagder
Copy link
Member

bagder commented Jun 2, 2022

I've emailed @jeroen to ask about relicensing docs/examples/crawler.c to the curl license. The two BSD licensed examples have now been removed.

@jeroen
Copy link
Contributor

jeroen commented Jun 2, 2022

OK by me; you have my permission to change the license.

bagder added a commit that referenced this pull request Jun 2, 2022
With permission from Jeroen Ooms

URL: #8869 (comment)
@mxmehl
Copy link
Contributor Author

mxmehl commented Jun 2, 2022

such a huge patch is hard to review!

I know, but I think it's worth to tackle the issues in one run even it's a PITA (and trust me, I hate resolving the conflicts with each rebase as well ;) )

1. The new files don't seem to be included in releases (maketgz builds them) ?

Yes, not in the resulting tarball. If you also wanted to have the released tar file being REUSE compliant (so that one could extract it and run reuse lint over it successfully), there are three options:

  1. Make the new files also carry the copyright/licensing information somehow (depends on the build system and file type whether that's possible)
  2. Preemptively add the newly created files to the dep5 file
  3. Add a wildcard in dep5 as a catch-all last resort with the standard copyright holder and the curl license.
2. I don't think you should change the start year in the copyright ranges. Most of the files are derived or downright cloned from another file at creation so the year of the first commit does not accurate reflect the year of the origins of that file.

OK, got it. I will try to revert to the original dates.

3. the man pages in docs/libcurl/opts/ seem to not have had their copyright year ranges updated? (I would presume `-/scripts/copyright.pl` warns about that?)

Ah no, there was no warning. These files seem to be ignored here:

'(\/|^)[A-Z0-9_.-]+$', # all uppercase file name, possibly with dot and dash

I can fix those manually still.

4. The PR has a merge conflict again

True. And as mentioned, each rebase costs quite some time to resolve conflicts. How about I fix this once after you and the team are fine with the general changes?

Separately from this work: we should probably consider removing the examples with the two separate BSD licenses (fopen.c and rtsp.c) and the GPL-3 licensed m4 macro. In the latter case, ax_compile_check_sizeof.m4 is a 23 line function that brings in the (scary) 200+ lines GPL-3 license text. The macro being slightly harder to replace since it would require a rewrite.

I rebased on master so fopen.c and rtsp.c are gone (I also deleted their license text files). I also relicensed crawler.c. The current output is:

# SUMMARY

* Bad licenses:
* Deprecated licenses:
* Licenses without file extension:
* Missing licenses:
* Unused licenses:
* Used licenses: BSD-3-Clause, BSD-4-Clause-UC, GPL-3.0-or-later, ISC, LicenseRef-OpenEvidence, curl
* Read errors: 0
* Files with copyright information: 3498 / 3498
* Files with license information: 3498 / 3498

Congratulations! Your project is compliant with version 3.0 of the REUSE Specification :-)

mxmehl added a commit to mxmehl/curl that referenced this pull request Jun 2, 2022
bagder added a commit that referenced this pull request Jun 2, 2022
bagder added a commit that referenced this pull request Jun 2, 2022
With permission from Jeroen Ooms

URL: #8869 (comment)
Closes #8950
Add licensing and copyright information for all files in this repository. This
either happens in the file itself as a comment header or in the file
`.reuse/dep5`.

This commit also adds a Github workflow to check pull requests and adapts
copyright.pl to the changes.
@mxmehl
Copy link
Contributor Author

mxmehl commented Jun 2, 2022

Thanks for the commits outside this branch!

  • I rebased again to incorporate them
  • Fixed the warnings triggered by the modified copyright.pl
  • Changed the copyright starting years back to as they are on master

All that's left would be the tar.gz question (point 1 in my earlier comment).

@mxmehl
Copy link
Contributor Author

mxmehl commented Jun 2, 2022

All that's left would be the tar.gz question (point 1 in my earlier comment).

Alternatively, we could merge this PR and handle the release tarball in another step? It won't change much with the existing files I guess.

@bagder
Copy link
Member

bagder commented Jun 13, 2022

Thanks!

@bagder bagder closed this in ad9bc59 Jun 13, 2022
@bagder
Copy link
Member

bagder commented Jun 13, 2022

@mxmehl can you tell us how we can run this check locally? Isn't there a script/tool that does the check without having to run a docker thing?

@mxmehl mxmehl deleted the reuse branch June 13, 2022 12:56
@mxmehl
Copy link
Contributor Author

mxmehl commented Jun 13, 2022

Many thanks for merging, and congratulations for successfully adopting the REUSE best practices!

@mxmehl can you tell us how we can run this check locally? Isn't there a script/tool that does the check without having to run a docker thing?

Sure. There is the REUSE helper tool, a lightweight Python tool that allows you to run the check locally (reuse lint) and assists with making/keeping a repo REUSE compliant. For starters, here's a tutorial with the most basic tool commands.

For curl however, I scripted most things as the year-range logic as well as the custom header are quite specific. We have some improvements regarding this in the milestone for the next release(s) though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

6 participants