Disclaimer: I am NOT a lawyer, I am a developer with an internet connection, which is NOT even closely the same thing. This is NOT legal advice, this is an interpretation of a software engineer. Still, let’s talk about licenses.
GNU General Public License (GPL)
GNU General Public License (GPL) is an open-source license that requires all derived work to also be under a GPL. It is very clear about dependencies, if your project depends on a package under a GPL, your project has to have GPL. The FAQ states it very clearly.
This does NOT go for Lesser GNU General Public License (LGPL).
So if you have a package with some GPL dependency somewhere in your package, you need GPL; not MIT, BSD, Apache, or LGPL. A StackOverflow discussion seems to come to the same conclusion.
This means if we can find all Python packages with a GPL dependency but without a GPL itself, we have a list of possible miss-licensed packages.
GPL has a server sister license called GNU Affero General Public License, I will treat it as if it is a synonym of GPL.
Data: neo4j graph database
In a previous blog, I built a graph database with Python packages and their dependencies. The original database has the 5000 most downloaded packages and their dependencies (totaling roughly 5300 packages).
I made a new database with all packages that did not give me an error, for a total of 404,975 packages.
Both will be used in this blog.
Finding all potential miss licensed packages
A graph database lets you easily traverse the many dependencies. If we do this on the two datasets we find 8757 packages have a GPL dependency but not GPL. This dependency can be multiple packages deep.
// Find all GPL packages
MATCH (n:package)
WHERE n.license CONTAINS "GNU general public license" OR n.license CONTAINS "GNU Affero general public license"
// Find all packages depending on the GPL packages
// One or more hops away
MATCH (n)<-[:DEPENDS_ON*1..]-(m)
// Filter out all package with the correct license
WHERE NOT m.license CONTAINS "GNU general public license" AND NOT m.license CONTAINS "GNU Affero general public license"
RETURN m.name as name, m.license AS license, collect(DISTINCT n.name) as GPL_dependencies ORDER BY license
If we look at the most common licenses with GPL dependency, we see most packages have an MIT license, Apache, or BSD license. This is in line with generally the most popular licenses. A lot have unknown licenses.
We also see the GNU **lesser **general public license. This might be an honest mistake between the 2 GNU licenses.
MIT license: 4020
unknown: 2217
Apache license: 755
BSD license: 510
GNU lesser general public license v3 (lgplv3): 262
Apache license 2.0: 220
other/proprietary license: 175
gnu lesser general public license: 73
BSD 3-clause license: 40
Mozilla public license 2.0 (mpl 2.0): 39
If I would do the same for the 5000 most popular packages we find 54 packages with potential issues. If we look at apache-airflow-providers-MySQL (one of the biggest packages) we find a discussion about GPL. It is closed because of “Closing this for now — Since these are not in default requirements. We can reopen if needed” whether this is correct or not I cannot tell you.
Looking at the pypi of apache-airflow-providers-MySQL we see the GPL packages (MySQL-connector-python and MySQLclient) have an extra requirement; “platform_machine != \”aarch64\””. If you are NOT on an aarch64 platform, you can install without GPL. I do not know how well the end user knows about this.
The documentation gave only 2 hits on GPL or GNU, one in the release notes, and one in a printout of an output.
Conclusion
Pay attention to what license you use and your dependency uses, but don’t feel too bad if you make a mistake, you are not alone.