
AI licensing conundrum
How do we license AI?
Jiri Podivin
7/21/20234 min read
It strikes me that recent situation of AI models bears similarity to that of software in early 80s, before open source, or free software, were much more than an idea, when the technology was only slowly spreading among general populace, when big monetary interests suddenly intersected with what was formerly seemingly academic and niche area.
Projects like llama.cpp have brought language models, and AI in general, within realm of average computing hardware. What used to require several GPUs can now run on a single-board computer. Retraining models to fit new use is now within reach of individuals, there is very little stopping anyone from making a custom LLM to generate texts base on works of their favorite author, audio from actors long deceased, or any other number of applications.
And together with that we have seen growth in concern, fear, even signs of panic, among those who may be impacted by the changed landscape. There are calls for regulation, or even outright bans. Politicians, artists, lawyers, celebrities, all want to pitch their take.
I don’t presume to address their worries. Because I’m uniquely unsuited for that. Instead I want to focus on one small and relatively niche subset of issues brought up by recent surge of AI, licensing.
Everyone who ever opened an open source project know about the LICENSE file. The magic thing telling the world what the project can be, and can not be, used for. I’m not a lawyer, and I never will be, but suffice to say that this file isn’t always the final source of truth. It is however a good starting point on the quest for it. And after many decades of use we can say that software licenses have proven to be a decent, if sometimes convoluted, concept, across legal systems and oceans.
But AI make things a bit muddy. Many of the released models are ostensibly covered by an existing licenses like Apache 2.0 or GPL. However, it isn’t clear how their concepts apply to the world of AI, mostly because they were not designed for it. This is still better, than if the party releasing the model makes their own license. For those who work with open source, this is a known pitfall and sometimes a red flag.
The problem is two fold. First, new license may contain disturbing clauses that can cause significant harm to those who would use the software covered. Second, and maybe more importantly, they haven’t been tested in court, or in arbitrage. Simply put, if you use something covered with new, unorthodox license, you risk litigation with unknown outcome.
Not an appealing prospect that’s for sure. But that is the world of AI models.
So what should change? Again, I’m not a lawyer. I can’t design a license, much less convince people that they should use it. But I can imagine what a license that I would like to use should look like and make a sort of a wish list for it.
In order to make my approach clear I’ll go with following assumptions:
training data ~= source
weights ~= compiled software
training pipeline ~= compiler
None of these pairs are an exact equivalents, but conceptually they feel reasonably similar.
The source is perhaps the most contentious so I will focus on that. Without training data there is no model, just an expensive and large random number generator. The source is created, written, by someone at some point, just like training data. There can be a degree of originality, or it can be completely generic. It can even be generated automatically. But it is still the basis of the software.
Weights are, just like compiled software, almost unintelligible without specialized software. They are the most immediately useful form of software for production. But if you want to improve it you need to resort to reverse engineering. This is something we have already seen in the LLM community with llama.
Training pipeline is a bit nebulous, but I felt it must be included. Even if you have the training data, it’s how the data was applied that makes the model. The pre-processing, cross validation protocol, model selection and hyperparameter optimization. All of these things make difference between a useful model, and one that is fundamentally broken. Even if you know on what was the model trained, it doesn’t matter if you don’t know how it was trained. Like a person in a gym, even if they use the equipment, it’s how they use it that makes the difference.
With these concepts in mind I’ve arrived at following requirements I would have for an open source license in the world of AI.
The licensee may use the model in whatever way they want to.
All training data must be available to licensee. Either packed together with the model or on demand without undue delay.
The licensee must receive the model weights and general architecture in a standard format.
The licensee must receive description of the entire training pipeline used to obtain the model.
Licensee can use this model, the training data or the training pipeline to make a derivative model, covered under the same license.
Now these are just basics, and just as I’ve said, I’m not a lawyer. But I believe that these basic tenets should be followed by any reasonable license for AI models. Some existing licenses already cover the same ground, and should be a good fit. But I’m not entirely sure how they would adapt to the question of training data.
So, what would you like in that license?
Do you have a suggestion for an existing one that would do the trick and was already tested in practice?