Questions from the Open Source AI Definition session

Last week in Vienna at Open Source Summit Europe, the Open Source Initiative presented their current draft of the Open Source AI Definition (OSAID) to a packed room. Now, I’ve written on the topic and my concerns with the proposal before so I was interested in gauging the temperature of the audience. Unfortunately, most of the session was not interactive, and attendees were not provided with a microphone for their questions.

You can watch the full session on YouTube, but I thought I’d pull out the questions and responses as they are so difficult to hear in the video. I’ll admit that there may be transcription errors, though I tried to be as accurate as possible. Where I know the names of the attendees, I have provided them.

Question 1

Attendee unknown: First of all, is there a fixed definition to sufficiently detailed information and a fixed definition to a skilled person? I mean, how do you define that? You know, that this is the person that is skilled? And this is the information that is sufficient in order for you to [make modifications]?

Stefano Maffuli: Yeah, I mean, this is a lawyer, lawyer question. You can probably argue around that. But skilled person is definitely a technical term in, in legal, in legal literature. It’s a legal practice is it’s recognized….I mean, I think the the general idea is if you give people the information, right, that they can, you know, get similar data sets. So I mean, an an example that, you know, comes to mind is if you’re doing some kind of classifier for x rays on a wrist, right? Like, you know, maybe you can’t share the actual x rays, but you could say, you know, I trained it on 10,000 X rays of, you know, this type of demographic taken from this angle, you know, you know, in this way, right? And that would be uh you know, something that would probably give, you know, be, be able to train a system with similar um you know, capabilities.

Question 2

Aeva Black: You talk a lot about the preferred form to make modifications of the program. When the vulnerability is found in the program, even if the original authorship is no longer maintained, it is possible to develop a fix if you have the source code and a compiler whatever to build it again. Without the original data or access to it, how are researchers or those who have received a model – (let’s assume that the original creator of the model is no longer maintaining it) – if the data is not available, how else can someone build a patch to predictably modify that system, change the weights, to fix the vulnerability?

Stefano Maffuli: Um I, I guess, I mean, we have experts also in this field. I, I can imagine ways because I…Uh I think that it’s, it may be necessary to us for us to rethink really about the whole problem space and remove the picture of us fixing a model the same way that we would fix software. It’s, it’s one of the things that one of the challenges that I’ve had from the very beginning in this process was to shed and drop all the knowledge about software and really try to immerse into the new domain because it’s completely different…Like all of the, all of the paradigms and, and, and similarities that I had in my mind between compiling software and building and building training, all of those led to dead ends and, and bad outcomes. So my recommendation is really to think about that. There is there is a problem in the model. You don’t train it from scratch, you have all the instructions about building a similar other data model and, and you move it. I talked to other people who told me that, um I mean, who have demonstrated that they have removed biases from proprietary system from proprietary models by retraining a significant chunk of the network. So with different data, not the same data set they had and they didn’t, in this specific case, they didn’t even know how the system was trained. They only had the uh research paper and the trained, the trained weights and they removed all the biases. They told me by using probably 20% of the original data, right? This space is different, it’s not software.

Question 3

julia ferraioli (me): As somebody who’s been working in open source and AI for decades, this is not a new space. This is very much not a new space. These are challenges, these are problems of scale, but the idea that people do not retrain models from scratch is absolutely false. If you identify a problem in your model or its performance, the transparent thing to do is modify it in its original state. You cannot verify that you’ve removed all of the biases by retraining or fine tuning. You can verify that maybe you have mitigated them, but you have not removed them. And that’s a very important distinction when we’re talking about systems that are inherently opaque.

The presenters from the Open Source Initiative and Microsoft provided no meaningful answer to this, perhaps because it was phrased as a concern and not a question. However, the presenters provided this response to questions 2 and 3.

Justin Colannino: So I I think what those last two questions were getting at is kind of a tension between um you know, full transparency in the data set, right? And the idea that there might be, you know, some data, right? And data like what Stef was saying earlier, data is hard to share, some data is hard to share. And so if it’s public, I think that the the definition where this is going, if it’s public data, you should be able to list that out. That’s not a it’s not an issue. Maybe you’re not able to reshare it because of various laws. But then there’s, you know, what do you do about data that you know, can’t be shared or other otherwise couldn’t be accessed. And so like I, you know, coming back to kind of the health care point, I think the, the way that um folks have been thinking about that is like, is it, is it even possible to have, you know, open source health care if you can’t um you know, see the underlying data, I think that’s the underlying tension that uh we’re addressing.

Stefano Maffuli: Oh maybe, maybe I can add one more thing before we move on because there is in in here. Of course, there is a, a position that needed to be made. Like there is, there is a tension here with, with groups that are saying seeing the pipeline, the pipeline requires uh the pipeline to build a machine learning system is to start from the data, massage it with uh run and create a data set, uh run the training and get the model weights at the end So all of this needs to be open intuitively. All of us go in the same place in the whole thing needs to be open. Then when you start looking at it in the data space you need, if you wanna have only open data, you are immediately reducing your quantity of data that you have accessible to train systems into a very tiny, tiny set, very tiny set. Give you an example. Oh I wanna try train my my system on, on movies only in public domain. And you know that is almost impossible to calculate which movies have gone into public domain because every legislation has a different, has a different rule. There is a nice article posted on, on our blog on OSI blog that explains all of this. [Editorial eliding of irrelevant chatter.] Then on the other side of the spectrum, there are companies who are saying we’re never gonna tell you how we built this the the data set. Not only because it, there’s some secret in the, in the data, but it’s the secret is how we massaged it. We’re not gonna tell you how you we done it because in this, in many specific cases, more data uh makes a better system. So by sharing the instructions, they inherently inherently give more power to companies that have more data available. And there are two Meta and Google. Um and maybe Microsoft. And maybe Amazon too. [Editorial disclosure: I work for Amazon.] So there is this tension here and we’re striking a middle, we’re trying to strike a middle because we want to protect the groups, the organizations like LLM360, TII, EleutherAI, these groups that are doing research they want and they need large amounts of data in order to build these systems. That’s where this position on data information comes from.

Question 4

Attendee unknown: In every model you have a compressed version of the original, which is a big problem. There was an exploitation of patients data with this picture from from patients with cancer. And you can do mitigations by trying to have a knowledge graph which covers this. But if I have an idea about the data set or a piece of the picture, I want to see, I can extract the full picture. And if you see both ways, we also see there are companies stealing a lot of copyrighted data, scientific papers and so on. And they don’t want to reveal that they, that they stole data. But if you have at least an idea of how the text in the model is, then you can reveal the whole document or something very similar. So this is a big problem. Another problem is there is not only a software supply chain, but there’s also the human supply chain. What about the conditions of the people who are training these models? And this is not a software problem itself, but by European law, we have this supply chain in a way that it must be clean without exploiting people in in foreign countries. And this also has to be []. Every model has been created by volunteers but there are people under economic pressure who do this work and they also have to be protected by law.

Stefano Maffuli: Absolutely. I mean that your last word by law is is what needs to happen. And in fact, the the requirements of the definition, also to your first part of your comment, is the requirements on transparency. Uh the the the reveal revelation of forcing, forcing the compliance by revealing the instructions to build the data set is actually where transparency comes in and you can discover abuse and and uh you should be able to discover abuse and uh and uh and exploitation.

Question 5

Aeva Black: The tension that I highlighted isn’t about data per se, but our principles, and the principal value exchange of open source upon which all of this has been built for 25-30 years. I will not speak on behalf of my European colleagues, but I will speak on behalf of the agency that I work at. And that value exchange that makes open source trustworthy is the transparency and also the exchange of responsibility to be able to modify it and able to study it. So I hear you that you’re trying to strike a middle ground on the spectrum of data and stakeholders. But how are you preserving the principle that enables a recipient of a system to study and modify it on their own to change, fix it, maybe do different things in a predictable and controllable way.

Stefano Maffuli: So we, we think we have, we have it here like that is, is safeguarded. Um And uh I, I’m not saying that it’s gonna be forever uh like this and uh the, the, the definition as versions and depending on how the science also evolves, I I think we will have, we will adapt. Uh There is no other way to put it.

Question 6

Cailean Osborne: First thing, have you considered expanding the first sentence to say skilled person with similar resources or access to resources? And the second is, do you intend to suggest best practices for what is sufficient information [Cailean cut off by Stefano Maffuli].

Stefano Maffuli: Below this definition there used to be another component of another piece called the checklist. And we were gonna be using that, uh, it’s now split into a separate file. So basically the open source AI definition is setting the general principles and now how those are gonna be implemented in practice lives in a separate, in a separate document.

Takeaways

I’m glad this session happened, because it gave people with concerns about the proposed Open Source AI Definition a visible space to highlight concerns in a way that could not easily be shut down, which is what I have observed on other platforms. It is a shame that there were so many contradictions in responses to questions and concerns (especially in the response to questions 2 and 3), and a clear misunderstanding of machine learning and how to protect the principles of open source within it. I continue to hope that the OSI chooses to take the feedback to heart and be the champion of openness that they claim to be.

If you were one of the attendees who asked a question and would like your name added or removed, please email me to ask and I’ll be more than happy to do so.

Question 1#

Question 2#

Question 3#

Question 4#

Question 5#

Question 6#

Takeaways#