The Open Source AI Definition: where's the data?

Everything in this post is my opinion and experience, and does not necessarily reflect the views of my employer.

There’s a niche corner of the world, where a group of volunteers, a small number of staff, and staunch supporters defend the integrity of what it means to be open source software. By and large, this group known as the Open Source Initiative (OSI) does their work quietly, and oftentimes without thanks or recognition. We owe a large part of modern technology to their efforts, as open source software can be found in everything from power grids, to social media apps, to even our TV shows’ special effects.

The responsibility that they take on is immense, fraught, and nuanced.

I believe in the mission of the OSI, which is why I have been a member since 2018 (and honestly, should have been a member long before that). They take on important policy issues, corral opinionated volunteers of crunchy granola open source hippies and IP lawyers alike, and navigate complicated geopolitical spaces. For the past two years, they have been attempting to steer us through the complicated waters of what it means to be open source, in an age of artificial intelligence.

Two years into this “multi-stakeholder process”, we are facing a number of issues that are immense, fraught, and nuanced, just like open source itself.

Buckle up, kids, this is a long one.

Missing data for missing reasons

The current draft of the Open Source AI Definition (OSAID) does not require data used in the training of AI systems to be disclosed and/or distributed. At the moment, the only aspects of the data that a system desiring to be labeled as “open source AI” would need to publish are:

Training methodologies and techniques, under an OSD-compliant license
Training data scope and characteristics, under an OSD-compliant license
Training data provenance (including how data was obtained and selected), under an OSD-compliant license
Training data labeling procedures, if used, under an OSD-compliant license
Training data cleaning methodology, under an OSD-compliant license

None of the information above gives the prospective adopter of the AI system insight into the data that were used to train the system, though I give them credit for the last three points.

A person holding a magnifying glass with a flower in it — Unsplash photo courtesy of Vasilina Sirotina

Despite the fact that you cannot train a machine learning model without data, the OSI takes the position that requiring the disclosure of training data to be unneccessary. Given the OSI’s mission states that

Open source enables a development method for software that harnesses the power of distributed peer review and transparency of process

I find it counterintuitive that the OSI is advocating for the lack of distributed peer review and transparency in the name of making it more achievable for existing models to be labeled as “open source”. Data are part of the “source” for AI systems. If the data aren’t open, then neither is the system.

Data are to AI systems as ___ is to software

There are not great analogies for data, but here are some of the ones we’ve heard.

“Data are to AI systems as source code is for software”

This isn’t perfect, but it’s not entirely wrong either. Data are dependencies of machine learning, just not static ones. Without the data that go into training systems, testing them, and validating them, it is near impossible to assess them for Things That You Care About. What you care about is no one else’s business.

People adopting AI systems deserve to understand the foundations of the systems, which lies in the data used to train them. They deserve to be able to assess that the system accounts for the Things That They Care About at those foundations.

“Data are to AI systems as hardware is to software”

Again, not entirely wrong. If we take the foundational argument from above to an extreme, yes, you can’t compile software without hardware (easily at least; I have great confidence in the abilities of determined nerds). But software does not yield drastically different functionality with different hardware. You may have to tweak it to get it to compile for your architecture, but you are still able to verify that it works the way you expect it to based on a shared understanding of the source code.

This just isn’t true when it comes to data and machine learning.

Remember, we’re not just talking about large models for this definition. It must hold true for small ones where each dimension in each data point holds a large amount of weight in order for the definition to be valid.

The open source definition (OSD) applies the same for single-purpose librarires as it does for giant frameworks.

Inability to exercise the four freedoms

The OSAID aims to give adopters the same “four freedoms” that open source does: the ability to use, study, modify, and distribute the AI system. Without the inclusion of data, the current draft of the OSAID only fulfills two of the four freedoms. Specifically, people picking up an “open source AI system” would only be able to use and distribute said system. They would be able to build on top of it, through methods such as transfer learning and fine-tuning, but that’s it.

A green succulent plant in cage — Unsplash photo courtesy of SHTTEFAN

As stated above, training data form the basis of the systems. As my colleague Tom Callaway states:

“Without requiring the data be open, it is not possible for anyone without the data to fully study or modify the LLM, or distribute all of its source code. You can only use it, tune/tweak it a bit, but you can’t dive deep into it to understand why it does what it does.” LinkedIn

He makes a good point on the distribution aspect, which makes me deduct half a point from the OSAID’s fulfillment of two of the freedoms. Sure, you could distribute the system, but AI systems recycle…lots of things: existing data, bias, and even a lack of openness.

As Ben Cotton writes:

“But if we don’t know what an AI model is trained on, we don’t know what sort of biases it’s reproducing. This is a data problem, not a model weights problem. The most advanced AI in the world is still going to produce biased output if trained on biased sources.” funnelfiasco.com

Ben’s point about biases is not fearmongering, and I highly recommend that you read up on how these biases have come to light in previous, much less complicated tools, as well as how data collection itself inherits biases that many thought were purely in the realm of history. Popular inference tools in use today, that are much less complicated than large models, have been shown to exacerbate societal biases in unexpected ways. You can read more on how data collection methodologies themselves have the potential to perpetuate harm in my exploration of the What Gets Counted Counts chapter from Data Feminism.

Demographic inference tools have been used uncritically by very serious companies, research institutions, journalists, and yes, even open source projects. These aren’t hypothetical concerns.

Now, while I care very much about biases in AI, highlighting biases isn’t in scope for a definition of open source AI. But being able to examine and identify biases in AI systems absolutely is. We can’t do that without the data.

Missing definitions in the definition

Let’s go back to the draft of the OSAID (archive.org, retrieved on 13-06-2024).

A close up of a dictionary with word "artificial" in focus — Unsplash photo courtesy of Mick Haupt

There’s ambiguity and vagueness, intentionally so, according to the Executive Director of the OSI.

Such squishy statements

To achieve its goal of modifiability, the OSAID states that enough information about the data must include:

“Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data.”

This is well-meaning. It is also woefully misguided. A good definition leaves as little to interpretation as possible. This phrasing leaves a giant loophole for gatekeeping.

“Oh, you can’t reproduce the system? You must not be skilled enough.”

“You’re getting different results? You’re probably not using the right data. All the information is there; are you sure you’re reading it right?”

I find it fascinating that all the words that water down a well-meaning statement start with the letter “s”: sufficiently, skilled, substantially, similar

What would a stronger statement look like?

“Disclosure of the data used to trained the system so that an adopter of the system can retrain the system.”

Even still, there’s a key issue with that statement. It does not assert that the disclosed data be open and available. That means that even with disclosure of data sources, you may not have permission to use the data and/or you may have to pay for licensing the data.

A matter of scope

Open source software can include dependencies on non-open libraries. It does not disqualify the software from being open source, as long as the the software in question has an open source license attached to it. There’s an argument to be made that data AI system should enjoy the same treatment as those proprietary dependencies.

This is where we get back to a larger problem of definitions. The definition that the OSI has chosen for this…definition (things are getting meta here) is scoped to “AI systems”, which they define (sigh) as conforming to the one established by the Organization for Economic and Co-operation Development:

An AI system is a machine-based system that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments. Different AI systems vary in their levels of autonomy and adaptiveness after deployment.

Okay, fine, that’s pretty generic. As someone with a background in AI, this seems overly narrow and hyperfocused on machine learning, but I’ll set that aside.

However, the OSI has determined (archive.org, retrieved on 19-06-2024) that the system explicitly includes the data upon which it depends. That’s akin to saying that open source software is only open source if all its dependencies are also open source.

This scope was a choice. The complexity in crafting a definition is an effect of that choice. The unwillingness to navigate the resulting ambiguity in the context of prior decisions is concerning.

Missing focus

To be honest, I’m not entirely clear why we need a separate definition for “open source AI”. In fact, when I first was allowed to participate in the definition process (more on that, perhaps another time), I raised the point that artificial intelligence is a field of study and/or practice, and not an artifact. Trying to define what constitutes “open source AI” is closer to attempting to define “open source computer science” or “open source psychology” than “open source software”.

Note to self and everyone else: don’t try to define open source psychology please; that was not an invitation.

This initiative is meant to push back on everyone that’s calling themselves “open source AI” (or even “open AI”!) without the blessing of the OSI. Generously, it is meant to provide clarity to those looking to adopt ✨open source AI✨ and give them the same certainty that the OSI did for open source software. However, this intersection of fields, amalgamation of heterogenous artifacts, and clear-as-mud understanding of the very domain they are looking to define has fallen quite short of the mark.

Black and silver eye test apparatus — Unsplash photo courtesy of Quincy Follweiler

The issues with the text of the draft OSAID largely revolve around a lack of focus and a lack of understanding. It seems far too late now, given the money that has been poured into this effort (again, more on that, perhaps another time) and pressures of the industry both real and perceived, to change anything. Still, I wish that the OSI had taken a piecemeal approach.

I think they would have had far more success tackling the individual elements separately (data being one of those elements) and then composing those elements. Or, continue to limit its scope to software. It would have resulted in a far more realistic definition that would apply to more AI systems in a far more comprehensible fashion.

A definition that needs its own dictionary and compass to navigate is likely going to be too complex or toothless for providing the clarity that was its goal.

Missing data, revisited

I actually get why the OSI is pushing so hard against requiring data in the OSAID.

A puzzle with a piece missing — Unsplash photo courtesy of Pierr Bamin

They don’t want to publish a definition for which few existing systems qualify. The Executive Director of the OSI has been speculating and expressing concerns about the legal feasibility of requiring the inclusion of data. I am not a lawyer, and this piece has not been cleared by one, so I won’t weigh in on the legality or lack thereof of open data. The OSI doesn’t want to introduce a threat to its own legitimacy by producing a definition that may get challenged in court.

However, I do know that change isn’t made without someone deciding to take up the banner.

Open source software was a radical concept in the beginning. We didn’t get to where we are today by abiding by the status quo. We need to carry that forward with us into new domains, into new (or renewed, in the case of AI) technologies. We need to be bold and brave. We need to fight for openness and transparency.

You can read more about the overarching concerns as well as specific discussions about the need for data on the forums. Note that the second post is locked in favor of a new post that omits much of the nuance in the discussion.

Missing data for missing reasons#

Data are to AI systems as ___ is to software#

“Data are to AI systems as source code is for software”#

“Data are to AI systems as hardware is to software”#

Inability to exercise the four freedoms#

Missing definitions in the definition#

Such squishy statements#

A matter of scope#

Missing focus#

Missing data, revisited#