New study: Copyright challenges in open-source AI development in the European Union

As the European Union actively pursues digital sovereignty, a fundamental structural misalignment has emerged between its innovation objectives and its existing legal frameworks. Creating a European, open, and public AI will be obstructed without targeted policy interventions.

Military air show in Ławica, 1929
Tytuł: Military air show in Ławica, 1929
public
Autor: Koncern Ilustrowany Kurier Codzienny – Archiwum Ilustracji / Narodowe Archiwum Cyfrowe

As the European Union actively pursues digital sovereignty, a fundamental structural misalignment has emerged between its innovation objectives and its existing legal frameworks. Across Europe, academic research consortia are utilising public funding to build open-source, public-interest Large Language Models (LLMs). However, as our newest study shows, European open source and public AI developers are finding themselves constrained by a complex web of copyright restrictions and legal uncertainties.

This report, “Copyright Challenges in Open-Source AI Development in the European Union”, was commissioned by the COMMUNIA Association, and developed in collaboration with Open Future. This release was a stepping stone in providing the comprehensive feedback to the European Commission’s recently concluded consultations – an evaluation of the 2019 Copyright in the Digital Single Market (CDSM) Directive.

In our study, we provide empirical evidence on how current text and data mining (TDM) exceptions established by the CDSM Directive operate in practice. Based on eight in-depth interviews with technical leads and data experts from prominent European AI projects – including OpenEuroLLM, Pleias, PLLUM, and SOOFI – the study maps the operational bottlenecks that hinder public-interest technology

The Copyright in the Digital Single Market (CDSM) Directive contains two mandatory exceptions for TDM. Article 3 permits research organisations and cultural heritage institutions to conduct TDM for scientific research. Meanwhile, Article 4 permits general-purpose TDM but allows rightsholders to implement an “opt-out” mechanism. Our study highlights that this distinction creates structural friction. Because publicly funded projects are often mandated to open source their models – allowing for broad reuse, including commercial applications – university legal teams frequently advise against relying on Article 3, with its research scope. As one interviewee noted:

“The fact that we are a scientific institution didn’t give us the right to operate solely under Article 3, because purpose matters.” (IDI)

Consequently, public-interest teams routinely shift to the more restrictive Article 4 framework. Under this regime, developers must invest computational and financial resources to parse unstandardised opt-outs. As the report points out:

“There’s no standard formula – unlike, say, a pharmaceutical disclaimer with a fixed legal form that everyone recognises. Restriction notices appear in all kinds of forms. This raises a key question: which form is sufficient to consider a reservation valid, and which is not?” (IDI)

 

Key policy recommendations: Securing the future of Open Science

To ensure that European public-interest AI can scale, we outlined a series of targeted policy interventions aimed at lowering these barriers:

  1. Clarify the scope of TDM exceptions: EU law must explicitly confirm that the training and development of AI systems constitute legitimate text and data mining activities protected under both Article 3 and Article 4. In addition, the EU legislator must clarify that open sourcing of LLMs and their components is already fully permitted under Article 3 DSM and does not disqualify the institution from the scope of scientific research and TDM exceptions.
  2. Introduce a statutory right to share: policymakers should strengthen Article 3 by introducing a clear right permitting scientific research institutions to host, share, and republish curated training datasets for validation and peer review.
  3. Establish good-faith “safe harbors”: researchers and public-interest intermediaries who follow standard procedural compliance and act in good faith, protocols must be legally protected from statutory copyright claims and liability.
  4. Develop a European public training corpus: Europe needs to invest in digital infrastructure by building shared, high-quality public training corpora. This would bring the data-sharing goals of the CDSM Directive into practical operation.

For European digital sovereignty to move from rhetoric to reality, the copyright framework must actively protect public-interest innovation rather than penalising it with legal exposure. A balanced digital ecosystem requires robust rights that treat shared knowledge, Open Science and Open Culture as vital common goods that make up the core strength of our societies.

See the full report on the COMMUNIA Association’s website.