Direkt zum Inhalt springen

Diskurs

Dienstag, 11.03.2025

Detailed source documentation for generative AI is technically possible without further ado.

GenAI + transparency: Technical solutions for the documentation of training data and sources

Prof. Dr. Sebastian Stober shows in a new article that providers of generative AI systems could provide much more detailed information about their sources than claimed - a finding with potentially far-reaching consequences for the copyright debate.

Computer scientist and AI expert Prof. Dr. Sebastian Stober from the Otto von Guericke University Magdeburg examines whether and how providers of generative AI systems can document their training and reference sources in detail in an article from February 28, 2025 entitled “Possibilities of source documentation and information for generative AI systems”.

What are the findings?

"Training generative AI models requires large amounts of training data, a significant portion of which is obtained through web scraping from the internet. Additionally, AI systems sometimes access web sources during operation to answer specific queries. This has led to a broad debate about copyright and usage rights. Undoubtedly, rights are affected here. Regardless of the extent to which legal claims exist, the question arises whether and how these can be asserted. A basic prerequisite for this is a sufficiently detailed source documentation and an adequate means for rights holders to obtain information about the sources. Is this technically possible and feasible with reasonable effort? The short answer is: Yes, it is technically possible and in many cases – especially for web sources – trivialto document sources and make them available for disclosure. This paper describes in detail what pragmatic solutions could look like."

Matthias Hornschuh, spokesman for the IU, sorts out the findings:

“We are not just struggling for transparency and information when it comes to AI. Our demands in this regard run like a red thread through all copyright debates of the last two decades. Whether it's labels, YouTube or now AI providers, we're always told: we can't do that and, incidentally, it jeopardizes our trade secrets. Sebastian Stober shows how “trivial” it would actually be to obtain information about the use of our works and services. Politicians will now have to make a simple decision, namely whether they want to give greater weight to the intellectual property of US corporations than that of those stolen from their own jurisdiction. Our answer is obvious.”

Katharina Uppenbrink adds:

“Sebastian Stober's recommendations will help to finally change the discussion about the Code of Practice and the templates. In line with the AI Act, AI providers must finally be obliged to provide sufficiently detailed information about the training data.”

You can find the article here:

(DE) https://papers.ssrn.com/abstract=5165182

(EN) https://papers.ssrn.com/abstract=5165118

DOI: http://dx.doi.org/10.2139/ssrn.5165118

Pressekontakt: info@urheber.info