The photographer brought copyright infringement proceedings, alleging that the non-profit had made an infringing copy of his images during the creation of their ‘LAION-5B’ dataset. Kneschke also alleged that this copying was in violation of the terms and conditions on the website where the images were hosted, which prohibited their scraping or downloading by automated programs. The LAION non-profit did not dispute that as part of the process of compiling their dataset they had temporarily copied one of Kneschke’s photographs, but claimed that their actions in copying and analysing the image were permitted under German law, including on the basis of text and data mining exceptions. As we have previously discussed, text and data mining (TDM) is an essential tool in developing AI models, and TDM exceptions have been thrust into the spotlight recently as being a central issue concerning the legality of the development of generative AI models.
While the decision by the court focuses on an analysis of the German legislation, it provides useful guidance on how the EU copyright exceptions for TDM, which are related to the UK’s TDM copyright exception, might be interpreted in the context of AI training. This is significant, as the new EU AI Act includes provisions that require technology companies to comply with EU copyright law when developing AI models, including respecting the right of copyright owners to ‘opt-out’ of the use of their work for use in commercial TDM. Where a copyright owner has effectively opted-out of the use of their works for commercial TDM, they are still able to enter into copyright licensing arrangements with AI developers – so the right to opt-out can also function as a signal that the copyright owner requires payment before their works can be used for AI training. The decision of the German Court controversially suggests that it may be possible for rights owners to opt-out of commercial text and data mining of their content through terms and conditions on a website – but will this approach be accepted on appeal or elsewhere in the EU/UK?
How was the LAION dataset created?
In order to understand the significance of this recent decision, it is necessary to have an overview of what the defendant was alleged to have done. LAION (which stands for Large-scale Artificial Intelligence Open Network) is a non-profit organisation that ‘provides datasets, tools and models to liberate machine learning research’. They developed the Laion-5B dataset, which consists of links to over 5 billion images available on the internet ‘scraped’ from the Common Crawl web archive, along with a description of the content of those images (taken from the ‘alt text’ associated for the images, which is an accessibility feature intended to be used for short and concise descriptions of images posted online).
It is important to note that the published LAION-5B dataset does not itself contain any images or photographs, but only links to locations on the internet where the photos can be viewed. In order to create and validate the dataset, the developers did however temporarily download a copy of each image, and analysed those images using an AI model from Open AI to confirm that the text description of the image was accurate. Once they had carried out this analysis of each image, LAION claimed that each picture was automatically deleted.
After the LAION-5B dataset was created and publicly released it could be freely used by third parties who were developing their own AI models. For example, the Stable Diffusion generative AI image tool was trained using a subset of the LAION-5B dataset. When AI developers use the LAION dataset to train an AI model they will use the image links within the dataset to download images, which are then used in the generative AI training process. In this German case the Court only considered the legality of the actions of LAION, and not whether any subsequent use of the dataset by a third party AI developer would be permitted under German copyright law.
Was LAION engaging in non-commercial text and data mining research?
The German Court considered that LAION was permitted to take advantage of the scientific research TDM limitation in §60d of the German Copyright and Related Rights Act, which implements the TDM for the purposes of scientific research exception in Article 3 of the Copyright in the Digital Single Market Directive. The German and EU exceptions permit copies of works to be made for TDM for scientific research purposes by research organizations, and operate alongside an additional narrower exception for commercial TDM activities. Here the German Court considered that scientific research is not limited to activities directly associated with the acquisition of new insights, but also includes earlier foundational steps which may be used later for the creation of new knowledge. Here the dataset itself did not create new knowledge, but its subsequent potential use for training AI systems did have the potential to give rise to new knowledge. As a result, LAION was able to take advantage of the §60d TDM exception.
The Court also confirmed that LAION did not pursue commercial purposes when developing the dataset, as they provided it to the public for free. The Court considered that it was irrelevant that commercial entities (such as generative AI model developers) might subsequently use the dataset for their own research, or that LAION had collaborated closely with commercial AI providers who had at least partially financed the creation of the dataset and where members of the LAION team were employed by commercial AI providers.
When can commercial organisations engage in text and data mining to develop AI models?
As the German court concluded that LAION was able to take advantage of the scientific research TDM limitation, it was not necessary to reach a concluded decision about whether the commercial TDM exception in § 44b of the German Act and Article 4 of the Copyright in the Digital Single Market Directive applied. This commercial TDM exception is the exception within EU law that commercial generative AI model developers may seek to use to argue that the development of their models is consistent with copyright law. This commercial TDM exception differs from the scientific TDM exception, as it requires those undertaking TDM to comply with any ‘opt-out’ request by copyright owners who do not wish their works to be used without a licence for TDM or AI model development. While the German Court was not required to analyse the commercial TDM exception, the Court did provide insight into how copyright owner opt-out might take place.
Under the German and EU rules, in order to prevent their works from being used in commercial text or data mining copyright owners must ‘opt out’ by expressly reserving their rights. When the copyright owners’ content has been made publicly available online then the German legislation (and also arguably the EU Directive) requires any opt out to be done using ‘machine-readable means’. There is currently no universally accepted technical standard for implementing ‘machine-readable’ opt outs, although the W3C does have a group working on developing such a standard. For website content it is common for copyright owners to upload a file called ‘robots.txt’ to their webpages, which provides instructions as to whether crawlers (such as Google’s search crawlers) are permitted to view and copy material and which is likely to be considered an effective machine-readable opt out from commercial TDM. Another method to opt-out of commercial TDM which is likely to be considered effective would be to include rights reservation metadata within the content itself.
One area of debate however is whether merely inserting ‘natural language’ opt-outs into a website’s terms and conditions would be ‘machine readable’ and therefore effective. The German Court addressed this issue, holding that ‘machine-readable’ should be interpreted as asking whether the opt out could be understood by a machine, which would depend on the technical developments existing at the time the work was accessed. They considered that modern technologies include AI tools that can capture and analyse the content of text written in natural language, and that as a result it would be possible for a website’s terms and conditions to operate as an effective opt-out of TDM. The German Court’s interpretation of ‘machine readable’ is inconsistent with the way that the term has been defined in other unrelated pieces of EU legislation, but is consistent with the approach suggested in the (non-legally binding) Recitals to the Copyright in the Digital Single Market Directive. Requiring AI developers to respect opt outs expressed in natural language in website terms and conditions is also consistent with the obligations on the providers of general-purpose AI models in the EU AI Act, which requires them to use ‘state-of-the-art’ technologies to identify and company with opt outs. It will be interesting to watch whether other Courts across the EU adopt a similar approach when considering their own copyright exceptions for commercial TDM.
How can copyright owners opt out of text and data mining in the EU?
Where a copyright owner wishes to ensure that their content is not used without permission for commercial text and data mining or used to train generative AI systems in the EU they should consider taking a number of steps:
- For material published on their website, they should include in their robots.txt file an appropriate statement that the website operator reserves any copyright in works on the site and does not consent to those works being scraped by generative AI model providers. That robots.txt file should also be updated regularly, to ensure that it blocks the scraping tools of new and emerging AI developers. Where more selective restriction or permissions in relation to commercial AI training are required, the similar and more recently developed ‘ai.txt’ standard could be an alternative, although it may not yet be as widely respected or understood by generative AI developers.
- Where possible, opt outs or copyright information should also be included in the metadata of files containing copyright protected content (such as images, video and audio). This can help ensure that the TDM opt out can be notified to AI developers if the content is downloaded from the copyright owners’ website and shared elsewhere online.
- Copyright owners should review their website terms and conditions to ensure that they restrict the scraping or commercial text and data mining of website content. This approach should however not be relied upon in insolation, as it is not certain that Courts in other EU jurisdictions will follow the recent German decision in interpreting such natural language opt outs as being machine readable.
- Notifying individual AI developers that copyright owners do not grant permission for their content to be used in a training set. This could be done in the format of letters, such as those sent by some representative bodies, or through opt-out tools developed by individual AI developers. For example, OpenAI is developing a tool called ‘Media Manager’ to assist copyright owners to exclude their works from machine learning research and training.
Copyright owners may also wish to use tools such as ‘Have I been trained?’ to help check whether their work has already been included in publicly disclosed AI training datasets.
What comes next?
The recent German decision, and the expected appeal, will be watched with interest. The decision may inform the UK Government’s soon-to-be-announced approach to resolving the long-running UK policy dispute between copyright owners and AI developers on TDM. As we discussed early last year, the LAION-5B dataset at issue in the German court decision is also central to the ongoing UK litigation between Getty Images and Stability AI. In the UK litigation it is claimed that Stability AI supported the development of the relevant LAION—5B dataset and subsequently used the dataset to train and develop its Stable Diffusion AI generator model. Further, a key element of the Stable Diffusion defence in the UK proceedings is that the training of their generative AI image model took place abroad and therefore did not infringe UK copyright law.