AI chatbot usage and concepts
Text and Data Mining (TDM) – and “web scraping” more generally – has been thrust into the spotlight with the recent explosion of Generative AI. At the heart of the issue lies a tension between AI companies on the one hand and publishers on the other.

TDM is an essential tool in developing intelligent applications that require large volumes of raw text and data to ‘self-learn’. Developers of the most advanced Generative AI tools on the market have carried out vast amounts of TDM, relying heavily on repositories of web crawl data scraped from the internet as a means of “training” the Large Language Models (LLMs) that underpin their clever AI machine-learning algorithms.

Conversely, news and content publishers (Publishers) own the copyright in the content and articles they make available online. They typically monetise this content through user subscriptions (to access paywalled content) and/or third party advertising revenue (by displaying ads to users on the Publisher’s website). Concerns have been raised amongst Publishers that the widespread TDM carried out to train AI models constitutes copyright infringement – with AI companies having unfairly used large amounts of scraped content for training purposes without permission and without paying a licence fee.

Perhaps an even deeper concern for the publishing sector is that Generative AI poses an existential threat to the existing business model. If ChatGPT has already scanned a Publisher’s online articles and can summarise these in an instant (for free) in response to user prompts, why would the user need to visit the Publisher’s website at all? Generative AI risks driving user traffic away from Publisher websites, decreasing the ad revenue Publishers stand to make and disincentivising users to pay for paywalled content. Generative AI is – in many ways – a disruptive competitor within today’s publishing sector.

Whilst AI companies have managed to enter into deals with some Publishers to licence content as a means of heading off potential litigation (see Axel Springer’s recently announced partnership with OpenAI), this has not worked on all fronts. Getty Images has filed a lawsuit in the UK and US against Stability AI (the company behind Stable Diffusion) for the alleged unlawful copying of the Getty Image bank to train the AI system – likely motivated by a desire to protect the commercial interests of Getty’s own Generative AI exploits.

Similarly, the New York Times recently launched a high-profile legal claim against OpenAI and Microsoft in the US for the alleged “copying” of its vast catalogue of online content, after negotiations broke down between the parties over a “fair market value” for license fees owed.

The New York Times v OpenAI and Microsoft

The teams at OpenAI and Microsoft barely had time to digest their boxing day turkey sandwiches before the New York Times (The Times) filed its US lawsuit against them on 27 December 2023.

Microsoft and OpenAI have admitted to using the Common Crawl in their training data – which includes sources such as Wikipedia, and most importantly, the New York Times. The Times claim that the Defendants “likely used millions of Times-owned works in full” in order to train OpenAI’s GPT models (including ChatGPT and GPT-4) “without any license or other compensation to The Times”. Much of these works sit on The Times’ website behind a paywall (which was introduced in 2011) which now has over 10 million subscribers globally. In addition to subscriptions, The Times also funds its work via advertising, affiliate revenue, and – crucially – commercial licensing to “third parties, including large tech platforms, [who] pay The Times significant royalties under [negotiated licensing] agreements in exchange for the right to use Times content for narrowly defined purposes.”

Evidence submitted by the Times makes for interesting reading – with certain prompts seemingly enticing ChatGPT to recite full passages from Times articles, often verbatim. A cornerstone of The Times’s complaint focuses on the fact ChatGPT was able to reach behind The Times’s paywall to reproduce entire paragraphs of the 2012 Pulitzer Prize-winning article “Snow Fall”. This could turn out to either be a smoking gun, or a red herring – considering that what the Times did in practice is prompt GPT-4 by (1) giving it the URL of the story and then (2) “prompting” it by giving it the headline of the article and the first seven and a half paragraphs of the article, and asking it to continue.

Whilst this case will play out under the US copyright doctrine of “fair use” (which, under US law, permits the use of copyrighted material when the “copying of copyrighted material [is] done for a limited and “transformative” purpose”), many will be keenly watching in the hope it proceeds to trial and does not settle out of court. A US court ruling may set a landmark precedent for the legality of how these AI models have been trained and the rights of Publishers to protect their works from unauthorised scraping by AI companies.

TDM in the UK

Whilst the case brought by The Times will be determined under US law, how would those same principles be considered under English law in the UK?

As it stands, UK law actually provides very limited circumstances in which TDM is permitted. 2023 saw the UK Government back-track on its original proposal for a very broad copyright exception for TDM in the face of outcry from creatives and rightsholders. Instead, the Government “committed to develop a code of practice on copyright and AI, to enable the AI and creative sectors to grow in partnership” which is anticipated in early 2024.

While we wait for the introduction of any such code, the UK currently has only a very narrowly circumscribed copyright exception in this area - found in section 29A of the Copyright, Designs and Patents Act 1988 (CDPA).

For TDM to be permissible in the UK, there are three key ‘narrowing’ elements to this exception:

  • It must be “lawful” – the would-be text and data miner (such as an AI platform) is required to have lawful access to the copyright work they wish to interrogate. For example, if the work is behind a paywall, they must have paid to access that work.
  • Subject to having lawful access to the copyright work, computational analysis of that work can then be carried out for the sole purpose of “research for a non-commercial purpose”:
    • Research – there is no statutory definition for “research” under the CDPA, but guidance from the UK IPO suggests that research involves pursuing an investigation in order to obtain understanding, knowledge and information/data (and is not confined to any one type of research).
    • Non-commercial purpose – which can generally be interpreted to mean that the activity is not primarily intended for or directed towards commercial advantage or monetary compensation.

What this means in the context of AI tools is that, under the letter of UK law, most of the TDM carried out by the AI platforms would likely be unlawful – particularly where TDM has been carried out to train an AI model which is ultimately licensed to users for profit on a subscription basis (i.e. for a “commercial purpose”).

If the TDM exemption for copyright is not available to commercially-motivated AI companies under UK law, the next best port of call will likely be for AI companies to argue that the “temporary copy exception” in s28A of the CDPA applies to permit the reproduction of content to train the model. This exception provides copyright will not be infringed by the making of a temporary copy which is “transient or incidental” – and the merits of any such argument are discussed in greater detail in this article from IPKat (authored by my colleagues Adrian Aronsson-Storrier and Oliver Fairhurst).

The EU position on TDM

It is worth noting that the copyright exceptions for TDM under EU law are wider under the Copyright in the Digital Single Market Directive 2019 (DSM Directive).

Whilst there is a similar (narrow) copyright exception under Article 3 of the DSM Directive relating to lawful access for scientific research purposes, Article 4 permits acts of “reproductions and extractions of lawfully accessible works and other subject matter for the purposes of text and data mining”. This exception is not limited to a particular purpose, which by implication means TDM can be carried out on publicly accessible works for any commercial purpose within the EU provided access is lawful. 

Crucially though, the Article 4 exception for TDM only applies “on condition that the use of works […] has not been expressly reserved by their rightsholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online”.

On this basis, many Publishers will expressly include prohibitions on TDM – both within their website T&Cs and/or via machine readable solutions (such as deploying “robot.txt files” and “tdm-reservation tags” on websites instructing robots not to crawl) – as a means of expressly reserving TDM rights in the EU.

The enforcement challenge

Of course, actually proving whether an AI tool has unlawfully carried out TDM on copyright-protected works hosted online is a different matter altogether for Publishers. The deployment of robot.txt files and other technological solutions are still not adopted standards within the sector and there are no guarantees that AI tools carrying out TDM will respect these machine-readable instructions. In all likelihood, the AI tools will have breezed right past them in many cases.

Equally, there is no transparency from the AI companies over the web-based sources used as source data to train their models, so it may be practically impossible to tell whether a Publisher’s copyright works have been unlawfully used / scraped for this purpose. It is notable that the finalised wording of the new “EU AI Act” – which is likely to regulate AI within the EU from 2026 onwards –  includes a transparency requirement that providers of AI systems must provide a summary of the sources of the training data (i.e. which databases have been used, rather than listing individual content and its copyright status). This may at least make it easier for Publishers to identify whether it was likely that their material was ingested based on known datasets used for training.

Certainly if a Publisher can collate irrefutable evidence from the “output” of AI tools that its content has been unlawfully ingested and repurposed by the AI, this may open up an infringement action for the Publisher to take against the AI company. This is, of course, the challenge which The Times has taken on in its case against OpenAI.

So what next?

It will be interesting to see how the Times’ case against OpenAI is resolved, as the outcome is likely to have a huge impact on how the Generative AI industry proceeds from here.

The outcome is anybody’s guess, but a court decision in favour of The Times would be a seismic spanner in the works for the Generative AI industry and the financial implications of training AI models going forwards – potentially opening the floodgates to more lawsuits from Publishers of this nature across multiple jurisdictions.

On the other hand, a court decision in favour of OpenAI would seemingly give the green-light to indiscriminate scraping of online content for AI training purposes (at least in the US), compromising the rights of Publishers to protect the integrity of their copyright-protected works going forwards.

Here in the UK, the Government has acknowledged that “it is vitally important that AI-generated content does not supplant the work of our musicians, filmmakers and journalists” and is currently working with broadcasters, publishers and creative businesses to understand the impact that AI has on their work.

It may be the case that various jurisdictions regulate TDM for AI training differently (and there are open jurisdictional questions about “where” TDM activity has been carried out and where infringement claims can be raised). For now, it appears that global governments are finding it too difficult to weigh-in on the issue of determining “fair market value” for use of Publisher works for AI training purposes, ultimately leaving it up to the industry – and indeed litigation – to determine such value.

Authors