By Barbara Rasin, J.D. Candidate, 2027
The recent boom in generative AI technology has been hampered by accusations that AI training sets violate intellectual property laws. Various rightsholders, including the New York Times and Universal Music Group, have sued the companies behind these algorithms for training on their protected IP without a license. Until this course of litigation is resolved, the parties remain categorically opposed: defendants seek to maximize the training data available to their algorithms, while plaintiffs’ livelihood depends on exclusive ownership and control of their IP. In the meantime, tech companies have been promoting an ‘opt-out’ approach to assembling training data sets as a workable compromise. Under such an ‘opt-out’ scheme, AI algorithms may train on any available data, regardless of existing IP protections, so long as the rightsholder has not affirmatively opted out of being included in the dataset. The EU’s DSM Directive has already adopted such an approach, and the UK’s proposed Data Bill advocates for this solution, although it was recently struck down by the House of Lords. However, opt-out schemes are an empty promise to creators, enabling tech companies to continue their mass exploitation of unlicensed data while maintaining a veneer of compliance with IP laws.
If opt-out schemes were enforced in the US, generative algorithms would be free from liability for copyright infringement at the input level as long as they respected rightsholders’ opt-out requests. This may seem like a compromise– providing an opportunity to object without unduly hindering technological advancement. However, in its current iteration, ‘opt-out’ schemes do not truly allow rightsholders to opt out.
The primary assumption of an opt-out approach to AI training sets is that content owners default to being opted in. However, once an algorithm is trained on certain data, that data cannot be removed from the training set. This means that by the time rightsholders are given the opportunity to opt out of an algorithm’s training data, it’s already too late. Truly giving rightsholders the ability to opt out would require AI companies to retrain their algorithms every time such an opt out occurred. Further, it would prevent AI companies from using synthetic data: data generated by algorithms and used to train these same algorithms to ensure a limitless pool of training data and theoretically endless algorithmic optimization. Because synthetic data is made from the outputs of other generative algorithms, it will continue to use ‘opted-out’ data unless a new synthetic data set is generated every time a rightsholder opts out. These are not concessions that the big tech companies behind the most popular AI algorithms are likely to make, especially as they continue their effective lobbying efforts.
What’s more, most opt-out schemes in practice are location-based, meaning they control only the inclusion of entire domains or URLs in AI training sets without discriminating between the individual pieces of content available on each site. Such location-based approaches do very little to prevent downstream copying. Although rightsholders can opt out of web crawling on the domain that they own, they have no control over how their data is used if it ends up on third-party websites. For example, an artist may opt out from including their own website in AI training data, but the artist’s works may still end up in these datasets if it’s posted on Instagram, or any other third-party website that has not opted out. Given the inherent fluidity of content on the internet, this makes it essentially impossible for rightsholders to effectively consent to the use of their data within a location-based opt-out scheme.
The alternative is a unit-based approach, which manages opt-outs at the individual content level. However, unit-based approaches generally rely on metadata embedded in each piece of content, and metadata is inherently unreliable: it can be easily removed, is resistant to software updates, and is incompatible with text-based content. So while in theory, under a unit-based approach, an artist could ensure that no digital copies of their artwork are included in AI training sets if they’ve opted out, a bad actor could easily bypass this stipulation, and there would be no recourse at all for excluding any copyrighted textual material.
Rightsholders may be forgiven for having little faith that opt-out schemes will effectively shield them from inclusion in AI training sets. For instance, Wired discovered that robots.txt, a location-based opt-out scheme that’s been a programmer staple since the 1990s, was being evaded by Perplexity AI just last year. The continued failure of a scheme with decades of opportunity for R&D does not engender confidence for the future of ‘opting out’ as a viable compromise between algorithms and rightsholders.
It’s worth mentioning that, in addition to being ineffective on its face, opting out is a fundamentally impracticable burden to place on individual rightsholders, especially as generative AI becomes increasingly commonplace and the available algorithms more numerous. When a musician seeks to interpolate the work of another songwriter, they must first obtain a license from the original rightsholder. Under an analogous opt-out approach to interpolation, a songwriter would be tasked with hunting down all uses of their work or else risk the distribution of unauthorized interpolations.
It’s worth mentioning that opt-out schemes have received push back in the UK from both rightsholders and tech companies. Tech companies, for their part, are reluctant to give up their claim to fair use by conceding that some material may be unauthorized for AI training in some instances. Yet the EU already codified an opt-out approach to AI training in 2024, as part of their AI Act, via the extension of an opt-out provision originally intended for text-and-datamining. While the EU’s AI Act is not yet fully enforceable, its passage may lead other jurisdictions (including the US and UK) to follow suit. Ultimately, while proponents of opt-out schemes advocate for this approach under the guise of compromise, it’s clear that this is nothing less than a concession to the tech sector at the expensive of creative industries.