For the first time, OpenAI will provide access to its training data to verify whether copyrighted works were used to power its technology.
In a filing Tuesday, the authors suing the Sam Altman-led study and OpenAI said they had reached an agreement on information inspection protocols. They will seek details about incorporating their work into training datasets, which could be a battleground in the case that could help establish guardrails for building automated chatbots.
The settlement stems from a trio of lawsuits filed by prominent authors, including Sarah Silverman, Paul Tremblay, and Ta-Nehisi Coates, who accuse OpenAI of harvesting vast amounts of books from the web, which were then used to produce illicit responses from ChatGPT. It comes after the court in July dismissed a claim that the company had engaged in unfair trade practices by using their works without consent or compensation. Earlier, U.S. District Judge Araceli Martínez-Olguín also dismissed other claims of negligence, unjust enrichment, and vicarious copyright infringement, although the authors’ claim for direct copyright infringement remained intact.
In other cases, AI companies have denied wholesale copying of works. Instead, they have argued that training their models involves developing parameters based on those works to define how things look and how they should be built. OpenAI may advance that defense in a later phase of the authors’ case, as well as arguments that the practice of using published works to train its system constitutes fair use, which provides protection for using copyrighted material to create a secondary work as long as it is “transformative.”
OpenAI said it trains its model on “large publicly available datasets that include copyrighted works.” Last year, it decided to stop disclosing such materials in an effort to maintain an edge over competitors and avoid legal liability. While it is not known which works were used, the authors noted that ChatGPT generates summaries and in-depth analyses of themes in their novels. They said the company downloaded hundreds of thousands of books from shadow library sites to train its AI system.
Under the agreement, the training datasets will be made available at OpenAI’s San Francisco office on a secure computer without internet or network access. Anyone reviewing the information will be required to sign a nondisclosure agreement, sign a visitor’s log, and provide proof of identity.
The use of any technology will be severely limited. No recording devices, including computers, cell phones, or cameras, will be allowed in the inspection room, according to the joint stipulation. OpenAI may provide limited use of a computer for note-taking, with the authors' attorneys copying those notes to another device under the supervision of company representatives at the end of each day. No copies of any part of the training data will be permitted.
“The consultant and/or inspector experts may take handwritten or electronic notes on the provided note-taking computer in scrap files, but may not copy training data into any notes,” the document reads.
Attorneys at the Joseph Saveri Law Firm are leading the case. They also represent authors in identical copyright lawsuits against Meta. Discovery in those cases is scheduled to conclude on Sept. 30, although a request for an extension has been filed. U.S. District Judge Vince Chhabria questioned whether the attorneys could adequately represent the authors in a hearing Friday.
“It’s very clear to me from the documents, the case file, and the interview with the trial judge that you’ve brought this case forward and you haven’t done your job to bring it forward,” Chhabria said, according to Politico. “You and your team have barely moved the case forward. It’s obvious… This is not your typical proposed class action. This is an important case. This is an important social issue. This is important to your clients.”
The concern arose in part from the fact that the lawyers had not taken any depositions in the case.
“It is sometimes said that timing is everything. Well, it appears that bad timing is true, too,” wrote U.S. District Judge Thomas Hixson. “Plaintiffs are asking that the Court allow them to take 35 party depositions, excluding third-party depositions, or alternatively request a total of 180 hours of testimony. And they made that request … 18 days before the current close of discovery.”
The judge added: “Since the plaintiffs have not taken any depositions, the 35 parties' depositions (plus third-party depositions), or alternatively the 180 hours of witness testimony, would all have to take place in the second half of September, which is obviously impossible.”