(CTN News) – According to a recent filing in a copyright infringement lawsuit, Meta Platforms was warned by its lawyers about the legal risks of using pirated books to train its AI models.
However, the company proceeded with this action despite the warning. The filing, which consolidates two lawsuits brought against Meta by notable authors such as Sarah Silverman and Michael Chabon, alleges that utilized their works without permission to train its AI language model, Llama.
The Silverman lawsuit had a portion dismissed by a California judge, who indicated that the authors would be allowed to amend their claims.
Meta has not yet responded to the allegations. The new complaint, filed on Monday, includes chat logs of a Meta-affiliated researcher discussing the acquisition of the dataset in a Discord server.
This evidence suggests that Meta was aware that its use of the books may not be protected by US copyright law.
Researcher Tim Dettmers discusses his correspondence with Meta’s legal department regarding the use of book files as training data.
Dettmers states that using The Pile on Facebook is currently not feasible due to legal constraints. Meta has acknowledged using The Pile to train its initial version of Llama, but Dettmers notes that Meta’s lawyers informed him that the data cannot be used or models cannot be published if trained on that data.
The concerns likely stem from books with active copyrights. Dettmers, when approached by Reuters, was unable to comment on the allegations.
Tech companies this year have faced lawsuits for using copyrighted works without permission to develop generative AI models, which have gained attention and investment.
If successful, these cases could reduce enthusiasm for generative AI by increasing expenses for AI companies, who may have to compensate artists and authors for using their works.
Additionally, new regulations in Europe may require companies to disclose the data they use to train their models, exposing them to legal risks.
Meta released the initial version of its Llama large language model in February, disclosing the datasets used for training. However, they did not disclose the training data for their latest model, Llama 2, which was released this summer.
Llama 2 can be used for free by companies with fewer than 700 million monthly active users, posing a threat to dominant players like OpenAI and Google who charge for model usage.