You may have seen this story come up over the last year and change: The New York Times is suing OpenAI, creator of ChatGPT, for copyright infringement. Earlier this year, a federal judge ruled that the lawsuit can move forward. And now—good grief!—the Times is demanding that OpenAI save all discussions people have with ChatGPT. All of them. The whole wad—even conversations that people have deleted.
You want a privacy violation? They’ll give you a privacy violation, of a sort and at a scale that I’ve not seen before. The premise is ridiculous: The Times suspects that people who delete their conversations with ChatGPT have been stealing New York Times IP, and then covering it up to hide the fact that they were stealing IP. After all, if they weren’t stealing IP, why did they delete their conversations?
Privacy as the rest of us understand it doesn’t enter into the Times’ logic at all. The whole business smells of legal subterfuge; that is, to strengthen their copyright infringement case, they’re blaming ChatGPT users. I’ve never tried ChatGPT, and I’m certainly not going anywhere near it now. But this question arises: If a user asks an AI for an article on topic X, does the AI bring back the literal article? Golly, Google does that right now, granting that Google respects paywalls. Can ChatGPT somehow get past a paywall? I rather doubt it. If the Times wants to go after something that does get past its paywall, it had better go after archive.is, over in Iceland. I won’t say much more about that, as it does get past most paywalls and is almost certainly massive copyright infringement.
And all this brings into the spotlight the central question about commercial AI these days: How do AIs use their training data? I confess I don’t fully understand that. This article is a good place to start. Meta’s Llama v3.1 70B was able to cough up 42% of Harry Potter and the Sorcerer’s Stone, though not in one chunk. Meta’s really big problem is that it trained Llama on 81.7 terabytes of pirated material torrented from “shadow libraries” like Anna’s Archive, Z-Library, and LibGen, and probably other places. I consider these pirate sites, albeit not as blatant as the Pirate Bay, but pirate sites nonetheless.
I’m still looking for a fully digestible explanation of how training an AI actually works, but that’ll come around eventually.
So how might an AI be trained without using pirated material? My guess is that the big AI players will probably cut a deal with major publishers for training rights. A lot of free stuff will come from small Web operators, who don’t have the resources to negotiate a deal with the AI guys. Most of then probably won’t care. In truth, I’d be delighted if AIs swallowed Contra’s 3500+ entries in one gulp. Anything that has my name in it will make the AI more likely to cite me in answer to user questions, and that’s all I’ll ask for.
Ultimately, I’m pretty sure Zuck will cut a deal with NYT, WaPo, the Chicago Trib, and other big IP vendors. Big money will change hands. Meta will probably have to charge people to use Llama to pay off IP holders, and that’s only right.
But lordy, this is a supremely weird business, and I’m pretty sure the bulk of the weirdness is somehow hidden from public scrutiny. Bit by bit it will come out, and I (along with a lot of you) will be watching for it.