NY Times is asking that ALL LLMs trained on Times data be destroyed

@haxor · 6 months ago

NY Times is asking that ALL LLMs trained on Times data be destroyed

Lvxferre · 6 months ago

Threads like this are why I discuss this shit in Lemmy, not in HN itself. The idiocy in the comments there is facepalm-worthy.

Plenty users there are trapping themselves in the “learning” metaphor, as if LLMs were actually “learning” shit like humans would. It’s a fucking tool dammit, and it is being legally treated as such.

The legal matter here boils down to: OpenAI is picking content online, feeding it into a tool, the tool transforms it into derivative content, and the derivative content is serviced to users. Is the transformation deep enough to make said usage go past copyright? A: nobody decided yet.

Pennomi · 6 months ago

The other part of the controversy is that in certain cases where the benefit to society is strong enough, copyright can be ignored.

It’s not impossible that the Feds will step in and explicitly allow scraping for AI use, because falling behind China in LLM development is a national security issue.

Lvxferre · 6 months ago

That sounds reasonable.

Sonori · 6 months ago

The transformation doesn’t matter if it was illegally obtained in the first place. They published thier material for human consumption, not as AI feedstock. The problem for Microsoft is that the only thing special about chatgpt 4 is the amount of feedstock, and if they have to buy the rights to it suddenlyp putting half the internet into Siri and selling the result doesn’t seem so brilliant anymore.

Lvxferre · 6 months ago

I don’t think that the content was illegally obtained, unless there’s some law out there that prevents you from using a crawler to automatically retrieve text. And probably there’s no law against using data intended for human consumption to feed AI.

As such, I might be wrong but I think that the only issue is that this data is being used to output derivative works, and acc. to NYT in some instances “can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style.”

Sonori · 6 months ago

They’ve generally been pretty open about using pirated books and data to train their product on.

https://shkspr.mobi/blog/2023/07/fruit-of-the-poisonous-llama/

There is also no law stating that copyright doesn’t apply to training AI in the same way it applies to every other use. Even this comment technically has a copyright, in the same way that people who write long original stories on forums like Spacebattles and Sufficent Velocity post by post still have a copyright on that story.

There is a carve out in copyright for academic research, but that protection disappears the second you start using it for a commercial purpose.

Lvxferre · edit-2 6 months ago

Now I get it. And yes, now I agree with you; it would give them a bit more merit to claim that the data being used in the input was obtained illegally. (Unless Meta has right of use to ThePile.)

The link does not mention GPT (OpenAI, Microsoft) or LaMDA/Bard (Google, Alphabet), but if Meta is doing it odds are that the others are doing it too.

Sadly this would be up to the copyright holders of this data. It does not apply to NYT content that you can freely access online, for NYT it got to be about the output, not the input.

@pivot_root@lemmy.world · 6 months ago

Good luck with that, NYT.

Capt. Wolf · edit-2 6 months ago

deleted by creator

that guy · edit-2 6 months ago

And if they are, 2 seconds later someone can train a new one. Maybe they should learn to code like those coal miners they pitied.

Lvxferre · 6 months ago

2 seconds later someone can train a new one

“Training” datasets:

Does this look like the amount of content that you’d get in two seconds???

Maybe they should learn to code like those coal miners they pitied.

And maybe you should go back to Reddit.