Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I do think that distilling a model from another is much less impressive than distilling one from raw text. However, it is hard to say if it is really illegal or even immoral, perhaps just one step further in the evolution of the space.


It's about as illegal as the billions, if not trillions of IPs that ClosedAI infringed to train their own data without consent. Not that they're alone, and I personally don't mind that AI companies do it, but it's still amusing when they get this annoyed at others doing the same thing to them.


I think they had the advantage of being ahead of the law in this regard. To my knowledge, reading copywritten material isn't (or wasn't illegal) and remains a legal grey area.

Distilling weights from prompts and responses is even more of a legal grey area. The legal system cannot respond quickly to such technological advancements so things necessarily remain a wild west until technology reaches the asymptotic portion of the curve.

In my view the most interesting thing is, do we really need vast data centers and innumerable GPUs for AGI? In other words, if intelligence is ultimately a function of power input, what is the shape of the curve?


The main issue is that they've had plenty of instances where the LLM outputted copyrighted content verbatim, like it happened with the New York Times and some book authors. And then there's DALL-E, which is baked into ChatGPT and before all the guardrails came up, was clearly trained on copyrighted content to the point it had people's watermarks, as well as their styles, just like Stable Diffusion mixes can do (if you don't prompt it out).

Like you've put, it's still a somewhat gray area, and I personally have nothing against them (or anyone else) using copyrighted content to train models.

I do find it annoying that they're so closed-off about their tech when it's built on the shoulders of openness and other people's hard work. And then they turn around and throw Issy fits when someone copies their homework, allegedly.


> Distilling weights from prompts and responses is even more of a legal grey area.

Actually unless the law changes this is pretty settled territory in US law. All output of AIs are not copyrightable, and are therefore in the public domain. The only legal avenue of attack OpenAi has is Terms of Service violation, which is a much weaker breach then copyright if it is even true.


> if intelligence is ultimately a function of power input, what is the shape of the curve?

According to a quick google search, the human body consumes ~145W of power over 24h (eating 3000kcals/day). The brain needs ~20% of that so 29W/day. Much less than our current designs of software & (especially) hardware for AI.


I think you mean the brain uses 29W (i.e. not 29W/day). Also, I suspect that burgers are a higher entropy energy source than electricity so perhaps it is even less than that.


Illegally acquiring copyrighted material has always been highly illegal in France and I'm sure most other countries. Disney is another example of how it not grey at all.


Is the question of training AI on data fair use settled yet? Because if it is not - it looks like fair use to me.


Isn't it more impressive given that training on model output usually leads to worse model?

If they actually figured out how to use output of existing models to build model that outperforms them then it's something that brings us closer to singularity than every other development so far.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: