Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

AI systems are trained against private, PII, copyrighted data all the time without explicit consent. For example, consider the spellcheck in Google search. Every query you make will go into the training for that system, along with your preferred language and country of origin.

You can certainly argue that generative AI systems are different than previous AI systems and should be treated differently. But the current situation is basically that you are allowed to train an AI on any data you have, regardless of copyright or consent. I wouldn't be surprised if that ends up being considered legally and ethically okay, because it's the status quo, and because it's hard to define "what counts as AI".



The point about spell check is a very good one. I think one big differentiator is that the attack/risk surface for something as complex as a LLM is much higher and that much, much more information is encoded in an LLM than a spell checking dictionary.

For example, it's possible to extract training data from an LLM—which could include PII/medical data/etc. Those risks don't exist with spellcheck as far as I'm aware.

To your point about what is "AI", I'd state that AI is a misnomer. What we're really talking about are generative large language models (LLMs). What an _can_ be considered an LLM is definitely up for debate, but if you were to describe one in general terms I think we could reasonably say that most (or all) things we consider LLMs are:

  1. A probabilistic model of a natural language
  2. Have the ability to interpret and generate natural language text
  3. Typically encode a large volume of training data
I'd love to hear other thoughts on how one would define an LLM in practical, simple language. I imagine doing so would be a pre-requisite to any effective legislation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: