Hacker Newsnew | past | comments | ask | show | jobs | submit | brisky's commentslogin

Just a personal anecdote from my life - I have set up Youtube account for my kid with correct age restrictions (he is 11). Also this account is under family plan so there are no ads.

My kid logs out of this account so he can watch restricted content. I wonder - what is PG rating for logged out experience?


Yes, it was vibe coded


Homeless people have the most sun and spend a lot of time outside. I hope that this is not where we are headed with AI


Does any current Open Source license address the question of AI/LLM training at all? Some OSS developers have clear sentiment against it but currently they can not even pick a standard OSS license that aligns with their worldview.


One of these things is true:

1. Training AI on copyrighted works is fair use, so it's allowed no matter what the license says.

2. Training AI on copyrighted works is not fair use, so since pretty much every open source license requires attribution (even ones as lax as MIT do; it's only ones that are pretty much PD-equivalent like CC0, WTFPL, and Unlicense that don't) and AI doesn't give attribution, it's already disallowed by all of them.

So in either case, having a license mention AI explicitly wouldn't do any good, and would only make the license fail to comply with the OSD.


Point 2 misses the distinction between AI models and their outputs.

Let's assume for a moment that training AI (or, in other words, creating an AI model) is not fair use. That means that all of the license restrictions must be adhered to.

For the MIT license, the requirement is to include the copyright notice and permission notice "in all copies or substantial portions of the Software". If we're going to argue that the model is a substantial portion of the software, then only the model would need to carry the notices. And we've already settled on accessing over a server doesn't trigger these clauses.

Something like the AGPL is more interesting. Again, if we accept that the model is a derivative work of the content it was trained on, then the AGPL's viral nature would require that the model be released under an appropriate license. However, it still says nothing about the output. In fact, the GPL family licenses don't require the output of software under one of those licenses to be open, so I suspect that would also be true for content.

So far, though, in the US, it seems courts are beginning to recognize AI model training as fair use. Honestly, I'm not surprised, given that it was seen as fair use to build a searchable database of copyright-protected text. The AI model is an even more transformative use, since (from my understanding) you can't reverse engineer the training data out of a model.

But there is still the ethical question of disclosing the training material. Plagiarism still exists, even for content in the public domain. So attributing the complete set of training material would probably fall into this form of ethical question, rather than the legal questions around intellectual property and licensing agreements. How you go about obtaining the training material is also a relevant discussion, since even fair use doesn't allow you to pirate material, and you must still legally obtain it - fair use only allows you to use it once you've obtained it.

There are still questions for output, but those are, in my opinion, less interesting. If you have a searchable copy of your training material, you can do a fuzzy search of that material to return potential cases where the model returned something close to the original content. GitHub already does something similar with GitHub Copilot and finding public code that matches AI responses, but there are still questions there, too. It's more around matches that may not be in the training data or how much duplicated code needs to be attributed. But once you find the original content, working with licensing becomes easier. There are also questions about guardrails and how much is necessary to prevent exact reproduction of copyright protected material that, even if licensed for training, isn't licensed for redistribution.


> The AI model is an even more transformative use, since (from my understanding) you can't reverse engineer the training data out of a model.

You absolutely can; the model is quite capable of reproducing works it was trained on, if not perfectly then at least close enough to infringe copyright. The only thing stopping it from doing so is filters put in place by services to attempt to dodge the question.

> In fact, the GPL family licenses don't require the output of software under one of those licenses to be open, so I suspect that would also be true for content.

It does if the software copies portions of itself into the output, which seems close enough to what LLMs do. The neuron weights are essentially derived from all the training data.

> There are also questions about guardrails and how much is necessary to prevent exact reproduction of copyright protected material that, even if licensed for training, isn't licensed for redistribution.

That's not something you can handle via guardrails. If you read a piece of code, and then produce something substantially similar in expression (not just in algorithm and comparable functional details), you've still created a derivative work. There is no well-defined threshold for "how similar", the fundamental question is whether you derived from the other code or not.

The only way to not violate the license on the training data is to treat all output as potentially derived from all training data.


> You absolutely can; the model is quite capable of reproducing works it was trained on, if not perfectly then at least close enough to infringe copyright. The only thing stopping it from doing so is filters put in place by services to attempt to dodge the question.

The model doesn't reproduce anything. It's a mathematical representation of the training data. Software that uses the model generates the output. The same model can be used across multiple software applications for different purposes. If I were to go to https://huggingface.co/deepseek-ai/DeepSeek-V3.2/tree/main (for example) and download those files, I wouldn't be able to reverse-engineer the training data without building more software.

Compare that to a search database, which needs the full text in an indexable format, directly associated with the document it came from. Although you can encrypt the database, at some point, it needs to have the text mapped to documents, which would make it much easier to reconstruct the complete original documents.

> That's not something you can handle via guardrails. If you read a piece of code, and then produce something substantially similar in expression (not just in algorithm and comparable functional details), you've still created a derivative work. There is no well-defined threshold for "how similar", the fundamental question is whether you derived from the other code or not.

The threshold of originality defines whether something can be protected by copyright. There are plenty of small snippets of code that can't be protected. But there are still questions about these small snippets that were consumed in the context of a larger, protected work, especially when there are only so many ways to express the same concept in a given language. It's definitely easier in written text than code to reason about.


> The model doesn't reproduce anything. It's a mathematical representation of the training data. Software that uses the model generates the output.

By that argument, a compressed copy of the Internet doesn't reproduce the Internet, the decompression software does. That's not a useful semantic distinction; the compressed file is the derived work, not the decompression software.


Testing your app/website when it has different behaviour depending on locale


Real, but pretty minimal usage.


Great points. With Meta glasses and other similar gadgets I think manual consent is not enough. There should be a 'protocol' to announce that you don't allow your images to be included in social media. I propose a QR code that would signify that you don't want to filmed. We need to push for legislation allowing (returning) such liberty. After such automated consent is legal it will be up to social media platforms to blur and anonymize individuals with such preferences. Finally we will have a job where AI could be put to good use!


You might want to check out Really Simple Decentralized Syndication (RSDS) https://writer.did-1.com/


I think I have it as well. But my theory is that we might have imagination but it is only accessible to subconscious. It is as if it is blocked from consciousness. I have ADHD as well, might be that this is protection mechanism that allows my kind of brain to survive in the world better (otherwise it would be too entertaining to get lost in your own imagination). As a kid I used to daydream a lot.


It would be very useful for AI platform customers. You could run prompts with 0 temperature and check if the results are the same making sure that AI provider is not switching the PRO model in the background for a cheap one and ripping you off.


Similar situation - I was an independent app publisher on app store, but I don't feel comfortable publishing my phone number next to my apps. I don't do customer support. This punishes indie app devs. After I saw this requirement I decided to remove my app from the app store.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: