25% of the top 1000 websites are blocking OpenAI from crawling: https://original...

jpablo · 2024-12-23T15:03:10 1734966190

They aren't blocking anything. They are just asking nicely not to be crawled. Given that AI companies haven't cared a single bit about ripping of other's peoples data I don't see why they would care now.

wing-_-nuts · 2024-12-23T15:24:09 1734967449

A number of sites have started outright blocking any traffic that looks remotely suspicious. This has made browsing with a vpn a bit of a pain.

pixl97 · 2024-12-23T17:25:32 1734974732

This has been ever increasing for years now. Bots, attacks, scrapers, AI, all these things seem to be the majority of traffic on most sites.

superluserdo · 2024-12-23T19:29:29 1734982169

I wish I could go back to the days of doing almost anything at all without having to tell a server what a motorbike or traffic light is.

wing-_-nuts · 2024-12-23T21:59:37 1734991177

LPT: switch to the audio captcha. Yes, it takes a bit longer than if you did one grid captcha perfectly, but I never have to sit there and wonder if a square really has a crosswalk or not, and I never wind up doing more than one.

EVa5I7bHFq9mnYK · 2024-12-23T15:59:44 1734969584

In their attempt to block OpenAI, they block me. Many sites that were accessible just 2 years ago, require login/captchas/rectal exam now just to read the content.

ammanley · 2024-12-23T17:36:05 1734975365

Im looking forward to the life experience that is content I want to read badly enough to endure a rectal exam.

EVa5I7bHFq9mnYK · 2024-12-23T19:44:09 1734983049

It's not that bad ...

fennecbutt · 2024-12-30T21:56:17 1735595777

Not sure why you're being downvoted. Watching str8 bois react with shock and horror at the idea of anything near their butt is hilarious.

Prostate and rectal cancer is real, boys. Grow tf up about it.

josu · 2024-12-23T18:47:15 1734979635

> captchas

I suspect that AIs are already more effective than humans at passing captchas.

EVa5I7bHFq9mnYK · 2024-12-23T19:42:18 1734982938

That would be an example of AI providing real value that I would pay for.

heavyset_go · 2024-12-23T19:50:47 1734983447

These exist for a fee if you want to use them

EVa5I7bHFq9mnYK · 2024-12-24T05:02:25 1735016545

I used 2captcha, for a fee ... it doesn't work

kjkjadksj · 2024-12-23T18:19:24 1734977964

They block plenty and they do it crudely. I get suspicious traffic bans from reddit all the time. Trivial enough to route around by switching user agent however. Which goes to show any crawling bot writer worth their salt already routes around reddit and most other sites bs by now. I’m just the one getting the occasional headache because I use firefox and block ads and site tracking I guess.

njovin · 2024-12-23T16:29:13 1734971353

Wouldn't it be somewhat trivial to set up honeypots?

jaybna · 2024-12-23T15:05:57 1734966357

Yeah, probably right. If you want a great rabbit hole, look up "Common Crawl" and see how a great academic project was absolutely hijacked for pennies on the dollar to grab training data - the foundation for every LLM out there right now.

CamperBob2 · 2024-12-23T18:51:13 1734979873

It's hard to envision a greater success for the "great academic project" than what happened. I mean, what else were they trying to accomplish?

jaybna · 2024-12-23T20:02:55 1734984175

It was meant to be an open-source compilation of the crawled internet so that research could be done on web search given how opaque Google's process is. It was NOT meant to be a cheap source of data for for-profit LLM's to train on.

*edit: added "for-profit"

CamperBob2 · 2024-12-23T23:23:59 1734996239

(Shrug) Multiple not-for-profit LLMs have trained on it as well.

If something I worked on turned out to play a significant part in something that turned out to be that big a deal, I'd be OK with it. And nobody's stopping people from doing web-search studies with it, to this day.

cshores · 2024-12-23T16:45:07 1734972307

It ultimately doesn't matter because a fairly current snapshot of all of the world's information is already housed in their data lakes. The next stage for AI training is to generate synthetic data either by other AI or by simulations to further train on as human generated content can only go so far.

pphysch · 2024-12-23T17:34:32 1734975272

How is synthetic data supposed to work? Broadly speaking, ML is about extracting signal from noisy data and learning the subtle patterns.

If there is untapped signal in existing datasets, then learning processes should be improved. It does not follow that there should be a separate economic step where someone produces "synthetic data" from the real data, and then we treat the fake data as real data. From a scientific perspective, that last part sounds really bad.

Creating derivative data from real data sounds, for the purpose of machine learning, like a scam by the data broker industry. What is the theory behind it, if not fleecing unsophisticated "AI" companies? Is it just myopia, Goodhart's Law applied to LLM scaling curves? Some MBA took the "data is the new oil" comment a little too seriously and inferred that data is as fungible as refined petroleum?

joshribakoff · 2024-12-23T20:06:49 1734984409

I tried to train an AI to guess the weight and reps from my exercise log but it would produce nonsense results for rep ranges I didn’t have enough training data for, as if it didn’t understand that more weight means less reps. I used synthetic training data and interpolated and imputed data for rep ranges I didn’t have data for using estimation formulas, the network then predicted better, but it also made me realize i basically made the model learn the prediction formula and AI was not actually needed and im better off using the prediction formula. But it also illustrates that the model can learn from a calculation or estimation the same way it learns from the real world, without necessarily needing to train exclusively in the real world. An ai car driving in a simulation may actually learn some of the formulas that apply both in the simulation and in the real world. The same simulations and synthetic data can also be just as useful for validation not just training. It’s not hard to imagine scenarios that are impractical, illegal or unethical to test in real life. Also, as AI becomes more advanced, synthetic data can be useful for generating superhuman examples. It’s not hard to imagine you could improve upon data from a human driver by synthetically altering it to be even safer.

pphysch · 2024-12-24T18:19:49 1735064389

Thanks, I now can see synthetic data being used to patch up holes and deal with ethical issues.

I still don't see how it could address the volume problem, like needing 10x or 100x of current data to train GPT5.

cshores · 2024-12-23T19:30:16 1734982216

As others have mentioned, Tesla is already implementing similar advancements. More broadly, a new AI framework called Genesis has emerged, capable of training robots in just minutes using purely synthetic data. It generates a virtual environment for the robot to "perceive" and train within, even though this environment doesn't physically exist. This is just one example. Another could involve an AI specifically trained to diagnose illnesses based on genetic information in DNA. The insights gained from this virtual scientist could then cross-pollinate with other AIs, enhancing their training and capabilities as well.

Nevermark · 2024-12-23T21:01:39 1734987699

Competition between AI’s to solve problems better or faster than each other, but learning from each other, is another way to start with simple problems and naturally bootstrap increasing difficulty.

elfly · 2024-12-23T20:57:18 1734987438

Synthetic data works as long as it is directed towards a clear objective and curated.

At one point someone generated a Python teaching book from a LLM, took that, trained a second LLM with that, and the new LLM knew Python.

If you are just dragging random content from the web and you don't know what's synthetic and what's human, that data may be contaminated and a lot less useful, but if someone wanted to whitewash their training data by replacing a part of it with synthetic data, it can be done.

RationPhantoms · 2024-12-23T17:53:22 1734976402

Would you trust a ML self-driving algorithm trained on a "digital twin" of a city? I would. I view synthetic training data like a digital twin in which it can provider further control or specified noise to understand from.

scottLobster · 2024-12-23T20:20:20 1734985220

No, because right now I'm working closely with some EEs to troubleshoot electrical issues on some prototype boards (I wrote the firmware). They're prototypes precisely because we know the limits of our models and simulations and need real world boards to test our electronics design and firmware on.

You're suggesting the new, untested models in a new, untested technological field are sufficient for deployment in real world applications even with a lack of real world data to supplement them. That's magical thinking given what we've experienced in every other field of engineering (and finance for that matter).

Why is AI/ML any different? Because highly anthropomorphized words like "learning" and "intelligence" are in the name? These models are some of the most complex machines humanity has ever produced. Replace "learning" and "intelligence" with "calibrated probability calculators". Then detail the sheer complexity of the calibrations needed, and tell me with a straight face that simulations are good enough.

Nevermark · 2024-12-23T21:09:25 1734988165

Both are likely to be much better.

Simulations may not be good enough alone, but still provide a significant boost.

Simulations can cheaply include scenarios that would be costly or dangerous to actually perform in the real world. And cover many combinations of scenario factors to improve combinatorial coverage.

Another way is to separate models into highly real world dependent (sensory interpretation) and more independent (kinematics based on sensory interpretation) parts. The latter being more suited to training in simulation. Obviously full real world testing is still necessary to validate the results.

fennecbutt · 2024-12-30T21:59:22 1735595962

Hey, let's shut down humanity because human behaviour can't be perfectly simulated.

kjkjadksj · 2024-12-23T18:15:21 1734977721

What makes you assume your digital twin is actually capturing the factors that contribute to variation in the real data? This is a big issue in simulation design but for ml researchers its hand-waved off seemingly.

fragmede · 2024-12-23T20:25:51 1734985551

Probably due to reports like these where the digital twin is credited with gains in factory efficiency.

https://www.forbes.com/sites/carolynschwaar/2024/12/09/schae...

joshribakoff · 2024-12-23T20:16:34 1734984994

It either improves the results or it does not, i don’t think i see the problem.

Corrado · 2024-12-23T18:03:06 1734976986

Isn’t this what Tesla does for their driving data? However it would fall apart if they didn’t have real world days to feed into it, right?

heavyset_go · 2024-12-23T19:49:06 1734983346

> Would you trust a ML self-driving algorithm trained on a "digital twin" of a city? I would.

No, just as I wouldn't trust a surgeon who studied medicine by playing Operation. A gross approximation is not a substitute for real life.

fragmede · 2024-12-23T20:22:28 1734985348

Hope you don't need surgery then! Suture training kits like these are quite popular for surgeons to train on. https://a.co/d/3cAotZ0 I don't know about you, but I'm not a rubbery rectangular slab of plastic, so obviously this kit can't help them learn.

heavyset_go · 2024-12-23T20:31:43 1734985903

This is a reason I opted to have a plastic surgeon come in when I went to the ER with an injury.

I could've had the nurse close me up and leave me with a scar, which she admitted would happen with her practice, or I could have someone with extensive experience treating wounds so that they'd heal in cosmetically appealing way do it. I opted for the latter.

scottLobster · 2024-12-23T20:33:18 1734985998

The difference being that you have to do a little more than that to become a board-certified surgeon. If a VC gives you a billion dollars to buy and practice on every available surgery practice kit in the world, you will still fail to become a surgeon. And we enforce such standards because if we don't then people die needlessly.

Nevermark · 2024-12-23T22:05:43 1734991543

How a model learns doesn’t really matter. What works works.

How it is tested and validated is what matters.

There are lots of ways to train on synthetic data, and synthetic data can have advantages as well as disadvantages over natural data.

Creative use of synthetic data is going to lead to many cases where we find it is good enough. Or even better than natural data.

joshribakoff · 2024-12-23T20:08:31 1734984511

What about a doctor who used a mix of training both on live patients as well as cadavers and models?

heavyset_go · 2024-12-23T20:22:37 1734985357

Is this doctor able to learn new information and work through novel problems on the fly, or will their actions always be based on the studying they did in the past on old information?

Similarly, when this doctor sees something new, will they just write it off as something they've seen before and confidently work from that assumption?

phyalow · 2024-12-24T02:00:31 1735005631

Um, augmentation (i.e. the generation of synthetic data) is a very very well known technique for improving learning.

Also whats with the hate for MBA’s?

Your comment is off kilter with the rules here.

pphysch · 2024-12-24T18:25:10 1735064710

Synthetic data is being proposed here as a solution to extrapolate ML scaling.

Augmentation, interpolation, smoothing are different concepts.

phyalow · 2024-12-29T04:42:21 1735447341

I think you're drawing an artificial distinction here. Synthetic data generation is fundamentally an extension of augmentation. When OpenAI uses expert generated examples and curriculum based approaches, that's literally textbook augmentation methodology. The goal of augmentation has always been to improve model fit, and scaling is just one aspect of that.

Your concern about extrapolation is interesting but misses something key when we generate synthetic data through expert demonstration or guided curriculum, we're not trying to magically create capabilities beyond the training distribution. Instead, we're trying to better sample the actual distribution of problemsolving approaches humans use. This isn't extrapolation rather, better sampling of an existing, complex distribution!

i.e. if you think about the manifold hypothesis then we know real data lives on a lowerdimensional manifold, and good synthetic data helps fill those gaps. This naturally leads to better extrapolation, it's pretty well established at this point.

TBH I think you are characterizing this as some kind of blind data multiplication scheme, but it's much closer to curriculum learning you start with basic synthetic examples and gradually ramp up complexity. So it isn't whether synthetic data is "real" or not, but if it effectively helps map the underlying distribution and reasoning patterns.

Funny enough, your oil analogy actually supports the case for synthetic data refined petroleum is more useful than crude for specific purposes, just like well designed synthetic data can be more effective than raw internet text for certain learning objectives.

jaybna · 2024-12-23T19:57:41 1734983861

https://www.nature.com/articles/s41586-024-07566-y

cshores · 2024-12-24T02:23:44 1735007024

I understand the concept of AI model collapse caused by recursion. What I’m proposing goes beyond a basic feedback loop, like repeatedly running Stable Diffusion. Instead, I envision an AI system with specialized expertise, akin to a scientist making a breakthrough based on inputs from a researcher—or even autonomously. This specialized AI could then train other, less specialized models in its area of expertise. For example, it might generate a discovery that is as straightforward as producing a white paper for interpretation. If there is an virtual "scientist" that is trained on DNA for instance, could come up with a discovery for a treatment. This gets published, circulated and trained in to other models. This isn't the kind of inbreeding that you suggest as the answer is valid.

aftbit · 2024-12-23T16:03:01 1734969781

IMO this is an underappreciated advantage for Google. Nobody wants to block the GoogleBot, so they can continue to scrape for AI data long after AI-specific companies get blocked.

Gemini is currently embarrassingly bad given it came from the shop that:

1. invented the Transformer architecture

2. has (one of) the largest compute clusters on the planet

3. can scrape every website thanks to a long-standing whitelist

Art9681 · 2024-12-23T16:17:08 1734970628

The new Gemini Experimental models are the best general purpose models out right now. I have been comparing with o1 Pro and I prefer Gemini Experimental 1206 due to its context, speed, and accuracy. Google came out with a lot of new stuff last week if you havent been following. They seem to have the best models across the board, including image and video.

HaZeust · 2024-12-23T19:55:02 1734983702

Omnimodal and code/writing output still has a ways to go for Gemini - I have been following and their benchmarks are not impressive compared to the competition, let alone my anecdotal experience in using Claude for coding, GPT for spec-writing, and Gemini for... Occasional cautious optimism to see if it can replace either.

kibwen · 2024-12-23T16:17:54 1734970674

> Nobody wants to block the GoogleBot

This only remains true as long as website operators think that Google Search is useful as a driver of traffic. In tech circles Google Search is already considered a flaming dumpster heap, so let's take bets on when that sentiment percolates out into the mainstream.

dageshi · 2024-12-23T16:45:05 1734972305

If it reaches the point where google is no longer a useful driver of traffic then there's probably little point in having a website at all any more.

5h · 2024-12-23T17:13:30 1734974010

Strange take ... I seem to remember websites having a lot of point before google.

dageshi · 2024-12-23T17:23:43 1734974623

They had a point back then because no alternatives existed.

How many websites back then would be youtube channels, podcasts or social media accounts if they had existed back then?

Nowadays most sites survive via traffic from google, if it goes away then most of those sites go away as well.

pixl97 · 2024-12-23T17:58:57 1734976737

They had a lot or point because....

1. They were a major site that was an initial starting point for traffic

2. Search engines pointed to them and people could locate them.

---

That was all a long time ago. Now people tend to go to a few 'all in one sites'. Google, reddit, '$big social media'. Other than Google most of those places optimize you to stay on that particular site rather than go to other people's content. The 'web' was a web of interconnectedness. Now it's more like a singularity. Once you pass the event horizon of their domain you can never escape again.

jameslk · 2024-12-23T16:44:32 1734972272

For OpenAI, they could lean on their relationship with Microsoft for Bing crawler access

Websites won’t be blocking the search engine crawlers until they stop sending back traffic, even if they’re sending back less and less traffic

tartuffe78 · 2024-12-23T16:16:47 1734970607

Wonder if OpenAI is considering building a search engine for this reason... Imagine if we get a functional search engine again from some company just trying to feeding their model generation...

thiagowfx · 2024-12-23T16:37:58 1734971878

There are two to distinguish: "Googlebot" and "Google-Extended".

lxgr · 2024-12-23T17:00:20 1734973220

That seems to be more like a courtesy that Google could stop extending at any point than a requirement grounded in law or legal precedent.

Palmik · 2024-12-24T07:34:36 1735025676

Same goes for OpenAI ignoring these "blocks".

heavyset_go · 2024-12-23T20:09:07 1734984547

> I am betting hundreds of thousands, rising to millions more little sites, will start blocking/gating this year. AI companies might license from big sources (you can see the blocking percentage went down), but they will be missing the long tail, where a lot of great novel training data lives.

This is where I'm at. I write content when I run into problems that I don't see solved anywhere else, so my sites host novel content and niche solutions to problems that don't exist elsewhere, and if they do, they are cited as sources in other publications, or are outright plagiarized.

Right now, LLMs can't answer questions that my content addresses.

If it ever gets to the point where LLMs are sufficiently trained on my data, I'm done writing and publishing content online for good.

zifpanachr23 · 2024-12-23T20:21:01 1734985261

I don't think it is at all selfish to want to get some credit for going to the trouble of publishing novel content and not have it all stolen via an AI scraping your site. I'm totally on your side and I think people that don't see this as a problem are massively out of touch.

I work in a pretty niche field and feel the same way. I don't mind sharing my writing with individuals (even if they don't directly cite me) because then they see my name and know who came up with it, so I still get some credit. You could call this "clout farming" or something derogatory, but this is how a lot of experts genuinely get work...by being known as "the <something> guy who gave us that great tip on a blog once".

With AI snooping around, I feel like becoming one of those old mathematicians that would hold back publicizing new results to keep them all for themselves. That doesn't seem selfish to me, humans have a right to protect ourselves and survive and maintain the value of our expertise when OpenAI isn't offering any money.

I honestly think we should just be done with writing content online now, before it's too late. I've thought a lot about it lately and I'm leaning more towards that option.

heavyset_go · 2024-12-23T21:17:24 1734988644

Agree with your assessment. I enjoy the little networks of people that develop as others use and share content. I enjoy the personal messages of thanks, the insights that are shared with me and seeing how my work influences others and the work they do. It's really cool to learn that something I made is the jumping off point for something bigger than I ever foresaw. Hell, just being reached out to help out or answer questions is... nice? I guess.

It's the little bits of humanity that I enjoy, and divorcing content from its creators is alienating in that way.

I'm not a musician, but I imagine there are similar motivations and appreciations artists have when sharing their work.

> I work in a pretty niche field and feel the same way. I don't mind sharing my writing with individuals (even if they don't directly cite me) because then they see my name and know who came up with it, so I still get some credit. You could call this "clout farming" or something derogatory, but this is how a lot of experts genuinely get work...by being known as "the <something> guy who gave us that great tip on a blog once".

Yup, my writing has netted me clients who pointed at my sites as being a deciding factor in working with me.

> I honestly think we should just be done with writing content online now, before it's too late. I've thought a lot about it lately and I'm leaning more towards that option.

The rational side of me agrees with you, and has for a while now, but the human side of me still wants to write.

glenstein · 2024-12-23T16:30:04 1734971404

>Bill Gross correctly calls this phase of AI shoplifting. I call it the Napster-of-Everything (because I am old). I am also betting that the courts won't buy the "fair use" interpretation of scraping, given the revenues AI companies generate. That means a potential stalling of new models until some mechanism is worked out to pay knowledge creators.

To your point, I have wondered whatever became of that massive initiative from Google to scan books, and whether that might be looked at as a potential training source, giving that Google has run into legal limitations on other forms of usage.

ben_w · 2024-12-23T17:00:01 1734973201

> To your point, I have wondered whatever became of that massive initiative from Google to scan books, and whether that might be looked at as a potential training source, giving that Google has run into legal limitations on other forms of usage.

Still around, doing fine: https://en.wikipedia.org/wiki/Google_Books and https://books.google.com/intl/en/googlebooks/about/index.htm...

Given the timing, I suspect it was started as simple indexing, in keeping with the mission statement "Organize the world's information and make it universally accessible and useful".

There was also reCAPTCHA v1 (books) and v2 (street view), which each improved OCR AI until the state of the art AI were able to defeat them in the role of CAPTCHA systems.

glenstein · 2024-12-23T17:33:30 1734975210

I don't know what you mean by timing (relative to what?) or "simple indexing" (they scanned the complete contents of books), but I am, and was already aware, of the wiki article and the role of recaptcha.

Maybe I wasn't clear, but I was interested in the consequences of the legal stuff. It's not clear from the wiki article what any of this means with respect to the suitability of scans for AI training.

ben_w · 2024-12-23T18:57:36 1734980256

> I don't know what you mean by timing (relative to what?) or "simple indexing" (they scanned the complete contents of books), but I am, and was already aware, of the wiki article and the role of recaptcha.

Timing as in: it started in 2004, when the most advanced AI most people used was a spam filter, so it wasn't seen as a training issue (in the way that LLMs are) *at the time*.

As for training rights, I agree with you, there's no clarity for how such data could be used *today* by the people who have it. Especially as the arguments in favour of LLM training are often by comparison to search engine indexing.

fragmede · 2024-12-23T20:43:20 1734986600

Until such time as a lawsuit declares otherwise, Google's position is obviously that scanning books, OCRing them, saving that text in a database, and using that to allow searching is no different, legally, than scanning books, OCRing them, saving that text in to a database, and using that to train LLMs. Book publishers already went up against Google for the practice of scanning in the first place, we'll see if they try again with LLM training.

pncnmnp · 2024-12-23T17:46:44 1734976004

> I have wondered whatever became of that massive initiative from Google to scan books, and whether that might be looked at as a potential training source, giving that Google has run into legal limitations on other forms of usage.

A few months ago, there was an interesting submission on HN about this - The Tragedy of Google Books (2017) (https://news.ycombinator.com/item?id=41917016).

Kostchei · 2024-12-23T15:34:41 1734968081

Using the real world- as in vision, 3d orientation, physical sensors and building training regimes that augment the language models to be multidimensional and check that perception, that is the next step.

And there is very little shortage of data and experience in the actual world, as opposed to just the text internet. Can the current AI companies pivot to that? Or do you need to be worldlabs, or v2 of worldlabs?

shanusmagnus · 2024-12-23T15:37:05 1734968225

Ironically, if it plays out this way, it will be the biggest boon to actual AGI development there could be -- the intelligence via text tokenization will be a limiting factor otherwise, imo.

Tossrock · 2024-12-23T18:37:27 1734979047

Some can. Google owns Waymo and runs Streetview, they're collecting massive amounts of spatial data all the time. It would be harder for the MS/OpenAI centaur.

code51 · 2024-12-23T16:22:54 1734970974

With current state of legal, a real challenge can happen only around 10 years from now. By then AI players will gather immense power over the law.

lxgr · 2024-12-23T17:04:55 1734973495

If you're willing to believe the narrative that there's some sort of existential "race to AGI" going on at the moment (I'm ambivalent myself, but my opinion doesn't really matter; if enough people believe it to be true, it becomes true), I don't think that'll realistically stop anyone.

Not sure how exactly the Library of Congress is structured, but the equivalent in several countries can request a free copy of everything published.

Extending that to the web (if it's not already legally, if not practically, the case) and then allowing US companies to crawl the resulting dataset as a matter of national security, seems like a step I could see within the next few years.

zifpanachr23 · 2024-12-23T20:26:52 1734985612

I agree with you about the fair use argument. Seems like it doesn't meet a lot of the criteria for fair use based on my lay understanding of how those factors are generally applied.

See https://fairuse.stanford.edu/overview/fair-use/four-factors/

I think in particular it fails the "Amount and substantiality of the portion taken" and "Effect of the use on the potential market" extremely egregiously.

cedws · 2024-12-23T14:57:18 1734965838

Cloudflare has a toggle for blocking AI scrapers. I don’t think it’s default, but it’s there.

kyledrake · 2024-12-23T17:40:15 1734975615

This just feels like mystery meat to me. My guess is that a lot of legitimate users and VPNs are being blocked from viewing sites, which numerous users in this discussion have confirmed.

This seems like a very bad way to approach this, and ironically their model quite possible also uses some sort of machine learning to work.

A few web hosting platforms are using the cloudflare blocker and I think it's incredibly unethical. They're inevitably blocking millions of legitimate users from viewing content on other people's sites and then pretending it's "anti AI". To paraphrase Theo Deraadt, they saw something on the shelf, and it has all sorts of pretty colours, and they bought it.

pixl97 · 2024-12-23T18:01:32 1734976892

> I think it's incredibly unethical.

The internet isn't built on ethical behavior, unfortunately.

kyledrake · 2024-12-23T21:32:50 1734989570

I get that a lot of people are opposed to AI, but blocking random IP ranges seems like a really inappropriate way to do this, the friendly fire is going to be massive. The robots.txt approach is fine, but it would be nice if it could get standardized so that you don't have to change it a lot based on new companies (like a generic no llm crawling directive for example).

input_sh · 2024-12-23T19:14:01 1734981241

It's not much smarter than just adding user agents to robots.txt manually.

jaybna · 2024-12-23T15:04:09 1734966249

They might get into the micro-licensing game too. More power to them.

1vuio0pswjnm7 · 2024-12-23T19:42:41 1734982961

Bill Gross:

https://twitter.com/Bill_Gross/status/1859999138836025808

https://pdl-iphone-cnbc-com.akamaized.net/VCPS/Y2024/M11D20/...

He appears to be criticising "AI" only to solicit support for his own company.

jasondigitized · 2024-12-23T18:54:00 1734980040

The amount of content coming off of YouTube every minute puts Google in a very enviable position.

vidarh · 2024-12-23T15:59:20 1734969560

All the big players are pouring a fortune into manually curated and created training data.

As it stands, OpenAI has a market cap large enough to buy a major international media conglomerate or two. They'll get data no matter how blocked they get.

Workaccount2 · 2024-12-23T15:22:56 1734967376

Doing basic copyright analyses on model outputs is all that is needed. Check if the output contains copyright, block it if it does.

Transformers aren't zettabyte sized archives with a smart searching algo, running around the web stuffing everything they can into their datacenter sized storage. They are typically a few dozen GB in size, if that. They don't copy data, they move vectors in a high dimensional space based on data.

Sometimes (note: sometimes) they can recreate copyrighted work, never perfectly, but close enough to raise alarm and in a way that a court would rule as violation of copyright. Thankfully though we have a simple fix for this developed over the 30 years of people sharing content on the internet: automatic copyright filters.

parineum · 2024-12-23T17:31:22 1734975082

It's not even close to that simple. Nobody is really questioning if the data contains the copyrighted information, we know that to be true in enough cases to bankrupt open ai, the question is what analogy should the courts be using as a basis to determine if it's infringement.

It read many works but can't duplicate them exactly sounds a lot like what I've done, to be honest. I can give you a few memorable lines to a few songs but only really can come close to reciting my favorites completely. The LLMs are similar but their favorites are the favorites of the training data. A line in a pop song mentioned a billion times is likely reproducible, the lyrics to the next track on the album, not so much.

IMO, any infringement that might have happened would be acquiring data in the first place but copy protection cares more about illegal reproduction than illegal acquisition.

webmaven · 2024-12-24T10:10:59 1735035059

You're correct, as long as you include the understanding that "reproduction" also encompasses "sufficiently similar derivative works."

Fair use provides exceptions for some such works, but not all, and it is possible for generative models to produce clearly infringing (on either copyright or trademark basis) outputs both deliberately (IMO this is the responsibility of the user) and, much less commonly, inadvertently ( ?).

This is likely to be a problem even if you (reasonably) assume that the generative models themselves are not infringing derivative works.

EricMausler · 2024-12-23T16:01:33 1734969693

No comment on if output analysis is all that is needed, though it makes sense to me. Just wanted to note that using file size differences as an argument may simply imply transformers could be a form of (either very lossy or very efficient) compression.

Workaccount2 · 2024-12-23T17:52:23 1734976343

You can argue any form of data is an arbitrarily lossy compression of any other form of data.

I get your point, but nobody is archiving their companies 50 years of R&D data with and LLM so they can get it down to 10GB.

They may have traits of data compression, but they are not at all in the class of data compression software.

jaybna · 2024-12-23T15:31:39 1734967899

So then copyrighted content scraped is not needed for training? Guess I missed AGI suddenly appearing that reasoned things out all by itself.

Workaccount2 · 2024-12-23T15:38:38 1734968318

Nothing builds a better strawman than a foundation started with "So".

cma · 2024-12-23T16:58:53 1734973133

People upload lots from those sites to chatgpt asking to summarize.

devsda · 2024-12-23T19:00:57 1734980457

That's still manual and minuscule compared to the amount they can gather by scraping.

If blocking really becomes a problem, they can take a page out of Google's playbook[1] and develop a browser extension to scrape page content and in exchange offer some free credits for Chat-GPT or a summarizer type of tool(s). There won't be shortage of users.

1. https://en.wikipedia.org/wiki/Google_Toolbar

cma · 2024-12-23T19:57:45 1734983865

Before long people will also continuously use it to watch their screen and act as an assistant, so it can slurp up everything people actually read. People could poison it though with faked browsing of e. G. foreign propaganda stuff made to look like being read from CNN.