Hacker Newsnew | past | comments | ask | show | jobs | submit | james2doyle's commentslogin

Still not "9 times faster", and still seems disingenuous, but here is one comparison that is at least given with some more context: https://x.com/ShieldCrush/status/1943516032674537958

Ppth(//news.ycombinator.com/-+).us

You call it extortion of the AI companies, but isn’t stealing/crawling/hammering a site to scrape their content to resell just as nefarious? I would say Cloudflare is giving these site owners an option to protect their content and as a byproduct, reduce their own costs of subsidizing their thieves. They can choose to turn off the crawl protection. If they aren't, that tells you that they want it, doesn’t it?

>You call it extortion of the AI companies, but isn’t stealing/crawling/hammering a site to scrape their content to resell just as nefarious?

You can easily block ChatGPT and most other AI scrapers if you want:

https://habeasdata.neocities.org/ai-bots


This is just using robots.txt and asking "pretty please, don’t scrape me".

Here is an article (from TODAY) about the case where Perplexity is being accused of ignoring robots.txt: https://www.theverge.com/news/839006/new-york-times-perplexi...

If you think a robots.txt is the answer to stopping the billion-dollar AI machine from scraping you, I don’t know what to say.


If someone has a robots.txt, and I want to request their page, but I want to do that in an automated way, should I open the browser to do it instead of issue a curl request? How about if I am going to ask claude to fetch the page for me?

Respect the robots.txt and don’t do it?

Yes, I was referring to legitimate companies, and Perplexity doesn't seem to be one of those.

Oh for sure. When he wrote of the AI companies that are "stealing/crawling/hammering", you thought he meant the legitimate ones that do honor robots.txt. That makes sense.

Actually, it looks like all the major ones do honour robots.txt including perplexity. They seemingly get around it using google serps, so theyre not actually crawling or hammering the site servers (or even cloudflare).

https://www.ailawandpolicy.com/2025/10/anti-circumvention-re...


I'm guessing you don't manage any production web servers?

robots.txt isn't even respected by all of the American companies. Chinese ones (which often also use what are essentially botnets in Latin American and the rest of the world to evade detection) certainly don't care about anything short of dropping their packets.


I have been managing production commercial web servers for 28 years.

Yes, there are various bots, and some of the large US companies such as Perplexity do indeed seem to be ignoring robots.txt.

Is that a problem? It's certainly not a problem with cpu or network bandwidth (it's very minimal). Yes, it may be an issue if you are concerned with scraping (which I'm not).

Cloudflare's "solution" is a much bigger problem that affects me multiple times daily (as a user of sites that use it), and those sites don't seem to need protection against scraping.


It is rather disingenuous to backpedal from "you can easily block them" to "is that a problem? who even cares" when someone points out that you cannot in fact easily block them.

I was referring to legitimate ones, which you can easily block. Obviously there are scammy ones as well, and yes it is an issue, but for most sites I would say the cloudflare cure is worse than the problem it's trying to cure.

No true scotsman needs Cloudflare, as any true scotsman can block AI bots themselves is not a strong argument.

But is there any actual evidence that any major AI bots are bypassing robots.txt? It looked as if Perplexity was doing this, but after looking into it further it seems that likely isn't the case. Quite often people believe single source news stories without doing any due diligence or fact checking.

I haven't been in the weeds in a few months, but last time I was there we did have a lot of traffic from bots that didn't care about robots. Bytedance is one that comes to mind.

Security almost always brings inconvenience (to everyone involved, including end users). That is part of its cost.

What security issue is actually being solved here though?

No you cannot! I blocked all of the user agents on a community wiki I run, and the traffic came back hours later masquerading as Firefox and Chrome. They just fucking lie to you and continue vacuuming your CPU.

There shouldn't be any noticeable hit on your cpu from bots from a site like that. Are you sure it's not a DDoS?

Obviously it depends on the bot, and you can't block the scammy ones. I was really just referring to the major legitimate companies (which might not include Perplexity).


There is a noticeable hit, there's also a noticeable cost, and it's not a ddos.

Not all sites can have full caching, we've tried.


I was referring to the community wiki.

How are you this naive? Do you really think scrapers give a damn about your robots.txt?

The legitimate ones do, which is what I was referring to. Obviously there are bastard ones as well.

this is the equivalent of asking people not to speed on your street.

Tell me you don't run a site without telling me you don't run a site

Tell me you make incorrect assumptions without specifically saying so. (Yes, you're incorrect).

How often do you buy the first result on an Amazon search? Because that's delegating your labour, isn't it? Surely the best products are getting to the top, right? Well no, they're being paid to get to the top. An LLM that has in-app shopping is gonna be the same thing


Do you buy the first item that pops up on Amazon for a search that you've made? Because that's letting the robot do it for you.

If the answer is "no because that's an ad", well, how do you know that the output from ChatGPT isn't all just products that have bought their rank in the results?


You get the sources, you click through to them to see what they are.

EDIT: Like, have you actually tried this? If you ask it to summarise what Reddit is saying with sources, that’s pretty much exactly what you get.


There seems to be quite a few of projects like this these days:

* https://github.com/Scythe-Technology/zune * https://github.com/lune-org/lune * https://github.com/luau-lang/lute * https://github.com/seal-runtime/seal

All slightly different. Personally, I like Lune and Zune. Have used both to play around and find them fun to use


I think luvit [1] by Tim Caswell was the first one I saw that got me excited many years ago. I love to see passion for Lua/Lua derivatives.

[1] https://github.com/luvit/luvit


I was playing with the new IBM Granite models. They are quick/small and they do seem accurate. You can even try them online in the browser because they are small enough to be loaded via the filesystem: https://huggingface.co/spaces/ibm-granite/Granite-4.0-Nano-W...

Not only are they a lot more recent than gemma, they seem really good at tool calling, so probably good for coding tools. I haven’t personally tried it myself for that.

The actual page is here: https://huggingface.co/ibm-granite/granite-4.0-h-1b


Not the person you replied to, but thanks for this recommendation. These look neat! I'm definitely going to give them a try.


Interesting. Is there a way to load this into Ollama? Doing things in browser is a cool flex, but my interest is specifically in privacy respecting LLMs -- my goal is to run the most powerful one I can on my personal machine, with the end goal being those little queries I used to send to "the cloud" can be done offline, privately.


> Is there a way to load this into Ollama?

Yes, the granite 4 models are on ollama:

https://ollama.com/library/granite4

> but my interest is specifically in privacy respecting LLMs -- my goal is to run the most powerful one I can on my personal machine

The HF Spaces demo for granite 4 nano does run on your local machine, using Transformers.js and ONNX. After downloading the model weights you can disconnect from the internet and things should still work. It's all happening in your browser, locally.

Of course ollama is preferable for your own dev environment. But ONNX and transformers.js is amazingly useful for edge deployment and easily sharing things with non-technical users. When I want to bundle up a little demo for something I typically just do that instead of the old way I did things (bundle it all up on a server and eat the inference cost).


Thanks for this pointer and explanation, I appreciate it.

Also my "dev enviornment" is vi -- I come from infosec (so basically a glorified sysadmin) so I'm mostly making little bash and python scripts, so I'm learning a lot of new things about software engineering as I explore this space :-)

Edit: Hey which of the models on that page were you referring to? I'm grabbing one now that's apparently double digit GB? Or were you saying they're not CPU/ram intensive but still a bit big?


> Edit: Hey which of the models on that page were you referring to?

I was referring to the smaller ones -- `granite4:micro`, `granite4-latest`, `granite4:350m`.

> I'm grabbing one now that's apparently double digit GB?

You are probably downloading one of these two ids: `granite4:small-h` or `granite4:32b-a9b-h`.

The "small" model _is_ small in relative terms, but is also the largest of the currently released granite models! At 32B parameters (19GB download) it's runnable locally but not in the same "run on your laptop with acceptable performance" category of the nano/micro models.

> Also my "dev enviornment" is vi -- I come from infosec (so basically a glorified sysadmin) so I'm mostly making little bash and python scripts, so I'm learning a lot of new things about software engineering as I explore this space :-)

Shameless plug: if you're writing Python scripts to automate things using small locally hosted models, consider trying out https://github.com/generative-computing/mellea

Mellea tries to nudge toward good software engineering practices -- breaking down big tasks into smaller parts, checking outputs after nondeterministic steps, thinking in terms of data structures and invariants rather than flow charts, etc. We built it with "actual fully automated robust workflows" in mind. You can use it with big models or small models, but it really shines when used with small models.


* Comprehensive API: Includes fully featured APIs for filesystem operations, networking, and standard I/O. * Rich Standard Library: A rich standard library with utilities for basic needs, reducing the need for external dependencies. * Cross-Platform Compatibility: Fully compatible with Linux, macOS, and Windows, ensuring broad usability across different operating systems. * High Performance: Built with Zig, designed for high performance and low memory usage.


Check this out: https://github.com/gemini-cli-extensions/security

This one seems to showcase a bunch of the "extension" features, including a custom MCP for dealing with file line numbers.


There's a fork of the repo you posted: https://github.com/QuanZhang-William/gemini-cli-security that's listed on Google's extensions gallery page: https://geminicli.com/extensions/browse/

EDIT:

I've posted about it on GitHub: https://github.com/gemini-cli-extensions/security/issues/81

Hopefully the relevant team will see it there.


Hey! I'm the PM for that extension — thanks for sharing it. Give it a try and let us know what you think about it.

Feedback, bug reports, and ideas are all welcome on the GitHub repo's issues tab. Happy to answer any questions here too.


There is also nelua (https://nelua.io/) which can use WASM to compile allow its usage in the browser: https://github.com/edubart/nelua-game2048/


Yeah hammerspoon seems a lot more capable.

I wrote two articles on using global hotkeys with hammerspoon:

https://ohdoylerules.com/tricks/hammerspoon-hyper-key/ https://ohdoylerules.com/tricks/hammerspoon-number-pad-short...

One think you can even do is detect which devices are being used and handle shortcuts differently. You can write a full on workflow that can be triggered with a keyboard shortcut if you’re using hammerspoon.

I recently switched from a homecooked keyboard "expansion" plugin to using Espanso but it can do that as well!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: