Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

An important Cursor feature that no one else seems to have implemented yet is documentation indexing. You give it a base URL and it crawls and generates embeddings for API documentation, guides, tutorials, specifications, RFCs, etc in a very language agnostic way. That plus an agent tool to do fuzzy or full text search on those same docs would also be nice. Referring to those @docs in the context works really well to ground the LLMs and eliminate API hallucinations

Back in 2023 one of the cursor devs mentioned [1] that they first convert the HTML to markdown then do n-gram deduplication to remove nav, headers, and footers. The state of the art for chunking has probably gotten a lot better though.

[1] https://forum.cursor.com/t/how-does-docs-crawling-work/264/3



The continue.dev plugin for Visual Studio Code provides documentation indexing. You provide a base URL and a tag. The plugin then scrapes the documentation and builds a RAG index. This allows you to use the documentation as context within chat. For example, you could ask @godotengine what is a sprite?


So this is why everything is going behind Anubis then?


Nah, Anubis combats systematic Scraping of the web by data scrapers, not actual user agents.


A scraper in this case is the agent of the user. Doesn't make it not a scraper that can and will get trapped.


Cursor’s doc indexing is acc one of the few AI coding features that feels like it saves time. Embedding full doc sites, deduping nav/header junk, then letting me reference @docs inline actually improves context grounding instead of guessing APIs.


Just use the Context7 MCP ? Actually I'm assuming Void supports MCP.


Context7 is missing lots of info pieces from the repos it indexing and getting overbloated with similar sounding repos, which is becoming confusing for LLM's.


can you elaborate on how context7 handles document indexing or web crawling. If i connect to the mcp server, will it be able to crawl websites fed to it?


Agreed - this is one of the better solutions today.


This is a good point.We've stayed away from documentation assuming that it's more of a browser agent task, and I agree with other commenters that this would make a good MCP integration.

I wonder if the next round of models trained on tool-use will be good at looking at documentation. That might solve the problem completely, although OSS and offline models will need another solution. We're definitely open to trying things out here, and will likely add a browser-using docs scraper before exiting Beta.


I agree that on the face of it this is extremely useful. I tried using it for multiple libraries and it was a complete failure though, it failed to crawl fairly standard mkdocs and sphynx sites. I guess it's better for the 'built in' ones that they've pre-indexed


I use it mostly to index stuff like Rust docs on docs.rs and rendered mdbooks. The RAG is hit or miss but I haven’t had trouble getting things indexed.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: