As someone with workloads that can benefit from these techniques, but limited resources to put them into practice, my working thesis has been:
* Use a multi-threaded tokio runtime that's allocated a thread-per-core
* Focus on application development, so that tasks are well scoped / skewed and don't _need_ stealing in the typical case
* Over time, the smart people working on Tokio will apply research to minimize the cost of work-stealing that's not actually needed.
* At the limit, where long-lived tasks can be distributed across cores and all cores are busy, the performance will be near-optimal as compared with a true thread-per-core model.
What's your hot take? Are there fundamental optimizations to a modern thread-per-core architecture which seem _impossible_ to capture in a work-stealing architecture like Tokio's?
A core assumption underlying thread-per-core architecture is that you will be designing a custom I/O and execution scheduler that is purpose-built for your software and workload at a very granular level. Most expectations of large performance benefits follow from this assumption.
At some point, people started using thread-per-core style while delegating scheduling to a third-party runtime, which almost completely defeats the purpose. If you let tokio et al do that for you, you are leaving a lot of performance and scale on the table. This is an NP-Hard problem; the point of solving it at compile-time is that it is computationally intractable for generic code to create a good schedule at runtime unless it is a trivial case. We need schedulers to consistently make excellent decisions extremely efficiently. I think this point is often lost in discussions of thread-per-core. In the old days we didn’t have runtimes, it was just assumed you would be designing an exotic scheduler. The lack of discussion around this may have led people to believe it wasn’t a critical aspect.
The reality that designing excellent workload-optimized I/O and execution schedulers is an esoteric, high-skill endeavor. It requires enormous amounts of patience and craft, it doesn’t lend itself to quick-and-dirty prototypes. If you aren’t willing to spend months designing the many touch points for the scheduler throughout your software, the algorithms for how events across those touch points interact, and analyzing the scheduler at a systems level for equilibria and boundary conditions then thread-per-core might not be worth the effort.
That said, it isn’t rocket science to design a reasonable schedule for software that is e.g. just taking data off the wire and doing something with it. Most systems are not nearly as complex as e.g. a full-featured database kernel.
It still doesn't make sense. Cursor undoubtedly has smart engineers who could implement the Anthropic text editing tool interface in their IDE. Why not just do that for one of your most important LLM integrations?
I agree it doesn't make sense. I'd think they could alias their own tools to match Anthropic's, but my guess is they don't want to customize too heavily on any given model.
That's true if you're just evaluating the final answer. However, wouldn't you evaluate the context -- including internal tokens -- built by the LLM under test ?
In essence, the evaluator's job isn't to do separate fact-finding, but to evaluate whether the under-test LLM made good decisions given the facts at hand.
I would if I was the developer, but if I'm the user being sold the product, or a third-party benchmarker, I don't think I'd have full access to that if most of that is happening in the vendor's internal services.
But that’s not good. You don’t want Bob to be the gate keeper for why a process is the way it is.
In my experience working with agents helps eliminate that crap, because you have to bring the agent along as it reads your code (or process or whatever) for it to be effective. Just like human co-workers need to be brought along, so it’s not all on poor Bob.
I would recommend the `anyhow` crate and use of anyhow::Context to annotate errors on the return path within applications, like:
falliable_func().context("failed to frob the peanut")?
Combine that with the `thiserror` crate for implementing errors within a library context. `thiserror` makes it easy to implement structured errors which embed other errors, and plays well with `anyhow`.
Yeah, I found `anyhow`'s `Contex` to be a great way of annotating bubbled up
errors. The only problem is that using the lazy `with_context` can get
somewhat unwieldy. For all the grief people give to Go's `if err != nil`
Rust's method chaining can get out of hand too. One particular offender I
wrote:
match operator.propose(py).with_context(|| {
anyhow!(
"Operator {} failed while generating a proposal",
operator.repr(py).unwrap()
)
})? {
Which is a combination of `rustfmt` giving up on long lines and also not
formatting macros as well as functions
> Microsoft owns GitHub and VSCode yet cursor was able to out execute them
Really? My startup is under 30 people. We develop in the open (source available) and are extremely willing to try new process or tooling if it'll gain us an edge -- but we're also subject to SOC2.
Our own evaluation was Cursor et all isn't worth the headache of the compliance paperwork. Copilot + VSCode is playing rapid catch-up and is a far easier "yes".
How large is the intersection of companies who a) believe Cursor has a substantive edge in capability, and b) have willingness to send Cursor their code (and go through the headaches of various vendor reviews and declarations)?
Windsurf was acquired for $3B by OAI and it's clearly the worse of the two. Cursor is trying to raise at a $10B valuation and has $300MM in ARR in less than two years.
So in short, yes, companies do appear to be showing some willingness to send Cursor their code, even with all the headache associated with getting a new vendor.
Yep. At the scope of a single table, append-only history is nice but you're often after a clone of your source table within Iceberg, materialized from insert/update/delete events with bounded latency.
There are also nuances like Postgres REPLICA IDENTITY and TOAST columns. Enabling REPLICA IDENTITY FULL amplifies you source DB WAL volume, but not having it means your CDC updates will clobber your unchanged TOAST values.
If you're moving multiple tables, ideally your multi-table source transactions map into corresponding Iceberg transactions.
Zooming out, there's the orchestration concern of propagating changes to table schema over time, or handling tables that come and go at the source DB, or adding new data sources, or handling sources without trivially mapped schema (legacy lakes / NoSQL / SaaS).
As an on-topic plug, my company tackles this problem. Postgres => Iceberg is a common use case.
> On the other hand, one could argue that AI is just another abstraction
I, as a user of a library abstraction, get a well defined boundary and interface contract — plus assurance it’s been put through paces by others. I can be pretty confident it will honor that contract, freeing me up to not have to know the details myself or second guess the author.
> To me the best solution seem like combining storing writes on EBS (or even NVMe) initially to minimize the time until writes can be acknowledged, and creating a chunk on S3 standard every second or so.
Yep, this is approximately Gazette's architecture (https://github.com/gazette/core). It buys the latency profile of flash storage, with the unbounded storage and durability of S3.
An addendum is there's no need to flush to S3 quite that frequently, if readers instead tail ACK'd content from local disk. Another neat thing you can do is hand bulk historical readers pre-signed URLs to files in cloud storage, so those bytes don't need to proxy through brokers.
* Use a multi-threaded tokio runtime that's allocated a thread-per-core * Focus on application development, so that tasks are well scoped / skewed and don't _need_ stealing in the typical case * Over time, the smart people working on Tokio will apply research to minimize the cost of work-stealing that's not actually needed. * At the limit, where long-lived tasks can be distributed across cores and all cores are busy, the performance will be near-optimal as compared with a true thread-per-core model.
What's your hot take? Are there fundamental optimizations to a modern thread-per-core architecture which seem _impossible_ to capture in a work-stealing architecture like Tokio's?