The post says this in other words: in Lean, Rocq, or any other theorem prover, you get a formally-verified proof, but you do NOT get a formally verified theorem statement.
So, even if the proof is correct, you need to determine if the theorem is what you want. Making that determination requires expertise. Since you cannot "run the theorem", you cannot vibe-code your way through it. E.g., there is no equivalent of "web app seems to be working!" You have to actually understand what the theorems are saying in a deep way.
Even then this seems much more promising to me compared to other areas. Writing theorem statements is much much easier than coming up with proofs so it's not a big deal if a human has to do that. And once that's done getting a correct proof out of an LLM/AI model can be done fully automatically (assuming you do get a proof out of it though!)
> Even then this seems much more promising to me compared to other areas. Writing theorem statements is much much easier than coming up with proofs so it's not a big deal if a human has to do that.
Not at all. Theorem crafting is hard. What's easy is standing up and proving a mathematical strawman which may or may not have any relation to the original problem.
I'm not sure this is true. Encoding theorems in dependent types takes a lot of expertise.
Even without the Lean technical details, a lot of math theorems just don't mean anything to most people. For example, I have no idea what the Navier-Stokes theorem is saying. So, I would not be able to tell you if a Lean encoding of the theorem is correct. (Unless of course, it is trivially broken, since as assuming False.)
There is no "Navier-Stokes theorem". There is a famous class of open problems whether Navier-Stokes equations are well-posed (have solutions that don't explode in finite time) for various initial data, but that type of question is completely irrelevant for any practical purposes.
I do share the feeling though. As a non-expert, I have no idea what the existing, allegedly practically relevant, formalizations of distributed algorithms actually guarantee.
Yeah, if you have a formal proof of something but you don't know exactly what then you might as well vibe code the system itself. I've been wondering if it would be possible to formally define a property (like deadlock freedom or linearizability) and then vibe-code the spec (and associated implementation).
I don't know if that's possible but it seems like that's where the value would be.
It is but we're not there yet. You can use a refinement style system like Verus in which your spec is a very high level statement of system behavior, like "this system provides the properties that paxos is supposed to guarantee"(1), and then be guaranteed that the things that implement it are preserving that spec.
But this is still double black diamond expert territory - it remains a challenging thing to identify ways to state and refine those properties to enable the proof. But it's getting easier by the year and the LLMs are getting more useful as part of it.
(1) You would have to state the consistency and conditional liveness properties directly, of course, as Verus doesn't understand "what paxos does". That would look like TLA+-style statements, "at some time in the future, a command submitted now will be executed, etc."
In the case of Lean, propositions are encoded as dependent types and the user typically has to encode that themselves then make use of e.g. tactics to derive a term of said type.
Writing a statement you don't understand then later getting a proof from an LLM doesn't seem all that useless to me; in my mind, it could still be useful as exploration. Worst case scenario: you encoded a tautology and the LLM gave you a trivial proof. Best case scenario: the proposition ends up being a lemma for something you want to prove.
I do think there is a kernel of truth in what you've stated: if the user does not actually understand the statement of a proposition, then the proof is not very useful to them, since they don't know what the statement's truthiness implies. As someone who used to do mathematics, I still find immense value in vibe-coding away mundane computations, similar to what Lean's `simp` tactic already does, but much more broad.
The worst case is that you vibe code a theorem that reads:
False => P
Then you vibe code a proof of this theorem. Then you get excited that you’ve proven P.
Some of the X discussion that prompted the OP was quite close to this. There are screenshots on X of Lean code that doesn’t compile, with Lean being blamed.
Right, the risk is encoding a vacuously true statement. I consider "false implies Q" and tautologies to be vacuously true statements, since both are truths that fail to convey any meaningful information.
In any case, the worst case scenario is having a vacuously true statement. I picked tautologies as an example since that and statements about the empty set are the first things that come into my mind.
I teach at a university, and spend plenty of time programming for research and for fun. Like many others, I spent some time on the holidays trying to push the current generation of Cursor, Claude Code, and Codex as far as I could. (They're all very good.)
I had an idea for something that I wanted, and in five scattered hours, I got it good enough to use. I'm thinking about it in a few different ways:
1. I estimate I could have done it without AI with 2 weeks full-time effort. (Full-time defined as >> 40 hours / week.)
2. I have too many other things to do that are purportedly more important that programming. I really can't dedicate to two weeks full-time to a "nice to have" project. So, without AI, I wouldn't have done it at all.
3. I could hire someone to do it for me. At the university, those are students. From experience with lots of advising, a top-tier undergraduate student could have achieved the same thing, had they worked full tilt for a semester (before LLMs). This of course assumes that I'm meeting them every week.
This is where the LLM coding shines in my opinion, there's a list of things they are doing very well:
- single scripts. Anything which can be reduced to a single script.
- starting greenfield projects from scratch
- code maintenance (package upgrades, old code...)
- tasks which have a very clear and single definition. This isn't linked to complexity, some tasks can be both very complex but with a single definition.
If your work falls into this list they will do some amazing work (and yours clearly fits that), if it doesn't though, prepare yourself because it will be painful.
I'm trying to determine what programming tasks are not in this list. :) I think it is trying to exclude adding new features and fixing bugs in existing code. I've done enough of that with LLMs, though not in large codebases.
I should say I'm hardly ever vibe-coding, unlike the original article. If I think I want code that will last, I'll steer the models in ways that lean on years of non-LLM experience. E.g., I'll reject results that might work if they violate my taste in code.
It also helps that I can read code very fast. I estimate I can read code 100x faster than most students. I'm not sure there is any way to teach that other than the old-fashioned way, which involves reading (and writing) a lot of code.
> I'm trying to determine what programming tasks are not in this list. :) I think it is trying to exclude adding new features and fixing bugs in existing code
Yes indeed, these are the things on the other hand which aren't working well in my opinion:
- large codebase
- complex domain knowledge
- creating any feature where you need product insights
- tasks requiring choices (again, complexity doesn't matter here, the task may be simple but require some choices)
- anything unclear where you don't know where you are going first
While you don't experience any of these when teaching or side projects, these are very common in any enterprise context.
How do you compare Claude Code to Cursor? I'm a Cursor user quietly watching the CC parade with curiosity. Personally, I haven't been able to give up the IDE experience.
Im so sold on the cli tools that I think IDEs are basically dead to me. I only have an IDE open so I can read the code, but most often I'm just changing configs (like switching a bool, or bumping up a limit or something like that).
Seriously, I have 3+ claude code windows open at a time. Most days I don't even look at the IDE. It's still there running in the background, but I don't need to touch it.
When I'm using Claude Code, I usually have a text editor open as well. The CC plugin works well enough to achieve most of what Cursor was doing for me in showing real-time diffs, but in my experience, the output is better and faster. YMMV
I use CC for so much more than just writing code that I cannot imagine being constrained within an IDE. Why would I want to launch an IDE to have CC update the *arr stack on my NAS to the latest versions for example? Last week I pointed CC at some media files that weren't playing correctly on my Apple TV. It detected what the problem formats were and updated my *arr download rules to prefer other releases and then configured tdarr to re-encode problem files in my existing library.
I was here a few weeks ago, but I'm now on the CC train. The challenge is that the terminal is quite counterintuitive. But if you put on the Linux terminal lens from a few years ago, and you start using it. It starts to make sense. The form factor of the terminal isn't intuitive for programming, but it's the ultimate.
FYI, I still use cursor for small edits and reviews.
I don't think I can scientifically compare the agents. As it is, you can use Opus / Codex in Cursor. The speed of Cursor composer-1 is phenomenal -- you can use it interactively for many tasks. There are also tasks that are not easier to describe in English, but you can tab through them.
The PyTorch 2.9 wheels do work. You can pip install torch --index-url <whatever-it-is> and it just works. You do need to build flash attention from source, which takes an hour or so.
I have H100s to myself, and access to more GPUs than I know what to do with in national clusters.
The Spark is much more fun. And I’m more productive. With two of them, you can debug shallow NCCL/MPI problems before hitting a real cluster. I sincerely love Slurm, but nothing like a personal computer.
Your complaint sounds more like the way that you have to access the HPC (via slurm), not the compute itself. After having now tried slurm myself, I don't understand the love for it at all.
As for debugging, that's where you should be allowed to spin up a small testing cluster on-demand. Why can't you do that with your slurm access?
I think that personal computing is more fun than time-shared computing. :)
It's remarkable what can now be done on a whisper-quiet little box. I hope the Strix Halo's will be just as much fun, and they should be, so long as Flash Attention works.
I really don't understand the difference. Either way, it is just a window into a computer. ¯\_(ツ)_/¯
We rent bare metal on-demand and our whole business is to be able to offer compute that you probably wouldn't be able to host in your house $, as if you own it yourself.
So, we made it so that users can get access into the BMC and modify the box however they want. When they are done, we've automated the reset as well. Fully self-service.
$ These boxes are very expensive, weigh 350lbs, sound like a jet engine and consume ~10kW.
100% - slurm is aimed at job maintenance and resource management on HPC clusters. Thus being a pain in the ass for the kind of fast adhoc iteration and testing that AI/ML requires.
Unless you can submit an interactive slurm job and get exclusive access to an H100 for a few hours of dedicated time. If the cluster is overloaded, it’s hard to get those to run when you’d like, but there are still ways. But you do have to be patient.
But it’s still not quite like exclusive access to resources when you want them. So I can see it from both ways.
I don't have a math or computer science background so the more academic publications are almost always quite unreadable to me. I learn a lot by exploring the source code of existing virtual machines instead, and most are written in systems languages.
Sometimes much smarter people than I randomly decide to write articles that democratize access to very complex areas of programming language development. Examples:
Lots of discussion about choice of programming language in the comments below.
- In principle, it should not matter at all, but there are practical reasons why one PL may be better than another in a particular school or context.
- But, all this "choice of PL" discussion is really a discussion about CS1. A CS degree has at least seven other courses -- assuming 1 CS course per semester -- and in practice many more than that. So, if you're going to ask questions about CS1, the question to ask is, "Does CS1 setup students to succeed in the advanced courses?" Classically, these were courses in compilers, operating systems, networking, and so on. These days, you can add distributed computing, machine learning, etc. (but don't subtract the classics).
I think the average American today, including the average admissions officer, has a negative view of technology. So, an application that is unequivocally optimistic about technology is unlikely to be well received. I think that that is what happened here. We also have no visibility into letters of recommendation, which are likely a big factor.
This is an incredibly poor take - curious as to how you just completely made up that ridiculous claim? Ironic as well that you think letters of recommendation matter for college admissions when they are perfunctory for probably > 95% of them. Maybe you shouldn't espouse your opinions on this.
What a PhD student hopefully gets from an advisor is more targeted advice than the type of boilerplate generic advice in this article and others like it.
An advisor who knows what the student wants to accomplish, and is capable of accomplishing should be able to determine when building a working system is more valuable than pushing yet another paper, when reforming science in a small way is likely succeed, and so on.
So, even if the proof is correct, you need to determine if the theorem is what you want. Making that determination requires expertise. Since you cannot "run the theorem", you cannot vibe-code your way through it. E.g., there is no equivalent of "web app seems to be working!" You have to actually understand what the theorems are saying in a deep way.
reply