With a sufficiently advanced type system, you could lay out high level building blocks, such as function definitions, and let the LLM go wild, and then if things compile there's a high chance it's correct. Or maybe even a formal proof things are correct.
I was blown away when I realized some Haskell functions have only one possible definition, for example. I think most people haven't worked with type systems like this, and there are type systems far more powerful than Haskell's, such as dependant types.
There's not much reason to worry about low level quality standards so long as you know it's correct from a high level. I don't think we've seen what a deep integration between a LLM and a programming language can do, where the type system helps validate the LLM output, and the LLM has a type checker integrated into its training process.
if there is only one valid translation of the type constraints into executable code then what you have is a slow, expensive, and occasionally argumentative compiler
it merely remains to build a debugger for your Turing-complete type system, and the toolchain will be ready for production
> Is this a brother or a cousin of the "sufficiently advanced compiler"? :-)
I believe the claim was that a sufficiently advanced compiler could do a lot of optimization and greatly improve performance. Maybe my claim here will turn out the same.
Because we're using languages with flexibility but no correctness, but the vast performance advantage AI programming has over us could be used to manage the formal proofs for a verified toolchain.
We're not quite there yet, but while regular programming is quite tough for AI due to how fuzzy it is, formal proofs are something AI is already very good at.
The functions for which there's only one implementation are trivial examples. It's not going to work for anything even slightly more complicated, like a function that returns a float.
Even if you could, you probably wouldn't want to make any change a breaking change by exposing implementation details.
* build a codegen for Idris2 and a rust RT (a parallel stack "typed" VM)
* a full application in Elm, while asking it to borrow from DT to have it "correct-by-construction", use zippers for some data structures… etc. And it worked!
* Whilst at it, I built Elm but in Idris2, while improving on the rendering part (this is WIP)
* data collators and iterators to handle some ML trainings with pausing features so that I can just Ctrl-C and continue if needed/possible/makes sense.
* etc.
At the end I had to rewrite completely some parts, but I would say 90% of the boring work was correctly done and I only had to focus on the interesting bits.
However it didn’t deliver the kind of thorough prep work a painter would do before painting a house when asked for. It simply did exactly what I asked, meaning, it did the paint and no more.
I keep seeing folks who say they’ve built a “full application” using or deeply collaborating with an LLM but, aside from code that is only purported to be LLM-generated, I’ve yet to see any evidence that I can consider non-trivial. Show me the chat sessions that produced these “full applications”.
An application powered by a single source file comprised of only 400 lines of code is, by my definition, trivial. My needs are more complex than that, and I’d expect that the vast majority of folks who are trying to build or maintain production quality revenue generating code have the same or similar needs.
Again, please don’t be offended; what you’re doing is great and I dearly appreciate you sharing your experience! Just be aware that the stuff you’re demonstrating isn’t (hasn’t been, for me at least) capable of producing the kind of complexity I need while using the languages and tooling required in my environment. In other words, while everything of yours I’ve seen has intellectual and perhaps even monetary value, that doesn’t mean your examples or strategies work for all use-cases.
LLMs are restricted by their output limits. Most LLMs can output 4096 tokens - the new Claude 3.5 Sonnet release this week brings that up to 8196.
As such, for one-shot apps like these there's a strict limit to how much you can get done purely though prompting in a single session.
I work on plenty of larger projects with lots of LLM assistance, but for those I'm using LLMs to write individual functions or classes or templates - not for larger chunks of functionality.
> As such, for one-shot apps like these there's a strict limit to how much you can get done purely though prompting in a single session.
That’s an important detail that is (intentionally?) overlooked by the marketing of these tools. With a human collaborator, I don’t have to worry much about keeping collab sessions short—and humans are dramatically better at remembering the context of our previous sessions.
> I work on plenty of larger projects with lots of LLM assistance, but for those I'm using LLMs to write individual functions or classes or templates - not for larger chunks of functionality.
Good to know. For the larger projects where you use the models as an assistant only, do the models “know” about the rest of the project’s code/design through some sort of RAG or do you just ask a model to write a given function and then manually (or through continued prompting in a given session) modify the resulting code to fit correctly within the project?
There are systems that can do RAG for you - GitHub Copilot and Cursor for example - but I mostly just paste exactly what I want the model to know into a prompt.
In my experience most of effective LLM usage comes down to carefully designing the contents of the context.
My experience with Copilot (which is admittedly a few months outdated; I never tried Cursor but will soon) shows that it’s really good at inline completion and producing boilerplate for me but pretty bad at understanding or even recognizing the existence of scaffolding and business logic already present in my projects.
> but I mostly just paste exactly what I want the model to know into a prompt.
Does this include the work you do on your larger projects? Do those larger projects fit entirely within the context window? If not, without RAG, how do you effectively prompt a model to recognize or know about the relevant context of larger projects?
For example, say I have a class file that includes dozens of imports from other parts of the project. If I ask the model to add a method that should rely upon other components of the project, how does the model know what’s important without RAG? Do I just enumerate every possible relevant import and include a summary of their purpose? That seems excessively burdensome given the purported capabilities of these models. It also seems unlikely to result in reasonable code unless I explicitly document each callable method’s signature and purpose.
For what it’s worth, I know I’ve been pretty skeptical during our conversations but I really appreciate your feedback and the work you’ve been doing; it’s helping me recognize both the limitations of my own knowledge and the limitations of what I should reasonably expect from the models. Thank you, again.
Yes, I paste stuff in from larger projects all the time.
I'm very selective about what I give them. For example, if I'm working on a Django project I'll paste in just the Django ORM models for the part of the codebase I'm working on - that's enough for it to spit out forms and views and templates, it doesn't need to know about other parts of the codebase.
Another trick I sometimes use is Claude Projects, which allow you to paste up to 200,000 tokens into persistent context for a model. That's enough to fit a LOT of code, so I've occasionally dumped my entire codebase (using my https://github.com/simonw/files-to-prompt/ tool) in there, or selected pieces that are important like the model and URL definitions.
That's one take I'm trying to take with https://www.flyde.dev.
Build that higher-level abstraction that allows LLM to shine and us humans to enjoy a more effective way to view and control its outcome.
I know it's not what you meant, but tbh I don't think going deeper than Haskell on type-system power is the route to achieve mass adoption.
> With a sufficiently advanced type system, you could lay out high level building blocks, such as function definitions, and let the LLM go wild, and then if things compile there's a high chance it's correct. Or maybe even a formal proof things are correct.
A formal proof is simply a type that allows only the correct implementation at the term level. And specifying this takes ages, as in hundreds of times longer than just writing it out. So you want to use an LLM to save this 1% of time after you've done all the hard work of specifying what the only correct implementation is?
I'm not sure how much things have changed but I tried to use GPT-4 when it first came out to build a library on top of Scala + Shapeless and it utterly failed. Unless you can somehow wire in an LSP per language as an agent and have it work through the type errors as it tries to create code, I can't see where we'll ever get to a place where LLMs can work with strongly typed languages and produce compliant code.
Even with, the aforementioned "have an LSP agent work through type errors", it may be faster to just do it yourself than wait for an LLM to spit out what may be correct.
Oh, I’ve found Claude 3.5 to be better but still pointless. To be specific, it generates code that does what I ask (roughly) but…
- Has obvious bugs, many at runtime
- Has subtle bugs.
- Is inefficient.
All of which it will generally fix when asked. But how useful is it if I need to know all the problems with its code beforehand? Then it responds with the same-ish wrong answer the next time.
Claude 3.5 is good, however, it's "Type comprehension" is really basic. That was my experience with it when using Rust. It still can't create an internal mental model of the types and how to link them together. It'll also, at some points, start to heavily hallucinate functions and stuff.
Glad to know I’m not the only one who has had difficulties getting reasonable Scala code from an LLM. While I agree that LSP integration would likely help with generating or debugging production-quality code, I’d also suggest that the need for an LSP demonstrates these models’ inability to reason—at least about code. Perhaps they were trained more deeply and carefully on the marketing or “getting started” pages for the libraries and languages I need to use and not as much on examples of production-quality code repositories that make use of those languages/libs/tools.
I was blown away when I realized some Haskell functions have only one possible definition, for example. I think most people haven't worked with type systems like this, and there are type systems far more powerful than Haskell's, such as dependant types.
There's not much reason to worry about low level quality standards so long as you know it's correct from a high level. I don't think we've seen what a deep integration between a LLM and a programming language can do, where the type system helps validate the LLM output, and the LLM has a type checker integrated into its training process.