Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Did Semgrep Just Get a Lot More Interesting? (fly.io)
190 points by ghuntley 5 months ago | hide | past | favorite | 86 comments


How are people collaborating on code when using AI tools to generate patches?

We hold code review dear as a tool to make sure more than one set of eyeballs has been over a change before it goes into production, and more than one person has the context behind the code to be able to fix it in future.

As model generated code becomes the norm I’m seeing code from junior engineers that they haven’t read and possible nor do they understand. For example, one Python script calling another using exec instead of importing it as a module, or writing code that is already available as a very common part of the standard library.

In such cases, are we asking people to mark their code as auto generated? Should we review their prompts instead of the code? Should we require the prompt to code step be deterministic? Should we see their entire prompt context and not just the prompt they used to build the finished patch?

I feel like a lot of the value of code review is to bring junior engineers up to higher levels. To that extent each review feels like an end of week school test, and I’m getting handed plagiarised AI slop to mark instead of something that maps properly to what the student does or does not know.

Pair programming is another great teaching tool. Soon, it might be the only one left.


AI is a tool, not a solution. If someone uses it to write code they don't understand and that is flawed, that should never pass code review.

If your AI generated code passes code review without any questions asked and without any hints that it's AI generated (or AI was used in some way), then it doesnt matter that it was used.

The person submitting the code is still responsible, and the reviewer is equally responsible. You typically have tests to make sure it behaved correctly, too.


That's beside the point, we all know that one terrible engineer who is just incredibly productive and completely lacks any self-critical instincts. With LLMs this type of output becomes much more easy to produce, so what are you going to do if you're asked to review 10x the amount of code that you're reviewing right now?


Something similar happened in my job. I reported the over-engineered, completely senseless solution, after the "dev" got really angry at me about simply reviewing his code. He was gone the next week. If you have a boss, tell him when this happens. If you don't, fire him/her yourself. If the amount of complexity introduced is not worth the use cases it covers, in a business sense, then that's a review that rejects the code on that basis.


Fire the bad engineer.


If they're a net negative to the team, their manager needs to have a performance conversation with them.


That level of submission sounds like a "reject on sight, go back and do it again without Ai, if you need help then ask" to me.


They can pull the code out of a magical oracle somewhere in the misty mountains for all I care, as long as they can telle what the code does and why it was placed there. If they can't, it's rejected. Use whatever tool you like, but the buck stops with you.


My problem is a co worker has put an OK-ish wooden toy train in front of me presenting it as their own work, so I take some time to show them how to hold the plane for a better finish on the grain, how to pick timber so that the knots are faced inward, and how contrasting stock takes up wood stain in different way highlighting the elements of the piece.

They nod politely, go back to Ali Express and pick a slightly different toy train to present as their own, commit it, and move on.

Too romantic? Quite probably.


At some point that employee has to be shown the door, because you can also order from Ali Express yourself, you don't need a well paid employee to do that that doesn't provide additional value.


Someone has yet to discover how offshoring works in practice.

This is only one step further.

Management not only doesn't care, they welcome the employee doing their Ali Express purchase, while saving their own operational costs.


Then you get pushback from management: we don't have time for the perfect solution, if it works, it's good.


Pathologically short-term thinking from management is a problem that far predates AI (computers, even), and the solution hasn't changed: go somewhere better and watch the old place burn.


Can we replace management with AI?


I heard of multiple "let's make an AI to show who to fire to increase efficiency" get canceled after inevitably the AI points towards upper management which then cans the project because they can't spin it to have themselves excluded from analysis.


We all keep making this joke, and one day someone’s going to try it. I have a feeling it’ll be moderately successful too.


Not yet.

Give it another year.


I do believe my code review of a junior developer submitting "AI slop without reviewing or understanding it" would be fairly blunt:

"If you are just taking AI generated content verbatim without code reviewing it and understanding it, you are providing no value over just having the AI do the work directly. If you are providing no value over AI doing your job, the company can replace you with AI."

I'm wondering if the junior employee just isn't educated enough to be using the AI, like they don't know Python at all so they don't understand the difference between "exec" and "import"? But I'm also inclined to think that they should be using it to learn, if the goal of the company and the employee is to move from junior to mid to senior. Like ask another model "can you improve this code", "does this code make sense", "can you explain this code to me", and anything in the code you don't understand, research and learn.

But, there are employees who are just going to grind it out, doing "just enough" to get something working, and it can at times be hard to figure out why they are there and why they aren't working towards more. It could be as complex is physical or mental struggles that put "just enough to get it working" at the far edge of their abilities, maybe that literally is the most they can do.

How do you learn that and adjust so they can excel within their constraints?


They are collaborating by making sure they’ve read and understood the code that was generated, and usually edited it, too.

Sometimes, when e.g. making an ad-hoc visualization or debugging tool, one may emphasize that they just had AI generate it and didn’t read into the details. Occasionally it makes sense.

But if someone is making PRs without disclosing their lack of understanding of them because of generating most of it and not making sure everything is correct, that seems like a cultural issue, primarily.

But I suppose you can start having such people walk you through their PRs, which should at least reveal their lack of understanding, if it’s in fact the case.

Point is, this is imo not an experience inherent to LLM usage, and there’s also not much point reviewing LLM prompts because of the strong non-determinism involved.


Anecdotally, I frequently dig into source code where the stack trace points me, look up functions, debug in local envs, etc. meanwhile my coworker is working on the same problem and talking to an LLM and I often get to the solution before he does. I don't think I've had my John Henry moment quite yet.


> But I just checked and, unsurprisingly, 4o seems to do reasonably well at generating Semgrep rules? Like: I have no idea if this rule is actually any good. But it looks like a Semgrep rule?

I don't know about semgrep syntax, but the chat it generated is bad in at least a couple other ways. E.g. their "how to fix" instruction is wrong:

    if let Some(Load::Local(load)) = self.load.read().get(...) {
        // do a bunch of stuff with `load`
    } else {
        drop(self.load.read()); // Explicitly drop before taking write lock
        let mut w = self.load.write();
        self.init_for(&w);
    }
That actually acquires and then drops a second read lock. It doesn't solve the problem that the first read lock is still active and thus the write lock will deadlock.

Speaking of which, acquiring two read locks from the same thread can also deadlock, as shown in the "Potential deadlock example" at <https://doc.rust-lang.org/std/sync/struct.RwLock.html>. It can happen in the code above (one line before the other deadlock). It can also slip through their rule because they're incorrectly looking for just a write lock in the else block.

I've been playing with AI code generation tools like everyone else, and they are okay as autocomplete, but I don't see them as trustworthy. For a while I thought I just wasn't prompting well enough, but when other people show me their AI output, I can see it's wrong, so maybe I'm just looking more closely?


> But I just checked and, unsurprisingly, 4o seems to do reasonably well at generating Semgrep rules? Like: I have no idea if this rule is actually any good. But it looks like a Semgrep rule?

This is the thing with LLMs. When you’re not an expert, the output always looks incredible.

It’s similar to the fluency paradox — if you’re not native in a language, anyone you hear speak it at a higher level than yourself appears to be fluent to you. Even if for example they’re actually just a beginner.

The problem with LLMs is that they’re very good at appearing to speak “a language” at a higher level than you, even if they totally aren’t.


Hold on, hold on. You're missing a step here.

I agree completely that an LLM's first attempt to write a Semgrep rule is likely as not to be horseshit. That's true of everything an LLM generates. But I'm talking about closed-loop LLM code generation. Unlike legal arguments and medical diagnoses, you can hook an LLM up to an execution environment and let it see what happens when the code it generates runs. It then iterates, until it has something that works.

Which, when you think about it, is how a lot of human-generated code gets written too.

So my thesis here does not depend on LLMs getting things right the first time, or without assistance.


The problem is what one means by "works". Is it just that it runs without triggering exceptions here and there?

One has to know, and understand, what the code is supposed to be doing, to evaluate it. Or use tests.

But LLMs love to lie so they can't be trusted to write the tests, or even to report how the code they wrote passed the tests.

In my experience the way to use LLMs for coding is exactly the opposite: the user should already have very good knowledge of the problem domain as well as the language used, and just needs to have a conversation with someone on how to approach a specific implementation detail (or help with an obscure syntax quirk). Then LLMs can be very useful.

But having them directly output code for things one doesn't know, in a language one doesn't know either, hoping they will magically solve the problem by iterating in "closed loops", will result in chaos.


It clearly does not result in chaos. This is an "I believe my lying eyes" situation, where I can just see that I can get an agent-y LLM codegen setup to generate a sane-looking working app in a language I'm not fluent in.

The thing everyone thinks about with LLM codegen is hallucination. The biggest problem for LLMs with hallucination is that there are no guardrails; it can just say whatever. But an execution environment provides a ground truth: code works or it doesn't, a handler path generates an exception or it doesn't, a lint rule either compiles and generates workable output or it doesn't.


> code works or it doesn't

It seems you're deliberately confusing "works" with "runs". They're different things.


That's also the problem with these conversations. Some people evaluate zero-shot promoted code oozing out of gpt-3.5, others plug Sonnet into an IDE with access to terminal, LSP, diagnostics etc. crunching through a problem in an agentic self improvement loop. Those two approaches will generate very different quality levels of code.


An LLM though doesn’t truly understand the goal AND it frequently gets into circular loops it can’t get out of when the solution escapes its capability rather than asking for help. Hopefully it’ll get fixed but some of this stuff is an architectural problem rather than just iterating on the transformer idea.


That's totally true, but it's also a small amount of Python code in the agent scaffolding to ensure that it bails on those kinds of loops. Meanwhile, for something like Semgrep, the status quo ante was essentially no Semgrep rules getting written at all (I believe the modal Semgrep user just subscribes to existing rule repositories). If a closed-loop LLM setup can successfully generate Semgrep rules for bug patterns even 5% of the time, that is a material win, and a win that comes at very little cost.


Yeah, I more or less agree about the closed loop part and the overall broader point the article was making in this context — that it may be a useful use case. I think it’s likely that process creates a lot of horseshit that passes through the process, but that might still be better than nothing for semgrep rules.

I only came down hard on that quote out of context because it felt somewhat standalone and I want to broadcast this “fluency paradox” point a bit louder because I keep running into people who really need to hear it.

I know you know what’s up.


DSLs like Semgrep are one of my top use-cases for LLMs generally.

It used to be that tools like Semgrep and jq and Tree Sitter and zsh all required you to learn quite a bit of syntax before you could start using them productively.

Thanks to LLMs you can focus on learning what they can do for you without also having to learn the fiddly syntax.


Over and over again I have witnessed people just drive themselves in circles downstream of a refusal to just step out of the “make the LLM fix the issue for me” loop.

At one point the syntax and the specifics matter! * and + have different meanings in the regex. Overly-specified LLM output is worth trimming down to what your problem actually needed.

I appreciate LLM output able to draft out sketches of results (and yes, a lot of the time, getting exactly the right result). And it’s great as a learning tool (especially if you’re diligent in the trust + verify department). But I worry that people are not taking opportunities to sit down and actually use the output as the sketch, and to insert the sort of precision that comes from the “infinite context” of the human working on the problem. Devs can’t just decide to opt out of getting into the details IMO


> Over and over again I have witnessed people just drive themselves in circles downstream of a refusal to just step out of the “make the LLM fix the issue for me” loop.

You're reminding me of a 1997 talk by an American Airlines trainer about automation dependency and when dropping down a level of automation reduces workload: <https://www.youtube.com/watch?v=WITLR_qSPXk>


cool talk, respect him making the point in 20 minutes. There's a lot of levels of applying this when programming, not just with LLM usage.


> Over and over again I have witnessed people just drive themselves in circles downstream of a refusal to just step out of the “make the LLM fix the issue for me” loop.

I think that's an amazing encapsulation of my experience with LLMs.

They can be great tools, but you need to extremely willing to throw away their output and start over instead of trying to incrementally fix what they give you.


I kind of agree. I’ve had very mixed experiences with LLMs and DSLs.

I was writing an NRQL query (New Relic’s log query language) and wanted to essentially do a GROUP BY date_trunc. It kept giving me options that I was eager for, and then the functions it gave me just didn’t exist. After like four back and forths of me telling it that the functions it was giving me didn’t exist - it worked.

Then I needed it to split on the second forward slash of a string and just give me the first piece. It gave me the foundation to fill in the gaps of the function, but the LLM never got it.

In that case, I assume it’s a lack of training data since NRQL is pretty niche.

I catch myself swinging from “holy shit this is impressive” to “wow this sucks” and back regularly for code.


This is similar to my experience with LLMs and DSLs. They tend to hallucinate functions that will magically work in the situation you are describing. My pet theory here is that they are fooled by many forum posts/issues "why doesn't a function exist in this DSLs called ABC that does this?"


Did you try asking the LLM to writ the missing functions? It might work in some cases.


In that case putting NRQL reference into context would work wonders


Oh goodness I’m a fool to not have tried that. I’ll absolutely give that a go next time!


NRQL's abandonment of GROUP BY for FACET always throws me.


LLM context windows are quite large now. This is very likely a simple matter of including NRQL docs or specification in the prompt.


Yes, the same applies to many niche syntaxes: Influx' flux language (I was able to Design my dream Grafana dashboards now!) or Auto Hot Key (AHK) for Windows automation are only two examples.


I love using LLMs to generate plantUML diagrams


What other tool do you use? I guess you want an svg or png output at some point?


If you mean "I have a .puml and need .svg or .png", the tool does that natively, as do most online versions (this one is their example https://www.plantuml.com/plantuml/uml/SyfFKj2rKt3CoKnELR1Io4... and there are PNG and SVG links under the textarea). It also works offline https://plantuml.com/command-line#:~:text=Types%20of%20Outpu...


I didn’t know plantuml.com existed. Thanks!


I had an absolutely terrible time today trying to get chatgpt to write me a very simple awk oneliner. It “thought” I was specifying a much more complicated requirement than I actually was.


"ChatGPT" can mean many things... If you meant the free 4o-mini model, then yes, this outcome is not surprising, since it is bad at basically anything related to code.

If you meant the more powerful o1 or o3-mini models (great naming, openai...), then that outcome would be surprising.


O1 Reasoning. You know, the feature you used to have to pay $200/mo for.


o1 never cost $200/mo. That was o1-pro, which still isn’t available to Plus or Free users.


whatever it is it’s the latest model and the hype is just not real.


Exactly the same for me: the major breakout success was having ChatGPT teach me how to use pandas. It really shone as an interactive manual with worked examples.

Taking the training wheels off isn’t something I’ve really nailed though: for example, I keep coming back with the same questions about how to melt and pivot. I can self diagnose as this showing I didn’t really spend enough time understanding the answers the first time around.


To be fair, that also could be because the pandas API is…not great. When I used to write lots of python and pandas I basically always had the pandas docs open intuitive and memorable it is not.


I am reminded if this IMO timeless classic:

https://news.ycombinator.com/item?id=5397797

A short snippet (the whole thing is very funny and interestingly written in 2013 long before the modern ai craze):

By now I had started moving on to doing my own consulting work, but I never disabled the hill-climbing algorithm. I'd closed and forgotten about the Amazon account, had no idea what the password to the free vps was anymore, and simply appreciated the free money.

But there was a time bomb. That hill climbing algorithm would fudge variables left and right. To avoid local maxima, it would sometimes try something very different.

One day it decided to stop paying me.

Its reviews did not suffer. It's balance increased. So it said, great change, let's keep it. It now has over $28,000 of my money, is not answering my mail, and we have been locked in an equity battle over the past 18 months.

The worst part is that I still have to clean up all its answers to protect our reputation. Who's running who anyway?


I think an even more interesting use case for semgrep, and also LSP or something like LSP, is querying for exactly what an AI needs to know to fix a specific problem.

Unlike humans, LLMs have no memory, so they can't just learn where things are in the code by remembering the work they did in the past. In a way, they need to re-learn the relevant parts of your codebase from scratch on every change, always keeping context window limitations in mind.

Humans learn by scrolling and clicking around and remembering what's important to them; LLMs can't do that. We try to give them autogenerated codebase maps and tools that can inject specific files into the context window, but that doesn't seem to be nearly enough. Semantic queries look like a much better idea.

I thought you couldn't really teach an LLM how to use something like that effectively, as that's not how humans work and there's no data to train on, but the recent breakthroughs with RL made me change my mind.


> LLMs have no memory, so they can't just learn where things are in the code by remembering the work they did in the past.

Isn't this what Google Titan is trying to achieve?


OK, hear me out. The future isn’t o4 or whatever. The future is when everyone, every language, every tool, every single library and codebase can train their own custom model tailored to their needs and acting as a smart documentation which you can tell what you want to do and it will tell you how to do it.

People have been trying with fine tuning, RAG, using the context window. That’s not enough. The model needs to be trained on countless examples of question-answer for this particular area of knowledge starting from a base model aware of comp sci concepts and language (just English is fine). This implies that such examples have to be created by humans - each such community will need its own „Stack Overflow”.

Smaller, specialized models are the future of productivity. But of course that can’t be monetized, right? Well, the technology just needs to get cheaper so that people can just afford to train such models themselves. That’s the next major breakthrough. Could be anyway.


Love the illustrator. And love linking out and supporting her.


That's Annie Ruygt. She predates me at Fly.io.


Yeah I just spent some time going through the blog and looking at them all, she's really great. Need art.fly.io plz, would love a spot to look at them all together. :)


In the age of generative AI, handmade art stands out all the more.


I’ve built something for a solution that takes you most of the way there, using Semgrep’s SARIF output and prompted LLMs to help prioritize triage.

We’ve used this for the past year at Microsoft to help prioritize the “most likely interesting” 5% of a large set of results for human triage. It works quite well…

https://github.com/247arjun/ai-secure-code-review


LOL but i cant help but think about the comment from tptacek [1].

>"We wrote all sorts of stuff this week and this is what gets to the front page. :P"

And how they write content specifically for HN [2].

[1] https://news.ycombinator.com/item?id=43053985

[2] https://fly.io/blog/a-blog-if-kept/


I've been trying to do something similar to create CodeQL queries recently, and found that chatgpt is completely unable to create even simple queries. I assume it's because training is based on old query language or just completely missing, but being able to feed the rules and the errors which they produce when run has been a complete failure for me.


Take a large context frontier model. Upload 200k tokens of code for each query. Ask about what code pattern you want it to highlight for you. Works better than any other system, but costs token on API services.


So the idea is that LLM1 looks at the output of LLM0 and builds a new set of constraints, and then LLM0 has to try again, rinse and repeat? (LLM0 could be the same as LLM1, and I think it is in the article?)


That's Devin / replit agent

Not there yet but it is inevitable


I think the author is missing one part about cursor, aider, etc.

Out of the box it is decent.

Watching only the basic optimizations on YouTube developers are doing proper to starting a project puts the experience and consistency to a far higher level

Maybe this casual surface testing if I’m not Mia reading is why so many tech people are missing what tools like cursor, aider, etc are doing.


> What interests me is this: it seems obvious that we’re going to do more and more “closed-loop” LLM agent code generation stuff. By “closed loop”, I mean that the thingy that generates code is going to get to run the code and watch what happens when it’s interacted with.

Well, at least we have a credible pathway into the Terminator or Matrix universes now...


I'm quite surprised "autofix" functionality wasn't mentioned.

https://semgrep.dev/docs/writing-rules/autofix

Seems like the natural thing to do for cases that support it.


'closed loop' concept in here is important

the point that a unit of code is a thing that is maintained, rather than a thing that is generated once, is where codegen has always lost me

(both AI codegen and ruby-on-rails boilerplate generators)

iterative improvement, including factoring useful things out to standard libraries, is where it's at


I just tried my latest task with it and o1 readily hallucinated non-existent semgrep functions.


I wrote a tool for rewriting semgrep matches using an LLM https://github.com/icholy/semgrepx


"Generate patterns for language X and framework Y which can lead to vulnerability V, generate Semgrep/Joern rule for it" longest chats with ChatGPT.


I have a closed loop coding agent working here, you can try it out: https://mycoder.ai


r2c / semgrep has truly come a long way since its incubation at Facebook: https://github.com/facebookarchive/pfff

Remember using soot, kythe.io, & pfff to find the exact CTS (compatibility test suite) tests to run given code diff between two AOSP builds.


I have to ask if this semgreo rule for relock bugs is public, because the first google hit for me is this blog.


> But I’m burying the lead.

It's "lede". There's a few other typos too.

I'm not sure I like the "This one trick they don't want you to know about!" writing style of these (e.g. the Cursor/malpractice hot take, that sort of thing).


I like it! It's provocative and opinionated, just what I want out of a blog. Blogs are casual, personal, subjective. I read a lot of blogs like this and appreciate being able to take in many broad opinions, many of which yeah I think are dumb, but I like to have them all bounce around my head. Gives me a window into how people are thinking across the community


Oh yeah, the provocative opinion "If you're not using AI you're missing out. It is the future!"

Never heard that one before.


It's "lede" if you're showing off or writing for a publication, where they deliberately search for "lede" and "hed" and "tk", which exist specifically to stick out from ordinary text.


I love how fickle HN is. This is the etymological equivalent of saying "it's expresso, not espresso" (espresso -> expresso, lead -> lede both happened in the 50s) and yet claiming that would be highly controversial.


> makes me think that more of the future of our field belongs to people who figure out how to use this weird bags of model weights than any of us are comfortable with.

Until you find a way to improve self guided training, no, this will never happen. New things get invented and need to be implemented before your "bag of weights" has any idea how to approach it, which is, of course, by simply stealing something that already existed.

People who think this way blow my mind. Is it that you don't actually like your day job and dream about having a machine do it for you while, somehow, still earning the salary you currently command?

Laughable.


I'd put a $1k long bet that a 3B param model, cleverly orchestrated, will achieve AGI* in the next ten years. These are the sorts of ideas that would help get us there.

Any takers?

*AGI defined as smarter than a FAANG staff engineer on similar tasks.


I liked the original article - since I've looked at semgrep, and I'm also hoping "closing the loop" can fix some of the downsides of LLMs

I'm also willing to bet money, and I'd even thought of a challenge for 10x or 20x that amount

But if you want to bet, then you have to have something well-defined and interesting to bet on:

- leave out the term "AGI" - this only confuses things, because everyone has a different definition of it.

Just say what the problem is, precisely

- leave out "FAANG staff engineer". Because computers are already better than staff engineers on dozens and dozens of tasks, like adding two 32 bit numbers, or compiling C++ code, or running Python code. Not to mention certain things involving LLMs.

i.e. it's extremely obvious that LLMs are better at engineers at certain things -- the ones they choose to use LLMs for. That doesn't mean LLMs can replace them, which is often what people mean by "AGI".




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: