The KPI problem is systemic and bigger than just Gen-AI, it’s in everything these days. Actual governance starts by being explicit about business value.
If you can’t state what a thing is supposed to deliver (and how it will be measured) you don’t have a strategy, only a bunch of activity.
For some reason the last decade or so we have confused activity with productivity.
(and words/claims with company value - but that's another topic)
I find it interesting that this thread is full of pragmatic posts that seem to honestly reflect the real limits of current Gen-Ai.
Versus other threads (here on HN, and especially on places like LinkedIn) where it's "I set up a pipeline and some agents and now I type two sentences and amazing technology comes out in 5 minutes that would have taken 3 devs 6 months to do".
I actually enjoy writing specifications. So much so that I made it a large part of my consulting work for a huge part of my career. SO it makes sense that working with Gen-AI that way is enjoyable for me.
The more detailed I am in breaking down chunks, the easier it is for me to verify and the more likely I am going to get output that isn't 30% wrong.
> "The biggest issue I see is Microsoft's entire mentality around AI adoption that focuses more on "getting the numbers up" then actually delivering a product people want to use."
That succinctly describes 90% of the economy right now if you just change a word and remove a couple:
The biggest issue I see is the entire mentality that focuses more on "getting the numbers up" than actually delivering a product people want to use.
KPI infection. You see projects whose goal is, say "repos with A I code review turned on" vs "Code review suggestions that were accepted". And then if you do get adoption (like, say, a Claude Code trial), then VPs balk about price. If it's actually expensive now it's because they are actually using it all the time!
The same kind of logic that led companies to migrate from Slack to Teams. Metrics that don't actually look at actual, positive impact, as nobody picks a risky KPI, and will instead pick a useless one that can't miss.
"The company can be held vicariously liable" means that in this analogy, the company represents the human who used AI inappropriately, and the employee represents the AI model that did something it wasn't directly told to do.
Nobody tries to jail the automobile being driven when it hits a pedestrian when on cruise control. The driver is responsible for knowing the limits of the tool and adjusting accordingly.
Can you help me understand where you are coming from? Is it that you think the benchmark is flawed or overly harsh? Or that you interpret the tone as blaming AI for failing a task that is inherently tricky or poorly specified?
My takeaway was more "maybe AI coding assistants today aren’t yet good at this specific, realistic engineering task"....
In my experience many OTEL libraries are aweful to use and most of the "official" ones are the worst offenders as the are largely codegened. That typically makes them feel clunky to use and they exhibit code patterns that are non-native to the language used, which would an explanation of why AI systems struggle with the benchmark.
I think you would see similar results if tasking an AI to e.g. write GRPC/Protobuf systems using only the builtin/official protobuf codegen languages.
Where I think the benchmark is quite fair is in the solutions. It looks like for each of the languages (at least the ones I'm familiar with), the "better" options were chosen, e.g. using `tracing-opentelemtry` rather than `opentelemetry-sdk` directly in Rust.
However the one-shot nature of the benchmark also isn't that reflective of the actual utility. In my experience, if you have the initial framework setup done in your repo + a handful of examples, they do a great job of applying OTEL tracing to the majority of your project.
Where I work we are looking at a lot of our documentation and implementations where AI has a hard time when doing it.
This almost always correlates with customers having similar issues in getting things working.
This has lead us to rewrite a lot of documentation to be more consistent and clear. In addition we set out series of examples from simple to complex. This shows as less tickets later, and more complex implementations being setup by customers without the need for support.
I did similar for about 25 years. I had one injury from overtraining (I basically ran 20 miles every Sunday morning for 6 months, in addition to two shorter runs each week) that ended up plantar fasciitis and I had to take 4-5 month off.
I stopped doing that sort of weekly long run after that and did a lot more in the 6-10 miles range.
Then during and immediately post-COVID shutdowns, I just started running every time I felt stressed about something, and I started to neglect all the other holistic movements that complement running.
This ended up leading to a weird twinge in my hip that 2 years of focused strength training hasn't eliminated. Doctor says there is nothing structural but I don't run any more and I miss it often. There is a flow state I seem to get in somewhere just under to just over an hour in to a run.
The only other time I ever get in to that wonderful flow state is every once in a while when playing guitar, but it's rare.
There's a great analog with this in chess as well.
~1200 - omg chess is so amazing and hard. this is great.
~1500 - i'm really starting to get it! i can beat most people i know easily. i love studying this complex game!
~1800 - this game really isn't that hard. i can beat most people at the club without trying. really I think the only thing separating me from Kasparov is just a lot of opening prep and study
~2300 - omg this game is so friggin hard. 2600s are on an entirely different plane, let alone a Kasparov or a Carlsen.
Magnus Carlsen - "Wow, I really have no understanding of chess." - Said without irony after playing some game and going over it with a computer on stream. A fairly frequent happening.
IMO both perspectives have their place. Sometimes what's missing is the information, sometimes what's lacking is the ability to communicate it and/or the willingness to understand it. So in different circumstances either viewpoint may be appropriate.
What's missing more often than not, across fields of study as well as levels of education, is the overall commitment to conceputal integrity. From this we observe people's habitual inability or unwillingness to be definite about what their words mean - and their consequent fear of abstraction.
If one is in the habit of using one's set of concepts in the manner of bludgeons, one will find many ways and many reasons to bludgeon another with them - such as if a person turned out to be using concepts as something more akin to clockwork.
Simple counterexample: chess. The rules are simple enough we regularly teach them to young children. There's basically no randomness involved. And yet, the rules taken together form a game complex enough that no human alive can fully comprehend their consequences.
This is actually insightful: we usually don't know the question we are trying to answer. The idea that you can "just" find the right question is naive.
Sure, you can put it this way, with the caveat that reality at large isn't strongly definable.
You can sort of see this with good engineering: half of it is strongly defining a system simple enough to be reasoned about and built up, the other half is making damn sure that the rest of reality can't intrude, violate your assumptions and ruin it all.
If you can’t state what a thing is supposed to deliver (and how it will be measured) you don’t have a strategy, only a bunch of activity.
For some reason the last decade or so we have confused activity with productivity.
(and words/claims with company value - but that's another topic)
reply