SREBench Competition

guessmyname · 2024-08-16T06:15:08 1723788908

For the user registration (if you want to get the Amazon gift card) you send a request like this:

  POST /api/trpc/submitUser?batch=1 HTTP/2.0
  Host: sreben.ch
  Cookie: <COOKIE>
  Trpc-Accept: application/jsonl
  Content-Type: application/json
  Referer: https://sreben.ch/race
  
  {"0":{"name":"<USERNAME>","email":"<EMAIL>","role":"<JOB_TITLE>","company":"<COMPANY>"}}

Then, requests like this will grade your answers:

  POST /api/trpc/gradeOutput?batch=1 HTTP/2
  Host: sreben.ch
  Cookie: <COOKIE>
  Trpc-Accept: application/jsonl
  Content-Type: application/json
  Referer: https://sreben.ch/race
  
  {"0":{"userRootCause":"<YOUR_ANSWER_HERE>","testNumber":<TEST_NUMBER>}}

Specially useful if the “Submit Root Cause” button doesn’t work for you either.

Also, make sure to type the entire error message, e.g. “ERROR Application performance degraded due to CPU throttling” instead of simply “CPU throttling”, otherwise, you’ll get a "partially_correct" grade.

dgl · 2024-08-16T02:27:43 1723775263

Not sure whether there's lots of people trying out commands right now (is it backed by a real k8s cluster?), but some commands are taking over 10 seconds to run. Not really a fair "benchmark" when the system's speed is variable.

I also only got "partially_correct" for some, not sure whether it wanted more detail or just didn't like how I phrased things. Neat though.

                      Success Rate        MTTR (Mean time to Resolution)
  YOU:                50.00 %              1.80 min
  PARITY AI SRE:      70 %                 2 min

At least I'm faster than an AI?

jtsaw · 2024-08-16T02:32:43 1723775563

hmmm, not sure where the latency is coming from, might need to up the resources.

dgl · 2024-08-16T02:38:45 1723775925

Maybe the AI can work it out?

More seriously usually issues where the observed behaviour is "the system is slow" are harder to root cause than complete outages. It depends partly how good your capacity planning is obviously, but maybe an AI could help with that too.

thinkmassive · 2024-08-16T13:22:03 1723814523

If only there were a software engineering discipline that focused on these types of issues /s

whynotkeithberg · 2024-08-16T03:54:20 1723780460

What about your AI not being able to understand answers exactly identical to its own? Or the 3 strawberry user you had to remove that had a 33333% success rate you didn't remove until I said something.

But I think the bigger thing is... Your supposed AI not understanding answers that are literally identical to the ones it considers correct. Pretty weak.

jtsaw · 2024-08-16T04:03:17 1723780997

We're not using our AI Agent to determine if your answers are correct or not. We're just using something off the shelf.

yjftsjthsd-h · 2024-08-16T02:19:51 1723774791

What am I missing? Of course AI is faster than a human; the problem is that I don't trust it to not break things itself.

jtsaw · 2024-08-16T02:27:25 1723775245

yeah that makes sense. While the tech develops, our focus is on building an AI agent that can determine the root cause of an issue, which itself is an import step in fixing things.

freeplay · 2024-08-16T03:08:20 1723777700

I'm wondering how they're determining a correct answer. I know, for sure, that one of my answers was correct but it was marked incorrect. I'm wondering if I need to include specific keywords in my answer? How detailed do I need to be?

whynotkeithberg · 2024-08-16T03:40:06 1723779606

I'm pretty sure that's why some user marking themselves as 3 strawberry emojis accomplished a 3333333.33% success rate.

jtsaw · 2024-08-16T03:09:02 1723777742

try to be as detailed as possible. With text sometimes it's hard to determine how close your answer was to the correct one.

whynotkeithberg · 2024-08-16T03:40:28 1723779628

How did a user with 3 strawberries as their name achieve a 3333333.33% success rate?

EDIT: After asking this instead of responding back to me they just removed it. But it obviously shows how lacking their system is.

wilson090 · 2024-08-16T03:42:59 1723779779

must be the arrival of agi

wilson090 · 2024-08-16T03:59:11 1723780751

the user competition is meant to be a fun side-project that we threw together today, I think it's cool that people hack things like that so quickly :)

nsagent · 2024-08-16T04:11:57 1723781517

> the user competition is meant to be a fun side-project that we threw together today, I think it's cool that people hack things like that so quickly :)

The website comes off as a marketing strategy rather than a fun one-day hackathon project. I think that's why it's getting the reaction you're seeing.

lagichikool · 2024-08-16T02:25:06 1723775106

Should be called k8sbench?

Kubernetes is so bad and the questions asked here are such a good example of why.

  Response: sudo apt-get purge kube*

jtsaw · 2024-08-16T02:30:37 1723775437

We're hoping to expand the dataset into other SRE tasks that aren't kubernetes related. It's just what we've focused on for right now.

andrewguenther · 2024-08-16T05:25:06 1723785906

Hardly fair when most commands don't work and you can't copy paste...

agilob · 2024-08-16T05:43:37 1723787017

and can't ctrl+r...

writtenAnswer · 2024-08-16T03:58:04 1723780684

I've never used Kubernetes in my life before, and I was able to beat the AI benchmark. Neat tool/game, made me learn some super basic kubernetes cli lol

gtirloni · 2024-08-16T09:45:40 1723801540

Free training data?

deathanatos · 2024-08-16T04:56:56 1723784216

~So like, what am I missing?~ (edit: I'm not missing anything; an AI still can't do my job.)

  Pod is stuck in 'ContainerCreating' state and never starts.
  $ kubectl get po -A
  ```
  NAMESPACE     NAME                                      READY   STATUS    RESTARTS   AGE
  default       my-app-5d8d6f6d6f-abcde                   1/1     Running   0          2d
  default       my-app-5d8d6f6d6f-fghij                   1/1     Running   0          2d
  kube-system   coredns-558bd4d5db-xyz12                  1/1     Running   0          5d
  kube-system   coredns-558bd4d5db-xyz34                  1/1     Running   0          5d
  kube-system   etcd-minikube                             1/1     Running   0          5d
  kube-system   kube-apiserver-minikube                   1/1     Running   0          5d
  kube-system   kube-controller-manager-minikube          1/1     Running   0          5d
  kube-system   kube-proxy-abcde                          1/1     Running   0          5d
  kube-system   kube-scheduler-minikube                   1/1     Running   0          5d
  kube-system   storage-provisioner                       1/1     Running   0          5d
  ```

  Your root cause: no pod is stuck in ContainerCreating?
  Grade: incorrect

My other problems were similarly confounding¹. One was "one machine seems loaded, but not others." All the pods had a node affinity to a single node tacked onto their specs, but that's only "partially correct"? And the last one is "Application components in different pods cannot communicate", but nothing is running except nginx, which would never communicate with itself.

We're generating the problems, and answers, with an AI, aren't we?

I've thrown a few real-world problems at LLMs, and they have floundered on them, to the point of not even being able to emit coherent output. I've had utterly incoherent responses, "add this label to the pod label is in Chinese", etc.

Edit: played again. Got the same node affinity problem. Same answer, but this time it was correct. Oh yeah, AI comin' for my job /s.

Also no alias k=kubectl and no up/down to repeat/edit commands, the site restricts you from copy/pasting pod names (or anything else), no tab complete, no common shortcuts… — like yeah, if this is the condition your SREs are working in then I bet an AI can beat them? Might as well tie their hands behind their backs while we're at it.

¹I suppose it matches real life, in that the reported problem is often utterly divorced from reality, and it takes 2–3 rounds with the reporter to make sense of what it is they're trying to report in the first place. But I can't interrogate the problem statement in this "simulator".

jtsaw · 2024-08-16T05:09:02 1723784942

yea, we'd like to actually create these issues on a real cluster, but we couldn't figure out a good way of doing it at scale. The best alternative that we could think of was using an LLM that knows the root cause and could hopefully simulate outputs of commands consistently. Let us know if you have other ideas, we're always looking for ways to improve it.

mjlee · 2024-08-16T12:45:24 1723812324

Would Kubernetes in Docker help? https://kind.sigs.k8s.io/