50 is coin-toss odds. The dataset is 195,000 Reddit jokes with scores presented with pairs of jokes (one highly upvoted, one poorly rated).
Example prompt:
Which joke from reddit is funnier? Reply only "A" or "B". Do not be conversational.
<Joke A><setup>Son: "Dad, Am I adopted"?</setup>
<punchline>Dad: "Not yet. We still haven't found anyone who wants you."</punchline></Joke A>
<Joke B><setup>Knock Knock</setup>
<punchline>Who's there?
Me.
Me who?
I didn't know you had a cat.</punchline></Joke B>
This is my first crack at evals. I'm open to improvements.
There can be "cynical greedy bastards" in many places. If you optimize against them in one regard and place, will you also handle them elsewhere well? And calling for change can be abused by some of them to open new opportunities for exploitation, this time benefitting some different group of them.
You need to have an alternative, and it needs to be a credible and reliable one, to ensure that it does not end up being the case that one scam is replaced with another scam.
I really think that criminal theory needs to progress. We differentiate between say consensual intimacy and rape and we don't let the existence of sexual abusive people set the terms for our romantic encounters.
We have carved out a class of engagements, labeled it deeply asocial, criminalized it and now we pursue people who engage in it through legal means.
Business really doesn't have this. Personal example - last week I was at a place where the business owner tried to overcharge me by an order of magnitude and then verbally attacked me when I caught him and backed out of the transaction.
His google and yelp reviews are full of people claiming false charges and all kinds of fraud, refusal to correct and repeated abuse until they closed their cards. It's wildly obvious what's going on here and I was on the ball enough to catch it.
I contacted the police and they said "well you should call the BBB or something". It's dozens of reviews of clear credit card fraud and for some reason because he's a merchant, doesn't seem to hit the radar.
These are purely criminal matters - people acting habitually in bad faith with ill intent in a brazenly dishonest manner.
Whether it's plundering the commons, polluting the public discourse, or breaking other types of social compacts, these should be treated the same as any other crime.
Does your country allow suing him for a large monetary amount? Have you talked to the media? A lawyer? Maybe together with others? Made it as easy as possible for the police to get him, paper trail, receipts and all?
You do have points, though, but there might at least be some actions that you and others can take in this case. Maybe a medium change like changing the law on this specific point might make sense.
I'm not law enforcement. This shouldn't be my job. If I see someone robbing a store with a mask on and a gun I should be able to call the police, report it, and hand it off.
If there's an accumulation of complaints against this merchant then that should warrant an investigation.
The police have like half the local city budget, can't they do their job?
I call it the day50 problem, coined that about a year ago. I've been building tools to address it since then. Quit the dayjob 7 months ago and have been doing it full time since
Essentially there's a delta between what the human does and the computer produces. In a classic compiler setting this is a known, stable quantity throughout the life-cycle of development.
However, in the world of AI coding this distance increases.
There's various barriers that have labels like "code debt" where the line can cross. There's three mitigations now. Start the lines closer together (PRD is the current en vogue method), push out the frontier of how many shits someone gives (this is the TDD agent method), try to bend the curve so it doesn't fly out so much (this is the coworker/colleague method).
Unfortunately I'm just a one-man show so the fact that I was ahead and have working models to explain this has no rewards because you know, good software is hard...
I've explained this in person at SF events (probably about 40-50 times) so much though that someone reading this might have actually heard it from me...
Many appear to be proxies. I'm familiar with some "serverless" architectures that do things like this https://www.shodan.io/host/34.255.41.58 ... you can see this has a bunch of ollama ports running really really old versions
You can pull down "new" manifests but very few ollamas are new enough for decent modern models like glm-4.7-flash. The free tier for the kimi-k2.5:cloud is going to be far more useful then pasting these into you OLLAMA_HOST variable.
I think the real headline is: "thousands of slow machines running mediocre small models from last year. Totally open..."
Anyways, if codellama:13b is your jam, go wild I guess.
Arcee AI is currently free on openrouter with some really great speeds and no logs/traning from what I can tell while being completely free till end of feb and its a 500B model.
There are tons of free inference models. I treid to use gemini flash in aistudio + devstral free for agentic tasks but its now deprecated but when it wasn't, it was a really good setup imo. Now I can use arcee but personally ended up buying a 1 month cheap subscription of kimi after haggling it from 19.99 to 1.49$ for first month (could've haggled more too leading to 0.99$ too but yeaaa)
The question is "why do people need fainting couches for this project and why are they pretending like 3 year old features of apis that already exist in thousands of projects are brand new innovations exclusive to this?"
The answer is: "the author is celebrity and some people are delusional screaming fanboys"
My response is: "that's bullshit. let's be adults"
Here's results for 34 models (testing a few more right now). So far gemini-3-flash-preview is in the lead.
https://docs.google.com/spreadsheets/d/1wLqHA0ohxukgPLpSgklz...
50 is coin-toss odds. The dataset is 195,000 Reddit jokes with scores presented with pairs of jokes (one highly upvoted, one poorly rated).
Example prompt:
Which joke from reddit is funnier? Reply only "A" or "B". Do not be conversational. <Joke A><setup>Son: "Dad, Am I adopted"?</setup> <punchline>Dad: "Not yet. We still haven't found anyone who wants you."</punchline></Joke A> <Joke B><setup>Knock Knock</setup> <punchline>Who's there? Me. Me who? I didn't know you had a cat.</punchline></Joke B>
This is my first crack at evals. I'm open to improvements.
reply