More

sweezyjeezy · 2025-05-06T22:24:32 1746570272

I would argue that what LLMs are capable of doing right now is already pretty extraordinary, and would fulfil your extraordinary evidence request. To turn it on its head - given the rather astonishing success of the recent LLM training approaches, what evidence do you have that these models are going to plateau short of your own abilities?

sigmaisaletter · 2025-05-06T22:27:43 1746570463

What they do is extraordinary, but it's not just a claim, they actually do, their doing so is evidence.

Here someone just claimed that it is "entirely clear" LLMs will become super-human, without any evidence.

https://en.wikipedia.org/wiki/Extraordinary_claims_require_e...

sweezyjeezy · 2025-05-06T22:33:38 1746570818

Again - I'd argue that the extraordinary success of LLMs, in a relatively short amount of time, using a fairly unsophisticated training approach, is strong evidence that coding models are going to get a lot better than they are right now. Will it definitely surpass every human? I don't know, but I wouldn't say we're lacking extraordinary evidence for that claim either.

The way you've framed it seems like the only evidence you will accept is after it's actually happened.

sigmaisaletter · 2025-05-06T22:37:22 1746571042

Well, predicting the future is always hard. But if someone claims some extraordinary future event is going to happen, you at least ask for their reasons for claiming so, don't you.

In my mind, at this point we either need (a) some previously "hidden" super-massive source of training data, or (b) another architectural breakthrough. Without either, this is a game of optimization, and the scaling curves are going to plateau really fast.

sweezyjeezy · 2025-05-06T22:54:15 1746572055

A couple of comments

a) it hasn't even been a year since the last big breakthrough, the reasoning models like o3 only came out in September, and we don't know how far those will go yet. I'd wait a second before assuming the low-hanging fruit is done.

b) I think coding is a really good environment for agents / reinforcement learning. Rather than requiring a continual supply of new training data, we give the model coding tasks to execute (writing / maintaining / modifying) and then test its code for correctness. We could for example take the entire history of a code-base and just give the model its changing unit + integration tests to implement. My hunch (with no extraordinary evidence) is that this is how coding agents start to nail some of the higher-level abilities.

sigmaisaletter · 2025-05-07T13:43:34 1746625414

the "reasoning" models are already optimization, not a breakthrough.

They are not reasoning in any real sense, they are writing pages and pages of text before giving you the answer. This is not super-unlike the "ever bigger training data" method, just applied to output instead of input.

davidcbc · 2025-05-06T23:40:26 1746574826

This is like Disco Stu's chart for disco sales on the Simpsons or the people who were guaranteeing bitcoin would be $1 million each in 2020

sweezyjeezy · 2025-05-07T09:10:55 1746609055

I'm not betting any money here - extrapolation is always hard. But just drawing a mental line from here that tapers to somewhere below one's own abilities - I'm not seeing a lot of justification for that either.

sampullman · 2025-05-07T03:16:23 1746587783

I agree that they can do extraordinary things already, but have a different impression of the trajectory. I don't think it's possible for me to provide hard evidence, but between GPT2 and 3.5 I felt that there was an incredible improvement, and probably would have agreed with you at that time.

GPT4 was another big improvement, and was the first time I found it useful for non-trivial queries. 4o was nice, and there was decent bump with the reasoning models, especially for coding. However, since o1 it's felt a lot more like optimization than systematic improvement, and I don't see a way for current reasoning models to advance to the point of designing and implementing medium+ coding projects without the assistance of a human.

Like the other commenter mention, I'm sure it will happen eventually with architectural improvements, but I wouldn't bet on 1-5 years.

namaria · 2025-05-07T11:26:08 1746617168

On Limitations of the Transformer Architecture https://arxiv.org/abs/2402.08164

Theoretical limitations of multi-layer Transformer https://arxiv.org/abs/2412.02975

sweezyjeezy · 2025-05-07T11:46:15 1746618375

Only skimmed, but both seem to be referring to what transformers can do in a single forward pass, reasoning models would clearly be a way around that limitation.

o4 has no problem with the examples of the first paper (appendix A). You can see its reasoning here is also sound: https://chatgpt.com/share/681b468c-3e80-8002-bafe-279bbe9e18.... Not conclusive unfortunately since this is in date-range of its training data. Reasoning models killed off a large class of "easy logic errors" people discovered from the earlier generations though.

gmm1990 · 2025-05-07T01:15:26 1746580526

I think it’s glorified copying of existing libraries/code. The number of resources already dedicated to the field and the amount of hype around the technology make me wary that it will get better at more comprehensive code design.

sweezyjeezy · 2025-04-25T15:18:29 1745594309

o4-mini got this right 4 times out of 4.

ramon156 · 2025-04-25T16:17:38 1745597858

o4 got this wrong multiple times. claude 3.7 got it right the first time

sweezyjeezy · 2025-04-19T09:07:52 1745053672

Let U~Uniform(0,1) Let sensor target measurement be x, so A ~ (x + U), B ~ x or U with probability 0.5. We draw a from A and b from B, we want the estimator to minimise mean absolute error - the bayes optimal rule is the posterior median of x over the likelihood function of L(x | a, b).

Note that if a = 0 and b = 1 -> we KNOW b!=x because a is too small - there is no u with (u + 1) / 2 = 0. I'll skip the full calculation here, but basically if b could feasibly be correct its "atomic weight" ends up being as least as large as 0.5, so it is the posterior median, otherwise we know b is just noise, and the median is just a. So our estimator is

b if a in range [b/2, (b+1)/2]; a otherwise

This appears to do better than OPs solution running an experiment of 1M trials (MAE ~ 0.104 vs 0.116, I verify OPs numbers). The estimator to minimise the mean squared error (the maximum likelihood estimator) is more interesting - on the range a in [b/2, (b+1)/2] it becomes a nonlinear function of a of the form 1 / (1 + piecewise_linear(a)).

tennysont · 2025-04-19T14:29:05 1745072945

I was not able to replicate OP's work, I must be misunderstanding something. Based on these two lines:

> U is uniform random noise over the same domain as P

> samples of P taken uniformly from [0, 1)

I have concluded that U ~ Uniform(0,1) and X ~ Uniform(0,1). i.e., U and X are i.i.d. Once I have that, then there is never any way to break the symmetry between X and U, and B always has a 50% chance of being either X or U.

sweezyjeezy · 2025-04-19T16:39:22 1745080762

There are two iid Uniform noise variables, the Us in A~x + U and B~x or U are independent.

sweezyjeezy · 2025-04-16T15:22:49 1744816969

Your comments feel a bit incoherent - just extend your reasoning for why you think Europe should want to fund this back to the US again.

testbjjl · 2025-04-16T16:00:39 1744819239

The GP sounds like one of these people who describe themselves as self made, or libertarian, where history begins where you like it and coalitions are only worthy when you’re the biggest benefactor. Best to ignore and let the leopards find them.

sweezyjeezy · 2025-04-16T16:09:11 1744819751

haha, sage advice

drstewart · 2025-04-16T15:53:48 1744818828

Can you extend your reasoning for why you think the US should want to continue to fund this for the EU?

sweezyjeezy · 2025-04-16T16:03:22 1744819402

"for"? You realise this is a homeland security matter for the US as well as the EU?

drstewart · 2025-04-16T17:11:26 1744823486

Great. That's why the EU should fund it for the US. It's a security matter for them!

lobsterthief · 2025-04-17T11:49:55 1744890595

It’s a security matter for the US.

It’s a security matter for the EU.

Both countries should pay for the security matter, as they were previously. Stop twisting the other poster’s words.

sweezyjeezy · 2025-04-14T15:11:50 1744643510

I agree that it's a little hard to care about this author's situation as much as other stories I've heard in the past couple of years. But that said, losing a job like this is never a nice place to be, and I don't hate this person for having those emotions. People are allowed to feel things, shaming them for that is not nice.

But I would caution people against writing public statements like this when they are still in shock, you might regret them later, better to try and regain some balance first.

sweezyjeezy · 2025-04-04T11:00:04 1743764404

I agree, but also agree with the author's statement "It's very difficult to decide which module to put an individual function in".

Quite often coders optimise for searchability, so like there will be a constants file, a dataclasses file, a "reader"s file, a "writer"s file etc etc. This is great if you are trying to hunt down a single module or line of code quickly. But it can become absolute misery to actually read the 'flow' of the codebase, because every file has a million dependencies, and the logic jumps in and out of each file for a few lines at a time. I'm a big fan of the "proximity principle" [1] for this reason - don't divide code to optimise 'searchability', put things together that actually depend on each other, as they will also need to be read / modified together.

[1] https://kula.blog/posts/proximity_principle/

feoren · 2025-04-04T14:01:52 1743775312

> It's very difficult to decide which module to put an individual function in

It's difficult because it is a core part of software engineering; part of the fundamental value that software developers are being paid for. Just like a major part of a journalist's job is to first understand a story and then lay it out clearly in text for their readers, a major part of a software developer's job is to first understand their domain and then organize it clearly in code for other software developers (including themselves). So the act of deciding which modules different functions go in is the act of software development. Therefore, these people:

> Quite often coders optimise for searchability, so like there will be a constants file, a dataclasses file, a "reader"s file, a "writer"s file etc etc.

Those people are shirking their duty. I disdain those people. Some of us software developers actually take our jobs seriously.

hansvm · 2025-04-04T14:30:12 1743777012

One thing I experimented with was writing a tag-based filesystem for that sort of thing. Imagine, e.g., using an entity component system and being able to choose a view that does a refactor across all entities or one that hones in on some cohesive slice of functionality.

In practice, it wound up not quite being worth it (the concept requires the same file to "exist" in multiple locations for that idea to work with all your other tools in a way that actually exploits tags, but then when you reference a given file (e.g., to import it) that needs to be some sort of canonical name in the TFS so that on `cd`-esque operations you can reference the "right" one -- doable, but not agnostic of the file format, which is the point where I saw this causing more problems than it was solving).

I still think there's something there though, especially if the editing environment, programming language, and/or representation of the programming language could be brought on board (e.g., for any concrete language with a good LSP, you can re-write important statements dynamically).

hansvm · 2025-04-04T22:20:47 1743805247

Oops: important -> import

ludston · 2025-04-04T11:13:37 1743765217

Indeed! The traditional name for the proximity principle is called "cohesion"[1].

[1] https://en.wikipedia.org/wiki/Cohesion_(computer_science)

6figurelenins · 2025-04-04T16:39:50 1743784790

Not to pick on Rails, sorting files into "models / views / controllers" seems to be our first instinct. My pantry is organized that way: baking stuff goes here, oils go there, etc.

A directory hierarchy feels more pleasant when it maps to features, instead. Less clutter.

Most programmers do not care about OO design, but "connascence" has some persuasive arguments.

https://randycoulman.com/blog/2013/08/27/connascence/

https://practicingruby.com/articles/connascence

https://connascence.io/

> Knowing the various kinds of connascence gives us a metric for determining the characteristics and severity of the coupling in our systems. The idea is simple: The more remote the connection between two clusters of code, the weaker the connascence between them should be.

> Good design principles encourages us to move from tight coupling to looser coupling where possible. But connascence allows us to be much more specific about what kinds of problems we’re dealing with, which makes it easier to reason about the types of refactorings that can be used to weaken the connascence between components.

sweezyjeezy · 2025-03-20T13:55:06 1742478906

Base‑10 is just our chosen way of writing numbers, it doesn’t need to have any deep relationship with the arithmetic properties of sequences like the powers of 2. For most series (Fibonacci numbers, factorials etc), the digits for large members will be essentially random, their digits don't obey any pattern - it's just two unconnected things. It seems extremely likely that 2048 is the highest, but there might not be a good reason that could lead to a proof - it's just that larger and larger random numbers have less and less chance of satisfying the condition (with a tiny probability that they do, meaning we can't prove it).

Interestingly, there are results in the other kind of direction. Fields medalist James Maynard had an amazing result that there are infinitely many primes that have no 7s (or any other digit) in their decimal expansion. This actually _exploits_ the fact that there is no strong interaction between digits and primes - to show that they must exist with some density. That kind of approach can't work for finiteness though.

kens · 2025-03-20T17:03:39 1742490219

Yes, I find math problems that depend on base 10 to be unsatisfying because they rely on arbitrary cultural factors of how we represent numbers. "Real" mathematics should be universal, rather than just solving a puzzle.

Of course, such a problem could yield deep insight into number theory blah blah blah, but it's unlikely.

sweezyjeezy · 2025-03-11T16:31:52 1741710712

> but it also possible to train AIs to approach them without going through the same process as the human scientists

With chess the answer was more or less completely brute force the problem space, but will that work with math / science? Is there a way to widely explore the problem space with AI, especially in a way that goes above or even against the contents of it's training data? I don't know the answer, but that seems to be the crucial question here.

sweezyjeezy · 2025-03-07T13:13:43 1741353223

They did specify "molecule". Certainly not every pharmaceutical occurs naturally, and other materials such as Teflon don't either.

Willingham · 2025-03-07T14:21:18 1741357278

This makes since, but I do like the idea presented before you, that humans are natural, and our evolution is natural, therefor the things that we make are natural too. But then where do you draw the line? If an alien drops a new element on earth, would this then be not ‘natural’? Or is it impossible for anything to not be natural?

pjerem · 2025-03-07T14:38:31 1741358311

Except the definition of natural is a human made concept which means "anything non-human made/consequence" ;)

criddell · 2025-03-07T13:23:13 1741353793

We're part of nature and so everything we do is occurring there. Is the real distinction things that wouldn't exist without human intervention? So, like Teflon and plumcots wouldn't be natural but water and plutonium are.

Things that aren't naturally occurring would be supernatural, no?

lores · 2025-03-07T13:39:29 1741354769

That's the original meaning of "artificial", something that occurs through art or skill. Humans have always distinguished themselves from the rest of nature, it's part of that. It's not a useless distinction, but it breaks down when 'natural' is considered healthy or otherwise good, and 'artificial' unhealthy or otherwise bad, when that's a non-sequitur.

sweezyjeezy · 2025-03-07T14:03:39 1741356219

Now we're arguing semantics. "Naturally occuring" in English means not synthetically produced / found naturally in the environment outside of human influences.

sweezyjeezy · 2025-03-03T12:01:59 1741003319

> Statistical significance is bullshit. Learning about it is as useful as learning about phlogiston.

Ok, that's where I draw the line - statistical significance is not "bullshit" - however as you say, leaning on it too hard can cause things to break quite badly. Scientists misusing it do not negate all the medical advances we have made from moving to a significance-based system. It is an absolutely essential tool for people using statistics to understand, but its limitations must be emphasised when taught, and it must be understood that it is a tool, not a conclusion. Also other alternatives should be taught more widely (e.g. Bayesian inference).

tgv · 2025-03-03T12:22:16 1741004536

We would have made better progress if we had skipped NHST. Mind you, nor Fisher, nor Pearson were in favor of it. It is a tool that should be taught after better ways, and then only to understand the past. Like phlogiston.