To me, academia has always been about developing proof-of-concepts for techniques that industry adopts years later. These proof-of-concepts are intricate but small, so they require a lot of ingenuity but not much resources or man-hours (a company can also hire far more employees than a professor).
AI is no exception. All evidence suggests that current LLMs don't scale (EDIT: they do scale, but at a certain point somewhere around GPT4 the scaling slows down very quickly), so we need new techniques to fix their (many) flaws. A proof-of-concept for a new technique doesn't need hundreds of millions of dollars to demonstrate huge potential: such a demonstration only needs a small model using the new technique, a small model using the current SOTA, and some evidence that it scales (like slightly larger models that show the new model isn't slowing down its scaling vs SOTA).
Academia is also about creating better explanations for things we already have. So researchers don't even need to develop new models, they can simply create small existing models and demonstrate some new property or explanation to get a worthwhile result. We probably need better explanations for how and why current LLMs work in order to create the better models mentioned above.
EDIT: At least how it's supposed to work, you don't even need to show success either. Academics merely need to show a new technique with the potential to improve the SOTA. Even if the technique is a huge failure, the work still advances the future of AI, by contributing to our understanding of models (what works and what doesn't) as well as removing one potential option.
I have seen no evidence that LLMs don’t scale. What evidence did you have in mind? Do you mean something specific and perhaps different from what I would call scaling? It seems in the last two years people got better at squeezing out more performance from smaller (but still large) language models but the continuous scaling of large models to multi modal applications has been rather impressive.
They didn't say LLM's don't scale. They said current LLM's don't scale, which appears to be true. LLM reasoning ability appears to be plateauing pretty hard, no matter how much we spend.
I’m glad things are plateauing because when I suggested this could happen I was written off as an idiot. People just kept pointing to the leap from GPT 2 to GPT 3 and GPT 3 to GPT 4 and extrapolating that clearly progress will be exponential forever, some of those people were on this very site.
The visible difference between answers that are broadly plausible and reliably exactly correct is going to be relatively small, probably only distinguishable by experts in a field. But the economic value and technological progress in achieving that leap will be colossal.
That GPT, Gemini, and Claude are all about the level of GPT-4 is evidence that things are plateauing -- not conclusive evidence, but evidence nonetheless.
Not Really. When GPT-3 was released, it would be well over a year before top competitors/research institutions with the compute hit >= GPT-3 levels.
When you're in uncharted territory, you hit the levels you know are possible first and then you try to scale up. You waste less money this way. And scaling is only getting more expensive.
Within the year, we can probably expect the release of Gemini 2, Claude 4, GPT-5 etc. If those are still small incremental improvements then we can talk about a slow down. Right now, it just seems par the course.
In all the fast moving/low barrier technologies I've tracked over the years, competitors tend to leap frog each other. In this case, it's not happening despite scaled compute resources. Again, not conclusive but consistent with a slowdown.
You've completely ignored what I just said. Competitors can leapfrog each other but it's not happening instantly.
Gathering personnel, performing the necessary research, training, post-training red teaming. Each of these things take months of time and millions of dollars. That competitors are reaching GPT-4 level a year from it's release is perfectly normal and indicates nothing. Google released the paper for Gopher a year and a half after GPT-3 and that was never intended to be a product.
>In all the fast moving/low barrier
GPT-4 level compute is not low barrier. That's literally the main point of this paper.
Google Amazon Microsoft Meta Nvidia Oracle... if 6 companies can accomplish something in 6 months, that's low barrier. Expensive, but that's not the same as high barrier.
Scaling in this broader context can have many meanings. All of them disagree with the original statement and with yours. If you assume that OpenAI spent more on GPT3.5 than on GPT3, and more on GPT4 than on GPT3.5, which is certainly plausible, how do I understand your statement in that view? What is even meant with scaling stopping at GPT4 when the next model by OpenAI is not yet public? Of course it makes no sense to compare against yet another company throwing money at a problem, but within each company scaling to more money (training/modalities/tunings) as a function of time has improved reasoning abilities. I am probably misunderstanding something simple so any citation would help.
This isn't necessarily just about the models themselves, as there also laws — and protestors dropping banners on Altman's head[0] — basically saying "can you not".
It doesn’t scale because it’s inherently a flawed system. Predicting the next output sequentially over and over doesn’t allow for coherent decision making and planning.
It also lacks the ability to learn. You feed it a trillion pages of books, then if you want it to learn something new you have to train it again and risk destroying the old knowledge. A human brain doesn’t need a trillion pages.
We need a system that can learn in real time. If I’m talking to it, it should change its structure. It can never truly be intelligent if it’s stuck in 2022.
>All evidence suggests that current LLMs don't scale (EDIT: they do scale, but at a certain point somewhere around GPT4 the scaling slows down very quickly)
There seems to actually be no evidence that LLMs dont scale. All data from LLMs leading up to GPT4 indicates they scale incredibly well actually.
Do you have a single paper you can point to that says otherwise?
No OP, but no amount of data will ever enable an LLM to know when it doesn’t know something, for instance. That’s a fundamental limitation with the current architecture.
When we have an external source of information that is not self-accountable, it is different from when we are both the learner and functor - when we notice mistakes we learn from them. We're always looking out for errors, and if they happen then (usually) we alone feel responsible for their effects.
Again, you're talking about something different than scaling. There are many different aspects of emotions, life, or even learning. But the comment was about scaling learning, which can be unaffected by other aspects, such as whether you learned something incorrect.
More succinctly:
I can still teach Simpson's Rule in a Calculus class to a young man or woman who subscribes to Creationism, or the Flat Earth theory. None of their false knowledge affects their innate ability to take in new knowledge.
But what is the purpose of "scaling?" Broadly speaking, it means that there is a marked and obtainable improvement on an objective function in response to additional data. But for certain criteria, like the possession of knowledge, it appears attention-based LLMs aren't capable of that, regardless the amount of data. So they don't "scale" if you use that as an objective function.
Negation in the training data, exhaustion, or unknowable unknowns like future events is about an LLM will ever be able to say it "doesn't know" reliably.
LLMs have no concept of truth or correctness really, they are just probabilistic appending to output to match learned ideal patterns.
They're trained (and possibly scripted) to not "know" things or not be able to do things.
But at their core, they're going to generate the most plausible continuation, even if that isn't very plausible (like non-existent functions). I think this can be improved (playing with the temperature in self hosted models depending on the nature of the task e.g.), but it doesn't seem fully solvable with current LLMs.
Do you know much about how this all works? I've been thinking about how a human child ingests (maybe) petabytes of tokens every day through video and action. They initiate movement and observe the consequences through video, audio, proprioception, etc. The decades of schooling afterwards seems like fine tuning on a base model.
The base model is likely to have lots of genetic determined stuff that is there before any training: some animals like horses can walk and perceive/navigate very soon after being born. There's also an innate fear of snake shapes in most mammals.
Another one that's bothered me is how our models so often start from scratch (aside from fine tuning or transfer learning) - I wish we could just start with say a smaller model with low resolution and grow it into a higher resolution one, a small LLM that can be grown into different specialisations (which we do with fine tuning - but do they all need the same number of layers, size of each layer, etc).
I don't understand what you mean. To improve performance using the same basic architecture the model need to scale both compute and training data. Where are we going to find another 10x the web sized text training corpus?
And that is just web scrapped data. There are trillions of valuable tokens worth of text from the likes of pdfs/ebooks, academic journals and other documents that essentially has no web presence otherwise.
>To improve performance using the same basic architecture the model need to scale both compute and training data.
What you are trying to say here is that you need to scale both parameters and data as scaling data increases compute.
That said, it's not really true. It would be best if you scaled both data and parameters but you can get increased performance just scaling one or the other.
The 6T token dataset is surely a high quality subset / refined extract from much larger public datasets. It's misleading to compare their sizes directly.
We don't know what the dataset is. "high quality subset from much larger public datasets" is not just inherently speculation, it's flat out wrong as no such public datasets existed when GPT-4 was trained.
I think the main argument is based on diminishing returns. They expected larger improvement in performance given the 5x increase in number of parameters.
> To study how the parallel structure of transformers might limit their capabilities, the pair considered the case where transformers didn’t feed their output back into their input — instead, their first output would have to be the final answer. They proved that the transformers in this theoretical framework couldn’t solve any computational problems that lie outside a specific complexity class. And many math problems, including relatively simple ones like solving linear equations, are thought to lie outside this class.
> But no matter how a prompt is phrased, as long as it causes a language model to output step-by-step solutions, the model can in principle reuse the results of intermediate steps on subsequent passes through the transformer. That could provide a way to evade the limits of parallel computation... They quantified how that extra computational power depends on the number of intermediate steps a transformer is allowed to use before it must spit out a final answer. In general, researchers expect the appropriate number of intermediate steps for solving any problem to depend on the size of the input to the problem. ...Indeed, Merrill and Sabharwal proved that chain of thought only really begins to help when the number of intermediate steps grows in proportion to the size of the input, and many problems require the number of intermediate steps to grow much larger still.
So there’s a large space of O(n) and O(constant) problems that LLMs seem to scale very well on, but if they need O(n^2) examples in their training data to learn how to solve O(n^2) problems (even with prompt engineering!) then that’s a hard limit.
ETA: there’s also good (IMO irrefutable) evidence that AI researchers have deluded themselves about the scalability of LLMs because of motivated statistical reasoning: https://arxiv.org/abs/2304.15004
In fact it seems to me that “LLMs scale well” means “LLMs can solve O(n) problems if you give them O(n) examples in their training data,” which is actually not very interesting. The interesting thing is that the internet + Moore’s law allows us to build (and compute on) O(n) training data for an enormous variety of problems.
>> To study how the parallel structure of transformers might limit their capabilities, the pair considered the case where transformers didn’t feed their output back into their input — instead, their first output would have to be the final answer. They proved that the transformers in this theoretical framework couldn’t solve any computational problems that lie outside a specific complexity class. And many math problems, including relatively simple ones like solving linear equations, are thought to lie outside this class.
This seems like somewhat of an artificial constraint? E.g. we can explicitly ask GPT4 to solve problems using code which I wouldn't think has the same limits. The model might not be able to solve a problem directly but can in code.
>ETA: there’s also good (IMO irrefutable) evidence that AI researchers have deluded themselves about the scalability of LLMs because of motivated statistical reasoning: https://arxiv.org/abs/2304.15004
Curious to hear how this relates to scaling. I think a lot of the emergence dialog is just semantics. I.e. if GPT3 can only score a 50% on a 3 digit arithmetic test but GPT4 scores 80% this 'ability' might not literally be emergent but functionally it is, at a larger scale it 'gained' the ability to do 3 digit arithmetic. It's not clear how this indicates issues with Scalability though. For example in the paper the authors state: "However, if one changes from nonlinear Accuracy to linear Token Edit
Distance while keeping the models’ outputs fixed, the family’s performance smoothly, continuously and predictably improves with increasing scale"
> This seems like somewhat of an artificial constraint?...The model might not be able to solve a problem directly but can in code.
First of all, I would say that it's not an "artificial constraint" so much as what you're saying is an artificial mitigation. When it comes to human programmers, we expect them to be able to execute small O(n^2) and O(2^n) algorithms on paper, and if they're not able to we rightfully question whether or not they actually understand the algorithm versus just having a bunch of code memorized.
But the bigger problem is that the O(n) limitation doesn't just apply to formal math/CS problems; many natural language queries have a computational complexity, and I believe this result puts a limit on those that GPT4 can (easily) solve. I'll give a concrete example from my testing GPT-4 last year[1]. Naively GPT-4 struggled to solve two simple problems:
1) Given a pair of sentences, choose the one that has the most words.
2) Given a list of n numbers, choose the largest.
Without chain-of-thought, GPT-4 was not able to solve these problems reliably because it was clearly using spurious statistical cues and "guessing," e.g. seeming to think "antidisestablishmentarian" was always in a sentence with many words. Using chain-of-thought prompting ("let's think step-by-step") GPT-4 was able to solve these problems easily.
However, these are both O(n) problems. If you asked GPT-4 "given a list of n sentences, choose the one with the most words. Let's think step by step:" it could not reliably solve the problem! It either counted the sentences step-by-step but then guessed as to what the largest number was; or it guessed the count of the sentences and them compared the count step-by-step. Its ability to solve two separate O(n) problems did not extend to an O(mn) problem - you would need to tailor the chain-of-thought prompt and basically hold GPT's hand to get the right answer. I think this theoretical result puts my experiment on firmer ground: transformer LLMs can be trained on a variety of linear complexity problems but they cannot natively extend those results to quadratic complexity.
And yes, GPT could put this in code and get the right answer. But you'd have to train it to write a Python function instead of using its own abilities. I don't think "if the problem seems too complex, try writing a Python function" will work as a prompting strategy: it's GPT-4's inability to recognize quadratic complexity that's the root cause of the problem. (It's also why chain-of-thought isn't a cure-all - humans need to decide whether CoT is appropriate! If you use CoT prompting on a simple factual question you'll get a bunch of useless "steps.")
Going back to the scaling - the promise of scaling is that eventually big data pays off and the LLM gains some sort of generalizable understanding of the problem. But if the performance varies smoothly and predictably, then that's probably not happening - LLMs never "grok" subjects, they just screw things up less and less. E.g. if GPT-3 got a 50% on 3-digit arithmetic because its training data include 30% of all 3-digit arithmetic problems, and GPT-4 got an 80% because its training data includes 50% of all such problems, then that's a vacuous and uninteresting form of scaling - there are only ~2,000,000 such problems so this isn't as ridiculous as it might sound. Likewise if GPT-4 had the same training data but doubled its performance because it doubled the resources spent on arithmetic training. The point isn't to deny that scaling exists, it's to question whether its worth all the money and excitement. I really don't think it is - it's not even interesting.
[1] GPT-4 can solve these examples correctly now, but that's mostly a testament to the impossibility of doing reproducible investigation on closed LLMs. I strongly suspect OpenAI later trained GPT-4 on this and other "dumb" problems: certainly you could write a Python script to generate appropriate training data. The problem is you would need to generate an ~O(mn) amount of data for every pair of O(m) and O(n) problems that someone might want to ask in conjunction. I don't actually use LLMs for work, so maybe you could see how performance varies between Claude, Mixtrial, etc.
I wonder if part of my disagreement is in thinking of GPT4 as a System rather than just as an LLM. If the system has the ability to solve problems in code it feels like intentionally handicapping it by forcing it to solve problems in natural language. So it's not super interesting for me if you say that GPT4 the llm can solve only O(n) problems when GPT4 the system can solve more complex ones. Though I do admit theres an issue where the prompter needs to preemptively identify problems that should be solved in code and that's a significant shortfall.
> the promise of scaling is that eventually big data pays off and the LLM gains some sort of generalizable understanding of the problem. But if the performance varies smoothly and predictably, then that's probably not happening - LLMs never "grok" subjects
Is this any different from humans? Even a Nobel Prize winning Physicist will make some mistakes, does this indicate they don't grok Physics?
>if GPT-3 got a 50% on 3-digit arithmetic because its training data include 30% of all 3-digit arithmetic problems, and GPT-4 got an 80% because its training data includes 50% of all such problems, then that's a vacuous and uninteresting form of scaling
Genuinely curious, is there any indication this is actually what's happening? It seems pretty trivial to generate novel questions in other domains for example and GPT4's performance seems to generalize to those questions. I.e. GPT4 can do well on unreleased GRE problems for example. Or do you think GPT4 is actually just memorizing everything and has 0 ability to generalize?
> Academics merely need to show a new technique with the potential to improve the SOTA.
Not sure which field you are talking about, but I've had more than one machine learning paper rejected because it didn't improve SOTA enough. Not to contradict your point directly, but current publication practices don't seem to scale very well with the progress in AI
I am an AI researcher working for a small company. When ChatGPT came out and people started to solve all sort of problems with prompting, I questioned my role and reasoned that in future middle-level researches will lose their jobs and if I want to stay in this business I have to upgrade my skills and possibly start publishing papers and get a PhD. Since in future only big tech companies would do research in AI and many automation problems of small companies would be solved with foundation models. And the competition for research jobs at big companies would be very tight.
Prompting is yet another layer of abstraction. Untrained teenagers obsessed with chatGPT likely rival OP in prompting.
OP can likely write some random forest algorithm, its just not needed anymore.
I'm not an AI person, just a programmer, and I've found my time has gone into learning parameters to fine-tune + learning prompting. Maybe the parameters would have helped if I knew AI, but these are all layers that need to be learned.
I recently saw an article arguing that waiting for overall LLM improvements actually beats fine-tuning in most cases as well. What are your feeling on this? (apologies that I can’t remember the source)
> OP can likely write some random forest algorithm, its just not needed anymore.
Not sure if that's completely true. What if the prediction problem is based on company data e.g. predicting probability of click? Not sure how you could use ChatGPT for that. Also no one actually writes a random forest algorithm, you just import it from a library.
Whats your job as a AI researcher for a small company?
When you mentioned prompting, i imagined you using vertex ai or smiliar 'slightly' lower level AI tools, but thats more ML Ops for me than AI research or AI integrator.
We develop information extraction models and solve Document AI problems. But many NLP and Vision tasks can be done with foundation models by simple fine-tuning, for which you need to know very basic definitions do not need to have a post-graduate degree.
darkoob12>I have to upgrade my skills and possibly start publishing papers and get a PhD. <
Have you seen this "oldie but goodie" from Philip Greenspun's website (esp. the graph beneath "Not So Very Serious Stuff"):
https://www.philip.greenspun.com/careers/
If there is an "AI Winter" it is unlikely that a Ph.D., new or old, would keep you employable. Look to other more predictable but related fields: math esp. statistics and engineering.
FWIW n the background is looming a great revolution in energy generation: fission is now possible.
There was a related article [0] about this in the WSJ a while back. Here are a couple of relevant quotes from that article:
"Despite this progress, we remain concerned that there is a disproportionate amount of interest by policy makers in the voices of industry leaders rather than those in academia and civil society."
"Furthermore, to truly understand this technology, including its sometimes unpredictable emergent capabilities and behaviors, public-sector researchers urgently need to replicate and examine the under-the-hood architecture of these models. That’s why government research labs need to take a larger role in AI."
> Obviously there are limits to what a small model is capable of doing
This is not obvious to me. I mean yes, limits exist, but I seriously doubt that we are anywhere close to them. This is exactly the kind of question that needs academic research!
I feel that our current models waste tons of compute and memory and there ought to be room to optimize them by 10x or even much more with new techniques.
By leaning on inductive biases, we've historically been able to get pretty far with relatively small models and litte data.
The trouble is, applying inductive biases requires that you make assumptions about the problem, which requires that you understand the problem. The stronger the inductive bias, the stronger the assumptions.
Creating fantasticly large models allows us to eschew the inductive bias in favour of model capacity. That capacity would lead to useless, fragile models which overfit without mountains of data.
It's great to explore the maxima of either approach so we can define some kind of loose Pareto frontier of model size and performance.
I do think there's tonnes of room to benefit from the pissing contest the tech giants are currently engaged in and learn from the massive models what can be done for smaller models.
The RAG pipeline is kind of an approach to add some inductive bias to such large models. Currently people tend to use the same giant causal LLMs for RAG as they do for non-RAG Chat purposes, but I'm certain you can get much smaller models performing very well for RAG. In fact, I'm sure plenty of folks are doing it already considering the number of eyeballs on LLM and RAG atm.
I have the same feeling. I often wonder about the difference between learning and reasoning, in neural networks. If we could train a small network on a small amount of code and it figures out how to write code that's as good as what GPT-4 can write, that's what I'd call reasoning (and that's what's currently a massive open research question).
On the other hand, it's definitely a lot easier to just show a gigantic model a gigantic amount of code until it becomes familiar with most types of applications a user would likely ask it about.
So it feels like right now we're papering over our lack of research success by (essentially) throwing money at the problem, and it seems to be working well enough. (At least, well enough to be useful to people.)
I know a handful of computer science academics personally, and each of them has founded at least one startup. They can go on sabbatical easily, do the startup, then come back once they sell it or it fails. It's a pretty sweet safety net.
If I was an AI professor, I would go get a $1 million comp package on sabbatical then come back. The bubble will likely burst in 1 to 3 years, so you need to get paid now.
I have witnessed the same excitement and "never ending summer" psychology many times before. The cold winter always comes.
We even had a big bubble recently with AI itself regarding self driving cars. It's amazing how every tech luminary went on stage and declared "This Year For Sure" to be when self driving cars would pick us all up... in 2019. It's perpetually 5 years away from mass deployment.
Training LLMs has huge capital expenditure (capex) and is not producing revenues. As an anecdote, my brother in law works for a large 15k person Microsoft software shop, and they got a discount on their license renewal (yes, a discount) for including Github Copilot for every software engineer. We surmised this is to goose the number of "paying" licenses to help metrics.
Extremely high salaries - extremely high capex - extremely high hype and gushing press - extremely nonexistent revenues (except for the pick and shovel sellers like NVDA) - this is a quintessential bubble in the making!
Waymo can get there, it’s tightening screws at Waymo. But it’s the exception that proves the rule: no one else ever gets there on today’s model. You’re basically right across the board.
Cruise gets written down, Tesla still makes cars but you have to drive them, yeah…I think that covers everyone else.
I wouldn't be so sure. Even if it worked 100%, I don't see a way that Waymo could ever be profitable enough to justify the investment. There just isn't enough margin. Alphabet can afford to subsidize it indefinitely but it's not going to become a moneymaker for anyone. In fact, as you point out, their serious competition is basically out of the game now. I expect within the next five years, Google will spin it off and call it a win, and Waymo itself will peter out a few years later.
That's fair, in a game as crazy hard and complicated as self-driving, I shouldn't have implied it's a lock, I don't think anything is a lock when that much cutting-edge technology, complicated institutional incentives, regulators, and a topic guaranteed to interest the public all come together.
I meant to say "I've ridden in Waymo vehicles, a number of times, in non-trivial situations, and it was flawless, so while I'm sure there's a long tail of crazy hard scenarios, it can drive you to and from a restaurant in heavy traffic that involves getting on something that sure looked big enough to be a freeway, getting back off it without trouble, and parking better than any human I know".
How much they spent to get there, and if they're ever seeing that back? I wouldn't know where to begin estimating that. I'm basically sure from first-hand experience that Waymo can do the vast majority of the driving I've ever needed. Maybe it can't cope with insane weather or something, but most people can't really either.
“OpenAI's annual run rate—a measure of one month's revenue multiplied by 12—hit the $2 billion revenue mark in December 2023, two people familiar with the knowledge told the Financial Times.”
Several well-known professional services firms have already deployed AI for their workforces. Many software engineers, esp younger ones, use ChatGPT or its equivalent every day at work.
Yeah but dolphin via ollama is a fine code assistant and didn’t require a zillion bucks or a Larry Summer-level shady insider Washington power play gone afoul.
The revenue numbers are pricing in that someone makes Linux illegal, i.e. not real or a real tragic loss for humanity: pick one.
Depends on how you count. Meta/FAIR realizing de-facto that open LLaMa was in their advantage is debatably causally related to everything that happened after, but there's a case people would have figured it out anyways.
In this narrow instance it's not true at all: Mistral (pre-MSFT Mistral, we'll see going forward) released "Mixtral" 8x7B with de-facto "available weight" licensing and enough arch literature to make it the go-to for a long time.
Dolphin is an Orca-style "non-operator de-alignment / operator-alignment" of that model, based on open work on this out of MSFT.
And they did it on a very modest budget by the standards of foundation models in any modality but especially natural language and adjacent things.
It's basically established at this point at least to my satisfaction that too much money hurts AI research. It's too easy to just throw money at it, and if you have that kind of backing, you have to show results on a very specific schedule (usually).
There was a need for some form of funding from somewhere to allow LLM research certainly, and there was probably a need for someone to subsidize a few expensive runs that didn't pan out before it became common enough knowledge how to train a foundation LLM. But the cat's out of the bag on that now, you might not get Opus, but that capability band between e.g. GPT-3.5 and GPT-4 (and it's various previews) is crowded with stuff that is well-explained in significant detail in purely public sources.
So IDK, yeah, maybe we needed big giant companies to kickstart this thing, maybe not, the government used to do effective public/private partnerships or even useful stuff directly with academia and in places still does: CERN did the LHC without like, crazy direct corporate sponsorship written all over it (there may be indirect stuff that I don't know about not being involved in the LHC project).
And anyone who thinks AI is more complicated or should be more capital-intensive than high-energy particle physics done at CERN and places like it is getting high on their own supply.
The root of the problem is that the methods used by LLMS (even deep learning for that matter) are considered by many academics as "the brute force approach" and that the researchers are throwing GPUs at the problem. The speech recognition by deep neural networks bet almost every "intelligent" approach and a performance barrier which had been there for 30 years to be even considered good enough.
They’re considered by many people whose experience is entirely industrial and zero academic to be trivially wasteful via the efficacy of quantization strategies directly or indirectly based on Hessian local geometry, to name but one example.
It’s unclear to me what either the widely observed empirical observation that modern LLMs and their immediate offshoots (ViTs, all of it) are wildly entropically redundant via too many arguments to succinctly enumerate has to do with my arguments about the comparative efficacy of e.g. CERN and the way we do AI research in the large.
As for academics, many academics regard calling AI scaling in an industrial or quasi-industrial setting “research” is generous in the extreme.
It’s eye-popping engineering but we’re a long way from anyone so much as whispering about any Fields-level math coming out of “AI”.
I believe in this technology too. It is classic for technologies to enter the hype cycle, with a bubble forming followed by the "trough of disillusionment".
Huge potential with LLMs, but that doesn't mean pain isn't going to come.
- Hollywood disrupted
- Music industry disrupted
- Gaming industry disrupted
- Search disrupted
- New forms of media and social media
- Molecular and chemical discovery enhanced
Things that will happen with one or two possibly low hanging fruit discoveries:
- AI agents
- AI reasoning
Things that will happen on a longer time horizon:
- Self-driving
- Robots and automated manufacturing
- Software automation
It doesn't have to be perfect right away in order to strike at over a trillion dollars of (growing) TAM. And the field will keep getting better with more eyeballs, more practitioners, and more investment dollars.
It's definitely possible, but there is a lot of hand waving going on with terms like "Music industry disrupted". Where is the AI that is top of the charts? What movie did AI create that I am forking over my hard earned money for?
In the 1980s and 1990s there were many prognostications about how the internet would disrupt everything. And it basically did, to varying degrees, 10 to 25 years later...
Perhaps some kind of "generative listening" too, taking existing music and adapting it to the listener. Like how a lot of AI art is some kind of "remix" of existing things.
Eh, Until we have a "this is AGI" moment, self driving cars are going to be a pipe dream, so not sure if they are a great example.
And, yea, put your money where you're mouth is with your investments... I think you'll quickly end up relearning what "The market can remain irrational far longer than you can remain solvent" actually means.
Trying to put discounts on license renewals on LLMs not selling is missing what the hell is going on in the market outside of AI. It is rough out there and business spending in general is way down. Trying to lump that in with lack of AI adoption may be a mistake.
Because they always do. Because if you're looking, you'll see all the weaknesses in the current generation of LLM tech. But even if everything were great about LLM tech and the promise was as unlimited as Sam Altman promises, the mere existence of a skyrocketing field attracts scammer founders and investors who drive valuations higher until they are unsustainable.
When the CEO of Microsoft starts claiming insane things like “AI is a bigger invention than fire and electricity” it’s a huge red flag that investors are rabid and not based in reality.
It has reasonably already burst a few times. At this point IT is just a reactionary part of the broader economy. IT being a leading technology experiences bubbles with in itself at as new flavors come around just like any other industry. Like in 2017 there was a huge bubble in tile work with schluter all in one systems. Any reasonable tile layer could take two weekends of classes and start making about $1000 a day installing these new tile systems in commercial buildings. Now the hype is dead most tile jobs are just as they were. There was a bubble in insurance a few years ago with Medicaid advantage plans as well.
Perhaps IT has more bubbles but on the whole comparatively salaries are falling in line with other industries.
I am acquaintances with a professor who is on multiple company boards. But really he is on meaningless committees, like the "Compensation Committee", and they just use him to broadcast the fact that an accomplished professor from a world-renowned university is "on their board". He just gets a check for them using him for marketing!
I have and continue to take approach IV and VIII[0], and its worked very well for me. Especially now that my research is investigating LLMs[1], scaling down and using architectures that focus on performance is paying dividends.
Working in a niche applications (molecular biology) is also a great way to remain sane (although it comes with its own problems).
As someone in an adjacent field who tried to keep up with the state of the art until the late 2010s, I really relate to this article. I'm reminded of another interesting personal story about AI-induced academic depression https://arxiv.org/abs/2003.01870. I think it's valuable to record things like these.
The premise is bizarre. Academia is a sanctuary for people who can't get paid to do research because it's not economically exploitable, but is considered socially valuable. Academics are deeply lucky to have this option. Most of us can't get subsidies for our passion work.
There's no reason to be "depressed" about corporations throwing mountains of money at you to come out of the sanctuary.
It’s not irrational that we’ve forgotten how to funnel an American core value of “being best” via government spending on “defense” into “socialism” (the public good) by way of smart people who insist on comfortable and esteemed as their minimum job perk level, but care more about knowledge than power deep down.
AI is no exception. All evidence suggests that current LLMs don't scale (EDIT: they do scale, but at a certain point somewhere around GPT4 the scaling slows down very quickly), so we need new techniques to fix their (many) flaws. A proof-of-concept for a new technique doesn't need hundreds of millions of dollars to demonstrate huge potential: such a demonstration only needs a small model using the new technique, a small model using the current SOTA, and some evidence that it scales (like slightly larger models that show the new model isn't slowing down its scaling vs SOTA).
Academia is also about creating better explanations for things we already have. So researchers don't even need to develop new models, they can simply create small existing models and demonstrate some new property or explanation to get a worthwhile result. We probably need better explanations for how and why current LLMs work in order to create the better models mentioned above.
EDIT: At least how it's supposed to work, you don't even need to show success either. Academics merely need to show a new technique with the potential to improve the SOTA. Even if the technique is a huge failure, the work still advances the future of AI, by contributing to our understanding of models (what works and what doesn't) as well as removing one potential option.