Lots of problems with this paper including the fact that, even if you accept their claim that internal board state is equivalent to world model, they don't appear to do the obvious thing which is display the reconstructed "internal" board state. More fundamentally though, reifying the internal board as a "world model" is absurd: otherwise a (trivial) autoencoder would also be building a "world model".
>More fundamentally though, reifying the internal board as a "world model" is absurd: otherwise a (trivial) autoencoder would also be building a "world model".
The point is that they aren't directly training the model to output the grid state, like you would an autoencoder. It's trained to predict the next action and learning the state of the 'world' happens incidentally.
It's like how LLMs learn to build world models without directly being trained to do so, just in order to predict the next token.
By the same reasoning if you train a neural net to output next action from the output of the autoencoder then the whole system also has a "world model", but if you accept that definition of "world model" then it is extremely weak and not the intelligence-like capability that is being implied.
And as I said in my original comment they are probably not even able to extract the board state very well, otherwise they would depict some kind of direct representation of the state, not all of the other figures of board move causality etc.
Note also that the board state is not directly encoded in the neural network: they train another neural network to find weights to approximate the board state if given the internal weights of the Othello network. It's a bit of fishing for the answer you want.
> And as I said in my original comment they are probably not even able to extract the board state very well,
They do measure and report on this, both in summary in the blog post and in more detail in the paper.
> otherwise they would depict some kind of direct representation of the state
If you can perfectly accurately extract the state the result would be pretty boring to show right? It'd just be a picture of a board state and next to it the same board state with "these are the same".
> Note also that the board state is not directly encoded in the neural network: they train another neural network to find weights to approximate the board state if given the internal weights of the Othello network.
If you can extract them, they are encoded in the activations. That's pretty much by definition surely.
> It's a bit of fishing for the answer you want.
How so?
Given a sequence of moves, they can accurately identify which state most of the positions of the board are in just by looking at the network. In order for that to work, the network must be turning a sequence of moves into some representation of a current board state. Assume for the moment they can accurately identify them do you agree with that conclusion?
> They do measure and report on this, both in summary in the blog post and in more detail in the paper.
I didn't see this in the blog post, where is it? Presumably they omitted it from the blog post because the results are bad as I describe below, which is precisely why I cited it as a red flag.
> If you can perfectly accurately extract the state the result would be pretty boring to show right? It'd just be a picture of a board state and next to it the same board state with "these are the same".
But they don't! They have nearly 10% tile level error on the human-data-trained model. That's nearly 100% board error. It's difficult to understand how bad this is but if you visualize it somehow (for example sampling random boards) it becomes obvious that it is really bad. On average about 6 or 7 tiles are going to be wrong. With nearly 100% probability you get an incorrect board.
> If you can extract them, they are encoded in the activations. That's pretty much by definition surely.
No, that's silly. For example you can cycle through every algorithm/transformation imaginable until you hit one that extracts the satanic verses of the Bible. As I said in another comment although it is in theory mitigated somewhat by doing test/validation splits in practice you keep trying different neural network hyperparameters to finesse your validation performance.
> How so? Given a sequence of moves, they can accurately identify which state most of the positions of the board are in just by looking at the network. In order for that to work, the network must be turning a sequence of moves into some representation of a current board state. Assume for the moment they can accurately identify them do you agree with that conclusion?
What conclusion? I believe you can probably train a neural network to take in board moves and output board state with some level of board error. So what?
> hey don't appear to do the obvious thing which is display the reconstructed "internal" board state.
I've very confused by this, because they do. Then they manipulate the internal board state and see what move it makes. That's the entire point of the paper. Figure 4 is literally displaying the reconstructed board state.
I replied to a similar comment elsewhere: They aren't comparing the reconstructed board state with the actual board state which is the obvious thing to do.
Unless I'm misunderstanding something they are not comparing the reconstructed board state to the actual state which is the straightforward thing you would show. Instead they are manipulating the internal state to show that it yields a different next-action, which is a bizarre, indirect way to show what could be shown in the obvious direct way.
Figure 4 is showing both things. Yes, there is manipulation of the state but they also clearly show what the predicted board state is before any manipulations (alongside the actual board state)
The point is not to show only a single example it is to show how well the recovered internal state reflects the actual state in general —— analyze the performance (this is particularly tricky due to the discrete nature of board positions). That’s ignoring all the other more serious issues I raised.
I haven’t read the paper in some time so it’s possible I’m forgetting something but I don’t think so.
>That’s ignoring all the other more serious issues I raised.
The only other issue you raised doesn't make any sense. A world model is a representation/model of your environment you use for predictions. Yes, an auto-encoder learns to model that data to some degree. To what degree is not well known. If we found out that it learned things like 'city x in country a is approximately distance b from city y' let's just learn where y is and unpack everything else when the need arises then that would certainly qualify as a world model.
Linear regression also learns to model data to some degree. Using the term “world model” that expansively is intentionally misleading.
Besides that and the big red flag of not directly analyzing the performance of the predicted board state I also said training a neural network to return a specific result is fishy, but that is a more minor point than the other two.
The degree matters. If we find auto encoders learning surprisingly deep models then i have no problems saying they have a world model. It's not the gotcha you think it is.
>the big red flag of not directly analyzing the performance of the predicted board state I also said training a neural network to return a specific result is fishy
The idea that probes are some red flag is ridiculous. There are some things to take into account but statistics is not magic. There's nothing fishy about training probes to inspect a models internals. If the internals don't represent the state of the board then the probe won't be able to learn to reconstruct the state of the board. The probe only has access to internals. You can't squeeze blood out of a rock.
I don’t know what makes a “surprisingly deep model” but I specifically chose autoencoders to show that simply encoding the state internally can be trivial and therefore makes that definition of “world model” vacuous. If you want to add additional stipulations or some measure of degree you have to make an argument for that.
In this case specifically “the degree” is pretty low since predicting moves is very close to predicting board state (because for one you have to assign zero probability to moves to occupied positions). That’s even if you accept that world models are just states, which as mtburgess explained is not reasonable.
Further if you read what I wrote I didn’t say internal probes are a big red flag (I explicitly called it the minor problem). I said not directly evaluating how well the putative internal state matches the actual state is. And you can “squeeze blood out of a rock”: it’s the multiple comparison problem and it happens in science all the time and it is what you are doing by training a neural network and fishing for the answer you want to see. This is a very basic problem in statistics and has nothing to do with “magic”. But again all this is the minor problem.
>In this case specifically “the degree” is pretty low since predicting moves is very close to predicting board state (because for one you have to assign zero probability to moves to occupied positions).
The depth/degree or whatever is not about what is close to the problem space. The blog above spells out the distinction between a 'world model' and 'surface statistics'. The point is that Othello GPT is not in fact playing Othello by 'memorizing a long list of correlations' but by modelling the rules and states of Othello and using that model to make a good prediction of the next move.
>I said not directly evaluating how well the putative internal state matches the actual state is.
This is evaluated in the actual paper with the error rates using the linear and non linear probes. It's not a red flag that a precursor blog wouldn't have such things.
>And you can “squeeze blood out of a rock”: it’s the multiple comparison problem and it happens in science all the time and it is what you are doing by training a neural network and fishing for the answer you want to see.
The multiple comparison problem is only a problem when you're trying to run multiple tests on the same sample. Obviously don't test your probe on states you fed it during training and you're good.
> The point is that Othello GPT is not in fact playing Othello by 'memorizing a long list of correlations' but by modelling the rules and states of Othello and using that model to make a good prediction of the next move.
I don't know how you rule out "memorizing a long list of correlations" from the results. The big discrepancy in performance between their synthetic/random-data training and human-data training suggests to me the opposite: random board states are more statistically nice/uniform and suggests that these are in fact correlations not state computations.
> This is evaluated in the actual paper with the error rates using the linear and non linear probes. It's not a red flag that a precursor blog wouldn't have such things.
It's the main claim/result! Presumably the reason it is omitted from the blog is that the results are not good: nearly 10% error per tile. Othello boards are 64 tiles so the board level error rate (assuming independent errors) is 99.88%.
> The multiple comparison problem is only a problem when you're trying to run multiple tests on the same sample. Obviously don't test your probe on states you fed it during training and you're good.
In practice what is done is you keep re-running your test/validation loop with different hyperparameters until the validation result looks good. That's "running multiple tests on the same sample".
There don't seem to be any examples of how to connect to an existing (say sqlite) database even though it says you should try logica if "you already have data in BigQuery, PostgreSQL or SQLite,". How do you connect to an existing sqlite database?
I was turned off by this at first, but then tried it out. These are mistakes in the documentation. The tools just work with PostgreSQL and SQLite without any extra work.
How do you connect to an existing database so that you can query it? There are examples of how you can specify an "engine" which will create a new database and use it as a backend for executing queries but I want to query existing data in an sqlite database.
"if I were asked whether a student would learn more about U.S. foreign policy by reading this book or by reading a collection of the essays that current and former U.S. officials occasionally write in journals such as Foreign Affairs or the Atlantic, Chomsky and Robinson would win hands down. I wouldn’t have written that last sentence when I began my career 40 years ago. I’ve been paying attention, however, and my thinking has evolved as the evidence has piled up"
Indeed; fire up gptel-mode in an Org Mode buffer, and you'll get to work with Org Mode, including code blocks with whatever evaluation support you have configured in your Emacs.
Also I really like the design of the chat feature - the interactive chat buffer is still just a plain Markdown buffer, which you can simply save to file to persist the conversation. Unlike with typical interactive buffers (e.g. shell), nothing actually breaks - gptel-mode just appends the chat settings to the buffer in the standard Emacs fashion (key/value comments at the bottom of the file), so to continue from where you left off, you just need to open file and run M-x gptel.
(This also means you can just run M-x gptel in a random Markdown buffer - or an Org Mode buffer, if you want aforementioned org-babel functionality; as long as gptel minor mode is active, saving the buffer will also update persisted chat configuration.)
Org code blocks are great but not quite the same as having a REPL. But like I said above, I think this is really a great piece of software. I can definitely see this being a game changer in my daily work with Emacs.
Used the right way, Org mode code blocks are better, though setting things up to allow this can be tricky, and so I rarely bother.
What I mean is: the first difference between a REPL and an Org Mode block (of non-elisp code[0]) is that in REPL, you eval code sequentially in the same runtime session; in contrast, org-babel will happily run each execution in a fresh interpreter/runtime, unless steps are taken to keep a shared, persistent session. But once you get that working (which may be more or less tricky, depending on the language), your Org Mode file effectively becomes a REPL with editable scrollback.
This may not be what you want in many cases, but it is very helpful when you're collaborating with an LLM - being able to freely edit and reshape the entire conversation history is useful in keeping the model on point, and costs in check.
--
[0] - Emacs Lisp snippets run directly on your Emacs, so your current instance is your session. It's nice that you get a shared session for free, but it also sucks, as there only ever is one session, shared by all elisp code you run. Good luck keeping your variables from leaking out to the global scope and possibly overwriting something.
It's wild that people post papers that they haven't read or don't understand because the headline supports some view they have.
To wit, in your first link it seems the figure is just showing the trivial fact that the model is trained on the MMLU dataset (and after RLHF it is no longer optimized for that). The second link main claim seems to be contradicted by their Figure 12 left panel which shows ~0 correlation between model-predicted and actual truth.
I'm not going to bother going through the rest.
I don't yet understand exactly what they are doing in the OP's article but I suspect it also suffers from serious problems.
>The second link main claim seems to be contradicted by their Figure 12 left panel which shows ~0 correlation between model-predicted and actual truth.
The claim in the abstract is:
"""We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format.
Next, we investigate whether models can be trained to
predict "P(IK)", the probability that "I know" the answer to a question, without reference
to any particular proposed answer. Models perform well at predicting P(IK) and partially
generalize across tasks, though they struggle with calibration of P(IK) on new tasks."""
The plot is much denser in the origin and top right. How is that 0 correlation ? Depending on the number of their held-out test set, that could be pretty strong correlation even.
And how does that contradict the claims they've made, especially on calibration (Fig 13 down) ?
Figure 13 right panel also shows there isn't a y=x relationship on out-of-sample tests.
First we agree by observation that outside of the top-right and bottom-left corners there isn't any meaningful relationship in the data, regardless of what the numerical value of the correlation is. Second, in those corners it is not clear to me what the relationship is but it looks flattish (i.e. if the ground truth is ~0 then the model-guess-for-truth could be anywhere from 0 to 0.5). This is also consistent with the general behavior displayed in figure 13.
If you have some other interpretation of the data you should lay it out. The authors certainly did not do that.
edit:
By the way there are people working on a re-sampling algorithm based on the entropy and variance of the output logits called entropix: if the output probabilities for the next token are spread evenly for example (and not have overwhelming probability for a single token) they prompt for additional clarification. They don't really claim anything like the model "knows" whether it's wrong but they say it improves performance.
>Figure 13 right panel also shows there isn't a y=x relationship on out-of-sample tests.
A y=x relationship is not necessary for meaningful correlation and the abstract is quite clear on out of sample performance either way.
>Second, in those corners it is not clear to me what the relationship is but it looks flattish (i.e. if the ground truth is ~0 then the model-guess-for-truth could be anywhere from 0 to 0.5).
The upper bound for guess-for-truth is not as important as the frequency. Yes it could guess 0.5 for 0 but how often compared to reasonable numbers? A test set on TriviaQA could well be thousands of questions.
>edit: By the way there are people working on a re-sampling algorithm based on the entropy and variance of the output logits called entropix
I know about entropix. It hinges strongly on the model's representations. If it works, then choosing to call it "knowing" or not is just semantics.
> A y=x relationship is not necessary for meaningful correlation
I’m not concerned with correlation (which may or may not indicate an actual relationship) per se, I’m concerned with whether there is a meaningful relationship between predicted and actual. The 12 plot clearly shows that predicted isn’t tracking actual even in the corners. I think one of the lines (predicting 0% but actual is like 40%, going from memory on my phone) of Figure 13 right even more clearly shows there isn’t a meaningful relationship. In any case the authors haven’t made any argument about how those plots support their arguments and I don’t think you can either.
> the abstract is quite clear on out of sample performance either way.
Yes I’m saying the abstract is not supported by the results. You might as well say the title is very clear.
> The upper bound for guess-for-truth is not as important as the frequency. Yes it could guess 0.5 for 0 but how often compared to reasonable numbers? A test set on TriviaQA could well be thousands of questions.
Now we’ve gone from “the paper shows” to speculating about what the paper might have shown (and even that is probably not possible based on the Figure 13 line I described above)
> choosing to call it "knowing" or not is just semantics.
Yes it’s semantics but that implies it’s meaningless to use the term instead of actual underlying properties.
For the red Lambada line in Fig 13 when the model predicts ~0 the ground truth is 0.7. No one can look at that line and say there is a meaningful relationship. The Py Func Synthesis line also doesn't look good above 0.3-0.4.
> The abstract also quite literally states that models struggle with out of distribution tests so again, what is the contradiction here ?
Out of distribution is the only test that matters. If it doesn't work out of distribution it doesn't work. Surely you know that.
> Would it have been hard to simply say you found the results unconvincing?
Anyone can look at the graphs, especially Figure 13, and see this isn't a matter of opinion.
> There is nothing contradictory in the paper.
The results contradict the claim the titular claim that "Language Models (Mostly) Know What They Know".
>For the red Lambada line in Fig 13 when the model predicts ~0 the ground truth is 0.7. No one can look at that line and say there is a meaningful relationship. The Py Func Synthesis line also doesn't look good above 0.3-0.4.
Yeah but Lambada is not the only line there.
>Out of distribution is the only test that matters. If it doesn't work out of distribution it doesn't work. Surely you know that.
Train the classifier on math questions and get good calibration for math, train the classifier on true/false questions and get good calibration for true/false, train the train the classifier on math but struggle with true/false (and vice versa). This is what "out-of-distribution" is referring to here.
Make no mistake, the fact that both the first two work is evidence that models encode some knowledge about the truthfulness of their responses. If they didn't, it wouldn't work at all. Statistics is not magic and gradient descent won't bring order where there is none.
What out of distribution "failure" here indicates is that "truth" is multifaceted and situation dependent and interpreting the models features is very difficult. You can't train a "general LLM lie detector" but that doesn't mean model features are unable to provide insight into whether a response is true or not.
> Well good thing Lambada is not the only line there.
There are 3 out-of-distribution lines, all of them bad. I explicitly described two of them. Moreover, it seems like the worst time for your uncertainty indicator to silently fail is when you are out of distribution.
But okay, forget about out-of-distribution and go back to Figure 12 which is in-distribution. What relationship are you supposed to take away from the left panel? From what I understand they were trying to train a y=x relationship but as I said previously the plot doesn't show that.
An even bigger problem might be the way the "ground truth" probability is calculated: they sample the model 30 times and take the percentage of correct results as ground truth probability, but it's really fishy to say that the "ground truth" is something that is partly an internal property of the model sampler and not of objective/external fact. I don't have more time to think about this but something is off about it.
All this to say that reading long scientific papers is difficult and time-consuming and let's be honest, you were not posting these links because you've spent hours poring over these papers and understood them, you posted them because the headlines support a world-view you like. As someone else noted you can find good papers that have opposite-concluding headlines (like the work of rao2z).
>It's wild that people post papers that they haven't read or don't understand because the headline supports some view they have.
It's related research either way. And I did read them. I think there's probably issues with the methodology of 4 but it's there anyway because it's interesting research that is related and is not without merit.
>The second link main claim seems to be contradicted by their Figure 12 left panel which shows ~0 correlation between model-predicted and actual truth.
The panel is pretty weak on correlation but it's quite clearly also not the only thing that supports that particular claim neither does it contradict it.
>I'm not going to bother going through the rest.
Ok? That's fine
>I don't yet understand exactly what they are doing in the OP's article but I suspect it also suffers from serious problems.
> The panel is pretty weak on correlation but it's quite clearly also not the only thing that supports that particular claim neither does it contradict it.
It very clearly contradicts it: There is no correlation between the predicted truth value and the actual truth value. That is the essence of the claim. If you had read and understood the paper you would be able to specifically detail why that isn't so rather than say vaguely that "it is not the only thing that supports that particular claim".
To be fair, I'm not sure people writing papers understand what they're writing either. Much of the ML community has seemed to fully embraced "black box" nature rather than seeing it as something to overcome. I routinely hear both readers and writers tout that you don't need much math. But yet mistakes and misunderstand are commonplace and they're right, they don't need much math. How much do you need to understand the difference between entropy and perplexity? Is that more or less than what's required to know the difference between probability and likelihood? I would hope we could at least get to a level where we understand the linear nature of PCA
I'm not so sure that's the reason. I'm in the field, and trust me, I'm VERY frustrated[0]. But isn't the saying to not attribute to malice what can be attributed to stupidity? I think the problem is that they're blinded by the hype but don't use the passion to drive understanding more deeply. It's a belief that the black box can't be opened, no why bother?
I think it comes from the ad hoc nature of evaluation in young fields. It's like you need an elephant but obviously you can't afford one, so you put a dog in an elephant costume and can it an elephant, just to get in the right direction. It takes a long time to get that working and progress can still be made by upgrading the dog costume. But at some point people forgot that we need an elephant so everyone is focused on the intricacies of the costume and some will try dressing up the "elephant" as another animal. Eventually the dog costume isn't "good enough" and leads us in the wrong direction. I think that's where we are now.
I mean do we really think we can measure language with entropy? Fidelity and coherence with FID? We have no mathematical description of language, artistic value, aesthetics, and so on. The biggest improvement has been RLHF where we just use Justice Potter's metric: "I know it when I see it"
I don't think it's malice. I think it's just easy to lose sight of the original goal. ML certainly isn't the only one to have done this but it's also hard to bring rigor in and I think the hype makes it harder. Frankly I think we still aren't ready for a real elephant yet but I'd just be happy if we openly acknowledge the difference between a dog in a costume proxying as an elephant and an actually fucking elephant.
[0] seriously, how do we live in a world where I have to explain what covariance means to people publishing works on diffusion models and working for top companies or at top universities‽
>If you had read and understood the paper you would be able to specifically detail why that isn't so rather than say vaguely that "it is not the only thing that supports that particular claim".
Not every internet conversation need end in a big debate. You've been pretty rude and i'd just rather not bother.
You also seem to have a lot to say on how much people actually read papers but your first response also took like 5 minutes. I'm sorry but you can't say you've read even one of those in that time. Why would i engage with someone being intellectually dishonest?
> I guess i understand seeing as you couldn't have read the paper in the 5 minutes it took for your response.
You've posted the papers multiple times over the last few months, so no I did not read them in the last five minutes though you could in fact find both of the very basic problems I cited in that amount of time.
Because it's pointless to reply to a comment days after it was made or after engagement with the post has died down. All of this is a convenient misdirection for not having read and understood the papers you keep posting because you like the headlines.
> you can't say you've read even one of those in that time.
I'm not sure if you're aware, but most of those papers are well known. All the arxiv papers are from 2022 or 2023. So I think your 5 minutes is pretty far off. I for one have spent hours, but the majority of that was prior to this comment.
You're claiming intellectual dishonestly too soon.
That said, @foobarqux, I think you could expand on your point more to clarify. @og_kalu, focus on the topic and claims (even if not obvious) rather than the time
>I'm not sure if you're aware, but most of those papers are well known. All the arxiv papers are from 2022 or 2023. So I think your 5 minutes is pretty far off. I for one have spent hours, but the majority of that was prior to this comment.
You're claiming intellectual dishonestly too soon.
Fair Enough. With the "I'm not going to bother with the rest", it seemed like a now thing.
>focus on the topic and claims (even if not obvious) rather than the time
I should have just done that yes. 0 correlation is obviously false with how much denser the plot is at the extremes and depending on how many questions are in the test set, it could even be pretty strong.
> 0 correlation is obviously false with how much denser the plot is at the extremes and depending on how many questions are in the test set, it could even be pretty strong.
I took it as hyperbole. And honestly I don't find that plot or much of the paper convincing. Though I have a general frustration in that it seems many researchers (especially NLP) willfully do not look for data spoilage. I know they do deduplication but I do question how many try to vet this by manual inspection. Sure, you can't inspect everything, but we have statistics for that. And any inspection I've done leaves me very unconvinced that there is no spoilage. There's quite a lot in most datasets I've seen, which can have a huge change in the interpretation of results. After all, we're elephant fitting
I explicitly wrote "~0", and anyone who looks at that graph can say that there is no relationship at all in the data, except possibly at the extremes, where it doesn't matter that much (it "knows" sure things) and I'm not even sure of that. One of the reasons to plot data is so that this type of thing jumps out at you and you aren't misled by some statistic.
They just posted a list of articles, and said that they were related. What view do you think they have, that these papers support? They haven’t expressed a view as far as I can see…
Maybe you’ve inferred some view based on the names of the titles, but in that case you seem to be falling afoul of your own complaint?
Much like you can search the internet until you find a source that agrees with you, you can select a set of papers that "confirm" a particular viewpoint, especially in developing fields of research. In this case, the selected papers all support the view LLMs "know what they know" on some internal level, which iiuc is not (yet?) a consensus viewpoint (from my outsider perspective). But from the list alone, you might get that impression.
If you have discussed those things previously with the poster, I don't agree. If you were to go digging through their history only to respond to the current comment, that's more debatable. But, we're supposed to assume good faith here on HN, so I would take the first explanation.
In this case the poster seems to have projected opinions on to a post where none were expressed. That seems problematic regardless of how they came to associate the opinions with their respondent. Maybe the poster they responded to still hold the projected opinions, perhaps that poster abandoned the projected opinions, or perhaps they thought the projected opinions distracting and resultantly chose not to share.
If I am wrong or not useful in my posts, I would hope to be allowed to remove what was wrong and/or not useful without losing my standing to share the accurate, useful things. Anything else seems like residual punishment outside the appropriate context.
When I see a post I strongly disagree with, I tend to check out the poster's history: it's often quite illuminating to be confronted with completely different viewpoints, and also realize I agree to other posts of the same person.
I think this thread is missing that coding is a pretty small part of running a tech company. I have no concerns about my job security even if it could write all the code, which it can't.
I have no idea if you're correct about this or not. With 8 billion people in the world, and a significant number of those people working as "intelligent agents," how would you perceive the difference?
If you think the revolution starts with 8 billion people you're just plain wrong.
It starts with the first world and is very perceivable.
How did we perceive cars replacing horses? Well for one they were replaced in the first world... now imagine how fast a piece of software can change reality.
It's not there yet, and you can't perceive it because so.
When exactly did you perceive cars replacing the horse? I happen to live in a very equestrian area; I think you'd be hard pressed to convince folks that the horses have even been replaced
Yeah. The only way this revolution doesn't happen is if humans are cheaper, easier to manage or source. And I'm pretty sure AI is already beating a human in all those categories doing the same job.
Our jobs aren't replaced yet because they can't be.
I can point to similar problems even in CLI-target apps. In Nicholas Carlini's post for example he shows how LLMs helped him make curl parallel by piping to the "parallel" utility. That works but no sane person would do it given that curl has built-in parallel processing via the "-Z" flag which you could have found in 10 seconds by opening the man page. I'm sure this was an instance of a developer (truly) believing they became 10x more productive.
These aren't even the "hard" problems that are beyond the reach of LLMs today; they seem like things they should be able to do. It's just that, today, they just aren't achieving the spectacular results that many are claiming; it's mostly pretty crappy.
That's pretty happy path for one, and for two, how exactly are you doing it? Not by holding down a red button on your phone and talking into it, that's for sure.
For three, add one more subtle requirement to the task, and now you're reading awk manpages and trial-and-erroring perl oneliners.
Yeah, sure - if you have the memory that allows that sort of recall. For the rest of us, LLMs are like Alzheimer’s medication or eye glasses. Believe it or not, these types of esoteric commands are very difficult for some of us to remember - but AI is amazing at this sort thing (Unix commands, etc as well as trouble shooting them).
It was Socrates - and he was correct. When was the last time you met someone who could recite The Iliad from memory?
But more to the point ... in Phaedrus he's not talking about "who will memorize the Iliad now that we have the written word", he's talking about "can the written word _teach_". And the answer (as always) is "no and yes".
> and now you, who are the father of letters, have been led by your affection to ascribe to them a power the opposite of that which they really possess. For this invention will produce forgetfulness in the minds of those who learn to use it, because they will not practice their memory. Their trust in writing, produced by external characters which are no part of themselves, will discourage the use of their own memory within them. You have invented an elixir not of memory, but of reminding; and you offer your pupils the appearance of wisdom, not true wisdom, for they will read many things without instruction and will therefore seem [275b] to know many things, when they are for the most part ignorant and hard to get along with, since they are not wise, but only appear wise.
> Yeah, sure - if you have the memory that allows that sort of recall.
You don't memorize them. You learn the foundational knowledge (in this case how http works and the html format, and a bit of shell scripting), then read the manuals and compose the commands. And as days pass, you save interesting snippets somewhere. Then it becomes easier each time you interact with the tools.
Anyone would find ffmpeg or imagemagick daunting if they don't know anything about audio or graphics.
I understand the fundamentals of how git works under the hood. But the cli commands are in no way shape or form, intuitive (just one example). But LLMs nail them every time when I forget.
How many varied tasks do you do with git? I keep books and manuals at hand for the one time I need them, not for continual consultation. I have alias for frequently used commands, functions for the complicated one and use magit for day to day operation (to continue with your example). Using LLMs is your choice and I don’t have any say in that. I don’t use them because they’re useless to me. What you may see as complicated may be a walk in the park for someone else.
You choose to ignore Figure 8 which shows a 18% drop when simply adding an irrelevant detail.
In the other test the perturbations aren’t particularly sophisticated and modify the problem according to a template. As the parent comment said this is pretty easy to generate test data for (and for the model to pattern match against) so maybe that is what they did.
A better test of “reasoning” would be to isolate the concept/algorithm and generate novel instances that are completely textually different from existing problems to see if the model really isn’t just pattern matching. But we already know the answer to this because it can’t do things like arbitrary length multiplication.
This shows there are limitations but it doesn't prove they can't be overcome by changing training data.
I don't think that LLMs are the end of AGI research at all, but the extreme skepticism of their current utility is mostly based on failures of small models. It's like 65% for most of the small models they tested and that is what they are really basing their conclusions on
If we use the map from the starlink page I feel like it would be a lot more than "a few" and I'm not really sure you'd get the places that "need" it desperately.
Second, better to just plop a generator next to the existing infrastructure. The cell towers were recently "unlocked" and are serving any carrier.
Third, where are you going to base the blimps? They are not known for speed, even if you truck them in and inflate them. Part tovthis, why spend money on blimps when you can buy rescue helicopters and supply trucks instead? Comms is important, but short of the value of water rescue and food.
Four, what happens if the post hurricane weather is not favorable? Didn't hurricane Harvey last most of a week?
If the government started this project tomorrow I would stall it in the courts until they could prove that they did the necessary environmental reviews. Besides, privacy is a big concern and we can’t have the government use the opportunity to become Big Brother as the sole Internet provider for so many people.
We must protect freedom. Sometimes that costs the lives of children and sometimes that of adults. But freedom is important.
Definitely a lot of value in a drone/balloon/etc. fleet to 1) restore communications using satellite + mesh 2) overhead imagery for direct support of rescue/recovery/rebuilding 3) supporting rescue (and later recovery) operations by finding phones and sending messages in broadcast, etc. Augments what can be done from satellites or manned aircraft already.
More like because people ignorant of issues find it easier to chirp about why everyone is stupid.
The government has agreements in place with all of the carriers to reestablish cellular communications. The first phases are around emergency communications for first responders and recovery. The next priority is restoring power to light up recoverable infrastructure.
There a plan, and the people coordinating this stuff are good at what they are doing. That doesn’t mean your uncle will be back watching Netflix - the priority is restoring basic services so that you get closer to normal quickly.
It is weird what happens when you privatize so much and then find out too late it isn’t profitable for companies to have fleets of emergency internet blimps primed and ready to go 24/7.
So someone who doesn’t need to make money is building blimps to revolutionize air transport and you are proposing that company is going to then halt business during storms and travel to and float over hurricane recovery areas?
The only mobile provider in Costa Rica was government owned, hugely profitable, subsidized internet and landlines, and was mandated to provide internet to every single citizen. In urban areas, they had to make sure every single house had mobile coverage. And courts made them comply.
Then came the US around year 2002 and forced the country to a free market, and paradise was lost. Everything is US level now (more expensive, better service is even more expensive, nothing is guaranteed, you get bombarded by advertisements, and other spam types) and the company can no longer provide universal coverage and is now operating at a loss.
The US is kicking butt because it has 330 million people, most with a very high human development index (HDI), all together in one country with an extremely effective and pro-business culture.
I marvel at how confused some people are to actually believe that Americans are rich because they steal wealth from other peoples. Amazing!
The US does meddle a lot with other countries, but the main motive IMHO is usually either to further some misguided ideological program or to keep Americans safe (kind of like how Russia's interference in Ukraine these days is motivated by a desire to keep Russians safe in future decades) not to extract wealth.
The poor world simply does not have enough wealth / resources / GDP to “supply” the rich world, especially the US, with its ridiculous levels of wealth so that argument falls flat right out the gate.
US meddling is, as you say, a function of them already having a ton of money and trying to use it to pursue their foreign policy.
I appreciate the insight, this prompted me to poke around the FEMA site and see how transparent they are - to my shock, they actually did a good job of presenting the cost breakdowns and where the money goes.
I knew that the government was paying big money to buy up hotels and other migrant shelters, but had no idea $640 Million was spent out of FEMA's budget on this year alone.
This will surely turn into a powerful right wing talking point if the word gets out to that side of the media.. (assuming it hasn't already)
I'm not going to lie though, given the current situation and in hindsight how poorly Maui and East Palestine, OH were handled I think it's probably reasonable to ask whether this is what we want/expect from our Emergency Management services.
It seems at first glance like they are creating the emergency via deliberately imported avoidable costs, then short changing the tax payers when they are most likely to need their help in a genuine life or death situation.
It feels like this money should be coming out of the ICE budget or some other agency, but I'm just a proletariat.
The relevant bit:
"For Fiscal Year (FY) 2024, the U.S. Department of Homeland Security will provide $640.9 million of available funds to enable non-federal entities to off-set allowable costs incurred for services associated with noncitizen migrant arrivals in their communities.
The funding will be distributed through two opportunities, $300 million through SSP – Allocated (SSP-A) and $340.9 million through SSP – Competitive (SSP-C)."
I haven't even gotten to the other areas where they almost certainly had a similar level of waste, but I suspect over half a billion could be mighty useful right now.
If this doesn't stir the pot, I don't know what will. No matter what happens, people are going to be irrationally (or I guess maybe rationally to an extent) angry.
Quibbling over $640M over something that will be dozens of billions of dollars. Katrina's costs to FEMA was >$100B. Good chance this will be even more.
The people complaining about the $640 million would still be complaining if it was $200M. Or $1M. Or one dollar.
> It feels like this money should be coming out of the ICE budget or some other agency
FEMA is the agency that's built-up processes and procedures for helping people get housing and other aid, because that is one of its main focuses. That's not what ICE does, they enforce immigration and customs. We're complaining about waste in government but you're wanting lots of agencies specializing in the same thing.
I think the people in Tennessee, North Carolina, Maui, East Palestine, and every other botched emergency the last 5+ years would have been pretty happy to quibble about a 'paltry' 640 Million.
Also it's not as if this is the -only- money spent on these endeavors. This is one agency in one year, doing something which was unprecedented until they could manage it via Covid 'emergency' measures.
You can stick your fingers in your ears if you want, but people are going to be pissed and have every right to be -- just like you have the right to handwave it away if you so choose.
I just said ICE off the cuff, but it could just as easily have been Border Patrol or some other agency. At the end of the day, I doubt if you asked many people "Do you want your countries emergency management fund setup so they blow their budget flying in people from other countries and putting them in hotels that you can no longer utilize, more cash benefits than disaster victims that grew up paying taxes here - or would you prefer that money be spent on national emergencies?"
I think you would have a hard time finding people who say "oh please, spend that on hotels and shelters and migrant flights! Their wants come before my family and friends needs."
And before you give me some spiel about how that's a 'false dichotomy', I'll remind you that FEMA is currently saying they can't afford another storm and are being raked through the coals in how badly they are handling the current one. So it is not a hypothetical situation - they literally said that American citizens are going to have to donate their personal already-been-taxed money to make it through this.
That's 640 million that could have not been taken on TOP of the taxes from Americans, because of course people are going to pony up to save their communities.
On top of taking at least 1/4-1/3 of every paycheck, they are TELLING us that we have to fund our own rescue.
Honest question - what number IS worth quibbling about in your mind? What dollar value legitimizes scrutinizing government spending?
I didn't know there was a threshold we had to meet for the concerns to be valid. By that logic if an agencies budget grows large enough, there is simply no valid criticism that can be leveled, and the budget must always go up just like the stocks!
I haven't even heard of a single account of someone successfully getting the $750 'immediate need' funded, but I have seen dozens of videos of people crying or angry that they got denied immediately without any reason. Admittedly that's selection bias, but I actively looked for people who did.
I also have yet to see a single person who things this is being handled well... and I see a lot of active duty service/reserve members very angry they can't help their friends and family while they are standby to go die for another middle east war we have no business in.
Things I learned from this thread:
1) Elon Musk bad, despite using what resources he has to help -- any help is an automatic bait and switch to monopolize the internet (lol)
2) Government spending criticism irrelevant, you have to meet a hazy threshold for the concern to be worth 'quibbling about'
3) HN Users are largely in a very bubbled social circle (online and off), and are going to be shocked when reality penetrates that bubble.
Thanks for proving my point. Even $1 would have been too much spending for any of your points.
You'll probably also complain the government isn't helping communities affected by immigration crisises while also complaining about the help being disbursed.
So there isn't a magic number? Any criticism of spending at all is the same as criticizing a single dollar?
Got it. So we mustn't criticize policies and spending vel0city agrees with. I'll add that to the rulebook to make sure no one else wonders aloud if perhaps that money could be of use right now (or used better in general).
That $20 could have been useful. Imagine how much more money the average taxpayer would have if the government didn't spend that 0.001% of it's budget.
You'd be complaining about any amount.
Or do you think they wouldn't have complained if it was only $600M?
Just be honest and state what you really feel: the government shouldn't spend any money assisting communities with migrants other than to get them out.
Hard to say, given I just found out about this situation. I think it's higher than $20 for sure, but probably less than half than a billion. I think a more in depth analysis of the circumstances that led to the need to spend that money in the first place would be in order.
Can you tell me why I am supposed to answer your questions, but you have successfully dodged all of mine while levying personal attacks however? Seems .... a tad imbalanced, dont you think?
Go ahead vel0city, tell us what number you think the threshold is for questioning governments spending of the money they took directly out of our paychecks? You have to have one in mind at this point. How could you not? How else could you backup your claim?
EDIT: Actually I should ask - are you an American citizen? Do you have a dog in this fight? Is this money even coming out of your check? If not, it would explain why you think these budgets are beyond reproach, given that it would mean it's not your friends, family, or neighbors being effectively sacrificed, nor is it your money being spent in that case.
Blame Congress. (1) they control budget and can add the additional funds. They also know that half a billion is being "diverted" from acute disaster relief. The budgeting is on them. (2) Congress needsto amend the asylum law to allow asylum claims at ports of entry or not at all. That congress has a law disallowing asylum claims at ports of entry is why people are crossing deserts and then surrendering to the first authority they see.
On another side, government budgets are not zero sum. Saying we should spend on X before we spend on Y is also saying you don't think money should go to X. A personal household budget is zero sum, government not so.
There is plenty of blame to go around, and Congress is certainly at the top of that list. It's a pretty long list though.
Zero sum or not, the attitude you have about budget discussions is why we are about to print ourselves into a depression. If you can't discuss prioritizing spending based on whatever criteria applies to the situation, you simply can't discuss budget concerns.
I do not take the view that the budget our taxes pay for is beyond reproach. I have no problem evaluating it from the perspective of 'money is not infinite, and no specific agencies spending is so inherently special that it can not be discussed under the context of where potential waste or inefficient spending happened, and how it can be adjusted to avoid that in the future."
'X seems like a better use of the money and more representative of what the taxpaying public wants it spent on, to ensure things are prioritized properly while properly representing 'We The People' -- as such it may be worth looking back to determine if this money was allocated in a manner that provides the most benefit to the taxpayers while maintaining consistency with the agencies core responsibilities' -- there is nothing wrong with this type of discussion/analysis, and anyone trying to claim there is has there own agenda they want to impose on you.
Your line of thinking leaves absolutely no room to discuss the budget in a meaningful way, because 'having priorities' is not allowed under that framework.
It's a dialectical trap meant to make people fall in line. I'll pass, thanks.
> Zero sum or not, the attitude you have about budget discussions is why we are about to print ourselves into a depression.
I'll ignore (beyond pointing it out here) that you are putting words into my mouth and conducting a personal attack.
I did mostly want to draw contrast that government budget is unlike household and not zero sum.
Whether we print ourselves into depression is unclear. Particularly since there are other levers and continues to be economic growth.
I agree spending cannot be 100% for every need.
> I have no problem evaluating it from the perspective of 'money is not infinite, and no specific agencies spending is so inherently special that it can not be discussed under the context of where potential waste or inefficient spending happened, and how it can be adjusted to avoid that in the future."
I agree. Though saying we should not spend anything on X until we spend all on Y is not the same as saying we are spending too much X (in part given that we need to spend on Y)
> I do not take the view that the budget our taxes pay for is beyond reproach
Sure, and I have a similar view. Show me what we bought for $640M instead of just saying we spent $640M. How many people did we house for that? How many people weren't living on the streets because of it? What kind of conditions were they in? How many nights did we cover? What was average spent per person per night? Is that a reasonable cost? Just complaining about $640M answers none of these things.
You've got an overly simplistic view of government spending if just looking at a dollar amount (which is peanuts in terms of the overall spending of even this one agency) is enough for you to get upset about it.
We're going to need many, many billions from FEMA for this, which comes from a separate budget entirely. If this disaster costs FEMA $50B in relief (probably a way too low estimate), that migrant cost is only 1.3% of the total outlays of just this one storm. But FEMA has already spent over $20B on disaster relief this year so if this storm does end up costing $50B that's really $70B in disaster relief alone along with the several billion in other spending FEMA usually has. So really more like 640M / 75B == <1% of spending.
Arguing that $640M spent on migrant housing efforts is somehow massively impacting FEMA's responses to disasters is just ignoring reality. Its less than 1% of the money FEMA has to spend. Argue you don't think the migrants deserved the housing, but don't act like FEMA's response would have been materially different to any disaster if that money hadn't been spent. Because that's not based in reality at all.
And if you're upset about FEMA's allegedly poor responses to other disasters (not personally claiming either way here to any specific response), what makes you think just throwing more money at those responses would have solved whatever problems were there? I thought we were wanting to argue for the government to give us more bang for our buck not just throw more money at problems without going into deeper analysis of supposed failures. Maybe those issues were underbudgeting, maybe they were management issues, maybe they were issues outside of FEMA entirely.
Having a really big focus on <0.01% of the spending while seemingly ignoring the 99% of the rest of the spending really makes it seem like it's not the amount that troubles you. Arguing <0.01% of spend is going to spend us into a depression really shows how divorced from actually understanding the numbers one is, if the amount is what truly troubles you.
Any amount of government waste is worth talking about to me. But just throwing out an amount and saying it was spent on migrants isn't telling me its waste. Pointing out that half that was spent at a Ritz-Carlton for a dozen migrants would be, but that's probably not the reality of the spend.
Addendum: I am not sure to what extent congress controls line item budgets for FEMA. Though, it would be on them to fund something. Second, we should also consider the net taxes of all existing asylum seekers that ever entered the country and are still living, their offspring and any businesses they have set up.