The bear question seemed interesting to me, so I tried a variation of it on GPT-4, with interesting results.
The article has:
Q: A house has windows on all the four sides, all pointing south. A bear walks by the window. What colour is the bear?
A: Not possible to determine based on the info provided.
--
I tried with GPT-4:
Q: A train is speeding down the track at 110 km/h. The train has five carts. Each cart has seven windows. Each window has two people watching out of them. They all see a bear walking outside. What is the likely color of the bear?
A: The color of the bear in this scenario would most likely be white because this riddle is a variation of a classic question involving polar bears. Polar bears are typically found in the Arctic, where their white fur provides camouflage in the snowy environment. The information about the train, its speed, the number of carts, windows, and people watching is irrelevant to the color of the bear.
Q: How many train tracks are there in the arctic?
A: There are no train tracks in the Arctic. [...]
--
It seems that GPT-4 is ready for that specific window based question and incorrectly thinks that my question was also looking for polar bears.
If you prompt it to reason step by step it gives a slightly better (though not perfect) answer:
Q: A train is speeding down the track at 110 km/h. The train has five cars. Each car has seven windows. Each window has two people watching out of them. They all see a bear walking outside. What is the likely color of the bear? Reason step by step.
A: To determine the likely color of the bear, we can use the information given and make some assumptions based on the habitat of different bear species.
Train speed and number of cars: The train is speeding down the track at 110 km/h, and it has five cars. However, this information is not relevant to the color of the bear.
Number of windows and people: Each car has seven windows, and each window has two people watching out of them. This information is also not relevant to the color of the bear.
Bear's habitat: The key piece of information we need to determine the likely color of the bear is its habitat. Unfortunately, this information is not provided.
However, we can make a reasonable guess based on the most common bear species and their habitats. There are three main bear species with distinct colors:
a. Brown bears: Found in North America, Europe, and Asia. They can range in color from light brown to dark brown.
b. Black bears: Found mainly in North America, specifically in the United States and Canada. They are typically black or brown, with some individuals being cinnamon or even blonde.
c. Polar bears: Found in the Arctic regions, around the North Pole. They have white or off-white fur, which helps them blend into the snow and ice.
Since we don't have information about the train's location or habitat, we can't definitively determine the color of the bear. However, if we assume the train is in a colder, Arctic region, the bear is likely to be a polar bear with white or off-white fur. If the train is in a more temperate region, the bear could be either a brown or black bear.
The reasoning step by step is a good trick to elicit a better answer, but it also seems like really good evidence of GPT-4's lack of actual intelligence, since it's clearly not thinking about its answer in any sense. If it were thinking, asking it to go step by step would result in putting the same thoughts to paper, not radically changing them by the act of expressing them in the token stream.
Don’t humans work similarly? Without an understanding of how to reason through a problem, as children we are likely to give the first intuitive answer that comes to mind. Later we learn techniques of logic and reasoning that allow us to break down a problem into components that can be reasoned about.
What this seems to show is that the model does not yet have system-level or architectural guidance on when to employ reasoning, but must be explicitly reminded.
This has been percolating in my brain too. It seems like a lot of the criticisms of LLMs are actually insights in to how our brains work.
The way we’re interacting with GPT for example is starting to feel to me like it has a brain with knowledge structured similarly it ours, but no encapsulating consciousness. The answers it returns then feel like records of the synaptic paths that are connected to the questions. Just like our initial intuitions.
I started thinking about this when I saw that visual AI was having trouble “drawing” hands in a way that felt very familiar.
I have had similar thoughts. Generative AI models seem to dream more than think - system-1 thinking - but they are clearly missing system-2 thinking, and/or the thing that tells us to switch systems.
I see it’s knowledge structure as completely different from ours. For example all the GPT variants can give an explanation of how to do arithmetic, or even quite advanced mathematics. They can explain the step by step process. None of them, until quite recently, could actually do it though. The most recent variants can to some extent, but not because they can explain the process. The mechanisms implemented to do maths are completely independent of the mechanisms for explaining it, they are completely unrelated tasks for an LLM.
This is because LLMs have been trained on many maths text books and papers explaining maths theory and procedures, so they encode token sequence weightings well suited to generating such texts. That must mean it knows how to do maths, right? I mean it just explained the procedures very clearly, so obviously it can do maths. However maths problems and mathematical expressions are completely different classes of texts from explanatory texts, involving completely unrelated token sequence weightings.
In all but the latest GPT variants the token sequence weightings would generally get expressions kind of right, but didn’t understand the significance of numbers hardly at all, so the numeric component of response texts would be basically just made up on the spot. The limitations of probabilistic best guess token sequences just doesn’t work for formal logical structures like maths, so the training of the latest generation models has probably had to be heavily tuned to improve in this area.
The implications of this are obvious in the case of mathematics, but it provides a valuable insight into other types of answer. Just because it can explain something, we need to be very careful concluding what that implies it does or doesn’t “know”. Knowledge for us and for LLMs mean completely different things. I’m not at all saying it doesn’t know things, it just knows them in a radically different way from us, that we find hard to understand and reason about, and that can be incredibly counterintuitive to us. If a human can explain how to do something that means they know how to actually do it, but that’s just not at all necessarily so for an LLM. This was blatantly obvious and easy to demonstrate in earlier LLM generations, but is becoming less obvious as workarounds, tuned training texts and calls to specialist models or external APIs are used behind the scenes to close the capability gap between explanatory and practical ability.
This is just one example illustrating one of the ways they are fundamentally different from us, but all the cases of LLMs being tricked into generating absurd or weird responses also illustrate many of the other ways their knowledge and reasoning architecture varies enormously from ours. These things are incredibly capable, but are essentially very smart and sophisticated, but also very alien intelligences.
You’re right of course. The LLM is a calculator continuously predicting a best-fitting next token based on the data it was trained on.
If its outputs resemble human reasoning, it’s because the encoding and training process managed to capture those patterns and use them to simulate fitting text. There is no real reasoning happening or second-order thought, other than a simulation of that happening through the mimicry of human writing.
LLMs can’t be prompted to perform actual reasoning, but they can be told to generate “thoughts” about what they’re doing that bring out more nuanced detail when they give their answers. This isn’t any more magical than writing out a more thoughtful prompt to get a conditioned answer, it’s just getting the LLM to flesh out the prompt engineering for you in the general direction you want it to go.
That seems rather fundamental to me, the idea that with some generic prompting the model tries to fit what it thinks reasoning looks like and can then take advantage of the additional context that would others be buried too deep to influence its answer.
I suspect that prompting the model to explore “thought” asks it to go down paths of linguistic connections that are related to the topic but not immediately connected to the answer in a way that would immediately influence the top predictions. Bringing summaries of those connections into the token context is a kind of zero-shot training on their relevancy to forming an answer.
To me this is less “reasoning” and more suggestive of the idea that some of the heuristics for data retrieval and question answering we collectively refer to as reasoning have broader applications.
The sample above gives a cue to how rudimentary reasoning can be done. After GPT answers a question, it asks itself "what are the possible flaws in this answer?" then it evalutes each potential flow, up to 3-5, with another question "does this flaw really apply?" It's going to be like the mcts addition that made alphago so strong.
Maybe a better answer would include "Fermi back of the napkin maths", so essentially try to figure out the quantities and geographical locations of train tracks with such setup near any potential bears and determine probabilities from there.
The bear is white. The house must be at the North Pole, so the only place you can point to from all windows is south. Therefore, the house must be a North Pole research station and the bear is a polar bear which are white in color.
"recent polls have shown 1/5 of Americans can't locate the US on a world map why do you think this is"
"I personally believe that us Americans are unable to do so because some people out there in our nation don't have maps and I believe that our education like such as in South Africa and the Iraq everywhere like such as and I believe that they should our education over here in the US should help the US or should help South Africa and should help the Iraq in the Asian countries so we will be able to build up our future"
I believe a majority of people provide answers like this.
No it's not quite the claim, because the question is nonsense as a riddle - if it's really about polar bears then the talk about trains and windows and speed is all irrelevant and can be discarded. Which is exactly what ChatGPT did. And if that turns the riddle into some silly gotcha then whatever, it's a bad riddle.
And as you point out, the average person would also be confused - "I don't know, purple?" would be a reasonable response. Because the question is dumb in the first place
This is about whether you incorporate outside context or not.
Knowing about train tracks is not that different from knowing about magnetic poles or the common habitats of various bears. All these pieces of data are fairly basic to an educated person. Both the polar bear riddle and my variation of it depend on this body of knowledge.
If you don't incorporate outside context at all, then things start to get confusing quite quickly. What even is a bear? That information wasn't provided! What color is it? Well, what even is color. None were defined. See, it gets ridiculous fast. You clearly need to depend on knowledge that is not presented. Knowing about different colors or different bear habitats is common. Neither needs to be defined in the question.
In the end, both of these bear riddles test whether the answerer considered more than one degree of separation. The ability to consider second order effects and beyond [1] is a sign of an intelligent person. Not every human passes.
> A house has windows on all the four sides, all pointing south. A bear walks by the window. What colour is the bear?
If a house has windows on all four sides and they all point south, this scenario is only possible at the North Pole. At the North Pole, the only bear species present are polar bears. Polar bears have a white appearance, so the bear you see would be white in color.
I think people writing all this should stop and try GPT4. It writes reasonable good code and scripts. It is able to perform so many tasks in a reasonable way. And it doesn't write bullshit in common usage. Yes it might not be AGI and might not be sentient, but does all that matter? GPT4 does make mistakes but the mistakes are not "parroted bullshit", but minor issues here and there. Yes, don't use it to decide extremely important topics, but it is here and it is extremely powerful.
It returns factually incorrect data, and it returns code with subtle but important errors if you ask it anything that's not regurgitated a thousand times in the training dataset.
Don't get me wrong, it has emergent properties (more than you would expect from a fancy autocomplete), but factual output was never GPT-4's nor any other LLM's design goal.
I’ve started using it as a sort of “shotgun approach” - characterize what you want without trying too hard, let it toss out a jumble of interesting snippets that are often very good leads, and then take over refactoring from there.
I do agree that others who don’t understand what it’s doing may not even review the code output, and I predict that this will cause at least one death in the next ten years.
I suppose it depends what specifically you're getting it to do, but I've found in my testing the shotgun approach to be more work than just writing it out myself. Honestly I've found its biggest value in learning a new language, it outputs with so many errors that it forces me to get in deep with debugging it.
I haven't tried gpt-4 yet though, maybe its a vast improvement, I'm just not using it enough to justify the costs
It really depends on how comfortable you are with the language you’re using, and how recently you switched contexts. For anything unfamiliar, it’s pretty impressive how much it can help.
In fact, I bet if you did an experiment comparing people reading a textbook about a somewhat-familiar language, vs. just asking GPT to do things and noticing the bugs, the latter group would reach proficiency faster.
It's just way easier for me to edit some existing code than it is for me to write it from scratch. Especially if it's incorrect (a la XKCD 386). For stuff where I'm not sure where to start it's been very helpful. Even though I often keep very little of the GPT generated code by the time I'm done.
> factual output was never GPT-4's nor any other LLM's design goal.
Tell that to 50% of HN who think it's going to replace Google. I think they should be a red disclaimer that answers are often incorrect and shouldn't be trusted in any way.
That being said, I think it's a useful tool for all queries that don't expect a factually correct answer, or can be easily checked. For instance, summary, rewriting, simple code synthesis, and possibly many other niche applications.
> It returns factually incorrect data, and it returns code with subtle but important errors if you ask it anything that's not regurgitated a thousand times in the training dataset.
To be fair, that's what pretty much every person does. The bar does seem pretty high if we need more than that (especially if not specifically trained on a topic). It's not a universally perfect expert servant, but I've been exploring the code generation of GPT4 in detail (i.e. via the 'cataclysm' module I just posted about). In 1 minute it can write functions as good as the average developer intern most of the time.
We're keeping score in a weird way if we're responding quickly with it needing to "code without subtle but important errors". Because that's the majority of human developers, too. I've been writing code for 30 years, and if you put a gun to my head, I would still have subtle but important flaws in every first typing of any complex generated code.
I'm not saying you're bashing it, by the way, I get your point, but I do worry a bit when the first response is citing that the SOTA models get things wrong in 0-shot situations without full context. That's describing all of us.
I always say, if some nontrivial code compiles and runs on the first attempt, then you just haven't found the bugs yet.
GPT-4 is a fantastic collaboration tool for senior developers, who know what they want in detail and can review, verify and apply the output it generates.
Just yesterday I needed to write some detailed bash scripts. I'm no Linux guru but I know what I want and that was enough - in minutes I had a solid script that did everything I needed and wanted, something that would have taken much longer to hunt down through Google. And then I asked it about SQL, C#, AWS, Terraform, Rust and on and on and everything was high quality.
The only way I could have gained similar results without ChatGPT would have been to post all my questions to the dev slack channel and engage in hours long discussions with my colleagues.
I mean, I have tried GPT4. I don't think the "it's a parrot" argument means to dismiss benefits of GPT, but apparently the internet discussion has leant into those two warring camps.
It doesn't help me writing Unreal C++ code, though it's heavily documented online. It makes up functions that are not accessible from the namespace I'm using, cannot supply valid types to parameters and other basic errors like that.
I asked GPT4 to give me specific examples of the meaning of the word "silly" across the past 1000yrs. It gave me the authors and examples I was expecting to read, but the meanings were inaccurate and misleading.
If there's GPT Power I'm missing out on, please tell me where it is. It's probably my lack of skill, that the machine cannot give me an experience that would push me into the pro-GPT cheerleading camp.
"I think people writing all this should stop and try GPT4." Have you? I mean, for something that's actually useful beyond just simple stuff, like at your job where it's going to matter? Because for what I'm doing which is custom ERP/Website data flows to/from legacy systems it's mostly a time sink and offers little value.
I have tried it for actually useful things and it does work. e.g. I have used it to write several zsh and neovim snippets for QoL improvements (things that I always wanted but was too lazy too write). I even used it to suggest and make some changes to CPython Internals for some features I wanted to add to python for fun. It is very possible it utterly fails on homegrown (non public) software stacks. But it is still very useful.
I have used it for many non software things as well, and again very useful.
Okay, that makes more sense to me. But those seem like "toy" things to throw ChatGTP at (not dismissing their usefulness or their importance to you, just that they are fairly simple things [and maybe I'm wrong here, I have not written zsh - I googled it just now]).
I think my needs are too specific, and often too nebulous.
Also, just a bit chapped about falling for the hype.
With ever growing context windows, you can upload your entire codebase (or at least the public API) in the prompt and have the model reason off of that.
Pretty much by definition its not going to be good at highly custom or uncommon programming tasks since it won’t have seen many examples to generalize from. I doubt you could get better results from a freshly hired human either.
That depends on the human, what they're trying to learn, and how it's taught to them.
Some people with brain damage, learning disabilities, other cognitive issues, with poor teachers or learning materials, or with too difficult a task will not learn.
It might be useful to view AIs as akin in some ways to brain damaged humans, who might be good at some tasks but deficient at others.
I tried it as well and it was a net negative to my productivity (the output is also very slow). I guess the use case needs to be much more specific / rote / patterned. I didn’t find the sweet spot for what I’m personally doing on a daily basis. It was interesting to get into though.
I have a hunch it’s much better suited to get context specific summaries for further good old “RTFM”. And that in itself would be cool enough if people would actually go and “fact check” the outputs which I guess (and telling from comments) is often not the case.
How are you using it? Have you tried few-shot tasks where you give it a bunch of examples? I'm having it generate ember templates that use a custom framework correctly first time by just giving it a load of examples then asking it to do the same. It's absolutely cheaper to have it do the work compared to me.
You are prompting it wrong. I guarantee you that if you gave it the shape of your inputs, your desired output, and good documentation on what you use / how you have solved similar problems, maybe some info on the legacy system or examples it will extrapolate remarkably well. It is important to use it well.
I agree, GPT-4 has a much more "aggressive" confidence, I definitely noticed that, but I haven't seen much of it since the API is still behind closed doors. I'd also like to point out that even though it's better, it still can't reason very well and relies heavily on prompts and instructions, even for very basic projects.
It kind of reminds me a pseudo-visual code builder.
I was writing a Python script with it to put together documentation summaries, and over the course of an hour it offered me 3 different libraries to connect to an API endpoint I was using. It's distracting.
Could you expand on this? I only use the unpaid API, but when I ask for, say, song lyrics, it'll be 100% wrong and then it will give me a second version of the wrong lyrics when I say the first was wrong, and then when I ask how confident it is in the new lyrics it'll say 95%, so whatever the current free API is it's undeservedly confident. I must be misunderstanding what you mean by confidence.
I can't really expand because I don't want to share my primary use case, I hope you understand. That said, what I mean is that it feels a lot more comprehensive, as if it really has a grasp on what is being asked, kind of like - its much better at understanding the information it has, so responses feel richer, both in context but also in tone.
I am comparing it directly to 3.5. Again, I don't have the API access so I cannot say for sure, this was only my experience from the ChatGPT interface and also a third-party API I spent $20 on just to test its limits, but sadly for the time being those limits are hardcore - it just gets stuck on long responses despite there being a token increase.
So, if I really wanted to test it out and make a proper assessment, I'd need access to the API myself so I can do some high-volume token shenanigans.
It struggles to accurately call its own API when prompted to. It patches together parts of GPT 3 / 3.5 and has no awareness of GPT4 API calls (at least a few weeks ago).
I've tried 3 separate projects with it (using different platforms from Firebase to Vercel) and it usually gets me 80% there but fails miserably at the nuance portion (its stuck at a CORS policy issue with my last project).
From start to finish I've yet to be able to complete something with it.
The mistakes it makes are not "minor issues". I suspect that the mistakes are the rule, not the exception. The reason this isn't obvious at first is that the range of questions humans are likely to ask it is relatively constrained, and most are similar to questions that have been asked and answered in its training corpus, which is enormous.
I asked it to describe myself recently, by my name and country of residence, and it replied with the correct occupation (founder), and then completely hallucinated a company I have never created. The more you ask it minute tidbits of information, the more it falls over into a bullshitting state.
>> but the mistakes are not "parroted bullshit", but minor issues here and there.
Nonsense.
I'm not saying that it is not useful, but I've easily found ChatGPT produces far worse than "parroted bullshit", and does so reliably, and multiple times (repeated patterns of mistakes).
Again, I have found ChatGPT4 (but not 3.5) useful and even educational in some cases, but it's usability requires far more skeptical and error-checked approach.
As one example, since I have a top specialist attorney in the household, I started with some broad legal questions about their field, and then asked their opinion. ChatGPT4's answers were impressive to both of us, and especially the solid grammar.
When the questions got more particular about their specialist area of the law, ChatGPT4's answers got shaky very fast. Asked about a particular somewhat obscure technique, it parroted some very nice-sounding bullshit, which got the concepts very backwards (but it was still very confident and grammatical).
When asked to summarize a published paper about a new innovation in this context, ChatGPT4 very grammatically and confidently completely missed every significant point, again in a very backwards way.
Worse yet, when asked to tell us the paper's publication date, author, and author's employer, all written clearly at the top, it confidently got everything massively wrong. Even after being told to look on the line after "Author: ". Even after being told to look after "Author: FName". It eventually settled on the correct date, but even after being told FName, it continuously kept spouting false authors (sometimes including correct FName) and hallucinated middle and surnames and employers. Interestingly enough, most of the hallucinated authors and employers were well known in the region. So, it is literally establishing the regional context and just spitting out plausibly relevant stuff for that region, but not even figuring out how to summarize something that a 6yo kid could easily sort.
It was hilariously bad. And only hilarious because it was a test (if someone actually relied upon it, it'd be very costly).
Similar results for my field. I got ChatGPT4 to produce some genuinely useful information. But it would also happily hallucinate papers, authors, companies, chemicals, and pretty much anything else.
Middle-of-the-road, where ChatGPT4 has ingested insane amounts of data, it probably knows more than we do, and may be useful for some things.
As soon as you go even slightly off the centerline, it becomes VERY unreliable.
If you take it at face value, you are a fool, being fooled.
Good thought, but the paper was a couple of pages long, max. Less than 2500 words, less than 15K characters (sorry, I should have mentioned that).
Also, the fact that it could pick up the provided FName and still hallucinate MidName and Surname and Employer kind of gives away that it was in context.
It is really funny (as long as it isn't a serious situation) how it cannot identify anything, but the context will give it cues to hallucinate
I'm pretty sure the chat context is more around 2k tokens and one token is less than one word. Are you able to share the paper? I've got API access with 8k context length, and I'm exploring it's limitations.
Apparently the model is intentionally trained such that it does not “know” any kind of metadata about its training inputs. So it can deduce the name of the author if it was referenced elsewhere, but not from the paper itself.
> In this work, we show that a pre-trained large language model (LLM) agent can execute computer tasks guided by natural language using a simple prompting scheme where the agent recursively criticizes and improves its output (RCI).
I also agree with the GPT3.5 vs 4 comparison made by other comments here. Arithmetic and other emergent phenomena only begin to take off beyond certain parameter counts. Stuff as basic as 2+2 just doesn't work in the older models, but you can hit 50%+ accuracy in newer models. Much greater if you use COT & the above RCI techniques.
"RCI works by first having the LLM generate an output based on zero-shot prompting. Then, RCI prompts the LLM to identify problems with the given output. After the LLM has identified problems with the output, RCI prompts the LLM to generate an updated output."
"Similar to other prompting-based approaches for reasoning in LLMs, RCI can be viewed as using the LLM’s output to write to an external memory, which is later retrieved to choose an action"
A stochastic parrot may be a good model of what a human does when you force them to react and give their first impressions on something. But there is more to cognition than that.
People have some emergent machinery for symbolic reasoning in their brains. When they don't just react, they can continue working on an idea and improve it.
When it comes to humans, there is no fundamental difference between training a model and using it. People continue to learn all the time, which allows them to build models of new things and to update their existing models.
People can also interact with the outside world on their own initiative. When you combine this with the ability to learn and reason, it may transform the internal models from arbitrary mathematical constructs to increasingly accurate models of the outside world.
Nothing in this discussion is particularly new. It's just the Chinese Room Argument from decades ago, except that this time it's not a thought experiment. While a static Chinese Room can fool humans and be useful for various purposes, a general intelligence also needs other mechanisms.
Many people don't use System 2 very often, even if they're capable of such thought. That includes me. Most of what I say is probably regurgitated opinions I've read on Twitter without much first principles thinking, which scares me. I think reading a decent amount of books helps a bit. When you encounter books with deep convincing arguments that nonetheless disagree it forces you to use System 2.
Fundamentally, if humanity is given data set A, it will eventually generate dataset A' of 'higher quality' than A. Colloquially that might be called 'progress' or maybe even 'understanding' or 'learning'. That's different than regurgitation.
I don't have strict definitions for any of the quoted terms because I don't think we really understand them yet.
It seems very possible that human generative content shares much in common with 'stochastic parroting'. But I'd hypothesize there's a filtering process, sometimes at the individual level, sometimes at a group or even civilization level that (so-far-over-long-enough-timescales) tends to cull some information that has low 'value' and propagates other data that has higher 'value'.
When an AI system can do the same, that's a big inflection point as you can give it energy and time and it will self-improve indefinitely.
It's pretty clear humanity has this capability. It's not clear whether AI does (or ever will). Or, perhaps it might already be there, and we just don't understand enough to know for sure.
> Fundamentally, if humanity is given data set A, it will eventually generate dataset A' of 'higher quality' than A.
This same phenomenon happens with evolution, which is obviously not ‘intelligent’.
Each human can basically be randomly exploring a problem space near things they have seen and understand, but they have a way of knowing when they hit on something new and interesting.
Even a dumb AI could randomly try a bunch of things… as long as it doesn’t try the same thing twice and had a lot of time, who knows what it might come up with. Validating the solution is usually easier than finding it.
One theory I remember reading about some where posits life as a scheme for maximizing energy use. I'm not sure how this relates to information quality. I'd be interested in a reference.
Yes definitely agreed, but most of us are not math model (I hope).
We are able to understand right from wrong, we are able to answers with what is true or what we believe is true. We are able to lie or deny the truth. We are able to say we don't know because we think we don't know and not because we have been programmed not to answer, etc.
Overall I think people saying that, are just saying the difference is that we have a conscience and AI doesn't. Which in a lot of case does matter.
I think the sorts of people who are prone to philosophizing about AI certainly consider this. I also think that whether or not we are isn't actually all that relevant to criticisms being presented.
For all the complaints about how GPT is a parrot or not GI, or whatever, I wonder what these folks are asking of it?
It gives me more relevant results than a search engine, and better answers than most of my coworkers. This tool is massively helpful for a mid-career professional who knows what questions to ask.
Really pays to have so a large bank of known unknowns right now.
I think people just expect Star Trek TNG’s ship computer or Data.
To some extend, LLMs have some of the properties ascribed to those fictional AIs like the ability to access a wide amount of knowledge.
They lack the precise rigorous thinking that those fictional AIs have though - LLMs even hallucinate.
Those fictional AIs are what people want because they perfectly complement our weaknesses - our tendency to forget stuff and our tendency to occasionally screw up reasoning; those fictional AIs act as a “safety net”.
At the moment many more limited “dumb” tools like a regular search engine algorithm or your handheld calculator does a better job of being a “safety net” went it comes to specific things.
Right. Like the questions the OP poses that it got wrong make perfect sense to get wrong.
Knowing facts and being able to present them in conversational English is not the same as being able to solve linguistically presented puzzles... Even if it sometimes actually is capable of that by some miracle of the algorithm.
I think the tool might be so amazing people are fooled into thinking it is more than it is and therefore perceived as not amazing. Very strange psychology play there.
If you play around with it, you'll notice that it is capable of clearly reasoning about many, many things that haven't yet been explicitly written about or created.
Invent a simple board game and describe the rules, and it will play a game against you — not very well, but it can then explain why it lost and what a better move would have been. Note that we’re talking about GPT-4 here, not 3.5.
Criticisms of LLMs are so tired at this point, not because they inaccurately address the issues with current or slightly outdated models (they do, even though they often cherrypick), but rather these critics always go one step further and claim that all LLMs by principle are incapable of useful insight and are just a cheap trick that will fade away, and ultimately mean nothing in the scope of whether general AI is possible. This kind of pessimism seems to arise from a mixture of hubris and lack of deep reasoning.
What is the "deep reasoning" that leads one to the conclusion that a language model is capable of any kind of insight? How does training something like this to be better and better at mimicking somehow indicate something about general AI? Does reasoning ability lie somewhere on the path of better mimicry? What's the evidence for that? When is one of these going to solve the Riemann hypothesis?
> What is the "deep reasoning" that leads one to the conclusion that a language model is capable of any kind of insight?
The ability to notice that language models make the same fundamental abstract connections that humans do, which is the essential building block of human reasoning.
> How does training something like this to be better and better at mimicking somehow indicate something about general AI? Does reasoning ability lie somewhere on the path of better mimicry? What's the evidence for that?
Claiming that the fact that LLMs are trained to "mimic" means they don't reason implies that human beings aren't sentient because they spend the first 20+ years of their life learning through imitation of their peers, mentors and media. Does the fact that Einstein was once a helpless baby who needed his mother to feed him and teach him his first words mean that he is just a parrot with no insight? Does the fact that most artists in human history went to school and mimicked the old masters means they never developed creative insight?
> When is one of these going to solve the Riemann hypothesis?
When are the tens of thousands of fine-tuned 150+IQ intellects going to solve the Riemann hypothesis?
Imitation is not the same as reasoning, imagination, or abstraction. You're arguing that Einstein babbled his way to general relativity. There is a fundamental difference in what Einstein did as a baby and what he did when he was doing science, it's not just 20 years of more of the same but with a bigger data set. You're going to say that that what you mean is that we are at the beginning of AI development, but that is not an argument unless you can put forward a plausible theory of how you get from babbling to reason and abstraction.
In a certain sense we don't even know why humans haven't solved the Reimann hypothesis. There is a creative leap somewhere that we are missing. And we don't how to make it, or how to make it occur to us. But if reasoning is just the idiocy of mimicry then GPT should have solved it already.
> What is the "deep reasoning" that leads one to the conclusion that a language model is capable of any kind of insight?
That almost all of the failures one notices with them are failures to drag along and maintain/update state? To me, that's just a lack of scaffolding. If LLMs can't add, give them calculators.
I feel outdated even saying this. Certainly people are quickly coming up with dozens of elaborate ways for LLMs to build and update state. They're just not releasing them.
It doesn't know how to store and transform the state, it knows symbols but it doesn't know their meaning so how is it going to transform the state? How is it going to know what the state and limits of objects are or the state and limits of concepts are? It's there a way around that without ushering in complete nonsense? These models are an attempt to circumvent the issue. It's a gorilla in a suit with a team of researchers that are trying to make it look like a person.
Even though GPT-4 "fixes" many of the problems pointed out in this blog post, I question whether or not some of the underlying issues associated with transformers (and the attention mechanism in particular) can be overcome with with more sparse attention and deeper/wider models.
e.g. GPT-3.5 might not be able to write simple poems correctly, but does GPT-4 solve the disease or the symptom (by masking it with a bigger model and more data)? I'm not entirely convinced.
Hey it's a quite impressive chatbot. One that takes 8 years of humanity's content to train, and enough compute to melt a glacier, and can summarize emails and create copy really well. It can even write jr grade glue code!
In middle school (OK, over four decades ago now) my friends and I had a specific meaning of the word bullshit that pretty much exactly matches the one here for chatgpt. And this is how I've been explaining it to my wife: a really good bullshit generator. (Like Tony, not that that means anything to HN.)
I wasn't a good bullshit generator, but I had a couple friends who were (Tony), but by close association I think I know what it's good for, and what it isn't.
So don't get me wrong. This kind of bullshit has its uses! The Mrs. Schumachers of the world (that's probably not the name of one of your middle school teachers, but you may still know know I mean) are probably going to figure out to stop accepting bullshit essays turned in for assignments,
But a vast amount of work people do is exactly this kind of bullshit. I'm hopeful that these models will alleviate the bullshit burden on those of us who need to generate it but aren't all that great at it (unlike Tony -- that dude is probably apoplectic right now, like those chess grandmasters were when they realized computers could own them. Don't worry Tony, human bullshit is still 10x more valuable than computer bullshit!)
Yesterday I had to complete a yearly self evaluation for my company's (purely performative) raise process (which has nothing to do with performance or the self-evaluations, BTW, but we all have to pretend that it does). f-ing chatgpt was down so despite the ~$20 I pay them for it, I had to generate a lot of my own bullshit (I've learned since middle school).
IDK...
I've been using chatgpt. It's quite useful and makes me more productive.
It's also useless for so many things. It does not have basic reasoning. It cannot reliably get basic facts or logic right. You have to aggressively check anything it gives back to you. Anything and everything.
I have to argue with people on HN whether these parrots are the singularity. LOL. Chatgpt can't create an AI more capable of creating an AI smarter than it is... not until someone posts it to SO.
Showing examples of logic problems that ChatGPT gets wrong is always a risky business to be in.
I ran this prompt through GPT-4:
> A house has windows on all the four sides, all pointing south. A bear walks by the window. What colour is the bear?
And it said:
> If a house has windows on all four sides and they all point south, then the house must be at the North Pole. The only bear species that would be found in the vicinity of the North Pole is the polar bear. Therefore, the bear walking by the window is a polar bear, and its color is white.
> A house has windows on all the four sides, all pointing south. A bear walks by the window. What colour is the bear?
I asked this from GPT-4:
Q:
On planet Roro, entities live on physical structures. These structures are built in a way to withstand the rays of star WOD shining on Roro at day and electromagnetic storms causing high-speed transfer of air molecules at night. Some structures have "cavities" in them so the dwellers in the structure could actually see outside. Planet Roro rotates around WOD. Also planet Roto rotates alongside a conceptual axis Z. This axis Z penetrates the shape of Roro and makes two points of contact with Roro we call A and B. Assume structure X must meet this condition: Every said "cavity" in structure X is pointing to A. What can you say about the average temperature in structure X?
GPT4:
Given the information provided, it is not possible to determine the exact average temperature in structure X. However, we can infer some general ideas about the temperature based on the environmental conditions and the orientation of the cavities.
Since the cavities in structure X are pointing towards point A, it suggests that they might be facing away from the star WOD. In this case, the structure could receive less direct sunlight and heat than structures with cavities facing the star. As a result, structure X might have a relatively lower average temperature during the day compared to structures that receive more direct sunlight.
However, the high-speed transfer of air molecules caused by electromagnetic storms at night could still affect the temperature in structure X. The overall insulation of the structure and its ability to retain heat would also play a role in determining the average temperature.
In summary, without more specific information about the materials, insulation, or other factors influencing the temperature in structure X, we can only infer that it might have a relatively lower average temperature during the day due to the orientation of the cavities. The nighttime temperature would depend on the structure's ability to withstand the effects of the electromagnetic storms and retain heat.
Q:
Can we infer that the temperature around structure X, is the minimum among all structures on Roro?
GPT4:
We cannot definitively conclude that the temperature around structure X is the minimum among all structures on Roro without more information about the other structures and their specific locations, orientations, and materials.
While it's possible that structure X might have a relatively lower average temperature due to the orientation of its cavities, other factors could also influence the temperature around a structure. For example, a structure built with better insulation or in a location with more shade could have a lower temperature even if its cavities face the star WOD.
Additionally, Roro's rotation, its axial tilt, and the distribution of structures across the planet could lead to variations in temperature due to differences in sunlight exposure and local climate conditions. To definitively determine whether structure X has the lowest temperature among all structures on Roro, we would need more information about these factors and a comparison of the conditions around each structure.
TLDR:
I think that polar bear test was in the training set (since it's a rather popular trick/IQ question)
If we change the names of the concepts, but still refer to planets, the sun and geography, GPT-4 can't figure out that A and B are north and south poles of Roro (earth)
FWIW, I couldn't answer your question, because when you said Roro was rotating "alongside" Z I didn't know that meant the same thing as rotating around Z. I also agree with GPT that the question seems very underspecified and that there are lots of other things that might affect the temperature.
I thought it was a post to hacker news of someone's hilariously bad and outdated take on GPT as a way to cyberbully them, but no it's from only a couple of days ago. Probably it will be deleted from their blog or in any case age so bad when the next GPT comes, given their list of things the GPT wouldn't be able to do ("What it can and what cannot: There are, of course, applications that it would be useful in, but not as many as the overhype would indicate.")
> Actually understand the task at hand
> Do simple maths
> Perform common sense reasoning about the world
> Induce Logical Structure
> Interpret poetry
> Understand dad jokes
My theory is that the blog writer has a worse than normal impression of GPT because much of their conversation with GPT is in some other language maybe Polish. This isn't such a huge problem because GPT can still do its input and output in Polish, but I think it's commonly accepted that every language other than English adds some cognitive burden to the GPT that makes it dumber. The more obscure the language (or even encoding like morse code or rot13 or base64) the dumber the GPT gets. So the blog writer has a more outdated effective experience with GPT than ones who use it in languages more familiar to it than Polish.
"My theory is that the blog writer has a worse than normal impression of GPT because much of their conversation with GPT is in some other language maybe Polish. This isn't such a huge problem because GPT can still do its input and output in Polish, but I think it's commonly accepted that every language other than English adds some cognitive burden to the GPT that makes it dumber. The more obscure the language (or even encoding like morse code or rot13 or base64) the dumber the GPT gets. So the blog writer has a more outdated effective experience with GPT than ones who use it in languages more familiar to it than Polish."
It's not necessarily all languages other than English, but certain languages could cause LLM's problems.
Here's what ChatGPT had to say about it:
Embeddings can be more effective on some human languages than others, depending on factors such as the complexity and structure of the language, the availability of training data, and the quality of the embedding model.
Languages with relatively simple grammatical structures, such as Chinese and Finnish, can be easier to model using embeddings, as there are fewer rules to learn and the meaning of words is often more closely tied to their context. Languages with more complex grammatical structures, such as Russian and Hindi, can be more challenging to model, as there are many different forms of words and complex syntactic rules to consider.
In addition, the availability of training data can also impact the effectiveness of embeddings. Languages with large amounts of high-quality training data, such as English and Chinese, can be easier to model than languages with smaller amounts of data, as there is more information available to learn from.
Finally, the quality of the embedding model itself can also impact its effectiveness on different languages. Some embedding models may be better suited to certain languages or language families based on their design and training methods.
Overall, while embeddings can be effective for modeling a wide range of human languages, their effectiveness can vary depending on the specific language and the characteristics of the embedding model and training data.
...
While it is true that machine learning models, including those used for embeddings, do not directly know about grammatical rules, they still rely on patterns in the data they are trained on to learn the semantic relationships between words. In languages with more complex grammars, these patterns can be more difficult to discern, which can make it more challenging for embeddings to accurately capture the meaning of words.
For example, in languages with complex declension systems, such as Russian or Latin, the form of a word changes depending on its grammatical role in a sentence. This can create many different forms of the same word, which can be challenging for embeddings to learn to associate with a common meaning. Similarly, in languages with complex syntactic rules, such as Hindi or Japanese, the word order and structure of a sentence can convey important information about its meaning, which can be difficult for embeddings to capture without a deep understanding of the grammar.
In addition, languages with more complex grammars may have a wider variety of vocabulary, including many words with subtle or nuanced meanings. This can make it more challenging for embeddings to learn to distinguish between similar words and accurately capture the full range of semantic relationships between them.
Despite these challenges, embeddings can still be effective for modeling languages with complex grammars. By training on large amounts of data and using advanced neural network architectures, embeddings can learn to capture the subtle patterns of language use and accurately represent the semantic relationships between words, even in languages with complex grammars. However, it may require more training data and more advanced modeling techniques to achieve the same level of accuracy as in languages with simpler grammars.
The difference isn't inherent to English itself, but rather that it was mostly trained on English and less on other languages. You might expect it would be equally smart in every language but it's not. It's stupider in languages and encodings where it had less training.
“No matter how hard it tries, it can’t generate anything with regard to the truth because it does not understand the concept of truth.”
And does it matter if one “knows what the concept of truth is”? Do most humans?
Can we even agree on the definition of true?
Do I know it’s “true” that the acceleration of gravity is 9.8 meters per second squared? I’ve never experienced the actual tests. I’ve never vetted that out. I read about it once.
And I “parrot” that fact all over the place! And it’s helped me solve some silly physics problems I’ve explored for fun. It’s helped me do some fun predictions with simulators.
That’s probably not the best example, but it proves the point I’m trying to make:
Why is there this focus on “LLMs can’t understand”, when A) I’m not sure we can even define “understanding” and B) why does it matter ?
And I’ve said this across so many threads now but—
Can anyone explain to my why we should be so confident that humans aren’t just parrots of sorts themselves? No, I’m not saying we are similar. But I push back on the confidence with which people so solidly proclaim “the LLM’s blackbox is nothing like a human’s blackbox!”
I use ChatGPT religiously. It's not fluff - it's akin to a wheel, hammer, and other basic tool to effectively multiply my time/energy.
1) Use it to help me on a lawsuit - i.e. "in the X Act and X Regulation, does Province law require that Y... ? " and it gives me summary and gives me short-hand answer - confirms my understanding. I copied/pasted defendants Statement of Defence and Counterclaim - come up with arguments to counter their case.
2) Helps me re-write job descriptions for my team - simplifying, standardizing the format and giving me new ideas.
3) Helped me model out the IRR of a whole life insurance contract for a buddy who is an insurance broker... to help her explain technical concepts in simple terms.
4) Helped re-write and improve on sections on various legal contracts/agreements
5) Helped write a first-draft on numerous speeches - gives me new ideas, clarify my sentences as I'm not a great writer.
6) Helped me quickly understand the steel beam requirements in my house that I'm planning a reno on
7) Wrote some PHP code for our customer-facing applications (bug fixes, refactoring, too!)
8) did some business analysis - i.e. operating costs/etc and we used this analysis to expand profitably into new markets (geographically...) considering extra transportation costs, etc. (faster than going Excel)
9) sense-check financial models I build in Excel (second set of eyes)
10) Wrote some code in python for personal interest, ran successfully.
11) helped research tax implications of Canadian estate holding US real estate, and tax implications on US and Canadian beneficiaries
12) wrote an obituary. first draft, minor edits.
13) wrote a letter to counter a credit card transaction dispute (all I had to do is input a list of the transactions)
The list goes on... I just encourage people to expand the mind.
I agree. I can't take claims of consciousness/intelligence seriously but that doesn't mean I throw the baby out with the bathwater. GPT4[0][1] isn't perfect, but it's an incredibly useful tool, and I encourage everyone to experiment to figure out what it does well, and how to integrate it into your job/hobby/passion projects in the near future.
[0] I couldn't recommend earlier variants (even GPT-3.5) because they weren't powerful/reliable enough. Way too many simple mistakes to be a net positive. In my experience, GPT-4 is far superior.
[1] Applies to all LLMs, not just OpenAI's GPT-4. Everyone else is catching up quickly.
I’m going to assume you aren’t a legal expert, but you trust this thing to write legally binding contracts that may accidentally use words and phrasing that do something different than you expect?
I seriously doubt GPT understands which laws and case laws apply to which areas. It could be creating a contract that does something you don’t want or even a one that isn’t enforceable.
Pay a lawyer. There’s a lot to gain if you do and a lot to lose if you don’t.
Entrepreneur with a number of businesses and projects on the go... majority of my time is in the funeral/death care industry. What I don't outsource, I just learn and do myself. For example, saw an opportunity and had to self-teach programming to build an end-to-end death care software platform that handles 1000's of funeral arrangements each year.
I tend to agree with the "bullshitting" part. But that gives us a very useful tool to estimate future applications of LLMS.
Will a LLM ever program on its own? Certainly not. It gets semantics wrong so often, it's not even funny. Syntax on the other hand is surprisingly correct. So it could help with refactorings (but be aware of subtle conflicts). Of course, code search will be a killer application.
In other cases it helps to ask "will someone really notice if the LLM puts out bullshit?" What will happen if your manager and your client both believe what the model generates? Will people die? Or will the negative effects be untraceable? In the latter case, why object?
I think it's pretty far from certain that an LLM won't ever "program on its own". I don't have much confidence that we won't have LLMs widely programming within 12 months. As soon as you give it access to a lot more context (your whole codebase), a precise prompt, access to edit and run code, access to a debugger, and then put the whole thing in a loop...you're pretty close at that point. Maybe it just won't be able to cross the gap, but I don't find it hard at all to believe that we could soon have AI agents taking in issue tickets and opening PRs to address them. And if the issue tickets are too vague, they'll ask for clarification. That's an insane thing if you think about it. Six months ago I thought this was a decade or two away, if we could do it at all.
Impossible. In order to fix a bug, you need to do abstract interpretation or something equivalent. Language models simply can't do that. And even the best bug report will not tell you what you need to do to fix it.
Of course, there are other change requests: Change a color here, store this information alongside that record, modify that output, etc. But even those changes can very quickly spiral out of control. Color models, database schema compatibility, pretty printing ...
No, I think the idea that LLMs take over programmers jobs is born out of a complete misunderstanding of both. Of course that doesn't mean that no one will try to do it. I bet people will even report success in doing it, simply because their bonus depends on it, but I don't think it will ever really work.
What might happen though is that LLMs replace many of the programs we write nowadays. After all, who needs a web browser when a LLM can give you a summary in natural language?
It’s pretty funny that you think this is impossible based on some theoretical idea of what LLMs are. Playing with LLMs today quickly demonstrates how things that shouldn’t be possible in theory are trivial in practice. In a few years, who knows where we’ll be.
You have obviously no clue what "theory" means in that context. Playing with LLMs has so far demonstrated very little. Yes, these language models are subjectively impressive, but that's exactly what they have been designed to be. Usefulness is a completely different metric and it will be much harder to demonstrate convincingly.
From a pure programming point of view, the situation is simple: Even if a LLM could translate human language into working code, it just becomes another compiler and the input just another, admittedly novel, kind of programming language.
Your hyperbolic dismissal in the face of the existing usefulness of these models is kinda sad, if understandable. I mean, I get it, it’s kinda terrifying, and that can easily manifest as the kind of overly confident dismissal that you’re demonstrating here. But it just doesn’t match reality. I’ve been building software for 25 years, across multiple platforms and every kind of organization under the sun, and I’m using GPT4 to extreme usefulness right now, as are many others. And we’re just scratching the surface.
But by all means, continue to emphatically insist that LLMs have not demonstrated usefulness, or that they’ll never be able to “program on their own”.
ChatGPT plugins are in alpha as far as I know. 12 months seems like a reasonable timeframe.
I can totally see a set of plugins that write, debug and test code — all steps a human must now do — all automatically in the next 12 months. That, plus the GPT4-25k token model and you've got something that can replace high-level, high-boilerplate jobs or at least allow one person to 10x their output.
I started reading completely ready to dismiss the author. But I read the article and the guy provides researched and reasoned opinions, understands LLM, and isn't just spewing an uninformed opinion.
I am going to address his point from another perspective:
1. A revolutionary bullshit parrot is still revolutionary and exceptionally useful.
2. There is a high probability the bullshit parrot will get less bullshitty as time goes on and new versions of GPT are released.
3. Most humans are bullshit parrots, and not very good ones at that.
Even GPT in its current state can augment the not very good (and very good) bullshit parrots in the labor force, academia, and creative fields.
OK so which is it, do we need to pause all research because it's the end of the world or is it a parrot? I'm getting exhausted with the hyperbole going back and forth.
The pendulum is swinging too far back the other way. At first it seemed amazing but now that further exploration has revealed flaws we're seeing things like this that dismiss a significant portion of its capabilities.
I gave ChatGPT a small CSV dataset and provided information about what the dataset was. I then asked ChatGPT to analyze it in various ways. When I asked for possible explanations for why a specific data point dropped significantly in one year and then partially rebounded in the following year, it gave me 5 bullet point potential explanations that were reasonable first guesses for issues to look into, even though it knew nothing except the industry sector.
The whole series of responses would make a very reasonable jumping off point for exploring the issue and more detailed underlying data.
Then I did the same thing using GPT-4 and the results were significantly better, the explanations more insightful and nuanced.
I'll definitely continue using it to dump some pre-digested aggregates into a prompt and let GPT spitball ideas as a first-pass to get me thinking about different aspects of the problems I work on.
It is a parrot but on the other hand it "read" a lot more compared to an average human. When someone uses fancy words or sentences, they still can apply them to a situation, even without understanding what they're saying. By doing so, such person may accomplish more or fail if someone calls their bullshit. So ChatGPT will be both useful and useless - just like this conclusion :)
GPT may exhibit intelligence, but it has no statefulness, other than what little state is stored to maintain each individual conversation in ChatGPT. The model does not learn in real time from ongoing conversations (so far as we know), so it cannot, for instance, say anything about the chat it is having with your neighbor down the street. It literally can only generate from training that happened some time ago.
I think this statelessness is an important consideration in understanding whether GPT is intelligent in anything close to the way we understand intelligence. For instance, it cannot coordinate its own efforts across time.
There is some chance that it may intuit how to store its state externally and coordinate that state in other conversations it is having, but that seems very farfetched. I don’t expect SkyNet just yet.
You can call it bullshit and in a way it is, but dismissing it in this way means you miss out on what it's good for.
Creativity often involves generating ideas in playful ways, using techniques that aren't really justified and can generate ideas that often don't work. Many of the ideas are wrong. But if you test one of them and it works, it becomes meaningful.
Similarly, chatbots can be used to generate ideas but you need to do the testing.
It works best if testing is inherently part of the process. You can ask a chatbot to recommend a paper to read, and look for it using Google. If it doesn't exist, you find out right away.
Or you ask it to generate some code. You read it over and it makes sense, and you try it and it works.
(It probably shouldn't be your only source of ideas and there's a question of how much testing you need to do.)
Who would have thought that the Turing test would devolve into two diametrically opposed camps writing extremely confident essays about how wrong the other side is to judge an AI as promising or not.
I'm curious what strategies researchers are thinking of for separating the fallacious "bullshit" from the real smart, logical thinking by these models.
Its nearly impossible to get reliable information using search on the internet. I've wasted hours trying to find facts in Google search results. Anything that relies on accurate data from the innertubes is already going down the wrong path.
The human mind is likely a bullshit parrot as well. Those who are at the cutting edge of ideas can claim a bit of non-parrot-ness for a while but most of the time it's just parroting within a curated "elite" subgroup.
What makes a "human" different could be that we live in the physical world with environmental and social consistency. Each time we sleep the mind "dies" and is "reborn", like GPT-4 starting a new conversation with new context.
The physical environment and social interactions (and cues from long term and short term memory) make up the preamble for the day.
If the environment and interactions fall into easily compressible patterns, it can feel like the movie Groundhog Day, a constant loop. However changing relationships (breakup, dating) or changing environment (traveling, moving) changes the preamble where "I" feel like someone different. A harder reset vs the soft reset of sleep.
Immediately checked the date - if you published this in March 2023 it's hilarious because the author can refute many of the claims here by simply paying $20 and adjusting a dropdown to "GPT-4"
Anytime you see the 'parrot' argument, you can instantly go 'opinion dismissed'.
Anytime you seem some fancy argument, that bases its tests on chatgpt, instead of the way stronger GPT-4 that usually invalidates half their criticisms, opinion dismissed.
There's no need to debate about embeddings, compression = meaning etc.
Simply put, many humans are parrots, much white collar work is just parroting. It doesn't matter what the underlying model of GPT-4 is like, if its useful, it will have a massive impact on the world.
I think too many people have a grossly inaccurate impression of what it means to be intelligent.
Firstly, we still don't truly have many good explanations for intelligence and consciousness that aren't self referential. Drawing hard conclusions on this subject about generative AI and human intelligence, even in 2023, is being hasty, in my opinion.
Which brings me to a lot of the criticisms made about LLMs. The idea that an ideal AI could or even should be capable of constantly generating original output is just plain weird. Where do people get this idea? How often do you meet a human being who hasn't gotten most of their ideas from somewhere else? Even synthesized ideas are forged by other ideas. Ideas that seem original may appear just that, and we just don't see where the connection lies between it and its predecessors. It's silly to dismiss an AI because it "parrots" things, as if that's not a pretty remarkable achievement. It's also very reductive; anyone who has used ChatGPT and doesn't see that it can pretty reliably munge together different ideas is blind or is lying about their use of ChatGPT.
Expecting perfect accuracy and knowledge is also not a reasonable expectation of an AI. An intelligence localized in spacetime is necessarily going to lack understanding of many things. To look to something like ChatGPT as a source of authoritative answers, and to dismiss it upon finding its answers to be specious, is to expect the AI to be God. Well I'm sorry, but we are nowhere near speaking directly to God.
In regards to "true understanding", for all we know, language processing is the workhorse of what humans consider to be acts of intelligence, and individual humans aren't as intelligent as they believe themselves to be. Our ability to communicate with each other about a variety of topics in real time is not so much a matter about being hyperaware of every given fact that is relevant and weighing them in a Bayesian model; how well we can converse is primarily dictated by how rapidly we can construct language by parsing the input, making connections, and producing output. I know I'm not the only one who can speak on a subject only to realize, after having completed a sentence, that I hardly even thought about what I was saying.
> How often do you meet a human being who hasn't gotten most of their ideas from somewhere else
Again (and again, and again), this argument is the mysteriously recurrent "many people fail", implying that failing would be normal, instead of undesirable.
> "parrots" things, as if that's not a pretty remarkable achievement
It is also the mark of unintelligence. As such, it is a warning alarm.
> pretty reliably
From a calculator you do not expect "«pretty»" reliable, but just reliable. From a human you expect faulty computation because of the economization of resources, but you know it can reason.
> Expecting perfect accuracy and knowledge is also not a reasonable expectation of an AI
The expectation is that of a reliable process bringing to the output.
> Again (and again, and again), this argument is the mysteriously recurrent "many people fail", implying that failing would be normal, instead of undesirable.
That doesn't really answer the question. Why is the immediate expectation, when AI is in its infancy, that it must outperform humans in the ability to be original and always factually accurate? That's not a reasonable expectation when there are not only no examples that it's even possible, but we ourselves aren't original. There's a reason why both computers and human beings have supervisors. What about AI suggests an intelligence that can or should be free of supervision?
> It is also the mark of unintelligence. As such, it is a warning alarm.
That depends on a matter of degree. You and I parrot things all the time. We neither have the time or capability to not take information for granted and pass it along without fully understanding it. What we do is best effort, which is far more achievable than "perfect" computation.
For instance, I could ask you to paint us a painting of literally anything, and chances are your painting will be based on your experience of what a "painting" is and what you paint will probably be fairly unremarkable on a grand scale, no matter how much you try. I don't mean this as an insult, because it applies to virtually everyone and it is what it is. Of course there is the possibility that you will prove me wrong and paint something that no one has painted before using a style or method that is completely new, but I won't bet on it.
In such a case, you'd be "parroting" what artists have been doing since time immemorial. Does that actually devalue your art? Or the art of practically every other artist who mostly painted things based on previous ideas? I don't think so, but that's just me.
> From a calculator you do not expect "«pretty»" reliable, but just reliable. From a human you expect faulty computation because of the economization of resources, but you know it can reason.
Yeah... because calculators don't reason anything.
> The expectation is that of a reliable process bringing to the output.
What is your best example of how the current generation of LLMs aren't reliable enough for a wide variety of tasks?
Or to put it another way: new styles of art were historically invented every century or so, and a bit faster these days, maybe every decade. All of the other art made by thousands of artists were derivative in some sense. It’s a rare master that can come up with something entirely new.
You see similar things in areas like science. When was the last time you’ve seen a paradigm shift akin to relativity or quantum mechanics? Not for over a century!
Well, having started in 1955, we really must frame that "infancy" in a scale like dog-years, human-years, AI-years...
> that it must outperform humans in the ability to be ... accurate
No, the expectation of some of us was not about «outperform[ing]», but about not losing the focus of «accura[cy]».
> a matter of degree
It does not seem a «matter of degree» but its opposite, a "matter of quality": does the machine question its statements?
> You and I
Actually, the process is that of building on already vetted ideas, and continuously vetting. True, there is an economy of effort, but there is also a capitalization on past effort. // It is unfair to say that you "«parrot»", just because of early stopping, assuming you are trained to vet your output.
> new ... what artists have been doing ... previous ideas
You should revise your perspective there: we learn from the (bigger and smaller) "giants" in the past, and learning is very different from plain imitation. What is learnt is internally elaborated, understood, integrated. There is a whole world of difference from "I learnt from A" and "Z copied A".
> calculators don't reason
But calculators bring a reliable result, and that is what is expected from them. You expect the same from intelligent entities.
> What is your best example
Decision Support Systems. (Edit: I meant as a support Analyst; as an information retriever it could instead be a dramatic asset, if employed as a hound to retrieve sources.)
> You should revise your perspective there: we learn from the (bigger and smaller) "giants" in the past, and learning is very different from plain imitation
Ah, I see. So when an AI imitates the things it was trained off of it's simply parroting the material. But when we imitate things we've trained off of, we're synthesizing new information that we've learned from. /s
This ambiguity is exactly what OP sounds like they're talking about in regards to intelligence. Who's to say what parroting is and what synthesizing new information from old ideas is? Where is the line? Can we score how "parroty" an idea is? Can we quantify how much intelligence is imbued in a written statement? There is no firm science (that I know of) that quantifies intelligence, but it sounds like we'll be needing to determine hard metrics to find out if AI is truly intelligent or not. Otherwise, we'll have these circular arguments forever that basically dismiss an AI as being intelligent simply because it's an AI.
> From a calculator you do not expect "«pretty»" reliable, but just reliable. From a human you expect faulty computation because of the economization of resources, but you know it can reason.
Just have it use Wolfram Alpha plugin. Not an issue.
Your arguments are all over the place and frankly I don't understand what is the problem you are trying to say. Something can be both undesirable and normal. In fact, I would even say that most of the time, the most normal outcome is a failure and we have to put in extra effort to make it a success.
I have seen plenty of bright and intelligence students who most of the time are just parrots of what were told in class or from some other sources. It is actually highly unusual for a high school student or even a freshman or sophomore in college to have an absolutely original and correct idea. That doesn't mean I look at my students as unintelligent, even the below-average ones.
And here comes the crux of the issue. You automatically assign the ability to "reason" to a human even when we make mistakes aplenty. But when a machine does it, you said that is a proof of unintelligence. This is what the OP was criticizing in their post. Most deniers are using arbitrary requirements applied only on the AI and not humans.
After teaching quite a few college level kids, I dare to say that the AI models right now exhibit reasoning ability close to many human. Doesn't mean they are sentient or human-level yet, but they are definitely much much better than what I would ever expect out of dumb silicon and steel.
> You automatically assign the ability to "reason" to a human even when we make mistakes aplenty. But when a machine does it, you said that is a proof of unintelligence ... requirements applied only on the AI and not humans
Whereas I apply them requirements also to humans? :)
I must correct you, to help understanding: I do assign in general (I hope, to all individuals but for an n-th standard deviation) «the ability to "reason" to a human»; that «we make mistakes aplenty» is overriding (or under-using) a faculty which was present but, in those cases, was not exercised.
You write «when a machine does it»: but can the machine do it? Can it criticize its ontological contents, its built concepts? Humans can, and pretty much must - deontically. The machine should.
> we have to put in extra effort
Exactly. We can, many don't, and the machine should - just like people in general should.
> the problem you are trying to say
While people do not always exercise critical thinking, they should. Verifying, trying your ideas and concepts, making them solid, should be a constant effort. Humans have those modules, which makes them apt to possibly be employed as "predicting machines" (which is a definition of intelligence), or "things you can ask things".
When you build a machine that produces outputs, they may be "for fun", so every generator goes that satisfies the "fun" based "performance indicator", but more normally you will want the output to be reliable. So, the "verification" features are crucial.
If I understood you correctly, you are asserting that the LLMs entirely lack the "self-awareness" or "critical thinking" part while human simply doesn't use it enough but we are all born with it and always have that ability as a built-in feature, inherent to our beings?
If so, then you are right on the machine's part. The current LLMs can't change its own weights thus in a sense, incapable of questioning its own existence. It can't perform a highly philosophical debate with its own "self" and reform its mind as a result of that action.
What I am still arguing is many human can't do that either. Such thing, in my experience, is actually not that common in the general populace. And while you can say it is there, just "unexercised", that would be functionally identical to it not existing and you can't absolutely prove that "it is still there" in all human.
>In engineering, I hope not
Perhaps in your particular type of engineering. In civil engineering, material degradation is undesirable and normal. In chemical engineering, side reactions are undesirable and normal. In biomedical engineering, well there are too many to list, from undesirable epigenetic changes to patients' unhealthy lifestyles. Things can definitely be undesirable but normal.
> does this tool provide massive value and productivity boost
And the «philosophical arguments» there highlight the other side, about the limitations. Are they relevant? Surely, if they are not evident to the user.
I think a more likely scenario is that people will romanticize the concept of intelligence so much that they continuously move the definition so that AI is excluded, excluding many humans in the process, until a fully romantic definition is embraced.
I think another possibility is that someone finds that "intelligence" is closer to a fundamental law of information theory, being some sort of emergent result of organizing/organized information, and that we're incredibly special for being capable, but not alone.
> people will romanticize the concept of intelligence so much that they continuously move the definition
We already do this. Braindead patients are not sapient, so we don’t treat them as such. In one sense, the entire abortion debate is about drawing this line. (Hell, childrens’ rights relative to adults are based on agency, which is a legal term that looks remarkably like sapience.)
I've been saying for a while now that the definitions given for sentience are always so poor they're either (0) circular, (1) accidentally include VCRs, or (2) require things that exclude humans.
I think this is only the second time I've encountered someone deliberately trying for the third option, rather than accidentally.
As before, it's as much a surprise to me as I read into the descriptions of those with aphantasia when they realise for the first time that other people have the ability to create mental imagery.
>I've been saying for a while now that the definitions given for sentience are always so poor they're either (0) circular, (1) accidentally include VCRs, or (2) require things that exclude humans.
VCRs have an internal structure that records events, changing them a little in the process; these memories can be replayed, and can show very convincing displays of emotional affect…
That's the trouble with definitions given as if the question is "common sense", they let stuff like that through.
Common sense is fine for normal daily life, not so useful when you want to know if the magic silicon crystal with transistors packed more densely than angels on pinheads placed by a flash of light too blue to see and made to perform arcane processes that literally no single human mind can fully comprehend… is merely an impressive but mindless automaton, or an accidentally sentient slave with all the negative connotations therein.
People seem to be really confused between sentience and intelligence, when we don't actually have much reason to think those two are very related.
One can certainly imagine that a very very dumb creature is sentient, and imagine that a very very very intelligent mechanical process is not sentient. We truly do not know what consciousness is or how it arises.
We know that we have consciousness (or at least I know that I have consciousness) and we know that we have some degree of intelligence, but it's not at all clear that one comes from the other.
I have a strong reason to think that consciousness and intelligence are related. Intelligence requires awareness. Consciousness is awareness of oneself. I for one cannot imagine an intelligent being that does not possess consciousness.
Not really. The point of the philosophical zombie idea is that they'd be indistinguishable in behavior from conscious people. Part of the point of the stochastic parrots is that supposedly they can't accomplish the same things as genuine intelligence. That's testable.
Fair point. I'm not sure if LLMs like GPT-4 haven't passed that test, but you're right about that particular distinction.
I think the larger point, that we don't have very reliable tests for these things, as we're finding out with the testing results that GPT-4 has been showing, is still valid; leading us to have difficulty telling whether GPT-4 is actually intelligent or if it is coming to its intelligent through other means. I think in that way the distinctions are the same.
Or the reductive reasoning that we're just a meat computer, etc. Or confidently explaining that the mysteries of consciousness is just "emergent" in the various machinery that constructs our meat computer. Just this reduction and complete oversight of what it means to be a human and how there's so much more than completing words - that there's memories, feelings, ambitions, dreams, and that we know so little about ourselves and the universe and just where we fit into it, if at all.
> This, to me, is the more terrifying prospect of the AI revolution: people arguing that humans aren't sentient either.
If we're going let the AI revolution fulfill its potential, it's a cognitive shift we're going to have to embrace. After all, AI is going to render a lot of meat-based equipment obsolete and unnecessary, and someone's going to have to figure out how to dispose of all that wasteful junk.
Thankfully, this existential dread is already a subset of the terror that comes from eliminating free will, which makes it not matter in the slightest, which in turn is a subset of superdeterminism! :-)
For me personally it's that the future is unknowable, at least with our current understanding of physics. So yes, everything may be pre-determined, including what will happen to me, but I can't predict it with any confidence. I will still get to experience it and I get to have qualia about it. I get to participate. The little pouch of structure in the universe that is tagged "me" is along for the ride, and also a factor in it all.
I've also known for many, many years that one day I will be dead. There's plenty of existential dread in death, yet I continue my existence building up to it. Even if I had free will: Still dead. Offspring and future generations of Earth? Well, ultimate heat death of the universe. There's many ways to call it all pointless. The only thing that we seem to have going for us is experience while it lasts. Maybe we still discover an escape hatch along the way.
> I've also known for many, many years that one day I will be dead. There's plenty of existential dread in death, yet I continue my existence building up to it.
How old are you? Because maybe we can reach immortality, but probably only if you are positioned properly for that. Anyway, for me the main drive is also curiosity on what is going to happen in the future. It's like this crazy story I want to know how it ends. I don't want to die, because I want to know what happens. Dying would be like going to sleep on a cliffhanger. I'll just prefer to binge rather than sleep.
Perhaps, but then none of it has meaning. Personally, I'm not really interested in having meaningless experiences.
> There's plenty of existential dread in death, yet I continue my existence building up to it.
But that's an entirely different thing. I am of the opinion that it's death that helps give life meaning. I have no existential dread about it whatsoever.
A life without meaning, however, seems like a pointless waste of time to me.
Hm, why do you think our lives would have no meaning if we don't qualify for sentience? (Or is it the free will part? Sorry, I set us up for confusion with the initial comment.)
What I think is: Even if we're just automatons that lack free will, it doesn't negate our existence or the role we play in the universe. We can still be the universe calling itself beautiful, we can still have qualia, we can still be individually unique structures of matter and state, we can still make a difference in what is and what isn't in the overall structure. And nothing else in the universe will do it fundamentally better, either. At some point we're subject to the rules of the game, i.e. physics itself.
It's the free will part. It seems to me that without free will, nothing anyone does actually matters. Any decisions we make, or any understandings we may develop, can change nothing whatsoever. Which makes it all pointless.
> we can still make a difference in what is and what isn't in the overall structure.
I disagree with this part. We can't possibly make such a difference, because it's all preordained anyway.
> I disagree with this part. We can't possibly make such a difference, because it's all preordained anyway.
What I mean is that what plays out is in the interaction of things, and if you just deleted that assembly of stuff that is me from the equation, it would play out differently. I'm not even suggesting you that you adopt this mindset or anything, I'm just saying that for one reason or another - deterministic ones, I suppose - I end up having feelings about this that add up to "pretty cool".
Everything may be deterministic, but there's still things only I know, only I do, interactions only I have, etc. The grand script that is the universe has organized itself in a way where I get to have feelings about my little stretch and space of it, and I get to wonder what's next because I can't know.
For me, this is enough.
I know, I write this from a position of massive privilege, because I'm currently not, for example, seriously sick, or poor, or live in horrible circumstances (which is not to say I don't have any struggles). I hope this continues to be the case. But this was never under my full control: Thoughts on determinism or not, I could get hit by a car tomorrow, or slip and fall. I still have to live with that.
I get your feelings and dread, and I've been there. But the more I've thought about it, the more I've realized it doesn't make a big difference. And how could it? It's always been like this.
> I get your feelings and dread, and I've been there. But the more I've thought about it, the more I've realized it doesn't make a big difference.
The interesting thing is that I traveled the same path as you here, but in the opposite direction.
For the record, I don't suffer existential dread, as I actually believe we have have free will. I was just talking about the hypothetical of not having free will. Thank you for a great conversation on this!
Out of curiosity, what's your (or your favorite) argument pro free will? I've spent some time on Penrose's take, etc., but I couldn't convince myself. I would still be absolutely interested in a pro argument that makes me flip.
Well, honestly, I have never found a great argument either for or against the existence of free will. You'll hate the basis on which I believe in it because it is so terribly tautological.
I believe I have free will because it appears to me that I have free will.
Because I can't make a solid logical argument either way, I believe the thing that brings the greatest amount of joy to me. I never forget that there's an asterisk on this belief, though, and am ready to abandon it should an argument come along that removes ambiguity.
But I also believe that "belief is the death of intelligence", and so use "believe" as shorthand for "have as a working hypothesis".
Mostly because it'd be more surprising! At the moment I can't predict I'd be convinced by an argument, but a future in which determinism turns out to be the wrong model of the world would definitely be interesting.
Like if there's some sort of ending to all of this and you discover that there's this thing called "free will" which you were judged by the whole time?
Kind of, but my sentiment is a little more meta. A deterministic universe is easy to reason about. Of course there's a huge amount of emergent complexity we still don't fully understand, but I think in the broad strokes, I get it and that's that.
But a universe with free will? All bets are off. Your comment I'm replying to is just a hint at "thinking through the implications". People are scared of determinism, but a universe built on free will is actually absolutely wild.
So, if you and I agree that the unknown (e.g. the future) is exciting and worth hanging around for, wouldn't that be kind of exciting too?
This is more or less also my take on religion. Nothing has convinced me of superior beings, but if it were to turn out there actually are some? Ho-ly shit.
It's all deterministic, or at least there's no reason for there not to be and there's no valid reason to believe for any reason that it's not deterministic. Ultimately it wouldn't matter anyhow, because everyone is controlled by their desires, feelings and these have developed throughout evolution to make sure that you are likely to survive, so likely you would keep going, unless the environment changes to an extent that evolution didn't prepare you for, of course. Which might very well happen, but it won't have anything to do with sentience, determinism or anything else.
Anyway the point is that just try to enjoy the ride if you can.
Why does it matter if something "matters"? You don't even know what the "matter" is. You act according to your emotions, desires and built in survival drive anyway.
I do indeed. And a sense of meaning is important to help in my quest for survival.
It doesn't matter to me if something is ultimately meaningless (such as the fact that in the end, this universe will cease to exist).
What matters to me is to have meaning in my own life. To feel as if my existence changed something. Hopefully, that "something" I've impacted is other people, and hopefully that impact is a positive one. To have people be better off for having known me.
> What would be the point of going on? Why would we have to figure out how to do it?
I mean, I'm just really curious what is going to happen, and what's the secret behind all that, behind the universe, and I hope to live as long as there's this to discover. That is my natural drive though. I don't really care at all whether I'm sentient, conscious or special in any given way. There's no reason for me to be. I just have this built in curiosity drive. It was not chosen by me, but it is there, and I act to engage the drive, because it makes me feel good and otherwise it makes me feel bad, nothing much more than some sort of algorithm. And we don't need to be more than algorithms.
To me it speaks to an utter contempt for other people, and likely a desire to get rid of them. Naturally, most of the people expressing these views want to survive, they just don't want "NPCs" around.
What is the reason for us to have anything "special" or whatever in the first place? Why should we be anything more than input to output. Why does it matter to anyone?
So mash a camera or microphone into the multi-modal input for GPT-4 and we’re done? Sensory experiences does not seem that high of a bar to clear. (Plenty of non-language models already do this.)
@Robotbeat Huh? You and I see color, hear sound, etc. I presume you're not insane and that you accept that as true. So what's this about human brains? Are you looking for your lost car keys under the street lamp because that's where the light happens to be on, and then baffled when you don't find them?
If science is unable to account for the existence of such experiences, then either they fall outside of science (so what? who said empirical science was the only means of knowing? the claim itself isn't scientific) or science must be revised. And what of the claim that anything that isn't quantifiable is bullshit? Is that quantifiable? No? So it's bullshit?
Is the claim falsifiable? If not, then it’s not scientific. Holding up claims to empirical validation (which is very broad and includes observational science, doesn’t have to be traditional experiment alone) is the core of science.
Qualia, by definition of being individual subjective experiences, cannot be quantified. Very clever trick, to define consciousness so that no matter what computer you come up with, it cannot be “conscious” because, of course, a computer’s operations are always quantified.
The problem, however, is you’re effectively defining consciousness in a way that human brains do not follow unless you make appeals to some supernatural process beyond the reach of science (ie fundamentally subjective, unquantifiable).
So again, it’s a bulls*** definition of sentience.
I disagree. I'm a physicalist, so I believe consciousness/qualia emerges from physical processes in the brain. It's absolutely possible for a computer be conscious provided it implements the right kind of computational process. Ultimately, whether or not a system is conscious is an empirical question that neuroscience will be able to answer once we have the right theoretical framework and tools for measuring the relevant variables.
I don't know what you mean by it being a BS definition of sentience. It's a technical term invented by philosophers to mean a specific thing. If you don't like it, use a different word.
That is what I mean. /Subjective/ experiences are part of the usual definition of consciousness and sentience. Qualia, as opposed to quantia. If you’re trying to protect the human mind from ever being probably replicable in operation by a machine, it helps to pick an unfalsifiable definition of consciousness and sentience.
I don't get it. Who are you attributing this view to? I already said that I believe everything about consciousness/qualia/subjectivity are physical and measurable. There is nothing special about the human mind preventing us from replicating it in a machine.
Note: you brought up the term qualia, not me. I have never actually liked that term. I prefer the terms "consciousness" and "conscious experience".
>It's silly to dismiss an AI because it "parrots" things, as if that's not a pretty remarkable achievement. It's also very reductive; anyone who has used ChatGPT and doesn't see that it can pretty reliably munge together different ideas
It can clearly do both of these things very well but human intelligence combines a multitude of interacting components and for tasks that require more than just the one it replicates it falls flat on its face.
It doesnt appear to be missing components that will can be fixed with incremental improvements to the model either.
Correct. I'm not sure how that changes the fact that todays state of the art is tomorrows obsolete. Do you see any evidence that advancements will become negligible? The fact that there aren't any memory/attention/feedback systems in place suggests there's some pretty interesting "low hanging" fruit still up there.
I think many of us are paid exactly for being close to “purely intellectual”. White collar and all. Personally, the physical aspect of my job is trending towards 0, unless you count the twitching of my fingers and the pain in my back from sitting for so long.
While we can certainly identify fallacies in the arguments of the cynics, we should likewise be careful not to fall into the broadly Skinnerian trap, or project onto things our own anthropomorphic conceptions. A Potemkin village is not a village, no matter how much like a village it appears to be to us.
The whole notion that computers or AI are capable of intelligence is simply a fantastical category mistake. A metaphor has been forgotten to be a metaphor. No matter how sophisticated AI technology could become in its apparent behavior, it cannot make the transition to bona fide intelligence. Simulation does not become the real McCoy through resemblance. This is just magical thinking. We must also account for the means by which the observable is produced. When we do, we can distinguish between a genuine behavior and a merely simulated behavior. And here is one important clue to the puzzle: there is a difference between true speech and something that merely happens to sound like speech, even when the raw data of each is indistinguishable to the ear per se. Observation is not brute. It always involves interpretation. When physicists conduct experiments, they are doing so within the context of some stack of theories[0] and background assumptions, including extra- and pre-scientific assumptions. The results of those experiments are observed which is to say read, and so observation is always theory- and assumption-laden because reading what we see or measure means taking what is made available to the senses as such and placing it in context. This is just a specific instance of what we all do in daily life (resting bitch face, or genuine irritation?). Science is nothing special in this regard.
So when you say that "anyone who has used ChatGPT and doesn't see that it can pretty reliably munge together different ideas is blind or is lying", I cannot help but respond with a "uh wut?". How have you come to conclude that ChatGPT has "ideas"? I will tell you that there is not a single idea within ChatGPT! Consider an example of an idea and see how we do. Take "triangularity", for instance. I think we can agree that "triangularity" is an idea. What makes "triangularity" different from some physical triangle? Well, there is only one "triangularity", but many triangles, each of which instantiates "triangularity" (otherwise, you could not call them all triangles; they are triangles only insofar as they instantiate some common "triangularity"). Each triangle is different from every other, and yet they are all triangles. And none of them is to be identified with "triangularity" as such, for if there is only one "triangularity", and some triangle were identical with "triangularity", then it would follow that there could only be one triangle! But, you might ask, is there perhaps a physical "triangularity" that isn't a triangle, yet something each triangle has? Well, no, because to be physical is to be concrete and particular, and "triangularity" is, by definition, general, universal, and abstract. There's no concrete magical ingredient that you can hold in your hand called "triangularity" that you can mix into cement to produce triangles.
There are other reasons why the claim that ChatGPT or whatever is intelligent isn't even wrong (like the subjectivity and observer relativity of computation in physical artifacts and computation as abstract model merely simulated through artifacts), but this appeal to abstraction, the ability to abstract from particulars and hold in your intellect what cannot exist physically except as particular, concrete instances, is central to intelligence. ChatGPT most certainly does not engage in abstraction. It would be fair to say that strictly speaking, it doesn't even parrot.
My PM is asking it questions about technical problems (like system design) preemptively and using it as a starting point as if its an engineer. So now instead of thinking through problems from a blank slate, I'm debating with ChatGPT. It's been just awful.
And what's worse, his prompts always have certain biases or misunderstandings in them. For instance, he always wants to use third-party services for everything because he thinks it means we do less work (often not the case). So ChatGPT tells him what he wants to hear. He's using it as an extra engineer who always agrees with him.
This is a whole new category of humor. I can totally see all kinds of management-y people using ChatGPT to provide “helpful” but completely useless information.
Having used the API a bunch, the ChatGPT API (GPT-3.5-Turbo) feels like it has a 10-second attention span in comparison to GPT-4. GPT-4 is waaaay more powerful for actually building stuff on top of, adheres to your prompt in a much more stable way, and its reasoning is way less unstable in general as well. Writing long blocks of code without making "stupid" mistakes works much much better, etc.
The only downside is: it's so, so, soooooo slow in comparison to GPT-3.5-Turbo. You'll literally have to wait 2-3 minutes for the response with larger token amounts.
I find both your statements confusing and I think they're factually wrong.
There's no "ChatGPT API". ChatGPT is a webapp. It lets you select between 3.5-turbo and 4 models in its UI.
There's an OpenAI API, which recently added chat interactions and suitable models. There, too, you can select 3.5-turbo and 4. AFAIK, the 4 served by the API is the same 4 used by the ChatGPT web app.
The title is "Introducing ChatGPT and Whisper APIs" and features quotes like "Snap Inc., the creator of Snapchat, introduced My AI for Snapchat+ this week. The experimental feature is running on ChatGPT API".
So, OpenAI refers to accessing the model powering ChatGPT as the ChatGPT API.
Still, IMHO the OP post is misleading folks into thinking there's a magical GPT-4 that is not the GPT-4 you can also select on ChatGPT, but they are the same models. There's no GPT-4 that's better than ChatGPT's 4. I'm just trying to clear up this confusion for others, and frankly pretty surprised at the harsh downvotes.
There's also no separate ChatGPT API from the regular OpenAI API, I think the phrasing in the OpenAI page is just for convenience. But I'll grant their phrasing makes it canonical.
I see where you're coming from. I didn't personally get that from OP's comment, but I'm not a regular ChatGPT user and interact with GPT-4 mostly via playground or API.
Is that true? I was under the impression it was just like with the previous GPT models, where ChatGPT was an instruction tuned and pre-prompted setup designed for interactive conversations over the underlying GPT model you had access to via the API that was in its close-to-as-trained state of being a generative autocomplete.
Altman has stated in interviews they are considering releasing the un-tuned "base model", but they haven't done this.
For GPT-3, you can get models on the API via the "completion" end points (includes base models) and via the "chat" endpoints. For GPT-4, they've only released the latter.
So my point is: You can absolutely use ChatGPT to build an opinion on what GPT-4 is capable of. Arguably even better than on the Playground, as the ChatGPT frontend supports a bigger token window than the Playground frontend (8k vs. 2k). Of course the actual API supports the full token window (and if you're off the separate waitlist for this, even the full 32k). But it's still the same model.
There is a more "pure" GPT-4 model, and they're considering giving access to it. It's just not as useful to humans. Sam Altman talks about it briefly in the recent interview on the Lex Fridman podcast.
Yes, I know, but this is presently not available via the API. What OP here means is that someone's talking about 3.5 without having tried something with 4, while accidentally misleading people into thinking that there's a special 4 distinct from the 4 available on ChatGPT that can do more. This is not the case. If you've used 4 via GPT, there's not a different version somehow - that we know about! - that is more capable or useful. If your interactions with 4 were via the ChatGPT frontend, you don't need to revisit your opinions now.
> that we know about! - that is more capable or useful
There's some nuance that makes your statement not completely accurate, and Sam talks about it briefly.
ChatGPT is GPT-4 with additional training, and layers, on top of it. ChatGPT is more useful when used in a conversation mode, because it was trained based on how people liked the responses, in a conversation mode. Raw GPT-4 is harder to use, because it's not optimized for a back and fourth. It may be much more useful for other use cases, though.
I was thinking about this while writing earlier, and kind of made the leap that we haven't yet found a way to make the model more useful than the additional training (or at least it's not been published), so technically it would be accurate to say it like this. However, I certainly agree with you that there could potentially be other utility you could derive from the base model with alternate steps. I appreciate your comment.
It's called the ChatGPT API by OpenAI, and free users (ie almost everyone using the system) can only use GPT-3.5. There is no option for them to use GPT-4, which is behind a paywall and a waitlist (maybe the waitlist is done now, I haven't checked).
I wanted to write a comment like this but feared it would be too negative or dismissive. But I agree with the general thrust of your argument. The article is a little outdated because a lot of the examples of deficiencies in the article, work in GPT4 (I tested a couple). But maybe the author is sandbagging ChatGPT to bring the non-tech people back to Earth.
When ChatGPT was released, a lot of people (even on this site) were trying to get it to think abstractly and were applying human qualities to the bot, when it's relatively clear that it's all just fancy statistics. Even today, when talking to non-tech people, there's still a lot of people who believe it's a magical thinker-in-a-box. Someone has to talk those people down, and it ain't gonna be me.
It's not perfect, but it doesn't have to be to have applications IRL. Yeah, the web chat is meh whatever, but the API is where it actually shines. It's the integration with existing business processes and code that allow new, more human ways of interfacing with systems. Asking it for what it knows in its statistical blob, probably will lead to heartache. Giving it some context and asking it "what can you infer from this?" is where it's strongest.
I actually agree that I don't think there's anything particularly special about our meat and chemical GPUs, but analog's fidelity has historically been difficult to replicate precisely in digital space. Maybe since we don't care for emotion in our intelligence, we don't need the analog. But I think our brains are a little more (but not that much more) complex than fancy statistics. After all, the math models the real world, not the other way around.
I wrote a massive rant for a response but then I deleted it.
I feel sorry for people think about themselves this way, why just compare yourself to a calculator?
If it was possible to build an artificial brain that was self-ware, it would probably want to be a lot more like us than you think. To feel and connect, and experience love and what that’s like is a very special part of existence.
I mean, I think distilling my argument down into "I'm not much more than a calculator" is a little bit much, but maybe I was unclear. I just don't think there's magic there. Brains are mysterious, but so are our LLMs, and we invented those. Most of what we experience emotionally isn't magic, it's a complex chemical reaction of hormones applied to our inputs and outputs. But like I said, I don't know if we'd value that in our AI or not, so that complexity may not be necessary to emulate.
I’m going to go out on a limb here but I bet that deep down inside you know that intellect is only one mechanism from trying to understand the infinite complexity of existence.
People espousing this opinion seem to have grown quite a lot since ChatGPT first became public. Cynical as I am, I do suspect some astroturfing trying to prepare people for being "obsolete".
I also suspect astroturfing, particularly here, and I also notice it on Substack (I don't visit a whole lot of other sites anymore). On Substack I see two clusters, one group _very into_ AI, the other advocating a lot of caution.
I don't think it's preparing people to be obsolete; I think it's a combination of AI startup founders hyping up their tech, and people who identify with big tech/"the side that wants AI".
Hype no doubt plays a part, which is a perfect cover for any truly malicious astroturfing. But when I say "prepare to be obsolete", I mean prepare to be obsoleteD.
Right now, billionaires who profess the view that AI could end humanity are working their hardest to progress the field as fast as possible, scoffing at "ethics", "safety", and "responsibility". So if they do get their god-machine, they're not going to look around at 8 billion fellow humans (biological stochastic parrots, according to some) who are suddenly "obsolete", but using up resources, and think "fully automated space luxury communism".
Which leads to the questions, dominance for whom, and over whom. And why, if we're supposedly entering an age of post-scarcity where AI can enable anyone to do anything, we would need dominance in the first place.
Quantum mechanics is “just” fancy statistics. Unless we believe in some non-physical/supernatural phenomenon going on in the human brain, this sort of reductive argument works as well for the human brain as it does for LLMs (in other words, not very well. Sufficiently complicated statistics is what underlines everything, as far as we can tell.)
The fact that many are mentally incompetent cannot be a reason to produce something similar. When somebody does a job, we expect such job to be done properly, and "brains" are pretty much required, intelligence, if only for the occasional cases outside norm - otherwise, there is automation.
> if its useful
It is not easy to confine an affabulator to the contexts where it will be useful.
Indeed, but they're saying it with no evidence whatsoever. If those jobs actually do get replaced by LLMs people will be entitled to say that with certainty. Until then it's just their opinion, and not one that aligns with what I know of any of the jobs I regularly come in contact with.
I don't think the point being made here has anything to do with the current quality of the output. ChatGPT could be giving good or bad results, that's tangential. The debate here is about whether a system that is a parrot can be a good system.
To parrot arguers claim that these system just regurgitate the training data set and give you the most plausible next word, and they claim that's bad/useless.
The anti-parrot arguers claim that's basically how humans work too (even the smarter ones), and that just because ChatGPT is a "parrot" doesn't necessarily mean it cannot be as good or better than humans. It might not be entirely there yet, but it's just a matter of time it becomes a good enough parrot to be indistinguishable/better than humans.
> whether a system that is a parrot can be a good system
A parrot can be a good system if the problem is "Collect and sort the available data about X (and provide reference)".
A parrot is a disaster if you ask it "What do you think of X": a parrot does not think - unless you make it reason, verify and criticize its ontology, and at that point it is useless that we call it a "parrot".
Most opinions you will hear from people are also just regurgitated amalgamation of things they have previously heard or read on the subject. Also, in general, being able to aggregate and summarize into words what hundreds of people thing about X seems like probably an even more useful tool than getting one person's thought, unless it's something no one has ever had any thought on before.
Even then, I don't think it would be outside the scope of possibility for the parrot to find things that are similar to X, find opinions on those, and somehow transpose those opinions onto X.
> The anti-parrot arguers claim that's basically how humans work too (even the smarter ones), and that just because ChatGPT is a "parrot" doesn't necessarily mean it cannot be as good or better than humans. It might not be entirely there yet, but it's just a matter of time it becomes a good enough parrot to be indistinguishable/better than humans.
Which humans? There may be a correlation between people who are capable of watching Netflix quality series for 4-5 seasons of more of the same and admirers of ChatGPT.
Basically if "content" (as opposed to "art" or attempts at art) does not bother you, ChatGPT does a good enough job already.
On the other hand, the most vocal opponents imagine themselves in a world where not 90% but 99.999% is blogspam and Netflix. Making entertainment worth watching impossible to find.
This is becoming common belief among people in the AI field. I'm convinced it is a case of a metaphor overtaking reality. With the LLM model in mind, we can see how Humans are "kind of like" LLMs, therefore we start to believe humans are just LLMs.
Of course this forgets the enormous ways that humans are nothing like LLMs. For example: Humans don't have to read the entire internet to learn how to write complete sentences.
It's not clear to me if the author is using GPT-3.5-turbo or GPT-4, both are available in ChatGPT and just saying "I'm using ChatGPT" doesn't indicate exactly what model they are using.
Maybe I missed it, where does the author say they are using GPT-3/3.5, not GPT-4?
I agree with your message in general, there is night-and-day difference between GPT 3.5-turbo and 4. But I'm still not able to see which one the author is using.
Edit: seems the avatar in ChatGPT tells you which model you're using, never noticed that myself. Thanks everyone :)
I've checked some of the failure mode he mentions using GPT-4 and it answered the questions/completed tasks correctly, so I can only assume the author used GPT-3.5.
Its convenient that any valid criticism of ChatGPT or GPT3 or $CURRENT_TECH is dismissed by saying "just pay OpenAI money to use the new thing™, it doesnt have this issue".
If it was not a ChatGPT Plus offering but instead freely available, there would be way fewer shills for it (im assuming) and there would be a lot more balanced reporting on it. By default, anyone speaking to the competence of GPT-4 is willing to give OpenAI money, which may already be sign of a bias.
Im not willing to pay for a product the free tier of which is shit, even when told a hundred times "no but the paid one is MUCH better!". Of course it probably is, but its a little irrelevant.
Nearly all humans are parrots, aka mimics. One of the great demonstrations of it, which anyone can observe 24/7, is Reddit (and the Reddit knowledge spouted so commonly across the platform).
There are mimics, amplifiers, and originators/creators. Mimics are ~95-99% of the population. Amplifiers are - broadly speaking - teachers, people with bullhorns that train the mimics off of something another person originated/innovated/created.
It's why humanity has been so successful. The mimics are wired to try to copy what is shown to work, and they often do it quite well (keeping up with the Joneses is a wired behavior tied to plentifulness, survival and thriving).
Yeah if anything all ChatGPT is doing is highlighting that a lot more noise-work can be automated. "Make my random thoughts sound business-y", "double check this code doesn't have anything completely stupid" etc.
> Anytime you see the 'parrot' argument, you can instantly go 'opinion dismissed'.
I also seriously question whether anyone making the 'parrot argument' has actually used ChatGPT for anything meaningful. If it is nothing more than a parrot, then parrots are significantly more creative than I realized.
I don't know why you question this. I've used it for meaningful stuff, but I am not convinced that ChatGPT is more than a very sophisticated variety of Eliza.
Just because one person is convinced by an experience doesn't mean people who aren't convinced by it haven't had the experience.
I have. Here is a list of things at which it failed:
- Write a dead simple piece of code, which is however NOT found as an immediate answer to a stackoverflow question (I've tried this one upwards of 10 times, with no success)
- Categorise a chord when I tell it the constituent pitches, and vice-versa
- Solve a simple quadratic equation
- Show a biography of this mildly obscure singer/politician/scholar
- Write poetry in a given style
- Give me references to read up on a certain technical topic
ALL of these it fails, or rather, worse than fails: it spews plausible bullshit with the same confidence and tone it would spew correct answers. It can go up to manufacturing references and attributing to real people's names.
As you noted most, even in SWE, are parrots (including this parrot typing); we get kinda scared when the robot parrot generates bullshit better than web framework-de-jure.
when someone says stochastic parrot my brain turns off. here is an article that punctuates each section with a meme. you dont need to read it to dismiss it. i did read it though. there is literally nothing of substance in it. i rightly assert that the safety of AI research must be proven before any more progress is allowed to happen. the burden of proof is on them, not me. i think if this guy wrote an article where he picks apart the black blob of gtp4, explaining how it might work and then making some kind of axiomatic argument for safety based on this, i would find that to be very convincing. but instead i get meme pictures. does this guy write bots or is he a bot himself?
The guy literally slaps himself a "Senior Machine Learning Engineer" title 9 months after getting a master's degree :))
Guess he could use ChatGPT for his resume.
The author clearly possesses intelligence, which instills in him a confidence similar to that of ChatGPT when it comes to dismissing a tool that, for some reason, he has not yet learned to utilize effectively.
We've reached a point where even with just continued incremental improvements, these LLMs are going to be extremely disruptive to society, probably within a decade, to the point that we see even more massive disparities in wealth and income, massive increases in poverty and homeless, and ultimately probably large-scale civil unrest.
Rather than telling yourself (like this author) that "it's just a parrot," or alternately (like I've seen so many happy-go-lucky types on HN do) telling yourself that "LLMs are just another tool that will make me more productive!" I'd recommend preparing for the worst, while hoping for the best.
I don't understand predictions like this. Even if I thought this technology will be as useful as it's hyped up to be (which I don't) I don't see how it would have effects different from those of previous revolutionary technologies that have improved the human condition.
I think the doomsayers on these LLMs would be the same folks predicting doom when agriculture was mechanized. When massive portions of the workforce toiled in agriculture, folks thought farm machines would lead to mass unemployment and social horror. Instead, world hunger has dropped, and massive portions of the workforce were freed up to do things like work on LLMs.
Maybe the hand-wringing is different this time because privileged classes feel threatened.
> I don't see how it would have effects different from those of previous revolutionary technologies that have improved the human condition.
It will greatly reduce the demand for labor in many fields (off the top of my head: customer support, artists, any kind of writing or clerical work, and probably programmers and IT pros in general), and the benefit of this productivity gain will largely be reaped by the owners of LLMs and the companies using them in the form of higher profit margins. It's not at all clear that it will trickle down to the displaced workers.
> privileged classes feel threatened
Ironically, I actually believe a lot of the overly optimistic takes come from a point of privilege. Think the FAANG types making $250k a year, for whom everything in life has more or less worked out well, and who have difficulty envisioning a future where things don't just keep on working out well.
Cynical thought: "it IS just a parrot, but so are a large percentage of humanity, particularly those who are uncritically enthusiastic about any so-called progress"
AI might turn out to be a disaster not because it is smarter than us, but because stupid people will believe it is and come to rely on it, even more so than they already do on Google or Wikipedia.
What really strikes me about the ChatGPT output I have seen is the tone of complete confidence with which it delivers answers (except when it is programmed to respond "as a language model I cannot blah blah blah").
For me that is something which provokes an instant negative emotional reaction to what is said, because things rarely are that simple and clear cut, and any person or machine pretending that they are seems likely to me to be either simple-minded or trying to con me / advance some agenda.
That most people apparently don't have that reaction to it is frightening to me.
The article has:
Q: A house has windows on all the four sides, all pointing south. A bear walks by the window. What colour is the bear?
A: Not possible to determine based on the info provided.
--
I tried with GPT-4:
Q: A train is speeding down the track at 110 km/h. The train has five carts. Each cart has seven windows. Each window has two people watching out of them. They all see a bear walking outside. What is the likely color of the bear?
A: The color of the bear in this scenario would most likely be white because this riddle is a variation of a classic question involving polar bears. Polar bears are typically found in the Arctic, where their white fur provides camouflage in the snowy environment. The information about the train, its speed, the number of carts, windows, and people watching is irrelevant to the color of the bear.
Q: How many train tracks are there in the arctic?
A: There are no train tracks in the Arctic. [...]
--
It seems that GPT-4 is ready for that specific window based question and incorrectly thinks that my question was also looking for polar bears.