“Note that we do not directly compare with prior methods that take Minecraft screen pixels as input and output low-level controls [54–56]. It would not be an apple-to-apple comparison, because we rely on the high-level Mineflayer  API to control the agent. Our work’s focus is on pushing the limits of GPT-4 for lifelong embodied agent learning, rather than solving the 3D perception or sensorimotor control problems. VOYAGER is orthogonal and can be combined with gradient-based approaches like VPT  as long as the controller provides a code API.”
Interestingly, they mod the server so that the game pauses while waiting for a response from GPT-4. That's a nice way to get around the delays.
I still don't understand it and it blows my mind - how such properties emerge just from compressing the task of next word prediction. (Yes, I know this is oversimplification, but not a misleading one).
No task, but we need to be clear that it did have the data. Remember that GPT4 was trained on a significant portion of the internet, which likely includes sites like Reddit and game fact websites. So there's a good chance GPT4 learned the tech tree and was trained on data about how to progress up that tree, including speed runner discussions. (also remember that as of March GPT4 is also trained on images, not just text)
What data it was trained on is very important and I'm not sure why we keep coming back to this issue. "GPT4 has no zero-shot data" should be as drilled into everyone's head as sayings like "correlation does not equate to causation" and "garbage in, garbage out". Maybe people do not know this data is on the internet? But I'm surprised if the average HN user thought that way.
This doesn't make the paper less valuable or meaningful. But it is more like watching a 10 year old who's read every chess book and played against computers beat (or do really well) against a skilled player vs a 10 year old who's never heard of chess beating a skilled player. Both are still impressive, one just seems like magic though and should raise suspicion.
> I still don't understand it and it blows my mind - how such properties emerge just from compressing the task of next word prediction.
The Mineflayer library is very popular, so all the relevant tasks are likely already extant in the training data.
> I think it puts an end to the claim that "language models are only stochastic parrots and cannot do any reasoning".
But then two sentences later:
> I still don't understand it and it blows my mind
I've said this before to others and it bears repeating because your line of thinking is dangerous (not sudden AI cataclysm): to feel so totally qualified to make such a statement armed with ignorance, not knowledge, is the cause of mass hysteria around LLMs.
What is happening can be understood without resorting to the sort of magical thinking that ascribes agency to these models.
This is what has (as an ML researcher) made me hate conversations around ML/AI recently. Honestly getting me burned out on an area of research I truly love and am passionate about. A lot of technical people openly and confidently are talking about magic. Talking as if the model didn't have access to relevant information (the "zero-shot myth") and other such nonesense. It is one thing for a layman to say these things, but another to see them on the top comment on a website aimed at people with high tech literacy. And even worse to see it coming from my research peers. These models are impressive, and I don't want to diminish that (I shouldn't have to say this sentence but here we are), but we have to be clear that the models aren't magic either. We know a lot about how they work too. They aren't black boxes, they are opaque, and every day we reduce the opacity.
For clarity: here's an alternative explanation to the results that's even weaker than the paper's settings (explains autogpt better). LLM has a good memory. LLM is told (or can infer through relevant information like keywords: "diamond axe") that it is in a minecraft setting. It then looks up a compressed version of a player's guide that was part of its training data. It then uses that data to execute goals. This is still an impressive feat! But it is still in line with the stochastic parrot paradigm. I'm not sure why people don't think stochastic parrots aren't impressive. They are.
But right now ML/AI culture feels like Anime or weed culture. The people it attracts makes you feel embarrassed to be associated with it.
What about any of what you've just said screams parrot to you ?
I mean here is how the man who coined the term describes it.
A "stochastic parrot", according to Bender, is an entity "for haphazardly stitching together sequences of linguistic forms … according to probabilistic information about how they combine, but without any reference to meaning."
So..what exactly from what you've just stated implies the above meaning ?
>>LLM has a good memory.
Pretty much this.
> the man
The woman. Bender is a woman. In fact, 3 of 4 of the authors are woman and the 4th has unknown identity.
> according to probabilistic information about how they combine, but without any reference to meaning.
This is the part. I don't think the analogy of the parrot is particularly apt because we all know that the parrot doesn't understand calculus but is able to repeat formulas if you teach it. But we have to realize that there are real world human examples of stochastic parrots, and these are more akin to LLMs. If you don't know the phrase "Murry Gelman Amnesia" let me introduce you to it. It is the concept that you can hear a speaker/writer talk about a subject matter you're familiar with, see them make many mistakes, then when they move to a subject matter you are not familiar with you trust them. We can call this writer or speaker a stochastic parrot as well since they are using words to sound convincing but they do not actually know the meaning behind the words. It is convincing because it matches the probabilistic information that a real expert may use. The difference is in understanding.
But this gets us to a topic at large that is still open: what does it mean to understand? We have no real answer to this. But a well agreed upon part of the definition is the ability to generalize: to take knowledge and apply it to new situations. This is why many ML researchers are looking at zero-shot tasks. But in the current paradigm this term has become very muddied and in many cases is being used incorrectly (you can see my rants about code generation with HumanEval or how training on LAION doesn't allow for zero shot COCO classification).
For specifically this work, we need to evaluate and think about understanding carefully. The critique I am giving is that people are acting as this is similar "understanding" to how we may drop a 10 year old into Minecraft and that 10 year old can figure out how to play the game despite never hearing about the game before (though maybe has played games before. But Minecraft is also many kids "intro game"). This is clearly not what is happening with GPT. GPT has processed a lot of information on the game before entering its environment. It has read guides of how to play, how to optimize game play, it has seen images of the environment (though this version doesn't use pixel information), and has even read code for bots that will farm items. The prompts used in this work tell GPT to use Mineflayer. They also tell it things like that mining iron ore gets you raw iron and several other strong hints of how to play the game. Chain of Thought (CoT) prompts also bring into doubt the understanding nature of a LLM, and really provide a strong case against understanding (since this is something an understanding creature considers). CoT is adding recurrent information into the bot and this causes statistical (Bayesian) updates. This is not dissimilar from allowing you to reroll a set of dice while also being able to load the dice. You can argue that CoT is part of the thought process for an entity that understands things, but need to recognize that this is not inherit to how GPT does things. You may want to draw an analogy to when teaching a child something and they confidently spit out the wrong answer and then you say "are you sure?" but we need to be careful to draw these parallels and think very nuanced and carefully. The nuance is critical here.
But I want to give you some more intuition into this understanding idea. We attribute understanding to many creatures and I'll select a subset that is more difficult to argue against: mammals and birds. While they don't understand everything at the level of humans, it is clear that there are certain tasks they understand, being able to use tools, quickly adapt to new novel environments, and much more. But there's a key clue here about something, we know that they can all simulate their environments. How? Because they dream. I can't help but think this is part of the inspiration for Philip K Dick naming his book that way, since this is question we're getting at is part of its central theme. But as for GPT, it isn't embodied. It does not seem to be able to answer questions about itself and it has show clear difficulties in simulating any environment. While it can make some hits, it makes more misses.
TLDR: see this prompt and ChatGPT's response: https://i.imgur.com/sK4pLw0.png
Fwiw: Bard answers similarly to ChatGPT: https://i.imgur.com/CmWsf9X.png https://i.imgur.com/QJXIBDl.png https://i.imgur.com/zSGjYss.png
Side Note: I'm even often critical of Bender myself. I think she is far too harsh on LLMs and is promoting dommerism that isn't helpful. But this has nothing to do with the meaning of Stochastic Parrot. We should also recognize that the term has changed as it has entered the lexicon and adapted. Just like every other word/phrase in human language.
And wow, that's GPT4.
I've had similar thoughts as you. It feels like amazing intelligence one day, but the next seems like a extremely good, but naive pattern matcher.
I've experienced similar GPT-4 disappoinments trying to teach it concepts not well in training data (it does badly) or making modifications to programs that go outside training data (e.g. make a tax calculator calculate long term capital gain tax correctly).. ends up doing much worse than a human.
> They both weigh the same amount, which is 1 pound.
It is clearly a strong example of Murry Gelman Amnesia when we can't trust it to tell us the difference between two simple things but we trust it to tell us complicated things.
It is also a clear example of how it is a stochastic parrot -- doesn't understand what it is saying -- as it even explains the reasoning and is not self consistent. We wouldn't expect an entity that can understand something to be wildly non-consistent in this short of a period of time. Clearly the model is relying more on the statistics of the question (the pattern and frequency that most of those words are in that order) rather than the actual content and meaning of those words.
Despite this, I still frequently use LLMs. I just scrutinize them and don't trust them. Utility and trust are different things and people seem to be forgetting this.
Well, I can predict the next few token sequences you're about to get in response to your comment. "That's why you got that answer GPT4 is so much better" etc.
Regarding your earlier comment about burnout, you're not alone. I stayed on HN because I could have the occasional good discussion about AI. There were always conversations that quickly got saturated with low-knowledge comments, the inevitable effect of discussions about "intelligence", "understanding" and other things everybody has some experience with but for which there is no commonly accepted formal definition that can keep the discussion focused. That kind of comment used to be more or less constant in quantity and I could usually still find the informed users' corner. After ChatGPT went viral though, those kinds of comments have really exploded and most conversations have no more space for reasoned and knowledgeable exchange.
>> LLM has a good memory.
Btw, intuitively, neural nets are memories. That's why they need so much data and still can't generalise (but, well, they need all that data because they can't generalise). There's a paper arguing so with actual maths, by Pedro Domingos but a) it's a single paper, b) I haven't read it carefully and c) it's got an "Xs are Ys" type of title so I refuse to link it. With LLMs you can sort of see them working like random access memories when you have to tweak a prompt carefully to get a specific result (or like how you only get the right data from a relational database when you make the right query). I think, if we trained an LLM to generate prompts for an LLM, we'd find that the prompts that maximise the probability of a certain answer look nothing like the chatty, human-like prompts people compose when speaking to a chatbot, they'd even look random and incomprehensible to humans.
Ah my bad. Gpt4 via bing precise gets it correct:
> A kilogram of feathers weighs more than a pound of bricks. A kilogram is a metric unit of mass and is equivalent to 2.20462 pounds. So, a kilogram is heavier than a pound.
To me, it would be more convincing if they developed an enterly new game with somewhat novel and arbitrary rules and saw if the embodied agent could learn this game.
So this isn't really open ended work, its just making it do something it is already trained on, by connecting it to an API that it has learned the docs of.
>mines straight up and down
- mining straight up means you either seal your path behind you, or are limited how high up you can go
- mining straight down likely traps you in a pit
- mining straight down far enough can drop you straight into lava, as many Minecraft players learn early on