I remember I was working on a game where you fly a ship and shoot alien ships. I didnt have a lot of experience at the time so I just made the ships move in one direction and eventually take a random turn, but if you got within a distance they moved away from you, but if you got too close they would kamikaze toward you. but I needed to figure out when to shoot, so I just decided that they would randomly shoot when you shoot.
However, what happened is that since the distance of your bullets is small, you would fly up towards them and they would 'trick' you by at first flying away then suddenly b-lining at you. and when you fired sometimes they would fire at the same time and get you.
The comments I got from people who played the game 'wow the AI is really good'. Not sure if this is a direct example of the Forer effect
Let me make this very clear: Generating syntactically correct code in various languages that also looks plausible is no small feat. It is, in fact, extremely impressive and will certainly have an impact on SE.
But. Every single test I ran lead to functionally wrong designs from smallish memory errors in C (that hilariously ChatGPT was able to correct ND explain when pointed to) over misplaced/hallucinated methods in python APIs to completely hallucinated perl packages. It never gave me a useful answer.
I don't know if that tells us something about my work vs. your work or my approach to the model vs yours. But it definitely tells me that such a model cannot replace a developer.
I have the same experience. I think the transformer-based models are impressive, but whenever I try to actually use them for something, e.g. having ChatGPT write something or using GitHub Copilot, the results I get are terrible.
I was legitimately afraid for my job when GPT-4 was released. After using it, however, I sleep easy now.
You know, I went ahead and paid to access GPT-4 and tried to use it for something that I guess is programming related, but maybe not what you think.
I was using it to try and help me create Arma3 scenarios, which often involves using an ugly little DSL called SQF.
The last thing I asked it about was how to play a custom audio file sitting on my disk. It gave me a simple one-liner, that didn’t do anything. I asked it several questions to clarify, but it seemed confident that all I needed was this one liner to play an audio file sitting anywhere on my disk.
Eventually I said fuck it, and decided to google it myself, and it turns out it’s much more involved than just a single function call with a path as an argument, as suggested by ChatGPT. In fact you have to create an entirely separate config file with classes to represent the audio, and then you can reference that class to play the audio, not just a simple string path.
I’ve had it fail with a lot of other things, mostly trying to make computer controlled units perform deterministic, scripted actions, but it’s hard to tell if GPT-4 or the game’s engine was at fault here, because in my experience making Arma’s AI units do anything reliably is near impossible, no matter how much scripting you do.
I had the same experience, I was worried and now that I've used the tools I sleep easier. But I still worry that these tools will improve dramatically and quickly. Somebody tell me I don't have to worry, please.
If an analogy helps, I think the current AI craze will lead to some obvious and permanent changes to how the world works in a way that’s very similar to the self-driving Tesla hype over the past 5+ years.
Things like ChatGPT might get 90% of the way there very, very quickly for many use cases (similar to how lane keeping and adaptive cruise control is 90% of “self-driving”) but the remaining 10% of edge cases will require humans for a very long time (similar to how we never quite managed to get “100% self driving” because the vehicles still can’t respond reliably 100% of the time). But we still value it because the 90% works.
Once people understand that 10% limitation and once the 90% value has been gained/normalized/integrated, the hype will die down and we’ll be watching a slow grind to figure out how to “fix” the last 10% which is necessary to live up to what’s being hyped today.
I can guarantee you that the equivalent code which I generate (taking >100 times as much time) has a comparable error/typo rate. I can say this with confidence because ChatGPT, while not perfect, "knows" lots of things I don't know and absolutely makes me 10x as efficient. Now, I am a pretty crummy software engineer so ymmv, but it's worth noting how far the goalposts have moved for assessing AI: from "simulates understanding of language in a general way" to "comparable in any way to a professional specialist in a random task (and much faster)". A few months ago everyone was losing their minds over how good ChatGPT was, and while some of that was overblown, nothing has fundamentally changed except we have become used to it and are looking for the next shiney tech thing. At least we're not talking about crypto anymore.
>But it definitely tells me that such a model cannot replace a developer.
Certainly not an average developer (however that's decided in C's culture). Not making any statements on skill of anyone here, but I'd bet it could replace the very low ends of skill as it is (GPT-4 and maybe Bard). I doubt there's too many of those kinds of developers though.
It's not a guaranteed positive for all languages/people, but it's farther along being useful as an enhancement technology than it is as replacement technology. I think it's making progress on both axes, but one is clearly ahead of the other and it's still possible it stops progress until other innovations are made.
As a counterpoint, I am absolutely horrible at writing Bash but I used ChatGPT to write a fairly complex (~50 lines, multiple functions) script in Bash that works just dandy. It wasn't perfect, and I made a couple adjustments, but it was 100x faster than what I could have done without it.
As a counter-counterpoint I tried to use GPT to avoid learning bash argument parsing. The script was buggy. After about 30 minutes of iterating with ChatGPT I gave up and read the docs. The issue was immediately apparent: GPT inconsistently mixed up getopt and getops. sometimes the code was correct, sometimes it mixed up the syntaxes each take into a nonsensical blend.
If it can correct itself when promoted to do so, there should just be a loop made that automatically does that. You can't get all of GPT's abilities from a single passthrough. Humans are also not very capable without iterative reflective processing.
Counter point that literally just bit me this morning. ChatGPT completely lied to me about the scaling properties of Kinesis. I was confused about the 1000 write / second but only 5 read / second throughput of a kinesis shard and was asking it a number of questions, including specifically if a batch read would count as only a single read operation. It explicitly told me that each record in the batch counted as an individual read. Most of the other information it gave me was accurate so I took this to be true.
Only the next day when I still couldn't wrap my head around why a real time processing system like be designed like this and did more google searching did I find it explicitly spelled out in the docs that a batch read counts as only a single read operation and can retrieve up to 10,000 records.
It's the first time I've been bit by this and it will definitely make me wary of any information it provides me in the future for technologies I am not intimately familiar with (at which point it's unlikely I'd need to consult it in the first place).
Yes, please. Do not use LLMs as a substitute for Google search. If you are looking for factual information, just use google, bing, or duckduckgo. You should only use ChatGPT for things that you are able to review it's work.
Technology is supposed to make us smarter. Blindly believing an AI that we know can hallucinate makes us dumb with confidence.
> You should only use ChatGPT for things that you are able to review it's work.
This keeps being my argument when people at work daydream about time and cost savings by offloading non-critical business functions to AI. I say, "Great, so it can produce 1000x more work than a person. But then what army of people are we planning to use to check those outputs?"
I'm super-impressed with the current crop of language models for their ability to so accurately simulate correctness, but their inability to understand what they don't know - because, in fact, they don't 'know' any of it in the sense that we do - makes them like very productive but completely untrustworthy employees. A junior dev who monopolizes his mentor's time through inconsistent performance is not a good hire.
Eh, part of the problem is people don't currently understand what LLMs are doing...
Have you ever had a dumb/wrong thought in your head? I'm going to go ahead and answer yes for you, you do all the time. In fact you don't (hopefully) verbalize a stream of consciousness to other people around you. In general you think of something then reflect on what it is true/false.
This is not what LLMs do, they pitch back the first 'thought' they have, "correct" or not. This is why things like COT/TOT greatly increase the accuracy of LLM output. The problem? It requires at least an order of magnitude more processing to get an answer, and with GPU time already in high demand and expensive you don't see much of it happen.
Betting on LLMs commonly being wrong is not a safe bet at this point.
Even if the error rate of LLMs decreases with additional GPU power there's little rhyme or reason to their confabulations. Even if only 1% of the code is in error there's no guidance or pattern to where those errors might be.
It's like reviewing an overconfident junior developer's code except you can't learn their particular weaknesses. If a developer is bad about memory leaks, you know to check their every PR for memory leaks. An LLM won't necessarily produce the same types of errors given similar prompts or even the same prompt with some period of time between invocations.
In this paper, we introduce the Tree-of-Thought (ToT) framework, a novel approach aimed at improving the problem-solving capabilities of auto-regressive
large language models (LLMs). The ToT technique is inspired by the human
mind’s approach for solving complex reasoning tasks through trial and error. In
this process, the human mind explores the solution space through a tree-like
thought process, allowing for backtracking when necessary. To implement ToT
as a software system, we augment an LLM with additional modules including a
prompter agent, a checker module, a memory module, and a ToT controller. In
order to solve a given problem, these modules engage in a multi-round conversation with the LLM. The memory module records the conversation and state
history of the problem solving process, which allows the system to backtrack
to the previous steps of the thought-process and explore other directions from
there. To verify the effectiveness of the proposed technique, we implemented
a ToT-based solver for the Sudoku Puzzle. Experimental results show that the
ToT framework can significantly increase the success rate of Sudoku puzzle solving.
I find LLMs to be very good at helping me figure out what I need to google search when I'm not entirely sure what I'm looking for. I can describe my problem from my perspective and it'll propose solutions that I can then go verify by searching and reading up on 'em.
Before, I'd often end up going down the wrong rabbit hole when I would try to write out my problem in a google-able manner.
Generally, chatbots function in direct communication with humans and do not function autonomously. Thus they are dependent on not just human meaning-making but human thirst for meaning which will find it even when it isn't there.
> Generally, chatbots function in direct communication with humans and do not function autonomously. Thus they are dependent on not just human meaning-making but human thirst for meaning which will find it even when it isn't there.
The latter doesn't follow. They may just be dependent on good prompt design.
People who aren't used to the AIs, and don't know how to use them, get worse results. This shouldn't be a surprise; it's how every tool works.
The prompt is everything. It’s sort of like knowing how to formulate a google search query, but cubed. You need to learn a whole bag of tricks and you can’t use the intuitions of talking to people for talking to the AI. Some people are by now really adept at prompt engineering and that makes them skilled at setting the AI to work.
That’s why people’s first impressions are often so wildly off. People who have the fortune of asking the right prompts for their first conversations walk away believing they’ve spoken to AGI, people who ask a less than optimal prompt (or things it is bad at like math problems) walk away not seeing what the hype is about.
Same thing with copilot. The quality of the code that it suggests depend a lot on the context, so if someone isn’t writing good comments and giving their variables proper names the suggested code will be terrible.
These systems are very capable, but using them is a skill that must be learned, and there’s little material out there to teach that skill because things move so quickly.
The whole history of AI debates seems to be an exercise in goalpost-moving. Whether it's AI proponents or skeptics, no one can seem to agree on a stable benchmark. Personally, I think it's due to intelligence itself being ill-defined as a concept. How are we supposed to build something when we don't even agree on what it is we're trying to build?
That’s not really an issue when we’re talking about AI. I said in a separate comment on the topic: intelligence is such a complicated thing that we seem to only be able to define it by pointing to things and saying “that’s not it.”
If we didn’t move the goalposts, we’d have declared Stockfish to be full AI, despite it only being a chess-playing program, long ago.
> We agreed on a definition of artificial intelligence–the Turing test–for 50 years.
No, we didn’t. There was plenty of disagreement over it. Heck, the Chinese Room is very popular rejection fundamentally of the premise of it.
That aside, even if in a blind scenario (where the builders didn’t know the criteria used to test) being able to fool humans in linguistic interaction would be reasonably likely to be a good test of general intelligence, LLM’s are about as an obvious of a direct and deliberate attempt to Goodhart’s Law the Turing Test as one could imagine.
Using a particular capacity to test a more general capacity is obviously vulnerable to systems built to specialize in the tested capacity, as opposed to those that have it as a consequence of general ability.
The fact it was defined a long time ago isn’t good enough. If, upon passing the test, you realise the test was insufficient, you change the test. You don’t shrug your shoulders and go “well it must be right because a guy said so 50 years ago!”
My confidence comes from watching GothamChess play it at… chess. The problem with ChatGPT isn’t that it’s bad at chess or doesn’t understand the rules of the game. It obviously has no concept of what it’s doing. It can’t even keep track of which pieces are still on the board: a trivial task for even a small child.
No, they look at the board, though strong chess players can do it blindfolded.
My point is that the hyperbole about “we’ve cracked AI but they changed the goalposts” is self evidently not true. You just proved it right there: I need to add more plugins because ChatGPT does not understand what it’s doing.
It’s a potentially useful tool but that doesn’t make it intelligent.
I’m using “intelligent” in the “intelligent life” meaning, not the “how intelligent is this person” meaning. By this meaning, any reasonable person would realise it’s not crossing that threshold.
Where is this threshold? I don’t think anyone really knows yet. That’s why I don’t see a problem with “moving the goalposts,” because doing so is the best way to help us truly understand what it means to be an intelligent life form (artificial or otherwise).
And this is why the word intelligent is useless. In your use of the word we're not getting anywhere then suddenly terminator kicks in the door and steps on your head and then "Oh, yea, I guess we reached the intelligent point". This is a piss poor predictor of capabilities.
And again "reasonable" is a pretty useless metric. Asking a 'reasonable' person about any system that requires expert knowledge to understand is going to derive an unreasonable answer. This is because they'll conflate intelligence with human behavior.
> In your use of the word we're not getting anywhere then suddenly terminator kicks in the door
This is just silly. Also, a machine killing me is still not enough. We have drone targeting systems in development right now but I don’t think you’d call them AI.
> And again "reasonable" is a pretty useless metric.
When you can’t define intelligence, it’s really the starting point.
The thing with LLMs is that they look almost there but, as many are pointing out, the method by which they make inferences is an analysis of how words fit together without understanding the meaning. This is why things like ChatGPT confidently spout absolute nonsense about topics they weren’t trained on (with much human intervention): it doesn’t know what it’s saying so it doesn’t realise it’s making stuff up.
Came here to say this. The code it generates for me is usually not perfect, but for me it provides conceptual insights that have gotten me unstuck from gnarly situations.
The conventional narrative is that chatGPT can code as well as a junior engineer, but I feel like it’s more like a senior engineer scribbling on a whiteboard—no, it’s not perfect code, but it’s the right idea.
I think the interpretation is that there is a certain statistical distribution of statements that appear “more broadly true” than they are. I think there is some meat to this, even with code that works as intended. The degree of creativity might be lower than we perceive.
I agree with the other comments that this is a fairly useless article. Perhaps I’m looking too much into this specific example, but I fail to see how the entirety of AI’s success (per this title) is misrepresented due to its inability to provide a horoscope (good example by another comment) that’s specific to some person in question.
We see a lot of articles that swing too far towards “AI will change everything!” just as much as “no, AI is not actually effective/meaningful!”. This is the latter. I’m surprised that anyone would genuinely try to use chatGPT/language LLMs this way.
I’m mainly enjoying chatGPT as a way to distill widespread information on the internet into a pseudo conversation. It’s great for learning new topics - my favorite is generating and explaining code snippets for languages/libraries I’m not familiar with.
> I fail to see how the entirety of AI’s success (per this title) is misrepresented
The title does not claim that the "entirety" of AI's success is misrepresented. It is questioning "how much" is, and I think that this is a fair question. The author does admit that the first paragraph in the example does have valid information which shows the AI with knowledge. It's the second paragraph that is just fluff.
I think that it is valid to ask how much this "fluff" impresses us and leads us to see the AI as being knowledgeable. Perhaps the author does draw too strong of a conclusion, but he still makes a good point.
> my favorite is generating and explaining code snippets for languages/libraries I’m not familiar with.
For me this is one of the more dangerous uses.
Humans are already pretty bad at detecting errors in code. Bertrand Meyer, an expert with some renown in formal methods, couldn’t find an error in a one-liner of Eiffel code generated by ChatGPT. What hope do programmers with less training have to recognize when ChatGPT has given them an incorrect summary?
What hope do Jr programmers have in figuring out the own errors they have in their own code they've written?
I work in the code security industry, and there is one truism, that is people typically write code until it compiles and or doesn't return an immediate error, they do not write code until it is 'correct'.
What fraction of people using AI are looking up opinions on themselves? While I'm sure someone has written a completely automated horoscope generator, in general I think these LLMs are being applied in situations where there would be no Forer Effect.
It's a time-honored tradition with new tech, asking it what it thinks about us.
"According to a study by the Pew Internet & American life project, 47% of American adult Internet users have undertaken a vanity search in Google or another search engine. Some egosurf purely for entertainment, such as finding celebrities with the same name. However, many people egosurf as a means of online reputation management. Egosurfing can be used to find data spills, released information that is undesirable to have in the public eye. By searching one's own name in an online search engine, one can take on the perspective of a stranger attempting to find out personal information. Some egosurf in order to conceal personal images or information from potential employers, clients, identity thieves and the like. Similarly, some use egosurfing to maintain a positive public image and to achieve self-promotion."
Maybe it's something related. When I ask ChatGPT technical questions I get very confident answers that are often completely wrong. But setting aside the confidence aspect, I notice ChatGPT responses can be vague and widely applicable.
In one case I ask a question about X, and ChatGPT responds with an answer: Y because Z. Well Y is wrong, so I tell ChatGPT Y is wrong and I get a new answer: Not Y because Z.
In another case I ask a question about X, and I get an answer that talks endlessly about Z without ever giving the answer Y.
The reasoning all sounds good, it's relevant and useful, but the actual answer might as well be a Mad Lib.
This isn't it. Look at Figure 4 in https://arxiv.org/abs/2303.08774 those test results aren't like reading your horoscope and thinking hmmm that does sound like me and they aren't like philip k dick using chatgpt recreationally and thinking that it's valis talking to him directly
Interesting observation. I find it incredibly difficult to write queries for ChatGPT that are not biased. If you write a question mentioning two concept, the answer will almost without fail suggest a connection between the two. Whereas if you were to mention just one of them, the chance that the other concept appears in the response might be very small.
Seems a very specific scenario where you are well-known enough to be able to ask an LLM what it thinks of you and use that criteria to judge how good it is. I don't see Barnum statements being applicable in many other use cases.
It didn't know who I was at all, a great relief all told. The eye of Sauron has not fallen on me yet, it seems.
I've been looking at using llms to extract as structured information from structured but human written documents (local municipal codes). Unfortunately the models need a lot of handholding to do this without hallucinating or making generic statements.
They produce output that looks correct and that's an accomplishment on its own. Unfortunately the output often has little bearing on the reality presented to it.
Realistically... They need to evolve much more. As an experiment I paid people on fiver to do the exact same thing with the same prompt and they get it right. These are not Americans so the norms of American municipalities are new to them.
> Some of them become emotional at having their personality revealed to them. Only then to be told that everyone gets the same results.
I think this is one of those cases where I don't think it's as much of a gotcha as the author is supposing because those tests act as a kind of mirror allowing you to project yourself onto the results and then see yourself in the reflection. It's why they're useful when you take them yourself as a self-reflection tool but not at all actionable when someone else tries to estimate you with them.
wink wink nudge nudge tarot works on the same principal
I guess the people who are not aware of this effect might make a lot of decisions in life and a confident AI agent will be like a high-tech con-artist.
As a child, I would read the horoscope of the day with the wrong sign for mother and her female friends and they always thought it was talking about them and I never had the courage to tell them I switched as a joke. I keep imagining a fine tuned GPT for insights from the "mystic realm".
We can extend this phenomenon to all of knowledge. You ask ChatGPT something and it gives you some very generic 10-point blogspam-esque answer. You immediately think ChatGPT is intelligent and that it has understood your particular question and that it has given you a tailor-made answer.
Wow, this article really got me thinking! It's fascinating how AI, like ChatGPT, can create statements that feel so personal and unique to us, when in reality they're quite generic. It's a testament to the power of the Forer Effect, and how our brains are wired to find meaning and personal relevance in broad statements. It's a bit like reading a horoscope, isn't it? You can't help but feel like it's speaking directly to you. This definitely adds a new layer to how I perceive AI-generated content. It's not just about the technology's capability to generate human-like text, but also about our human tendency to find personal meaning in it.