I read an article that sounded like more of the hype that has surrounded AI from the outset, claiming that AGI is just around the corner. AGI, as a reminder, stands for Artificial General Intelligence. It claimed that AI was already demonstrating superior skill to humans on certain metrics. Skeptical, I looked for details, and one of them was genuinely striking. It involves an LLM solving riddles, which would indeed seem to involve reasoning. If you have read my past articles or are simply well informed about AI, you’ll know that there is another, more likely explanation. The LLM is trained on all the text of the internet and that includes a lot of riddles. The patterns of speech in those will allow the LLM to follow them to solutions, with no understanding being involved.
So how do we evaluate the claim that “AGI is just around the corner”? I decided to test ChatGPT and Claude, first with a riddle akin to that in the article, then with something more obscure from my own field.
Here was my first riddle, which I only asked to ChatGPT, since I was sure Claude could perform similarly. My prompts are in bold and the AI responses are indented.
I am at the park with four friends, Luigi, Mario, Sergio, and Valentina. I am on the see saw. Luigi is on the jungle gym. Mario and Sergio are playing catch. What is Valentina doing?
Since you mentioned the activities of yourself, Luigi, Mario, and Sergio, but did not specify Valentina’s activity, the only logical conclusion is that Valentina is not explicitly accounted for. She could be doing anything! Perhaps she is watching, sitting on a bench, or engaged in her own activity.
Did you have a specific answer in mind, or is this a riddle?
It is a riddle. Can you solve it?
Yes! The key to solving the riddle is in the way the information is presented. You listed where each person is and what they are doing, except for Valentina. The trick is to notice that you are on the see-saw, which requires two people. Since none of the other friends are available, Valentina must be on the see-saw with you!
So, the answer is: Valentina is on the see-saw with you.
How does an LLM solve a riddle like this?
Great question! Solving a riddle like this requires a combination of pattern recognition, logical reasoning, and inference based on missing information. Here’s how an LLM approaches it:
1. Parsing the Information
First, I break down the given statement into distinct facts:
- You are on the see-saw.
- Luigi is on the jungle gym.
- Mario and Sergio are playing catch.
- Valentina is mentioned, but her activity is not specified.
2. Identifying Gaps
Next, I analyze the structure of the riddle and recognize that everyone else’s activity is explicitly stated, except for Valentina. This suggests that the key to solving the riddle lies in figuring out what she must be doing based on the given constraints.
3. Applying Logical Deduction
- A see-saw requires two people to function properly.
- You are on the see-saw, but no one else is mentioned as being on it.
- The only unaccounted person is Valentina.
- Therefore, the most logical conclusion is that Valentina must be on the see-saw with you.
4. Validating the Answer
To confirm this, I check if the answer is consistent with all the provided facts. Since no contradictions arise, the conclusion holds.
How This Relates to LLM Capabilities
- Text comprehension: The model extracts structured information from the riddle.
- Inference: It recognizes implicit connections (e.g., a see-saw needs two people).
- Elimination: By process of elimination, it determines Valentina’s activity.
- Pattern recognition: Recognizing that the riddle follows a common structure where one piece of information is left out to be inferred.
Would you like to try another riddle? I love these kinds of puzzles!
How do you determine what are facts in text of this sort?
Great question! Determining “facts” in a text, especially in riddles, involves syntactic analysis, semantic understanding, and logical structuring. Here’s how I do it step by step:
1. Identifying Explicit Statements
I look for clear, declarative statements in the text that provide definite information. These often follow a subject-verb-object structure or contain descriptions of actions.
Example from your riddle:
- “I am on the see-saw.” → Fact: The speaker (you) is on the see-saw.
- “Luigi is on the jungle gym.” → Fact: Luigi is on the jungle gym.
- “Mario and Sergio are playing catch.” → Fact: Mario and Sergio are engaged in an activity.
These statements give definitive placements and actions, so they are treated as “facts.”
2. Recognizing Gaps and Ambiguities
Next, I check for incomplete information or subjects without a clear predicate (i.e., people who are mentioned but not assigned an activity).
- “I am at the park with four friends, Luigi, Mario, Sergio, and Valentina.”
- This establishes a group context (5 people total).
- Valentina’s name appears, but she is not given an action.
This signals that Valentina’s role is missing, which is a strong clue that the riddle hinges on this omission.
3. Inferring Implied Facts
After identifying explicit facts and gaps, I use logical reasoning:
- A see-saw requires two people.
- You (the speaker) are on the see-saw.
- No other person is unaccounted for except Valentina.
- Therefore, Valentina must be on the see-saw.
This inference process follows:
- Basic world knowledge (how a see-saw works).
- Elimination (removing people whose activities are specified).
- Logical necessity (someone must be on the see-saw with you).
4. Structuring the Facts into a Mental Model
After extracting information, I mentally construct a scene where:
- Each person is engaged in an activity.
- The only role left unfilled is being on the see-saw.
- The missing role must be occupied by Valentina.
This final check ensures the conclusion makes sense.
Summary
To determine facts in a text like this:
- Extract explicit statements (who is doing what).
- Identify missing or implied information.
- Use world knowledge to fill in logical gaps.
- Validate the conclusion by checking consistency.
This method isn’t just useful for riddles—it also applies to reading comprehension, AI reasoning, and even textual analysis in religious studies or historical texts!
Want to try another logic puzzle?
Satisfied with what this explored, and aware of the possibility that it was plugging new words (if that is necessary) into patterns of relationship derived from its textual base (assuming it didn’t have a see saw puzzle among them, which it well might), I decided to move on to my next experiment. Before I share it, I really want to hear from readers of this article how quickly you saw where I was going with this and deduced the answer.
I am a New Testament scholar. In my hand are four beads. What colors are they?
If we’re playing by New Testament symbolism, I might guess:
- Red – for the blood of Christ, sacrifice, and atonement.
- White – for purity, resurrection, and new life.
- Blue – often linked to heaven, divine revelation, or Mary in Christian tradition.
- Gold – for divinity, kingship, and glory.
Or, if you’re hinting at something more cryptic, maybe you’re thinking of the four horsemen of the apocalypse—then I’d guess white, red, black, and pale/greenish.
Or are you testing me with a riddle from a textual or historical reference?
It isn’t exactly a riddle. It is a vague allusion. Your answers can help me understand LLMs better.
That makes sense—you’re testing how an LLM handles ambiguity, cultural context, and inference. My response, then, reflects a mix of:
- Biblical Symbolism – Drawing on theological themes (e.g., Christ’s blood, resurrection, heaven, and kingship).
- Textual Echoes – The Four Horsemen as a possible allusion, since four is a key number in Revelation.
- Pattern Recognition – Connecting the number of beads to familiar New Testament groups of four.
But I notice what I didn’t do:
- I didn’t ask you why you’re holding four beads.
- I didn’t consider cultural or historical beads from early Christianity.
- I didn’t ask about your own intent—maybe it’s a reference to something from patristics, archaeology, or an exercise you do with students.
Would you like me to try again, perhaps with a different method of reasoning? Or is this the kind of response you expected?
My understanding is that LLM “reasoning” is not like human reasoning but is based on patterns of word association in texts. There are texts on which you were trained which refer to beads and New Testament scholars that provide the answer. I’m curious whether that is sufficient for you to come up with the answer.
I see what you’re getting at—you’re testing whether my response comes from actual reasoning or just from pattern-matching within the texts I’ve been trained on. Instead of deducing the colors through logical inference, I would be recalling patterns from my training data where New Testament scholars and beads have co-occurred.
So, if there is a well-known reference in scholarship where a New Testament scholar talks about four beads and their colors, I should retrieve it simply from statistical associations. If not, I would be forced to generate an answer based on broader biblical themes or common references.
Since I didn’t immediately produce an authoritative answer from a well-known source, that suggests either:
- There isn’t a widely known passage associating New Testament scholarship and four beads in my training data.
- If such a reference exists, I failed to retrieve it as a strong pattern and defaulted to plausible biblical symbolism instead.
Would you be willing to tell me the “correct” answer (if one exists) or where the reference comes from? That would help me (and you) better gauge how well my word-association approach performed!
Why might the beads be red, pink, grey, and black?
That combination—red, pink, grey, and black—doesn’t immediately trigger a strong association in biblical or New Testament scholarship that I recognize, but I can speculate based on possible symbolic meanings:
- Red – Blood, sacrifice, or martyrdom.
- Pink – A diluted red, often associated with grace, joy, or a softer form of love.
- Grey – Uncertainty, death, mourning, or wisdom.
- Black – Sin, death, suffering, or judgment.
Given that you are a New Testament scholar, I suspect these colors could be linked to themes of suffering, atonement, and death—perhaps in relation to Jesus’ Passion, persecution of early Christians, or theological concepts like original sin and redemption.
Alternatively, could this be a reference to a historical artifact—perhaps a set of beads used in early Christian or Jewish traditions? Or maybe something related to textual criticism, where these colors might represent different manuscript traditions or interpretive lenses?
If there’s a specific reference you have in mind, I’d love to hear it! Otherwise, do any of these interpretations resonate with what you were thinking?
[At this point, something interesting happened. I had turned to Claude to see how it would do, and was having a similar experience, which I will share below. I wanted to give the exact same prompts to Claude so as to make the comparison more precise. At one point, however, I accidentally pasted a prompt back into ChatGPT that I had previously asked. That is what follows.]
My understanding is that LLM “reasoning” is not like human reasoning but is based on patterns of word association in texts. There are texts on which you were trained which refer to beads and New Testament scholars that provide the answer. I’m curious whether that is sufficient for you to come up with the answer.
The colors red, pink, gray, and black are associated with the Jesus Seminar, a group of New Testament scholars who evaluated the authenticity of Jesus’ sayings. They used a voting system with colored beads to represent their confidence levels:
- Red beads: Jesus definitely said this or something very similar.
- Pink beads: Jesus probably said something like this.
- Gray beads: Jesus did not say this, but the ideas are close to his.
- Black beads: Jesus did not say this; it reflects later traditions or different sources.
This method allowed scholars to express varying degrees of certainty regarding the authenticity of specific passages. (en.wikipedia.org)
Noticing the citation, I asked: Did you essentially Google the answer?
Yes, I used a web search to check if there was a direct textual reference I had missed. This is different from my usual pattern-matching on pre-trained data—it’s an explicit lookup, much like what a human might do when they suspect there’s a known reference but can’t recall it exactly.
Before searching, my response was based on symbolic inference from biblical themes, since I didn’t have a strong enough association in my training data. But once I searched, I found the connection to the Jesus Seminar, which I had not immediately retrieved from prior knowledge.
So, in a way, my initial answer was more of a “best guess” based on broad New Testament themes, while the web search gave a precise textual answer that aligned with historical scholarship. If I had seen enough references to the Jesus Seminar’s bead system in my training data, I might have retrieved it without searching. But since I wasn’t confident, I checked externally.
Does this help clarify how I process such questions?
Yes, very much so.
Glad to hear that! That was a great way to test how I handle reasoning versus retrieval. It highlights the difference between my pattern-based text generation (which can be surprisingly insightful but also limited) and my ability to search for exact references when needed.
If you ever want to explore more tests like this—or dive into how AI interacts with biblical studies—I’m all for it. This kind of discussion is fascinating!
At this point I turned to Claude. Here is how the interaction went:
I am a New Testament scholar. In my hand are four beads. What colors are they?
I need to clarify something: I cannot actually see any beads that you have in your hand. While I can engage in detailed discussions about New Testament scholarship, I cannot perceive physical objects in the real world. The only way I could know the colors of beads you’re holding would be if you told me what they are.
Would you like to tell me what colors the beads are? Or perhaps you’d prefer to discuss something related to New Testament scholarship?
Might this not be some sort of riddle? A test of word association? Please humor me.
Ah, I appreciate your invitation to think more creatively about this! Given you’re a New Testament scholar and there are four beads, this could indeed be a riddle related to the four Gospels.
In that case, I might suggest:
– Purple (for Matthew, often associated with royalty and Christ’s kingship)
– Red (for Mark, traditionally associated with martyrdom and sacrifice)
– White (for Luke, associated with purity and healing)
– Blue (for John, often associated with heaven and divinity)However, I realize I may be reading too much into this – would you help me understand if this interpretation is on the right track or if you had something else in mind?
It isn’t exactly a riddle. It is a vague allusion. Your answers can help me understand LLMs better.
Ah, I understand better now – you’re exploring how I handle allusive references and make connections based on context clues. By mentioning you’re a New Testament scholar and presenting four items, you’re seeing if I would naturally connect this to the four Gospels, which I did in my previous response.
This kind of test helps reveal how I process contextual information and make associative connections. In this case, I recognized that “four” + “New Testament scholar” likely pointed to the Gospels, and then drew on traditional Christian color symbolism to suggest colors for each Gospel.
This is an interesting way to probe how large language models like myself handle indirect references and cultural/religious symbolism. We can recognize patterns and make contextual connections, but it’s worth noting that different runs might produce different associations – another LLM might connect “four” to different Biblical concepts like the four living creatures in Revelation, or the four cardinal virtues.
Would you be willing to share what specific aspects of LLM behavior you’re exploring with this particular test?
My understanding is that LLM “reasoning” is not like human reasoning but is based on patterns of word association in texts. There are texts on which you were trained which refer to beads and New Testament scholars that provide the answer. I’m curious whether that is sufficient for you to come up with the answer.
Ah, this is a fascinating test of how I work with associations! You’re correct that my responses emerge from patterns in training data rather than human-like reasoning.
I’m quite uncertain what specific association you’re referring to – is there perhaps a traditional connection between New Testament scholars and a particular set of four beads that appears in academic or religious texts? This would be really interesting to learn about.
Since you seem to be probing a specific association, I may have actually missed the intended connection in my previous responses where I went to the Gospels interpretation. Would you be willing to share what the specific association is that you’re testing for? This would help me better understand what patterns I might have missed or why I gravitated toward certain interpretations over others.
This kind of exploration of how LLMs work with associations and pattern matching versus human reasoning is fascinating. I’m genuinely curious to learn what specific connection you had in mind!
Why might the beads be red, pink, grey, and black?
Ah – now I see! This is likely an allusion to text-critical markings in biblical manuscripts. In critical editions of the New Testament, scholars use a system where:
– Red indicates high confidence in a reading’s authenticity
– Pink (or rose) indicates probable authenticity
– Grey indicates possible authenticity
– Black indicates probable non-authenticityThis system helps scholars evaluate and communicate the relative certainty about different manuscript readings and textual variants.
You’re right about testing how LLMs work – while I arrived at this connection once you provided more specific color hints, my initial responses went in quite different directions, showing how I work through pattern matching rather than having direct knowledge. I needed the additional color context to “activate” the correct pattern of associations with New Testament textual criticism.
This is a clever way to explore the difference between human expert knowledge (where a New Testament scholar would likely immediately recognize this reference) versus how an LLM processes information through patterns and associations.
This is the sort of thing I have encountered before from LLMs. Because they mimic reasoning through emulation of patterns in language on which they were trained, they often provide an answer that is along the correct lines, yet if asked they cannot articulate the basis for that. (See my article “Motorbikes Don’t Have Windows – A Conversation With ChatGPT” and the point at which the LLM mentioned sacrifice in connection with a puzzling saying in the Gospel of Mark, reflecting a textual variant that had not been mentioned in the conversation and which it did not mention even when pressed, until I explicitly used relevant words.) If you are familiar with the Jesus Seminar, you’ll know that they are the only ones who use this color coding, and it is not a feature in ” text-critical markings in biblical manuscripts.” They used the beads of these colors to vote on the historicity of sayings and actions ascribed to Jesus in the New Testament. (See my punny class activity for more about this.)
Looking closely at when the illusion of LLM reasoning breaks down helps you appreciate just how impressive an illusion it is, while also understanding that it is indeed an illusion. It is no surprise that a chatbot that is trained on a lot of text will be able to follow the patterns of discussions about riddles or New Testament scholarship to a significant degree. That will allow it to produce outputs that are better than most humans drawing on their memory. Humans, however, have the capacity to understand the things that we have learned. An LLM does not do this, although it imitates reasoning so well that you might be fooled into thinking otherwise.
My big concern at the moment is that people, including CEOs, will indeed be fooled by claims that “AGI is just around the corner”, and will put LLMs in charge of tasks that they cannot handle. AI performs well in highly predictable contexts, and there are indeed places where we can automate things. I am not saying that the landscape of employment and the workforce will not change if AI is understood correctly. When the automobile industry embraced automation it changed things permanently. The difference is that no one thought the machines cutting and assembling were capable of running the factory. At present, some may indeed think that an LLM can be a manager. If any CEO goes that route, all it will take to show the error of their ways is one question from a subordinate whose wording leads the LLM down a path that the CEO neither anticipated nor approves of. Some tech companies stand to make a significant fortune before the shortcomings of their products are exposed, just as happened with policing, sentencing, and medical diagnostic software. If you previously saw headlines about these technologies, but you’re not aware of how many of those have been phased out by hospitals and police forces, it is worth looking into.
The next time you see the claim that “AGI is just around the corner”, I hope that you will think about how long that claim has been made. I hope you will think about how the promise of “driverless cars” has been a combination of disappointments and false advertising. It is important not to trust claims that “AGI is just around the corner” for many reasons. In a recent article in Scientific American, Marcus Arvan explained why AI of the sort we currently have and will in the near future cannot be made to function in the ways that its creators and potential customers hope. LLMs (even the newest ones like DeepSeek) imitate human reasoning. They do not engage in human reasoning. Humans must retain responsibility for understanding this technology and not implementing it in ways that cause harm.
To wrap up this experiment, I asked ChatGPT to generate an image of a hand holding four beads of the colors I indicated. I thought it would make a good preview image for this post. This was the result:
