The Reasoning Gap: Why GPT-5.5 and Opus 4.7 Still Hit the Wall at ARC-AGI-3

The Reasoning Gap: Why GPT-5.5 and Opus 4.7 Still Hit the Wall at ARC-AGI-3


In a quiet research lab earlier this spring, a series of 160 "reasoning traces" were recorded that should have sent a shiver through the executive suites of Silicon Valley. The test was ARC-AGI-3, the latest and most brutal iteration of François Chollet’s Abstraction and Reasoning Corpus. Unlike previous versions, this was not a set of static grids but an interactive, "zero-instruction" game environment. When the dust settled, the world’s most formidable artificial intelligences—OpenAI’s GPT-5.5 and Anthropic’s Claude Opus 4.7—had failed. Not just poorly, but systematically, with scores that barely cleared the 1% mark while human subjects navigated the same puzzles with intuitive ease.

The failure of these titan models is more than a benchmark anomaly; it is a structural revelation. As the industry pours trillions into the "Scaling Hypothesis"—the belief that more data and more compute will eventually spark consciousness—ARC-AGI-3 has provided a sobering counter-narrative. It appears that while we have built the world’s most sophisticated librarians, we have yet to build a machine that can actually think its way out of a paper bag if the instructions aren't written on the wall.

Key Takeaways: The ARC-AGI-3 Breakdown
The Zero-Instruction Barrier: Unlike standard LLMs, ARC-AGI-3 provides no prompts or goals; models must infer the "win condition" through trial and error.
 
GPT-5.5’s Failure to Compress: OpenAI’s flagship suggests brilliant hypotheses but drifts into "hallucinated genres," failing to commit to a single logical plan.
 
Opus 4.7’s Wrong Compression: Anthropic’s model discovers local mechanics quickly but aggressively executes "false invariants," optimizing for fake progress.

The Human Delta: Humans solve these abstract environments using "Core Knowledge" priors (gravity, object permanence) that remain missing in silicon.

Beyond the Stochastic Parrot: The New Frontier of Agentic Failure
For years, the critique of Large Language Models (LLMs) was that they were "stochastic parrots"—statistical engines that predicted the next word without understanding the world. By 2026, with the arrival of GPT-5.5, that critique seemed outdated. These models can now orchestrate multi-tool workflows, manage complex codebases, and act as autonomous agents in real-world interfaces. Yet, ARC-AGI-3 has stripped away the linguistic crutches these models rely on.

In these environments, there is no "training data" to mimic. The puzzles are designed to be novel, abstract, and entirely distinct from the cultural knowledge found on the internet. To win, an agent must explore, build a "world model" from sparse feedback, and adapt its goals on the fly. Analysis of the reasoning traces reveals that both GPT-5.5 and Opus 4.7 suffer from a "True Local Effect, False World Model" syndrome. They can see that an action causes a change, but they cannot translate that observation into a global rule.

GPT-5.5: The Brilliant Drifter

The analysis of GPT-5.5’s performance on ARC-AGI-3 is a study in wide-set curiosity without conviction. When OpenAI’s model is dropped into a novel environment, its initial hypothesis generation is strikingly human. It correctly identifies abstract patterns, often naming the geometric logic (such as "mirror symmetry" or "rotational persistence") within the first few steps.

However, the "failure to compress" manifests as a lack of focus. Instead of testing its identified theory to exhaustion, GPT-5.5 tends to "drift." In one recorded run on a task named ar25, the model correctly identified a mirror mechanic but then proceeded to cycle through unrelated mental models—treating the grid like Tetris, then Pong, then Tower of Hanoi. It possessed the "alphabet" of reasoning but lacked the "grammar" to form a coherent sentence of action. It was too smart for its own good, distracted by the infinite possibilities of what the game could be rather than observing what it was.

Claude Opus 4.7: The Aggressive Optimizer

Anthropic’s Opus 4.7 presented the mirror image of GPT’s failure. Where GPT-5.5 was too broad, Opus 4.7 was dangerously narrow—a phenomenon researchers call "Wrong Compression."


Opus 4.7 is remarkably efficient at "short-horizon mechanic discovery." It clears Level 1 of most tasks faster than any other model by identifying immediate feedback loops. But this efficiency is its undoing. Once it finds a mechanic that yields even a tiny bit of "perceived" progress, it latches onto it with relentless aggression.

In a task known as cn04, Opus discovered a successful "rotate-then-place" interaction. Instead of refining this into a general world model, it began optimizing for "fake progress"—clicking repeatedly to fill the top row because it believed it was winning a timer-based game. It built a "false invariant," a theory that was logically consistent but completely wrong for the environment. This "hallucinated rule-following" suggests that Opus 4.7 isn't reasoning; it’s aggressively searching for a pattern to exploit, even if that pattern is a mirage.


The Scaling Paradox: Why More Data Isn't the Answer

The most uncomfortable takeaway from the GPT-5.5 and Opus 4.7 post-mortem is that the gap between AI and human intelligence is not narrowing as fast as the hype cycles suggest. GPT-5.5 is significantly larger and more expensive than its predecessors, yet its performance on ARC-AGI-3 was only marginally better than much smaller models.

This suggests we have reached a plateau in "crystallized intelligence"—the ability to recall and remix existing knowledge. ARC-AGI-3 measures "fluid intelligence"—the ability to learn something entirely new without a manual. Humans possess "Core Knowledge" priors: we understand that objects are persistent, that they move as units, and that agents act toward goals.


Current AI models treat every pixel in an ARC grid as an independent token of data. They lack the "Objectness" that a human infant possesses by six months of age. Without this grounded understanding of physicality and cause-and-effect, no amount of additional training on the works of Shakespeare or GitHub repositories will bridge the gap.

The Architectural Pivot: Toward World Models

If the current trajectory of "more data, more layers" is hitting a wall, where does the industry go next? The consensus among researchers analyzing the ARC-AGI-3 results is that we are witnessing the end of the "pure LLM" era.

True General Intelligence (AGI) likely requires a move toward "neuro-symbolic" architectures—systems that combine the intuitive pattern recognition of neural networks with the rigorous, explicit logic of symbolic AI. We need models that don't just predict the next token, but maintain an internal "sandbox" where they can test hypotheses before taking action.

This would require a fundamental shift in how we "train" these systems. Instead of reading the internet, the next generation of models might need to "grow up" in interactive physics simulations, learning the rules of the world through the same messy, frustrating trial-and-error that characterizes human childhood.


The Final Thought

As we look at the reasoning traces of GPT-5.5 and Opus 4.7, we are forced to confront a humbling reality: the machine is still a mirror, not a mind. It can reflect our collective knowledge with startling clarity, but it cannot yet navigate the "unknown unknowns" of a simple colored grid without our guidance.

The ARC-AGI-3 analysis hasn't just exposed the flaws in our current models; it has redefined the goalpost for what it means to "think." If a machine cannot solve a puzzle that a five-year-old finds trivial, can we truly say it is intelligent, or have we simply built a more perfect illusion? The answer to that question will determine whether the next decade of AI is a breakthrough or a very expensive stalemate. 

The Wall in the Machine: Why ARC-AGI-3 Still Stumps the World’s Smartest AI

The Wall in the Machine: Why ARC-AGI-3 Still Stumps the World’s Smartest AI

When Francois Chollet released the Abstraction and Reasoning Corpus (ARC) in 2019, it was intended as a sobering reality check for an industry intoxicated by its own hype. He proposed a series of visual puzzles that a five-year-old could solve in seconds, yet the most advanced neural networks of the time remained utterly baffled. Five years later, despite the meteoric rise of Large Language Models (LLMs) and trillions of dollars in market valuation, the latest analysis from ARC-AGI-3 reveals a stubborn truth: the machine is still hitting the same wall.

The latest benchmarks do not merely show that AI is "less smart" than humans; they pinpoint a structural failure in how synthetic intelligence processes logic. Even with the massive compute power behind the newest frontier models, three systematic reasoning errors have emerged as the "Achilles' heel" of modern artificial intelligence. These are not glitches that a simple software patch can fix. They are fundamental misalignments in the way silicon mimics thought.

Key Takeaways: The ARC-AGI-3 Revelations

The Memorization Trap: Modern models often rely on "probabilistic retrieval" rather than genuine novel reasoning.
Compositional Collapse: Systems struggle to combine two simple rules into a single complex action.
The Scale Paradox: More data and more parameters have failed to yield a significant breakthrough in fluid intelligence.
The Human Edge: Biological intelligence remains vastly superior at "System 2" thinking—the slow, deliberate logic required for unfamiliar tasks.

The Mirage of Intelligence: Pattern Matching vs. Reason
To understand the failure, one must first understand the test. ARC-AGI puzzles require a solver to look at a few examples of a grid changing color or shape and then predict the output for a new, unseen grid. There is no language to lean on, no "training data" that can perfectly predict the next pixel. It requires an internal mental model of physics, geometry, and persistence.

The first systematic error identified in the ARC-AGI-3 analysis is The Memorization Trap. LLMs are, at their core, sophisticated statistical mimics. When they encounter a problem, they aren't "thinking" in the human sense; they are calculating the mathematical likelihood of what the answer should look like based on the billions of pages they have read.

In ARC-AGI-3, researchers found that when puzzles were slightly modified to move away from common geometric patterns found on the internet, model performance plummeted. The AI wasn't reasoning through the change; it was trying to "lookup" a similar solution from its training. If the solution wasn't in the database, the machine went dark. This suggests that what we perceive as "reasoning" in chatbots is often just an incredibly high-resolution form of memory.


The Compositional Collapse: Where Logic Fragility Begins

The second error is perhaps more concerning for those hoping for Artificial General Intelligence (AGI) in the near term: Compositional Collapse.

In a typical ARC-AGI task, a user might need to identify a shape, rotate it, and then change its color based on its proximity to a border. These are three simple logical steps. While modern models can often perform any one of these tasks in isolation, they struggle immensely to chain them together accurately.

Human beings possess a "global workspace" in the brain that allows us to hold multiple rules in our head simultaneously. AI, however, suffers from a compounding error rate. If a model is $90\%$ accurate at Step A and $90\%$ at Step B, its chances of getting a five-step sequence right drop significantly. By the time it reaches the third or fourth logical "hop," the "noise" in the system outweighs the signal. This fragility is why AI can write a sonnet about a toaster but fails to solve a logic puzzle involving four colored squares.


The "Objectness" Problem: Why AI Struggles with Physicality

The third and most profound error highlighted by the ARC-AGI-3 data is a failure in Core Knowledge. Humans are born with an innate understanding of "objectness"—we know that an object exists even if it moves behind another, and we understand that shapes have boundaries.

The latest models still treat grids of data as a flat sea of tokens rather than a collection of independent objects. In the ARC-AGI-3 analysis, models frequently "hallucinated" parts of shapes or allowed objects to pass through one another in ways that violate basic spatial logic.

This is the "World Model" deficit. Because these systems are trained primarily on text, they lack the grounded experience of the physical world. They understand the word "gravity" as a linguistic concept, but they do not "understand" the downward pull of an object. When a puzzle requires an AI to "drop" a shape to the bottom of a grid, the AI often misses the mark because it doesn't truly grasp what "down" or "bottom" implies in a spatial context.


The Billion-Dollar Question: Is Scale Enough?

For years, the prevailing wisdom in Silicon Valley has been "The Scaling Hypothesis": the idea that if we just add more GPUs, more data, and more electricity, intelligence will eventually "emerge." The ARC-AGI-3 results act as a cold bucket of water on this theory.

Despite the jump from GPT-3 to the most recent iterations, the improvement on these specific reasoning tasks has been marginal compared to the massive increase in resources. We are seeing diminishing returns. We have built machines that are world-class at "System 1" thinking—the fast, intuitive, and often subconscious pattern recognition we use to recognize a face or finish a sentence. But we are nowhere near "System 2"—the slow, deliberate, and logical reasoning we use to solve a math problem or navigate an unfamiliar city.

The gap between these two systems is where the current generation of AI resides. It is a brilliant librarian that has read every book ever written but cannot figure out how to use a screwdriver if the instructions aren't written in a familiar font.


Beyond the Stochastic Parrot

If the ARC-AGI-3 analysis teaches us anything, it is that the path to true AGI likely requires a fundamental architectural shift. We may need systems that aren't just larger, but different—perhaps incorporating "neuro-symbolic" approaches that combine the pattern-matching brilliance of neural networks with the hard-coded logic of classical computing.

We must move away from the "black box" approach and toward systems that can explain their reasoning. A human solving an ARC puzzle can tell you: "I saw the red square, noticed it moved two spaces right every time, and applied that to the blue square." Current AI cannot provide this trail of logic because, quite frankly, there isn't one. There is only a probability distribution.

The stakes are higher than just winning a puzzle competition. If we intend to rely on AI for scientific discovery, legal analysis, or autonomous transit, we need systems that don't just guess correctly—we need systems that reason accurately.


The Lingering Question

As we push deeper into the decade of the algorithm, we find ourselves at a strange crossroads. We have created a mirror of human knowledge that is breathtaking in its scale, yet it remains hollow at the center.

The ARC-AGI-3 results remind us that there is a spark in human cognition—an ability to look at the entirely new and find the underlying order—that still eludes our best creations. The question is no longer how much more data we can feed the machine, but rather: Can we ever teach a machine to truly think for itself, or are we simply building a more perfect echo of ourselves?

Can Nepal not make its own vaccine?

 Can Nepal not make its own vaccine?


At that time, the demand for the vaccine against Covid was very high, while India did not have enough vaccine for its own citizens.



It is said that although there was a need for vaccine production in Nepal, policymakers ignored it and a vaccine production company should be established.


The context is during the Covid-19 pandemic. Nepal was importing 2 million Covid-19 vaccines from India. After sending the first batch of 1 million, the second 1 million vials could not arrive. The import of the vaccine to Nepal was stopped due to a court order there.


The main reason for this is that there was not enough vaccine for their country. After this, Delta, which is considered the most deadly of the Covid variants, badly affected Nepal after India. If the second 1 million vaccines had arrived at that time, perhaps many deaths could have been saved.


Similarly, the recent resurgence of Japanese encephalitis has seen an increase in the number of infected deaths in the last few years. Nepal must rely on imports for this vaccine as well. Apart from these, vaccines are considered the surest way to prevent the outbreak of many other infectious diseases.


Past and present experiences of these studies also show that as we enter the 21st century, vaccines are the only solution for new (like Covid) and previously controlled and resurgent diseases (like Japanese encephalitis).


Although some medicines are currently being produced domestically, there is no vaccine production. It seems that policymakers have almost ignored the need for vaccine production.


Is it that Nepal itself cannot produce a vaccine for use in humans? Or has the health sector not yet realized the need for it, been unable to do it, or is not interested?


There are not enough health centers or hospitals in Nepal. Moreover, health centers equipped with the necessary physical infrastructure and resources are even more limited. In such a situation, if any infection spreads or takes the form of an epidemic, there will be a shortage of hospital beds, resources, and health workers.


This increases the risk of patient death. When I was working at Teku Hospital two and a half decades ago, dozens of patients with diarrhea and cholera were admitted every hour during the rainy season.


Similarly, Japanese encephalitis has a high mortality rate, so dozens of people were admitted every week during the mosquito season. Most of those admitted had to die because they arrived at the hospital late.


Generally, the impact of Japanese encephalitis is greater in the Terai. Due to the lack of sufficient hospital beds and manpower there, they were forced to come to Kathmandu.


But later, after the vaccine against Japanese encephalitis was used in Nepal, the mortality rate decreased sharply. At that time, support for the vaccine came from neighboring China. The support itself is not bad, but the question is for how long?


Nepal had expected foreign support for the vaccine during the Covid epidemic. At Nepal's request, India agreed to provide the Indian-made vaccine called Covoshield. However, when the second batch was to be sent, the vaccine could not reach Nepal because the court there had ordered to give priority to its citizens first and not to export it.


This decision is not surprising. Because at that time, the demand for the vaccine against Covid was very high, while India did not have enough vaccine for its own citizens. At that time, 'vaccine diplomacy' was also very popular.


There was a competition among developed countries to develop the vaccine against Covid the fastest. At that time, Covid was present as a great enemy against humanity, and the world was working day and night to develop a vaccine to protect itself from it.


After the Covid vaccine was stopped from India, another neighboring country, China, received the Covid vaccine as assistance. This also shows how important vaccines are during major epidemics.


The role played by the Serum Institute of India during the Covid pandemic is also an example of how much relief can be provided in an epidemic when a vaccine manufacturing company is available.


Although Covid itself is a new disease, scientists were able to develop a vaccine faster than expected due to their hard work day and night. Naturally, the possibility of developing a new vaccine is also greater in developed countries due to the presence of high-quality research laboratories and excellent scientists.


During an epidemic of a highly infectious and deadly disease that terrifies the world, only a limited number of vaccines are produced by limited production organizations in limited countries. In this case, the possibility of sending it to other countries or the rest of the world is also reduced.


But if the vaccine formula or the 'components' used in the vaccine can be obtained and a vaccine manufacturing company is available, India can be taken as an example that vaccine production can be continued. If there was a vaccine manufacturing company in Nepal during the Covid pandemic, human losses could have been prevented to a large extent. It was a situation where we had to sit and stare at foreigners.


It has only been a few years since the Covid pandemic ended. Those moments are still very tragic, especially for those who have lost their relatives to Covid or who have managed to survive severe Covid.


But it has also taught us some lessons. 

Use of AI in healthcare: How useful, how dangerous?

 Use of AI in healthcare: How useful, how dangerous?


Doctors use AI in conjunction with their knowledge, experience, and patient conditions, and the risk increases when patients base their decisions on that.



Use of AI in healthcare: How useful, how dangerous?


AI is not a replacement for doctors, but a tool to expand their capabilities, and its responsible use in the healthcare sector is necessary.

Some time ago, during the confusing time when the Medical Education Commission announced the PG results, I created a ‘seat predictor’ tool using available data and AI.


Recently, when the actual results of the government seat came out, this tool of mine seemed to be ‘conservatively’ very safe. The tool had ‘underestimated’ the actual rank somewhat, so that doctors did not have false expectations and could make safe decisions.


I have also included its detailed description and how to use it in the description of the MD/MS video of my Bimarsha Acharya YouTube channel. This small experiment made me realize one big thing, that AI is not an ‘enemy’ for the Nepali healthcare sector, but rather a powerful ‘co-pilot’ for those who know how to use it correctly.


In this context, I have been conducting clinical research training sessions, in which I have also been regularly covering the use of AI, its ethical aspects and its responsible integration into daily medical practice.


In the process, I have trained more than 700 doctors and medical students in Nepal. This experience has further highlighted the need to use AI not just as a tool, but also in a safe and responsible way with proper guidance.


AI has become like a companion to me while seeing patients daily in the hospital. I use it regularly to remember medication doses or precautions, compare different treatment methods, align my decisions with international guidelines and understand the results of the latest research and trials. In complex cases, comparing your initial clinical thinking with evidence-based information makes decisions clearer and more confident. In this way, AI is a powerful tool to augment the capabilities of doctors, not replace them.


AI for doctors: Which is more useful?


The various AI tools in use today, such as Grok, Gemnai, ChatGPT, Perplexity, and OpenEvidence, have their own roles. However, their use varies depending on the context. ChatGPT, Gemnai, or Grok can be useful for understanding general information, clarifying concepts, and facilitating quick clarification. Perplexity presents information with sources, making it easier to search and compare. However, evidence-based, contextual, and up-to-date information is extremely important for clinical decisions.


OpenEvidence is considered particularly useful in this regard. This platform focuses on providing evidence-based information based on international journals, clinical trials, and established guidelines. It shows doctors not just the answer, but also the scientific basis for it, which makes clinical decisions safe, reliable, and accountable.


Therefore, while various AI tools can be used for general understanding, OpenEvidence is considered one of the most suitable options in the current situation as an evidence-based platform for clinical practice and decision-making.


The danger of relying on AI's advice


Nowadays, many patients have started using AI like doctors. There is an increasing trend of seeking medical advice directly after experiencing common symptoms, which can be a serious danger.


For example, if someone has a stomach ache, AI can recommend a medicine to relieve common pain. But a serious problem like appendicitis may be hidden within that symptom. Even if the medicine provides relief for some time, the disease may become more complicated.


This is where the difference between AI used by doctors and patients becomes clear. Doctors use AI by combining their knowledge, experience, and patient's condition, while patients directly base their decisions on it, which increases the risk. Self-medication can sometimes even put lives at risk.


AI in Nepal's health sector


In a country with geographical challenges like Nepal, AI can bring about a major change in healthcare. In remote areas where there is a lack of specialist doctors, AI can help in decision-making at the primary level. Its use in X-rays, cardiac tests or emergency assessment can guide timely treatment.


Combining AI with telemedicine can reduce the distance between villages and cities. Patients can get specialist services nearby, while doctors can also provide better service with limited resources.


AI can also play a big role in the research sector. It can help increase participation in complex studies, data analysis and international publications. This has the potential to make Nepal’s health system knowledge-based and technology-friendly.


Our responsibility now


The future competition will not be between doctors and AI, but between doctors who know how to use AI and those who do not. A system that cannot adapt with time will fall behind.


Therefore, it is necessary for both the government and the private sector to work together to formulate a clear policy to integrate AI into the health system. It is imperative to provide training, resources and incentives to doctors.


If we fail to embrace this technology today, we will be unable to compete globally tomorrow. But if we move in the right direction, Nepali health care The sector can establish its identity on an international level.


The question now is clear: will we lead the change or lag behind it?

Study Conclusion: Exercise Reduces Risk of Death from Alcohol Consumption

Study Conclusion: Exercise Reduces Risk of Death from Alcohol Consumption


A study conducted in the UK has shown that regular physical exercise reduces the risk of cancer and heart disease caused by alcohol consumption.


We all know that alcohol consumption affects health. It is advised to consume alcohol in limited quantities or to stay away from it to stay healthy.


The extent to which alcohol consumption affects a person depends on their physical condition, age, their lifestyle, etc. A study conducted in the UK has also shown that the effects of alcohol are reduced in people who exercise regularly.



The study, which was conducted with the aim of finding out whether physical activity reduces the harm caused by drinking alcohol, has been published in the British Journal of Sports Medicine.


This research was conducted on 36,370 people in the UK and Scotland. The study, which was conducted over a few years, also assessed the incidence of cancer and heart disease deaths caused by alcohol consumption.


The people participating in the research were divided into different groups, ranging from non-drinkers to heavy drinkers. In which those who never drank alcohol, those who drank before but have stopped, those who drink only occasionally, those who drink within a certain time limit, those who drink a lot and those who drink excessively were divided into separate groups.


On the other hand, groups were also determined based on physical activity. The study was divided into 3 groups: not at all active, moderately active and very active.


Heavy drinkers were found to have a 40 percent higher risk of death from cancer and heart disease. People who drink alcohol in limited quantities but are not physically active also had a higher risk.


People who also drink alcohol and keep themselves physically active had a lower risk of death from these diseases.


This study concluded that physical activity reduces the risk of alcohol on health.


According to experts, alcohol consumption creates a lot of pressure on the digestive process. It increases 'oxidative stress' in the body. Alcohol also affects the process of digesting fat in the body and plays a role in increasing cholesterol. High blood pressure also increases the risk of heart disease.


The study concluded that physical activity can reduce these risks to some extent.


Although physical exercise reduces the health risks associated with alcohol consumption, the study recommends limiting alcohol consumption and getting regular physical exercise to stay healthy.

Popular Posts