What is the ARC-AGI benchmark?
The ARC-AGI benchmark assesses the generalisation ability of an AI system – the ability to identify patterns and rules from a small number of examples. In contrast to ChatGPT, which uses vast datasets of human text for model training, the o3 model tries its hand at problems using a much smaller number of examples. For instance, it solves puzzles involving grids of coloured squares to determine the fourth element, with rules inferred from three examples. Such a format is akin to IQ tests that humans might recall from school.
How OpenAI o3 Stands Out
What makes o3 distinctive is its flexibility. While the method is not disclosed, experts think that the model uses a "chain of thought" approach. This implies considering various chains of reasoning and implying the simplest or "weakest" rules that fit the task. By choosing the simpler rules, the o3 model maximises its ability against unfamiliar challenges.
By developing a general purpose o3 for participation in the ARC-AGI test, OpenAI has probably found out a way to get the model to be more problem-solving-focused than simple memorisation. This is a strategy that corresponds with the breakthroughs that have been made by systems like AlphaGo that beat the world champion in the game of Go by evaluating potential moves through similar heuristic-like methods.