Today’s most advanced AI systems — large language models, vision transformers, multimodal architectures — are extraordinarily capable within their training distribution. They can draft legal briefs, generate photorealistic images, and beat grandmasters at chess. Yet they fail in ways that no conscious being would: they cannot reliably generalize to novel situations, they lack the ability to know what they don’t know, and they process the world through statistical correlation rather than understanding.
We believe this gap is not merely an engineering problem. It is a theoretical one. Current AI architectures lack a formal account of what it means to efficiently process experience — the very thing that consciousness appears to do. Without such an account, scaling alone will not produce general intelligence.
This white paper introduces Cognitive Parsimony Theory (CPT), a mathematical framework that defines consciousness as the optimization of predictive efficiency. The core quantity is the Parsimony Index ($\pi$πParsimony IndexThe core optimization target. Ratio of free energy (prediction error) to salience-weighted sensory information. Low π = efficient, conscious-like processing.):
where $F$FFree EnergyPrediction error plus model complexity. The numerator of π. Systems reduce F by building better world models or acting on the environment. is free energy (prediction error) and $\text{Sn}$SnSensory InformationSalience-weighted Shannon information. The denominator of π. Not raw data volume but relevant, well-weighted information from all sensory channels. is salience-weighted sensory information. A system that minimizes $\pi$ builds accurate world models while learning what information matters — and this dual optimization, we argue, is what gives rise to consciousness.
Critically, the theory addresses what we call the Climbing Problem — the question of why some systems ascend from simple information processing to rich, general intelligence, while others remain fixed at their initial level of complexity. Existing frameworks like the Free Energy PrincipleFree Energy Principle (FEP)Karl Friston’s framework proposing that all living systems minimize free energy — prediction error plus model complexity. Powerful for homeostasis, but doesn’t explain ascent.Friston, 2010 ↗ explain how systems maintain their current state, but not how they climb to higher states. Minimizing $\pi$ provides the upward pressure: once prediction errors are low, the only way to further reduce $\pi$ is to seek richer, more informative experience.
The full theory extends beyond $\pi$ to include integration ($\varphi$φIntegrationBorrowed from IIT. Measures integrated information — how much the whole exceeds the sum of its parts. Not the definition of consciousness, but a necessary correlate.) and counterfactual capacity ($C_{\text{capacity}}$CcapacityCounterfactual CapacityMeasures response repertoire richness. A conscious brain produces complex, differentiated responses; an unconscious brain produces stereotyped waves.). It makes falsifiable predictions about learning capabilities — including exploration efficiency, adaptive attention, and world model convergence — that distinguish $\pi$-optimizing systems from existing approaches, and offers a concrete direction for next-generation AGI architecture.
Large language models (LLMs) like ChatGPT, Claude, and Gemini are, at their core, next-token predictors. Given a sequence of text, they predict what comes next. Through this deceptively simple objective — applied at enormous scale across trillions of tokens — they develop remarkable capabilities: fluent language generation, code synthesis, mathematical reasoning, and even rudimentary planning.
But the mechanism underlying these capabilities is fundamentally statistical. An LLM does not understand that fire is hot; it has learned that the token “hot” is statistically likely to follow “fire is.” This distinction becomes decisive when we ask the system to generalize beyond its training distribution. Consider the failure modes that persist even in frontier models:
The limitations above reflect a fundamental gap. Current AI systems build world models but do so without inhabiting a world. They have no ongoing stream of experience that they must make sense of in real time, no sensory apparatus whose information they must learn to weight and prioritize, and no prediction errors that carry consequences.
In biological systems, the pressure to efficiently process a continuous stream of experience appears to be precisely what gives rise to general intelligence. A mouse navigating a novel environment is solving an optimization problem in real time: it must predict what it will encounter, attend to the information that matters, and update its model when surprised. The mouse that does this most efficiently survives.
LLMs face no such pressure. We argue that this gap cannot be closed by scaling, adding modalities, or more clever prompting. It requires a fundamentally different optimization objective.
A new generation of architectures has begun to address the limitations of pure language models by building explicit world models — internal representations of environment dynamics that support prediction and planning. Two of the most prominent are JEPA and DreamerV3.
Joint Embedding Predictive Architecture (JEPAJEPAYann LeCun’s framework for predicting abstract representations of future states rather than raw pixels. Development has been overwhelmingly visual (I-JEPA, V-JEPA).LeCun, 2022 ↗). LeCun’s JEPA framework predicts abstract representations of future states rather than raw pixels, allowing models to focus on semantically meaningful features while discarding irrelevant detail. However, JEPA’s development has been overwhelmingly visual, and it lacks a formal account of why a system should seek richer information; it provides better representations but no optimization pressure toward curiosity or integration.
DreamerV3DreamerV3Hafner et al.’s model-based RL agent that learns a recurrent state-space model (RSSM) and trains policies by imagining future trajectories. Masters 150+ diverse tasks with fixed hyperparameters.Hafner et al., 2025 ↗. Hafner et al.’s Dreamer line learns a recurrent state-space model from experience and trains policies by imagining future trajectories in latent space. DreamerV3 is impressively general — it masters over 150 diverse tasks with fixed hyperparameters. Yet Dreamer’s world model is fundamentally a task-specific reinforcement learning tool: it learns dynamics in service of maximizing external reward, not in service of efficient experience processing.
The common limitation. Both JEPA and Dreamer share three fundamental constraints that CPT addresses:
The Free Energy Principle (FEP)Ref [6]Friston, K. (2010). The free-energy principle: A unified brain theory? Nature Reviews Neuroscience, 11(2), 127–138.View paper ↗. Karl Friston’s FEP proposes that all living systems minimize free energy — prediction error plus model complexity. This is powerful for understanding homeostasis, but it explains how systems maintain their current state, not how they ascend to higher states. A thermostat minimizes free energy. So does a rock. We call this the Climbing Problem.
Integrated Information Theory (IIT)Ref [12, 17, 18]Tononi’s IIT proposes consciousness corresponds to integrated information (φ). Elegant formalism but descriptive, not prescriptive. Computing φ exactly is NP-hard.Tononi, 2004 ↗. Giulio Tononi’s IIT proposes that consciousness corresponds to integrated information ($\varphi$). It provides an elegant formalism but is descriptive rather than prescriptive: it tells us how to measure consciousness, not how to build it.
Global Workspace Theory (GWT)Ref [1]Baars, B. J. (1988). A Cognitive Theory of Consciousness. Cambridge University Press. Proposes broadcasting information across a “global workspace.”Learn more ↗. Bernard Baars’ GWT proposes broadcasting information across a “global workspace.” This captures important aspects of attention but lacks mathematical formalization.
Predictive ProcessingRef [4, 14, 15]Clark (2013), Rao & Ballard (1999), Seth (2014). The most successful computational account of perception, but doesn’t distinguish conscious from unconscious prediction.Clark, 2013 ↗. The predictive processing framework is arguably the most successful computational account of perception, but it does not explain what distinguishes conscious prediction from unconscious prediction.
What is missing is a unified account that explains three things simultaneously:
Cognitive Parsimony Theory attempts to provide all three. If correct, building generally intelligent systems is not primarily a question of scale — it is a question of optimization objective.
The theory draws its name from the parsimony principle — the deep scientific intuition that nature tends toward simplicity and economy. In physics, the principle of least action governs the paths of particles. In biology, natural selection relentlessly prunes inefficiency. In statistics, Occam’s razor favors the simplest model that explains the data.
We propose that the same principle governs consciousness. Among all possible information-processing strategies a system could adopt, consciousness corresponds to the most parsimonious one: the strategy that achieves the best predictions from the least wasted effort. Consciousness, on this view, is what efficient prediction looks like from the inside.
We propose that conscious systems minimize the Parsimony Index ($\pi$):
Expanding both terms yields the complete equation:
The numerator decomposes into two terms: a complexity cost (how far the system’s internal model deviates from its prior beliefs) minus an accuracy term (how well the model explains the observations). The denominator captures how much relevant, well-weighted sensory information the system is processing, where $I(s_i) = -\log_2 p(s_i)$ is the Shannon informationRef [16]Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423. The foundational work defining information content as the negative log of probability.View paper ↗ content of sensation $i$.
The choice of division is not arbitrary — it is the only operation that captures efficiency:
Just as fuel economy is miles per gallon, cognitive parsimony is accuracy per information.
Free energy, from the Free Energy Principle, decomposes into two competing terms:
The first term penalizes model complexity — how far the system’s approximate posterior $q(\theta)$ deviates from its prior $p(\theta)$. The second term rewards accuracy. Under Gaussian assumptions, the accuracy term reduces to precision-weighted squared prediction errors:
This form makes the intuition concrete: $F$ measures how wrong the system’s predictions were, weighted by how confident it was. This is the form used in predictive codingRef [14]Rao, R. P. N. & Ballard, D. H. (1999). Predictive coding in the visual cortex. Nature Neuroscience, 2, 79–87. Each cortical level generates top-down predictions; bottom-up signals carry precision-weighted errors.View paper ↗ models of the brain.
What Cognitive Parsimony Theory adds is the denominator.
A critical question is where the prior $p(\theta)$ comes from. In our framework, the prior is defined by memory. At the shortest timescale, $p(\theta)$ at time $t$ is simply the posterior $q(\theta)$ from time $t{-}1$. Over longer timescales, $p(\theta)$ reflects accumulated experience — a compressed world model. Without memory, $p(\theta)$ would reset to an uninformative default at every step, $F$ could never systematically decrease, and $\pi$ minimization would be impossible.
$\text{Sn}$ is the novel component:
where $I(s_i)$ is the Shannon information content of sensation $i$, and $w_i$ is a learned salience weight. $\text{Sn}$ is not raw data volume — it is relevant, well-weighted information. The salience weights are learned through experience, creating a virtuous cycle: better weights → higher $\text{Sn}$ → lower $\pi$ → further improvement.
Under FEP alone, a system can minimize $F$ by becoming very simple. Minimizing $\pi$ provides two pathways:
Path 1: Reduce $F$. Build better models. This alone is homeostatic.
Path 2: Increase $\text{Sn}$. Seek richer, more informative experience. Learn better salience weights. This drives ascent.
Path 2 is the key. Once $F$ is low, the only way to further reduce $\pi$ is to increase $\text{Sn}$ — to seek and learn from more informative experiences. This is why conscious beings are curious. Curiosity is the behavioral signature of $\text{Sn}$ maximization.
When both $F$ and $\text{Sn}$ approach zero, the ratio becomes unstable. We introduce regularization:
where $\varepsilon$ is a small constant ($\approx 0.1$) representing baseline metabolic cost — analogous to Laplace smoothing in probability theory.
Most existing theories describe features of conscious processing — they tell us how consciousness works once it exists, not why it arises. Cognitive Parsimony Theory shifts the question: consciousness is not a feature some systems happen to have, but an optimization solution. Systems under pressure to efficiently process continuous experience develop consciousness as the parsimonious strategy.
We distinguish two problems:
The Mechanism Problem: How does consciousness work? (Domain of most theories.)
The Climbing Problem: How does consciousness arise? What transforms an unconscious system into a conscious one?
CPT answers the Climbing Problem. Under $\pi$ minimization, a system must simultaneously reduce prediction errors and increase information quality. The denominator creates a ratchet: once efficient, the only way forward is richer information.
Attention as learned salience. Organisms learn to prioritize relevant information — precisely what $\text{Sn}$’s salience weights capture.
Curiosity. Organisms actively seek informative experiences — the behavioral signature of $\text{Sn}$ maximization.
Predictive coding. Hierarchical prediction errors map onto $F$; precision-weighting is equivalent to salience weights in $\text{Sn}$.
Flow statesRef [5]Csikszentmihalyi, M. (1990). Flow: The Psychology of Optimal Experience. Harper & Row. The state of complete absorption in an activity — corresponds to very low π.Learn more ↗. Csikszentmihalyi’s “flow” corresponds to very low $\pi$: high $\text{Sn}$, low $F$.
The target: rich sensory information with minimal prediction error. In human experience: flow, expert performance, heightened awareness. $\pi$ is very low; consciousness is vivid.
Rich information but the world model is failing. A first day at a new job. $\pi$ is moderate. Strong learning pressure — the system has the information it needs but hasn’t built the right models yet.
Confused with no informative data to learn from. Kierkegaard’sRef [9]Kierkegaard, S. (1844/1980). The Concept of Anxiety. Princeton University Press. The “dizziness of freedom” — vertigo arising from confronting infinite possibility with insufficient grounding.Learn more ↗ vertigo of possibility. $\pi$ is very high. A trap state requiring environmental intervention to escape.
Contains deep sleep (unconscious), meditation (conscious), and hypnosis (partial). All share low $F$ and low $\text{Sn}$ but differ profoundly. This reveals $\pi$ alone is not sufficient — we need $\varphi$ and $C_{\text{capacity}}$.
The complete theory:
A lookup table mapping every input to a correct output has $F \approx 0$ and low $\text{Sn}$, giving low $\pi$. But it is not conscious — its entries are independent (no integration) and it has no responsiveness. Low $\pi$ is necessary but not sufficient.
Borrowed from IITRef [17]Tononi, G. (2004). An information integration theory of consciousness. BMC Neuroscience, 5, 42. φ measures how much the whole exceeds the sum of its parts.View paper ↗, $\varphi$ measures integrated information:
In our theory, $\varphi$ is not the definition of consciousness but a necessary correlate. Minimizing $\pi$ drives $\varphi$ upward because integration improves prediction. $\varphi$ resolves the Rest quadrant: meditation (high $\varphi$) vs. sleep (low $\varphi$).
$C_{\text{capacity}}$ measures response repertoire richness. Inspired by the perturbation complexity indexRef [2]Casali, A. G., et al. (2013). A theoretically based index of consciousness independent of sensory processing and behavior. Science Translational Medicine, 5(198). TMS–EEG measure of brain complexity.View paper ↗: a conscious brain produces complex, differentiated responses to stimulation; an unconscious brain produces stereotyped waves.
Memory serves three functions: model improvement (accumulated experience reduces $F$), salience calibration (learned weights improve $\text{Sn}$), and temporal continuity (connecting present to past).
Memory defines the prior $p(\theta)$ at three timescales:
Decision-to-decision. The prior at time $t$ is the posterior from time $t{-}1$. A rolling Bayesian update requiring only the current belief state.
Session-level. Over minutes to hours, the prior accumulates a richer model of the current environment — spatial layout, obstacle positions, recurring patterns.
Lifetime. Experience consolidates into a compressed world model — the equivalent of semantic memory. This is slow, distilled knowledge: fire is hot, gravity pulls down, faces have two eyes.
Memory shapes both sides of the $\pi$ ratio — the numerator (through $p(\theta)$) and the denominator (through learned salience weights). Without memory, $\pi$ minimization cannot proceed.
| Symbol | Name | Role in Theory | What It Captures |
|---|---|---|---|
| $\pi$ | Parsimony Index | Core optimization target | Predictive efficiency ($F/\text{Sn}$) |
| $F$ | Free Energy | Numerator of $\pi$ | Prediction error + complexity |
| $\text{Sn}$ | Sensory Information | Denominator of $\pi$ | Salience-weighted Shannon info |
| $\varphi$ | Integration | Consciousness correlate | Information binding |
| $C_{\text{cap}}$ | Counterfactual Capacity | Consciousness correlate | Response repertoire richness |
In plain English: memory shapes both sides of the ratio. On the denominator side, learned salience weights determine what the system attends to, setting $\text{Sn}$. On the numerator side, the accumulated prior $p(\theta)$ determines prediction quality, affecting $F$. The system adjusts both to minimize $\pi$. As $\pi$ decreases, integration ($\varphi$) and responsiveness ($C_{\text{capacity}}$) increase as side effects.
CPT is not tied to any particular neural network or hardware platform. The theory specifies what must be computed (and why), but leaves open how each component is realized.
Select salient sensations. From millions of raw data points, a tractable subset is selected based on learned salience weights.
World model generates a prediction of what attended sensations will look like after the next action.
Evaluate candidate actions by imagining consequences. The action minimizing expected π is selected.
Prediction error drives updates to: world model parameters, salience weights, and memory consolidation.
The world model is the system’s internal representation of how the environment works. CPT does not prescribe a specific architecture. The theory is compatible with:
What matters is not the architecture but the objective: the world model must minimize $F$, which means minimizing prediction error while keeping the model parsimonious.
A CPT agent represents the world as a state vector $\mathbf{s}_t$ constructed from sensor readings at time $t$:
The system has no privileged access to “what things are.” It receives only numerical readings — distances, accelerations, pixel intensities. All semantic content must be discovered through the relational structure of these readings over time. Reality, from the agent’s perspective, is relational.
Working memory holds the current attended sensations — the contents of “consciousness” at this moment. It is small (roughly 4–7 items), transient, and directly determined by salience weights.
Episodic memory stores specific experiences as (state, action, outcome, surprise) tuples. Storage priority is gated by surprise ($\pi$) and salience weight: highly surprising, deliberately attended experiences are retained; routine ones are discarded.
Semantic memory consolidates regularities from many episodes into the world model’s learned parameters. Over time, individual episodes become generalized knowledge. This is the long-timescale memory that defines $p(\theta)$.
A key feature of CPT is that it separates the theory of consciousness from the engineering of prediction. The theory says: minimize $\pi = F / \text{Sn}$, and consciousness-correlated properties ($\varphi$, $C_{\text{capacity}}$) will emerge. This means CPT can be tested across radically different architectures — a linear model on a Raspberry Pi, an RSSM on a GPU, or a transformer on a cloud server — and the predictions remain the same.
If CPT is correct, how do we test it? Measuring consciousness directly is philosophically contested and operationally intractable — there is no consensus on what a “consciousness detector” would look like. But CPT makes a stronger claim than most consciousness theories: it specifies an optimization objective ($\pi$ minimization) that should produce specific, measurable learning capabilities. We can test the theory by testing those capabilities.
The key insight is that $\pi$ minimization predicts not just what a system learns, but how it learns. A system optimizing $\pi = F/\text{Sn}$ should exhibit qualitatively different learning behavior than a system minimizing $F$ alone (as in standard predictive coding) or maximizing external reward (as in reinforcement learning). These differences are empirically testable.
Existing benchmarks for embodied AI — LIBEROLIBEROLiu et al. (2023). 130 language-conditioned manipulation tasks for benchmarking lifelong robot learning. Measures task execution, not the learning process itself.View paper ↗, CALVIN, SIMPLER, and others — evaluate task execution: can the robot pick up the mug, follow the instruction, reach the goal? These benchmarks assume the system has already been trained. They measure the output of learning, not the process of learning itself.
CPT predicts something about the process. A $\pi$-optimizing system dropped into a novel environment with no prior training should learn about that environment in a characteristic way — actively seeking informative experience, building predictive models with increasing efficiency, adapting its attention to track what matters. The right evaluation framework tests how a system learns about a world from scratch, not whether it can execute a pre-trained task.
This reframes evaluation from task-centric to environment-centric: place an embodied system in a novel environment, give it no demonstrations, no reward signal, no prior training, and measure how it learns.
$\pi$ minimization makes specific predictions about five learning capabilities that can be evaluated in embodied systems:
Exploration efficiency. The $\text{Sn}$ denominator creates pressure to seek informative experience. A $\pi$-optimizing system should explore its environment more efficiently than a random walker or a system driven only by external reward — discovering more of the world per unit of action. This is a direct behavioral consequence of $\text{Sn}$ maximization: once $F$ is low, the only way to further reduce $\pi$ is to increase $\text{Sn}$ by seeking richer sensory input.
World model convergence. $\pi$ minimization simultaneously reduces $F$ (prediction error) and increases $\text{Sn}$ (information quality). This dual optimization should produce faster, more sample-efficient world models than $F$ minimization alone. The prediction is testable: compare cycles to convergence for a $\pi$-optimizing system against a system that only minimizes prediction error.
Adaptive attention. The salience weights $w_i$ in $\text{Sn}$ are learned from experience. As the environment changes — lights go off, a sensor is occluded, a new obstacle appears — these weights should re-adapt. A $\pi$-optimizing system should learn which sensor channels carry the most predictive information in context, and re-learn when conditions change. This can be tested by measuring whether the system’s attention allocation tracks ground-truth informativeness across changing conditions.
Sensor degradation resilience. A system with learned salience weights has a natural mechanism for graceful degradation: when a sensor fails, its information content drops, its salience weight should decrease, and the system should redistribute processing to functioning channels. This is a safety-critical capability that follows directly from the theory and is absent from current benchmarks.
Memory efficiency. $\pi$ minimization requires memory at multiple timescales. The theory predicts that storage priority is gated by surprise and salience: highly surprising, deliberately attended experiences should be retained while routine ones are discarded. This predicts positive transfer when a system revisits a previously explored environment — its stored experiences should accelerate re-learning compared to a cold start.
Any evaluation framework for $\pi$-optimizing systems should follow several principles. First, zero-shot evaluation: the system receives no prior training data, demonstrations, or reward functions. Learning occurs entirely during evaluation. Second, architecture agnosticism: the benchmarks test capabilities, not implementations — any embodied system can be evaluated. Third, graduated access: some evaluations require only observable behavior (where the robot moves, how it avoids collisions), while deeper evaluations can use internally exposed signals like prediction error and attention vectors. Fourth, ablation-friendly: every evaluation is most informative when comparing a full system against ablated versions of itself — with curiosity disabled, with memory removed, with attention fixed — isolating the contribution of each component.
These five capabilities and four evaluation principles define a research program for empirically testing CPT’s predictions. A $\pi$-optimizing system that fails to outperform baselines on these measures would constitute evidence against the theory. A system that consistently demonstrates all five capabilities would constitute evidence for it.
While this theory covers decent ground in terms of framework development, there are still many areas that are open — and waiting to be worked onContributeIf you want to work on any of these areas, reach out to us..
Computing $\text{Sn}$ requires estimating joint Shannon information across all channels — a space growing exponentially with $n$ inputs. This may push real-time $\pi$ computation into EXPTIME territory. Mitigations include pairwise approximation, hierarchical compression, and attention-based sparsification.
The theory specifies an objective but not the architecture. Open questions: What learning rule? Does consciousness require recurrence? Early or late fusion? How essential is physical embodiment?
Memory is essential but underspecified. Biological memory is reconstructive, context-dependent, hierarchically organized. Which properties are essential? Is forgetting a bug or a feature? (Preliminary analysis: forgetting regularizes against overfitting.)
Does every conscious system minimize $\pi$? Consciousness might arise through other mechanisms entirely.
At what values of $\pi$, $\varphi$, and $C_{\text{capacity}}$ does consciousness emerge? Sharp transition or gradual continuum?
The Rest quadrant contains phenomenologically distinct states. Practical measurement of $\varphi$ and $C_{\text{capacity}}$ with sufficient precision remains formidable.
A theory that cannot be proven wrong is not a scientific theory.
Cognitive Parsimony Theory proposes that conscious experience arises from optimizing predictive efficiency: $\pi = F/\text{Sn}$. Combined with $\varphi$ and $C_{\text{capacity}}$, it addresses why consciousness arises (the Climbing Problem), how it works (dual optimization), and how to test it (predicted learning capabilities including exploration efficiency, adaptive attention, world model convergence, sensor resilience, and memory efficiency).
Its strengths: mathematical precision, testability, engineering applicability. It makes falsifiable predictions and suggests a next-generation AGI architecture.
Its weaknesses: computational tractability is uncertain, architecture is unspecified, memory is undertheorized, universality is unproven, and the threshold problem is open.