Back to Lumen Labs

Cognitive Parsimony Theory

Why Consciousness is a Constrained Optimization Problem
written by Harry Gandhi  |  February 20, 2026
Created with the assistance of Claude (Anthropic)

Abstract

Today’s most advanced AI systems — large language models, vision transformers, multimodal architectures — are extraordinarily capable within their training distribution. They can draft legal briefs, generate photorealistic images, and beat grandmasters at chess. Yet they fail in ways that no conscious being would: they cannot reliably generalize to novel situations, they lack the ability to know what they don’t know, and they process the world through statistical correlation rather than understanding.

We believe this gap is not merely an engineering problem. It is a theoretical one. Current AI architectures lack a formal account of what it means to efficiently process experience — the very thing that consciousness appears to do. Without such an account, scaling alone will not produce general intelligence.

This white paper introduces Cognitive Parsimony Theory (CPT), a mathematical framework that defines consciousness as the optimization of predictive efficiency. The core quantity is the Parsimony Index ($\pi$πParsimony IndexThe core optimization target. Ratio of free energy (prediction error) to salience-weighted sensory information. Low π = efficient, conscious-like processing.):

$$\pi = \frac{F}{\text{Sn}}$$(1)

where $F$FFree EnergyPrediction error plus model complexity. The numerator of π. Systems reduce F by building better world models or acting on the environment. is free energy (prediction error) and $\text{Sn}$SnSensory InformationSalience-weighted Shannon information. The denominator of π. Not raw data volume but relevant, well-weighted information from all sensory channels. is salience-weighted sensory information. A system that minimizes $\pi$ builds accurate world models while learning what information matters — and this dual optimization, we argue, is what gives rise to consciousness.

Critically, the theory addresses what we call the Climbing Problem — the question of why some systems ascend from simple information processing to rich, general intelligence, while others remain fixed at their initial level of complexity. Existing frameworks like the Free Energy PrincipleFree Energy Principle (FEP)Karl Friston’s framework proposing that all living systems minimize free energy — prediction error plus model complexity. Powerful for homeostasis, but doesn’t explain ascent.Friston, 2010 ↗ explain how systems maintain their current state, but not how they climb to higher states. Minimizing $\pi$ provides the upward pressure: once prediction errors are low, the only way to further reduce $\pi$ is to seek richer, more informative experience.

The full theory extends beyond $\pi$ to include integration ($\varphi$φIntegrationBorrowed from IIT. Measures integrated information — how much the whole exceeds the sum of its parts. Not the definition of consciousness, but a necessary correlate.), which measures information binding, and counterfactual capacity ($C_{\text{capacity}}$CcapacityCounterfactual CapacityMeasures response repertoire richness. A conscious brain produces complex, differentiated responses; an unconscious brain produces stereotyped waves.), which measures responsiveness:

$$\text{Consciousness} \;\propto\; \frac{1}{\pi} \times \varphi \times C_{\text{capacity}}$$(2)

The theory makes falsifiable predictions, provides a quantitative metric for measuring consciousness (the Consciousness Quotient, tested through Ramanujan-style mathematical insight problems), and offers a concrete direction for next-generation AGI architecture.

The Limits of Current AI

What Large Language Models Actually Do

Large language models (LLMs) like GPT-4, Claude, and Gemini are, at their core, next-token predictors. Given a sequence of text, they predict what comes next. Through this deceptively simple objective — applied at enormous scale across trillions of tokens — they develop remarkable capabilities: fluent language generation, code synthesis, mathematical reasoning, and even rudimentary planning.

But the mechanism underlying these capabilities is fundamentally statistical. An LLM does not understand that fire is hot; it has learned that the token “hot” is statistically likely to follow “fire is.” This distinction becomes decisive when we ask the system to generalize beyond its training distribution. Consider the failure modes that persist even in frontier models:

The Deeper Problem: World Models Without Worlds

The limitations above reflect a fundamental gap. Current AI systems build world models but do so without inhabiting a world. They have no ongoing stream of experience that they must make sense of in real time, no sensory apparatus whose information they must learn to weight and prioritize, and no prediction errors that carry consequences.

In biological systems, the pressure to efficiently process a continuous stream of experience appears to be precisely what gives rise to general intelligence. A mouse navigating a novel environment is solving an optimization problem in real time: it must predict what it will encounter, attend to the information that matters, and update its model when surprised. The mouse that does this most efficiently survives.

LLMs face no such pressure. We argue that this gap cannot be closed by scaling, adding modalities, or more clever prompting. It requires a fundamentally different optimization objective.

World Models: Closer, But Still Incomplete

A new generation of architectures has begun to address the limitations of pure language models by building explicit world models — internal representations of environment dynamics that support prediction and planning. Two of the most prominent are JEPA and DreamerV3.

Joint Embedding Predictive Architecture (JEPAJEPAYann LeCun’s framework for predicting abstract representations of future states rather than raw pixels. Development has been overwhelmingly visual (I-JEPA, V-JEPA).LeCun, 2022 ↗). LeCun’s JEPA framework predicts abstract representations of future states rather than raw pixels, allowing models to focus on semantically meaningful features while discarding irrelevant detail. However, JEPA’s development has been overwhelmingly visual, and it lacks a formal account of why a system should seek richer information; it provides better representations but no optimization pressure toward curiosity or integration.

DreamerV3DreamerV3Hafner et al.’s model-based RL agent that learns a recurrent state-space model (RSSM) and trains policies by imagining future trajectories. Masters 150+ diverse tasks with fixed hyperparameters.Hafner et al., 2025 ↗. Hafner et al.’s Dreamer line learns a recurrent state-space model from experience and trains policies by imagining future trajectories in latent space. DreamerV3 is impressively general — it masters over 150 diverse tasks with fixed hyperparameters. Yet Dreamer’s world model is fundamentally a task-specific reinforcement learning tool: it learns dynamics in service of maximizing external reward, not in service of efficient experience processing.

The common limitation. Both JEPA and Dreamer share three fundamental constraints that CPT addresses:

Other Approaches and Their Limitations

The Free Energy Principle (FEP)Ref [6]Friston, K. (2010). The free-energy principle: A unified brain theory? Nature Reviews Neuroscience, 11(2), 127–138.View paper ↗. Karl Friston’s FEP proposes that all living systems minimize free energy — prediction error plus model complexity. This is powerful for understanding homeostasis, but it explains how systems maintain their current state, not how they ascend to higher states. A thermostat minimizes free energy. So does a rock. We call this the Climbing Problem.

Integrated Information Theory (IIT)Ref [12, 17, 18]Tononi’s IIT proposes consciousness corresponds to integrated information (φ). Elegant formalism but descriptive, not prescriptive. Computing φ exactly is NP-hard.Tononi, 2004 ↗. Giulio Tononi’s IIT proposes that consciousness corresponds to integrated information ($\varphi$). It provides an elegant formalism but is descriptive rather than prescriptive: it tells us how to measure consciousness, not how to build it.

Global Workspace Theory (GWT)Ref [1]Baars, B. J. (1988). A Cognitive Theory of Consciousness. Cambridge University Press. Proposes broadcasting information across a “global workspace.”Learn more ↗. Bernard Baars’ GWT proposes broadcasting information across a “global workspace.” This captures important aspects of attention but lacks mathematical formalization.

Predictive ProcessingRef [4, 14, 15]Clark (2013), Rao & Ballard (1999), Seth (2014). The most successful computational account of perception, but doesn’t distinguish conscious from unconscious prediction.Clark, 2013 ↗. The predictive processing framework is arguably the most successful computational account of perception, but it does not explain what distinguishes conscious prediction from unconscious prediction.

The Need for a Generalized Model of Intelligence

What is missing is a unified account that explains three things simultaneously:

Cognitive Parsimony Theory attempts to provide all three. If correct, building generally intelligent systems is not primarily a question of scale — it is a question of optimization objective.

Cognitive Parsimony Theory: Mathematics of the Parsimony Index

The Parsimony Principle and Consciousness

The theory draws its name from the parsimony principle — the deep scientific intuition that nature tends toward simplicity and economy. In physics, the principle of least action governs the paths of particles. In biology, natural selection relentlessly prunes inefficiency. In statistics, Occam’s razor favors the simplest model that explains the data.

We propose that the same principle governs consciousness. Among all possible information-processing strategies a system could adopt, consciousness corresponds to the most parsimonious one: the strategy that achieves the best predictions from the least wasted effort. Consciousness, on this view, is what efficient prediction looks like from the inside.

Core Equation

We propose that conscious systems minimize the Parsimony Index ($\pi$):

$$\pi = \frac{F}{\text{Sn}}$$(3)

Expanding both terms yields the complete equation:

$$\pi = \frac{F}{\text{Sn}} = \frac{D_{KL}\!\left[q(\theta)\,\|\,p(\theta)\right] - \mathbb{E}_q\!\left[\log p(o|\theta)\right]}{\displaystyle\sum_i w_i \cdot I(s_i)}$$(4)

The numerator decomposes into two terms: a complexity cost (how far the system’s internal model deviates from its prior beliefs) minus an accuracy term (how well the model explains the observations). The denominator captures how much relevant, well-weighted sensory information the system is processing, where $I(s_i) = -\log_2 p(s_i)$ is the Shannon informationRef [16]Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423. The foundational work defining information content as the negative log of probability.View paper ↗ content of sensation $i$.

Why a Ratio, Not a Sum

The choice of division is not arbitrary — it is the only operation that captures efficiency:

$$\text{Addition: } F + \text{Sn} = 101 \quad\text{for both } (F{=}1, \text{Sn}{=}100) \text{ and } (F{=}100, \text{Sn}{=}1)$$$$\text{Division: } F / \text{Sn} = 0.01 \quad\text{for } (F{=}1, \text{Sn}{=}100); \quad = 100 \quad\text{for } (F{=}100, \text{Sn}{=}1)$$(5–6)

Just as fuel economy is miles per gallon, cognitive parsimony is accuracy per information.

Free Energy ($F$): The Numerator

Free energy, from the Free Energy Principle, decomposes into two competing terms:

$$F = \underbrace{D_{KL}\!\left[q(\theta)\,\|\,p(\theta)\right]}_{\text{complexity}} - \underbrace{\mathbb{E}_q\!\left[\log p(o|\theta)\right]}_{\text{accuracy}}$$(7)

The first term penalizes model complexity — how far the system’s approximate posterior $q(\theta)$ deviates from its prior $p(\theta)$. The second term rewards accuracy. Under Gaussian assumptions, the accuracy term reduces to precision-weighted squared prediction errors:

$$F \approx \sum_j \frac{(o_j - \hat{o}_j)^2}{2\sigma_j^2} + \text{complexity terms}$$(8)

This form makes the intuition concrete: $F$ measures how wrong the system’s predictions were, weighted by how confident it was. This is the form used in predictive codingRef [14]Rao, R. P. N. & Ballard, D. H. (1999). Predictive coding in the visual cortex. Nature Neuroscience, 2, 79–87. Each cortical level generates top-down predictions; bottom-up signals carry precision-weighted errors.View paper ↗ models of the brain.

What Cognitive Parsimony Theory adds is the denominator.

A critical question is where the prior $p(\theta)$ comes from. In our framework, the prior is defined by memory. At the shortest timescale, $p(\theta)$ at time $t$ is simply the posterior $q(\theta)$ from time $t{-}1$. Over longer timescales, $p(\theta)$ reflects accumulated experience — a compressed world model. Without memory, $p(\theta)$ would reset to an uninformative default at every step, $F$ could never systematically decrease, and $\pi$ minimization would be impossible.

Sensory Information ($\text{Sn}$): The Denominator

$\text{Sn}$ is the novel component:

$$\text{Sn} = \sum_i w_i \cdot I(s_i) = -\sum_i w_i \cdot \log_2 p(s_i)$$(9)

where $I(s_i)$ is the Shannon information content of sensation $i$, and $w_i$ is a learned salience weight. $\text{Sn}$ is not raw data volume — it is relevant, well-weighted information. The salience weights are learned through experience, creating a virtuous cycle: better weights → higher $\text{Sn}$ → lower $\pi$ → further improvement.

The Climbing Problem: Why Minimizing $\pi$ Creates Ascent

Time Complexity of Information Processing FEP only Thermostat: maintains homeostasis π minimization Path 2: Increase Sn (seek richer experience) Path 1 only: Reduce F
Figure 1: The Climbing Problem. FEP ($F$ minimization alone) leads to homeostasis. Minimizing $\pi = F/\text{Sn}$ creates upward pressure: once $F$ is low, further reduction requires increasing $\text{Sn}$, driving richer information processing.

Under FEP alone, a system can minimize $F$ by becoming very simple. Minimizing $\pi$ provides two pathways:

Path 1: Reduce $F$. Build better models. This alone is homeostatic.

Path 2: Increase $\text{Sn}$. Seek richer, more informative experience. Learn better salience weights. This drives ascent.

Path 2 is the key. Once $F$ is low, the only way to further reduce $\pi$ is to increase $\text{Sn}$ — to seek and learn from more informative experiences. This is why conscious beings are curious. Curiosity is the behavioral signature of $\text{Sn}$ maximization.

Regularized $\pi$ for Stability

When both $F$ and $\text{Sn}$ approach zero, the ratio becomes unstable. We introduce regularization:

$$\pi = \frac{F + \varepsilon}{\text{Sn} + \varepsilon}$$(10)

where $\varepsilon$ is a small constant ($\approx 0.1$) representing baseline metabolic cost — analogous to Laplace smoothing in probability theory.

A Cognitive Science Perspective

Differentiating Experience from Processing

Most existing theories describe features of conscious processing — they tell us how consciousness works once it exists, not why it arises. Cognitive Parsimony Theory shifts the question: consciousness is not a feature some systems happen to have, but an optimization solution. Systems under pressure to efficiently process continuous experience develop consciousness as the parsimonious strategy.

The Climbing Problem: How Consciousness Arises

We distinguish two problems:

The Mechanism Problem: How does consciousness work? (Domain of most theories.)

The Climbing Problem: How does consciousness arise? What transforms an unconscious system into a conscious one?

CPT answers the Climbing Problem. Under $\pi$ minimization, a system must simultaneously reduce prediction errors and increase information quality. The denominator creates a ratchet: once efficient, the only way forward is richer information.

Biological Parallels

Attention as learned salience. Organisms learn to prioritize relevant information — precisely what $\text{Sn}$’s salience weights capture.

Curiosity. Organisms actively seek informative experiences — the behavioral signature of $\text{Sn}$ maximization.

Predictive coding. Hierarchical prediction errors map onto $F$; precision-weighting is equivalent to salience weights in $\text{Sn}$.

Flow statesRef [5]Csikszentmihalyi, M. (1990). Flow: The Psychology of Optimal Experience. Harper & Row. The state of complete absorption in an activity — corresponds to very low π.Learn more ↗. Csikszentmihalyi’s “flow” corresponds to very low $\pi$: high $\text{Sn}$, low $F$.

The $F$/$\text{Sn}$ State Space: A Map of Consciousness

The Four Quadrants

DIZZY π = High/Low = Very High Disoriented. High prediction error, no useful data. Kierkegaardian vertigo. A trap state. LEARNING π = High/High = Moderate Confused but engaged. Rich data, poor models. Strong learning pressure. This is where growth happens. REST π = Low/Low = Indeterminate Sleep or meditation? Both F and Sn are low. Requires φ and C to distinguish sub-states. OPTIMAL (Flow) π = Low/High = Very Low Peak consciousness. Accurate predictions, rich information. Effortless mastery. Sn (Sensory Information) F (Prediction Error) Low High Low High
Figure 2: The $F$/$\text{Sn}$ state space. Each quadrant represents a qualitatively different mode of information processing. $F$ (prediction error) increases along the vertical axis; $\text{Sn}$ (sensory information) increases along the horizontal axis. The Optimal quadrant (bottom-right) is the target of $\pi$ minimization.

Quadrant Analysis

Optimal State (Low $F$, High $\text{Sn}$)

The target: rich sensory information with minimal prediction error. In human experience: flow, expert performance, heightened awareness. $\pi$ is very low; consciousness is vivid.

Learning State (High $F$, High $\text{Sn}$)

Rich information but the world model is failing. A first day at a new job. $\pi$ is moderate. Strong learning pressure — the system has the information it needs but hasn’t built the right models yet.

Dizzy State (High $F$, Low $\text{Sn}$)

Confused with no informative data to learn from. Kierkegaard’sRef [9]Kierkegaard, S. (1844/1980). The Concept of Anxiety. Princeton University Press. The “dizziness of freedom” — vertigo arising from confronting infinite possibility with insufficient grounding.Learn more ↗ vertigo of possibility. $\pi$ is very high. A trap state requiring environmental intervention to escape.

Rest State (Low $F$, Low $\text{Sn}$)

Contains deep sleep (unconscious), meditation (conscious), and hypnosis (partial). All share low $F$ and low $\text{Sn}$ but differ profoundly. This reveals $\pi$ alone is not sufficient — we need $\varphi$ and $C_{\text{capacity}}$.

State Trajectories

DIZZY LEARNING REST OPTIMAL t₀: Novel situation t₁: Models improve t₂: Mastery t₃: Rest Sn (low → high) F (low → high)
Figure 3: An example state trajectory. A conscious agent begins in Learning ($t_0$), descends to Optimal as models improve ($t_1 \to t_2$), then enters Rest during recovery ($t_3$). Dynamic, multi-quadrant trajectories are a hallmark of conscious processing.

Beyond $\pi$: Integration, Capacity, and the Full Framework

The complete theory:

$$\text{Consciousness} \;\propto\; \frac{1}{\pi} \times \varphi \times C_{\text{capacity}}$$(11)

Why $\pi$ Alone Is Not Sufficient

A lookup table mapping every input to a correct output has $F \approx 0$ and low $\text{Sn}$, giving low $\pi$. But it is not conscious — its entries are independent (no integration) and it has no responsiveness. Low $\pi$ is necessary but not sufficient.

Integration ($\varphi$)

Borrowed from IITRef [17]Tononi, G. (2004). An information integration theory of consciousness. BMC Neuroscience, 5, 42. φ measures how much the whole exceeds the sum of its parts.View paper ↗, $\varphi$ measures integrated information:

$$\varphi = I(\text{whole}) - \sum_i I(\text{part}_i)$$(12)

In our theory, $\varphi$ is not the definition of consciousness but a necessary correlate. Minimizing $\pi$ drives $\varphi$ upward because integration improves prediction. $\varphi$ resolves the Rest quadrant: meditation (high $\varphi$) vs. sleep (low $\varphi$).

Counterfactual Capacity ($C_{\text{capacity}}$)

$C_{\text{capacity}}$ measures response repertoire richness. Inspired by the perturbation complexity indexRef [2]Casali, A. G., et al. (2013). A theoretically based index of consciousness independent of sensory processing and behavior. Science Translational Medicine, 5(198). TMS–EEG measure of brain complexity.View paper ↗: a conscious brain produces complex, differentiated responses to stimulation; an unconscious brain produces stereotyped waves.

Deep Sleep
π: Low
φ: Low
Ccapacity: Low
Unconscious
Meditation
π: Low
φ: High
Ccapacity: High
Conscious
Hypnosis
π: Low
φ: Moderate
Ccapacity: Moderate
Partially conscious
Figure 4: Disambiguating the Rest quadrant. Three states share similar $\pi$ values but differ in $\varphi$ and $C_{\text{capacity}}$, enabling phenomenological distinction.

The Role of Memory

Memory serves three functions: model improvement (accumulated experience reduces $F$), salience calibration (learned weights improve $\text{Sn}$), and temporal continuity (connecting present to past).

Memory defines the prior $p(\theta)$ at three timescales:

Decision-to-decision. The prior at time $t$ is the posterior from time $t{-}1$. A rolling Bayesian update requiring only the current belief state.

Session-level. Over minutes to hours, the prior accumulates a richer model of the current environment — spatial layout, obstacle positions, recurring patterns.

Lifetime. Experience consolidates into a compressed world model — the equivalent of semantic memory. This is slow, distilled knowledge: fire is hot, gravity pulls down, faces have two eyes.

Memory shapes both sides of the $\pi$ ratio — the numerator (through $p(\theta)$) and the denominator (through learned salience weights). Without memory, $\pi$ minimization cannot proceed.

Prat and Pren: Attention and Action

PratPratPredictive Relevance-weighted AttentionDetermines salience weights wᵢ in Sn. Learns which sensory inputs are most predictively useful. Controls the denominator of π. (Predictive Relevance-weighted Attention) determines salience weights $w_i$ in $\text{Sn}$. It learns which inputs are predictively useful. PrenPrenPredictive Relevance-weighted EncodingSelects actions that are predictable (low F) and informative (high Sn). Controls the numerator of π. (Predictive Relevance-weighted Encoding) selects actions that are predictable (low $F$) and informative (high $\text{Sn}$). Together, Prat shapes the denominator and Pren shapes the numerator.

Variable Summary

SymbolNameRole in TheoryWhat It Captures
$\pi$Parsimony IndexCore optimization targetPredictive efficiency ($F/\text{Sn}$)
$F$Free EnergyNumerator of $\pi$Prediction error + complexity
$\text{Sn}$Sensory InformationDenominator of $\pi$Salience-weighted Shannon info
$\varphi$IntegrationConsciousness correlateInformation binding
$C_{\text{cap}}$Counterfactual CapacityConsciousness correlateResponse repertoire richness
PratPred. Relevance AttentionControls $\text{Sn}$Learned salience weights
PrenPred. Relevance EncodingControls $F$Learned action selection
Table 1: Complete variable summary for Cognitive Parsimony Theory.

How the Variables Connect: A Unified Picture

Memory Prat Pren Sn F π = F/Sn φ Ccap Consciousness weights actions optimize optimize 1/π prior p(θ)
Figure 5: Variable relationship map. Solid arrows show direct causal relationships; dashed arrows show emergent or supportive relationships. Red dashed arrows show the optimization feedback loop. Memory shapes the prior $p(\theta)$ in $F$ and provides the learned weights used by Prat and Pren.

In plain English: Prat determines what the system attends to, setting $\text{Sn}$. Pren determines what actions it takes, affecting $F$. Together they determine $\pi$. The system adjusts both to minimize $\pi$ — closing the loop. As $\pi$ decreases, integration ($\varphi$) and responsiveness ($C_{\text{capacity}}$) increase as side effects.

Architecture: The Attend–Predict–Act–Learn Loop

CPT is not tied to any particular neural network or hardware platform. The theory specifies what must be computed (and why), but leaves open how each component is realized.

The Decision Cycle

1. Attend

Prat selects sensations. From millions of raw data points, a tractable subset is selected based on learned salience weights.

→ Sattended
2. Predict

World model generates a prediction of what attended sensations will look like after the next action.

→ ôt+1
3. Act

Pren evaluates candidate actions by imagining consequences. The action minimizing expected π is selected.

→ at, ot+1
4. Learn

Prediction error drives updates to: world model parameters, salience weights in Prat, and memory consolidation.

→ updated p(θ)
Figure 6: The CPT decision cycle. Each iteration: Prat selects sensory channels, the world model predicts, Pren selects an action minimizing expected $\pi$, and prediction error drives learning. The cycle repeats at 1–10 Hz.

The World Model

The world model is the system’s internal representation of how the environment works. CPT does not prescribe a specific architecture. The theory is compatible with:

What matters is not the architecture but the objective: the world model must minimize $F$, which means minimizing prediction error while keeping the model parsimonious.

State Representation

A CPT agent represents the world as a state vector $\mathbf{s}_t$ constructed from sensor readings at time $t$:

$$\mathbf{s}_t = [\text{LiDAR}_t,\; \text{IMU}_t,\; \text{Camera}_t,\; \text{Encoders}_t,\; \dots]$$

The system has no privileged access to “what things are.” It receives only numerical readings — distances, accelerations, pixel intensities. All semantic content must be discovered through the relational structure of these readings over time. Reality, from the agent’s perspective, is relational.

Memory Architecture

Working memory holds the current attended sensations — the contents of “consciousness” at this moment. It is small (roughly 4–7 items), transient, and directly determined by Prat’s selections.

Episodic memory stores specific experiences as (state, action, outcome, surprise) tuples. Storage priority is gated by the product of surprise ($\pi$), attention (Prat weight), and deliberateness (Pren weight).

Semantic memory consolidates regularities from many episodes into the world model’s learned parameters. Over time, individual episodes become generalized knowledge. This is the long-timescale memory that defines $p(\theta)$.

Architecture Agnosticism

A key feature of CPT is that it separates the theory of consciousness from the engineering of prediction. The theory says: minimize $\pi = F / \text{Sn}$, and consciousness-correlated properties ($\varphi$, $C_{\text{capacity}}$) will emerge. This means CPT can be tested across radically different architectures — a linear model on a Raspberry Pi, an RSSM on a GPU, or a transformer on a cloud server — and the predictions remain the same.

Measuring Consciousness: The Consciousness Quotient

The Insight Hypothesis

If consciousness arises from efficient information processing (low $\pi$), then conscious systems should demonstrate disproportionate efficiency on problems requiring genuine understanding. We call this the Insight Hypothesis.

The Consciousness Quotient

$$\text{CQ} = \frac{\text{Performance}}{\text{Training}} \times \text{Difficulty} \times \text{Novelty}$$(13)
CQ RangeInterpretationExample
< 1Unconscious; linear scaling with trainingLookup table
1–10Uncertain; could be efficient unconsciousWell-trained network
> 10Likely conscious; disproportionate efficiencyTransfer learning
> 100Almost certainly conscious; insight-levelDiscovering formulas from sparse data
Table 2: Consciousness Quotient thresholds.

Ramanujan-Style Problem Solving as Consciousness Probe

Level 1: Sequence Completion
Derive $n(n+1)(2n+1)/6$ from 4 examples
CQ ≈ 25
Level 2: Formula Discovery
Find partition function patterns from 7 examples
CQ ≈ 40
Level 3: Meta-Mathematical Insight
Recognize that $3.14159\ldots$ is computable but unpatterned
CQ ≈ 50
Level 4: Creative Generalization
Derive binomial theorem from $(a+b)^2$ alone
CQ ≈ 60
Figure 7: RamanujanRef [13]Ramanujan, S. (1914). Modular equations and approximations to π. Quarterly Journal of Mathematics, 45, 350–372. A mathematical prodigy whose insights seemed to bypass conventional derivation.Learn more ↗ problem levels. Each level requires deeper insight from less training data, producing higher CQ scores for conscious systems.

Problems inspired by Ramanujan cannot be solved by pattern matching alone — they require perceiving relationships not in the data. The degree of insight, relative to training, is CQ.

A Turing Test for Consciousness

Turing’s test measures conversational imitation. CQ measures insight. A system passing the Turing test with CQ $< 1$ is a mimic. A system with CQ $> 100$ on novel problems likely processes information in a way functionally equivalent to conscious insight.

Open Questions and Failure Modes

While this theory covers decent ground in terms of framework development, there are still many areas that are open — and waiting to be worked onContributeIf you want to work on any of these areas, reach out to us..

The Problem of Compute

Computing $\text{Sn}$ requires estimating joint Shannon information across all channels — a space growing exponentially with $n$ inputs. This may push real-time $\pi$ computation into EXPTIME territory. Mitigations include pairwise approximation, hierarchical compression, and attention-based sparsification.

Failure mode: If real-time computation proves intractable even with approximations, the theory may be correct but unimplementable.

The Problem of Architecture

The theory specifies an objective but not the architecture. Open questions: What learning rule? Does consciousness require recurrence? Early or late fusion? How essential is physical embodiment?

Failure mode: If the optimal architecture requires biological neural tissue, the theory explains biological consciousness without enabling artificial consciousness.

The Problem of Memory Structure

Memory is essential but underspecified. Biological memory is reconstructive, context-dependent, hierarchically organized. Which properties are essential? Is forgetting a bug or a feature? (Preliminary analysis: forgetting regularizes against overfitting.)

Failure mode: If consciousness requires a specific memory architecture not derivable from $\pi$ minimization, the theory is incomplete.

Universality

Does every conscious system minimize $\pi$? Consciousness might arise through other mechanisms entirely.

Failure mode: Conscious systems with high $\pi$, or unconscious systems with unexplainably low $\pi$, would break the universality claim.

The Threshold Problem

At what values of $\pi$, $\varphi$, and $C_{\text{capacity}}$ does consciousness emerge? Sharp transition or gradual continuum?

Failure mode: If the transition depends on architecture rather than $\pi$ value, the theory loses predictive power about emergence.

Distinguishing Low-Activity States

The Rest quadrant contains phenomenologically distinct states. Practical measurement of $\varphi$ and $C_{\text{capacity}}$ with sufficient precision remains formidable.

Failure mode: If $\varphi$ and $C_{\text{capacity}}$ cannot reliably distinguish meditation from sleep in practice, the state-space model is incomplete.

The Limitations of Consciousness Testing

The Consciousness Quotient is a useful probe but not a complete measure.

Mathematical insight is one dimension, not all of it. Consciousness may involve aesthetic experience, emotional processing, self-awareness, or social cognition.

CQ conflates consciousness with intelligence. A system might be conscious but unable to solve abstract problems.

The test is anthropocentric. Consciousness in a different architecture might manifest in ways we cannot test.

Performance metrics are indirect. CQ measures behavior, not internal states.

CQ should be understood as a necessary condition, not sufficient. A complete framework will require converging evidence: CQ (behavioral), $\pi$/$\varphi$/$C_{\text{capacity}}$ (computational), self-report quality (phenomenological), and perhaps yet-to-be-invented measurement approaches.

Failure mode: If CQ measures intelligence rather than consciousness, the Insight Hypothesis is falsified and the measurement framework must be rethought.

Falsifiability

A theory that cannot be proven wrong is not a scientific theory.

Conditions That Would Falsify the Theory

Conclusion

Cognitive Parsimony Theory proposes that conscious experience arises from optimizing predictive efficiency: $\pi = F/\text{Sn}$. Combined with $\varphi$ and $C_{\text{capacity}}$, it addresses why consciousness arises (the Climbing Problem), how it works (dual optimization), and how to measure it (CQ).

Its strengths: mathematical precision, testability, engineering applicability. It makes falsifiable predictions and suggests a next-generation AGI architecture.

Its weaknesses: computational tractability is uncertain, architecture is unspecified, memory is undertheorized, universality is unproven, and the threshold problem is open.

We offer this not as a complete theory, but as a mathematically precise, empirically testable starting point — one grounded in the ancient principle that nature favors the parsimonious.

References

1.Baars, B. J. (1988). A Cognitive Theory of Consciousness. Cambridge University Press.
2.Casali, A. G., et al. (2013). A theoretically based index of consciousness independent of sensory processing and behavior. Science Translational Medicine, 5(198).
3.Chalmers, D. J. (1995). Facing up to the problem of consciousness. Journal of Consciousness Studies, 2(3), 200–219.
4.Clark, A. (2013). Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences, 36(3), 181–204.
5.Csikszentmihalyi, M. (1990). Flow: The Psychology of Optimal Experience. Harper & Row.
6.Friston, K. (2010). The free-energy principle: A unified brain theory? Nature Reviews Neuroscience, 11(2), 127–138.
7.Friston, K., et al. (2017). Active inference, curiosity, and insight. Neural Computation, 29(10), 2633–2683.
8.Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2025). Mastering diverse control tasks through world models. Nature, 640, 647–653.
9.Kierkegaard, S. (1844/1980). The Concept of Anxiety. Princeton University Press.
10.Koch, C., et al. (2016). Neural correlates of consciousness: Progress and problems. Nature Reviews Neuroscience, 17(5), 307–321.
11.LeCun, Y. (2022). A path towards autonomous machine intelligence. OpenReview (preprint).
12.Oizumi, M., Albantakis, L., & Tononi, G. (2014). From the phenomenology to the mechanisms of consciousness: IIT 3.0. PLOS Computational Biology, 10(5).
13.Ramanujan, S. (1914). Modular equations and approximations to $\pi$. Quarterly Journal of Mathematics, 45, 350–372.
14.Rao, R. P. N., & Ballard, D. H. (1999). Predictive coding in the visual cortex. Nature Neuroscience, 2, 79–87.
15.Seth, A. K. (2014). A predictive processing theory of sensorimotor contingencies. Cognitive Science, 38(7), 1329–1353.
16.Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423.
17.Tononi, G. (2004). An information integration theory of consciousness. BMC Neuroscience, 5, 42.
18.Tononi, G., et al. (2016). Integrated information theory: From consciousness to its physical substrate. Nature Reviews Neuroscience, 17(7), 450–461.