What Hyperbolic Geometry Reveals About How LLMs Reason
Most interpretability work assumes that LLM representations live in flat, Euclidean space. We compute cosine similarities, run PCA, project with t-SNE. All tools built on Euclidean assumptions.
But reasoning has hierarchical structure. Premises support conclusions. Abstract claims generalize over specific ones. If you draw a proof tree, you see something that looks like a branching hierarchy, not a point cloud in flat space.
Trees embed poorly in Euclidean space but naturally in hyperbolic space, where volume grows exponentially with radius. This mismatch motivated a paper I wrote for the ICLR 2026 Workshop (GRaM Tiny Paper Track), which was accepted for poster presentation. I wanted to test whether hyperbolic probes capture hierarchical reasoning structure in LLM hidden states better than Euclidean ones.
The results were more dramatic than I expected.
Why hyperbolic geometry?
In Euclidean space, the area of a circle grows as $\pi r^2$. In hyperbolic space, it grows exponentially with $r$. There is dramatically more room at the edges of a hyperbolic disk, which is exactly what you need to embed trees: they have exponentially more leaves than internal nodes.
The Poincare disk makes this concrete. Points near the center represent high-level, general concepts. Points near the boundary represent specific, leaf-level details. Distances between points capture hierarchical relationships.
This is not just an analogy. Nickel & Kiela (2017) showed that 5-dimensional Poincare embeddings can match 200-dimensional Euclidean embeddings on hierarchical data. More recently, He et al. (2025) measured the intrinsic $\delta$-hyperbolicity of LLM embeddings and found values between 0.07 and 0.20, suggesting genuine tree-like structure.
So: if LLM representations during reasoning have hierarchical structure, can hyperbolic probes capture it more faithfully than Euclidean ones?
Setup
I compared two models from the same architecture family but with different training regimes:
- DeepSeek-R1-Distill-Qwen-7B: reasoning-specialized, trained with chain-of-thought distillation. Generates explicit reasoning steps via
<think>tokens. - Qwen2.5-7B-Instruct: standard instruction-tuned model from the same Qwen2.5 base.
Both are 28-layer transformers with 3584-dimensional hidden states, so the comparison isolates the effect of reasoning-specialized training rather than architectural differences.
The dataset is PrOntoQA (Saparov & He, 2023): 1000 logical reasoning problems with depths 1 through 5, forming clean linear chains. The templated structure minimizes linguistic confounds, letting us focus on geometric structure rather than surface-level language patterns.
For probing, I map layer activations to either Euclidean space or the Poincare ball ($d = 5$, curvature $c = 0.5$) and train lightweight probes to predict pairwise reasoning depth distances. The probe uses spectral normalization and Maximum Distance Rescaling for numerical stability. Training uses a stress-normalized loss (standard Kruskal stress from multidimensional scaling):
\[\mathcal{L} = \frac{\sum_{i \neq j} (d_{\text{pred}}(i,j) - d_{\text{true}}(i,j))^2}{\sum_{i \neq j} d_{\text{true}}(i,j)^2}\]I evaluate across 8 layers (L8 through L27) with 5-fold cross-validation, using Spearman $\rho$ and distortion (mean absolute distance error) as metrics.
Finding 1: Euclidean probes break down in reasoning models
At the final layer (L27), the hyperbolic probe achieved Spearman $\rho = 0.967$ on both models. Robust, consistent, unremarkable in the best way.
The Euclidean probe told a different story. On Qwen (the standard model), it performed well: $\rho = 0.955$. On DeepSeek (the reasoning model), it collapsed to $\rho = 0.488$. Same architecture, same probing task, same layer. The only difference is the training regime.
The distortion numbers made the gap sharper. DeepSeek Euclidean distortion at L27 was 0.562, roughly 6x higher than the hyperbolic probe’s 0.090. Qwen’s Euclidean distortion was 0.139, comparable to its hyperbolic result (0.104).
Since the target metric (1D ordinal depth) embeds isometrically in both geometries, this advantage has to come from the representation structure itself. The model’s internal geometry genuinely favors hyperbolic decoding.
The degradation is progressive. Looking across layers, DeepSeek’s Euclidean probe is stable from L8 through L21 ($\rho \approx 0.97$), starts dropping at L23 ($\rho = 0.842$), partially recovers at L25 ($\rho = 0.906$), then falls off at L27 ($\rho = 0.488$). The hyperbolic probe stays above $\rho = 0.90$ across all layers. Qwen’s Euclidean probe shows no degradation at any layer.
Finding 2: thinking tokens concentrate hierarchical information
Chain-of-thought models produce explicit reasoning tokens during generation. Following Qian et al. (2025), I identified “thinking tokens” by matching reasoning markers: “Wait”, “Hmm”, “Let me”, “So”, “Therefore”, “Thus”, “Hence”, “Because”, “Since”. These constitute about 6.7% of the sequence (roughly 20.7 tokens per sample on average).
At Layer 27, probing these thinking tokens with the hyperbolic probe gave $\rho = 0.871$. Probing the last token gave $\rho = 0.468$. Uniform pooling over all tokens gave $\rho = 0.390$.
The thinking token advantage is concentrated at the final layer ($\Delta\rho = +0.481$ at L27). At intermediate layers (L19, L23, L25), thinking tokens actually perform worse than uniform pooling. The benefit only emerges where representations are most compressed. This suggests that hierarchical information gets consolidated into these specific token positions at the model’s output layer.
This provides geometric validation of what Qian et al. (2025) found through mutual information analysis: reasoning dynamics are concentrated in sparse, identifiable tokens that constitute just 0.5-5% of the generated sequence.
What the compression statistics reveal
Why do Euclidean probes fail specifically at late layers in reasoning models? I computed layer-wise activation statistics for DeepSeek and found a clear pattern of representational compression at L27.
From L25 to L27:
- Activation norms decrease by 41% (1333 to 782)
- Norm variance increases by 214% (39.7 to 124.4)
- Participation ratio (effective dimensionality) drops 43% (45.5 to 25.8)
- Isotropy increases roughly 20x (0.0049 to 0.096)
Qwen also compresses at its final layer, but less severely: participation ratio drops 29% (vs. 43%), and its effective dimensionality at L27 (43.1) is 67% higher than DeepSeek’s (25.8). This milder compression explains why Qwen’s Euclidean probes still work.
The interpretation: reasoning-specialized training creates representations that compress more aggressively at the output layer. Reduced effective dimensionality and loss of directional diversity mean Euclidean distances lose resolution. Hyperbolic geometry, with its exponential volume growth, accommodates this compressed structure where flat geometry cannot.
Limitations
This is a preliminary investigation with several important caveats.
Both models share the Qwen2.5 backbone, so the cross-model comparison reflects training regime differences rather than architectural ones. I only evaluated 7B-parameter models; scaling to 70B+ might reveal different patterns. PrOntoQA provides clean 1D chains, but real-world reasoning involves branching hierarchies and is considerably messier. Models were loaded with 4-bit quantization, which may affect activation distributions. And while layer statistics provide evidence for representational compression, full mechanistic understanding would require circuit-level analysis identifying which attention heads and MLPs drive the observed behavior.
What’s next
The direction I find most promising is using hyperbolic geometry to build interpretability tools that work with the natural structure of reasoning rather than flattening it.
If reasoning has hierarchical structure, our tools for understanding it should respect that geometry. Most interpretability methods assume flat spaces by default. The results here suggest that for reasoning-specialized models, this assumption can miss real structure.
I’m also curious whether different reasoning training approaches (RLHF, process reward models, constitutional AI) leave distinct geometric fingerprints. If they do, hyperbolic probing could become a diagnostic tool for comparing training regimes. And extending from linear chains to branching DAG structures (as Zhong et al., 2026 have started exploring) would test whether the geometric advantage holds for more complex reasoning topologies.
Accepted at the ICLR 2026 Workshop on Geometry-grounded Representation Learning and Generative Modeling (GRaM). OpenReview