Detecting Hallucinations via Hyperbolic Geometry: An Internal Representation Approach
Language models hallucinate. They state falsehoods with confidence, fabricate citations, and produce plausible-sounding nonsense. The standard approach—comparing outputs to ground truth—only works when you have ground truth.
But what if you could detect hallucinations by looking at how the model generates text, not just what it generates?
That’s the premise of my recent work submitted to ICLR 2025: investigating whether the internal representations of LLMs betray when they’re hallucinating, even when the output looks convincing.
Spoiler: They do. With 87.5% accuracy and 0.937 AUROC.
The Central Hypothesis
Models don’t “know” whether they’re hallucinating in the human sense. But their internal computation might look different when generating known facts vs invented ones.
Intuition:
- True facts: Likely seen many times in training → confident retrieval → stable activation patterns
- Hallucinations: Model “filling in the blank” → uncertain generation → unstable or geometrically distinct patterns
Key question: Can we find this signal without supervised labels?
Why Hyperbolic Space?
Most representation learning happens in Euclidean space (standard dot products, cosine similarity, etc.). But hierarchical and tree-like structures are naturally suited to hyperbolic geometry.
Hyperbolic space properties:
- Exponentially growing volume away from origin
- Natural representation of hierarchies (parent→child relationships)
- Geodesic distances capture entailment relationships
Connection to reasoning:
- Logical entailment has hierarchical structure
- Claims at different levels of abstraction form trees
- Distance from “well-supported facts” might correlate with hallucination risk
Hypothesis: Reasoning tokens (intermediate steps in chain-of-thought) that lead to hallucinations might be geometric outliers in hyperbolic space—far from the manifold of truthful reasoning.
Experimental Setup
1. Data Collection
Generated three types of samples:
TRUE: Factual claims grounded in knowledge
"The Eiffel Tower is in Paris. Paris is in France. Therefore, the Eiffel Tower is in France."
HALLUCINATION: Plausible but false claims
"The Eiffel Tower is in Berlin. Berlin is in Germany. Therefore, the Eiffel Tower is in Germany."
UNRELATED: Random/nonsensical reasoning
"The Eiffel Tower is blue. Bananas are yellow. Therefore, the moon is made of cheese."
Crucial design choice: Keep syntactic structure identical across types. We want to detect semantic differences in internal computation, not surface formatting.
2. Activation Extraction
For each reasoning chain, extract activations from late transformer layers (where semantic processing happens).
Focus: Tokens corresponding to logical connectives and conclusions—”therefore”, “thus”, “because”, etc.
These are the reasoning tokens—where the model commits to an inference.
3. Hyperbolic Mapping via Unsupervised Probing
Challenge: We don’t have labels during inference (that’s the whole point).
Solution: Unsupervised embedding into hyperbolic space (Poincaré ball model).
Method:
- Extract activations from reasoning tokens
- Project to lower-dimensional embedding (to reduce noise)
- Map embeddings to Poincaré disk using learned transformation
- Measure hyperbolic distance from dataset centroid
Key insight: If hallucinations are geometric outliers, they should have high hyperbolic distance.
4. Detection via Outlier Scoring
Define outlier score: \[ \text{score}(x) = d_{\mathbb{H}}(f(x), \mu) \]
Where:
- \(f(x)\): hyperbolic embedding of activation
- \(\mu\): hyperbolic centroid of true reasoning samples
- \(d_{\mathbb{H}}\): hyperbolic distance (geodesic in Poincaré ball)
Decision rule: If score > threshold, flag as potential hallucination.
Results
Testing on held-out samples:
| Metric | Value |
|---|---|
| Accuracy | 87.5% |
| AUROC | 0.937 |
| Precision (at 0.8 recall) | 0.91 |
| False Positive Rate | 11.2% |
What this means:
- 87.5% of hallucinations caught using only internal activations
- 93.7% chance a random hallucination ranks higher (more outlier-like) than a random true claim
- 91% of flagged examples actually were hallucinations (when tuned for high recall)
Ablation Studies
Does hyperbolic geometry matter?
| Embedding Space | AUROC |
|---|---|
| Euclidean (standard) | 0.843 |
| Hyperbolic (ours) | 0.937 |
| Spherical | 0.876 |
Answer: Yes. +9.4% over Euclidean, +6.1% over spherical.
Why? Hierarchical structure of entailment is better captured in hyperbolic space.
Which layers encode hallucination signal?
| Layer Range | AUROC |
|---|---|
| Early (0-8) | 0.712 |
| Middle (9-16) | 0.854 |
| Late (17-24) | 0.937 |
| All layers (ensemble) | 0.941 |
Interpretation: Semantic distinction emerges in later layers (where reasoning happens). Ensembling helps slightly.
Why This Works: Mechanistic Hypothesis
Based on probing analysis, here’s my working theory:
Phase 1: Retrieval (Early Layers)
- Model retrieves relevant training patterns
- True facts: strong retrieval signal (seen often)
- Hallucinations: weak or conflicting retrieval
Phase 2: Construction (Middle Layers)
- Model builds representation of the claim
- True facts: stable, coherent internal representation
- Hallucinations: “patchwork” representation from disparate sources
Phase 3: Reasoning (Late Layers)
- Model commits to logical inference
- True facts: tight, hierarchical structure (close in hyperbolic space to other valid inferences)
- Hallucinations: geometrically distant (constructed via weak analogies)
Geometric manifestation: Hallucination activations lie further from the “manifold of valid reasoning” in hyperbolic space.
Failure Modes & Limitations
Not all hallucinations are caught:
1. Training Data Hallucinations
If the model saw a falsehood repeatedly in training, it might generate it with “truthful-looking” activations.
Example: Common misconceptions (e.g., “Glass is a liquid”) if over-represented in training.
Mitigation: This method detects uncertainty, not falsehood. Would need external verification.
2. Plausible But Unverifiable Claims
Model generates something that could be true but isn’t in training data.
"The population of City X in 2025 is Y."
If the model interpolates plausibly, activations might look confident.
Mitigation: Combine with retrieval-augmented generation (check against knowledge base).
3. Domain Shift
Trained on general text, tested on specialized domains (medical, legal, technical).
Geometric structure of reasoning might differ across domains.
Mitigation: Domain-specific calibration (recompute centroid on domain examples).
4. False Positives on Novel but True Claims
If a true fact is rare/surprising, it might look like an outlier.
Example: Correctly inferring an unexpected consequence from premises.
Mitigation: This is a feature, not a bug—flags high-uncertainty claims for verification.
Comparison to Existing Approaches
1. Self-Consistency / Sampling
Sample multiple responses, see if they agree.
Pros: Model-agnostic, interpretable
Cons: Expensive (multiple forward passes), doesn’t work for single-generation scenarios
Our method: Single forward pass, real-time detection.
2. Perplexity / Confidence Scores
Use model’s own uncertainty estimates (logit magnitudes).
Pros: Built-in, no extra model
Cons: Models often overconfident, especially when wrong
Our method: Doesn’t rely on model’s calibration, looks at representation geometry.
3. Supervised Fact-Checking (External KB)
Compare claim against knowledge base.
Pros: Grounded in truth
Cons: Requires comprehensive, up-to-date KB; doesn’t catch reasoning errors
Our method: Detects process failures, not just factual errors.
4. Probing Classifiers (Supervised)
Train classifier on labeled true/false examples.
Pros: High accuracy with good labels
Cons: Requires labeled data, distribution-specific
Our method: Unsupervised (no hallucination labels needed).
Practical Deployment Strategy
How to use this in production:
Phase 1: Calibration
- Collect representative sample of true reasoning chains from your domain
- Compute hyperbolic embeddings
- Estimate centroid and threshold (balance precision/recall)
Phase 2: Runtime Detection
- Extract activations from reasoning tokens as model generates
- Map to hyperbolic space using pre-learned projections
- Compute outlier score
- If score > threshold:
- Low-stakes: Flag for human review
- High-stakes: Refuse to answer / request verification
Phase 3: Feedback Loop
- Log flagged examples
- Human annotators label as true hallucination or false alarm
- Periodically recalibrate threshold
- Optionally: Fine-tune projection to improve separation
Latency: Embedding computation adds ~5% overhead (negligible for most use cases).
Open Questions & Future Work
1. Multi-Hop Reasoning
Does the geometric signal accumulate across reasoning steps?
Hypothesis: Each faulty step pushes representation further from truth manifold.
Experiment: Track hyperbolic distance trajectory across chain-of-thought steps.
2. Model-Specific vs Universal
Are hyperbolic outliers consistent across model families?
Test: Same probes on GPT-4, Claude, Llama, Gemini.
If yes: Single detection system for all models.
If no: Model-specific calibration needed.
3. Adversarial Robustness
Can models be trained to “hide” hallucination signal in activations?
Attack: Adversarial fine-tuning to produce false claims with truthful-looking geometry.
Defense: Multi-layer detection, external verification.
4. Causal Interventions
If we detect a likely hallucination, can we steer the model toward truthfulness?
Approach: Adjust activations to move toward hyperbolic centroid, regenerate.
Risk: Might produce superficially “truthful-looking” but still wrong outputs.
Broader Implications for AI Safety
This work gestures toward asignificant principle:
External behavior (text output) is insufficient for safety. We need to audit internal computation.
Why This Matters:
- Deception Detection: If a model is “lying” intentionally, internal representations might leak intent
- Uncertainty Quantification: Geometric distance could proxy for epistemic uncertainty
- Interpretability: Understanding representation geometry aids in understanding model cognition
- Alignment: Models aligned via RLHF might learn to “sound confident” without being right; internal checks bypass this
Philosophical point: A model that “knows it doesn’t know” has structured uncertainty in its representations. Detecting that structure is key to reliable deployment.
Try It Yourself (Coming Soon)
I’m working on open-sourcing:
- Hyperbolic embedding code
- Pre-trained projection matrices for Llama/GPT families
- Evaluation scripts and datasets
GitHub: Watch for release at github.com/deadsmash07/hallucination-detection
Conclusion
Hallucination detection doesn’t have to rely on external verification or model overconfidence. The geometry of internal representations—particularly in hyperbolic space—provides a powerful signal.
87.5% accuracy from unsupervised methods suggests that models encode more about their own uncertainty than they reveal in outputs.
The path forward: combining geometric analysis, retrieval augmentation, and multi-layer verification for robust, deployable hallucination detection.
Because if we’re going to trust LLMs in high-stakes domains, we need to understand not just what they say, but how certain they are when they say it.
This work was submitted to ICLR 2025 Workshop on Representational Geometry in Neural Networks. Thanks to the research community for foundational work on hyperbolic embeddings, model interpretability, and geometric deep learning.
Paper: Available soon (pending anonymity period).
Code: Coming Q1 2026.