Detecting Hallucinations via Hyperbolic Geometry: An Internal Representation Approach

Language models hallucinate. They state falsehoods with confidence, fabricate citations, and produce plausible-sounding nonsense. The standard approach—comparing outputs to ground truth—only works when you have ground truth.

But what if you could detect hallucinations by looking at how the model generates text, not just what it generates?

That’s the premise of my recent work submitted to ICLR 2025: investigating whether the internal representations of LLMs betray when they’re hallucinating, even when the output looks convincing.

Spoiler: They do. With 87.5% accuracy and 0.937 AUROC.

The Central Hypothesis

Models don’t “know” whether they’re hallucinating in the human sense. But their internal computation might look different when generating known facts vs invented ones.

Intuition:

True facts: Likely seen many times in training → confident retrieval → stable activation patterns
Hallucinations: Model “filling in the blank” → uncertain generation → unstable or geometrically distinct patterns

Key question: Can we find this signal without supervised labels?

Why Hyperbolic Space?

Most representation learning happens in Euclidean space (standard dot products, cosine similarity, etc.). But hierarchical and tree-like structures are naturally suited to hyperbolic geometry.

Hyperbolic space properties:

Exponentially growing volume away from origin
Natural representation of hierarchies (parent→child relationships)
Geodesic distances capture entailment relationships

Connection to reasoning:

Logical entailment has hierarchical structure
Claims at different levels of abstraction form trees
Distance from “well-supported facts” might correlate with hallucination risk

Hypothesis: Reasoning tokens (intermediate steps in chain-of-thought) that lead to hallucinations might be geometric outliers in hyperbolic space—far from the manifold of truthful reasoning.

Experimental Setup

1. Data Collection

Generated three types of samples:

TRUE: Factual claims grounded in knowledge

"The Eiffel Tower is in Paris. Paris is in France. Therefore, the Eiffel Tower is in France."

HALLUCINATION: Plausible but false claims

"The Eiffel Tower is in Berlin. Berlin is in Germany. Therefore, the Eiffel Tower is in Germany."

UNRELATED: Random/nonsensical reasoning

"The Eiffel Tower is blue. Bananas are yellow. Therefore, the moon is made of cheese."

Crucial design choice: Keep syntactic structure identical across types. We want to detect semantic differences in internal computation, not surface formatting.

2. Activation Extraction

For each reasoning chain, extract activations from late transformer layers (where semantic processing happens).

Focus: Tokens corresponding to logical connectives and conclusions—”therefore”, “thus”, “because”, etc.

These are the reasoning tokens—where the model commits to an inference.

3. Hyperbolic Mapping via Unsupervised Probing

Challenge: We don’t have labels during inference (that’s the whole point).

Solution: Unsupervised embedding into hyperbolic space (Poincaré ball model).

Method:

Extract activations from reasoning tokens
Project to lower-dimensional embedding (to reduce noise)
Map embeddings to Poincaré disk using learned transformation
Measure hyperbolic distance from dataset centroid

Key insight: If hallucinations are geometric outliers, they should have high hyperbolic distance.

4. Detection via Outlier Scoring

Define outlier score: \[ \text{score}(x) = d_{\mathbb{H}}(f(x), \mu) \]

Where:

\(f(x)\): hyperbolic embedding of activation
\(\mu\): hyperbolic centroid of true reasoning samples
\(d_{\mathbb{H}}\): hyperbolic distance (geodesic in Poincaré ball)

Decision rule: If score > threshold, flag as potential hallucination.

Results

Testing on held-out samples:

Metric	Value
Accuracy	87.5%
AUROC	0.937
Precision (at 0.8 recall)	0.91
False Positive Rate	11.2%

What this means:

87.5% of hallucinations caught using only internal activations
93.7% chance a random hallucination ranks higher (more outlier-like) than a random true claim
91% of flagged examples actually were hallucinations (when tuned for high recall)

Ablation Studies

Does hyperbolic geometry matter?

Embedding Space	AUROC
Euclidean (standard)	0.843
Hyperbolic (ours)	0.937
Spherical	0.876

Answer: Yes. +9.4% over Euclidean, +6.1% over spherical.

Why? Hierarchical structure of entailment is better captured in hyperbolic space.

Which layers encode hallucination signal?

Layer Range	AUROC
Early (0-8)	0.712
Middle (9-16)	0.854
Late (17-24)	0.937
All layers (ensemble)	0.941

Interpretation: Semantic distinction emerges in later layers (where reasoning happens). Ensembling helps slightly.

Why This Works: Mechanistic Hypothesis

Based on probing analysis, here’s my working theory:

Phase 1: Retrieval (Early Layers)

Model retrieves relevant training patterns
True facts: strong retrieval signal (seen often)
Hallucinations: weak or conflicting retrieval

Phase 2: Construction (Middle Layers)

Model builds representation of the claim
True facts: stable, coherent internal representation
Hallucinations: “patchwork” representation from disparate sources

Phase 3: Reasoning (Late Layers)

Model commits to logical inference
True facts: tight, hierarchical structure (close in hyperbolic space to other valid inferences)
Hallucinations: geometrically distant (constructed via weak analogies)

Geometric manifestation: Hallucination activations lie further from the “manifold of valid reasoning” in hyperbolic space.

Failure Modes & Limitations

Not all hallucinations are caught:

1. Training Data Hallucinations

If the model saw a falsehood repeatedly in training, it might generate it with “truthful-looking” activations.

Example: Common misconceptions (e.g., “Glass is a liquid”) if over-represented in training.

Mitigation: This method detects uncertainty, not falsehood. Would need external verification.

2. Plausible But Unverifiable Claims

Model generates something that could be true but isn’t in training data.

"The population of City X in 2025 is Y."

If the model interpolates plausibly, activations might look confident.

Mitigation: Combine with retrieval-augmented generation (check against knowledge base).

3. Domain Shift

Trained on general text, tested on specialized domains (medical, legal, technical).

Geometric structure of reasoning might differ across domains.

Mitigation: Domain-specific calibration (recompute centroid on domain examples).

4. False Positives on Novel but True Claims

If a true fact is rare/surprising, it might look like an outlier.

Example: Correctly inferring an unexpected consequence from premises.

Mitigation: This is a feature, not a bug—flags high-uncertainty claims for verification.

Comparison to Existing Approaches

1. Self-Consistency / Sampling

Sample multiple responses, see if they agree.

Pros: Model-agnostic, interpretable
Cons: Expensive (multiple forward passes), doesn’t work for single-generation scenarios

Our method: Single forward pass, real-time detection.

2. Perplexity / Confidence Scores

Use model’s own uncertainty estimates (logit magnitudes).

Pros: Built-in, no extra model
Cons: Models often overconfident, especially when wrong

Our method: Doesn’t rely on model’s calibration, looks at representation geometry.

3. Supervised Fact-Checking (External KB)

Compare claim against knowledge base.

Pros: Grounded in truth
Cons: Requires comprehensive, up-to-date KB; doesn’t catch reasoning errors

Our method: Detects process failures, not just factual errors.

4. Probing Classifiers (Supervised)

Train classifier on labeled true/false examples.

Pros: High accuracy with good labels
Cons: Requires labeled data, distribution-specific

Our method: Unsupervised (no hallucination labels needed).

Practical Deployment Strategy

How to use this in production:

Phase 1: Calibration

Collect representative sample of true reasoning chains from your domain
Compute hyperbolic embeddings
Estimate centroid and threshold (balance precision/recall)

Phase 2: Runtime Detection

Extract activations from reasoning tokens as model generates
Map to hyperbolic space using pre-learned projections
Compute outlier score
If score > threshold:
- Low-stakes: Flag for human review
- High-stakes: Refuse to answer / request verification

Phase 3: Feedback Loop

Log flagged examples
Human annotators label as true hallucination or false alarm
Periodically recalibrate threshold
Optionally: Fine-tune projection to improve separation

Latency: Embedding computation adds ~5% overhead (negligible for most use cases).

Open Questions & Future Work

1. Multi-Hop Reasoning

Does the geometric signal accumulate across reasoning steps?

Hypothesis: Each faulty step pushes representation further from truth manifold.

Experiment: Track hyperbolic distance trajectory across chain-of-thought steps.

2. Model-Specific vs Universal

Are hyperbolic outliers consistent across model families?

Test: Same probes on GPT-4, Claude, Llama, Gemini.

If yes: Single detection system for all models.
If no: Model-specific calibration needed.

3. Adversarial Robustness

Can models be trained to “hide” hallucination signal in activations?

Attack: Adversarial fine-tuning to produce false claims with truthful-looking geometry.

Defense: Multi-layer detection, external verification.

4. Causal Interventions

If we detect a likely hallucination, can we steer the model toward truthfulness?

Approach: Adjust activations to move toward hyperbolic centroid, regenerate.

Risk: Might produce superficially “truthful-looking” but still wrong outputs.

Broader Implications for AI Safety

This work gestures toward asignificant principle:

External behavior (text output) is insufficient for safety. We need to audit internal computation.

Why This Matters:

Deception Detection: If a model is “lying” intentionally, internal representations might leak intent
Uncertainty Quantification: Geometric distance could proxy for epistemic uncertainty
Interpretability: Understanding representation geometry aids in understanding model cognition
Alignment: Models aligned via RLHF might learn to “sound confident” without being right; internal checks bypass this

Philosophical point: A model that “knows it doesn’t know” has structured uncertainty in its representations. Detecting that structure is key to reliable deployment.

Try It Yourself (Coming Soon)

I’m working on open-sourcing:

Hyperbolic embedding code
Pre-trained projection matrices for Llama/GPT families
Evaluation scripts and datasets

GitHub: Watch for release at github.com/deadsmash07/hallucination-detection

Conclusion

Hallucination detection doesn’t have to rely on external verification or model overconfidence. The geometry of internal representations—particularly in hyperbolic space—provides a powerful signal.

87.5% accuracy from unsupervised methods suggests that models encode more about their own uncertainty than they reveal in outputs.

The path forward: combining geometric analysis, retrieval augmentation, and multi-layer verification for robust, deployable hallucination detection.

Because if we’re going to trust LLMs in high-stakes domains, we need to understand not just what they say, but how certain they are when they say it.

This work was submitted to ICLR 2025 Workshop on Representational Geometry in Neural Networks. Thanks to the research community for foundational work on hyperbolic embeddings, model interpretability, and geometric deep learning.

Paper: Available soon (pending anonymity period).
Code: Coming Q1 2026.