Arnav Raj

Dual Degree (B.Tech + M.Tech) Computer Science & Engineering, IIT Delhi

Fourth-year dual degree student focused on AI safety and LLM reliability. Currently at Google DeepMind (Colab-Bench evaluation) and Abundant AI (YC-24). Accepted work at ICLR 2026 on probing LLM reasoning with hyperbolic geometry; co-author on KG-MuLQA under review at ACL 2026. Interested in building evaluation and interpretability tooling that makes models more robust and trustworthy.

AI Safety LLM Evaluation Interpretability RLHF Reinforcement Learning
Arnav Raj portrait
CURRENT

Google DeepMind (IC)

STEM Expert, AI Evaluation
Mar 2026 to Present · Bengaluru, India (Remote)

Curating expert-level ML challenge tasks for Colab-Bench, an internal benchmark for collaborative multi-step reasoning in LLMs. Building evaluation rubrics and working with DeepMind engineers to surface failure modes in Gemini's STEM reasoning.

CURRENT

Abundant AI (YC-24)

ML Engineer & AI Training Data Research Intern
Nov 2025 to Present · San Francisco, CA (Remote)

Building GPU-enabled evaluation infrastructure for AI agents and RL-based data curation. Previously designed adversarial ML tasks that exposed systematic failures in frontier LLMs. Datasets now used by 3 of the top 5 global AI labs.

Georgia Institute of Technology Financial Services Innovation Lab

Research Intern
May 2024 to Jun 2025 · Atlanta, GA (Remote)

Co-developed KG-MuLQA (under review at ACL 2026), a framework for generating multi-hop knowledge-graph-grounded questions. Built the end-to-end pipeline producing 20,139 QA pairs across 170 financial documents.

Harvard University Edge Computing Lab

Research Intern
May 2024 to Dec 2024 · Cambridge, MA (Remote)

Built a LangChain benchmarking framework for evaluating LLM-generated RTL hardware designs across GPT-4 and Llama. Implemented end-to-end validation (syntax, testbench, PPA analysis) and compared prompting strategies for hardware code generation.

Hyperbolic Geometry of Reasoning: Probing LLM Hidden States

Arnav Raj
Accepted at ICLR 2026 Workshop on Geometry-grounded Representation Learning and Generative Modeling (GRaM)

Solo-authored. Hyperbolic probes maintain robust performance across all layers while Euclidean probes degrade at late layers in reasoning models. Thinking tokens concentrate hierarchical information at compressed final layers.

OpenReview →

KG-MuLQA: Multi-hop Question Answering over Knowledge Graphs for Long-Context Evaluation

Nikita Tatarinov, B Vidhyakshaya Kannan, Haricharana Srinivasa, Arnav Raj, et al.
Under review at ACL 2026 (ARR)

Framework for systematic multi-hop question generation. 20,139+ QA pairs across 170 documents for structured long-context reasoning evaluation.

arXiv →

LLM Code Agent Evaluation Framework

Comprehensive evaluation suite for LLM code generation inspired by SWE-bench and MLE-bench, achieving 73% task completion with iterative self-correction.

Python Docker LangChain OpenAI API pytest AST

RL Agent for Code Optimization

PPO-based reinforcement learning agent that optimizes code performance through iterative refinement, achieving 35% runtime reduction and 20% memory improvement.

Python PyTorch OpenAI Gym Ray RLlib AST subprocess

Hangman AI: Transformer-Based Game Solver

Transformer-driven Hangman solver achieving >60% success using character-level modeling, morphological augmentation, and multi-strategy guess selection.

Python PyTorch NLTK Transformers NumPy
GitHub →

Graph Neural Network for User Personality Prediction

Bipartite user–product interaction graph leveraged with GNN architecture to infer user personality traits with high accuracy.

Python PyTorch PyTorch-Geometric
GitHub →

Data Search & Retrieval Engine

Neural-augmented retrieval system combining classical inverted indexing with dense vector search for hybrid relevance scoring across semi-structured technical documents.

Python Elasticsearch Faiss hnswlib FastAPI SentenceTransformers Docker
GitHub →

Context-Aware Spelling Correction

Noisy-channel + smoothed N-gram language modeling system achieving 88% accuracy on context-dependent spelling errors.

Python NLTK
GitHub →

TotalRecall: Cognitive Health App

Cross-platform Flutter app with AI-assisted memory support workflows for Alzheimer’s patients, backed by Firebase services.

Flutter Dart Firebase
GitHub →

Advanced Analytic Tool

A modular analytical platform that ingests heterogeneous datasets (CSV/JSON/SQL streams) and provides extensible pipelines for preprocessing, feature engineering, and rapid experimentation with classical ML and lightweight deep models.

Python Pandas Scikit-learn LightGBM FastAPI Docker
GitHub →

AI Player for Havannah Board Game

Monte Carlo Tree Search (MCTS) agent with UCB exploration achieving >80% win rate over strong RAVE-only baselines.

Python MCTS
GitHub →

SDN-based Intelligent Network Controller

High‑throughput OpenFlow controller with proactive L2 learning, shortest‑path routing, and loop prevention for complex topologies.

Python Ryu OpenFlow Mininet
GitHub →

Application-Layer Reliable Transport & Congestion Control

Implemented a reliable transport protocol over UDP plus Reno & CUBIC congestion control variants with detailed performance benchmarking.

Python UDP Mininet
GitHub →

OS Kernel Enhancements in xv6

Implemented a page swapping subsystem in the xv6 teaching OS, including victim selection, swap slot management, and page fault handling.

C x86 QEMU
GitHub →

I'm a fourth-year dual degree student at IIT Delhi (B.Tech + M.Tech in Computer Science) working at the intersection of AI safety, LLM evaluation, and interpretability. My research focuses on understanding how language models reason and where they fail, particularly through geometric and probing-based approaches.

At Google DeepMind, I build evaluation tasks for Colab-Bench, stress-testing collaborative reasoning in frontier models. At Abundant AI, I've designed adversarial ML benchmarks that are now part of the training pipeline for three of the world's top five AI labs. My solo-authored paper on probing LLM hidden states with hyperbolic geometry was accepted at the ICLR 2026 Workshop (GRaM), and I co-authored KG-MuLQA, a long-context evaluation framework under review at ACL 2026.

Outside of research, I co-founded the AI Safety Club at IIT Delhi and completed alignment training through BlueDot Impact and ARENA. I enjoy working on problems where evaluation methodology, model behavior, and safety considerations intersect.

Download Resume (PDF) →

IIT Delhi

Indian Institute of Technology Delhi

B.Tech and M.Tech in Computer Science & Engineering
2022-2027

NK Security Scholar

Merit-based scholarship awarded to top 30 students at IIT Delhi for academic and technical excellence

Smart India Hackathon

National Top 5 Finalist in both 2023 and 2024 editions (India's largest student innovation competition)

JEE Advanced 2022

All India Rank 1,158 out of 1,000,000+ candidates (top 0.1%)

KVPY SX Fellowship 2021

National science fellowship awarded by the Government of India and IISc Bangalore

Codeforces Expert

1700+ rating in competitive programming

National Science Olympiads

Top 250 Astronomy, Top 300 Chemistry

Founding Member & Technical Lead / AI Safety Club, IIT Delhi

2025 – Present

Co-founded student research group on AI alignment, interpretability, and evaluation. Led reading groups on mechanistic interpretability (TransformerLens) and safety frameworks. Completed BlueDot Impact AI safety training and ARENA curriculum.

Technical Consultant / STEM AI Hackathon 2026 (AI-Collab Hack)

Jan 2026 – Present

Providing technical mentorship to 20+ teams building AI agents for STEM education at a hackathon jointly organized by IIT Delhi, Imperial College London, and Microsoft Garage.

Senior Editor / Tech Ambit (Pan-IIT Magazine)

2023 – 2025

Led 15-member editorial team across 23 IITs; curated and edited 30+ technical articles on AI and systems research.

Mess Secretary / Zanskar Hostel, IIT Delhi

Jun 2024 – May 2025

Elected by 400+ residents; managed operations team of 13. Awarded Best Mess Secretary for digitalization initiatives.