Arnav Raj
Dual Degree (B.Tech + M.Tech) Computer Science & Engineering, IIT Delhi
Fourth‑year dual degree student focused on AI safety, evaluation of large language models, and building reliable ML systems. Actively exploring reinforcement learning while leveraging experience with operating systems, networks, and performance‑minded backend work. I like designing retrieval‑augmented generation pipelines, building observability + evaluation tooling that turns model behavior into measurable signals, and squeezing latency/quality trade‑offs in large‑scale inference.
RAG
Model interpretability
Long‑Context LLM Eval
Model Observability
ML for Systems
RL (Exploring)
Currently seeking internship & collaborative project roles for 2025–26.

Education

-
Indian Institute of Technology Delhi
B.Tech and M.Tech in Computer Science & Engineering • 2022-2027 -
Mess Secretary, BHM
Leadership & Operations • Jun 2024–2025 -
Senior Editor, TechAmbit (Pan-IIT Magazine)
Editorial & Tech Strategy • 2023–Present
Honors & Awards
- JEE Advanced AIR 1158 (Top 0.5% in India)
- KVPY SX Fellowship awarded by Gov. India
- NSEA Astronomy Top 250 in India
- NSEC Chemistry Top 300 in India
- Codeforces Expert (1700+ Rating)
- IMC Prosperity Trading Challenge – World Rank 8 (R1)
- 2× Smart India Hackathon National Finalist
Experience
-
Harvard University
Summer Research Intern – Edge Computing Lab
- Built LangChain benchmarking & validation framework for RTL code generation (syntax → testbench → PPA loop with auto re‑prompting) across GPT‑4 & Llama.
- Compared prompt strategies (CoT, zero‑shot, few‑shot) under graded design complexity; tracked accuracy, efficiency, robustness.
- Automated validation: syntax + module / system testbenches; failing cases re‑prompted; passing designs sent to PPA analysis.
- Co‑authoring paper targeting DATE 2025 (preprint in preparation).
-
Georgia Institute of Technology
Summer Research Intern – FSI Lab
- Co‑authored KG‑QAGen, submitted to NeurIPS 2025 Datasets & Benchmarks, for generating multi‑hop knowledge‑graph questions.
- Co‑developed KG‑QAGen‑D, a dataset of 20,139 long‑context multi‑hop QA pairs for structured long‑context reasoning evaluation.
- Built an end‑to‑end benchmarking pipeline (auto chunking, batched question generation, multi‑chunk answer synthesis) scaling LLM evaluation across 170 agreements.