LLM Code Agent Evaluation Framework

End-to-end evaluation framework for benchmarking LLM code generation capabilities on realistic software engineering tasks. Built a containerized testing infrastructure that measures functional correctness, code quality, and self-repair abilities across GPT-4, Claude, and Llama models.

Task Design: Curated 150+ software engineering challenges spanning algorithm implementation, bug fixing, API integration, and refactoring—each with ground-truth test suites and complexity ratings.
Containerized Execution: Docker-based sandboxed runtime for safe code execution with resource limits (CPU, memory, timeout), preventing runaway processes and ensuring reproducible evaluation conditions.
Automated Validation Pipeline: Three-stage validation: (1) AST parsing for syntax correctness, (2) pytest execution against hidden test cases, (3) static analysis via pylint/flake8 for code quality scoring.
Self-Correction Loop: Iterative re-prompting mechanism where failing test outputs and error traces are fed back to the LLM for up to 3 correction attempts, measuring improvement rate per iteration.
Multi-Model Benchmarking: Evaluated GPT-4-turbo, Claude-3-Opus, and Llama-70B using identical prompts and temperature settings. GPT-4 achieved 73% first-pass completion; iterative correction pushed this to 84%.
Metrics Dashboard: Tracked pass@1, pass@3, average correction rounds, token efficiency (tokens/correct solution), and execution time distribution across task categories.
LangChain Integration: Modular agent architecture using LangChain for prompt management, tool calling, and structured output parsing with Pydantic validation.

Tech: Python, Docker, LangChain, OpenAI API, Anthropic API, pytest, AST, subprocess, logging.

Designed with extensibility in mind—new models and task categories can be added via YAML configuration without code changes.