Data Search & Retrieval Engine
This project implements a hybrid semantic + lexical retrieval engine tuned for technical PDFs, API docs, and research abstracts.
- Indexing: BM25 + character & subword analyzers for robustness to token noise.
- Dense layer: Sentence-transformer embeddings (multiQA / all-mpnet-base-v2) with approximate kNN (HNSW).
- Hybrid scoring: Weighted fusion (Reciprocal Rank Fusion + normalized cosine similarity).
- Query expansion: Pseudo‑relevance feedback + optional LLM paraphrase generation.
- Evaluation: nDCG@k, MRR, Recall@k using curated relevance sets.
- Serving: FastAPI microservice exposing /query, /analyze, /embed endpoints.
Pipeline: Ingestion → Text normalization & section segmentation → Dual indexing (sparse + dense) → Hybrid ranker → Reranker (cross-encoder) → Response synthesis.
Tech: Python, Elasticsearch / OpenSearch, Faiss / hnswlib, FastAPI, SentenceTransformers, Docker.
Use Cases: Internal knowledge base search, research literature triage, code/API assist.