Data Search & Retrieval Engine

This project implements a hybrid semantic + lexical retrieval engine tuned for technical PDFs, API docs, and research abstracts.

  • Indexing: BM25 + character & subword analyzers for robustness to token noise.
  • Dense layer: Sentence-transformer embeddings (multiQA / all-mpnet-base-v2) with approximate kNN (HNSW).
  • Hybrid scoring: Weighted fusion (Reciprocal Rank Fusion + normalized cosine similarity).
  • Query expansion: Pseudo‑relevance feedback + optional LLM paraphrase generation.
  • Evaluation: nDCG@k, MRR, Recall@k using curated relevance sets.
  • Serving: FastAPI microservice exposing /query, /analyze, /embed endpoints.

Pipeline: Ingestion → Text normalization & section segmentation → Dual indexing (sparse + dense) → Hybrid ranker → Reranker (cross-encoder) → Response synthesis.

Tech: Python, Elasticsearch / OpenSearch, Faiss / hnswlib, FastAPI, SentenceTransformers, Docker.

Use Cases: Internal knowledge base search, research literature triage, code/API assist.

View on GitHub →