Neural Food Search

Production-Grade GenAI Search Pipeline

Open SourceGenAIMLOpsSearch & IR

Search 13,591 restaurants across 4 US cities with natural language

About

Neural Food Search is an end-to-end GenAI search pipeline that transforms natural language queries like “cozy ramen locals love, not too loud” into ranked restaurant results. It demonstrates how to build, evaluate, and operate a production ML system using modern retrieval and ranking techniques.

The pipeline combines BM25 keyword search, dense vector retrieval (BGE-M3), sparse retrieval, Reciprocal Rank Fusion, cross-encoder reranking, and LLM listwise reranking into a 4-stage architecture deployed on Google Cloud Run.

Beyond the search pipeline itself, the project showcases a complete MLOps lifecycle: rigorous offline evaluation with ablation studies, custom model training experiments (LambdaMART reranker, DistilBERT/T5 analyzer distillation), and production monitoring with drift detection frameworks.

Key Numbers

13,591

restaurants indexed

53,125

dishes indexed

+80%

NDCG improvement

$0.005

per query cost

Pipeline Architecture

Query Analyzer Claude Haiku

Parses natural language into structured intent, HyDE document, filters, and negative constraints.

Hybrid Retrieval BGE-M3 + Elastic

Three parallel paths (BM25 + dense + sparse) fused with Reciprocal Rank Fusion (k=60).

Cross-Encoder Reranking bge-reranker-v2-m3

Pairwise relevance scoring on top-50 candidates using cross-attention transformer.

LLM Listwise Reranking Claude Haiku

Holistic reordering with comparative reasoning and natural-language match explanations.

What We Found

RRF fusion is the single biggest quality lever (+80% NDCG)

Combining BM25, dense, and sparse retrieval with a zero-parameter rank fusion formula outperformed every learned model we trained. With only 30 eval queries, simplicity beats complexity.

Small models can replace LLM APIs — if you pick the right output format

Our first T5 training run produced 0% valid JSON. Switching to pipe-delimited output gave 100% parse rate with the same model and data. The bottleneck was representation, not model capacity.

Reranking stages look worse in metrics — but are actually better

Cross-encoder and LLM stages appear to decrease NDCG because they promote ungraded documents into the top-10. We documented this honestly rather than hiding it — the eval methodology matters as much as the numbers.

End-to-End Process

This project covers the full lifecycle of building, measuring, and operating a production ML system.

Build

Data pipeline (Yelp → spaCy NER → Claude Batch API → BGE-M3 embeddings → Elastic indexing) processing 13K restaurants and 7M reviews.

Deploy

3 Cloud Run services (models, API, UI) with service-to-service auth, secret management, and scale-to-zero infrastructure.

Evaluate

100 eval queries across 10 types, 7-stage ablation study, per-query-type breakdown, and honest failure analysis on the worst-performing queries.

Experiment

LambdaMART custom reranker (26 features, boost tradeoff analysis) and T5 analyzer distillation (2,768 training examples, V1 classifier → V2 generator).

Operate

Component contract monitoring for frozen-weight models, dual NDCG tracking, deployment decision frameworks, and concrete next-step actions.