Price comparison only works if you're comparing the same product. But a gallon of Kirkland organic whole milk at Costco and a gallon of Simple Truth organic whole milk at Kroger are functionally identical to a consumer — they just have different names, SKUs, and serving size labels.

The Deduplication Problem

At 40,000+ distinct product names across our scrape targets, manual synonym matching doesn't scale. The first generation of our system used TF-IDF similarity scores, which worked reasonably well on branded products but collapsed on private-label equivalents.

pgvector + OpenAI Embeddings

We replaced TF-IDF with sentence embeddings from OpenAI's text-embedding-3-small model, stored in Supabase's pgvector extension. Each product name is embedded on ingest; at query time, a cosine similarity search returns the top-k candidates above a 0.92 threshold, which a lightweight classifier then confirms or rejects.

The classifier is a fine-tuned logistic regression trained on ~5,000 hand-labeled pairs. It's fast (sub-millisecond) and handles edge cases the embedding similarity misses — like private-label products with nearly identical names but different serving sizes.

Performance

Query latency is consistently under 40ms for the embedding lookup + classifier pass on our current dataset size. The pgvector HNSW index handles the similarity search; we tuned ef_search to 64, which gives us >99% recall at our query volume.