Embeddings cheatsheet.
What are embeddings
Text → vector (e.g., 1536 floats). Similar text → similar vectors. Foundation for RAG, semantic search, clustering.
OpenAI
response = client.embeddings.create(
model="text-embedding-3-small", # 1536 dim, cheap
# model="text-embedding-3-large", # 3072 dim
input=["text 1", "text 2"],
)
vectors = [d.embedding for d in response.data]
Cohere
import cohere
co = cohere.Client(...)
response = co.embed(
texts=["text 1", "text 2"],
model="embed-english-v3.0",
input_type="search_document", # or search_query
)
Voyage (great for RAG)
import voyageai
vo = voyageai.Client()
result = vo.embed(["text"], model="voyage-3", input_type="document")
Local (sentence-transformers)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
embeddings = model.encode(["text 1", "text 2"], normalize_embeddings=True)
Models worth trying: BAAI/bge-large-en-v1.5, mixedbread-ai/mxbai-embed-large-v1, intfloat/e5-large-v2.
Cosine similarity
import numpy as np
def cosine(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# If normalized: dot product alone
For normalized vectors (unit length), cosine = dot product. Faster.
Vector DBs
- Pinecone: managed.
- Weaviate: managed/open.
- Qdrant: open, easy.
- Chroma: simple, embeddable.
- Milvus: scale.
- pgvector: Postgres extension.
- Redis: with redis-stack.
- Elasticsearch / OpenSearch: with kNN.
pgvector
CREATE EXTENSION vector;
CREATE TABLE docs (
id SERIAL PRIMARY KEY,
text TEXT,
embedding vector(1536)
);
CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops);
-- Insert
INSERT INTO docs (text, embedding) VALUES ('...', '[0.1, 0.2, ...]');
-- Query
SELECT text, embedding <=> '[...]' AS distance
FROM docs ORDER BY distance LIMIT 5;
<=> cosine, <-> L2, <#> negative inner product.
Qdrant
from qdrant_client import QdrantClient
client = QdrantClient(":memory:")
client.create_collection("docs", vectors_config={"size": 1536, "distance": "Cosine"})
client.upsert(collection_name="docs", points=[
{"id": 1, "vector": [...], "payload": {"text": "..."}},
])
results = client.search(collection_name="docs", query_vector=[...], limit=5)
Chroma
import chromadb
client = chromadb.PersistentClient(path="./db")
coll = client.get_or_create_collection("docs")
coll.add(ids=["1"], embeddings=[[...]], metadatas=[{"src": "..."}], documents=["text"])
results = coll.query(query_embeddings=[[...]], n_results=5)
Chunking
Split text into ~300-500 token chunks with overlap (~10-15%).
def chunk(text, chunk_size=500, overlap=50):
words = text.split()
chunks = []
i = 0
while i < len(words):
chunk = " ".join(words[i:i+chunk_size])
chunks.append(chunk)
i += chunk_size - overlap
return chunks
Better: use semantic / structural chunking. Libraries: langchain.text_splitter, unstructured.
Matryoshka embeddings
text-embedding-3-large supports dim reduction:
client.embeddings.create(
model="text-embedding-3-large",
input=["..."],
dimensions=512, # cut from 3072
)
Lower dim → faster, less storage, slight quality drop.
Hybrid search
Combine vector search (semantic) + keyword (BM25) for best recall.
from rank_bm25 import BM25Okapi
bm25_scores = bm25.get_scores(tokenized_query)
vector_scores = ...
# Reciprocal Rank Fusion
def rrf(rankings, k=60):
scores = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: -x[1])
Reranking
Initial retrieval (top 50) → rerank with cross-encoder (top 5).
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-large")
pairs = [(query, doc) for doc in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
Or use Cohere Rerank / Voyage Rerank APIs.
Dimensions tradeoffs
| Lower dim | Higher dim | |
|---|---|---|
| Storage | less | more |
| Speed | faster | slower |
| Quality | lower | higher |
| Cost (API) | same | same |
For most use cases: 768-1024 is plenty.
Cost
OpenAI text-embedding-3-small: $0.02 / 1M tokens. Very cheap.
For huge corpora: consider local model + batch inference.
Indexing strategy
- HNSW: most common, fast. Memory-heavy.
- IVF (FAISS): good for billions, partitioned.
- DiskANN: SSD-friendly.
Most managed DBs use HNSW by default.
Distance metrics
- Cosine: most common; angle-based. Use normalized vectors.
- Dot product: same as cosine if normalized.
- L2 (Euclidean): distance-based.
For most LLM embeddings: cosine.
Common mistakes
- Inconsistent embedding model (insert with one, query with another).
- Comparing un-normalized vectors with cosine without scale.
- Storing too-large chunks → diluted similarity.
- No reranking → recall but no precision.
- Updating model without re-embedding the corpus.
Read this next
If you want my embedding + vector DB setup, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .