MarkTechPost AI其他AI

程式碼實戰指南：打造以 pgvector 驅動的語意、混合、稀疏與量化向量搜尋系統

2026年5月28日 08:07

重點摘要

本教學將在 Google Colab 中建立完整的 pgvector 遊樂場，探索 PostgreSQL 如何作為強大的向量資料庫支援現代 AI 應用。我們從安裝 PostgreSQL、編譯 pgvector 擴充套件、透過 Psycopg 連線並註冊向量型別以順暢整合 Python 開始；接著使用 SentenceTransformers 產生嵌入、儲存至 PostgreSQL、建立 HNSW 索引，並執行語意搜尋、篩選搜尋、距離度量比較、半精度儲存、二元量化、稀疏向量搜尋、混合檢索與向量聚合。透過此流程，我們將學會如何使用純開源工具，讓 pgvector 支援實務上的檢索增強生成、推薦、相似性搜尋與混合搜尋系統。

站內 AI 整理稿

In this tutorial, we build a complete pgvector playground inside Google Colab and explore how PostgreSQL can work as a powerful vector database for modern AI applications. We start by installing PostgreSQL, compiling the pgvector extension, connecting through Psycopg, and registering vector types for smooth Python integration. Then, we create embeddings with SentenceTransformers, store them in PostgreSQL, build HNSW indexes, and run semantic search, filtered search, distance metric comparisons, half-precision storage, binary quantization, sparse vector search, hybrid retrieval, and vector aggregation. Through this workflow, we learn how pgvector supports practical retrieval-augmented generation, recommendation, similarity search, and hybrid search systems using only open-source tools. Copy CodeCopiedUse a different Browserimport os import subprocess import sys import time def sh(cmd: str, check: bool = True): """Run a shell command, streaming a compact log.""" print(f" $ {cmd}") return subprocess.run(cmd, shell=True, check=check, stdout=subprocess.DEVNULL, stderr=subprocess.STDOUT) print("[0/10] Installing PostgreSQL + building pgvector (≈1–2 min)...") sh("apt-get -qq update") sh("apt-get -qq install -y postgresql postgresql-contrib " "postgresql-server-dev-all build-essential git") if not os.path.exists("/tmp/pgvector"): sh("git clone --depth 1 https://github.com/pgvector/pgvector.git /tmp/pgvector") sh("cd /tmp/pgvector && make && make install") sh("service postgresql start") time.sleep(3) sh("""sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'postgres';" """) print("[0/10] Installing Python packages...") sh(f"{sys.executable} -m pip install -q pgvector psycopg[binary] " f"sentence-transformers numpy") We set up the complete PostgreSQL and pgvector environment. We install the required system packages, clone and build pgvector from source, start the PostgreSQL service, and configure the database password. We also install the Python dependencies needed to connect to PostgreSQL and work with vector embeddings. Copy CodeCopiedUse a different Browserimport numpy as np import psycopg from pgvector import HalfVector, SparseVector from pgvector.psycopg import register_vector from sentence_transformers import SentenceTransformer print("\n[1/10] Connecting and enabling the 'vector' extension...") conn = psycopg.connect( "host=127.0.0.1 port=5432 dbname=postgres user=postgres password=postgres", autocommit=True, ) conn.execute("CREATE EXTENSION IF NOT EXISTS vector") register_vector(conn) ver = conn.execute("SELECT extversion FROM pg_extension WHERE extname='vector'").fetchone()[0] print(f" pgvector version: {ver}") print("\n[2/10] Loading embedding model + encoding corpus...") model = SentenceTransformer("all-MiniLM-L6-v2") DIM = model.get_sentence_embedding_dimension() corpus = [ ("Octopuses have three hearts and blue blood.", "animals"), ("Transformers revolutionized natural language processing.","technology"), ("Quantum computers exploit superposition and entanglement.","technology"), ("GPUs accelerate deep learning by parallelizing matrix math.","technology"), ("Sourdough bread relies on wild yeast and lactobacilli.", "food"), ("Dark chocolate contains flavonoid antioxidants.", "food"), ("A black hole's gravity is so strong light cannot escape.","space") ] contents = [c for c, _ in corpus] categories = [k for _, k in corpus] embeddings = model.encode(contents, normalize_embeddings=True) conn.execute("DROP TABLE IF EXISTS documents") conn.execute(f""" CREATE TABLE documents ( id bigserial PRIMARY KEY, content text, category text, embedding vector({DIM}) ) """) with conn.cursor() as cur: cur.executemany( "INSERT INTO documents (content, category, embedding) VALUES (%s, %s, %s)", list(zip(contents, categories, [np.asarray(e) for e in embeddings])), ) print(f" Inserted {len(corpus)} documents with {DIM}-d embeddings.") We connect to PostgreSQL, enable the pgvector extension, and register vector support with Psycopg. We load the SentenceTransformers model, define a small text corpus, generate normalized embeddings, and create a PostgreSQL table for storing documents. We then insert each document with its category and vector representation so that we can perform semantic search later. Copy CodeCopiedUse a different Browserprint("\n[3/10] Building HNSW index and running semantic search...") conn.execute( "CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops) " "WITH (m = 16, ef_construction = 64)" ) conn.execute("SET hnsw.ef_search = 100") def semantic_search(query: str, k: int = 4): q = np.asarray(model.encode(query, normalize_embeddings=True)) return conn.execute( "SELECT content, category, embedding <=> %s AS distance " "FROM documents ORDER BY distance LIMIT %s", (q, k), ).fetchall() for content, cat, dist in semantic_search("animals that are unusually quick"): print(f" {dist:.3f} [{cat:<10}] {content}") print("\n[4/10] Filtered search (only category = 'space')...") q = np.asarray(model.encode("objects with extreme gravity", normalize_embeddings=True)) rows = conn.execute( "SELECT content, embedding <=> %s AS distance " "FROM documents WHERE category = %s ORDER BY distance LIMIT 3", (q, "space"), ).fetchall() for content, dist in rows: print(f" {dist:.3f} {content}") print("\n[5/10] Same query under different distance metrics (top hit each)...") q = np.asarray(model.encode("brewing a hot caffeinated drink", normalize_embeddings=True)) for op, label in [("<->", "L2"), ("<=>", "cosine"), ("<#>", "neg-inner"), ("<+>", "L1")]: content, score = conn.execute( f"SELECT content, embedding {op} %s AS s FROM documents ORDER BY s LIMIT 1", (q,) ).fetchone() print(f" {label:<10} {score:+.3f} {content}") We build an HNSW index on the embedding column to enable faster, more efficient vector search. We define a semantic search function that converts a query into an embedding and retrieves the most similar documents using cosine similarity. We also perform metadata-filtered search and compare different pgvector distance operators such as L2, cosine, negative inner product, and L1. Copy CodeCopiedUse a different Browserprint("\n[6/10] Half-precision storage with halfvec...") conn.execute(f"ALTER TABLE documents ADD COLUMN IF NOT EXISTS embedding_half halfvec({DIM})") conn.execute("UPDATE documents SET embedding_half = embedding::halfvec") conn.execute( "CREATE INDEX ON documents USING hnsw (embedding_half halfvec_cosine_ops)" ) q_half = HalfVector(model.encode("the galaxy we live in", normalize_embeddings=True)) rows = conn.execute( "SELECT content, embedding_half <=> %s AS d FROM documents ORDER BY d LIMIT 2", (q_half,), ).fetchall() for content, d in rows: print(f" {d:.3f} {content}") print("\n[7/10] Binary quantization (Hamming) + exact re-rank...") conn.execute( f"CREATE INDEX ON documents " f"USING hnsw ((binary_quantize(embedding)::bit({DIM})) bit_hamming_ops)" ) q = np.asarray(model.encode("parallel hardware for AI training", normalize_embeddings=True)) rerank_sql = f""" SELECT content, candidates.embedding <=> %(q)s AS exact_distance FROM ( SELECT content, embedding FROM documents ORDER BY binary_quantize(embedding)::bit({DIM}) <~> binary_quantize(%(q)s)::bit({DIM}) LIMIT 8 ) AS candidates ORDER BY exact_distance LIMIT 3 """ for content, d in conn.execute(rerank_sql, {"q": q}).fetchall(): print(f" {d:.3f} {content}") print("\n[8/10] Native sparse vectors...") conn.execute("DROP TABLE IF EXISTS sparse_items") conn.execute("CREATE TABLE sparse_items (id bigserial PRIMARY KEY, embedding sparsevec(10))") sparse_data = [ SparseVector({0: 1.0, 3: 2.0, 7: 1.5}, 10), SparseVector({1: 0.5, 3: 1.0, 9: 3.0}, 10), SparseVector({0: 0.2, 4: 2.5, 7: 0.8}, 10), ] with conn.cursor() as cur: cur.executemany("INSERT INTO sparse_items (embedding) VALUES (%s)", [(v,) for v in sparse_data]) query_sparse = SparseVector({0: 1.0, 7: 1.0}, 10) rows = conn.execute( "SELECT id, embedding, embedding <#> %s AS neg_ip " "FROM sparse_items ORDER BY neg_ip LIMIT 3", (query_spa

原始來源：MarkTechPost AI ↗

查看原始來源