MarkTechPost AI其他AI

程式碼實戰指南:打造以 pgvector 驅動的語意、混合、稀疏與量化向量搜尋系統

2026年5月28日 08:07

重點摘要

本教學將在 Google Colab 中建立完整的 pgvector 遊樂場,探索 PostgreSQL 如何作為強大的向量資料庫支援現代 AI 應用。我們從安裝 PostgreSQL、編譯 pgvector 擴充套件、透過 Psycopg 連線並註冊向量型別以順暢整合 Python 開始;接著使用 SentenceTransformers 產生嵌入、儲存至 PostgreSQL、建立 HNSW 索引,並執行語意搜尋、篩選搜尋、距離度量比較、半精度儲存、二元量化、稀疏向量搜尋、混合檢索與向量聚合。透過此流程,我們將學會如何使用純開源工具,讓 pgvector 支援實務上的檢索增強生成、推薦、相似性搜尋與混合搜尋系統。

站內 AI 整理稿

In this tutorial, we build a complete pgvector playground inside Google Colab and explore how PostgreSQL can work as a powerful vector database for modern AI applications. We start by installing PostgreSQL, compiling the pgvector extension, connecting through Psycopg, and registering vector types for smooth Python integration. Then, we create embeddings with SentenceTransformers, store them in PostgreSQL, build HNSW indexes, and run semantic search, filtered search, distance metric comparisons, half-precision storage, binary quantization, sparse vector search, hybrid retrieval, and vector aggregation. Through this workflow, we learn how pgvector supports practical retrieval-augmented generation, recommendation, similarity search, and hybrid search systems using only open-source tools. Copy CodeCopiedUse a different Browserimport os import subprocess import sys import time def sh(cmd: str, check: bool = True): """Run a shell command, streaming a compact log.""" print(f" $ {cmd}") return subprocess.run(cmd, shell=True, check=check, stdout=subprocess.DEVNULL, stderr=subprocess.STDOUT) print("[0/10] Installing PostgreSQL + building pgvector (≈1–2 min)...") sh("apt-get -qq update") sh("apt-get -qq install -y postgresql postgresql-contrib " "postgresql-server-dev-all build-essential git") if not os.path.exists("/tmp/pgvector"): sh("git clone --depth 1 https://github.com/pgvector/pgvector.git /tmp/pgvector") sh("cd /tmp/pgvector && make && make install") sh("service postgresql start") time.sleep(3) sh("""sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'postgres';" """) print("[0/10] Installing Python packages...") sh(f"{sys.executable} -m pip install -q pgvector psycopg[binary] " f"sentence-transformers numpy") We set up the complete PostgreSQL and pgvector environment. We install the required system packages, clone and build pgvector from source, start the PostgreSQL service, and configure the database password. We also install the Python dependencies needed to connect to PostgreSQL and work with vector embeddings. Copy CodeCopiedUse a different Browserimport numpy as np import psycopg from pgvector import HalfVector, SparseVector from pgvector.psycopg import register_vector from sentence_transformers import SentenceTransformer print("\n[1/10] Connecting and enabling the 'vector' extension...") conn = psycopg.connect( "host=127.0.0.1 port=5432 dbname=postgres user=postgres password=postgres", autocommit=True, ) conn.execute("CREATE EXTENSION IF NOT EXISTS vector") register_vector(conn) ver = conn.execute("SELECT extversion FROM pg_extension WHERE extname='vector'").fetchone()[0] print(f" pgvector version: {ver}") print("\n[2/10] Loading embedding model + encoding corpus...") model = SentenceTransformer("all-MiniLM-L6-v2") DIM = model.get_sentence_embedding_dimension() corpus = [ ("Octopuses have three hearts and blue blood.", "animals"), ("Transformers revolutionized natural language processing.","technology"), ("Quantum computers exploit superposition and entanglement.","technology"), ("GPUs accelerate deep learning by parallelizing matrix math.","technology"), ("Sourdough bread relies on wild yeast and lactobacilli.", "food"), ("Dark chocolate contains flavonoid antioxidants.", "food"), ("A black hole's gravity is so strong light cannot escape.","space") ] contents = [c for c, _ in corpus] categories = [k for _, k in corpus] embeddings = model.encode(contents, normalize_embeddings=True) conn.execute("DROP TABLE IF EXISTS documents") conn.execute(f""" CREATE TABLE documents ( id bigserial PRIMARY KEY, content text, category text, embedding vector({DIM}) ) """) with conn.cursor() as cur: cur.executemany( "INSERT INTO documents (content, category, embedding) VALUES (%s, %s, %s)", list(zip(contents, categories, [np.asarray(e) for e in embeddings])), ) print(f" Inserted {len(corpus)} documents with {DIM}-d embeddings.") We connect to PostgreSQL, enable the pgvector extension, and register vector support with Psycopg. We load the SentenceTransformers model, define a small text corpus, generate normalized embeddings, and create a PostgreSQL table for storing documents. We then insert each document with its category and vector representation so that we can perform semantic search later. Copy CodeCopiedUse a different Browserprint("\n[3/10] Building HNSW index and running semantic search...") conn.execute( "CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops) " "WITH (m = 16, ef_construction = 64)" ) conn.execute("SET hnsw.ef_search = 100") def semantic_search(query: str, k: int = 4): q = np.asarray(model.encode(query, normalize_embeddings=True)) return conn.execute( "SELECT content, category, embedding <=> %s AS distance " "FROM documents ORDER BY distance LIMIT %s", (q, k), ).fetchall() for content, cat, dist in semantic_search("animals that are unusually quick"): print(f" {dist:.3f} [{cat:<10}] {content}") print("\n[4/10] Filtered search (only category = 'space')...") q = np.asarray(model.encode("objects with extreme gravity", normalize_embeddings=True)) rows = conn.execute( "SELECT content, embedding <=> %s AS distance " "FROM documents WHERE category = %s ORDER BY distance LIMIT 3", (q, "space"), ).fetchall() for content, dist in rows: print(f" {dist:.3f} {content}") print("\n[5/10] Same query under different distance metrics (top hit each)...") q = np.asarray(model.encode("brewing a hot caffeinated drink", normalize_embeddings=True)) for op, label in [("<->", "L2"), ("<=>", "cosine"), ("<#>", "neg-inner"), ("<+>", "L1")]: content, score = conn.execute( f"SELECT content, embedding {op} %s AS s FROM documents ORDER BY s LIMIT 1", (q,) ).fetchone() print(f" {label:<10} {score:+.3f} {content}") We build an HNSW index on the embedding column to enable faster, more efficient vector search. We define a semantic search function that converts a query into an embedding and retrieves the most similar documents using cosine similarity. We also perform metadata-filtered search and compare different pgvector distance operators such as L2, cosine, negative inner product, and L1. Copy CodeCopiedUse a different Browserprint("\n[6/10] Half-precision storage with halfvec...") conn.execute(f"ALTER TABLE documents ADD COLUMN IF NOT EXISTS embedding_half halfvec({DIM})") conn.execute("UPDATE documents SET embedding_half = embedding::halfvec") conn.execute( "CREATE INDEX ON documents USING hnsw (embedding_half halfvec_cosine_ops)" ) q_half = HalfVector(model.encode("the galaxy we live in", normalize_embeddings=True)) rows = conn.execute( "SELECT content, embedding_half <=> %s AS d FROM documents ORDER BY d LIMIT 2", (q_half,), ).fetchall() for content, d in rows: print(f" {d:.3f} {content}") print("\n[7/10] Binary quantization (Hamming) + exact re-rank...") conn.execute( f"CREATE INDEX ON documents " f"USING hnsw ((binary_quantize(embedding)::bit({DIM})) bit_hamming_ops)" ) q = np.asarray(model.encode("parallel hardware for AI training", normalize_embeddings=True)) rerank_sql = f""" SELECT content, candidates.embedding <=> %(q)s AS exact_distance FROM ( SELECT content, embedding FROM documents ORDER BY binary_quantize(embedding)::bit({DIM}) <~> binary_quantize(%(q)s)::bit({DIM}) LIMIT 8 ) AS candidates ORDER BY exact_distance LIMIT 3 """ for content, d in conn.execute(rerank_sql, {"q": q}).fetchall(): print(f" {d:.3f} {content}") print("\n[8/10] Native sparse vectors...") conn.execute("DROP TABLE IF EXISTS sparse_items") conn.execute("CREATE TABLE sparse_items (id bigserial PRIMARY KEY, embedding sparsevec(10))") sparse_data = [ SparseVector({0: 1.0, 3: 2.0, 7: 1.5}, 10), SparseVector({1: 0.5, 3: 1.0, 9: 3.0}, 10), SparseVector({0: 0.2, 4: 2.5, 7: 0.8}, 10), ] with conn.cursor() as cur: cur.executemany("INSERT INTO sparse_items (embedding) VALUES (%s)", [(v,) for v in sparse_data]) query_sparse = SparseVector({0: 1.0, 7: 1.0}, 10) rows = conn.execute( "SELECT id, embedding, embedding <#> %s AS neg_ip " "FROM sparse_items ORDER BY neg_ip LIMIT 3", (query_spa

Related

相關文章

鈦媒體其他AI

AI成績單背後,藏著一位華人“出題人”

這篇消息聚焦「AI成績單背後,藏著一位華人“出題人”」。原始導語提到:AI,你需要向虎證明自己很聰明。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

剛剛
鈦媒體其他AI

別被不靠譜服務商忽悠,GEO優化沒有捷徑

這篇消息聚焦「別被不靠譜服務商忽悠,GEO優化沒有捷徑」。原始導語提到:怎麼重建GEO行業信任,避免踩坑? 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

剛剛

美國AI狂飆,亞洲搶先吃飽

這篇消息聚焦「美國AI狂飆,亞洲搶先吃飽」。原始導語提到:亞洲,正在成為全球算力基礎設施製造中心。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

15 小時前
鈦媒體其他AI

馬斯克花600億美元,買了箇中國模型底座的代碼編輯器

這篇消息聚焦「馬斯克花600億美元,買了箇中國模型底座的代碼編輯器」。原始導語提到:錢的大頭,又讓別人賺走了 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

16 小時前