在ResearchMath-14k數據集上構建語義搜尋引擎與開放狀態分類器
重點摘要
本教學使用 amphora/ResearchMath-14k 數據集(從 arXiv 收集的研究級數學問題),載入數據、檢視結構,並分析問題在數學領域與開放狀態類別中的分佈。接著提取領域關鍵詞、生成語義嵌入、視覺化問題分佈、對相關問題進行聚類,並建立簡易搜尋引擎。最後訓練分類器從嵌入預測問題狀態,並偵測高度相關或近似重複的問題。
In this tutorial, we work with the amphora/ResearchMath-14k dataset, a collection of research-level mathematics problems mined from arXiv. We load the dataset, inspect its structure, and explore how the problems are distributed across mathematical fields and open-status categories. We then move beyond basic analysis by extracting field-specific keywords, generating semantic embeddings, visualizing the problem landscape, clustering related problems, and building a simple search engine over the dataset. Also, we train a classifier to predict problem status from embeddings and detect closely related or near-duplicate problems. Copy CodeCopiedUse a different Browser!pip -q install -U datasets sentence-transformers scikit-learn umap-learn \ pandas matplotlib seaborn wordcloud 2>/dev/null import warnings, numpy as np, pandas as pd warnings.filterwarnings("ignore") import matplotlib.pyplot as plt import seaborn as sns sns.set_theme(style="whitegrid", palette="deep") SAMPLE_SIZE = 4000 RANDOM_STATE = 42 EMB_MODEL = "sentence-transformers/all-MiniLM-L6-v2" We begin by installing the required libraries and importing the tools needed for analysis, visualization, embeddings, and data handling. We also set the main configuration values, including sample size, random seed, and embedding model. This gives us a clean setup before we start working with the ResearchMath dataset. Copy CodeCopiedUse a different Browserfrom datasets import load_dataset ds = load_dataset("amphora/ResearchMath-14k", split="test") df = ds.to_pandas() print("Rows:", len(df)) print("Columns:", list(df.columns)) df.head(3) TEXT_COL = "self_contained_problem" df = df[df[TEXT_COL].astype(str).str.len() > 20].reset_index(drop=True) We load the amphora/ResearchMath-14k dataset from Hugging Face and convert it into a pandas DataFrame. We inspect the number of rows, available columns, and a few sample records to understand the dataset structure. We then keep only problem statements of meaningful length so that subsequent analysis works on useful text. Copy CodeCopiedUse a different Browserprint("\n--- open_status distribution ---") print(df["open_status"].value_counts(dropna=False)) print("\n--- taxonomy_level_1 (math fields) ---") print(df["taxonomy_level_1"].value_counts()) fig, axes = plt.subplots(1, 3, figsize=(20, 6)) df["open_status"].value_counts().plot( kind="bar", ax=axes[0], color="steelblue") axes[0].set_title("Problem status"); axes[0].tick_params(axis="x", rotation=30) df["taxonomy_level_1"].value_counts().plot( kind="barh", ax=axes[1], color="seagreen") axes[1].set_title("Top-level math field"); axes[1].invert_yaxis() df["doc_len"] = df[TEXT_COL].str.split().apply(len) axes[2].hist(df["doc_len"].clip(upper=400), bins=40, color="indianred") axes[2].set_title("Problem length (words, clipped @400)") plt.tight_layout(); plt.show() ct = pd.crosstab(df["taxonomy_level_1"], df["open_status"], normalize="index") plt.figure(figsize=(10, 6)) sns.heatmap(ct, annot=True, fmt=".2f", cmap="rocket_r") plt.title("Fraction of each status within each field") plt.tight_layout(); plt.show() We explore the dataset by checking how problems are distributed across open-status labels and mathematical fields. We visualize the status counts, field counts, and problem lengths to quickly get an overview of the corpus. We also create a heatmap to see how open-status categories vary across different math fields. Copy CodeCopiedUse a different Browserfrom sklearn.feature_extraction.text import TfidfVectorizer def top_terms_per_group(frame, group_col, text_col, k=8): out = {} for g, sub in frame.groupby(group_col): if len(sub) < 20: continue vec = TfidfVectorizer(max_features=3000, stop_words="english", ngram_range=(1, 2), min_df=3) X = vec.fit_transform(sub[text_col]) scores = np.asarray(X.mean(axis=0)).ravel() terms = np.array(vec.get_feature_names_out()) out[g] = terms[scores.argsort()[::-1][:k]].tolist() return out for field, terms in top_terms_per_group(df, "taxonomy_level_1", TEXT_COL).items(): print(f"\n{field:35s} -> {', '.join(terms)}") We use TF-IDF to find the most important terms within each top-level mathematical field. We group the dataset by field and extract the strongest keywords or phrases that represent each group. This helps us understand what topics and terminology dominate different areas of research in mathematics. Copy CodeCopiedUse a different Browserfrom sklearn.feature_extraction.text import TfidfVectorizer def top_terms_per_group(frame, group_col, text_col, k=8): out = {} for g, sub in frame.groupby(group_col): if len(sub) < 20: continue vec = TfidfVectorizer(max_features=3000, stop_words="english", ngram_range=(1, 2), min_df=3) X = vec.fit_transform(sub[text_col]) scores = np.asarray(X.mean(axis=0)).ravel() terms = np.array(vec.get_feature_names_out()) out[g] = terms[scores.argsort()[::-1][:k]].tolist() return out for field, terms in top_terms_per_group(df, "taxonomy_level_1", TEXT_COL).items(): print(f"\n{field:35s} -> {', '.join(terms)}") We sample the dataset and convert each mathematical problem into a semantic embedding using a SentenceTransformer model. We reduce the embeddings into two dimensions using UMAP, or PCA if UMAP is unavailable, and visualize the problem landscape by field. We then apply K-Means clustering and compare the resulting clusters with the human-labeled taxonomy using ARI and NMI. Copy CodeCopiedUse a different Browserfrom sentence_transformers import util def search(query, k=5): q = model.encode([query], normalize_embeddings=True) sims = util.cos_sim(q, emb)[0].cpu().numpy() idx = sims.argsort()[::-1][:k] print(f'\n=== Query: "{query}" ===') for rank, i in enumerate(idx, 1): row = work.iloc[i] print(f"\n[{rank}] sim={sims[i]:.3f} | {row['taxonomy_level_1']} " f"| status={row['open_status']}") print(" ", row[TEXT_COL][:260].replace("\n", " "), "...") search("rational points on hyperelliptic curves") search("multiplicativity of maximal output p-norm of a quantum channel") from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, ConfusionMatrixDisplay y = work["open_status"].values Xtr, Xte, ytr, yte = train_test_split( emb, y, test_size=0.25, random_state=RANDOM_STATE, stratify=y) clf = LogisticRegression(max_iter=2000, class_weight="balanced", C=2.0) clf.fit(Xtr, ytr) pred = clf.predict(Xte) print("\n=== open_status classifier (embeddings + logistic regression) ===") print(classification_report(yte, pred)) fig, ax = plt.subplots(figsize=(7, 6)) ConfusionMatrixDisplay.from_predictions( yte, pred, ax=ax, cmap="Blues", xticks_rotation=45, normalize="true", values_format=".2f") ax.set_title("open_status confusion matrix (row-normalized)") plt.tight_layout(); plt.show() sims = util.cos_sim(emb, emb).cpu().numpy() np.fill_diagonal(sims, 0) i, j = np.unravel_index(sims.argmax(), sims.shape) print(f"\nMost similar pair (cos={sims[i, j]:.3f}):") for n in (i, j): print(f"\n paper_id={work.iloc[n]['paper_id']} | " f"{work.iloc[n]['taxonomy_level_1']}") print(" ", work.iloc[n][TEXT_COL][:240].replace("\n", " "), "...") print("\nDone. Set SAMPLE_SIZE=None at the top to run on the full 14.1k rows.") We build a semantic search function that retrieves the most similar research problems for a given query. We then train a classifier on the embeddings to predict each problem’s open-status label. Finally, we compute similarity across all embedded problems to detect the closest pair and identify near-duplicate or strongly related problem statements. In conclusion, we have a complete workflow for analyzing research-level mathematical problems using modern NLP and machine learning tools. We started with dataset exploration, then used TF-IDF, sentence embeddings, dimensionality reduction, clustering, semantic search, and classification to understand the corpus’s structure from multiple angles. It gives us a practical way to study how mathematical problems are grouped, how similar problem
Related
相關文章

GPT發AI原創新成果了
這篇消息聚焦「GPT發AI原創新成果了」。原始導語提到:AI實現藥物全自動研發,還遠嗎? 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

AI越強,越要“殺死”過去的自己
這篇消息聚焦「AI越強,越要“殺死”過去的自己」。原始導語提到:人類需要實現思維模式的轉變。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。
Salesforce CodeGen Tutorial: Generate, Validate, and Rerank Python Functions With Unit Tests and Safety Checks
In this tutorial, we implement an end-to-end workflow for Salesforce CodeGen. We load a CodeGen model from Hugging Face, prepare it for code generation, and use it to generate Python functions from natural-language prompts. We then move beyond basic inference by adding function extraction, syntax checking, static safety checks, unit-test-based validation, best-of-N candidate reranking, multi-step program synthesis, prompt-style experimentation, benchmark visualization, and artifact export. Through this workflow, we learn how CodeGen can be used not only as a code completion model but also as part of a structured code-generation pipeline that evaluates, filters, and organizes generated solutions. Loading the Salesforce CodeGen Model from Hugging Face Copy CodeCopiedUse a different Browserim

Transformer之父離開谷歌,奧特曼等了他十年
這篇消息聚焦「Transformer之父離開谷歌,奧特曼等了他十年」。原始導語提到:27億美元也沒能留住,Noam Shazeer追尋下一代架構。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

Dario訪談首曝:Mythos被稱為“超級武器”
這篇消息聚焦「Dario訪談首曝:Mythos被稱為“超級武器”」。原始導語提到:在這場69分鐘完整訪談裡,Dario Amodei 說人類真正面對的不是某個突然降臨的奇點,而是一條已經開始垂直起飛的指數曲線。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

用結構替代數據,因果世界模型如何重塑具身智能大腦
這篇消息聚焦「用結構替代數據,因果世界模型如何重塑具身智能大腦」。原始導語提到:因果世界模型需要一個標誌性的時刻來證明自己。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。