io.github.benseverndev-oss/goldenmatch icon

GoldenMatch

by Benseverndev-oss

io.github.benseverndev-oss/goldenmatch

Find duplicate records in 30 seconds. Zero-config entity resolution, 97.2% F1 out of the box.

Version 2.2.0
Local
View source

GoldenMatch ยท v2.2.0

by Benseverndev-oss

65

๐ŸŸก Golden Suite

A polyglot data-quality and entity-resolution toolkit. Polished, opinionated, AI-native.

GoldenCheck profiles โ†’ GoldenFlow standardizes โ†’ GoldenMatch deduplicates โ†’ GoldenAnalysis reports, all orchestrated by GoldenPipe. With InferMap for schema mapping, a Rust extension layer for Postgres / DuckDB, and optional WebAssembly acceleration behind the edge-safe TypeScript ports.

โšก GoldenMatch scales from a CSV on your laptop to 100M+ rows on a Ray cluster โ€” verified: 100,000,000 records deduped recall-complete (correct across any partitioning) in 9.2 min, with a 0.36 GB driver footprint.


CI
codecov
OpenSSF Scorecard

GoldenMatch web workbench โ€” pair drilldown with NL prose

Pair drilldown in the web workbench: cluster members, field-level diff, and a one-line NL explanation per pair. pip install goldenmatch[web] then goldenmatch serve-ui <project>. More screenshots โ†’

# Headline package: dedupe a CSV in 30 seconds
pip install goldenmatch && goldenmatch dedupe customers.csv

# TypeScript / Edge runtimes
npm install goldenmatch

๐Ÿ†• v2.0.0 โ€” GoldenMatch 2.0.0: the first backwards-incompatible major. It removes four deprecation-window items, each shipped with a 1.x runway: the legacy :hash: identity lookup bridge + GOLDENMATCH_IDENTITY_ID_SCHEME (run goldenmatch identity migrate-ids before upgrading; un-fingerprintable rows keep their :hash: id), the GOLDENMATCH_CLUSTER_FRAMES_OUT gate + legacy dict cluster path (build_clusters stays as a frames-backed adapter), and the cheapest_healthy / _scale_aware_backend shims. Pipeline behavior is output-equivalent. Migration guide: Migrating to v2.

v1.30.0 โ€” Zero-training Fellegi-Sunter now beats hand-rolled, expert-tuned Splink, head-to-head and reproducibly. On one shared evaluator across every dataset Splink scores, GoldenMatch's probabilistic auto-config wins on all of them: historical_50k pairwise F1 0.778 vs 0.757 (cluster-level Bยณ 0.844 vs 0.789), febrl3 0.991 vs 0.965, synthetic_person 0.998 vs 0.996 โ€” made reproducible by an EM training-pair determinism fix (#829). Full bake-off: docs/benchmarks/2026-06-09-splink-bakeoff.md.

v1.26.0 โ€” 100M records, distributed, on a 4-worker Ray cluster โ€” verified. The distributed Phase-5 pipeline (GOLDENMATCH_DISTRIBUTED_PIPELINE=2) now runs a full 100,000,000-row dedupe end to end in ~213 s with the driver process peaking at 0.30 GB RSS. The unlock was removing every driver-side collect from the pipeline (scoring -> per-partition local connected-components -> distributed join -> distributed golden build + write), so nothing funnels back to a single node.


Why a suite?

Each tool stands alone, but they compose into a single pipeline:

flowchart LR
    raw([raw rows])
    golden([golden records])

    subgraph orchestration ["GoldenPipe orchestrates"]
        direction LR
        infermap[InferMap]
        goldencheck[GoldenCheck]
        goldenflow[GoldenFlow]
        goldenmatch[GoldenMatch]
        infermap --> goldencheck --> goldenflow --> goldenmatch
    end

    raw --> infermap
    goldenmatch --> golden
Step Role
InferMap schema mapping โ€” auto-aligns columns across heterogeneous sources
GoldenCheck profile + validate โ€” encoding, format, anomaly detection
GoldenFlow standardize + transform โ€” phone, date, address, categorical normalization
GoldenMatch dedupe + cluster + survivorship โ€” fuzzy / exact / probabilistic / LLM
GoldenAnalysis analysis + reporting โ€” one exportable report over any stage's output, plus cross-run regression detection
GoldenPipe orchestrator โ€” declarative YAML pipeline wiring the steps
  • Zero-config defaults that admit when they're unsure โ€” every step has a self-verifying preflight + postflight; results carry an inspectable report instead of failing silently.
  • 96.4% F1 on DBLP-ACM out of the box for entity resolution โ€” and the opt-in Fellegi-Sunter engine beats hand-rolled, expert-tuned Splink head-to-head on every dataset Splink scores (historical_50k pairwise F1 0.778 vs 0.757, cluster-level Bยณ 0.844 vs 0.789; one shared evaluator, reproducible bake-off).
  • Learning Memory โ€” corrections persist across runs and re-anchor across row reorders, so the system stops needing the same correction twice (GoldenMatch v1.6.0; off by default).
  • Identity Graph โ€” a durable graph layer above run-local clusters: stable entity_ids that survive across runs, an append-only event log, and create / absorb / merge / split semantics, surfaced on the CLI, REST, MCP, and SQL interfaces (the Identity Graph v2 feature, shipped in GoldenMatch v1.15).
  • Privacy-preserving record linkage โ€” match across organizations without sharing raw data (PPRL, 92.4% F1 on FEBRL4).
  • AI-native by design โ€” every package ships an MCP server, a REST API, and an A2A agent surface. 50+ MCP tools across the suite, including auto_configure + controller_telemetry for v1.7-v1.12 introspection.
  • AutoConfigController visible everywhere (v1.7-v1.12 surface-parity arc) โ€” web ControllerPanel, TUI Ctrl+A, CLI goldenmatch autoconfig, REST /autoconfig + /controller/telemetry, Postgres goldenmatch_autoconfig + gm_telemetry, DuckDB UDFs, MCP/A2A telemetry tools. One JSON shape across every interface.
  • Polyglot parity โ€” the full suite ships on npm (goldenmatch, goldencheck, goldenflow, goldenanalysis, infermap, goldenpipe) alongside PyPI; the TypeScript and Python implementations track the same outputs to 4-decimal precision via a cross-language parity harness.
  • Edge-safe, with optional native speed โ€” the TypeScript cores are dependency-free and node:*-free, so they run in browsers, Cloudflare Workers, Vercel Edge, and Deno. An opt-in WebAssembly backend (await enableWasm() / enableAnalysisWasm()) swaps in the same pyo3-free Rust kernels the Python wheels and the SQL UDFs use โ€” pure-TS stays the default and the byte-identical fallback, so default users download zero wasm bytes.
  • SQL-native, both engines at parity โ€” the same functions run inside PostgreSQL (pgrx extension) and DuckDB: dedupe / match / score / auto-config + telemetry / identity graph, plus data profiling, evaluate, Fellegi-Sunter probabilistic scoring, and GoldenFlow transforms.
  • Production paths โ€” Postgres sync, daemon mode, lineage tracking, review queues, dbt integration, GitHub Actions, and a Rust extension layer for Postgres / DuckDB.

The Suite

Package Lang What it does Install
GoldenMatch ๐ŸŸก Python ยท TS Zero-config entity resolution. Fuzzy + exact + probabilistic + LLM. Headline package. pip install goldenmatch ยท npm i goldenmatch
GoldenCheck Python ยท TS Data-quality scanning: encoding, Unicode, format validation, anomaly detection. pip install goldencheck ยท npm i goldencheck
GoldenFlow Python ยท TS Transforms & standardizers: phone, date, address, categorical normalization. pip install goldenflow ยท npm i goldenflow
GoldenPipe Python ยท TS Orchestrator that wires Check โ†’ Flow โ†’ Match into one declarative pipeline. pip install goldenpipe ยท npm i goldenpipe
InferMap Python ยท TS Schema mapping engine โ€” auto-aligns columns across heterogeneous sources. pip install infermap ยท npm i infermap
GoldenAnalysis Python ยท TS Cross-cutting analysis & reporting โ€” consumes any stage's typed artifacts (or a raw DataFrame) and emits a unified, exportable AnalysisReport; optional Rust / WASM histogram+quantile kernels. pip install goldenanalysis ยท npm i goldenanalysis
goldenmatch-extensions Rust Postgres extension (pgrx) + DuckDB UDFs. SQL-native fuzzy matching. source build
dbt-goldensuite dbt ยท Python dbt package โ€” dedupe + two-table match materializations (incl. zero-config Fellegi-Sunter), an ER match-quality build gate, quality-gate tests, transforms, and identity-graph reads for warehouse models. packages.yml (git subdir)
goldencheck-action YAML GitHub Action โ€” fail PRs that introduce data-quality regressions. Marketplace

Headline pitch and the deepest docs live in packages/python/goldenmatch/README.md (~1,300 lines, full feature list, CLI, architecture, benchmarks).


Choose your path

I want to... Go here
Deduplicate a CSV right now packages/python/goldenmatch
Use from Claude Desktop / Code packages/python/goldenmatch โ€” MCP
Edit rules in a browser, label pairs, compare runs packages/python/goldenmatch โ€” Web UI
Build AI agents that deduplicate ER Agent / A2A wiki page
Profile data quality before matching packages/python/goldencheck
Standardize messy fields (phone, date, address) packages/python/goldenflow
Run the full pipeline declaratively packages/python/goldenpipe
Map columns across schemas packages/python/infermap
Analyze + report across stages and runs packages/python/goldenanalysis
Write TypeScript / Node.js / Edge (browser, Workers; optional WASM) packages/typescript/goldenmatch
Match in Postgres / DuckDB SQL packages/rust/extensions
Add data-quality gates to dbt packages/dbt/goldensuite
Block bad data in GitHub PRs packages/actions/goldencheck
Run as Airflow DAGs examples/airflow/ โ€” 12 drop-in DAGs
Run from a single MCP container docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest
Pull every Suite container GitHub Packages

Quick examples

Python โ€” dedupe in 30 seconds

import goldenmatch as gm

# Zero-config
result = gm.dedupe("customers.csv")
print(result)  # DedupeResult(records=5000, clusters=847, match_rate=12.0%)
result.golden.write_csv("deduped.csv")

# Or be explicit
result = gm.dedupe("customers.csv",
    exact=["email"],
    fuzzy={"name": 0.85, "zip": 0.95},
    blocking=["zip"],
    threshold=0.85)

TypeScript โ€” edge-safe core

import { dedupe } from "goldenmatch";

const result = dedupe(rows, {
  fuzzy: { name: 0.85 },
  blocking: ["zip"],
  threshold: 0.85,
});
console.log(result.stats);  // { totalRecords, totalClusters, matchRate, ... }

Runs in browsers, Vercel Edge, Cloudflare Workers, Deno โ€” and optionally swaps in the Rust score-core kernel via await enableWasm(). ~940 tests, strict TypeScript (noUncheckedIndexedAccess, exactOptionalPropertyTypes).

Web workbench โ€” browser UI for matching

pip install 'goldenmatch[web]'
goldenmatch serve-ui my-project   # opens http://localhost:5050

GoldenMatch web UI

Edit rules with live validation, preview against a sampled slice, label pairs
(mirrored into Learning Memory automatically), compare runs (CCMS), sweep
parameters, browse the corrections store. Single-process localhost workbench
shipped as the optional [web] extra.

Composed pipeline

import goldenpipe as gp

pipeline = gp.Pipeline.from_yaml("pipeline.yaml")  # check โ†’ flow โ†’ match
result = pipeline.run("customers.csv")
result.report.write_html("report.html")

More: examples/ has runnable demos for every Suite scenario:
Python (quickstart, full pipeline, customer 360, PPRL, review workflow, MCP client) ยท
TypeScript (quickstart, Vercel Edge route, MCP client) ยท
Airflow DAGs (12 production-shaped pipelines).


Use cases (real-world pipelines)

Reproducible end-to-end pipelines running GoldenMatch on public data at scale, each with measured headline numbers vs baselines:

  • ๐Ÿ•ต๏ธ goldenmatch-shell-company-network โ€” investigative ER across ICIJ Offshore Leaks + OpenSanctions + GLEIF + UK PSC + UK disqualified-directors. Confidence-weighted graph, structure mining, named investigative candidates. โˆ’62.5% analyst-hours to triage vs single-source baselines; +133% adversarial perturbation recovery.
  • ๐Ÿ›ก๏ธ goldenmatch-vuln-attribution โ€” cross-database ER on 6.1M OSS vulnerability records across 40 sources (OSV, GHSA, PyPA, RustSec, Go vulndb, EPSS, CISA KEV, CVE Project bulk). 6,126,895 records โ†’ 847,475 canonical vulns in ~5 minutes end-to-end on a single 64GB runner via the full Golden Suite (Check + Flow + Match + Pipe).
  • โš–๏ธ goldenmatch-sanctions-reconciliation โ€” cross-list coverage analysis on 85 public sanctions lists across 50+ jurisdictions via OpenSanctions, plus 10-year OFAC SDN history and PEP/crypto cross-analysis. Coverage-gap benchmark for any sanctions-screening vendor.

Install variants

GoldenMatch ships fat optional extras so you only pay for what you use:

pip install goldenmatch                    # core (CSV in, CSV out) + native acceleration on common platforms
pip install goldenmatch[native]            # back-compat alias; native is already default on common platforms
pip install goldenmatch[embeddings]        # + sentence-transformers, FAISS
pip install goldenmatch[llm]               # + Claude / OpenAI for LLM boost
pip install goldenmatch[postgres]          # + Postgres sync
pip install goldenmatch[snowflake]         # + Snowflake connector
pip install goldenmatch[bigquery]          # + BigQuery connector
pip install goldenmatch[databricks]        # + Databricks connector
pip install goldenmatch[salesforce]        # + Salesforce connector
pip install goldenmatch[duckdb]            # + DuckDB out-of-core backend
pip install goldenmatch[ray]               # + Ray distributed backend (50M+ rows)
pip install goldenmatch[quality]           # + GoldenCheck integration
pip install goldenmatch[transform]         # + GoldenFlow integration
pip install goldenmatch[mcp]               # + MCP server for Claude Desktop
pip install goldenmatch[agent]             # + A2A agent (aiohttp)
pip install goldenmatch[web]               # + localhost browser workbench (FastAPI + React)

goldenmatch setup    # interactive wizard: GPU, API keys, database

Sister packages compose: pip install goldenpipe[full] brings in Check + Flow + Match together.


Remote MCP Server

GoldenMatch is hosted as an MCP server on โ€” connect from any MCP client without installing anything.

{
  "mcpServers": {
    "goldenmatch": {
      "url": "https://goldenmatch-mcp-production.up.railway.app/mcp/"
    }
  }
}

50+ MCP tools across the suite: deduplicate, match, explain, review, link privately, configure, scan quality, transform, synthesize golden records, and manage Learning Memory corrections.


Container images

Every Suite package ships as a multi-arch container image (linux/amd64 + linux/arm64) on GitHub Container Registry. Pull anonymously, no auth needed:

# One container, every Suite tool โ€” the convenience option
docker run -p 8300:8300 ghcr.io/benseverndev-oss/goldensuite-mcp:latest

# Per-package containers โ€” narrower deployments
docker run -p 8200:8200 ghcr.io/benseverndev-oss/goldenmatch-mcp:latest
docker run -p 8100:8100 ghcr.io/benseverndev-oss/goldencheck-mcp:latest
docker run -p 8150:8150 ghcr.io/benseverndev-oss/goldenflow-mcp:latest
docker run -p 8250:8250 ghcr.io/benseverndev-oss/goldenpipe-mcp:latest
docker run -p 8400:8400 ghcr.io/benseverndev-oss/infermap-mcp:latest

# Postgres + extension preinstalled
docker run -e POSTGRES_PASSWORD=secret ghcr.io/benseverndev-oss/goldenmatch-extensions:latest

Tags:

  • :latest โ€” current main
  • :main-<sha7> โ€” every push to main, immutable
  • :vX.Y.Z and :vX.Y โ€” pushed when a <package>-vX.Y.Z tag is created

See packages/python/goldensuite-mcp/README.md for the aggregator's tool-collision behaviour.


Airflow

12 drop-in DAGs at examples/airflow/, grouped by lifecycle stage:

Group DAGs
Core pipeline daily_dedupe, incremental_match, warehouse_native (Snowflake), customer_360 (multi-source)
Privacy pprl_linkage (two-party PPRL)
Onboarding & monitoring schema_align_and_load, schema_drift_alarm, quality_gate
Feedback loop review_worker, active_learning
Operationalize reverse_etl (Salesforce/HubSpot), backfill

TaskFlow API, Airflow 2.7+ (compatible with 3.x). Each DAG has tunable knobs at the top, idempotent retries, and is marker-protected against double-processing. Drop the file you want into your Airflow dags/ folder.


Repository layout

goldenmatch/
โ”œโ”€โ”€ packages/
โ”‚   โ”œโ”€โ”€ python/
โ”‚   โ”‚   โ”œโ”€โ”€ goldenmatch/      # entity resolution โ€” headline package
โ”‚   โ”‚   โ”œโ”€โ”€ goldencheck/      # data quality scanning
โ”‚   โ”‚   โ”œโ”€โ”€ goldenflow/       # transforms & standardizers
โ”‚   โ”‚   โ”œโ”€โ”€ goldenpipe/       # orchestrator
โ”‚   โ”‚   โ”œโ”€โ”€ infermap/         # schema mapping
โ”‚   โ”‚   โ””โ”€โ”€ goldenanalysis/   # cross-cutting analysis & reporting
โ”‚   โ”œโ”€โ”€ typescript/
โ”‚   โ”‚   โ”œโ”€โ”€ goldenmatch/      # full TS port (edge-safe core)
โ”‚   โ”‚   โ”œโ”€โ”€ goldencheck/      # TS implementation
โ”‚   โ”‚   โ”œโ”€โ”€ goldencheck-types/ # shared TS types
โ”‚   โ”‚   โ”œโ”€โ”€ goldenflow/       # TS transforms
โ”‚   โ”‚   โ”œโ”€โ”€ infermap/         # TS schema mapping
โ”‚   โ”‚   โ””โ”€โ”€ goldenanalysis/   # TS analysis & reporting (edge-safe + WASM)
โ”‚   โ”œโ”€โ”€ rust/
โ”‚   โ”‚   โ””โ”€โ”€ extensions/       # Postgres pgrx + DuckDB UDFs (own Cargo workspace)
โ”‚   โ”œโ”€โ”€ python/goldensuite-mcp/ # aggregator MCP server (one container, all tools)
โ”‚   โ”œโ”€โ”€ dbt/goldensuite/      # dbt package (materializations, tests, macros)
โ”‚   โ””โ”€โ”€ actions/goldencheck/  # GitHub Action
โ”œโ”€โ”€ examples/
โ”‚   โ”œโ”€โ”€ python/               # 6 runnable Python scripts (quickstart โ†’ MCP)
โ”‚   โ”œโ”€โ”€ typescript/           # 3 TS scripts (quickstart, Vercel Edge, MCP)
โ”‚   โ””โ”€โ”€ airflow/              # 12 drop-in Airflow DAGs
โ”œโ”€โ”€ docs/superpowers/         # design specs and implementation plans
โ”œโ”€โ”€ justfile                  # install / test / lint / build, all languages
โ”œโ”€โ”€ pyproject.toml            # uv workspace (root)
โ”œโ”€โ”€ pnpm-workspace.yaml       # TypeScript pnpm workspace (Turborepo)
โ”œโ”€โ”€ package.json              # root scripts + pnpm workspace root
โ””โ”€โ”€ .github/workflows/ci.yml

Workspaces (Cargo vs pnpm)

  • Cargo โ€” no root workspace. packages/rust/extensions/ is itself a Cargo workspace (the postgres crate is excluded for pgrx-specific build requirements). Cargo doesn't allow nested workspaces sharing members, so Cargo commands run from inside packages/rust/extensions/.
  • TypeScript โ€” a single pnpm workspace. packages/typescript/* form one pnpm + Turborepo workspace (see TypeScript dev setup). .npmrc pins node-linker=hoisted, giving a flat node_modules that avoids the Windows symlink issues an earlier per-package layout hit.

Build / test / lint everything

just install   # uv sync + per-package npm install + cargo fetch
just test      # all languages
just lint
just build

Reproducing benchmarks

Published GoldenMatch numbers (DQbench composite 91.04, DBLP-ACM 0.9641 F1, Febrl3 0.9443 F1, NCVR 0.9719 F1) map back to a single committed runner: scripts/run_benchmarks.py. See docs/reproducing-benchmarks.md for per-number commands, dataset URLs, expected output (with tolerance), variance notes (deterministic vs LLM-augmented), and a copy-pasteable one-click reproduction snippet for the DQbench composite. The same runner powers the weekly benchmarks.yml workflow.

Scale envelope

"How big can this handle?" is answered in docs/scale-envelope.md: per-backend ranges (Polars in-memory < 500K, DuckDB out-of-core 500K - 50M, Ray distributed >= 50M), block-size failure modes, candidate-pair math, and a single-page decision tree for picking a backend.

Verified at the top end: a full 100,000,000-row GoldenMatch dedupe on a 5-node Ray cluster (e2-standard-16, 80 CPU) in 9.2 min (554 s), 20,000,000 golden records recovered exactly, driver process peak 0.36 GB RSS โ€” the default distributed path is now recall-complete (blocking-key shuffle scoring + a distributed randomized-contraction WCC), so duplicates merge correctly no matter how the input is partitioned, and it stays driver-collect-free end to end (#844). A faster per-partition path is available via GOLDENMATCH_DISTRIBUTED_BLOCK_SHUFFLE=0 (driver-collect-free, ~213 s on a 4-worker run) for inputs where duplicates already co-locate within partitions โ€” but it under-merges when a cluster's members land in different input partitions, which is why recall-complete is the default. Recipe in packages/python/goldenmatch/configs/distributed-100m.yaml.


Contributing

  • Feature work goes on feature/<name> branches; merge via squash PR.
  • PR title format: feat: <description>, fix: <description>, docs: <description>.
  • Tests must pass on all three languages where the change applies; the parity harness in packages/typescript/goldenmatch/tests/parity/ enforces 4-decimal-tolerance Python โ†” TypeScript scorer parity.
  • See docs/superpowers/specs/ for design rationale on architectural decisions.

TypeScript dev setup (pnpm + Turborepo)

The TypeScript packages live in a single pnpm workspace orchestrated by Turborepo. From the repo root:

corepack enable                               # one-time, picks up pnpm@9.15.0 from package.json
pnpm install                                  # installs all workspace packages
pnpm turbo run build test typecheck lint      # full pipeline (cached after first run)
pnpm --filter goldenmatch test                # single package

Windows: enable Developer Mode for pnpm. pnpm install creates symlinks under node_modules/. Settings โ†’ For Developers โ†’ Developer Mode โ†’ On. If you see EPERM: operation not permitted, symlink ... during install, Dev Mode is off.

If corepack enable fails (often needs an admin shell on Windows), the fallback is npm i -g pnpm@9.15.0 โ€” functionally equivalent.


History

This repository was formed on 2026-05-01 by folding 8 sibling repos into the existing goldenmatch repo using git filter-repo. Full commit history is preserved for every source. See docs/superpowers/specs/2026-05-01-goldenmatch-monorepo-fold-in-design.md for the design rationale and docs/superpowers/plans/2026-05-01-goldenmatch-monorepo-fold-in.md for the step-by-step migration plan.


Author & License

Built by Ben Severn.

MIT โ€” see LICENSE.