How This Compares — Epstein Network Projects

Four Layers of Epstein Investigation Projects

There are 8+ open-source projects analyzing the Epstein documents. They fall into four layers — most do layer 1 or 2, a few do layer 3. Only this project does layer 4.

Archive — collect & host raw documents

FULL_EPSTEIN_INDEX epstein-docs

Search — full-text search + entity tagging

DugganUSA epstein-archive

Extract → Visualize — NLP/LLM builds graphs

phelix001 maxandrews SvetimFM OWL-DOJ

Model → Analyze gaps

unify-graph typed schema where constraint violations = investigative leads

The Standard Pipeline (Layers 1-3)

Every other project follows a variation of this pipeline:

Documents → OCR → NLP / LLM → Entities → Graph → "Look at all these connections"

They differ in scale (47 entities to 86,000), extraction method (regex to Claude AI), and visualization (2D force graphs to 3D clouds and Sankey diagrams). But they all answer the same question: "What do we know?"

The Inverted Pipeline (Layer 4)

This project inverts the question:

Entities → Typed Schema → Constraint Violations → "Here's what we DON'T know"

132 hand-curated entities modeled in CUE — a configuration language designed for infrastructure-as-code. The CUE compiler enforces that every reference resolves, every type is consistent, every constraint is satisfied. When something doesn't satisfy the schema, that's not a bug — it's a lead.

Dangling Reference

Entity mentions a person not yet modeled.

→ Who are they? Why are they connected?

Empty Evidence Map

Entity exists in the graph but has zero document citations.

→ Most entities were added from structural relationships before evidence was linked. The gap is the work queue.

Type Inconsistency

Tagged as FinancialEnabler but no financial flows documented.

→ Where's the paper trail?

Sole Connector

One person is the only bridge between two cluster worlds.

→ 44 cluster pairs depend on a single entity. Discredit them, the connection vanishes.

Exposure Cascade

Pre-computed BFS: if X cooperates, who's within 2 degrees?

→ High reach is expected in a hub-and-spoke network — what matters is which entities are NOT reachable.

Cluster Isolation

Entity is assigned to a cluster but has zero connections to peers.

→ Misclassified, or genuinely isolated?

Community vs Cluster Mismatch

NetworkX Louvain community detection disagrees with the manual cluster label.

→ 73/132 entities structurally belong to a different group than their label says. Either the label is wrong or the entity operates across domains.

Entity Reconciliation

114/132 entities linked to Wikidata QIDs. CUE validates the mapping at compile time.

→ QIDs enable federated queries: SPARQL can pull in all linked identifiers (OpenCorporates, LittleSis, court records) from a single ID.

Detailed Comparison

	DugganUSA	phelix001	maxandrews	SvetimFM	OWL-DOJ	epstein-archive	FULL_EPSTEIN_INDEX	unify-graph
Layer	Search	Extract→Viz	Extract→Viz	Extract→Viz	Extract→Viz	Search	Archive	Model→Gaps
Core question	What's in the docs?	Who connects?	What relationships?	What clusters?	Enough for conviction?	What's in the corpus?	Where are the files?	What's missing?
Docs indexed	71k+	19,154	House Oversight	69k chunks	14,674	51k+	~20k pg + DOJ + FBI	Consumes DugganUSA API
Entities	—	47	15k+ triples	31	30+	86k	—	132 (typed, curated)
Entity method	API search	pdfplumber + regex	Claude AI + dedup	Ollama embeddings	OWL autonomous	NLP pipeline	OCR text	Human-authored CUE
Analysis	Preset viz	Co-occurrence	Cluster filtering	Embedding clusters	Legal framework	Full-text search	—	Gap detection, bridges, cascades, BFS, reciprocity, NetworkX (betweenness, PageRank, community detection, k-core)
External ID reconciliation	—	—	—	—	—	—	—	114/132 Wikidata QIDs
Completeness enforcement	—	—	—	—	—	—	—	CUE type constraints
Backend required?	Yes	No	Yes (SQLite)	No	No	Yes (SQLite)	No	No (static JSON)
Tech	Custom	Python + vis-network	React + Claude + SQLite	Python + Plotly + Ollama	Vanilla JS + D3	React + Express	CSV + GDrive	CUE + Python/NetworkX + D3

Complementary, Not Competing

This project already consumes DugganUSA's API to get corpus mention counts. Layer 4 sits on top of layers 1-3:

FULL_EPSTEIN_INDEX

Raw documents

→

DugganUSA

Search API (71k docs)

→

unify-graph

Typed model + gap analysis

←

Wikidata

114 QIDs (entity reconciliation)

↓

Investigative leads + LLM context (TOON)

DugganUSA tells you what's in the documents. This project tells you what's still missing from the picture.

What This Shows (And Doesn't)

Honest framing — the model is young and intentionally incomplete. The gap analysis distinguishes between structural findings about the graph and statements about the model's own completeness.

Evidence coverage	11.4% — most entities were added from structural relationships before evidence was linked. The gap is the work queue, not a discovery about the network.
Fragility	Most cross-cluster relationships depend on a single bridge entity. Discredit that person and the connection between those worlds disappears from the public record. This is a genuine structural finding — the graph looks dense but is actually hub-and-spoke.
Directionality bias	Almost all connections are one-way — the model reflects who investigators connected to whom, not mutual relationships. The graph shows the investigation's perspective, not the network's actual topology.
Reachability	Everyone is within 2 hops of Epstein. This is a mathematical consequence of hub-and-spoke structure, not an investigative finding — it would be true of any network built around a central figure.
Cluster validation	NetworkX Louvain detected 12 communities vs. 19 manual clusters. 73/132 entities mismatch. Community 0 alone spans 45 entities across 11 clusters — the "banking" label is structurally a broader financial orbit. This validates the model's usefulness — the disagreements identify cross-domain operators and potential misclassifications.

Built with CUE, NetworkX, and D3.js. Data from DugganUSA, Wikidata, DOJ EFTA Release, and public reporting. Source on GitHub.