How This Compares

Graph Source

Four Layers of Epstein Investigation Projects

There are 8+ open-source projects analyzing the Epstein documents. They fall into four layers — most do layer 1 or 2, a few do layer 3. Only this project does layer 4.

1
Archive — collect & host raw documents
2
Search — full-text search + entity tagging
3
Extract → Visualize — NLP/LLM builds graphs
4
Model → Analyze gaps
unify-graph typed schema where constraint violations = investigative leads

The Standard Pipeline (Layers 1-3)

Every other project follows a variation of this pipeline:

Documents OCR NLP / LLM Entities Graph "Look at all these connections"

They differ in scale (47 entities to 86,000), extraction method (regex to Claude AI), and visualization (2D force graphs to 3D clouds and Sankey diagrams). But they all answer the same question: "What do we know?"

The Inverted Pipeline (Layer 4)

This project inverts the question:

Entities Typed Schema Constraint Violations "Here's what we DON'T know"

132 hand-curated entities modeled in CUE — a configuration language designed for infrastructure-as-code. The CUE compiler enforces that every reference resolves, every type is consistent, every constraint is satisfied. When something doesn't satisfy the schema, that's not a bug — it's a lead.

Dangling Reference
Entity mentions a person not yet modeled.
→ Who are they? Why are they connected?
Empty Evidence Map
Entity exists in the graph but has zero document citations.
→ Most entities were added from structural relationships before evidence was linked. The gap is the work queue.
Type Inconsistency
Tagged as FinancialEnabler but no financial flows documented.
→ Where's the paper trail?
Sole Connector
One person is the only bridge between two cluster worlds.
→ 44 cluster pairs depend on a single entity. Discredit them, the connection vanishes.
Exposure Cascade
Pre-computed BFS: if X cooperates, who's within 2 degrees?
→ High reach is expected in a hub-and-spoke network — what matters is which entities are NOT reachable.
Cluster Isolation
Entity is assigned to a cluster but has zero connections to peers.
→ Misclassified, or genuinely isolated?
Community vs Cluster Mismatch
NetworkX Louvain community detection disagrees with the manual cluster label.
→ 73/132 entities structurally belong to a different group than their label says. Either the label is wrong or the entity operates across domains.
Entity Reconciliation
114/132 entities linked to Wikidata QIDs. CUE validates the mapping at compile time.
→ QIDs enable federated queries: SPARQL can pull in all linked identifiers (OpenCorporates, LittleSis, court records) from a single ID.

Detailed Comparison

DugganUSA phelix001 maxandrews SvetimFM OWL-DOJ epstein-archive FULL_EPSTEIN_INDEX unify-graph
Layer Search Extract→Viz Extract→Viz Extract→Viz Extract→Viz Search Archive Model→Gaps
Core question What's in the docs? Who connects? What relationships? What clusters? Enough for conviction? What's in the corpus? Where are the files? What's missing?
Docs indexed 71k+ 19,154 House Oversight 69k chunks 14,674 51k+ ~20k pg + DOJ + FBI Consumes DugganUSA API
Entities 47 15k+ triples 31 30+ 86k 132 (typed, curated)
Entity method API search pdfplumber + regex Claude AI + dedup Ollama embeddings OWL autonomous NLP pipeline OCR text Human-authored CUE
Analysis Preset viz Co-occurrence Cluster filtering Embedding clusters Legal framework Full-text search Gap detection, bridges, cascades, BFS, reciprocity, NetworkX (betweenness, PageRank, community detection, k-core)
External ID reconciliation 114/132 Wikidata QIDs
Completeness enforcement CUE type constraints
Backend required? Yes No Yes (SQLite) No No Yes (SQLite) No No (static JSON)
Tech Custom Python + vis-network React + Claude + SQLite Python + Plotly + Ollama Vanilla JS + D3 React + Express CSV + GDrive CUE + Python/NetworkX + D3

Complementary, Not Competing

This project already consumes DugganUSA's API to get corpus mention counts. Layer 4 sits on top of layers 1-3:

FULL_EPSTEIN_INDEX
Raw documents
DugganUSA
Search API (71k docs)
unify-graph
Typed model + gap analysis
Wikidata
114 QIDs (entity reconciliation)
Investigative leads + LLM context (TOON)

DugganUSA tells you what's in the documents. This project tells you what's still missing from the picture.

What This Shows (And Doesn't)

Honest framing — the model is young and intentionally incomplete. The gap analysis distinguishes between structural findings about the graph and statements about the model's own completeness.

Evidence coverage11.4% — most entities were added from structural relationships before evidence was linked. The gap is the work queue, not a discovery about the network.
FragilityMost cross-cluster relationships depend on a single bridge entity. Discredit that person and the connection between those worlds disappears from the public record. This is a genuine structural finding — the graph looks dense but is actually hub-and-spoke.
Directionality biasAlmost all connections are one-way — the model reflects who investigators connected to whom, not mutual relationships. The graph shows the investigation's perspective, not the network's actual topology.
ReachabilityEveryone is within 2 hops of Epstein. This is a mathematical consequence of hub-and-spoke structure, not an investigative finding — it would be true of any network built around a central figure.
Cluster validationNetworkX Louvain detected 12 communities vs. 19 manual clusters. 73/132 entities mismatch. Community 0 alone spans 45 entities across 11 clusters — the "banking" label is structurally a broader financial orbit. This validates the model's usefulness — the disagreements identify cross-domain operators and potential misclassifications.

Built with CUE, NetworkX, and D3.js. Data from DugganUSA, Wikidata, DOJ EFTA Release, and public reporting. Source on GitHub.