There are 8+ open-source projects analyzing the Epstein documents. They fall into four layers — most do layer 1 or 2, a few do layer 3. Only this project does layer 4.
Every other project follows a variation of this pipeline:
They differ in scale (47 entities to 86,000), extraction method (regex to Claude AI), and visualization (2D force graphs to 3D clouds and Sankey diagrams). But they all answer the same question: "What do we know?"
This project inverts the question:
132 hand-curated entities modeled in CUE — a configuration language designed for infrastructure-as-code. The CUE compiler enforces that every reference resolves, every type is consistent, every constraint is satisfied. When something doesn't satisfy the schema, that's not a bug — it's a lead.
| DugganUSA | phelix001 | maxandrews | SvetimFM | OWL-DOJ | epstein-archive | FULL_EPSTEIN_INDEX | unify-graph | |
|---|---|---|---|---|---|---|---|---|
| Layer | Search | Extract→Viz | Extract→Viz | Extract→Viz | Extract→Viz | Search | Archive | Model→Gaps |
| Core question | What's in the docs? | Who connects? | What relationships? | What clusters? | Enough for conviction? | What's in the corpus? | Where are the files? | What's missing? |
| Docs indexed | 71k+ | 19,154 | House Oversight | 69k chunks | 14,674 | 51k+ | ~20k pg + DOJ + FBI | Consumes DugganUSA API |
| Entities | — | 47 | 15k+ triples | 31 | 30+ | 86k | — | 132 (typed, curated) |
| Entity method | API search | pdfplumber + regex | Claude AI + dedup | Ollama embeddings | OWL autonomous | NLP pipeline | OCR text | Human-authored CUE |
| Analysis | Preset viz | Co-occurrence | Cluster filtering | Embedding clusters | Legal framework | Full-text search | — | Gap detection, bridges, cascades, BFS, reciprocity, NetworkX (betweenness, PageRank, community detection, k-core) |
| External ID reconciliation | — | — | — | — | — | — | — | 114/132 Wikidata QIDs |
| Completeness enforcement | — | — | — | — | — | — | — | CUE type constraints |
| Backend required? | Yes | No | Yes (SQLite) | No | No | Yes (SQLite) | No | No (static JSON) |
| Tech | Custom | Python + vis-network | React + Claude + SQLite | Python + Plotly + Ollama | Vanilla JS + D3 | React + Express | CSV + GDrive | CUE + Python/NetworkX + D3 |
This project already consumes DugganUSA's API to get corpus mention counts. Layer 4 sits on top of layers 1-3:
DugganUSA tells you what's in the documents. This project tells you what's still missing from the picture.
Honest framing — the model is young and intentionally incomplete. The gap analysis distinguishes between structural findings about the graph and statements about the model's own completeness.
| Evidence coverage | 11.4% — most entities were added from structural relationships before evidence was linked. The gap is the work queue, not a discovery about the network. |
| Fragility | Most cross-cluster relationships depend on a single bridge entity. Discredit that person and the connection between those worlds disappears from the public record. This is a genuine structural finding — the graph looks dense but is actually hub-and-spoke. |
| Directionality bias | Almost all connections are one-way — the model reflects who investigators connected to whom, not mutual relationships. The graph shows the investigation's perspective, not the network's actual topology. |
| Reachability | Everyone is within 2 hops of Epstein. This is a mathematical consequence of hub-and-spoke structure, not an investigative finding — it would be true of any network built around a central figure. |
| Cluster validation | NetworkX Louvain detected 12 communities vs. 19 manual clusters. 73/132 entities mismatch. Community 0 alone spans 45 entities across 11 clusters — the "banking" label is structurally a broader financial orbit. This validates the model's usefulness — the disagreements identify cross-domain operators and potential misclassifications. |
Built with CUE, NetworkX, and D3.js. Data from DugganUSA, Wikidata, DOJ EFTA Release, and public reporting. Source on GitHub.