Epstein Files Jan 30, 2026 Release - Archived from Justice.gov

xodoh74984@lemmy.world · edit-2 19 days ago

Epstein Files Jan 30, 2026 Release - Archived from Justice.gov

Arthas@lemmy.world · 5 days ago

Epstein Files - Complete Dataset Audit Report

Generated: 2026-02-16 | Scope: Datasets 1–12 (VOL00001–VOL00012) | Total Size: ~220 GB

Background

The Epstein Files consist of 12 datasets of court-released documents, each containing PDF files identified by EFTA document IDs. These datasets were collected from links shared throughout this Lemmy thread, with Dataset 9 cross-referenced against a partial copy we had downloaded independently.

Each dataset includes OPT/DAT index files — the official Opticon load files used in e-discovery — which serve as the authoritative manifest of what each dataset should contain. This audit was compiled to:

Verify completeness — compare every dataset against its OPT index to identify missing files
Validate file integrity — confirm that all files are genuinely the file types they claim to be, not just by extension but by parsing their internal structure
Detect duplicates — identify any byte-identical files within or across datasets
Generate checksums — produce SHA256 hashes for every file to enable downstream integrity verification

Executive Summary

Metric	Value
Total Unique Files	1,380,939
Total Document IDs (OPT)	2,731,789
Missing Files	25 (Dataset 9 only)
Corrupt PDFs	3 (Dataset 9 only)
Duplicates (intra + cross-dataset)	0
Mislabeled Files	0
Overall Completeness	99.998%

Dataset Overview

                      EPSTEIN FILES - DATASET SUMMARY
  ┌─────────┬──────────┬───────────┬──────────┬─────────┬─────────┬─────────┐
  │ Dataset │  Volume  │   Files   │ Expected │ Missing │ Corrupt │  Size   │
  ├─────────┼──────────┼───────────┼──────────┼─────────┼─────────┼─────────┤
  │    1    │ VOL00001 │    3,158  │   3,158  │    0    │    0    │  2.5 GB │
  │    2    │ VOL00002 │      574  │     574  │    0    │    0    │  633 MB │
  │    3    │ VOL00003 │       67  │      67  │    0    │    0    │  600 MB │
  │    4    │ VOL00004 │      152  │     152  │    0    │    0    │  359 MB │
  │    5    │ VOL00005 │      120  │     120  │    0    │    0    │   62 MB │
  │    6    │ VOL00006 │       13  │      13  │    0    │    0    │   53 MB │
  │    7    │ VOL00007 │       17  │      17  │    0    │    0    │   98 MB │
  │    8    │ VOL00008 │   10,595  │  10,595  │    0    │    0    │   11 GB │
  │    9    │ VOL00009 │  531,282  │ 531,307  │   25    │    3    │   96 GB │
  │   10    │ VOL00010 │  503,154  │ 503,154  │    0    │    0    │   82 GB │
  │   11    │ VOL00011 │  331,655  │ 331,655  │    0    │    0    │   27 GB │
  │   12    │ VOL00012 │      152  │     152  │    0    │    0    │  120 MB │
  ├─────────┼──────────┼───────────┼──────────┼─────────┼─────────┼─────────┤
  │  TOTAL  │          │1,380,939  │1,380,964 │   25    │    3    │ ~220 GB │
  └─────────┴──────────┴───────────┴──────────┴─────────┴─────────┴─────────┘

Notes

DS1: Two identical copies found (6,316 files on disk). Byte-for-byte identical via SHA256. Table above reflects one copy (3,158). One copy is redundant.
DS2: 699 document IDs map to 574 files (multi-page PDFs)
DS3: 1,847 document IDs across 67 files (~28 pages/doc avg)
DS5: 1:1 document-to-file ratio (single-page PDFs)
DS6: Smallest dataset by file count. ~37 pages/doc avg.
DS9: Largest dataset. 25 missing from OPT index, 3 structurally corrupt.
DS10: Second largest. 950,101 document IDs across 503,154 files.
DS11: Third largest. 517,382 document IDs across 331,655 files.

Dataset 9 — Missing Files (25)

EFTA00709804    EFTA00823221    EFTA00932520
EFTA00709805    EFTA00823319    EFTA00932521
EFTA00709806    EFTA00877475    EFTA00932522
EFTA00709807    EFTA00892252    EFTA00932523
EFTA00770595    EFTA00901740    EFTA00984666
EFTA00774768    EFTA00912980    EFTA00984668
EFTA00823190    EFTA00919433    EFTA01135215
EFTA00823191    EFTA00919434    EFTA01135708
EFTA00823192

Dataset 9 — Corrupted Files (3)

File	Size	Error
`EFTA00645624.pdf`	35 KB	Missing trailer dictionary, broken xref table
`EFTA01175426.pdf`	827 KB	Invalid xref entries, no page tree (0 pages)
`EFTA01220934.pdf`	1.1 MB	Missing trailer dictionary, broken xref table

Valid %PDF- headers but cannot be rendered due to structural corruption. Likely corrupted during original document production or transfer.

File Type Verification

Two levels of verification performed on all 1,380,939 files:

Magic Byte Detection (file command) — All files contain valid %PDF- headers. 0 mislabeled.
Deep PDF Validation (pdfinfo, poppler 26.02.0) — Parsed xref tables, trailer dictionaries, and page trees. 3 structurally corrupt (Dataset 9 only).

Duplicate Analysis

Within Datasets: 0 intra-dataset hash duplicates across all 12 datasets.
Cross-Dataset: All 1,380,939 SHA256 hashes compared. 0 cross-dataset duplicates — every file is unique.
Dataset 1 Two Copies: Both copies byte-for-byte identical (SHA256 verified). One is redundant (~2.5 GB).

Integrity Verification

SHA256 checksums were generated for every file across all 12 datasets. Individual checksum files are available per dataset:

File	Hashes	Size
`dataset_1_SHA256SUMS.txt`	3,158	256 KB
`dataset_2_SHA256SUMS.txt`	574	47 KB
`dataset_3_SHA256SUMS.txt`	67	5.4 KB
`dataset_4_SHA256SUMS.txt`	152	12 KB
`dataset_5_SHA256SUMS.txt`	120	9.7 KB
`dataset_6_SHA256SUMS.txt`	13	1.1 KB
`dataset_7_SHA256SUMS.txt`	17	1.4 KB
`dataset_8_SHA256SUMS.txt`	10,595	859 KB
`dataset_9_SHA256SUMS.txt`	531,282	42 MB
`dataset_10_SHA256SUMS.txt`	503,154	40 MB
`dataset_11_SHA256SUMS.txt`	331,655	26 MB
`dataset_12_SHA256SUMS.txt`	152	12 KB

To verify any file against its checksum:

shasum -a 256 <filename>

If you’d like access to the SHA256 checksum files or can help host them, send me a DM.

Methodology

Hash Generation: SHA256 checksums via shasum -a 256 with 8-thread parallel processing
OPT Index Comparison: Each dataset’s OPT load file parsed for expected file paths, compared against files on disk
Intra-Dataset Duplicate Detection: SHA256 hashes compared within each dataset
Cross-Dataset Duplicate Detection: All 1,380,939 hashes compared across all 12 datasets
File Type Verification (Level 1): Magic byte detection via file command
Deep PDF Validation (Level 2): Structure validation via pdfinfo (poppler 26.02.0) — xref tables, trailer dictionaries, page trees
Cross-Copy Comparison: Dataset 1’s two copies compared via full SHA256 diff

Recommendations

Remove Dataset 1 duplicate copy — saves ~2.5 GB
Document the 25 missing Dataset 9 files — community assistance may help locate these
Preserve OPT/DAT index files — authoritative record of expected contents
Distribute SHA256SUMS.txt files — for downstream integrity verification

Report generated as part of the Epstein Files preservation and verification project.