About Genome-Eval
Genetic analysis pipeline from raw consumer data (e.g. 23andMe).
What This Is
genome-eval is a local analysis pipeline for the raw data file you get from a consumer DNA test (23andMe, AncestryDNA, MyHeritage, or FamilyTreeDNA). It reads that file and builds a growing, plain-language notebook of findings about your genome: which drugs you metabolize normally, fast, or slowly; whether you carry hidden copies of recessive disease mutations; traits like lactose tolerance and caffeine speed; polygenic risk scores for conditions like heart disease and type 2 diabetes; and your deep-ancestry maternal and paternal lineages.
Two things make it different from the report your testing company hands you. First, everything runs locally — your raw DNA never leaves your machine; the only things downloaded are public research datasets that have nothing to do with you. Second, it shows its work: every finding records the underlying evidence — study size, effect size, the ancestry of the cohort, two separate confidence axes, and a letter-grade evidence tier — so you can see why it says what it says, and re-grade it later as the science improves. Findings live in an append-only ledger, so revising one never erases its history.
Architecture at a Glance
Every arrow is one-way — this is a forward data pipeline, not a request/response service. Numbers match the steps below; step 3a and 5 are where local copies of public reference data feed the compute stages.
What Happens When You Run an Analysis
-
You drop your raw provider file into
raw-source-genomes/and runnormalize.py <id>, picking a short ID for yourself. That directory is a read-only boundary — the file is never edited again. -
The parser auto-detects which company produced the file and converts
it to one standard internal format, written to
standardized-genomes/<id>.parquet. From here on, nothing reads the raw file — all analysis reads only this normalized parquet plus yourprofiles/<id>.jsonmetadata. - Imputation fills in the ~97% of genetic variants the chip never directly measured. Beagle compares your measured positions against a public reference panel (3a) — the 1000 Genomes Project, 2,500+ fully-sequenced genomes — and statistically infers the gaps, expanding ~640K measured variants to tens of millions.
- The analysis runners read the expanded genotypes. Each is one command: pharmacogenomics, carrier screening, polygenic scores, mtDNA and Y haplogroups, and trait calls.
- Runners that need external evidence pull it from local copies of public data (5): PGS Catalog weight files and 1000G allele frequencies for polygenic scores, ClinVar/ACMG panels for carrier screening, and PhyloTree / ISOGG trees for haplogroups.
-
Each result is appended to the ledger as a finding carrying its full
evidence — study size, p-value, effect size, replication count, cohort
ancestry.
tier_rules.pythen computes the letter grade (A–E) from those metrics rather than anyone inventing it; revisions append a new row that supersedes the old one, never an edit. -
On request,
generate_report.pyreads the active findings (latest non-superseded row per chain) and writes a plain-language markdown report toreports/<date>-<id>.md, grouped by category with confidence and tier shown on every finding.
Tech Stack
| Layer | Tools |
|---|---|
| Language | Python 3.13 (CLI scripts, no web frontend) |
| Data format | pandas + pyarrow (parquet); append-only JSONL ledger |
| Imputation | Beagle 5.4 + conform-gt, 1000 Genomes Phase 3 EUR reference panels |
| Polygenic scores | PGS Catalog weights, 1000G EUR allele frequencies (empirical calibration over 503 samples) |
| Haplogroups | HaploGrep3 (mtDNA, PhyloTree 17.2) & yhaplo (Y, ISOGG 2016) |
| Runtime | Bundled portable Temurin JDK 21 (runs Beagle & HaploGrep3 — no system Java needed) |
| Reference evidence | ClinVar / ACMG carrier panels, curated SNP tables, FDA PGx labels |
Source Code
The project lives on GitHub at github.com/JackVance/genome-eval.
- All runners and library code: github.com/JackVance/genome-eval/tree/master/scripts
- Detailed workflow, SNP tables, and tier rules: …/.claude/skills/genome-eval/SKILL.md
- Setup and command reference: github.com/JackVance/genome-eval/blob/master/README.md