About Genome-Eval

What This Is

genome-eval is a local analysis pipeline for the raw data file you get from a consumer DNA test (23andMe, AncestryDNA, MyHeritage, or FamilyTreeDNA). It reads that file and builds a growing, plain-language notebook of findings about your genome: which drugs you metabolize normally, fast, or slowly; whether you carry hidden copies of recessive disease mutations; traits like lactose tolerance and caffeine speed; polygenic risk scores for conditions like heart disease and type 2 diabetes; and your deep-ancestry maternal and paternal lineages.

Two things make it different from the report your testing company hands you. First, everything runs locally — your raw DNA never leaves your machine; the only things downloaded are public research datasets that have nothing to do with you. Second, it shows its work: every finding records the underlying evidence — study size, effect size, the ancestry of the cohort, two separate confidence axes, and a letter-grade evidence tier — so you can see why it says what it says, and re-grade it later as the science improves. Findings live in an append-only ledger, so revising one never erases its history.

Architecture at a Glance

Client Ingest / routing Compute Storage External (public)

Every arrow is one-way — this is a forward data pipeline, not a request/response service. Numbers match the steps below; step 3a and 5 are where local copies of public reference data feed the compute stages.

What Happens When You Run an Analysis

You drop your raw provider file into raw-source-genomes/ and run normalize.py <id>, picking a short ID for yourself. That directory is a read-only boundary — the file is never edited again.
The parser auto-detects which company produced the file and converts it to one standard internal format, written to standardized-genomes/<id>.parquet. From here on, nothing reads the raw file — all analysis reads only this normalized parquet plus your profiles/<id>.json metadata.
Imputation fills in the ~97% of genetic variants the chip never directly measured. Beagle compares your measured positions against a public reference panel (3a) — the 1000 Genomes Project, 2,500+ fully-sequenced genomes — and statistically infers the gaps, expanding ~640K measured variants to tens of millions.
The analysis runners read the expanded genotypes. Each is one command: pharmacogenomics, carrier screening, polygenic scores, mtDNA and Y haplogroups, and trait calls.
Runners that need external evidence pull it from local copies of public data (5): PGS Catalog weight files and 1000G allele frequencies for polygenic scores, ClinVar/ACMG panels for carrier screening, and PhyloTree / ISOGG trees for haplogroups.
Each result is appended to the ledger as a finding carrying its full evidence — study size, p-value, effect size, replication count, cohort ancestry. tier_rules.py then computes the letter grade (A–E) from those metrics rather than anyone inventing it; revisions append a new row that supersedes the old one, never an edit.
On request, generate_report.py reads the active findings (latest non-superseded row per chain) and writes a plain-language markdown report to reports/<date>-<id>.md, grouped by category with confidence and tier shown on every finding.

Tech Stack

Layer	Tools
Language	Python 3.13 (CLI scripts, no web frontend)
Data format	pandas + pyarrow (parquet); append-only JSONL ledger
Imputation	Beagle 5.4 + conform-gt, 1000 Genomes Phase 3 EUR reference panels
Polygenic scores	PGS Catalog weights, 1000G EUR allele frequencies (empirical calibration over 503 samples)
Haplogroups	HaploGrep3 (mtDNA, PhyloTree 17.2) & yhaplo (Y, ISOGG 2016)
Runtime	Bundled portable Temurin JDK 21 (runs Beagle & HaploGrep3 — no system Java needed)
Reference evidence	ClinVar / ACMG carrier panels, curated SNP tables, FDA PGx labels

Source Code

The project lives on GitHub at github.com/JackVance/genome-eval.

All runners and library code: github.com/JackVance/genome-eval/tree/master/scripts
Detailed workflow, SNP tables, and tier rules: …/.claude/skills/genome-eval/SKILL.md
Setup and command reference: github.com/JackVance/genome-eval/blob/master/README.md