About Genome-Eval

Genetic analysis pipeline from raw consumer data (e.g. 23andMe).

What This Is

genome-eval is a local analysis pipeline for the raw data file you get from a consumer DNA test (23andMe, AncestryDNA, MyHeritage, or FamilyTreeDNA). It reads that file and builds a growing, plain-language notebook of findings about your genome: which drugs you metabolize normally, fast, or slowly; whether you carry hidden copies of recessive disease mutations; traits like lactose tolerance and caffeine speed; polygenic risk scores for conditions like heart disease and type 2 diabetes; and your deep-ancestry maternal and paternal lineages.

Two things make it different from the report your testing company hands you. First, everything runs locally — your raw DNA never leaves your machine; the only things downloaded are public research datasets that have nothing to do with you. Second, it shows its work: every finding records the underlying evidence — study size, effect size, the ancestry of the cohort, two separate confidence axes, and a letter-grade evidence tier — so you can see why it says what it says, and re-grade it later as the science improves. Findings live in an append-only ledger, so revising one never erases its history.

Architecture at a Glance

1. drop raw file + run normalize.py 2. detect provider → standardize 3. fill in unmeasured variants 3a. 1000G panel 4. expanded genotypes 5. weights, ClinVar, trees 6. append finding + compute tier 7. generate report You + Claude Code raw provider file & commands normalize.py parser → standard schema standardized parquet analysis boundary Imputation Beagle 5.4 Analysis runners PGx · carrier · PRS · haplogroup · traits Ledger + tier_rules append-only JSONL Markdown report reports/<date>-<id>.md Public reference data 1000 Genomes · PGS Catalog ClinVar / ACMG panels PhyloTree · ISOGG trees
Client Ingest / routing Compute Storage External (public)

Every arrow is one-way — this is a forward data pipeline, not a request/response service. Numbers match the steps below; step 3a and 5 are where local copies of public reference data feed the compute stages.

What Happens When You Run an Analysis

  1. You drop your raw provider file into raw-source-genomes/ and run normalize.py <id>, picking a short ID for yourself. That directory is a read-only boundary — the file is never edited again.
  2. The parser auto-detects which company produced the file and converts it to one standard internal format, written to standardized-genomes/<id>.parquet. From here on, nothing reads the raw file — all analysis reads only this normalized parquet plus your profiles/<id>.json metadata.
  3. Imputation fills in the ~97% of genetic variants the chip never directly measured. Beagle compares your measured positions against a public reference panel (3a) — the 1000 Genomes Project, 2,500+ fully-sequenced genomes — and statistically infers the gaps, expanding ~640K measured variants to tens of millions.
  4. The analysis runners read the expanded genotypes. Each is one command: pharmacogenomics, carrier screening, polygenic scores, mtDNA and Y haplogroups, and trait calls.
  5. Runners that need external evidence pull it from local copies of public data (5): PGS Catalog weight files and 1000G allele frequencies for polygenic scores, ClinVar/ACMG panels for carrier screening, and PhyloTree / ISOGG trees for haplogroups.
  6. Each result is appended to the ledger as a finding carrying its full evidence — study size, p-value, effect size, replication count, cohort ancestry. tier_rules.py then computes the letter grade (A–E) from those metrics rather than anyone inventing it; revisions append a new row that supersedes the old one, never an edit.
  7. On request, generate_report.py reads the active findings (latest non-superseded row per chain) and writes a plain-language markdown report to reports/<date>-<id>.md, grouped by category with confidence and tier shown on every finding.

Tech Stack

Layer Tools
Language Python 3.13 (CLI scripts, no web frontend)
Data format pandas + pyarrow (parquet); append-only JSONL ledger
Imputation Beagle 5.4 + conform-gt, 1000 Genomes Phase 3 EUR reference panels
Polygenic scores PGS Catalog weights, 1000G EUR allele frequencies (empirical calibration over 503 samples)
Haplogroups HaploGrep3 (mtDNA, PhyloTree 17.2) & yhaplo (Y, ISOGG 2016)
Runtime Bundled portable Temurin JDK 21 (runs Beagle & HaploGrep3 — no system Java needed)
Reference evidence ClinVar / ACMG carrier panels, curated SNP tables, FDA PGx labels

Source Code

The project lives on GitHub at github.com/JackVance/genome-eval.