Skip to main content
Mathematics and Computer Science

Pilot for Evaluation, Development, and Application of Genotype Imputation Algorithms and Pipelines at Scale

Scaling and benchmarking study for analysis of hundreds of thousands of experimental genotype samples

The genomic characterization of 500,000 Million Veteran Program participants can be used in combination with their complete, longitudinal medical records to identify genomic determinants of outcome for many critical health care delivery questions. Efficient characterization of genotype creates a need for imputation tools to infer complete genotype from experimental data. Such tools have been tested to work on thousands of samples, but challenges remain when scaling the analysis to hundreds of thousands of samples. To address these challenges, we are conducting a scaling and benchmarking study.

We will review the state of the art in performing large-scale imputation analysis, build initial set of analytical workflows using the latest reference panels on computational resources at the DoE laboratories, create benchmarks for performance, develop computation profiles and strategies to scale up the analysis, and eventually run all of the samples using three of the popular imputation tools. We will adopt best practices in data management to enhance the reproducibility of the analyses. These software tools and processes will be developed in a scalable manner in collaboration with VA researchers.

These techniques are most directly needed for the cardiovascular health project but have clear applications to prostate cancer and suicide prevention.