ImputationPipeline – bbmri

Context Navigation

This page describes the Imputation pipelines developed by the GoNL - Impute team. Please contribute. For help with trac wiki formatting: http://trac.edgewall.org/wiki/WikiFormatting
All scripts presented here are located in our SVN repository: http://www.bbmriwiki.nl/svn/imputation
Minutes of our Team Calls: http://www.bbmriwiki.nl/wiki/Imputations/Minutes

Contributors and Teams

UMC Groningen: Alexandros Kanterakis alexandros.kanterakis@…

Study data

Reference data

1000 Genomes October 2011 release
GoNL pilot 3, 48 trios, 192 haplotypes

Pre processing

Normalize beagle datasets

Location: http://www.bbmriwiki.nl/svn/Imputation/alex/scripts/Normalize_beagle_datasets.ftl
Takes a list of beagle and marker files and applies the following checks:
Checks if the SNPs are compatible. If the compatibility cannot be corrected by SNP inversion then it is discarded.
Checks if SNP has null alleles, if so, SNP is removed from study data.
Checks if two SNPs with same reference code (rs) are in the same position.
Checks if two SNPs in the same position have the same reference code (rs).
Checks if a SNP in the study has MAF < MAF_minimum, HWE < HWE_minimum and CR < CR_minimum if any of these criteria are met, the SNP is discarded. (MAF = Minor Allele Frequency, HWE = Hardy Weinberg Equilibrium, CR = Call Rate)

It generates a log file with all inconsistencies found: At the end of this file there is a summary of the problems found:

SNPs inverted: For Example A/G SNPs in reference , T/C SNPs in study
Allele problems: Number of SNPs with inconsistent alleles in study and in reference that could not be fixed with flipping
Position problems (different references, same loci): As it says. These SNPs are NOT removed. We keep the reference (rs number) of the reference panel
Unresolved single alleles problems: SNPs in study that have only one allele. These SNPs are filtered out.
Double rs codes problems: As it says. This SNPs are filtered out.
SNPs in study with MAF < MAF_minimum: SNPs with MAF < MAF_minimum set.
SNPs in study with HWE < HWE_minimum: SNPs with HWE < HWE_minimum set.
SNPs in study with CR < CR_minimum: SNPs with Call Rate < CR_minimum set
SNPs that differ in Allele Frequencies: SNPs with difference in AF between reference and study over CR_minimum set.

Options:

input_beagle_study : The study in beagle format
input_beagle_reference : The reference in beagle format
input_markers_study : The study's markers in beagle format
input_markers_reference : The reference's markers in beagle format
output_beagle_study : The Normalized output of the study (Use this as "study" for imputation)
output_beagle_reference : The Normalized output of the reference (Normally you will not use this file)
output_markers_study : The Markers of the normalized study
output_markers_reference : The Markers of the normalized reference
output_log_filename : the log filename

Imputation software

Impute2
Beagle
Mach / Minimach

Quality metrics

Convert impute2 gprobs to TPED

Location: http://www.bbmriwiki.nl/svn/Imputation/alex/scripts/Convert_impute2_gprobs_to_PEDMAP_beagle.ftl

This method is suitable to convert results from impute2 imputation to TPED. You can define an R2 threshold. The R2 is the allelic R2 according to http://www.sciencedirect.com/science/article/pii/S0002929709000123#sec2.7.2 . You can copy the TFAM from the original study in order to have a complete TPED / TFAM dataset.

Options:

input_impute2_gprobs_filename : The gprobs file generated from impute2
output_TPED_filename : The output TPED filename
output_stats_filename : The file where the R2 estimation will be printed. It will contain ALL the R2 values not only these surpassing the threshold
chromosome : The chromosome of this study
r2_threshold : The R2 threshold

Statistics_of_imputation_results

Location: http://www.bbmriwiki.nl/svn/Imputation/alex/scripts/Statistics_of_imputation_results.ftl

Computes several statistics of imputation results. This is suitable when we have "real" genotype data to benchmark our imputation pipeline. The computed statistics are:

Allelic R2 : according to http://www.sciencedirect.com/science/article/pii/S0002929709000123#sec2.7.2
Real_Allelic_R2 : Computes the R2 (or coefficient of determination) between a real and an imputed genotype.
Imputation_Allele_Frequency and Standardized_allele_frequency_error : (From: http://www.sciencedirect.com/science/article/pii/S0002929709000123) Allele-frequency error is the difference between the true allele frequency in the sample and the estimated allele frequency in the sample computed from the posterior genotype probabilities. If the three posterior genotype probabilities for an individual are denoted pAA, pAB, and pBB, then the estimated A allele frequency is found by summing (2pAA + pAB) over all individuals and dividing by twice the number of individuals. However, allele-frequency error is difficult to interpret unless the true allele frequency and sample size are known. abs(p - q) / sqrt( ( p * (1-p))/ (2*n)). p is the allele frequency in the sample of n individuals from a population in Hardy-Weinberg equilibrium. q is the estimated allele frequency obtained from the imputed posterior genotype probabilities.

Options:

input_beagle_dosage_filename : The output of the beagle imputation
input_beagle_unimputed_filename : The beagle file with the "real", un-imputed genotypes
output_filename : Output filename for the stats

Complete pipelines

Results

References

Brian L. Browning, Sharon R. Browning. A Unified Approach to Genotype Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and Unrelated Individuals. AJHG, Volume 84, Issue 2, 13 February 2009, Pages 210-223. doi:10.1016/j.ajhg.2009.01.005
http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1000529 Impute2
http://www.biomedcentral.com/1471-2156/10/27 Accuracy of genome-wide imputation of untyped markers and impacts on statistical power for association studies
The effect of genome-wide association scan quality control on imputation outcome for common variants. QC on scan quality has MINIMUM effect on imputation quality http://www.nature.com/ejhg/journal/v19/n5/full/ejhg2010242a.html , http://www.nature.com/ng/journal/v39/n7/full/ng2088.html Marchini , http://www.nature.com/nrg/journal/v11/n7/full/nrg2796.html
Hickey JM, Kinghorn BP, Tier B, Wilson JF, Dunstan N, van der Werf JH. A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes. Genet Sel Evol. 2011 Mar 10;43:12. http://www.ncbi.nlm.nih.gov/pubmed/21388557
The International HapMap? Consortium. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52-58. 2010. The HapMap? Project
The use of imputation in GWAS studies: http://www.nature.com/ejhg/journal/v19/n2/full/ejhg2010157a.html?WT.i_dcsvid=%25%25LIST_ID%25%25-%25%25RECIPIENT_ID%25%25&WT.ec_id=MARKETING&WT.mc_id=EG1107CV030 Politopoulos I. et. al Genome-wide association of breast cancer: composite likelihood with imputed genotypes. European Journal of Human Genetics (2011) 19, 194–199.
Introduction of imputation? http://www.ncbi.nlm.nih.gov/pubmed/19165921 Detection of sharing by descent, long-range phasing and haplotype imputation.
We found that the difference between the unQCed data and the fully QCed data on imputation outcome was minimal. Our study shows that imputation of common variants is generally very accurate and robust to GWAS QC, which is not a major factor affecting imputation outcome. A minority of common-frequency SNPs with particular properties cannot be accurately imputed regardless of QC stringency. These findings may not generalise to the imputation of low frequency and rare variants. http://www.nature.com/ejhg/journal/v19/n5/full/ejhg2010242a.html . The effect of genome-wide association scan quality control on imputation outcome for common variants. European Journal of Human Genetics (2011) 19, 610–614.

Context Navigation

Contributors and Teams

Study data

Reference data

Pre processing

Normalize beagle datasets

Imputation software

Quality metrics

Convert impute2 gprobs to TPED

Statistics_of_imputation_results

Complete pipelines

Results

References

See also

Download in other formats: