wiki:ImputationTool

Introduction

ImputationTool is a collection of methods to perform pre- and post- analysis for imputation related tasks.

Implementation

ImputationTool developers:

  • Dr. Lude Franke (Lude@…): Format design, Initial methods.
  • Harm-Jan Westra (harm-jan@…): Extensions, Format converters, SNPs checks.

It has been written in java, NetBeans.

Availability

Documentation

From the ImputationTool help screen:

ImputationTool v0.2


------------------------
PreProcessing
------------------------

# Create random batches of cases and controls from a TriTyper dataset. Creates a file called batches.txt in outdir.
--mode batch --in TriTyperdir --out outdir --size batchsize

------------------------
Imputation
------------------------

# Convert Impute Imputed data into TriTyper
--mode itt --in ImputeDir --out TriTyperDir
------------------------
Beagle
------------------------

# Convert beagle files (one file/chromosome) to TriTyper. Filetemplate is a template for the batch filenames, The text CHROMOSOME will be replaced by the chromosome number.
--mode btt --in BeagleDir --tpl template --ext ext --out TriTyperDir [--fam famfile]

# Convert batches of beagle files (multiple files / chromosome) to trityper files. Filetemplate is a template for the batch filenames, The text CHROMOSOME will be replaced by the chromosome number, BATCH by the batchname.
--mode bttb --in BeagleDirdir --tpl template --out TriTyperDir --size numbatches

------------------------
Ped+Map (Plink files)
------------------------

# Converts Ped and Map files created by ttpmh to Beagle format
--mode pmbg --in indir --batch-file batches.txt

# Converts TriTyper file to Plink Dosage format. Filetemplate is a template for the batch filenames, The text CHROMOSOME will be replaced by the chromosome number, BATCH by the batchname.
--mode ttpd --in indir --beagle beagledir --tpl template --batchdesc batchdescriptor --out outdir --fam famfile

# Converts PED and MAP files to TriTyper.
--mode pmtt --in Ped+MapDir --out TriTyperDir

# Converts TriTyper file to PED and MAP files. The FAM file is optional. --split splits the ped and map files per chromosome
--mode ttpm --in indir --out outdir [--fam famfile] [--split]

# Converts TriTyper dataset to Ped+Map concordant to reference (hap) dataset. Supply a batchfile if you want to export in batches. Supply a chromosome if you want to export a certain chromosome.
--mode ttpmh --in TriTyperDir --hap TriTyperReferenceDir --out outdir [--fam famfile] [--batch-file batchfile] [--chr chromosome] [--exclude fileName]

---------------------
PostProcessing
---------------------

# Correlates genotypes of imputed vs non-imputed datasets. Saves a file called correlationOutput.txt in outdir, containing correlation per chromosome as well as correlation distribution.
--mode corr --in TriTyperDir --name datasetname --in2 TriTyperDir2 --name2 datasetname2 --out outdir [--snps snplist]

# Correlates genotypes of imputed vs non-imputed datasets. Also take Beagle imputation score (R2) into account. Saves a file called correlationOutput.txt in outdir, containing correlation per chromosome as well as correlation distribution.
--mode corrb --in TriTyperDir --name datasetname --in2 TriTyperDir2 --name2 datasetname2 --out outdir --beagle beagleDir --tpl template --size numBatches 

# Gets all the excluded snps from chrx.excludedsnps.txt with a certain call-rate threshold (0 < threshold < 1.0)
--mode ecra --in TriTyperDir --threshold threshold

# Generates R2 distribution (beagle quality score) for each batch and chromosome, and tests each batch against chromosome R2 distribution, using WilcoxonMannWhitney test
--mode r2dist --in BeagleDir --template template --out outdir --size numbatches

# Merge two TriTyper datasets
--mode merge --in TriTyper1Dir --in2 TryTyper2Dir --out outdir

Example

This is a common scenario using ImputationTool: Suppose that we have the following directory structure:

  • study
    • study.ped
    • study.map
  • reference
    • reference.ped
    • reference.map

To impute the study vs. the reference with beagle:

  • mkdir study_TriTyper
  • Convert study ped/map data to TriTyper:
    • java -jar ImputationTool.jar --mode pmtt --in study/ --out study_TriTyper
  • mkdir reference_TriTyper
  • Convert reference ped/map data to Trityper:
    • java -jar ImputationTool.jar --mode pmtt --in study/ --out study_TriTyper
  • mkdir batches
  • Create batches of 300 samples of the study:
    • java -jar ImputationTool.jar --mode batch --in study_TriTyper --out batches/ --size 300
  • mkdir reference_analyzed
  • Convert reference into beagle (and do some quality check as well)
    • java -jar ImputationTool.jar --mode ttpmh --in reference_TriTyper --hap reference_TriTyper --out reference_analyzed
  • convert reference to beagle (repeat for rest chromosomes)
    • java -jar linkage2beagle.jar reference_analyzed/chr1.dat reference_analyzed/chr1.ped > reference_analyzed/chr1.bgl
  • mkdir study_reference_compare
  • Perform a comparison between reference and study.
    • java -jar ImputationTool.jar --mode ttpmh --in study_TriTyper --hap reference_TriTyper --batch-file batches/batches.txt --out study_reference_compare
  • Convert analyzed study to beagle:
    • java -jar linkage2beagle.jar study_reference_compare/chr1.dat study_reference_compare/chr1.ped > reference_compare/chr1.bgl
  • mkdir RESULTS
  • And now time for the imputation step (beagle needed)
    • java -jar beagle.jar phased=reference_analyzed/chr1.bgl unphased=study_reference_compare/chr1.bgl markers=reference_analyzed/chr1.markersBeagleFormat missing=0 out=RESULTS/output

The TriTyper Format

TriTyper is a binary format to store genotype information, including insertion, deletion and expression data, providing very efficient read/write/seek methods.

Filtering

In the ttpmh mode, ImputationTool applies the following filtering between a study and a reference dataset:

The filtering steps imputation tool does when comparing to reference:

  • assesses alleles and swaps SNP if needed

ref: C/T GWAS: A/G --> needs to be swapped and inverted to become C/T

  • checks Hardy-Weinberg equilibrium <= 0.0001, MAF < 0.01, callrate < 0.95. If above threshold, SNP is removed
  • checks if SNP is present in reference data, if not, SNP is removed from GWAS data
  • checks if SNP has null alleles, if so, SNP is removed from GWAS data
  • checks if allele frequency is comparable to reference. If not (>25% difference), SNP is removed from GWAS data.
  • Assesses if the haplotype structure is comparable between reference and GWAS data. This is performed by pairwise comparison of r-squared between SNPs in both reference and GWAS. For SNPs in LD (r-squared > 0.1), the allele frequencies are compared. SNPs are removed from the GWAS data when the major allele differs more often than it is identical.

Last modified 6 years ago Last modified on Sep 19, 2011 5:12:30 PM