Changes between Version 1 and Version 2 of ChipBasedQcPipeline


Ignore:
Timestamp:
Sep 26, 2010 7:03:18 PM (14 years ago)
Author:
Yurii Aulchenko
Comment:

re-structuring the document

Legend:

Unmodified
Added
Removed
Modified
  • ChipBasedQcPipeline

    v1 v2  
    33= Protocol for comparison between sequencing (VCF) and chip data =
    44
    5 Created: Yurii Aulchenko, 2010.09.12
     5'''Status''': under development
    66
    7 Modified: (please fill in your name and date)
     7'''Contributors''': Yurii
     8
     9== Summary ==
    810
    911Aims of application of the protocol:
     
    1517 * Check if these rates are in agreement with theoretically expected (thus we do not miss any important experimental factor).
    1618
    17 Below a workflow is provided. This document assumes VCF v.4 format (http://1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcf4.0) is used; “+” strand is used in VCF. It is assumed that chip data come in ??? format (). This protocol does not specify exact tools to be used; as most of operations are data manipulations, it will be up to the involved analysts to decide what tool may be more convenient for them. The actual implementation of the workflow should allow reproducing the results or apply the same workflow to other data. This is not only important from good practice point of view, but also keeping in mind that more data will come to the same pipeline in the future.
     19== Idea ==
    1820
    19 == CHIP-VCF BUILD AND DBSNP MATCHING TABLE ==
     21A principal idea of what questions should be addressed (without saying how) is summarized in ChipBasedQcPipelineIdea.
    2022
    21 Before starting, check what is the reference build used by VCF files. Next, what is dbSNP version used to name the SNPs. If any of these (and/or strand is not always ‘+’ in chip data), arrange tab-separated file containing conversion table (name: chip_data_conversion_table_yyyy.mm.dd.txt):
     23== Workflow ==
    2224
    23 Header line:
    24 
    25 CHRC    POSC    SNPC    STRANDC A1C     A2C     CHRV    POSV    SNPV    A1V     A2V
    26 
    27 Next lines should all contain 12 tab-delimited values. There should be no missing values.
    28  * CHR[C/V]: chromosome SNP is located at, according to Chip [C] and VCF [V] build versions <integer from 1 to 22 for autosomes, “X” for X-chromosome, and “Y” for Y-chromosome>
    29  * POS[C/V]: chromosomal position, according to Chip [C] and VCF [V] build versions <integer>
    30  * SNP[C/V]: SNP rs-name, according to Chip [C] and VCF [V] dbSNP versions <alphanumeric>
    31  * STRANDC: strand in chip annotation <single character, either “+” or “-“>
    32  * A1C: first allele in chip annotation <single character, either “A”, “C”, “G” or “T”>
    33  * A2C: second allele in chip annotation <single character, either “A”, “C”, “G” or “T”>
    34  * A1V: (translated) first chip allele (A1C) according to “+” strand on VCF build <single character, either “A”, “C”, “G” or “T”>
    35  * A2V: (translated) second chip allele (A2C) according to “+” strand on VCF build <single character, either “A”, “C”, “G” or “T”>
    36 
    37 Questions:
    38  * Q: what is the build and dbSNP version used by chip and VCF?
    39  * Q: how many SNPs changed the name in VCF build?
    40  * Q: how many SNPs changed the strand in VCF build?
    41  * Q: please provide a 2x2 table (name change/not) x (strand change/not)
    42 
    43 Note that in future, samples typed using a number of different Chip platforms will be coming in. Therefore above step should not assume a particular chip is used!
    44 
    45 == UPDATED CHIP GENOTYPES ==
    46 
    47 Using above described translation table, generate updated chip genotypes file (name: chip_genotypes_yyyy.mm.dd.txt)
    48 
    49 This is a tab-delimited text file containing a table. The header line is
    50 
    51 ID      SNPV    QUALCHIP        A1VCHIP A2VCHIP GTCHIP
    52 
    53 Next lines should all contain 5 tab-delimited values. Use “.” (dot) for missing.
    54  * ID: sample ID (genotyped individual’s code) <alphanumeric>
    55  * SNPV: SNP rs-name, according to VCF dbSNP version <alphanumeric>
    56  * QUALCHIP: calling quality for the individual genotype
    57  * A1VCHIP: first allele the personal genotype, translated according to “+” strand on VCF build <single character, either “A”, “C”, “G” or “T”>
    58  * A2VCHIP: second allele the personal genotype, translated according to “+” strand on VCF build <single character, either “A”, “C”, “G” or “T”>
    59 GTCHIP: genotype with alleles in alphabetic order, <two characters, each either “A”, “C”, “G” or “T”>
    60 
    61 Questions:
    62  * Q: do all SNPs in chip data have rs-number?
    63  * Q: what alleles are observed in chip data? Only A/T/G/C?
    64  * Q: are all SNPs bi-allelic?
    65 
    66 == EXTRACTION OF CHIP SNPS FROM VCF FILE ==
    67 
    68 From VCF, extract only lines containing SNPs also observed in the chip (see SNPV column of “chip_data_conversion_table_yyyy.mm.dd.txt”)
    69 
    70 Parse extracted lines, and arrange “Annotation” and “Genotypic” tables
    71 
    72 Annotation table (name: VCF_annotation_yyyy.mm.dd.txt). Tab-delimited file with header lines (and consequently extracting following columns from VCF):
    73 
    74 CHROM   POS     ID      REF     ALT     QUAL    FILTER  INFO
    75 
    76 At the beginning of the file, add meta-info from VCF file
    77 
    78 Genotypic table (name: VCF_genotypes_yyyy.mm.dd.txt). Tab-delimited file containing following information. Header line:
    79 
    80 ID      SNPV    GTVCF   GQ      DP      BATCH   ????
    81 
    82 Next lines should all contain XXX tab-delimited values. Use “.” (dot) for missing.
    83  * ID: sample ID (genotyped individual’s code) <alphanumeric>
    84  * SNPV: SNP rs-name, according to VCF dbSNP version <alphanumeric>
    85  * GTVCF: genotype with alleles in alphabetic order, <two characters, each either “A”, “C”, “G” or “T”>. This can be done by mapping the numbers provided in VCF GT field to REF and ALT and then ordering.
    86  * GQ, DP: directly from VCF file
    87 BATCH …
    88 
    89 Merge chip and VCF genotypic tables (“chip_genotypes_yyyy.mm.dd.txt” and “VCF_genotypes_yyyy.mm.dd.txt”) using ID and SNPV as key variables. Keep all chip genotypes, substituting missing (“.”) when no information is available from VCF. Name the table “merged_chip_and_VCF_genotypes_yyy.mm.dd.txt”.
    90 
    91 Questions:
    92  * Q: What is count and proportion of genotypes that do not match between GTCHIP and GTVCF? How much these counts/proportions changes if dropping rows with QUALCHIP < X (vary X)? How much these counts/proportions changes if dropping rows with GQ (DP) < X (vary X)?
    93  * Q: What is proportion of false-positive and false-negative findings in our study, if we do not take trio structure into account?
    94  * Q: Find out QC metrics thresholds maximizing specificity and sensitivity.
    95 
    96 Update the table with variable “CHIPVCFMISMATCH” (1 if mismatch, 0 for match, missing (“.”) if any is missing).
    97 
    98  * Q: Explore, which variables are significant predictors of mismatch using multiple logistic regression.
    99 
    100 
    101 == CHIP SNPS MISSING FROM VCF ==
    102 
    103 Write the list of the chip SNPs not in VCF into the file “list_of_chip_snps_missing_in_VCF_yyyy.mm.dd.txt” (single column containing SNPV name). This should be done when matching chip SNPs with VCF SNPs (see section “EXTRACT CHIP SNPs FROM VCF”)
    104 
    105  * Q: How many variants do we miss in VCF (how many SNPs in file list_of_chip_snps_missing_in_VCF_yyyy.mm.dd.txt)?
    106 
    107 For each SNP in list_of_chip_snps_missing_in_VCF_yyyy.mm.dd.txt, based on chip_genotypes_yyyy.mm.dd.txt derive frequency from chip data and arrange the following table (name: annot_chip_snps_missing_in_VCF_yyyy.mm.dd.txt). The header line should contain
    108 
    109 SNPV     A1V     A2V     FREQA1V
    110 
    111 Each next line should contain 4 values delimited by tab; SNPV, A1V, and A2V explained above (the same as in “chip_data_conversion_table_yyyy.mm.dd.txt” file). FREQA1V is a floating-point frequency of allele “A1V”.
    112 
    113  * Q: Does the distribution of frequency of missed variants match the expected under the assumption that we miss at random because of limited #chromosomes and coverage (for each trio we read two chromosomes at 12x and 2 chromosomes at 24x)
    114 
     25Automated workflow (will be) provided in ChipBasedQcPipelineWorkflow page.