| 23 | | Header line: |
| 24 | | |
| 25 | | CHRC POSC SNPC STRANDC A1C A2C CHRV POSV SNPV A1V A2V |
| 26 | | |
| 27 | | Next lines should all contain 12 tab-delimited values. There should be no missing values. |
| 28 | | * CHR[C/V]: chromosome SNP is located at, according to Chip [C] and VCF [V] build versions <integer from 1 to 22 for autosomes, “X” for X-chromosome, and “Y” for Y-chromosome> |
| 29 | | * POS[C/V]: chromosomal position, according to Chip [C] and VCF [V] build versions <integer> |
| 30 | | * SNP[C/V]: SNP rs-name, according to Chip [C] and VCF [V] dbSNP versions <alphanumeric> |
| 31 | | * STRANDC: strand in chip annotation <single character, either “+” or “-“> |
| 32 | | * A1C: first allele in chip annotation <single character, either “A”, “C”, “G” or “T”> |
| 33 | | * A2C: second allele in chip annotation <single character, either “A”, “C”, “G” or “T”> |
| 34 | | * A1V: (translated) first chip allele (A1C) according to “+” strand on VCF build <single character, either “A”, “C”, “G” or “T”> |
| 35 | | * A2V: (translated) second chip allele (A2C) according to “+” strand on VCF build <single character, either “A”, “C”, “G” or “T”> |
| 36 | | |
| 37 | | Questions: |
| 38 | | * Q: what is the build and dbSNP version used by chip and VCF? |
| 39 | | * Q: how many SNPs changed the name in VCF build? |
| 40 | | * Q: how many SNPs changed the strand in VCF build? |
| 41 | | * Q: please provide a 2x2 table (name change/not) x (strand change/not) |
| 42 | | |
| 43 | | Note that in future, samples typed using a number of different Chip platforms will be coming in. Therefore above step should not assume a particular chip is used! |
| 44 | | |
| 45 | | == UPDATED CHIP GENOTYPES == |
| 46 | | |
| 47 | | Using above described translation table, generate updated chip genotypes file (name: chip_genotypes_yyyy.mm.dd.txt) |
| 48 | | |
| 49 | | This is a tab-delimited text file containing a table. The header line is |
| 50 | | |
| 51 | | ID SNPV QUALCHIP A1VCHIP A2VCHIP GTCHIP |
| 52 | | |
| 53 | | Next lines should all contain 5 tab-delimited values. Use “.” (dot) for missing. |
| 54 | | * ID: sample ID (genotyped individual’s code) <alphanumeric> |
| 55 | | * SNPV: SNP rs-name, according to VCF dbSNP version <alphanumeric> |
| 56 | | * QUALCHIP: calling quality for the individual genotype |
| 57 | | * A1VCHIP: first allele the personal genotype, translated according to “+” strand on VCF build <single character, either “A”, “C”, “G” or “T”> |
| 58 | | * A2VCHIP: second allele the personal genotype, translated according to “+” strand on VCF build <single character, either “A”, “C”, “G” or “T”> |
| 59 | | GTCHIP: genotype with alleles in alphabetic order, <two characters, each either “A”, “C”, “G” or “T”> |
| 60 | | |
| 61 | | Questions: |
| 62 | | * Q: do all SNPs in chip data have rs-number? |
| 63 | | * Q: what alleles are observed in chip data? Only A/T/G/C? |
| 64 | | * Q: are all SNPs bi-allelic? |
| 65 | | |
| 66 | | == EXTRACTION OF CHIP SNPS FROM VCF FILE == |
| 67 | | |
| 68 | | From VCF, extract only lines containing SNPs also observed in the chip (see SNPV column of “chip_data_conversion_table_yyyy.mm.dd.txt”) |
| 69 | | |
| 70 | | Parse extracted lines, and arrange “Annotation” and “Genotypic” tables |
| 71 | | |
| 72 | | Annotation table (name: VCF_annotation_yyyy.mm.dd.txt). Tab-delimited file with header lines (and consequently extracting following columns from VCF): |
| 73 | | |
| 74 | | CHROM POS ID REF ALT QUAL FILTER INFO |
| 75 | | |
| 76 | | At the beginning of the file, add meta-info from VCF file |
| 77 | | |
| 78 | | Genotypic table (name: VCF_genotypes_yyyy.mm.dd.txt). Tab-delimited file containing following information. Header line: |
| 79 | | |
| 80 | | ID SNPV GTVCF GQ DP BATCH ???? |
| 81 | | |
| 82 | | Next lines should all contain XXX tab-delimited values. Use “.” (dot) for missing. |
| 83 | | * ID: sample ID (genotyped individual’s code) <alphanumeric> |
| 84 | | * SNPV: SNP rs-name, according to VCF dbSNP version <alphanumeric> |
| 85 | | * GTVCF: genotype with alleles in alphabetic order, <two characters, each either “A”, “C”, “G” or “T”>. This can be done by mapping the numbers provided in VCF GT field to REF and ALT and then ordering. |
| 86 | | * GQ, DP: directly from VCF file |
| 87 | | BATCH … |
| 88 | | |
| 89 | | Merge chip and VCF genotypic tables (“chip_genotypes_yyyy.mm.dd.txt” and “VCF_genotypes_yyyy.mm.dd.txt”) using ID and SNPV as key variables. Keep all chip genotypes, substituting missing (“.”) when no information is available from VCF. Name the table “merged_chip_and_VCF_genotypes_yyy.mm.dd.txt”. |
| 90 | | |
| 91 | | Questions: |
| 92 | | * Q: What is count and proportion of genotypes that do not match between GTCHIP and GTVCF? How much these counts/proportions changes if dropping rows with QUALCHIP < X (vary X)? How much these counts/proportions changes if dropping rows with GQ (DP) < X (vary X)? |
| 93 | | * Q: What is proportion of false-positive and false-negative findings in our study, if we do not take trio structure into account? |
| 94 | | * Q: Find out QC metrics thresholds maximizing specificity and sensitivity. |
| 95 | | |
| 96 | | Update the table with variable “CHIPVCFMISMATCH” (1 if mismatch, 0 for match, missing (“.”) if any is missing). |
| 97 | | |
| 98 | | * Q: Explore, which variables are significant predictors of mismatch using multiple logistic regression. |
| 99 | | |
| 100 | | |
| 101 | | == CHIP SNPS MISSING FROM VCF == |
| 102 | | |
| 103 | | Write the list of the chip SNPs not in VCF into the file “list_of_chip_snps_missing_in_VCF_yyyy.mm.dd.txt” (single column containing SNPV name). This should be done when matching chip SNPs with VCF SNPs (see section “EXTRACT CHIP SNPs FROM VCF”) |
| 104 | | |
| 105 | | * Q: How many variants do we miss in VCF (how many SNPs in file list_of_chip_snps_missing_in_VCF_yyyy.mm.dd.txt)? |
| 106 | | |
| 107 | | For each SNP in list_of_chip_snps_missing_in_VCF_yyyy.mm.dd.txt, based on chip_genotypes_yyyy.mm.dd.txt derive frequency from chip data and arrange the following table (name: annot_chip_snps_missing_in_VCF_yyyy.mm.dd.txt). The header line should contain |
| 108 | | |
| 109 | | SNPV A1V A2V FREQA1V |
| 110 | | |
| 111 | | Each next line should contain 4 values delimited by tab; SNPV, A1V, and A2V explained above (the same as in “chip_data_conversion_table_yyyy.mm.dd.txt” file). FREQA1V is a floating-point frequency of allele “A1V”. |
| 112 | | |
| 113 | | * Q: Does the distribution of frequency of missed variants match the expected under the assumption that we miss at random because of limited #chromosomes and coverage (for each trio we read two chromosomes at 12x and 2 chromosomes at 24x) |
| 114 | | |
| | 25 | Automated workflow (will be) provided in ChipBasedQcPipelineWorkflow page. |