Changes between Version 7 and Version 8 of DataConcordance


Ignore:
Timestamp:
Apr 21, 2011 4:16:21 PM (13 years ago)
Author:
laurent
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • DataConcordance

    v7 v8  
    1 = Data Concordance =
    2 
    3 This page summarizes the methods and results of the concordance checks between the following data sets:
    4 * Groningen Immunochip data
    5 * Groningen Sequence data
    6 * BGI Sequence data
    7 
    81[[TOC]]
    92
     3= Data Concordance =
     4This page summarizes the methods and results of the concordance checks between the following data sets:
     5
     6 * Groningen Immunochip data
     7 * Groningen Sequence data
     8 * BGI Sequence data
     9
    1010= Methods & Tools =
    11 
    1211== File Types ==
    1312All data sets were either generated or converted to VCF files aligned on the build Hg19 of the Human Reference Genome:
    14 * See [[GoNL_Immunochip_Data_Preparation]] about how the Immunochip data was processed
    15 * See [https://www.broad.harvard.edu/gsa/wiki/index.php/LiftOverVCF.pl GATK LiftOverVCF] about how to liftover a VCF file from one reference to another
     13
     14 * See [[GoNL_Immunochip_Data_Preparation]] about how the Immunochip data was processed
     15 * See [https://www.broad.harvard.edu/gsa/wiki/index.php/LiftOverVCF.pl GATK LiftOverVCF] about how to liftover a VCF file from one reference to another
    1616
    1717== Concordance calculation using [http://vcftools.sourceforge.net/ VCFTools] ==
    18 To calculate the concordance between the different files, [http://vcftools.sourceforge.net/ VCFTools] was used. More specifically:
    19 <pre>vcftools --vcf /data/lfrancioli/immunochip/hg19/GvNL.hg19.final.vcf --indv ${sample} --diff /data/lfrancioli/results/pilot/${sample}.human_g1k_v37.immuno.vcf --diff-site-discordance --diff-indv-discordance --diff-discordance-matrix</pre>
    20 This computes the concordance per file, site and individual as well as a discordance matrix. This was applied on a sample level so only the file, site and discordance matrix where actually used.
     18To calculate the concordance between the different files, [http://vcftools.sourceforge.net/ VCFTools] was used. More specifically: <pre>vcftools --vcf /data/lfrancioli/immunochip/hg19/GvNL.hg19.final.vcf --indv ${sample} --diff /data/lfrancioli/results/pilot/${sample}.human_g1k_v37.immuno.vcf --diff-site-discordance --diff-indv-discordance --diff-discordance-matrix</pre> This computes the concordance per file, site and individual as well as a discordance matrix. This was applied on a sample level so only the file, site and discordance matrix where actually used.
    2119
    2220== Concordance aggregation using home-made scripts ==
     
    2523=== vcftools-diff_site-concordance.pl ===
    2624As its name suggests, this script runs over the individual .diff.site files produced by VCFTools and aggregate their information. The following features are available:
    27 * Per-major report
    28 ** SNP filtering
    29 ** Report as plain text or tab-delimited
    30 * SNP-major report
    31 ** SNP filtering
    32 ** Addition of MAF from a plink frq files
    33 ** Addition of SNP ID from a plink bim file
    34 ** Output of shared SNPs only
     25
     26 * Per-major report
     27
     28** SNP filtering ** Report as plain text or tab-delimited
     29
     30 * SNP-major report
     31
     32** SNP filtering ** Addition of MAF from a plink frq files ** Addition of SNP ID from a plink bim file ** Output of shared SNPs only
    3533
    3634=== vcftools-discordance-matrix.py ===
     
    3836
    3937== Reporting using R scripts ==
    40 For reporting purpose, R scripts were created. These scripts all take files created using vcftools-diff_site-condordance.pl or vcftools-discordance-matrix.py as input.
    41 The following scripts are available:
    42 * plot_shared_loci.R
    43 ** Plots the shared/unique loci in the two datasets per individual as a barplot
    44 ** Usage: Rscript plot_shared_loci.R <concordance_file> <out_plot.jpg> [name_dataset1] [name_dataset2]
    45 * plot_geno_concordance.R
    46 ** Plots the genotype concordance between two datasets per individual as a barplot
    47 ** Usage: Rscript plot_geno_concordance.R <concordance_file> <out_plot.jpg> [plot_title]
    48 * plot_discordance_matrix.R
    49 ** Plots the genotype discordance by "discordance type" (0/0 -> 0/1, 0/0 -> 1/1, 0/1 -> 0/0, etc.)
    50 ** Usage: Rscript plot_discordance_matrix.R <discordance_matrix_file> <out_plot.jpg> [dataset1_name] [dataset2_name] [show_concordant_data=FALSE] [<concordance_file>]
    51 ** Note:
    52 *** The last optional argument is a concordance file over the same data to plot as 'unknown' all loci that were not captured by the concordance matrix since the alleles were not exact matches (e.g. if one of the allele was monomorphic in one set).
     38For reporting purpose, R scripts were created. These scripts all take files created using vcftools-diff_site-condordance.pl or vcftools-discordance-matrix.py as input. The following scripts are available:
     39
     40 * plot_shared_loci.R
     41
     42** Plots the shared/unique loci in the two datasets per individual as a barplot ** Usage: Rscript plot_shared_loci.R <concordance_file> <out_plot.jpg> [name_dataset1] [name_dataset2]
     43
     44 * plot_geno_concordance.R
     45
     46** Plots the genotype concordance between two datasets per individual as a barplot ** Usage: Rscript plot_geno_concordance.R <concordance_file> <out_plot.jpg> [plot_title]
     47
     48 * plot_discordance_matrix.R
     49
     50** Plots the genotype discordance by "discordance type" (0/0 -> 0/1, 0/0 -> 1/1, 0/1 -> 0/0, etc.) ** Usage: Rscript plot_discordance_matrix.R <discordance_matrix_file> <out_plot.jpg> [dataset1_name] [dataset2_name] [show_concordant_data=FALSE] [<concordance_file>] ** Note: *** The last optional argument is a concordance file over the same data to plot as 'unknown' all loci that were not captured by the concordance matrix since the alleles were not exact matches (e.g. if one of the allele was monomorphic in one set).
    5351
    5452= Results - GoNL Pilot =
    5553== Groningen / BGI ==
    5654Datasets:
    57 * Groningen
    58 ** Produced using Groningen pipeline on hg19
    59 ** SNPs were only filtered for quality > Q10 (to avoid extreme numbers in VCF). It is considered an 'unfiltered' set and is expected to contain many false positives, however sensitivity should be excellent as any SNP not reported at this point will not be detected further in the pipeline.
    60 * BGI
    61 ** Produced using BGI pipeline on b36, then lifted over to hg19
    62 ** SNPs filtered using standard BGI filter setup
     55
     56 * Groningen
     57
     58** Produced using Groningen pipeline on hg19 ** SNPs were only filtered for quality > Q10 (to avoid extreme numbers in VCF). It is considered an 'unfiltered' set and is expected to contain many false positives, however sensitivity should be excellent as any SNP not reported at this point will not be detected further in the pipeline.
     59
     60 * BGI
     61
     62** Produced using BGI pipeline on b36, then lifted over to hg19 ** SNPs filtered using standard BGI filter setup
    6363
    6464=== Loci Concordance ===
    6565Below is a chart showing the shared and unique SNPs in the two datasets regardless of their genotypes. As expected, the vast majority of the SNPs are shared between the datasets, a relatively high number of SNPs are only found in Groningen (amongst them a majority of unfiltered false positives) and a small number of SNPs unique to the BGI dataset (to be investigated).
    6666
    67 [[File:bgi_groningen_loci_concordance.jpg|center]]
     67[[Image(bgi_groningen_loci_concordance.jpg)]]
    6868
    6969After investigation, the three least concordant individuals encountered a problem while processing one of their lanes, thus leading to 2/3 of the normal coverage. The figures should be updated when the lanes have been processed and these individuals corrected.
     
    7272The following chart shows the genotype concordance on the shared SNPs between BGI and Groningen datasets.
    7373
    74 [[File:bgi_groningen_concordance.jpg|center]]
     74[[Image(bgi_groningen_concordance.jpg)]]
    7575
    7676Note: The chart above does not take sex chromosomes into account as an artifact introduced by the way the Y-chrom was mapped by BGI was showing all males as completely discordant over the sex chromosomes.
     
    7878== Groningen / Immunochip ==
    7979Datasets:
    80 * Groningen
    81 ** Produced using Groningen pipeline on hg19
    82 ** SNPs were only filtered for quality > Q10 (to avoid extreme numbers in VCF). It is considered an 'unfiltered' set and is expected to contain many false positives, however sensitivity should be excellent as any SNP not reported at this point will not be detected further in the pipeline.
    83 ** Homozygous reference loci corresponding to the Immunochip were added to the dataset as well
    84 * Immunochip
    85 ** ~165K loci after QC (both SNPs and homozygous reference)
    86 *** SNP HWE p-val > 1e-3
    87 *** SNP callrate > 99%
    88 ** Exported from Genome Studio, QC'ed and lifted over from hg18 to hg19
     80
     81 * Groningen
     82
     83** Produced using Groningen pipeline on hg19 ** SNPs were only filtered for quality > Q10 (to avoid extreme numbers in VCF). It is considered an 'unfiltered' set and is expected to contain many false positives, however sensitivity should be excellent as any SNP not reported at this point will not be detected further in the pipeline. ** Homozygous reference loci corresponding to the Immunochip were added to the dataset as well
     84
     85 * Immunochip
     86
     87** ~165K loci after QC (both SNPs and homozygous reference) *** SNP HWE p-val > 1e-3 *** SNP callrate > 99% ** Exported from Genome Studio, QC'ed and lifted over from hg18 to hg19
    8988
    9089=== Genotype Concordance ===
    9190The following chart shows the genotype concordance on the 165K Immunochip loci left after QC.
    9291
    93 [[File:groningen_immunochip_concordance.jpg|center]]
     92[[Image(groningen_immunochip_concordance.jpg)]]
    9493
    9594The 5 least concordant individuals can be explained as follow:
    96 * A3b, A7b samples are contaminated
    97 * A8a,A8c,R5A encountered a problem while processing one of their lanes, thus leading to 2/3 of the normal coverage. The figures should be updated when the lanes have been processed and these individuals corrected.
     95
     96 * A3b, A7b samples are contaminated
     97 * A8a,A8c,R5A encountered a problem while processing one of their lanes, thus leading to 2/3 of the normal coverage. The figures should be updated when the lanes have been processed and these individuals corrected.
     98
    9899Moreover, if we exclude the individuals above and filter for sites that are commonly reported by the BGI and Groningen pipelines, the concordance reaches 99.029% on average.
    99 
    100100
    101101The graph below shows a preliminary analysis of the "types" of discordance observed. An important caveat has to be taken into account: VCFTools only reports sites where the alleles perfectly match. This means that all monomorphic sites in one dataset that are polymorphic in the other will not appear. This was especially problematic since we compared each sequenced sample separately against the whole Immunochip dataset. As a result almost all homozygous reference sites in the sequence data were not reported by VCFTools. All the discordant sites that did not have perfectly matching alleles are reported below as 'unknown' as it has yet to be investigated what discordance "type" they belong to.
    102102
    103 [[File:groningen_immunochip_discordance_matrix.jpg|center]]
     103[[Image(groningen_immunochip_discordance_matrix.jpg)]
    104104
    105 == BGI / Immunochip == 
     105== BGI / Immunochip ==
    106106Datasets:
    107 * BGI
    108 ** Produced using BGI pipeline on b36, then lifted over to hg19
    109 ** SNPs filtered using standard BGI filter setup. Note that no homozygous reference locus is reported.
    110 * Immunochip
    111 ** ~165K loci after QC (both SNPs and homozygous reference)
    112 *** SNP HWE p-val > 1e-3
    113 *** SNP callrate > 99%
    114 ** Exported from Genome Studio, QC'ed and lifted over from hg18 to hg19
     107
     108 * BGI
     109
     110** Produced using BGI pipeline on b36, then lifted over to hg19 ** SNPs filtered using standard BGI filter setup. Note that no homozygous reference locus is reported.
     111
     112 * Immunochip
     113
     114** ~165K loci after QC (both SNPs and homozygous reference) *** SNP HWE p-val > 1e-3 *** SNP callrate > 99% ** Exported from Genome Studio, QC'ed and lifted over from hg18 to hg19
    115115
    116116=== Genotype Concordance ===
    117 The following chart shows the concordance between the 2 datasets over ~47K shared loci. 
     117The following chart shows the concordance between the 2 datasets over ~47K shared loci.
    118118
    119 [[File:bgi_immunochip_concordance.jpg|center]]
     119[[Image(bgi_immunochip_concordance.jpg)]]
    120120
    121121Note that the 2 least concordant samples are explained as being contaminated.