| 3 | | TODO: More info on statistic and how to access the dataset will be added here. |
| 4 | | BGI will shared with us: SNP calling results (VCF), SOAP format for indels. |
| | 4 | == BGI hg19 Pipeline == |
| | 5 | * Alignment: bwa aln -o 1 -e 63 -i 15 -L -l 31 -k 2 |
| | 6 | * Parameters based on other projects for indel detection |
| | 7 | * Remove duplicates: samtools, merge first then remove duplicates |
| | 8 | * SNP Detection: [http://soap.genomics.org.cn/soapsnp.html SOAPsnp] |
| | 9 | * Params: -r 0.0005 -3 0.001 -u -2 |
| | 10 | * Depth > 4x |
| | 11 | * Q > 20 |
| | 12 | * CN < 2 |
| | 13 | * >5bp between each SNP |
| | 14 | * Indel Detection: [http://samtools.sourceforge.net/ samtools] pileup -ivcf |
| | 15 | * RMS Qual > 20 |
| | 16 | * Variation freq >= 0.1 |
| | 17 | * Consensus quality > Q20 |
| | 18 | * Supporting reads >= 2 |
| | 19 | * Results: ~700k per individual |
| | 20 | * CNV: [http://www.csie.ntu.edu.tw/~kmchao/tools/CNVDetector/ CNV_Detector] |
| | 21 | * Parameters: -r *.fa -c *.depth.insgle.out -o *.cnv.txt -l 90 |
| | 22 | * Depth-based: GC%, Avg. seq depth or every sliding window of the sequenced genome will be counted and the depth distribution be modeled to a norm distribution with an estimated mean dpeth and sd for each level of GC content. Regions below criteria are identidied as potention CNVS: |
| | 23 | * Depth was significantly different from WG average with same GC content |
| | 24 | * Flanking region... |
| | 25 | * Results: ~150 CNV > 10k, ~100 > 100k per individual |
| | 26 | |
| | 27 | == BGI file format == |
| | 28 | BGI delivered all its data in both the raw format from the tools they are using as well as a vcf version. Below is a summary of the data and formats along with possible notes and/or know caveats. |
| | 29 | |
| | 30 | * Alignment data |
| | 31 | * BAM format |
| | 32 | * CNV |
| | 33 | * [http://www.csie.ntu.edu.tw/~kmchao/tools/CNVDetector/ CNV Detector] output format |
| | 34 | * According to documentation, CNV detector uses Microarray intensity log 2 ratio values as input files. It is not clear what array/files were used for our samples at the moment. |
| | 35 | * [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 VCF format] |
| | 36 | * Not assessed yet. |
| | 37 | * Indels |
| | 38 | * Samtools pileup format |
| | 39 | * Cannot find a matching description on the samtools website. To be completed when BGI provides the complete format specs. |
| | 40 | * [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 VCF format] |
| | 41 | * Not assessed yet. |
| | 42 | * SNP calls |
| | 43 | * [http://soap.genomics.org.cn/soapsnp.html#output2 SOAPsnp output format] |
| | 44 | * |
| | 45 | * [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 VCF format] |
| | 46 | * Some of the files are in VCF3.0 format, others in VCF 4.0 format |
| | 47 | * Some files are not sorted correctly; these need to be sorted again in order for most programs to work correctly |
| | 48 | * Some files do not have a proper sample name set (only SAMPLE). This is usually problematic for any program that works on multiple VCF files at the same time. |
| | 49 | * The filter column is either always 0 or always PASS; in both cases it just means that all the SNPs passed the filtering steps and hence does not give any information about unfiltered SNPs. |
| | 50 | |
| | 51 | = Overview BGI datasets'''''''''' = |
| | 52 | TODO: More info on statistic and how to access the dataset will be added here. BGI will shared with us: SNP calling results (VCF), SOAP format for indels. |