= '''BGI Data''' = [[TOC()]] Below is the description of the pipeline used by BGI for processing the GoNL data on hg19 and a short description of the format of the files delivered by BGI. == BGI hg19 Pipeline == * Alignment: bwa aln -o 1 -e 63 -i 15 -L -l 31 -k 2 * Parameters based on other projects for indel detection * Remove duplicates: samtools, merge first then remove duplicates * SNP Detection: [http://soap.genomics.org.cn/soapsnp.html SOAPsnp] * Params: -r 0.0005 -3 0.001 -u -2 * Depth > 4x * Q > 20 * CN < 2 * >5bp between each SNP * Indel Detection: [http://samtools.sourceforge.net/ samtools] pileup -ivcf * RMS Qual > 20 * Variation freq >= 0.1 * Consensus quality > Q20 * Supporting reads >= 2 * Results: ~700k per individual * CNV: [http://www.csie.ntu.edu.tw/~kmchao/tools/CNVDetector/ CNV_Detector] * Parameters: -r *.fa -c *.depth.insgle.out -o *.cnv.txt -l 90 * Depth-based: GC%, Avg. seq depth or every sliding window of the sequenced genome will be counted and the depth distribution be modeled to a norm distribution with an estimated mean dpeth and sd for each level of GC content. Regions below criteria are identidied as potention CNVS: * Depth was significantly different from WG average with same GC content * Flanking region... * Results: ~150 CNV > 10k, ~100 > 100k per individual == BGI file format == BGI delivered all its data in both the raw format from the tools they are using as well as a vcf version. Below is a summary of the data and formats along with possible notes and/or know caveats. Note that unless specified otherwise, all data is aligned on hg19. * Alignment data * BAM format * CNV * [http://www.csie.ntu.edu.tw/~kmchao/tools/CNVDetector/ CNV Detector] output format * According to documentation, CNV detector uses Microarray intensity log 2 ratio values as input files. It is not clear what array/files were used for our samples at the moment. * [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 VCF format] * Not assessed yet. * Indels * Samtools pileup format * Cannot find a matching description on the samtools website. To be completed when BGI provides the complete format specs. * [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 VCF format] * Not assessed yet. * SNP calls * [http://soap.genomics.org.cn/soapsnp.html#output2 SOAPsnp output format] * * [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 VCF format] * Some of the files are in VCF3.0 format, others in VCF 4.0 format * Some files are not sorted correctly; these need to be sorted again in order for most programs to work correctly * Some files do not have a proper sample name set (only SAMPLE). This is usually problematic for any program that works on multiple VCF files at the same time. * The filter column is either always 0 or always PASS; in both cases it just means that all the SNPs passed the filtering steps and hence does not give any information about unfiltered SNPs. = Overview BGI datasets'''''''''' = TODO: More info on statistic and how to access the dataset will be added here. BGI will shared with us: SNP calling results (VCF), SOAP format for indels. || '''Batch''' || '''Samples''' || '''Lanes''' || '''Size''' || '''Groningen storage''' || '''Grid storage (VO)''' || '''Storage additional sites''' || '''Analysis site''' || '''Status''' || || 1st batch aka Pilot phase || 60 || 183 || || yes || yes (vlemed) || || UMGC, AMC (for comparison) || aligned, in progress || || 2st batch || 90 || 295 || || yes || yes (vlemed) || || AMC || in progress || || 3rd batch || 222 || 683 || 10TB || yes || no || || UMGC || in progress || || 4th batch || 235 || 630 || 10TB || yes || yes (bbmri.nl) || Hubrecht, EMC || LUMC/TUdelft, UMGC, Hubrecht, EMC || in progress || || 5th batch || 153 || || || no || no || || || not on storage yet || || 6th batch || 10 || || || no || || || || not arrived yet || == '''Directories''' == '''1st batch''' * Groningen * Raw data (fastq) * Results * Grid (vlemed) * Raw data SRM (fastq) srm://srm.grid.sara.nl:8443/pnfs/grid.sara.nl/data/vlemed/gvnl/data (access to members of vlemed/gvnl VO) * Raw data LFC (fastq) lfn://lfc.grid.sara.nl:5010/grid/vlemed/gvnl/data (access closed) * Results LFC lfn://lfc.grid.sara.nl:5010/grid/vlemed/gvnl/data/bam/markduplicates/realignment/fixmates/recalibrate/sorted (access closed) * Note: results will be copied to bbmri.nl VO when analysis is done '''2nd batch''' * Groningen * Raw data (fastq) * Results (bam) * Grid (vlemed) * Raw data SRM (fastq) srm://srm.grid.sara.nl:8443/pnfs/grid.sara.nl/data/vlemed/gvnl/data (access to members of vlemed/gvnl VO) * Raw data LFC (fastq) lfn://lfc.grid.sara.nl:5010/grid/vlemed/gvnl/data (access closed) * Results LFC lfn://lfc.grid.sara.nl:5010/grid/vlemed/gvnl/data/bam/markduplicates/realignment/fixmates/recalibrate/sorted (access closed) * Note: results will be copied to bbmri.nl VO when analysis is done '''3rd batch''' * Groningen * Raw data (fastq) * Results '''4th batch''' * Groningen * Raw data (fastq) * Results * Grid (bbmri.nl) * Raw data SRM (fastq) srm://srm.grid.sara.nl:8443/pnfs/grid.sara.nl/data/bbmri.nl/fourth_batch (access to members of bbmri.nl VO) * Raw data LFC (fastq) lfn://lfc.grid.sara.nl:5010/grid/bbmri.nl/gonl/data/input (access to members of bbmri.nl VO) * Results LFC lfn://lfc.grid.sara.nl:5010/grid/bbmri.nl/gonl/data/output (access to members of bbmri.nl VO)