[[TOC]] == Port applications to Dutch Life Science Grid == '''People''' * AMC: Antoine van Kampen, Barbera van Schaik, Silvia D Olabarriaga, Mark Santcroos * Sara/BiGGrid: Tom Visser * UMCG: Morris Swertz, Freerk van Dijk '''Description''' Software is going to be implemented as workflow components. The workflows will run on the Dutch life science grid. * Information about the infrastructure: http://www.bioinformaticslaboratory.nl/twiki/bin/view/EBioScience/ * Getting started: http://www.bioinformaticslaboratory.nl/twiki/bin/view/EBioScience/GettingStarted * [http://www.ncbi.nlm.nih.gov/pubmed/21156038 Luyf and van Schaik et al (2010)] '''Implemented workflow components at AMC''' This list of workflow components are already available. We can expand it with Pindel and (parts of) the GATK pipeline. * Splitting of fastq files * Building a BWA index on the genome sequence (base space and color space) * BWA for shotgun reads (base space and color space) It is possible to do parameter sweeps. Output is in bam format * Merge bam results * Samtools pileup * Varscan (pileup to snp, indel and cns) * Bam2coverage creates a UCSC wiggle file to display the genome coverage (per 50kbp) * Coverage-per-base determines the coverage for every base in the genome and it summarizes the results (coverage versus frequency) * Annovar (works for hg18, working on other assemblies). This is a pipeline to annotate variants (gene, dbsnp, hapmap, 1000g, conservation, etc) * FastqC '''Implemented components of the Groningen pipeline''' ''A more detailed description will follow later'' * !BwaIllumina (''done'') - pe00-bwa-align-pair1.ftl, pe01-bwa-align-pair2.ftl, pe02-bwa-sampe.ftl, pe03-sam-to-bam.ftl, pe04-sam-sort.ftl * !MarkDuplicates (''done'') - pe05-mark-duplicates.ftl * PicardQC (''partly done'') - pe04b-picardQC.ftl. Didn't get the R environment up and running yet, so the .pdf .hist and .bamindexstats can not be produced yet. Will continue with the other components and fix this later. Attachment contains info about the required R packages. * !GatkGenerateIntervalFile (''done'') - see e-mail Freerk on Dec 13, 2010 * !ReAlign (''in progress'') - pe06-realign.ftl. Fails, need to discuss this with Freerk * !FixMates (''done'') - pe07-fixmates.ftl * !GatkCovariates (''done'') - pe08-covariates-before.ftl, pe11-covariates-after.ftl * !GatkRecalibrate (''done'') - pe09-recalibrate.ftl * !SortSam (''done'') - pe10-sam-sort.ftl * !AnalyzeCovariates (''done'') - pe12-analyze-covariates.ftl * ''Update 31-1-2011'': integrating all the components in two workflows; one for alignment, one for the subsequent steps '''To be implemented''' * The components of the Groningen pipeline that not implemented as a workflow component yet * Pindel '''Data access rights''' To ensure that the most limited group of people has access to the data we have created a subgroup "gvnl" within the "vlemed" Virtual Organisation (VO). For people to become part of this group, it is required that they have a Grid certificate and that they are part of the "vlemed" VO. On the following page there is information on how to get a certificate, how to get into the "vlemed" VO: http://www.bioinformaticslaboratory.nl/twiki/bin/view/EBioScience/EBioInfra#Access For more information about data access see http://www.bioinformaticslaboratory.nl/twiki/bin/view/EBioScience/DataManagement '''Things to address''' * Available disk space on the grid storage elements / worker nodes == Data location on grid == === Data === Browse to the following directory with the Vbrowser (this directory is only readable/writable for the gvnl group): lfn://lfc.grid.sara.nl:5010/grid/vlemed/gvnl/Data/Results-test20101220 Each component in the workflow generates a new sub directory. If you would like to retrieve the analysis results of the picard-qc component of one of the samples, you go to the directory: * bwa_databasename_parametername/samplename_libraryname_lanename/picard1.32qc/ * Example: lfn://lfc.grid.sara.nl:5010/grid/vlemed/gvnl/Data/bwa_gatk_index_basespace_human_g1k_v37.fasta.tar.gz_bwa_4_threads.txt/A4a_HUModqRBVDIBAPE_L7 The results for the analysis where we have splitted the fastq-files is located in the directories that start with: * split_fileprefix_1/ '''Note that not everything is there yet and that the data structure may change, since we are still testing.''' === Workflows and databases === These directories are open to all members of the vlemed VO * Workflows: lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF * Databases: lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_DB The directories that contain the workflows have the following structure: || '''Directory''' || '''Description''' || || bin || dependent binaries like bwa, samtools || || GasW || component description, describes which executable has to run on grid and it specifies the in and output files/parameters || || Java || dependent jar files, e.g. the GATK jar || || parameterFiles || text files that contain exactly one line with the parameters that you would like to provide to bwa or another component || || Scufl || old workflow description files, can be ignored || || shFiles || these files are executed on the grid and described by the GASW descriptor, the gvnl shFiles are based on the Groningen templates || || Workflows || Workflow descriptions, clicking on them will start the Moteur plugin. Input files/parameters can be specified in the fields. If you click on "Run" the jobs are submitted to the grid || The shell files can (in most cases) run on any linux cluster. In that case you need to place the shell file and the dependent executable(s) in one directory. At the start of each shell file is an example on how to run them. == Workflow execution == ''' On mini pilot. Split lines-per-file: 8,000,000. Start: 18-12-2010 16:40 ''' || '''Lane''' || '''split BWA merge''' || '''Comments''' || '''Elapsed time (s)''' || || A4a_L4_HUModqRBUDIBAPE || [https://hangplek.nikhef.nl:443/workflows/workflow-cOunUZ/html/workflow-cOunUZ.html done] || || 22740 || || A4a_L6_HUModqRADDIAAPE || [https://hangplek.nikhef.nl:443/workflows/workflow-pqWdc1/html/workflow-pqWdc1.html done] || || 37750 || || A4a_L7_HUModqRBVDIBAPE || [https://hangplek.nikhef.nl:443/workflows/workflow-OqeNdv/html/workflow-OqeNdv.html failed] || || || || A4b_L3_HUModqRAFDIBAPE || [https://hangplek.nikhef.nl:443/workflows/workflow-GRDX0X/html/workflow-GRDX0X.html done] || || 32256 || || A4b_L6_HUModqRBTDIBAPE || [https://hangplek.nikhef.nl:443/workflows/workflow-ZRH98o/html/workflow-ZRH98o.html failed] || || || || R2A _L1_HUModqRADDIBAPE || [https://hangplek.nikhef.nl:443/workflows/workflow-1lNDQh/html/workflow-1lNDQh.html failed] || || || || R2A _L1_HUModqRAFDIBAPE || [https://hangplek.nikhef.nl:443/workflows/workflow-rufel6/html/workflow-rufel6.html done] || || 35819 || || R2A _L5_HUModqRAEDIAAPE || [https://hangplek.nikhef.nl:443/workflows/workflow-wVuMjg/html/workflow-wVuMjg.html failed] || || || || R2B _L3_HUModqRBTDIAAPE || [https://hangplek.nikhef.nl:443/workflows/workflow-L3m5Yi/html/workflow-L3m5Yi.html done] || || 23754 || || R2B _L4_HUModqRBUDIAAPE || [https://hangplek.nikhef.nl:443/workflows/workflow-alGMJK/html/workflow-alGMJK.html failed] || pair 1 not in correct gzip format? || || || R2B _L6_HUModqRBTDIBAPE || [https://hangplek.nikhef.nl:443/workflows/workflow-LCaDkY/html/workflow-LCaDkY.html failed] || || || || R2C _L2_HUModqRBUDIBAPE || [https://hangplek.nikhef.nl:443/workflows/workflow-gVXW9i/html/workflow-gVXW9i.html done] || || 40763 || || R2C _L2_HUModqRBVDIBAPE || [https://hangplek.nikhef.nl:443/workflows/workflow-8bma4j/html/workflow-8bma4j.html failed] || || || || R2C _L7_HUModqRBVDIAAPE || [https://hangplek.nikhef.nl:443/workflows/workflow-Rjh1lC/html/workflow-Rjh1lC.html done] || || 39374 || || Unknown_L6_HUModqRBUDIAAPE || [https://hangplek.nikhef.nl:443/workflows/workflow-IG0HAm/html/workflow-IG0HAm.html done] || || 30307 || Note: The Merge component is red (marked as failed) in the finished workflows. The bam files where produced successfully. ''' BWA alignment on mini pilot without splitting the data ''' * 19-12-2010 14:00 [https://hangplek.nikhef.nl:443/workflows/workflow-rIUQ2N/html/workflow-rIUQ2N.html done] - elapsed time 325886 s '''PicardQC on finished bam files''' * 20-12-2010 11:30 [https://hangplek.nikhef.nl:443/workflows/workflow-3T5uMC/html/workflow-3T5uMC.html done] - elapsed time (10 bam files) 8730 s * [http://www.bbmriwiki.nl/attachment/wiki/BigCompute/log-picardqc-20101220.ods Info runtime and used disk space] (ods) '''Coverage-per-base on finished bam files''' * 20-12-2010 12:40 [https://hangplek.nikhef.nl:443/workflows/workflow-qwMQka/html/workflow-qwMQka.html done] - elapsed time 89634 s '''Mark-duplicates on finished bam files''' * 20-12-2010 12:47 [https://hangplek.nikhef.nl:443/workflows/workflow-E6tgrW/html/workflow-E6tgrW.html done] - elapsed time 10614 s '''Re-align on finished bam files''' * 20-12-2010 17:48 [https://hangplek.nikhef.nl:443/workflows/workflow-ICmgZF/html/workflow-ICmgZF.html done] - elapsed time 9165 s '''Gvnl pipeline on remaining samples (split, bwa, merge, coverage-per-base, picard-qc, mark-duplicates, re-align) Start: 20-12-2010 18:50''' || '''Lane''' || '''Progress''' || '''Elapsed time''' || || A4a_L7_HUModqRBVDIBAPE || [https://hangplek.nikhef.nl:443/workflows/workflow-RllyLv/html/workflow-RllyLv.html done] || 208542 || || A4b_L6_HUModqRBTDIBAPE || [https://hangplek.nikhef.nl:443/workflows/workflow-A6WSPB/html/workflow-A6WSPB.html done] || 74476 || || R2A _L1_HUModqRADDIBAPE || [https://hangplek.nikhef.nl:443/workflows/workflow-L2NZUb/html/workflow-L2NZUb.html done] || 72636 || || R2A _L5_HUModqRAEDIAAPE || [https://hangplek.nikhef.nl:443/workflows/workflow-VoZud6/html/workflow-VoZud6.html done] || 55242 || || R2B _L4_HUModqRBUDIAAPE || [https://hangplek.nikhef.nl:443/workflows/workflow-ldqezK/html/workflow-ldqezK.html failed] || || || R2B _L6_HUModqRBTDIBAPE || [https://hangplek.nikhef.nl:443/workflows/workflow-q9PKaQ/html/workflow-q9PKaQ.html done] || 227859 || || R2C _L2_HUModqRBVDIBAPE || [https://hangplek.nikhef.nl:443/workflows/workflow-jroCmW/html/workflow-jroCmW.html done] || 44730 || * Note1: Workflow stops after merge-bam. Working on it. * Note2: The Dutch grid seems very busy at the moment. That could explain the longer execution times. '''Gvnl pipeline on theoretical reads''' * 22-12-2010 19:14 [https://hangplek.nikhef.nl/workflows//workflow-FRErry/html/workflow-FRErry.html done] - elapsed time 38587 s * Workflow does continue after the merge-bam now * To do: implement recalibration step, and a check on all bam files of the split input before the merge step == Alternatives == === Clusters === * Groningen * Description here about code template and automatic PBS script generation. Job submission/monitoring * Leiden * Huygens * Lisa * Philips * DAS === Grid === * EBioInfra http://www.bioinformaticslaboratory.nl/twiki/bin/view/EBioScience/ * BiGGrid Cloud http://www.cloud.sara.nl/ * Topos https://grid.sara.nl/wiki/index.php/Using_the_Grid/ToPoS