[[TOC]] == Port applications to Dutch Life Science Grid == '''People''' * AMC: Antoine van Kampen, Barbera van Schaik, Silvia D Olabarriaga, Mark Santcroos * Sara/BiGGrid: Tom Visser * UMCG: Morris Swertz, Freerk van Dijk === Description === Software is going to be implemented as workflow components. The workflows will run on the Dutch life science grid. * Information about the infrastructure: http://www.bioinformaticslaboratory.nl/twiki/bin/view/EBioScience/ * Getting started: http://www.bioinformaticslaboratory.nl/twiki/bin/view/EBioScience/GettingStarted * Angela CM Luyf, Barbera DC van Schaik, Michel de Vries, Frank Baas, Antoine HC van Kampen, Silvia D Olabarriaga (2010) Initial steps towards a production platform for DNA sequence analysis on the grid. BMC Bioinformatics 11: 1. Dec [http://www.ncbi.nlm.nih.gov/pubmed/21156038 pubmed] === Implemented workflow components at AMC === This list of workflow components are already available. We can expand it with Pindel and (parts of) the GATK pipeline. * Splitting of fastq files * Building a BWA index on the genome sequence (base space and color space) * BWA for shotgun reads (base space and color space) It is possible to do parameter sweeps. Output is in bam format * Merge bam results * Samtools pileup * Varscan (pileup to snp, indel and cns) * Bam2coverage creates a UCSC wiggle file to display the genome coverage (per 50kbp) * Coverage-per-base determines the coverage for every base in the genome and it summarizes the results (coverage versus frequency) * Annovar (works for hg18, working on other assemblies). This is a pipeline to annotate variants (gene, dbsnp, hapmap, 1000g, conservation, etc) * FastqC === Implemented components of the Groningen pipeline === Template (grid component) ==== Alignment, realignment, recalibration, stats ==== * pe0--fastqc.ftl (FastqToFastQC, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/quality/Workflow/FastqToFastQC.gwendia) * pe00-bwa-align-pair1.ftl (!BwaIllumina, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/BwaIllumina.gwendia) * pe01-bwa-align-pair2.ftl (!BwaIllumina, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/BwaIllumina.gwendia) * pe02-bwa-sampe.ftl (!BwaIllumina, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/BwaIllumina.gwendia) * pe03-sam-to-bam.ftl (!BwaIllumina, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/BwaIllumina.gwendia) * pe04a-!HsMetrics.ftl (!CalculateHsMetrics, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/CalculateHsMetrics.gwendia) * pe04b-picardQC.ftl (PicardQC, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/PicardQC.gwendia) * pe04-sam-sort.ftl (!SamSort, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/SamSort.gwendia) * pe05-mark-duplicates.ftl (!MarkDuplicates, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/MarkDuplicates.gwendia) * pe06-realign.ftl (!ReAlign, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/ReAlign.gwendia) * pe07-fixmates.ftl (!FixMates, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/FixMates.gwendia) * pe08-covariates-before.ftl (!GatkCovariates, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/GatkCovariates.gwendia) * pe09-recalibrate.ftl (!GatkRecalibrate, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/GatkRecalibrate.gwendia) * pe10-sam-sort.ftl (!SamSort, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/SamSort.gwendia) * pe11-covariates-after.ftl (!GatkCovariates, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/GatkCovariates.gwendia) * pe12-analyze-covariates.ftl (!GatkAnalyzeCovariates, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/GatkAnalyzeCovariates.gwendia) ==== Merge bam per sample and perform SNP and indel calling ==== * vc00a-unified-genotyper.ftl '''to do''' * vc00b-variant-filtration.ftl '''to do''' * vc00c-variant-eval.ftl '''to do''' * vc00d-picardMetrics.ftl (PicardQC, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/PicardQC.gwendia) * vc00-merge.ftl '''to do''' * vc00.merge.ftl '''to do''' * vc01-coverage.ftl '''to do''' * vc01.unified_genotyper.ftl '''to do''' * vc02.picardQC.ftl (PicardQC, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/PicardQC.gwendia) * vc02-realigner-target-creator.ftl '''to do''' * vc03.coverage.ftl '''to do''' * vc03-realign.ftl (!ReAlign, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/ReAlign.gwendia) * vc04-fixmates.ftl (!FixMates, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/FixMates.gwendia) * vc05-indel-genotyper-v2.ftl '''to do''' * vc06-filter-indels.ftl '''to do''' * vc07-unified-genotyper.ftl '''to do''' * vc08-make-indel-mask.ftl '''to do''' * vc09-variant-filtration.ftl '''to do''' * vc10-variant-eval.ftl '''to do''' * vc11-name-sort-bam.ftl (!SamSort, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/SamSort.gwendia) * Pindel (Pindel, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/pindel/Workflows/Pindel.gwendia) === Data access rights === To ensure that the most limited group of people has access to the data we have created a subgroup "gvnl" within the "vlemed" Virtual Organisation (VO). For people to become part of this group, it is required that they have a Grid certificate and that they are part of the "vlemed" VO. On the following page there is information on how to get a certificate, how to get into the "vlemed" VO: http://www.bioinformaticslaboratory.nl/twiki/bin/view/EBioScience/EBioInfra#Access For more information about data access see http://www.bioinformaticslaboratory.nl/twiki/bin/view/EBioScience/DataManagement === Things to address === * Available disk space on the grid storage elements / worker nodes == Data location on grid == === Data === The data is located on the storage element at Sara and only readable and writable for the vlemed/gvnl group. This screencast demonstrates how to access the data from the Vbrowser: http://www.youtube.com/watch?v=FicwWGAbubQ Storage location (resource): srm.grid.sara.nl Path: /pnfs/grid.sara.nl/data/vlemed/gvnl === Workflows and databases === These directories are open to all members of the vlemed VO * Workflows: lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF * Databases: lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_DB The directories that contain the workflows have the following structure: || '''Directory''' || '''Description''' || || bin || dependent binaries like bwa, samtools || || GasW || component description, describes which executable has to run on grid and it specifies the in and output files/parameters || || Java || dependent jar files, e.g. the GATK jar || || parameterFiles || text files that contain exactly one line with the parameters that you would like to provide to bwa or another component || || Scufl || old workflow description files, can be ignored || || shFiles || these files are executed on the grid and described by the GASW descriptor, the gvnl shFiles are based on the Groningen templates || || Workflows || Workflow descriptions, clicking on them will start the Moteur plugin. Input files/parameters can be specified in the fields. If you click on "Run" the jobs are submitted to the grid || The shell files can (in most cases) run on any linux cluster. In that case you need to place the shell file and the dependent executable(s) in one directory. At the start of each shell file is an example on how to run them. == Workflow execution == || '''#''' || '''Sample''' || '''WF''' || '''Status''' || '''Start''' || || 1 || A4a || [http://orange.ebioscience.amc.nl/workflows/workflow-693426f3/html/workflow-693426f3.html F] [http://orange.ebioscience.amc.nl/workflows/workflow-35a9b777/html/workflow-35a9b777.html F] [http://orange.ebioscience.amc.nl/workflows/workflow-4ba3f651/html/workflow-4ba3f651.html F] [http://orange.ebioscience.amc.nl/workflows/workflow-425d9ceb/html/workflow-425d9ceb.html workflow-425d9ceb] || done |||| || 2 || Vartest || [http://orange.ebioscience.amc.nl/workflows/workflow-490b15f8/html/workflow-490b15f8.html workflow-490b15f8] || done |||| || 3 || Iteration test || [http://orange.ebioscience.amc.nl/workflows/workflow-bf48aff1/html/workflow-bf48aff1.html workflow-bf48aff1] || failed |||| || 4 || Iteration test || [http://orange.ebioscience.amc.nl/workflows/workflow-923c6588/html/workflow-923c6588.html workflow-923c6588] || done |||| || 5 || 60-samples-batch (15 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-d80b5767/html/workflow-d80b5767.html workflow-d80b5767] || 10 / 15 done || 11-02-2011 19:30 || || 6 || 60-samples-batch A (55 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-cbaca6e5/html/workflow-cbaca6e5.html workflow-cbaca6e5] || 15 / 55 done || 12-02-2011 13:55 || || 7 || 60-samples-batch A remaining 1 (17 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-835250b1/html/workflow-835250b1.html workflow-835250b1] || failed (grid very busy) || 07-03-2011 17:45 || || 8 || 60-samples-batch A remaining 1 (17 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-31eb952d/html/workflow-31eb952d.html workflow-31eb952d] || 1/17 done || 08-03-2011 14:15 || || 9 || 60-samples-batch A remaining (27 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-fd98c7c8/html/workflow-fd98c7c8.html workflow-fd98c7c8] || running || 15-03-2011 10:45 || || 10 || 60-samples-batch G (27/54 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-a781209c/html/workflow-a781209c.html workflow-a781209c] || running || 15-03-2011 20:37 || || 11 || second-batch R10-11-12 (27 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-6fe1cb10/html/workflow-6fe1cb10.html workflow-6fe1cb10] || running || 16-03-2011 10:57 || || x || 60-samples-batch G (54 lanes) || || || || || x || 60-samples-batch R (60 lanes) || || || || == Alternatives == === Clusters === * Groningen * Description here about code template and automatic PBS script generation. Job submission/monitoring * Leiden * Huygens * Lisa * Philips * DAS === Grid === * EBioInfra http://www.bioinformaticslaboratory.nl/twiki/bin/view/EBioScience/ * BiGGrid Cloud http://www.cloud.sara.nl/ * Topos https://grid.sara.nl/wiki/index.php/Using_the_Grid/ToPoS