| 108 | |
| 109 | == Quality Control == |
| 110 | The current important values discussed for the quality control along with their thresholds are the following: |
| 111 | * RawData |
| 112 | ** FastQC report (per mate of the pair) |
| 113 | *** Manual look at files and check: |
| 114 | **** Avg Quality per read > 30 |
| 115 | **** Num sequences ~60Mio |
| 116 | **** Sequence quality should look OK |
| 117 | * Alignment (per lane) |
| 118 | ** Picard Alignment Summary Metrics |
| 119 | *** %Purified reads aligned > 90% |
| 120 | *** Purified High Quality Error Rate < 1% |
| 121 | *** Purified reads aligned > 150Mio |
| 122 | ** Picard GC Bias Metrics |
| 123 | *** GC Curve should look OK |
| 124 | *** Median GC% windows between 30 and 40 |
| 125 | *** Avg Mean Base Quality should be OK |
| 126 | ** Picard Insertsize Metrics |
| 127 | *** Peak should be ~500 |
| 128 | *** Peak should be narrow |
| 129 | *** Should have few outliers |
| 130 | ** Picard BAM Index Stats |
| 131 | *** Should be uniform by Chromosome |
| 132 | ** GATK or Picard (currently testing) Coverage Metrics |
| 133 | *** Should correspond to a Poisson curve with peak at 12x |
| 134 | ** Picard Mark Duplicates |
| 135 | *** %duplicates between 5% and 8% |
| 136 | * Recalibration |
| 137 | ** GATK Analyze Covariate |
| 138 | *** No output currently; should revisit when working |
| 139 | ** Picard Quality by Cycle |
| 140 | *** To be determined once data is produced |
| 141 | ** Picard Quality Distribution |
| 142 | *** To be determined once data is produced |
| 143 | * Initial SNP Calling |
| 144 | ** To be determined once data is produced and analyzed. A first basis for it should be derived from the difference between chipdata and sequence data and the %of SNPs found in dbSNP. |
| 145 | |