Changes between Version 2 and Version 3 of DeNovoVariationPipeline


Ignore:
Timestamp:
Sep 24, 2010 5:03:52 PM (11 years ago)
Author:
Yurii Aulchenko
Comment:

format fix + small update

Legend:

Unmodified
Added
Removed
Modified
  • DeNovoVariationPipeline

    v2 v3  
    7272== (1) COVERAGE ==
    7373
    74 For few trios, increase coverage (potentially only in parents, or even one parent); this will decrease the chance that we miss a heterozygote. We did calculations of what coverage should be so we get chance of het missing becoming comparable to the mutation rate; e.g. we aim chance het missing = 1e-8 or so (see Box below for computations). It appears that at ~32x only half of the situations described above will be attributable to inadequate coverage, while other half will be true ‘de novo’ mutations.
    75 
    76 Assume the heterozygote call is made when at least two reads show the variant. Let us also assume for the moment that coverage is always Nx. Denote reference sequence as “R” and alternative as “A”, so in fact the person is R/A. Let us compute the probability that we miss this heterozygote (i.e. will call it A/A or R/R):
    77 
    78 P(call R/A as R/R or A/A) = P(all N read R) + P([N-1] reads are R, and 1 read is A) + P(all N read A) + P([N-1] reads are A, and 1 read is R)
     74For few trios, increase coverage (potentially only in parents, or even one parent); this will decrease the chance that we miss a heterozygote. We did calculations of what coverage should be so we get chance of het missing becoming comparable to the mutation rate; e.g. we aim chance het missing = 1e-8 or so (see Box 1 below for computations). It appears that at ~32x only half of the situations described above will be attributable to inadequate coverage, while other half will be true ‘de novo’ mutations.
     75
     76
     77{{{
     78Box 1
     79
     80Assume the heterozygote call is made when at least two reads show the variant.
     81Let us also assume for the moment that coverage is always Nx. Denote reference
     82sequence as “R” and alternative as “A”, so in fact the person is R/A. Let us compute
     83the probability that we miss this heterozygote (i.e. will call it A/A or R/R):
     84
     85P(call R/A as R/R or A/A) = P(all N read R) + P([N-1] reads are R, and 1 read is A)
     86                                                + P(all N read A) + P([N-1] reads are A, and 1 read is R)
    7987
    8088Assuming that probability of reads follows binomial distribution, we get
     
    8391
    8492P(call R/A as R/R or A/A) ~ 1e-8 at N ~ 32
     93}}}
    8594
    8695== (2) Exploit tagging of same-window reads ==
     
    149158rrrrrrrrrr
    150159
     160}}}
     161
     162Reads in offspring
     163
     164{{{
     165
     166rrCrrArrrr
     167
     168rrCrrArrrr
     169
     170rrCrrArrrr
     171
     172rrCrrArrrr
     173
     174rrCrrArrrr
     175
     176rrCrrArrrr
     177
     178rrrrrrrrrr
     179
     180rrrrrrrrrr
     181
     182rrrrrrrrrr
     183
     184rrrrrrrrrr
     185
     186rrrrrrrrrr
     187
     188rrrrrrrrrr
     189
    151190}}} 
    152191
    153 Reads in offspring
    154 
    155 {{{ 
    156 
    157 rrCrrArrrr
    158 
    159 rrCrrArrrr
    160 
    161 rrCrrArrrr
    162 
    163 rrCrrArrrr
    164 
    165 rrCrrArrrr
    166 
    167 rrCrrArrrr
    168 
    169 rrrrrrrrrr
    170 
    171 rrrrrrrrrr
    172 
    173 rrrrrrrrrr
    174 
    175 rrrrrrrrrr
    176 
    177 rrrrrrrrrr
    178 
    179 rrrrrrrrrr
    180 
    181 }}} 
    182 
    183 To address whether this is realistic scenario under which we can detect de novo mutations, we need to answer the question about probability that, given ‘de novo’ mutation occurs, what is the chance we will see that mutation in at least four reads (it is clear that for ‘de novo’ we must use more stringent calling criteria) and that in at least two of these 4 reads we will also see a heterozygote coming from a parent. Computations estimating this chance are provided below in the Box.
    184 
    185 From these computations, it appears that the chance to see ‘de novo’ in 4 or more reads, and see an existing (transmitted from a parent) variant in at least two of these reads is about 0.09. Thus, using outlined strategy we will be able to detect several de novo mutations per trio offspring, translating to hundreds (or thousands) de novo described from the whole data set. Note that in above we ignored the paired-end nature of our sequencing data, which, when properly accounted for, would probably double the numbers of detectable de novo mutations.
    186 
    187 The probability that we see a ‘de novo’ in at least 4 reads out of 12 is 0.93. The chance that an existing heterozygous site is covered in the same read can be computed assuming the read length of 100, uniform distribution of the read-start position across the genome, and heterozygote probability of 1/300 per site (Kai). Assume the ‘worst’ scenario of exactly 4 reads with ‘de novo’, what is the chance that in at least two of them we will see an existing heterozygote?
    188 
    189 Denoting the ‘de novo’ position in the read as 0, the ‘coverable’ position of a heterozygote may vary from -99 to +99. The chance that a heterozygote at +99 is included in the read is 0.01; if heterozygote is at +1, the chance is 0.99. Thus, for a heterozygote at position ‘j’ (j in -99 to -1 and 1 to 99) the chance to be included in the read is (1-abs(j/100)). We assume that a chance to have a ‘linked’ alternative variant at a position is ½ * 1/300 = 1/600. Thus the probability to detect a ‘linked’ variant in at least two reads out of 4 is:
     192To address whether this is realistic scenario under which we can detect de novo mutations, we need to answer the question about probability that, given ‘de novo’ mutation occurs, what is the chance we will see that mutation in at least four reads (it is clear that for ‘de novo’ we must use more stringent calling criteria) and that in at least two of these 4 reads we will also see a heterozygote coming from a parent. Computations estimating this chance are provided below in the Box 2.
     193
     194From these computations, it appears that the chance to see ‘de novo’ in 4 or more reads, and see an existing (transmitted from a parent) variant in at least two of these reads is about 0.09. Thus, using outlined strategy we will be able to detect several de novo mutations per trio offspring, translating to hundreds (or thousands) de novo described from the whole data set. Note that in above we ignored the paired-end nature of our sequencing data, which, when properly accounted for, would probably double the numbers of detectable de novo mutations. Next, if cross-reads phasing works accurately and at longer distances this will allow to bring this proportion even higher.
     195
     196{{{
     197Box 2
     198
     199The probability that we see a ‘de novo’ in at least 4 reads out of 12 is 0.93.
     200The chance that an existing heterozygous site is covered in the same read
     201can be computed assuming the read length of 100, uniform distribution of
     202the read-start position across the genome, and heterozygote probability of
     2031/300 per site (Kai). Assume the ‘worst’ scenario of exactly 4 reads with
     204‘de novo’, what is the chance that in at least two of them we will see an
     205existing heterozygote?
     206
     207Denoting the ‘de novo’ position in the read as 0, the ‘coverable’ position
     208of a heterozygote may vary from -99 to +99. The chance that a heterozygote
     209at +99 is included in the read is 0.01; if heterozygote is at +1, the chance is
     2100.99. Thus, for a heterozygote at position ‘j’ (j in -99 to -1 and 1 to 99) the
     211chance to be included in the read is (1-abs(j/100)). We assume that a chance
     212to have a ‘linked’ alternative variant at a position is ½ * 1/300 = 1/600. Thus
     213the probability to detect a ‘linked’ variant in at least two reads out of 4 is:
    190214
    191215P(see variant in >=2 reads)
    192216
    193 = P(variant is at -99) * P(see variant in >=2 reads | variant is at -99) + P(variant is at -98) * P(see variant in >=2 reads | variant is at -98) + … + P(variant is at +99) * P(see variant in >=2 reads | variant is at +99)
    194 
    195 = 1/600 (P(see variant in >=2 reads | variant is at -99) + … P(see variant in >=2 reads | variant is at +99))
    196 
    197 = 1/600 [ 2 * SUM,,j=(1,99),, SUM,,k=2,4,, (1-j/100)^k^ * (j/100)^(4-k)^ ]
     217= P(variant is at -99) * P(see variant in >=2 reads | variant is at -99)
     218   + P(variant is at -98) * P(see variant in >=2 reads | variant is at -98) +
     219   … + P(variant is at +99) * P(see variant in >=2 reads | variant is at +99)
     220
     221= 1/600 (P(see variant in >=2 reads | variant is at -99)
     222   + … P(see variant in >=2 reads | variant is at +99))
     223
     224= 1/600 [ 2 * SUM_{j=(1,99)} SUM_{k=2,4} (1-j/100)^k * (j/100)^(4-k) ]
    198225
    199226Evaluation of this expression gives 
     
    201228P(see variant in >=2 reads) = 0.09
    202229
    203 Thus the joint probability to see ‘de novo’ in >=4 reads and see an established variant transmitted from a parent in at least 2 of these reads is 0.93*0.09 = 0.086.
     230Thus the joint probability to see ‘de novo’ in >=4 reads and see an
     231established variant transmitted from a parent in at least 2 of these reads
     232is 0.93*0.09 = 0.086.
     233}}}
    204234
    205235== Conclusions ==