Approximate word matches between two random sequences (Q2476396)

From MaRDI portal
scientific article
Language Label Description Also known as
English
Approximate word matches between two random sequences
scientific article

    Statements

    Approximate word matches between two random sequences (English)
    0 references
    0 references
    0 references
    19 March 2008
    0 references
    Let \({\mathcal L}= \{A,G,C,T\}\) be an alphabet with strand-symmetric probability measure \(\xi= \{\xi_A, \xi_G, \xi_C, \xi_T\}\) and perturbation parameter \(\eta\). That is, \(-1\leq\eta\leq 1\) is the unique number satisfying \(\xi_A= \xi_T= (1/4)(1+ \eta)\), \(\xi_C= \xi_G= (1/4)(1- \eta)\). Let \({\mathbf A}= A_1 A_2\cdots A_n\) and \({\mathbf B}= B_1 B_2\cdots B_n\) be two random squences of length \(n\) over \({\mathcal L}\) such that the letters are i.i.d. Let \({\mathbf x}\) and \({\mathbf y}\) be two words of length \(m< n\) and define the distance between \({\mathbf x}\) and \({\mathbf y}\) by \[ \delta({\mathbf x},{\mathbf y})=\text{number of characters mismatches between \({\mathbf x}\) and \({\mathbf y}\)}. \] We say that \({\mathbf x}\) is a \(k\)-neighbor of \({\mathbf y}\) if \(\delta({\mathbf x},{\mathbf y})\leq k< m\). Furthermore, define the statistic \(D^{(k)}_2\) to be the number of \(k\)-neighborhood \(m\)-word matches between the sequences \({\mathbf A}\) and \({\mathbf B}\), including overlaps. In this paper, the authors consider the mean of \(D^{(k)}_2\) and a lower and upper bound for the variance of \(D^{(k)}_2\) and prove among others the following theorem: If \(\eta\neq 0\), \(m= \alpha\log_{1/p_2}(n)+ C\) with \(0\leq\alpha<{1\over 2}\) and \(C\) a constant and \(0\leq k< m\) is fixed, then \[ {D^{(k)}_2- E(D^{(k)}_2)\over \sqrt{\text{Var}(D^{(k)}_2)}}\overset {d}\Rightarrow{\mathcal N}(0,1)\quad\text{as }{n}\infty. \] Here, \(p_2= \sum_{a\in{\mathcal L}}\xi^2_a\). The results are useful in the study of bioinformatics for expressed sequence tag database searches.
    0 references
    DNA sequences
    0 references
    number of \(m\)-letter word matches
    0 references
    central limit theorem
    0 references
    sequence comparison
    0 references
    word matches
    0 references

    Identifiers