A Method for Nucleotide Sequence Analysis
University of Information Technology and Management
ul. H. Sucharskiego 2, 35-225 Rzeszów, Poland
e-mai: bkozarzewski@wsiz.rzeszow.pl
Received:
(Received: 22 February 2012; revised: 29 May 2012; accepted: 7 June 2012; published online: 21 June 2012)
DOI: 10.12921/cmst.2012.18.01.5-10
OAI: oai:lib.psnc.pl:420
Abstract:
Symbolic sequence decomposition into a set of consecutive, distinct subsequences (mers) is presented. Several statistical distributions of nucleotide subsequences are defined and analysed. Sequence entropy and similarity between sequences in terms of mer lengths distribution are defined. An alignment-free method of phylogenetic tree
construction is proposed.
Key words:
References:
[1] A. Lempel, J. Ziv, On the complexity of finite sequences. IEEE Trans. Inform. Theory 22, 75-81 (1976).
[2] H.H. Out, K. Sayood, A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19, 2122-2130 (2003).
[3] D.-G. Ke, Q.-Y. Tong, Easily adaptable complexity measure for finite time series. Phys. Rev. E77, 066215 (2008).
[4] Z. Kása, On the d-complexity of strings. http://arxiv.org/abs/1002.2721v1.
[5] C. Adami, N.J. Ceref, 1999. Physical complexity of symbolic sequences. arxiv: adap-org/9605002v3
[6] J. Wen, C. Li, Similarity analysis of DNA sequences based on the LZ complexity. Internet Electron. J. Mol. Des. 6, 1-12 (2007).
[7] B. Kozarzewski, Multilevel time series complexity. Journal of Applied Computer Science 19, 2, 61-71 (2011).
[8] J.-B. Brissaud, The meaning of entropy. Entropy 7, 68-96 (2005).
[9] Y.-H. Chen, S.-L. Nyeo, C.-Y. Yeh, Model for distribution of k-mers in DNA sequences. Physical Review E72, 011908 (2005).
[10] W.K. Brown, K.H. Wohletz, Derivation of the Weibull distribution based on physical principles and its connection to the Rossin-Rammler and lognormal distributions. Journal of Applied Physics 78, 2758-2763 (1995).
[11] M. van Oven, http://www.phylotree.org (2009).
Symbolic sequence decomposition into a set of consecutive, distinct subsequences (mers) is presented. Several statistical distributions of nucleotide subsequences are defined and analysed. Sequence entropy and similarity between sequences in terms of mer lengths distribution are defined. An alignment-free method of phylogenetic tree
construction is proposed.
Key words:
References:
[1] A. Lempel, J. Ziv, On the complexity of finite sequences. IEEE Trans. Inform. Theory 22, 75-81 (1976).
[2] H.H. Out, K. Sayood, A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19, 2122-2130 (2003).
[3] D.-G. Ke, Q.-Y. Tong, Easily adaptable complexity measure for finite time series. Phys. Rev. E77, 066215 (2008).
[4] Z. Kása, On the d-complexity of strings. http://arxiv.org/abs/1002.2721v1.
[5] C. Adami, N.J. Ceref, 1999. Physical complexity of symbolic sequences. arxiv: adap-org/9605002v3
[6] J. Wen, C. Li, Similarity analysis of DNA sequences based on the LZ complexity. Internet Electron. J. Mol. Des. 6, 1-12 (2007).
[7] B. Kozarzewski, Multilevel time series complexity. Journal of Applied Computer Science 19, 2, 61-71 (2011).
[8] J.-B. Brissaud, The meaning of entropy. Entropy 7, 68-96 (2005).
[9] Y.-H. Chen, S.-L. Nyeo, C.-Y. Yeh, Model for distribution of k-mers in DNA sequences. Physical Review E72, 011908 (2005).
[10] W.K. Brown, K.H. Wohletz, Derivation of the Weibull distribution based on physical principles and its connection to the Rossin-Rammler and lognormal distributions. Journal of Applied Physics 78, 2758-2763 (1995).
[11] M. van Oven, http://www.phylotree.org (2009).