• CONTACT
  • LAST ISSUE
  • IN PROGRESS
  • EARLY VIEW
  • ACCEPTED PAPERS
GET_pdf

Volume 20 (3) 2014, 93-100

A New Method for Symbolic Sequences Analysis. An Application to Long Sequences

Kozarzewski Bohdan

University of Information Technology and Management
ul. H. Sucharskiego 2, 35-225 Rzeszów, Poland
E-mail: bkozarzewski@wsiz.rzeszow.pl

DOI:   10.12921/cmst.2014.20.03.93-100

Abstract:

The method for symbolic sequence decomposition into a set of consecutive, distinct, non-overlapping strings of various lengths is proposed. Representation of the sequence as a set of words allows one to use set theory notions. The main result is a quite new definition of the similarity between any two sequences over a given alphabet. No prior sequence alignment is necessary. In the present paper two applications of a set of words are described. In the first a similarity measure is applied to prepare centroids for K-means algorithm. It results in a high performance grouping method for long DNA sequences. The other application concerns the statistical analysis of word attributes. It is shown that similarity, complexity
and correlation function of word attributes across sequences of digits of fractional parts of some irrational numbers support the suggestion that the sequences are instances of a random sequence of decimal digits.

Supplementary material:

  • data – irrational numbers
  • data – clustering

Key words:

clustering, DNA sequences, irrational numbers, similarity and distance measures

References:

[1] A. Lempel, J. Ziv, On the complexity of finite sequences,
IEEE Trans. Inform. Theory 22, 75-81 (1976).
[2] D.-G. Ke, Q.-Y. Tong, Easily adaptable complexity measure
for finite time series, Phys. Rev. E 77, 066215 (2008).
[3] B. Kozarzewski, A method for nucleotide sequences analysis,
CMST 18 (1), 5-10 (2012).
[4] M-S. Yang and K-L. Wu, A Similarity-Based Robust Clus-
tering Method, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 2 (4), 434-448 (2004).
[5] Y. Liu, The Numerical Characterization and Similarity Anal-
ysis of DNA Primary Sequences, Internet Electronic Journal
of Molecular Design 1, 675-684 (2002).
[6] J. Wen, C. Li, Similarity analysis of DNA sequences based on
the LZ complexity, Internet Electronic Journal of Molecular
Design 6, 1-12 (2007).
[7] A. Kelil, S. Wang, Q. Jiang, R. Brzezinski, A general measure
of similarity for categorical sequences, Knowl. Inf. Syst., 24,
197-220 (2010), (DOI 10.1007/s10115-009-0237-8).
[8] S. Kumar, A. Filipski, Multiple sequence alignment: In pur-
suit of homologous DNA positions, Genome Research 17,
27-135 (2007).
[9] S. Vinga, J. Almeida, Alignment-free sequence comparison –
a review, Bioinformatics 19, 513-523 (2003).
[10] L.R. Dice, Measures of the Amount of Ecologic Association
Between Species, Ecology 26 (3), 297-302 (1945).
[11] T. Kanungo, N.S. Netanyahu, A.Y. Wu, An Efficient k-Means
Clustering Algorithm: Analysis and Implementation, IEEE
Trans. Pattern Analysis and Machine Intelligence, 24, (7),
881-892 (2002).
[12] B. Kozarzewski, A representative set method for symbolic
sequence clustering, CMST 19 (2), 35-47 (2013).
[13] G.P. Dresden, Three Transcendental Numbers from the Last
Non-Zero Digits of , Fn , and n!, Mathematical Magazine, 81
(2), 96 (2007).
[14] D. Bailey, P. Borwein, S. Plouffe, On the rapid computa-
tion of various polylogarithmic constants, Mathematics of
Computation, vol. 66, (218), 903-913.
[15] D.H. Bailey, A Compendium of BBP-type Formulas
for Mathematical Constants, http://crd-legacy.lbl.gov/
̃dhbailey/dhbpapers/bbp-formulas.pdf (2013).
[16] R. Nemiroff and J. Bonnell, http://apod.nasa.gov/htmltest/rjn
_dig.html, http://apod.nasa.gov/htmltest/gifcity/e.2mil, http://
apod.nasa.gov/htmltest/gifcity/sqrt2.2mil.

  • JOURNAL MENU

    • AIMS AND SCOPE
    • EDITORS
    • EDITORIAL BOARD
    • NOTES FOR AUTHORS
    • CONTACT
    • IAN SNOOK PRIZES 2015
    • IAN SNOOK PRIZES 2016
    • IAN SNOOK PRIZES 2017
    • IAN SNOOK PRIZES 2018
    • IAN SNOOK PRIZES 2019
    • IAN SNOOK PRIZES 2020
    • IAN SNOOK PRIZES 2021
    • IAN SNOOK PRIZES 2024
  • GALLERY

    CMST_vol_28_3_2022_okladka_
  • LAST ISSUE

  • MANUSCRIPT SUBMISSION

    • SUBMIT A MANUSCRIPT
  • FUTURE ISSUES

    • ACCEPTED PAPERS
    • EARLY VIEW
    • Volume 31 (1) – in progress
  • ALL ISSUES

    • 2024
      • Volume 30 (3–4)
      • Volume 30 (1–2)
    • 2023
      • Volume 29 (1–4)
    • 2022
      • Volume 28 (4)
      • Volume 28 (3)
      • Volume 28 (2)
      • Volume 28 (1)
    • 2021
      • Volume 27 (4)
      • Volume 27 (3)
      • Volume 27 (2)
      • Volume 27 (1)
    • 2020
      • Volume 26 (4)
      • Volume 26 (3)
      • Volume 26 (2)
      • Volume 26 (1)
    • 2019
      • Volume 25 (4)
      • Volume 25 (3)
      • Volume 25 (2)
      • Volume 25 (1)
    • 2018
      • Volume 24 (4)
      • Volume 24 (3)
      • Volume 24 (2)
      • Volume 24 (1)
    • 2017
      • Volume 23 (4)
      • Volume 23 (3)
      • Volume 23 (2)
      • Volume 23 (1)
    • 2016
      • Volume 22 (4)
      • Volume 22 (3)
      • Volume 22 (2)
      • Volume 22 (1)
    • 2015
      • Volume 21 (4)
      • Volume 21 (3)
      • Volume 21 (2)
      • Volume 21 (1)
    • 2014
      • Volume 20 (4)
      • Volume 20 (3)
      • Volume 20 (2)
      • Volume 20 (1)
    • 2013
      • Volume 19 (4)
      • Volume 19 (3)
      • Volume 19 (2)
      • Volume 19 (1)
    • 2012
      • Volume 18 (2)
      • Volume 18 (1)
    • 2011
      • Volume 17 (1-2)
    • 2010
      • Volume SI (2)
      • Volume SI (1)
      • Volume 16 (2)
      • Volume 16 (1)
    • 2009
      • Volume 15 (2)
      • Volume 15 (1)
    • 2008
      • Volume 14 (2)
      • Volume 14 (1)
    • 2007
      • Volume 13 (2)
      • Volume 13 (1)
    • 2006
      • Volume SI (1)
      • Volume 12 (2)
      • Volume 12 (1)
    • 2005
      • Volume 11 (2)
      • Volume 11 (1)
    • 2004
      • Volume 10 (2)
      • Volume 10 (1)
    • 2003
      • Volume 9 (1)
    • 2002
      • Volume 8 (2)
      • Volume 8 (1)
    • 2001
      • Volume 7 (2)
      • Volume 7 (1)
    • 2000
      • Volume 6 (1)
    • 1999
      • Volume 5 (1)
    • 1998
      • Volume 4 (1)
    • 1997
      • Volume 3 (1)
    • 1996
      • Volume 2 (1)
      • Volume 1 (1)
  • DATABASES

    • AUTHORS BASE
  • CONTACT
  • LAST ISSUE
  • IN PROGRESS
  • EARLY VIEW
  • ACCEPTED PAPERS

© 2025 CMST