A Representative Set Method for Symbolic Sequence Clustering
University of Information Technology and Management
ul. H. Sucharskiego 2, 35-225 Rzeszów, Poland
E-mail: bkozarzewski@wsiz.rzeszow.pl
Received:
Received: 25 October 2012; revised: 24 February 2013; accepted: 1 March 2013; published online: 22 May 2013
DOI: 10.12921/cmst.2013.19.02.99-105
OAI: oai:lib.psnc.pl:461
Abstract:
Sequence decomposition into a set of consecutive, distinct subsequences is crucial for symbolic sequence analysis. It reduces significantly the reference base of the recorded sequence for further retrieval and allows for original similarity and membership measures of the sequences. The introduced measures are a start point to a new algorithm for clustering sequences into groups of similar individuals. Algorithms that use the concept of a representative set achieved relatively good clustering results. The representative set that we have introduced is precisely and uniquely defined in contrast to that used in other applications.
Supplementary material:
Key words:
clustering, representative set, similarity and membership measures
References:
[1] M. Randić, S.C. Basak, Characterization of DNA Primary Sequences
Based on the Average Distances between Bases, J. Chem. Inf. Comput.
Sci. 41, 561-568 (2001).
[2] Y. Liu, The Numerical Characterization and Similarity Analysis of
DNA Primary Sequences, Internet Electronic Journal of Molecular
Design 1, 675-684 (2002).
[3] M-S. Yang and K-L. Wu, A Similarity-Based Robust Clustering
Method, IEEE Transactions on Pattern Analysis and Machine Intelligence
2(4), 434-448 (2004).
[4] J. Wen, C. Li, Similarity analysis of DNA sequences based on the
LZ complexity, Internet Electronic Journal of Molecular Design 6,
1-12 (2007).
[5] A. Kelil, S. Wang, Q. Jiang, R. Brzezinski, A general measure of
similarity for categorical sequences, Knowl. Inf. Syst. 24, 197-220 (2010), (DOI
10.1007/s10115-009-0237-8).
[6] M.R. Ackermann, J. Blömer, D. Kuntze, C. Sohler, Analysis of
Agglomerative Clustering,
http://arXiv.org/abs/1012.3697 (2012).
[7] P. Berkhin, Survey of Clustering Data Mining Techniques,1-56,
http://citeseerx.ist.psu.edu/viewauth/summary?aid=32145.
[8] R. Xu, D. Wunsch, Survey of clustering algorithms. IEEE Transactions
on Neural Networks 16(3), 645-678 (2005).
[9] P. Agrawal, M.A. Alam, R. Biswas, Analysing the agglomerative
hierarchical clustering algorithm for categorical attributes, International
Journal Innovation, Management and Technology 1(2), 186-190 (2010)
(and references quoted therein).
[10] N.S. Müller, A. Gabadinho, G. Ritschard, M. Studer, Extracting
knowledge from life courses: Clustering and visualization, In DAWAK
2008, volume LNCS 5182 of Lectures Notes in Computer Science,
176-185, Berlin Heidelberg Springer (2008).
[11] G.W. Milligan, M.C. Cooper, An examination of procedures for
determining the number of clusters in a data set, Psychometrika 50,
159-179 (1985).
[12] D.-G. Ke , Q.-Y. Tong, Easily adaptable complexity measure for
finite time series, Phys. Rev. E 77, 066215 (2008).
[13] B. Kozarzewski, A method for nucleotide sequence analysis, CMST
18(1), 5-10 (2012).
[14] L.R. Dice, Measures of the Amount of Ecologic Association Between
Species, Ecology 26(3), 297-302 (1945).
[15] M. Daszykowski, B. Walczak, D.L Massart, Representative subset
selection, Analytica Chimica Acta 468(1), 91-103 (2002).
[16] A. Gabadinho, G. Ritschard, M. Studer, N.S. Müller, Extracting
and Rendering Representative Sequences, in: Communications in
Computer and Information Science, Lecture Notes in Computer
Science, 94-106, Springer-Verlag Berlin Heidelberg (2011).
[17] C.D. Michener, R. R. Sokal, A quantitative approach to a problem of
classification, Evolution 11, 490-499 (1957).
[18] T. Calinski, J. Harabasz, A Dendrite Method for Cluster Analysis,
Communications in Statistics 3(1), 1-27 (1974).
[19] Q. Zhao, V. Hautamaki, P. Fränti, Knee point detection in BIC for
detecting the number of clusters, ACIVS 2008, volume LNCS 5295
of Lectures Notes in Computer Science, 664-673, Berlin Heidelberg.
Springer (2008).
[20] V. Granville, Identifying the number of clusters: final a solution,
http://www.analyticbridge.com/profile/Vincent.Granville
[21] M. Cameron, Y. Bernstein, H. Williams, Clustered sequence representation
for fast homology search, J. Comp. Biol. 14(5), 594-614 (2007).
Sequence decomposition into a set of consecutive, distinct subsequences is crucial for symbolic sequence analysis. It reduces significantly the reference base of the recorded sequence for further retrieval and allows for original similarity and membership measures of the sequences. The introduced measures are a start point to a new algorithm for clustering sequences into groups of similar individuals. Algorithms that use the concept of a representative set achieved relatively good clustering results. The representative set that we have introduced is precisely and uniquely defined in contrast to that used in other applications.
Supplementary material:
Key words:
clustering, representative set, similarity and membership measures
References:
[1] M. Randić, S.C. Basak, Characterization of DNA Primary Sequences
Based on the Average Distances between Bases, J. Chem. Inf. Comput.
Sci. 41, 561-568 (2001).
[2] Y. Liu, The Numerical Characterization and Similarity Analysis of
DNA Primary Sequences, Internet Electronic Journal of Molecular
Design 1, 675-684 (2002).
[3] M-S. Yang and K-L. Wu, A Similarity-Based Robust Clustering
Method, IEEE Transactions on Pattern Analysis and Machine Intelligence
2(4), 434-448 (2004).
[4] J. Wen, C. Li, Similarity analysis of DNA sequences based on the
LZ complexity, Internet Electronic Journal of Molecular Design 6,
1-12 (2007).
[5] A. Kelil, S. Wang, Q. Jiang, R. Brzezinski, A general measure of
similarity for categorical sequences, Knowl. Inf. Syst. 24, 197-220 (2010), (DOI
10.1007/s10115-009-0237-8).
[6] M.R. Ackermann, J. Blömer, D. Kuntze, C. Sohler, Analysis of
Agglomerative Clustering,
http://arXiv.org/abs/1012.3697 (2012).
[7] P. Berkhin, Survey of Clustering Data Mining Techniques,1-56,
http://citeseerx.ist.psu.edu/viewauth/summary?aid=32145.
[8] R. Xu, D. Wunsch, Survey of clustering algorithms. IEEE Transactions
on Neural Networks 16(3), 645-678 (2005).
[9] P. Agrawal, M.A. Alam, R. Biswas, Analysing the agglomerative
hierarchical clustering algorithm for categorical attributes, International
Journal Innovation, Management and Technology 1(2), 186-190 (2010)
(and references quoted therein).
[10] N.S. Müller, A. Gabadinho, G. Ritschard, M. Studer, Extracting
knowledge from life courses: Clustering and visualization, In DAWAK
2008, volume LNCS 5182 of Lectures Notes in Computer Science,
176-185, Berlin Heidelberg Springer (2008).
[11] G.W. Milligan, M.C. Cooper, An examination of procedures for
determining the number of clusters in a data set, Psychometrika 50,
159-179 (1985).
[12] D.-G. Ke , Q.-Y. Tong, Easily adaptable complexity measure for
finite time series, Phys. Rev. E 77, 066215 (2008).
[13] B. Kozarzewski, A method for nucleotide sequence analysis, CMST
18(1), 5-10 (2012).
[14] L.R. Dice, Measures of the Amount of Ecologic Association Between
Species, Ecology 26(3), 297-302 (1945).
[15] M. Daszykowski, B. Walczak, D.L Massart, Representative subset
selection, Analytica Chimica Acta 468(1), 91-103 (2002).
[16] A. Gabadinho, G. Ritschard, M. Studer, N.S. Müller, Extracting
and Rendering Representative Sequences, in: Communications in
Computer and Information Science, Lecture Notes in Computer
Science, 94-106, Springer-Verlag Berlin Heidelberg (2011).
[17] C.D. Michener, R. R. Sokal, A quantitative approach to a problem of
classification, Evolution 11, 490-499 (1957).
[18] T. Calinski, J. Harabasz, A Dendrite Method for Cluster Analysis,
Communications in Statistics 3(1), 1-27 (1974).
[19] Q. Zhao, V. Hautamaki, P. Fränti, Knee point detection in BIC for
detecting the number of clusters, ACIVS 2008, volume LNCS 5295
of Lectures Notes in Computer Science, 664-673, Berlin Heidelberg.
Springer (2008).
[20] V. Granville, Identifying the number of clusters: final a solution,
http://www.analyticbridge.com/profile/Vincent.Granville
[21] M. Cameron, Y. Bernstein, H. Williams, Clustered sequence representation
for fast homology search, J. Comp. Biol. 14(5), 594-614 (2007).