Numerical Representation of Symbolic Data
University of Information Technology and Management
ul. H. Sucharskiego 2, 35-225 Rzeszów, Poland
E-mail: bkozarzewski@wsiz.rzeszow.pl
Received:
Received: 09 November 2015; revised: 17 December 2015; accepted: 21 December 2015; published online: 29 December 2015
DOI: 10.12921/cmst.2015.21.04.008
Abstract:
A method of direct numerical representation of symbolic data is proposed. The method starts with parsing a sequence into an ordered set (spectrum) of distinct, non-overlapping short strings of symbols (words). Next, the words spectrum is mapped onto a vector of binary components in a high dimensional, linear space. The numerical representation allows for some arithmetical operations on symbolic data. Among them is a meaningful average spectrum of two sequences. As a test, the new numerical representation is used to build centroid vectors for the k-means clustering algorithm. It significantly enhanced the clustering quality. The advantage over the conventional approach is a high score of correct clustering several real character sequences like novel, DNA and protein.
Key words:
clustering, distance measure, numerical representation, symbolic sequence
References:
[1] C. Notredame, Recent progress in multiple sequence align-
ment: a survey, Pharmacogenomics 3(1), 131-144 (2002).
[2] M. Randic, M. Vrako, On the similarity of DNA primary
sequences, Journal of Chemical Information and Computer
Sciences 40, 599-606 (2000).
[3] S. Vinga and J. Almeida, Alignment-free sequence compari-
son – a review, Bioinformatics, 19(4), 513-523 (2003).
[4] A. Kelil, S. Wang, Q. Jiang, R. Brzezinski, A general mea-
sure of similarity for categorical sequences, Knowl. Inf. Syst.
24,197-220 (2010), DOI 10.1007/s10115-009-0237-8.
[5] B. Kozarzewski, A method for nucleotide sequences analysis,
CMST 18(1), 5-10 (2012).
[6] T. Kanungo, N.S. Netanyahu, A.Y. Wu, An Efficient k-Means
Clustering Algorithm: Analysis and Implementation, IEEE
Trans. Pattern Analysis and Machine Intelligence 24 (7), 881-
892 (2002).
[7] B. Kozarzewski, A New Method for Symbolic Se-
quences Analysis, CMST 20(3), 93-100 (2014),
DOI:10.12921/cmst.2014.20.03.93-100.
A method of direct numerical representation of symbolic data is proposed. The method starts with parsing a sequence into an ordered set (spectrum) of distinct, non-overlapping short strings of symbols (words). Next, the words spectrum is mapped onto a vector of binary components in a high dimensional, linear space. The numerical representation allows for some arithmetical operations on symbolic data. Among them is a meaningful average spectrum of two sequences. As a test, the new numerical representation is used to build centroid vectors for the k-means clustering algorithm. It significantly enhanced the clustering quality. The advantage over the conventional approach is a high score of correct clustering several real character sequences like novel, DNA and protein.
Key words:
clustering, distance measure, numerical representation, symbolic sequence
References:
[1] C. Notredame, Recent progress in multiple sequence align-
ment: a survey, Pharmacogenomics 3(1), 131-144 (2002).
[2] M. Randic, M. Vrako, On the similarity of DNA primary
sequences, Journal of Chemical Information and Computer
Sciences 40, 599-606 (2000).
[3] S. Vinga and J. Almeida, Alignment-free sequence compari-
son – a review, Bioinformatics, 19(4), 513-523 (2003).
[4] A. Kelil, S. Wang, Q. Jiang, R. Brzezinski, A general mea-
sure of similarity for categorical sequences, Knowl. Inf. Syst.
24,197-220 (2010), DOI 10.1007/s10115-009-0237-8.
[5] B. Kozarzewski, A method for nucleotide sequences analysis,
CMST 18(1), 5-10 (2012).
[6] T. Kanungo, N.S. Netanyahu, A.Y. Wu, An Efficient k-Means
Clustering Algorithm: Analysis and Implementation, IEEE
Trans. Pattern Analysis and Machine Intelligence 24 (7), 881-
892 (2002).
[7] B. Kozarzewski, A New Method for Symbolic Se-
quences Analysis, CMST 20(3), 93-100 (2014),
DOI:10.12921/cmst.2014.20.03.93-100.