Numerical Representation of Symbolic Data

Kozarzewski Bohdan

doi:10.12921/cmst.2015.21.04.008

Numerical Representation of Symbolic Data

University of Information Technology and Management
ul. H. Sucharskiego 2, 35-225 Rzeszów, Poland
E-mail: bkozarzewski@wsiz.rzeszow.pl

Received:

Received: 09 November 2015; revised: 17 December 2015; accepted: 21 December 2015; published online: 29 December 2015

DOI: 10.12921/cmst.2015.21.04.008

Abstract:

A method of direct numerical representation of symbolic data is proposed. The method starts with parsing a sequence into an ordered set (spectrum) of distinct, non-overlapping short strings of symbols (words). Next, the words spectrum is mapped onto a vector of binary components in a high dimensional, linear space. The numerical representation allows for some arithmetical operations on symbolic data. Among them is a meaningful average spectrum of two sequences. As a test, the new numerical representation is used to build centroid vectors for the k-means clustering algorithm. It significantly enhanced the clustering quality. The advantage over the conventional approach is a high score of correct clustering several real character sequences like novel, DNA and protein.

Key words:

clustering, distance measure, numerical representation, symbolic sequence

References:

[1] C. Notredame, Recent progress in multiple sequence align-
ment: a survey, Pharmacogenomics 3(1), 131-144 (2002).
[2] M. Randic, M. Vrako, On the similarity of DNA primary
sequences, Journal of Chemical Information and Computer
Sciences 40, 599-606 (2000).
[3] S. Vinga and J. Almeida, Alignment-free sequence compari-
son – a review, Bioinformatics, 19(4), 513-523 (2003).
[4] A. Kelil, S. Wang, Q. Jiang, R. Brzezinski, A general mea-
sure of similarity for categorical sequences, Knowl. Inf. Syst.
24,197-220 (2010), DOI 10.1007/s10115-009-0237-8.
[5] B. Kozarzewski, A method for nucleotide sequences analysis,
CMST 18(1), 5-10 (2012).
[6] T. Kanungo, N.S. Netanyahu, A.Y. Wu, An Efficient k-Means
Clustering Algorithm: Analysis and Implementation, IEEE
Trans. Pattern Analysis and Machine Intelligence 24 (7), 881-
892 (2002).
[7] B. Kozarzewski, A New Method for Symbolic Se-
quences Analysis, CMST 20(3), 93-100 (2014),
DOI:10.12921/cmst.2014.20.03.93-100.

Volume 21 (4) 2015, 241-249

Numerical Representation of Symbolic Data

Received:

DOI: 10.12921/cmst.2015.21.04.008

Abstract:

Key words:

References:

JOURNAL MENU

GALLERY

LAST ISSUE

MANUSCRIPT SUBMISSION

FUTURE ISSUES

ALL ISSUES

DATABASES