• CONTACT
  • LAST ISSUE
  • IN PROGRESS
  • EARLY VIEW
  • ACCEPTED PAPERS
GET_pdf

Volume 21 (4) 2015, 241-249

Numerical Representation of Symbolic Data

Kozarzewski Bohdan

University of Information Technology and Management
ul. H. Sucharskiego 2, 35-225 Rzeszów, Poland
E-mail: bkozarzewski@wsiz.rzeszow.pl

Received:

Received: 09 November 2015; revised: 17 December 2015; accepted: 21 December 2015; published online: 29 December 2015

DOI:   10.12921/cmst.2015.21.04.008

Abstract:

A method of direct numerical representation of symbolic data is proposed. The method starts with parsing a sequence into an ordered set (spectrum) of distinct, non-overlapping short strings of symbols (words). Next, the words spectrum is mapped onto a vector of binary components in a high dimensional, linear space. The numerical representation allows for some arithmetical operations on symbolic data. Among them is a meaningful average spectrum of two sequences. As a test, the new numerical representation is used to build centroid vectors for the k-means clustering algorithm. It significantly enhanced the clustering quality. The advantage over the conventional approach is a high score of correct clustering several real character sequences like novel, DNA and protein.

Key words:

clustering, distance measure, numerical representation, symbolic sequence

References:

[1] C. Notredame, Recent progress in multiple sequence align-
ment: a survey, Pharmacogenomics 3(1), 131-144 (2002).
[2] M. Randic, M. Vrako, On the similarity of DNA primary
sequences, Journal of Chemical Information and Computer
Sciences 40, 599-606 (2000).
[3] S. Vinga and J. Almeida, Alignment-free sequence compari-
son – a review, Bioinformatics, 19(4), 513-523 (2003).
[4] A. Kelil, S. Wang, Q. Jiang, R. Brzezinski, A general mea-
sure of similarity for categorical sequences, Knowl. Inf. Syst.
24,197-220 (2010), DOI 10.1007/s10115-009-0237-8.
[5] B. Kozarzewski, A method for nucleotide sequences analysis,
CMST 18(1), 5-10 (2012).
[6] T. Kanungo, N.S. Netanyahu, A.Y. Wu, An Efficient k-Means
Clustering Algorithm: Analysis and Implementation, IEEE
Trans. Pattern Analysis and Machine Intelligence 24 (7), 881-
892 (2002).
[7] B. Kozarzewski, A New Method for Symbolic Se-
quences Analysis, CMST 20(3), 93-100 (2014),
DOI:10.12921/cmst.2014.20.03.93-100.

  • JOURNAL MENU

    • AIMS AND SCOPE
    • EDITORS
    • EDITORIAL BOARD
    • NOTES FOR AUTHORS
    • CONTACT
    • IAN SNOOK PRIZES 2015
    • IAN SNOOK PRIZES 2016
    • IAN SNOOK PRIZES 2017
    • IAN SNOOK PRIZES 2018
    • IAN SNOOK PRIZES 2019
    • IAN SNOOK PRIZES 2020
    • IAN SNOOK PRIZES 2021
    • IAN SNOOK PRIZES 2024
  • GALLERY

  • LAST ISSUE

  • MANUSCRIPT SUBMISSION

    • SUBMIT A MANUSCRIPT
  • FUTURE ISSUES

    • ACCEPTED PAPERS
    • EARLY VIEW
    • Volume 31 (1) – in progress
  • ALL ISSUES

    • 2024
      • Volume 30 (3–4)
      • Volume 30 (1–2)
    • 2023
      • Volume 29 (1–4)
    • 2022
      • Volume 28 (4)
      • Volume 28 (3)
      • Volume 28 (2)
      • Volume 28 (1)
    • 2021
      • Volume 27 (4)
      • Volume 27 (3)
      • Volume 27 (2)
      • Volume 27 (1)
    • 2020
      • Volume 26 (4)
      • Volume 26 (3)
      • Volume 26 (2)
      • Volume 26 (1)
    • 2019
      • Volume 25 (4)
      • Volume 25 (3)
      • Volume 25 (2)
      • Volume 25 (1)
    • 2018
      • Volume 24 (4)
      • Volume 24 (3)
      • Volume 24 (2)
      • Volume 24 (1)
    • 2017
      • Volume 23 (4)
      • Volume 23 (3)
      • Volume 23 (2)
      • Volume 23 (1)
    • 2016
      • Volume 22 (4)
      • Volume 22 (3)
      • Volume 22 (2)
      • Volume 22 (1)
    • 2015
      • Volume 21 (4)
      • Volume 21 (3)
      • Volume 21 (2)
      • Volume 21 (1)
    • 2014
      • Volume 20 (4)
      • Volume 20 (3)
      • Volume 20 (2)
      • Volume 20 (1)
    • 2013
      • Volume 19 (4)
      • Volume 19 (3)
      • Volume 19 (2)
      • Volume 19 (1)
    • 2012
      • Volume 18 (2)
      • Volume 18 (1)
    • 2011
      • Volume 17 (1-2)
    • 2010
      • Volume SI (2)
      • Volume SI (1)
      • Volume 16 (2)
      • Volume 16 (1)
    • 2009
      • Volume 15 (2)
      • Volume 15 (1)
    • 2008
      • Volume 14 (2)
      • Volume 14 (1)
    • 2007
      • Volume 13 (2)
      • Volume 13 (1)
    • 2006
      • Volume SI (1)
      • Volume 12 (2)
      • Volume 12 (1)
    • 2005
      • Volume 11 (2)
      • Volume 11 (1)
    • 2004
      • Volume 10 (2)
      • Volume 10 (1)
    • 2003
      • Volume 9 (1)
    • 2002
      • Volume 8 (2)
      • Volume 8 (1)
    • 2001
      • Volume 7 (2)
      • Volume 7 (1)
    • 2000
      • Volume 6 (1)
    • 1999
      • Volume 5 (1)
    • 1998
      • Volume 4 (1)
    • 1997
      • Volume 3 (1)
    • 1996
      • Volume 2 (1)
      • Volume 1 (1)
  • DATABASES

    • AUTHORS BASE
  • CONTACT
  • LAST ISSUE
  • IN PROGRESS
  • EARLY VIEW
  • ACCEPTED PAPERS

© 2025 CMST