• CONTACT
  • LAST ISSUE
  • IN PROGRESS
  • EARLY VIEW
  • ACCEPTED PAPERS
GET_pdf

Volume 24 (1) 2018, 21-27

Korpusomat – a Tool for Creating Searchable Morphosyntactically Tagged Corpora

Kieraś Witold, Kobyliński Łukasz, Ogrodniczuk Maciej

Institute of Computer Science, Polish Academy of Sciences
Jana Kazimierza 5, 01-248 Warszawa
E-mail: w.kieras@ipipan.waw.pl, l.kobylinski@ipipan.waw.pl, m.ogrodniczuk@ipipan.waw.pl

Received:

Received: 21 March 2017; revised: 30 November 2017; accepted: 16 January 2018; published online: 31 March 2018

DOI:   10.12921/cmst.2018.0000005

Abstract:

The paper presents Korpusomat, a web application aimed at building annotated corpora for the purpose of corpus linguistic studies. Korpusomat combines existing tools, such as morphological analyser, tagger and corpus search engine, and provides an easy-to-use environment for building corpora technically compatible with the National Corpus of Polish from almost any text, including texts in binary formats. In the paper we present the current state of the project, its features and functionalities, as well as some future plans and developments tasks. A usage example is also presented.

Key words:

corpus linguistics, corpus management, language corpora

References:

[1] A. Przepiórkowski, M. Bańko, R.L. Górski, B. Lewandowska-Tomaszczyk, eds, Narodowy Korpus Języka Polskiego [Eng.: National Corpus of Polish], Wydawnictwo Naukowe PWN, Warsaw, 2012.
[2] A. Kilgarriff et al., The Sketch Engine: ten years on, Lexicog- raphy, pages 1–30, (2014).
[3] M. Woliński, Morfeusz Reloaded, [In:] N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, S. Piperidis, eds, Proceedings of the 9th International Conference on Language Resources and Eval- uation (LREC 2014), pages 1106–1111, Reykjavík, Iceland, 2014 European Language Resources Association.
[4] W. Kieraś, Co jest zgodne z duchem kraftu? Próba korpu- sowego badania słownictwa zwia ̨zanego z piwem, Język Pol- ski, (2017), in print.
[5] M. Woliński, W. Kieraś, The On-Line Version of Grammat- ical Dictionary of Polish, [In:]N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, S. Piperidis, editors, Proceedings of the 10th Inter- national Conference on Language Resources and Evaluation (LREC 2016), pages 2589–2594, Portorož, Slovenia, 2016 European Language Resources Association.
[6] J. Waszczuk, Harnessing the CRF complexity with domain- specific constraints. The case of morphosyntactic tagging of a highly inflected language, [In:] Proceedings of the 24th International Conference on Computational Linguistics (CO- LING 2012), pages 2789–2804, Mumbai, India, 2012.
[7] D. Janus, A. Przepiórkowski, POLIQARP 1.0: Some tech- nical aspects of a linguistic search engine for large corpora, [In:] J. Waliński, K. Kredens, S. Goźdź-Roszkowski, editors, Proceedings of Practical Applications in Language and Com- puters Conference (PALC 2005), Frankfurt am Main, 2006 Peter Lang.
[8] A. Przepiórkowski, Z. Krynicki, Ł. Dębowski, M. Woliński, D. Janus, Piotr Bański, A Search Tool for Corpora with Positional Tagsets and Ambiguities, [In:] Proceedings of the 4th Inter- national Conference on Language Resources and Evaluation (LREC 2004), pages 1235–1238, 2004.
[9] A. Przepiórkowski, M. Woliński, The Unbearable Lightness of Tagging: A Case Study in Morphosyntactic Tagging of Polish, [In:] Proceedings of the 4th International Workshop on Lin-
guistically Interpreted Corpora (LINC 2003), pages 109–116,
2003.
[10] A. Przepiórkowski, The IPI PAN Corpus in Numbers, [In:] Z.
Vetulani, editor, Proceedings of the 2nd Language & Technol- ogy Conference: Human Language Technologies as a Chal- lenge for Computer Science and Linguistics (LTC 2005), pages 27–31, Poznań, Poland, 2005.
[11] Ł. Kobyliński, PoliTa: A multitagger for Polish, [In:] N. Cal- zolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, S. Piperidis, eds, Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pages 2949–2954, Reykjavík, Iceland European Language Resources Association.
[12] A.Radziszewski,T. Śniatowski, Maca–aconfigurabletool to integrate Polish morphological data, [In:] Felipe Sánchez- Martínez and Juan Antonio Pérez-Ortiz, editors, Proceedings of the 2nd International Workshop on Free/Open-Source Rule- Based Machine Translation, pages 29–36, 2011.
[13] J.S. Bień, Efficient Search in Hidden Text of Large DjVu Docu- ments, [In:] R. Bernardi, S. Chambers, B. Gottfried, F. Segond, I. Zaihrayeu, eds, Advanced Language Technologies for Digi- tal Libraries (NLP4DL 2009), volume 6699 of Lecture Notes in Computer Science, pages 1–14 Springer, 2009.
[14] M. Łaziński, Korpusy w programach badawczych i dydak- tyce Instytutu Języka Polskiego Uniwersytetu Warszawskiego, In I. Bundza, J. Kowalewski, A. Kravcˇuk, and O. Slivin- skij, editors, Język polski i polonistyka w Europie wschodniej: przeszłość i współczesność: praca zbiorowa z okazji dziesięci- olecia Katedry Filologii Polskiej Narodowego Uniwersytetu Lwowskiego im. Iwana Franki, pages 584–585 Fìrma ÌNKOS, Kijów, 2015.
[15] M. Ogrodniczuk, The Polish Sejm Corpus, [In:] N. Calzolari, K. Choukri, T. Declerck, M.U. Dog ̆an, B. Maegaard, J. Mar- iani, A. Moreno, J. Odijk, S. Piperidis, eds, Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), pages 2219–2223, Istanbul, Turkey, 2012 European Language Resources Association.

  • JOURNAL MENU

    • AIMS AND SCOPE
    • EDITORS
    • EDITORIAL BOARD
    • NOTES FOR AUTHORS
    • CONTACT
    • IAN SNOOK PRIZES 2015
    • IAN SNOOK PRIZES 2016
    • IAN SNOOK PRIZES 2017
    • IAN SNOOK PRIZES 2018
    • IAN SNOOK PRIZES 2019
    • IAN SNOOK PRIZES 2020
    • IAN SNOOK PRIZES 2021
    • IAN SNOOK PRIZES 2024
  • GALLERY

    CMST_vol_24_2_2018_okladka_
  • LAST ISSUE

  • MANUSCRIPT SUBMISSION

    • SUBMIT A MANUSCRIPT
  • FUTURE ISSUES

    • ACCEPTED PAPERS
    • EARLY VIEW
    • Volume 31 (1) – in progress
  • ALL ISSUES

    • 2024
      • Volume 30 (3–4)
      • Volume 30 (1–2)
    • 2023
      • Volume 29 (1–4)
    • 2022
      • Volume 28 (4)
      • Volume 28 (3)
      • Volume 28 (2)
      • Volume 28 (1)
    • 2021
      • Volume 27 (4)
      • Volume 27 (3)
      • Volume 27 (2)
      • Volume 27 (1)
    • 2020
      • Volume 26 (4)
      • Volume 26 (3)
      • Volume 26 (2)
      • Volume 26 (1)
    • 2019
      • Volume 25 (4)
      • Volume 25 (3)
      • Volume 25 (2)
      • Volume 25 (1)
    • 2018
      • Volume 24 (4)
      • Volume 24 (3)
      • Volume 24 (2)
      • Volume 24 (1)
    • 2017
      • Volume 23 (4)
      • Volume 23 (3)
      • Volume 23 (2)
      • Volume 23 (1)
    • 2016
      • Volume 22 (4)
      • Volume 22 (3)
      • Volume 22 (2)
      • Volume 22 (1)
    • 2015
      • Volume 21 (4)
      • Volume 21 (3)
      • Volume 21 (2)
      • Volume 21 (1)
    • 2014
      • Volume 20 (4)
      • Volume 20 (3)
      • Volume 20 (2)
      • Volume 20 (1)
    • 2013
      • Volume 19 (4)
      • Volume 19 (3)
      • Volume 19 (2)
      • Volume 19 (1)
    • 2012
      • Volume 18 (2)
      • Volume 18 (1)
    • 2011
      • Volume 17 (1-2)
    • 2010
      • Volume SI (2)
      • Volume SI (1)
      • Volume 16 (2)
      • Volume 16 (1)
    • 2009
      • Volume 15 (2)
      • Volume 15 (1)
    • 2008
      • Volume 14 (2)
      • Volume 14 (1)
    • 2007
      • Volume 13 (2)
      • Volume 13 (1)
    • 2006
      • Volume SI (1)
      • Volume 12 (2)
      • Volume 12 (1)
    • 2005
      • Volume 11 (2)
      • Volume 11 (1)
    • 2004
      • Volume 10 (2)
      • Volume 10 (1)
    • 2003
      • Volume 9 (1)
    • 2002
      • Volume 8 (2)
      • Volume 8 (1)
    • 2001
      • Volume 7 (2)
      • Volume 7 (1)
    • 2000
      • Volume 6 (1)
    • 1999
      • Volume 5 (1)
    • 1998
      • Volume 4 (1)
    • 1997
      • Volume 3 (1)
    • 1996
      • Volume 2 (1)
      • Volume 1 (1)
  • DATABASES

    • AUTHORS BASE
  • CONTACT
  • LAST ISSUE
  • IN PROGRESS
  • EARLY VIEW
  • ACCEPTED PAPERS

© 2025 CMST