Korpusomat – a Tool for Creating Searchable Morphosyntactically Tagged Corpora
Kieraś Witold, Kobyliński Łukasz, Ogrodniczuk Maciej
Institute of Computer Science, Polish Academy of Sciences
Jana Kazimierza 5, 01-248 Warszawa
E-mail: w.kieras@ipipan.waw.pl, l.kobylinski@ipipan.waw.pl, m.ogrodniczuk@ipipan.waw.pl
Received:
Received: 21 March 2017; revised: 30 November 2017; accepted: 16 January 2018; published online: 31 March 2018
DOI: 10.12921/cmst.2018.0000005
Abstract:
The paper presents Korpusomat, a web application aimed at building annotated corpora for the purpose of corpus linguistic studies. Korpusomat combines existing tools, such as morphological analyser, tagger and corpus search engine, and provides an easy-to-use environment for building corpora technically compatible with the National Corpus of Polish from almost any text, including texts in binary formats. In the paper we present the current state of the project, its features and functionalities, as well as some future plans and developments tasks. A usage example is also presented.
Key words:
References:
[1] A. Przepiórkowski, M. Bańko, R.L. Górski, B. Lewandowska-Tomaszczyk, eds, Narodowy Korpus Języka Polskiego [Eng.: National Corpus of Polish], Wydawnictwo Naukowe PWN, Warsaw, 2012.
[2] A. Kilgarriff et al., The Sketch Engine: ten years on, Lexicog- raphy, pages 1–30, (2014).
[3] M. Woliński, Morfeusz Reloaded, [In:] N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, S. Piperidis, eds, Proceedings of the 9th International Conference on Language Resources and Eval- uation (LREC 2014), pages 1106–1111, Reykjavík, Iceland, 2014 European Language Resources Association.
[4] W. Kieraś, Co jest zgodne z duchem kraftu? Próba korpu- sowego badania słownictwa zwia ̨zanego z piwem, Język Pol- ski, (2017), in print.
[5] M. Woliński, W. Kieraś, The On-Line Version of Grammat- ical Dictionary of Polish, [In:]N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, S. Piperidis, editors, Proceedings of the 10th Inter- national Conference on Language Resources and Evaluation (LREC 2016), pages 2589–2594, Portorož, Slovenia, 2016 European Language Resources Association.
[6] J. Waszczuk, Harnessing the CRF complexity with domain- specific constraints. The case of morphosyntactic tagging of a highly inflected language, [In:] Proceedings of the 24th International Conference on Computational Linguistics (CO- LING 2012), pages 2789–2804, Mumbai, India, 2012.
[7] D. Janus, A. Przepiórkowski, POLIQARP 1.0: Some tech- nical aspects of a linguistic search engine for large corpora, [In:] J. Waliński, K. Kredens, S. Goźdź-Roszkowski, editors, Proceedings of Practical Applications in Language and Com- puters Conference (PALC 2005), Frankfurt am Main, 2006 Peter Lang.
[8] A. Przepiórkowski, Z. Krynicki, Ł. Dębowski, M. Woliński, D. Janus, Piotr Bański, A Search Tool for Corpora with Positional Tagsets and Ambiguities, [In:] Proceedings of the 4th Inter- national Conference on Language Resources and Evaluation (LREC 2004), pages 1235–1238, 2004.
[9] A. Przepiórkowski, M. Woliński, The Unbearable Lightness of Tagging: A Case Study in Morphosyntactic Tagging of Polish, [In:] Proceedings of the 4th International Workshop on Lin-
guistically Interpreted Corpora (LINC 2003), pages 109–116,
2003.
[10] A. Przepiórkowski, The IPI PAN Corpus in Numbers, [In:] Z.
Vetulani, editor, Proceedings of the 2nd Language & Technol- ogy Conference: Human Language Technologies as a Chal- lenge for Computer Science and Linguistics (LTC 2005), pages 27–31, Poznań, Poland, 2005.
[11] Ł. Kobyliński, PoliTa: A multitagger for Polish, [In:] N. Cal- zolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, S. Piperidis, eds, Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pages 2949–2954, Reykjavík, Iceland European Language Resources Association.
[12] A.Radziszewski,T. Śniatowski, Maca–aconfigurabletool to integrate Polish morphological data, [In:] Felipe Sánchez- Martínez and Juan Antonio Pérez-Ortiz, editors, Proceedings of the 2nd International Workshop on Free/Open-Source Rule- Based Machine Translation, pages 29–36, 2011.
[13] J.S. Bień, Efficient Search in Hidden Text of Large DjVu Docu- ments, [In:] R. Bernardi, S. Chambers, B. Gottfried, F. Segond, I. Zaihrayeu, eds, Advanced Language Technologies for Digi- tal Libraries (NLP4DL 2009), volume 6699 of Lecture Notes in Computer Science, pages 1–14 Springer, 2009.
[14] M. Łaziński, Korpusy w programach badawczych i dydak- tyce Instytutu Języka Polskiego Uniwersytetu Warszawskiego, In I. Bundza, J. Kowalewski, A. Kravcˇuk, and O. Slivin- skij, editors, Język polski i polonistyka w Europie wschodniej: przeszłość i współczesność: praca zbiorowa z okazji dziesięci- olecia Katedry Filologii Polskiej Narodowego Uniwersytetu Lwowskiego im. Iwana Franki, pages 584–585 Fìrma ÌNKOS, Kijów, 2015.
[15] M. Ogrodniczuk, The Polish Sejm Corpus, [In:] N. Calzolari, K. Choukri, T. Declerck, M.U. Dog ̆an, B. Maegaard, J. Mar- iani, A. Moreno, J. Odijk, S. Piperidis, eds, Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), pages 2219–2223, Istanbul, Turkey, 2012 European Language Resources Association.
The paper presents Korpusomat, a web application aimed at building annotated corpora for the purpose of corpus linguistic studies. Korpusomat combines existing tools, such as morphological analyser, tagger and corpus search engine, and provides an easy-to-use environment for building corpora technically compatible with the National Corpus of Polish from almost any text, including texts in binary formats. In the paper we present the current state of the project, its features and functionalities, as well as some future plans and developments tasks. A usage example is also presented.
Key words:
References:
[1] A. Przepiórkowski, M. Bańko, R.L. Górski, B. Lewandowska-Tomaszczyk, eds, Narodowy Korpus Języka Polskiego [Eng.: National Corpus of Polish], Wydawnictwo Naukowe PWN, Warsaw, 2012.
[2] A. Kilgarriff et al., The Sketch Engine: ten years on, Lexicog- raphy, pages 1–30, (2014).
[3] M. Woliński, Morfeusz Reloaded, [In:] N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, S. Piperidis, eds, Proceedings of the 9th International Conference on Language Resources and Eval- uation (LREC 2014), pages 1106–1111, Reykjavík, Iceland, 2014 European Language Resources Association.
[4] W. Kieraś, Co jest zgodne z duchem kraftu? Próba korpu- sowego badania słownictwa zwia ̨zanego z piwem, Język Pol- ski, (2017), in print.
[5] M. Woliński, W. Kieraś, The On-Line Version of Grammat- ical Dictionary of Polish, [In:]N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, S. Piperidis, editors, Proceedings of the 10th Inter- national Conference on Language Resources and Evaluation (LREC 2016), pages 2589–2594, Portorož, Slovenia, 2016 European Language Resources Association.
[6] J. Waszczuk, Harnessing the CRF complexity with domain- specific constraints. The case of morphosyntactic tagging of a highly inflected language, [In:] Proceedings of the 24th International Conference on Computational Linguistics (CO- LING 2012), pages 2789–2804, Mumbai, India, 2012.
[7] D. Janus, A. Przepiórkowski, POLIQARP 1.0: Some tech- nical aspects of a linguistic search engine for large corpora, [In:] J. Waliński, K. Kredens, S. Goźdź-Roszkowski, editors, Proceedings of Practical Applications in Language and Com- puters Conference (PALC 2005), Frankfurt am Main, 2006 Peter Lang.
[8] A. Przepiórkowski, Z. Krynicki, Ł. Dębowski, M. Woliński, D. Janus, Piotr Bański, A Search Tool for Corpora with Positional Tagsets and Ambiguities, [In:] Proceedings of the 4th Inter- national Conference on Language Resources and Evaluation (LREC 2004), pages 1235–1238, 2004.
[9] A. Przepiórkowski, M. Woliński, The Unbearable Lightness of Tagging: A Case Study in Morphosyntactic Tagging of Polish, [In:] Proceedings of the 4th International Workshop on Lin-
guistically Interpreted Corpora (LINC 2003), pages 109–116,
2003.
[10] A. Przepiórkowski, The IPI PAN Corpus in Numbers, [In:] Z.
Vetulani, editor, Proceedings of the 2nd Language & Technol- ogy Conference: Human Language Technologies as a Chal- lenge for Computer Science and Linguistics (LTC 2005), pages 27–31, Poznań, Poland, 2005.
[11] Ł. Kobyliński, PoliTa: A multitagger for Polish, [In:] N. Cal- zolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, S. Piperidis, eds, Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pages 2949–2954, Reykjavík, Iceland European Language Resources Association.
[12] A.Radziszewski,T. Śniatowski, Maca–aconfigurabletool to integrate Polish morphological data, [In:] Felipe Sánchez- Martínez and Juan Antonio Pérez-Ortiz, editors, Proceedings of the 2nd International Workshop on Free/Open-Source Rule- Based Machine Translation, pages 29–36, 2011.
[13] J.S. Bień, Efficient Search in Hidden Text of Large DjVu Docu- ments, [In:] R. Bernardi, S. Chambers, B. Gottfried, F. Segond, I. Zaihrayeu, eds, Advanced Language Technologies for Digi- tal Libraries (NLP4DL 2009), volume 6699 of Lecture Notes in Computer Science, pages 1–14 Springer, 2009.
[14] M. Łaziński, Korpusy w programach badawczych i dydak- tyce Instytutu Języka Polskiego Uniwersytetu Warszawskiego, In I. Bundza, J. Kowalewski, A. Kravcˇuk, and O. Slivin- skij, editors, Język polski i polonistyka w Europie wschodniej: przeszłość i współczesność: praca zbiorowa z okazji dziesięci- olecia Katedry Filologii Polskiej Narodowego Uniwersytetu Lwowskiego im. Iwana Franki, pages 584–585 Fìrma ÌNKOS, Kijów, 2015.
[15] M. Ogrodniczuk, The Polish Sejm Corpus, [In:] N. Calzolari, K. Choukri, T. Declerck, M.U. Dog ̆an, B. Maegaard, J. Mar- iani, A. Moreno, J. Odijk, S. Piperidis, eds, Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), pages 2219–2223, Istanbul, Turkey, 2012 European Language Resources Association.