Automatic Speech Recognition and its Application to Media Monitoring

Cecko Robert; Jamroży Jerzy; Jęśko Waldemar; Kuśmierek Ewa; Lange Marek; Owsianny Mariusz

doi:10.12921/cmst.2021.0000015

Automatic Speech Recognition and its Application to Media Monitoring

Cecko Robert, Jamroży Jerzy, Jęśko Waldemar, Kuśmierek Ewa *, Lange Marek, Owsianny Mariusz

Poznan Supercomputing and Networking Center
ul. Jana Pawła II 10
61-139 Poznan, Poland
*E-mail: ewa.kusmierek@man.poznan.pl

Received:

Received: 21 May 2021; revised: 22 June 2021; accepted: 24 June 2021; published online: 29 June 2021

DOI: 10.12921/cmst.2021.0000015

Abstract:

In this paper we present application of the automatic speech recognition technology in the area of media monitoring. We describe the use of computational models and methods by two ASR technologies, namely a Hidden Markov Model with a Gaussian Mixture Model and Deep Neural Networks, that were crucial in the ASR development. Both approaches were implemented in our speech recognition ARM-1 engine developed for the Polish language. We provide details on the implementation choices, specifically adjustments made for media monitoring application guided by the characteristics of media content. Performance of both versions of our engine is evaluated and compared.

Key words:

automatic speech recognition, machine learning, media monitoring, neural networks, signal processing

References:

[1] J. Jamroży, E. Kuśmierek, M. Lange, M. Owsianny, Przetwarzanie dźwięku i obrazu materiałów radiowo-telewizyjnych – wyszukiwanie informacji multimedialnej, [In:] Postępy badań w inżynierii dźwięku i obrazu – Nowe trendy i zastosowania technologii multimedialnych, Ed. B. Kostek, Akademicka Oficyna Wydawnicza EXIT, 194–225 (2019).

[2] J. Jamroży, M. Lange. M. Owsianny, M. Szymański, ARM-1: Automatic Speech Recognition Engine, [In:] Proc. of the PolEval 2019 Workshop, Eds. M. Ogrodniczuk, Ł. Kobyliński, 79–88 (2019).

[3] C. Mazurek, Digital Humanities – Challenges for Humanities in the Digital Society Era – Foreword, Computational Methods in Science and Technology 24(1), 5–6 (2018).

[4] G. Rigoll, The ALERT system: advanced broadcast speech recognition technology for selective dissemination of multimedia information, IEEE Workshop on Automatic Speech Recognition and Understanding, 301–306 (2001).

[5] A. Znotins, K. Polis, R. Dargis, Media monitoring system for Latvian radio and TV broadcasts, Proc. of the 16th Annual Conference of the International Speech Communication Association (INTERSPEECH), 732–733 (2015).

[6] M.S. Maucec, A. Žgank, Speech recognition system of Slovenian broadcast news, Speech Technologies, 221–236 (2011).

[7] H. Meinedo, Audio Pre-Processing and Speech Recognition for Broadcast News, PhD Thesis, IST, Technical University of Lisbon, Lisbon, Portugal (2008).

[8] H. Meinedo, J. Neto, A stream-based audio segmentation, classification and clustering pre-processing system for broadcast news using ANN models, 9th European Conference on Speech Communication and Technology, 237–240 (2005).

[9] H. Meinedo, A. Abad, T. Pellegrini, J. Neto, I. Trancoso, The L2F Broadcast News Speech Recognition System, [In:] Proc. of FALA, 93–96 (2010).

[10] H. Meinedo, D. Caseiro, J. Neto, I. Trancoso, Computational Processing of the Portuguese Language, 6th International Workshop PROPOR, Springer, 9–17 (2003).

[11] H. Meinedo, N. Souto, J. Neto, Speech recognition of broadcast news for the European Portuguese language, Automatic Speech Recognition and Understanding (2001).

[12] T. Alumäe, O. Tilk, Automatic Speech Recognition System for Lithuanian Broadcast Audio, Frontiers in Artificial Intelligence and Applications 289: Human Language Technologies – The Baltic Perspective, 39–45 (2016).

[13] M. Sazhok, R. Selukh, D. Fedorin, O. Yukhimenko, V. Robeyko, Automatic Speech Recognition for Ukrainian Broadcast Media Transcribing, Control Systems and Computers 46–57 (2019).

[14] I. Demiros, H. Papageorgiou, V.Antonopoulos, Vassilios, A. Pipis, A. Skoulariki, Media Monitoring by Means of Speech and Language Indexing for Political Analysis, Journal of Information Technology & Politics 5, 133–146 (2008).

[15] B. Gerazov, Z. Ivanovski, Towards a System for Automatic Media Transcription in Macedonian, 28th Telecommunications Forum (TELFOR), 1–4 (2020).

[16] J. Neto, H. Meinedo, M. Viveiros, A media monitoring solution, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1813–1816 (2011).

[17] R. Menon, A. Saeb, H. Cameron, W. Kibira, J. Quinn, T. Niesler, Radio-browsing for developmental monitoring in Uganda, [In:] Proc. of ICASSP, 5795–5799 (2017).

[18] M. Stadtschnitzer, Robust Speech Recognition for German and Dialectal Broadcast Programmes, PhD Thesis, University of Bonn, Bonn, Germany (2018).

[19] R. Safarik, J. Nouza, Unified Approach to Development of ASR Systems for East Slavic Languages, [In:] Statistical Language and Speech Processing, Lecture Notes in Computer Science 10583, Eds. N. Camelin, Y. Estève, C. Martín-Vide, 193–203, Springer (2017).

[20] T. Alumäe, O. Tilk, A. Ullah, Advanced Rich Transcription System for Estonian Speech, Baltic HLT (2018).

[21] H. Meinedo, D. Caseiro, J. Neto, I. Trancoso, AUDIMUS.MEDIA: A Broadcast News Speech Recognition System for the European Portuguese Language, [In:] Computational Processing of the Portuguese Language 2721, Eds. N.J. Mamede, I. Trancoso, J. Baptista, M. das Graças Volpe Nunes, Berlin, Springer, 9–17 (2003).

[22] Y. Miao, M. Gowayyed, F. Metze, EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding, [In:] 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 167–174 (2015).

[23] S. Grocholewski, Statystyczne podstawy systemu ARM dla je˛zyka polskiego, Wydawnictwo Politechniki Poznan´skiej (2001).

[24] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, V. Valtchev, P. Woodland, The HTK book, Cambridge University Engineering Department 3 (2002).

[25] G. Demenko, M. Szymański, R. Cecko, M. Lange, K. Klessa, M. Owsianny, Development of large vocabulary continuous speech recognition using phonetically structured speech corpus, Proc. of the 17thInternational Congress of Phonetic Sciences (ICPhS XVII), A86–A98 (2011).

[26] G. Demenko, S. Grocholewski, K. Klessa, J. Ogórkiewicz, A. Wagner, M. Lange, D. Śledzinski, N. Cylwik, LVCSR Speech Database – JURISDIC, New Trends in Audio and Video/Signal Processing Algorithms, Architectures, Arrangements, and Applications SPA, 67–72 (2008).

[27] S. Linnainmaa, The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors, Master’s Thesis (in Finnish), Univ. Helsinki (1970).

[28] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning representations by back-propagating errors, Nature 323, 533–536 (1986).

[29] A. Graves, A. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 6645–6649 (2013).

[30] H. Sak, A. Senior, F. Beaufays, Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition (2014).

[31] A. Amberkar, P. Awasarmol, G. Deshmukh, P. Dave, Speech Recognition using Recurrent Neural Networks, 2018 International Conference on Current Trends towards Converging Technologies (ICCTCT), 1–4 (2018).

[32] T.S. Vandhana, S. Srivibhushanaa, K. Sidharth, C.S. Sanoj, Automatic Speech Recognition using Recurrent Neural Network, International Journal of Engineering Research & Technology (IJERT) 9(8) (2020).

[33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin, Attention Is All You Need, Advances in Neural Information Processing Systems (NIPS) 30 (2017).

[34] L. Dong, S. Xu, B. Xu, Speech-Transformer: A NoRecurrence Sequence-to-Sequence Model for Speech Recognition, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5884–5888 (2018).

[35] S. Wang, L. Guanyu, Overview of end-to-end speech recognition, Journal of Physics: Conference Series 1187(5), IOP Publishing (2019).

[36] D.B. Paul, J.M. Baker, The design for the Wall Street Journal based CSR corpus, [In:] Proc. of the workshop on Speech and Natural Language. Association for Computational Linguistics, 357–362 (1992).

[37] V. Panayotov, G. Chen, D.Povey, S. Khudanpur, Librispeech: An ASR corpus based on public domain audio books, [In:] Proc. Int. Conf. Acoust., Speech, Signal Process, 5206–5210 (2015).

[38] Y. Zhang, J. Qin, D. Park, W. Han, C. Chiu, R. Pang, Q. Le, Y. Wu, Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition (2020).

[39] J.H. Wong, Y. Gaur, R. Zhao, L. Lu, E. Sun, J. Li, Y. Gong, Combination of End-to-End and Hybrid Models for Speech Recognition, Proc. of InterSpeech, 1783–1787 (2020).

[40] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlícˇek, Y. Qian, P. Schwarz, J. Silovský, G. Stemmer, K. Vesel, The Kaldi speech recognition toolkit, IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (2011).

[41] D. Povey, X. Zhang, S. Khudanpur, Parallel training of DNNs with Natural Gradient and Parameter Averaging, arXiv: 1410.7455v4 (2015).

[42] H. Sak, A. Senior, K. Rao, F. Beaufays, Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition, Proc. of InterSpeech 2015, 1468–1472 (2015).

[43] V. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals, Soviet Physics Doklady 10, 707–710 (1965).

[44] D. Koržinek, Results of the PolEval 2019 Shared Task 5: Automatic Speech Recognition Task, [In:] Proc. of the PolEval 2019 Workshop, Eds. M. Ogrodniczuk, Ł. Kobylin´ski, 73–78 (2019).

[45] K.R. Ghule, R.R. Deshmukh, Feature extraction techniques for speech recognition: A review, International Journal of Scientific & Engineering Research 6(5), 143–147 (2015).

[46] P. Singh, S.D. Joshi, R.K. Patney, K. Saha, The Fourier decomposition method for nonlinear and non-stationary time series analysis, Proc. of the Royal Society A: Math., Phys. and Eng. Sci. 473, 20160871 (2017).

[47] N. Rehman, H. Aftab, Multivariate Variational Mode Decomposition, IEEE Transactions on Signal Processing 67(23), 6039–6052 (2019).

[48] Y. Hu, F. Li, H. Li, C. Liu, An enhanced empirical wavelet transform for noisy and non-stationary signal processing, Digital Signal Processing 60, 220–229 (2017).

[49] P. Cao, H. Wang, K. Zhou, Multichannel Signal Denoising Using Multivariate Variational Mode Decomposition With Subspace Projection, IEEE Access 8, 74039–74047 (2020).

[50] P. Kuwalek, B. Burlaga, W. Jesko, P. Konieczka, Research on methods for detecting respiratory rate from photoplethysmographic signal, Biomedical Signal Processing and Control 66, 102483 (2021).

[51] P. Kuwalek, Estimation of Parameters Associated With Individual Sources of Voltage Fluctuations, IEEE Transactions on Power Delivery 36(1), 351–361 (2021).

[52] P. Singh, A. Singhal, B. Fatimah, A. Gupta, S.D. Joshi, AFMNS: A Novel AM-FM Based Measure of Non-Stationarity, IEEE Commun. Lett. 25(3), 990–994 (2021).

Volume 27 (2) 2021, 41–55

Automatic Speech Recognition and its Application to Media Monitoring

Received:

DOI: 10.12921/cmst.2021.0000015

Abstract:

Key words:

References:

JOURNAL MENU

GALLERY

LAST ISSUE

MANUSCRIPT SUBMISSION

FUTURE ISSUES

ALL ISSUES

DATABASES