SI TEDx-UM Speech Database - Inštitut za elektroniko in telekomunikacijeInštitut za elektroniko in telekomunikacije

Authors: Andrej Zgank, Darinka Verdonik, Mirjam Sepesy Maucec

SI TEDx-UM is a new Slovenian spoken language resource built from TEDx Talks. The speech database contains 242 talks in total duration of 54 hours. The annotation and transcription of acquired spoken material was generated automatically, applying acoustic segmentation and automatic speech recognition based on UMB Broadcast News speech recongizer. The development and evaluation subset was also manually transcribed using the guidelines specified for the Slovenian GOS corpus. The SI TEDx-UM speech database can be used in various areas of speech technology (e.g: automatic speech recognition, speech to speech translation,…).

License: Creative Commons 3.0

Please cite the following reference if you use the SI TEDx-UM speech database for your research:

A. Zgank, D. Verdonik, M.S. Maucec, “The SI TEDx-UM Speech Database: a new Slovenian Spoken Language Resource”, in Proc. LREC 2016, Portoroz, Slovenia, May 2016.
Links:
– training set, speech recordings, part 1.
– training set, speech recordings, part 2.
– development and evaluation set, speech recordings.
– unsupervised transcriptions, v1 (23052016).
– manual transcriptions, v1 (23052016).