语音与音频技术实验室
论文推荐
GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio
- DOI码:
- 10.21437/Interspeech.2021-1965
- 发表刊物:
- Interspeech
- 摘要:
- This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 33,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 33,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality. Baseline systems are provided for popular speech recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.
- 合写作者:
- Chao Weng,Dan Su,Daniel Povey,Jan Trmal,Junbo Zhang,Mingjie Jin,Sanjeev Khudanpur,Shinji Watanabe,Shuaijiang Zhao,Wei Zou,Xiangang Li,Xuchen Yao,Yongqing Wang,Yujun Wang,Zhao You,Zhiyong Yan
- 第一作者:
- Guoguo Chen,Shuzhou Chai,Guanbo Wang,Jiayu Du,Wei-Qiang Zhang
- 论文类型:
- 会议论文
- 是否译文:
- 否
- 发表时间:
- 2021-08-30
- 发布期刊链接:
- https://www.isca-speech.org/archive/interspeech_2021/chen21o_interspeech.html