Tìm kiếm theo cụm từ
Chi tiết
Tên HMM-based Speech Synthesis with Multiple Individual Voices using Exemplar-based Voice Conversion
Lĩnh vực Tin học
Tác giả Trung-Nghia Phung
Nhà xuất bản / Tạp chí Năm 2017
Số hiệu ISSN/ISBN
Tóm tắt nội dung

Traditional text-to-speech (TTS) systems can synthesize only single individual voice. When we need to synthesize other individual voices, we have to train the system again with the new voices. The training process normally requires a huge amount of data that is usually available with a few specific voices existed in the database.
The state of the art TTS using Hidden Markov Model (HMM), called as HMM-based TTS, can synthesize speech with various voice personality characteristics by using speaker adaptation methods. However, both of the voices synthesized and adapted by HMM-based TTS are “over-smooth”. When these voices are over-smooth, the detail structures clearly linked to speaker individuality may be missing. We can also synthesize multiple voices by using some voice conversion (VC) methods combined with HMM-based TTS. However, current voice conversions still cannot synthesize target speech while keeping the detail information related to speaker individuality of the target voice and just using limited amount data of target voices. In this paper, we proposed to use exemplar-based voice conversion combined with HMM-based TTS to synthesize multiple high-quality individual voices with a few amount of target data. The evaluation results using the English data corpus CSTR confirmed the advantages of the proposed method.
Traditional text-to-speech (TTS) systems can synthesize only single individual voice. When we need to synthesize other individual voices, we have to train the system again with the new voices. The training process normally requires a huge amount of data that is usually available with a few specific voices existed in the database.The state of the art TTS using Hidden Markov Model (HMM), called as HMM-based TTS, can synthesize speech with various voice personality characteristics by using speaker adaptation methods. However, both of the voices synthesized and adapted by HMM-based TTS are “over-smooth”. When these voices are over-smooth, the detail structures clearly linked to speaker individuality may be missing. We can also synthesize multiple voices by using some voice conversion (VC) methods combined with HMM-based TTS. However, current voice conversions still cannot synthesize target speech while keeping the detail information related to speaker individuality of the target voice and just using limited amount data of target voices. In this paper, we proposed to use exemplar-based voice conversion combined with HMM-based TTS to synthesize multiple high-quality individual voices with a few amount of target data. The evaluation results using the English data corpus CSTR confirmed the advantages of the proposed method.

Đính kèm: