Tìm kiếm theo cụm từ
Chi tiết
Tên Multiple emotional voice conversion in Vietnamese HMM-based speech synthesis using non-negative matrix factorization
Lĩnh vực Tin học
Tác giả Trung-Nghia Phung
Nhà xuất bản / Tạp chí Năm 2017
Số hiệu ISSN/ISBN 2313-626X
Tóm tắt nội dung

Most of current text-to-speech (TTS) systems can synthesize only single
voice with neutral emotion. If different emotional voices are required to be
synthesized, the system has to be trained again with the new emotional
voices. The training process normally requires a huge amount of emotional
speech data that is usually impractical. The state of the art TTS using Hidden
Markov Model (HMM), called as HMM-based TTS, can synthesize speech with
various emotions by using speaker adaption methods. However, both of the
emotional voices synthesized and adapted by HMM-based TTS are “oversmooth”.
When these voices are over-smooth, the detail structures clearly
linked to speaker emotions may be missing. We can also synthesize multiple
voices by using some voice conversion (VC) methods combined with HMMbased
TTS. However, current voice conversions still cannot synthesize target
speech while keeping the detail information related to speaker emotions of
the target voice and just using limited amount data of target voices. In this
paper, we proposed to use exemplar-based emotional voice conversion
combined with HMM-based TTS to synthesize multiple high-quality
emotional voices with a few amount of target data. The evaluation results
using the Vietnamese emotional speech data corpus confirmed the merits of
the proposed method.
Most of current text-to-speech (TTS) systems can synthesize only singlevoice with neutral emotion. If different emotional voices are required to besynthesized, the system has to be trained again with the new emotionalvoices. The training process normally requires a huge amount of emotionalspeech data that is usually impractical. The state of the art TTS using HiddenMarkov Model (HMM), called as HMM-based TTS, can synthesize speech withvarious emotions by using speaker adaption methods. However, both of theemotional voices synthesized and adapted by HMM-based TTS are “oversmooth”.When these voices are over-smooth, the detail structures clearlylinked to speaker emotions may be missing. We can also synthesize multiplevoices by using some voice conversion (VC) methods combined with HMMbasedTTS. However, current voice conversions still cannot synthesize targetspeech while keeping the detail information related to speaker emotions ofthe target voice and just using limited amount data of target voices. In thispaper, we proposed to use exemplar-based emotional voice conversioncombined with HMM-based TTS to synthesize multiple high-qualityemotional voices with a few amount of target data. The evaluation resultsusing the Vietnamese emotional speech data corpus confirmed the merits ofthe proposed method.