Efficient Deep Learning-Base Speech Synthesis Model Study for TTS
Main Article Content
Abstract
TTS technology using deep learning is very similar to human speech. In order to make such TTS technology, it receives text, creates a spectrogram is Text-to-Mel, and turns the spectrogram is Text-to-Mel, and turns the spectrogram into speech in a vocoder. However, there are various Text-to-Mel and Vocoder models. We need to devise a way to analyze and combine the various Text-to-Mel and Vocoders produce good sound quality. Therefore, in this paper, in order to convent English text to speech, a 24_hour LJ Speech Dataset that one person reads 7 books was used. And this dataset was used to train Text-to-Mel Models and vocoder models. Each learned Text-to-Mel and vocoder models were combined with each other, and the combination performance was compared for which one was the best combination and which one was the worst.