dc.description.abstracten |
In modern days synthesis of human images and videos is arguably one of the most popular topics in the Data Science community. The synthesis of human speech is less trendy but deeply bonded to the mentioned topic. Since the publication of WaveNet paper by Google researchers in 2016, the state-of-the-art approach transferred from parametric and concatenative systems to deep learning models. Most of the work on the area focuses on improving the intelligibility and naturalness of the speech. However, almost every significant study also mentions ways to generate speech with the voices of different speakers. Usually, such an enhancement requires the model’s re-training in case of generating audio with the voice of a speaker that was not present in the training set. Additionally,studies focused on highly modular speech generation are rare. There fore there is a room left for research on ways to add new parameters for other aspects of the speech, like sentiment, prosody, and melody. In this work, we aimed to implement a competitive text-to-speech solution with the ability to specify the speaker without model re-training and explore possibilities for adding emotions to the generated speech. Our approach generates good quality speech with the mean opinion score of 3,78 (out of 5) points and the ability to mimic speaker voice in real-time, which is a significant improvement over the baseline that merely obtains 2,08. On top of that, we researched sentiment representation possibilities. We built an emotion classifier that performs on the level of the current state of the art solutions by giving an accuracy of more than eighty percent. |
uk |