Improving the naturalness of concatenative Vietnamese speech synthesis under limited data conditions

Phung Trung Nghia, Luong Chi Mai, Masato Akagi


Building a large speech corpus is a costly and time-consuming task. Therefore, how tobuild high-quality speech synthesis under limited data conditions is an important issue, specicallyfor under-resourced languages such as Vietnamese. As the most natural-sounding speech synthesisis currently concatenative speech synthesis (CSS), it was the target speech synthesis we studied inthis research. All possible units of a specic phonetic unit set are required for CSS. This requirementmay be easy for verbal languages, in which the number of all units of a specic phonetic unit set suchas phoneme is relatively small. However, the numbers of all tonal phonetic units are signicant intonal languages, and it is dicult to design a small corpus covering all possible tonal phonetic units.Additionally, as all context-dependent phonetic units are required to ensure the naturalness of corpusbasedCSS, it needs a large database with a size up to dozens of gigabytes for concatenation. Therefore,the motivation for this work is to improve the naturalness of CSS under limited data conditions, andwe solved both these two mentioned problems. First, we attempted to reduce the number of tonalunits required for the CSS of tonal languages by using a method of tone transformation. Second, weattempted to reduce mismatch-context errors in concatenation regions to make the CSS available ifmatching-context units could not be found from the database. Temporal Decomposition (TD), whichis an interpolation method decomposing a spectral or prosodic sequence into its sparse event targetsand corresponding temporal event functions, was used for both tasks. Previous studies have revealedthat TD can eciently be used for spectral transformation. Therefore, a TD-based transformationof fundamental frequency (F0) contours, which represents the lexical tones in tonal languages, isproposed. The concept of TD is also close to that of co-articulation of speech, which is related tothe contextual eect in CSS. Therefore, TD is also used to model, select, and modify co-articulatedtransition regions to reduce the mismatch-context errors. The experimental results obtained froma small Vietnamese corpus demonstrated that the proposed lexical tone transformation was able totransform lexical tones, and the proposed method of reducing the mismatch-context errors in the CSSof the general language was ecient. As a result, the two proposed methods are useful to improvethe naturalness of Vietnamese CSS under limited data conditions.


Concatenative speech synthesis, temporal decomposition, co-articulation, tone transformation, limited data, Vietnamese speech

Full Text:


DOI: Display counter: Abstract : 163 views. PDF : 106 views.


Journal of Computer Science and Cybernetics ISSN: 1813-9663

Published by Vietnam Academy of Science and Technology