Data-driven methods for TTS database creation



The creation of databases for concatenative TTS systems is currently a labor-intensive and time-consuming activity. This can be broken down into two phases: the recording of nonsense words designed to ensure coverage of the sound combinations occuring in synthesis, and the analysis and segmentation of the audio. The recording phase can be skipped by taking existing databases of recorded natural speech, but at the cost of greatly increased time and effort to create the diphone segments for a database. This project investigates the possibility of automating as much as possible of the analysis and segmentation, enabling the creation of a large database of natural speech diphones. The ability to use a extensive source audio brings with it the possibility of having several diphones to choose from at synthesis time. There are also plans to investigate means of choosing among diphones, in order to select the best fit to a given combination of contextual factors. A further possibility for improving TTS output is in the introduction of longer segments of audio into the database, thus bypassing the problem of acoustic artifcats introduced by concatenation boundaries. This is not intended to replace diphones, but to improve the quality of frequently-occurring diphone combinations or of those which suffer from persistently poor quality. A final investigation is into the extension of our existing male database to handle diphones that do not occur in Danish, specifically, those occurring in English pronunciation which do not have close Danish approximations. Unrestricted Danish text frequently contains English loanwords, and Danish native speakers are usually capable of approximating English pronunciation. A Danish TTS system must be capable of reflecting this. (Ove Andersen, Charles Hoequist , TDC Research Fund)
