As storage has become cheap, concatenative speech synthesis has moved away from small and fixed-size inventories to large corpus-based systems containing multiple instances of most or all tokens in the database. The motivation is to minimize the need for runtime modification of the database units, which is necessary in small-library concatenation both to smooth discontinuities at unit boundaries and to modify prosodic features of the stored signals. No matter how sophisticated the signal processing, the more a token is modified away from its original stored form, the more its naturalness and perhaps intelligibility suffers. One question addressed is concerned with the token and type coverage across two audio books. It was found that although it is difficult to get robust type coverage it does not necessarily imply a similar poor token coverage. The reason is that the missing types are occurring infrequently only. The second research issue that the study is concerned with is the segmentation of large (5-10 hours) audio books. The books are originally organized into one waveform file per chapter and a single text file with the manuscript covering the complete book. Eventually these books have to be segmented at a phonetic/phonemic level in order to be useful for corpus based speech synthesis. Therefore an automatic two-step procedure has been devised. First, the audio and the manuscript are aligned and segmented into utterances by means of HMMs. Second, the HMMs are retrained and used for forced alignment at the phonemic level. Results from comprehensive testing show that the utterance segmentation is very robust while the phonemic segmentation is comparable to human performance (80%-90% of the boundaries within +/- 20 ms).
|Effective start/end date||19/05/2010 → 31/12/2017|