Evaluation the quality of Estonian text-to-speech synthesis and diphone corrector for the TTS system*
Meelis Mihkla, Einar Meister, Indrek Kiissel, Jürgen Lasn
Abstract
The main tasks of the Estonian text-to-speech synthesis project have in principle now been fulfilled: an Estonian diphone database has been created and the linguistic processing of the text and prosody modelling has been realised. The planning of further developments required an interim evaluation of the present state of the synthesis as far as the intelligibility, smoothness and naturalness of the synthesised speech was concerned. Speech intelligibility depends to a great extent on the selection of speech units and their segmental quality. We use the Esprit/SAM test. Part of the test material was generated as VCV, VC and CV words, using 17 Estonian consonants in the environment of the extreme vowels a, i and u. The other set of stimuli was made up of the most frequent VCV, VC and CV combinations occurring in the Estonian language. To improve the smoothness of synthetic speech it seems reasonable if we combined some words in the sentence into prosodic compounds. These unusual compounds will inevitably produce some unknown diphones. The same problem occurs in the pronounciation of foreign words and names. Therefore we need a diphone corrector. We also discuss about future developments of Estonian TTS synthesizer.
Keywords: Estonian TTS, quality evaluation, SAM test, diphone corrector
- INTRODUCTION
The aim of our text-to-speech synthesis project is to convert the Estonian written text to an orthoepically correct and natural-sounding spoken text for wide range of practical application. This is the common project of the Institute of the Estonian Language, the Institute of Cybernetics and OÜ Filosoft.
We use a diphone synthesis method. The advantage of the system over other systems is that the coarticulatory transitions, which are controlled with difficulties by rules, are naturally comprised without losses in diphones separated from real speech. The Estonian diphone database consists of 1720 diphones (for our model of text-to-speech compilative synthesis and for diphone segmentation, see (Mihkla, Eek, Meister 1999a, 1999b)).
We also joined the international speech synthesis project initated by the Mons University in Belgium. This project enables us to use the Mons (Belgium) MBROLA synthesizer (Dutoit 1997) for concatenating diphones, matching them with each other, changing the duration and fundamental frequency of sounds.
The Estonian orthography is not phonetic. Some essential phonological oppositions as well as phonologically non-relevant phonetic facts (but important from the point of view of orthoepy and speech naturality) are not revealed in the written form of Estonian (Mihkla, Eek, Meister 1998). It is necessary to add diacritics to orthography (e.g. for differentiating quantity degrees and palatalization) is needed. This problem has been solved, though partially, by the automatic morphological analysis of a text using the rules of text labelling. This block for the synthesizer has been compiled by language technologists (H.-J.Kaalep, T.Vaino) from OÜ Filosoft (see: http://www.filosoft.ee). Besides the morphological analysis, this block contains a set of statistical morphosyntactic rules to improve the exactness of the morphological analysis.
The main tasks of the project have in principle now been fulfilled and the betaversion of a speech synthesizer for Estonian is made available from the WWW-homepage (see http://www.eki.ee/keeletehnoloogia/).
This does not mean, however, that the text-to-speech system is completed and the result satisfies both users and makers. Estonian text-to-speech synthesis is still an open problem and a field of active research. The planning of further developments required an interim evaluation of the present state of the synthesis as far as the intelligibility and naturalness of the synthesised speech was concerned.
- TESTING PROCEDURE
Speech intelligibility depends to a great extent on the selection of speech units and their segmental quality. There are lots of tests to estimate the quality of speech synthesis. We used the Esprit/SAM test - Multi-lingual Speech Input/Output Assessment, Methodology and Standartization (Fourcin, Harland, Barry, Hazan, 1989). This test has been proposed as a standard segmental test for European languages. The test is a nonsense word test, combining VCV (vowel-consonant-vowel), VC (vowel-consonant) and CV (consonant-vowel) words. Part of the test material was generated as VCV, VC and CV words, using 17 Estonian consonants in the environment of the extreme Estonian vowels a, i and u. The other set of stimuli was made up of the most frequent VCV, VC and CV combinations occurring in the Estonian language.
The test consists of 12 different stimuli lists, containing about 450 words. There were subjects (auditors) were 27 (12 male and 15 female), of whom around a half were linguists. No instructions were given before listening, the subjects were just asked to recognize a single consonant in the different enviroment of vowels. Unlike in written Estonian they were to mark the palatalization of some consonants. The test took about 40 minutes to perform.
- RESULTS
Figure 1 shows a summary of the results of the 27 people who were tested. As there is no other reliable Estonian test-to-speech synthesizer in the world, it is only possible to compare the segmental quality of synthesized speech with that of human speech. The figures show that errors in consonant recognition occur most frequently when consonants are near extreme vowels and less often when they are in the most wide-spread clusters. One of the reasons for this is that the most common units of sounds are easier to recognize. Another reason is that the people tested were not used to marking palatalization (in written Estonian palatalization is not marked). They remembered to do it at the beginning of the test but forgot later on. Palatalization-related errors accounted for 20% of total errors, including those where the subjects forgot to mark it altogether.
Figure 1. Over-all error rate (in percents) for human and synthetic speech in the CV, VCV and VC clusters (TEST 1 - 17 consonants in the enviroment of extreme vowels; TEST 2 - the most frequent VCV, VC and CV combinations occuring in the Estonian language).
Table 1 shows a sample of the confusion matrix concerning perception of consonants in synthetic VCV words. Typical errors in consonant recognition can be seen. The percentage of correctly perceived consonants is shown in the diagonal of the matrix.
Table 1. Confusion matrix for consonants in synthetic VCV words (percent of correctly percieved consonants in the diagonal).
|
b |
d |
f |
g |
h |
j |
k |
l |
l’ |
m |
n |
n’ |
p |
r |
s |
s’ |
š |
t |
t’ |
v |
? |
b |
86 |
1 |
|
|
|
|
|
|
|
|
|
|
10 |
|
|
|
|
|
|
3 |
|
d |
|
92 |
|
5 |
|
|
1 |
|
|
|
|
|
|
|
|
|
|
|
1 |
|
1 |
f |
|
|
97 |
|
1 |
|
|
|
|
|
|
|
|
|
|
|
1 |
|
|
|
1 |
g |
1 |
8 |
|
75 |
5 |
|
10 |
|
|
|
|
|
|
|
|
|
|
1 |
|
|
|
h |
1 |
|
19 |
14 |
51 |
|
|
1 |
|
|
|
|
1 |
|
|
|
|
|
|
11 |
2 |
j |
|
|
|
|
|
78 |
|
4 |
4 |
|
1 |
|
|
9 |
|
|
|
|
|
|
4 |
k |
|
|
|
1 |
|
|
97 |
|
|
|
|
|
|
|
|
|
|
1 |
|
|
1 |
l |
|
|
|
4 |
2 |
|
|
73 |
11 |
|
4 |
|
|
2 |
|
|
|
|
|
4 |
|
l’ |
|
|
|
|
|
|
|
6 |
82 |
2 |
|
1 |
|
4 |
|
|
|
|
|
5 |
|
m |
1 |
|
|
|
2 |
|
|
|
|
91 |
|
|
|
|
|
|
|
|
|
4 |
2 |
n |
|
4 |
|
|
|
|
|
2 |
|
|
64 |
30 |
|
|
|
|
|
|
|
|
|
n’ |
|
|
|
2 |
|
|
|
|
|
|
2 |
94 |
|
|
|
|
|
|
|
|
2 |
p |
4 |
|
2 |
|
|
|
|
|
|
|
|
|
86 |
2 |
|
|
|
|
|
5 |
1 |
r |
|
|
|
|
|
|
|
|
|
|
|
|
|
98 |
|
|
|
|
|
|
2 |
s |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
83 |
15 |
2 |
|
|
|
|
s’ |
|
2 |
|
|
1 |
|
|
|
|
|
|
|
|
|
7 |
89 |
|
|
1 |
|
|
š |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
1 |
94 |
|
|
|
1 |
t |
|
11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
82 |
5 |
|
2 |
t’ |
|
11 |
|
|
|
|
22 |
4 |
|
|
|
|
|
|
|
|
|
|
59 |
|
4 |
v |
5 |
|
|
|
2 |
|
|
|
|
2 |
1 |
|
|
10 |
|
|
|
|
|
78 |
2 |
Some consonants like f, k and r are recognized very well, to almost 100%. It is interesting to note that the consonant r which is not sound well in synthesized speech is well recognized in VCV nonsense words. Consequently, the intelligibility and naturalness of speech do not necessarily correlate.
In Table 2, the most frequent errors of consonant recognition are presented. An unpalatalized n has been recognized as a palatalized n in 30% of the cases. Recognition errors are divided into three groups: errors in determining palatalization – 20% of the cases (n → n’, s → s’, l → l’), errors in perception of quantity - 12% (t → d, t’ → d’, g → k) and “pure” errors in determining consonants – 68% (h → g, v → r, j → r, etc).
Table 2. The most frequent errors (%) in the VCV clusters.
n → n’ 30
t’ → k 22
h → f 19
s → s’ 15
h → g 14
l → l’ 11
t → d 11
h → v 11
t’ → d’ 11
v → r 10
g → k 10
b → p 10
j → r 9
g → d 8
Table 3 gives a comparison of test results of the Estonian text-to-speech system to the corresponding system of the Swedish language (Carlson, Granström, Nord 1992). It should be mentioned that it was not easy to find analogous test results of CV, VCV and VC clusters in other languages. Perception of Estonian consonants is best in a CV cluster. This is partly connected to the fact that there is no palatalization of consonants at the beginning of a word and the explosives g, b and d are pronounced as k, p andt. In consonant perception, the most sensitive cluster is a VCV cluster. Swedish consonants are almost equally perceived in all cluster types.
Table 3. Error rates in percents for Estonian and Swedish TTS systems.
System, testing year |
CV |
VCV |
VC |
Estonian TTS, 2000 |
4 |
17 |
11 |
Swedish TTS, 1992 |
16 |
15 |
21 |
- DIPHONE CORRECTOR
On the basis of segmental quality tests, we should be able to correct the selection of speech units and renew our diphone database. To improve the smoothness of synthetic speech it seems reasonable if we combined some words in the sentence into prosodic compounds. These unusual compounds will inevitably produce some unknown diphones. The same problem occurs in the pronounciation of foreign words and names. Therefore we need a diphone corrector.
The speech synthesizer creates a lot of uncertain interruptions due to the lack of diphones. In ideal case the diphone corrector must ensure the smooth output speech of the synthesizer despite the concrete diphone exists or not. The corrector should be built up in two steps. The first step checks the presents of the transition in the diphones database and it also includes the basic rules of diphone cloning (see figure 2).
|
|
b |
p |
b: |
p: |
|
|
|
|
|
|
|
|
a |
|
B |
b |
p: |
p: |
|
e |
|
B |
b |
p: |
p: |
|
ä |
|
B |
b |
p: |
p: |
|
|
|
|
|
|
|
|
Figure 2. An example of diphone component table (a, e, ä are the first component of diphone; b, p, b:, p: are the second ones; the simple clones a-p = a-b and a-b: = a-p:).
The second part of the corrector is a bunch of rules to replace the missing diphone with best existing one. But as the sequences form are the basis of our compilative synthesis, we must keep an eye on the pronunciation as well as see that the previous and following transitions are available from the diphone database.
- FUTURE DEVELOPMENTS OF ESTONIAN TTS
The quality of synthesized speech depends on the quality of prosody, which means the controlling of the fundamental frequency as well as the sound intensity and sound duration changes in the speech. An automatic and correct generation of prosody implies that the computer understands the text. As this is not the case, we must try to gain information from the text – as much as possible. We need an automatic prosody predictor, that would indicate the focus of sentences, add emphasis to words, and insert pauses in fluent speech. The creation of such a prosody predictor is possible on the basis of syntactic data.
High quality synthetic speech clearly depends on the user and on the type of text. Reading news, for example, we use other intonation than when we read narratives. It is necessary to work out special user interfaces for changing the synthesis parameters according to different types of texts (news, messages, fiction) or according to the user’s desires.
REFERENCES
C a r l s o n, R., G r a n s t r ö m, B., N o r d, L. 1992, Segmental Evaluation Using the Esprit/SAM Test and Mono-syllabic Words. Talking Machines: Theories, Models and Design: 443-453.
D u t o i t, T. 1997, An Introduction to Text-to-Speech Synthesis. Kluwer Academic Publishers, Dordrecht.
F o u r c i n, A., H a r l a n d, G., B a r r y, W., H a z a n, V. 1989, Speech Intput and Output Assessment – Multilingual Methods and Standards. Ellis Horwood Limited, Chichester, England.
M i h k l a, M ., E e k, A., M e i s t e r, E. 1998, Creation of the Estonian Diphone Database for Text-to-Speech Synthesis. – Proceedings of the Finnic Phonetics Symposium, August 11-14, 1998, Pärnu, Estonia. Linguistica Uralica 34, 3:334-340.
M i h k l a, M., E e k, A., M e i s t e r, E. 1999a, Diphone Synthesis of Estonian. – Proceedings of the International Workshop Dialogue’99. Computational Linguistics and Its Applications (ed. A.S.Narin’yani), vol.2 (Applications):351-353. Tarusa.
M i h k l a, M., E e k, A., M e i s t e r, E. 1999b, Text-to-Speech Synthesis of Estonian. – Proceedings of the 6th European Conference on Speech Communication and Technology, Budapest, Hungary, September 5-10, 1999, vol. 5: 2095-2098.
* The support from Estonian Science Foundation has made this work possible.