WO2005034084A1

WO2005034084A1 - Improvements to an utterance waveform corpus

Info

Publication number: WO2005034084A1
Application number: PCT/US2004/030569
Authority: WO
Inventors: Yi-Qing Zu; Jian-Cheng Huang
Original assignee: Motorola Inc
Current assignee: Motorola Solutions Inc
Priority date: 2003-09-29
Filing date: 2004-09-17
Publication date: 2005-04-14
Anticipated expiration: 2006-03-29
Also published as: EP1668630A1; CN1604077A; CN1604077B; KR100759729B1; EP1668630A4; KR20060056406A; EP1668630B1

Abstract

There is described a method (200) for providing a representation of a waveform for a word. The method (200) includes providing (220) transcriptions representing phrases and corresponding sampled and digitized utterance waveforms of the transcriptions, the transcriptions having marked natural phrase boundaries. The method (200) also provides for clustering (230) parts of the waveforms corresponding to identical words in the transcriptionst to provide groups of waveforms for the identical words with similar prosodic features, the clustering being effected when the identical words are positioned in the transcriptions at locations relative to natural phrase boundaries. Then processing each of the groups of waveforms for the identical words to provide a representative utterance waveform for each other 240.

Description

IMPROVEMENTS TO AN UTTERANCE WAVEFORM CORPUS

FIELD OF THE INVENTION The present invention relates generally to Text-To-Speech (TTS) synthesis. The invention is particularly useful for, but not necessarily limited to, determining an appropriate synthesized pronunciation of a text segment using an improved utterance waveform corpus.

BACKGROUND OF THE INVENTION Text to Speech (TTS) conversion, often referred to as concatenated text to speech synthesis, allows electronic devices to receive an input text string and provide a converted representation of the string in the form of synthesized speech. However, a device that may be required to synthesize speech originating from a non-deterministic number of received text strings will have difficulty in providing high quality realistic synthesized speech. That is because the pronunciation of each word or syllable (for Chinese characters and the like) to be synthesized is context and location dependent. For example, a pronunciation of a word at the end of a sentence (input text string) may be drawn out or lengthened. The pronunciation of the same word may be lengthened even more if it occurs in the middle of a sentence where emphasis is required. In most languages the pronunciation of a word depends on acoustic prosodic parameters comprising tone (pitch), volume (power or amplitude) and duration. The prosodic parameter values for a word is dependent upon word position in a phrase. One TTS approach to identify matching text strings with a sufficiently long utterance in the corpus. However, this approach is computationally expensive, requires an unacceptably large corpus for most applications and there is no guarantee of finding a suitable matching utterance in the corpus. Another approach employs a relatively small corpus and clustering of acoustic units (phonemes) representative of similar prosodic parameters. This approach is relatively computationally efficient but does not suitably address the problem of prosodic variations due to word position in a phrase. In this specification, including the claims, the terms 'comprises', 'comprising' or similar terms are intended to mean a non-exclusive inclusion, such that a method or apparatus that comprises a list of elements does not include those elements solely, but may well include other elements not listed.

SUMMARY OF THE INVENTION According to one aspect of the invention there is provided a method for providing a representation of a waveform for a word, the method comprising: providing a plurality of transcriptions representing phrases and corresponding sampled and digitized utterance waveforms of the transcriptions , the transcriptions having marked natural phrase boundaries; clustering parts of the waveforms corresponding to identical words in the text strings to provide groups of waveforms for the identical words with similar prosodic features, the clustering being effected when the identical words are positioned in the transcriptions ring at locations relative to the natural phrase boundaries; and processing each of the groups of waveforms for the identical words to provide a representative utterance waveform thereof. Preferably, the locations relative to the natural phrase boundaries are grouped into at least one of five positions such that there are five potential clusters for the identical words. Suitably, a first one of the positions is at a beginning of the transcriptions. Preferably, a second one of the positions is at an end of the transcriptions. Suitably, a third one of the positions is immediately before and adjacent the marked natural phrase boundaries between the beginning and end of the transcriptions. Suitably, fourth one of the positions is immediately preceding and adjacent the marked natural phrase boundaries that between the beginning and end of the transcriptions. Suitably, fifth one of the positions is any position other than the first, second, third or fourth position in the transcriptions. Preferably, the processing is further characterized by determining average values of the waveforms for the identical words to provide a representative utterance waveform thereof. An electronic device for Text-To-Speech (TTS) synthesis comprising: a processor; a synthesizer coupled to the processor; a memory module coupled to the processor for providing text strings; and a waveform utterance corpus coupled to the processor; the corpus comprising representative utterance waveforms of clusters of identical words positioned in the text strings at locations relative to the natural phrase boundaries.

BRIEF DESCRIPTION OF THE DRAWINGS In order that the invention may be readily understood and put into practical effect, reference will now be made to a preferred embodiment as illustrated with reference to the accompanying drawings in which: Fig. 1 is a schematic block diagram of an electronic device use with the present invention; Fig. 2 is a method 200 for providing a representation of a waveform for a word to be stored in an utterance corpus of Fig. 1; and Figs 3A to 3C illustrate text strings and marker identifying natural phrase boundaries.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT OF THE INVENTION Referring to Fig. 1 there is illustrated an electronic device 100, in the form of a radio-telephone, comprising a device processor 102 operatively coupled by a bus 103 to a user interface 104 that is typically a touch screen or alternatively a display screen and keypad. The electronic device 100 also has an utterance corpus 106, a speech synthesizer 110, Non Volatile memory 120, Read Only Memory 118 and Radio communications module 116 all operatively coupled to the processor 102 by the bus 103. The speech synthesizer 110 has an output coupled to drive a speaker 112. The corpus 106 includes representations of words or phonemes and associated sampled, digitized and processed utterance waveforms PUWs. In othere words, and as described below, the Non Volatile memory 120 (memory module) provides text strings in use for Text-To-Speech (TTS) synthesis (the text may be received by module 116 or otherwise). Also the waveform utterance corpus comprises representative utterance waveforms of clusters of identical words positioned in transcriptions, representing phrases and corresponding sampled and digitized utterance waveforms, at locations relative to the natural phrase boundaries as described below. As will be apparent to a person skilled in the art, the radio frequency communications unit 116 is typically a combined receiver and transmitter having a common antenna. The radio frequency communications unit 116 has a transceiver coupled to antenna via a radio frequency amplifier. The transceiver is also coupled to a combined modulator/demodulator that couples the communications unit 116 to the processor 102. Also, in this embodiment the non-volatile memory 120 (memory module) stores a user programmable phonebook database Db and Read Only Memory 118 stores operating code (OC) for device processor 102. Referring to Figs. 2 and 3A to 3C there is illustrated a method 200 for providing a representation of a waveform for a word, the method 200 after a start step 210 comprises a step 220 of providing a plurality of text strings representing phrases and corresponding sampled and digitized utterance waveforms of the text strings, the text strings having marked natural phrase boundaries 310. These natural phrase boundaries are inserted manually into transcriptions of speech waveforms , the transcriptions being phrases or sentences. Also the sampled and digitized utterance waveforms are typically in the form of feature vectors as will be apparent to a person skilled in the art. The method 200 then effects a step 230 of clustering parts of the waveforms corresponding to identical words in the transcriptions to provide groups of waveforms for identical words with similar prosodic features, the clustering being effected when the identical words are positioned in the transcription at locations LS(?) relative to the natural phrase boundaries 310. For instance, the transcription 300 of Fig. 3 A "| The cat | sat on the mat|" has three natural phrase boundaries 310 indicated by the markers"|"; the transcription 300 of Fig. 3B "| The cat I sat on the mat| in the house|" has four natural phrase boundaries and so does the transcription 300 of Fig 3C "|The dog| sat on the mat| next to the cat |". During the step 230 of clustering, the locations LS of the words in the transcription relative to the natural phrase boundaries 310 are grouped one of five positions such that there are five potential clusters for the identical words. A first (1^st) one of the positions is at a beginning of the text string. Thus, in the three transcription examples of FIG 3A to 3C. there is an identical word "The" in the first (1^st) one of the positions. Other identical words should be found in further transcription and all instances of the word "the" in the first (1^st) one of the positions words will be grouped together during the step 230 of clustering. A second (2^nd) one of the positions is at an end of the transcription. In the three transcription examples of FIG 3A to 3C there are no identical words (mat, house, cat) and therefore none of these words will be grouped together during the step 230 of clustering. However, identical words in the second (2^nd) one of the positions may be found in further transcriptions. A third (3^rd) one of the positions is immediately before and adjacent the marked natural phrase boundaries 310 that between the beginning and end of the transcription. In the three transcription examples of FIG 3A to 3C there is are two groups of identical words "cat: and "mat" in the third (3rd) one of the positions. Other identical words should be found in further transcriptions and all instances of the words "cat") in the third (3rd) one of the positions words will be grouped together during the step 230 of clustering. The same also applies for the word and "mat" (and dog). A fourth (4th) one of the positions is immediately preceding and adjacent the marked natural phrase boundaries 310 that between the beginning and end of the transcription. In the three transcription examples of FIG 3A to 3C there is an identical word "sat" in the fourth (4th) one of the positions. Other identical words should be found in further transcriptions and all instances of the word "sat" in the fourth (4th) one of the positions will be grouped together during the step 230 of clustering. The same also applies for the words "in" and "near". A fifth (5th) one of the positions is any position other than the first, second, third or fourth position in the transcription. In the three transcription examples of FIG 3A to 3C there is are identical words "on", "the" in the fifth (5th) one of the positions. Other identical words should be found in further transcriptions and all instances of the words "on" in the fifth (5th) one of the positions will be grouped together during, the step 230 of clustering as will instances of identical words "the". The same also applies for the word "to". After step 230 a step 240 of processing provides for processing each of the groups of waveforms for the identical words to provide a representative utterance waveform thereof. Specifically, the step 240 of processing preferably provides for determining average values of the the waveforms corresponding to identical words to provide a representative utterance waveform thereof The average values are calculated by summing the each element in the feature vectors for each cluster and then dividing by the number of feature vectors. For instance, if there were 100 instances of the word "the" identified in the first (1^st) position of the text stings then each corresponding element in the feature vector for each of the 100 instances would be summed and then the result would be divided by 100 to obtain a mean value for each feature vector element. Hence, after processing an average sampled digitized waveform SDW representative of the cluster for the word "the" in the first (1^st) position of an utterance is stored in the utterance corpus 106 in a storing step 250. The method then ends after all clustering is completed for each word is completed. Advantageously, the present invention allows for storing average sampled digitized waveforms SDWs representative of a cluster an associated word. The average sampled digitized waveforms SDWs essentially model acoustic prosodic features for words, wherein parameters of the acoustic prosodic features comprising tone (pitch), volume (power or amplitude) and duration are dependent upon their position in a sentence or phrase relative to the natural phrase boundaries. The detailed description provides a preferred exemplary embodiment only, and is not intended to limit the scope, applicability, or configuration of the invention. Rather, the detailed description of the preferred exemplary embodiment provides those skilled in the art with an enabling description for implementing preferred exemplary embodiment of the invention. It should be understood that various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the invention as set forth in the appended claims.

Claims

WE CLAIM:

1. A method for providing a representation of a waveform for a word, the method comprising: providing a plurality of transcriptionss representing phrases and corresponding sampled and digitized utterance waveforms of the transcriptionss, the transcriptionss having marked natural phrase boundaries; clustering parts of the waveforms corresponding to identical words in the transcriptionss to provide groups of waveforms for the identical words with similar prosodic features, the clustering being effected when the identical words are positioned in the transcriptions at locations relative to the natural phrase boundaries; and processing each of the groups of waveforms for the identical words to provide a representative utterance waveform thereof.

2. A method as claimed in claim 1, wherein the locations relative to the natural phrase boundaries are grouped into at least one of five positions such that there are five potential clusters for the identical words.

3. A method as claimed in claim 2, wherein a first one of the positions is at a beginning of the transcriptions.

4. A method as claimed in claim 2, wherein a second one of the positions is at an end of the transcriptions.

5. A method as claimed in claim 2, wherein a third one of the positions is immediately before and adjacent the marked natural phrase boundaries between the beginning and end of the transcriptions.

6. A method as claimed in claim 2, wherein a fourth one of the positions is immediately preceding and adjacent the marked natural phrase boundaries that between the beginning and end of the transcriptions.

7. A method as claimed in claim 2, wherein a fifth one of the positions is any position other than the first, second, third or fourth position in the transcriptions.

8. A method as claimed in claim 1, wherein, the processing is further characterized by determining average values of the waveforms for the identical words to provide a representative utterance waveform thereof.

9. An electronic device for Text-To-Speech (TTS) synthesis comprising: a processor; a synthesizer coupled to the processor; a memory module coupled to the processor for providing text strings; and a waveform utterance corpus coupled to the processor; the corpus comprising representative utterance waveforms of clusters of identical words positioned in the text strings at locations relative to the natural phrase boundaries.