WO2005034084A1 - Improvements to an utterance waveform corpus - Google Patents

Improvements to an utterance waveform corpus Download PDF

Info

Publication number
WO2005034084A1
WO2005034084A1 PCT/US2004/030569 US2004030569W WO2005034084A1 WO 2005034084 A1 WO2005034084 A1 WO 2005034084A1 US 2004030569 W US2004030569 W US 2004030569W WO 2005034084 A1 WO2005034084 A1 WO 2005034084A1
Authority
WO
WIPO (PCT)
Prior art keywords
transcriptions
waveforms
positions
identical words
utterance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2004/030569
Other languages
French (fr)
Inventor
Yi-Qing Zu
Jian-Cheng Huang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Motorola Solutions Inc
Original Assignee
Motorola Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Motorola Inc filed Critical Motorola Inc
Priority to EP04784432.9A priority Critical patent/EP1668630B1/en
Publication of WO2005034084A1 publication Critical patent/WO2005034084A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit

Definitions

  • the present invention relates generally to Text-To-Speech (TTS) synthesis.
  • TTS Text-To-Speech
  • the invention is particularly useful for, but not necessarily limited to, determining an appropriate synthesized pronunciation of a text segment using an improved utterance waveform corpus.
  • TTS Text to Speech
  • concatenated text to speech synthesis allows electronic devices to receive an input text string and provide a converted representation of the string in the form of synthesized speech.
  • a device that may be required to synthesize speech originating from a non-deterministic number of received text strings will have difficulty in providing high quality realistic synthesized speech. That is because the pronunciation of each word or syllable (for Chinese characters and the like) to be synthesized is context and location dependent. For example, a pronunciation of a word at the end of a sentence (input text string) may be drawn out or lengthened.
  • the pronunciation of the same word may be lengthened even more if it occurs in the middle of a sentence where emphasis is required.
  • the pronunciation of a word depends on acoustic prosodic parameters comprising tone (pitch), volume (power or amplitude) and duration.
  • the prosodic parameter values for a word is dependent upon word position in a phrase.
  • One TTS approach to identify matching text strings with a sufficiently long utterance in the corpus is computationally expensive, requires an unacceptably large corpus for most applications and there is no guarantee of finding a suitable matching utterance in the corpus.
  • Another approach employs a relatively small corpus and clustering of acoustic units (phonemes) representative of similar prosodic parameters.
  • a method for providing a representation of a waveform for a word comprising: providing a plurality of transcriptions representing phrases and corresponding sampled and digitized utterance waveforms of the transcriptions , the transcriptions having marked natural phrase boundaries; clustering parts of the waveforms corresponding to identical words in the text strings to provide groups of waveforms for the identical words with similar prosodic features, the clustering being effected when the identical words are positioned in the transcriptions ring at locations relative to the natural phrase boundaries; and processing each of the groups of waveforms for the identical words to provide a representative utterance waveform thereof.
  • the locations relative to the natural phrase boundaries are grouped into at least one of five positions such that there are five potential clusters for the identical words.
  • a first one of the positions is at a beginning of the transcriptions.
  • a second one of the positions is at an end of the transcriptions.
  • a third one of the positions is immediately before and adjacent the marked natural phrase boundaries between the beginning and end of the transcriptions.
  • fourth one of the positions is immediately preceding and adjacent the marked natural phrase boundaries that between the beginning and end of the transcriptions.
  • fifth one of the positions is any position other than the first, second, third or fourth position in the transcriptions.
  • the processing is further characterized by determining average values of the waveforms for the identical words to provide a representative utterance waveform thereof.
  • An electronic device for Text-To-Speech (TTS) synthesis comprising: a processor; a synthesizer coupled to the processor; a memory module coupled to the processor for providing text strings; and a waveform utterance corpus coupled to the processor; the corpus comprising representative utterance waveforms of clusters of identical words positioned in the text strings at locations relative to the natural phrase boundaries.
  • TTS Text-To-Speech
  • Fig. 1 is a schematic block diagram of an electronic device use with the present invention
  • Fig. 2 is a method 200 for providing a representation of a waveform for a word to be stored in an utterance corpus of Fig. 1
  • Figs 3A to 3C illustrate text strings and marker identifying natural phrase boundaries.
  • FIG. 1 there is illustrated an electronic device 100, in the form of a radio-telephone, comprising a device processor 102 operatively coupled by a bus 103 to a user interface 104 that is typically a touch screen or alternatively a display screen and keypad.
  • the electronic device 100 also has an utterance corpus 106, a speech synthesizer 110, Non Volatile memory 120, Read Only Memory 118 and Radio communications module 116 all operatively coupled to the processor 102 by the bus 103.
  • the speech synthesizer 110 has an output coupled to drive a speaker 112.
  • the corpus 106 includes representations of words or phonemes and associated sampled, digitized and processed utterance waveforms PUWs.
  • the Non Volatile memory 120 memory module
  • the waveform utterance corpus comprises representative utterance waveforms of clusters of identical words positioned in transcriptions, representing phrases and corresponding sampled and digitized utterance waveforms, at locations relative to the natural phrase boundaries as described below.
  • the radio frequency communications unit 116 is typically a combined receiver and transmitter having a common antenna.
  • the radio frequency communications unit 116 has a transceiver coupled to antenna via a radio frequency amplifier.
  • the transceiver is also coupled to a combined modulator/demodulator that couples the communications unit 116 to the processor 102.
  • the non-volatile memory 120 stores a user programmable phonebook database Db and Read Only Memory 118 stores operating code (OC) for device processor 102.
  • a method 200 for providing a representation of a waveform for a word the method 200 after a start step 210 comprises a step 220 of providing a plurality of text strings representing phrases and corresponding sampled and digitized utterance waveforms of the text strings, the text strings having marked natural phrase boundaries 310. These natural phrase boundaries are inserted manually into transcriptions of speech waveforms , the transcriptions being phrases or sentences. Also the sampled and digitized utterance waveforms are typically in the form of feature vectors as will be apparent to a person skilled in the art.
  • the method 200 then effects a step 230 of clustering parts of the waveforms corresponding to identical words in the transcriptions to provide groups of waveforms for identical words with similar prosodic features, the clustering being effected when the identical words are positioned in the transcription at locations LS(?) relative to the natural phrase boundaries 310.
  • the transcription 300 of Fig. 3 A "
  • the transcription 300 of Fig. 3B "
  • the locations LS of the words in the transcription relative to the natural phrase boundaries 310 are grouped one of five positions such that there are five potential clusters for the identical words.
  • a first (1 st ) one of the positions is at a beginning of the text string.
  • Other identical words should be found in further transcription and all instances of the word "the” in the first (1 st ) one of the positions words will be grouped together during the step 230 of clustering.
  • a second (2 nd ) one of the positions is at an end of the transcription.
  • a fourth (4th) one of the positions is immediately preceding and adjacent the marked natural phrase boundaries 310 that between the beginning and end of the transcription. In the three transcription examples of FIG 3A to 3C there is an identical word “sat” in the fourth (4th) one of the positions. Other identical words should be found in further transcriptions and all instances of the word "sat” in the fourth (4th) one of the positions will be grouped together during the step 230 of clustering. The same also applies for the words "in” and "near”.
  • a fifth (5th) one of the positions is any position other than the first, second, third or fourth position in the transcription.
  • step 230 of clustering provides for processing each of the groups of waveforms for the identical words to provide a representative utterance waveform thereof.
  • the step 240 of processing preferably provides for determining average values of the the waveforms corresponding to identical words to provide a representative utterance waveform thereof
  • the average values are calculated by summing the each element in the feature vectors for each cluster and then dividing by the number of feature vectors. For instance, if there were 100 instances of the word "the" identified in the first (1 st ) position of the text stings then each corresponding element in the feature vector for each of the 100 instances would be summed and then the result would be divided by 100 to obtain a mean value for each feature vector element.
  • the present invention allows for storing average sampled digitized waveforms SDWs representative of a cluster an associated word.
  • the average sampled digitized waveforms SDWs essentially model acoustic prosodic features for words, wherein parameters of the acoustic prosodic features comprising tone (pitch), volume (power or amplitude) and duration are dependent upon their position in a sentence or phrase relative to the natural phrase boundaries.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

There is described a method (200) for providing a representation of a waveform for a word. The method (200) includes providing (220) transcriptions representing phrases and corresponding sampled and digitized utterance waveforms of the transcriptions, the transcriptions having marked natural phrase boundaries. The method (200) also provides for clustering (230) parts of the waveforms corresponding to identical words in the transcriptionst to provide groups of waveforms for the identical words with similar prosodic features, the clustering being effected when the identical words are positioned in the transcriptions at locations relative to natural phrase boundaries. Then processing each of the groups of waveforms for the identical words to provide a representative utterance waveform for each other 240.

Description

IMPROVEMENTS TO AN UTTERANCE WAVEFORM CORPUS
FIELD OF THE INVENTION The present invention relates generally to Text-To-Speech (TTS) synthesis. The invention is particularly useful for, but not necessarily limited to, determining an appropriate synthesized pronunciation of a text segment using an improved utterance waveform corpus.
BACKGROUND OF THE INVENTION Text to Speech (TTS) conversion, often referred to as concatenated text to speech synthesis, allows electronic devices to receive an input text string and provide a converted representation of the string in the form of synthesized speech. However, a device that may be required to synthesize speech originating from a non-deterministic number of received text strings will have difficulty in providing high quality realistic synthesized speech. That is because the pronunciation of each word or syllable (for Chinese characters and the like) to be synthesized is context and location dependent. For example, a pronunciation of a word at the end of a sentence (input text string) may be drawn out or lengthened. The pronunciation of the same word may be lengthened even more if it occurs in the middle of a sentence where emphasis is required. In most languages the pronunciation of a word depends on acoustic prosodic parameters comprising tone (pitch), volume (power or amplitude) and duration. The prosodic parameter values for a word is dependent upon word position in a phrase. One TTS approach to identify matching text strings with a sufficiently long utterance in the corpus. However, this approach is computationally expensive, requires an unacceptably large corpus for most applications and there is no guarantee of finding a suitable matching utterance in the corpus. Another approach employs a relatively small corpus and clustering of acoustic units (phonemes) representative of similar prosodic parameters. This approach is relatively computationally efficient but does not suitably address the problem of prosodic variations due to word position in a phrase. In this specification, including the claims, the terms 'comprises', 'comprising' or similar terms are intended to mean a non-exclusive inclusion, such that a method or apparatus that comprises a list of elements does not include those elements solely, but may well include other elements not listed.
SUMMARY OF THE INVENTION According to one aspect of the invention there is provided a method for providing a representation of a waveform for a word, the method comprising: providing a plurality of transcriptions representing phrases and corresponding sampled and digitized utterance waveforms of the transcriptions , the transcriptions having marked natural phrase boundaries; clustering parts of the waveforms corresponding to identical words in the text strings to provide groups of waveforms for the identical words with similar prosodic features, the clustering being effected when the identical words are positioned in the transcriptions ring at locations relative to the natural phrase boundaries; and processing each of the groups of waveforms for the identical words to provide a representative utterance waveform thereof. Preferably, the locations relative to the natural phrase boundaries are grouped into at least one of five positions such that there are five potential clusters for the identical words. Suitably, a first one of the positions is at a beginning of the transcriptions. Preferably, a second one of the positions is at an end of the transcriptions. Suitably, a third one of the positions is immediately before and adjacent the marked natural phrase boundaries between the beginning and end of the transcriptions. Suitably, fourth one of the positions is immediately preceding and adjacent the marked natural phrase boundaries that between the beginning and end of the transcriptions. Suitably, fifth one of the positions is any position other than the first, second, third or fourth position in the transcriptions. Preferably, the processing is further characterized by determining average values of the waveforms for the identical words to provide a representative utterance waveform thereof. An electronic device for Text-To-Speech (TTS) synthesis comprising: a processor; a synthesizer coupled to the processor; a memory module coupled to the processor for providing text strings; and a waveform utterance corpus coupled to the processor; the corpus comprising representative utterance waveforms of clusters of identical words positioned in the text strings at locations relative to the natural phrase boundaries.
BRIEF DESCRIPTION OF THE DRAWINGS In order that the invention may be readily understood and put into practical effect, reference will now be made to a preferred embodiment as illustrated with reference to the accompanying drawings in which: Fig. 1 is a schematic block diagram of an electronic device use with the present invention; Fig. 2 is a method 200 for providing a representation of a waveform for a word to be stored in an utterance corpus of Fig. 1; and Figs 3A to 3C illustrate text strings and marker identifying natural phrase boundaries.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT OF THE INVENTION Referring to Fig. 1 there is illustrated an electronic device 100, in the form of a radio-telephone, comprising a device processor 102 operatively coupled by a bus 103 to a user interface 104 that is typically a touch screen or alternatively a display screen and keypad. The electronic device 100 also has an utterance corpus 106, a speech synthesizer 110, Non Volatile memory 120, Read Only Memory 118 and Radio communications module 116 all operatively coupled to the processor 102 by the bus 103. The speech synthesizer 110 has an output coupled to drive a speaker 112. The corpus 106 includes representations of words or phonemes and associated sampled, digitized and processed utterance waveforms PUWs. In othere words, and as described below, the Non Volatile memory 120 (memory module) provides text strings in use for Text-To-Speech (TTS) synthesis (the text may be received by module 116 or otherwise). Also the waveform utterance corpus comprises representative utterance waveforms of clusters of identical words positioned in transcriptions, representing phrases and corresponding sampled and digitized utterance waveforms, at locations relative to the natural phrase boundaries as described below. As will be apparent to a person skilled in the art, the radio frequency communications unit 116 is typically a combined receiver and transmitter having a common antenna. The radio frequency communications unit 116 has a transceiver coupled to antenna via a radio frequency amplifier. The transceiver is also coupled to a combined modulator/demodulator that couples the communications unit 116 to the processor 102. Also, in this embodiment the non-volatile memory 120 (memory module) stores a user programmable phonebook database Db and Read Only Memory 118 stores operating code (OC) for device processor 102. Referring to Figs. 2 and 3A to 3C there is illustrated a method 200 for providing a representation of a waveform for a word, the method 200 after a start step 210 comprises a step 220 of providing a plurality of text strings representing phrases and corresponding sampled and digitized utterance waveforms of the text strings, the text strings having marked natural phrase boundaries 310. These natural phrase boundaries are inserted manually into transcriptions of speech waveforms , the transcriptions being phrases or sentences. Also the sampled and digitized utterance waveforms are typically in the form of feature vectors as will be apparent to a person skilled in the art. The method 200 then effects a step 230 of clustering parts of the waveforms corresponding to identical words in the transcriptions to provide groups of waveforms for identical words with similar prosodic features, the clustering being effected when the identical words are positioned in the transcription at locations LS(?) relative to the natural phrase boundaries 310. For instance, the transcription 300 of Fig. 3 A "| The cat | sat on the mat|" has three natural phrase boundaries 310 indicated by the markers"|"; the transcription 300 of Fig. 3B "| The cat I sat on the mat| in the house|" has four natural phrase boundaries and so does the transcription 300 of Fig 3C "|The dog| sat on the mat| next to the cat |". During the step 230 of clustering, the locations LS of the words in the transcription relative to the natural phrase boundaries 310 are grouped one of five positions such that there are five potential clusters for the identical words. A first (1st) one of the positions is at a beginning of the text string. Thus, in the three transcription examples of FIG 3A to 3C. there is an identical word "The" in the first (1st) one of the positions. Other identical words should be found in further transcription and all instances of the word "the" in the first (1st) one of the positions words will be grouped together during the step 230 of clustering. A second (2nd) one of the positions is at an end of the transcription. In the three transcription examples of FIG 3A to 3C there are no identical words (mat, house, cat) and therefore none of these words will be grouped together during the step 230 of clustering. However, identical words in the second (2nd) one of the positions may be found in further transcriptions. A third (3rd) one of the positions is immediately before and adjacent the marked natural phrase boundaries 310 that between the beginning and end of the transcription. In the three transcription examples of FIG 3A to 3C there is are two groups of identical words "cat: and "mat" in the third (3rd) one of the positions. Other identical words should be found in further transcriptions and all instances of the words "cat") in the third (3rd) one of the positions words will be grouped together during the step 230 of clustering. The same also applies for the word and "mat" (and dog). A fourth (4th) one of the positions is immediately preceding and adjacent the marked natural phrase boundaries 310 that between the beginning and end of the transcription. In the three transcription examples of FIG 3A to 3C there is an identical word "sat" in the fourth (4th) one of the positions. Other identical words should be found in further transcriptions and all instances of the word "sat" in the fourth (4th) one of the positions will be grouped together during the step 230 of clustering. The same also applies for the words "in" and "near". A fifth (5th) one of the positions is any position other than the first, second, third or fourth position in the transcription. In the three transcription examples of FIG 3A to 3C there is are identical words "on", "the" in the fifth (5th) one of the positions. Other identical words should be found in further transcriptions and all instances of the words "on" in the fifth (5th) one of the positions will be grouped together during, the step 230 of clustering as will instances of identical words "the". The same also applies for the word "to". After step 230 a step 240 of processing provides for processing each of the groups of waveforms for the identical words to provide a representative utterance waveform thereof. Specifically, the step 240 of processing preferably provides for determining average values of the the waveforms corresponding to identical words to provide a representative utterance waveform thereof The average values are calculated by summing the each element in the feature vectors for each cluster and then dividing by the number of feature vectors. For instance, if there were 100 instances of the word "the" identified in the first (1st) position of the text stings then each corresponding element in the feature vector for each of the 100 instances would be summed and then the result would be divided by 100 to obtain a mean value for each feature vector element. Hence, after processing an average sampled digitized waveform SDW representative of the cluster for the word "the" in the first (1st) position of an utterance is stored in the utterance corpus 106 in a storing step 250. The method then ends after all clustering is completed for each word is completed. Advantageously, the present invention allows for storing average sampled digitized waveforms SDWs representative of a cluster an associated word. The average sampled digitized waveforms SDWs essentially model acoustic prosodic features for words, wherein parameters of the acoustic prosodic features comprising tone (pitch), volume (power or amplitude) and duration are dependent upon their position in a sentence or phrase relative to the natural phrase boundaries. The detailed description provides a preferred exemplary embodiment only, and is not intended to limit the scope, applicability, or configuration of the invention. Rather, the detailed description of the preferred exemplary embodiment provides those skilled in the art with an enabling description for implementing preferred exemplary embodiment of the invention. It should be understood that various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the invention as set forth in the appended claims.

Claims

WE CLAIM:
1. A method for providing a representation of a waveform for a word, the method comprising: providing a plurality of transcriptionss representing phrases and corresponding sampled and digitized utterance waveforms of the transcriptionss, the transcriptionss having marked natural phrase boundaries; clustering parts of the waveforms corresponding to identical words in the transcriptionss to provide groups of waveforms for the identical words with similar prosodic features, the clustering being effected when the identical words are positioned in the transcriptions at locations relative to the natural phrase boundaries; and processing each of the groups of waveforms for the identical words to provide a representative utterance waveform thereof.
2. A method as claimed in claim 1, wherein the locations relative to the natural phrase boundaries are grouped into at least one of five positions such that there are five potential clusters for the identical words.
3. A method as claimed in claim 2, wherein a first one of the positions is at a beginning of the transcriptions.
4. A method as claimed in claim 2, wherein a second one of the positions is at an end of the transcriptions.
5. A method as claimed in claim 2, wherein a third one of the positions is immediately before and adjacent the marked natural phrase boundaries between the beginning and end of the transcriptions.
6. A method as claimed in claim 2, wherein a fourth one of the positions is immediately preceding and adjacent the marked natural phrase boundaries that between the beginning and end of the transcriptions.
7. A method as claimed in claim 2, wherein a fifth one of the positions is any position other than the first, second, third or fourth position in the transcriptions.
8. A method as claimed in claim 1, wherein, the processing is further characterized by determining average values of the waveforms for the identical words to provide a representative utterance waveform thereof.
9. An electronic device for Text-To-Speech (TTS) synthesis comprising: a processor; a synthesizer coupled to the processor; a memory module coupled to the processor for providing text strings; and a waveform utterance corpus coupled to the processor; the corpus comprising representative utterance waveforms of clusters of identical words positioned in the text strings at locations relative to the natural phrase boundaries.
PCT/US2004/030569 2003-09-29 2004-09-17 Improvements to an utterance waveform corpus Ceased WO2005034084A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP04784432.9A EP1668630B1 (en) 2003-09-29 2004-09-17 Improvements to an utterance waveform corpus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN03134795.9 2003-09-29
CN031347959A CN1604077B (en) 2003-09-29 2003-09-29 Improvement for pronunciation waveform corpus

Publications (1)

Publication Number Publication Date
WO2005034084A1 true WO2005034084A1 (en) 2005-04-14

Family

ID=34398363

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2004/030569 Ceased WO2005034084A1 (en) 2003-09-29 2004-09-17 Improvements to an utterance waveform corpus

Country Status (4)

Country Link
EP (1) EP1668630B1 (en)
KR (1) KR100759729B1 (en)
CN (1) CN1604077B (en)
WO (1) WO2005034084A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111833842A (en) * 2020-06-30 2020-10-27 讯飞智元信息科技有限公司 Synthetic sound template discovery method, device and equipment
US11393447B2 (en) * 2019-06-18 2022-07-19 Lg Electronics Inc. Speech synthesizer using artificial intelligence, method of operating speech synthesizer and computer-readable recording medium
US11398219B2 (en) * 2019-09-16 2022-07-26 Lg Electronics Inc. Speech synthesizer using artificial intelligence and method of operating the same
US11443732B2 (en) * 2019-02-15 2022-09-13 Lg Electronics Inc. Speech synthesizer using artificial intelligence, method of operating speech synthesizer and computer-readable recording medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111128116B (en) * 2019-12-20 2021-07-23 珠海格力电器股份有限公司 Voice processing method and device, computing equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5384893A (en) * 1992-09-23 1995-01-24 Emerson & Stern Associates, Inc. Method and apparatus for speech synthesis based on prosodic analysis
US5774855A (en) * 1994-09-29 1998-06-30 Cselt-Centro Studi E Laboratori Tellecomunicazioni S.P.A. Method of speech synthesis by means of concentration and partial overlapping of waveforms
WO2000030069A2 (en) 1998-11-13 2000-05-25 Lernout & Hauspie Speech Products N.V. Speech synthesis using concatenation of speech waveforms
US6144939A (en) * 1998-11-25 2000-11-07 Matsushita Electric Industrial Co., Ltd. Formant-based speech synthesizer employing demi-syllable concatenation with independent cross fade in the filter parameter and source domains
US20020128841A1 (en) * 2001-01-05 2002-09-12 Nicholas Kibre Prosody template matching for text-to-speech systems

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100259777B1 (en) * 1997-10-24 2000-06-15 정선종 Optimal synthesis unit selection method in text-to-speech system
CA2366952A1 (en) * 1999-03-15 2000-09-21 British Telecommunications Public Limited Company Speech synthesis
KR20010035173A (en) * 2001-01-10 2001-05-07 백종관 Personal Text-To-Speech Synthesizer Using Training Tool Kit for Synthesizing Voice and Method Thereof
CN1259631C (en) * 2002-07-25 2006-06-14 摩托罗拉公司 Chinese test to voice joint synthesis system and method using rhythm control

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5384893A (en) * 1992-09-23 1995-01-24 Emerson & Stern Associates, Inc. Method and apparatus for speech synthesis based on prosodic analysis
US5774855A (en) * 1994-09-29 1998-06-30 Cselt-Centro Studi E Laboratori Tellecomunicazioni S.P.A. Method of speech synthesis by means of concentration and partial overlapping of waveforms
WO2000030069A2 (en) 1998-11-13 2000-05-25 Lernout & Hauspie Speech Products N.V. Speech synthesis using concatenation of speech waveforms
US6144939A (en) * 1998-11-25 2000-11-07 Matsushita Electric Industrial Co., Ltd. Formant-based speech synthesizer employing demi-syllable concatenation with independent cross fade in the filter parameter and source domains
US20020128841A1 (en) * 2001-01-05 2002-09-12 Nicholas Kibre Prosody template matching for text-to-speech systems

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP1668630A4 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11443732B2 (en) * 2019-02-15 2022-09-13 Lg Electronics Inc. Speech synthesizer using artificial intelligence, method of operating speech synthesizer and computer-readable recording medium
US11393447B2 (en) * 2019-06-18 2022-07-19 Lg Electronics Inc. Speech synthesizer using artificial intelligence, method of operating speech synthesizer and computer-readable recording medium
US11398219B2 (en) * 2019-09-16 2022-07-26 Lg Electronics Inc. Speech synthesizer using artificial intelligence and method of operating the same
CN111833842A (en) * 2020-06-30 2020-10-27 讯飞智元信息科技有限公司 Synthetic sound template discovery method, device and equipment
CN111833842B (en) * 2020-06-30 2023-11-03 讯飞智元信息科技有限公司 Synthetic tone template discovery method, device and equipment

Also Published As

Publication number Publication date
EP1668630A1 (en) 2006-06-14
CN1604077A (en) 2005-04-06
CN1604077B (en) 2012-08-08
KR100759729B1 (en) 2007-09-20
EP1668630A4 (en) 2008-04-23
KR20060056406A (en) 2006-05-24
EP1668630B1 (en) 2013-10-23

Similar Documents

Publication Publication Date Title
WO2005034085A1 (en) Identifying natural speech pauses in a text string
US6505158B1 (en) Synthesis-based pre-selection of suitable units for concatenative speech
KR100403293B1 (en) Speech synthesizing method, speech synthesis apparatus, and computer-readable medium recording speech synthesis program
US6463413B1 (en) Speech recognition training for small hardware devices
WO2005034082A1 (en) Method for synthesizing speech
WO1996023298A2 (en) System amd method for generating and using context dependent sub-syllable models to recognize a tonal language
WO2004034377A2 (en) Apparatus, methods and programming for speech synthesis via bit manipulations of compressed data base
JP5172682B2 (en) Generating words and names using N-grams of phonemes
KR20150105075A (en) Apparatus and method for automatic interpretation
US20060229877A1 (en) Memory usage in a text-to-speech system
KR100848148B1 (en) Syllable unit speech recognition device, character input unit using syllable unit speech recognition device, method and recording medium
EP1668629B1 (en) Letter-to-sound conversion for synthesized pronunciation of a text segment
EP1668630B1 (en) Improvements to an utterance waveform corpus
US5897617A (en) Method and device for preparing and using diphones for multilingual text-to-speech generating
JPH0420998A (en) Voice synthesizing device
Kishore et al. Building Hindi and Telugu voices using festvox
US7676366B2 (en) Adaptation of symbols
KR0134707B1 (en) LSP Speech Synthesis Method Using Diphone Unit
JP2006098994A (en) A method for preparing a dictionary, a method for preparing training data for an acoustic model, and a computer program
CN111696530B (en) Target acoustic model obtaining method and device
Suontausta et al. Low memory decision tree method for text-to-phoneme mapping
Zitouni et al. Orientel: speech-based interactive communication applications for the mediterranean and the middle east.
Eady et al. Pitch assignment rules for speech synthesis by word concatenation
Gakuru Development of a kenyan english text to speech system: A method of developing a TTS for a previously undefined english dialect
Dobler et al. A Server for Area Code Information Based on Speech Recognition and Synthesis by Concept

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2004784432

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 1020067006142

Country of ref document: KR

WWP Wipo information: published in national office

Ref document number: 1020067006142

Country of ref document: KR

WWP Wipo information: published in national office

Ref document number: 2004784432

Country of ref document: EP