JPH04277B2

JPH04277B2 -

Info

Publication number: JPH04277B2
Application number: JP58079617A
Authority: JP
Inventors: Yamato Sato
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 1983-05-07
Filing date: 1983-05-07
Publication date: 1992-01-06
Also published as: JPS59204097A

Description

【発明の詳細な説明】この発明は単語より小さな音声要素に基づき任
意の連続音声を合成する音声合成方法に関するも
のである。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a speech synthesis method for synthesizing arbitrary continuous speech based on speech elements smaller than words.

＜従来技術＞任意の語・文の音声を合成する方式には、従来
から音声の最も基本的要素である音素（ａ、ｉ、
ｐ、ｋ、ｓ…など音声記号に相当する要素）から
合成する方式と、２音素連続の組（母音−子音、
母音−母音、子音−子音）であるDyad，VCV
（母音−子音−母音）などの音素の複合要素から
合成する方式とが知られている。音素から合成す
る方式は音素から他の音素への動作部分の実現
法、調音結合に伴なう変形などを「規則」によつ
て実現せねばならないが、現状ではこれらの知見
がすべて明らかになつているわけではなく、必ず
しも良い品質の合成音を提供してはくれない。音
素の複合要素による方法は、比較的現実的方法で
あり実用に供されている例もある。VCVやCVを
要素とする従来の方式では、合成要素の結合を主
に母音部で行つている。母音部は最も聞こえが大
きい部分であり、こゝでスペクトル・パラメータ
の結合歪みや音色の変化が生じ、合成音声の品質
をそこなうという欠点があつた。<Prior art> Methods for synthesizing speech of arbitrary words and sentences have traditionally been based on phonemes (a, i,
p, k, s, etc.)), and a combination of two consecutive phonemes (vowel-consonant,
Dyad, VCV which is vowel-vowel, consonant-consonant)
A method of synthesizing from complex elements of phonemes such as (vowel-consonant-vowel) is known. The method of synthesizing from phonemes requires ``rules'' to realize the movement part from one phoneme to another phoneme, the transformation associated with articulatory combination, etc., but at present, all of these knowledge has not been clarified. However, it does not necessarily provide a good quality synthesized sound. The method using complex elements of phonemes is a relatively realistic method, and there are examples in which it has been put to practical use. In the conventional method using VCV and CV as elements, the combination of synthesis elements is mainly performed at the vowel part. The vowel part is the part that is most audible, and this has the disadvantage that combined distortion of spectral parameters and changes in timbre occur, impairing the quality of the synthesized speech.

＜発明の目的＞この発明は、合成音声の品質向上を図るため、
従来の複合要素の欠点を除去した新たな要素によ
る音声合成方法を提供するものである。<Purpose of the invention> This invention aims to improve the quality of synthesized speech.
This invention provides a speech synthesis method using new elements that eliminates the drawbacks of conventional composite elements.

＜発明の原理＞すでに述べたように母音定常部における合成要
素の結合は、合成音の品質に悪影響を及ぼす可能
性がある。そこでCVC（Ｃ：子音、Ｖ：母音）型
の合成要素を設定する。この要素は母音部分が先
行子音と後続子音の影響を受けたものになつてい
る。すなわち、調音結合の影響を自然と含み、な
めらかな口の動きを実現できる要素である。ま
た、合成音にリズム感を与えるため母音の持続時
間を適切に設定しなければならないが、CVC型
要素では両側子音の影響で現実的な母音長が実現
されている。<Principle of the Invention> As already mentioned, the combination of synthesis elements in the vowel stationary region may have a negative effect on the quality of the synthesized speech. Therefore, a CVC (C: consonant, V: vowel) type synthesis element is set. In this element, the vowel part is influenced by the preceding and following consonants. In other words, it is an element that naturally includes the influence of articulatory combination and can realize smooth mouth movements. In addition, in order to give a sense of rhythm to the synthesized sound, the duration of the vowel must be set appropriately, but in the CVC type element, realistic vowel lengths are achieved due to the influence of the consonants on both sides.

発音が１音節としてなされる（母音−撥音）、
（母音と母音ｉの連続）も１つの要素の中に含ま
せると都合が良い。何故なら、これらの音韻連続
は、緊密度が非常に強いばかりでなく、日本語の
中での出現頻度の高い音韻連続であるからであ
る。 Pronunciation is made as a single syllable (vowel-plosive),
It is convenient to include (a sequence of vowels and vowels i) in one element. This is because these phoneme sequences are not only extremely tight, but also occur frequently in Japanese.

以上述べたように、この発明では音声合成要素
の基本形式としてCVC及びCVMC（Ｍは撥音Ｎ又
は母音ｉ）を設定する。しかし、日本語の音韻の
種類を考慮すると、この基本形式の音声合成要素
は６千種を越すため、音声合成要素の数としては
現実的な個数を越えている。実際にはこれらの音
声合成要素の中には、日本語としてほとんど用い
られない音韻の連続もあるであろうし、例え用い
られるにせよ出現頻度の極めて小さいものもある
に違いない。そこでこの発明では日本語の音韻の
出現頻度に基づき、使用頻度の高いものは、
CVC又はCVMCを使用し、使用頻度の低いもの
については、CV、VC、VV、VN、NCなどの要
素を用いて合成することにより、合成要素の蓄積
メモリの低減と合成要素作成の作業量を大巾に小
なくする。第１図は一例として日本語の辞書見出
し語の分析から得られたCVC及びCVMCの累積
百分率を示すものであるが、出現頻度の高い約
1000個のCVC及びCVMC音韻連続で日本語単語
の90％がカバーされることが理解される。従つて
このような1000個の長単位要素とCV、VC、
VV、VN、CN等の短単位要素を用いて合成を行
なう。 As described above, in the present invention, CVC and CVMC (M is a phonic sound N or a vowel i) are set as the basic formats of speech synthesis elements. However, considering the types of phonemes in Japanese, there are over 6,000 types of speech synthesis elements in this basic format, which exceeds the realistic number of speech synthesis elements. In reality, among these speech synthesis elements, there are probably some phoneme sequences that are rarely used in Japanese, and even if they are used, there are definitely some that appear very rarely. Therefore, in this invention, based on the frequency of appearance of Japanese phonemes, the most frequently used ones are
By using CVC or CVMC and compositing using elements such as CV, VC, VV, VN, and NC for infrequently used items, the storage memory of composite elements can be reduced and the amount of work for creating composite elements can be reduced. Reduce it to a large width. Figure 1 shows, as an example, the cumulative percentages of CVC and CVMC obtained from the analysis of Japanese dictionary headwords.
It is understood that 90% of Japanese words are covered by 1000 CVC and CVMC phonological sequences. Therefore, 1000 long unit elements like this and CV, VC,
Synthesis is performed using short unit elements such as VV, VN, and CN.

＜実施例＞第２図はこの発明の一実施例を示す。入力端子
１より、合成すべき語や文の文字系列が入力され
ると、合成要素名系列作成部２はいつたん入力文
字系列をCVC又はCMCの長単位要素の系列に分
割し、その各長単位要素の要素名を合成要素名照
合部３に送る。合成要素名照合部３は、長単位合
成要素名蓄積メモリ４より音声合成要素の名称を
読出し、これと合成要素名系列作成部２より転送
されてきた長単位要素名との照合をとり、メモリ
４に音声合成要素の名称があるか否かを判定し、
その結果を合成要素名系列作成部２に転送する。
長単位合成要素名蓄積メモリ４に該当の長単位要
素がなかつた場合、合成要素名系列作成部はその
なかつた長単位要素CVCをCV、VCに、CVMC
をCV、VM、MCの各短単位要素の系列に分割す
る。例えば、 sakurawa kireidesu （桜はきれいです）はsak＋kur＋raw＋wa kir＋re＋ei＋id＋des＋
su となり、下線部は〔reid〕という長単位要素の音
声合成要素がなかつたために、３つの短単位要素
に分割した例である。なお、語尾では〔wa〕、
〔su〕のように短単位要素が用いられる。<Example> FIG. 2 shows an example of the present invention. When a character sequence of a word or sentence to be synthesized is input from the input terminal 1, the composition element name sequence creation unit 2 immediately divides the input character sequence into a series of long unit elements of CVC or CMC, and calculates the length of each of them. The element name of the unit element is sent to the composite element name collation unit 3. The synthesis element name collation unit 3 reads the name of the speech synthesis element from the long unit synthesis element name storage memory 4, collates this with the long unit element name transferred from the synthesis element name series creation unit 2, and stores the name in the memory. Determine whether or not there is a name of the speech synthesis element in 4.
The result is transferred to the composite element name series creation section 2.
If there is no corresponding long unit element in the long unit composite element name storage memory 4, the composite element name series creation section converts the missing long unit element CVC into CV, VC, and CVMC.
Divide into a series of short unit elements of CV, VM, and MC. For example, sakurawa kireidesu (cherry blossoms are beautiful) is sak+kur+raw+wa kir+re+ei+id+des+
The underlined part is an example of dividing into three short unit elements because there was no speech synthesis element for the long unit element [reid]. In addition, at the end of the word, [wa],
Short unit elements are used, such as [su].

上記のようにして合成要素名系列作成部２から
合成要素名系列が得られると、その系列は音声合
成制御部５に送られる。各種のCVCとCVMCに
ついては長単位合成要素パラメータ蓄積メモリ６
に、すべてのCV、VCなどについては短単位合成
要素パラメータ蓄積メモリ７にそれぞれ音声合成
スペクトルパラメータが蓄えられている。音声合
成制御部５は、パラメータ読出部８を制御して入
力された合成要素名系列の各音声合成要素に対す
るパラメータをメモリ６，７の何れからか読み出
し、その情報をパラメータ平滑部９に送る。 When a synthetic element name sequence is obtained from the synthetic element name sequence creation section 2 as described above, the sequence is sent to the speech synthesis control section 5. For various CVC and CVMC, long unit composite element parameter storage memory 6
For all CVs, VCs, etc., voice synthesis spectrum parameters are stored in the short unit synthesis element parameter storage memory 7. The speech synthesis control section 5 controls the parameter reading section 8 to read the parameters for each speech synthesis element of the input synthesis element name series from either the memory 6 or 7, and sends the information to the parameter smoothing section 9.

パラメータ平滑部９では音声合成要素結合部分
及びその近傍でスペクトルパラメータ
（PARCOR係数、LSPパラメータ、ホルマントな
ど）及び振幅に関して以下のような荷重平均処理
を行ない、パラメータを平滑化する。 The parameter smoothing section 9 smoothes the parameters by performing the following weighted averaging process on spectral parameters (PARCOR coefficients, LSP parameters, formants, etc.) and amplitudes at the speech synthesis element connection portion and its vicinity.

ｆ＝_f+l 〓^i+f-l W_i-fPf 但し、_l 〓^j=-l W_j＝１ Pf：パラメータ Wi：重み係数ｆ：フレーム番号 2l＋１：荷重平均処理を行なう窓のフレーム長この処理を音声合成要素の接合部の前後数フレ
ームに渡つて行う。 f= _f+l 〓 ^i+fl W _if Pf However, _l 〓 ^j=-l W _j = 1 Pf: Parameter Wi: Weighting coefficient f: Frame number 2l+1: Frame length of the window to perform weighted average processing This processing is performed as an audio This is done over several frames before and after the joint of composite elements.

第３図はパラメータ平滑部９の実施例を示すも
のであり、合成要素のパラメータ蓄積メモリ６，
７から読み出された音声合成要素のパラメータは
いつたん入力バツフアメモリ２１に蓄積される。
入力バツフアメモリ２１はダブルバツフアになつ
ており、次に続く音声合成要素のパラメータも同
時に蓄積される。演算制御部２２はパラメータの
平滑演算を制御する部分であり、平滑処理の不必
要な合成要素の区間では読出制御部２３を介して
フレーム毎にパラメータを出力バツフアレジスタ
２４に転送する。音声合成要素の結合部近傍のパ
ラメータ平滑区間では荷重平均処理を行う窓区間
のパラメータが入力バツフアメモリ２１から順次
読出され、乗算器２５によつてメモリ２６に蓄積
されている重み係数との積がとられるとともに、
加算器２７、累算レジスタ２８によつて積和演算
がなされ、平滑化された窓中心のフレームのパラ
メータ値が出力バツフアレジスタ２４に送られ
る。１合成要素全フレームのパラメータの出力バ
ツフアレジスタ２４への転送が終了すると、次の
音声合成要素のパラメータが入力バツフアメモリ
２１に書き込まれ、上記と同様の動作が繰返され
る。なお重み係数の窓区間内での包絡は三角波
形、ガウス波形など中心部が最大となるものであ
る。 FIG. 3 shows an embodiment of the parameter smoothing unit 9, in which the parameter storage memory 6,
The parameters of the speech synthesis element read from 7 are stored in the input buffer memory 21.
The input buffer memory 21 is a double buffer, and the parameters of the next speech synthesis element are also stored at the same time. The arithmetic control section 22 is a section that controls the smoothing operation of parameters, and transfers the parameters to the output buffer register 24 for each frame via the readout control section 23 in the composite element section where smoothing processing is unnecessary. In the parameter smoothing section near the joint of the speech synthesis elements, the parameters of the window section in which the weighted average processing is performed are sequentially read out from the input buffer memory 21, and the multiplier 25 multiplies them with the weighting coefficients stored in the memory 26. At the same time,
A product-sum operation is performed by the adder 27 and the accumulation register 28, and the smoothed parameter values of the window-centered frame are sent to the output buffer register 24. When the transfer of the parameters of all frames of one synthesis element to the output buffer register 24 is completed, the parameters of the next speech synthesis element are written to the input buffer memory 21, and the same operation as described above is repeated. Note that the envelope of the weighting coefficient within the window section is maximum at the center, such as a triangular waveform or a Gaussian waveform.

このような処理を行う前のLSPパラメータの時
系列の例を第４図Ａに、そのLSPパラメータの時
系列に前記平均化処理を行つた例を第４図Ｂに示
す。パラメータの平均化処理により例えば、「ザ」
と「グ」の結合部３１、「ル」と「マ」との結合
部３２はそれぞれ結合部３３，３４となり合成音
になめらかさを与えるとともに、接合部のパラメ
ータ値の急変に帰因する異音の混入を避けること
ができる。 FIG. 4A shows an example of a time series of LSP parameters before such processing, and FIG. 4B shows an example of the time series of LSP parameters subjected to the averaging process. For example, by averaging the parameters,
The joining part 31 between and "gu" and the joining part 32 between "ru" and "ma" become joining parts 33 and 34, respectively, which give smoothness to the synthesized sound and eliminate differences caused by sudden changes in parameter values at the joining parts. Mixing of sound can be avoided.

第２図の説明に戻つて、パラメータ平滑部９で
音声合成要素のパラメータ時系列ができると、音
声合成制御部５は音源生成部１０よりの駆動音源
信号を音声合成デイジタルフイルタ１１へ供給
し、そのデイジタルフイルタ１１のフイルタ特性
がパラメータ平滑部９からのパラメータ時系列に
より制御されて、合成音出力端子１２に最終的な
合成音声信号が出力される。音源生成部１０及び
音声合成デイジタルフイルタ１１は音声合成器を
構成している。 Returning to the explanation of FIG. 2, when the parameter smoothing section 9 generates the parameter time series of the speech synthesis elements, the speech synthesis control section 5 supplies the drive sound source signal from the sound source generation section 10 to the speech synthesis digital filter 11, The filter characteristics of the digital filter 11 are controlled by the parameter time series from the parameter smoothing section 9, and a final synthesized speech signal is output to the synthesized speech output terminal 12. The sound source generation section 10 and the speech synthesis digital filter 11 constitute a speech synthesizer.

＜効果＞以上説明したように、この発明ではCVC及び
CVMCを音声合成の基本要素として用いるため、
聞こえの大きい母音部でのパラメータの線形結合
が避けられ、調音結合の影響が自然と含まれ、な
めらかな口の動きが実現され、従来以上に良い品
質の合成音が得られるばかりでなく、CV、VC、
VV、VN型合成要素を併用することにより、音
声合成要素の総数を減少せしめ、装置を経済的に
実現できる利点がある。また、音声合成要素の結
合に際してパラメータの平滑化処理をほどこす場
合は滑らかな合成音が得られるとともに、合成要
素結合部におけるパラメータの“とび”による異
音の混入を防ぐことが可能となる。<Effects> As explained above, this invention has CVC and
In order to use CVMC as a basic element of speech synthesis,
Linear combinations of parameters are avoided in the vowel part, where the audibility is louder, the effects of articulatory combinations are naturally included, smoother mouth movements are achieved, and not only a synthesized sound of better quality than before can be obtained, but also CV ,VC,
The combined use of VV and VN type synthesis elements has the advantage of reducing the total number of speech synthesis elements and making it possible to realize the device economically. In addition, when smoothing the parameters when combining the speech synthesis elements, a smooth synthesized sound can be obtained, and it is also possible to prevent abnormal sounds from being mixed in due to "jumps" in the parameters at the synthesis element combination section.

[Brief explanation of drawings]

第１図は日本語単語中のCVC、CVMCの出現
累積百分率を示す図、第２図はこの発明の実施例
を示すブロツク図、第３図は第２図中のパラメー
タ平滑部９の具体例を示すブロツク図、第４図は
パラメータ平均化した場合と、しない場合との差
を示す図である。１：文字系列入力端子、２：合成要素名系列作
成部、３：合成要素名照合部、４：長単位合成要
素名蓄積メモリ、５：音声合成制御部、６：長単
位合成要素パラメータ蓄積メモリ、７：短単位合
成要素パラメータ蓄積メモリ、８：パラメータ読
出部、９：パラメータ平滑部、１０：音源生成
部、１１：音声合成デイジタルフイルタ、１２：
合成音出力端子、２１：入力バツフアメモリ、２
２：演算制御部、２３：読出制御部、２４：出力
バツフアレジスタ、２５：乗算器、２６：メモ
リ、２７：加算器、２８：累算レジスタ。 Fig. 1 is a diagram showing the cumulative percentage of appearance of CVC and CVMC in Japanese words, Fig. 2 is a block diagram showing an embodiment of the present invention, and Fig. 3 is a specific example of the parameter smoothing unit 9 in Fig. 2. FIG. 4 is a diagram showing the difference between the case where parameters are averaged and the case where they are not averaged. 1: Character sequence input terminal, 2: Synthesis element name series creation section, 3: Synthesis element name collation section, 4: Long unit synthesis element name storage memory, 5: Speech synthesis control section, 6: Long unit synthesis element parameter storage memory , 7: Short unit synthesis element parameter storage memory, 8: Parameter reading section, 9: Parameter smoothing section, 10: Sound source generation section, 11: Speech synthesis digital filter, 12:
Synthetic sound output terminal, 21: Input buffer memory, 2
2: Arithmetic control section, 23: Read control section, 24: Output buffer register, 25: Multiplier, 26: Memory, 27: Adder, 28: Accumulation register.

Claims

[Claims]

1. A first memory that stores names of speech synthesis elements in the form of vowels, vowel sequences, or vowel-phrasal sequences sandwiched between consonants on both sides; and a second memory that stores spectral parameters of each of the speech synthesis elements.
a third memory for storing spectral parameters of a speech synthesis element in a two-phoneme continuous format; and a speech synthesizer; If the spectral parameters are stored, the spectral parameters of the speech synthesis element are read from the second memory, and if the spectral parameters are not stored in the first memory, the corresponding spectral parameters are read from the third memory; 1. A speech synthesis method, comprising: means for mutually combining the obtained spectral parameters to create a continuous spectral parameter time series, and supplying this to the speech synthesizer to synthesize continuous speech.