JPH03216699A

JPH03216699A - Sound source data generating method of sound synthesizer

Info

Publication number: JPH03216699A
Application number: JP2012283A
Authority: JP
Inventors: Kiyoshi Ishida; 清石田; Yoshimasa Sawada; 沢田　喜正; Norio Suda; 典雄須田
Original assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Current assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Priority date: 1990-01-22
Filing date: 1990-01-22
Publication date: 1991-09-24
Anticipated expiration: 2015-05-08
Also published as: JP3038755B2

Abstract

PURPOSE:To reduce rough noises by reanalyzing residue information by setting an analytic window centering on a mean peak position in order, and segmenting the residue information by the reanalysis according to a peak position and generating sound source data. CONSTITUTION:The analytic window centering on the averaged peak position sequence is set in a residue waveform in order to perform resegmentation and reanalysis, and the residue obtained as a result is segmented uniformly according to the previous peak position sequence to generate a sound source file. Consequently, a sound source wave shifts in peak position, pitch by pitch, and the residue waveform has no phase variance in peak position. The residual waveform in each pitch section is averaged axially in an in-phase state and the added mean and weighted mean of the residue waveform B to be processed and other residue waveforms A and C are calculated. Consequently, this method is suitable for the synthesis of a female voice, etc., rough noises in the synthesized voice are reduced, and the synthesized voice which has good quality on the whole is obtained.

Description

【発明の詳細な説明】Ａ．産業上の利用分野本発明は、規則合成方式による音声合成装置に係り、特
に音源データの生成方法に関する。[Detailed Description of the Invention] A. BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesis device using a rule synthesis method, and more particularly to a method for generating sound source data.

Ｂ．発明の概要本発明は、複数の音声波形から得る残差情報を音源情報
とする音声合成装置において、残差情報の各ピッチ区間
でのピーク位置のバラツキを少なくする処理を行い、さ
らにはピーク値の変動及びインパルス性の弱い区間のイ
ンパルス性強調処理を行うことにより、合成音声にざらつく感じのノイズを低減したものである
。B. Summary of the Invention The present invention provides a speech synthesis device that uses residual information obtained from a plurality of audio waveforms as sound source information. This process reduces the rough noise in the synthesized speech by emphasizing the fluctuations in the impulsiveness and the impulsiveness of weakly impulsive sections.

Ｃ．従来の技術規則合成方式による音声合成装置は、入力文字列を構文
解析によって単語，文節に区切り、夫々にはイントネー
ション．アクセントを決定し、単語や文節を音節さらに
は音素にまで分解し、音節又は音素単位の音源波及び調
音フィルタのパラメータを求め、音源波に対する調音フ
ィルタの応答出力として合成音声を得るようにしている
。C. Conventional speech synthesis devices using technical rule synthesis method divide input character strings into words and phrases by syntactic analysis, and each one has intonation. The accent is determined, words and phrases are broken down into syllables and even phonemes, sound source waves and articulatory filter parameters are determined for each syllable or phoneme, and synthesized speech is obtained as the response output of the articulatory filter to the sound source waves. .

このような音声合成装置において、音源情報としてイン
パルスとノイズを使用する方式、又は残差情報を使用す
る方式がある。このうち、残差を音源情報とする方式は
、音声波形を線形予測分析して調音パラメータを求め、
このパラメータによる調音フィルタに音声波形を入力し
てその出力に残差波形を求め、この残差波形をサンプリ
ングと符号化によって音源情報とする。また、音声波形
の切出しには元の波形に窓関数（ハミング窓．ハニング
窓等）を乗じ、切出し区間の両端に急激な変化が起きな
いようにする。In such a speech synthesis device, there are a method that uses impulses and noise as sound source information, and a method that uses residual information. Among these methods, the method that uses the residual as sound source information calculates articulatory parameters by linear predictive analysis of the speech waveform.
A speech waveform is input to an articulation filter based on these parameters, a residual waveform is obtained from its output, and this residual waveform is used as sound source information by sampling and encoding. Furthermore, when cutting out the audio waveform, the original waveform is multiplied by a window function (Hamming window, Hanning window, etc.) to prevent sudden changes from occurring at both ends of the cutting section.

Ｄ．発明が解決しようとする課題残差を音源とする方式において、音源情報の圧縮のため
に残差波形のサンプリングと符号化に情報圧縮を施す場
合、この音源情報を用いた合成音声は人の音声特徴から
著しく外れた音声になってしまう問題があった。D. Problems to be Solved by the Invention In a method that uses residual as a sound source, when information compression is applied to sampling and encoding of the residual waveform in order to compress sound source information, the synthesized speech using this sound source information is human speech. There was a problem in which the sound was significantly out of character.

一方、音源情報を得るための元の音声波形として子音士
母音（ＣＶ波形）のほかに、母音十子音（ＶＣ波形）や
共通母音（Ｖ波形）を推移区間で波形混合し、この混合
波形から残差情報を得て音源とする場合、ピッチ区間毎
の音源の振幅や形状に大きなバラツキが生じる場合があ
り（特に女性の音声分析時）、その結果、合成音声にざ
らつく感じのノイズが含まれることが多く、全体的に質
の良い合成音声が得られない問題があった。On the other hand, as the original speech waveform for obtaining sound source information, in addition to consonant vowels (CV waveform), vowel ten consonants (VC waveform) and common vowels (V waveform) are mixed in the transition interval, and from this mixed waveform. When obtaining residual information and using it as a sound source, there may be large variations in the amplitude and shape of the sound source for each pitch interval (especially when analyzing female voices), and as a result, the synthesized speech contains rough noise. There was a problem that overall high-quality synthesized speech could not be obtained.

例えば、女性音声の残差波形図は、第４図に示すように
、ピッチ毎の基準点Ｌ０〜ｔ３からピーク（インパルス
）の位置までの時間Ｔ０〜Ｔ３にバラツキがあるし、ピ
ーク値し。−し，にピッヂ毎の変動が大きく、さらにイ
ンパルスの強く表れる区間２と殆どノイズに近い（イン
パルス性の弱い）区間３が混在する。このようなピーク
位置のバラツキやピーク値の変動さらにはインパルス性
の弱い区間の混在が合成音声にざらつきノイズの主要因
と考えられる。For example, in the residual waveform diagram of a female voice, as shown in FIG. 4, there are variations in the time T0 to T3 from the reference point L0 to t3 for each pitch to the peak (impulse) position, and there is no peak value. -However, there is a large variation from pitch to pitch, and there are also a section 2 where impulses appear strongly and a section 3 which is almost noise (weak impulsiveness). Such variations in peak positions and peak values, as well as the presence of weakly impulsive sections, are considered to be the main causes of roughness and noise in synthesized speech.

本発明の目的は、複数の音声波形から残差情報を得て音
源とする音声合成装置において、ざらつく感じのノイズ
を低減した音源データの生成方法を提供することにある
。An object of the present invention is to provide a method for generating sound source data in which rough noise is reduced in a speech synthesis device that obtains residual information from a plurality of speech waveforms and uses it as a sound source.

Ｅ．課題を解決するための手段本発明は、前記目的を達成するため、音声波形の分析に
よって得る残差情報の各ピッチ区間毎に完全インパルス
列と該残差情報の相互相関を計算し、該相関が最大とな
る位置から各ピッチ区間毎のピーク位置を計算し、この
ピーク位置列の平均化処理による平均化ピーク位置を中
心にして順次分析窓を取って該残差情報の再分析を行い
、この再分析による残差情報を前記ピーク位置を基準に
して切出して音源データを生成することを特徴とする。E. Means for Solving the Problems In order to achieve the above object, the present invention calculates the cross-correlation between the complete impulse train and the residual information for each pitch section of the residual information obtained by analyzing the speech waveform, and calculates the cross-correlation between the complete impulse train and the residual information. Calculate the peak position for each pitch section from the position where is the maximum, and re-analyze the residual information by sequentially taking an analysis window around the averaged peak position obtained by averaging the peak position sequence, The present invention is characterized in that the residual information resulting from this reanalysis is extracted based on the peak position to generate sound source data.

また、本発明は前記切出した残差情報をピッチ区間毎に
各残差波形の位相をそろえて時間軸方向の平均化処理を
行うことを特徴とする。Further, the present invention is characterized in that the extracted residual information is averaged in the time axis direction by aligning the phases of the residual waveforms for each pitch section.

Ｆ、作用残差波形のピーク位置のバラッキには、残差波形の各ピ
ッチ区間毎に完全インパルスとの相互相関によって各ピ
ッチ区間毎のピーク位置を求め、このピーク位置の平均
化処理による平均化ピーク位置を中心にした分析窓によ
る切出しと再分析によってピーク位置のバラツキを低紘
した残差波形を得る。F. The variation in the peak position of the action residual waveform is determined by calculating the peak position for each pitch interval by cross-correlation with the complete impulse for each pitch interval of the residual waveform, and averaging this peak position by averaging processing. By cutting out and reanalyzing using an analysis window centered on the peak position, a residual waveform with low variation in peak position is obtained.

また、ピーク位置のバラツキを低減した残差波形をピッ
チ区間毎に位相をそろえて平均化処理を行うことでピー
ク値のピッチ毎の変動及びインパルス性の弱い区間のイ
ンパルス性強調を行う。In addition, by performing averaging processing on the residual waveform with reduced peak position variations by aligning the phase for each pitch section, fluctuations in peak values for each pitch and impulsiveness of sections with weak impulsiveness are emphasized.

Ｇ．実施例第１図は本発明方法の一実施例を示す処理手順図である
。ステップＳｌは、従来の残差情報生成と同様に、複数
の音声波形の混合波形から音声特徴パラメータを求める
と共に音源情報としての残差抽出を行う。ステップ８２
〜ｓ７は残差情報に対してそのピーク（インパルス状）
がフレーム毎に大きく変動しないように波形処理を行う
。このため、まず、フレームｉにおいて得られた残差波
形（第２図の＆）に対して、各ピッチ区間毎の基準点ｔ
０〜ｔ４からある固定時間ΔＬだけづれた完全インパル
ス列（第２図のｂ）を用意し、この完全インパルス列と
残差波形との相互相関を計算する（ステップＳ２）。こ
の計算より、相関係数の最大となるようなずらし幅をＸ
＋とじ、このＸｔを各ピッチ区間毎に求める（ステップ
Ｓ３）。この算出値は（Ｘｔ＋Δｔ）としてフレームｉ
における基準点ｔ，からのピーク位置とする。G. Embodiment FIG. 1 is a processing procedure diagram showing an embodiment of the method of the present invention. In step Sl, similar to conventional residual information generation, voice feature parameters are obtained from a mixed waveform of a plurality of voice waveforms, and residuals are extracted as sound source information. Step 82
~s7 is the peak (impulse-like) for the residual information
Waveform processing is performed so that the waveform does not vary significantly from frame to frame. For this reason, first, for the residual waveform (& in FIG. 2) obtained in frame i, the reference point t for each pitch section is
A complete impulse train (b in FIG. 2) shifted by a fixed time ΔL from 0 to t4 is prepared, and the cross-correlation between this complete impulse train and the residual waveform is calculated (step S2). From this calculation, the shift width that maximizes the correlation coefficient is set to
+ binding, and this Xt is determined for each pitch section (step S3). This calculated value is (Xt+Δt) for frame i
Let it be the peak position from the reference point t, at .

各フレームにおいて求められたピーク位置列（ｘｔ＋Δ
ｔ）はフレーム方向で平均化処理を行い、ピーク位置が
なめらかに推移するようにする（ステップＳ４）。この
平均化処理は、例えば女性音声波形ではその残差のイン
パルス性が弱い区間やバラツキが非常に大きくなる場合
にピーク位置の抽出誤りによるバラツキの軽減を図る。Peak position sequence (xt+Δ
t) performs averaging processing in the frame direction so that the peak position changes smoothly (step S4). This averaging process aims to reduce variations due to errors in extracting peak positions, for example, in the case of female voice waveforms, in sections where the impulsiveness of the residual is weak or when the variations are very large.

なお、平均化ピーク位置列は、実際の残差渡形のピーク
点にならない場合もあるが、本来その位置に残差ピーク
が現れるべき（ピッチ間隔でピークが現れるはず）のも
のであり、この位置をピーク位置とすることでバラツキ
軽減を図る。Note that the averaged peak position sequence may not be the actual peak point of the residual cross-section shape, but the residual peak should originally appear at that position (the peak should appear at the pitch interval), and this The variation is reduced by setting the position to the peak position.

次に、平均化したピーク位置列を中心にして残差波形に
順次分析窓を取った再切出し（ステップＳＳ）と再分析
を行う（ステップＳ６）。この結果得られた残差を先の
ピーク位置列を基準に一様に切出し、音源ファイルを作
成する（ステップＳ７）。この結果、音源波としてはピ
ッチ毎のピーク位置はなめらかに推移し、残差波形にピ
ーク位置の位相的なバラッキを無くす。Next, the residual waveform is re-cut (step SS) and re-analyzed by sequentially taking an analysis window around the averaged peak position sequence (step S6). The resulting residuals are uniformly cut out based on the previous peak position sequence to create a sound source file (step S7). As a result, the peak position of each pitch of the sound source wave changes smoothly, and there is no phase variation in the peak position in the residual waveform.

次に、ピッチ区間毎の残差波形は夫々の位相をそろえた
状態で時間軸方向の平均化処理を行う（ステップＳ８）
。この平均化処理は、第３図に示すように処理対象とす
る残差波形Ｂと他の残差波形Ａ，Ｃとの加算平均又は加
重平均を行い、この結果の残差波形Ｂ′には全体的にイ
ンパルス性の弱いピッチ区間にもインパルス性を向上さ
せ、またインパルス振幅の推移もなめらかにする。Next, the residual waveforms for each pitch section are averaged in the time axis direction with their respective phases aligned (step S8).
. In this averaging process, as shown in FIG. Impulsivity is improved even in pitch sections that are generally weak in impulsiveness, and transitions in impulse amplitude are also smoothed.

Ｈ．発明の効果以上のとおり、本発明によれば、残差情報の各ピッチ区
間でのピーク位置のバラツキを少なくし、さらにはピー
ク値の変動の平滑化及びインパルス性の弱い区間のイン
パルス性強調を行って音源データを生成するようにした
ため、女性音声の合成等に適用して合成音声にざらつく
感じのノイズを低減し、全体的に質の良い合成音声を得
ることができる。H. Effects of the Invention As described above, according to the present invention, it is possible to reduce the dispersion of the peak position in each pitch section of the residual information, and further to smooth out fluctuations in the peak value and emphasize the impulsiveness of the section with weak impulsiveness. Since this method generates sound source data by using the method, it is possible to apply it to synthesis of female voices, etc., to reduce rough noise in the synthesized speech, and to obtain synthesized speech of good quality overall.

[Brief explanation of drawings]

第１図は本発明方法の一実施例を示す処理手順図、第２
図は残差波形と完全インパルス波形図、第３図は残差波
形の平均化処理波形図、第４図は従来の残差波形図であ
る。第１図実尤整汐１ｊのガリ里予１ｌ旧回FIG. 1 is a processing procedure diagram showing one embodiment of the method of the present invention, and FIG.
The figure shows a residual waveform and a complete impulse waveform, FIG. 3 shows an averaged waveform of the residual waveform, and FIG. 4 shows a conventional residual waveform. Figure 1 Gali Ryo 1l old edition of Jitsui Seisho 1j

Claims

[Claims]

(1) Calculate the cross-correlation between the complete impulse train and the residual information for each pitch section of the residual information obtained by analyzing the audio waveform, and calculate the peak position for each pitch section from the position where the correlation is maximum. Then, the residual information is re-analyzed by sequentially taking an analysis window centered around the averaged peak position obtained by averaging the peak position sequence, and the residual information resulting from this re-analysis is calculated based on the peak position. A method for generating sound source data for a speech synthesizer, characterized in that the sound source data is generated by cutting out the sound source data.

(2) A sound source data generation method for a speech synthesizer, characterized in that the extracted residual information is averaged in the time axis direction by aligning the phases of the residual waveforms for each pitch section.