JPH1114672A

JPH1114672A - Method for estimating spectrum of periodic waveform and program recording medium therefor

Info

Publication number: JPH1114672A
Application number: JP16417997A
Authority: JP
Inventors: Kiyoaki Aikawa; 清明相川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 1997-06-20
Filing date: 1997-06-20
Publication date: 1999-01-22

Abstract

PROBLEM TO BE SOLVED: To highly accurately estimate a spectrum, without synchronizing the spectrum with a pitch. SOLUTION: An approximately 3-5 ms minute section τ of an input sound waveform is sliced (S3) and subjected to DFT(discrete Fourier transform) with a Hamming window (S5). A spectrum of the τ, i.e., M=log(X+1) (X: a result of the DFT) is found (S6). A cumulative addition A=A+M<e> (an initial value of the A is 0) is performed (S7). The τ is shifted by 2-4 ms, and the slicing of the τ is carried out again (S8). The process is conducted for a conventional analysis section T, and S=(A/N)<1/e> (where N-1 is a count of shifts) is calculated to obtain a spectrum S (S10). This procedure is executed for each spectral component.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は音声波形のような
周期性構造をもつ波形の一定時間（分析表）ごとのスペ
クトルを推定する方法及びそのプログラム記録媒体に関
する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for estimating a spectrum of a waveform having a periodic structure such as a speech waveform at predetermined time intervals (analysis table), and a program recording medium therefor.

【０００２】[0002]

【従来の技術】従来から、音声情報処理においては、ま
ず音声波形からスペクトル時系列を求める。スペクトル
を求める際、従来の短時間スペクトル分析では、音声ス
ペクトルは分析窓に含まれる音声波形全体を一括して用
いて求められた。通常分析の時間窓幅は２０ｍｓから４
０ｍｓ程度である。この窓幅は通常、フレーム毎のスペ
クトルの変動がないように、ピッチ周期（声門の開閉周
期、開から開への時間間隔）の２倍以上で音素（母音，
子音）長より短かく設定される。このような分析方法を
用いると、スペクトルに調波構造が現れる。〔例えば、
古井貞熈、ディジタル音声処理、東海大学出版会、１９
８５〕。従ってスペクトル形状がピッチ周期の影響を受
けやすい。ピッチに同期して１ピッチの区間の音声信号
を切り出して分析し、与えられた区間内で平均すればス
ペクトル推定精度を向上できるがピッチ区間を正確に切
り出すことは難しい。2. Description of the Related Art Conventionally, in speech information processing, first, a spectrum time series is obtained from a speech waveform. In obtaining the spectrum, in the conventional short-time spectrum analysis, the voice spectrum was obtained by using the entire voice waveform included in the analysis window. Normal analysis time window width is 20 ms to 4
It is about 0 ms. Normally, the window width is twice or more than the pitch period (glottal opening / closing period, time interval from opening to opening) in phonemes (vowels, vowels,
Consonant) shorter than the length. Using such an analysis method, a harmonic structure appears in the spectrum. [For example,
Sadahiro Furui, Digital Audio Processing, Tokai University Press, 19
85]. Therefore, the spectrum shape is easily affected by the pitch period. If a speech signal of one pitch section is cut out and analyzed in synchronization with the pitch and averaged within a given section, the accuracy of spectrum estimation can be improved, but it is difficult to cut out a pitch section accurately.

【０００３】[0003]

【発明が解決しようとする課題】この発明の目的はピッ
チと同期させないが、ピッチ周期に影響されない正しい
スペクトルを推定できる周期性信号のスペクトル推定方
法及びそのプログラム記録媒体を提供することにある。SUMMARY OF THE INVENTION It is an object of the present invention to provide a method for estimating a spectrum of a periodic signal which can estimate a correct spectrum which is not synchronized with the pitch but is not affected by the pitch period, and a program recording medium therefor.

【０００４】[0004]

【課題を解決するための手段】この発明によれば周期性
信号の周期Ｔより短い分析窓でスペクトル分析し、その
分析結果を実数乗してその短時間スペクトルを複数統合
して周期Ｔのスペクトルを推定する。音声波形について
述べれば、短い時間の音声波形を短い周期（微小区間窓
幅）で切り出す。この短い時間の音声波形から求められ
たスペクトルを微小区間スペクトルと呼ぶことにする。
ここで短い時間とは音声波形の１ピッチ程度以下の時間
を想定している。このような短い時間の波形は時間的な
周期構造を持たないため、スペクトルは調波構造を示さ
ない。音声切り出しの様子を図１Ａに示す。τは微小区
間スペクトルを求めるための微小区間窓幅（時間）であ
り、δは微小区間シフト幅（時間）であり、Ｔは従来の
分析における分析窓幅（時間）である。従来の窓幅と同
じ実効窓幅になるようにするには、以下の数の微小区間
スペクトルを統合すればよい。According to the present invention, a spectrum is analyzed in an analysis window shorter than the period T of a periodic signal, the analysis result is raised to the power of a real number, and a plurality of short-time spectra are integrated to obtain a spectrum of the period T. Is estimated. Describing the audio waveform, a short-time audio waveform is cut out at a short cycle (a minute section window width). The spectrum obtained from the short-time speech waveform is called a minute section spectrum.
Here, the short time is assumed to be a time of about one pitch or less of the voice waveform. Since such a short-time waveform does not have a temporal periodic structure, the spectrum does not show a harmonic structure. FIG. 1A shows a state of audio clipping. τ is a minute section window width (time) for obtaining a minute section spectrum, δ is a minute section shift width (time), and T is an analysis window width (time) in the conventional analysis. In order to make the effective window width the same as the conventional window width, the following number of minute section spectra may be integrated.

【０００５】Ｎ＝（（Ｔ−τ）／δ）＋１（１）複数の微小区間スペクトルを統合する関数として以下の
Ｌ_pノルム（ｐ乗平均距離）を用いれば、様々な統合方
法を統一的に表現できる。Ｓ（ω，ｔ）＝｛（１／Ｎ）Σ_i=0 ^N-1Ｍ（ω，ｔ＋δｉ）^e｝^1/e （２）ここで、Ｍ（ω，ｔ）は時刻ｔの微小区間スペクトル、
ωは周波数を表す。ｅ（ｅ≠０）は指数で、ｅ＝１の時
には単に微小区間スペクトルの算術平均を表す。ｅが−
∞の時には最小値を求める計算となり、ｅが∞の時には
最大値を求める計算となる。式（２）で得られたものを
統合微小区間スペクトル、式（２）を用いるスペクトル
分析法を微小区間スペクトル法と呼ぶことにする。N = ((T−τ) / δ) +1 (1) If the following L _p norm (p-mean distance) is used as a function for integrating a plurality of minute interval spectra, various integration methods can be unified. Can be expressed as S (ω, t) = {(1 / N)} _{i = 0} ^N−1 M (ω, t + δi) ^e ｝ ^{1 / e} (2) where M (ω, t) is a minute section spectrum at time t. ,
ω represents a frequency. e (e ≠ 0) is an exponent, and when e = 1, simply represents the arithmetic mean of the minute interval spectrum. e is-
When ∞, the calculation is to find the minimum value, and when e is ∞, the calculation is to find the maximum value. The one obtained by the equation (2) is called an integrated minute section spectrum, and the spectrum analysis method using the equation (2) is called a minute section spectrum method.

【０００６】各微小区間スペクトルＭ（ω，ｔ）は例え
ばＦＦＴにより求める。ＦＦＴ次数は２のべき乗で、周
波数分析チャネル数の２倍以上、かつ波形上でのＦＦＴ
窓長がはじめて微小区間スペクトルの窓長τをこえる数
に設定する。ハミングウインドウを掛けた長さτの波形
データを左詰めで入れ、あとは０としてＦＦＴを行う。
チャネルｋ、時刻をｉとし、微小区間の線形ＦＦＴスペ
クトルをＰ（ｋ，ｉ）とすると、統合に用いる微小区間
スペクトルＭ（ｋ，ｉ）は次式で求まる。[0006] Each minute section spectrum M (ω, t) is obtained by, for example, FFT. The FFT order is a power of 2, more than twice the number of frequency analysis channels, and the FFT on the waveform
The window length is set to a number exceeding the window length τ of the minute section spectrum for the first time. The waveform data of length τ multiplied by the Hamming window is inserted left-justified, and the FFT is performed after setting it to 0.
Assuming that the channel k and the time are i and the linear FFT spectrum of the minute section is P (k, i), the minute section spectrum M (k, i) used for integration is obtained by the following equation.

【０００７】Ｍ（ｋ，ｉ）＝ｌｏｇ（１＋Ｐ（ｋ，ｉ））（３）これは対数スペクトルに近いが、値は必ず正値となる。
Ｌp ノルムを計算する時に各項は正値である必要があ
る。ＦＦＴの次数をＫとするとチャネルｋに相当する周
波数は、音声のサンプリング周波数をｆ_sとすると式
（４）により与えられる。M (k, i) = log (1 + P (k, i)) (3) Although this is close to a logarithmic spectrum, the value is always a positive value.
Each term must be positive when calculating the Lp norm. Frequency corresponding the order of the FFT to the channel k When K is given when the sampling frequency of the audio and f _s by Equation (4).

【０００８】 ω（ｋ）＝πｆ_sｋ／（２Ｋ）（４）[0008] _{ω (k) = πf s k} / (2K) (4)

【０００９】[0009]

【発明の実施の形態】図３にこの発明による方法の実施
例を示す。まず音声区間例えばＴ＝３０ｍｓでの時間ポ
インタｉを０とし（Ｓ１）、またスペクトルを蓄積する
バッファの内容ＡをクリアしてＡ＝０に初期化をする
（Ｓ２）。次に微小区間、つまりτ＝５ｍｓ程度の区間
（ｔ〜ｔ＋τ）の音声を切り出し（Ｓ３）、その切り出
した音声信号に対して窓掛け、例えばハミングウインド
ウをかける（Ｓ４）。その窓掛けされた区間τの音声信
号に対しｋ次のＤＦＴ（離散的フーリエ変換）を行って
パワースペクトルＸを求める（Ｓ５）。FIG. 3 shows an embodiment of the method according to the invention. First, the time pointer i in a voice section, for example, T = 30 ms is set to 0 (S1), and the content A of the buffer for storing the spectrum is cleared and initialized to A = 0 (S2). Next, a sound in a minute section, that is, a section of about τ = 5 ms (t to t + τ) is cut out (S3), and the cut out sound signal is windowed, for example, a Hamming window is applied (S4). A power spectrum X is obtained by performing a k-th order DFT (Discrete Fourier Transform) on the voice signal in the windowed section τ (S5).

【００１０】このＤＦＴの結果Ｘに対し、式（３）、つ
まりＭ＝ｌｏｇ（Ｘ＋１）によりその微小区間スペクト
ルＭを求める（Ｓ６）。この対数は自然対数である。こ
の求めた微小スペクトルＭを、ｅ乗して蓄積バッファ内
に蓄積されているスペクトルＡに累積加算する（Ｓ
７）。つまり次式を演算する。Ａ←Ａ＋Ｍ^e 次に時間ポインタｉを＋１し、例えば微小区間シフト幅
δ＝２ｍｓ程度、だけ移動させ、つまり時刻をｔ＋δと
する（Ｓ８）、この時、音声データ終端か、つまり時刻
ｔがｔ＋Ｔとなり、Ｔ＝３０ｍｓの音声区間の終端に到
達したかを判定し（Ｓ９）、到達してなければ、ステッ
プＳ３に戻り、δだけシフトした微小区間（ｔ＋δ〜ｔ
＋δ＋τ）の音声切り出しを行い、以下同様の処理を行
う。With respect to the result X of the DFT, a minute section spectrum M is obtained by equation (3), that is, M = log (X + 1) (S6). This log is the natural log. The obtained small spectrum M is raised to the power of e and cumulatively added to the spectrum A stored in the storage buffer (S
7). That is, the following equation is calculated. A ← A + ^Me Next, the time pointer i is incremented by +1 and, for example, shifted by a minute section shift width δ = about 2 ms, that is, the time is set to t + δ (S8). It is determined whether the end of the voice section of T = 30 ms has been reached (S9). If it has not reached, the process returns to step S3, and the minute section shifted by δ (t + δ to t)
+ Δ + τ), and the same processing is performed thereafter.

【００１１】このようにしてδだけシフトしながら、区
間τの微小区間音声のスペクトルＭを求め、これを蓄積
バッファの内容Ａに累積加算し、式（１）で示したよう
に、Ｎ個の微小区間のスペクトルを累積すると、つまり
音声区間の終端δｉ＝Ｔ＝３０ｍｓに到達すると、これ
がステップＳ９で判定され、蓄積バッファの記憶内容で
ある累積したスペクトルＡ＝Σ_i=0 ^N-1Ｍ^eをその累
積した数Ｎで割り、その割算結果を１／ｅ乗して、つま
り式（２）を演算して、その微小区間スペクトルＳを求
める（Ｓ１０）。なおこの図２で示した処理では各周波
数について繰返すことを省略しており、従って図２中の
Ａ，Ｘ，Ｍ，Ｓは周波数チャネル数の要素を持つベクト
ルである。In this way, while shifting by δ, the spectrum M of the minute section voice in section τ is obtained, and this is cumulatively added to the content A of the storage buffer, and as shown in equation (1), N When accumulating the spectrum in small sections, that is, when it reaches the end .delta.i = T = 30 ms speech segment, which is determined in the step S9, the spectral _{^{a = Σ i = 0 N-}} 1 M e obtained by accumulating a stored content of the storage buffer Is divided by the accumulated number N, and the division result is raised to the power of 1 / e, that is, the equation (2) is calculated to obtain the minute section spectrum S (S10). In the processing shown in FIG. 2, the repetition of each frequency is omitted, and thus A, X, M, and S in FIG. 2 are vectors having elements of the number of frequency channels.

【００１２】このようにして求められた微小区間スペク
トルの値の時間的変動がどのように統合されるかをある
周波数チャネルについてシミュレーション実験を行った
結果を示す。即ちスペクトルのうちある周波数の値の変
動がｓ（ｉ）＝０．５−０．５ cos（４πｉ／Ｎ）＋ε （５）０＜ｉ＜Ｎのような正弦波状であるとする。εは値の発
散を避けるための微小定数である。The results of a simulation experiment performed on a certain frequency channel to see how the temporal fluctuations of the values of the minute section spectrum obtained in this way are integrated are shown. That is, it is assumed that the fluctuation of the value of a certain frequency in the spectrum is a sine wave such as s (i) = 0.5−0.5 cos (4πi / N) + ε (5) 0 < i <N. ε is a minute constant for avoiding the divergence of the value.

【００１３】微小区間スペクトルをＬp ノルムで統合す
る式はｖ（ｅ）＝｛Σ_i=0 ^N-1ｓ（ｉ）^e｝^1/e （６）で与えられる。ここでは微小区間スペクトルの統合を目
的としており、基本的には最小値を求める問題ではない
ので、０＜ｅとする。低レベルの雑音が重畳されている
場合、駆動音源が不安定な場合などでは、エネルギーの
大きな部分を重視するのが適当である。このような場合
に、複数の微小区間スペクトルを統合するには１＜ｅと
すれば良い。また、エネルギーの低い部分を重視して、
突発的な雑音を除去する目的では０＜ｅ＜１とすれば良
い。The equation for integrating the minute interval spectrum with the Lp norm is given by: v (e) = ｛Σi _{= 0} ^N−1 s (i) ^e ｝ ^{1 / e} (6) Here, the purpose is to integrate the minute section spectra, and it is basically not a problem to find the minimum value. When low-level noise is superimposed, or when the driving sound source is unstable, it is appropriate to attach importance to a portion having a large energy. In such a case, 1 <e may be set to integrate a plurality of minute section spectra. Also, focusing on low energy parts,
For the purpose of removing sudden noise, 0 <e <1 may be set.

【００１４】式（５）のスペクトルの変動が時間的に一
定な雑音νに埋もれてｒ（ｉ）＝ max〔ｓ（ｉ），ν〕（７）となっていたとする。つまり図１Ｂに示すように横軸を
時間、縦軸をレベルとし、信号ｓ（ｉ）は曲線１１のよ
うに変化し、横軸と平行な各種レベルの雑音ν₁，
ν₂，ν₃・・・が重畳した場合は、ｒ（ｉ）は信号ｓ
（ｉ）と雑音νとのうち大きい方がｒ（ｉ）となる。こ
のように雑音νにより、これより低いレベルの部分が埋
もれた信号ｒ（ｉ）に対し式（６）を各種ｅについて計
算した結果を図１Ｃに示す。この図２からｅが４程度以
上であれば、スペクトルの最大値の１／２程度の雑音で
埋もれていても、統合スペクトルレベルはほぼ一定であ
り、雑音にほとんど影響されないことがわかる。これは
この発明の微小区間スペクトル法が雑音に対して頑健で
あることを示している。It is assumed that the fluctuation of the spectrum of the equation (5) is buried in a temporally constant noise ν and r (i) = max [s (i), ν] (7). That is, as shown in FIG. 1B, the horizontal axis represents time, and the vertical axis represents level, the signal s (i) changes as shown by a curve 11, and noise ν _{1 at} various levels parallel to the horizontal axis.
When ν ₂ , ν ₃ ... are superimposed, r (i) is the signal s
The larger of (i) and noise ν is r (i). FIG. 1C shows the result of calculating Equation (6) for various types of e for the signal r (i) in which the lower level portion is buried by the noise ν. It can be seen from FIG. 2 that if e is about 4 or more, the integrated spectrum level is almost constant and is hardly affected by the noise even if the noise is buried with about 1/2 of the maximum value of the spectrum. This indicates that the small interval spectrum method of the present invention is robust against noise.

【００１５】上述では各周波数ごとに微小区間スペクト
ル時系列の統合を行ったが、各微小区間のパワーに依存
したスペクトルの重み付き加算を行うようにしてもよ
い。つまり微小区間のパワーをｕ（ｔ）、微小区間のパ
ワー正規化されたスペクトルをＱ（ω，ｔ）とすると次
式の関係がある。ｕ（ｔ）＝１／（２π）∫Ｍ（ω，ｔ）ｄω （８）Ｑ（ω，ｔ）＝Ｍ（ω，ｔ）／ｕ（ｔ）（９） ∫は−πからπ従ってパワーに依存したスペクトルの重
み付き加算を行う場合が微小区間スペクトルの統合式は
以下のようになる。In the above description, the time series of the minute section spectrum is integrated for each frequency. However, the weighted addition of the spectrum depending on the power of each minute section may be performed. That is, assuming that the power in the minute section is u (t) and the power-normalized spectrum in the minute section is Q (ω, t), the following relationship is established. u (t) = 1 / (2π) ∫M (ω, t) dω (8) Q (ω, t) = M (ω, t) / u (t) (9) In the case where weighted addition of spectra depending on is performed, the integration formula of the minute section spectrum is as follows.

【００１６】Ｓ（ω，ｔ）＝｛Σ_i=0 ^N-1ｕ（ｔ＋δｉ）^e Ｑ（ω，ｔ＋δｉ）｝／Σ_i=0 ^N-1ｕ（ｔ＋δｉ）^e（10）この場合、図２において、ステップＳ６で求めた微小区
間スペクトルＭを、式（９）で示す関係でｕ（ｔ）とＱ
（ω，ｔ）に分け、ステップＳ７では、式（１０）にも
とづき、Ａ＝Ａ＋ｕ^eＱを演算すればよい。つまり各スペクトルごとにＡ＝Ａ＋
ｕ^eを演算すればよい。Ｑ（ω，ｔ＋δｉ）はＦＦＴの
みならず、ＬＰＣ分析で求めてもよい。S (ω, t) = { _{i = 0} ^N−1 u (t + δi) ^e Q (ω, t + δi)} / Σ _{i = 0} ^N−1 u (t + δi) ^e (10) In this case, FIG. In step 2, the small section spectrum M obtained in step S6 is converted into u (t) and Q by the relationship shown in equation (9).
(Omega, t) is divided into, in step S7, based on the equation (10) may be calculating the A = A + u ^e Q. That is, A = A + for each spectrum
u ^e may be calculated. Q (ω, t + δi) may be obtained not only by FFT but also by LPC analysis.

【００１７】次に微小区間スペクトルを用いた音声認識
の例を図４Ａに示す。音声認識部ではＨＭＭ（隠れマル
コフモデル）〔中川聖一：確率モデルによる音声認識、
電子通信情報学会，１９８８〕を用いた場合である。マ
イクロフォン２１よりの入力音声は、サンプリング周波
数（例えば１２ｋＨｚ）の１／２の通過帯域を持つ低域
フィルタ２２を通された後、Ａ／Ｄ変換器２３によりア
ナログ信号から、サンプリング周波数でディジタル化さ
れる、このディジタル音声信号はこの発明による微小区
間スペクトル推定部２４で微小区間スペクトルを用いて
スペクトル時系列に変換される。学習音声のスペクトル
時系列はＨＭＭ学習部２５に入力され、ＨＭＭが作られ
てＨＭＭ蓄積部２６に蓄積される。認識対象音声のスペ
クトル時系列はＨＭＭ認識部２７に入力され、ＨＭＭ蓄
積部２６のＨＭＭと、語彙情報蓄積部２８の認識語彙の
リストとを参照して認識処理が行われ、その結果が表示
部２９に表示される。ＨＭＭの学習と認識は、上記参考
文献に記載されている標準的な方法を用いる。FIG. 4A shows an example of speech recognition using a minute section spectrum. HMM (Hidden Markov Model) [Seiichi Nakagawa: Speech Recognition by Stochastic Model,
This is the case where IEICE, 1988] is used. An input voice from the microphone 21 is passed through a low-pass filter 22 having a pass band of の of the sampling frequency (for example, 12 kHz), and is then digitized from an analog signal by an A / D converter 23 at the sampling frequency. This digital audio signal is converted into a spectrum time series by using the minute section spectrum in the minute section spectrum estimating section 24 according to the present invention. The spectrum time series of the learning speech is input to the HMM learning unit 25, where the HMM is created and stored in the HMM storage unit 26. The spectral time series of the speech to be recognized is input to the HMM recognizing unit 27, and the recognition process is performed with reference to the HMM in the HMM storing unit 26 and the list of recognized vocabulary in the vocabulary information storing unit 28, and the result is displayed on the display unit. 29 is displayed. The learning and recognition of the HMM use standard methods described in the above-mentioned references.

【００１８】微小区間スペクトル推定部２４での微小区
間スペクトルを用いてスペクトルの時系列を生成する処
理は図４Ｂに示すように行われる。まず時間ポインタを
０にし（Ｓ１）、連続的な音声波形から時間ポインタを
起点としてＴ＝３０ｍｓの音声信号を切り出す（Ｓ
２）。その切り出した３０ｍｓの音声信号から微小区間
スペクトル法によりスペクトルを抽出する（Ｓ３）。次
に時間ポインタを１０ｍｓ移動する（Ｓ４）。これは音
声認識部２７に送られるいわゆるフレームレート、ある
いはフレーム周期と呼ばれる値である。次に時間ポイン
タが音声信号の終端に到達したかを判定し（Ｓ５）、終
端に到達してなければステップＳ２に戻り、終端に到達
したら終了する。ステップＳ３の微小区間スペクトル抽
出は図２に示した処理により行う。The processing of generating a time series of spectra using the minute section spectrum in the minute section spectrum estimating section 24 is performed as shown in FIG. 4B. First, the time pointer is set to 0 (S1), and an audio signal of T = 30 ms is cut out from the continuous audio waveform starting from the time pointer (S1).
2). A spectrum is extracted from the extracted 30 ms audio signal by the minute section spectrum method (S3). Next, the time pointer is moved for 10 ms (S4). This is a value called a frame rate or a frame period sent to the voice recognition unit 27. Next, it is determined whether or not the time pointer has reached the end of the audio signal (S5). If it has not reached the end, the process returns to step S2, and if it has reached the end, the process ends. The extraction of the minute section spectrum in step S3 is performed by the processing shown in FIG.

【００１９】この発明は音声波形のみならず、周期性の
ある波形のスペクトル推定にも適用できる。The present invention can be applied not only to speech waveforms but also to spectrum estimation of periodic waveforms.

【００２０】[0020]

【発明の効果】以上述べたようにこの発明によれば、微
小区間から求めたスペクトルを統合して音声波形などの
スペクトルを推定するため、ｅの値の選定することによ
り、つまりｅ＞１とすることによりエネルギーの高い微
小区間スペクトルを選択的に統合できる。エネルギーの
高いスペクトルはピッチ周期に同期して得られ、ピッチ
同期スペクトル分析に近い精度の高いスペクトル推定を
行うことができる。このため、スペクトルがピッチ周期
や音声の分析区間の切り出し位置の影響を受け難くな
り、音声認識に適用して認識性能を向上させることがで
きる。As described above, according to the present invention, in order to estimate a spectrum such as a speech waveform by integrating spectra obtained from minute sections, the value of e is selected, that is, e> 1. By doing so, it is possible to selectively integrate minute section spectra having high energy. A spectrum having high energy is obtained in synchronization with the pitch period, and a highly accurate spectrum estimation close to pitch-synchronous spectrum analysis can be performed. For this reason, the spectrum is hardly affected by the pitch period and the cutout position of the voice analysis section, and the recognition performance can be improved by applying the present invention to voice recognition.

【００２１】また０＜ｅ＜１にｅを選定することによ
り、エネルギーの低い部分を重視して、突発的な雑音を
除去することができる。音声認識に、この発明を適用し
てスペクトル系列を得る場合と、ピッチ周期の２倍程度
以上の窓を用いた従来のＦＦＴによりスペクトルを得る
場合とを比較した結果、発声様式の異なる音素の認識に
おいて従来６４％であった音素認識率を７１％まで向上
させることができた。ｅが０．５から２のいずれの場合
にも微小区間スペクトル法の効果が得られる。Further, by selecting e so that 0 <e <1, it is possible to remove sudden noises with emphasis on low energy portions. As a result of a comparison between a case where a spectrum sequence is obtained by applying the present invention to speech recognition and a case where a spectrum is obtained by a conventional FFT using a window of about twice or more the pitch period, recognition of phonemes having different utterance styles was performed. As a result, the phoneme recognition rate, which was 64% in the past, could be improved to 71%. In any case where e is 0.5 to 2, the effect of the minute section spectral method can be obtained.

【００２２】１フレームの窓長Ｔは１５ｍｓ、２０ｍ
ｓ、３０ｍｓおよび４０ｍｓのいずれの場合でも微小区
間スペクトル法を用いると、従来の長いデータ窓を用い
るＦＦＴスペクトルに比べ、高い認識率が得られる。こ
の改善は発声様式の異なる音声を認識した場合に大き
い。改善効果は母音の方が大きい。微小区間フレームシ
フトδは２ｍｓから４ｍｓ、微小区間フレーム窓長τは
３ｍｓから５ｍｓ程度が良く、つまり窓長Ｔの１／５〜
１／１４程度の長さがよく、特に同じ発話様式の音声に
対しては５ｍｓ程度が、異なる発話様式の音声に対して
は３ｍｓ程度が良い。The window length T of one frame is 15 ms, 20 m
In any of s, 30 ms, and 40 ms, the use of the minute interval spectrum method can obtain a higher recognition rate than the conventional FFT spectrum using a long data window. This improvement is significant when recognizing sounds with different utterance styles. The improvement effect is greater for vowels. The minute section frame shift δ is preferably from 2 ms to 4 ms, and the minute section frame window length τ is preferably about 3 ms to 5 ms, that is, ５〜 of the window length T.
The length is preferably about 1/14, particularly about 5 ms for voices of the same utterance style, and about 3 ms for voices of different utterance styles.

[Brief description of the drawings]

【図１】Ａは従来のスペクトル分析窓Ｔと、この発明に
おける微小区間窓幅τと、微小区間シフト幅δとの関係
例を示す図、Ｂは雑音に埋もれた周波数チャネルの出力
波形を示す図である。FIG. 1A is a diagram showing an example of a relationship between a conventional spectrum analysis window T, a minute section window width τ, and a minute section shift width δ in the present invention, and B shows an output waveform of a frequency channel buried in noise. FIG.

【図２】パラメータとする雑音レベルの変化に対する統
合微小区間スペクトルの関係を示す図。FIG. 2 is a diagram showing a relationship between a change in noise level as a parameter and an integrated minute section spectrum.

【図３】この発明によるスペクトル推定方法の一例を示
す流れ図。FIG. 3 is a flowchart showing an example of a spectrum estimation method according to the present invention.

【図４】Ａはこの発明の微小区間スペクトル推定方法を
適用した音声認識装置の機能構成を示す図、Ｂは微小区
間スペクトル時系列の生成手順を示す流れ図である。FIG. 4A is a diagram showing a functional configuration of a speech recognition apparatus to which the minute section spectrum estimating method of the present invention is applied, and FIG. 4B is a flowchart showing a procedure for generating a minute section spectrum time series.

Claims

[Claims]

1. A method for estimating a spectrum for each fixed time section T of a periodic waveform, comprising: extracting a periodic waveform of the fixed time section T by a minute section τ shorter than the fixed time section. This is performed by sequentially shifting the small section shift width δ shorter than the section τ, to obtain the spectrum M of the waveform of each of the cut-out minute sections.
/ E raised to the spectrum of the above-mentioned fixed time section T, wherein a spectrum of a periodic waveform is estimated.

2. The spectrum M of the minute section is obtained by performing a discrete Fourier transform on the waveform of the minute section, and obtaining a power spectrum X obtained by the conversion result as M = log.
2. The method for estimating the spectrum of a periodic waveform according to claim 1, wherein (X + 1) (log is a natural logarithm) is calculated.

3. The minute section spectrum M is converted into a minute section power u and a spectrum Q normalized by the power u.
3. A method for estimating a periodic waveform spectrum according to claim 1 or 2, wherein the real power e is obtained for a power u in a small section by obtaining a weighted average using the e raised to the power of e. .

4. When estimating a spectrum of an audio waveform for each fixed time interval, the audio waveform of the fixed time interval T is replaced with the fixed time interval T
Cutting out shorter shorter sections τ is called smaller section τ
The shorter minute section shift width δ is sequentially shifted, and the spectrum M of the speech waveform of each of the cut-out minute sections is performed.
A recording medium on which is recorded a program for performing, by a computer, each of these minute section spectrums M raised to the power of the real number e, averaged, and further raised to the power of 1 / e to obtain the spectrum of the fixed time section T.

5. A spectrum of a voice waveform is converted into a predetermined time interval T.
When estimating each time period, the speech waveform in the fixed time interval T is
Cutting out shorter shorter sections τ is called smaller section τ
The shorter minute section shift width δ is sequentially shifted, and the spectrum M of the speech waveform of each of the cut-out minute sections is performed.
And a power u of the small sections, calculated as the product u · Q of the normalized spectral Q in power of small section, for each of these small sections spectrum M, real e raised to the power u ^e for the small sections power u A recording medium in which a program for obtaining a Q and averaging the Q to obtain the spectrum in the above-mentioned fixed time section T by a computer is recorded.