JPH0448400B2

JPH0448400B2 -

Info

Publication number: JPH0448400B2
Application number: JP22487885A
Authority: JP
Inventors: Yukio Tabei; Makoto Morito; Kozo Yamada
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1985-10-11
Filing date: 1985-10-11
Publication date: 1992-08-06
Also published as: JPS6286399A

Description

【発明の詳細な説明】（産業上の利用分野）本発明は高雑音下においても高精度の認識を行
うことのできる音声認識方法に関する。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a speech recognition method that can perform highly accurate recognition even under high noise.

（従来の技術）従来、この種の音声認識方法として電子通信学
会論文誌、J68−１（昭和60年１月）p.78−85に記
載されるものがあつた。第２図は従来のローカル
ピークを用いた音声認識方法のフローチヤートで
あり、入力音声は、15チヤネルのバンドパスフイ
ルター群で10msecごとに周波数分析され（第２
図の１参照）、声帯音源特性の個人差の正規化法
として、音声スペクトルを振幅、周波数軸ともに
対数で表わし最小二乗近似直線を求め（第２図の
２参照）、差をとつて補正する。ただし、最小二
乗近似直線の傾きが正の場合には平均値からの差
をとる。その後、第３図に示すように、各フレー
ム（10msec）毎に0dB以上となる各部分につい
て、各最大値の1/2以上の振幅を持つものの中で
最大値となるチヤネルをローカルピーク有りとし
て“１”にし、他を“０”として２値化を行う
（第２図の３参照）。バンドパスフイルタのチヤネ
ル数は15であるが、16チヤネル目に最小二乗近似
直線の傾きが負のとき有声音と見なし１をたて、
傾きが正のとき無声音と見なし“０”をたて、傾
斜の符号を付加する（第２図の４参照）。(Prior Art) Conventionally, this type of speech recognition method has been described in Journal of the Institute of Electronics and Communication Engineers, J68-1 (January 1985), pages 78-85. Figure 2 is a flowchart of a conventional speech recognition method using local peaks, in which the input speech is frequency-analyzed every 10 msec by a group of bandpass filters with 15 channels (second
(See 1 in Figure 2), as a normalization method for individual differences in vocal cord sound source characteristics, the voice spectrum is expressed logarithmically on both the amplitude and frequency axes, a least squares approximation straight line is found (see 2 in Figure 2), and the differences are corrected. . However, if the slope of the least squares approximation line is positive, the difference from the average value is taken. After that, as shown in Figure 3, for each part of each frame (10 msec) where the amplitude is 0 dB or more, the channel with the maximum value among those with an amplitude of 1/2 or more of each maximum value is regarded as having a local peak. Binarization is performed by setting the data to "1" and the others to "0" (see 3 in FIG. 2). The number of channels of the bandpass filter is 15, but when the slope of the least squares approximation straight line in the 16th channel is negative, it is regarded as a voiced sound and is set to 1.
When the slope is positive, it is regarded as an unvoiced sound, and "0" is set, and the sign of the slope is added (see 4 in Fig. 2).

荷重平均辞書は、複数の２値化パターンを時間
軸上一番長いものへ線形に伸ばして加算して多値
パターンとして得られる（第２図の５参照）。 The weighted average dictionary is obtained as a multivalued pattern by linearly extending and adding a plurality of binarized patterns to the longest one on the time axis (see 5 in FIG. 2).

２値の入力パターンと多値の荷重平均辞書との
マツチングには、時間方向は長い方のパターンに
線形に伸ばして合わせ、ある類似度に基づいて計
算を行い、最大類似度を与える標準パターンのカ
テゴリ名を認識結果とする（第２図の６参照）。 To match a binary input pattern and a multi-value weighted average dictionary, the longer pattern in the time direction is linearly extended and matched, calculations are performed based on a certain degree of similarity, and the standard pattern that gives the maximum degree of similarity is selected. The category name is used as the recognition result (see 6 in Figure 2).

（発明が解決しようとする問題点）以上述べた従来の音声認識方法は、接話型マイ
ク等を用いる場合のようなSN比の良い環境では
有効に機能するが、高雑音の環境下では雑音の変
動によるピークを拾いやすく誤認識が増えるとい
う問題点があつた。(Problems to be Solved by the Invention) The conventional speech recognition methods described above function effectively in an environment with a good SN ratio, such as when using a close-talking microphone, but in a high-noise environment, noise There was a problem that it was easy to pick up peaks due to fluctuations in the value, leading to an increase in erroneous recognition.

本発明は、以上述べた雑音の変動によるピーク
があつても、音声のローカルピークとの性質の違
いを考慮したローカルピークベクトル算出処理を
用いることで、雑音の変動によるピークを拾わ
ず、雑音に対する耐性が強く認識精度の高い音声
認識方法を提供することを目的とする。 Even if there are peaks due to noise fluctuations as described above, the present invention uses a local peak vector calculation process that takes into consideration the difference in characteristics from the local peaks of speech, so that the peaks due to noise fluctuations are not picked up and the noise is reduced. The purpose of this invention is to provide a speech recognition method with strong resistance and high recognition accuracy.

（問題点を解決するための手段）本発明による音声認識方法は、まず入力音声を
各音声フレーム毎に複数チヤネルの特徴ベクトル
に周波数分析を行う。(Means for Solving the Problems) The speech recognition method according to the present invention first performs frequency analysis on input speech into feature vectors of multiple channels for each speech frame.

一方、入力音声の特徴ベクトルは、当該ベクト
ルの属する音声フレームにおける最小二乗近似直
線を用いてスペクトル正規化される。前記スペク
トル正規化後の特徴ベクトルの各成分が正であれ
ば１とし、０以下であれば０とする２値の窓ベク
トルを算出し、前記窓ベクトルをスムージングを
行い、その後、窓ベクトルの各成分と前記スペク
トル正規化後の特徴ベクトルの各成分との積を算
出し、前記積の取られた特徴ベクトルから周波数
方向の極大値のあるチヤネルに対応する成分を１
とするローカルピークベクトルを算出する。そし
てこの入力音声のローカルピークベクトルの時系
列と予めめ用意された複数の標準パターンとの類
似度計算を行ない入力音声のカテゴリーを判定す
るものである。 On the other hand, the feature vector of the input voice is spectral normalized using a least squares approximation straight line in the voice frame to which the vector belongs. A binary window vector is calculated, in which each component of the feature vector after spectrum normalization is set to 1 if it is positive, and 0 if it is less than or equal to 0, the window vector is smoothed, and then each component of the window vector is The product of the component and each component of the feature vector after the spectrum normalization is calculated, and the component corresponding to the channel with the maximum value in the frequency direction is calculated from the product feature vector.
Calculate the local peak vector. The category of the input voice is determined by calculating the degree of similarity between the time series of local peak vectors of the input voice and a plurality of standard patterns prepared in advance.

（作用）本発明は、入力音声のスペクトル正規化した特
徴ベクトルを抽出した後、ローカルピークベクト
ル抽出前にスペクトル正規化特徴ベクトルから得
られる窓ベクトルをスムージングしこれとスペク
トル正規化特徴ベクトルとの乗算処理を行なつて
いる。このため、ローカルピークベクトル抽出時
に入力雑音の変動によるピークを入力音声のロー
カルピークと誤つて抽出することが抑制され、安
定して入力音声のローカルピークベクトルを抽出
している。(Operation) The present invention extracts a spectrum-normalized feature vector of input speech, smooths the window vector obtained from the spectrum-normalized feature vector before extracting the local peak vector, and multiplies this by the spectrum-normalized feature vector. Processing is in progress. Therefore, when extracting local peak vectors, it is possible to prevent peaks due to fluctuations in input noise from being mistakenly extracted as local peaks of input speech, and to stably extract local peak vectors of input speech.

（実施例）第１図は本発明の一実施例を示すブロツク図で
ある。以下、第１図に示された音声認識装置の構
成及び動作について説明する。(Embodiment) FIG. 1 is a block diagram showing an embodiment of the present invention. The configuration and operation of the speech recognition device shown in FIG. 1 will be explained below.

[Input processing]

入力音声はマイク（図示せず）を通して電気信
号に変換され、アンプ（図示せず）、ローパスフ
イルタ（図示せず）、を通りＡ／Ｄ変換器（図示
せず）により、例えば標本化周波数12kHzで標本
化され、入力端子１０１に入力される。 Input audio is converted into an electrical signal through a microphone (not shown), passed through an amplifier (not shown), a low-pass filter (not shown), and an A/D converter (not shown) at a sampling frequency of 12 kHz, for example. The signal is sampled at the input terminal 101 and input to the input terminal 101.

[Frequency analysis processing]

入力端子から入力されるデイジタル値は周波数
分析部１０２において周波数分析されて、音声フ
レーム時系列の特徴ベクトルに変換される。この
周波数分析部１０２はバンドパスフイルタと絶対
値化演算部とローパスフイルタとで構成される。 A digital value inputted from an input terminal is subjected to frequency analysis in a frequency analysis section 102 and converted into a feature vector of an audio frame time series. This frequency analysis section 102 is composed of a band pass filter, an absolute value calculation section, and a low pass filter.

まず、周波数分析には、本実施例では、第４図
に示すような低Ｑの特性を有するバンドパスフイ
ルタを用いている。ここではローカルピークの安
定な抽出を目的としたため低Ｑのバンドパスフイ
ルタを用いている。 First, for frequency analysis, in this embodiment, a bandpass filter having a low Q characteristic as shown in FIG. 4 is used. Here, a low Q bandpass filter is used for the purpose of stable extraction of local peaks.

各バンドパスフイルタの出力は絶対値演算が施
され、ローパスフイルタに入力され、音声フレー
ムの周期（本実施例では10msec）ごとに再サン
プルされ、特徴ベクトルを算出する。 The output of each bandpass filter is subjected to absolute value calculation, input to a low-pass filter, and resampled every audio frame period (10 msec in this embodiment) to calculate a feature vector.

ｉ番目の音声フレームにおけるｋチヤネルのロ
ーパスフイルタの出力を再サンプルした出力をa_i
^ｋとするとｉ番目の音声フレームにおける特徴ベ
クトルa_iは a_i＝a_i ¹，a_i ²，……，a_i ^k と表現される。ここでＫはチヤネル数（本実施例
ではＫ＝22）であり、a_i ¹，a_i ²，……，a_i ^kは特徴
ベクトルa_iの成分である。 The output obtained by resampling the output of the low-pass filter of the k channel in the i-th audio frame is a _i
^k , the feature vector a _i in the i-th audio frame is expressed as a _i =a _i ¹ , a _i ² , . . . , a _i ^k . Here, K is the number of channels (K=22 in this embodiment), and a _i ¹ , a _i ² , . . . , a _i ^k are components of the feature vector a _i .

[Frame power calculation process]

フレーム電力算出部１０３は、音声フレーム毎
に周波数分析部１０２より出力される特徴ベクト
ルa_iを受けて、当該音声フレームのフレーム電力
P_iを次式(1) により算出する。 The frame power calculation unit 103 receives the feature vector a _i output from the frequency analysis unit 102 for each audio frame, and calculates the frame power of the audio frame.
P _i is expressed by the following formula (1) Calculated by

[Voice section detection processing]

音声区間検出部１０４においては、フレーム電
力算出部１０３より出力されるフレーム電力P_iを
用いて音声区間検出を行う。 The voice section detection section 104 performs voice section detection using the frame power P _i output from the frame power calculation section 103.

音声区間検出のアルゴリズムについては各種提
案されており、本発明はそのアルゴリズム自体が
目的ではないが、ここではフレーム電力P_iが定め
られた閾値P_S以上、T₁フレーム以上続いた始め
のフレームを始端I_S、音声の始端から後でフレー
ム電力P_iが閾値P_E以下の状態がT₂フレーム続い
た時の始めてP_E以下となつたフレームを終端I_Eと
して検出する。 Various algorithms have been proposed for speech interval detection, and the _algorithm itself is _not the purpose of the _present invention. When the frame power P _i continues to be below the threshold P _E for T ₂ frames after the start end of the audio, the frame in which the frame power P _i becomes below the threshold P _E for the first time is detected as the end I _E .

[Spectral normalization processing]

スペクトル正規化部１０５は周波数分析部１０
２より出力される入力音声の特徴ベクトルa_iを受
けてまず特徴ベクトルa_iの各成分a_i ^kを次式(2)によ
り絶対値x_i（ｋ）に対数変換する。 The spectrum normalization unit 105 is the frequency analysis unit 10
2, each component _a _{i k} of the feature vector a _i is first logarithmically converted into an absolute value x _i ( ^k ) using the following equation (2).

X_i ^K＝ｃ log a_i ^k ０ a_i ^k≧１ a_i ^k≦０ ……(2) ｃは、a_i ^kのビツト数とx_i ^kのビツト数から定ま
る定数である。X _i ^K =c log a _i ^k 0 a _i ^k ≧1 a _i ^k ≦0 (2) c is a constant determined from the number of bits of a _i ^k and the number of bits of x _i ^k .

次に、次式(3)によつて与えられる最小二乗近似
直線 y_i ^k＝u_i・ｋ＋v_i 但し、を用いた次式(4)のスペクトル正規化処理を行う。 Next, the least squares approximation straight line given by the following equation (3) y _i ^k = u _i · k + v _i However, The spectrum normalization process using the following equation (4) is performed.

z_i ^k＝x_i ^k−y_i ^k ……(4) 〔ローカルピークベクトル算出処理〕第５図に本発明によるローカルピークベクトル
算出部１０６の詳細構成を示す。 z _i ^k =x _i ^k −y _i ^k (4) [Local peak vector calculation process] FIG. 5 shows the detailed configuration of the local peak vector calculation unit 106 according to the present invention.

第５図において、５０１はスペクトル正規化デ
ータz_i ^kの入力端子、５０２は２値化演算部、５
０３はスムージング部、５０４は乗算演算部、５
０５は極大値抽出部、５０６はローカルピークベ
クトル出力端子である。 In FIG. 5, 501 is an input terminal for spectrum normalized data z _i ^k , 502 is a binarization calculation unit, and 5
03 is a smoothing section, 504 is a multiplication operation section, 5
05 is a local maximum value extraction unit, and 506 is a local peak vector output terminal.

前記スペクトル正規化部１０５によりスペクト
ル正規化されたデータz_i ^kから、２値化演算部５
０２において次式(5)によつて与えられる２値の窓
ベクトル W_i＝（W_i ¹，W_i ²，……，W_i ^k，……，W_i ^K）が算
出される。 From the data z _i ^k whose spectrum has been normalized by the spectrum normalization unit 105, the binarization calculation unit 5
In 02, a binary window vector W _i =(W _i ¹ , W _i ² , . . . , W _i ^k , . . . , W _i ^K ) given by the following equation (5) is calculated.

（ｋはチヤネル番号を表わす。） W_i ^K＝１０ z_i ^k＞０ z_i ^k０ (5) ここでW_i ¹，W_i ²，……，W_i ^Kは窓ベクトルW_iの成
分である。続いてスムージング部５０３により窓
ベクトルW_iをスムージングし、スムージング窓
ベクトル_i＝（W_i ¹，_i ^k，……，_i ^K）を得る。 (k represents the channel number.) W _i ^K = 1 0 z _i ^k > 0 z _i ^k 0 (5) Here, W _i ¹ , W _i ² , ..., W _i ^K are the components of the window vector W _i It is. Subsequently, the smoothing unit 503 smoothes the window vector W _i to obtain a smoothing window vector _i = (W _i ¹ , _i ^k , . . . , _i ^K ).

このスムージングはW_iの成分W_i ^kが２チヤンネ
ル以上続けて１とならない場合は対応する_i ^kは
ゼロとすることにより行なわれる。 This smoothing is performed by setting the corresponding _i ^k to zero if the component W _i ^k of W _i does not become 1 continuously for two or more channels.

すなわち……010110…………000110……のよ
うにスムージングされる。 In other words, it is smoothed like...010110...000110...

次に、スムージングされた窓ベクトル_iの各
成分_i ^kとスペクトル正規化されたデータz_i ^kとの
積が乗算演算部５０４において次式(6)により求め
られる。 Next, the product of each component _i ^k of the smoothed window vector _i and the spectrum-normalized data z _i ^k is calculated by the following equation (6) in the multiplication calculation unit 504.

L_i ^k＝z_i ^k・_i ^k……(6)（但しｋ＝１，……
Ｋ）次にここで求められたL_i ^kを用いて極大値抽出
部５０５により次式(7)において L_i ^k＞L_i ^k+1 かつ L_i ^k-1＜L_i ^k ただしｋ＝１，……，Ｋ L_i ⁰＝−∞ L_i ^k+1＝−∞ (7) なる条件を満たすｋに対してはr_i ^k＝１、条件を満
たさないｋに対してはr_i ^k＝０なる値を成分とする
ローカルピークベクトル r_i＝r_i ¹，r_i ²，……，r_i ^k，……r_i ^K）を算出する。ここでr_i ¹，r_i ²，……，r_i ^Kはローカ
ルピークベクトルr_iの成分である。 L _i ^k =z _i ^k・_i ^k ……(6) (However, k=1,……
K) Next, using the L _i ^k obtained here, the maximum value extraction unit 505 calculates in the following equation (7) that L _i ^k >L _i ^k+1 and L _i ^k-1 <L _i ^k where k=1 , ..., K L _i ⁰ = −∞ L _i ^k+1 = −∞ (7) For k that satisfies the condition, r _i ^k = 1, and for k that does not satisfy the condition, r _i ^k = A local _peak vector r _i =r _i ¹ , r _i ² , . . ^. , _ri ^k , . Here, r _i ¹ , r _i ² , ..., r _i ^K are components of the local peak vector r _i .

第６図ａにスペクトル正規化されたデータz_i ^k
の例、第６図ｂに窓ベクトルW_iの成分w_i ^kの例、
第６図ｃにスムージングされた窓ベクトル_iの
成分w_i ^kの例、第６図ｄにz_i ^kと_i ^kとの積L_i ^kの例、
第６図ｅにローカルピークベクトルr_iの成分r_i ^kの
例を示す。 Figure 6a shows the spectral normalized data z _i ^k
An example of the component w _i ^k of the window vector W _i is shown in Fig. 6b.
Figure 6c shows an example of the component w _i ^k of the smoothed window vector _i , and Figure 6d shows an example of the product L _i ^k of z _i ^k and _i ^k .
FIG. 6e shows an example of the component r _i ^k of the local peak vector r _i .

[Similarity calculation process]

類似度計算部１０７はローカルピークベクトル
算出部１０６から出力される入力音声のローカル
ピークベクトルr_iの時系列を受けて標準パターン
メモリ１０８に格納された全ての標準パターンと
の類似度計算を行なう。 The similarity calculation unit 107 receives the time series of the local peak vector r _i of the input voice output from the local peak vector calculation unit 106 and calculates the similarity with all standard patterns stored in the standard pattern memory 108 .

ここで標準パターンは、カテゴリ毎に１個、あ
るいは複数個の学習音声に対して認識を行う前
に、認識時と同様な処理によりローカルピークベ
クトルを算出し、時間軸を伸縮して加算して作成
してある。 Here, the standard pattern is that before performing recognition on one or more training voices for each category, local peak vectors are calculated using the same process as during recognition, and the time axis is expanded/contracted and added. It has been created.

すなわち標準パターンは重み付きローカルピー
クベクトルの時系列として格納される。本実施例
では標準パターンの数をＭとする。 That is, the standard pattern is stored as a time series of weighted local peak vectors. In this embodiment, the number of standard patterns is M.

類似度計算部１０７において入力音声と標準パ
ターンとのフレーム間の類似度Ｓ（ｉ，ｊ）は、
次式(8)で求められる。 In the similarity calculation unit 107, the interframe similarity S(i,j) between the input voice and the standard pattern is
It is determined by the following equation (8).

ここでr_iは第ｉフレームの入力音声のローカル
ピークベクトル、D_jは第ｊフレームの標準パタ
ーンの特徴ベクトル、r_i ^tはr_iの転置、D_j ^tはD_jの転
置を表わす。 Here, r _i represents the local peak vector of the input speech of the i-th frame, D _j represents the feature vector of the standard pattern of the j-th frame, r _i ^t represents the transposition of r _i , and D _j ^t represents the transposition of D _j .

なお、ｉとｊの対応には非線形に対応させる方
法もあるが、本実施例では線形マツチングを行
い、ｍ番目の標準パターン長をSL_nとする。 Although there is a method of non-linearly matching i and j, in this embodiment linear matching is performed and the length of the m-th standard pattern is set to SL _n .

このとき入力音声とｍ番目の標準パターンの類
似度S^_nは次式(9)で求められる。 At this time, the degree of similarity S^ _n between the input voice and the m-th standard pattern is obtained by the following equation (9).

上述の如くして、Ｍ個の標準パターン全てに対
して入力音声との類似度S^_n（但しｍ＝１〜Ｍ）を
算出する。 As described above, the degree of similarity S^ _n (where m=1 to M) with the input voice is calculated for all M standard patterns.

〔Determination process〕

判定部１０９は類似度計算部１０７より出力さ
れる各標準パターンとの類似度S^_n（但しｍ＝１〜
Ｍ）を受け、その中でも最も類似度の高いものを
抽出し、抽出された類似度に対する標準パターン
のカテゴリー名を判定結果として識別し出力す
る。 The determining unit 109 determines the degree of similarity S^ _n (where m = 1 to
M), the one with the highest degree of similarity is extracted, and the category name of the standard pattern corresponding to the extracted degree of similarity is identified and output as a determination result.

即ち、この判定処理は次式(10)で表わされる処理
によつて、 m₀＝arg max S^_n……(10) なるm₀を判定し、m₀番目の標準パターンのカテ
ゴリ名を出力端子１１０へ出力する。 That is, this judgment process determines m ₀ such that m ₀ = arg max S^ _n ...(10) by the process expressed by the following equation (10), and outputs the category name of the m _0th standard pattern. Output to terminal 110.

以上の説明では各処理をハード的に行なう場合
について説明したが、各処理をソフト的に行なう
ことも当然可能なものである。 In the above explanation, each process is performed using hardware, but it is of course also possible to perform each process using software.

（発明の効果）以上、詳細に説明したように本発明によれば、
入力音声のスペクトル正規化後の特徴ベクトルか
ら窓ベクトルを求め、窓ベクトルをスムージング
処理し、前記スペクトル正規化後の特徴ベクトル
にスペクトル窓として乗算してからローカルピー
クベクトルを算出しているため、雑音によるロー
カルピークを音声のローカルピークと誤ることが
なく、各標準パターンとの類似度計算処理、判定
処理において精度の高い処理が行なわれ、その結
果認識精度の良い音声認識装置が実現できる。(Effects of the Invention) As described above in detail, according to the present invention,
A window vector is obtained from the spectral-normalized feature vector of the input audio, the window vector is smoothed, and the spectral-normalized feature vector is multiplied as a spectral window before the local peak vector is calculated. The local peaks obtained by the method are not mistaken for the local peaks of the voice, and highly accurate processing is performed in the similarity calculation processing and determination processing with each standard pattern, and as a result, a speech recognition device with high recognition accuracy can be realized.

[Brief explanation of drawings]

第１図は本発明の１実施例の構成を示すブロツ
ク図、第２図は従来の音声認識方法のフローチヤ
ート、第３図は従来の入力信号の２値化を説明す
るための図、第４図は本発明の一実施例の周波数
分析に用いるバンドパスフイルタの周波数特性
図、第５図は本発明のローカルピークベクトル算
出部の構成を示すブロツク図、第６図ａ〜ｅは本
発明における入力音声のローカルピークベクトル
抽出の過程を説明するための図である。１０２……周波数分析部、１０３……フレーム
電力算出部、１０４……音声区間検出部、１０５
……スペクトル正規化部、１０６……ローカルピ
ークベクトル算出部、１０７……類似度計算部、
１０８……標準パターンメモリ、１０９……判定
部、５０２……２値化演算部、５０３……スムー
ジング部、５０４……乗算演算部、５０５……極
大値抽出部。 FIG. 1 is a block diagram showing the configuration of an embodiment of the present invention, FIG. 2 is a flowchart of a conventional speech recognition method, FIG. 3 is a diagram for explaining conventional binarization of an input signal, and FIG. Fig. 4 is a frequency characteristic diagram of a bandpass filter used for frequency analysis according to an embodiment of the present invention, Fig. 5 is a block diagram showing the configuration of a local peak vector calculation section of the present invention, and Figs. FIG. 3 is a diagram for explaining the process of extracting a local peak vector of input speech in FIG. 102... Frequency analysis section, 103... Frame power calculation section, 104... Voice section detection section, 105
... Spectrum normalization section, 106 ... Local peak vector calculation section, 107 ... Similarity calculation section,
108... Standard pattern memory, 109... Judgment section, 502... Binarization operation section, 503... Smoothing section, 504... Multiplication operation section, 505... Maximum value extraction section.

Claims

[Claims] 1. A process of frequency-analyzing the input voice for each voice frame of a predetermined period and extracting a feature vector as a vector of frequency components of the input voice, and converting the feature vector of the input voice to the voice to which the feature vector belongs. A process of normalizing the spectrum using a least squares approximation straight line in the frame and extracting a spectrum normalized feature vector, and for each component of the spectrum normalized feature vector, if the component is positive, it is set as "1", and if it is less than or equal to 0, it is set as " A process of extracting a window vector consisting of each binary component converted as 0'', a process of smoothing the window vector to extract a smoothing window vector, and a process of extracting a smoothing window vector by smoothing the window vector, The process of calculating the product with each component and extracting it as a windowed feature vector, and determining the presence or absence of a local maximum value in the frequency direction for the windowed feature vector, and determining the component corresponding to the channel that has a local maximum value, that is, a local peak. A process of converting into a binary local peak vector with "1" for some and "0" for others, and calculation of the similarity between the time series of the local peak vector of the input audio and multiple standard patterns prepared in advance. 1. A speech recognition method, comprising: determining a category of input speech.