JPH0568716B2

JPH0568716B2 -

Info

Publication number: JPH0568716B2
Application number: JP59058435A
Authority: JP
Inventors: Makoto Morito; Masao Takeuchi; Akihiko Fujisawa; Yukio Tabei
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1984-03-28
Filing date: 1984-03-28
Publication date: 1993-09-29
Also published as: JPS60203992A

Description

【発明の詳細な説明】（技術分野）本発明は音声認識方法に関し、具体的には単語
入力音声の終端の確認を待たないで、入力音声の
始端検出から認識動作を開始するようにした音声
認識方法に関する。[Detailed Description of the Invention] (Technical Field) The present invention relates to a speech recognition method, and more specifically, the present invention relates to a speech recognition method, and specifically, a speech recognition method in which a recognition operation is started from detecting the beginning of input speech without waiting for confirmation of the end of word input speech. Regarding recognition methods.

（背景技術）音声認識方法の一形式として、各標準音声に対
応して周波数成分のフレーム時系列として標準パ
ターンを記憶しておき、入力音声から同じく周波
数成分のフレーム時系列として入力パターンを抽
出し、入力パターンと各標準パターンとの非類似
度を計算し、その非類似度に基づいて入力音声を
識別する方法が知られている。(Background Art) As a form of speech recognition method, a standard pattern is stored as a frame time series of frequency components corresponding to each standard voice, and the input pattern is extracted from the input speech as a frame time series of frequency components. A known method is to calculate the degree of dissimilarity between an input pattern and each standard pattern, and to identify input speech based on the degree of dissimilarity.

一例として、沖研究開発第118号第53頁、昭和
57年12月、に開示されている。このような方法に
おける標準又は入力の音声パターンは、通常、規
則的にフレームを設定して周波数分析し、対数変
換と最小自乗近似直線を用いた声帯振動特性等の
正規化とを経て、周波数成分のフレーム時系列と
して表現したものを用いる。 For example, Oki Research and Development No. 118, page 53, Showa
It was disclosed in December 1957. In such a method, the standard or input speech pattern is usually frequency-analyzed by setting frames regularly, and then subjected to logarithmic transformation and normalization of vocal fold vibration characteristics using a least squares approximation straight line, and then the frequency components are determined. , expressed as a frame time series.

また、入力パターンと各標準パターンとの非類
似度を計算するためにマツチングパスを設定する
方法としては、動的計画法を用いたDPマツチン
グ法と前記文献に見られるような本質的に線形な
マツチング法とが知られている。構成の簡易化の
観点からは、線形マツチング法が有利であるが、
第１図に例示した如く、単語の発声速度は変動が
きわめて大きく、個人差があると共に心理状態や
状況によつても変動し、標準という感覚のもとで
すら20％〜40％の発声長のばらつきが見られ、何
等かの工夫が必要である。線形マツチング法には
種々の形式が提案されているが、前記文献に限ら
ず、そこでは入力音声の終端を検出したのち、マ
ツチングパスを設定していて、認識応答の面では
問題があり、入力パターンも始端から終端を確認
するまで記憶しておく必要がある。 In addition, methods for setting matching paths to calculate the degree of dissimilarity between the input pattern and each standard pattern include the DP matching method using dynamic programming and the essentially linear matching method as seen in the above-mentioned literature. The law is known. From the perspective of simplifying the configuration, the linear matching method is advantageous, but
As illustrated in Figure 1, the utterance speed of a word varies extremely widely, with individual differences as well as changes depending on the psychological state and situation. There is some variation in the results, and some kind of improvement is required. Various types of linear matching methods have been proposed, but not only in the above-mentioned literature, in which a matching path is set after detecting the end of input speech, there are problems in terms of recognition response, and there are It is also necessary to memorize the starting point until the ending point is confirmed.

（発明の目的）本発明の目的は、入力音声の始端を検出して直
ちにマツチング動作を開始させることによつて認
識速度を高め、且つ発声速度変動を予想してマツ
チングパスを設定することによつて発声速度の変
動を吸収することにある。(Objective of the Invention) An object of the present invention is to increase the recognition speed by detecting the beginning of the input voice and immediately starting the matching operation, and by setting the matching path in anticipation of variations in the speaking rate. The purpose is to absorb fluctuations in speaking speed.

（発明の概要）本発明の第１の特徴は、入力パターンの各フレ
ーム毎にマツチング処理を行ない、各フレーム毎
に各標準パターンの各マツチングパスに対応した
非類似度を更新記憶するようにしたことにある。(Summary of the Invention) The first feature of the present invention is that matching processing is performed for each frame of the input pattern, and the degree of dissimilarity corresponding to each matching pass of each standard pattern is updated and stored for each frame. It is in.

まず、入力音声の有音状態の検出には音声パワ
ーを用いる方法を用いることができる。この場合
音声の始端検出はフレーム電力Ｐ(j)（但しｊは入
力パターンのフレーム番号）があらかじめ定めら
れた閾値を越えた時点を始端と考える。但し外部
からの雑音などにより音声入力が行なわれていな
くとも電力Ｐ(j)が閾値を越えてしまい、誤つた始
端とする場合がある。そのため、ともかくフレー
ム電力Ｐ(j)が閾値を越えたフレームを始端と考え
認識処理を開始するものの連続して３フレーム以
上フレーム電力が閾値を越えなければその入力フ
レームを音声の始端とは考えず認識処理を中断し
始端検出のための処理へともどる。但し、フレー
ム長を16ｍsecとしている。ここで音声の始端か
らフレーム電力が閾値を越えたフレームの番号付
けを定義しｈ番目の音声フレームと称し、単なる
入力フレーム番号とは区別する。すなわち、音声
フレーム番号ｈの音声フレームは、有音区間でｈ
番目の入力フレームに対応する。 First, a method using audio power can be used to detect the presence of input audio. In this case, the start end of the audio is detected when the frame power P(j) (where j is the frame number of the input pattern) exceeds a predetermined threshold. However, due to external noise or the like, the power P(j) may exceed the threshold even if no voice input is being performed, and the starting point may be determined to be incorrect. Therefore, although the frame whose frame power P(j) exceeds the threshold is considered to be the starting point and recognition processing is started, if the frame power exceeds the threshold for three or more consecutive frames, the input frame is not considered to be the starting point of the audio. The recognition process is interrupted and the process returns to the start edge detection process. However, the frame length is set to 16 msec. Here, the numbering of the frame whose frame power exceeds the threshold from the start of the audio is defined and is called the h-th audio frame, which is distinguished from a mere input frame number. In other words, the audio frame with audio frame number h has h in the sound section.
corresponding to the input frame.

発声速度の正規化を行なうマツチング処理を音
声の始端から開始し、音声分析部出力が得られる
周期（フレーム周期）ごとに行なえれば音声分析
部のデータを始端からすべて格納しておく必要も
なく、また、応答時間も速くなる。 If the matching process that normalizes the speech rate can be started from the beginning of the voice and performed every cycle (frame period) in which the output of the voice analyzer is obtained, there is no need to store all the data from the voice analyzer from the start. , and the response time is also faster.

本発明の第２の特徴は発声がおそく行なわれた
場合、標準的に行なわれた場合、はやく行なわれ
た場合を想定したマツチングパスを設定しそれぞ
れのマツチングパス上でのマツチング処理を行な
うことにある。音声の始端検出時点では今から入
力される単語の発声速度は不明である。そこで発
声がおそく行なわれた場合、標準的に行なわれた
場合、はやく行なわれた場合を想定したマツチン
グパスを設定し、それぞれのマツチングパス上で
マツチング処理を行なえば終端検出前からでもマ
ツチング処理が開始可能となる。もちろん、この
場合、入力の終端と標準パターンの終端が一致す
るパスが存在する可能性は少ないが、入力の終端
と標準パターンの終端が最も一致しているパス上
での非類似度が最小となることが予想される。 A second feature of the present invention is to set matching paths assuming cases in which utterances are performed slowly, normally, and quickly, and perform matching processing on each matching path. At the time of detecting the beginning of speech, the speech rate of the word that is about to be input is unknown. Therefore, by setting matching paths assuming that the utterance is performed slowly, normally, or quickly, and performing matching processing on each matching path, matching processing can be started even before the end is detected. becomes. Of course, in this case, there is a small possibility that there exists a path where the end of the input matches the end of the standard pattern, but the dissimilarity on the path where the end of the input most matches the end of the standard pattern is minimal. It is expected that

一方、単語には「イチ」の「イ」と「チ」の間
のように単語内にフレーム電力が閾値に満たない
部分すなわち無音状態のフレームを持つ単語があ
る。このような部分を「パワーデイツプ」と称す
る。このパワーデイツプの長さは単語によつて異
なるが通常30フレーム長を越えることはほとんど
ない。音声の始端を検出後、あるフレーム時間点
においてそのフレーム電力が閾値未満すなわち無
音状態となつた場合、そのフレーム時間点はパワ
ーデイツプの始まりなのか、音声の終端なのかは
判断がつかない。この判定は通常、その時点から
30フレームの間に音声の始端条件（３フレーム以
上連続してフレーム電力が閾値以上）を満足する
フレームが存在するかしないかによつて行なうた
め最大30フレーム後でなければ判断が下されな
い。従つて、フレーム電力が閾値未満となつた場
合のマツチング結果は何らかの形で保留されなけ
ればならない。本発明では、音声フレーム番号の
更新を停止して、フレーム電力が閾値未満すなわ
ち無音状態となつたフレームに対してはマツチン
グ処理を停止することによりこの問題を解決す
る。 On the other hand, some words have a portion where the frame power is less than the threshold, such as between the "i" and "chi" in "ichi", that is, a silent frame. Such a part is called a "power dip." The length of this power dip varies depending on the word, but it usually rarely exceeds 30 frames. After detecting the start of audio, if the frame power at a certain frame time point is less than the threshold, that is, the frame becomes silent, it is difficult to determine whether that frame time point is the beginning of a power dip or the end of audio. This determination is usually made from that point onwards.
This is done depending on whether there is a frame that satisfies the audio start condition (3 or more frames in a row where the frame power is above the threshold) within 30 frames, so a decision cannot be made until 30 frames have passed at most. Therefore, the matching results when the frame power becomes less than the threshold must be suspended in some way. In the present invention, this problem is solved by stopping the updating of the audio frame number and stopping the matching process for frames whose frame power is less than a threshold value, that is, in a silent state.

第２図ａは本発明による音声認識方法における
入力パターンと標準パターンとのマツチングを行
なう複数のマツチングパス例を示した図、第２図
ｂは入力パターンのフレーム電力例を示した図、
第２図ｃは入力パターンと標準パターンとの各マ
ツチングパスにおける非類似度Dn(j)、D′n(j)、
D″n(j)の例を示した図である。 FIG. 2a is a diagram showing an example of a plurality of matching passes for matching an input pattern and a standard pattern in the speech recognition method according to the present invention, FIG. 2b is a diagram showing an example of frame power of the input pattern,
Figure 2c shows the dissimilarity Dn(j), D′n(j), in each matching pass between the input pattern and the standard pattern,
FIG. 3 is a diagram showing an example of D″n(j).

第２図ａにおいては発声速度の範囲を例えば±
20％と考え、マツチングパスを３本設定した場合
を示している。第２図ａにおいて、横軸は入力パ
ターンのフレーム番号を表わす。また、縦軸は標
準パターンのフレーム番号を表わし、ｎ番目の標
準パターンSnを例として考え、そのフレーム長
をSL(n)とする。１０１は発声を20％遅く発声し
た場合を想定したパス、１０２は標準的な発声を
想定したパス、１０３は発声を20％速く想定した
場合のパスを示す。ｊ番目の入力フレームの電力
が閾値以上の場合、３本のパス上での標準パター
ンSnとの距離を次式によつて与える。但し、ｈ
はｊ番目の入力フレーム番号に対応した音声フレ
ーム番号であり、Ｗ（ｉ、ｊ）は入力フレーム番
号がｊでチヤンネル番号がｉ（但し、ｉ＝１〜８）
の入力パターンの成分であり、Sn（ｉ、ｋ）はフ
レーム番号がｋでチヤンネル番号がｉの標準パタ
ーンの成分である。 In Figure 2 a, the range of speech rate is, for example, ±
The figure shows the case where three matching paths are set, assuming that the ratio is 20%. In FIG. 2a, the horizontal axis represents the frame number of the input pattern. Further, the vertical axis represents the frame number of the standard pattern, and taking the nth standard pattern Sn as an example, let its frame length be SL(n). Reference numeral 101 indicates a path assuming that the speech is uttered 20% slower, 102 a path that assumes standard speech, and 103 a path that assumes that the speech is uttered 20% faster. When the power of the j-th input frame is greater than or equal to the threshold, the distance from the standard pattern Sn on the three paths is given by the following equation. However, h
is the audio frame number corresponding to the j-th input frame number, and W(i, j) is when the input frame number is j and the channel number is i (where i = 1 to 8).
Sn (i, k) is a component of the standard pattern with frame number k and channel number i.

パス１０１に対する距離 dn(j)＝₈ 〓ⁱ⁼¹ ｜Ｗ（ｉ、ｊ）−Sn（ｉ、ｋ）｜但しｋ＝〔１／1.2ｈ〕〔１／1.2ｈ〕≦SL(n) SL(n) 〔１／1.2ｈ〕＞SL(n) ……第１式パス１０２に対する距離 d′n(j)＝₈ 〓ⁱ⁼¹ ｜Ｗ（ｉ、ｊ）−Sn（ｉ、k′）｜但し k′＝ｈ SL(n) ｈ≦SL(n) ｈ＞SL(n) ……第２式パス１０３に対する距離 d″n(j)＝₈ 〓ⁱ⁼¹ ｜Ｗ（ｉ、ｊ）−Sn（ｉ、k″）｜但し k″＝〔１／0.8ｈ〕〔１／0.8ｈ〕≦SL(n) SL(k) 〔１／0.8ｈ〕＞SL(n) ……第３式尚、〔〕はガウス記号を示す。Distance to path 101 dn(j) = ₈ 〓 ⁱ⁼¹ | W (i, j) − Sn (i, k) | where k = [1/1.2h] [1/1.2h] ≦ SL(n) SL (n) [1/1.2h] > SL(n) ...Equation 1 Distance to path 102 d'n(j) = ₈ 〓 ⁱ⁼¹ | W (i, j) - Sn (i, k') ｜ However, k′=h SL(n) h≦SL(n) h＞SL(n) ...Second formula Distance to path 103 d″n(j)= ₈ 〓 ⁱ⁼¹ ｜W(i, j) −Sn (i, k″) | However, k″=[1/0.8h] [1/0.8h]≦SL(n) SL(k) [1/0.8h]>SL(n) ……3rd formula Note that [ ] indicates a Gaussian symbol.

前記の式によればパス１０１においては入力パ
ターンのｊ番目の入力音声フレームと標準パター
ンのｋ番目のフレームの間の距離計算を行なう。
パス１０２においては入力パターンｊ番目の入力
フレームと標準パターンのk′番目のフレームの間
の距離計算を行ない、パス１０３においては入力
パターンのｊ番目の入力フレームと標準パターン
のk″番目のフレームの間の距離計算が行なわれ
る。但し、標準パターンのフレーム番号を示す
ｋ、k′、k″はその標準パターンの長さSL(n)より
大きくなる場合にはSL(n)に制限される。 According to the above equation, in path 101, the distance between the j-th input audio frame of the input pattern and the k-th frame of the standard pattern is calculated.
In pass 102, the distance between the j-th input frame of the input pattern and the k'-th frame of the standard pattern is calculated, and in the pass 103, the distance between the j-th input frame of the input pattern and the k''-th frame of the standard pattern is calculated. However, if k, k', k'' indicating the frame number of the standard pattern is larger than the length SL(n) of the standard pattern, it is limited to SL(n).

一方、ｊ番目の入力フレームの電力が閾値未満
の場合、それぞれのパス上での距離dn(j)、d′n
(j)、d″n(j)を強制的に dn(j)＝０ ……第４式 d′n(j)＝０ ……第５式 d″n(j)＝０ ……第６式とすることによりフレーム電力が閾値以下の場合
の非類似度計算を事実上加算しない処理を行な
う。またこのために、標準パターンもパワーデイ
ツプ対応のフレームすなわち無音状態に対応する
フレームを除いた形で蓄積する。 On the other hand, if the power of the j-th input frame is less than the threshold, the distances dn(j), d′n on each path
(j), d″n(j) is forced dn(j)=0...4th equation d′n(j)=0...5th equation d″n(j)=0...6th equation By using the formula, processing is performed in which dissimilarity calculations are not actually added when the frame power is less than the threshold value. For this purpose, the standard pattern is also stored in a form excluding frames corresponding to power dips, that is, frames corresponding to the silent state.

このように、パワーデイツプや標準パターンの
終端以後でのマツチングのように、非類似度とし
て重要でないフレームでは距離を０としているけ
れども、本発明では本質的に線形なマツチングで
ある。 In this way, although the distance is set to 0 for frames that are not important as dissimilarities, such as power dips or matching after the end of a standard pattern, the matching is essentially linear in the present invention.

次に入力パターンのｊ番目の入力フレームまで
の非類似度Dn(j)、D′n(j)、D″n(j)が計算される。 Next, the dissimilarities Dn(j), D′n(j), and D″n(j) of the input pattern up to the j-th input frame are calculated.

パス１０１の非類似度 Dn(j)＝dn(j)＋Dn（ｊ−１） ……第７式パス１０２の非類似度 D′n(j)＝d′n(j)＋D′n（ｊ−１）……第８式パス１０３の非類似度 D″n(j)＝d″n(j)＋D″n（ｊ−１）……第９式すなわち、それぞれのパス上でのｊ番目のフレ
ームの非類似度の算出は各チヤンネルごとの距離
（例えば｜Ｗ（ｉ、ｊ）−Sn（ｉ、ｋ）｜）をチヤン
ネル分、ｊ−１番目のフレームに対する非類似度
値（たとえばDn（ｊ−１））に加えることによつ
て得られる。これらの演算はｊ番目のフレームの
入力がなされた時点で行なわれる。ｊ番目の入力
フレームに対する非類似度の算出にあたつてはｊ
番目のフレームの入力パターンデータとそれぞれ
のパスに相当する標準パターンのデータおよび１
フレーム前のｊ−１番目の入力フレームの目のフ
レームの非類似度データのみが必要であつて２フ
レーム以上前の入力パターンデータは不必要であ
る。そのため、終端を検出するまでの入力パター
ンを格納しておかなければならない線形伸縮マツ
チング法に比較しても記憶領域が小さくなる効果
が生じる。 Dissimilarity of path 101 Dn(j)=dn(j)+Dn(j−1) ...Formula 7 Dissimilarity of path 102 D'n(j)=d'n(j)+D'n(j −1)...Equation 8 Dissimilarity of path 103 D″n(j)=d″n(j)+D″n(j−1)...Equation 9 In other words, j-th on each path To calculate the dissimilarity of frames, calculate the distance for each channel (for example, |W(i,j)−Sn(i,k)|) for each channel, and calculate the dissimilarity value for the (j-1)). These operations are performed when the j-th frame is input. When calculating the dissimilarity for the j-th input frame,
The input pattern data of the th frame, the standard pattern data corresponding to each pass, and 1
Only the frame dissimilarity data of the j-1th input frame before the frame is necessary, and the input pattern data from two or more frames before is unnecessary. Therefore, the storage area is reduced compared to the linear expansion/contraction matching method in which input patterns must be stored until the end is detected.

第２図ｃは入力パターンと標準パターンとの各
マツチングパターンでの非類似度Dn(j)、D′n(j)、
D″n(j)を示したものであるが、第２図ｃに見られ
るようにフレーム電力が閾値以下となつたとき距
離値を強制的に０にすることにより非類似度Dn
(j)、D′n(j)、D″n(j)は保持される。従つて、終端
における非類似度と終端から30フレームへだてた
入力フレーム（この時点で初めて終端が検出され
る）における非類似度は等しい。 Figure 2c shows the dissimilarity Dn(j), D′n(j),
D″n(j), but as shown in Figure 2c, by forcing the distance value to 0 when the frame power is below the threshold, the dissimilarity Dn
(j), D′n(j), and D″n(j) are retained. Therefore, the dissimilarity at the end and the input frame 30 frames from the end (the end is detected for the first time at this point) The dissimilarities at are equal.

次に、音声の終端を検出した時点（音声の終端
から30フレーム後）から各標準パターンごとに得
られた非類似度によつてカテゴリーの判定が行な
われる。終端検出時点の入力フレーム番号をｊ、
音声フレーム番号をＨとするとｎ番目の標準パタ
ーンに対する各パスの非類似度はDn(j)、D′n(j)、
D″n(j)で与えられる。これらの非類似度の組が標
準パターンの数（Ｎとする）だけ存在する。これ
らの非類似度を用いてカテゴリー判定を行なう手
法について述べる。判定第１ステツプは次のよう
に行なわれる。まず、ｎ番目の標準パターンに対
する各パスごとの非類似度Dn(j)、D′n(j)、D″n(j)
のうち１つが選択される。この選択にあたつては
音声終端検出時の音声フレーム番号Ｈに対して次
式で与えられるＬ、L′、L″が用いられる。 Next, the category is determined based on the degree of dissimilarity obtained for each standard pattern from the time when the end of the audio is detected (30 frames after the end of the audio). The input frame number at the time of end detection is j,
When the audio frame number is H, the dissimilarity of each path to the nth standard pattern is Dn(j), D′n(j),
It is given by D″n(j). There are as many sets of these dissimilarities as there are standard patterns (assumed to be N). We will describe a method for making category judgments using these dissimilarities. Judgment 1 The steps are performed as follows. First, the dissimilarities Dn(j), D′n(j), D″n(j) for each path with respect to the nth standard pattern are calculated.
One of them is selected. In this selection, L, L', and L'' given by the following equation are used for the voice frame number H at the time of voice end detection.

パス１０１Ｌ＝〔１／1.2Ｈ〕 ……第10式パス１０２ L′＝Ｈ ……第11式パス１０３ L″＝〔１／0.8Ｈ〕 ……第12式これらの値Ｌ、L′、L″は音声フレームに対応
する標準パターンのフレーム数を与える式に類似
しているが、標準パターンの長さにSL(n)によつ
て制限されることはない。従つて、Ｌ、L′、
L″は標準パターンの種類とは無関係である。こ
れらＬ、L′、L″のうち標準パターンの長さSL(n)
に最も近い値を示すパスに対応する非類似度のみ
を選択する。たとえば、L′がSL(n)に最も近いと
するとパス１０２が対応しそれに対する非類似度
D′n(J)が選択される。選択された非類似度をDDn
とするこれらの選択は標準パターンごとに行なれ
る。 Path 101 L=[1/1.2H] ...Equation 10 Path 102 L'=H ...Equation 11 Path 103 L''=[1/0.8H] ...Equation 12 These values L, L', L'' is similar to the formula giving the number of frames of the standard pattern corresponding to the audio frame, but is not limited to the length of the standard pattern by SL(n). Therefore, L, L′,
L″ is unrelated to the type of standard pattern. Among these L, L′, and L″, the length of the standard pattern SL(n)
Select only the dissimilarity corresponding to the path that has the closest value to . For example, if L′ is closest to SL(n), path 102 corresponds and the dissimilarity for it is
D′n(J) is selected. DDn the selected dissimilarity
These selections can be made for each standard pattern.

次に判定の第２ステツプが行なわれる。前記判
定第１ステツプによつて得られた標準パターンご
との非類似度DDnに対して最小値を求める。こ
の最小値を与える標準パターンに付加されたカテ
ゴリが認識結果となる。 Next, a second step of determination is performed. A minimum value is found for the degree of dissimilarity DDn for each standard pattern obtained in the first determination step. The category added to the standard pattern that gives this minimum value becomes the recognition result.

（実施例）第３図は本発明におけるマツチング処理と判定
処理を行なう回路構成を示した一実施例である。
以下、その動作について詳細に説明する。(Embodiment) FIG. 3 is an embodiment showing a circuit configuration for performing matching processing and determination processing in the present invention.
The operation will be explained in detail below.

第３図において、５２は始端からの音声フレー
ム数をカウントする音声フレームカウンタで、始
端検出時はリセツトパルス５０によつてその内容
は０となり以後入力フレームの電力が閾値を越え
たときに入力されるカウントパルス５１によつて
カウントアツプ動作を行なう。入力フレームの電
力が閾値未満の場合にはカウントパルス５１は付
加されず、音声フレームカウンタ５２の出力は保
持される。音声フレームカウンタ５２の出力であ
る音声フレーム番号をｈとする。５３はマツチン
グの際のパスの種類を表わす信号でパスの数は３
本なので０〜２の値をとる。５４はｈ番目の音声
フレームにおいて各パス上で対応する標準パター
ンのフレーム番号を与えるROMである。ROM
５４には〔１／0.8ｈ〕、ｈ、〔１／1.2ｈ〕に相当する
値が格納されている。ROM５４の出力をｌとす
る。５５は標準パターンの番号を与える標準パタ
ーン番号信号であつてｎとする。標準パターンの
総数がＮのとき０〜Ｎ−１の値をとる。５６はｎ
番目の標準パターンに対してその標準パターンの
長さＳ(n)を格納するROMである。５７は標準パ
ターンフレーム番号ｌを出力するROM５４の出
力と標準パターンの長さSL(n)を出力するROM５
６の内容を比較してｌ≦SL(n)ならば“１”を、
ｌ＞SL(n)ならば“０”を出力するコンパレータ
である。５８はコンパレータ５７の出力が“１”
のときはROM５４の出力を、コンパレータ５７
の出力が“０”のときはROM５６の出力を選択
するセレクタである。コンパレータ５７とセレク
タ５８によつてｌ≦SL(n)ならばｌが、ｌ＞SL(n)
ならばSL(n)がセレクタ５８より出力される動作
が行なわれる。セレクタ５８の出力をｋとする。
５９はチヤンネル番号ｉを与える信号である。６
０はチヤンネル番号ｉとセレクタ５８の出力例え
ばｋと標準パターン番号信号ｎによつてアドレツ
シングされ標準パターンの各成分Sn（ｉ、ｋ）を
出力する標準パターンのメモリである。６１はス
ペクトル正規化を行なつた１フレーム分の入力パ
ターンの成分Ｗ（ｉ、ｊ）を格納しておくメモリ
でチヤンネル番号信号５９によつてアドレスが与
えられる。６２はメモリ６１の入力端子であり、
図示しないスペクトル正規化部でスペクトル正規
化された入力データＷ（ｉ、ｊ）が入力される。
６３はメモリ６１の出力Ｗ（ｉ、ｊ）と標準パタ
ーンROM６０の出力Sn（ｉ、ｋ）の間でコント
ロール信号CONTによつて以下の値を出力する
演算器である。 In FIG. 3, 52 is an audio frame counter that counts the number of audio frames from the start edge. When the start edge is detected, the content becomes 0 by the reset pulse 50, and is input thereafter when the power of the input frame exceeds the threshold. A count up operation is performed by the count pulse 51. If the power of the input frame is less than the threshold, the count pulse 51 is not added and the output of the audio frame counter 52 is held. Let h be the audio frame number output from the audio frame counter 52. 53 is a signal representing the type of path during matching, and the number of paths is 3.
Since it is a book, it takes a value of 0 to 2. 54 is a ROM that gives frame numbers of the corresponding standard pattern on each path in the h-th audio frame. ROM
54 stores values corresponding to [1/0.8h], h, and [1/1.2h]. Let the output of the ROM 54 be l. 55 is a standard pattern number signal which gives the number of the standard pattern, and is assumed to be n. When the total number of standard patterns is N, it takes a value from 0 to N-1. 56 is n
This is a ROM that stores the length S(n) of the standard pattern for the standard pattern. 57 is the output of the ROM 54 which outputs the standard pattern frame number l, and the ROM 5 which outputs the standard pattern length SL(n).
Compare the contents of 6 and select “1” if l≦SL(n),
This is a comparator that outputs "0" if l>SL(n). 58, the output of comparator 57 is “1”
In this case, the output of ROM 54 is sent to comparator 57.
This is a selector that selects the output of the ROM 56 when the output is "0". Comparator 57 and selector 58 determine that if l≦SL(n), then l>SL(n)
If so, an operation is performed in which SL(n) is output from the selector 58. Let the output of the selector 58 be k.
59 is a signal giving channel number i. 6
0 is a standard pattern memory that is addressed by the channel number i, the output of the selector 58, for example k, and the standard pattern number signal n, and outputs each component Sn (i, k) of the standard pattern. Reference numeral 61 denotes a memory that stores one frame's worth of input pattern components W(i, j) that have undergone spectrum normalization, and is given an address by a channel number signal 59. 62 is an input terminal of the memory 61;
Input data W(i, j) whose spectrum has been normalized by a spectrum normalization unit (not shown) is input.
63 is an arithmetic unit that outputs the following values between the output W (i, j) of the memory 61 and the output Sn (i, k) of the standard pattern ROM 60 according to the control signal CONT.

CONT＝１のとき｜Ｗ（ｉ、ｊ）−Sn（ｉ、ｋ）｜ CONT＝０のとき０ ……第13式 CONT信号はフレーム電力が閾値以上のとき
は“１”を、閾値未満のときは“０”となる信号
である。６４は加算器、６５はパス信号５３と標
準パターン番号信号５５の値をアドレスとする
RAMであり非類似度Dn(j)、D′n(j)、D″n(j)が格
納される。６７は音声フレーム長ｈに対して判定
第１ステツプにおける選択すべきパス番号を与え
るROMである。但し、選択すべきパス番号は標
準パターンごとに与えられるためROM６７は音
声フレーム番号ｈと標準パターン番号ｎをアドレ
スとして入力しそのとき選択すべきパス番号を出
力する。６８はROM６７の出力とパス信号５３
とを比較して一致すると“１”を出力するコンパ
レータである。６９はコンパレータ６８の信号に
従いコンパレータ６８の出力が“１”のときは
RAM６５の出力をそのまま、またコンパレータ
６８の出力が“０”のときはRAM６５の出力を
非類似度最大値に変換するためのコンバータであ
る。７０はコンバータ６９の出力と後で述べるレ
ジスタ７１の出力を比較して小さい方の値を出力
する最小値選択回路であり、２つの信号を出力す
る。１つは比較した結果のうち小さい方の値を与
える信号であり、この信号はレジスタ７１に格納
される。もう一方の信号は比較の結果コンバータ
６９の値の方が小さければ発するクロツクであり
レジスタ７２の入力クロツクとなる。レジスタ７
１は非類似度の最小値を与えるレジスタでありフ
レーム周期の始めに非類似度の最大値がセツトさ
れる。レジスタ７２は最小値選択回路７０の出力
パルスによつて標準パターン番号信号を格納する
レジスタで非類似度最小値を与える標準パターン
の番号が格納されている。When CONT = 1 | W (i, j) - Sn (i, k) | When CONT = 0 0 ...Equation 13 The CONT signal is set to "1" when the frame power is above the threshold, and when the frame power is below the threshold It is a signal that becomes "0" at this time. 64 is an adder, and 65 is an address that uses the values of the pass signal 53 and standard pattern number signal 55.
It is a RAM and stores dissimilarities Dn(j), D'n(j), and D″n(j). 67 gives the path number to be selected in the first step of judgment for the audio frame length h. However, since the pass number to be selected is given for each standard pattern, the ROM 67 inputs the audio frame number h and the standard pattern number n as addresses and outputs the pass number to be selected at that time. Output and pass signal 53
This is a comparator that outputs "1" if the two match. 69 follows the signal of the comparator 68 and when the output of the comparator 68 is "1",
This converter converts the output of the RAM 65 as it is, or converts the output of the RAM 65 into the maximum dissimilarity value when the output of the comparator 68 is "0". 70 is a minimum value selection circuit that compares the output of the converter 69 and the output of a register 71, which will be described later, and outputs the smaller value, and outputs two signals. One is a signal that gives the smaller value of the comparison results, and this signal is stored in the register 71. The other signal is a clock that is generated if the value of the converter 69 is smaller as a result of the comparison, and becomes the input clock of the register 72. register 7
1 is a register giving the minimum value of dissimilarity, and the maximum value of dissimilarity is set at the beginning of a frame period. The register 72 is a register that stores a standard pattern number signal in response to the output pulse of the minimum value selection circuit 70, and stores the number of the standard pattern that gives the minimum value of dissimilarity.

第３図は以上の如く構成されており、以下動作
について説明する。 FIG. 3 is constructed as described above, and its operation will be explained below.

各処理はフレーム電力Ｐ(j)が閾値以上となつた
時点から開始されるが、３フレーム以上連続して
フレーム電力Ｐ(j)が閾値以上でなければ処理はリ
セツトされる。音声の始端フレーム前はカウンタ
５２はリセツトパルス５０によつてリセツト状態
にある。また、メモリ６５の値はすべてリセツト
されている。以後、始端検出後の１フレーム周期
内の処理を順次説明する。但し、説明のため入力
フレーム番号はｊとする。ｊ番目の入力フレーム
のフレーム電力が閾値を越えた場合、カウントパ
ルス５１がカウンタ５２に印加され、カウンタ５
２はカウントアツプし音声フレーム番号ｈを出力
する。音声フレーム番号ｈに対応する標準パター
ンのフレーム番号はROM５４とROM５６とコ
ンパレータ５７とセレクタ５８によつて出力され
る。ｎ番目の標準パターンのｋ番目のフレームの
ｉチヤンネルのデータSn（ｉ、ｋ）はROM６０
によつて出力される。一方、メモリ６１には前段
のスペクトル正規化部（図示せず）より出力され
るｊ番目の入力フレームのスペクトル正規化後の
入力データＷ（ｉ、ｊ）が入力端子６２より入力
され格納されている。ROM６０の出力Sn（ｉ、
ｋ）とメモリ６１の出力Ｗ（ｉ、ｊ）はチヤンネ
ル番号信号５９に同期して出力され演算器６３に
おいて第13式に与えられる演算を行なう。演算器
６３の出力とメモリ６５の間で第１式〜第９式に
相当する演算が実行される。実際は第１式〜第９
式の演算は統合された次の形式で行なわれる。 Each process is started when the frame power P(j) exceeds the threshold value, but the process is reset if the frame power P(j) does not exceed the threshold value for three or more consecutive frames. The counter 52 is in the reset state by the reset pulse 50 before the start frame of the audio. Also, all values in the memory 65 have been reset. Hereinafter, the processing within one frame period after the start edge detection will be sequentially explained. However, for the sake of explanation, the input frame number is assumed to be j. If the frame power of the jth input frame exceeds the threshold, a count pulse 51 is applied to the counter 52;
2 counts up and outputs the audio frame number h. The frame number of the standard pattern corresponding to the audio frame number h is outputted by the ROM 54, the ROM 56, the comparator 57, and the selector 58. The i-channel data Sn (i, k) of the k-th frame of the n-th standard pattern is stored in the ROM60.
is output by . On the other hand, input data W(i, j) after spectral normalization of the j-th input frame outputted from the previous-stage spectral normalization unit (not shown) is input to the memory 61 from the input terminal 62 and stored therein. There is. ROM60 output Sn(i,
k) and the output W(i, j) of the memory 61 are outputted in synchronization with the channel number signal 59, and the arithmetic unit 63 performs the calculation given by equation (13). Calculations corresponding to the first to ninth expressions are executed between the output of the arithmetic unit 63 and the memory 65. Actually, formulas 1 to 9
The expression operations are performed in the following unified form.

（メモリ６５）←（メモリ６５）＋｜Ｗ（ｉ、ｊ）−Sn（ｉ、k″）｜０次に判定第１ステツプの動作について説明す
る。 (Memory 65)←(Memory 65) +|W(i,j)−Sn(i,k″)|0 Next, the operation of the first step of determination will be explained.

判定第１ステツプに必要なパスの選択はROM
６７とコンパレータ６８とコンバータ６９によつ
て行なわれる。ROM６７にはｎ番目の標準パタ
ーンにおいて音声フレーム番号ｈの場合に選択さ
れるべきパス番号が格納されており、その設定基
準は音声フレーム番号に対する第10式〜第12式の
演算結果のうちｎ番目の標準パターンの長さSL
(n)に最も近いパス番号によつて与えられる。
ROM６７の出力のパス番号とパス信号５３がコ
ンパレータ６８によつて比較され、コンパレータ
６８では両者が一致すれば“１”をコンバータ６
９に出力する。コンバータ６９ではコンパレータ
６８からの入力が“１”のときはメモリ６５の出
力を、“０”のときは非類似度の最大値を出力し
ており、この処理によりコンパレータ６８からの
出力が“０”、すなわちROM６７の出力とパス
信号５３とが一致しない場合、そのときの非類似
度が最小判定処理によつて選択されることを実質
的に禁示している。この処理により判定第１ステ
ツプが行なわれる。次に最小値選択回路７０によ
つてレジスタ７１に格納されている非類似度とコ
ンバータ６９によつて出力される非類似度のうち
小さい方がレジスタ７１に格納される。と同時に
コンバータ６９の出力の方が小さければパルスが
レジスタ７２に加えられそのときの標準パターン
番号がレジスタ７２に格納される。この処理をす
べての標準パターンについて行なえばそのときの
最小非類類似度を与える標準パターン番号がレジ
スタ７２に格納されることになる。以上の１フレ
ーム周期内の処理に対するタイムチヤートを第４
図に示す。 ROM is used to select the path required for the first step of judgment.
67, a comparator 68, and a converter 69. The ROM 67 stores the path number to be selected in the case of audio frame number h in the nth standard pattern, and its setting standard is based on the nth path number among the calculation results of equations 10 to 12 for the audio frame number. Standard pattern length SL
given by the path number closest to (n).
The pass number output from the ROM 67 and the pass signal 53 are compared by a comparator 68, and if the two match, the comparator 68 sets "1" to the converter 6.
Output to 9. The converter 69 outputs the output of the memory 65 when the input from the comparator 68 is "1", and outputs the maximum value of dissimilarity when the input is "0". ”, that is, when the output of the ROM 67 and the path signal 53 do not match, the dissimilarity at that time is substantially prohibited from being selected by the minimum determination process. This process performs the first step of determination. Next, the minimum value selection circuit 70 stores in the register 71 the smaller of the dissimilarity stored in the register 71 and the dissimilarity output by the converter 69. At the same time, if the output of converter 69 is smaller, a pulse is applied to register 72 and the standard pattern number at that time is stored in register 72. If this process is performed for all standard patterns, the standard pattern number giving the minimum dissimilarity at that time will be stored in the register 72. The time chart for the above processing within one frame period is shown in the fourth section.
As shown in the figure.

以上の処理は１フレーム周期ごとに行なわれ終
端が検出された時点におけるレジスタ７２の結果
が最終的な認識結果となり、出力端子７３から出
力される。 The above processing is performed every frame period, and the result of the register 72 at the time when the end is detected becomes the final recognition result, and is output from the output terminal 73.

（発明の効果）本発明は以上説明したように、入力フレームご
とに距離計算、非類似度計算を行なうため終端検
出後、１フレーム以内にに認識結果が出る利点が
ある。また、非類似度計算のためには前の入力フ
レームに対する非類似度値と現入力フレームに対
する距離値との累算を行なうだけでよく、始端か
ら終端までの入力データを格納する必要がない。(Effects of the Invention) As described above, the present invention has the advantage that a recognition result can be obtained within one frame after the termination is detected because distance calculation and dissimilarity calculation are performed for each input frame. Furthermore, in order to calculate the dissimilarity, it is only necessary to accumulate the dissimilarity value for the previous input frame and the distance value for the current input frame, and there is no need to store the input data from the start to the end.

さらに、回路構成の簡易化を目的とした方式で
あるためLSI化が容易であり、ゲート数の少ない
安価な音声認識用LSIチツプを供給すると同時に
汎用マイクロプロセツサのソフト処理によつても
実現され得るものである。 Furthermore, since it is a method aimed at simplifying the circuit configuration, it is easy to implement on an LSI, and it can be realized by providing an inexpensive LSI chip for speech recognition with a small number of gates, and at the same time by using software processing on a general-purpose microprocessor. It's something you get.

[Brief explanation of the drawing]

第１図は発声長変動を説明するための図、第２
図は本発明のマツチングパスの概要を説明するた
めに示した図、第３図は本発明の一実施例を示す
ブロツク図、第４図は本実施例の１フレーム周期
内の処理に対するタイムチヤートを示した図であ
る。５２……音声フレーム番号ｈのカウンタ、５４
……標準パターンのフレーム番号相当のものを発
生させるためのROM、５６……標準パターンの
長さを記憶しているROM、５７……コンパレー
タ、５８……セレクタ、６０……標準パターンメ
モリ、６１……入力パターンのメモリ、６３……
距離の演算器、６４……加算器、６７……選択す
べきパス番号のROM、６８……コンパレータ、
６９……コンバータ、７０……最小値選択回路、
７１……最小非類似度のメモリ、７２……認識結
果としての標準パターン番号のメモリ。 Figure 1 is a diagram for explaining utterance length variation, Figure 2
3 is a block diagram showing an embodiment of the present invention, and FIG. 4 is a time chart for processing within one frame period of this embodiment. FIG. 52...Counter of audio frame number h, 54
...ROM for generating something equivalent to the frame number of the standard pattern, 56...ROM storing the length of the standard pattern, 57...Comparator, 58...Selector, 60...Standard pattern memory, 61 ...Input pattern memory, 63...
Distance calculator, 64... Adder, 67... ROM of path number to be selected, 68... Comparator,
69...Converter, 70...Minimum value selection circuit,
71...Memory of minimum dissimilarity, 72...Memory of standard pattern number as recognition result.

Claims

[Scope of Claims] 1. A standard pattern expressed as a frame time series of frequency components corresponding to each standard voice and with frames corresponding to a silent state removed, and a path provided corresponding to each standard pattern. (a) extracts the input pattern from the input audio as a frame time series of frequency components; (b) detects the start of the input audio and starts counting frames of the input pattern; While detecting a sound state, the audio frame number is updated sequentially, while while detecting a silent state, updating of the audio frame number is stopped, and before confirming the end of the input audio, the sound state is resumed. (c) upon each update of the audio frame number, by generating a standard pattern of frame numbers in an essentially linear relationship to that audio frame number; A plurality of matching paths are set between the pattern and each standard pattern, and (d) each time the audio frame number is updated, the distance between the input pattern and each standard pattern is calculated between the frames matched by each matching path. (e) The cumulative value of the distance along the matching path from the start of the input audio to an arbitrary audio frame number is defined as the dissimilarity, and each time the audio frame number is updated, the immediately preceding dissimilarity and the relevant frame number are calculated. (f) The dissimilarity corresponding to each matching path for each standard pattern is updated and stored by adding the distance with the distance in and storing it once, and (f) every time the audio frame number is updated, the one path selection signal is generated corresponding to each standard pattern based on the standard pattern, a dissimilarity corresponding to the matching path specified by the path selection signal is selected for each standard pattern, and among the selected dissimilarities, (g) When the end of the input audio is confirmed, the code of the standard pattern corresponding to the dissimilarity that shows the minimum value is updated and stored every time the audio frame number is updated, and (g) when the end of the input audio is confirmed, A speech recognition method characterized in that a code of the standard pattern corresponding to a correspondingly stored minimum value indicating dissimilarity is recognized as a category of input speech.