JPH044600B2

JPH044600B2 -

Info

Publication number: JPH044600B2
Application number: JP56076472A
Authority: JP
Priority date: 1981-05-22
Filing date: 1981-05-22
Publication date: 1992-01-28
Also published as: JPS57191699A

Description

【発明の詳細な説明】本発明は、音声認識などのパターン認識システ
ムに使用されるパターンマツチング装置に関する
ものである。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a pattern matching device used in a pattern recognition system such as speech recognition.

現在、実用化が進められている音声認識等のパ
ターン認識システムでは、入力パターンを認識す
るために、あらかじめ記憶した各種の標準パター
ンと入力パターンを比較して認識を行うパターン
マツチング法が用いられている。例えば音声認識
では、標準パターンとして、認識したい単語分の
標準パターンを記憶して用いる。 In pattern recognition systems such as voice recognition that are currently being put into practical use, a pattern matching method is used to recognize input patterns by comparing the input patterns with various standard patterns stored in advance. ing. For example, in speech recognition, standard patterns for words to be recognized are stored and used as standard patterns.

このような音声認識において、パターンマツチ
ング法を用いるとき問題になるのは、入力音声が
発声毎に時間軸に対して任意に伸縮することであ
る。すなわち、同一の話者が同一の単語を発声し
ても、全く同じ長さには発声できない。したがつ
て標準パターンとのマツチングにおいて、記憶さ
れている標準パターンが入力音声を発声した話者
のものであつても、発声速度が任意に変化するた
め類似度が発声毎に変化し、正しい認識結果が得
られない。この入力パターンと標準パターンの時
間軸のずれを整合しながらマツチングを行う方法
として、動的計画法（Ｄynamic Ｐ
rogramming以下DPと略す。）を用いたマツチン
グ法が広く用いられている（例えば特開昭47−
30242号公報参照）。次にDPの概要について説明
する。 In such speech recognition, a problem when using the pattern matching method is that the input speech is arbitrarily expanded or contracted with respect to the time axis each time it is uttered. In other words, even if the same speaker utters the same word, the words cannot be uttered at exactly the same length. Therefore, when matching with a standard pattern, even if the stored standard pattern is that of the speaker who uttered the input voice, since the rate of speech changes arbitrarily, the degree of similarity changes with each utterance, resulting in incorrect recognition. I can't get any results. As a method for matching while matching the time axis deviation between this input pattern and the standard pattern, dynamic programming ( Dynamic Programming ) is used.
rogramming is abbreviated as DP hereafter. ) is widely used.
(See Publication No. 30242). Next, an overview of DP will be explained.

音声パターンは、特徴ベクトルA_i＝（a_1i、a_2i、
a_ni、……a_oi）の時系列としてＡ＝A₁、A₂、A₃、…、A_i、…、A_I ……(1) と表わされる。この特徴ベクトルA_iは音声信号を
時間領域でＩ個の区間に区切つたときの（この時
区切られた１つの区間をフレームと呼ぶ）ｉ番目
のフレームの音声の特徴ベクトルであり、特徴ベ
クトルとしては、たとえば中心周波数の異なつた
ｎ個の帯域フイルタ群の出力が考えられる。a_ni
とは、音声のｉ番目のフレームのｎ個の帯域フイ
ルタのうちｍ番目のフイルタ出力である。入力音
声パターンＸ、標準音声パターンＹは、特徴ベク
トルの時系列として、Ｘ＝X₁、X₂、X₃、…、X_i、…、X_I Ｙ＝Y₁、Y₂、Y₃、…、Y_j、…、Y_J ……(2) と表現される。 The speech pattern is defined by the feature vector A _i = (a _1i , a _2i ,
a _ni , ... a _oi ) is expressed as A = A ₁ , A ₂ , A ₃ , ..., A _i , ..., A _I ... (1). This feature vector A _i is the audio feature vector of the i-th frame when the audio signal is divided into I sections in the time domain (one section divided at this time is called a frame), and is used as a feature vector. For example, the outputs of a group of n bandpass filters having different center frequencies can be considered. a _ni
is the output of the mth filter among the n bandpass filters of the ith frame of audio. _The input _speech pattern X and _the _standard speech pattern Y are expressed as _a _time series _of feature vectors _: , Y _j , …, Y _J …(2).

ここで入力パターンＸ、標準パターンＹの類似
度を求めるため、両特徴ベクトルの距離を求める
必要があり、距離ｄは例えば、ユークリツド距離
として、ｄ（ｉ，ｊ）＝（X₁−Y_j）²＝_o 〓（x_iu−y_ju）² ……(3) で求められる。仮に入力パターンＸと標準パター
ンＹの時間軸の伸縮が全く無いとするとＸとＹは
１対１に対応し、類似度Ｓは、距離ｄの総和とし
て、Ｓ＝｛_n 〓（X_n−Y_n）²｝／ｍ＝｛_n 〓ｄ（ｍ，ｍ）｝／ｍ ……(4) として求められる。ここでｍは入力パターンと標
準パターンのそれぞれｎ個のフレームのうちいず
れかを指定するフレーム番号を示す。Ｓは小さい
ほど、類似度が高い、すなわち良く似たパターン
（単語）であることを示す。 Here _, in order to find the _similarity between the input pattern ² = _o 〓 (x _iu −y _ju ) ² ...(3). Assuming that there is no expansion _or contraction of the time axes _of the input pattern _n ) ² }/m={ _n 〓 d(m, m)}/m...(4) Here, m indicates a frame number specifying one of n frames for each of the input pattern and the standard pattern. The smaller S is, the higher the degree of similarity is, that is, the patterns (words) are very similar.

先に述べたように、音声は発声する毎に伸び縮
みしてしまうため、(4)式では正確に類似度Ｓを求
めることができない。したがつてDPでは、次の
ように時間軸を整合しながらＳを求める。第１図
に時間軸上での入力パターンＸと標準パターンＹ
の整合の過程を示し、第２図に入力パターンＸ、
標準パターンＹを縦座標、横座標に変換したもの
を示す。矢印は整合過程を示したもので、時間軸
の伸縮を整合している様子を示している。矢印で
示す経路で距離を求めながら、類似度Ｓを求める
と、Ｓ＝｛ｄ（１，１）＋ｄ（２，２）＋ｄ（２，
３）＋…＋ｄ（８，７）｝／９……(5) が求められる。 As mentioned above, since the voice expands and contracts each time it is uttered, it is not possible to accurately determine the similarity S using equation (4). Therefore, in DP, S is found while aligning the time axes as follows. Figure 1 shows input pattern X and standard pattern Y on the time axis.
Figure 2 shows the process of matching input patterns X,
The standard pattern Y is shown converted into vertical and horizontal coordinates. The arrows indicate the alignment process, showing how the expansion and contraction of the time axis is aligned. When calculating the similarity S while calculating the distance along the route indicated by the arrow, S = {d (1, 1) + d (2, 2) + d (2,
3) +...+d(8,7)}/9...(5) is obtained.

DPを用いたパターンマツチングには各種ある
が単語のように、時間軸の伸縮がある程度限定さ
れる場合の認識では、時間軸の整合の幅を限定し
たDP（以下、DPと呼ぶのは全てこのようなDPと
する。）が用いられる。本DPのバスの選択は次の
ようにして行なわれる。 There are various types of pattern matching using DP, but for recognition when the expansion and contraction of the time axis is limited to a certain extent, such as words, DP (hereinafter referred to as DP) is used to Such a DP) is used. The bus selection for this DP is performed as follows.

すなわち、第３図の点Ｅに到る経過としては、
点Ｆ、Ｇ、Ｈの３方向が許されるが、点Ｆ、Ｇ、
Ｈに至るまでの距離ｄの累和をＤ（ｉ、ｊ−１）、
Ｄ（ｉ−１、ｊ−１）、Ｄ（ｉ−１、ｊ）とすると、
点Ｅの累和Ｄ（ｉ、ｊ）は３方向の内どこからく
るかによつて(6)式のようになる。累和Ｄ（ｉ、ｊ）
としては(6)式のうち、値が最小になる経過のもの
が選択される。 In other words, the process leading to point E in Figure 3 is as follows:
Three directions of points F, G, and H are allowed;
The cumulative sum of the distance d up to H is D(i, j-1),
Assuming D(i-1, j-1) and D(i-1, j),
The cumulative sum D(i, j) of point E is expressed as equation (6) depending on which of the three directions it comes from. Cumulative sum D(i, j)
Among equations (6), the one whose value is the minimum is selected.

Ｄ（ｉ、ｊ−１）＋ｄ（ｉ、ｊ）Ｄ（ｉ−１、ｊ−１）＋Ｗ・ｄ（ｉ、ｊ）Ｄ（ｉ−１、ｊ）＋ｄ（ｉ、ｊ） ……(6) 但しＷは対角線方向へのバス選択の重みを示し、
通常は２である。また点Ｅに到るまでの経過長Ｌ
（ｉ、ｊ）としては(6)式で選ばれた経過に応じた
経過長が(7)式にしたがつ選択される。点Ｆ、Ｇ、
Ｈまでの経過長をＬ（ｉ、ｊ−１）、Ｌ（ｉ−１、
ｊ−１）、Ｌ（ｉ−１、ｊ）とするとＬ（ｉ、ｊ−１）＋１Ｌ（ｉ−１、ｊ−１）＋ＷＬ（ｉ−１、ｊ）＋１ ……(7) なお初期値Ｄ（１、１）、Ｌ（１、１）は次のよ
うに表わされる。 D (i, j-1) + d (i, j) D (i-1, j-1) + W・d (i, j) D (i-1, j) + d (i, j) ...(6 ) However, W indicates the weight of bus selection in the diagonal direction,
Usually it is 2. Also, the elapsed length L until reaching point E
As (i, j), the elapsed length corresponding to the elapsed time selected by equation (6) is selected according to equation (7). Points F, G,
Let the elapsed length up to H be L(i, j-1), L(i-1,
j-1), L(i-1, j), L(i, j-1)+1 L(i-1, j-1)+W L(i-1, j)+1...(7) The initial values D(1, 1) and L(1, 1) are expressed as follows.

Ｄ（１、１）＝ｄ（１、１）Ｌ（１、１）＝１ ……(8) また(6)式は次式の制約条件で計算される。 D(1,1)=d(1,1) L (1, 1) = 1 ...(8) Furthermore, equation (6) is calculated using the following constraint.

１ｉＩ、１ｊＪｊ−γｉｊ＋γ γ＝整数 ……(9) γは入力パターンＸと標準パターンＹの時間軸
の変化を吸収して整合するときの範囲すなわち整
合幅を決めるもので、γを大きくするにつれ、伸
縮の大きな、入力パターンと標準パターン間の整
合がとれることになる。しかし、γが大きすぎる
と処理量が増加し、また整合しすぎることにな
る。第２図の場合はｌ＝2γ＋１、γ＝±１であ
り、ｌ＝３で整合できることを示している。第２
図において、最終的な類似度Ｓは、累和Ｄ（８、
７）と経過長Ｌ（８、７）によつて、Ｓ＝Ｄ（８、７）／Ｌ（８、７） ……(10) として求められ、これは(5)式に相当する。 1iI, 1jJ j−γij+γ γ=integer ...(9) γ determines the range of matching by absorbing changes in the time axis of input pattern X and standard pattern Y, that is, the matching width, and as γ increases, , matching can be achieved between the input pattern and the standard pattern, which have large expansion and contraction. However, if γ is too large, the throughput will increase and the matching will be too high. In the case of FIG. 2, l=2γ+1 and γ=±1, indicating that matching can be achieved with l=3. Second
In the figure, the final similarity S is the cumulative sum D(8,
7) and the elapsed length L(8,7), it is obtained as S=D(8,7)/L(8,7)...(10), which corresponds to equation (5).

このように本DPは、第２図のように点線内の
範囲のみマツチングを行えばよいため、距離計算
も点線内の処理のみでよい。したがつて、処理量
が少なく、多量の標準パターンのマツチングが可
能となる。 In this way, in this DP, since it is only necessary to perform matching within the range within the dotted line as shown in FIG. 2, the distance calculation only needs to be performed within the dotted line. Therefore, the amount of processing is small and a large number of standard patterns can be matched.

本DPにおいて問題になるのが入力音声の切出
しである。本DPは(8)式のように始端が固定され
ており、また(9)式のように整合可能な幅がγに限
定されている。したがつて、始端の切出しを間違
えると、始点がずれため、整合幅γでは整合でき
なくなることが考えられる。また終端の切出しを
間違えると入力音声の長さが本DPで許している
（±γ）をこえてしまい、異なつた単語と判断さ
れることが考えられる。DP処理を実時間で行う
場合、この入力音声の切出しも実時間で行う必要
がある。しかし入力音声の切出しを実時間で行う
ためには、色々な問題点が生じる。 The problem with this DP is the extraction of input audio. In this DP, the starting end is fixed as shown in equation (8), and the matching width is limited to γ as shown in equation (9). Therefore, if the starting end is cut out incorrectly, the starting point will shift, and it is conceivable that matching will not be possible with the matching width γ. Also, if you make a mistake in cutting out the end, the length of the input voice may exceed the (±γ) allowed by this DP, and the word may be judged as a different word. When performing DP processing in real time, this input audio extraction also needs to be performed in real time. However, various problems arise when extracting input audio in real time.

つぎに、実時間でのDP処理、切出し処理にお
ける具体的な問題点について述べる。入力音声の
切出し、すなわち始端、終端の検出には例えば入
力音声のパワーが用いられる。音声が発声されて
いないときは、パワーが小さく（無音）、発声さ
れるとパワーが上がることを利用したものであ
る。 Next, we will discuss specific problems in real-time DP processing and extraction processing. For example, the power of the input voice is used to extract the input voice, that is, to detect the start and end points. This method takes advantage of the fact that when no voice is being uttered, the power is low (silence), and when the voice is uttered, the power increases.

さて、終端の決定において、切手（キツテ）な
ど促音のある場合、ツのところが無音区間となつ
てしまい終端と間違う危険性が生じる（第４図Ａ
点付近）。また語尾が無声化しやすい単語、例え
ば浜松（HAMAMATSU）などでは、終端で音
声信号の有無がはつきりしない場合がある（第４
図Ｂ−Ｃ点間）。両者とも終端らしい点が見つか
つても、その点以降数百ｍsecの様子を見ないと、
その点が終端であると断定できない。すなわち第
４図のような入力音声があつた場合、実際の終端
はＣ点であるのにＡ、Ｂ点が終端と間違える恐れ
がある。またＣ点を終端と決定するためには、Ｄ
点まで入力音声の様子を見る必要がある。Ａ点あ
るいはＢ点が終端と間違つて判定され、入力音声
が途中で切られた場合実際の入力音声より短かい
音声と判断される。これを避けるため数百ｍsec
の無音区間を確認できた後、終端を決定する場合
にはＤ点で初めて終端が決定されることになる。
しかしDP処理を実時間で行うとＤ点までの入力
音声がすでにDP処理部で処理されている。すな
わち入力音声がＤ点までの長さの音声として処理
され、実際の入力音声より長い音声と判断され
る。前述したDP処理では、処理量を少なくする
ため時間軸の変動幅を、±γ（標準パターンの時間
長）と制限しているため、終端がＡ、Ｂあるいは
Ｄ点と間違つて判断されると、この変動幅以上の
入力となり、リジエクトされる。 Now, when determining the end, if there is a consonant sound such as in a stamp (kitsute), the tsu will become a silent section and there is a risk that it will be mistaken for the end (Figure 4A
(near the point). In addition, for words whose endings tend to be devoiced, such as Hamamatsu (HAMAMATSU), it may not be clear whether or not there is a voice signal at the end.
(Between points B and C in Figure). Even if you find a point that seems to be the end of both, you have to look at the situation several hundred milliseconds after that point.
It cannot be determined that that point is the end. In other words, when there is an input voice as shown in FIG. 4, there is a risk that points A and B may be mistaken for the end points, even though the actual end point is point C. Also, in order to determine point C as the terminal point, D
It is necessary to see the state of the input audio up to the point. If point A or point B is mistakenly determined to be the end and the input voice is cut off in the middle, it will be determined that the voice is shorter than the actual input voice. To avoid this, several hundred msec
When determining the end after confirming the silent section of , the end is determined for the first time at point D.
However, when DP processing is performed in real time, the input audio up to point D has already been processed by the DP processing section. In other words, the input voice is processed as having a length up to point D, and is determined to be longer than the actual input voice. In the above-mentioned DP processing, the fluctuation width of the time axis is limited to ±γ (standard pattern time length) in order to reduce the amount of processing, so the end may be mistakenly determined to be point A, B, or D. , the input exceeds this fluctuation range and is rejected.

したがつて、従来の方法では、まず入力音声の
切出し処理をおこなつて入力音声をメモリに格納
しておき、切出し終了後、音声区間の入力音声を
メモリより読み出し、マツチング処理を行つてい
た。 Therefore, in conventional methods, the input audio is first extracted and stored in memory, and after the extraction is completed, the input audio of the audio section is read out from the memory and matched. .

この従来法の問題点を明確にするため、まず第
５図を用いて音声などの認識処理装置の従来例の
概略を説明する。入力された音声は、特徴抽出部
１で特徴が抽出される。距離計算部２は、特徴抽
出された入力音声と、あらかじめ標準パターンメ
モリ３に格納されている複数の標準パターン（単
語）との間の距離を計算する。 In order to clarify the problems of this conventional method, first an outline of a conventional example of a recognition processing device for speech, etc. will be explained using FIG. Features of the input voice are extracted by a feature extraction unit 1. The distance calculation unit 2 calculates the distance between the input voice from which features have been extracted and a plurality of standard patterns (words) stored in the standard pattern memory 3 in advance.

DP処理部４では、距離計算部２からの距離を
時間軸整合しながらマツチングを行なう。判定部
５では、DP処理部４から出力される各標準パタ
ーンとの類似度から最大の類似度にたいする標準
パターンを解として出力する。従来の装置は、前
記の理由により特徴抽出部１の後に１単語分の入
力音声を格納するためのメモリ１０を持ち、音声
切出し部６により検出された始端から終端までの
入力音声を格納する。メモリ１０には正しく切出
された入力音声が格納される。終端検出後メモリ
１０に格納されている入力音声に対して、距離演
算以降のマツチング処理を行う。 The DP processing unit 4 performs matching while aligning the distances from the distance calculation unit 2 on the time axis. The determination unit 5 outputs, as a solution, the standard pattern with the maximum degree of similarity from the degree of similarity with each standard pattern output from the DP processing unit 4. For the above-mentioned reason, the conventional device has a memory 10 for storing one word worth of input speech after the feature extraction section 1, and stores the input speech from the start to the end detected by the speech extraction section 6. Correctly extracted input speech is stored in the memory 10. After the termination is detected, matching processing after distance calculation is performed on the input audio stored in the memory 10.

従来の装置では実時間処理ができないが、第５
図におけるメモリ１０の構成を、特徴抽出部１か
らの書込みと、距離計算部２からの読出しを同時
に行なえるようにすることにより実時間処理が可
能になる。この場合、メモリ１０の容量は、第４
図のＣ点からＤ点までの数百ｍsecにわたる情報
を格納可能なものでよく、メモリ容量は従来の装
置に比べて少なくできる。以下、第５図の装置に
ついて説明する。 Conventional equipment cannot perform real-time processing, but the fifth
Real-time processing becomes possible by configuring the memory 10 in the figure so that writing from the feature extraction section 1 and reading from the distance calculation section 2 can be performed simultaneously. In this case, the capacity of the memory 10 is
It is sufficient to be able to store information for several hundred milliseconds from point C to point D in the figure, and the memory capacity can be reduced compared to conventional devices. The apparatus shown in FIG. 5 will be explained below.

メモリ１０により、特徴パラメータ抽出部１よ
り出力されたデータは、数百ｍsec後に距離計算
部２に入力される。したがつて、Ｄ点の入力音声
が特徴抽出部１に入力して、音声切出し部６で終
端が検出された時、実際の終端であるＣ点のデー
タはまだメモリ１０に格納されており、距離計算
部２には入力されていない。音声切出し部６は、
Ｃ点まで入力音声をメモリ１０から、距離計算部
２に送つた後、Ｄ点でDP処理部４に終端検出信
号EEを送り、DP演算を完了させ、そのときの
DP演算結果を判定部５に転送し、判定部５でDP
演算結果にもとづき類似度最大の標準パターンを
見出し、解として出力することにより、ほぼ実時
間処理可能な装置を実現していた。しかし、従来
の装置では、メモリ１０は、読出し書込みが競合
する回路となり、また実際の終端であるＣ点か
ら、Ｃ点を実際の終端と決定できるＤ点までの時
間も一定でないため、メモリ１０の制御回路およ
び音声切出し部６の制御は共に複雑になつてしま
う。 The memory 10 inputs the data output from the feature parameter extraction section 1 to the distance calculation section 2 after several hundred milliseconds. Therefore, when the input audio at point D is input to the feature extraction unit 1 and the end is detected by the audio extraction unit 6, the data at point C, which is the actual end, is still stored in the memory 10. It is not input to the distance calculation section 2. The audio cutting unit 6 is
After sending the input audio from the memory 10 to the distance calculation unit 2 up to point C, the end detection signal EE is sent to the DP processing unit 4 at point D, the DP calculation is completed, and the
The DP calculation result is transferred to the judgment unit 5, and the DP calculation result is transferred to the judgment unit 5.
By finding the standard pattern with the maximum similarity based on the calculation results and outputting it as a solution, a device capable of almost real-time processing was realized. However, in the conventional device, the memory 10 becomes a circuit in which reading and writing compete, and the time from point C, which is the actual end, to point D, at which point C can be determined as the actual end, is not constant. Both the control circuit and the control of the audio cutting section 6 become complicated.

したがつて、本発明の目的は上記問題点を解決
して、入力音声にたいして実時間処理をおこなえ
るようにした音声認識などに用いられるパターン
マツチング装置を提供することにある。この目的
を達成するため本発明においては、実時間処理を
おこないながら終端候補が検出されたときのDP
演算結果を順次メモリに格納し、上記終端候補が
真の終端であることが検出されたとき真の終端に
対応したDP演算結果をメモリより読み出し、こ
れを真のDP演算結果をする点に特徴がある。 SUMMARY OF THE INVENTION Accordingly, an object of the present invention is to solve the above-mentioned problems and provide a pattern matching device used in speech recognition, etc., which can perform real-time processing on input speech. In order to achieve this objective, the present invention performs real-time processing to detect the DP when a termination candidate is detected.
The feature is that the calculation results are sequentially stored in the memory, and when the above termination candidate is detected to be the true termination, the DP calculation result corresponding to the true termination is read out from the memory, and this is used as the true DP calculation result. There is.

以下、実施例を参照して本発明を説明する。 The present invention will be described below with reference to Examples.

本発明にもとづく音声認識装置のブロツク構成
図を第７図に、第４図の入力パターン（横軸）と
ある標準パターン（縦軸）との整合関係を第６図
に示す。 FIG. 7 shows a block diagram of the speech recognition apparatus according to the present invention, and FIG. 6 shows the matching relationship between the input pattern (horizontal axis) of FIG. 4 and a certain standard pattern (vertical axis).

第６図のＡ、Ｂ、ＣおよびＤ点は第４図の入力
音声の各時点Ａ、Ｂ、ＣおよびＤに対応する。 Points A, B, C, and D in FIG. 6 correspond to time points A, B, C, and D of the input audio in FIG.

第７図に示す装置では第５図に示す従来の装置
のDP処理部４と判定部５の間に新たにメモリ７
を設ける。このメモリ７は、標準パターンメモリ
３に格納されている単語数すなわち認識語数だけ
の容量を持つ。さらに本装置の音声切出し部６に
は、始端検出部６０のほかに終端らしい点（終端
候補）を検出し、この情報をDP処理部４に伝え
る終端候補検出部６１と、終端を検出し、DP処
理部４と判定部５にこの情報を伝える終端検出部
６２がある。始端検出部６０は入力パターンの始
端において始端信号SSを発生し、終端候補検出
部６１は、第４図のＡ、Ｂ、Ｃ点で終端候補信号
SEを発生し、終端検出部８はＤ点で終端信号EE
を発生する。DP処理部４は終端候補信号SEを受
けると、その時点の各標準パターンと入力パター
ンとの類似度をメモリ７に書込む。また終端信号
EEを受けると処理を終了する。判定部５は、終
端信号EEを受けるとメモリ７の値を読出し、第
５図の場合と同様の判定処理を行う。 In the device shown in FIG. 7, a new memory 7 is added between the DP processing section 4 and the determination section 5 of the conventional device shown in FIG.
will be established. This memory 7 has a capacity equal to the number of words stored in the standard pattern memory 3, that is, the number of recognized words. Furthermore, in addition to the start end detection section 60, the audio extraction section 6 of this device includes an end candidate detection section 61 that detects a point that seems to be the end (end candidate) and transmits this information to the DP processing section 4, and a end point detection section 61 that detects the end, There is an end detection section 62 that conveys this information to the DP processing section 4 and the determination section 5. The start edge detection section 60 generates a start edge signal SS at the start edge of the input pattern, and the termination candidate detection section 61 generates termination candidate signals at points A, B, and C in FIG.
SE is generated, and the termination detector 8 outputs the termination signal EE at point D.
occurs. When the DP processing unit 4 receives the termination candidate signal SE, it writes into the memory 7 the degree of similarity between each standard pattern and the input pattern at that time. Also the termination signal
Processing ends when EE is received. Upon receiving the termination signal EE, the determining section 5 reads out the value in the memory 7 and performs the same determining process as in the case of FIG. 5.

つぎに、第６図を用いて本装置の動作を説明す
る。本装置では、入力パターンより特徴抽出部１
で抽出された自己相関係数やパワーなどの特徴量
にもとづいて、まず始端を検出すると距離計算部
２以降の処理を開始する。処理が進み特徴抽出部
１にＡ点にあたる音声が入力すると、Ａ点ではパ
ワーがほとんど無くなり終端である可能性がある
から終端候補検出部６１はこのＡ点を終端候補と
して検出し、DP処理部４に終端候補信号SEを送
る。この信号SEにもとづきDP処理部４は、終端
候補検出部６１が終端候補を検出した点すなわ
ち、Ａ点までのDPマツチング結果をメモリ７に
退避するか否かの制御をおこなう。第６図の場
合、標準パターンの長さと、Ａ点までの入力音声
の長さは大きく異り、整合幅ｌ以内でおさまらな
いため、メモリ７に退避されるデータはマツチン
グ結果そのものではなく、リジエクトデータまた
は類似度が最低となるデータである。終端検出部
６２は、終端候補検出部６１が終端候補を検出し
た後数百ｍsec以内に再び音声が入力されない、
すなわちパワーが無いとき終端を確認する回路で
Ａ点からＢ点の間では数百ｍsec以内にＢ点で再
び音声が入力するため、終端信号EEは出力され
ない。 Next, the operation of this device will be explained using FIG. 6. In this device, the feature extraction unit 1
When the start end is first detected based on the feature quantities such as the autocorrelation coefficient and power extracted in step 1, the processing from the distance calculation section 2 onwards is started. As the processing progresses and the voice corresponding to point A is input to the feature extraction unit 1, there is almost no power at point A and there is a possibility that it is the end, so the end candidate detection unit 61 detects this point A as a end candidate and sends it to the DP processing unit. The termination candidate signal SE is sent to 4. Based on this signal SE, the DP processing unit 4 controls whether or not to save the DP matching results up to the point where the termination candidate detection unit 61 detected the termination candidate, that is, the point A, to the memory 7. In the case of Fig. 6, the length of the standard pattern and the length of the input audio up to point A are very different and cannot be settled within the matching width l, so the data saved in memory 7 is not the matching result itself but the rigid one. ect data or data with the lowest degree of similarity. The termination detection unit 62 detects that no voice is input again within several hundred milliseconds after the termination candidate detection unit 61 detects the termination candidate.
In other words, in the circuit that checks the termination when there is no power, between point A and point B, the voice is input again at point B within several hundred milliseconds, so the termination signal EE is not output.

同様にＢ、Ｃ点においても終端候補検出部６１
は上記の理由により終端候補であると判断するた
め、Ｂ、Ｃ点では終端候補検出部６１はDP処理
部４に終端候補信号SEを送る。ここでメモリ７
はDP処理部４からマツチング結果が送られる毎
に、新しいマツチング結果に書き換えられる。例
えば、Ａ点においては、メモリ７にすでに書きこ
まれているＡ点での入力パターンと各標準パター
ンとのマツチング結果がＢ点においては、Ｂ点で
の入力パターンと各標準パターンのマツチング結
果に書き換えられる。同様に、Ｂ点でのマツチン
グ結果がＣ点においてはＣ点でのマツチング結果
に書き換えられる。ここで第６図の標準パターン
に対するＣ点のマツチング結果は、標準パターン
と音声（単語）の長さの変動が−γ〜＋γの範囲
内におさまつているため、白丸の点までのDP処
理された結果としてある値をとる。終端検出部６
２は、Ｂ−Ｃ点間ではＢ点の終端候補検出後再び
音声が入力されるため、終端信号は出さない。し
かしＣ−Ｄ点間では、数百ｍsecの間音声が入力
しないため、すでに音声の発声は終つたと判断で
きる。したがつて終端検出部はＤ点でDP処理部
４および判定部５に終端信号EEを送る。これに
より、DP処理部４は処理を停止し、判定部５は、
各標準パターンとの類似度をメモリ７より読出
し、最大の類似度を持つ標準パターンを検出し、
それを解として出力する。ここでメモリ７にはＣ
点以降音声の再入力は無いため、終端候補は現わ
れない。したがつてメモリ７には、実際の終端で
あるＣ点までのマツチング結果が格納されてい
る。このようにして、本発明によれば実時間で切
出し処理およびDPによるマツチング処理が実現
できる。 Similarly, at points B and C, the terminal candidate detection unit 61
is determined to be a termination candidate for the above-mentioned reason, and therefore, at points B and C, the termination candidate detection section 61 sends the termination candidate signal SE to the DP processing section 4. Here memory 7
is rewritten with a new matching result every time the matching result is sent from the DP processing unit 4. For example, at point A, the matching result between the input pattern at point A and each standard pattern that has already been written in the memory 7 is the same as the matching result between the input pattern at point B and each standard pattern at point B. Can be rewritten. Similarly, the matching result at point B is rewritten to the matching result at point C at point C. Here, the matching result of point C with respect to the standard pattern in Figure 6 shows that the variation in length of the standard pattern and speech (word) is within the range of -γ to +γ, so the DP processing up to the white circle point takes a certain value as the result. Termination detection section 6
2, no termination signal is output between points B and C because the voice is input again after the termination candidate at point B is detected. However, since no voice is input for several hundred milliseconds between points C and D, it can be determined that voice production has already ended. Therefore, the termination detection section sends the termination signal EE to the DP processing section 4 and the determination section 5 at point D. As a result, the DP processing unit 4 stops processing, and the determination unit 5
The degree of similarity with each standard pattern is read from the memory 7, the standard pattern with the maximum degree of similarity is detected,
Output it as a solution. Here, memory 7 has C
Since there is no re-input of audio after this point, no termination candidates appear. Therefore, the memory 7 stores the matching results up to point C, which is the actual end. In this way, according to the present invention, extraction processing and matching processing using DP can be realized in real time.

つぎに、第５図の従来の回路と本発明の実施例
にあげた第７図の回路とのハードウエア量を比較
してみる。第７図で入力音声切出し部６の能力と
しては従来のものと全く同じで、ただ終端を決定
する途中で終端らしい点（終端候補）が求まつた
時、この情報をフラグの形で出力する機能を追加
するのみであるため、従来の回路に対するハード
ウエア量の増加はほとんど無いといえる。次にメ
モリ７にたいしてはDP処理部４からの書込み、
判定部５からの読出し処理があるが、同時にはア
クセスしない、すなわち競合は起きないため、従
来の実時間の回路に比べ制御は簡単でハードウエ
ア量も少なくなる。メモリ７の容量は標準パター
ンの単語数だけ必要となり、認識しない単語数が
増えるとメモリ容量も大きくなつてくる。しか
し、数百語以下の実用的な認識装置では、第５図
の従来の装置に用いられるメモリ１０に比べて容
量は少なくてすむ。 Next, let us compare the amount of hardware between the conventional circuit shown in FIG. 5 and the circuit shown in FIG. 7 which is an embodiment of the present invention. As shown in FIG. 7, the capability of the input audio extraction section 6 is exactly the same as that of the conventional one, except that when a point that seems to be the end (terminus candidate) is found during the process of determining the end, this information is output in the form of a flag. Since it only adds functionality, it can be said that there is almost no increase in the amount of hardware compared to conventional circuits. Next, writing from the DP processing unit 4 to the memory 7,
Although there is a read process from the determination unit 5, there are no simultaneous accesses, that is, no competition occurs, so control is simpler and the amount of hardware is smaller than in conventional real-time circuits. The capacity of the memory 7 is required for the number of words in the standard pattern, and as the number of unrecognized words increases, the memory capacity also increases. However, for a practical recognition device of several hundred words or less, the memory capacity required is smaller than that of the memory 10 used in the conventional device of FIG.

以上のように本発明によれば、従来の装置に比
べてハードウエア量が少なく、かつ実時間で入力
パターンの切出しおよび切出し後の処理をおこな
う装置が実現可能になり、その効果は大きい。 As described above, according to the present invention, it is possible to realize a device which requires less hardware than conventional devices and which performs input pattern cutting and post-cutting processing in real time, which has great effects.

[Brief explanation of drawings]

第１図〜第３図はDPによるパターンマツチン
グの原理を説明する図、第４図は入力音声パター
ンのパワーの変化を示す図、第５図は従来のパタ
ーンマツチング装置のブロツク構成を示す図、第
６図は第４図の入力パターンと標準パターンとの
整合関係を示す図、第７図は本発明にもとづくパ
ターンマツチング装置の１実施例のブロツク構成
を示す図である。６……音声切出し部。 Figures 1 to 3 are diagrams explaining the principle of pattern matching using DP, Figure 4 is a diagram showing changes in the power of an input audio pattern, and Figure 5 is a diagram showing the block configuration of a conventional pattern matching device. 6 is a diagram showing the matching relationship between the input pattern of FIG. 4 and the standard pattern, and FIG. 7 is a diagram showing the block configuration of one embodiment of the pattern matching device based on the present invention. 6...Audio cutting section.

Claims

[Claims]

1. In a pattern matching device that matches an input pattern and a plurality of standard patterns while extracting an effective signal section in real time, the device includes means for detecting a starting end of the effective signal section and detecting an end candidate of the effective signal section. means for detecting the end of the effective signal section; and means for starting a matching process for determining the degree of similarity between the input pattern and each standard pattern based on the detection of the start end, and detecting the end of the valid signal section based on the detection of the end candidate. a matching processing means for outputting the degree of similarity between the input pattern and each standard pattern at the time of candidate detection; a storage means for retaining the output of the matching processing means at the time of detecting the termination candidate; A pattern mater characterized by comprising: a determining means for reading the output from the matching processing means from the storage means when detecting a termination candidate corresponding to a termination, determining and outputting a standard pattern having the maximum degree of similarity. Ching device.