JPH0552516B2

JPH0552516B2 -

Info

Publication number: JPH0552516B2
Application number: JP58048112A
Authority: JP
Inventors: Seiichi Nakagawa; Hidekazu Tsuboka
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1983-03-22
Filing date: 1983-03-22
Publication date: 1993-08-05
Also published as: JPS59173884A

Description

[Detailed description of the invention]

産業上の利用分野本発明は、特徴ベクトルの系列で表わされた複
数種類の標準パターンと入力パターンとの比較を
行い、入力パターンの識別を行うパターン比較装
置に関し、特に単語音声の認識等の適用可能なパ
ターン比較装置に関する。従来例の構成とその問題点人間にとつて最も自然な情報発生手段である音
声が、人間−機械系の入力手段として使用できれ
ば、その効果は非常に大きい。従来、単語音声認識装置として特定話者登録方
式によるものが、実用化されている。即ち、認識
装置を使用しようとする話者が、予め、認識すべ
きすべての単語を自分の声で特徴ベクトルの系列
に変換したものとして単語辞書に標準パターンと
して登録しておき、認識時に発声された音声を、
同様に特徴ベクトルの系列に変換し、前記単語辞
書中のどの単語に最も近いかを予め定められた規
則によつて計算し、最も類似している単語を認識
結果とするものである。ところが、この方法によると、認識すべき単語
数が少いときは良いが、数百、数千単語といつた
ように増加してくると、主として次の三つの問題
が無視し得なくなる。 (1) 前記登録における話者の負担が著しく増大す
る。 (2) 前記認識時に発声された音声と標準パターン
との類似度あるいは距離を計算するのに要する
時間が著しく増大し、認識装置の応答速度が遅
くなる。 (3) 前記単語辞書のために要するメモリが非常に
大きくなる。発明の目的本発明は、以上の主として三つの問題点を解決
した新規のパターン比較装置を提供するものであ
る。発明の構成本発明は、単語音声を認識する場合、認識の基
本単位を単音節音声とし、各単音節音声を特徴ベ
クトルの系列で表したものを単音節標準パターン
として記憶しておき、各単語を各単音節を示すコ
ードの結合として記憶する単語辞書を設け、この
単語辞書をもとに各単語毎の単音節の結合の仕方
を知り、各単音節に対応する前記単音節標準パタ
ーンを形成する特徴ベクトルの結合を前記単語音
声の単語標準パターンとし、認識時に発声された
音声の特徴ベクトルの系列に変換された入力パタ
ーンがどの単語標準パターンに最も近いかを計算
し、最も類似している単語認識結果とするもので
ある。このようにすれば日本語の場合、任意の単語は
単音節の結合で表し得るから、任意の単語の単語
標準パターンは単音節標準パターンの結合として
表すことができ、話者は、登録時に全単語を発声
する必要はなく、単音節のみ発声すればよいこと
になる。単音節の数は日本語の場合101種である
から、単語数が如何に増えようとも101種類の単
音節音声を登録するのみでよい。これで前記(1)の
問題は解決される。また、後述するように、最も
計算量を必要とするベクトル関距離は、前記単語
標準パターンのそれぞれとマツチングする度に求
める必要はなく、入力音声の各フレームについて
前記各単音節標準パターンに対して１回求めてお
けば良い。これは単語数がいくらか増加しても変
らないから、前記(2)の問題が解決されることにな
る。さらに、各単音節に対応する特徴ベクトルの
系列のみ記憶しておけば、単語辞書は各単語に対
応する単音節列を記号の列として記憶しておくだ
けで良いので、認識単語数が増加してもメモリの
増加量は少くて済み、前記(3)の問題も解決される
ことになる。いま、ｗ番目の単語のｌ番目の単語
節名をｑ（ｗ，ｌ）、いくつかの単音節を連続して
発声したときの音声パターンをＡ＝a₁，a₂……
a_I，ｎ番目の単音節の標準パターンをRⁿ＝bⁿ ₁bⁿ ₂…
…bⁿ _Jo，（ただし、ｗ＝１，２，……，Ｗ；ｌ＝
１，２，……，Lw；ｎ＝１，２，……，Ｎ；a_i，
bⁿ _jはそれぞれ特徴ベクトル）とするとき、ｗ番目
の単語の標準パターンｗはｗ＝R^q(w,1)R^q(w,2)……R^q(w,Lw) ＝b^q(w,1) ₁，b^q(w,1) ₂……b^q(w,1) _Jq(w,1)b^q(w,2
) ₁b^q(w,2) ₂……b^q(w,2) _Jq(w,2)……b^q(w,Lw) ₁b^q(w,Lw) ₂……b^q(w,Lw) _Jq(w,Lw) で表わされる。ここではパターンの接続を表わ
す。本発明のパターン比較装置は、このＲｗと入力
音声パターンＡとの間でDPマツチングを実行し、
その際得られる両パターン間の距離Ｄ（Ａ，Ｒｗ）
が最小となるようなｗを見出すものである。実施例の説明第１図は本発明の第一の実施例を示すブロツク
図である。１は音声信号の入力端子、２は入力音
声信号を周波数分析、LPC分析、PARCOR分析、
相関分析等により幾つかの数値の組（特徴ベクト
ル）の系列に変換する特徴抽出部である。３は日
本語Ｎ単音節のｎ番目の単音節（ｎ＝１，２，…
…，Ｎ）について特徴抽出部２により特徴ベクト
ルの系列｛bⁿ _j｝に変換されたパターンRⁿを単音
節標準パターンとして認識に先立つて予め登録し
ておく単音節標準パターン記憶部である。４は特
徴抽出部２の出力から例えば各フレームにおける
電力を求め、その電力が予め定められた閾値を越
えている期間を以て音声区間と定める等周知の方
法によつて音声区間を検出する音声区間検出部で
ある。５は、音声区間が始つてから終るまでのフ
レーム数を計数するフレーム数計数部であつて、
端子６には現在のフレームが音声区間開始後何フ
レーム目であるかが出力される。７は、ベクトル
間距離計算部があつて、入力の第ｉフレーム目の
ベクトルa_iとｎ番目の単音節標準パターンRⁿを構
成する各ベクトルbⁿ _jとの距離dⁿ（ｉ，ｊ）をｎ＝
１，２，……，Ｎ；ｊ＝１，２，……Jⁿについて
計算するベクトル間距離計算部である。ベクトル
間の距離の定義としては、最も簡単には市街地距
離である。すなわち、それぞれのベクトルをa_i＝
（a_i1，a_i2，……，a_in）、bⁿ _j＝（bⁿ _j1，bⁿ _j2，……，
bⁿ _jn
とすれば、 dⁿ（ｉ，ｊ）＝_o 〓^k=1 ｜a_ik−bⁿ _jk｜ ……(1) となし得る。８はベクトル間距離記憶部であつ
て、ベクトル間距離計算部７で得られたベクトル
間距離dⁿ（ｉ，ｊ）を記憶する。９は単語辞書で
あつて、認識すべき単語が各単音節を示すコード
の結合として記憶されている。１０は累積距離記
憶部であつて次に説明する累積距離の計算に必要
なそれ以前の累積距離を記憶している。１１は累
積距離計算部であつて、単語辞書９に記憶してい
るコード列をもとに各単語毎の単音節の結合の仕
方を知り、この結合の順序に従つてベクトル間距
離記憶部８に記憶されているベクトル間距離と累
積距離記憶部１０に記憶されている累積距離とか
ら現フレームまでの累積距離を計算する累積距離
計算部である。この累積距離計算部１１での計算
結果は累積距離記憶部１０に記憶される。１２は
判定部であつて、単語音声の入力が完了したとき
各単語について最終フレームの累積距離を累積距
離記憶部１０から読み出し、それが最小である単
語を認識結果として判定する。１３は認識結果の
出力端子である。以上の構成における各部の動作を次により詳細
に説明する。この説明は第ｉフレームにおける動
作について行なう。本実施例においては、DPマツチングの径路と
して第２図に示すものを採用した場合について説
明する。すなわち、この径路を採用した場合、入
力フレーム番号ｉを横軸に標準パターンのフレー
ム番号ｊを縦軸にとつた格子グラフトにおいて、
座標１，１から座標ｉ，ｊまでの標準パターン
Rⁿの第１〜第ｊフレームの部分パターンと、入
力パターンＡの第１〜第ｉフレームの部分パター
ンとの累積距離をDⁿ（ｉ，ｊ）とするとき、次の
漸化式を満足する。 Dⁿ（ｉ，ｊ）＝dⁿ（ｉ，ｊ）＋minDⁿ（ｉ−２，ｊ−
１＋dⁿ（ｉ−１，ｊ） Dⁿ（ｉ−１，ｊ−１） Dⁿ（ｉ−１，ｊ−２） ……(2) まず、特徴抽出部２の出力ベクトルa_iと単音節
標準パターンを構成する全てのベクトルとの距離
がベクトル間距離計算部７で前記の如く計算さ
れ、ベクトル間距離記憶部８に記憶される。式(2)
を計算するのに必要なベクトル間距離は、dⁿ（ｉ，
ｊ）とdⁿ（ｉ−１，ｊ）であるから、ベクトル間
距離記憶部８は入力パターンの現フレームｉと一
つ前のフレームｉ−１におけるベクトル間距離を
ｎ＝１，２……，Ｎ，ｊ＝１，２，……，Jⁿにつ
いて覚えておれば良い。累積距離計算部１１は基
本的には式(2)の計算を行うのであるが、単語辞書
９により提示される単音節列に従つて計算され
る。今単語ｗ（ｗ＝１，２，……，Ｗ）とマツチ
ングする場合について説明する。単語ｗはLwの
単音節から成るものとし、単語ｗのｌ番目の単音
節を（ｗ，ｌ）とし、単語ｗに対し、（ｗ，１）
の座標（１，１）から（ｗ，ｌ）の座標（ｉ，
ｊ）までの累積距離D^(w,l)（ｉ，ｊ）を直前の単音
節（ｗ，ｌ−１）までのマツチング結果の続きと
して計算し、 D^(w,Lw)（Ｉ，J^Lw）を入力パターンと単語ｗのマツ
チング結果である累積距離とするものである。従
つて、第２図のマツチング径路を採用する場合、
単語ｗのｌ番目の単音節における累積距離は単語
ｗのｌ−１番目の単音節の最終２フレームまでの
累積距離が初期値となるから、式(2)の計算はｊ＝
１，２と３ｊJ^q(w,1)の場合に分けて計算する
のがわかり易い。従つて、単音節（ｗ，ｌ）の単
音節名をｑ（ｗ，ｌ）とすると、 FIELD OF INDUSTRIAL APPLICATION The present invention relates to a pattern comparison device that compares an input pattern with a plurality of standard patterns represented by a series of feature vectors and identifies the input pattern, and particularly relates to a pattern comparison device that identifies the input pattern by comparing it with a plurality of standard patterns represented by a series of feature vectors. The present invention relates to an applicable pattern comparison device. Conventional configuration and its problems If voice, which is the most natural means of generating information for humans, could be used as an input means for a human-machine system, the effect would be very large. Conventionally, word speech recognition devices based on a specific speaker registration method have been put into practical use. That is, a speaker who intends to use a recognition device registers in advance all the words to be recognized converted into a series of feature vectors using his/her own voice as a standard pattern in a word dictionary, and then converts the words uttered during recognition into a series of feature vectors. the audio,
Similarly, the feature vectors are converted into a series of feature vectors, the word in the word dictionary is calculated which word is the closest, and the most similar word is taken as the recognition result. However, this method is good when the number of words to be recognized is small, but when the number of words to be recognized increases to hundreds or thousands of words, the following three problems become impossible to ignore. (1) The burden on the speaker in said registration increases significantly. (2) The time required to calculate the similarity or distance between the voice uttered during the recognition and the standard pattern increases significantly, and the response speed of the recognition device becomes slow. (3) The memory required for the word dictionary becomes very large. OBJECTS OF THE INVENTION The present invention provides a novel pattern comparison device that solves the above three main problems. Structure of the Invention When recognizing word sounds, the basic unit of recognition is a monosyllabic sound, and each monosyllabic sound is stored as a series of feature vectors as a monosyllabic standard pattern. A word dictionary is provided in which is stored as a combination of codes indicating each monosyllable, and based on this word dictionary, the method of combining monosyllables for each word is known, and the monosyllabic standard pattern corresponding to each monosyllable is formed. The combination of the feature vectors is used as the word standard pattern of the word sound, and the input pattern converted into the series of feature vectors of the voice uttered during recognition is calculated to determine which word standard pattern is closest to the word standard pattern. This is the word recognition result. In this way, in the case of Japanese, any word can be expressed as a combination of monosyllables, so the word standard pattern for any word can be represented as a combination of monosyllabic standard patterns, and speakers can There is no need to utter words, only single syllables. Since the number of monosyllables in Japanese is 101, no matter how the number of words increases, it is only necessary to register 101 types of monosyllable sounds. This solves problem (1) above. In addition, as will be described later, the vector relation distance, which requires the most amount of calculation, does not need to be calculated every time it is matched with each of the word standard patterns, but for each frame of input speech with respect to each of the monosyllabic standard patterns. You only need to ask for it once. This does not change even if the number of words increases, so problem (2) above is solved. Furthermore, by storing only the series of feature vectors corresponding to each monosyllable, the word dictionary only needs to store the monosyllable string corresponding to each word as a string of symbols, which increases the number of recognized words. However, the amount of increase in memory is small, and problem (3) above is also solved. Now, let q(w,l) be the name of the lth word clause of the wth word, and let the sound pattern when several monosyllables are uttered in succession be A=a ₁ , a ₂ . . .
a _I , the standard pattern of the nth monosyllable is R ⁿ = b ⁿ ₁ b ⁿ ₂ …
...b ⁿ _Jo , (where w=1, 2, ..., W; l=
1,2,...,Lw;n=1,2,...,N;a _i ,
b ⁿ _j are feature vectors), the standard pattern w of the wth word is w=R ^q(w,1) R ^q(w,2) ……R ^q(w,Lw) = b ^{q( w,1)} ₁ ,b ^q(w,1) ₂ ……b ^q(w,1) _Jq(w,1) b ^{q(w,2
)} ₁ b ^q(w,2) ₂ ……b ^q(w,2) _Jq(w,2) ……b ^q(w,Lw) ₁ b ^q(w,Lw) ₂ ... expressed as b ^q(w,Lw) _Jq(w,Lw) . This shows the connection of patterns. The pattern comparison device of the present invention performs DP matching between this R w and the input audio pattern A,
The distance D (A, R w) between both patterns obtained at that time
The purpose is to find w such that . DESCRIPTION OF THE EMBODIMENTS FIG. 1 is a block diagram showing a first embodiment of the present invention. 1 is the audio signal input terminal, 2 is the input audio signal for frequency analysis, LPC analysis, PARCOR analysis,
This is a feature extraction unit that converts into a series of several sets of numerical values (feature vectors) through correlation analysis or the like. 3 is the nth monosyllable of Japanese N monosyllables (n=1, 2,...
^. _{_} ^_ 4 is a voice section detection step in which a voice section is detected by a well-known method, such as determining the power in each frame from the output of the feature extraction unit 2, and determining a period during which the power exceeds a predetermined threshold as a voice section. Department. 5 is a frame number counting unit that counts the number of frames from the start to the end of the audio section,
The number of the current frame after the start of the voice section is outputted to the terminal 6. 7 has an inter-vector distance calculation unit that calculates the distance d ⁿ (i, _j ) between the input vector a _i of the i-th frame and each vector b ⁿ ^j constituting the n-th monosyllabic standard pattern R n n=
1, 2, . . . , N; j=1, ² , . The simplest definition of the distance between vectors is the urban distance. That is, each vector is a _i =
(a _i1 , a _i2 , ..., a _in ), b ⁿ _j = (b ⁿ _j1 , b ⁿ _j2 , ...,
b ⁿ _jn
Then, d ⁿ (i, j) = _o 〓 ^k=1 | a _ik − b ⁿ _jk | ...(1). Reference numeral 8 denotes an inter-vector distance storage unit which stores the inter-vector distance d ⁿ (i, j) obtained by the inter-vector distance calculation unit 7. 9 is a word dictionary in which words to be recognized are stored as combinations of codes indicating each monosyllable. Reference numeral 10 denotes a cumulative distance storage unit which stores previous cumulative distances necessary for calculation of cumulative distance, which will be explained next. Reference numeral 11 denotes a cumulative distance calculation unit, which learns how to combine monosyllables for each word based on the code string stored in the word dictionary 9, and calculates the inter-vector distance storage unit 8 according to the order of the combination. This is a cumulative distance calculation unit that calculates the cumulative distance to the current frame from the inter-vector distance stored in the vectors and the cumulative distance stored in the cumulative distance storage unit 10. The calculation result by the cumulative distance calculation section 11 is stored in the cumulative distance storage section 10. Reference numeral 12 denotes a determining unit, which reads out the cumulative distance of the last frame for each word from the cumulative distance storage unit 10 when the input of word speech is completed, and determines the word with the minimum value as the recognition result. 13 is an output terminal for the recognition result. The operation of each part in the above configuration will be explained in more detail below. This description will be made regarding the operation in the i-th frame. In this embodiment, a case will be explained in which the path shown in FIG. 2 is adopted as the DP matching path. That is, when this route is adopted, in a lattice graft in which the horizontal axis is the input frame number i and the vertical axis is the frame number j of the standard pattern,
Standard pattern from coordinates 1, 1 to coordinates i, j
When the cumulative distance between the partial patterns of the 1st to j-th frames of R ⁿ and the partial patterns of the 1st to i-th frames of input pattern A is D ⁿ (i, j), the following recurrence formula is satisfied. do. D ⁿ (i, j) = d ⁿ (i, j) + minD ⁿ (i-2, j-
1+d ⁿ (i-1, j) D ⁿ (i-1, j-1) D ⁿ (i-1, j-2) ...(2) First, the output vector a _i of the feature extraction unit 2 and the monosyllable The distances to all the vectors constituting the standard pattern are calculated as described above by the inter-vector distance calculation section 7 and stored in the inter-vector distance storage section 8. Formula (2)
The distance between vectors required to calculate d ⁿ (i,
j) and d ⁿ (i-1, j), the inter-vector distance storage unit 8 stores the inter-vector distance between the current frame i and the previous frame i-1 of the input pattern as n=1, 2... , N, j=1, 2, ..., J ⁿ . The cumulative distance calculation unit 11 basically performs the calculation of equation (2), and the calculation is performed according to the monosyllable string presented by the word dictionary 9. Now, the case of matching with word w (w=1, 2, . . . , W) will be explained. Word w consists of Lw monosyllables, the lth monosyllable of word w is (w, l), and for word w, (w, 1)
from the coordinates (1,1) of (w,l) to the coordinates (i,
The cumulative distance D ^(w,l) (i,j) to j) is calculated as a continuation of the matching result up to the previous monosyllable (w,l-1), and D ^(w,Lw) (I,J ^Lw ) is the cumulative distance that is the matching result of the input pattern and word w. Therefore, when adopting the matching route shown in Fig. 2,
Since the initial value of the cumulative distance in the lth monosyllable of word w is the cumulative distance to the last two frames of the l-1st monosyllable of word w, the calculation of equation (2) is as follows:
It is easier to understand if the calculation is done separately for 1, 2 and 3jJ ^q(w,1) . Therefore, if the monosyllabic name of monosyllabic (w, l) is q(w, l), then

【表】となる。ただし、初期条件は D^(w,0)（−１，０）＝０ d^q(w,1)（０，１）＝０ D^(w,0)（ｉ，０）＝∞ D^(w,0)（ｉ，−１）＝∞ D^(w,l)（−１，ｊ）＝∞ D^(w,l)（０，ｊ）＝∞ J_q(w,0)＝０である。以上の計算の結果は累積距離記憶部１０に逐次
記憶されるが、式(2)あるいは第２図からも明らか
なように、第ｉフレームの計算を行うに必要な過
去の累積距離は第ｉ−１フレームと第ｉ−２フレ
ームの値だけであるから、累積距離記憶部１０は
１つ前と２つ前のフレームの累積距離のみ記憶し
ておけばよい。さらに以上のような計算の結果、各単語に対し
て第ｉフレームにおける介単語に対する最終値
D^(w,Lw)（ｉ，J^q(w,Lw)）も累積距離記憶部１０に記
憶される。以上第ｉフレームの処理について述べ
たが、フレーム数計数部５の計数値が以上のｉを
設定している。従つて以上の処理はフレームが１
進む毎に行われ、音声区間が終了すなわちｉ＝Ｉ
となると、累積距離記憶部１０には各単語につい
ての最終の累積距離D^(w,Lw)（Ｉ，J^q(w,Lw)）が記憶
され、音声区間検出部４が音声の終了を検出する
と、このD^(w,Lw)（Ｉ，J^q(w,Lw)）がｗ＝１，……，
Ｗについて読み出され、判定部１２はｗ＝argmin〔D^(w,Lw)（Ｉ，J^q(w,Lw)）〕ｗを見出し、ｗを認識結果とする。ここで、 argmin〔ｆ（ｘ）〕はｆ（ｘ）を最小にするｘのこ
とを意味する。第３図は、以上の実施例の動作をフローチヤー
トにしたものであつて、ソフトウエアで実現する
場合もこのフローチヤートに従えば良い。ステツプ（100）〜（105）は初期化を行う部分
である。ステツプ（106）〜（115）は第ｉフレー
ムにおける処理を表し、ステツプ（107）〜
（109）はベクトル間距離を求める部分、ステツプ
（110）〜（115）は累積距離を求める部分であつ
て、ステツプ（111）は初期化を行う部分、ステ
ツプ（113）はｊ＝１，２についての累積距離、
ステツプ（114）〜（115）は３ｊJ^q(w,l)につ
いての累積距離を求める部分である。ステツプ
（118）は最終的に単語として最も累積距離の小さ
い単語を判定する部分で、第１図判定部１２で行
われる計算に相当する。次に第２の実施例を説明する。これは、第１の
実施例の改良である。すなわち、単音節を連続さ
せたとき、単音節の境界付近のパターンは曖味に
なるので、標準パターンの各単音節の始端と終端
を自由にしてマツチングすること、言い換えれば
マツチングにおいて、始終端のフレームを適当に
とばしてマツチングすることを許すことにより、
より精度の高いマツチングを行うことが可能とな
る。これは、第１図において、累積距離計算部１
１での累積距離の求め方を少々変更することによ
つて簡単に実現できる。すなわち、累積距離計算
部１１における漸化式の計算を次のように変更す
る。標準パターンの単音節音声パターンの頭尾部に
おける端点自由区間をそれぞれδ₁フレーム、δ₂フ
レームとする。すなわち、各単音節標準パターン
ｎに対するマツチングの開始フレームを第１〜δ₁
フレームの間のフレームとし、マツチングの終了
フレームを第Jⁿ−δ₂〜Jⁿフレームの間のフレーム
とし、それぞれの最適のフレームを選ぶ。この場
合もマツチング径路に第２図の拘束条件を採用す
るものとすれば累積距離D^(w,l)（ｉ，ｊ）は次のよ
うに変更される。即ち、第ｌ番目の単音節の累積
距離を求めるとき[Table] becomes. However, the initial conditions are D ^(w,0) (-1,0)=0 d ^q(w,1) (0,1)=0 D ^(w,0) (i,0)=∞ D ^{(w, 0)} (i, -1) = ∞ D ^{(w, l)} (-1, j) = ∞ D ^{(w, l)} (0, j) = ∞ J _q(w,0) = 0. The results of the above calculations are sequentially stored in the cumulative distance storage unit 10, but as is clear from equation (2) or FIG. 2, the past cumulative distance necessary to calculate the i-th frame is Since the values are only for the -1 frame and the i-2 frame, the cumulative distance storage unit 10 only needs to store the cumulative distances for the previous and two previous frames. Furthermore, as a result of the above calculation, the final value for the intervening word in the i-th frame for each word is
D ^(w,Lw) (i, J ^q(w,Lw) ) is also stored in the cumulative distance storage unit 10. The processing of the i-th frame has been described above, and the count value of the frame number counting unit 5 is set to i. Therefore, the above processing requires only one frame.
This is done each time the voice section ends, i.e., i=I.
Then, the cumulative distance storage unit 10 stores the final cumulative distance D ^(w,Lw) (I, J ^q(w,Lw) ) for each word, and the speech interval detection unit 4 detects the end of the speech. Then, this D ^(w,Lw) (I, J ^q(w,Lw) ) becomes w=1,...
The determination unit 12 finds w=argmin[D ^(w,Lw) (I, J ^q(w,Lw) )] w and takes w as the recognition result. Here, argmin[f(x)] means x that minimizes f(x). FIG. 3 is a flowchart showing the operation of the above embodiment, and this flowchart can also be followed when implementing it by software. Steps (100) to (105) are the parts for initialization. Steps (106) to (115) represent processing in the i-th frame, and steps (107) to
(109) is the part that calculates the distance between vectors, steps (110) to (115) are the part that calculates the cumulative distance, step (111) is the part that performs initialization, and step (113) is the part that calculates the distance between vectors. Cumulative distance for,
Steps (114) to (115) are steps for calculating the cumulative distance for 3jJ ^q(w,l) . Step (118) is a part for finally determining the word with the smallest cumulative distance, and corresponds to the calculation performed by the determining unit 12 in FIG. Next, a second embodiment will be described. This is an improvement on the first embodiment. In other words, when consecutive monosyllables are made, patterns near the boundaries of monosyllables become ambiguous, so it is necessary to match the standard pattern with the beginning and end of each monosyllable free. By allowing frames to be skipped and matched appropriately,
It becomes possible to perform matching with higher accuracy. This is the cumulative distance calculation unit 1 in FIG.
This can be easily realized by slightly changing the method of calculating the cumulative distance in 1. That is, the calculation of the recurrence formula in the cumulative distance calculating section 11 is changed as follows. Let the end point free sections in the head-tail part of the monosyllabic speech pattern of the standard pattern be δ ₁ frame and δ ₂ frame, respectively. That is, the matching start frames for each monosyllabic standard pattern n _are
The frames between the frames are set as frames, and the matching end frame is set as a frame between the J ⁿ -δ ₂ to ^{J n} frames, and each optimal frame is selected. In this case as well, if the constraint conditions shown in FIG. 2 are adopted for the matching path, the cumulative distance D ^(w,l) (i,j) will be changed as follows. In other words, when calculating the cumulative distance of the lth monosyllable,

【表】となる。第４図はこの第２の実施例の動作をフローチヤ
ートにしたものであつて、第３図と同じ番号を付
したステツプは第３図と同様な処理を行つてい
る。ステツプ（117′）は、終端点自由の区間を J^q(w,l)−δ₂〜J^q(w,l)としたので、ｌ＝１の場合の累
積距離の計算に現れてくる D^q(w,0)（ｉ−１，J^q(w-0)−ｋ）をｋ＝０，１，２，
……，δ₂にわたつて∞とするためのものである。
ステツプ（118）は単音節（ｗ，ｌ−１）の第ｉ
−１フレームまでの累積距離を前記終端点自由区
間内の中で最小のものとして求める部分である。
ステツプ（119）はｊ＝１，２のときの処理、ス
テツプ（120），（121）は３ｊδ₁のときの処
理、ステツプ（122），（123）はδ₁＋１ｊ
J_q(w,l)のときの処理を行う部分である。以上のよ
うにすることによつて、前記始端自由のマツチン
グを実現することができる。以上のように第２の実施例では、単音節の結合
部の不安定な部分を適当に飛ばしてマツチングで
きるので、認識率の向上が図れたのであるが、よ
りきめの細かいマツチングを行うために重みを導
入する方法を提案する。即ち、通常のマツチング
においては、マツチングすべき全てのフレームを
一様な重みでマツチングを行つていることになる
が、それぞれのパターンにおいて、その特徴をよ
りよく表す重要な部分は大きな重みで、そうでな
い部分は小さな重みでマツチングすることによつ
て、互に距離的に近く従つて混同が起り易いパタ
ーンも十分に識別することができるようになる。第５図は第１の実施例に重みを導入することに
より、より信頼性の高い認識装置として実現した
第３の実施例である。第１図に示す第一の実施例
と異る点は、重み計数記憶部１４が加わり、累積
距離計算部１１の動作がこの重み計数を用いて計
算する点である。第２図のマツチング径路を採用
するとき、各径路に対する重みは第６図に示すよ
うにすることができる。このように重み付を行う
と、ｎ番目の標準パターンと入力パターンのマツ
チング径路をどのように選ぼうともその径路に沿
う重みの和は入力パターンのフレーム数をＭとす
ると _Jo 〓^j=1 Hⁿ（ｊ）＋Ｍとなりその標準パターンと入力パターンについて
一定となる。累積距離計算部１１における計算は次のように
なる。[Table] becomes. FIG. 4 is a flowchart of the operation of this second embodiment, and steps with the same numbers as in FIG. 3 perform the same processing as in FIG. 3. Step (117') sets the free interval of the terminal point to J ^q(w,l) −δ ₂ 〜J ^q(w,l) , so D that appears in the calculation of the cumulative distance when l=1 ^q(w,0) (i-1, J ^q(w-0) -k) with k=0, 1, 2,
..., to make it ∞ over δ ₂ .
Step (118) is the i-th syllable (w, l-1)
This is the part in which the cumulative distance up to frame -1 is determined as the minimum value within the free section of the end point.
Step (119) is the process when j = 1, 2, Steps (120) and (121) are the process when 3jδ ₁ , Steps (122) and (123) are the process when j _{= 1} + 1j
This is the part that performs the processing when J _q(w,l) . By doing the above, it is possible to realize the free matching of the starting end. As described above, in the second embodiment, the recognition rate can be improved because unstable parts of monosyllable joints can be skipped appropriately, but in order to perform more fine-grained matching, We propose a method to introduce weights. In other words, in normal matching, all frames to be matched are matched with uniform weights, but in each pattern, important parts that better represent its characteristics are given large weights, and By performing matching with a small weight on the parts that are not the same, it becomes possible to sufficiently identify patterns that are close to each other and are therefore likely to be confused. FIG. 5 shows a third embodiment that is realized as a more reliable recognition device by introducing weights to the first embodiment. The difference from the first embodiment shown in FIG. 1 is that a weight count storage section 14 is added, and the operation of the cumulative distance calculation section 11 is performed using this weight count. When employing the matching paths shown in FIG. 2, the weights for each path can be set as shown in FIG. When weighting is performed in this way, no matter how you choose the matching path between the n-th standard pattern and the input pattern, the sum of the weights along that path is _Jo 〓 ^j=1 H, where M is the number of frames of the input pattern. ⁿ (j)+M, which is constant for the standard pattern and input pattern. The calculation in the cumulative distance calculating section 11 is as follows.

【表】【table】

【表】によつて、すなわち、各単語の最終フレームまで
の累積類似度をその単語についての重みの総和で
割つたものが最小となる単語が認識結果となる
（Ｉはすべての単語に対して共通であるから省略
できる。）。このとき、各単音節についての重み和 _Jo 〓^j=1 Hⁿ（ｊ）が一定となるようにしておけば、Ｉ
もすべての標準パターン（単語）とマツチングす
る間一定であるから、ｗは次のようにして求める
ことができる。ｗ＝^argmin _w〔D^q(w,Lw)（Ｉ，J^q(w,Lw)））／Lw〕第７図は、第３の実施例の動作をフローチヤー
トに示したものである。ステツプ（200）〜
（201）は前以てdⁿ（１，１）を求めておく部分で
ある。ステツプ（202）〜（207）は漸化式を計算
する場合の初期値を設定する部分である。ステツ
プ（207）でｉ＝１のときの処理は完了するので、
ステツプ（208）〜（217）はｉ＝２以後の処理で
ある。ステツプ（209）〜（211）は入力のフレー
ムｉにおけるベクトル間距離をすべての単音節に
対して求めておく部分である。ステツプ（212）
〜（217）は各単語ｗについて累積距離 D^(w,Lw)（Ｉ，J^q(w,Lw)）を求める部分である。ステ
ツプ（213）はそのときの初期値を与える部分で
ある。ステツプ（214）〜（217）は単語ｗのｌ番
目の単音節について累積距離を計算する部分であ
つて、ステツプ（215）は各単音節に対してｊ＝
１，２の場合、ステツプ（216）〜（217）は３
ｊJ^q(w,l)の場合について累積距離を計算してい
る。ステツプ（218）は判定部１２に相当すると
ころであつて、前述した通りである。以上、第１〜第３の実施例においてはマツチン
グ径路の拘束条件として第２図に示すものを用い
たが、その他第９図ａ〜ｄに示すような種々の径
路を考えることができる。このとき、各径路に対
する重みは、比較すべき標準パターンと入力パタ
ーンを固定したときそのマツチング径路に沿う重
みの和が径路の選び方によらないようにすれば良
いのであつて重みの決め方の一例を第８図に示
す。Hⁿ（ｊ）＝０とすれば各径路の重みが１の通
常の場合になる。また以上の実施例においては標準パターンとし
て単音節音声のパターンを登録する場合について
述べたが、これを単誤音声のパターンとすれば全
く同様にして連続単語音声の認識を行うようにす
ることもできる。特に連続のさせ方が予め定まつ
ている場合に有効である。また、単音節の代りに，VCV（母音＋子音＋母
音）のパターンを標準パターンとしてもつてお
き、その結合として単語の標準パターンを構成す
るようにしておけば、より自然な発声の入力音声
に対して認識率の向上が図れる。第３の実施例では、マツチングの径路に沿う重
みの総和が径路によらず単音節毎に一定になるよ
うにして説明したが、これは単語全体として一定
になるようにしても良いのは勿論である。また、第２の実施例の始終端点自由のマツチン
グに第３の実施例で説明したような重み付の方法
を導入することも当然考えられる。これを行うに
は始終端点自由の区間に対する重みHⁿ（ｊ）を零
とすることで簡単に実現できる。さらに、実施例では音声信号に対する場合につ
いてのみ述べたが、基本パターンの連続として構
成されているパターンを認識する場合で、その基
本パターンの連続のさせ方が何通りか予め定まつ
ているような場合は、標準パターンとして前記基
本パターンを準備しておけば、本実施例と同様に
して前記連続パターンを認識できる。発明の効果本発明によれば、大語彙単語の特定話者登録方
式による認識装置の持つていた問題点 (1) 標準パターン登録時の話者の負担が大きい。 (2) 標準パターンと入力パターンとのマツチング
に時間がかかり認識装置の応答が遅くなる。 (3) 標準パターンを記憶するメモリが膨大にな
る。等を一挙に解決することができたものである。ま
た、始終端点自由や重みの導入により認識率を向
上させることも可能となつたものである。According to [Table], the recognition result is the word for which the cumulative similarity of each word up to the final frame divided by the sum of the weights for that word is the minimum (I is for all words. It can be omitted since it is common.) At this time, if the weighted sum _Jo 〓 ^j=1 H ⁿ (j) for each monosyllable is kept constant, I
Since w is constant during matching with all standard patterns (words), w can be obtained as follows. w= ^argmin _w [D ^q(w,Lw) (I, J ^q(w,Lw) ))/Lw] FIG. 7 is a flowchart showing the operation of the third embodiment. Step (200) ~
(201) is the part where d ⁿ (1, 1) is calculated in advance. Steps (202) to (207) are parts for setting initial values when calculating the recurrence formula. The process when i=1 is completed in step (207), so
Steps (208) to (217) are the processes after i=2. Steps (209) to (211) are steps for determining inter-vector distances for all monosyllables in input frame i. Step (212)
~(217) is a part for calculating the cumulative distance D ^(w,Lw) (I, J ^q(w,Lw) ) for each word w. Step (213) is the part that gives the initial value at that time. Steps (214) to (217) are the part that calculates the cumulative distance for the lth monosyllable of word w, and step (215) is the part that calculates the cumulative distance for each monosyllable.
In the case of 1 and 2, steps (216) to (217) are 3
The cumulative distance is calculated for the case jJ ^q(w,l) . Step (218) corresponds to the determination section 12 and is as described above. As described above, in the first to third embodiments, the constraints shown in FIG. 2 are used as the matching path constraints, but various other paths such as those shown in FIGS. 9a to 9d can be considered. At this time, the weight for each route should be such that when the standard pattern to be compared and the input pattern are fixed, the sum of the weights along the matching route does not depend on how the route is selected. It is shown in FIG. If H ⁿ (j) = 0, the weight of each path is 1, which is the normal case. Furthermore, in the above embodiment, a case was described in which a monosyllabic speech pattern is registered as a standard pattern, but if this is used as a pattern for single erroneous speech, consecutive word speech can be recognized in exactly the same way. can. This is particularly effective when the method of continuity is predetermined. Also, if you have a VCV (vowel + consonant + vowel) pattern as a standard pattern instead of a single syllable, and combine them to form a standard word pattern, you can make the input voice sound more natural. In contrast, the recognition rate can be improved. In the third embodiment, the sum of the weights along the matching path is constant for each single syllable regardless of the path, but it is of course possible to make it constant for the entire word. It is. Naturally, it is also conceivable to introduce a weighting method as explained in the third embodiment to the free matching of the starting and ending points in the second embodiment. This can be easily achieved by setting the weight H ⁿ (j) to zero for the interval where the start and end points are free. Furthermore, although only the case with an audio signal has been described in the embodiment, it is possible to recognize a pattern configured as a series of basic patterns, and there are several predetermined ways of making the series of basic patterns. In this case, if the basic pattern is prepared as a standard pattern, the continuous pattern can be recognized in the same manner as in this embodiment. Effects of the Invention According to the present invention, the following problems are encountered in the recognition apparatus using the speaker-specific registration method for large vocabulary words: (1) The burden on the speaker during standard pattern registration is large. (2) It takes time to match the standard pattern and the input pattern, which slows down the response of the recognition device. (3) The memory for storing standard patterns becomes enormous. We were able to solve these issues all at once. In addition, it has become possible to improve the recognition rate by introducing free start and end points and weights.

[Brief explanation of the drawing]

第１図は本発明の第１の実施例におけるパター
ン比較装置のブロツク図、第２図はDPマツチン
グ径路を示す図、第３図は第１の実施例における
動作を示すフロチヤート、第４図は第２の実施例
における動作を示すフローチヤート、第５図は本
発明の第３の実施例におけるパターン比較装置の
ブロツク図、第６図は各径路における重み付けを
示す図、第７図は第３の実施例の動作を示すフロ
ーチヤート、第８図は各種重み付けの例を示す
図、第９図ａ〜ｄは各種径路を示す図である。２……特徴抽出部、３……単音節標準パターン
記憶部、９……単語辞書、１１……累積距離計算
部、１２……判定部。 FIG. 1 is a block diagram of a pattern comparison device in a first embodiment of the present invention, FIG. 2 is a diagram showing a DP matching path, FIG. 3 is a flowchart showing the operation in the first embodiment, and FIG. 4 is a diagram showing a DP matching path. A flowchart showing the operation in the second embodiment, FIG. 5 is a block diagram of the pattern comparison device in the third embodiment of the present invention, FIG. 6 is a diagram showing weighting in each path, and FIG. FIG. 8 is a flow chart showing the operation of the embodiment, FIG. 8 is a diagram showing examples of various weightings, and FIGS. 9 a to 9 d are diagrams showing various routes. 2... Feature extraction section, 3... Monosyllabic standard pattern storage section, 9... Word dictionary, 11... Cumulative distance calculation section, 12... Judgment section.

Claims

[Claims] 1. Convert the input signal into a sequence of feature vectors a ₁ , a ₂ ,...
..., a _I , and the nth standard pattern consisting of a series of feature vectors _Ro = b ⁿ ₁ , b ⁿ ₂ , ..., b ⁿ _jo (where n is the number of types N) When n
∈{1, 2, ..., N}) and a combination pattern created by combining this standard pattern (hereinafter, the name of this combination pattern will be referred to as w∈{1, 2, ...
..., W}, and each w is called a word) is expressed as n
A combination pattern of the standard pattern corresponding to the word w and the word dictionary that stores what is expressed by the array of R ^q(w,1) R ^q(w,2) ...R ^q(w,Lw) However, q(w,k) is the word w∈{1,2,
..., W}, the name of the k∈{1, 2, ..., L _w }th standard pattern among the L _w standard patterns constituting the pattern represents a combination of patterns) and the input pattern a ₁ , a ₂ , ..., a _I is the combination of the feature vector b ^q(w,k) _j that makes up the combined pattern of the standard pattern and the feature vector a _i that makes up the input pattern. means for calculating cumulative distance (similarity) obtained by minimizing (maximizing) using dynamic programming as a function consisting of;
determination means for finding the word for which this cumulative distance (similarity) is minimum ( _maximum ) ^; The distance (similarity) d _o (i, j) to _i frame a i is n=1,...,N,
An intervector distance (similarity) calculation means that calculates j=1, ..., J _o , and when the k-th standard pattern forming the word w is (w, k), (w,
1) from the first frame to the j-th frame of (w, k) and the first partial pattern of the input pattern.
The cumulative distance (similarity) D _(w,k) (i,j) from the frame to the i-th frame with the partial pattern is calculated for each frame i, k=1, ..., L _w , j=1, ... ...,
For J _q(w,k) , using the calculated d ⁿ (i,j), calculate the intermediate cumulative distance (similarity ) calculation means, D _(w,Lw) (I,
A pattern comparison device characterized in that J _q(w,Lw) ) is a distance (similarity) between a standard pattern sequence for a word w and the input pattern.