JPS6335997B2

JPS6335997B2 -

Info

Publication number: JPS6335997B2
Application number: JP55174340A
Authority: JP
Inventors: Hidekazu Tsuboka; Yoshiteru Mifune; Satoru Kabasawa
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1980-12-10
Filing date: 1980-12-10
Publication date: 1988-07-18
Also published as: JPS5797598A

Description

[Detailed description of the invention]

本発明は、入力音声信号を一定期間毎にサンプ
リングして音韻の系列に変換し、しかる後に予め
音韻記号で登録されている単語辞書の各単語と比
較し、最も類似度の高い単語を認識結果とする音
声認識装置において、前記変換された音韻系列を
マージングすることにより前記単語辞書との比較
照合の速度を上げることを目的とするものであ
る。第１図は、入力音声信号を一たん音韻の系列に
分解してから、単語認識を行う音声認識装置の従
来構成を示すブロツク図である。１は音声信号入
力端子、２は例えばｎチヤンネルのフイルタバン
クからなり、入力音声信号の周波数分析を行う周
波数分析部である。すなわち、入力される音声信
号に対し、前記フイルタバンクを構成する各帯域
フイルタの出力の大きさに対応した電圧が、それ
ぞれの周波数成分の大きさとして出力される。こ
れを例えば10msec毎にサンプリングすれば、
10msec毎にｎ次元のベクトル系列に前記音声信
号が変換されることになる。（ここでそれぞれの
10msecをフレームという）３は音韻の標準パタ
ーンを記憶している標準パターン記憶部であつ
て、各音韻を前記フイルタバンクで周波数分析し
た結果得られたｎ次元のベクトルとして各音韻が
記憶されている。４は音韻認識部であつて、周波
数分析部２から出力されるｎ次元ベクトルの系列
のそれぞれが、標準パターン記憶部３のどの音韻
のパターンに最も近いかを計算し、最も近い標準
パターンに対応する音韻を認識結果として出力す
る。この時点で入力音声は、音韻系列に変換され
たことになる。５は単語辞書であつて、認識すべ
き語彙を構成する各単語が、略ローマ字書きのよ
うに音韻の組合せの形で予め登録されている。６
は単語認識部であつて、予め各音韻間で実験的あ
るいは理論的に求められた音韻間類似度に基づ
き、音韻認識部４で得られた音韻系列と、単語辞
書５のそれぞれの単語と比較照合を行い、最も類
似度の高い単語を認識結果として出力する。７は
判定結果の出力端子である。この場合、音韻認識部４によつて認識される音
韻列は、間違いを多く含んでいるので、入力され
るｎ次元のベクトルのそれぞれに、唯一の音韻を
対応させるのではなく、第１候補音韻、第２候補
音韻、第１候補音韻の信頼度の組として出力され
る。第２図はこの例を示す。すなわち、ｉ番目の
ベクトルViに対しA_iが第１候補音韻、B_iが第２
候補音韻、r_iが第１候補の信頼度である。これら
音韻と、信頼度の決定は、各標準パターンベクト
ルとViとの距離を計算し、最も距離の小さい標
準パターンを第１候補、次に距離の小さい標準パ
ターンを第２候補とし、前者の距離をd_i1、後者
の距離をd_i2とするとき信頼度r_iとしてr_i＝d_i2／
（d_i1＋d_i2）で求められる。音韻A_i，B_iに対する標
準パターンベクトルAi，Biと入力ベクトルV_iの
距離は通常のベクトル間の距離として定義でき
る。例えば、この距離をユークリツド距離で定義
すれば Ai＝（A_i1，A_i2，……，A_io） Bi＝（B_i1，B_i2，……，B_io） Vi＝（V_i1，V_i2，……，V_io）とするときになる。このようにすれば、AiとBiの確からし
さが同程度のときはri＝0.5で、Aiの確からしさ
が増大するにつれてriは１に近づくことになる。このようにして得られた音韻系列と単語辞書の
各単語との比較を行うためには、この音韻系列の
それぞれと各単語を構成する音韻との間の類似度
を定義しておく必要がある。音韻間の類似度例え
ば音韻ＰとＱの類似度は、音韻ＰとＱに対応する
ｎ次元ベクトルとして集められた多数のデータか
ら統計的に処理することによつて求められた両音
韻間の距離を線形変換することにより求められ
る。これを類似度S₀（Ｐ，Ｑ）とする。このとき、
前記音韻系列のｉ番目の音韻の組と単語辞書の比
較の対称となつているｋ番目の単語W_kのｊ番目
の音韻D_jとの類似度Ｓ（ｉ，ｊ）は、例えばＳ（ｉ，ｊ）＝riS₀（A_i，D_j）＋（１−ri）S₀（B_i，D_j）で与えることができる。前記入力音韻系列と単語W_kの類似度は、この
Ｓ（ｉ，ｊ）を基にして、縦軸ｊ、横軸ｉの格子
グラフ上で周知の動的計画法により求めることが
できる。以上の認識において、音韻認識部４の出力音韻
系列はそのまま単語認識部６へ入力されるのでは
なく、マージングして音韻数を減らすことが普通
行われる。これは、10msec程度のサンプリング
であると、同一の音韻が連続する場合が多く、冗
長度が高いため、そのまま単語辞書と照合をする
のは非能率であることから、認識速度を上げるた
めと、音韻のわたりの部分などの不安定なところ
では音韻の認識結果が誤つていることが多く、こ
れを取り除くために行われる。本発明は、以上の認識方式において、新しいマ
ージングの方式を備えた音声認識装置を提供する
ものである。第２図の説明のように第ｉフレームの第１候補
音韻をA_i、第２候補音韻をB_i，A_iの信頼度をriと
すれば、B_iの信頼度は１−riとなる。いま、第ｊ
フレームから第ｊ＋ｋフレームまでの音韻系列を
マージングして、第１候補音韻Fl、第２候補音韻
Sl、信頼度Rlを求める方法について述べる。こ
の範囲に含まれる音韻はA_j，A_(j+1)，……，
A_(j+k)，B_j，B_(j+1)，……，B_(j+k)であり、それぞ
れには前記説明における信頼度rj，ｒ（ｊ＋１），
……，ｒ（ｊ＋ｋ），１−rj，１−ｒ（ｊ＋１），…
…，１−ｒ（ｊ＋ｋ）が付随している。この中に
ｍ種の異なつた音韻が存在しているものとし、そ
れぞれをX₁，X₂，……，X_nとすれば、前記A_j，
……，B_(j+k)のそれぞれはX₁，……，X_nの何れか
に含まれることになる。このとき、A_j，……，
B_(j+k)のうち、X_iに含まれるものに対応する前記
信頼度の合計をuiとする。u₁，u₂，……，u_nの最
大のものをuλ、２番目に大きいものをuμとすれ
ば第１候補音韻Fl＝Xλ、第２候補音韻Sl＝Xμ、
信頼度Rl＝uλ／（uλ＋uμ）で与えることができ
る。第３図は、本発明になるマージング方式を導入
した音声認識装置の構成を示し、１〜７は前記従
来例と同様な動作を行うもので、８が本発明によ
る前記マージングを行う音韻列修正部である。マージングの範囲としては、実験的に３フレー
ム単位（すなわちｍ＝３）で固定して行うのが最
も簡単な方法で、比較的効果が高い。例として、ｉ−１フレームからｉ＋１フレーム
までをマージングする場合を述べる（ｋ＝２の場
合）第２図においてこのフレームに含まれる各音
韻の信頼度は下表のようになる。 The present invention samples an input speech signal at regular intervals, converts it into a phoneme sequence, and then compares it with each word in a word dictionary registered in advance using phoneme symbols, and selects the word with the highest degree of similarity as a recognition result. In the speech recognition device, the purpose is to increase the speed of comparison with the word dictionary by merging the converted phoneme sequences. FIG. 1 is a block diagram showing the conventional structure of a speech recognition device that once decomposes an input speech signal into a series of phonemes and then performs word recognition. Reference numeral 1 denotes an audio signal input terminal, and 2 represents a frequency analysis section that includes, for example, an n-channel filter bank and performs frequency analysis of the input audio signal. That is, with respect to the input audio signal, a voltage corresponding to the magnitude of the output of each band filter constituting the filter bank is outputted as the magnitude of each frequency component. For example, if you sample this every 10msec,
The audio signal is converted into an n-dimensional vector sequence every 10 msec. (Here each
(10 msec is called a frame) 3 is a standard pattern storage unit that stores standard patterns of phonemes, and each phoneme is stored as an n-dimensional vector obtained as a result of frequency analysis of each phoneme using the filter bank. . 4 is a phoneme recognition unit that calculates which phoneme pattern in the standard pattern storage unit 3 each of the n-dimensional vector series output from the frequency analysis unit 2 is closest to, and corresponds to the closest standard pattern. Outputs the phoneme that is recognized as a recognition result. At this point, the input speech has been converted into a phoneme sequence. Reference numeral 5 denotes a word dictionary, in which each word constituting the vocabulary to be recognized is registered in advance in the form of a combination of phonemes, roughly written in Roman letters. 6
is a word recognition unit that compares the phoneme sequence obtained by the phoneme recognition unit 4 with each word in the word dictionary 5 based on the degree of similarity between phonemes determined experimentally or theoretically between each phoneme in advance. A comparison is made and the word with the highest degree of similarity is output as the recognition result. 7 is an output terminal for the determination result. In this case, the phoneme sequence recognized by the phoneme recognition unit 4 contains many mistakes, so instead of making each of the input n-dimensional vectors correspond to a unique phoneme, the first candidate phoneme , the second candidate phoneme, and the first candidate phoneme are output as a reliability set. Figure 2 shows an example of this. That is, for the i-th vector Vi, A _i is the first candidate phoneme, and B _i is the second candidate phoneme.
Candidate phoneme, r _i is the reliability of the first candidate. To determine these phonemes and reliability, calculate the distance between each standard pattern vector and Vi, select the standard pattern with the shortest distance as the first candidate, then choose the standard pattern with the shortest distance as the second candidate, and select the standard pattern with the shortest distance as the second candidate. When the distance is d _i1 and the latter distance is d _i2 , the reliability r _i is r _i = d _i2 /
It is obtained by (d _i1 + d _i2 ). The distance between the standard pattern vectors Ai and Bi for the phonemes A _i and B _i and the input vector V _i can be defined as the distance between normal vectors. For example, if this distance is defined as Euclidean distance, then Ai = (A _i1 , A _i2 , ..., A _io ) Bi = (B _i1 , B _i2 , ..., B _io ) Vi = (V _i1 , V _i2 , ..., V _io ) become. In this way, when the certainty of Ai and Bi is about the same, ri=0.5, and as the certainty of Ai increases, ri approaches 1. In order to compare the phoneme series obtained in this way with each word in the word dictionary, it is necessary to define the degree of similarity between each of the phoneme series and the phonemes that make up each word. . Similarity between phonemes For example, the similarity between phonemes P and Q is the distance between the two phonemes obtained by statistically processing a large amount of data collected as n-dimensional vectors corresponding to phonemes P and Q. It is obtained by linear transformation of . This is defined as similarity S ₀ (P, Q). At this time,
The degree of similarity S(i, j) between the kth word W _k and the jth phoneme D _j , which is the target of the comparison between the i-th phoneme set of the phoneme series and the word dictionary, is, for example, S(i , j)=riS ₀ (A _i , D _j ) + (1−ri) S ₀ (B _i , D _j ). The degree of similarity between the input phoneme sequence and the word W _k can be determined by well-known dynamic programming on a lattice graph with the vertical axis j and the horizontal axis i based on this S(i, j). In the above recognition, the output phoneme sequence of the phoneme recognition unit 4 is not input as is to the word recognition unit 6, but is usually merged to reduce the number of phonemes. This is done in order to increase the recognition speed, as it would be inefficient to directly check with a word dictionary because the same phoneme is often consecutive if the sampling is about 10 msec, and there is a high degree of redundancy. The phoneme recognition results are often erroneous in unstable areas such as transitions between phonemes, and this is done to eliminate this. The present invention provides a speech recognition device equipped with a new merging method in the above recognition method. As explained in Figure 2, if the first candidate phoneme of the i-th frame is A _i , the second candidate phoneme is B _i , and the reliability of A _i is ri, then the reliability of B _i is 1-ri. . Now, the jth
The phoneme sequence from the frame to the j+kth frame is merged, and the first candidate phoneme Fl and the second candidate phoneme are
We will explain how to obtain Sl and reliability Rl. The phonemes included in this range are A _j , A _(j+1) , ...,
A _(j+k) , B _j , B _(j+1) , ..., B _(j+k) , and each has the reliability rj, r(j+1),
..., r(j+k), 1-rj, 1-r(j+1),...
..., 1-r(j+k) are attached. Assume that there are m types of different phonemes, and let them be X ₁ , X ₂ , ..., X _n , then the above A _j ,
..., B _(j+k) will be included in any one of X ₁ , ..., X _n . At this time, A _j ,...,
Among B _(j+k) , let ui be the sum of the reliabilities corresponding to those included in X _i . If the largest one of u ₁ , u ₂ , ..., u _n is uλ and the second largest one is uμ, then the first candidate phoneme Fl=Xλ, the second candidate phoneme Sl=Xμ,
The reliability can be given as Rl=uλ/(uλ+uμ). FIG. 3 shows the configuration of a speech recognition device incorporating the merging method according to the present invention, in which 1 to 7 perform the same operations as the conventional example, and 8 is a phoneme string modification that performs the merging according to the present invention. Department. As for the range of merging, experimentally fixing it in units of three frames (that is, m=3) is the simplest method and is relatively effective. As an example, we will discuss the case of merging frames from i-1 to i+1 (when k=2). In FIG. 2, the reliability of each phoneme included in this frame is as shown in the table below.

【表】【table】

【表】いま例えば、A_(i-1)，A_i，B_(i+1)が音韻X₁，
B_(i-1)，B_iが音韻X₂，A_(i+1)が音韻X₃であつたとす
れば（ｍ＝３の場合）、このとき、 u₁＝ｒ（ｉ−１）＋ri＋（１−r_(i+1)） u₂＝（１−r_(i-1)）＋（１−ri） u₃＝r_(i+1) となる。従つてu₁≧u₂≧u₃であつたとすれば uλ＝u₁，uμ＝u₂ となるから、マージングされた結果は、 Fl＝X₁，Sl＝X₂，Rl＝u1／（u₁＋u₂）となる。以上のように本発明によれば、マージングすべ
き範囲に含まれる同一音韻の数と、それぞれの信
頼度から合理的にそれらのフレームをマージング
することができ、すなわち冗長度を低減して単語
辞書との照合が可能となることから、従来よりも
効率的に認識速度を向上させ、あわせて正確さも
実現できるものである。[Table] For example, A _(i-1) , A _i , B _(i+1) are phonemes X ₁ ,
If B _(i-1) , B _i is the phoneme X ₂ and A _(i+1) is the phoneme X ₃ (if m = 3), then u ₁ = r (i-1) + ri + (1-r _(i+1) ) u ₂ = (1-r _(i-1) ) + (1-ri) u ₃ = r _(i+1) . Therefore, if u ₁ ≧u ₂ ≧u ₃ , then uλ=u ₁ and uμ=u ₂ , so the merged results are Fl=X ₁ , Sl=X ₂ , Rl=u1/(u ₁ + u ₂ ). As described above, according to the present invention, it is possible to rationally merge frames based on the number of identical phonemes included in the range to be merged and their reliability, that is, to reduce redundancy and create a word dictionary. This makes it possible to improve recognition speed and accuracy more efficiently than before.

[Brief explanation of the drawing]

第１図は音韻分析を行う従来の音声認識装置の
ブロツク図、第２図は音韻認識の結果得られる音
韻列を説明する図、第３図は本発明の音声認識装
置の一実施例を示すブロツク図である。２……周波数分析部、３……標準パタン記憶
部、４……音韻認識部、５……単語辞書、６……
単語認識部、８……音韻列修正部。 Fig. 1 is a block diagram of a conventional speech recognition device that performs phoneme analysis, Fig. 2 is a diagram explaining a phoneme sequence obtained as a result of phoneme recognition, and Fig. 3 shows an embodiment of the speech recognition device of the present invention. It is a block diagram. 2...Frequency analysis unit, 3...Standard pattern storage unit, 4...Phonological recognition unit, 5...Word dictionary, 6...
Word recognition section, 8...Phone sequence correction section.

Claims

[Claims]

1 A means for converting an input speech signal into a series of feature vectors, and a means for converting each feature vector constituting the vector series into a first candidate phoneme, a second candidate phoneme, and a means for converting the first candidate phoneme and the second candidate phoneme with the same certainty. a phoneme recognition means that converts into a set of reliability of the first candidate phoneme, which is defined as 0.5 when the first candidate phoneme has a certainty, and approaches 1 as the certainty of the first candidate phoneme increases; means for merging several successive sets of a first candidate phoneme, a second candidate phoneme, and a reliability set of the first candidate phoneme, and an output sequence of the merging means and each word constituting the recognition vocabulary. In a speech recognition device that compares and collates a word with each word in a word dictionary expressed as a combination of phonemes and outputs the word with the highest degree of similarity as a recognition result, the merging means includes the i-th number of the output series of the phoneme recognition means. Let A _i be the first candidate phoneme of the set of
When the second candidate phoneme is B _i , the reliability of the first candidate phoneme is ri, and the reliability of the second candidate phoneme is 1−ri,
When merging sets from the j-th set to the j+k-th set, the m types of phonemes included therein X ₁ ,
For X ₂ , ..., X _n , the above A _j , A _j +1, ...,
Among A _j +k, B _j , B _j +1, ..., B _{j + k} , phoneme
Let u _i be the sum of the reliability corresponding to each thing equal to X _i , and let the maximum of u ₁ , u ₂ , ..., u _n be
If uλ is the next largest one, then the first candidate phoneme after merging is Xλ, and the second candidate phoneme is
A speech recognition device characterized in that it has means for setting the reliability of the first candidate phoneme to uλ/(uλ+uμ).