JPH0155477B2

JPH0155477B2 -

Info

Publication number: JPH0155477B2
Application number: JP58183695A
Authority: JP
Inventors: Hisanori Kanezashi; Kunio Akiba; Takao Irumano
Original assignee: Computer Basic Technology Research Association Corp
Current assignee: Computer Basic Technology Research Association Corp
Priority date: 1983-10-01
Filing date: 1983-10-01
Publication date: 1989-11-24
Also published as: JPS6075889A

Description

【発明の詳細な説明】（産業上の利用分野）本発明は、入力音声と、音素表記された単語辞
書を照合して単語を認識する単語音声認識方法に
関するものである。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a word speech recognition method for recognizing words by comparing input speech with a word dictionary in which phonemes are expressed.

（従来例の構成とその問題点）第１図は従来の単語音声認識方法の一例及び本
発明の単語音声認識方法の実施例等を実行するた
めの装置の機能ブロツク図である。従来例を第１
図及び第２図とともに説明する。第１図におい
て、１は入音声からパラメータの時系列を作成す
るパラメータ抽出部、２は音素標準パタンを照合
して、音素の確率密度を算出する確率密度計算
部、３は音素毎のセグメンテーシヨン、尤度計
算、単語類似度計算等を行なう単語認識部であ
る。また、４は各音素毎の各種パラメータにおけ
る分布を各音素毎の平均値（〓_i）、及び各種パラ
メータ間の共分散行列（Σ_i）の形で表わした音素
標準パタンを記憶する音素標準パタン部、５は認
識すべき全単語を音素単位の記号列で表記した単
語辞書が記憶されている単語辞書部である。その
単語辞書は、例えば単語「サツポロ」、「アサヒカ
ワ」は「SAQPORO」、「ASAHIKAWA」等と
表記されている。(Constitution of Conventional Example and Problems thereof) FIG. 1 is a functional block diagram of an apparatus for executing an example of a conventional word speech recognition method and an embodiment of the word speech recognition method of the present invention. Conventional example first
This will be explained with reference to FIG. In Figure 1, 1 is a parameter extraction unit that creates a time series of parameters from input speech, 2 is a probability density calculation unit that calculates the probability density of a phoneme by comparing standard phoneme patterns, and 3 is a segmentation unit for each phoneme. This is a word recognition unit that performs calculations such as probability calculation, likelihood calculation, and word similarity calculation. In addition, 4 is a phoneme standard pattern that stores a phoneme standard pattern that represents the distribution of various parameters for each phoneme in the form of an average value for each phoneme (〓 _i ) and a covariance matrix (Σ _i ) between various parameters. Section 5 is a word dictionary section in which a word dictionary in which all words to be recognized are expressed in symbol strings in units of phonemes is stored. In the word dictionary, for example, the words "Satsuporo" and "Asahikawa" are written as "SAQPORO", "ASAHIKAWA", etc.

第２図はXYZの音素系列から音声を発声した
場合に生ずる各音素の尤度値の動きを例示したも
のである。 FIG. 2 illustrates the movement of the likelihood value of each phoneme that occurs when speech is uttered from the XYZ phoneme sequence.

次に上記従来例の動作について説明する。パラ
メータ抽出部１において、入力音素を10msのフ
レーム毎に分析しパラメータを抽出して、パラメ
ータ時系列を作成する。次に確率密度計算部２に
おいて、フレーム毎に得られたパラメータと音素
標準パタン部４の音素標準パタンを照合し、その
パラメータの値から生成される音素の確率密度を
算出する。次に、単語認識部３において、各辞書
項目毎に、その辞書項目を構成する辞書音素系列
に従つて音素のセグメンテーシヨンを行ない、下
記式に従いその音素の種類と、その音素に対応
してセグメンテーシヨンされた区間の尤度ｌを計
算し、その辞書項目における、各音素の尤度の平
均として類似度を求める。ここで、その音素をＸ
とし、Ｘに対してセグメンテーシヨンされた区間
の始端と終端のフレーム番号をN_s、N_eとし、第
ｎフレームにおける各パラメータの値をC_oとす
ると、音素Ｘの尤度l_Xは下式で定義される。 Next, the operation of the above conventional example will be explained. The parameter extraction unit 1 analyzes input phonemes every 10 ms frame, extracts parameters, and creates a parameter time series. Next, the probability density calculation section 2 compares the parameters obtained for each frame with the phoneme standard pattern of the phoneme standard pattern section 4, and calculates the probability density of the phoneme generated from the parameter values. Next, the word recognition unit 3 performs phoneme segmentation for each dictionary item according to the dictionary phoneme series that makes up the dictionary item, and then segments the phoneme according to the type of phoneme and the phoneme corresponding to the phoneme according to the following formula. The likelihood l of the segmented interval is calculated, and the similarity is determined as the average of the likelihoods of each phoneme in the dictionary entry. Here, the phoneme is
If the frame numbers at the start and end of the segmented section for X are N _s and N _e , and the value of each parameter in the nth frame is C _o , then the likelihood l of phoneme _X is Defined by Eq.

φ_i（C_o）ある音素ｉの確率密度を表わし、式の
ように定義される。 φ _i (C _o ) represents the probability density of a certain phoneme i and is defined as in the equation.

Ｃ：１つのフレームにおけるｎ個のパラメータ
（ベクトル）〓_i：ある音素ｉのパラメータの平均値（ベク
トル） Σ_i：共分散行列式において、確率密度の割り算における分母
のサメンシヨンのｉの範囲は、音素Ｘが何である
かによつて異なり、例えばＸが音素Ａ（ア）の時
はｉの範囲は５母音、Ａ、Ｅ、Ｉ、Ｏ、Ｕ、とし
ている。以上により得られる単語類似度L_Mを
式に従つて各辞書項目毎に求め、L_Mが最大とな
る辞書項目をもつて、認識単語としていた。 C: n parameters (vector) in one frame 〓 _i : Average parameter value (vector) of a certain phoneme i Σ _i : Covariance matrix In the formula, the range of i in the denominator summension in probability density division is: It varies depending on the phoneme X. For example, when X is the phoneme A (a), the range of i is five vowels, A, E, I, O, and U. The word similarity L _M obtained above was determined for each dictionary item according to the formula, and the dictionary item with the maximum L _M was selected as a recognized word.

L_M＝_NP 〓ⁱ⁼¹ l_i／NP … L_M 辞書中のＭ番目の単語の類似度 l_i 辞書音素系列中の音素ｉの尤度 NP 辞書音素数上記従来例においては、セグメンテーシヨンさ
れた区間において第２図に示すように、辞書音素
系列において音素Ｘの前の音素Ｙ及び後ろの音素
Ｚとの調音結合により、渉りの部分にＸ，Ｙ，Ｚ
以外の音素j₁，j₂がＸの確率密度の値と同程度の
値で出現するため、式に従つて音素Ｘの尤度計
算をする場合、分子には音素Ｘの確率密度の値し
か考慮していないので、十分な尤度が得られず、
結果として単語誤認識の要因となつていた。 L _M = _NP 〓 ⁱ⁼¹ l _i /NP … L _M Similarity of the Mth word in the dictionary l Likelihood of phoneme i in _{the i} dictionary phoneme series NP Number of dictionary phonemes In the above conventional example, segmentation As shown in Figure 2, in the section where phoneme
Other phonemes j ₁ and j ₂ appear with values comparable to the probability density value of X, so when calculating the likelihood of phoneme X according to the formula, only the probability density value of phoneme X is included in the numerator. Since it is not taken into account, sufficient likelihood cannot be obtained,
As a result, this became a factor in word recognition errors.

（発明の目的）本発明は上記従来例の欠点を除去するものであ
り、尤度計算の精度を向上させ、それにより単語
認識率を向上させることを目的とする。(Object of the Invention) The present invention is intended to eliminate the drawbacks of the above-mentioned conventional example, and aims to improve the accuracy of likelihood calculation and thereby improve the word recognition rate.

（発明の構成）本発明は、上記目的を達成するために、音素Ｘ
の尤度計算を行なう際、渉りの部分に出現する、
音素Ｘ以外の音素の確率密度の値をとり入れた尤
度計算を行なうことにより、尤度計算の精度を向
上させる効果を持つものである。(Structure of the Invention) In order to achieve the above object, the present invention provides the phoneme
When calculating the likelihood of
By performing likelihood calculations that incorporate the probability density values of phonemes other than phoneme X, the accuracy of likelihood calculations can be improved.

（実施例の説明）以下に本発明の実施例について説明する。本実
施例における音素標準パタン、及び単語辞書は従
来例と同様である。またパラメータ抽出のステツ
プにより得られるパラメータ時系列も従来例と同
様である。(Description of Examples) Examples of the present invention will be described below. The phoneme standard pattern and word dictionary in this embodiment are the same as in the conventional example. Furthermore, the parameter time series obtained by the parameter extraction step is also the same as in the conventional example.

先ず入力音声からパラメータ抽出部１でフレー
ム毎のパラメータを得、さらにそのパラメータの
値を使つて確率密度計算部２で各音素標準パタン
から得られる確率密度を計算する。ここまでのス
テツプは前記従来例と同様である。次に単語認識
部３において、単語辞書部５内の各辞書項目毎
に、その辞書項目を構成する辞書音素系列に従つ
て音素Ｘのセグメンテーシヨンを行ない、その音
素Ｘと、その音素Ｘに対応してセグメンテーシヨ
ンされた区間の尤度l_Xを計算するのであるが、辞
書音素系列において音素Ｘの前の音素Ｙ、及び後
の音素Ｚとの調音結合によつて、渉りの部分に出
現する音素Ｘ以外の音素の確率密度（第２図にお
けるj₁，j₂のφ_i1，φ_j2）を考慮して、式に従つて
尤度l_Xを求める。 First, a parameter extractor 1 obtains parameters for each frame from the input speech, and then, using the values of the parameters, a probability density calculator 2 calculates the probability density obtained from each phoneme standard pattern. The steps up to this point are the same as in the conventional example. Next, in the word recognition unit 3, segmentation of the phoneme X is performed for each dictionary item in the word dictionary unit 5 according to the dictionary phoneme series that constitutes the dictionary item, and the phoneme X and the phoneme The likelihood _lX of the corresponding segmented interval is calculated. Taking into consideration the probability density of phonemes other than the phoneme X appearing in (φ _i1 , φ _j2 of j ₁ and j ₂ in FIG. 2), the likelihood l _X is determined according to the formula.

N_s、N_e：セグメンテーシヨンされた区間の始
端と終端フレーム番号 C_o：第ｎフレームにおける各パラメータの値 φ：式で定義した確率密度ここでＷ（Ｘ、Ｙ、Ｚ、ｎ）は音素Ｘの辞書音
素系列中の前の音素Ｙ、後の音素Ｚ及び、セグメ
ンテーシヨン区間内でのフレーム位置によつて決
まる重み関数である。 N _s , N _e : Starting and ending frame numbers of the segmented section C _o : Value of each parameter in the nth frame φ : Probability density defined by the formula Here, W (X, Y, Z, n) is This is a weighting function determined by the previous phoneme Y, the subsequent phoneme Z in the dictionary phoneme sequence of phoneme X, and the frame position within the segmentation interval.

また、分子第２項の Σⁱ φ_j（C_o）のｊの範囲は渉り部分に出現する音素に応じて設定する。Ｗ
（Ｘ、Ｙ、Ｚ、ｎ）及びｊの値は予め予備実験等
により求めておく。分母のｉの定義はの式と同
様である。以上によつて、得られる尤度の値から
単語類似度L_Mを従来例と同様に式に従つて各
辞書項目毎に求め、L_Mが最大となる辞書項目を
もつて認識単語とする。 Further, the range of j in the second term of the numerator Σ ⁱ φ _j (C _o ) is set according to the phoneme appearing in the crossing part. W
The values of (X, Y, Z, n) and j are determined in advance through preliminary experiments and the like. The definition of the denominator i is the same as the expression. As described above, the word similarity L _M is determined for each dictionary item according to the formula in the same way as in the conventional example from the obtained likelihood value, and the dictionary item with the maximum L _M is determined as a recognized word.

本実施例においては、セグメンテーシヨンされ
た区間内において辞書項目中の前後の音素を考慮
して、渉りの部分に出現する音素の確率密度の値
を利用した尤度計算を行なうことにより、高い精
度の尤度が得られる利点がある。 In this example, by taking into account the preceding and following phonemes in the dictionary entry within the segmented interval, and performing likelihood calculations using the probability density values of phonemes that appear in the crossing part, It has the advantage of obtaining a highly accurate likelihood.

（発明の効果）本発明は上記のような構成であり、以下に示す
効果が得られるものである。セグメンテーシヨン
された区間内において、辞書項目中の前後の音素
を考慮して、渉りの部分に出現する音素の確率密
度の値を利用した尤度計算を行なうことにより、
高い精度の尤度を得ることができる。(Effects of the Invention) The present invention has the above-described configuration, and provides the following effects. Within the segmented interval, by taking into account the preceding and following phonemes in the dictionary entry, and performing a likelihood calculation using the probability density value of the phoneme that appears in the crossing part,
A highly accurate likelihood can be obtained.

[Brief explanation of drawings]

第１図は従来および本発明の一実施例における
単語音声認識方法を説明するための図、第２図は
前後の音素Ｙ，Ｚを含む音素Ｘ及び渉りの部分に
出現する音素j₁，j₂の確率密度φの時間的変化を
示す図である。１…パラメータ抽出部、２…確率密度計算部、
３…単語認識部、４…音素標準パタン部、５…単
語辞書部。 FIG. 1 is a diagram for explaining the word speech recognition method in the conventional method and in an embodiment of the present invention, and FIG. 2 shows the phoneme X including the preceding and following phonemes Y and Z, and the phoneme j ₁ appearing in the intervening part, FIG. 3 is a diagram showing temporal changes in the probability density φ of j ₂ . 1...Parameter extraction unit, 2...Probability density calculation unit,
3... Word recognition section, 4... Phoneme standard pattern section, 5... Word dictionary section.

Claims

[Claims]

1 In a word recognition method that performs word recognition of input speech using a word dictionary that describes the word to be recognized as a symbol string for each phoneme, and a standard pattern for each phoneme that is expressed as a distribution of the parameters of each phoneme. , the input speech is compared with each dictionary entry in the word dictionary, the input speech is segmented for each phoneme according to the dictionary phoneme series that constitutes each dictionary entry, and the segmentation is performed using the standard pattern of that phoneme. When calculating the likelihood for a segment of speech, the value of the probability density of a phoneme that appears in a transitional part within the segmented segment of speech is calculated according to the adjacent phonemes before and after in the dictionary phoneme sequence. A word speech recognition method that calculates the included likelihood and uses this likelihood value to find the degree of similarity between a dictionary entry and input speech to recognize the word.