JPS60202494A

JPS60202494A - Word voice recognition

Info

Publication number: JPS60202494A
Application number: JP59058173A
Authority: JP
Inventors: 金指　久則; 入間野　孝雄; 秋場　国夫
Original assignee: Computer Basic Technology Research Association Corp
Current assignee: Computer Basic Technology Research Association Corp
Priority date: 1984-03-28
Filing date: 1984-03-28
Publication date: 1985-10-12
Also published as: JPH045391B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】（産業上の利用分野）本発明は入力音声と、音素表記された単語辞書を照合し
て単語を認識する単語音声認識方法に関するものである
。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a word speech recognition method for recognizing words by comparing input speech with a word dictionary in which phonemes are expressed.

（従来例の構成とその問題点）第１図は従来の単語音声認識方法の一例及び本発明の単
語音声認識方法の実施例を実行するための装置の機能ブ
ロック図である。従来例を第１図。(Constitution of Conventional Example and its Problems) FIG. 1 is a functional block diagram of an example of a conventional word speech recognition method and an apparatus for executing an embodiment of the word speech recognition method of the present invention. Figure 1 shows a conventional example.

第２図及び第３図とともに説明する。第１図において、
１は入力音声から・母うメータの時系列を作成するｉ４
ラメータ抽出部、２は音素標準ノ４タンを照合して、音
素の確率密度を算出する確率密度計算部、３は音素毎の
セグメンテーション、尤度計算、単語類似度計算を行な
う単語認識部である。This will be explained with reference to FIGS. 2 and 3. In Figure 1,
1 is an i4 that creates a time series of motherboard meters from input audio.
2 is a probability density calculation unit that calculates the probability density of a phoneme by collating the phoneme standard number 4, and 3 is a word recognition unit that performs segmentation, likelihood calculation, and word similarity calculation for each phoneme. .

また、４は各音素毎の各種パラメータにおける分布を各
音素毎の平均値（μｉ）、及び各種・母うメータ間の共
分散行列（Σｉ）の形で表わした音素標準パタンを記憶
する音素標準バタン部、５は認識すべき全単語を音素単
位の記号列で表記した単語辞書が記憶されている単語辞
書部である。その単語辞書は、例えば単語「サラポロ」
、「カンテイ」はｒ　５ＡＱＰＯＲＯＪ　、　ｒ　ＫＡ
Ｎ＝ＮＡＩ　Ｊ等と表記されている。In addition, 4 is a phoneme standard that stores a phoneme standard pattern that represents the distribution of various parameters for each phoneme in the form of an average value (μi) for each phoneme and a covariance matrix (Σi) between various types of meters. The button part 5 is a word dictionary part in which a word dictionary in which all words to be recognized are expressed in symbol strings in units of phonemes is stored. The word dictionary is, for example, the word "Sarapolo"
, "Kantei" is r 5AQPOROJ, r KA
It is written as N=NAI J, etc.

次に上記従来例の動作について説明する。パラメータ抽
出部１において、入力音素をｌ０ｍ５のフレーム毎に分
析し・ぐラメータを抽出して、・ぐラメータ時系列を作
成する。次に確率密度計算部２において、フレーム毎に
得られたパラメータと音素標準・モタン部４の音素標準
・やタンを照合し、音素の確率密度を算出する。次に、
単語認識部３において、各辞書項目毎に、その辞書項目
を構成する辞書音素系列に従って音素のセグメンテーシ
ョンを行ない、下記０式に従いその音素の種類と、その
音素に対応してセグメンテーションされた区間の尤度ｔ
を計算し、その辞書項目における、各音素の尤度の平均
として類似度をめる。ここで、その音素をＸとし、Ｘに
対応してセグメンテーションされた区間の始端と終端の
フレーム番号をＮｓ　＊　Ｎｅとし、第ｎフレームにお
ける各／４’ラメータの値をＣｎとすると、音素Ｘの尤
度ｔＸは下式で定義される。Next, the operation of the above conventional example will be explained. In the parameter extraction unit 1, the input phoneme is analyzed for each frame of 10m5, and the parameter is extracted to create a parameter time series. Next, the probability density calculation unit 2 compares the parameters obtained for each frame with the phoneme standard/yatan of the phoneme standard/motan unit 4, and calculates the probability density of the phoneme. next,
The word recognition unit 3 performs phoneme segmentation for each dictionary item according to the dictionary phoneme series that constitutes the dictionary item, and calculates the type of phoneme and the likelihood of the segmented interval corresponding to the phoneme according to the following formula 0. degree t
is calculated, and the similarity is calculated as the average of the likelihoods of each phoneme in that dictionary entry. Here, if the phoneme is X, the frame numbers at the start and end of the segmented section corresponding to X are Ns * Ne, and the value of each /4' parameter in the nth frame is Cn, then The likelihood tX is defined by the following formula.

φ、　（Ｃｎ）はある音素ｉの確率密度を表わし、■式
のように定義される。φ, (Cn) represents the probability density of a certain phoneme i, and is defined as in equation (2).

一■ ０式において、確率密度の割シ算における分母のサメン
ションの１の範囲は、音素Ｘが何であるかによって異な
り、例えばＸが音素Ａ（７）の時はｉの範囲は５母音、
Ａ、Ｅ、１．０．Ｕ、としている。1) In Equation 0, the range of summation 1 in the denominator in the division calculation of the probability density differs depending on the phoneme X. For example, when X is the phoneme A (7), the range of i is 5 vowels,
A, E, 1.0. It is set as U.

以上により得られる単語類似度しＭを■式に従って各辞
書項目毎にめ、ＬＭが最大となる辞書項目をもって、認
識単語としていた。The word similarity M obtained from the above was determined for each dictionary item according to formula (2), and the dictionary item with the maximum LM was selected as a recognized word.

ＬＭ＝、’暮ｔｊ／ＮＰ　−■ Ｊ＝１第２図は／　ＫＡＮ＝ＮＡ　Ｉ／　（閣内）と発声した
時の／ＡＮ＝ＮＡ／の部分の各音素の確率密度の時間変
化を表わしている。この場合の／ＡＮ＝ＮＡ／の部分の
セグメンテーション及び尤度計算は、各音素／Ａ／　、
　／Ｎ＝／、　／Ｎ／　、　／Ａ／の確率密度の値φ□
。LM=, 'kuretj/NP -■ J=1 Figure 2 shows the time change in the probability density of each phoneme in the /AN=NA/ part when uttering /KAN=NA I/ (cabinet). There is. In this case, the segmentation and likelihood calculation of the /AN=NA/ part are performed for each phoneme /A/,
/N=/, /N/, /A/ probability density value φ□
.

φ、−１φ８．φえの時間変化に従ってセグメンテーシ
ョンを行なう。／ＡＮ＝ＮＡ／の場合は第１番目の／Ａ
／に対してセグメンテーションした区間（ａ−ｈ）を対
応させ、■弐に従い、φ□を用いてＬＡを計算し、／Ｎ
＝／、／Ｎ／、／Ａ／についても同様にｔＮ＝。φ, -1φ8. Segmentation is performed according to the time change of φ. If /AN=NA/, the first /A
Correspond the segmented interval (a-h) to /, calculate LA using φ□ according to ■2, and /N
Similarly, tN= for =/, /N/, and /A/.

ｔＮ　＊　ｔＡを計算する。Calculate tN * tA.

第３図は同じ単ｙｇ　／ＫＡＮ　＝　ＮＡ　Ｉ　／を別
の話者が発声した場合の各音素の確率密度の時間変化を
示している。第３図において、／ＡＮ＝ＮＡ／の部分の
セグメンテーション及び尤度計算はφ、、φ、−１（５
） φ９．φえの時間変化によって行なうが、／Ｎ＝／のセ
グメンテーションをする場合／Ｎ＝／の次に来る音素／
Ｎ／の確率密度φ、が／Ｎ／の区間で十分大きくならず
φ、＝が／Ｎ／の区間に大きな値を持ち、次の音素／Ａ
／の区間の始まりまできている。従って／Ｎ−／のセグ
メンテーション区間は区間（ｇ−ｈ）となり、／Ｎ／の
区間を含むため、／Ｎ−／の次の音素／Ｎ／のセグメン
テーションを誤り、尤度ｔＮも低くなるため、撥音、鼻
音の連続２音素を含む単語は誤認識し易い欠点があった
。FIG. 3 shows the temporal change in the probability density of each phoneme when the same unit yg /KAN = NA I / is uttered by different speakers. In Figure 3, the segmentation and likelihood calculation for /AN=NA/ are φ, φ, −1(5
) φ9. This is done based on the time change of φ, but when segmenting /N=/, the phoneme that comes after /N=//
The probability density φ, of N/ is not large enough in the /N/ interval, and φ,= has a large value in the /N/ interval, and the next phoneme /A
We have reached the beginning of the / section. Therefore, the segmentation interval of /N-/ becomes the interval (gh), which includes the interval of /N/, so the segmentation of the next phoneme /N/ after /N-/ is incorrect, and the likelihood tN is also low. Words containing two consecutive phonemes, such as a nasal or a nasal, had the disadvantage of being easily misrecognized.

（発明の目的）本発明は、上記従来例の欠点を除去するものであり、尤
度計算の精度を向上させ、それにより単語認識率を向上
させることを目的とする。(Objective of the Invention) The present invention is intended to eliminate the drawbacks of the conventional example described above, and aims to improve the accuracy of likelihood calculation, thereby improving the word recognition rate.

（発明の構成）本発明は、上記目的を達成するために、撥音。(Structure of the invention) In order to achieve the above object, the present invention provides sound repellent.

鼻音が連続する音素系列のセグメンテーション及び尤度
計算を行なう際、撥音、鼻音の連続２音素ケまとめてセ
グメンテーションし尤度計算ヲ行なうことにより、セグ
メンテーション及び尤度計算の精度を向上させる効果を
得るものである。When segmenting and calculating the likelihood of a phoneme sequence with continuous nasal sounds, the accuracy of the segmentation and likelihood calculation can be improved by segmenting and calculating the likelihood of two consecutive phonemes of a nasal sound and a nasal sound at the same time. It is.

（６）（実施例の説明）以下に本発明の一実施例について第１図及び第３図とと
もに説明する。第１図において、音素標準・やタンは従
来例と同様である。単語辞書は、認識すべき単＠を音素
の記号列で表記しである。またノやラメータ抽出により
得られる・やラメータ時系列は従来例と同様である。本
実施例の動作について説明する。先ず、ノヤラメータ抽
出部１で入力音声からフレーム毎の・母うメータを得、
さらに確率密度計算部２でその・やラメータの値及び、
各音素標準ｉｊメタンら得られる確率密度を計算する。(6) (Description of Embodiment) An embodiment of the present invention will be described below with reference to FIGS. 1 and 3. In FIG. 1, the phoneme standard yatan is the same as the conventional example. The word dictionary represents the single @ to be recognized as a string of phoneme symbols. Moreover, the /ya parameter time series obtained by the /ya parameter extraction is the same as in the conventional example. The operation of this embodiment will be explained. First, the noise parameter extraction unit 1 obtains a parameter for each frame from the input audio,
Furthermore, the probability density calculation unit 2 calculates the value of the parameter and
Calculate the probability density obtained from each phoneme standard ij methane.

次に、単語認識部３において、単語辞書部５内の各辞書
項目毎にその辞書項目を構成する辞書音素系列に従って
音素Ｘのセグメンテーションを行ない、その音素Ｘとそ
の音素Ｘに対応してセグメンテーションされた区間の尤
度ｔｘを計算するのであるが、辞書音素系列中に撥音、
鼻音の２連続音素系列がある場合、第１番目の音素であ
る撥音の確率密度の値が、次の鼻音の終りまで優勢であ
る。従って撥音、鼻音の連続２音素をまとめてセグメン
テーションし、そのセグメンテーションした区間に対し
て尤度を計算する。第３図の／ＡＮ＝ＮＡ／の部分の各
音素／Ａ／、／Ｎ−／、／Ｎ／、／に／の確率密度φ６
．φ、−１φ８．φ□をみると、φ、−は／Ｎ／の部分
でφ、よりも大きな値を持ち／Ａ／の始１す（ｈ）まで
続いている。従って、φ、−の値を用いて、／Ｎ＝Ｎ／
の連続２音素をまとめてｇからｈまでセグメンテーショ
ンを行ない、セグメンテーションした区間（ｇ−ｈ）に
対してφ、−の値を用いて０式に従って２音素分の尤度
ｔＮ−Ｎヲ求める。ここで０式と対比して普通の音素の
場合は従来同様０式を用いて尤度計算を行なう。Next, in the word recognition unit 3, segmentation of the phoneme X is performed for each dictionary item in the word dictionary unit 5 according to the dictionary phoneme series that constitutes that dictionary item. The likelihood tx of the interval is calculated.
When there are two consecutive phoneme sequences of nasal sounds, the probability density value of the first phoneme, the phlegm, is dominant until the end of the next nasal sound. Therefore, two consecutive phonemes, a nasal and a nasal, are segmented together, and the likelihood is calculated for the segmented interval. Probability density φ6 of each phoneme /A/, /N-/, /N/, /ni/ in the /AN=NA/ part of Figure 3
．． φ, -1φ8. Looking at φ□, φ, - has a larger value than φ, at the /N/ portion, and continues until the beginning of /A/ (h). Therefore, using the value of φ, -, /N=N/
Segmentation is performed on the two consecutive phonemes from g to h, and the likelihood tN-N for the two phonemes is determined according to the formula 0 using the values of φ and - for the segmented interval (gh). Here, in contrast to the 0 formula, in the case of a normal phoneme, the 0 formula is used to calculate the likelihood as in the conventional case.

本実施ににおいては、撥音、鼻音の音素系列を１つにま
とめてセグメンテーション及び尤度計算を行なうため、
撥音、鼻音の連続２音素を含む単語の認識率が向上する
利点がある。In this implementation, in order to perform segmentation and likelihood calculation by combining the phoneme sequences of phonics and nasals into one,
This method has the advantage of improving the recognition rate of words containing two consecutive phonemes, such as a nasal or a nasal.

但し、記号の使用は■、■式に準する。However, the use of symbols follows the formulas ■ and ■.

（発明の効果）本発明は上記のように撥音、鼻音の連続２音素をまとめ
てセグメンテーションし、尤度計算を行なうことにより
、従来法に比べ精度よくセグメンテーション及び尤度計
算を行うことができる。(Effects of the Invention) As described above, the present invention can perform segmentation and likelihood calculation with higher precision than conventional methods by collectively segmenting two consecutive phonemes of a phonic and nasal sound and performing likelihood calculation.

[Brief explanation of the drawing]

第１図は従来及び本発明の一実施例における単語音声認
識方法を説明するための図、第２図は／ＫＡＮ＝ＮＡ１
　／（カンナイ）と発声した場合の／ＡＮ＝ＮＡ／の部
分の各要素／Ａ／、／Ｎ＝／、／Ｎ／、／Ａ／の確率密
度φ６．φ、−９φ８．φえ　の時間変化を示す図、第
３図は第２図の場合とは別の話者が／ＫＡＮ＝ＮＡＩ　
／と発声した場合φえ、φ、−１φ、。 φ、の時間変化を示す図である。１・・・パラメータ抽出部２・・・確率密度計算部、３・・・単語認識部４・・・音素標準バタン部５・・・単語辞書部（９）第１図入〃會声訴戚早猪FIG. 1 is a diagram for explaining the word speech recognition method in the conventional method and an embodiment of the present invention, and FIG.
Probability density φ6 of each element /A/, /N=/, /N/, /A/ in the part /AN=NA/ when /(kannai) is uttered. φ, -9φ8. Figure 3 shows the change in φe over time.
If you say /, φeh, φ, -1φ,. It is a figure showing a time change of φ. 1... Parameter extraction unit 2... Probability density calculation unit, 3... Word recognition unit 4... Phoneme standard slam unit 5... Word dictionary unit (9) Diagram 1 Contains meeting voice complaint early wild boar

Claims

[Claims]

We perform word recognition of input speech using a word dictionary that describes the words to be recognized as symbol strings for each phoneme, and standard numbers for each phoneme that are represented by the distribution of the acoustic parameters of each phoneme. In the word speech recognition method, the input speech is checked against each dictionary entry in a word dictionary, the input speech is segmented for each phoneme according to the dictionary phoneme series that constitutes each dictionary entry, and the standard nomaton of that phoneme is used. The probability density that the segmented speech section is generated from the phoneme is calculated, and the similarity between each dictionary item and the input speech is calculated using the above probability density value for the segmented speech section. When recognizing words, the phonograph in the dictionary word,
A word speech recognition method characterized in that, for a phoneme sequence with continuous nasal sounds, two consecutive phonemes of a nasal sound and a nasal sound are segmented together and likelihood calculation is performed.