JPH045400B2

JPH045400B2 -

Info

Publication number: JPH045400B2
Application number: JP358784A
Authority: JP
Priority date: 1984-01-13
Filing date: 1984-01-13
Publication date: 1992-01-31
Also published as: JPS60147799A

Description

【発明の詳細な説明】（産業上の利用分野）本発明は、入力音声と、音素表記された単語辞
書を照合して単語を認識する音声認識方法に関す
るものである。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a speech recognition method for recognizing words by comparing input speech with a word dictionary in which phonemes are expressed.

（従来例の構成とその問題点）第１図は従来の単語認識方法１つ（第１の従来
例）を実施する装置の機能の概略を示すブロツク
図であり、第２図はＷ区間における中域の帯域パ
ワーと低域の帯域パワーの比の、Ｗ区間における
変化を示す図である。(Structure of conventional example and its problems) Fig. 1 is a block diagram showing an outline of the functions of a device that implements one conventional word recognition method (first conventional example), and Fig. FIG. 7 is a diagram showing a change in the ratio of the mid-range band power and the low-band power in the W interval.

第１図において、１はパラメータ抽出部、２は
音素セグメンテーシヨン部、３は音素認識部、４
は単語辞書部、５はコンフユージヨンマトリクス
部、６は単語認識部である。単語辞書部４は認識
すべき全単語を音素で表記した単語辞書を記憶し
ているものであり、その単語辞書には、例えば単
語「サツポロ」、「アサヒカワ」、「ワカヤマ」、「オ
カヤマ」は、「SAQPORO」、「ASAHIKAWA」、
「WAKAJAMA」、「OKAJMA」等と表記されて
いる。コンフユージヨンマトリクス部５は、辞書
の表記に用いられる各種音素が、実際の音素認識
で何と認識されるかの確率、例えばＡがＡと認識
される確率は85％、ＡがＯと認識される確率は７
％、セグメンテーシヨン誤りによりＡが認識音素
系列上から脱落してしまう確率は５％…等の値を
示すコンフユージヨンマトリクスを記憶している
ものである。 In FIG. 1, 1 is a parameter extraction section, 2 is a phoneme segmentation section, 3 is a phoneme recognition section, and 4 is a phoneme recognition section.
5 is a word dictionary section, 5 is a confusion matrix section, and 6 is a word recognition section. The word dictionary section 4 stores a word dictionary in which all words to be recognized are expressed in phonemes, and the word dictionary includes, for example, the words "Satsuporo", "Asahikawa", "Wakayama", "Okayama", etc. , "SAQPORO", "ASAHIKAWA",
It is written as "WAKAJAMA", "OKAJMA", etc. The confusion matrix unit 5 shows the probability that various phonemes used in dictionary notation will be recognized in actual phoneme recognition, for example, the probability that A will be recognized as A is 85%, and the probability that A will be recognized as O is 85%. The probability is 7
%, the probability that A will be dropped from the recognized phoneme sequence due to a segmentation error is 5%, and so on.

次に上記従来例の動作について説明する。パラ
メータ抽出部１により入力音声を10ｍｓのフレー
ム毎に分析し、パラメータを抽出し、音素セグメ
ンテーシヨン部２でフレーム毎の音声の性質、パ
ラメータの変化等に基づき、音素セグメンテーシ
ヨン（入力音声を音素毎の区間に区切ること）を
行ない、音素認識部３でこのセグメンテーシヨン
された区間毎に音素認識を行なう。単語認識部６
では、音素認識部３により得られた認識音素系列
と、単語辞書部６に記憶されている各辞書項目と
の類似度をコンフユージヨンマトリクス部５内の
コンフユージヨンマトリクスを用いて計算し、最
大類似度となる単語（辞書項目）を認識単語とし
ていた。本従来例におけるワ行母音（以下Ｗで表
わす）のセグメンテーシヨン法は、入力音声の中
域（600〜1500Hz）の帯域パワーP_Mと低域（250
〜600Hz）の帯域パワーP_Lの下式に示す比R_ML R_ML△＝P_M／P_L ……(1) が、第２図に示すように谷形となる区間をＷの区
間としてセグメンテーシヨンするものであつた。
これは、R_MLがＷ区間における第１フオルマント
周波数の変化に対応して谷形に推移する性質を利
用したものであるが、下記の欠点を有していた。
すなわちR_MLはＷ区間以外でもフオルマント周波
数の変化に応じ変動する量であるため、Ｗ区間以
外の区間でもR_MLの谷は生じ、その区間をＷ区間
と誤らないためにR_MLの谷の深さに、あるスレツ
シヨルドを設け、谷の深さがそのスレツシヨルド
を越えた時にはその谷の区間をＷ区間としてセグ
メンテーシヨンし、それ以外の浅い谷は無視して
いた。そのため、Ｗ区間が正しくセグメンテーシ
ヨンされる割り合いは70％程度にとどまり、単語
誤認識も多いという欠点があつた。例えば、第１
図に例を示すように、入力単語がWAKAJAMA
である時、このＷはしばしば前記のような理由で
セグメンテーシヨンされず、得られた認識音素系
列はAKAJAMAとなつて、単語辞書の各辞書項
目との類似度計算の結果、単語認識結果は
OKAJAMAに誤つた。本従来例においては、上
記に示すような単語誤認識が多いという欠点があ
つた。 Next, the operation of the above conventional example will be explained. The parameter extraction unit 1 analyzes the input audio every 10ms frame and extracts the parameters.The phoneme segmentation unit 2 performs phoneme segmentation (input audio The phoneme recognition unit 3 performs phoneme recognition for each segmented interval. Word recognition unit 6
Now, the degree of similarity between the recognized phoneme sequence obtained by the phoneme recognition unit 3 and each dictionary item stored in the word dictionary unit 6 is calculated using the confusion matrix in the confusion matrix unit 5, The word (dictionary entry) with the highest degree of similarity was selected as the recognized word. The segmentation method for the Wa vowel (hereinafter referred to as W) in this conventional example is based on the band power P M in the middle range (600 to 1500 Hz) of the input voice and the band power P _M in the low range (250 Hz).
~600Hz), the ratio R _ML R _ML △=P _M /P _L shown in the formula below is divided into segments with _the valley-shaped section as the section of W as shown in Figure 2. It was something to look forward to.
This utilizes the property that the _RML changes in a valley shape in response to the change in the first formant frequency in the W interval, but it has the following drawbacks.
In other words, since R _ML is a quantity that fluctuates depending on the change in formant frequency even outside the W interval, the valley of R _ML occurs even in intervals other than the W interval, and the depth of the valley of R _ML must be adjusted in order not to mistake that interval for the W interval. In this case, a certain threshold was set, and when the depth of a valley exceeded that threshold, that valley section was segmented as the W section, and other shallow valleys were ignored. As a result, only about 70% of the W sections are correctly segmented, and there are many erroneous word recognitions. For example, the first
As shown in the example in the figure, the input word is WAKAJAMA
, this W is often not segmented for the reasons mentioned above, and the obtained recognized phoneme sequence becomes AKAJAMA, and as a result of calculating the similarity with each dictionary entry in the word dictionary, the word recognition result is
I made a mistake with OKAJAMA. This conventional example has a drawback in that there are many erroneous word recognitions as shown above.

次に第２の従来例を第３図とともに述べる。第
３図において、パラメータ抽出部１および単語辞
書部４は第１図に示す第１の従来例と同様のもの
である。第２の従来例において、パラメータ抽出
部１で入力を10ｍｓのフレーム毎に分析、パラメ
ータ抽出を行ない、単語肉書部７では、パラメー
タ抽出部１で得られたパラメータ時系列を直接、
単語辞書部４の単語辞書都照合し、各辞書項目毎
にその入力音声がその辞書項目を発声したもので
あると仮定して、その辞書項目の辞書音素系列に
従い、１音素ずつセグメンテーシヨンし、そのセ
グメンテーシヨンされた区間が、辞書音素系列の
その音素を発声したものである確からしさを表わ
す尺度である尤度を計算し、尤度の平均値とし
て、その辞書項目と入力音声との類似度を求める
ことにより単語を認識していた。ここで、第２の
従来例において、辞書音素系列上のＷに対応して
Ｗ区間のセグメンテーシヨンを行なう場合、フレ
ーム毎に求めた各母音標準パタンとの距離におい
て、この距離が一番近い母音の種類が「ワ」にお
いてＵ→ＡあるいはＯ→Ａと変化する性質を利用
してＷ区間をセグメンテーシヨンしていた。しか
し第２の従来例において、非常に明瞭に発声した
場合以外は、Ｗ区間におけるフレーム毎の一番距
離の近い母音が終止ＡあるいはＯとなつて、変化
をとらえられないことがが多く、その場合Ｗ区間
のセグメンテーシヨンができない、つまり「Ｗ区
間は含まない」という判断になりがちで、単語誤
認識の原因となつていた。 Next, a second conventional example will be described with reference to FIG. In FIG. 3, the parameter extraction section 1 and word dictionary section 4 are similar to those in the first conventional example shown in FIG. In the second conventional example, the parameter extraction unit 1 analyzes the input every 10 ms frame and extracts the parameters, and the word writing unit 7 directly uses the parameter time series obtained by the parameter extraction unit 1.
The word dictionary of the word dictionary section 4 is compared, and for each dictionary item, assuming that the input speech is the one that uttered that dictionary item, segmentation is performed one phoneme at a time according to the dictionary phoneme sequence of that dictionary item. , calculates the likelihood, which is a measure of the probability that the segmented interval is the utterance of that phoneme in the dictionary phoneme sequence, and calculates the likelihood, which is a measure of the probability that the segmented interval is the utterance of that phoneme in the dictionary phoneme sequence, and uses the average value of the likelihood as the difference between the dictionary entry and the input speech. Words were recognized by determining their similarity. Here, in the second conventional example, when performing segmentation of the W interval corresponding to W on the dictionary phoneme sequence, this distance is the closest among the distances to each vowel standard pattern found for each frame. The W interval was segmented by utilizing the property that the vowel type changes from U to A or O to A in ``wa''. However, in the second conventional example, unless the utterance is very clear, the closest vowel in each frame in the W section is often the final A or O, and the change cannot be detected. In this case, segmentation of the W interval is not possible, that is, it tends to be determined that the W interval is not included, leading to word recognition errors.

（発明の目的）本発明は上記従来例の欠点を除去するものであ
り、Ｗ区間を正しくセグメンテーシヨンできるよ
うにして、単語認識率を向上させることを目的と
する。(Object of the Invention) The present invention eliminates the drawbacks of the conventional example, and aims to improve the word recognition rate by correctly segmenting the W section.

（発明の構成）本発明は、上記目的を達成するために、入力音
声を単語辞書の各辞書項目と照合し、各辞書項目
を構成する辞書音素系列に従い各音素毎に入力音
声をセグメンテーシヨンするとともに、ワ行半母
音のセグメンテーシヨンに際しては、入力音声の
中域の帯域パワーP_Mと低域の帯域パワーP_Lの比
R_MLの時間変化をも照合し、その比R_MLが極小と
なる区間であるときにワ行半母音としてセグメン
テーシヨンを行ない、セグメンテーシヨンの後に
各音素の尤度を算出し、この尤度の値を用いて各
辞書項目と入力音声の類似度を求めて、入力単語
を認識することを特徴とする音声認識方法であ
る。この特徴により、本発明はセグメンテーシヨ
ンを確実に行なうことができ、高い単語認識率を
得る効果を持つものである。(Structure of the Invention) In order to achieve the above object, the present invention collates input speech with each dictionary entry in a word dictionary, and segments the input speech for each phoneme according to the dictionary phoneme series that constitutes each dictionary entry. At the same time, when segmenting the Wa-line semi-vowel, the ratio of the mid-range band power P _M and the low-range band power _PL of the input voice is determined.
The time change of R _ML is also compared, and when the ratio R _ML is the minimum, segmentation is performed as a Wa line semi-vowel. After segmentation, the likelihood of each phoneme is calculated, and this likelihood is calculated. This is a speech recognition method that recognizes input words by determining the degree of similarity between each dictionary item and input speech using the value of . Due to this feature, the present invention can perform segmentation reliably and has the effect of obtaining a high word recognition rate.

（実施例の説明）以下に本発明の一実施例について、図面ととも
に説明する。本実施例の音声認識方法を実施する
装置の基本構成は第２の従来例と同様に、第３図
のブロツク図により示される。第３図においてパ
ラメータ抽出部１と単語辞書部４は、前記第１の
従来例、及び第２の従来例と同様である。(Description of Embodiment) An embodiment of the present invention will be described below with reference to the drawings. The basic configuration of an apparatus for carrying out the speech recognition method of this embodiment is shown in the block diagram of FIG. 3, as in the second conventional example. In FIG. 3, the parameter extraction section 1 and the word dictionary section 4 are the same as those in the first conventional example and the second conventional example.

本実施例の動作について説明する。先ずパラメ
ータ抽出部１により入力音声を10ｍｓのフレーム
毎に分析、パラメータ抽出を行ない、単語認識部
７では、パラメータ抽出部１で得られたパラメー
タ時系列を直接、単語辞書部４の単語辞書と照合
し、各辞書項目毎に、その入力音声がその辞書項
目を発声したものであると仮定して、その辞書項
目の辞書音素系列に従い、１音素ずつセグメンテ
ーシヨンし、そのセグメンテーシヨンされた区間
が、辞書音素系列のその音素を発生したものであ
る確からしさを表わす尺度である尤度を計算し、
尤度の平均値として、その辞書項目と入力音声と
の類似度を求めることにより単語を認識する。こ
の時、本実施例においては、辞書音素系列上のワ
行半母音ＷてＷ区間のセグメンテーシヨンを行な
う場合、前記(1)式に示す、入力音声の中域の帯域
パワーと低域の帯域パワーの比R_MLが、第２図に
示すように谷形となる区間をＷの区間としてセグ
メンテーシヨンを行なう。 The operation of this embodiment will be explained. First, the parameter extraction unit 1 analyzes the input voice every 10 ms frame and extracts parameters, and the word recognition unit 7 directly compares the parameter time series obtained by the parameter extraction unit 1 with the word dictionary of the word dictionary unit 4. For each dictionary item, assuming that the input speech is the one that uttered that dictionary item, segmentation is performed one phoneme at a time according to the dictionary phoneme sequence of that dictionary item, and the segmented interval is calculates the likelihood, which is a measure of the probability that the phoneme in the dictionary phoneme sequence was generated,
A word is recognized by determining the similarity between the dictionary entry and the input speech as the average value of the likelihood. At this time, in this embodiment, when performing segmentation of the W section using the W-line semi-vowel W on the dictionary phoneme sequence, the mid-range band power and the low-range band power of the input voice shown in equation (1) above are calculated. Segmentation is performed by setting the section where the power ratio _RML has a valley shape as shown in FIG. 2 as the section W.

本実施例においては、Ｗ区間以外の所でR_MLの
谷が生じても、ちようどその部分で入力単語と異
る辞書項目のＷのセグメンテーシヨンがなされな
ければ全く問題はなく、またもしそのような誤つ
たセグメンテーシヨンがなされても、尤度計算時
に低い尤度となるように他のパラメータも用いた
尤度計算を行なうことにより、正しいＷ区間のセ
グメンテーシヨンとは区別できる。従つて、Ｗ区
間のセグメンテーシヨンにR_MLの谷を用いても、
第１の従来例とは異り、谷の深さにスレツシヨル
ドを設ける必要がなく、R_MLの谷が浅くてもＷ区
間を正しくセグメンテーシヨンできるようにな
り、単語認識率も向上する。また、Ｗ区間におい
てR_MLが谷を成さないことはほとんどないため、
第２の従来例と比べてもＷ区間を正しくセグメン
テーシヨンできる割り合いははるかに高く、単語
認識率も向上するという効果がある。例えば、第
３図に例を示すように、入力単語が
WAKAJAMAである時、Ｗのセグメンテーシヨ
ンは確実になされ、単語認識結果も正しく
WAKAJAMAとなつた。 In this embodiment, even if a valley in R _ML occurs in a place other than the W interval, there is no problem at all as long as the W segmentation of the dictionary entry different from the input word is not performed in that part. Even if such incorrect segmentation is performed, it can be distinguished from the correct W interval segmentation by performing likelihood calculations that also use other parameters so that the likelihood is low. . Therefore, even if the valley of R _ML is used for segmentation of W interval,
Unlike the first conventional example, there is no need to set a threshold for the depth of the valley, and even if the _RML valley is shallow, the W section can be correctly segmented, and the word recognition rate is improved. In addition, since R _ML rarely forms a valley in the W section,
Compared to the second conventional example, the rate at which the W interval can be correctly segmented is much higher, and the word recognition rate is also improved. For example, as shown in Figure 3, the input word is
When WAKAJAMA, the segmentation of W is done reliably and the word recognition result is also correct.
It became WAKAJAMA.

（発明の効果）本発明は上記のような構成であり、以下に示す
効果が得られるものである。入力音声を各辞書項
目の辞書音素系列に従つてセグメンテーシヨンす
る時、Ｗのセグメンテーシヨンを、入力音声の中
域の帯域パワーと低域の帯域パワーの比がＷ区間
で谷となることを利用して行なうことにより、Ｗ
区間が確実にセグメンテーシヨンされることとな
り、単語認識率が向上するという利点を有する。(Effects of the Invention) The present invention has the above-described configuration, and provides the following effects. When input speech is segmented according to the dictionary phoneme sequence of each dictionary item, segmentation of W is defined as the ratio of the mid-range band power to the low-range band power of the input speech that becomes a valley in the W interval. By using W
This has the advantage that the sections are reliably segmented and the word recognition rate is improved.

[Brief explanation of drawings]

第１図は第１の従来例における音声認識方法を
実施するのに用いる装置の機能の概略を示すブロ
ツク図、第２図は、Ｗ区間における中域の帯域パ
ワーと低域の帯域パワーの比の、Ｗ区間における
変化を示す図、第３図は、第２の従来例、及び本
発明の実施例における音声認識方法を実施するの
に用いる装置の機能の概略を示すブロツク図であ
る。１……パラメータ抽出部、４……単語辞書部、
７……単語認識部。 FIG. 1 is a block diagram schematically showing the functions of a device used to implement the first conventional speech recognition method, and FIG. 2 shows the ratio of mid-range band power to low-band power in the W section. FIG. 3 is a block diagram schematically showing the functions of the apparatus used to implement the speech recognition method in the second conventional example and the embodiment of the present invention. 1...Parameter extraction section, 4...Word dictionary section,
7...Word recognition section.

Claims

[Claims]

1. The words to be recognized in the input speech are compared with each dictionary entry in a word dictionary expressed in phonemes, and the input speech is segmented for each phoneme according to the dictionary phoneme series that constitutes each dictionary entry. During segmentation, the time change of the ratio R _ML of the mid-range band power P _M and the low-range band power _PL of the input audio is also checked, and when the ratio R _ML is the minimum, the segmentation is performed. Segmentation is performed as a line semi-vowel, and then the likelihood, which is a measure of the probability that each segmented speech interval is the one that uttered that phoneme, is calculated, and this likelihood value is used to segment each segment. A speech recognition method characterized by recognizing words in input speech by determining the degree of similarity between dictionary entries and input speech.