JPH0449958B2

JPH0449958B2 -

Info

Publication number: JPH0449958B2
Application number: JP58240415A
Authority: JP
Inventors: Masahiro Hamada; Hideki Fuje
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1983-12-19
Filing date: 1983-12-19
Publication date: 1992-08-12
Also published as: JPS60130800A

Description

[Detailed description of the invention]

産業上の利用分野本発明は、スペクトル類似度評価に基づく音声
認識装置に関するものである。従来例の構成とその問題点従来の音声認識装置の問題点を、単語認識方法
を例にとつて説明する。第１図において、入力音声はスペクトル分析手
段１でフレーム毎に分析され、エネルギ正規化手
段２で発声強度のばらつきを補正するためのエネ
ルギ正規化を受けた後、予め同様の手段で処理さ
れ蓄えられていた標準パターン６との間のスペク
トル的な類似度がスペクトル類似度計算手段３で
計算され、このようにして得られたフレーム毎の
スペクトル類似性が類似度累積手段４により単語
全長にわたつて累積され、判定手段５により最終
の判定が行なわれるしくみとなつている。一般に人間の発声強度は、同一の単語を発音す
る場合でもその都度異なる。エネルギ正規化手段
３はこのような発声強度のばらつきを補正するた
めに有効であるが、反面、音声の本質からしてエ
ネルギ強度に違いがあるべき区間〔例えば母音区
間と無声子音区間など〕についても、その性質に
かかわりなくエネルギ正規化を行なつてしまうと
いう欠点がある。第２図はその様子を具体的に示
したものである。第２図Ａにおいて、実線は母音のスペクトル強
度を、破線は子音のスペクトル強度をそれぞれ表
わしている。また両スペクトルの間の縦縞部分は
スペクトル非類似度〔類似度の逆の概念〕を示し
ており、本来のスペクトル的距離を表わしている
と考えることができる。一方、第２図Ｂは上記の
母音と子音とのスペクトルがエネルギ正規化処理
を受けた後のスペクトルを示している。同図で
は、本来エネルギの小さかつた子音のスペクトル
が拡大され、母音のスペクトルエネルギと同程度
になつたため、同図中の縦縞部分で示されたスペ
クトル非類似度は第２図Ａの場合よりも小さくな
つている。即ち、エネルギ正規化処理によると、
本来明らかにエネルギ強度が異なり、同時にスペ
クトル非類似度が大きかつた母音と子音とのスペ
クトルが、比較的エネルギ強度が等しく、同時に
スペクトル非類似度が小さいものとして評価され
ることになる。ところで、単語認識を行なう場合には、登録音
声テンプレートと認識対象音声との間でフレーム
毎に正規化スペクトル類似度を累積していく。従
つて言語的に明らかに異なつていると考えられる
音声区間〔例えば上記例の母音区間と子音区間〕
の非類似度は、言語的に似通つていると考えられ
る音声区間〔例えば同一種の母音であり、発声強
度のみが若干異なる母音区間〕の非類似度に比
べ、より大きな重みづけをもつて評価される方
が、最終の認識結果において明らかに言語的に異
なつていると考えられる言語間での誤認識が発生
することを防ぐことができる。即ち、エネルギ正規化処理によると、発声時の
エネルギ強度的不安定要因が除外できる反面、言
語的に明確かつ支配的であるべき相違点をあいま
いにしてしまうという問題点があつた。発明の目的本発明は上記従来の欠点を解消するもので、エ
ネルギ正規化処理のもつ上記長所を生かしつつ、
同時に言語的に明らかに不都合と考えられるよう
な誤認識の発生をできるだけ阻止し、認識率の向
上を図ることのできる音声認識装置を提供するこ
とを目的とする。発明の構成上記目的を達成するため、本発明の音声認識装
置は、入力音声の周波数的特徴をフレーム毎に分
析するスペクトル分析手段と、このスペクトル分
析手段で得られたスペクトル特徴から音声のエネ
ルギの強弱に起因する要素を除外して正規化スペ
クトルを得るエネルギを正規化手段と、登録音声
の正規化スペクトルと認識用音声の正規化スペク
トルとの間の類似度をフレーム毎に求めるスペク
トル類似度計算手段と、このスペクトル類似度計
算手段から得られるフレーム毎の類似度を認識し
ようとする音声単位長の全体にわたつて累積する
類似度累積手段と、入力音声の音韻系列を求める
音韻系列分析手段と、登録音声の音韻系列と認識
用音声の音韻系列との間の類似度を認識しようと
する音声単位長の全体にわたつて評価する音韻類
似度評価手段と、前記類似度累積手段から得られ
るスペクトル類似度と前記音韻類似度評価手段か
ら得られる音韻類似度との両者を用いて最終の判
定を下す判定手段とを備えた構成である。実施例の説明以下、本発明の一実施例について、図面に基づ
いて説明する。第３図は本発明の一実施例における音声認識装
置の構成図であり、第１図に示す構成要素と同一
の構成要素には同一の符号を付してその説明を省
略する。第３図において、７は音韻系列分析手
段、８は音韻類似度評価手段、９は音韻系列標準
パターンである。入力音声はスペクトル分析手段１でフレーム毎
に分析され、エネルギ正規化手段２で発生強度の
ばらつきの補正を受けた後、予め同様の手段で処
理され蓄えられていたスペクトル標準パターン６
との間のスペクトル的な類似度がスペクトル類似
度計算手段３で計算され、このようにして得られ
たフレーム毎のスペクトル類似度が類似度累積手
段４により単語全長にわたつて累積されていく。
一方、前記入力音声は音韻系列分析手段７にも入
力されここで得られた音韻系列と、予め同様の手
段で処理され蓄えられていた音韻系列標準パター
ン９とが音韻類似度評価手段８に入力され、ここ
で音韻類似度が求められる。さらに、前記単語全
長にわたつて累積されたスペクトル類似度と、音
韻類似度とは判定手段５に入力される。判定手段５における判定は次のように行なわれ
る。なお以下の説明では便宜上類似度と逆の概念
である非類似度を考え、これを距離と呼ぶことに
する。さて下記表は各登録音声と認識用音声との
組み合わせから得られた複数のスペクトル距離の
うち最も小さい距離〔以下第１候補距離（d₁）と
呼ぶ〕と、最も小さい距離と２番目に小さい距離
との差〔以下第２候補距離差Δ₂と呼ぶ〕との２
つの値について、それらが、予め別に定めたしき
い値〔以下第１候補距離しきい値（θ₁）および第
２候補距離差しきい値（θ₂）と呼ぶ〕に対してと
り得る大小関係の組み合わせの各場合を示してい
る。 FIELD OF THE INVENTION The present invention relates to a speech recognition device based on spectral similarity evaluation. Configuration of a conventional example and its problems Problems of a conventional speech recognition device will be explained using a word recognition method as an example. In FIG. 1, input speech is analyzed frame by frame by spectrum analysis means 1, energy normalized by energy normalization means 2 to correct variations in vocalization intensity, and then processed and stored in advance by the same means. A spectral similarity calculation means 3 calculates the spectral similarity between the reference pattern 6 and the standard pattern 6, and the spectral similarity obtained for each frame is calculated over the entire word length by a similarity accumulation means 4. The final judgment is made by the judging means 5. In general, the strength of human speech differs each time even when pronouncing the same word. The energy normalization means 3 is effective for correcting such variations in utterance intensity, but on the other hand, it is effective for correcting such variations in utterance intensity, but on the other hand, it is difficult to correct for intervals where there should be a difference in energy intensity due to the nature of speech (for example, vowel intervals and voiceless consonant intervals). However, the problem is that energy normalization is performed regardless of its properties. FIG. 2 specifically shows this situation. In FIG. 2A, the solid line represents the spectral intensity of vowels, and the broken line represents the spectral intensity of consonants. Further, the vertical striped portion between both spectra indicates spectral dissimilarity (the opposite concept of similarity), and can be considered to represent the original spectral distance. On the other hand, FIG. 2B shows the spectrum after the vowel and consonant spectra described above have been subjected to energy normalization processing. In the figure, the spectrum of the consonant, which originally had low energy, has been expanded and has become comparable to the spectral energy of the vowel, so the spectral dissimilarity indicated by the vertical stripes in the figure is greater than that in Figure 2A. is also getting smaller. That is, according to the energy normalization process,
The spectra of a vowel and a consonant, which originally had clearly different energy intensities and a large degree of spectral dissimilarity, are evaluated as having relatively equal energy intensities and a small degree of spectral dissimilarity. By the way, when performing word recognition, the normalized spectral similarity is accumulated for each frame between the registered speech template and the speech to be recognized. Therefore, speech intervals that are considered to be clearly linguistically different [for example, the vowel interval and consonant interval in the above example]
The dissimilarity of is given greater weight than the dissimilarity of speech intervals that are considered to be linguistically similar (e.g. vowel intervals of the same type, but differing only in vocalization intensity). Evaluation can prevent erroneous recognition from occurring between languages that are considered to be clearly linguistically different in the final recognition result. That is, although energy normalization processing can eliminate factors that cause instability in energy intensity during vocalization, it has the problem of obscuring differences that should be linguistically clear and dominant. Purpose of the Invention The present invention solves the above-mentioned conventional drawbacks, and while taking advantage of the above-mentioned advantages of energy normalization processing,
At the same time, it is an object of the present invention to provide a speech recognition device that can improve the recognition rate by preventing as much as possible the occurrence of erroneous recognitions that are considered to be clearly disadvantageous from a linguistic point of view. Structure of the Invention In order to achieve the above object, the speech recognition device of the present invention includes a spectrum analysis means for analyzing the frequency characteristics of input speech for each frame, and a method for calculating the energy of the speech from the spectrum features obtained by the spectrum analysis means. A means for normalizing energy to obtain a normalized spectrum by excluding elements caused by strength and weakness, and a spectral similarity calculation that calculates the similarity between the normalized spectrum of the registered speech and the normalized spectrum of the speech for recognition for each frame. a similarity accumulating means for accumulating over the entire speech unit length to recognize the similarity for each frame obtained from the spectral similarity calculating means; and a phonological sequence analysis means for determining a phonological sequence of input speech. , a phonological similarity evaluation means for evaluating the similarity between the phonological sequence of the registered speech and the phonological sequence of the speech for recognition over the entire length of a speech unit to be recognized; and a spectrum obtained from the similarity accumulating means. This configuration includes a determining unit that makes a final determination using both the similarity and the phonetic similarity obtained from the phonetic similarity evaluation unit. DESCRIPTION OF EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. FIG. 3 is a block diagram of a speech recognition device according to an embodiment of the present invention, and the same components as those shown in FIG. 1 are given the same reference numerals and their explanations will be omitted. In FIG. 3, 7 is a phoneme sequence analysis means, 8 is a phoneme similarity evaluation means, and 9 is a phoneme sequence standard pattern. The input audio is analyzed frame by frame by the spectrum analysis means 1, and after correction for variations in the generated intensity is performed by the energy normalization means 2, a spectrum standard pattern 6 which has been previously processed and stored by the same means is used.
The spectral similarity between them is calculated by the spectral similarity calculating means 3, and the spectral similarity for each frame thus obtained is accumulated over the entire length of the word by the similarity accumulating means 4.
On the other hand, the input speech is also input to the phoneme sequence analysis means 7, and the phoneme sequence obtained here and the phoneme sequence standard pattern 9, which has been previously processed and stored by the same means, are input to the phoneme similarity evaluation means 8. Then, the phonological similarity is determined. Further, the spectral similarity and phonological similarity accumulated over the entire length of the word are input to the determining means 5. The determination by the determining means 5 is performed as follows. Note that in the following explanation, for convenience, dissimilarity, which is a concept opposite to similarity, will be considered, and this will be referred to as distance. Now, the table below shows the smallest distance [hereinafter referred to as the first candidate distance (d ₁ )] among the multiple spectral distances obtained from the combination of each registered speech and recognition speech, the smallest distance, and the second smallest distance. The difference between the distance and the distance [hereinafter referred to as the second candidate distance difference Δ ₂ ]
Regarding the two values, the magnitude relationship that they can have with respect to thresholds determined separately in advance [hereinafter referred to as the first candidate distance threshold (θ ₁ ) and the second candidate distance difference threshold (θ ₂ )] Each case of combination is shown.

【表】【table】

Claims

[Claims] 1. Spectrum analysis means for analyzing the frequency characteristics of input speech on a frame-by-frame basis, and normalization by excluding elements caused by the strength and weakness of the energy of the speech from the spectral features obtained by the spectrum analysis means. an energy normalization means for obtaining a spectrum; a spectral similarity calculation means for calculating the similarity between the normalized spectrum of the registered speech and the normalized spectrum of the speech for recognition for each frame; A similarity accumulating means for accumulating over the entire speech unit length to recognize the similarity for each frame, a phonological sequence analysis means for obtaining a phonological sequence of the input speech, and a phonological sequence of the registered speech and the phonological sequence of the recognition speech. a phonological similarity evaluation means that evaluates the similarity over the entire phonetic unit length for which the similarity between the sequences is to be recognized, and a spectral similarity obtained from the similarity accumulation means and a spectral similarity obtained from the phonological similarity evaluation means. A speech recognition device comprising: a determination means that makes a final determination using both the phoneme similarity and the phoneme similarity. 2. The determining means determines the largest similarity among a plurality of spectral similarities obtained between each registered speech and the recognition speech, and the similarity difference between the largest similarity and the second largest similarity. 2. The speech recognition device according to claim 1, wherein the speech recognition device is configured to make the determination using three factors: 1. and phonological similarity. 3 The determination means is configured to make a determination when the largest degree of similarity is greater than a first predetermined threshold and the difference in similarity is less than a second predetermined threshold. A speech recognition device according to claim 2. 4 The determining means compares the phoneme sequence of the registered speech that gave the largest spectral similarity and the phoneme sequence of the registered speech that gave the second largest spectral similarity with the phoneme sequence of the speech for recognition, and 4. The speech recognition device according to claim 3, wherein the registered speech whose phoneme sequence is evaluated to be more similar according to the rules set forth above is used as the recognition result. 5 The determining means compares the phoneme sequence of the registered speech that gave the largest spectral similarity and the phoneme sequence of the registered speech that gave the second largest spectral similarity with the phoneme sequence of the speech for recognition, and Based on the rules, when both phoneme sequences are evaluated to be similar to the phoneme sequence of the recognition speech, the registered speech that gave the largest spectral similarity is taken as the recognition result, and one of the phoneme sequences is used as the recognition result. If the sequence is similar to the phonological sequence of the recognition speech and the other phonological sequence is evaluated as not similar to the phonological sequence of the recognition speech, the registered speech has a phonological sequence that is evaluated to be similar. Claim 3, wherein the recognition result is determined as a recognition result, and when it is evaluated that both phoneme sequences are not similar to the phoneme sequence of the recognition speech, the recognition result is not output because satisfactory recognition cannot be performed. Speech recognition device described in section.