JPH0449958B2 - - Google Patents

Info

Publication number
JPH0449958B2
JPH0449958B2 JP58240415A JP24041583A JPH0449958B2 JP H0449958 B2 JPH0449958 B2 JP H0449958B2 JP 58240415 A JP58240415 A JP 58240415A JP 24041583 A JP24041583 A JP 24041583A JP H0449958 B2 JPH0449958 B2 JP H0449958B2
Authority
JP
Japan
Prior art keywords
speech
similarity
recognition
sequence
phoneme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired
Application number
JP58240415A
Other languages
Japanese (ja)
Other versions
JPS60130800A (en
Inventor
Masahiro Hamada
Hideki Fuje
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Panasonic Holdings Corp
Original Assignee
Matsushita Electric Industrial Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Matsushita Electric Industrial Co Ltd filed Critical Matsushita Electric Industrial Co Ltd
Priority to JP58240415A priority Critical patent/JPS60130800A/en
Publication of JPS60130800A publication Critical patent/JPS60130800A/en
Publication of JPH0449958B2 publication Critical patent/JPH0449958B2/ja
Granted legal-status Critical Current

Links

Description

【発明の詳細な説明】[Detailed description of the invention]

産業上の利用分野 本発明は、スペクトル類似度評価に基づく音声
認識装置に関するものである。 従来例の構成とその問題点 従来の音声認識装置の問題点を、単語認識方法
を例にとつて説明する。 第1図において、入力音声はスペクトル分析手
段1でフレーム毎に分析され、エネルギ正規化手
段2で発声強度のばらつきを補正するためのエネ
ルギ正規化を受けた後、予め同様の手段で処理さ
れ蓄えられていた標準パターン6との間のスペク
トル的な類似度がスペクトル類似度計算手段3で
計算され、このようにして得られたフレーム毎の
スペクトル類似性が類似度累積手段4により単語
全長にわたつて累積され、判定手段5により最終
の判定が行なわれるしくみとなつている。 一般に人間の発声強度は、同一の単語を発音す
る場合でもその都度異なる。エネルギ正規化手段
3はこのような発声強度のばらつきを補正するた
めに有効であるが、反面、音声の本質からしてエ
ネルギ強度に違いがあるべき区間〔例えば母音区
間と無声子音区間など〕についても、その性質に
かかわりなくエネルギ正規化を行なつてしまうと
いう欠点がある。第2図はその様子を具体的に示
したものである。 第2図Aにおいて、実線は母音のスペクトル強
度を、破線は子音のスペクトル強度をそれぞれ表
わしている。また両スペクトルの間の縦縞部分は
スペクトル非類似度〔類似度の逆の概念〕を示し
ており、本来のスペクトル的距離を表わしている
と考えることができる。一方、第2図Bは上記の
母音と子音とのスペクトルがエネルギ正規化処理
を受けた後のスペクトルを示している。同図で
は、本来エネルギの小さかつた子音のスペクトル
が拡大され、母音のスペクトルエネルギと同程度
になつたため、同図中の縦縞部分で示されたスペ
クトル非類似度は第2図Aの場合よりも小さくな
つている。即ち、エネルギ正規化処理によると、
本来明らかにエネルギ強度が異なり、同時にスペ
クトル非類似度が大きかつた母音と子音とのスペ
クトルが、比較的エネルギ強度が等しく、同時に
スペクトル非類似度が小さいものとして評価され
ることになる。 ところで、単語認識を行なう場合には、登録音
声テンプレートと認識対象音声との間でフレーム
毎に正規化スペクトル類似度を累積していく。従
つて言語的に明らかに異なつていると考えられる
音声区間〔例えば上記例の母音区間と子音区間〕
の非類似度は、言語的に似通つていると考えられ
る音声区間〔例えば同一種の母音であり、発声強
度のみが若干異なる母音区間〕の非類似度に比
べ、より大きな重みづけをもつて評価される方
が、最終の認識結果において明らかに言語的に異
なつていると考えられる言語間での誤認識が発生
することを防ぐことができる。 即ち、エネルギ正規化処理によると、発声時の
エネルギ強度的不安定要因が除外できる反面、言
語的に明確かつ支配的であるべき相違点をあいま
いにしてしまうという問題点があつた。 発明の目的 本発明は上記従来の欠点を解消するもので、エ
ネルギ正規化処理のもつ上記長所を生かしつつ、
同時に言語的に明らかに不都合と考えられるよう
な誤認識の発生をできるだけ阻止し、認識率の向
上を図ることのできる音声認識装置を提供するこ
とを目的とする。 発明の構成 上記目的を達成するため、本発明の音声認識装
置は、入力音声の周波数的特徴をフレーム毎に分
析するスペクトル分析手段と、このスペクトル分
析手段で得られたスペクトル特徴から音声のエネ
ルギの強弱に起因する要素を除外して正規化スペ
クトルを得るエネルギを正規化手段と、登録音声
の正規化スペクトルと認識用音声の正規化スペク
トルとの間の類似度をフレーム毎に求めるスペク
トル類似度計算手段と、このスペクトル類似度計
算手段から得られるフレーム毎の類似度を認識し
ようとする音声単位長の全体にわたつて累積する
類似度累積手段と、入力音声の音韻系列を求める
音韻系列分析手段と、登録音声の音韻系列と認識
用音声の音韻系列との間の類似度を認識しようと
する音声単位長の全体にわたつて評価する音韻類
似度評価手段と、前記類似度累積手段から得られ
るスペクトル類似度と前記音韻類似度評価手段か
ら得られる音韻類似度との両者を用いて最終の判
定を下す判定手段とを備えた構成である。 実施例の説明 以下、本発明の一実施例について、図面に基づ
いて説明する。 第3図は本発明の一実施例における音声認識装
置の構成図であり、第1図に示す構成要素と同一
の構成要素には同一の符号を付してその説明を省
略する。第3図において、7は音韻系列分析手
段、8は音韻類似度評価手段、9は音韻系列標準
パターンである。 入力音声はスペクトル分析手段1でフレーム毎
に分析され、エネルギ正規化手段2で発生強度の
ばらつきの補正を受けた後、予め同様の手段で処
理され蓄えられていたスペクトル標準パターン6
との間のスペクトル的な類似度がスペクトル類似
度計算手段3で計算され、このようにして得られ
たフレーム毎のスペクトル類似度が類似度累積手
段4により単語全長にわたつて累積されていく。
一方、前記入力音声は音韻系列分析手段7にも入
力されここで得られた音韻系列と、予め同様の手
段で処理され蓄えられていた音韻系列標準パター
ン9とが音韻類似度評価手段8に入力され、ここ
で音韻類似度が求められる。さらに、前記単語全
長にわたつて累積されたスペクトル類似度と、音
韻類似度とは判定手段5に入力される。 判定手段5における判定は次のように行なわれ
る。なお以下の説明では便宜上類似度と逆の概念
である非類似度を考え、これを距離と呼ぶことに
する。さて下記表は各登録音声と認識用音声との
組み合わせから得られた複数のスペクトル距離の
うち最も小さい距離〔以下第1候補距離(d1)と
呼ぶ〕と、最も小さい距離と2番目に小さい距離
との差〔以下第2候補距離差Δ2と呼ぶ〕との2
つの値について、それらが、予め別に定めたしき
い値〔以下第1候補距離しきい値(θ1)および第
2候補距離差しきい値(θ2)と呼ぶ〕に対してと
り得る大小関係の組み合わせの各場合を示してい
る。
FIELD OF THE INVENTION The present invention relates to a speech recognition device based on spectral similarity evaluation. Configuration of a conventional example and its problems Problems of a conventional speech recognition device will be explained using a word recognition method as an example. In FIG. 1, input speech is analyzed frame by frame by spectrum analysis means 1, energy normalized by energy normalization means 2 to correct variations in vocalization intensity, and then processed and stored in advance by the same means. A spectral similarity calculation means 3 calculates the spectral similarity between the reference pattern 6 and the standard pattern 6, and the spectral similarity obtained for each frame is calculated over the entire word length by a similarity accumulation means 4. The final judgment is made by the judging means 5. In general, the strength of human speech differs each time even when pronouncing the same word. The energy normalization means 3 is effective for correcting such variations in utterance intensity, but on the other hand, it is effective for correcting such variations in utterance intensity, but on the other hand, it is difficult to correct for intervals where there should be a difference in energy intensity due to the nature of speech (for example, vowel intervals and voiceless consonant intervals). However, the problem is that energy normalization is performed regardless of its properties. FIG. 2 specifically shows this situation. In FIG. 2A, the solid line represents the spectral intensity of vowels, and the broken line represents the spectral intensity of consonants. Further, the vertical striped portion between both spectra indicates spectral dissimilarity (the opposite concept of similarity), and can be considered to represent the original spectral distance. On the other hand, FIG. 2B shows the spectrum after the vowel and consonant spectra described above have been subjected to energy normalization processing. In the figure, the spectrum of the consonant, which originally had low energy, has been expanded and has become comparable to the spectral energy of the vowel, so the spectral dissimilarity indicated by the vertical stripes in the figure is greater than that in Figure 2A. is also getting smaller. That is, according to the energy normalization process,
The spectra of a vowel and a consonant, which originally had clearly different energy intensities and a large degree of spectral dissimilarity, are evaluated as having relatively equal energy intensities and a small degree of spectral dissimilarity. By the way, when performing word recognition, the normalized spectral similarity is accumulated for each frame between the registered speech template and the speech to be recognized. Therefore, speech intervals that are considered to be clearly linguistically different [for example, the vowel interval and consonant interval in the above example]
The dissimilarity of is given greater weight than the dissimilarity of speech intervals that are considered to be linguistically similar (e.g. vowel intervals of the same type, but differing only in vocalization intensity). Evaluation can prevent erroneous recognition from occurring between languages that are considered to be clearly linguistically different in the final recognition result. That is, although energy normalization processing can eliminate factors that cause instability in energy intensity during vocalization, it has the problem of obscuring differences that should be linguistically clear and dominant. Purpose of the Invention The present invention solves the above-mentioned conventional drawbacks, and while taking advantage of the above-mentioned advantages of energy normalization processing,
At the same time, it is an object of the present invention to provide a speech recognition device that can improve the recognition rate by preventing as much as possible the occurrence of erroneous recognitions that are considered to be clearly disadvantageous from a linguistic point of view. Structure of the Invention In order to achieve the above object, the speech recognition device of the present invention includes a spectrum analysis means for analyzing the frequency characteristics of input speech for each frame, and a method for calculating the energy of the speech from the spectrum features obtained by the spectrum analysis means. A means for normalizing energy to obtain a normalized spectrum by excluding elements caused by strength and weakness, and a spectral similarity calculation that calculates the similarity between the normalized spectrum of the registered speech and the normalized spectrum of the speech for recognition for each frame. a similarity accumulating means for accumulating over the entire speech unit length to recognize the similarity for each frame obtained from the spectral similarity calculating means; and a phonological sequence analysis means for determining a phonological sequence of input speech. , a phonological similarity evaluation means for evaluating the similarity between the phonological sequence of the registered speech and the phonological sequence of the speech for recognition over the entire length of a speech unit to be recognized; and a spectrum obtained from the similarity accumulating means. This configuration includes a determining unit that makes a final determination using both the similarity and the phonetic similarity obtained from the phonetic similarity evaluation unit. DESCRIPTION OF EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. FIG. 3 is a block diagram of a speech recognition device according to an embodiment of the present invention, and the same components as those shown in FIG. 1 are given the same reference numerals and their explanations will be omitted. In FIG. 3, 7 is a phoneme sequence analysis means, 8 is a phoneme similarity evaluation means, and 9 is a phoneme sequence standard pattern. The input audio is analyzed frame by frame by the spectrum analysis means 1, and after correction for variations in the generated intensity is performed by the energy normalization means 2, a spectrum standard pattern 6 which has been previously processed and stored by the same means is used.
The spectral similarity between them is calculated by the spectral similarity calculating means 3, and the spectral similarity for each frame thus obtained is accumulated over the entire length of the word by the similarity accumulating means 4.
On the other hand, the input speech is also input to the phoneme sequence analysis means 7, and the phoneme sequence obtained here and the phoneme sequence standard pattern 9, which has been previously processed and stored by the same means, are input to the phoneme similarity evaluation means 8. Then, the phonological similarity is determined. Further, the spectral similarity and phonological similarity accumulated over the entire length of the word are input to the determining means 5. The determination by the determining means 5 is performed as follows. Note that in the following explanation, for convenience, dissimilarity, which is a concept opposite to similarity, will be considered, and this will be referred to as distance. Now, the table below shows the smallest distance [hereinafter referred to as the first candidate distance (d 1 )] among the multiple spectral distances obtained from the combination of each registered speech and recognition speech, the smallest distance, and the second smallest distance. The difference between the distance and the distance [hereinafter referred to as the second candidate distance difference Δ 2 ]
Regarding the two values, the magnitude relationship that they can have with respect to thresholds determined separately in advance [hereinafter referred to as the first candidate distance threshold (θ 1 ) and the second candidate distance difference threshold (θ 2 )] Each case of combination is shown.

【表】【table】

Claims (1)

【特許請求の範囲】 1 入力音声の周波数的特徴をフレーム毎に分析
するスペクトル分析手段と、このスペクトル分析
手段で得られたスペクトル特徴から音声のエネル
ギの強弱に起因する要素を除外して正規化スペク
トルを得るエネルギ正規化手段と、登録音声の正
規化スペクトルと認識用音声の正規化スペクトル
との間の類似度をフレーム毎に求めるスペクトル
類似度計算手段と、このスペクトル類似度計算手
段から得られるフレーム毎の類似度を認識しよう
とする音声単位長の全体にわたつて累積する類似
度累積手段と、入力音声の音韻系列を求める音韻
系列分析手段と、登録音声の音韻系列と認識用音
声の音韻系列との間の類似度を認識しようとする
音声単位長の全体にわたつて評価する音韻類似度
評価手段と、前記類似度累積手段から得られるス
ペクトル類似度と前記音韻類似度評価手段から得
られる音韻類似度との両者を用いて最終の判定を
下す判定手段とを備えた音声認識装置。 2 判定手段は、登録音声のそれぞれと認識用音
声との間から得られた複数のスペクトル類似度の
うち、最も大きい類似度と、最も大きい類似度と
2番目に大きい類似度との類似度差と、音韻類似
度との三者を用いて判定を下す構成とした特許請
求の範囲第1項記載の音声認識装置。 3 判定手段は、最も大きい類似度が予め定めら
れた第1のしきい値より大きくなり、かつ類似度
差が予め定められた第2のしきい値より小さくな
つた時に判定を下す構成とした特許請求の範囲第
2項記載の音声認識装置。 4 判定手段は、最も大きいスペクトル類似度を
与えた登録音声の音韻系列と、2番目に大きいス
ペクトル類似度を与えた登録音声の音韻系列と
を、認識用音声の音韻系列と比較し、予め定めた
規則によつて音韻系列がより類似していると評価
された方の登録音声をもつて認識結果とする構成
とした特許請求の範囲第3項記載の音声認識装
置。 5 判定手段は、最も大きいスペクトル類似度を
与えた登録音声の音韻系列と、2番目に大きいス
ペクトル類似度を与えた登録音声の音韻系列と
を、認識用音声の音韻系列と比較し、予め定めた
規則に基づいて、双方の音韻系列がともに認識用
音声の音韻系列と類似していると評価された時に
は最も大きいスペクトル類似度を与えた登録音声
をもつて認識結果とし、いずれか一方の音韻系列
が認識用音声の音韻系列に類似しかつ他方の音韻
系列が認識用音声の音韻系列に類似していないと
評価された時には類似していると評価された音韻
系列を与えた登録音声をもつて認識結果とし、双
方の音韻系列がともに認識用音声の音韻系列と類
似していないと評価された時には満足すべき認識
が行なえなかつたとして認識結果を出力しない構
成とした特許請求の範囲第3項記載の音声認識装
置。
[Claims] 1. Spectrum analysis means for analyzing the frequency characteristics of input speech on a frame-by-frame basis, and normalization by excluding elements caused by the strength and weakness of the energy of the speech from the spectral features obtained by the spectrum analysis means. an energy normalization means for obtaining a spectrum; a spectral similarity calculation means for calculating the similarity between the normalized spectrum of the registered speech and the normalized spectrum of the speech for recognition for each frame; A similarity accumulating means for accumulating over the entire speech unit length to recognize the similarity for each frame, a phonological sequence analysis means for obtaining a phonological sequence of the input speech, and a phonological sequence of the registered speech and the phonological sequence of the recognition speech. a phonological similarity evaluation means that evaluates the similarity over the entire phonetic unit length for which the similarity between the sequences is to be recognized, and a spectral similarity obtained from the similarity accumulation means and a spectral similarity obtained from the phonological similarity evaluation means. A speech recognition device comprising: a determination means that makes a final determination using both the phoneme similarity and the phoneme similarity. 2. The determining means determines the largest similarity among a plurality of spectral similarities obtained between each registered speech and the recognition speech, and the similarity difference between the largest similarity and the second largest similarity. 2. The speech recognition device according to claim 1, wherein the speech recognition device is configured to make the determination using three factors: 1. and phonological similarity. 3 The determination means is configured to make a determination when the largest degree of similarity is greater than a first predetermined threshold and the difference in similarity is less than a second predetermined threshold. A speech recognition device according to claim 2. 4 The determining means compares the phoneme sequence of the registered speech that gave the largest spectral similarity and the phoneme sequence of the registered speech that gave the second largest spectral similarity with the phoneme sequence of the speech for recognition, and 4. The speech recognition device according to claim 3, wherein the registered speech whose phoneme sequence is evaluated to be more similar according to the rules set forth above is used as the recognition result. 5 The determining means compares the phoneme sequence of the registered speech that gave the largest spectral similarity and the phoneme sequence of the registered speech that gave the second largest spectral similarity with the phoneme sequence of the speech for recognition, and Based on the rules, when both phoneme sequences are evaluated to be similar to the phoneme sequence of the recognition speech, the registered speech that gave the largest spectral similarity is taken as the recognition result, and one of the phoneme sequences is used as the recognition result. If the sequence is similar to the phonological sequence of the recognition speech and the other phonological sequence is evaluated as not similar to the phonological sequence of the recognition speech, the registered speech has a phonological sequence that is evaluated to be similar. Claim 3, wherein the recognition result is determined as a recognition result, and when it is evaluated that both phoneme sequences are not similar to the phoneme sequence of the recognition speech, the recognition result is not output because satisfactory recognition cannot be performed. Speech recognition device described in section.
JP58240415A 1983-12-19 1983-12-19 voice recognition device Granted JPS60130800A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP58240415A JPS60130800A (en) 1983-12-19 1983-12-19 voice recognition device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP58240415A JPS60130800A (en) 1983-12-19 1983-12-19 voice recognition device

Publications (2)

Publication Number Publication Date
JPS60130800A JPS60130800A (en) 1985-07-12
JPH0449958B2 true JPH0449958B2 (en) 1992-08-12

Family

ID=17059130

Family Applications (1)

Application Number Title Priority Date Filing Date
JP58240415A Granted JPS60130800A (en) 1983-12-19 1983-12-19 voice recognition device

Country Status (1)

Country Link
JP (1) JPS60130800A (en)

Also Published As

Publication number Publication date
JPS60130800A (en) 1985-07-12

Similar Documents

Publication Publication Date Title
EP0237934B1 (en) Speech recognition system
EP2482277B1 (en) Method for identifying a speaker using formant equalization
Silipo et al. Automatic transcription of prosodic stress for spontaneous English discourse
WO2002007145A3 (en) Method and apparatus for constructing voice templates for a speaker-independent voice recognition system
US4882755A (en) Speech recognition system which avoids ambiguity when matching frequency spectra by employing an additional verbal feature
JPH04362699A (en) Speech recognition method and device
JPS62232691A (en) voice recognition device
US6996527B2 (en) Linear discriminant based sound class similarities with unit value normalization
JPH0449958B2 (en)
EP0109140B1 (en) Recognition of continuous speech
Niyogi et al. A detection framework for locating phonetic events.
JPH0558553B2 (en)
Nagesh et al. A robust speech rate estimation based on the activation profile from the selected acoustic unit dictionary
Shankar et al. Weakly Supervised Syllable Segmentation by Vowel-Consonant Peak Classification.
Lamel et al. Performance improvement in a dynamic-programming-based isolated word recognition system for the alpha-digit task
JPS5936759B2 (en) Voice recognition method
JPS6136797A (en) Voice segmentation
JPH0585918B2 (en)
JPH0458638B2 (en)
Vysotsky Speaker-independent isolated word recognition using a one-pass analysis
JPS60147797A (en) Voice recognition equipment
Martens 9000 Gent, Belgium
JPH0458636B2 (en)
JPH045395B2 (en)
Martens Phonetic segmentation using psychoacoustic speech parameters

Legal Events

Date Code Title Description
EXPY Cancellation because of completion of term