JPH042197B2

JPH042197B2 -

Info

Publication number: JPH042197B2
Application number: JP58163537A
Authority: JP
Priority date: 1983-09-05
Filing date: 1983-09-05
Publication date: 1992-01-16
Also published as: JPS6053998A

Description

【発明の詳細な説明】産業上の利用分野本発明は人間の声によつて発声された音声信号
を自動的に認識するための、音声認識装置に関す
るものである。DETAILED DESCRIPTION OF THE INVENTION Field of the Invention The present invention relates to a speech recognition device for automatically recognizing an audio signal uttered by a human voice.

従来例の構成とその問題点音声を自動的に認識する音声認識装置は人間か
ら電子計算機や各種機械へデータや命令を与える
手段として非常に有効と考えられる。Conventional configurations and their problems A speech recognition device that automatically recognizes speech is considered to be very effective as a means for providing data and instructions from humans to computers and various machines.

従来研究あるいは発表されている音声認識装置
の動作原理としてはパタンマツチング法が多く採
用されている。この方法は認識される必要がある
全種類の単語に対して標準パターンをあらかじめ
記憶しておき、入力される未知の入力パタンと比
較することによつて一致の度合（以下類似度と呼
ぶ）を計算し、最大一致が得られる標準パタンと
同一の単語であると判定するものである。このパ
タンマツチング法では認識されるべき全ての単語
に対して標準パタンを用意しなければならないた
め、発声音が変つた場合には新しく標準パタンを
入力して記憶させる必要がある。したがつて数百
種類以上の単語を認識対象とするような場合、全
種類の単語を発声して登録するには時間と労力を
必要とし、また登録に要するメモリー容量も膨大
になることが予想される。さらに入力パタンと標
準パタンのパタンマツチングに要する時間も単語
数が多くなると長くなつてしまう欠点がある。 The pattern matching method is often adopted as the operating principle of speech recognition devices that have been researched or published in the past. In this method, standard patterns are memorized in advance for all types of words that need to be recognized, and the degree of matching (hereinafter referred to as similarity) is calculated by comparing them with unknown input patterns. The word is calculated and determined to be the same word as the standard pattern that yields the maximum match. In this pattern matching method, standard patterns must be prepared for all words to be recognized, so if the utterance changes, a new standard pattern must be input and stored. Therefore, if hundreds of types of words are to be recognized, it will take time and effort to pronounce and register all types of words, and it is expected that the memory capacity required for registration will be enormous. be done. Furthermore, there is a drawback that the time required for pattern matching between the input pattern and the standard pattern increases as the number of words increases.

これに対して、入力音声を音素単位に分けて音
素の組合せとして認識し（以下音素認識と呼ぶ）
音素単位で表記された単語辞書との類似度を求め
る方法は単語辞書に要するメモリー容量が大幅に
少なくて済み、パタンマツチングに要する時間が
短く、辞書の内容変更も容易であるという特長を
持つている。例えば「赤い」という発声は／
ａ／，／Ｋ／，／ｉ／という三つの音素を組合せ
てAKAIという極めて簡単な形式で表現すること
ができるため、不特定話者で多数語の音声に対処
することが容易である。 On the other hand, input speech is divided into phoneme units and recognized as combinations of phonemes (hereinafter referred to as phoneme recognition).
The method of determining similarity with a word dictionary written in phoneme units has the advantage that the memory capacity required for the word dictionary is significantly smaller, the time required for pattern matching is shorter, and the contents of the dictionary can be easily changed. ing. For example, the utterance of "red" is /
Since the three phonemes a/, /K/, and /i/ can be combined and expressed in an extremely simple format called AKAI, it is easy for non-specific speakers to deal with the sounds of many words.

第１図に音素認識を行うことを特徴とする音声
認識方式のブロツク図を示す。マイク等で入力さ
れた音声は音響分析部１によつて分析を行なう。
分析方法としては帯域フイルタ群や線形予測分析
を用い、フレーム周期（10mS程度）毎にスペク
トル情報を得る。音素判別部２では、音響分析部
１で得たスペクトル情報を用い、標準パターン格
納部３のデータによつてフレーム毎の音素判別を
行なう。標準パターン格納部３に格納された標準
パターンは、あらかじめ多数話者の音声より音素
毎に求めておく。セグメンテーシヨン部４では音
響分析部１の分析出力をもとに音声区間の検出と
音素毎の境界決定（以下セグメンテーシヨンと呼
ぶ）を行う。音素認識部５ではセグメンテーシヨ
ン部４と音素判別部２の結果をもとに１つの音素
区間毎に何という音素であるかを決定する作業を
行なう。この結果として音素の系列が完成する。
単語認識部６では、この音素系列を同様に音素系
列で表記された単語辞書７と照合し、最も類似度
の高い単語を認識結果として出力する。 FIG. 1 shows a block diagram of a speech recognition system characterized by phoneme recognition. Audio input through a microphone or the like is analyzed by the acoustic analysis section 1.
The analysis method uses a group of bandpass filters and linear predictive analysis, and spectral information is obtained every frame period (about 10 mS). The phoneme discrimination section 2 uses the spectrum information obtained by the acoustic analysis section 1 and performs phoneme discrimination for each frame based on the data in the standard pattern storage section 3. The standard patterns stored in the standard pattern storage section 3 are obtained in advance for each phoneme from the voices of multiple speakers. The segmentation unit 4 detects speech intervals and determines boundaries for each phoneme (hereinafter referred to as segmentation) based on the analysis output of the acoustic analysis unit 1. The phoneme recognition section 5 performs the work of determining what phoneme is for each phoneme section based on the results of the segmentation section 4 and the phoneme discrimination section 2. As a result, a series of phonemes is completed.
The word recognition unit 6 compares this phoneme sequence with a word dictionary 7 similarly written in phoneme sequences, and outputs the word with the highest degree of similarity as a recognition result.

前記方法で不特定話者を対象とする場合に最も
重要な点は、高い音声認識精度を、どういう話者
環境に対しても安定して得ることである。また、
そのために話者に負担をかけすぎたり音声認識装
置にした場合に高価な部分を要するようであつて
はならない。 The most important point when using the above method to target unspecified speakers is to stably obtain high speech recognition accuracy for any speaker environment. Also,
This must not place too much burden on the speaker or require expensive parts for the speech recognition device.

しかし従来発表または試作されている音声認識
装置は前記条件が不十分であるという欠点があつ
た。 However, speech recognition devices that have been announced or prototyped so far have had the drawback of not meeting the above conditions.

従来例として、予測残差を対象とする方式（鹿
野，好用「会話音声中の母音認識を目的とした
LPC距離尺度の評価」電子通信学会誌80／５，
VOLJ−63D，No.５参照）では、あらかじめ多数
話者の音声より線形予測分析によつて音素ｉの最
大パラメータA_ij（ｊ＝１，２，……，Ｐ）（Ｐは
分析次数）を求めておき、予測残差を次式で求め
る。 As a conventional example, a method that targets prediction residuals (Kano, 2003, ``For the purpose of vowel recognition in conversational speech'')
“Evaluation of LPC distance scale” Journal of the Institute of Electronics and Communication Engineers 80/5,
VOLJ-63D, No. 5), the maximum parameter A _ij (j = 1, 2, ..., P) (P is the analysis order) of phoneme i is determined in advance by linear predictive analysis from the voices of many speakers. Then, calculate the prediction residual using the following formula.

N_i＝_p 〓^j=1 A_ijS_j …(1) ここでS_jは未知な入力音声から求めた自己相関
係数である。この予測残差N_iを、対象とする音
素毎に求めこれを距離尺度として、N_iが最小と
なる音素を判別結果とする。 N _i = _p 〓 ^j=1 A _ij S _j …(1) Here, S _j is an autocorrelation coefficient obtained from unknown input speech. This prediction residual N _i is obtained for each target phoneme and is used as a distance measure, and the phoneme with the minimum N _i is taken as the discrimination result.

しかしこの方法は音素の標準パタンに相当する
最大パラメータA_ijが単なる平均値であるため、
たとえ使用者にあわせてA_ijを作り直すという学
習機能を設けたとしても、調音結合による発声の
変動に対処することができず、認識率が低いとい
う欠点があつた。 However, in this method, the maximum parameter A _ij corresponding to the standard pattern of phonemes is just an average value, so
Even if a learning function was provided to recreate A _ij to suit the user, it would not be able to deal with variations in utterances caused by articulatory combinations, resulting in a low recognition rate.

発明の目的本発明は前記欠点を解消し、不特定話者に対処
できるとともに話者，環境，言葉のちがいに影響
されることなく安定に高い音声認識精度を得るこ
とのできる音声認識装置を提供することを目的と
する。Purpose of the Invention The present invention solves the above-mentioned drawbacks, and provides a speech recognition device that can handle unspecified speakers and stably obtain high speech recognition accuracy without being affected by differences in speakers, environments, and language. The purpose is to

発明の構成本発明は上記目的を達成するためになされたも
ので、音声信号からスペクトルまたはそれに類似
する情報（以下スペクトル情報と記す）を算出す
る音響分析部と、多数話者からなる標準音声信号
から得られた、スペクトル情報の分散・共分散行
列および平均値を少なくとも含む標準パターンと
を用いて音素毎の類似度を求める類似度計算部
と、類似度または音素系列で表記された単語辞書
を格納する単語辞書記憶部と、前記類似度計算部
を経て作成された類似度または音素系列を単語辞
書と照合し最も類似度の高い単語を認識結果とし
て出力する出力部と、前記出力部の結果と前記音
響分析部のスペクトル情報とから新しい平均値を
作成しその結果に基づき前記係数記憶部の内容を
書き替える学習部とを具備するものである。Structure of the Invention The present invention has been made to achieve the above object, and includes an acoustic analysis section that calculates a spectrum or information similar to it (hereinafter referred to as spectrum information) from an audio signal, and a standard audio signal composed of multiple speakers. A similarity calculation unit that calculates similarity for each phoneme using a standard pattern that includes at least the variance/covariance matrix and average value of spectral information obtained from a word dictionary storage unit for storing a word dictionary; an output unit that compares the degree of similarity or phoneme sequence created through the similarity calculation unit with a word dictionary and outputs a word with the highest degree of similarity as a recognition result; and a result of the output unit. and a learning section that creates a new average value from the spectral information of the acoustic analysis section and rewrites the contents of the coefficient storage section based on the result.

実施例の説明第２図に本発明の音声認識装置の構成の一実施
例を示す。マイク３１から入つた音声信号はAD
変換器２１で、12kHzサンプリングで12ビツトに
変換する。これを信号処理回路でプリエンフアシ
スおよび20mSのハミング窓をかけ、10mS毎に線
形予測分析プロセツサ２３にてLPCケプストラ
ム係数を算出する。このLPCケプストラム係数
を類似度計算部２４に通し、各音素に対する類似
度をフレーム毎に算出し、メインメモリ２７に転
送する。係数メモリ２５は各音素毎のフイルタ係
数を格納している。DESCRIPTION OF EMBODIMENTS FIG. 2 shows an embodiment of the configuration of a speech recognition device according to the present invention. The audio signal input from microphone 31 is AD
A converter 21 converts it to 12 bits with 12kHz sampling. A signal processing circuit applies pre-emphasis and a 20 mS Hamming window to this signal, and a linear predictive analysis processor 23 calculates LPC cepstrum coefficients every 10 mS. The LPC cepstral coefficients are passed through the similarity calculation unit 24 to calculate the similarity for each phoneme for each frame and transferred to the main memory 27. The coefficient memory 25 stores filter coefficients for each phoneme.

一方、帯域フイルタ２６では３チヤネル程度の
帯域パワーおよび全パワーを算出し、音素のセグ
メンテーシヨン用のデータとしてメインメモリ２
７に転送する。メインプロセツサ２８では類似度
計算部２４および帯域フイルタ２６の結果を用い
て音声区間の検出と音素毎のセグメンテーシヨン
を行つた後、類似度計算部２４の音素毎の類似度
から類似度の最も高い音素を区間毎に決定し、音
素系列を作成する。この音素系列を同様に音素系
列で表記された単語辞書メモリ２９と照合するこ
とによつて最も類似度の大きい単語名を認識結果
として出力部３０に出力する。 On the other hand, the band filter 26 calculates the band power and total power of about 3 channels, and stores it in the main memory as data for phoneme segmentation.
Transfer to 7. The main processor 28 uses the results of the similarity calculation unit 24 and the band filter 26 to detect speech intervals and perform segmentation for each phoneme, and then calculates the similarity from the similarity for each phoneme in the similarity calculation unit 24. The highest phoneme is determined for each section and a phoneme sequence is created. By comparing this phoneme sequence with the word dictionary memory 29 which is also written in a phoneme sequence, the word name with the highest degree of similarity is outputted to the output unit 30 as a recognition result.

しかし、これだけでは不特定話者に対して使用
は可能であるが、標準パターンに相当する係数メ
モリ２５が固定されるため、話者による認識性能
のバラツキが大きく、認識率がかなり低くなつて
しまう場合が生ずる。そこで、新しく学習機能を
もたせるために学習部３２を設ける。この学習部
３２は線形予測分析プロセツサ２３で得たLPC
ケプストラム係数を受け、出力部３０から得た結
果を参照に学習データを作成し、あらかじめ求め
ておいた分散，共分散行列をもとにその話者に最
もふさわしい音素毎の判別係数を計算し直し、係
数メモリ２５に転送するための動作を行う。 However, although this alone can be used for unspecified speakers, since the coefficient memory 25 corresponding to the standard pattern is fixed, recognition performance varies greatly depending on the speaker, and the recognition rate becomes considerably low. A situation arises. Therefore, a learning section 32 is provided to provide a new learning function. This learning section 32 uses the LPC obtained by the linear predictive analysis processor 23.
Receiving the cepstral coefficients, creating learning data with reference to the results obtained from the output unit 30, and recalculating the discriminant coefficient for each phoneme that is most appropriate for the speaker based on the variance and covariance matrix determined in advance. , performs an operation for transferring to the coefficient memory 25.

次に本発明に係る音素認識装置の動作について
第２図を参照にしながら詳しく説明する。 Next, the operation of the phoneme recognition device according to the present invention will be explained in detail with reference to FIG.

あらかじめマスク３１から入力された多数話者
の発声した多数の単語音声からAD変換器２１を
介して母音／ａ／，／ｏ／，／ｕ／，／ｉ／，／
ｅ／と鼻音の切出しを行つておく。この音声デー
タを用いて信号処理回路２２および線形予測分析
プロセツサ２３により10mSの分析区間毎に線形
予測分析を行い、ｐ次元のLPCケプストラム係
数を算出する。このLPCケプストラム係数を用
いて全音素を対象とした共分散行列Ｗと、各音素
毎の平均値m_i（ｉは音素の種類を表わす）を求め
る。この結果より、音素ｉに対する判別係数a_ij
（ｊ＝１，２，……，ｐ）は共分散行列Ｗの逆行
列W^-1の（ｊ，j′）要素をδ^jj′とすると、 a_ij＝２_p 〓^j=1 δ^jj′m_ij′ …(2) で表わすことができる。 Vowels /a/, /o/, /u/, /i/, / are passed through the AD converter 21 from a large number of word sounds uttered by multiple speakers inputted in advance from the mask 31.
Cut out e/ and nasal sounds. Using this audio data, the signal processing circuit 22 and linear predictive analysis processor 23 perform linear predictive analysis for each 10 mS analysis interval to calculate p-dimensional LPC cepstral coefficients. Using these LPC cepstral coefficients, a covariance matrix W for all phonemes and an average value m _i (i represents the type of phoneme) for each phoneme are determined. From this result, the discriminant coefficient a _ij for phoneme i
(j = 1, 2, ..., p) is the inverse matrix W ^-1 of the covariance matrix W. Let δ ^jj ' be the (j, j') element, then a _ij = 2 _p 〓 ^j=1 δ ^jj ' It can be expressed as m _ij ′ …(2).

各音素毎にa_ij，m_ij′，δ^ij′，m_i ^tW^-1m_i（後逆）を
求め標準パターンとして係数メモリ２５に格納し
ておく。 For each phoneme, a _ij , m _ij ', δ ^ij ', m _i ^t W ^-1 m _i (post-reverse) are obtained and stored in the coefficient memory 25 as a standard pattern.

次に使用者に内容のあらかじめわかつている音
声（たとえば／ａ／，／ｉ／，／ｕ／，／
ｅ／，／ｏ／）を発声させ、音声区間中の分析区
間毎のLPCケプストラム係数を線形予測分析プ
ロセツサ２３で求め、学習部３２に転送する。一
方予め格納されている係数メモリ２５の標準パタ
ーンを用いて、判別フイルタ２４で類似度を求め
る。類似度計算部２４では入力信号のLPCケプ
ストラム係数ｘに対するマハラノビス距離D_i ²は D_i ²＝x^tW^-1x−_p 〓^j=1 a_ijx_j＋m_i ^tW^-1m_i …(3) （ｔは転置行列を示す）で表わすことができるが、第１項は音素の種類に
依存しないため、類似度L_iを簡易的に L_i＝_p 〓^j=1 a_ijx_j−m_i ^tW^-1m_i …(4) で表わし、(4)式を用いて類似度を計算する。その
結果をメインメモリ２７に転送し、メインプロセ
ツサ２８を通して音素系列を作成する。次に、学
習すべき音素の時間軸上の位置を示す値を出力部
３０より学習部３２にもどし、学習すべき音素の
LPCケプストラム係数の平均値を求める。以上
を音声の種類を変えながら必要な回数くり返す。
各音素毎の平均値に適度な重み付けをしたものを
学習しない場合のもとの平均値（m_ij′）に加え、
新しい音素毎の平均値を作成し係数メモリ２５の
平均値m_ij′を置き換える。さらにこの平均値を使
用して判別係数a_ijおよび(4)式の定数項（第２項）
を音素ごとに修正し、これらを新しい標準パター
ンとして係数メモリ２５に転送し、標準パターン
の書替えを行う。 Next, the user can hear a voice whose content is known in advance (e.g. /a/, /i/, /u/, /
e/, /o/) is uttered, the linear predictive analysis processor 23 calculates the LPC cepstral coefficients for each analysis section in the speech section, and transfers them to the learning section 32. On the other hand, the degree of similarity is determined by the discrimination filter 24 using the standard pattern stored in the coefficient memory 25 that is stored in advance. In the similarity calculation unit 24, the Mahalanobis distance D _i ² for the LPC cepstral coefficient x of the input signal is D _i ² =x ^t W ^-1 x− _p 〓 ^j=1 a _ij x _j +m _i ^t W ^-1 m _i …( 3) (t indicates the transposed matrix) However, since the first term does not depend on the type of phoneme, the similarity L _i can be simply expressed as L _i = _p 〓 ^j=1 a _ij x _j − m _i ^t W ^-1 m _i ...(4), and the similarity is calculated using equation (4). The results are transferred to the main memory 27 and passed through the main processor 28 to create a phoneme sequence. Next, the value indicating the position of the phoneme to be learned on the time axis is returned from the output unit 30 to the learning unit 32, and the value indicating the position of the phoneme to be learned on the time axis is returned to the learning unit 32.
Find the average value of the LPC cepstral coefficients. Repeat the above as many times as necessary while changing the type of voice.
In addition to the original average value (m _ij ′) without learning, the average value for each phoneme is appropriately weighted,
A new average value for each phoneme is created and the average value m _ij ' in the coefficient memory 25 is replaced. Furthermore, this average value is used to calculate the discriminant coefficient a _ij and the constant term (second term) in equation (4).
is modified for each phoneme, and these are transferred to the coefficient memory 25 as a new standard pattern, and the standard pattern is rewritten.

次に実際に音声認識を行う場合について説明す
る。マイク１０から入力された未知な音声信号に
ついて、信号処理回路２２および線形予測分析プ
ロセツサ２３を使用してLPCケプストラム係数
ｘ（x₁，x₂，……，x_p）を求め、類似度計算部２
４に転送し、予め求めて係数メモリ２５に収納し
てある標準パターンを用いて(4)式より音素ｉの類
似度L_iを求める。 Next, a case in which speech recognition is actually performed will be explained. For the unknown audio signal input from the microphone 10, the signal processing circuit 22 and the linear predictive analysis processor 23 are used to calculate LPC cepstral coefficients x (x ₁ , x ₂ , ..., x _p ), and the similarity calculation unit 2
4, and using a standard pattern previously determined and stored in the coefficient memory 25, the degree of similarity L _i of phoneme i is determined from equation (4).

これを音素毎（ｉ＝１，２，……，ｎ）（ｎは
音素数）に求め、メインメモリ２７に転送する。
メインプロセツサ２８ではこの類似度と帯域フイ
ルタ２６の出力をもとにセグメンテーシヨンを行
つた結果とを組合わせることにより音素認識を行
い音素系列を作成する。 This is obtained for each phoneme (i=1, 2, . . . , n) (n is the number of phonemes) and transferred to the main memory 27.
The main processor 28 performs phoneme recognition by combining this degree of similarity and the result of segmentation based on the output of the band filter 26 to create a phoneme sequence.

最後に音素系列を単語辞書メモリ２９と照合
し、最も類似度の高い単語を認識結果として出力
部３０に転送する。 Finally, the phoneme sequence is compared with the word dictionary memory 29, and the word with the highest degree of similarity is transferred to the output unit 30 as a recognition result.

上記実施例は音声認識を行う前に、内容の予め
わかつている音声を入力し、その結果に基づいて
係数メモリ２５内の標準パターンの修正を行う場
合について述べたが、音声認識の途中に音声の認
識結果に基づいて係数メモリ２５内の標準パター
ンの修正を行つても良いことはもちろんである。 In the above embodiment, before voice recognition, a voice whose content is known in advance is input, and the standard pattern in the coefficient memory 25 is corrected based on the result. Of course, the standard pattern in the coefficient memory 25 may be modified based on the recognition result.

この場合には内容のわかつている音声を予め学
習しなくても良く、環境の変化、入力者の音声の
変化等に対して自動的に追随することができる。 In this case, there is no need to learn speech whose content is known in advance, and it is possible to automatically follow changes in the environment, changes in the input person's voice, etc.

このように、本実施例は音声認識を基本とする
音声認識装置において、各音素の標準パタンをあ
らかじめ簡単な学習によつて使用者に合うように
作成する学習機能を持つことを特徴とし、高い音
声認識性能を持たせることができる。また、学習
のための計算は極めて簡単であり、特別な高い演
算精度を持つ計算回路を要することなく、すぐに
新しい標準パタンを作成することができる。 As described above, this embodiment is a speech recognition device based on speech recognition, and is characterized by having a learning function that creates standard patterns for each phoneme to suit the user through simple learning in advance. It can have voice recognition performance. In addition, calculations for learning are extremely simple, and new standard patterns can be created immediately without requiring a special calculation circuit with high calculation accuracy.

第３図は成人男子10人を対象として、学習のな
い場合と行つた場合の音素認識率の比較を行つた
ものである。学習は評価用の全単語で行つた場合
34と、20語程度の少数語で行つた場合35を示し
た。いずれも、学習のない場合33に比して音声認
識率は向上し、特に従来極端に認識率の低かつた
話者（NS，KS，SMなど）に対して大きな効果
のあることを示している。 Figure 3 shows a comparison of phoneme recognition rates between 10 adult males with and without learning. When learning is performed using all words for evaluation
34, and 35 when using a minority of around 20 words. In both cases, the speech recognition rate improved compared to the case without learning33, and it was shown that it is particularly effective for speakers (NS, KS, SM, etc.) for whom recognition rates have traditionally been extremely low. There is.

第４図は音素毎の認識率の標準偏差を示したも
ので、学習のない場合41に比して学習を全単語で
行つた場合42、少数語で行つた場合43ともにバラ
ツキが減少し、後段の単語マツチングの性能を向
上させる効果を与えることを示している。 Figure 4 shows the standard deviation of the recognition rate for each phoneme. Compared to the case without learning41, the variation is reduced both when learning is performed on all words42 and when it is performed on a minority of words43. This shows that it has the effect of improving the performance of word matching in the subsequent stage.

本実施例は以下に示すような効果を有する。 This embodiment has the following effects.

音声認識装置に学習機能を持たせることによ
り、使用者に適合した標準パタンを自動作成
し、環境の変化や話者の個人差によるバラツキ
の少ない良好な音声認識精度を持たせることが
できる。 By equipping a speech recognition device with a learning function, it is possible to automatically create a standard pattern suitable for the user, and to achieve good speech recognition accuracy with little variation due to changes in the environment or individual differences among speakers.

学習は使用前あるいは使用途中に、少数の音
声を発声することによつて自動的に行うことが
でき、標準パタンの作成も特別な装置を要する
ことなく極めて簡単，高速に行うことができ
る。 Learning can be performed automatically by uttering a small number of sounds before or during use, and standard patterns can be created extremely easily and quickly without the need for special equipment.

発明の効果以上要するに本発明は音声信号からスペクトル
またはそれに類似する情報（以下スペクトル情報
と記す）を算出する音響分析部と、多数話者から
なる標準音声信号から得られた、スペクトル情報
の分散・共分散行列および平均値を少なくとも含
む標準パターンを予め格納する係数記憶部と、前
記スペクトル情報と標準パターンとを用いて音素
毎の類似度を求める類似度計算部と、類似度また
は音素系列で表記された単語辞書を格納する単語
辞書記憶部と、前記類似度計算部を経て作成され
た類似度または音素系列を単語辞書と照合し最も
類似度の高い単語を認識結果として出力する出力
部と、前記出力部の結果と前記音響分析部のスペ
クトル情報とから新しい平均値を作成しその結果
に基づき前記係数記憶部の内容を書き替える学習
部とを具備することを特徴とする音声認識装置を
提供するもので、話者による音声認識精度のバラ
ツキを大幅に改善し、不特定話者に対して安定し
て使うことができる利点を有する。Effects of the Invention In summary, the present invention includes an acoustic analysis unit that calculates a spectrum or information similar to it (hereinafter referred to as spectrum information) from an audio signal, and a dispersion and analysis of the spectrum information obtained from a standard audio signal composed of multiple speakers. a coefficient storage unit that stores in advance a standard pattern including at least a covariance matrix and an average value; a similarity calculation unit that calculates a similarity for each phoneme using the spectral information and the standard pattern; a word dictionary storage unit that stores the word dictionary that has been created; an output unit that compares the similarity or phoneme sequence created through the similarity calculation unit with the word dictionary and outputs the word with the highest degree of similarity as a recognition result; Provided is a speech recognition device, comprising a learning section that creates a new average value from the result of the output section and spectrum information of the acoustic analysis section, and rewrites the contents of the coefficient storage section based on the result. This method has the advantage of significantly reducing variations in speech recognition accuracy between speakers and being able to be used stably for unspecified speakers.

[Brief explanation of drawings]

第１図は音素認識を基本とする従来の音声認識
装置のブロツク図、第２図は本発明の一実施例に
おける音声認識装置のブロツク図、第３図は本発
明の音声認識装置の効果を話者毎に示した図、第
４図は本発明の音声認識装置の効果を音素毎の標
準偏差として表わした図である。２１……AD変換器、２２……信号処理回路、
２３……線形予測分析プロセツサ、２４……類似
度計算部、２５……係数メモリ、２７……メイン
メモリ、２８……メインプロセツサ、２９……単
語辞書メモリ、３０……出力部、３２……学習
部。 Fig. 1 is a block diagram of a conventional speech recognition device based on phoneme recognition, Fig. 2 is a block diagram of a speech recognition device according to an embodiment of the present invention, and Fig. 3 shows the effects of the speech recognition device of the present invention. FIG. 4 is a diagram showing the effect of the speech recognition device of the present invention as a standard deviation for each phoneme. 21...AD converter, 22...signal processing circuit,
23...Linear prediction analysis processor, 24...Similarity calculation unit, 25...Coefficient memory, 27...Main memory, 28...Main processor, 29...Word dictionary memory, 30...Output unit, 32... ...Learning Department.

Claims

[Claims]

1. An acoustic analysis unit that calculates a spectrum or information similar to it (hereinafter referred to as spectral information) from an audio signal, and an acoustic analysis unit that calculates the variance/covariance matrix and average value of the spectral information obtained from a standard audio signal composed of multiple speakers. a coefficient storage section that stores in advance standard patterns including at least one; a similarity calculation section that calculates a degree of similarity for each phoneme using the spectral information and the standard pattern; and a dictionary of words expressed in degrees of similarity or phoneme sequences. a word dictionary storage unit; an output unit that compares the similarity or phoneme sequence created through the similarity calculation unit with a word dictionary and outputs the word with the highest similarity as a recognition result; A speech recognition device comprising: a learning section that creates a new average value from the spectrum information of the acoustic analysis section and rewrites the contents of the coefficient storage section based on the result.