JPS6048040B2

JPS6048040B2 - Learning processing method for individual differences in speech recognition

Info

Publication number: JPS6048040B2
Application number: JP52126072A
Authority: JP
Inventors: 貞煕古井
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 1977-10-20
Filing date: 1977-10-20
Publication date: 1985-10-24
Also published as: JPS5459008A

Abstract

PURPOSE:To attain a high aural confirmation rate and thus to increase the using efficiency be drawing out the standard pattern of each phoneme from part of the study sample. CONSTITUTION:When the new speaker unused for the calculation of the conversion formula pronounces part of the study words to be recognized, the average corelation coefficient is calculated between various phonemes and their words based on the aural waves. The value obtained through the above calculation is then converted into the logarithm cross-sectional area ratio. Using the value of the area ratio and the conversion formula, the value equivalent to the logarithm cross-sectional area ratioconverted from the averaged correlation coefficient for all words to be recognized is calculated for both the phoneme contained in the study word and the phoneme which is not contained. In this way, the standard pattern based on each phoneme is produced in accordance with the individual difference.

Description

【発明の詳細な説明】本発明は、音声認識における個人差の学習処理方法、
特に音声の機械認識装置において、音声波の個人差に適
応した標準パターンを、少数の学習用単語（学習サンプ
ル）から能率よく作成する音声認識のための学習方法に
関するものである。[Detailed Description of the Invention] The present invention provides a method for learning and processing individual differences in speech recognition,
In particular, the present invention relates to a learning method for speech recognition in which a standard pattern adapted to individual differences in speech waves is efficiently created from a small number of learning words (learning samples) in a speech machine recognition device.

一般に、発声された音声の意味内容を機械的にフ認識
する場合には、入力音声を適当な形式で表現し、その音
声と予め蓄えられている標準パターンとの間の類似度を
測り、その程度に応じて入力音声の内容を判定している
。また、入力音声および標準パターンを表現する形式と
しては、一般に、周波数スペクトル、またはそれと類似
なものが用いられるが、同じ内容あるいは同じ言葉を発
声しても、発声者によつて周波数スペクトルが異なつて
いるため、全ての発声者に共通の標準パターンを用いて
入力音声との類似度を測つたのでは、高い精度で入力音
声の内容を認識することはできない。従つて、従来の方
法では、発声者が変わるたびごとに、定められた学習サ
ンプルとして、すべての認識対象単語を発声してもらつ
て標準パターンを作り直したり、認識対象のうちの一部
の単語を発声してもらつて、学習サンプルに含まれなか
つた音韻の標準パターンを、多次元の変換公式により推
定するという方法がとられて来た。しかしながらこれら
の方法では、認識の対象となる発声内容の範囲が例えば
数百単語程度に拡大してくると、それにつれて標準パタ
ーンの種類が多くなり、そのすべての単語を発声して標
準パターンを作りなおすのは大変な手間と時間を要し、
そのうちのある部分を発声して、その発声音声（学習サ
ンプル）に含まれなかつた音韻の標準パターンを推定す
る方法では、発声内容が全認識対象に対して偏つている
ための影響を除去しえないため、高い精度で入力音声の
内容を認識することはできな、かつた。この欠点は、音
声の機械認識装置を実用的なものとする上で、非常に大
きな制約となつている。本発明は、上記従来例の欠点を
解消するために、認識対象のうちの一部の学習サンプル
から、ｊそれらに含まれる音韻の標準パターンを仮りに
作成し、この値をもとにして、学習サンプルに含まれる
音韻と含まれない音韻との両者について、認識対象のす
べての単語を学習サンプルとして用いたときに得られる
標準パターンに相当する値を計３算して、各音韻の標準
パターンとするように改良した音声認識のための学習方
法を提供することを目的としている。Generally, when mechanically recognizing the meaning of uttered speech, the input speech is expressed in an appropriate format, the similarity between the speech and a pre-stored standard pattern is measured, and the The content of the input voice is determined according to the degree of the input voice. In addition, a frequency spectrum or something similar is generally used to express input speech and standard patterns, but even if the same content or words are uttered, the frequency spectrum may differ depending on the speaker. Therefore, if the similarity with the input speech is measured using a standard pattern common to all speakers, it is not possible to recognize the content of the input speech with high accuracy. Therefore, in conventional methods, each time the speaker changes, the standard pattern is recreated by having the speaker speak all the words to be recognized as a predetermined learning sample, or by re-creating a standard pattern by having the speaker speak all the words to be recognized as a predetermined learning sample. A method has been used in which standard patterns of phonemes that are not included in the training samples are estimated using multidimensional conversion formulas by asking people to speak. However, in these methods, as the range of speech content to be recognized expands to, for example, several hundred words, the number of standard patterns increases, and it is necessary to utter all of the words to create a standard pattern. It takes a lot of effort and time to fix it,
The method of estimating the standard pattern of phonemes that are not included in the uttered speech (learning sample) by uttering a certain part of the speech cannot remove the influence of the utterance being biased against the whole recognition target. Therefore, it is not possible to recognize the content of input speech with high accuracy. This drawback poses a very large restriction in making a speech recognition device practical. In order to eliminate the drawbacks of the conventional example, the present invention temporarily creates a standard pattern of phonemes included in some learning samples of the recognition target, and based on this value, For both the phonemes included in the learning sample and the phonemes not included, calculate the values corresponding to the standard pattern obtained when all the words to be recognized are used as the learning sample, and calculate the standard pattern for each phoneme. The purpose of this study is to provide an improved learning method for speech recognition.

以下図面により、実施例を詳細に説明する。図は本発明
の一実施例を示したもので、１は音４．声信号入力端子
、２は低域通過フィルタ、３はアナログ−ディジタル変
換回路、４は自己相関係数計算回路、５は音韻区間区分
化回路、６は区分内自己相関係数平均化回路であつて同
一区分内の自己相関係数を平均化するもの、７はスイッ
チ、８ａ，８ｂはスイッチ７の端子、９は自己相関係数
全単語平均化回路であつて全単語から抽出された同一音
韻に対応する自己相関係数について平均化を行なうもの
、１０は平均相関係数蓄積部、１１は対数断面積比計算
回路、１２は全単語学習対数断面積比蓄積部、１３は自
己相関係数部分単語平均化回路であつて予め定められた
個数の単語即ち部分単語から抽出された同一音韻に対応
する自己フ相関係数について平均化を行なうもの、１４
は平均相関係数蓄積部、１５は対数断面積比計算回路、
１６は部分単語学習対数断面積比蓄積部、１７は自己相
関係数平均化回路てあつてすべての発声者から抽出され
た同一音韻に対する自己相関係・数について平均化を行
なうもの、１８は対数断面積比計算回路、１９は平均対
数断面積比蓄積部、２０はスイッチ、２１ａ，２１ｂは
スイッチ２０の端子、２２は正規行列計算回路、２３は
正規方程式計算回路、２４は変換公式蓄積部、２５は標
準パターン計算回路、２６は最尤スペクトルパラメータ
計算回路、２７は標準パターン蓄積部、２８は重み蓄積
部を表わす。Embodiments will be described in detail below with reference to the drawings. The figure shows an embodiment of the present invention, where 1 indicates the sound 4. voice signal input terminal; 2 is a low-pass filter; 3 is an analog-to-digital conversion circuit; 4 is an autocorrelation coefficient calculation circuit; 5 is a phoneme segmentation circuit; and 6 is an intra-segment autocorrelation coefficient averaging circuit. 7 is a switch, 8a and 8b are terminals of switch 7, and 9 is an autocorrelation coefficient all-word averaging circuit that averages the autocorrelation coefficients in the same category. 10 is an average correlation coefficient storage unit, 11 is a log cross section ratio calculation circuit, 12 is an all word learning log cross section ratio storage unit, and 13 is an auto correlation coefficient 14. A partial word averaging circuit that averages self-correlation coefficients corresponding to the same phoneme extracted from a predetermined number of words, that is, partial words;
15 is an average correlation coefficient storage unit, 15 is a logarithmic cross section ratio calculation circuit,
16 is a partial word learning log cross section ratio storage unit, 17 is an autocorrelation coefficient averaging circuit that averages the autocorrelation/number for the same phoneme extracted from all speakers, and 18 is a logarithm. 20 is a switch, 21a, 21b are terminals of switch 20, 22 is a normal matrix calculation circuit, 23 is a normal equation calculation circuit, 24 is a conversion formula storage unit, 25 is a standard pattern calculation circuit, 26 is a maximum likelihood spectral parameter calculation circuit, 27 is a standard pattern storage section, and 28 is a weight storage section.

次にこの実施例の動作を説明する。Next, the operation of this embodiment will be explained.

まずスイッチ２０を端子２１ａに接続し、標準パターン
蓄積部２７には、各音韻の標準パターンとして、経験的
、あるいはある特定の発声者の音声を分析して計算され
た適切な値を蓄えておく。そこで、学習サンプルとして
認識対象のすべての単語を、あらかじめ定められた順序
で発声する。このアナログ波形を音声信号入力端子１に
入力する。このアナログ音声は、低域通過フィルタで例
えば３．４ＫＨｚ以上の高い周波数成分が除去され、ア
ナログ−ディジタル変換回路３で、例えば８ＫＨｚ１１
１ビットで標本化および量子化され、ディジタル波形に
変換されたのち、自己相関係数計算回路４で、例えば１
５ｍｓなどの周期で一定の時間長の区間が切り出され、
例えば１欧までの有限次元までの自己相関係数が計算さ
れる。切り出された区間内のｔ番目の時点（１くｔ＜Ｎ
．Ｎは切り出した長さに相当する標本点数）の音声波形
をＸｔであられすとき、τ次の自己相関係数は次の式で
定義される。 λＴ−ー音韻区間区分化回路５で
は、この自己相関係数と、標準パターン蓄積部２７に蓄
えられている各音韻の標準パターンとを用いて、各音韻
の尤度が計算され、その値に応じて発声された音声を各
音韻区間に区分化する。First, the switch 20 is connected to the terminal 21a, and the standard pattern storage section 27 stores appropriate values calculated empirically or by analyzing the voice of a particular speaker as standard patterns for each phoneme. . Therefore, all words to be recognized are uttered in a predetermined order as learning samples. This analog waveform is input to the audio signal input terminal 1. This analog audio is processed by a low-pass filter to remove high frequency components of, for example, 3.4 KHz or higher, and then converted to an analog to digital converter circuit 3 for example, 8 KHz11.
After being sampled and quantized with 1 bit and converted into a digital waveform, the autocorrelation coefficient calculation circuit 4 generates a
A section of a certain length of time is cut out at a cycle of 5ms, etc.
For example, autocorrelation coefficients up to a finite dimension up to 1 European are calculated. The tth time point within the cut out section (1 t<N
．． N is the number of sample points corresponding to the length of the cutout) When the audio waveform is expressed as Xt, the τ-order autocorrelation coefficient is defined by the following equation. The λT-phoneme section segmentation circuit 5 calculates the likelihood of each phoneme using this autocorrelation coefficient and the standard pattern of each phoneme stored in the standard pattern storage section 27, and uses the value as Accordingly, the uttered speech is segmented into each phoneme interval.

音韻ｊの標準パターン（最尤スペクトルパラメータ）を
Ａｊ（７）（イ）くτくτ．Ｎａｘ）であられすとき、
音韻ｊの尤度は次の式で定義される。 τ−１こうして区分化された各音韻区間の相関係数は、次の区
分内自己相関係数平均化回路６に送られ、各音韻区間ご
とに平均の相関係数が計算される。The standard pattern (maximum likelihood spectral parameter) of phoneme j is Aj (7) (a) × τ × τ. Nax)
The likelihood of phoneme j is defined by the following formula. τ-1 The correlation coefficients of each phoneme section segmented in this manner are sent to the next intra-section autocorrelation coefficient averaging circuit 6, where an average correlation coefficient is calculated for each phoneme section.

スイッチ７は、はじめ端子８ａに接続しておき、各音韻
区間の平均相関係数は自己相関係数全単語平均化回路９
に送られて、各音韻ごとに、すべての認識対象単語から
抽出された平均の相関係数が計算される。この値は平均
相関係数蓄積部１０に蓄えられる。つぎに、いま発声さ
れている単語が、のちに認識対象単語の一部として学習
に用いられるもの（学習用単語）であれば、スイッチ７
は端子８ｂに接続されて、各音韻区間の平均相関係数は
自己相関係数部分単語平均化回路１３に送られ、各音韻
ごとに、学習用単語から抽出された平均の相関係数が計
算される。この値は平均相関係数蓄積部１４に蓄えられ
る。定められた順序ですべての学習サンプルの発声が行
われ、終了すると、平均相関係数蓄積部１０の内容は、
対数断面積比計算回路１１で対数断面積比の値に変換さ
れ、全単語学習対数断面積比蓄積部１２に蓄えられる。The switch 7 is initially connected to the terminal 8a, and the average correlation coefficient of each phoneme section is determined by the autocorrelation coefficient all-word averaging circuit 9.
The average correlation coefficient extracted from all recognition target words is calculated for each phoneme. This value is stored in the average correlation coefficient storage section 10. Next, if the word that is currently being uttered is one that will later be used for learning as part of the words to be recognized (a learning word), switch 7.
is connected to the terminal 8b, and the average correlation coefficient of each phoneme section is sent to the autocorrelation coefficient partial word averaging circuit 13, which calculates the average correlation coefficient extracted from the learning words for each phoneme. be done. This value is stored in the average correlation coefficient storage section 14. When all learning samples are uttered in a predetermined order and finished, the contents of the average correlation coefficient storage unit 10 are as follows.
The log cross section ratio calculation circuit 11 converts the value into a log cross section ratio value, which is stored in the all word learning log cross section ratio storage section 12 .

同様に、平均相関係数蓄積部１４の内容は、対数断面積
比計算回路１５で対数断面積比の値に変換され、部分単
語学習対数断面積比蓄積部１６に蓄えられる。以上の操
作は、定められた多数の、たとえば３０名程度の発声者
について繰返され、平均相関係数蓄積部１０、全単語学
習対数断面積比蓄積部１２、および部分単語学習対鰺断
面積比蓄積部１６には、それぞれの発声者に関する値が
別々に蓄えられる。Similarly, the contents of the average correlation coefficient storage section 14 are converted into a log cross section ratio value by the log cross section ratio calculation circuit 15 and stored in the partial word learning log cross section ratio storage section 16 . The above operations are repeated for a predetermined number of speakers, for example, about 30 speakers, and the average correlation coefficient storage unit 10, the whole word learning log cross-sectional area ratio storage unit 12, and the partial word learning logarithmic cross-sectional area ratio The storage unit 16 separately stores values related to each speaker.

あらかじめ定められたすべての発声者の発声が終了する
と、平均相関係数蓄積部１０に蓄えられていたそれぞれ
の発声者の各音韻の平均相関係数の値は自己相関係数平
均化回路１７に送られ、各音韻ごとに、すべての発声者
についての平均相関係数が計算される。この値は、対数
断面積比計算回路１８で対数断面積比の値に変換され、
平均対数断面積比蓄積部１９に蓄えられる。つぎに、す
べての音韻のすべての全単語学習対数面積比を、部分単
語学習で得られたすべての音韻のすべての対数断面積比
の線形結合て推定する変換公式を得るため、全単語学習
対数断面積比蓄積部１２の内容と部分単語学習対数断面
積比蓄積部１５の内容とは、正規行列計算回路２２に送
られ、両者の蓄積部に蓄えられている音韻間のすべての
組合せについてそれぞれ正規行列の計算が行われる。前
者の蓄積部に蓄えられている音韻１のＳ番目の成分（１
くＳくτ．．ＡＯ）と、後者の蓄積部に蓄えられている
音韻ｊに関する正規行列ΦＩ，５の要素φ，，ｓ（Ｍ．
．ｎ）、（イ）くｍくτＭａｘｌＯくｎくτ．．Ａｘ＋
１）は、次式て定義される。ただし、Ｚｊｌ（ｍ）は話
者１による音韻ｊの対数断面積比のｍ番目の成分を示し
、Ｚ，，（０）＝１である。）この正規行列は、音韻
１（７）Ｓ番目の成分Ｚｌｌ（Ｓ）を、部分単語学習対
数断面積比蓄積部１６に蓄えられている音韻ｊの成分Ｚ
，ｌ（ｍ）（イ）くｍくτＭａｘ）から変換して求める
公式を導出するためのＢｎ．（Ｓ）を決定する連立一次
方程式の係数行列である。When the utterances of all predetermined speakers are completed, the value of the average correlation coefficient of each phoneme of each speaker stored in the average correlation coefficient storage section 10 is sent to the autocorrelation coefficient averaging circuit 17. For each phoneme, the average correlation coefficient is calculated for all speakers. This value is converted into a log cross-sectional area ratio value by the log cross-sectional area ratio calculation circuit 18,
It is stored in the average log cross-sectional area ratio storage section 19. Next, in order to obtain a conversion formula that estimates the all-word learning log-area ratio of all phonemes by linearly combining all the log-cross-sectional area ratios of all phonemes obtained by partial word learning, we calculate the all-word learning log-area ratio. The contents of the cross section ratio storage section 12 and the contents of the partial word learning log cross section ratio storage section 15 are sent to the normal matrix calculation circuit 22, and are calculated for all combinations between phonemes stored in both storage sections. A normal matrix calculation is performed. The S-th component of phoneme 1 (1
KuSkuτ. ．． AO), and the elements φ,,s(M.
．． n), (a) KumkuτMaxlOkunkuτ. ．． Ax+
1) is defined by the following equation. However, Zjl(m) indicates the m-th component of the log cross-sectional area ratio of phoneme j by speaker 1, and Z, , (0)=1. ) This normal matrix divides the S-th component Zll(S) of phoneme 1 (7) into the component Z of phoneme j stored in the partial word learning log cross-sectional area ratio storage unit 16.
Bn. It is a coefficient matrix of simultaneous linear equations that determine (S).

そして、Ｂ．ｎ（Ｓ）はこの正規行列の要素φＩ，ｓ（
Ｍ．．ｎ）を係数とする連立一次方程式で決定される。And B. n(S) is the element φI,s(
M. ．． It is determined by simultaneous linear equations with n) as a coefficient.

正規行列の値は正規方程式計算回路２３に送られ、上記
φ，，ｓ（Ｍ．ｎ）を係数とする連立一次方程式の解演
算が行われる。The values of the normal matrix are sent to the normal equation calculation circuit 23, and a solution calculation of the simultaneous linear equations using the coefficients φ, , s(M.n) is performed.

この解演算には、掃き出し法などが用いられる。この結
果として得られる解を、ここではＢｍ（Ｓ）、（０くｍ
くτＭａ，，）とあられす。１番目からγＭａｘ番目ま
でのすべての次数とすべての音韻の組合せについて以上
の演算が行われ、これらの結果は変換公式蓄積部２４に
蓄えられる。A sweep method or the like is used for this solution calculation. The resulting solution is here Bm(S), (0km
It is hail. The above calculations are performed for all combinations of degrees and phonemes from the 1st to γMaxth, and these results are stored in the conversion formula storage unit 24.

同時にＺｌｌ（Ｓ）に対する（Ｚｉｌ（ｍ））（イ）く
ｍくτＭａ、）の寄与率が計算され、重み蓄積部２８に
蓄えられる。At the same time, the contribution rate of (Zil(m))(a)kmkuτMa,) to Zll(S) is calculated and stored in the weight storage unit 28.

寄与率Ｗｉｊ（Ｓ）は次の式で定義される。ただし７１
（Ｓ）は、Ｚｌｌ（Ｓ）のすべての発声者に関する平均
値である。The contribution rate Wij(S) is defined by the following formula. However, 71
(S) is the average value for all speakers of Zll(S).

つぎに以上の処理に用いられなかつた新しい発声者の声
の特徴を学習する際には、スイッチ７を端子８ｂに接続
するとともに、スイッチ２０を端子２１ｂに接続してお
き、発声者には、認識対象単語のうちの一部の学習用単
語を、上述の処理の場合と同じ決められた順序で発声し
てもらう。Next, when learning the voice characteristics of a new speaker who was not used in the above processing, the switch 7 is connected to the terminal 8b, and the switch 20 is connected to the terminal 21b. Some learning words among the words to be recognized are uttered in the same predetermined order as in the above process.

この音声波は上述と同様に音声信号入力端子１に入力さ
れ、低域通過フ・ｒルタ２、アナログ−デイジ．タル変
換回路３、自己相関係数計算回路４を経て、自己相関係
数の時系列に変換される。さらに音韻区間区分化回路５
により、各音韻区間に区分化され、区分内自己相関係数
平均化回路６により、各音韻区間の平均相関係数が計算
される。こ５の値は自己相関係数部分単語平均化回路１
３に送られて、各音韻ごとに、すべての学習用単語にお
ける平均の相関係数が計算される。この値はいつたん平
均相関係数蓄積部１４に蓄えられ、すべての学習用単語
の発声が終了すると、対数断面積比４計算回路１５で、
対数断面積比の値に変換され、部分単語学習対数断面積
比蓄積部１６に蓄えら゛れる。この値はつぎに標準パタ
ーン計算回路２５に送られ、同時に変換公式蓄積部２４
の内容、平均対数断面積比蓄積部１９の内容、および重
み蓄積部２８の内容が送り込まれて、ここで各音韻の対
数断面積のすべての成分について、認識対象のすべての
単語を発声したときに得られる値に相当する値が計算さ
れる。この計算には次の式が用いられる。
＼Ａ
−１目ただし、Ｚｊｌは目的として
いる話者１の音韻１の対数断面積比ベクトル（Ｚｌｌ＝
（Ｚｌｌ（１）、Ｚｌｌ（２）、、Ｚｉｌ（γＭａ．
））″）、Ｚｉｌは部分単語学習対数７断面積比蓄積部
１６に蓄えられている話者１の音韻ｊの対数断面積比ベ
クトル、ΩＬは学習用単語に含まれる音韻の集合、ψＩ
ｊは、すべての単語を発声したときの音韻１と、学習用
単語のみを発声したときの音韻ｊの組合せについて、音
韻１の対ノ数断面積比の１番目からτ、３０番目までの
成分に対する変換公式を順に並べて行列としたもの、ｒ
は重みで、経験的にたとえば０．５のような値を用いる
。This audio wave is input to the audio signal input terminal 1 in the same manner as described above, and is passed through the low-pass filter 2 and the analog-digital signal input terminal 1. The signal is converted into a time series of autocorrelation coefficients through a tal conversion circuit 3 and an autocorrelation coefficient calculation circuit 4. Furthermore, the phoneme segmentation circuit 5
Accordingly, the phoneme segment is segmented into each phoneme segment, and the intra-segment autocorrelation coefficient averaging circuit 6 calculates the average correlation coefficient of each phoneme segment. This value is the autocorrelation coefficient partial word averaging circuit 1.
3, and the average correlation coefficient across all training words is calculated for each phoneme. This value is stored in the average correlation coefficient storage unit 14, and when all learning words have been uttered, the log cross section ratio 4 calculation circuit 15 calculates
It is converted into a logarithmic cross section ratio value and stored in the partial word learning log cross section ratio storage section 16. This value is then sent to the standard pattern calculation circuit 25, and at the same time, the conversion formula storage section 24
, the contents of the average log cross section ratio storage section 19, and the contents of the weight storage section 28 are sent, and when all the words to be recognized are uttered for all components of the log cross section of each phoneme. A value corresponding to the value obtained in is calculated. The following formula is used for this calculation.
\A
−1. However, Zjl is the target log cross-sectional area ratio vector of phoneme 1 of speaker 1 (Zll=
(Zll(1), Zll(2), , Zil(γMa.
)"), Zil is the log cross section ratio vector of phoneme j of speaker 1 stored in the partial word learning logarithm 7 cross section ratio storage unit 16, ΩL is the set of phonemes included in the learning word, ψI
j is the 1st to τ, 30th component of the pairwise cross-sectional area ratio of phoneme 1 for the combination of phoneme 1 when all words are uttered and phoneme j when only the learning words are uttered. The transformation formula for is arranged in order and made into a matrix, r
is a weight, and a value such as 0.5 is empirically used.

Ｙｉｌとしては、音韻１あるいはそれに類似した音韻が
ΩＬに含まれていれば、部分単語学習対数断面積比蓄積
部１６に蓄えられているその音韻の対数断面積比の値を
用い、含まれていなければ、平均対数断面積比蓄積部１
９に蓄えられている多数の発声者について平均化された
音韻１の対数断面積比の値を用いる。上記（７）式は２
つの項からなつている。If phoneme 1 or a phoneme similar to it is included in ΩL, Yil is determined by using the value of the log cross section ratio of the phoneme stored in the partial word learning log cross section ratio storage unit 16. If not, the average log cross section ratio storage unit 1
The value of the log cross-sectional area ratio of phoneme 1, which is averaged over a large number of speakers stored in 9, is used. The above equation (7) is 2
It consists of two terms.

第１項は、変換行列ψＩｊを用いて、学習用単語から抽
出された種々の音韻の対数断面積比Ｚ』と、推定すべき
各音韻の対数断面積比Ｚｌｌを関連づける項であるが、
単に各音韻による推定値の平均としてＺｌｌを推定する
のてはなく、推定の信頼度に対応する寄与率Ｗｉｊ（ベ
クトル）による重みつき平均を行うことによつて推定値
の信頼度を高めている。第２項は第１項の推定誤差によ
る極端な推定値の変動を避けるため、学習用単語から直
接的に抽出された当該音韻の対数断面積比または多数発
声者の平均値と、第１項の推定値との重みつき平均を計
算することによつて、推定値の安定化をはかるものてあ
る。こうして計算された発声者１の音韻１の標準パター
ンＺｊｌの値は、最尤スペクトルパラメータ計算回路２
６に送られ、最尤スペクトルパラメータの値に変換され
る。The first term is a term that uses the transformation matrix ψIj to associate the log cross-sectional area ratio Z of various phonemes extracted from the learning words with the log cross-sectional area ratio Zll of each phoneme to be estimated.
Rather than simply estimating Zll as the average of the estimated values for each phoneme, the reliability of the estimated value is increased by performing a weighted average using the contribution rate Wij (vector) corresponding to the reliability of the estimation. . In order to avoid extreme fluctuations in the estimated value due to the estimation error in the first term, the second term is the log cross-sectional area ratio of the phoneme directly extracted from the learning word or the average value of many speakers, and the first term There is a method to stabilize the estimated value by calculating a weighted average with the estimated value of . The value of the standard pattern Zjl of phoneme 1 of the speaker 1 calculated in this way is calculated by the maximum likelihood spectral parameter calculation circuit 2.
6 and is converted into the value of the maximum likelihood spectral parameter.

最尤スペクトルパラメータの値を計算するには、対数断
面積比をパーコール（ＰＡＲＣＯＲ）係数に変換し、こ
れを線形予測係数に変換し、この積和をもとめることに
よつて得ることができ、この方法の詳細は文献（１）な
どに述べられている。文献（１）板倉：音声の特徴パラ
メータによる通信、昭和４７年度電気四学会連合大会、
２１９各音韻の最尤スペクトルパラメータは、単語音声
認識のための標準パターンとして、標準パターン蓄積部
２７に蓄えられる。The value of the maximum likelihood spectral parameter can be calculated by converting the log cross section ratio into a PARCOR coefficient, converting this into a linear prediction coefficient, and finding the sum of the products. Details of the method are described in literature (1) and others. References (1) Itakura: Communication based on voice characteristic parameters, 1971 Joint Conference of the Four Electrical Engineers of Japan,
The maximum likelihood spectral parameters of each 219 phoneme are stored in the standard pattern storage unit 27 as a standard pattern for word speech recognition.

このような構造になつているので、音韻などを単位とす
る標準パターンを、少数の学習用単語から抽出された音
韻の対数断面積比の値にもとづいて、認識対象のすべて
の単語を発声したときに得られる値に近い値として発声
者に適応化することができるため、極めて能率よく、し
かも高い認識率で音声の機械認識を行うことが可能とな
る。以上説明したように、本発明によれば、少数の学習
用単語を発声したときの各音韻のスペクトル（対数断面
積比）と、認識対象のすべての単語を発声したときの各
音韻の平均的なスペクトル（対数断面積比）との間の一
般的関係を変換公式として用意し、少数の学習用単語を
発声したときに得られる各音韻のスペクトルにこの変換
公式を適用した値を各音韻の標準パターンとして用いる
ため、多数の語曇を対象とした音声認識においても、少
数の学習用単語を用いて高い認識率を得ることができ、
極めて使いやすく安価な音声信号認識装置を提供するこ
とができるので、本発明は非常に有用性のある音声認識
における個人差の学習方法を提供することができる。With this structure, all the words to be recognized are uttered based on the log cross-sectional area ratio of the phonemes extracted from a small number of learning words, using a standard pattern with phonemes as units. Since the value can be adapted to the speaker as a value close to the value obtained at the time, it becomes possible to perform machine recognition of speech extremely efficiently and with a high recognition rate. As explained above, according to the present invention, the spectrum (log cross-sectional area ratio) of each phoneme when a small number of learning words are uttered, and the average of each phoneme when all words to be recognized are uttered. A general relationship between the spectrum (log cross section ratio) is prepared as a conversion formula, and the value obtained by applying this conversion formula to the spectrum of each phoneme obtained when a small number of learning words are uttered is calculated as the value of each phoneme. Because it is used as a standard pattern, it is possible to obtain a high recognition rate using a small number of learning words even in speech recognition that targets a large number of words.
Since it is possible to provide a speech signal recognition device that is extremely easy to use and inexpensive, the present invention can provide a method for learning individual differences in speech recognition that is extremely useful.

[Brief explanation of the drawing]

図は、本発明の実施例のブロック図である。１・・・・・・音声入力端子、２・・・・・・低域通過
フィルタ、３・・・・・・アナログ−ディジタル変換器
、４・・・自己相関係数計算回路、５・・・・・・音韻
区間区分化回路、６・・・・・・区分内自己相関係数平
均化回路、７・・・・・スイッチ、８ａ，８ｂ・・・・
スイッチ端子、９・・・・・・自己相関係数全単語平均
化回路、１０・・・・・・平均相関係数蓄積部、１１・
・・・・・対数断面積比計算回路、１２・・・・・・全
単語学習対数断面積比蓄積部、１３・・・・・・自己相
関係数部分単語平均化回路、１４・・・・・・平均相関
係数蓄積部、１５・・・・・対数断面積比計・算回路、
１６・・・・・・部分単語学習対数断面積比蓄積部、１
７・・・・・・自己相関係数平均化回路、１８・・・対
数断面積比計算回路、１９・・・・・・平均対数断面積
比蓄積部、２０・・・・スイッチ、２１ａ，２１ｂ・・
・・スイッチ端子、２２・・・・・正規行列計算回路、
２３・・・・・・正規方程式計算回路、２４・・・・・
・変換公式蓄積部、２５・・・・・・標準パターン計算
回路、２６・・最尤スペクトルパラメータ計算回路、２
７・・・・・・標準パターン蓄積部、２８・・・・・・
重み蓄積部。The figure is a block diagram of an embodiment of the invention. DESCRIPTION OF SYMBOLS 1...Audio input terminal, 2...Low pass filter, 3...Analog-digital converter, 4...Autocorrelation coefficient calculation circuit, 5... ...Phonological section segmentation circuit, 6...Intra-segment autocorrelation coefficient averaging circuit, 7...Switch, 8a, 8b...
Switch terminal, 9...Autocorrelation coefficient all word averaging circuit, 10...Average correlation coefficient storage unit, 11.
... Log cross section ratio calculation circuit, 12 ... All word learning log cross section ratio storage unit, 13 ... Auto correlation coefficient partial word averaging circuit, 14 ... ... Average correlation coefficient storage section, 15 ... Logarithmic cross section ratio calculator/calculation circuit,
16... Partial word learning log cross section ratio storage unit, 1
7...Autocorrelation coefficient averaging circuit, 18...Log cross section ratio calculation circuit, 19...Average log cross section ratio storage unit, 20...Switch, 21a, 21b...
...Switch terminal, 22...Normal matrix calculation circuit,
23... Normal equation calculation circuit, 24...
・Conversion formula storage unit, 25...Standard pattern calculation circuit, 26...Maximum likelihood spectral parameter calculation circuit, 2
7...Standard pattern storage section, 28...
Weight storage section.

Claims

[Claims]

1. Means for calculating autocorrelation coefficients up to a finite degree from speech waves, means for segmenting speech waves into various phoneme intervals, means for averaging the autocorrelation coefficients within each segment, and storage for storing them. a means for calculating an average value of autocorrelation coefficients for all words to be recognized for each phoneme;
a storage unit that stores them; a means for calculating the average value of autocorrelation coefficients for each phoneme included in some learning words of the recognition target vocabulary; a storage unit that stores them; Means for converting correlation coefficients into log cross section ratios, and log cross section ratios converted from correlation coefficients averaged for all recognition target words and averaged for some learning words to be recognized. Means for calculating the numerical relationship between the correlation coefficient and the log cross-sectional area ratio as a multidimensional conversion formula for all phoneme combinations, a storage unit for storing this conversion formula, and recognition using this conversion formula. Based on the value of the log cross section ratio converted from the correlation coefficient averaged for some of the target learning words, the log cross section converted from the correlation coefficient averaged for all recognition target words. When a new speaker, who was not used to calculate the above conversion formula, utters some of the learning words to be recognized, from this speech wave Calculate the average correlation coefficient for these words with various phonemes, convert this value to a log cross-section ratio, and use this value and the conversion formula above to calculate the phonology included in the learning word and the phonology included. By calculating the value equivalent to the log cross-sectional area ratio converted from the correlation coefficient averaged for all recognition target words, the standard pattern with phonology as a unit can be adjusted to individual differences. A learning processing method for individual differences in speech recognition characterized by adaptive creation.