JPS58145999A

JPS58145999A - Recognition of voice

Info

Publication number: JPS58145999A
Application number: JP57029472A
Authority: JP
Inventors: 雅男渡; 誠赤羽; 俊彦和久; 久雄西岡
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1982-02-25
Filing date: 1982-02-25
Publication date: 1983-08-31
Also published as: JPH0441357B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】本発明は音声認識に関し、ｆ＃蜆な方法を提案するもの
である。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to speech recognition and proposes an f# method.

音声認識においては、特定話者に対する単語認識による
ものがすでに実用化されている。これは認識対象とする
全ての単語について特定話者にこれらを発声させ、バン
ドパスフィルタバンク等によりその音響パラメータを検
出して記憶（登録）しておく。そして特定話者が発声し
たときその音響パラメータを検出し、登録された各単語
の音物パラメータと比較し、これらが一致したときその
単語であるとのｇ誠を行う。In speech recognition, methods based on word recognition for specific speakers have already been put into practical use. This involves having a specific speaker say all the words to be recognized, and detecting and storing (registering) the acoustic parameters using a bandpass filter bank or the like. Then, when a specific speaker utters a utterance, its acoustic parameters are detected and compared with the sound parameters of each registered word, and when these match, it is determined that the word is the same.

この上うな装置において、話者の発声の時間軸が登録時
と異なっている場合には、一定時間（５〜ｇＱｍ８ａｉ
）毎に抽出される音響パラメータの時系列を伸縮し１時
間軸を整合させる。これによって発声速度の変動に対処
させるよさにしている。In addition, in such a device, if the time axis of the speaker's utterance is different from the time of registration, the time axis of the speaker's utterance is different from the time of registration,
), the time series of acoustic parameters extracted for each interval is expanded or contracted to align the 1-time axis. This makes it possible to cope with fluctuations in speaking speed.

ところがとの装置の場合、認識対象とする全ての単ＩＮ
Ｋついてその単語の全体の音響パラメータをあらかじめ
登母格納しておかなければならず、膨大な記憶容量と演
算を必要とする。このため認識語い数に限界があった。However, in the case of the above device, all single INs to be recognized
The entire acoustic parameters of the word must be stored in advance, which requires a huge amount of storage capacity and calculations. For this reason, there was a limit to the number of words that could be recognized.

これに対して音韻（日本飴でいえばロー！字表記したと
きのＡ、　Ｉ　、Ｕ、　Ｅ、０．に、　８．Ｔ等）ある
いは音＠　（ＫＡ　、　ＫＩ　、　Ｋ１等）単位での認
識を行うことが提案されている。しかしこの場合Ｋ、母
音岬の準定常部を有する音韻の認識は容易であっても、
破裂音（Ｋ、Ｔ、Ｐ尋）のように音韻的特徴が非常に短
いものを音響パラメータのみで一つの音韻に特定するこ
とは極めて困難である。On the other hand, recognition in terms of phoneme (in Japanese candy, when written in the ro! character, A, I, U, E, 0., 8.T, etc.) or sound @ (KA, KI, K1, etc.) It is proposed to do so. However, in this case, K, even though it is easy to recognize the phoneme with the quasi-stationary part of the vowel cape,
It is extremely difficult to identify plosives (K, T, P fathom), which have very short phonetic features, as a single phoneme using only acoustic parameters.

そこで従来は、各音節ごとに離散的に発音された音声を
登帰し、離散的に発声された音声を単語ｗｔＩｍ！と同
様に時間軸整合させてｌｊｇ鐵を行っており、特殊な発
声を行うために限定された用途でしか利用できなかった
。Conventionally, the discretely pronounced sounds for each syllable are recorded, and the discretely pronounced sounds are converted into the word wtIm! Similarly, the ljg iron was performed in a time-aligned manner, and could only be used for limited purposes to produce special vocalizations.

さらに不特定話者をｓ１ｗＩｔ対象とした場合には、音
響パラメータに個人差による大きな分散があり、上述の
ように時間軸の整合だけでは認識を行うことができない
。そこで例えば一つの単語について複数の音響パラメー
タを登録して近似の青畳パラメータを＠緻する方法や、
単語全体を同定次元のパラメータに変換し、識別函数に
よって判別する方法が提案されているが、いづれも膨大
な記憶容量な必費としたり、演算量が多く、ｉ＆ｉｌｉ
織飴い数が伽めて少くなってしまう。Furthermore, when an unspecified speaker is targeted for s1wIt, there is a large variance in the acoustic parameters due to individual differences, and recognition cannot be performed only by matching the time axis as described above. Therefore, for example, there is a method of registering multiple acoustic parameters for one word and elaborating the approximate Aotatami parameters.
Methods have been proposed in which the entire word is converted into identification-dimensional parameters and discriminated using a discrimination function, but these methods require a huge amount of storage capacity, a large amount of calculation, and are difficult to implement in i&ili.
The number of woven candy will be significantly reduced.

本発明はこのような点にかんがみ、不特定話者に対して
も、容易かつ確実に音声認識を行えるようにした、新規
な音声認識方法を提案するものである。以下に図面を参
照しながら、本発明の一実施例について説明しよう。In view of these points, the present invention proposes a novel speech recognition method that allows speech recognition to be easily and reliably performed even for unspecified speakers. An embodiment of the present invention will be described below with reference to the drawings.

ところで音韻の発声現象を観察すると、母音や摩擦音（
８，Ｈ勢）等の音韻は長く伸して発声することができる
３２例えば１はいゝという発声を考えた場合に、この音
韻は第１図Ａに示すようＫ、「無音→Ｈ→Ａ→Ｉ→無音
」に変化する。これに対して同じ１はい１０発声を第１
図Ｂのように行うこともできる。ここでＨ，Ａ、Ｉの準
定ｆ部の長さは発声ととに変化し、これＫよって時間軸
の変動を生じる。ところがこの場合に１各音韻間の過渡
部［１線で示す）は比較的時間軸の変動が少いことが判
明した。By the way, when we observe the phenomenon of phonological vocalization, we find that vowels and fricatives (
For example, when considering the pronunciation of ``1'', this phoneme is pronounced as K, ``silence → H → A → Changes from “I → Silence”. In response, make the same 1 yes 10 utterance as the first
It can also be done as shown in Figure B. Here, the lengths of the quasi-fixed f parts of H, A, and I change depending on the utterance, and this causes a fluctuation in the time axis. However, in this case, it has been found that the transition part between each phoneme (indicated by one line) has relatively little variation in the time axis.

本願発明者はこの点に着目したものである。The inventor of the present application has focused on this point.

第２図において、マイクーフオン（１）に供給された音
声信号がマイクアンプ（２）、５．５ｋＨＫ以下のロー
パスフィルタ（３）を通じてＡ、Ｄ変換回路（４）Ｋ供
給される。またり謬ツク発生器（５）からのＩＬ５ｋＨ
１（８０声（６）間隔）のす／プリンダク四ツクがＡＤ
変換回路（４）に供給され、このタイ書ングで音声信号
がそれぞれ所定ビット数（＝１ワード）のデジタル信号
に変換される。この変、換された音声信号が５×６４ワ
ードのレジスタ（６）Ｋ供給される。またクロツタ発生
器（５）からの５．１２輌器間隔のフレームクロックが
５進カウンタ（７）Ｋ供給され、このカウント値がレジ
スタ（６）に供給されて音声信号が６４ワードずつシフ
トされ、シフトされた４Ｘ６４ワードの信号がレジスタ
（６）から取り出される。In FIG. 2, an audio signal supplied to a microphone amplifier (1) is supplied to an A/D conversion circuit (4) through a microphone amplifier (2) and a low-pass filter (3) of 5.5 kHz or less. IL5kHz from Matari error generator (5)
1 (80 tones (6) intervals) Nosu/Prindak Yotsuku is AD
The audio signals are supplied to a conversion circuit (4), and each audio signal is converted into a digital signal of a predetermined number of bits (=1 word) by this tie writing. This converted audio signal is supplied to a 5×64 word register (6)K. In addition, a frame clock with a 5.12-frame interval from the clock generator (5) is supplied to a quinary counter (7), and this count value is supplied to a register (6) to shift the audio signal by 64 words. The shifted 4×64 word signal is taken from register (6).

このレジスタ（６）から取り出された４　Ｘ　６４　＝
　２５４！ワードの信号が高速フーリエ変換（）’ＦＴ
）回路（８）Ｋ供給される。ここでこのＦＦ７回路（８
）において、例えはＴの時間長に含まれるｎ１個のす／
プリングデータによつ【表される波形函数をＵ　ｎ　ｆＴ（ｔ）　　　　　　　　　　　　　　−（
Ｉｔとしたとき、これなフーリエ変換して、〒＝Ｕ　ｌｎ　ｆＴ（ｆ）　十ｊ　Ｕ　ｚｎ　ｆＴ（ｆ）
　　　　＝−（２１の信号が得られる。4 x 64 = taken from this register (6)
254! The word signal undergoes a fast Fourier transform ()'FT
) Circuit (8) K is supplied. Here, this FF7 circuit (8
), the example is n1 times included in the time length of T/
The waveform function represented by the pulling data is U n fT(t) −(
When It is assumed, this Fourier transform becomes 〒 =U ln fT(f)
=-(21 signals are obtained.

さらにこのＦＦ７回路（８）からの信号がノ（ワースベ
クトルの検出回路（９）Ｋ供給され、１０”　ｌ　＝Ｕ”ｓ　ｎ　ｆＴ（ｆ）　＋ｕｉｎｆｔ
（ｔ）　　　　　Ｈ４＋４４Ｈ（ａ）のパワースペクト
ル信号が取り出される。ここでフーリエ変換された信号
は周波数軸上で対称になつ【いるので、フーリエ変換に
よって取り出されるＩＩｆ個のデータの半分は冗長デー
タである。そとで半分のデータを排除してｉｎｆ個のデ
ータが取り出される。すなわち上述０ＦＦＴ囲路（８）
に供給された２ｓ６ワードの信号が変換されてＨ８ワー
ドの／（ワースベクトル信号が取り出される。Further, the signal from this FF7 circuit (8) is supplied to the worst vector detection circuit (9), and 10"l = U"s n fT(f) +uinft
(t) The power spectrum signal of H4+44H(a) is extracted. Here, since the Fourier transformed signal is symmetrical on the frequency axis, half of the IIf data extracted by the Fourier transform is redundant data. Then half of the data is removed and inf pieces of data are extracted. That is, the above-mentioned 0FFT enclosure (8)
The 2s6 word signal supplied to is converted and the H8 word /(worth vector signal is taken out.

このパワースペクトル信号がエンファシス回路−に供給
されて聴感上の補正を行うための重み付けが行われる。This power spectrum signal is supplied to an emphasis circuit and weighted for auditory correction.

ここで重み付けとしては、例えば周波数の高域成分を増
強する補正が行われる。Here, as the weighting, for example, correction is performed to enhance high frequency components.

この重み付けされた信号が帯域分割−路収りに供給され
、聴感特性に合せた周波数メルスケールに応じて例えば
３２の帯域に分割される。ここで）（ワースベクトルの
分割点と異なる場合にはその信号が各帯域に按分され【
それぞれの帯域の信号の量に応じた信号が取り出される
。これによって上述の１２８ワードのパワースペクトル
信号が、青畳的特徴を保存したま京３２ワードに圧縮さ
れる。This weighted signal is supplied to the band division filter, and is divided into, for example, 32 bands according to a frequency mel scale matched to auditory characteristics. If the dividing point is different from the dividing point of the Worth vector, the signal is divided into each band proportionally.
Signals corresponding to the amount of signals in each band are extracted. As a result, the above-mentioned 128-word power spectrum signal is compressed into 32 words that preserve the blue-tatami characteristics.

この信号で対数囲路ａのに供給され、各信号の対数値に
変換される。これによって上述のエンファシス回路ａ呻
での重み付は等による冗長度が排除される。ここでこの
対数パワースペクトルｔｏｇ　Ｉ　ｌ４ｔ（ｆ）Ｉ　　
　　　　　　　　　　　　　　　　　・・−・・（４）
をスペクトルパラメータＸ（１）（ム二〇、１・叩・３
１）と称する。This signal is supplied to the logarithm circuit a, and is converted into a logarithmic value of each signal. This eliminates the redundancy due to weighting in the above-mentioned emphasis circuit. Now this logarithmic power spectrum tog I l4t(f) I
・・・－・・・(4)
Spectral parameter
1).

このスペクトルパラメータｘ（１）が離散的フーリエ変
換（ＤＦＴ）回路（１３に供給される。ここでこのＤＦ
Ｔ−絡ＱＩにおいて、例えば分割された帯域の数をＭと
すると、このＭ次元スペクトルパラメータＸ山（ｉ＝０
，１・・・・・・Ｍ−１）−を２Ｍ点の実数対称パラメ
ータとみなしＣＤＦＴを行う。従って１Ｍへ− 満＝０，１・・・・・・２Ｍ−１となる。さらＫこのＤＦＴを行う函数は偶画数とみなさ
れるためｇｏｉ＊ｗ＊＝顛　□− 輩となり、これらよりとなる。このＤＦＴによりスペクトルの包絡特性を表構
する音響パラメータが抽出される。This spectral parameter x(1) is supplied to a discrete Fourier transform (DFT) circuit (13, where this DF
In T-connection QI, for example, if the number of divided bands is M, then this M-dimensional spectral parameter X mountain (i = 0
, 1 . . . M-1)- is regarded as a real symmetric parameter of 2M points, and CDFT is performed. Therefore, 1M = 0, 1...2M-1. Furthermore, since the function that performs this DFT is considered to have an even number of strokes, goi*w* = 顛□-, and from these. This DFT extracts acoustic parameters representing the envelope characteristics of the spectrum.

このよ５ＫＬ、てＤＦＴされたスベタトツムバラメｆｉ
　Ｘ＜ｉ＞　Ｋツイテ、０−Ｐ−１（ｆｉえばｒ＝８）
次までのＶ次元の値を取り出し、これをローカルパラメ
ータＬ　（ｐ）　（ｐ＝０．　１・・−・・Ｐ−１）と
すると・・・（７）となり、ここでスペクトルパラメータが対称であること
を考慮してＸ（轟）　”　Ｘ（ｘｌｉ−ｉ−ｘ　）　　　　　　　
　　　−−−−（８）とおくと、ローカルパラメータＬ
　（ｐ）は・・・・・・（９）但し、ｐ工０，１・・・・・・Ｐ−１となる。このよ５ＫＬ、て３２ワードの信号がｒ（例え
ば８）ワードに正編される。This is 5KL, the smooth and smooth fi that was DFT
X<i>K tweet, 0-P-1 (r=8 if fi)
If we take the values of the V dimension up to the next and set this as the local parameter L (p) (p=0.1...P-1)...(7), where the spectral parameters are symmetrical. Considering a certain thing, X (Todoroki) ” X (xli-i-x)
-----If (8) is set, the local parameter L
(p) is...(9) However, p is 0,1...P-1. This 5KL, 32-word signal is divided into r (for example, 8) words.

このローカルパラメータＬ　（ｐ）がメモリ装置Ｉに供
給される。このメモリ装置Ｉは１行ｒワードの記憶部が
例えば１６行マトリクス状に配されたもので、ローカル
パラメータＬ（ｐ）か各次元ととＫＪＩ１１次記憶され
ると共に、上述のクロック発生器（５）からの５．１２
輌１間隔のフレームクロックが供給されて、各行のパラ
メータが順次横方向ヘシフトされる。This local parameter L (p) is supplied to the memory device I. This memory device I has a storage section of r words per row arranged in a matrix of 16 rows, for example, and stores the local parameter L(p) or each dimension in KJI 11th order, and also stores the above-mentioned clock generator (5 ) from 5.12
A frame clock at intervals of one vehicle is supplied, and the parameters of each row are sequentially shifted in the horizontal direction.

これにより【メモリ装置Ｉには５．１２ｍ５ｃ間隔のｒ
次元のローカルパラメータ、Ｌ（ｐ）が１６７レーム（
８１，９２ｍ５ｅｃ）分記憶され、フレームクロックご
とに順次新しいパラメータに更新されるさらにエンファシス１路ａ匈からの重み付けされた信号
が帯域分割−路Ｃ１１）に供給され、上述と同様にメル
スケールに応じてＮ（例えば２０　）の帯域に分割され
、それぞれの帯域の信号の量に応じた信号■佃）（ｌｌ
＝ｏ、１・・−・・Ｎ−１）が取り出される。この信号
がバイアス付き対数回路５（至）ｋ供給されてＶ’（ｎ
）　＝　”ｇ（Ｖ（ｎ）　＋　Ｂ　）　　　　　　　　
　・・＝　Ｈが形成される。また信号ＶＯＩ）が累算回
路（２）に供給されててｖ−＝ｊｏｇ（Ｖ１十Ｂ）　　　　　　　　　　　　　
　　　　　　−・・・・・Ｑυが形成される。そしてこ
れらの信号が演算回路（財）に供給されてＶ（ｎ）＝マ蟲−ｖ色）　　　　　　　　　　−ａ邊が
形成される。As a result, [Memory device I has an r space of 5.12m5c apart.
The local parameter of the dimension, L(p), is 167 remes (
81,92m5ec) is stored and updated to new parameters sequentially at every frame clock.Furthermore, the weighted signal from the emphasis 1 path C11) is supplied to the band division path C11), and the weighted signal is stored according to the mel scale in the same way as described above. The signal is divided into N (for example, 20) bands, and the signal is divided into N (for example, 20) bands according to the amount of signals in each band.
=o, 1...N-1) are extracted. This signal is supplied to the biased logarithm circuit 5 (to) k and V'(n
) = ”g(V(n) + B)
...=H is formed. Also, the signal VOI) is supplied to the accumulator circuit (2) and v-=jog(V10B)
−...Qυ is formed. These signals are then supplied to an arithmetic circuit (product) to form V(n)=ma-v color)-a side.

ここで上述のような信号Ｖ（ｎ）を用いることにより、
この信号は音韻から音韻への変化に対して各次（ｎ＝ｏ
、１・・・・−・Ｎ−１）の変化か同程度となり、音韻
の種類による変化量のばらつきを回避できる。Here, by using the signal V(n) as described above,
This signal corresponds to each order (n=o
, 1 .

また対数をとり演算を行って正規化パラメータＶ（ＩＩ
）を形成したことにより、入力音声のレベルの変化によ
るパラメータＶ（ｎ）の変動が排除される。さらにバイ
アスＢを加算して演算を行ったことにより、仮りに［３
−＊　ｏｏとするとバフメータｖ０１）→０となること
から明らかなように、入力音声の微少成分（ノイズ勢）
に対する感度を下げることができる。In addition, the logarithm is taken and the calculation is performed to normalize the parameter V(II
) eliminates fluctuations in the parameter V(n) due to changes in the level of the input audio. By further adding bias B and performing the calculation, it becomes [3
-* If oo, the buff meter v01) → 0. As is clear from this, the minute components of the input voice (noise)
sensitivity to can be lowered.

このパラメータＶ（Ａ”）がメモリ装置（ハ）に供給さ
れ′Ｃ２Ｗ＋１（例えば９）フレーム分が記憶される。This parameter V(A'') is supplied to the memory device (c) and 'C2W+1 (for example, 9) frames are stored.

この記憶された信号が演算回路（ホ）に供給されてＹｎ
、ｔ−１，：′ｆｆＦＮ（ｖＱｌ）（■））・・・・・
・（Ｂ）但し、ＧＦＮ＝（Ｉ；−實＋重≦■≦ｗ−）−
１）が形成され、この信号とバフメータｙ＜ｔ＞が演算
回路（財）Ｋ供給されて・・・・・・ａ荀が形成される　このＴ（ｔ）が過渡点検出パラメータで
あって、とのＴ（ｔ）がビータ判別回路（至）に供給さ
れて、入力音声信号の音韻の過渡点が検出される。This stored signal is supplied to the arithmetic circuit (e) and Yn
, t-1,:'ffFN(vQl)(■))...
・(B) However, GFN = (I; - fact + weight ≦■≦w-) -
1) is formed, and this signal and the buff meter y<t> are supplied to the arithmetic circuit K to form a. This T(t) is the transient point detection parameter, T(t) is supplied to the beater discrimination circuit (to), and the transition point of the phoneme of the input audio signal is detected.

ここでパラメータＴ（ｔ）が、フレー五ｔを挾んで前後
Ｗ７レームずつで定義されているので、不要な凹凸やｆ
ＩｋＩＬを生じるおそれがない。なお第３ｗＪは例えば
０ゼー“とい５発音を、サンプリング周波数ＩＬｉｋＨ
ｉ、ｌ冨ビットデジタルデータとし、５．１ハｌフレ一
ム周期で２６藝点のＦＦＴを行い、帯域数ｈ＝鵞Ｏ，バ
イアｘＢｍ６．検出７レーム数２Ｗ＋１ｗ−１で上述の
検出を行った場合を示している。図中Ａは音声波形、Ｂ
は音韻、Ｃは検出信号であって、「無音−４Ｚ　Ｊ　　
ｌ”　Ｚ−４Ｋ　Ｊ　　「ｇ　−４ＲＪ「Ｒ→０」　「
０→無音」の各過渡部で顕著なピークを発生する。ここ
で無音部にノイズによる多少の凹凸が形成されるかこれ
はバイアスＢを太き（することにより破ｌｉＡ図示のよ
うに略ＯＫなる。Here, the parameter T(t) is defined by W7 frames before and after frame 5 t, so unnecessary unevenness and f
There is no risk of producing IkIL. Note that the third wJ is, for example, 5 pronunciations such as 0 zee, and the sampling frequency is ILikH.
i, l-bit digital data, perform FFT of 26 points with 5.1 half-frame period, number of bands h = 0, bias x Bm6. This shows the case where the above-mentioned detection is performed with the number of detected 7 frames being 2W+1w-1. In the figure, A is the audio waveform, B
is the phoneme, C is the detection signal, and “silence-4Z J
l" Z-4K J "g -4RJ "R→0""
A remarkable peak occurs at each transition from 0 to silence. Here, if some unevenness is formed in the silent part due to noise, this can be substantially corrected by increasing the bias B (as shown in the diagram).

この過液点検出信号Ｔ（ｔ）がメ毫す装置Ｉに供給され
、この検出信号のタイ櫂ンダに相白する一一カルバツメ
ータＬ（Ｐ）が８誉目の行にシフトされた時点でメ毫す
装置１の読み出しが行われる。ここでメモリ鋏＊ａｉの
読み出しは、各次元Ｐごとに１６フレ一五分の信号が横
方向に読み出される。そして読み出された信号がＤＦＴ
回路Ｑ５に供給される。This overflow point detection signal T(t) is supplied to the measuring device I, and at the time when the 11 Kalbat meter L(P), which corresponds to the tie liner of this detection signal, is shifted to the 8th row. Reading of the printing device 1 is performed. Here, when reading out the memory scissors *ai, signals corresponding to 16 frames and 15 minutes are read out in the horizontal direction for each dimension P. Then, the read signal is DFT
It is supplied to circuit Q5.

この回路（ｌｓｔｍおいて上述と同様にＤＦＴが行われ
、音響パラメータの時系列変化の包絡特性が抽出される
。とのＤＦＴされた信号の内から０〜Ｑ−１（例えばＱ
＝３）次までのＱ次元の値を取り出す。In this circuit (lstm, DFT is performed in the same manner as described above, and the envelope characteristics of the time-series changes in acoustic parameters are extracted.
=3) Extract the values of the Q dimension up to the next one.

このＤＦＴを各次元Ｐごとに行い、全体でＰｘＱ（＝２
４）ワードの過渡点パラメータＫ（ｐ、　Ｑ）（ｐ＝ｏ
、１・・・・・・ｐ−１）　（Ｑ＝０．１・叩・Ｑ−１
）が形成される。ここで、Ｋ（０，０）は定数なので、
ｐ＝ｏのときＫｑ＝１−Ｑとしてもよい。This DFT is performed for each dimension P, and the total is PxQ (=2
4) Word transition point parameter K(p, Q) (p=o
, 1...p-1) (Q=0.1・hit・Q-1
) is formed. Here, K(0,0) is a constant, so
When p=o, Kq may be set as 1-Q.

すなわち第４図において、Ａのような入力音声信号（Ｈ
ＡＩ）に対してＢのような過渡点が検出されている場合
は、この信号の全体のパワースペクトルはＣのようにな
っている。そして例えば［Ｈ→Ａ］の過渡点のパワース
ペクトルがＤのようであったとすると、この信号がエン
ファシスされてＥのようになり、メルスケールで圧縮さ
れてＦのようになる。この信号がＤＦＴされてＧのよう
になり、Ｈのよさに前後の１６フレ一五分がマトリクス
され、この信号が順次時間軸を方向Ｋ　ＤＦＴされて過
渡点パラメータＫ（ｐｓｑ）が形成される。That is, in FIG. 4, if an input audio signal like A (H
If a transient point like B is detected for AI), the entire power spectrum of this signal is like C. For example, if the power spectrum at the transition point of [H→A] is as shown in D, this signal is emphasized to become as shown in E, and compressed using the mel scale as shown in F. This signal is subjected to DFT to become something like G, 16 frames and 15 minutes before and after are matrixed to the quality of H, and this signal is sequentially DFT'd in the direction of the time axis to form a transient point parameter K (psq). .

この過渡点パラメータＫ（ｐ、ｑ）がマハラノビス距離
算出回路ＱＩＫ供給されると共に、メモリ装置ａηから
のクラスタ系数が回路（１Ｇに供給され【各クラスタ系
数とのマハラノビス距離が算出される。This transient point parameter K (p, q) is supplied to the Mahalanobis distance calculation circuit QIK, and the cluster system from the memory device aη is supplied to the circuit (1G) [the Mahalanobis distance with each cluster system is calculated].

ここでクラスタ系数は複数の話者の発音が上述と同様に
過渡点パラメータを抽出し、これを音韻の自答に応じて
分類し統計解析して得られたものである。Here, the cluster system is obtained by extracting transition point parameters of the pronunciations of a plurality of speakers in the same way as described above, and classifying and statistically analyzing them according to the self-answered phonemes.

そしてこの算出されたマハッノビス距離が判定回路０に
供給され、検出された過渡点が、何の音韻から何の音韻
への過渡点であるかが判定され、出力端子ａ鐘に取り出
される。The calculated Mahanobis distance is then supplied to the determination circuit 0, which determines which phoneme to which phoneme the detected transition point is a transition point, and outputs it to the output terminal a.

すなわち例えば１はい”１いいえ”０（ゼロ）１〜＠９
（キエウ）１０１２単語について、あらかじめ多数（百
Å以上）の話者の音声を前述の装置に供給し、過渡点を
検出し過渡点パラメータを抽出する。この過渡点パラメ
ータを例えば第５図に示すようなテーブルに分類し、こ
の分＃（クラスタ）ととに統計解析する。図中＊は無音
を示す。For example, 1 yes "1 no" 0 (zero) 1~@9
(Kieu) For 1012 words, the voices of a large number of speakers (more than 100 Å) are supplied in advance to the above-mentioned device, a transition point is detected, and a transition point parameter is extracted. The transient point parameters are classified into a table as shown in FIG. 5, for example, and statistically analyzed into # (cluster). * in the figure indicates silence.

これらの過渡点パラメータについて、任意のす（１）／プルをＲｙ、　Ｂ　（ｒ　：＝　１１２”＝・２４　
）　（”はクラスタ指標で例えば―＝１は＊→Ｈ，ａ＝
２はＨ→Ａに対応する。―は話者番号）として、共分散
マトリクス・・−・ａｅ但し、Ｒ？）−Ｅ（Ｒシ５）Ｅはアンナンブル平均を針数し、この逆マトリクス（１）　　　　（ａ）−１Ｂｒ、　ｓ”　（Ａｔ、　ｕ）ｒ、　ｓ　　　　　　　
　　　−−−Ｑｉを求める。For these transition point parameters, any S (1) /pull is Ry, B (r := 112"=・24
) (” is a cluster index, for example -=1 is *→H, a=
2 corresponds to H→A. - is the speaker number), and the covariance matrix...ae However, R? )−E(Rshi5) E is the number of stitches of the unnumbered average, and this inverse matrix (1) (a)−1 Br, s” (At, u) r, s
--- Find Qi.

ここで任意の過渡点パラメータに、とクラスタ１との距
離が、マハラノビスの１ｎ（Ｋ・″″ＲＲソ゛　　　　　・・・・・・Ｑ７）で求
められる。Here, the distance between an arbitrary transition point parameter and cluster 1 is determined by Mahalanobis' 1n (K·″″RRso゛ . . . Q7).

従ってメ％す装置Ｑηに上述のＢ、、　ｓ及びｌを求０めて記憶しておくことにより、マハラノビス距離算出回
路（ＩＱにて入力音声の過渡点パラメータとのマハラノ
ビス距離が算出される。Therefore, by determining and storing the above-mentioned B, s, and l in the memory processing device Qη, the Mahalanobis distance between the transition point parameter of the input voice and the input voice is calculated by the Mahalanobis distance calculation circuit (IQ).

これによつ″′Ｃ回路収Ｑから入力音声の過渡点ごとに
各クラスタとの最小距離と過渡点の順位が堆り出される
。とれらが判定囲路（ＩＩＫ供給され、入力音声が無音
になった時点において認識判定を行う。As a result, the minimum distance to each cluster and the ranking of the transition points are extracted from the C circuit loss Q for each transition point of the input audio. A recognition judgment is made when the

例えば各単語ととに、各過渡点パラメータとクラスタと
の最小距離の平方根の平均値による単語距離を求める。For example, for each word, the word distance is determined by the average value of the square root of the minimum distance between each transition point parameter and the cluster.

？ｊお過渡点の一部脱落を考慮し【各単語は脱落を想定
した複数のタイプについて単語距離を求める。ただし過
渡点の順位関係がテーブルと異なっているものはリジェ
クトする。そしてこの単語距離が最小になる単語を認識
判定する。? j Considering the omission of some of the transition points, [calculate the word distance for multiple types assuming that each word is omitted. However, if the ranking relationship of the transition points is different from the table, it will be rejected. Then, the word with the minimum word distance is recognized and determined.

こうして音声認識が行われるわけであるが、本発明によ
れば音声の過一点の音韻の変化を検出しているので、時
間軸の変動がなく、不特定話者について良好なｇ繊を行
５ことができる。Speech recognition is performed in this way.According to the present invention, changes in the phoneme at one point in the speech are detected, so there is no change in the time axis, and it is possible to perform good g-strings for unspecified speakers. be able to.

また過渡点において上述のようなパラメータの抽出を行
ったことにより、一つの過渡点を例えば２４次元でｗ識
することができ、ｇｌｌｔｌ−極めて容易かつ正確に行
うことができる。Furthermore, by extracting the parameters as described above at a transition point, it is possible to identify one transition point in, for example, 24 dimensions, which can be done extremely easily and accurately.

なお上述の装置において１２０名の話者にて学習を行い
、この１２０名以外の話者にて上述の１２単語について
実験を行った結果、９６．５％の平均ｗｌ識率が得られ
た。In addition, as a result of learning with the above-mentioned device using 120 speakers and conducting experiments on the above-mentioned 12 words with speakers other than the 120 speakers, an average wl recognition rate of 96.5% was obtained.

さらに上述の例では１はい１の「Ｈ→ＡＪど８（ハチ）
０の「Ｈ−＊ＡＪは同じクラスタに分類可能である。従
って認識すべき言語の音韻数をαとして・１ｏＬＣ１個
のクラスタをあらかじめ計算してクラスタ係数をメモリ
装ｆＩＬ拳？）Ｋ記憶させておけば、種薯の単一の認識
に適用でき、多くの語いの認識を容易に行５ことができ
る。Furthermore, in the above example, 1 is 1's "H → AJ do 8 (Hachi)"
0's "H-*AJ" can be classified into the same cluster. Therefore, assuming the number of phonemes of the language to be recognized as α, 1oLC1 clusters are calculated in advance and the cluster coefficient is stored in memory. If this method is established, it can be applied to a single recognition of a seed yam, and recognition of many words can be easily performed.

【図面の簡単な説明】菖１１Ｑは音声のａ明のための図、第２図は本発明の一
例の系統卸、縞３図〜謔５図はその説明のための図であ
る。（１）ハマイタロフォン、＋３１はローパスフィルタ、
（４）はＡＤ変換回路、（５）はクロック発生器、（６
）はレジネタ、（７）はカクンタ、（８）は高速フーリ
エ変換囲路、（９）はパワースペクトル検出回路、ａｅ
はエンファシス回路、ａｅは帯域分割回路、ａ４は対Ｉ
Ｉ［回路、０３．６勺は離散的７−リエ変換閏路、軸、
＠りはメモリ装置、軸はマハツノビス距離算出回路、錦
は判定回路、０は出力端子、なυ〜（至）は過渡点検出
のための回路である。第５　ｉ”１手続補正書昭和ｓ８年　５月　２５日１′１′−件の表示昭和ｓ７年特許願第　！−４１２号２、発明の名称　音声−織方法３、補正、をする者事件との関係　　特許出願人住所　東京部品用区北品用６丁目７番３５号名称（２１
８）　　ソニー株式会社代表取締役　大　賀　典　雄６、補止により増加する発明の数（１）明細書中、嬉７１１９行Ｆ２Ｍ４点」とあるな「
２Ｍ−１点」と訂正する。（２）同、同］１１０行１ＤＦＴ＆行５」とあるな「２
Ｍ−２点のＤｒｉを行なう」と訂正する。（３）同、同真１１〜１４行［ｘ（、、、）−７玄１）／” ｉ−Ｏ謹＝Ｓ二Ｘ（ｉ）Ｗ［ｆ）’　ｄｉ　　　・・・・・（５
）諺ｍｍｍ０．１・・・・−２Ｍ−ＩＪとあるなｍｍｏ、１．
・・・・・２ト３」と訂正する。（４）岡、菖８勇１行〜２行ｉ［Ｗ　　　ｍｍ（悲士匹）１Ｍ４　　　　　　２Ｍ−雪＝偏（五」二！−）と訂正する。 −１（５）　　岡、同１１４行［Ｘ（ｍ）　＝　”ｆｆ’Ｘ（ｔ）ａｉｍす１Ｊとある
をｉ＝＠　　　　　　菖（６）同、同負１ｌ−Ｉｓ行 π・１−ｐｒ　Ｌ＜ｐ＞−五、ｘＯ）鴎］「」とあるな＋７）　　ｒｌｌｊ、第９３Ｎ２行ｒ　Ｘ（ｉ）　＝　Ｘ　（ｓＭ−ｉ−ｔ）・・・・・・
・（８）」とあるな「Ｘ（ｉ）　＝−Ｘ　（−一息−り
」と訂正する・（８）同、同ｊ１４行とあるな（９）同、第１０貴１０行ａＯ同、Ｍｌｌｌｋｌ！ｉ行ｒｙ＜ｔｈ＞Ｊとあるをｒ
Ｙｃｎ）　Ｊと訂正する。ａυ　同、同ｊｌｌ　１７行ａｇ　　同、第１３１１１１行「定数なので」とあるを
「音声鼓形のパワーを表現しているので、パワー正規化
のため」と訂正する。ａｓ　　同、菖１４１１６　、７　、８行にソｔＬソｔ
Ｌ　ｒ／ｊスタ系数」とあるＶ「クラスタ係数」と訂正
する。Ｑ４　　同、菖１７１１６行「９６．５％」とあるをｒ
９ｓ、ｚう」と訂正する。ａｓ　　同、同ｊａｌ１行ｒｃＬＣｍ個」とあるなｒｃ
ｉＰｓ個程度」と訂正する。以上[BRIEF DESCRIPTION OF THE DRAWINGS] The irises 11Q are diagrams for explaining audio atomization, FIG. 2 is a diagram showing a system diagram of an example of the present invention, and diagrams 3 to 5 are diagrams for explaining the same. (1) Hamaitalophone, +31 is low pass filter,
(4) is an AD conversion circuit, (5) is a clock generator, (6
) is a register, (7) is a kakunta, (8) is a fast Fourier transform circuit, (9) is a power spectrum detection circuit, ae
is an emphasis circuit, ae is a band division circuit, a4 is a pair I
I [circuit, 03.6 is a discrete 7-lier transform tunnel, axis,
＠ is a memory device, the axis is a Machatsunobis distance calculation circuit, the brocade is a determination circuit, 0 is an output terminal, and υ ~ (to) is a circuit for detecting a transition point. No. 5 i”1 Procedural Amendment May 25, 1939 Showa S8 1'1' - Indication of Patent Application No. !-412 2, Title of Invention Sound - Weaving Method 3, Amendment Relationship with Patent Applicant Address: 6-7-35, Kitashina-yo, Tokyo Parts-Yo-ku Name (21
8) Norio Ohga, Representative Director of Sony Corporation 6. The number of inventions will increase due to the supplement (1) In the specification, line 7119, F2M 4 points.''
2M-1 point,” he corrected. (2) Same, same] 110 lines 1 DFT & line 5”
Do a Dri for M-2 points,” he corrected. (3) Same, Doshin lines 11-14 [x(,,,)-7Gen1)/"i-O謹=S2X(i)W[f)' di......(5
) Proverb m mm0.1...-2M-IJ mmo, 1.
...2 to 3," he corrected. (4) Oka, Iris 8 Yu lines 1-2 i [W mm (traitors) 1M4 2M-yuki=biased (five'' two!-) Corrected. -1 (5) Oka, same line 114 [X(m) = "ff'L<p>-5, xO) Seagull] "" +7) rllj, line 93N2 r X(i) = X (sM-i-t)...
・(8)'' It is corrected as ``X(i) =-X (-breath-ri)'' ・(8) Same, same line j14 (9) Same, 10th line 10 aO same, Mlllkl! i line ry<th>J and r
Ycn) Correct it as J. aυ Same, same jll, line 17 ag Same, line 131111, correct the statement ``Because it is a constant'' to ``Because it expresses the power of the voice drum shape, it is used for power normalization.'' as same, irises 14116, 7, 8 lines sotL sot
``L r/j star series'' is corrected to ``Cluster coefficient''. Q4 Same, Iris line 17116 says "96.5%".
9s, zou,” he corrected. as same, same jal 1 line rcLCm pieces” rc
"About 1 IPs," he corrected. that's all

Claims

[Claims]

A speech method that has means for detecting a transitional part between phonemes including silence, extracts a predetermined length of speech in the detected transient part, converts it into a parameter, and uses the parameter as a basic recognition unit. .