JPH0376475B2

JPH0376475B2 -

Info

Publication number: JPH0376475B2
Application number: JP57146408A
Authority: JP
Inventors: Hidenori Shinoda; Yoichi Takebayashi; Tomio Sakata
Original assignee: Tokyo Shibaura Electric Co Ltd
Current assignee: Toshiba Corp
Priority date: 1982-08-24
Filing date: 1982-08-24
Publication date: 1991-12-05
Also published as: JPS5936300A

Description

【発明の詳細な説明】〔発明の技術分野〕本発明は離散的に発声された単語音声を効果的
に認識することのできる音声認識装置に関する。DETAILED DESCRIPTION OF THE INVENTION [Technical Field of the Invention] The present invention relates to a speech recognition device that can effectively recognize discretely uttered word sounds.

[Technical background of the invention and its problems]

離散的に発声された単語音声を認識する場合、
入力音声信号系列中から上記単語音声が存在する
音声区間を検出することが、その前処理として非
常に重要である。しかして従来では一般に入力音
声のエネルギ変化を利用して、上記音声区間を検
出することが行われている。このような音声区間
検出法は非常に簡便であると云う利点を有してい
るが、発声された単語音声に多くの雑音が付加さ
れるような音声入力環境にある場合等、上記雑音
を除去して上記単語音声を安定に認識することが
非常に困難であつた。これは認識対象とする音声
の存在区間に近い位置に雑音が加わると、前述し
たエネルギだけでは上記単語音声と雑音とが区別
できず、雑音も音声の一部であるとして認識処理
に取込んでしまう為である。このような問題を克
服するものとして、端点フリーのDPマツチング
法等の処理方式が種々考えられているが、認識処
理量が膨大となつて実用性に乏しい等の問題があ
つた。 When recognizing discretely uttered word sounds,
It is very important as preprocessing to detect the speech section in which the word speech exists from the input speech signal sequence. Conventionally, however, the above-mentioned voice section has generally been detected using energy changes in the input voice. This method of detecting speech intervals has the advantage of being very simple, but it is difficult to remove the noise when the speech input environment is such that a lot of noise is added to the uttered word sound. It was extremely difficult to stably recognize the above word sounds. This is because if noise is added near the area where the speech to be recognized exists, the above-mentioned energy alone cannot distinguish the word speech from the noise, and the noise is considered part of the speech and incorporated into the recognition process. This is to put it away. Various processing methods, such as endpoint-free DP matching methods, have been considered to overcome these problems, but these methods have had problems such as an enormous amount of recognition processing, making them impractical.

[Purpose of the invention]

本発明はこのような事情を考慮してなされたも
ので、その目的とするところは、離散的に発声さ
れた単語音声を雑音によつて妨害されることなし
に安定に認識することのできる簡易で実用性の高
い音声認識装置を提供することにある。 The present invention has been made in consideration of these circumstances, and its purpose is to provide a simple method that can stably recognize discretely uttered word sounds without being disturbed by noise. The objective is to provide a highly practical speech recognition device.

[Summary of the invention]

本発明は入力された音声信号を音響分析し、こ
の音響分析結果から例えばそのエネルギ変化と音
韻的特徴を抽出し、これらの特徴に従つて前記音
声信号の始端候補点および終端候補点とをぞれそ
れ検出する。そしてこれらの始端候補点と終端候
補点との可能な全ての組合せから求められる複数
の音声候補区間における前記音声信号の認識処理
に必要な特徴をそれぞれリサンプル抽出し、これ
らのリサンプル抽出された特徴に従つて音声認識
を行うようにしたものである。 The present invention acoustically analyzes an input speech signal, extracts, for example, its energy change and phonological features from the acoustic analysis results, and searches for starting and ending candidate points of the speech signal according to these features. Detect it. Then, the features necessary for the recognition processing of the speech signal in the plurality of speech candidate sections obtained from all possible combinations of these start end candidate points and end end candidate points are resampled and extracted, and these resampled and extracted features are Speech recognition is performed according to the characteristics.

〔Effect of the invention〕

従つて本発明によれば、複数の音声候補区間に
おいてそれぞれリサンプル抽出された特徴を用い
てそれぞれ音声認識処理し、その中の最も信頼性
の高い認識結果を抽出することによつて、雑音成
分を含んで検出される音声候補区間の情報を除去
し、ここに安定な音声認識を簡易に行わしめるこ
とが可能となる。 Therefore, according to the present invention, by performing speech recognition processing using features resampled in each of a plurality of speech candidate sections, and extracting the most reliable recognition result among them, noise components are removed. It becomes possible to remove the information of the speech candidate section detected including the speech candidate section, and easily perform stable speech recognition there.

[Embodiments of the invention]

以下、図面を参照して本発明の一実施例装置に
つき説明する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.

第１図は実施例装置の概略構成図である。離散
的に発生される単語音声は、音声入力部１におい
て音響電気変換され、適度な信号レベルに増幅さ
れたのちＡ／Ｄ変換して取込まれる。音響分析処
理部２は、上記音声入力部１を介して入力された
音声信号を、デイジタルバンドパスフイルタを通
して予め定められたフレーム周期毎にスペクトル
分解してその音響分析を行つている。しかして、
特徴抽出部３は、上記スペクトル分解された音声
信号データを入力し、これを上記フレーム単位で
処理して前記音声信号の特徴を抽出している。即
ち、特徴抽出部３は、上記フレーム毎にその音韻
特徴を抽出し、例えば母音声フレームに対しては
その母音種類を示すラベルを付し、鼻音声フレー
ムに対しては鼻音の種類を示すラベルを付し、更
にその他の子音については、破裂性、有声無声
性、無音性、摩擦性等のラベルを付している。こ
れにより、入力音声信号の各フレーム毎に付され
たラベルの系列からなる音声特徴時系列が求めら
れる。尚、上記母音・鼻音の種類のラベル付け
は、フレーム単位で求められるスペクトルのパタ
ーンデータと、予め辞書登録されている母音およ
び鼻音の各種類毎の標準スペクトルパターンとの
類似度を計算する等してそのラベルを決定して行
われる。また、上述した子音の種類分けは、各フ
レーム毎に求められるスペクトルパターンの概略
形状を調べる等して行われる。即ち、上記スペク
トルパターンが周波数に沿つて単調増和している
場合には、これを摩擦性として判定し、またスペ
クトルパターンが所謂釣鐘状に中央部が高くなつ
ている場合には、これを破裂性として判定する等
して行われる。 FIG. 1 is a schematic configuration diagram of an embodiment device. Discretely generated word sounds are acoustoelectrically converted in the audio input section 1, amplified to an appropriate signal level, and then A/D converted and input. The acoustic analysis processing unit 2 spectrally decomposes the audio signal inputted through the audio input unit 1 into each predetermined frame period through a digital bandpass filter, and performs an acoustic analysis on the audio signal. However,
The feature extraction unit 3 receives the spectrally decomposed audio signal data and processes it in units of frames to extract features of the audio signal. That is, the feature extraction unit 3 extracts the phonetic features of each frame, and for example, attaches a label indicating the vowel type to a vowel frame, and attaches a label indicating the nasal type to a nasal frame. , and other consonants are labeled as plosive, voiceless, voiceless, fricative, etc. As a result, an audio feature time series consisting of a series of labels attached to each frame of the input audio signal is obtained. Note that the above-mentioned labeling of vowel/nasal sounds types is done by calculating the degree of similarity between the spectral pattern data obtained on a frame-by-frame basis and the standard spectral pattern for each type of vowel/nasal sound registered in advance in the dictionary. This is done by determining its label. Further, the above-described consonant type classification is performed by, for example, examining the approximate shape of the spectral pattern obtained for each frame. In other words, if the spectral pattern increases monotonically along the frequency, it is determined to be frictional, and if the spectral pattern has a so-called bell shape with a high center, it is determined to be a rupture. This is done by determining the gender of the person.

音声区間検出回路４は、上記の如くして特徴抽
出部３が求めたラベル系列からなる音声信号の特
徴時系列と、前記音響分析処理部２が求めた音声
信号のスペクトルデータ、およびそのエネルギデ
ータとを入力し、これらの情報に従つて入力音声
信号の始端候補点Ｓおよび終端候補点Ｅとを求
め、これらの始端候補点Ｓと終端候補点Ｅとの可
能な組合せにより複数の音声候補区間を検出して
いる。即ち今、数字「１」なる音声が「イチ」と
発生され、その前後に雑音が伴つて入力すると、
その音声信号波形は例えば第２図に示すようにな
る。このような入力音声信号に対して、上述した
特徴時系列、スペクトルデータ、エネルギデータ
に従つて始端候補点S₁，S₂，S₃を求め、また終端
候補点E₁，E₂，E₃を求める。これらの始端およ
び終端の候補点の組合せから求められる音声候補
区間は、上記始端候補点が終端候補点よりも時間
的に必ず前に存在することから第２図に示す例で
は次のように求められる。 The speech section detection circuit 4 extracts the feature time series of the speech signal consisting of the label sequence obtained by the feature extraction section 3 as described above, the spectrum data of the speech signal obtained by the acoustic analysis processing section 2, and its energy data. is input, and the starting end candidate point S and end end candidate point E of the input audio signal are determined according to this information, and multiple audio candidate sections are created by possible combinations of these starting end candidate points S and end end candidate points E. is being detected. In other words, if the number "1" is generated as "ichi" and is input with noise before and after it,
The audio signal waveform is as shown in FIG. 2, for example. For such an input audio signal, start point candidate points S ₁ , S ₂ , S ₃ are determined according to the above-mentioned characteristic time series, spectrum data, and energy data, and end point candidate points E ₁ , E ₂ , E ₃ are determined. seek. The speech candidate section found from the combination of these start and end candidate points is determined as follows in the example shown in Figure 2, since the start end candidate point always exists temporally before the end candidate point. It will be done.

〔S₁，E₁〕、〔S₁，E₂〕、〔S₁，E₃〕〔S₂，E₁〕、〔S₂，E₂〕、〔S₂，E₃）〔S₃，E₃〕尚、上記音声候補区間のうち〔S₂，E₁〕、〔S₃，
E₃〕に関しては、１つの音声フレームに満たな
い長さでることから、これを処理対象とする候補
区間から除くようにしてもよい。[S ₁ , E ₁ ], [S ₁ , E ₂ ], [S ₁ , E ₃ ] [S ₂ , E ₁ ], [S ₂ , E ₂ ], [S ₂ , E ₃ ) [S ₃ , E ₃ ] Of the above voice candidate sections, [S ₂ , E ₁ ], [S ₃ ,
E ₃ ] has a length less than one audio frame, so it may be excluded from the candidate sections to be processed.

このようにして求められる音声候補区間のそれ
ぞれについて、認識部５は音声信号の特徴である
例えばスペクトル情報を入力する。そして、各音
声候補区間の音声信号スペクトル情報系列をリサ
ンプル抽出し、その特徴パターンベクトルを求め
て、辞書として予め登録された複数の音声カテゴ
リの各標準パターンベクトルとの類似度計算を行
う等して、音声認識処理が行われる。認識部５
は、前記の如く求められた複数の音声候補区間に
ついて、上記認識処理をそれぞれ行い、その認識
結果を制御部６に出力している。この制御部６
は、前述した各処理部２，３，４，５をそれぞれ
制御し乍ら、上記認識部５が求めた各音声候補区
間における認識結果を入力して、これを総合判定
している。そして、音声候補区間が雑音を含む場
合、これによつて上記認識結果（類似度値）が当
然悪くなることを利用してこれを除去し、最も信
頼性の高い認識結果を抽出して、前記音声信号に
対する正しい認識結果であるとして出力してい
る。かくしてここに、信頼性の高い音声候補区間
より求められた認識結果が得られることになり、
音声の前後に付加された雑音を含む情報から得ら
れる認識結果が効果的に排除されることになる。
つまり複数の音声候補区間のうちから、雑音を含
まない音声候補区間における情報のみが有効に取
出されて認識されることになる。 For each of the voice candidate sections obtained in this way, the recognition unit 5 inputs, for example, spectrum information that is a characteristic of the voice signal. Then, the audio signal spectrum information sequence of each audio candidate section is resampled, its characteristic pattern vector is determined, and the similarity is calculated with each standard pattern vector of multiple audio categories registered in advance as a dictionary. Then, voice recognition processing is performed. Recognition unit 5
performs the above-mentioned recognition processing on each of the plurality of speech candidate sections obtained as described above, and outputs the recognition results to the control section 6. This control section 6
While controlling each of the processing units 2, 3, 4, and 5 described above, the recognition unit 5 inputs the recognition results for each speech candidate section obtained by the recognition unit 5 and makes a comprehensive judgment. Then, if the speech candidate section includes noise, the above recognition result (similarity value) is naturally degraded due to this, so this is removed, the most reliable recognition result is extracted, and the above recognition result (similarity value) is removed. This is output as a correct recognition result for the audio signal. In this way, recognition results obtained from highly reliable voice candidate sections can be obtained.
Recognition results obtained from information containing noise added before and after speech are effectively eliminated.
In other words, from among the plurality of speech candidate sections, only information in the speech candidate sections that do not include noise is effectively extracted and recognized.

尚、上記認識部５における各音声候補区間の音
声認識処理は、従来より提唱されている種々の方
式を適宜用いればよい。またこの認識処理に用い
られる音声の特徴も、種々採用可能なことは云う
までもない。 Note that the speech recognition process for each speech candidate section in the recognition section 5 may be performed using various conventionally proposed methods as appropriate. It goes without saying that various characteristics of the voice used in this recognition process can be adopted.

ところで、本装置が最も特徴とするところの、
音声信号に対する始端候補点Ｓおよび終端候補点
Ｅの検出と、これらの始端および終端候補点Ｓ，
Ｅの組合せから求められる音声候補区間の検出処
理は、音声区間抽出回路４によつて次のように行
われる。第３図はその処理過程の一例を示す流れ
図である。この処理は、先ず処理制御カウンタ値
をイニシヤライズしたのち、第ｎフレームの音声
信号エネルギを入力して行われる。しかるのち、
例えば仮りに設定した閾値に従つて音声信号を無
音クラス、音声クラスに分け、各クラス間の級間
分散を求めてその値が最大となるべく最適閾値
E_thを設定し、その閾値E_thと入力音声エネルギ
E_(o)とを比較する。その後、入力音声エネルギE_(o)
が上記閾値E_thを越える時点をS′_(i)として始端の第
１候補点とする。そして、次に上記入力音声エネ
ルギE_(o)が上記閾値E_thを下回る時点を検出し、こ
れを終端の第１候補点E′_(i)とする。しかるのち、
このようにして求められた始端および終端の候補
点間の間隔を T_k＝｜S′_(k)−E′_(k)｜として求め、所定の間隔T_thを越えるか否かを判
定して音声候補区間を求める。これによつて、断
片的に得られる誤つた音声候補区間が除去され
る。そして、上述した音声候補区間の検出を、入
力された音声信号の全てのフレームに亘つて順次
入力し、その可能な全ての組合せについてチエツ
クし、全ての音声候補区間を求める。 By the way, the most distinctive feature of this device is
Detection of starting end candidate point S and end end candidate point E for the audio signal, and detection of these starting end candidate point S and end end candidate point S,
The process of detecting voice candidate sections obtained from the combinations of E is performed by the voice section extraction circuit 4 as follows. FIG. 3 is a flowchart showing an example of the processing process. This processing is performed by first initializing the processing control counter value and then inputting the audio signal energy of the nth frame. Afterwards,
For example, divide an audio signal into a silent class and a voice class according to a temporarily set threshold, find the interclass variance between each class, and set the optimal threshold to maximize the value.
Set E _th and calculate its threshold E _th and input audio energy.
Compare E _(o) . Then the input audio energy E _(o)
The time point at which E _{th exceeds the threshold value E th} is defined as S' _(i) and is the first candidate point for the starting point. Then, the point in time when the input audio energy E _(o) falls below the threshold E _th is detected, and this is set as the first candidate point E' _(i) of the end. Afterwards,
The interval between the starting and ending candidate points obtained in this way is determined as T _k = |S′ _(k) −E′ _(k) |, and it is determined whether or not it exceeds a predetermined interval T _th . Find voice candidate sections. As a result, erroneous speech candidate sections obtained in fragments are removed. Then, the above-mentioned voice candidate section detection is sequentially inputted over all frames of the input voice signal, and all possible combinations thereof are checked to obtain all voice candidate sections.

しかるのち、上記音声候補区間の音声特徴を調
べ、その区間に雑音性成分が含まれるか否かを判
定して、雑音性成分を含む音声候補区間を認識対
象から除去する。その後、発声の終了を、例えば
E_(o)＜E_thの区間が所定の期間M_thだけ続くことか
ら検出し、これまでに検出された音声候補区間に
おける音声特徴と、予め登録されている単語の発
声形状とを比較して、最終的な音声候補区間を決
定する。このようにして決定された音声候補区間
の全てについて、例えば複合類似度法を用いて辞
書登録された単語辞書との類似度をそれぞれ求
め、その類似度値を相互に比較して、最も信頼性
の高い結果を認識結果として出力する。 Thereafter, the speech characteristics of the speech candidate section are examined, it is determined whether or not the section contains a noisy component, and the speech candidate section containing the noisy component is removed from the recognition target. Then, the end of the utterance is determined by e.g.
The section where E _(o) < E _th is detected because it continues for a predetermined period M _th , and the speech features in the speech candidate section detected so far are compared with the utterance shape of the word registered in advance. , determine the final voice candidate section. For all of the speech candidate sections determined in this way, for example, the degree of similarity with the word dictionary registered in the dictionary is determined using the composite similarity method, and the similarity values are compared with each other to find the most reliable one. The results with high results are output as recognition results.

以上のような認識処理によれば、単語音声が含
まれる候補区間のそれぞれにおいて求められる認
識結果を相互に比較して、最も信頼性の高いもの
を抽出するので、最終的に離散的に発声された音
声そのものの特徴から求められる認識結果を信頼
性良く得ることができる。つまり音声区間の検出
と、その認識処理とを相互に関連して行うことに
なるので、安定に認識処理を行い得ると云う実用
上多大なる効果が奏せられる。 According to the recognition process described above, the recognition results obtained in each of the candidate sections containing word sounds are compared with each other and the most reliable one is extracted, so that the final utterances are discretely uttered. The recognition results obtained from the characteristics of the voice itself can be obtained with high reliability. In other words, since the detection of the voice section and the recognition process are performed in conjunction with each other, a great practical effect is achieved in that the recognition process can be performed stably.

尚、本発明は上記実施例に限定されるものでは
ない。例えば始端候補点および終端候補点の検出
処理自体、またこれらの組合せによつて求める音
声候補区間の抽出処理更には認識処理法は、種々
の方式を適宜作用することができる。また上述し
た処理に使用する音声の特徴についても特に限定
されない。要するに本発明はその要旨を逸脱しな
い範囲で種々変形して実施することができる。 Note that the present invention is not limited to the above embodiments. For example, various methods can be used as appropriate for the process of detecting the start end candidate point and the end candidate point, the process of extracting the speech candidate section obtained by a combination of these, and the recognition process. Furthermore, there are no particular limitations on the characteristics of the audio used in the above-described processing. In short, the present invention can be implemented with various modifications without departing from the gist thereof.

[Brief explanation of drawings]

第１図は本発明の一実施例装置の概略構成図、
第２図は実施例装置の処理を示す音声信号波形と
始端および終端候補点とその音声候補区間を示す
図、第３図は実施例装置における認識処理の流れ
を示す図である。１……音声入力部、２……音響分析処理部、３
……特徴抽出部、４……音声区間検出回路、５…
…認識部、６……制御部。 FIG. 1 is a schematic diagram of an apparatus according to an embodiment of the present invention;
FIG. 2 is a diagram showing an audio signal waveform, starting and ending candidate points, and their audio candidate sections, showing the processing of the embodiment device, and FIG. 3 is a diagram showing the flow of recognition processing in the embodiment device. 1...Audio input section, 2...Acoustic analysis processing section, 3
...Feature extraction unit, 4...Speech section detection circuit, 5...
...Recognition unit, 6...Control unit.

Claims

[Claims]

1. Means for inputting an audio signal and performing acoustic analysis at each predetermined frame period, and from this acoustic analysis result, labeling vowels and nasals for each frame based on phoneme similarity, and labeling from the outline of the spectrum. means for labeling consonants to extract features of a speech signal; means for detecting a starting point and an end candidate point of the speech signal according to the features of the speech signal and the acoustic analysis results; The vowel/nasal labels and consonant labels obtained for each frame are used to determine whether the segment is a voice or not for multiple speech candidate sections obtained by all possible combinations of the above starting point candidate points and end point candidate points. means for determining if it is noise; means for resampling and extracting the features of the speech signal in the speech candidate sections detected as speech; What is claimed is: 1. A speech recognition device comprising: recognition means.