JPH0376471B2

JPH0376471B2 -

Info

Publication number: JPH0376471B2
Application number: JP57012795A
Authority: JP
Inventors: Hidenori Shinoda; Tomio Sakata; Yoichi Takebayashi
Original assignee: Tokyo Shibaura Electric Co Ltd
Current assignee: Toshiba Corp
Priority date: 1982-01-29
Filing date: 1982-01-29
Publication date: 1991-12-05
Also published as: JPS58130395A

Description

【発明の詳細な説明】〔発明の技術分野〕本発明は孤立発声された単語音声の音声区間を
安定確実に検出し得る音声区間検出装置に関す
る。DETAILED DESCRIPTION OF THE INVENTION [Technical Field of the Invention] The present invention relates to a speech segment detection device capable of stably and reliably detecting a speech segment of isolated word speech.

[Technical background of the invention]

孤立発声された単語音声の全体的な音声パター
ンを用いて上記単語音声を認識する場合、その音
声区間の検出精度が認識率に大きく影響する。特
にこの音声区間の検出に際しては、雑音等を音声
区間の一部として取込むことによる所謂付加誤り
を未然に防ぐこと、また音声の一部が欠落した状
態で音声区間を定めたことによる所謂脱落誤りを
未然に防ぐことが重要な課題となる。 When recognizing a word sound using the overall sound pattern of an isolated word sound, the detection accuracy of the sound section greatly influences the recognition rate. In particular, when detecting this speech section, it is important to prevent so-called addition errors caused by incorporating noise etc. as part of the speech section, and to prevent so-called omission errors caused by defining the speech section with part of the speech missing. Preventing mistakes is an important issue.

しかして一般に音声がエネルギーのＳ／Ｎにし
て30dB以上確保される静かな環境下で与えれる
ような、会話型の認識システムにあつては、上述
した問題は比較的簡単に解決される。即ち、この
種のシステムでは、話者に対して積極的に発声促
進が行われるので、発声前の無音区間がある程度
保証される。従つてこの区間におけるエネルギー
や零交差数の平均値、更には分散等を求めておけ
ば背景雑音レベルに応じて音声区間検出の閾値を
設定できるので、正確な音声区間検出が可能とな
る。 However, in the case of a conversational recognition system, which is generally provided in a quiet environment where speech is provided in a quiet environment where the energy S/N ratio is guaranteed to be 30 dB or more, the above-mentioned problem can be solved relatively easily. That is, in this type of system, since the speaker is actively encouraged to speak, a silent period before speaking is guaranteed to some extent. Therefore, if the average value, variance, etc. of the energy and number of zero crossings in this section are determined, the threshold for speech section detection can be set in accordance with the background noise level, and accurate speech section detection becomes possible.

[Problems with background technology]

然し乍ら音声入力がなされる環境が必ずしも静
かであるとは何ら保証されず、一般的には上述し
た無音区間を確実に設定することはできない。し
かも発声環境によつては、周囲の雑音レベルが高
かつたり、雑音レベル自体が変動していることも
ある。この為、従来システムではこのような背景
雑音に対して何ら対処することができず、その音
声区間を正確に、且つ安定に検出することが甚だ
困難であつた。この為、付加脱落誤りにより、認
識率を十分に高くすることができなかつた。 However, there is no guarantee that the environment in which voice input is performed is necessarily quiet, and it is generally not possible to reliably set the above-mentioned silent period. Furthermore, depending on the speaking environment, the surrounding noise level may be high or the noise level itself may vary. For this reason, conventional systems cannot deal with such background noise in any way, and it is extremely difficult to accurately and stably detect the speech section. For this reason, the recognition rate could not be made sufficiently high due to addition/omission errors.

また精度の高い音声認識を行う為には、音声区
間の始端および終端位置の安定化を図ることが必
要である。例えば始端を単語の最初の音素の母音
の開始点として定義するならば、音声のレベルや
背景雑音レベルに無関係に上記開始点を検出する
ことが必要である。然し、これらの要求を、単一
の音声パラメータのみを用いて、あるいは単一の
アルゴリズムだけで満たすことは非常に困難であ
り、実用性の点でも問題があつた。 Furthermore, in order to perform highly accurate speech recognition, it is necessary to stabilize the starting and ending positions of a speech section. For example, if the start point is defined as the start point of the vowel of the first phoneme of a word, it is necessary to detect the start point regardless of the speech level or background noise level. However, it is very difficult to satisfy these requirements using only a single audio parameter or only a single algorithm, and there are also problems in terms of practicality.

[Purpose of the invention]

本発明はこのような事情を考慮してなされたも
ので、その目的とするところは、音声信号と背景
雑音とのＳ／Ｎが悪く、しかも背景雑音レベルが
大きく変動する環境下で発声された音声の始端と
終端を正確に検出して、その音声区間を安定に、
且つ高精度に検出することのできる実用性の高い
音声区間検出装置を提供することにある。 The present invention has been made in consideration of these circumstances, and its purpose is to provide a voice that is uttered in an environment where the S/N ratio between the voice signal and background noise is poor and the background noise level fluctuates greatly. Accurately detects the beginning and end of the audio, stabilizes the audio section,
Another object of the present invention is to provide a highly practical voice section detection device that can detect with high accuracy.

[Summary of the invention]

本発明は入力音声の或る音声パラメータを抽出
し、この抽出された音声パラメータを用いて上記
入力音声の大略的な音声区間を検出したのち、こ
の音声区間の仮始端から数フレーム前、および上
記音声区間の仮終端から数フレーム後までの区間
を検出区間として定めて、この検出区間の音声を
前記入力音声の別の特徴パラメータを用いて検出
して音声区間の始端と終端とをそれぞれ精度良く
定めるようにしたものである。 The present invention extracts a certain audio parameter of the input audio, detects a general audio section of the input audio using the extracted audio parameter, and then detects a few frames before the tentative start of this audio section, and the above-mentioned audio parameter. The section from the tentative end of the speech section to several frames later is defined as the detection section, and the speech in this detection section is detected using another characteristic parameter of the input speech to accurately identify the start and end of the speech section. It was designed to be established as follows.

〔Effect of the invention〕

従つて本発明によれば、雑音に対して比較的耐
性のある音声パラメータを用いて仮りに音声区間
を定めたのち、別の特徴パラメータを用いて高精
度に音声の一意的に定まる始端と終端とをそれぞ
れ検出するので、正確に音声区間を検出すること
が可能となる。しかも雑音に対して耐性のあるパ
ラメータを用いて音声区間を仮検出し、この仮検
出区間を含む前後に所定フレーム数付加された区
間を検出対象として精度の高い音声検出を行うの
で、背景雑音の大きな影響を受けることなしに安
定に音声区間検出を行うことが可能となる。従つ
て、発声環境に左右されることなく音声区間検出
を正確に行い得るので、音声認識率の向上を図り
得る等の実用上、絶大なる効果が奏せられる。 Therefore, according to the present invention, a speech section is temporarily determined using a speech parameter that is relatively resistant to noise, and then the start and end points of the speech are uniquely determined with high precision using another characteristic parameter. Since both are detected, it is possible to accurately detect the voice section. In addition, the voice section is tentatively detected using parameters that are resistant to noise, and the section with a predetermined number of frames added before and after the tentatively detected section is used as the detection target to perform highly accurate speech detection. It becomes possible to stably detect a voice section without being affected greatly. Therefore, it is possible to accurately detect voice segments without being affected by the speaking environment, and this provides great practical effects such as improving the voice recognition rate.

[Embodiments of the invention]

以下、図面を参照して本発明の一実施例につき
説明する。 Hereinafter, one embodiment of the present invention will be described with reference to the drawings.

第１図は実施例装置の概略構成図で、第２図は
実施例装置の作用を説明する為の図である。 FIG. 1 is a schematic configuration diagram of the embodiment device, and FIG. 2 is a diagram for explaining the operation of the embodiment device.

マイクロホンから入力増幅器を介して入力され
た音声信号は、音声パラメータ抽出部１に導びか
れる。この音声パラメータ抽出部１は、短時間幅
Ｔ毎に、上記入力音声信号の全帯域エネルギーの
実効値Ｅと、例えば４チヤンネルの広帯域バンド
パスフイルタを介して抽出された入力音声信号の
各チヤンネル出力の実効値B₁、B₂、B₃、B₄を求
め、これを音声パラメータ時系列として出力して
いる。このようにして求められるＥ、B₁、B₂、
B₃、B₄の音声パラメータ時系列は所定時間幅に
亘つて、一旦バツフアメモリ２に格納される。第
２図に示す信号Ｅは、エネルギーを音声パラメー
タとした音声パターンの例を示すものである。 An audio signal input from a microphone via an input amplifier is guided to an audio parameter extraction section 1. This audio parameter extraction unit 1 extracts, for each short time width T, the effective value E of the total band energy of the input audio signal and the output of each channel of the input audio signal extracted via a 4-channel wideband bandpass filter, for example. The effective values B ₁ , B ₂ , B ₃ , and B ₄ are determined and outputted as a time series of audio parameters. E, B ₁ , B ₂ obtained in this way,
The audio parameter time series of B ₃ and B ₄ are temporarily stored in the buffer memory 2 over a predetermined time width. Signal E shown in FIG. 2 shows an example of a voice pattern using energy as a voice parameter.

しかして、第１段音声区間検出部３は、エネル
ギーＥの音声パラメータ時系列を入力し、そのエ
ネルギーと予め設定された閾値E₁とを比較して、
上記閾値Ｅを越える時点a₁を音声の仮始端として
検出している。この仮始端a₁の検出アルゴリズム
は、入力音声エネルギーＥが閾値E₁を越え、所
定時間継続したときに上記閾値E₁を越えた時点
を仮始端a₁として定めることにより行われる。仮
りに、上記閾値E₁を越える期間が所定時間（所
定フレーム数：50〜70ｍsec）継続しない場合に
は、これを雑音と看做し、仮始端検出をし直す。
しかるのち、このようにして検出された仮始端a₁
の情報は閾値計算部４に与えられる。閾値計算部
４では、例えば音声入力開始時点から上記仮始端
a₁までの入力音声エネルギーＥの平均値を求め、
これに所定の値を加える等して、仮終端検出の為
の閾値E₂を設定し、これを前記第１段音声区間
検出部３に与えている。第１段音声区間検出部３
では、この新たに与えられた閾値E₂に従い、今
度は入力音声エネルギーＥが上記閾値E₂を下ま
わり、且つ所定時間（所定フレーム数：250〜300
ｍsec程度）継続したとき、上記エネルギーＥが
閾値E₂を下まわつた直前の時点b₁を仮終端として
検出している。従つて、この仮終端b₁の検出は、
仮始端a₁の検出に比して、或る程度背景雑音レベ
ルを考慮したものとなる。このようにして、第１
段音声区間検出部３により、入力音声に対する仮
りの音声区間が、エネルギーＥを音声パラメータ
とした閾値E₁、E₂との比較により仮始端a₁、仮
終端b₁が求められて検出されている。そして、上
記仮始端a₁および仮終端b₁によつて示される音声
区間は、パルス性雑音によりエネルギーが高くな
つた区間や、単語中の無音区間等の影響を受けな
いものとなつている。 Therefore, the first stage speech section detection unit 3 inputs the speech parameter time series of energy E, compares the energy with a preset threshold value _E1 ,
The time point _a1 exceeding the threshold E is detected as the temporary start of the voice. This tentative starting point a ₁ detection algorithm is performed by determining the point in time when the input audio energy E exceeds the threshold _E ₁ and continues for a predetermined time as the tentative starting point a ₁ . If the period in which the threshold value _E1 is exceeded does not continue for a predetermined time (predetermined number of frames: 50 to 70 msec), this is regarded as noise and the tentative start point detection is performed again.
Afterwards, the tentative starting point a ₁ detected in this way
The information is given to the threshold calculation section 4. The threshold calculation unit 4 calculates, for example, the tentative starting point from the start of voice input.
Find the average value of the input audio energy E up to a ₁ ,
A predetermined value is added to this to set a threshold value E ₂ for detecting a temporary end, and this is provided to the first stage speech section detection section 3 . First stage speech section detection unit 3
Now, according to this newly given threshold value _E2 , the input audio energy E is below the threshold value _E2 , and for a predetermined period of time (predetermined number of frames: 250 to 300).
msec), the time point _b1 immediately before the energy E falls below the threshold value _E2 is detected as a temporary termination. Therefore, the detection of this temporary termination b ₁ is
Compared to the detection of the tentative starting point _a1 , the background noise level is taken into consideration to some extent. In this way, the first
The step speech section detection unit 3 detects a temporary speech section for the input speech by comparing it with thresholds E ₁ and E ₂ using energy E as a speech parameter to find a temporary start point a ₁ and a temporary end b ₁ . There is. The speech section indicated by the provisional start point a ₁ and provisional end point b ₁ is not affected by a section in which the energy is high due to pulse noise, a silent section in a word, etc.

さて、ラベリング部５には、上記第１段音声区
間検出部３によつて検出された仮始端a₁および仮
終端b₁の情報が与えられる。また、前記閾値計算
部４では、上記仮始端a₁から仮終端b₁までの区間
の音声パラメータＥ、B₁、B₂、B₃、B₄を前記バ
ツフアメモリ２から読出し、エネルギーＥが最大
値をとる時点Ｍにおける最大エネルギー値EMお
よびこの時点Ｍにおける各チヤンネル出力B_1M、
B_2M、B_3M、B_4Mを求め、各パラメータ毎に上記各
値からそれぞれ所定値を差引いて、ラベリング用
の閾値E_T、B_1T、B_2T、B_3T、B_4Tの情報が前記ラ
ベリング部５に与えられている。 Now, the labeling section 5 is given information on the tentative start point a ₁ and the tentative end point b ₁ detected by the first stage speech section detection section 3. In addition, the threshold calculation unit 4 reads out the audio parameters E, B ₁ , B ₂ , B ₃ , and B ₄ in the section from the temporary start point a ₁ to the temporary end point b ₁ from the buffer memory 2, and the energy E is the maximum value. The maximum energy value EM at a time point M and each channel output B _1M at this time point M,
B _2M , B _3M , B _4M are determined, and a predetermined value is subtracted from each of the above values for each parameter, and information on labeling thresholds E _T , B _1T , B _2T , B _3T , and B _4T is obtained by the labeling unit. 5 is given.

ラベリング部５では、前記仮始端a₁の情報か
ら、この仮始端a₁より数フレーム前の時点（a₁−
N_F）を区間検出用の始端a₂として定め、また前
記仮終端b₁の情報から、この仮始端b₁より数フレ
ーム後の時点（b₁−N_E）を区間検出用の終端b₂
として定めている。そして、この始端a₂から終端
b₂によつて示される区間について前記バツフアメ
モリ２から音声パラメータを順次読出し、先に設
定されたラベリング用の閾値E_T、B_1T、B_2T、
B_3T、B_4Tとそれぞれ比較している。そして、第
２図にそのテーブルを示すように、各時点毎に、
上記各音声パラメータとその閾値との比較結果を
順次登録している。この比較結果の登録は、例え
ば音声パラメータが閾値より大なるとき、音声要
素が強いとして「１」なるデータを、また音声パ
ラメータが閾値より小なるときには音声要素が弱
いとして「０」なるデータをそれぞれ登録するこ
とにより行われる。そして、このようにして求め
られたテーブルを各時点毎に、例えば論理和処理
する等して、その結果「Ｑ」「Ｖ」の時系列を得
ている。このＱ−Ｖテーブルは、「Ｖ」を音声区
間の要素、「Ｑ」を無音区間の要素として示すも
のである。 In the labeling unit 5, based on the information on the tentative starting point _a1 , a point in time several frames before the tentative starting point _a1 ( _a1-
N _F ) is set as the starting point a ₂ for section detection, and from the information on the tentative ending point b ₁ , a time point (b ₁ - N _E ) several frames after this tentative starting point b ₁ is set as the ending point b ₂ for section detection.
It is defined as And from this starting end a ₂ to the ending
The audio parameters are sequentially read from the buffer memory 2 for the section indicated by _b2 , and the previously set labeling thresholds E _T , B _1T , B _2T ,
It is compared with B _3T and B _4T respectively. Then, as shown in the table in Figure 2, at each time point,
The comparison results between each of the above-mentioned audio parameters and their threshold values are registered in sequence. This comparison result can be registered, for example, when the audio parameter is greater than the threshold, the audio element is strong and the data is "1", and when the audio parameter is smaller than the threshold, the audio element is weak and the data is "0". This is done by registering. Then, the table obtained in this way is subjected to, for example, logical sum processing at each time point, and as a result, a time series of "Q" and "V" is obtained. This Q-V table shows "V" as an element of a voice section and "Q" as an element of a silent section.

第２段音声区間検出部６では、上記の如く求め
られたラベリング結果「Ｑ」、「Ｖ」を基にして、
入力音声に対する区間検出を行う。即ち、この区
間検出における始端および終端の検出は、先に説
明した第１段音声区間検出部３の検出アルゴリズ
ムとほぼ同様なものであるが、Ｑ−Ｖテーブルを
参照して時間方向に音声要素「Ｖ」とラベリング
されたフレームを探索していくことにより行われ
る。そして、最初に「Ｖ」とラベリングされたフ
レームａを検出し、その後「Ｖ」なるラベリング
が所定フレーム数、例えば40〜50ｍsecに相当す
るフレーム数継続するか否かを調べる。そして、
この条件が満たされたとき、上記フレームａを入
力音声の始端であると認定する。その後、最初に
「Ｑ」とラベリングされたフレームｂを検出し、
そのあとに所定フレーム数、例えば250〜300ｍ
secに亘つて「Ｑ」なるラベリングが継続するか
否かを検出する。この検出で否と判定された場合
には、「Ｖ」とラベリングされたフレームが所定
数、例えば40〜50ｍsecに相当するフレーム数継
続するか否かを調べる。そしてこの条件が満足さ
れた場合には、単語中の別の音声区間が現われた
と看做して改めて上記終端検出の操作を行う。
又、上記条件が満たされないときには、これをノ
イズによるものと看做して「Ｖ」とラベリングさ
れたフレーム数を「Ｑ」のカウントに加える。こ
れにより、入力音声に対する始端ａと終端ｂとが
それぞれ検出され、その音声区間が検出決定され
ることになる。 In the second stage speech section detection unit 6, based on the labeling results "Q" and "V" obtained as above,
Perform section detection for input audio. That is, the detection of the start and end points in this section detection is almost the same as the detection algorithm of the first stage speech section detection section 3 described above, but the detection algorithm is similar to the detection algorithm of the first stage speech section detection section 3 described above, but the speech elements are detected in the time direction by referring to the Q-V table. This is performed by searching for frames labeled "V". First, a frame a labeled "V" is detected, and then it is checked whether the labeling "V" continues for a predetermined number of frames, for example, a number of frames corresponding to 40 to 50 msec. and,
When this condition is met, the frame a is recognized as the beginning of the input audio. Then, first detect frame b labeled "Q",
After that, a predetermined number of frames, e.g. 250-300m
It is detected whether the labeling "Q" continues for sec. If this detection is negative, it is checked whether the frames labeled "V" continue for a predetermined number of frames, for example, a number of frames corresponding to 40 to 50 msec. If this condition is satisfied, it is assumed that another voice section in the word has appeared, and the end detection operation described above is performed again.
If the above conditions are not met, this is assumed to be due to noise and the number of frames labeled "V" is added to the count of "Q". As a result, the start point a and the end point b of the input voice are detected, and the voice section thereof is detected and determined.

かくして本装置による上述した音声区間検出に
よれば、背景雑音の悪影響を受けることなしに安
定に、且つ確実に入力音声の音声区間を精度良く
検出することができる。と言うのは、音声区間の
始端および終端は、背景雑音のレベルとは無関係
であり、従つて背景雑音のレベルに左右されるこ
となく決定されるべきものである。そこで本発明
では、入力音声の最大レベルを基準として閾値を
定め、この閾値に従つて各音声パラメータについ
てそれぞれラベリングを行つている。しかも、複
数の音声パラメータに亘つてラベリング判定して
いるので、例え或る帯域にエネルギーが集中し、
全体的にエネルギーレベルが低くなつている音素
であつても、これを確実に検出することができ
る。従つて、入力音声の各フレームにおける音素
をそれぞれ確実に検出することができ、音声区間
を正確に検出することが可能となる。これ故、発
声環境に左右されることなく、しかも背景雑音レ
ベルが変動している場合であつても安定に且つ正
確に音声区間検出ができ、その実用的利点は絶大
である。 Thus, according to the above-described voice section detection performed by the present apparatus, the voice section of the input voice can be detected stably and reliably with high accuracy without being adversely affected by background noise. This is because the beginning and end of a speech section are independent of the level of background noise, and therefore should be determined without being influenced by the level of background noise. Therefore, in the present invention, a threshold value is determined based on the maximum level of input audio, and each audio parameter is labeled in accordance with this threshold value. Moreover, since labeling is determined based on multiple audio parameters, even if energy is concentrated in a certain band,
Even phonemes whose energy level is low overall can be detected reliably. Therefore, it is possible to reliably detect each phoneme in each frame of input speech, and it is possible to accurately detect speech sections. Therefore, it is possible to stably and accurately detect voice segments without being influenced by the speaking environment and even when the background noise level is fluctuating, and its practical advantage is enormous.

尚、本発明は上記実施例に限定されるものでは
ない。例えば入力音声から抽出する音声パラメー
タとしては、各種次数でのLPC予測誤差や、音
声信号の相関係数等の特徴パラメータを採用する
こともできる。またバンドパスフイルタの各チヤ
ンネル出力の関連性を特徴パラメータとすること
も有用であり、このバンドパスフイルタのチヤン
ネル数は仕様に応じて定めればよい。要するに本
発明は、その要旨を逸脱しない範囲で種々変形し
て実施することができる。 Note that the present invention is not limited to the above embodiments. For example, as the audio parameters extracted from the input audio, feature parameters such as LPC prediction errors of various orders and correlation coefficients of audio signals may be employed. It is also useful to use the relationship between the outputs of each channel of the bandpass filter as a characteristic parameter, and the number of channels of this bandpass filter may be determined according to the specifications. In short, the present invention can be implemented with various modifications without departing from the gist thereof.

[Brief explanation of drawings]

第１図は本発明の一実施例装置の概略構成図、
第２図は実施例装置の作用を説明する為の音声パ
ラメータとラベリングテーブルを示す図である。１……音声パラメータ抽出部、２……バツフア
メモリ、３……第１段音声区間検出部、４……閾
値計算部、５……ラベリング部、６……第２段音
声区間検出部。 FIG. 1 is a schematic diagram of an apparatus according to an embodiment of the present invention;
FIG. 2 is a diagram showing audio parameters and a labeling table for explaining the operation of the embodiment device. DESCRIPTION OF SYMBOLS 1... Speech parameter extraction section, 2... Buffer memory, 3... First stage speech section detection section, 4... Threshold calculation section, 5... Labeling section, 6... Second stage speech section detection section.

Claims

[Scope of Claims] 1. Means for extracting a plurality of types of audio parameter time series of input audio, a memory for storing these extracted audio parameter time series, and a means for extracting a plurality of types of audio parameter time series from the input audio, and a memory for storing the extracted audio parameter time series, and Means for detecting a tentative starting point and a tentative ending point of a voice section from one speech parameter time series, and means for setting an interval from a predetermined number of frames before the tentative starting point to a predetermined number of frames after the tentative end as a labeling target section. and a means for detecting the maximum level of the audio parameter time series in the set labeling target section, and a means for detecting the maximum level of the audio parameter time series in the labeling target section based on this maximum level, at least other than the audio parameter time series used for detecting the tentative start and end. means for labeling whether each frame is silent or not using the stored audio parameter time series; and means for determining the start and end of the audio section of the input audio according to the labeling results. A voice section detection device characterized by the following features. 2. The audio parameter time series used to detect the tentative start and end points is based on the first claim in which the input audio energy is used as a characteristic parameter.
The speech interval detection device described in Section 1. 3. The tentative start point is detected by detecting the point in time when the level of the audio parameter time series with audio energy as a feature parameter exceeds a predetermined threshold and continues for a predetermined number of frames. Claim 1: This is performed by detecting a point in time when the level of the audio parameter time series continues for a predetermined number of frames below a threshold value determined according to the distribution of the audio parameters in the section up to the start end. The voice section detection device described above. 4. The voice section detection device according to claim 1, wherein the detection of the start end and the end end is performed by detecting the time immediately before and after the labeling indicating silence continues for a predetermined frame.