JPS6336679B2

JPS6336679B2 -

Info

Publication number: JPS6336679B2
Application number: JP56212858A
Authority: JP
Inventors: Hisayo Kusuhara; Kazuaki Mayumi; Hidekazu Tsuboka
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1981-12-29
Filing date: 1981-12-29
Publication date: 1988-07-21
Also published as: JPS58116595A

Description

【発明の詳細な説明】本発明は音声の零クロス情報による単語音声認
識装置に関するものである。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a word speech recognition device based on speech zero-cross information.

音声認識装置としては、多数のバンドパスフイ
ルタやスペクトル分析手段等を用いた大規模なも
のが多く、構成が複雑でありコストも高い。一
方、認識する単語数が少なく認識率は低くても構
成が簡単で安価な音声認識装置に対する要求があ
る。 Many of the speech recognition devices are large-scale devices that use a large number of bandpass filters, spectrum analysis means, etc., and are complex in structure and expensive. On the other hand, there is a demand for a speech recognition device that is simple in configuration and inexpensive, even though the number of words to be recognized is small and the recognition rate is low.

簡単な音声認識方法の１つに、音声波形の零ク
ロス情報を用いるものがある。コンパレータによ
つて音声信号の零クロスを検出し、コンピユータ
等を用いてそれを処理することによつて単語の特
徴を抽出し、認識するものである。実際には単に
入力信号の零クロスをとるだけでは、第１回に示
すように入力信号に含まれるノイズ等により、音
声以外の部分でも零クロスが検出されるため、音
声区間と音声のない区間とを分離することができ
ないので、零クロス情報以外に、入力信号の包絡
線を検出し、包絡線が定められた値を越える区間
を音声区間とみなして、その区間の零クロス情報
を認識に用いる。しかしこの方法によると、包絡
線検出のための回路が複雑になる。 One of the simple speech recognition methods uses zero cross information of speech waveforms. A comparator detects zero crosses in the audio signal, and a computer or the like is used to process it, thereby extracting and recognizing the characteristics of words. In reality, if you simply take the zero-crossings of the input signal, as shown in Part 1, zero-crossings will also be detected in areas other than audio due to noise contained in the input signal. Therefore, in addition to the zero-cross information, the envelope of the input signal is detected, the section where the envelope exceeds a predetermined value is regarded as a speech section, and the zero-cross information in that section is recognized. use However, according to this method, the circuit for envelope detection becomes complicated.

音声区間を分離する他の方法としては、コンパ
レータにヒステリシスを持たせたり、コンパレー
タの基準電圧を入力信号の平均値からずらせるこ
とにより、音声に比べて振幅の小さいノイズにコ
ンパレータが応動しないようにすることが考えら
れる。第２図、第３図はそれぞれ、コンパレータ
にヒステリシスを持たせた場合、コンパレータの
基準電圧をずらせた場合の、入力波形とコンパレ
ータの出力波形を示したものである。このように
ノイズを含んだ入力波形に対し、ヒステリシスを
適当に設定したり、基準電圧を適当にずらせるこ
とにより、音声のない区間におけるコンパレータ
の出力状態の変化を阻止することができる。した
がつて、出力状態の変化のある区間の情報を用い
て認識を行なう。この際得られる情報は厳密には
零クロス情報ではないが、ノイズに比べ振幅の大
きい音声信号においては、コンパレータのヒステ
リシスや基準電圧のずれはあまり影響せず、零ク
ロス情報とほぼ同じ情報が得られる。しかし、振
幅の小さい子音部については零クロス情報が失わ
れることがある。零クロス情報を用いた認識方法
においては、母音部など比較的零クロスの少ない
部分と、摩擦音など零クロスの多い部分との零ク
ロス数の差が大きいことが望ましい。しかし、摩
擦音部は振幅が小さいため、上記の方法によると
零クロス情報が失われ、特徴が現われにくく、認
識精度が十分得られないという問題がある。 Other methods of separating voice sections include adding hysteresis to the comparator or shifting the reference voltage of the comparator from the average value of the input signal so that the comparator does not respond to noise whose amplitude is small compared to the voice. It is possible to do so. FIGS. 2 and 3 respectively show the input waveform and the output waveform of the comparator when the comparator has hysteresis and when the reference voltage of the comparator is shifted. By appropriately setting hysteresis or appropriately shifting the reference voltage for such a noise-containing input waveform, it is possible to prevent a change in the output state of the comparator during a section without audio. Therefore, recognition is performed using information on sections where the output state changes. Strictly speaking, the information obtained at this time is not zero-cross information, but in the case of audio signals whose amplitude is larger than that of noise, the hysteresis of the comparator and the deviation of the reference voltage do not have much of an effect, and almost the same information as zero-cross information can be obtained. It will be done. However, zero-cross information may be lost for consonant parts with small amplitudes. In a recognition method using zero-cross information, it is desirable that there is a large difference in the number of zero-crosses between a portion such as a vowel where there are relatively few zero-crosses and a portion where there are many zero-crosses such as a fricative. However, since the amplitude of the fricative part is small, the above method has the problem that zero cross information is lost, the characteristics are difficult to appear, and sufficient recognition accuracy cannot be obtained.

したがつて本発明は、コンパレータを用いて零
クロス情報を得、それによつて単語音声認識を行
なう装置において、認識の精度を向上させること
を目的とするものである。 Therefore, an object of the present invention is to improve recognition accuracy in a device that uses a comparator to obtain zero-cross information and performs word speech recognition using the information.

本発明は、音声信号の平均レベルに対応する基
準値とを比較する第１の比較手段と、音声信号
と、音声信号の平均レベルより所定量だけ偏倚し
た基準値とを比較する第２の比較手段と、第２の
比較手段の出力より音声区間を検出する手段と、
検出された音声区間における前記第１の比較手段
の出力の零クロス情報から音声認識を行なう認識
手段から成り、簡単な構成で、認識精度を上げる
ことが可能となる。以下、本発明の実施例につい
て説明する。 The present invention provides a first comparison means for comparing the average level of the audio signal with a reference value, and a second comparison means for comparing the audio signal with a reference value that is deviated from the average level of the audio signal by a predetermined amount. means for detecting a voice section from the output of the second comparing means;
It consists of recognition means that performs voice recognition from the zero cross information output from the first comparison means in the detected voice section, and it is possible to improve recognition accuracy with a simple configuration. Examples of the present invention will be described below.

第４図に本発明による音声認識装置の実施例の
構成を示す。図において、マイクロホン１から入
力された音声信号は、プリアンプ２によつて増幅
され、ピツチ成分除去のためのハイパスフイルタ
３、入力音声信号の平均値をゼロレベルに調節す
るためのオフセツト調整部４を通つた後、コンパ
レータ５および６に入力される。コンパレータ５
は実質的に音声信号の平均レベルより所定量だけ
偏倚した基準値と音声信号を比較するもので、例
えば第３図で示したようにヒステリシスを持た
せ、ノイズを主として含む小振幅の信号に対して
は応動しないようにすることで構成される。ここ
で、ヒステリシスを持たせる代わりに、第２図に
示したようにコンパレータの基準電圧を入力信号
の平均値（零レベル）からずらせることによりノ
イズに対して応動しないようにしてもよい。一
方、コンパレータ６にはヒステリシスを持たせ
ず、基準電圧を入力信号の平均値と一致させてい
る。 FIG. 4 shows the configuration of an embodiment of a speech recognition device according to the present invention. In the figure, an audio signal input from a microphone 1 is amplified by a preamplifier 2, a high-pass filter 3 for removing pitch components, and an offset adjustment section 4 for adjusting the average value of the input audio signal to zero level. After passing through, it is input to comparators 5 and 6. Comparator 5
This method essentially compares the audio signal with a reference value that deviates by a predetermined amount from the average level of the audio signal. It consists of not reacting when the situation arises. Here, instead of providing hysteresis, the reference voltage of the comparator may be shifted from the average value (zero level) of the input signal, as shown in FIG. 2, so that it does not react to noise. On the other hand, the comparator 6 is not provided with hysteresis, and the reference voltage is made to match the average value of the input signal.

コンパレータ５および６の出力はそれぞれマイ
クロコンピユータ７の入力端子および割込み入力
端子に入力される。マイクロコンピユータは、全
体価格をできるだけ安価にするため、通常4bit程
度のマイコンが用いられる。なおRAM８はマイ
クロコンピユータ内のRAM容量が小さい場合に
外付けされるものである。 The outputs of comparators 5 and 6 are input to the input terminal and interrupt input terminal of microcomputer 7, respectively. Microcomputers are usually about 4 bits in order to keep the overall price as low as possible. Note that the RAM 8 is externally attached when the RAM capacity inside the microcomputer is small.

マイクロコンピユータ７はコンパレータ５から
の入力を逐次読み、入力の状態が変化するごとに
カウント用に割当てられたメモリの内容を１つ増
加させる。この間、コンパレータ６からの入力が
ハイレベルからローレベルに変化するごとに割込
みが発生し、割込み処理ルーチンにおいてコンパ
レータ６のカウント用に割当てられたメモリの内
容が１つ増加させられる。さらに10ｍsecごとに
タイマ割込みを発生させることにより、10ｍsec
（これを１フレームとする。）中のコンパレータ５
および６の出力状態の変化回数が得られる。１フ
レーム中のコンパレータ６の出力状態の変化回数
を零クロス数、コンパレータ５の出力状態の変化
回数を準零クロス数と呼ぶことにする。これらの
処理の流れは、第６図６１および６３，６２およ
び６５に示すとおりである。 The microcomputer 7 sequentially reads the input from the comparator 5 and increments the contents of the memory allocated for counting by one each time the state of the input changes. During this time, an interrupt is generated each time the input from the comparator 6 changes from high level to low level, and the contents of the memory allocated for counting by the comparator 6 are incremented by one in the interrupt processing routine. Furthermore, by generating a timer interrupt every 10msec, 10msec
(This is considered one frame.) Comparator 5 inside
and 6, the number of times the output state changes are obtained. The number of times the output state of the comparator 6 changes during one frame is called the number of zero crosses, and the number of times the output state of the comparator 5 changes is called the number of quasi-zero crosses. The flow of these processes is as shown in FIG. 6, 61, 63, 62, and 65.

まず準零クロス数を用いて音声区間を検出す
る。前述のようにコンパレータ５はノイズに対し
て応動しないようにしてあるので、準零クロス数
は音声区間のフレームでは零でないが、音声のな
いフレームでは零であると考えられる。したがつ
て、準零クロス数を用いて音声区間を検出するこ
とが可能である。しかし音声区間内においても破
裂音の前などには無音区間が存在するので、単に
準零クロス数が零でないところを音声区間とみな
すのは不十分である。そこで準零クロス数の系列
に対して第５図に示すような方法で音声区間を決
定する。準零クロス数が零でないフレームが所定
数（たとえば５個）連続すると、その最初の零で
ないフレームを音声区間の始まりとする。音声区
間が始まつた後、準零クロス数が零であるフレー
ムが所定数（たとえば30個）連続すると零でない
フレームの最後を音声区間の終わりとする。この
ようにして検出された音声区間が一定長（たとえ
ば20フレーム）に達しない場合、単語音声として
は短かすぎるので音声区間とみなさない。また検
出された音声区間が所定長（たとえば120フレー
ム）を越える場合も単語音声としては長すぎるの
で音声区間とみなさない。 First, a speech section is detected using the quasi-zero cross number. As mentioned above, since the comparator 5 is designed not to react to noise, the quasi-zero cross number is not zero in frames of voice sections, but is considered to be zero in frames without voice. Therefore, it is possible to detect speech sections using the quasi-zero cross number. However, even within a speech section, there are silent sections before plosives, so it is insufficient to simply consider a portion where the number of quasi-zero crosses is not zero to be a speech section. Therefore, the voice section is determined for the sequence of quasi-zero crosses using the method shown in FIG. When a predetermined number (for example, 5) of frames in which the number of quasi-zero crosses is not zero are consecutive, the first frame in which the number of quasi-zero crosses is not zero is taken as the beginning of the voice section. After a voice section starts, if a predetermined number (for example, 30) of frames in which the number of quasi-zero crosses is zero continue, the last frame in which the number of quasi-zero crosses is zero is determined as the end of the voice section. If the speech section detected in this way does not reach a certain length (for example, 20 frames), it is too short to be word speech and is not considered as a speech section. Furthermore, if the detected voice section exceeds a predetermined length (for example, 120 frames), it is too long to be a word voice, so it is not considered as a voice section.

音声区間検出６４と平行して零クロス数の
RAM８への書き込み６７が行なわれる。第６図
に示すように準零クロス数の値に基づいて零クロ
ス数を書き込むか否かが決定される。第６図にお
いて破線で囲んだ部分はマイクロコンピユータ内
の処理である。この様子を具体的に示すと第７図
のようになる。ここでＮはRAM８に書き込んだ
フレームの数、Ｍは準零クロス数が零であるフレ
ームの数である。まず、準零クロス数が零でない
フレームが現われると、零クロス数のRAMへの
書き込み制御６６によりRAM８の零クロス数格
納用に割当てられた領域の先頭から順次１つフレ
ームごとの零クロス数を書き込む。書き込みを開
始した後、所定フレーム（ここでは５フレーム）
に達しないうちに準零クロス数が零であるフレー
ムが現れると音声区間でないと判断し、新たに音
声区間の開始点を捜す。音声区間が開始した後、
所定フレーム（ここでは30フレーム）連続して準
零クロス数が零となると、音声区間終了として書
き込みを終え、書き込んだ音声区間が所定フレー
ム（ここでは20フレーム）以上であるかどうか判
定し、そうでない場合は音声区間とみなさず新た
に音声区間の開始点を捜す。音声区間と判定され
た場合、零クロス数、準零クロス数のカウントと
書き込みを終了し次の処理へ移る。また所定フレ
ーム（ここでは音声区間の最大値120フレームと
音声区間終了検出のための30フレームの和である
150フレーム）書き込んでも音声区間終了が検出
されない場合は音声区間とみなさず新たに音声区
間の開始点を捜す。 In parallel with the voice section detection 64, the number of zero crosses is detected.
Writing 67 to RAM 8 is performed. As shown in FIG. 6, it is determined whether or not to write the number of zero crosses based on the value of the number of quasi-zero crosses. The portion surrounded by broken lines in FIG. 6 is the processing within the microcomputer. This situation is specifically shown in FIG. 7. Here, N is the number of frames written to the RAM 8, and M is the number of frames in which the number of quasi-zero crosses is zero. First, when a frame in which the number of quasi-zero crosses is not zero appears, the number of zero crosses for each frame is sequentially written from the beginning of the area allocated for storing the number of zero crosses in RAM 8 by the write control 66 for writing the number of zero crosses to the RAM. Write. After starting writing, a predetermined frame (here 5 frames)
If a frame with a quasi-zero cross number of zero appears before reaching , it is determined that it is not a speech section, and a new starting point of the speech section is searched. After the audio section starts,
When the number of quasi-zero crosses reaches 0 for a predetermined frame (30 frames in this case) consecutively, writing ends as the voice section ends, and it is determined whether the written voice section is longer than a predetermined frame (20 frames in this case), and if so. If not, it is not regarded as a voice section and a new start point of the voice section is searched. If it is determined that it is a voice section, counting and writing of the number of zero crosses and quasi-zero crosses is finished, and the process moves to the next process. Also, the predetermined frame (here, the sum of the maximum value of the voice section, 120 frames, and 30 frames for detecting the end of the voice section)
150 frames) If the end of the voice section is not detected even after writing, it is not regarded as a voice section and a new start point of the voice section is searched.

音声区間と判定され書き込みが終了すると、書
き込まれたフレームの終わりから所定フレーム
（ここでは30フレーム）さかのぼつたフレームが
音声区間の終了点である。このようにして音声区
間が決定すると音声区間内の零クロス数から等間
隔に16フレーム分がサンプルされる。音声区間を
ｎフレームとすると、サンプルされるフレーム
は、１、ｎ／16＋１、２・ｎ／16＋１、３・ｎ／
16＋１、…、15・ｎ／16＋１番めのフレームであ
る。ここで使用したマイクロコンピユータは４ビ
ツトマシンとしているので、16で割るのはフレー
ム計算のために割当てられたメモリの下位１ワー
ドを無視することで実現され、上記の計算は加算
だけで行なうことができる。また、フレーム番号
とRAM番地は順序よく対応しているので、サン
プルフレームの計算は実際にはフレーム数ではな
くRAM番地によつて計算される。 When it is determined that it is a voice section and the writing is completed, the end point of the voice section is a frame that goes back a predetermined frame (in this case, 30 frames) from the end of the written frame. When the voice section is determined in this way, 16 frames are sampled at equal intervals starting from the number of zero crosses within the voice section. If the audio interval is n frames, the sampled frames are 1, n/16+1, 2・n/16+1, 3・n/
16+1,..., 15·n/16+1st frame. Since the microcomputer used here is a 4-bit machine, dividing by 16 is achieved by ignoring the lower 1 word of the memory allocated for frame calculation, and the above calculation can be performed only by addition. . Furthermore, since frame numbers and RAM addresses correspond in good order, sample frame calculations are actually calculated based on RAM addresses rather than frame numbers.

このようにして得られた１単語につき16個の零
クロス数にもとづいて認識を行う。まず、認識す
べき単語の各々について上記の16個の零クロス数
RAM８に登録する。この登録されたものを標準
パターンと呼ぶ。スイツチ入力により登録モード
である旨をマイクロコンピユータ７に入力した
後、マイクロホンから単語を発声すると、RAM
８に書き込まれた零クロス数から16個がサンプル
され、標準パターン登録用に割当てられたRAM
領域の先頭から順に書き込まれる。 Recognition is performed based on the number of 16 zero crosses for each word obtained in this way. First, the number of 16 zero crosses mentioned above for each word to be recognized.
Register in RAM8. This registered pattern is called a standard pattern. After inputting the registration mode into the microcomputer 7 by inputting a switch, when a word is uttered from the microphone, the RAM
16 zero crosses were sampled from the number of zero crosses written in 8, and the RAM was allocated for standard pattern registration.
Data is written sequentially from the beginning of the area.

登録が終了すると、スイツチ入力によりモード
を認識にした後、マイクロホンから音声を入力す
る。入力音声からサンプルされた16個の零クロス
数のｉ番目の値をSi（ｉ＝１、…、16）、標準パタ
ーンのｊ番目に登録した単語のｉ番目の値をTji
（ｉ＝１、…、16、ｊ＝１、…、Ｗ（Ｗは登録単語
数））とするとき、入力音声とｊ番目の単語との
間の距離DjをDj＝ ₁₆ ２ⁱ⁼¹ 1Tji−Si1で与えるものとす
る。各標準パターンに対してDjを計算し、Djの
最小値を与える標準パターンに対応する単語を認
識結果とし、LED等の表示素子（図示せず）で
表示する。 When registration is completed, the mode is set to recognition by inputting a switch, and then voice is input from the microphone. The i-th value of the 16 zero-cross numbers sampled from the input voice is Si (i = 1, ..., 16), and the i-th value of the j-th registered word of the standard pattern is Tji.
When (i = 1, ..., 16, j = 1, ..., W (W is the number of registered words)), the distance Dj between the input voice and the j-th word is Dj = ₁₆ 2 ^{i = 1} 1Tji −Si1 shall be given. Dj is calculated for each standard pattern, and the word corresponding to the standard pattern that gives the minimum value of Dj is taken as a recognition result and displayed on a display element (not shown) such as an LED.

本実施例は以上述べたように、音声区間検出と
単語間の距離算出用の２つのコンパレータと、マ
イクロコンピユータとを用いることにより、簡単
な回路で音声区間を検出するとともに、振幅の小
さい子音部の零クロスも比較的精度よく検出でき
る。また、認識のための処理が簡単であるので、
４ビツト程度のマイクロコンピユータで実現でき
る。 As described above, this embodiment uses two comparators for detecting speech intervals and calculating the distance between words, and a microcomputer to detect speech intervals with a simple circuit, and also detects consonant parts with small amplitude. It is also possible to detect zero crosses with relatively high accuracy. Also, since the processing for recognition is easy,
It can be realized with a microcomputer of about 4 bits.

なお、ノイズなどにより零クロスの頻度が非常
に多く、マイクロコンピユータの割り込み処理が
時間的に追従できない恐れのある場合は、コンパ
レータ６に対しても若干のヒステリシスを持たせ
たり、基準電圧を音声信号の平均値より偏倚させ
たりしてもよい。 If the frequency of zero crosses is very high due to noise, etc., and there is a risk that the microcomputer's interrupt processing may not be able to keep up with the time, it may be necessary to provide a slight hysteresis to the comparator 6, or set the reference voltage to the audio signal. may be biased from the average value.

[Brief explanation of the drawing]

第１図は従来の単語音声認識装置のコンパレー
タの動作を説明するための波形図、第２図はヒス
テリシスを設けたコンパレータの動作を説明する
ための波形図、第３図は偏倚した基準電圧を有す
るコンパレータの動作を説明するための波形図、
第４図は本発明の一実施例による単語音声認識装
置のブロツク図、第５図は同実施例における音声
区間検出の流れ図、第６図および第７図は零クロ
ス数のRAMへの書き込みの流れ図である。１……マイクロホン、２……プリアンプ、３…
…ハイパスフイルタ、４……オフセツト調整部、
５，６……コンパレータ、７……マイクロコンピ
ユータ、８……RAM。 Figure 1 is a waveform diagram to explain the operation of a comparator in a conventional word speech recognition device, Figure 2 is a waveform diagram to explain the operation of a comparator with hysteresis, and Figure 3 is a waveform diagram to explain the operation of a comparator with hysteresis. A waveform diagram for explaining the operation of a comparator with
Fig. 4 is a block diagram of a word speech recognition device according to an embodiment of the present invention, Fig. 5 is a flowchart of speech section detection in the same embodiment, and Figs. 6 and 7 show how the number of zero crosses is written to RAM. This is a flowchart. 1...Microphone, 2...Preamplifier, 3...
...High pass filter, 4...Offset adjustment section,
5, 6...Comparator, 7...Microcomputer, 8...RAM.

Claims

[Claims]

1. A first comparing means that compares the audio signal with a reference value corresponding to the average level of the audio signal, and a second comparing means that compares the audio signal with a reference value that deviates from the average level of the audio signal by a predetermined amount. and a means for detecting a speech interval from the output of the second comparison means, and a word by comparing the zero cross information of the output of the first comparison means in the detected speech interval with standard zero cross information. A word speech recognition device comprising a recognition means for performing speech recognition.