JPS6120880B2

JPS6120880B2 -

Info

Publication number: JPS6120880B2
Application number: JP55088020A
Authority: JP
Inventors: Akinobu Masuko
Original assignee: Tokyo Shibaura Electric Co Ltd
Current assignee: Toshiba Corp
Priority date: 1980-06-28
Filing date: 1980-06-28
Publication date: 1986-05-24
Also published as: JPS5713499A

Description

[Detailed description of the invention]

本発明は、入力音声信号による命令、即ち話者
の音声波から抽出されて物理量の時系列を特徴パ
ターンとしてとらえ、これをあらかじめ登録され
たパターンと比較して音声信号による命令を認知
する所謂、パターンマツチング法による音声認識
装置に係り、特にパターン間の類似度の判定手段
に音声の誤認識を防止する手段を設けた音声認識
装置に関する。一般に音声認識の方法は、音声信号から何らか
の特徴を抽出した後得られる特徴（入力）パター
ンとあらかじめ登録されている登録パターンとの
類似度を直接計算する方式と、前記音声信号から
特徴を抽出した後にこれを音韻系列に置きかえ、
これとあらかじめ登録されている単語辞書（パタ
ーン）とを比較して類似度を計算する方式の２つ
の方式に大別される。これら２つの方式のうち、
後者は音韻単位の識別を行うために、単語数が多
い場合の音声認識に優位である。しかし、単語数
がさほど多くない場合には、前者によるパターン
マツチング認識の方が一般に高い認識率が得られ
る。認識される単語数が数10程度の規模の前記パタ
ーンマツチングによる音声認識システムとして
は、民生機器においては例えば、テレビジヨン受
像機を音声によつて制御する場合が挙げられる。
つまり、テレビジヨン受像機の電源制御、音量制
御、チヤンネル切替等の制御を、あらかじめ音声
認識装置に制御内容を表わす言葉の音声を登録し
ておき、応答装置には認識応答として音声を記憶
させておき、音声命令と登録された制御内容とを
照合して一致すると制御内容を認識したことを音
声によつて返答するとともに所定の制御をするよ
うな場合である。例えば、チヤンネル切替制御に
おいて、１チヤンネを選ぶ場合、あらかじめ「１
チヤンネル」という音声を登録パターンとして記
憶しておいたときに、音声命令を受信するマイク
に向い「１チヤンネル」という音声命令を下すと
音声応答で「オーケー（OK）」と返答し、１チ
ヤンネルが選局される。しかし、ここで問題となるのは、「１チヤンネ
ル」と音声命令を下した時に、これと音声が類似
する「８チヤンネル」という音声命令が制御パタ
ーン（登録パターン）として登録されている点で
ある。即ち、「イチ」と「ハチ」の両者の音声は
類似しており、「イチ」と「ハチ」とを誤まつて
音声認識するのをいかに防止するかが問題とな
る。これは、「イチ」という語と「ハチ」という
語において、「チ」の発音部分の音声エネルギー
が大きい為に、「イ」と「ハ」を区別するのが困
難になることに起因する。一般に、一つの単語の
中にアクセントをもつ音声があると、その部分に
音声エネルギーが集中し、他の部分の音声情報の
認識が困難となる。従つて、音声認識に際して
は、音声命令の強音以外の部分の情報を失うこと
なく特徴（入力）パターンと登録パターンとの比
較をしなければならない。また、話者が音声を発生する場合、同じ単語を
発声しても、発声するたびに振幅が変化する。従
つて音声認識に際しては、振幅が変化しても同じ
単語であれば常に同じパターンが得られるように
しなければならない。また、制御内容を音声によつて登録パターンと
して登録する際の音声と、音声命令として発する
音声の発声速度は必ずしも一致しない。このこと
は、ある単語を登録した後、その単語を再度同じ
ように発声しても単語長は異なることを意味す
る。この為、入力パターンと登録パターン間の類
似度を評価するに際しては、時間軸についても考
慮しなければ誤認識がなされる。また、話者が言葉を発声する際一単語の発声期
間であつても無声音となる期間がある。この無声
音部のデータというのは音声認識にあまり寄与し
ないばかりか、かえつて認識率を悪くすることが
多い。特に数字の「イチ」，「ハチ」等の場合は両
方とも中央に無音声部を含み、しかも音声が発声
される部分においても両者が違うところといえば
「イ」，「ハ」の部分だけである。このような場合
発声に際して無音声部の長さが変動すると「イ
チ」を「ハチ」と誤認識したり、逆に「ハチ」を
「イチ」と誤認識する可能性がある。本発明は上記問題のうち、特に無音声部の存在
に起因する音声の誤認識という問題を解決し得る
音声認識装置を提供することを目的とする。以下、図面を参照して本発明の実施例を詳細に
説明するが、まず本発明を理解しやすくする為に
現在考えられているパターンマツチング法に基づ
いた音声認識装置を第１図乃至第８図を用いて説
明し、次に第１図及び第９図乃至第１６図を用い
てこの発明の一実施例を説明する。入力音声の特徴を示す入力パターンとあらかじ
め所定の単語が登録されている登録パターンとの
類似度を判別して音声認識を行う、所謂、パター
ンマツチング法による音声認識においては、入力
及び登録パターンの特徴の抽出の仕方によつては
前述したような点に起因して誤認識率が左右され
る。そこで、本発明においては、入力パターンの
特徴の抽出にあたり、入力音声に対する振幅の正
規化、時間軸の正規化を行ない誤認識を防ぐべく
行うとともに、更には、両パターン間の類似度の
計算の簡素化を図つた音声認識装置が考案されて
いる。第１図は、このような音声認識装置を示す回路
ブロツク線図であり、発声による音圧振動をマイ
クロフオンで電気信号に変換し、更に前記音声の
周波数分布を平担化する機能を有する音声入力部
１、この音声入力部１により得られる電気信号に
変換された音声信号からその特徴を抽出する特徴
抽出部２、この特徴抽出部２により抽出された特
徴を記憶するとともにこれと入力パターンとの比
較の演算処理を行ない音声による制御命令を判別
する認識処理部３を有し、制御命令が認識された
ことを音声により応答する音声応答部４が必要に
よつては付加される。この音声応答部４は、応答
すべき言葉をパターンとして記憶してあるメモリ
部４０１、第２のＩ／Ｏ（入出力）ポート４０
２、制御部４０３、Ｄ／Ａ変換器４０４、ローパ
スフイルタ４０５を有しており、話者の音声指令
が認知されたことをテレビジヨン受像機４０６等
の被制御機器の音声回路から音声により応答す
る。音声入力部１において、入力音声は、ワイヤレ
スマイク１１によりFM波に変換した後FM受像
機１２で受信してプリアンプ１３に入力する形態
と、前記プリアンプ１３前段に設けたマイクロフ
オン１４によつて入力する形態のいずれかにより
システムにとり入れられる。これらいずれの形態
の場合においても、認識に必要な音声信号とそれ
以外の音響信号との比であるSN比は、主として
マイクロフオンの指向性に左右されるのでマイク
ロフオン１１，１４は単一指向性のものを用い
る。プリアンプ１３に得られる電気信号に変換さ
れた音声信号は、単音節明瞭度を向上するため高
音域をプリエンフアシス回路１５により強調す
る。このようにして、得られる音声入力部１の出力
は、特徴抽出部２に供給され、ここで入力及び登
録パターンの形成に必要な特徴データの抽出処理
がなされる。即ち、話者の音声波から時系列的に
周波数をとらえ、音声を周波数分析しこれらのデ
ータを一定時間間隔でサンプリングするととも
に、サンプリングされたアナログデータをＡ／Ｄ
変換器によりデジタル量に変換する。つまり、特
徴抽出部２の入力端には１６_１〜１６₁₅で示され
るスイツチド・キヤパシタ・バンドパスフイルタ
（以下BPFと称する。）が接続されている。この１
６_１〜１６₁₅のBPFの中心周波数は印加されるク
ロツクで決まり、その各々のフイルタ特性は６次
のチエピシエフ特性で略−36aB／OCTの減衰特
性を持つ、そして、前記BPF１６_1〜15により、
略200Hz〜6.4KHzの帯域を1/3オクターブ間隔で
15バンドに分離している。この15に分離されたバ
ンドの帯域成分の音声信号を通過させる１６_1〜1
_５のBPFの夫々には、略20m_sec間隔で信号をサン
プル・ホールドするサンプル・ホールド回路１７
_1〜15が接続されており、このサンプル・ホール
ド作用により到来する音声の特徴データが抽出さ
れる。このようにしてサンプル・ホールド回路１７₁
_〜１５に抽出された特徴データはアナログ量である
が、例えば８ビツトのＡ／Ｄ変換器（アナログ−
デジタル変換器）１８によつてデジタル量に変換
される。このとき、前記サンプル・ホールド回路
１７_1〜15と前記Ａ／Ｄ変換器１８間の切換制御
は、マルチプレクサ１９によつて行なわれる。従
つて、音声信号から抽出した、第２図に示す時間
−周波数−レベルの特性をデジタル化した量が前
記Ａ／Ｄ変換器１８に得られる。そして、この
Ａ／Ｄ変換器１８で抽出された音声の特徴データ
は、第１のＩ／Ｏ（入出力）ポート２０を介して
認識処理部３に供給される。認識処理部３は、制御内容、例えば受信するチ
ヤンネルの指定、電源のオンオフの制御を音声に
よつて指示する場合にその指令音声から抽出され
た音声の特徴を記憶させ登録するための登録パタ
ーンメモリ２１、話者が希望する制御内容を発声
した際にその指示音声の特徴を入力パターンとし
て一担記憶するための入力パターンメモリ２２、
この入力パターンメモリ２２の内容が前記登録パ
ターンメモリ２１に記憶された、いずれの登録パ
ターンと類似するかの判定を行うためのプログラ
ムを記憶するシステムプログラムメモリ２３、こ
のシステムプログラムの内容を実行するCPU
（中央処理装置）２４からなる。そして、この
CPU２４は例えば、８ビツトのマイクロプロセ
ツサが用いられ、前記システムプログラムメモリ
２３は、2Kバイトの容量をもつROMで構成さ
れ、前記入力パターンメモリ２２、登録パターン
メモリ２１は10Kバイトの容量をもつRAMによ
つて構成される。この10KバイトのRAMのうち
1.75Kバイトは入力パターンメモリ２２として、
略7.5Kバイトは登録パターンメモリ２１として
用いられる。このような構成の認識処理部３に、前記特徴抽
出部２で抽出されたデータが、入力パターンデー
タ、登録パターンデータとして送られる訳である
が、先ず登録パターンデータが送られる場合につ
いて述べる。登録パターンデータが認識処理部３の登録パタ
ーンメモリ２１に送られる場合は、前述の様に話
者が希望する制御内容を何通りか発声により音声
認識装置に登録する場合である。ここで、いま１
チヤンネルの選局を登録パターンメモリ２１に制
御内容として記憶させる場合についてみると、
「１チヤンネル」という音声の特徴データは前記
Ａ／Ｄ変換器１８にデイジタルデータとして抽出
される。そして、このデータは第１のＩ／Ｏポー
ト２０を介して登録パターンメモリ２１に送られ
るが、このとき前記入力パターンメモリ２２に次
に示される行列式〓の形で一旦収納される。ここで、行列式の行数はサンプル回数、即ち、
前記スイツチド・キヤパシタ・バンドパスフイル
タ１６の出力が略20m_secの間隔のサンプルパルス
に呼応してサンプルされる回数を示し、列数は
BPF１６の個数を示し、各成分はデジタル化され
た前記各BPFのサンプル値である。このようにし
て、抽出された話者の音声の特徴データは、未だ
音声の振幅情報に対する正規化なされていない。
つまり、話者のアクセントの位置或は強音によつ
て弱音の情報が後退することに対する処理、及び
同一の言葉を発声しても発声するたびに振幅が変
化することに対する処理が行なわれていないので
話者の音声の特徴を十分に表わしているとはいえ
ない。そこで、前記行列式の各行の成分に対する
加重を行う。即ち、前記〓で表わされる一旦、入
力パターンメモリ２２に収納されたデータに対し
てシステムプログラム２３に記憶された次に示す
演算をCPU２４によつて行ない演算結果の行列
式〓を前記登録パターンメモリ２１に登録パター
ンとして格納する。このようにして、音声情報のうちの振幅情報は
正規化される。この振幅の正規化は、話者が制御
内容として発声する音声に対してすべてなされた
うえで、前記登録パターンメモリ２１にその内容
（行列式）が記憶される。こうして、話者が発声
により、前記登録パターンメモリ２１に希望する
制御内容を登録することで、音声認識装置に対す
る制御内容のセツテングは終了し、制御内容の数
に等しい種類の登録パターン（〓_１，〓_２……〓
_o）が前記登録パターンメモリ２１に記憶され
る。上述のように、音声の特徴を示す行列式〓に対
する振幅の正規化を行う演算は、前記システムプ
ログラム２３に記憶されたプログラム内容に応じ
てCPU２４によつて実行されるが、その実行内
容を次に模式的に説明する。即ち、前述の第１図中の第１のＩ／Ｏポート２
０、システムプログラム２３、CPU２４の動作
は、次に示す第３図の機能動作に対応できる。つまり、第３図中のラツチ回路３０_1〜15（実
際には入力パターンメモリ２２において行列式〓
が一担収納される部分に相当する。）には、前記
行列式〓に相当するデータがラツチされ、ラツチ
された内容は加算器３１、及び乗算器３２に夫々
供給される。そして、この加算器３１の出力は、
レベル判定回路３３と徐算器３４_1〜15に供給さ
れる。前記加算器３１は、前記行列式〓の各行成
分の要素を加算し、 The present invention captures a time series of physical quantities extracted from an input voice signal, that is, a speaker's voice wave, as a characteristic pattern, and compares this with a pre-registered pattern to recognize a voice signal command. The present invention relates to a speech recognition device using a pattern matching method, and more particularly to a speech recognition device in which means for determining similarity between patterns is provided with means for preventing erroneous recognition of speech. In general, there are two methods for speech recognition: a method that directly calculates the similarity between a feature (input) pattern obtained after extracting some features from an audio signal and a registered pattern that has been registered in advance; Later, this was replaced with a phonological sequence,
There are two main methods: a method that calculates the degree of similarity by comparing this with a word dictionary (pattern) registered in advance. Of these two methods,
The latter is advantageous for speech recognition when there are many words because it identifies phonological units. However, when the number of words is not so large, the former pattern matching recognition generally provides a higher recognition rate. An example of a speech recognition system using pattern matching, in which the number of words to be recognized is about several dozen, is used in consumer equipment, for example, when a television receiver is controlled by voice.
In other words, to control the television receiver's power control, volume control, channel switching, etc., the voice of the words expressing the control content is registered in advance in the voice recognition device, and the voice is stored in the response device as a recognition response. This is a case in which the voice command and the registered control content are checked, and if they match, a voice response is sent to indicate that the control content has been recognized, and a predetermined control is performed. For example, when selecting one channel in channel switching control, "1
When you memorize the voice ``Channel'' as a registered pattern, when you face the microphone that receives the voice command and give the voice command ``1 channel'', the voice response will say ``OK'' and the 1st channel will be activated. The channel is selected. However, the problem here is that when the voice command "1 channel" is given, a voice command "8 channels", which has a similar sound, is registered as a control pattern (registered pattern). . That is, the sounds of both "ichi" and "hachi" are similar, and the problem is how to prevent the mistaken speech recognition of "ichi" and "hachi". This is because in the words "ichi" and "hachi", the vocal energy of the pronunciation part of "chi" is large, making it difficult to distinguish between "i" and "ha". Generally, when there is an accented voice in one word, the voice energy is concentrated in that part, making it difficult to recognize the voice information in other parts. Therefore, when performing voice recognition, it is necessary to compare the characteristic (input) pattern and the registered pattern without losing information on parts of the voice command other than the strong sounds. Furthermore, when a speaker generates a voice, the amplitude changes each time the speaker utters the same word. Therefore, in speech recognition, it is necessary to always obtain the same pattern for the same word even if the amplitude changes. Further, the voice used when registering the control content as a registered pattern by voice does not necessarily match the rate of voice emitted as a voice command. This means that after a certain word is registered, even if the word is uttered in the same way again, the word length will be different. Therefore, when evaluating the similarity between the input pattern and the registered pattern, erroneous recognition will occur unless the time axis is also taken into account. Furthermore, when a speaker utters a word, even during the utterance period of one word, there is a period in which the utterance becomes a voiceless sound. This unvoiced sound part data not only does not contribute much to speech recognition, but it often worsens the recognition rate. In particular, in the case of numbers such as ``ichi'' and ``hachi'', both include a silent part in the center, and even in the part where the voice is uttered, the only difference between the two is the ``i'' and ``ha'' parts. be. In such a case, if the length of the silent part changes during utterance, there is a possibility that "ichi" may be mistakenly recognized as "hachi", or conversely, "hachi" may be mistakenly recognized as "ichi". It is an object of the present invention to provide a speech recognition device that can solve the above-mentioned problems, particularly the problem of erroneous recognition of speech caused by the presence of silent parts. Embodiments of the present invention will be described in detail below with reference to the drawings. First, in order to make the present invention easier to understand, a speech recognition device based on a pattern matching method currently being considered is shown in FIGS. 8, and then an embodiment of the present invention will be described using FIG. 1 and FIGS. 9 to 16. In speech recognition using the so-called pattern matching method, speech recognition is performed by determining the degree of similarity between an input pattern indicating the characteristics of the input speech and a registered pattern in which predetermined words are registered. The misrecognition rate is affected by the method of feature extraction due to the points mentioned above. Therefore, in the present invention, when extracting the features of the input pattern, the amplitude and time axis of the input audio are normalized to prevent misrecognition, and furthermore, the similarity between the two patterns is calculated. Simplified speech recognition devices have been devised. FIG. 1 is a circuit block diagram showing such a speech recognition device, which converts sound pressure vibrations caused by vocalization into electrical signals using a microphone, and further flattens the frequency distribution of the speech. An input section 1, a feature extraction section 2 that extracts the features from the audio signal converted into an electrical signal obtained by the audio input section 1, and a feature extraction section 2 that stores the features extracted by the feature extraction section 2 and combines them with the input pattern. It has a recognition processing section 3 that performs arithmetic processing of comparison between the following and discriminates a control command by voice, and a voice response section 4 that responds by voice that the control command has been recognized is added as necessary. This voice response unit 4 includes a memory unit 401 that stores words to be responded to as patterns, and a second I/O (input/output) port 40.
2. It has a control unit 403, a D/A converter 404, and a low-pass filter 405, and responds by voice from the voice circuit of a controlled device such as a television receiver 406 to indicate that the speaker's voice command has been recognized. do. In the audio input section 1, input audio is converted into FM waves by a wireless microphone 11, received by an FM receiver 12, and inputted to a preamplifier 13, and inputted by a microphone 14 provided before the preamplifier 13. It can be incorporated into the system in any of the following ways. In any of these forms, the SN ratio, which is the ratio between the audio signal necessary for recognition and other audio signals, mainly depends on the directivity of the microphone, so the microphones 11 and 14 are unidirectional. Use sexual items. The audio signal obtained by the preamplifier 13 and converted into an electric signal has its high frequency range emphasized by the pre-emphasis circuit 15 in order to improve the intelligibility of monosyllables. The output of the voice input section 1 obtained in this manner is supplied to the feature extraction section 2, where the feature data necessary for input and formation of the registered pattern is extracted. In other words, the frequency is captured in time series from the speaker's voice wave, the voice is frequency-analyzed, this data is sampled at regular time intervals, and the sampled analog data is converted to A/D.
Convert to digital quantity using a converter. That is, switched capacitor bandpass filters (hereinafter referred to as BPF) indicated by 16 ₁ to 16 ₁₅ are connected to the input terminal of the feature extraction unit 2. This one
The center frequency of the BPFs ₆₁ to ₁₆₁₅ is determined by the applied clock, and each filter characteristic is a 6th _- order Thiepisief characteristic and has an attenuation characteristic of approximately -36aB/OCT.
Approximately 200Hz to 6.4KHz band at 1/3 octave intervals
Separated into 15 bands. The audio signals of the band components of these 15 bands are passed through 16 _{1 to 1}
Each of the BPFs ₅ includes a sample and hold circuit 17 that samples and holds the signal at approximately 20m _sec intervals.
_{1 to 15} are connected, and feature data of the incoming voice is extracted by this sample and hold action. In this way, the sample and hold circuit 17 ₁
The feature data extracted in _{15 to 15} is an analog quantity, but for example, an 8-bit A/D converter (analog -
It is converted into a digital quantity by a digital converter) 18. At this time, switching control between the sample and hold circuits 17 _{1 - 15} and the A/D converter 18 is performed by a multiplexer 19 . Therefore, the A/D converter 18 obtains a digitized amount of the time-frequency-level characteristics shown in FIG. 2 extracted from the audio signal. The voice feature data extracted by this A/D converter 18 is supplied to the recognition processing section 3 via a first I/O (input/output) port 20. The recognition processing unit 3 includes a registration pattern memory for storing and registering the voice characteristics extracted from the command voice when instructing control contents, such as specifying a channel to receive or controlling power on/off, by voice. 21. Input pattern memory 22 for storing the characteristics of the instruction voice as an input pattern when the speaker utters the desired control content;
A system program memory 23 that stores a program for determining whether the contents of this input pattern memory 22 are similar to any registered pattern stored in the registered pattern memory 21, and a CPU that executes the contents of this system program.
(Central processing unit) Consists of 24. And this
The CPU 24 is, for example, an 8-bit microprocessor, the system program memory 23 is composed of a ROM with a capacity of 2 Kbytes, and the input pattern memory 22 and the registered pattern memory 21 are RAMs with a capacity of 10 Kbytes. Composed by. Of this 10K bytes of RAM
1.75K bytes is used as input pattern memory 22.
Approximately 7.5 Kbytes is used as the registered pattern memory 21. The data extracted by the feature extraction section 2 is sent to the recognition processing section 3 having such a configuration as input pattern data and registered pattern data. First, a case where registered pattern data is sent will be described. When the registered pattern data is sent to the registered pattern memory 21 of the recognition processing section 3, as described above, the speaker registers several desired control contents in the speech recognition device by uttering the desired control contents. Here, now 1
Regarding the case where channel selection is stored as control content in the registered pattern memory 21,
The audio characteristic data of "1 channel" is extracted as digital data by the A/D converter 18. Then, this data is sent to the registered pattern memory 21 via the first I/O port 20, but at this time, it is temporarily stored in the input pattern memory 22 in the form of the determinant shown below. Here, the number of rows of the determinant is the number of samples, that is,
Indicates the number of times the output of the switched capacitor bandpass filter 16 is sampled in response to sample pulses at intervals of approximately 20 m _sec , and the number of columns is
The number of BPFs 16 is shown, and each component is a digitized sample value of each BPF. In this way, the extracted feature data of the speaker's voice has not yet been normalized to the amplitude information of the voice.
In other words, there is no processing to deal with the fact that the information on soft sounds is set back due to the position of the speaker's accent or strong sounds, and there is no processing to deal with the fact that the amplitude changes each time the same word is uttered. Therefore, it cannot be said that it adequately represents the characteristics of the speaker's voice. Therefore, the components of each row of the determinant are weighted. That is, the CPU 24 performs the following calculation stored in the system program 23 on the data once stored in the input pattern memory 22, represented by 〓 above, and the determinant 〓 of the calculation result is stored in the registered pattern memory 21. is stored as a registered pattern. In this way, the amplitude information of the audio information is normalized. This amplitude normalization is performed on all sounds uttered by the speaker as control content, and then the content (determinant) is stored in the registered pattern memory 21. In this way, when the speaker registers the desired control contents in the registered pattern memory 21 by utterance, the setting of the control contents for the speech recognition device is completed, and registration patterns of types equal to the number of control contents (〓 ₁ , 〓 ₂ ...〓
_o ) is stored in the registered pattern memory 21. As mentioned above, the calculation for normalizing the amplitude with respect to the determinant 〓 representing the characteristics of the voice is executed by the CPU 24 according to the program contents stored in the system program 23, but the execution contents are as follows. This is schematically explained below. That is, the first I/O port 2 in FIG.
0, the operations of the system program 23 and the CPU 24 can correspond to the functional operations shown in FIG. 3 below. In other words, the latch circuits 30 _{1 to 15} in FIG. 3 (actually, the determinant 〓
This corresponds to the part where one bag is stored. ) is latched with data corresponding to the determinant 〓, and the latched contents are supplied to an adder 31 and a multiplier 32, respectively. The output of this adder 31 is
The signal is supplied to the level determination circuit 33 and dividers 34 _{1 to 15} . The adder 31 adds the elements of each row component of the determinant 〓,

【式】を算出するが、この夫々の総和値で前記ラツチ回路３０_1〜15に
ラツチされた成分要素の各々が除算器３４_1〜15
で除算される。ここで、除算器３４_1〜15の前段
に乗算器３２_1〜15が設けられておりＮなる乗算
を行うが、これは前記除算結果を整数部で評価す
るためのもので場合によつては省略し得る。ま
た、前記の除算器３４_1〜15で除算され振幅が正
規化されたデータは、バスラインを通して登録パ
ターンとして、登録パターンメモリ２１に登録さ
れる。また、前記レベル判定部３３には所定レベルの
閾値が設定されており、前記加算器３１の出力の
レベルが設定された閾値以下の時は、前記ラツチ
回路３５_1〜15のラツチされた内容をクリアし、
それ以外の時は前記両ラツチ回路を制御しない。
このように、ラツチ回路３５_1〜15に、前記加算
器３１の出力が一定値以上の時のみラツチ動作を
させることにより、検出する音声が小さい状態で
の雑音による誤動作が防止される。上述の第３図の説明から判る様に、話者が希望
する制御内容を登録パターンメモリ２１に登録す
る過程において、振幅が正規化される前の特徴デ
ータは、一旦、RAMで構成される入力パターン
メモリ２２に記憶されこの後に振幅が正規化さ
れ、特徴パターンとして登録パターンメモリ２１
に記憶される。次に、話者が登録した制御内容に対して、希望
する制御内容を音声により指示した場合について
述べる。話者が、登録した制御内容のうち、希望する制
御内容を発声し音声により指令をすると、音声の
特徴データは登録パターンの時と同様に振幅が正
規化され入力パターンメモリ２２に記憶される。
ここで、話者が音声指令した内容に対し、その振
幅に対する正規化を行なつた入力パターンは次に
示す行列式〓で示されるものとする。この振幅が正規化され入力パターンメモリ２２
に記憶される入力パターン〓は、既に制御内容と
して登録パターンメモリ２１に登録されている登
録パターンとの参照が行われる。この参照動作に
よる両パターン間の類似度の演算処理により、類
似度が一番近いパターンに対応する制御内容を話
者が指令した制御内容であると判定する。このような入力パターンと登録パターンの両パ
ターン間に類似度は、次に示されるパターン間の
距離Ｄを計算することにより判別される。即ち、
前記振幅が正規化された登録パターン〓と入力パ
ターン〓との各成分ｋ_1j，ｆ_1jの差の絶対値をと
ることにより得られる行列式を両パターン間の距
離を表わす行列式距離パターンＤと定義し、この
行列式Ｄの各成分の総和値によつて類似度を算出
する。このことを更に述べると、前記距離パター
ンＤは次式で示され、かつ類似度ａは次のように
示される。上記、類似度ｄの計算は全登録パターン、いい
かえると全制御内容を表わすパターンに対して行
われ、類似度ｄの値が最つとも小さいパターンを
話者が音声によつて指令したパターンであると判
定する。このようにして音声認識が行われるが、
上述のように音声の振幅に対する正規化を行うこ
とで誤認識率は著しく低減される。話者の発声に
対する音声認識はこうして、登録パターンと入力
パターンの類似度が、前記システムプログラムメ
モリ２３に設定された類似度算出プログラムによ
つて指示される演算が前記CPU２４で実行され
ることにより算出され、音声認識による機器の制
御が可能となる。上述した音声のパターン・マツチング法による
音声認識では、振幅が正規化されることで単語中
の強音部分に比較して弱音部分の情報が小さい点
及び同じ単語でも発声のたびに振幅が変動しやす
い点に起因する音声の誤認識は低減される。しか
し話者が同一の単語を発声してもその発声時間が
常に一致するとは限らない。この問題を解決する
には時間軸についても正規化を行なうことが必要
であり、次にこの時間軸の正規化について説明す
る。時間軸の正規化は、話者の発音単語の発音開
始時刻と発音終了時刻との間にかかる時間を、常
に一定の定数ｎで分割することによりなされる。
つまり、話者がある単語を発声するにある時は時
間T₁かかり、またあるときには時間T₂を要した
場合、それぞれの場合、特徴抽出のためのサンプ
ル時間間隔をΔT₁＝Ｔ_１／ｎ，ΔT₂＝Ｔ_２／ｎとするこ
とで解決される。このことは、時間軸のずれに呼応し
て音声の特徴が生起する時刻がずれるという現象
に根拠をおく。従つて、話者の発声の開始時刻と
終了時刻は極力正確に検知する必要がある。前述
のように、入力パターン、登録パターンのいずれ
の場合においても話者の音声の特徴の抽出は、
BPF１６_1〜15、サンプル・ホールド回路１７_1〜1
₅の両者に依存するが、両回路はいずれもその動
作に時定数的な要素をもつ。とりわけ、サンプ
ル・ホールド回路のピーク検波方式は話者の発声
の終了時刻の検出を正しく行うのに大きく左右す
る。従つて、特徴抽出部２を構成するサンプル・
ホールド回路におけるピーク検波方式、及びサン
プリングのタイミングは話者の発声長を正確にと
らえた上で時間軸の正規化を行うのに重要な点と
なる。次に、時間軸の補正を適格にするに適した特徴
抽出部２の他の例について説明する。一般に話者がある単音を（第４図に示す音声
波形）発声すると、前記BPF１６_1〜15の出力に
は第４図に示すように、ピーク値間のピツチが
Ｐの波紋が得られる。このピツチＰは、例えば
「ア」という単音を発声した場合には約8m_secであ
るが、普通の音声ではこのピツチは５〜15m_sec以
内に入いる。このようなピツチＰを有する第４図
に示されるBPF１６_1〜15の出力は、夫々第４
図に示される様にピーク検波されるわけである
が、検波するときの時定数によつては第４図，
に示されるように発声の終了時刻を誤まつて検
出する。即ち、ピーク検波によるリツプルを少な
くするために時定数を大きくすると、検波出力は
第４図で判るように、時刻t₁で実際には発声が
終了しているにも拘らず、時刻t₂まで音声が継続
していると認識する。また、これに対して時定数
を小さくした場合には、検波波形にリツプルが生
じて正確な特徴パターン抽出が望めない。このこ
とは、時間軸の正規化と特徴パターンの抽出に影
響を与え誤つた音声認識を行う原因ともなる。そこで、近年、ピツチ周期より長い周期でピー
ク値検出を行う方法が考えられている。以下この
方法について図面を参照して説明する。第５図は、第１図に示した特徴抽出部３の他の
例を示す回路ブロツク線図であり、入力端子P₁に
音声入力部１（図示せず。）からの音声信号が
BPF４１_1〜oに供給される。そして、このBPF４
１_1〜oの各々の出力はダイオードＤ_1〜oと、ピー
ク検出機能を有するサンプル・ホールド回路４２
_1〜oを構成するMOSトランジスタＱ_1〜o及びピー
ク値をホールドするコンデンサＣ_1〜oによつてピ
ーク検波される。ピーク検波によつて検出された
ピーク値、即ち、音声の振幅データは前記コンデ
ンサＣ_1〜oに保持され、これらの振幅データは２
進−10進デコーダ４３とMOSトランジスタ
Q′_1〜oよりなるマルチプレクサ４４を介してＡ／
Ｄ変換器４５に供給される。ここで前記MOSト
ランジスタＱ_1〜oがオンのときは前記マルチプレ
クサ４４を構成するMOSトランジスタQ′_1〜oは
オフの状態であり、一方のトランジスタ群がオン
のときは他方のトランジスタ群がオフとなる様に
制御されている。このため、前記MOSトランジ
スタＱ_1〜oがオンのときコンデンサＣ_1〜oに保持
された音声の振幅データは、前記MOSトランジ
スタＱ_1〜oがオフのときにMOSトランジスタ
Q′_1〜oを介してＡ／Ｄ変換器４５に供給されたデ
ジタル量に変換される。前記ピーク値のサンプリ
ングは、前述したピツチＰの時間より長い時間Ｔ
で行なわれ、時間Ｔだけピーク値が保持されると
その後、トランジスタＴ_1〜o、抵抗Ｒ_1〜o，
R′_1〜oによつて構成されるリセツト回路４６によ
つて前記コンデンサＣ_1〜oの充電電荷は放電され
る。この放電時間後、再びピーク値の検出が開始
されこれを話者の発声の終了までくり返す。第６
図を用いてこのことを説明すると、第６図は
BPF４１_1〜oのうちの１つの出力を示し、同図
に示す時間Ｔのサンプリングパルスで音声のピー
ク値が検出されるとともにピーク値が保持され、
同図に示すリセツトパルスでコンデンサＣ_1〜o
の充電電荷は放電されるので、Ａ／Ｄ変換器４５
の入力には同図に示す波形が入力される。第６
図で判るように音声のピーク値は、前述のピツチ
Ｐよりも長い時間Ｔだけ保持され、しかも放電時
はリセツトパルス期間なので、放電による誤まつ
た検波出力の振幅データをＡ／Ｄ変換器４５に送
ることもない。次に前記のＴなる時間、ピーク値をサンプル保
持するためのサンプリングパルスを発生させる手
段及びリセツトパルスを発生させる手段について
第５，７，８図を用いて説明する。前記コンデン
サＣ_1〜oに音声のピーク値をサンプル保持するた
めのサンプリングパルスは、分周器４７とナンド
ゲート４８によつて得られる。即ち、分周器４８のクロツク端子CKには、第
７図のCKで示されるクロツクパルスが印加さ
れ、これを分周してQ₀〜Q₁に示される出力をナ
ンドゲート４８に印加することにより第７図中
で示すサンプリングパルスが得られる。このサン
プリングパルスが前記MOSトランジスタＱ_1〜o
の導通を制御することは前述の通りである。ま
た、第１のモノマルチ４９は前記サンプリングパ
ルスａの立ちさがりを検出してパルス（第７図
）を発生しフリツプフロツプ５０の出力を反転
する（第７図）。すると、ナンドゲート５１、
インバータ５２を介して第８図に示すクロツクパ
ルスCK′がｍビツトカウンタ５３に印加されこの
クロツクパルスCK′をカウントし始め前記マルチ
プレクサ４４を構成する２進−10進デコーダを順
次切替え、全てのスキヤンが終わると前記ｍビツ
トカウンタ５３の出力Ｑがインバータ５４を介し
て前記フリツプフロツプ５０にリセツトパルスと
して供給され、フリツプフロツプ５０の状態が再
び反転する。そして、これと同時に第２のモノマ
ルチ５５が前記トランジスタＴ_1〜oを導通させコ
ンデンサＣ_1〜oの充電電荷を放電させるリセツト
パルス（第７，８図、第６図ではに相当す
る。）を発生する。尚、分周器４７に接続された、イニシヤライズ
回路５７は、電源投入時に前記分周器４７をリセ
ツトするためのものでＲは抵抗、Ｄはダイオー
ド、Ｃはコンデンサである。また、前記Ａ／Ｄ変換器４５へのデータの読み
込みのタイミングは次のようにして第８図に示
すパルスを発生することにより行なわれる。前述
のように、サンプリングパルス（第７図）の立
ち上がりで、第１のモノマルチ４９はパルス（第
７，８図）を発生す。このパルスによりフリツ
プフロツプ５０の状態は反転し（第７，８図
）、ｍビツトカウンタ５３にはクロツクパルス
CK′（第８図）が印加される。このクロツクパ
ルス（第８図）の立ち下がりは第３のモノマル
チ５６で検出され、この第３のモノマルチ５６の
出力には第８図で示されるパルスが発生され
る。そして、このパルスが前記Ａ／Ｄ変換器４５
のデータ読み込みタイミングパルスとして用いら
れる。このように、近年、単音発声時にみられる前述
のピツチＰより大きい時間Ｔを音声の特徴抽出の
ためのサンプル時間とし、ピーク検波時において
リツプルによる音声認識時における誤つた特徴抽
出を防止するようにしている。また、話者の発声
終了時刻の判定に際しても、その誤差範囲を略前
記ピツチ長Ｐよりも少ない範囲とすることができ
るので、時間軸に対する正規化を行うにあたり誤
認識を低減できる。いいかえると、話者が同一の
単語を発生するに要する時間を発声のたびに異な
らせたとしても、このことによる音声の誤認識を
低減することができる。次にこの発明の特徴とする点について説明す
る。この発明は前述したような無音声部の存在に
起因する音声の誤認識を解決する為になされたも
のである。この無音声部の存在に起因する誤認識
について第９図乃至第１１図を用いて説明する。
第９図は第２図に示したような時間−周波数−レ
ベルの特性を時間経過に伴なう各周波数成分のレ
ベルの変化として示したものである。また第１０
図ａ，ｂはそれぞれ数字の「ハチ」，「イチ」の音
声データのうち１つのBPFの出力データを簡単に
描いたものである。ここでは、説明の便宜上「イ
チ」の無音声部を「ハチ」のそれよりも短かくし
て示す。今、第１０図ａ，ｂを登録パターンと
し、第１１図ａに示すような無音声部の短かい
「ハチ」という音声が入力されたとすると、この
入力パターンと先の登録パターンとを比較してみ
ると第１１図ｂ，ｃのようになる。第１１図ｂ，
ｃにおいて斜線部がパターン間の距離に相当し、
第１１図ｂは登録パターン「ハチ」に対する入力
パターン「ハチ」の距離を示し、第１１図ｃは登
録パターン「イチ」に対する入力パターン「ハ
チ」の距離を示す。この場合、図示の如く第１１
図ｂに示すものが同図ｃに示すものより距離が大
きいので、入力音声が「ハチ」にもかかわらず、
「イチ」と認識される。つまり、無音声部の長さ
が変化しただけで、誤認識あるいは不認識をまね
く虞れがある。このことは前述したような時間軸
の正規化を行なうことで少し軽減されるがそれで
も残る問題である。そこでこの発明は話者の発声期間中において所
定の閾値以下の入力音声を取り込まないようにし
て無音声部の長さの変動による誤認識、あるいは
不認識を防止するようにしたものである。その一
実施例を第１図及び第１２図を用いて説明する。
まず、制御内容を登録する場合について説明す
る。Ｉ／Ｏポート２０を介して各サンプリング期
間毎に認識処理部３の入力パターンメモリ２２に
供給される特徴データ（前述の行列式〓の各行成
分に相当する）は前述のように各行毎にその成分
が加算される。そしてこの加算結果が所定の閾値
より大きいと、その行は振幅が正規化された状態
で登録パターンメモリ２１に転送され、所定のア
ドレスに収納される。一方、前記加算結果が所定
の閾値より小さいと登録パターンメモリ２１への
転送は行なわれない。ところで先の例ではこのように加算結果が所定
の閾値より小さい行に対しても登録パターンメモ
リ２１のアドレスのカウントが行なわれるので、
加算結果が所定の閾値より小さい行も結果的には
登録パターンメモリの対応するアドレス“０”デ
ータとして記憶される。これに対し、この発明では加算結果が所定の閾
値より小さいような行に対しては登録パターンメ
モリ２１のアドレスのカウントが行なわれない。
これを更に述べると、加算結果が所定の閾値より
小さくなると上述のようにアドレスのカウントは
行なわれず、加算結果が所定の閾値より小さくな
る行の直前の行に対応するアドレスが指定された
ままとなつている。この状態より加算結果が再び
所定の閾値より大きくなるとアドレスのカウント
が進み、この時の振幅の正規化された行成分は、
加算結果が所定の閾値より小さくなる行の直前の
行の振幅が正規化されたものが記憶されているア
ドレスのすぐ後のアドレスに記憶される。したが
つて第１２図ａに示すような無音声部を含む「ハ
チ」なる信号が入つてきても、同図ｂに唆す無音
声部の消去された信号に変換された状態で登録パ
ターンメモリ２１に登録される。また、同図ｃに
示すような無音声部を含む「イチ」なる信号が入
つてきても、同図ｄに示すような信号を変換され
た状態で登録される。登録された制御内容の中の所望の制御内容を話
者が発声により指令するような場合も同様で、あ
る行の成分の加算結果が所定の閾値より小さいよ
うな場合は、入力パターンメモリ２２のアドレス
のカウントは停止する。このような話者の発声期間中において、振幅レ
ベルが所定の閾値より小さくなるような部分が存
在するような場合は、この部分をデータとして登
録しないようにすることにより、第１１図を用い
て説明したような無音声部の長さの変動による誤
認識あるいは不認識を防止することができる。上述したような処理を行なう為のプログラムは
システムプログラムメモリ２３に記憶されてお
り、このプログラムに従つてOPU２４によつて
実行されるが、その実行内容を次に模式的に説明
する。即ち、先の第１図中の第１のＩ／Ｄポート２
０、システムプログラム２３、CPU２４の動作
は次の第１３図及び第１６図を用いて説明するよ
うな機能動作に対応できる。なお、第１３図に示
す音声認識装置において、時間軸の正規化を行な
う部分の回路は第５図に示す回路と略同じなので
同一符号を付し詳細な説明を省略する。一方、振
幅の正規化においては前述した例と異なる。即
ち、前述した振幅の正規化手段は各サンプル時点
毎に複数のBPFの出力のピーク値の総和を求め、
この総和でそのサンプル時点における各BPFの出
力のピーク値を除算して総和に対する各BPFのピ
ーク値の比率を求め、これを登録あるいは入力パ
ターンとしている。これに対し、第１３図に示す
装置では各サンプル時点毎に各BPFの出力のピー
ク値のうち最大のものを選択し、この最大ピーク
値でそのサンプル時点における各BPFのピーク値
を除算して最大ピーク値に対する各BPFの出力の
ピーク値の比率を求め、これを登録あるいは入力
パターンとしている。そこで、本発明の特徴に係る部分の説明に入る
前に、まず第１３図の装置における振幅の正規化
手段について第１４図及び第１５図を参照しなが
ら説明する。なお、第１４図においてCK，Q₀，Q₁，，
，として示す信号は先の第５図の動作説明に
おいて第７図にCK，Q₀，Q₁，，，として
示す信号と同一である。また第１５図CK′，，
，，，に示す信号も同じく先の第５図の
動作説明において第８図にCK′，，，，
，として示した信号と同一である。第１５図
にとして示す信号はＡ／Ｄ変換器４５の変換終
了信号を示す。すなわち、第１３図に示すコンデ
ンサＣ_1〜oに保持されている音声のピーク値を示
すデータは、２進−10進デコーダ４３の出力に応
じて順次対応するMOSトランジスタQ′_1〜oを介
してＡ／Ｄ変換器４５に供給され、モノマルチ５
６から発生されるデータ読み出しタイミングパル
ス（第１５図）のタイミングでデジタル量に変
換される。そしてＡ／Ｄ変換器４５は各コンデン
サＣ_1〜oに保持されているデータのＡ／Ｄ変換を
終了するたびに第１５図として示す変換終了信
号を発生する。また第１５図にとして示すモノ
マルチ５５の出力信号は第４のモノマルチ５７に
供給され、このモノマルチ５７によつてモノマル
チ５５の出力信号の立ち下がりに同期した第１５
図にとして示す信号が導出される。同様にモノ
マルチ５７の出力信号は第５のモノマルチ５８に
供給され第１４図及び第１５図にとして示す信
号が得られる。このモノマルチ５８の出力信号が
コンデンサＣ_1〜oの充電電荷を放電させる為のリ
セツトパルスとして使われる。前記Ａ／Ｄ変換器４５の出力はラツチ回路５９
_1〜oに供給される。このラツチ回路５９_1〜oの書
き込みタイミングはアンド回路６０_1〜oの出力に
よつて得られる。すなわち、このアンド回路６０
_1〜oの各一方の入力端には前述した変換終了信号
（第１５図）が供給され、他方の入力端には前
記２進−10進デコーダ４３の出力が各対応して供
給されている。したがつてアンド回路６０_1〜oは
変換終了信号のタイミングで順次ゲートを開き、
その出力は対応するラツチ回路５９_1〜oにクロツ
クパルスとして供給され、Ａ／Ｄ変換器４５の出
力は順次対応するラツチ回路５９_1〜oに書き込ま
れる。このラツチ回路５９_1〜oにラツチされたデ
ータ行列式〓の各行成分はそれぞれ乗算器６１₁
_〜ｏにてＮ倍され、除算器６２_1〜oに供給され
る。そして、この除算器６１_1〜oにおいて前記コ
ンデンサＣ_1〜oにサンプル保持された各サンプル
時における音声のピーク値のうち最大ピーク値を
デジタル量に変換したもので除算される。除算処
理を受けた除算器６１_1〜oの出力はラツチ回路６
３_1〜oに書き込まれる。この書き込みのタイミン
グ信号としては前記モノマルチ５５の出力信号
（第１５図）が使われる。一方、前記ラツチ回路５９_1〜oに書き込まれた
データは加算器６５に加算され、レベル判定回路
６６に供給される。このレベル判定回路６６は加
算器６５の出力（[Equation] is calculated, and each of the component elements latched in the latch circuits 30 _{1 to 15} is calculated by the sum of these values to the divider 34 _{1 to 15} .
divided by Here, multipliers 32 1 to 15 are provided before the dividers 34 ₁ _{to 15} and perform N multiplications, but this is to evaluate the division result in the integer part, and depending on the case, Can be omitted. Furthermore, the data whose amplitude has been normalized by being divided by the dividers 341 _{to 3415} is registered in the registered pattern memory 21 as a registered pattern through the bus line. Further, a predetermined level threshold is set in the level determination section 33, and when the level of the output of the adder 31 is below the set threshold, the latched contents of the latch circuits _{351 to 15} are Clear and
At other times, the two latch circuits are not controlled.
In this way, by causing the latch circuits _351-15 to perform the latch operation only when the output of the adder 31 is equal to or greater than a certain value, malfunctions due to noise when the detected voice is small can be prevented. As can be seen from the above explanation of FIG. It is stored in the pattern memory 22, and then the amplitude is normalized and registered as a characteristic pattern in the pattern memory 21.
is memorized. Next, a case will be described in which the speaker instructs the desired control content by voice with respect to the registered control content. When the speaker utters a desired control content among the registered control content and gives a voice command, the amplitude of the voice characteristic data is normalized and stored in the input pattern memory 22 in the same way as the registered pattern.
Here, it is assumed that an input pattern obtained by normalizing the amplitude of the content of the voice command given by the speaker is expressed by the following determinant 〓. This amplitude is normalized and input pattern memory 22
The input pattern 〓 stored in is referenced with the registered pattern already registered in the registered pattern memory 21 as the control content. By calculating the similarity between both patterns using this reference operation, it is determined that the control content corresponding to the pattern with the closest similarity is the control content commanded by the speaker. The degree of similarity between the input pattern and the registered pattern is determined by calculating the distance D between the patterns shown below. That is,
The determinant obtained by taking the absolute value of the difference of each component k _1j , f _1j between the registration pattern 〓 whose amplitude has been normalized and the input pattern 〓 is defined as the determinant distance pattern D representing the distance between both patterns. The similarity is calculated based on the summation value of each component of this determinant D. To further describe this, the distance pattern D is expressed by the following equation, and the similarity a is expressed as follows. The above calculation of the degree of similarity d is performed for all registered patterns, or in other words, patterns representing all control contents, and the pattern with the smallest value of degree of similarity d is the pattern commanded by the speaker by voice. It is determined that Speech recognition is performed in this way,
By normalizing the audio amplitude as described above, the misrecognition rate is significantly reduced. In this way, the speech recognition of the speaker's utterance is performed by calculating the degree of similarity between the registered pattern and the input pattern by the CPU 24 executing calculations instructed by the similarity calculation program set in the system program memory 23. This makes it possible to control devices using voice recognition. In speech recognition using the above-mentioned speech pattern matching method, the amplitude is normalized, so the information on the soft parts of the word is smaller than the strong parts of the word, and the amplitude fluctuates each time the same word is uttered. Misrecognition of speech due to easy-to-understand points is reduced. However, even if speakers utter the same word, the utterance times do not always match. To solve this problem, it is necessary to normalize the time axis as well, and next, this normalization of the time axis will be explained. The time axis is normalized by always dividing the time taken between the pronunciation start time and the pronunciation end time of a word pronounced by the speaker by a constant constant n.
In other words, if a speaker sometimes takes time T ₁ and sometimes takes time T ₂ to utter a certain word, in each case, the sample time interval for feature extraction is ΔT ₁ = T ₁ /n. , ΔT ₂ =T ₂ /n. This is based on the phenomenon that the times at which voice characteristics occur shift in response to shifts in the time axis. Therefore, it is necessary to detect the start time and end time of the speaker's utterance as accurately as possible. As mentioned above, extracting the speaker's voice features in both input patterns and registered patterns requires the following steps:
BPF16 _1~15 , sample/hold circuit 17 _1~1
₅ , both circuits have time constant elements in their operation. In particular, the peak detection method of the sample-and-hold circuit greatly influences the correct detection of the end time of the speaker's utterance. Therefore, the sample configuring the feature extraction unit 2
The peak detection method and sampling timing in the hold circuit are important for accurately capturing the speaker's utterance length and normalizing the time axis. Next, another example of the feature extracting unit 2 suitable for making correction of the time axis will be described. Generally, when a speaker utters a certain single sound (voice waveform shown in FIG. 4), ripples with a pitch of P between peak values are obtained in the outputs of the BPFs _{161 to 15} , as shown in FIG. For example, the pitch P is approximately 8 m _sec when a single sound such as "a" is uttered, but in normal speech, this pitch falls within 5 to 15 m _sec . The outputs of the BPFs 16 _{1 to 15} shown in FIG.
As shown in the figure, peak detection is performed, but depending on the time constant during detection, as shown in Figure 4,
The end time of the utterance is incorrectly detected as shown in . In other words, when the time constant is increased to reduce ripples caused by peak detection, the detection output remains constant until time t ₂ , even though vocalization actually ends at time t ₁ , as shown in Figure 4. Recognizes that the audio is continuous. On the other hand, if the time constant is made small, ripples will occur in the detected waveform, making it impossible to expect accurate feature pattern extraction. This may affect the normalization of the time axis and the extraction of feature patterns, leading to incorrect speech recognition. Therefore, in recent years, methods have been considered in which peak value detection is performed at a cycle longer than the pitch cycle. This method will be explained below with reference to the drawings. FIG. 5 is a circuit block diagram showing another example of the feature extraction section ₃ shown in FIG.
Supplied to BPF41 _{1 to o} . And this BPF4
Each output of 11 _{to o} is connected to a diode D1 _{to o} and a sample/hold circuit 42 with a peak detection function.
Peak detection is performed by MOS transistors Q _{1 - o} forming MOS transistors Q _{1 - o} and capacitors C _{1 - o} holding the peak value. The peak values detected by peak detection, that is, the audio amplitude data are held in the capacitors C1 _-o , and these amplitude data are
Hex-decimal decoder 43 and MOS transistor
A _/
The signal is supplied to the D converter 45. Here, when the MOS transistors Q1 _-o are on, the MOS transistors _Q'1-o constituting the multiplexer 44 are off, and when one transistor group is on, the other transistor group is off. It is controlled so that Therefore, when the MOS transistors Q _{1 - o} are on, the audio amplitude data held in the capacitors C _{1 - o} is stored in the MOS transistors Q 1 - o when the MOS transistors Q _{1 - o} are off.
It is converted into a digital quantity which is supplied to the A/D converter 45 via Q' ₁ to o. The peak value is sampled for a time T longer than the pitch P described above.
When the peak value is held for a time T, the transistors T1 _~o , the resistors R1 _~o ,
The charges in the capacitors C1 _-o are discharged by a reset circuit 46 constituted by R'1 _-o . After this discharge time, detection of the peak value is started again and this is repeated until the end of the speaker's utterance. 6th
To explain this using a diagram, Figure 6 shows
It shows the output of one of BPF41 _{1 to o} , and the peak value of the voice is detected by the sampling pulse of time T shown in the figure, and the peak value is held.
With the reset pulse shown in the same figure, the capacitors C1 _~o
Since the charged charge is discharged, the A/D converter 45
The waveform shown in the figure is input to the input. 6th
As can be seen in the figure, the peak value of the voice is held for a time T longer than the pitch P mentioned above, and since the discharge is during the reset pulse period, the amplitude data of the detection output that is erroneous due to the discharge is transferred to the A/D converter 45. I don't even send it to. Next, the means for generating a sampling pulse for sample-holding the peak value for the time T mentioned above, and the means for generating a reset pulse will be explained using FIGS. 5, 7, and 8. A sampling pulse for sample-holding the peak value of the audio in the capacitors C1 _-o is obtained by the frequency divider 47 and the NAND gate 48. That is, a clock pulse indicated by CK in FIG. 7 is applied to the clock terminal CK of the frequency divider 48, and the clock pulse indicated by _CK in _FIG . The sampling pulse shown in Fig. 7 is obtained. This sampling pulse is applied to the MOS transistors Q1 _~o .
Controlling the conduction is as described above. Further, the first monomulti 49 detects the falling edge of the sampling pulse a, generates a pulse (FIG. 7), and inverts the output of the flip-flop 50 (FIG. 7). Then, Nand Gate 51,
The clock pulse CK' shown in FIG. 8 is applied to the m-bit counter 53 via the inverter 52, and the clock pulse CK' starts counting, and the binary-decimal decoder constituting the multiplexer 44 is sequentially switched, and all scanning is completed. The output Q of the m-bit counter 53 is supplied as a reset pulse to the flip-flop 50 via an inverter 54, and the state of the flip-flop 50 is inverted again. At the same time, the second monomulti 55 conducts the transistors T1 _-o to discharge the charges in the capacitors C1 _-o (this corresponds to in FIGS. 7, 8, and 6). occurs. An initialization circuit 57 connected to the frequency divider 47 is for resetting the frequency divider 47 when the power is turned on, and R is a resistor, D is a diode, and C is a capacitor. Further, the timing of reading data into the A/D converter 45 is determined by generating pulses shown in FIG. 8 in the following manner. As described above, at the rising edge of the sampling pulse (FIG. 7), the first monomulti 49 generates a pulse (FIGS. 7 and 8). This pulse inverts the state of the flip-flop 50 (FIGS. 7 and 8), and the m-bit counter 53 receives the clock pulse.
CK' (FIG. 8) is applied. The falling edge of this clock pulse (FIG. 8) is detected by the third monomulti 56, and the pulse shown in FIG. 8 is generated at the output of the third monomulti 56. Then, this pulse is transmitted to the A/D converter 45.
This is used as a data read timing pulse. In this way, in recent years, a time T larger than the above-mentioned pitch P observed when producing a single sound has been set as the sample time for extracting speech features, in order to prevent erroneous feature extraction during speech recognition due to ripples during peak detection. ing. Furthermore, when determining the end time of a speaker's utterance, the error range can be set to a range that is approximately smaller than the pitch length P, so that erroneous recognition can be reduced when normalizing with respect to the time axis. In other words, even if the time required for a speaker to produce the same word differs each time he/she utters the same word, misrecognition of speech due to this can be reduced. Next, the features of this invention will be explained. This invention was made in order to solve the above-mentioned erroneous recognition of speech caused by the presence of silent parts. Misrecognition caused by the presence of this silent portion will be explained using FIGS. 9 to 11.
FIG. 9 shows the time-frequency-level characteristics shown in FIG. 2 as changes in the level of each frequency component over time. Also the 10th
Figures a and b are simple depictions of BPF output data for one of the voice data for the numbers ``Hachi'' and ``Ichi'', respectively. Here, for convenience of explanation, the silent part of "ichi" is shown shorter than that of "hachi". Now, suppose that Figure 10 a and b are registered patterns, and a short voiceless part of the voice "Hachi" as shown in Figure 11 a is input. Compare this input pattern with the previously registered pattern. If you look at it, it will look like Figure 11 b and c. Figure 11b,
In c, the shaded part corresponds to the distance between the patterns,
FIG. 11b shows the distance of the input pattern "Hachi" to the registered pattern "Hachi", and FIG. 11C shows the distance of the input pattern "Hachi" to the registered pattern "Ichi". In this case, the 11th
The distance shown in Figure b is larger than that shown in Figure c, so even though the input voice is "Hachi",
It is recognized as “first”. In other words, even a change in the length of the silent portion may lead to erroneous recognition or non-recognition. Although this problem can be alleviated a little by normalizing the time axis as described above, it still remains a problem. Therefore, the present invention prevents input speech below a predetermined threshold value from being captured during the speaker's utterance period to prevent erroneous recognition or non-recognition due to variations in the length of the silent portion. One embodiment thereof will be explained using FIG. 1 and FIG. 12.
First, the case of registering control details will be explained. The feature data (corresponding to each row component of the above-mentioned determinant 〓) supplied to the input pattern memory 22 of the recognition processing unit 3 for each sampling period via the I/O port 20 is calculated for each row as described above. The components are added. If the result of this addition is greater than a predetermined threshold, that row is transferred to the registered pattern memory 21 with its amplitude normalized and stored at a predetermined address. On the other hand, if the addition result is smaller than a predetermined threshold value, the transfer to the registered pattern memory 21 is not performed. By the way, in the previous example, since the addresses of the registered pattern memory 21 are counted even for rows where the addition result is smaller than the predetermined threshold,
Rows for which the addition result is smaller than a predetermined threshold value are also eventually stored as "0" data at the corresponding address in the registered pattern memory. In contrast, in the present invention, addresses in the registered pattern memory 21 are not counted for rows for which the addition result is smaller than a predetermined threshold.
To further explain this, if the addition result is smaller than a predetermined threshold, addresses are not counted as described above, and the address corresponding to the row immediately before the row for which the addition result is smaller than the predetermined threshold remains specified. It's summery. From this state, when the addition result becomes larger than the predetermined threshold again, the address count advances, and the normalized row component of the amplitude at this time is
The normalized amplitude of the row immediately before the row for which the addition result is smaller than a predetermined threshold is stored at the address immediately following the stored address. Therefore, even if a "hachi" signal containing a silent part as shown in FIG. Registered on 21. Furthermore, even if a signal "Ichi" including a silent part as shown in FIG. The same is true when a speaker commands a desired control content among the registered control content by vocalization, and if the addition result of the components of a certain row is smaller than a predetermined threshold, the input pattern memory 22 is Counting of addresses stops. If there is a part where the amplitude level is smaller than a predetermined threshold during the speaker's utterance period, by not registering this part as data, it is possible to Erroneous recognition or non-recognition due to variations in the length of the silent portion as described above can be prevented. A program for performing the above-mentioned processing is stored in the system program memory 23, and is executed by the OPU 24 in accordance with this program, and the content of the execution will be schematically explained next. That is, the first I/D port 2 in FIG.
0, the operations of the system program 23 and the CPU 24 can correspond to the functional operations described using the following FIGS. 13 and 16. In the speech recognition device shown in FIG. 13, the circuit for normalizing the time axis is substantially the same as the circuit shown in FIG. 5, so the same reference numerals are given and detailed explanation will be omitted. On the other hand, the amplitude normalization is different from the above example. That is, the above-mentioned amplitude normalization means calculates the sum of the peak values of the outputs of a plurality of BPFs at each sample time,
The peak value of the output of each BPF at the sample time is divided by this total to obtain the ratio of the peak value of each BPF to the total, and this is registered or used as an input pattern. On the other hand, in the apparatus shown in Fig. 13, the maximum value of the output peak values of each BPF is selected for each sample time, and the peak value of each BPF at that sample time is divided by this maximum peak value. The ratio of the peak value of the output of each BPF to the maximum peak value is determined, and this is registered or used as an input pattern. Therefore, before entering into a description of the features of the present invention, the amplitude normalization means in the apparatus shown in FIG. 13 will be described with reference to FIGS. 14 and 15. In addition, in Fig. 14, CK, Q ₀ , Q ₁ , ,
, are the same as the signals shown as CK, Q ₀ , Q ₁ , . . . in FIG. 7 in the operation description of FIG. 5 above. Also, Fig. 15 CK′,,
The signals shown in , , , are also shown as CK′, , , in FIG. 8 in the operation explanation of FIG.
, is the same as the signal shown as . The signal shown in FIG. 15 indicates the conversion end signal of the A/D converter 45. That is, the data indicating the peak value of the audio held in the capacitors C1 _-o _shown in FIG. is supplied to the A/D converter 45, and the monomulti 5
The data is converted into a digital quantity at the timing of the data read timing pulse (FIG. 15) generated from 6. The A/D converter 45 generates a conversion completion signal shown in FIG. 15 each time it completes A/D conversion of the data held in each capacitor C1 _-o . Further, the output signal of the monomulti 55 shown in FIG.
The signal shown as in the figure is derived. Similarly, the output signal of the monomulti 57 is supplied to a fifth monomulti 58 to obtain the signals shown in FIGS. 14 and 15. The output signal of this monomulti 58 is used as a reset pulse to discharge the charges in the capacitors C1 _-o . The output of the A/D converter 45 is connected to a latch circuit 59.
Supplied from _{1 to o} . The write timing of the latch circuits 591 _-o is obtained by the output of the AND circuits _601-o . That is, this AND circuit 60
The above-mentioned conversion end signal (FIG. 15) is supplied to one input terminal of each of 1 _{to o} , and the output of the binary-decimal decoder 43 is supplied to the other input terminal, respectively. . Therefore, AND circuits 60 _{1 to 60} open their gates in sequence at the timing of the conversion end signal,
The output is supplied as a clock pulse to the corresponding latch circuit _591-o , and the output of the A/D converter 45 is sequentially written to the corresponding latch circuit 591 _-o . Each row component of the data determinant 〓 latched in the latch circuits 59 _{1 to 59} is connected to a multiplier 61 ₁ .
_~o is multiplied by N and supplied to dividers 621 _~o . Then, in the dividers 61 _{1 to 61 o} , the maximum peak value of the audio peak values at each sampling time sampled and held in the capacitors C _{1 to o} is divided by a value obtained by converting the maximum peak value into a digital quantity. The outputs of divider 61 _{1 to o} that have undergone division processing are sent to latch circuit 6
3 Written in _1~o . The output signal of the monomulti 55 (FIG. 15) is used as the timing signal for this writing. On the other hand, the data written in the latch circuits 59 _{1 to 59 o} are added to an adder 65 and supplied to a level determination circuit 66 . This level determination circuit 66 outputs the output of the adder 65 (

【式】に相当）レベルが所定の閾値以下の時はラツチ回路６
３_1〜oにラツチされた内容をクリアする。この場
合、ラツチ回路６３_1〜oのクリア端子にはレベル
判定回路６６の出力とモノマルチ６７の出力と
（第１５図のタイミングと同じ）がアンド回路
６８を介して供されており、ラツチ回路６３_1〜o
はモノマルチ６７の出力のタイミングでクリアさ
れる。加算器６５の出力レベルが所定の閾以上の
時はラツチ回路６３_1〜oにラツチされた内容はク
リアされず、詳細を後述するアドレスカウンタ８
４の出力のタイミングでバスラインを通して入力
パターンメモリ６１（実際は入力パターンメモリ
のうち入力パターンを収納する部分に相当する）
に記憶される。ここで各サンプル期間における音声のピーク値
のうちの最大ピーク値を求める手段について説明
する。これは、ラツチ回路６９、ラツチ回路７
０、比較器７１、ノア回路７２，７３、データセ
レクトゲート７４、インバータ回路７５，７６、
オア回路７７から成る回路によつてなされる。す
なわち、Ａ／Ｄ変換器４５のｎ個の変換データは
順次ラツチ回路６９，７０のどちらか一方に供給
される。このラツチ回路６９，７０にラツチされ
た内容X₁，X₂は比較器７１によつてその大きさ
が比較され、X₁＜X₂なら出力が“Ｈ”とな
り、X₁＝X₂なら出力が“Ｈ”となり、X₁＞X₂
なら出力が“Ｈ”となる。ノア回路７２には
出力とＡ／Ｄ変換器４５の変換終了信号をインバ
ータ回路７６で反転したものとが供給されてい
る。ノア回路７３には，出力と前記インバー
タ回路７６の出力が供給されている。ノア回路７
２，７３は比較器７１の，，の出力状態に
応じてどちらか一方が“Ｈ”となり、対応するラ
ツチ回路７０，６９に対する変換データの書き込
みを可能とする。データセレクトゲート７４では
比較器７１の出力に応じてラツチ回路６９，７
０のうちのどちらか一方にラツチされたデータが
選択され除算器６２_1〜oに供給される。すなわ
ち、出力が“Ｈ”ならデータX₂が選択され、
出力が“Ｌ”ならデータX₂が選択される。な
お、X₁＝X₂ならどちらを選択してもよいがこの
ような構成ではデータX₁が選択される。ラツチ
回路６９，７０のクリアはイニシヤライズ回路５
７の出力をインバータ回路７５で反転したものと
モノマルチ５８の出力とをオア回路７７を通した
信号によつてなされる。オア回路７７の出力信号を第１５図にとして
示す。この信号のうち今、例えばパルス〓のタ
イミングで装置の電源が投入されたものとする
と、オア回路７７によりパルス〓がラツチ回路６
９，７０に供給され、ラツチ回路６９，７０がク
リア状態となる。この為、比較器７１は出力が
“Ｈ”となり、，出力は“Ｌ”となる。この
時、Ａ／Ｄ変換器４５が動作すると、１番最初の
変換終了信号のタイミングでノア回路７２の出力
のみが“Ｈ”となり、１番最初の変換データはラ
ツチ回路７０に書き込まれる。これにより比較器
７１の出力が“Ｈ”となり、，出力は
“Ｌ”となる。したがつて２番目の変換終了信号
のタイミングで今度はノア回路７３の出力が
“Ｈ”となり２番目の変換データはラツチ回路６
９に書き込まれる。以下、同様にしてラツチ回路
６９，７０のデーセX₁，X₂の大きさに応じて比
較器７１の，，のいずれか１つが“Ｈ”と
なり、出力が“Ｈ”となつた方のノア回路７２あ
るいは７３に対応する方のラツチ回路７０あるい
は６９に変換データが書き込まれる。こうしてラ
ツチ回路７０，６９のラツチデータのうち、より
大きい方のデータがデータセレクトゲート７４を
介して除算器６２_1〜oに供給される。したがつて
除算器６２_1〜oにおいては最終的に１番最初のサ
ンプル期間における音声の各ピーク値をその期間
の最大ピーク値で除算したデータが得られラツチ
回路６３_1〜oに書き込まれる。こうして１番最初
のサンプル期間における各ピーク値に対する振幅
の正規化が終了すると、ラツチ回路６９，７０は
モノマルチ５８の出力信号（第１１，１２図）
によつてリセツトされ初期状態に戻る。以下同様
にして２番目以降の各サンプリング期間に得られ
たピーク値データに対する振幅の正規化がなされ
る。次に本発明の特徴とする入力音声のうち所定の
閾値以下の振幅レベル部分を取り込まないように
する禁止手段について第１６図の信号波形図を参
照しながら説明する。なお、第１６図において
，，に示す信号は先の第１５図の，，
に示す信号と同じである。話者が発声により装置に音声を入力すると、前
述の如く各サンプル時点毎にラツチ回路５９_1〜o
にデジタル量に変換されたピーク値がラツチされ
る。そしてこのラツチされたピーク値の加算結果
がレベル判定回路６６の閾値より大きいと、振幅
の正規化されたデータが入力パターンメモリ６４
に収納される。そこで、今加算回路６５の出力が
レベル判定回路６６の閾値を越えたとする、前述
のようにレベル判定回路６６の出力はロウレベル
となる。このレベル判定回路６８の出力は前述の
ようにアンド回路６８に供給されるとともに更に
インバータ回路７８を介してアンド回路７９に供
給される。このアンド回路７９にはまた前記モノ
マルチ６７の出力信号が供給される。したがつて
アンド回路７９よりモノマルチ６７の出力信号の
タイミングで第１６図の信号が得られる。この
アンド回路７９の出力はオア回路８０を介してカ
ウンタ８１にリセツトパルスとして供給され、カ
ウンタ８１をリセツト状態にする。また、前記ア
ンド回路６８の出力はインバータ回路８２を介し
てアンド回路８３に供給される。このアンド回路
８３にはまたモノマルチ５８の出力信号（第１６
図）が供給される。したがつてアンド回路８３
の出力にはモノマルチ５８の出力信号のタイミン
グで第１６図の信号が得られる。このアンド回
路８３の出力はアドレスカウンタ８４にクロツク
パルスとして供給される。これによりアドレスカ
ウンタ８４のカウント値はクロツクパルスが１個
供給されるたびに１つずつ歩進し、ラツチ回路６
３_1〜oにラツチされたデータを入力パターンメモ
リ６４の所定のアドレスへ書き込ませる。また、
アンド回路８３の出力は更にフリツプフロツプ８
５に供給される。これによりフリツプフロツプ８
５の出力は第１６図に示すようにアンド回路８
３の一番最初の出力パルスによつて反転され、
“Ｌ”レベルとなる。このフリツプフロツプ８５
の出力はノア回路８６に供給される。この状態より加算回路６５の出力レベルがレベ
ル判定回路６６の閾値より小さくなると、アンド
回路６８の出力に第１６図に示す信号が得られ
る。これによりラツチ回路６３_1〜oは前述の如く
リセツト状態とされる。また、アンド回路６８の
出力はインバータ回路８２により反転されて前記
ノア回路８６に供給される。これによりノア回路
８６は第１６図に示す信号のタイミングでゲー
トが開かれ、その出力は前記カウンタ８１にクロ
ツクパルスとして供給される。この時、アンド回
路８３の出力が得られないので入力パターンメモ
リ６４のアドレスカウンタ８４はカウント動作を
停止している。したがつて、アドレスカウンタ８
４の出力が指定する入力パターンメモリ６４のア
ドレスは移動せず、加算回路６５の出力がレベル
判定回路６６の閾値以下となる直前のアドレスに
保持される。ところで、前記カウンタ８１は例えば８個目の
クロツクパルスが供給される以前に、前記加算回
路６４の出力が再びレベル判定回路６５の閾値を
越えると前記アンド回路７９の出力信号（第１６
図）によつてリセツトされる。また、この時ア
ンド回路８３にも出力信号（第１６図）が得ら
れ、アドレスカウンタ８４のカウント値が１つ歩
進して前述したような加算回路６５の加算結果が
レベル判定回路６６の閾値以下となる直前のアド
レスの次のアドレスが指定され、ここでデータが
収納される。逆に８個のクロツクパルスが供給されても、加
算回路６５の加算結果がレベル判定回路６６の閾
値を越えない場合はカウンタ８１のQ₃出力（第
１６図）がクロツクパルスの８個目のタイミン
グで“Ｈ”となる。このQ₃出力はインバータ回
路８７を介してフリツプフロツプ５０に供給さ
れ、これをリセツト状態とするとともに、第７の
モノマルチ８８に供給される。これによりモノマ
ルチ８８は第１６図に示す信号P₁を導出する。
この信号は処理スタート信号として使われる。即
ち、この処理スタート信号P₁が導出されると、認
識処理部３は話者が発声した音声全てが入力パタ
ーンメモリ６４に入力パターンとして収納された
ものとして判断し、登録パターンと入力パターン
との比較処理を行なう。なお、図にはこの比較処
理を行なう部分は示さない。この比較処理動作が
終了すると第１６図にとして示す処理エンド信
号が導出され、カウンタ８１をリセツト状態とす
る。また、この処理エンド信号はフリツプフロツ
プ８５にセツトパルスとして供給され、このフリ
ツプフロツプ８５をセツト状態、つまり出力
“Ｈ”状態とするとともに、オア回路８９を介し
てアドレスカウンタ８４、入力パターンメモリ６
４にリセツトパルスとして供給されこれらをリセ
ツト状態とする。以下同様に音声が入力されるた
びに上述した動作が繰り返えされる。なお、装置
の電源投入時は前記フリツプフロツプ８５はイニ
シヤライズ回路５７の出力をインバータ回路７５
で反転した信号によつてセツト状態とされる。ま
たこの時、カウンタ８１はインバータ回路７５の
出力をオア回路８０を通した信号によつてリセツ
ト状態とされ、アドレスカウンタ８４、入力パタ
ーンメモリ６４はインバータ回路７５の出力をオ
ア回路８９を通した信号によつてリセツト状態と
される。このように加算回路６４の加算結果がレベル判
定回路６６の閾値より小さいと、アドレスカウン
タ８４のカウント動作は停止され、入力パターン
メモリ６４は加算結果がレベル判定回路６６の閾
値より小さくなる直前のアドレスが指定されたま
まとなる。そしてこの状態が前記カウンタ８１に
よつて８個のクロツクパルスがカウントされる期
間（８回のサンプリング回数に相当する）より短
かい期間しか続かなければ、加算回路６５の加算
結果がレベル判定回路６６の閾値より大きくなつ
た時点で再びアドレスカウンタ８４がカウント動
作を開始し、入力パターンメモリへのデータの取
り込みがなされる。逆に上記状態がカウンタ８１
によつて８個のクロツクパルスがカウントされる
期間より短かい期間しか続かなければ、すべての
データが入力されたものとして入力パターンと登
録パターンの比較処理がなされる。なお、以上の説明では本発明を予じめ登録され
ている制御内容に対して話者が所望の制御内容を
発声により指令する場合を代表して説明したが、
複数の制御内容を登録する場合も同じようにして
登録パターンメモリへのデータの書き込みがなさ
れる。尚、本発明による音声認識装置による被制御機
器は、テレビジヨン受像機に限定されるものでは
なく、遠隔操作を要するシステム一般に適応し得
る。このように本発明によれば特に無音声部の存在
に起因する音声の誤認識という問題を解決し得る
音声認識装置を提供することができる。When the level (corresponding to [formula]) is below a predetermined threshold, the latch circuit 6
3 Clear the contents latched in _{1 to o} . In this case, the output of the level determination circuit 66 and the output of the monomulti 67 (same timing as shown in FIG. ₁₅ ) are supplied to the clear terminals of the latch circuits 631 to 63o via the AND circuit 68, and the latch circuits 63 _1～o
is cleared at the timing of the output of the monomulti 67. When the output level of the adder 65 is higher than a predetermined threshold, the contents latched in the latch circuits 63 _{1 to 63 o} are not cleared, and the contents of the address counter 8, the details of which will be described later, are cleared.
The input pattern memory 61 (actually corresponds to the part of the input pattern memory that stores the input pattern) is connected to the input pattern memory 61 through the bus line at the timing of the output of step 4.
is memorized. Here, a means for determining the maximum peak value among the peak values of audio in each sample period will be explained. This includes latch circuit 69 and latch circuit 7.
0, comparator 71, NOR circuits 72, 73, data select gate 74, inverter circuits 75, 76,
This is done by a circuit consisting of an OR circuit 77. That is, n pieces of converted data from the A/D converter 45 are sequentially supplied to either one of the latch circuits 69 and 70. The contents X ₁ and X ₂ latched in the latch circuits 69 and 70 are compared in size by a comparator 71, and if X ₁ < X ₂ , the output becomes "H", and if X ₁ = X ₂ , the output becomes “H”, and X ₁ > X ₂
Then the output becomes "H". The NOR circuit 72 is supplied with the output and a conversion end signal of the A/D converter 45 inverted by an inverter circuit 76 . The output and the output of the inverter circuit 76 are supplied to the NOR circuit 73. Noah circuit 7
Depending on the output state of the comparator 71, either one of the latch circuits 2 and 73 becomes "H", making it possible to write converted data into the corresponding latch circuits 70 and 69. In the data select gate 74, latch circuits 69 and 7 are connected according to the output of the comparator 71.
Data latched to one of the zeros is selected and supplied to dividers 621 _-o . That is, if the output is "H", data X ₂ is selected,
If the output is "L", data _X2 is selected. Note that if X ₁ =X ₂ , either one may be selected, but in this configuration, data X ₁ is selected. The latch circuits 69 and 70 are cleared by the initialization circuit 5.
7 is inverted by an inverter circuit 75 and the output of the monomulti 58 is generated by a signal passed through an OR circuit 77. The output signal of the OR circuit 77 is shown in FIG. For example, if the power of the device is turned on at the timing of the pulse 〓 of these signals, the OR circuit 77 causes the pulse 〓 to be applied to the latch circuit 6.
9 and 70, and the latch circuits 69 and 70 are in a clear state. Therefore, the output of the comparator 71 becomes "H", and the output becomes "L". At this time, when the A/D converter 45 operates, only the output of the NOR circuit 72 becomes "H" at the timing of the first conversion end signal, and the first converted data is written into the latch circuit 70. As a result, the output of the comparator 71 becomes "H", and the output becomes "L". Therefore, at the timing of the second conversion end signal, the output of the NOR circuit 73 becomes "H" and the second conversion data is transferred to the latch circuit 6.
9 is written. Thereafter, in the same way, one of the comparators 71 becomes "H" depending on the magnitudes of the outputs X ₁ and X ₂ of the latch circuits 69 and 70, and the output of the NOR whose output becomes "H" Conversion data is written into the latch circuit 70 or 69 corresponding to the circuit 72 or 73. In this way, the larger data of the latch data of the latch circuits 70 and 69 is supplied to the dividers 621 _-o via the data select gate 74. Therefore, in the dividers 621 _-o , data obtained by dividing each peak value of the audio in the first sample period by the maximum peak value of that period is finally obtained and written into the latch circuits _631-o . When the amplitude normalization for each peak value in the first sample period is completed, the latch circuits 69 and 70 send the output signal of the monomulti 58 (FIGS. 11 and 12).
is reset and returns to the initial state. Thereafter, the amplitude is normalized for the peak value data obtained in the second and subsequent sampling periods in the same manner. Next, referring to the signal waveform diagram of FIG. 16, a description will be given of a prohibition means for preventing input audio from inputting an amplitude level portion below a predetermined threshold value, which is a feature of the present invention. In addition, in Fig. 16, the signals shown in , , in Fig. 15 above are
This is the same signal as shown in . When a speaker inputs audio into the device by speaking, the latch circuits 59 _{1 to 59} are activated at each sample time as described above.
The peak value converted to a digital quantity is latched. If the addition result of the latched peak values is larger than the threshold of the level determination circuit 66, the amplitude normalized data is stored in the input pattern memory 66.
will be stored in. Therefore, assuming that the output of the adder circuit 65 exceeds the threshold of the level determination circuit 66, the output of the level determination circuit 66 becomes low level as described above. The output of this level determination circuit 68 is supplied to the AND circuit 68 as described above, and further supplied to the AND circuit 79 via the inverter circuit 78. The AND circuit 79 is also supplied with the output signal of the monomulti 67. Therefore, the signal shown in FIG. 16 is obtained from the AND circuit 79 at the timing of the output signal of the monomulti 67. The output of the AND circuit 79 is supplied as a reset pulse to the counter 81 via the OR circuit 80, thereby placing the counter 81 in a reset state. Further, the output of the AND circuit 68 is supplied to an AND circuit 83 via an inverter circuit 82. This AND circuit 83 is also connected to the output signal of the monomulti 58 (the 16th
Figure) is supplied. Therefore, AND circuit 83
The signal shown in FIG. 16 is obtained at the output of the monomulti 58 at the timing of the output signal. The output of this AND circuit 83 is supplied to an address counter 84 as a clock pulse. As a result, the count value of the address counter 84 is incremented by one each time one clock pulse is supplied, and the count value of the address counter 84 is incremented by one each time one clock pulse is supplied.
3 Write the data latched in ₁ to o to a predetermined address in the input pattern memory 64. Also,
The output of the AND circuit 83 is further connected to the flip-flop 8
5. This causes flip-flop 8
The output of 5 is sent to an AND circuit 8 as shown in FIG.
is inverted by the first output pulse of 3,
It becomes “L” level. This flip-flop 85
The output of is supplied to a NOR circuit 86. In this state, when the output level of the adder circuit 65 becomes smaller than the threshold value of the level determination circuit 66, the signal shown in FIG. 16 is obtained at the output of the AND circuit 68. As a result, the latch circuits 631 _-o are brought into the reset state as described above. Further, the output of the AND circuit 68 is inverted by an inverter circuit 82 and supplied to the NOR circuit 86. As a result, the gate of the NOR circuit 86 is opened at the timing of the signal shown in FIG. 16, and its output is supplied to the counter 81 as a clock pulse. At this time, since the output of the AND circuit 83 is not obtained, the address counter 84 of the input pattern memory 64 stops counting. Therefore, address counter 8
The address of the input pattern memory 64 designated by the output of No. 4 is not moved and is held at the address immediately before the output of the adder circuit 65 becomes equal to or less than the threshold value of the level determination circuit 66. By the way, if the output of the adder circuit 64 again exceeds the threshold of the level determination circuit 65 before the eighth clock pulse is supplied, the counter 81 outputs the output signal of the AND circuit 79 (the 16th clock pulse).
(Figure). At this time, an output signal (FIG. 16) is also obtained from the AND circuit 83, the count value of the address counter 84 is incremented by one, and the addition result of the addition circuit 65 as described above becomes the threshold value of the level determination circuit 66. The address following the previous address is specified, and the data is stored here. Conversely, even if eight clock pulses are supplied, if the addition result of the adder circuit 65 does not exceed the threshold of the level judgment circuit 66, the _Q3 output of the counter 81 (Fig. 16) will be output at the timing of the eighth clock pulse. It becomes “H”. This Q ₃ output is supplied to the flip-flop 50 via an inverter circuit 87 to reset it, and is also supplied to a seventh monomulti 88. As a result, the monomulti 88 derives the signal _P1 shown in FIG.
This signal is used as a processing start signal. That is, when this processing start signal _P1 is derived, the recognition processing unit 3 determines that all the sounds uttered by the speaker have been stored as input patterns in the input pattern memory 64, and compares the registered pattern with the input pattern. Perform comparison processing. Note that the figure does not show the part that performs this comparison process. When this comparison processing operation is completed, a processing end signal shown in FIG. 16 is derived, and the counter 81 is reset. Further, this processing end signal is supplied to the flip-flop 85 as a set pulse to set the flip-flop 85 to a set state, that is, to an output "H" state, and to the address counter 84 and the input pattern memory 6 via an OR circuit 89.
4 as a reset pulse to put them in the reset state. Thereafter, the above-described operations are repeated every time a voice is input. Note that when the device is powered on, the flip-flop 85 transfers the output of the initialization circuit 57 to the inverter circuit 75.
The set state is set by the inverted signal. At this time, the counter 81 is reset by the signal that passes the output of the inverter circuit 75 through the OR circuit 80, and the address counter 84 and input pattern memory 64 use the signal that passes the output of the inverter circuit 75 through the OR circuit 89. The reset state is set by . In this way, when the addition result of the adder circuit 64 is smaller than the threshold value of the level judgment circuit 66, the counting operation of the address counter 84 is stopped, and the input pattern memory 64 stores the address immediately before the addition result becomes smaller than the threshold value of the level judgment circuit 66. remains specified. If this state lasts for only a period shorter than the period in which eight clock pulses are counted by the counter 81 (corresponding to eight sampling times), the addition result of the adder circuit 65 is added to the level judgment circuit 66. When the value exceeds the threshold, the address counter 84 starts counting again, and data is loaded into the input pattern memory. Conversely, the above state is the counter 81
If it lasts for a period shorter than the period in which eight clock pulses are counted by , it is assumed that all data have been input, and the input pattern and the registered pattern are compared. In the above description, the present invention has been described based on the case where the speaker commands the desired control content by vocalization with respect to the control content that has been registered in advance.
When registering a plurality of control contents, data is written to the registered pattern memory in the same manner. Note that the device to be controlled by the voice recognition device according to the present invention is not limited to television receivers, but can be applied to general systems that require remote control. As described above, according to the present invention, it is possible to provide a speech recognition device that can solve the problem of misrecognition of speech particularly caused by the presence of silent parts.

[Brief explanation of the drawing]

第１図は音声認識装置として現在考えられてい
るものの一例及び本発明の一実施例を示す回路ブ
ロツク線図、第２図及び第３図は音声認識装置と
して現在考えられているものの説明に供する時間
−周波数−振幅レベル特性図及び回路ブロツク線
図、第４図は音声波の検波特性を説明するに供す
る信号波形図、第５図は音声認識装置として現在
考えられているものの他の例を示す回路ブロツク
線図、第６図は第５図の動作を説明するに供する
信号波形図、第７図及び第８図は第５図の動作を
説明するに供するタイミングチヤート、第９図は
本発明を説明するに供する時間−周波数−レベル
特性図、第１０図乃至第１２図は本発明を説明す
るに供する概略信号波形図、第１３図は本発明の
機能動作を模式的に示す場合の一例を示す回路ブ
ロツク線図、第１４図乃至第１６図は第１３図の
動作を説明する為のタイミングチヤートである。２１……登録パターンメモリ、２２……入力パ
ターンメモリ、２３……システムプログラムメモ
リ、２４……CPU、５５乃至５８，６７，８８
……モノマルチ、６８，７９，８３……アンド回
路、７５，７８，８２，８７……インバータ回
路、８０，８９……オア回路、８１……カウン
タ、８４……アドレスカウンタ、８５……フリツ
プフロツプ、８６……ノア回路。 FIG. 1 is a circuit block diagram showing an example of what is currently being considered as a speech recognition device and an embodiment of the present invention, and FIGS. 2 and 3 are provided to explain what is currently being considered as a speech recognition device. A time-frequency-amplitude level characteristic diagram and a circuit block diagram; Fig. 4 is a signal waveform diagram used to explain the detection characteristics of speech waves; Fig. 5 shows another example of what is currently being considered as a speech recognition device. 6 is a signal waveform diagram to explain the operation of FIG. 5, FIGS. 7 and 8 are timing charts to explain the operation of FIG. 5, and FIG. FIGS. 10 to 12 are time-frequency-level characteristic diagrams used to explain the invention. FIGS. 10 to 12 are schematic signal waveform diagrams used to explain the invention. FIG. A circuit block diagram showing an example, and FIGS. 14 to 16 are timing charts for explaining the operation of FIG. 13. 21...Registered pattern memory, 22...Input pattern memory, 23...System program memory, 24...CPU, 55 to 58, 67, 88
...Monomulti, 68,79,83...AND circuit, 75,78,82,87...Inverter circuit, 80,89...OR circuit, 81...Counter, 84...Address counter, 85...Flip-flop , 86...Noah circuit.

Claims

[Claims]

1 A plurality of filters each extracting components in different frequency bands from the voice uttered by a speaker, the outputs of the plurality of filters are sampled at the same timing, the peak value of the output of each filter is detected during the sampling period, and the peak value of the output of each filter is detected. peak value detection means that repeatedly performs an operation of holding the peak value for the sampling period at predetermined time intervals during the utterance period of the speaker; and outputs of the plurality of filters during the plurality of sampling periods in the peak value detection means. at least a recognition processing section having a prohibition means for prohibiting the peak value of a sampling period in which the peak value of A speech recognition device characterized in that only peak values of outputs of a plurality of filters during a sampling period are used as data indicating characteristics of speech uttered by a speaker.