JPH0243893A

JPH0243893A - voice recognition device

Info

Publication number: JPH0243893A
Application number: JP63193746A
Authority: JP
Inventors: Junichiro Fujimoto; 潤一郎藤本
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1988-08-03
Filing date: 1988-08-03
Publication date: 1990-02-14

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】五丘分災本発明は、音声認識装置に関する。[Detailed description of the invention] Five hill division disaster The present invention relates to a speech recognition device.

従来技術近年、音声認識装置の開発が盛んであり、特定話者方式
、不特定話者方式のものが実用化されている。しかし、
この装置も使用環境が変ると認識率が大きく変化してし
まう。BACKGROUND OF THE INVENTION In recent years, speech recognition devices have been actively developed, and speaker-specific and speaker-independent systems have been put into practical use. but,
The recognition rate of this device also changes significantly when the environment in which it is used changes.

例えば、マイクロフォンの近傍に会議卓や黒板などの反
射体が配置された状態で音声を収音した場合、その伝送
周波数特性は多数のデイツプを有した櫛歯形を呈し、か
つ発声者−マイクロフォン−反射体の位’＋？！関係に
よって、レベル特性が大きく変動する。これはＤＰマツ
チング等によって音声認識を行なう場合、認識率低減要
因の一つにあげられる。伝送周波数特性の認識性能に与
える影響について発声者と収音用マイクロフォンとの距
離を変化し、かつ反射体が有る場合と無い場合の音声を
対象に、類似単語を用いた認識評価実験を行なった結果
が音響学会講演論文集、昭和６３年３月、２６９，２７
０ページに報告されている。For example, when sound is collected with a reflector such as a conference table or blackboard placed near a microphone, the transmission frequency characteristic will have a comb-shaped shape with many dips, and the speaker - microphone - reflection. Body position'+? ! Level characteristics vary greatly depending on the relationship. This is one of the factors that reduce the recognition rate when performing speech recognition by DP matching or the like. To examine the influence of transmission frequency characteristics on recognition performance, we conducted a recognition evaluation experiment using similar words by varying the distance between the speaker and the microphone for sound collection, and targeting speech with and without a reflector. The results are in the Proceedings of the Acoustical Society of Japan, March 1988, 269, 27.
Reported on page 0.

その報告によると、発声者の口元から距離を変えて配置
した３本のマイクロフォンを用いて同時に行ない１反射
体から音源（発声者）までの高さ、及び反射体からマイ
クまでの高さをそれぞれ３０Ｇｍとして、音源からマイ
クまでの距離りを１０゜５０．９０■としたとき。According to the report, three microphones placed at different distances from the speaker's mouth were used to measure the height from one reflector to the sound source (the speaker) and the height from the reflector to the microphone, respectively. When assuming 30Gm and the distance from the sound source to the microphone is 10°50.90■.

■反射体が無い状態でかつ、発声者の個人的変動（発声
する度に生ずる音声のゆらぎ）がない場合の認識率は、
マイク収音距離の違いによる差は殆どない。■The recognition rate when there is no reflector and there are no individual variations of the speaker (speech fluctuations that occur each time he/she speaks) is:
There is almost no difference due to the difference in microphone collection distance.

■個人内変動がある場合の認識率は、約２０％程度の変
動を生じている。■When there is intra-individual variation, the recognition rate fluctuates by about 20%.

■反射体がある場合の認識率は、発声者と収音用マイク
ロフォンの距離が大きくなるほど低下し、特にＬ＝９０
ａｎの場合に顕著となる。なお、その値は話者によって
変化する。■The recognition rate when there is a reflector decreases as the distance between the speaker and the sound collection microphone increases, especially when L = 90.
This becomes noticeable in the case of an. Note that the value changes depending on the speaker.

■反射体がある場合でもＬ＝５０ａｏの認識率は、発声
者の個人白変動量に含まれる範囲であるが、同条件での
Ｌ＝１０ａ＋＋の場合より数パーセント低減している。(2) Even when there is a reflector, the recognition rate for L=50ao is within the range included in the amount of individual white variation of the speaker, but it is lower by several percent than when L=10a++ under the same conditions.

ことが分ったとされている。It is said that this was found out.

このように、身近に反射体がある時とない時で認識率に
差が出てしまう。これは例えば自動車の中で不特定話者
認識装置を使うような時に顕著に現われる。特定話者方
式では使用環境で標準パターンを作ることである程度回
避することはできるが、不特定話者ではどのような環境
下で使用されるかわからないため、これに対して対策を
練ることができない。それが自動車内のように狭い空間
では音声を反射するものが多く使用するマイクの周波数
特性が変って認識率が低下する。In this way, there is a difference in the recognition rate depending on whether there is a reflector nearby or not. This becomes noticeable, for example, when a speaker-independent recognition device is used in a car. With the speaker-specific method, it is possible to avoid this to some extent by creating a standard pattern for the usage environment, but with unspecified speakers, it is not possible to devise countermeasures against this problem because it is not known in what environment it will be used. . However, in a small space like the inside of a car, there are many things that reflect the sound, which changes the frequency characteristics of the microphone used, reducing the recognition rate.

目　　　　　的本発明は、上述のごとき実情に鑑みてなされたもので、
特に、音響特性が異なった場所、例えば自動車内でも認
識率が低下しないような認識装置を提供することを目的
としてなされたものである。Purpose The present invention was made in view of the above-mentioned circumstances.
In particular, the purpose of this invention is to provide a recognition device whose recognition rate does not decrease even in places with different acoustic characteristics, such as inside a car.

橋成本発明は、上記目的を達成するために、音声を電気信号
に変換する音響／電気変換器と、該電気信号を分析する
フィルタ群と、分析された結果を比較するパターン比較
部とを有する音声認識装置において、各フィルタの中心
周波数の音の一つ以上を再生する手段と、各フィルタの
出力の加算又は平均値を算出し、記憶する手段と、各フ
ィルタの出力値と前記記憶された値との差を求める手段
と、その差の大きさによって前記フィルタ出力を大又は
小にせしめる手段を有することを特徴としたものである
。以下１本発明の実施例に基いて説明する。To achieve the above object, the present invention includes an acoustic/electrical converter that converts audio into an electrical signal, a filter group that analyzes the electrical signal, and a pattern comparison section that compares the analyzed results. In the speech recognition device, means for reproducing one or more tones of the center frequency of each filter, means for calculating and storing the sum or average value of the output of each filter, and the output value of each filter and the stored The present invention is characterized by comprising means for determining the difference between the two values, and means for increasing or decreasing the filter output depending on the magnitude of the difference. The following will explain one embodiment of the present invention.

第１図は１本発明の一実施例を説明するための構成図で
、図中、１はＲＯＭ、２はＤ／Ａ変換回路、３は増幅器
、４はスピーカ、５はマイクロフォン、６はマイクアン
プ、７はフィルタ群、８は引き算部、９はレジスタ、１
０は加算器、１１はビットシフト部、１２はレジスタ、
１３は差演算部、１４はスイッチ、１５は比較部、１６
は音声辞書、１７は最大類似度算出部、１８は認識結果
出力部で、図示実施例では、フィルタ群７のフィルタの
数を８とし、音声入力用のマイクロフォン５の音をマイ
クアンプ６にて増幅し、フィルタ群７で分析する。分析
した結果を整流してＡ／Ｄ変換器（図示せず）で量子化
して引き算部８に入れレジスタ９に保存させる０図には
記さなかったが、マイクアンプ６の出力を対数に変換す
るのが普通である。引き算部８は、第２図に示す如く、
各フィルタ７１，７□・・・の出力から決められた値を
差し引くようになっており、差し引く閾値８□８□′・
・・が各々引き算部８１．８２・・・に与えられるが、
初期値として０が入れられている。次に、レジスタ９に
貯えられた８個の値を加算回路１０にて加算し、ビット
シフト部１１にて３回ビットシフトすると１／８になり
平均値が算出されるのでこれをレジスタ１２へ格納して
おく”。まず、スイッチ１４をＡ側に倒し、レジスタ９
の１〜８の各値をレジスタ１２の平均値から差し引き、
その値をフィルタ出力の引き算部８、つまり第２図の閾
値８１１８２／・・・に代入する。従って、閾値ｉへ設
定される値Ｙｉ、はレジスタの８個の値Ｘ１（ｉ＝１〜
８）を平均Ｘを使ってＹｉ＝Ｘ−Ｘｉ　　　・・・（１）として表わされる。一方、ＲＯＭ１にはフィルタ７□、
７□・・・７１の各中心周波数の正弦波が加算された信
号が例えばＰＣＭ等に符号化して記憶されている。＼こ
の信号をアナログに直し増幅して電気音響変換器（スピ
ーカ）４から再生する。この時、ＲＯＭＩに記憶されて
いる周波数の各成分は再生された時に出力レベルが一定
になるような振幅値を設定しなければならない。この音
を再生し゛ながら前記の調整を行なう。FIG. 1 is a block diagram for explaining one embodiment of the present invention, in which 1 is a ROM, 2 is a D/A conversion circuit, 3 is an amplifier, 4 is a speaker, 5 is a microphone, and 6 is a microphone. amplifier, 7 is a filter group, 8 is a subtraction section, 9 is a register, 1
0 is an adder, 11 is a bit shift section, 12 is a register,
13 is a difference calculation section, 14 is a switch, 15 is a comparison section, 16
17 is a speech dictionary, 17 is a maximum similarity calculation unit, and 18 is a recognition result output unit. In the illustrated embodiment, the number of filters in the filter group 7 is 8, and the sound from the microphone 5 for voice input is input to the microphone amplifier 6. It is amplified and analyzed by filter group 7. The analyzed results are rectified and quantized by an A/D converter (not shown), and then put into the subtraction section 8 and stored in the register 9.0Although not shown in the diagram, the output of the microphone amplifier 6 is converted into a logarithm. is normal. As shown in FIG. 2, the subtraction unit 8
A predetermined value is subtracted from the output of each filter 71, 7□..., and the threshold value for subtraction is 8□8□'.
... are given to the subtraction units 81, 82, respectively, but
0 is entered as the initial value. Next, the eight values stored in the register 9 are added in the adder circuit 10, and when the bits are shifted three times in the bit shift section 11, the result becomes 1/8, and the average value is calculated, and this is sent to the register 12. First, turn switch 14 to side A, and register 9.
Subtract each value of 1 to 8 from the average value of register 12,
The value is substituted into the filter output subtraction unit 8, that is, the threshold value 81182/... in FIG. Therefore, the value Yi set to the threshold i is the eight values X1 (i=1 to
8) is expressed as Yi=X−Xi (1) using the average X. On the other hand, ROM1 has filter 7□,
A signal obtained by adding the sine waves of each center frequency of 7□ . . . 71 is encoded and stored in, for example, PCM. \This signal is converted to analog, amplified, and reproduced from the electroacoustic transducer (speaker) 4. At this time, amplitude values must be set for each frequency component stored in the ROMI so that the output level will be constant when reproduced. Perform the above adjustments while playing this sound.

第３図は、上記調整の様子を示す図で、横軸は周波数を
表わす各バンドパスフィルタのチャンネル番号、縦軸は
レベルである。（ａ）図は自由空間での特性で、不特定
話者認識用の音声辞書はこの条件で作られている。この
認識装置を自動車のような狭い空間に持ち込むと（ｂ）
図のような特性になる。ここでこれらの８つのポイント
から平均レベルを計算すると、図の破線のようになる。FIG. 3 is a diagram showing the state of the above adjustment, in which the horizontal axis represents the channel number of each bandpass filter representing the frequency, and the vertical axis represents the level. Figure (a) shows the characteristics in free space, and speech dictionaries for speaker-independent recognition are created under these conditions. When this recognition device is brought into a narrow space such as a car (b)
The characteristics will be as shown in the figure. If we calculate the average level from these eight points, it will look like the broken line in the figure.

更に式（１）に従って平均値から（ｂ）図の各値を引く
と（ｃ）図のようになり、この値を第２図の各閾値とす
る。この調整後は各フィルタの出力は（ｄ）図のように
なり、もとの（ａ）図のような特性に補正でき、このた
め、狭い空間で使用することによる認識率の低下が防げ
る。認識の時はスイッチ１４をＢ側に倒し、周波数特性
が補正された状態で行なう。なお、図では認識部として
比較部と最大類似度を求める部分が記されているが、こ
れはパターン照合方式によらず必要な部分であって、具
体的には、動的計画法を用いたＤＰマツチングとして知
られるものなどどのような方法を用いても良い。Furthermore, by subtracting each value in FIG. 2B from the average value according to equation (1), the result shown in FIG. 2C is obtained, and this value is used as each threshold value in FIG. After this adjustment, the output of each filter becomes as shown in Fig. (d), which can be corrected to the original characteristic as shown in Fig. (a), thereby preventing a decrease in recognition rate due to use in a narrow space. At the time of recognition, the switch 14 is turned to the B side, and the recognition is performed with the frequency characteristics corrected. Note that the figure shows the recognition part as the comparison part and the part that calculates the maximum similarity, but this is a necessary part regardless of the pattern matching method. Any method may be used, such as what is known as DP matching.

効　　　果以上の説明から明らかなように、本発明によると、部屋
等の限られた空間においても音の反射の影響を補正して
マイクからの入力音声の周波数特性を平担にすることが
出来、その結果認ｍ率を向上することができる。Effects As is clear from the above explanation, according to the present invention, even in a limited space such as a room, it is possible to correct the influence of sound reflection and flatten the frequency characteristics of the input sound from the microphone. As a result, the recognition rate can be improved.

[Brief explanation of the drawing]

第１図は、本発明の一実施例を説明するための構成図、
第２図は、第１図に示した引き算部の詳細図、第３図は
、本発明の動作説明をするための同である。１・・・ＲＯＭ、２・・・Ｄ／Ａ変換回路、３・・・増
幅器、４・・・スピーカ、５・・・マイクロフォン、６
・・・マイクアンプ、７・・・フィルタ群、８・・引き
算部、９・・・レジスタ、１０・・・加算器、１１・・
・ビットシフト部。１２・・・レジスタ、１３・・・差演算部、１４・・・
スイッチ、１５・・・比較部、１６・・・音声辞書、１
７・・・最大類似度算出部、１８・・・認識結果出力部
。第　　１図FIG. 1 is a configuration diagram for explaining one embodiment of the present invention,
FIG. 2 is a detailed view of the subtraction section shown in FIG. 1, and FIG. 3 is the same for explaining the operation of the present invention. 1...ROM, 2...D/A conversion circuit, 3...amplifier, 4...speaker, 5...microphone, 6
...Microphone amplifier, 7.. Filter group, 8.. Subtraction section, 9.. Register, 10.. Adder, 11..
・Bit shift section. 12...Register, 13...Difference calculation unit, 14...
Switch, 15... Comparison section, 16... Voice dictionary, 1
7... Maximum similarity calculation unit, 18... Recognition result output unit. Figure 1

Claims

[Claims]

1. In a speech recognition device that has an acoustic/electrical converter that converts speech into an electrical signal, a filter group that analyzes the electrical signal, and a pattern comparison section that compares the analyzed results, the center frequency of each filter is means for reproducing one or more sounds; means for calculating and storing the summation or average value of the outputs of each filter; and means for determining the difference between the output value of each filter and the stored value; 1. A speech recognition device comprising means for increasing or decreasing the filter output depending on the magnitude of the filter output.