JPH02212899A

JPH02212899A - Voice recognition system

Info

Publication number: JPH02212899A
Application number: JP1034768A
Authority: JP
Inventors: Junichiro Fujimoto; 潤一郎藤本
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1989-02-13
Filing date: 1989-02-13
Publication date: 1990-08-24

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】致宜光互本発明は、音声認識方式に関するものである。[Detailed description of the invention] Chiyiko mutual The present invention relates to a voice recognition method.

良東技延使用者があらかじめ音声を登録してから使う、いわゆる
特定話者方式の音声認識装置では、トレニングによって
作る標準パターンの質によって装置の認識能力は左右さ
れる０発声違いがないか。With so-called speaker-specific speech recognition devices, in which users register their voices in advance and then use them, is there no difference in the number of utterances in which the recognition ability of the device depends on the quality of the standard patterns created through training?

正常なパターンとして登録されているかが重要である。It is important that the pattern is registered as a normal pattern.

例えば特開昭５９−２０１１４ｉ号公報に示されている
ように、登録時に自分が発声した音声を聞きながら行な
う方法もある。しかし１例えば「ムスティスラフ・ロス
トロボーヴイッチＪのような長い単語を「ムスティス」
と略して登録したり「ロスドロ」と略して登録した場合
、後日、何と登録したかを忘れてしまい登録しなおさな
ければならないことになる。そのために特開昭５７−８
６９７９号公報のように音声を録音しておく方法もある
が、そのために別の大きなメモリーと録音再生装置が必
要となってくる。これらの問題を解決するためには音声
認識用の辞書から登録時に発声した音を再生できれば良
い、しかしながら、標準パターンデータのデータ量を圧
縮するために２値化した標準パターンを作る方法（第１
０回。For example, as shown in Japanese Unexamined Patent Publication No. 59-20114i, there is also a method of registering while listening to the voice uttered by the user. However, 1. For example, a long word such as ``Mstislav Rostlovovich J'' is called ``Mustis.''
If you register it as abbreviated as ``Rosudoro'' or ``Rosdro,'' you will forget what you registered it as and have to re-register it at a later date. For that purpose, JP-A-57-8
There is a method of recording audio as in the 6979 publication, but this requires a separate large memory and a recording/playback device. In order to solve these problems, it is sufficient to be able to reproduce the sound uttered at the time of registration from a dictionary for speech recognition.
0 times.

情報理論とその応用シンポジウム、Ｎｏ、１９−２１．
１９８７．ｐｐ、４７５−４８０）では再生するための
音声の十分な情報が得られないという欠点があった。Information Theory and Its Applications Symposium, No. 19-21.
1987. pp. 475-480) had the disadvantage that sufficient audio information for reproduction could not be obtained.

目　　　　　的本発明は、上述のごとき実情に鑑みてなされたもので、
特に、音声認識で登録時の発声を示し。Purpose The present invention was made in view of the above-mentioned circumstances.
In particular, voice recognition shows the utterances uttered during registration.

利用者に登録時の発声を思い出させることにより、認識
精度を向上させることを目的としてなされたものである
。This was done with the aim of improving recognition accuracy by reminding users of the utterances they uttered at the time of registration.

墾成本発明は、上記目的を達成するために、音声を周波数分
析してそれを２値化処理して標準パターンとし、更に周
波数分析結果の高、低域の分布のしかたにより種類分け
した結果をも前記標準パターンに併せて登録しておいて
認識する音声認識方式において、パルス音源と雑音源と
を持ち、その音源の出力を前記標準パターンの特徴的な
周波数で変調するようにし、更にその時の周波数分布を
表わすデー、夕に従って前記２音源のどちらかを選択す
るようにし、該被変調波を可聴波にして出力する部分を
備えたことを特徴としたものである。In order to achieve the above-mentioned object, the present invention analyzes the frequency of audio, processes it into binarization to obtain a standard pattern, and further divides the frequency analysis results into types based on the distribution of high and low frequencies. In the speech recognition method, which is registered in conjunction with the standard pattern and recognized, it has a pulse sound source and a noise source, and modulates the output of the sound source with the characteristic frequency of the standard pattern, and furthermore, The present invention is characterized in that it selects one of the two sound sources according to day and night representing the frequency distribution, and includes a part that converts the modulated wave into an audible wave and outputs it.

以下、本発明の実施例に基づいて説明する。Hereinafter, the present invention will be explained based on examples.

第１図は９本発明の詳細な説明するための構成図で１図
中、１は音響／電気信号変換器、２は音声分析部、３は
スイッチ、４は辞書部、５は認識部、６は再生部、７は
出力、８は電気信号／音響変換器で、スイッチ３によっ
て辞書登録（ａ側）と認ｍ（ｂ側）が選択できるように
なっている。FIG. 1 is a block diagram for explaining the details of the present invention. In the figure, 1 is an acoustic/electrical signal converter, 2 is a speech analysis section, 3 is a switch, 4 is a dictionary section, 5 is a recognition section, 6 is a reproduction section, 7 is an output, and 8 is an electric signal/acoustic converter, and a switch 3 allows selection of dictionary registration (side a) and recognition m (side b).

而して、この図は特定話者方式を想定しているが。However, this diagram assumes a specific speaker method.

不特定話者方式ではこのようなスイッチが不要であるこ
とはいうまでもない。Needless to say, such a switch is not necessary in the speaker-independent system.

第２図は、第１図に示した音響／電気信号変換器１と音
響分枦部２の詳細図で、周知のように。FIG. 2 is a detailed view of the acoustic/electrical signal converter 1 and the acoustic divider 2 shown in FIG. 1, as is well known.

音響／電気信号変換器１としてマイクを用い、その出力
をマイクアンプＡで増幅してフィルタＦ１〜Ｆ１．に入
力する。このフィルタはバンドパスフィルタで１５個並
べであるが、その数や特性に意味はない０通常１５０−
１．　ＯＯＯＯＨｚが全て、又は重要な部分だけカバー
されていれば良い、その出力を整流ｍＲ□〜Ｒ□で整流
し、各々のデータを使って最小自乗直線部９で最小自乗
直線を求めたあと、２値化部１０で２値化する。A microphone is used as the acoustic/electrical signal converter 1, and the output thereof is amplified by a microphone amplifier A and filters F1 to F1. Enter. This filter is a bandpass filter with 15 filters arranged in a row, but their number and characteristics have no meaning.Normally 150-
1. It is sufficient that all OOOOOHz or only important parts are covered.After rectifying the output with rectification mR□~R□ and using each data to find the least square straight line in the least square straight line part 9, A digitization unit 10 binarizes the data.

第３図は、その過程を示す図で、フィルタ出力後、ＷＩ
流されたデータを周波数の低いものから並べると、（ａ
）のようになる、ただし、（ａ）は正しくは１５個の点
が並ぶべきであるがこれを連続線で表わしている。これ
に最小自装置ＡＬＬを引いである。各出力値からＬの各
値を引いていくと、（ｂ）のような結果が得られる。こ
の中から正側への山のピークを中心にして「１」、他を
「Ｏ」として２値化することによって（Ｑ）のような２
値化パターンができる。この時、（ａ）の最小自乗直線
の傾斜が負の時は１．正の時はＯを（ｃ）の１５個のデ
ータの次に加えて１６個−組のデータとする。このよう
にして得られたデータで標準パターンを作って辞書部へ
登録する。標準パターンの作成に関しては前述の文献中
に述べられているが、複数回発声して作った各々の２＠
化パターンを加算するような方法がある。しかしこれに
限ることはない、認識の場合はこの２＋ｆｉ化したデー
タを認識部へ転送し、辞書中のパターンと照合して最も
類似した標準パターンを認識結果として出力する。この
場合の照合の仕方は特に限定するものではなく、どのよ
うな方法であっても良いが、前述の文献に示されている
ような入力と辞書のパターンの重なり具合から類似性を
求めるのが適している。Figure 3 is a diagram showing the process. After the filter output, the WI
If the streamed data is arranged from the lowest frequency to the lowest frequency, (a
) However, in (a), 15 points should be lined up correctly, but this is shown as a continuous line. This is minus the minimum own device ALL. By subtracting each value of L from each output value, a result like (b) is obtained. By binarizing the peak of the mountain on the positive side as ``1'' and the others as ``O'', we can obtain 2 values like (Q).
A value pattern is created. At this time, when the slope of the least squares straight line in (a) is negative, 1. When it is positive, O is added next to the 15 data in (c) to create 16 data sets. A standard pattern is created using the data thus obtained and registered in the dictionary section. The creation of the standard pattern is described in the above-mentioned literature, but each 2@ created by uttering it multiple times
There is a method that adds digitization patterns. However, the present invention is not limited to this. In the case of recognition, this 2+fi data is transferred to the recognition unit, and the most similar standard pattern is output as the recognition result by comparing it with the pattern in the dictionary. The method of matching in this case is not particularly limited and may be any method, but it is best to find the similarity from the degree of overlap between the input and dictionary patterns as shown in the above-mentioned literature. Are suitable.

第４図は、再生部の詳細を示す図で、辞書部４から、１
６個ずつならんだデータが一定間隔で送られてくる。こ
の間隔は短い程、再生音としては良質になるが、データ
が増加するので通常の音声認識に用いる程度の間隔、つ
まり５〜１０ｒｎｓ位が良い、１６番目のデータも他と
一緒にして１〜１６ｃｈとして扱う、スイッチ８１〜Ｓ
１５はＯＮ。FIG. 4 is a diagram showing the details of the playback section, in which from the dictionary section 4, 1
Data arranged in groups of six are sent at regular intervals. The shorter this interval is, the better the quality of the reproduced sound will be, but since the amount of data will increase, it is best to use the interval used for normal speech recognition, that is, about 5 to 10 rns. Switches 81-S treated as 16ch
15 is ON.

ＯＦＦのスイッチで１〜１５ｃｈのデータがＯかどうか
でＯＮ又はＯＦＦになる。スイッチＳ□はＬ６ｃｈのデ
ータが０かどうかで音源を切りかえる。つまり、１６ｃ
ｈ目のデータが０であれば、雑音源１３．そうでなけれ
ばパルス音源１２がスインチＳ１〜ＳｔＳに連結される
。ただし１以上には、０と他の値でスイッチの動作を分
けたが標準パターンの平均した数によりこの値を変化さ
せる方が良い０例えば３つのパターンを加算して標準パ
ターンを作成した場合は０〜１と２〜３でスイッチの動
作を変えるのが好ましい、スイッチ８１〜Ｓ工、を通過
したデータはフィルタＦ！〜Ｆ１．に印加される。この
場合、第２図において分析したフィルタと同じであるこ
とが望ましく、分析された時のフィルタと同じフィルタ
にデータが入力されるように配慮する必要がある。フィ
ルターを、番号の若い順に中心周波数が高くなり１分析
結果も周波数の低い方からｌｃｈ〜１５ｃｈとすると、
フィルタｎで分析されたデータはｎ　　ｃｈのデータと
なり、再生する時にはフィルタｎへ入力されることにな
る。こうして得られた出力の和を加算器１１によって求
め、アンプＡによって増幅後、スピーカを駆動するデー
タとなる。パルス音源１２の周期は人間のピッチ周期に
近いものが良く。The OFF switch turns ON or OFF depending on whether the data of channels 1 to 15 is O or not. The switch S□ changes the sound source depending on whether the data of L6ch is 0 or not. That is, 16c
If the h-th data is 0, the noise source 13. Otherwise, the pulse sound source 12 is connected to the switches S1 to StS. However, for values above 1, the switch operation is divided into 0 and other values, but it is better to change this value depending on the average number of standard patterns.For example, if you create a standard pattern by adding three patterns, It is preferable to change the operation of the switches between 0 and 1 and 2 and 3. The data that has passed through switches 81 to S is sent to filter F! ~F1. is applied to In this case, it is desirable that the filter be the same as the filter analyzed in FIG. 2, and care must be taken to ensure that the data is input to the same filter as the filter used when it was analyzed. Assuming that the center frequency of the filter increases in ascending order of number, and the 1 analysis results are set from lch to 15ch from the lowest frequency,
The data analyzed by filter n becomes nch data, and is input to filter n when being reproduced. The sum of the outputs thus obtained is determined by the adder 11, and after being amplified by the amplifier A, becomes data for driving the speaker. The period of the pulse sound source 12 is preferably close to the human pitch period.

２００−３００　Ｈｚ位が適当である。Approximately 200-300 Hz is appropriate.

以上のような構成により、２値化処理された音声認識用
のデータから音声の再生が可能で、何がどのように発声
されていたかを聞くことができるようになる。With the above-described configuration, it is possible to reproduce the voice from the binarized data for voice recognition, and it becomes possible to hear what was uttered and how it was uttered.

卑果以」二の説明から明らかなように、本発明によると、辞
書内の標準パターンが可聴になり、登録時の発声を思い
出すことができるようになっただけでなく、標準パター
ンに不要な音がついて登録されていたりすると、それを
聞きとることができるようになる。この結果、装置の認
識精度を向上させることができる。As is clear from the explanation of "Beikai" 2, according to the present invention, the standard patterns in the dictionary are not only made audible and it is possible to recall the utterance at the time of registration, but also the standard patterns that are unnecessary are If there is a sound and it is registered, you will be able to hear it. As a result, the recognition accuracy of the device can be improved.

[Brief explanation of the drawing]

第１図は、本発明の詳細な説明するための構成図、第２
図は、第１図に示した音響／電気信号変換器１及び音声
分析部２の詳ｍ同、第３図は、２値化の一例を説明する
ための図、第４図は、第１図に示した再生部６の詳細図
である。１・・・音響／電気信号変換器、２・・・音声分析部、
３・・・スイッチ、４・・・辞書部、５・・・認識部、
６・・・再生部、７・・・出力、８・・・電気信号／音
響変換器、９・・・最小２乗直線部、１０・・・２値化
部、１１・・・加算部。１２・・・パルス音源、１３・・・雑音源。第１図第２図FIG. 1 is a configuration diagram for explaining the present invention in detail, and FIG.
The figure shows details of the acoustic/electrical signal converter 1 and the voice analysis section 2 shown in FIG. 1, FIG. 3 is a diagram for explaining an example of binarization, and FIG. FIG. 3 is a detailed diagram of the reproduction section 6 shown in the figure. 1...Acoustic/electrical signal converter, 2...Speech analysis section,
3... Switch, 4... Dictionary section, 5... Recognition section,
6... Reproducing section, 7... Output, 8... Electric signal/acoustic converter, 9... Least square linear section, 10... Binarization section, 11... Adding section. 12...Pulse sound source, 13...Noise source. Figure 1 Figure 2

Claims

[Claims]

1. Analyze the frequency of the audio, process it into binarization to create a standard pattern, and also register the results of classifying the frequency analysis results into types based on the distribution of high and low frequencies along with the standard pattern. In a speech recognition method that uses a pulse sound source and a noise source, the output of the sound source is modulated at a characteristic frequency of the standard pattern, and the output of the sound source is modulated at a characteristic frequency of the standard pattern, and which of the two sound sources is determined according to data representing the frequency distribution at that time. 1. A speech recognition method, comprising a part for selecting one of the modulated waves and outputting the modulated wave as an audible wave.