JPH0160160B2

JPH0160160B2 -

Info

Publication number: JPH0160160B2
Application number: JP56174115A
Authority: JP
Inventors: Kazunaga Yoshida
Original assignee: Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1981-10-30
Filing date: 1981-10-30
Publication date: 1989-12-21
Also published as: JPS5876894A

Description

【発明の詳細な説明】本発明は区切つて発声された日本語のかな文字
等に対応する単音節を認識する音声認識装置の改
良に関する。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to an improvement in a speech recognition device that recognizes monosyllables corresponding to Japanese kana characters etc. that are uttered in segments.

従来、区切つて発声された単音節を認識する方
法としてパタンマツチングを用いる方法があつ
た。代表的な方法としては、以下のような方法が
ある。通常、日本語の単音節は、ほとんどが子音
＋母音の形をしているため、両者を分けて認識す
る手法が用いられる。まず、入力された音声を分
析して得られた音声パタンの子音部と母音部を切
り出す。母音部の音声パタン、すなわち母音パタ
ンを、あらかじめ登録されている母音標準パタン
とパタンマツチングして母音カテゴリを決定す
る。 Conventionally, pattern matching has been used as a method for recognizing single syllables that are uttered separately. Typical methods include the following. Normally, most Japanese monosyllables are in the form of a consonant and a vowel, so a method is used to recognize them separately. First, the consonant and vowel parts of the speech pattern obtained by analyzing the input speech are extracted. A vowel category is determined by pattern matching the speech pattern of the vowel part, that is, the vowel pattern, with a pre-registered vowel standard pattern.

つぎに、子音パタンとあらかじめ登録されてい
る子音標準パタンとをパタンマツチングして、子
音カテゴリを決定し、単音節の認識結果を決定す
る。母音部は一般に安定したパタンが得られるた
め、切り出す方法及びパタンマツチングの方法
は、あまり問題とならない。一方、子音パタンを
切り出す方法及びパタンマツチングの方法は、さ
まざまな変形がある。 Next, a consonant category is determined by pattern matching the consonant pattern and a pre-registered consonant standard pattern, and a monosyllable recognition result is determined. Since stable patterns are generally obtained for vowel parts, the method of cutting out and the method of pattern matching do not pose much of a problem. On the other hand, there are various variations in the method of cutting out consonant patterns and the method of pattern matching.

たとえば、１つの方法として、音声の始端から
ある定まつたフレーム数の音声パタンを切り出し
これを子音パタンとする方法が考えられる。この
方法は、定まつたフレーム数のパタン同士を比較
するため、時間軸を伸縮させる必要はない。この
ため少ない計算量で認識することができる。 For example, one possible method is to cut out a voice pattern of a certain number of frames from the beginning of the voice and use this as a consonant pattern. Since this method compares patterns with a fixed number of frames, there is no need to expand or contract the time axis. Therefore, recognition can be performed with a small amount of calculation.

しかし、子音の時間長は、それぞれの発声また
はカテゴリの違いにより大きく異なることがあ
る。この時のフレーム数を時間長の長い子音の長
さに合わせると標準パタン記憶のために多くのメ
モリ量を必要とすることとなる。また、この方法
では時間長の短い子音同士の比較の場合、長いフ
レーム数で比較するため特徴が、うすめられるお
それがある。一方、時間長の短い子音の長さにフ
レーム数を合せると長い時間長の子音が認識しに
くくなるという欠点もある。 However, the duration of consonants can vary greatly depending on the pronunciation or category. If the number of frames at this time is adjusted to the length of a long consonant, a large amount of memory will be required to store the standard pattern. Furthermore, in this method, when comparing consonants with short durations, the characteristics may be weakened because the comparison is performed using a long number of frames. On the other hand, if the number of frames is adjusted to the length of a consonant with a short duration, there is a drawback that it becomes difficult to recognize a consonant with a long duration.

２番目の方法として、子音の長さに合せて子音
部を切り出し、子音パタンと子音標準パタンをダ
イナミツク・プログラミング法を用いて時間軸を
非線形に伸縮させてマツチングする方法が考えら
れる。この方法を用いれば、長さの異なるパタン
同士を、きめ細かくマツチングすることができ
る。 A second method is to cut out the consonant part according to the length of the consonant, and match the consonant pattern and the standard consonant pattern by non-linearly expanding and contracting the time axis using a dynamic programming method. Using this method, it is possible to finely match patterns of different lengths.

しかし、このダイナミツク・プログラミング法
は、かなりの計算量を必要とする。また、単語認
識の場合と異なり、単音節認識の場合は、時間軸
の非線形伸縮の効果は少ないと考えられるので、
必らずしも最適な方法とは言えない。 However, this dynamic programming method requires a considerable amount of calculation. Also, unlike the case of word recognition, in the case of monosyllable recognition, the effect of nonlinear expansion and contraction of the time axis is thought to be small, so
This is not necessarily the best method.

本発明の目的は単音節の認識を、少ない標準パ
タンメモリ量及び少ない計算量で行ない、高い認
識性能を得ることにある。この目的を達成するた
めに、本発明による単音節認識装置は、区切つて
発声された単音節をパタン化し、音声パタンとす
る分析部と、前記音声パタンより子音部分と母音
部分を切り出し、子音パタン及び母音パタンとす
る音声切り出し部と、あらかじめ発声された単音
節の子音パタン及び母音パタンを、それぞれ子音
標準パタン、母音標準パタンとして記憶しておく
標準パタンメモリ部と、入力された母音パタンと
前記母音標準パタンをマツチングし母音カテゴリ
を決定する母音マツチング部と、入力された子音
パタンと前記子音標準パタンをマツチングする際
に両者の時間長が異なる場合、短い方の子音パタ
ンの後に母音パタンを付加してマツチングする子
音マツチング部とを有して構成される。 An object of the present invention is to recognize monosyllables with a small amount of standard pattern memory and a small amount of calculation, and to obtain high recognition performance. In order to achieve this objective, the monosyllable recognition device according to the present invention includes an analysis unit that patterns monosyllables that are uttered separately and generates a speech pattern, and an analysis unit that extracts consonant parts and vowel parts from the speech pattern and creates a consonant pattern. and a standard pattern memory section that stores consonant patterns and vowel patterns of single syllables that have been uttered in advance as consonant standard patterns and vowel standard patterns, respectively; A vowel matching unit that matches vowel standard patterns to determine a vowel category; and a vowel matching unit that matches an input consonant pattern and the consonant standard pattern, and if the time lengths of the two are different, a vowel pattern is added after the shorter consonant pattern. and a consonant matching section that performs matching.

以下、本発明による一実施例について、図と共
に説明する。第１図は本発明による一実施例全体
の構成を示すブロツク図である。マイクロフオン
１より入力された音声は、分析部２で分析され、
音声パタンＰとして出力される。音声パタンＰよ
り音声切り出し部３において子音パタンＣ及び母
音パタンＶが、切り出される。標準パタン登録時
には、これらのパタンが、それぞれ子音標準パタ
ンメモリ４と母音標準パタンメモリ５の中に保持
される。認識時には、まず母音パタンＶを母音マ
ツチング部６において、母音標準パタンVRとマ
ツチングをとる。このマツチング方法は母音パタ
ンどうしの距離を求めればよい。これにより／
ａ／、／ｉ／、ｕ／、／ｅ／、／ｏ／等の母音カ
テゴリを認識し結果を母音カテゴリVCとして出
力する。子音マツチング部７では子音部のマツチ
ングを行ない、認識結果を／ｋ／、／ｓ／等の子
音カテゴリCCとして出力する。子音マツチング
部７の動作原理を説明する。第２図は子音マツチ
ング部７の動作例を説明するための概念図であ
る。入力された子音パタン１１（図は５フレーム
のパタンの例である）と子音標準パタン１２（図
は８フレームのパタンの例である）の間の距離を
求めるとする。それぞれの音声パタンは右方向を
時間方向とするベクトルの時系列である。音声の
始端から５フレーム目までは、矢印１３で示すよ
うに入力された子音パタンと子音標準パタンの同
じ位置のフレームどうしを比較し距離を求める。
５フレーム目以降は、子音標準パタン１２を入力
された音声の１フレーム分の母音パタン１５と矢
印１４で示すように対応させ距離を求める。子音
パタンの後部は通常母音パタンと、ほぼ連続して
おり、母音パタンは時間的にほぼ一定である。こ
のため、短いほうの子音パタンのあとに母音パタ
ンを連続させることにより、もとのパタンを再現
させることができる。 An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing the overall structure of an embodiment according to the present invention. The voice input from the microphone 1 is analyzed by the analysis section 2,
It is output as a voice pattern P. A consonant pattern C and a vowel pattern V are extracted from the voice pattern P in the voice extraction section 3. At the time of standard pattern registration, these patterns are held in the consonant standard pattern memory 4 and the vowel standard pattern memory 5, respectively. At the time of recognition, first the vowel pattern V is matched with the vowel standard pattern VR in the vowel matching section 6. This matching method only requires finding the distance between vowel patterns. As a result/
It recognizes vowel categories such as a/, /i/, u/, /e/, /o/, etc., and outputs the result as a vowel category VC. The consonant matching section 7 performs matching of consonant parts and outputs the recognition result as a consonant category CC such as /k/, /s/, etc. The operating principle of the consonant matching section 7 will be explained. FIG. 2 is a conceptual diagram for explaining an example of the operation of the consonant matching unit 7. Assume that the distance between the input consonant pattern 11 (the figure shows an example of a 5-frame pattern) and the consonant standard pattern 12 (the figure shows an example of an 8-frame pattern) is calculated. Each audio pattern is a time series of vectors with the time direction pointing to the right. From the start of the speech to the fifth frame, the input consonant pattern and the frames at the same position of the consonant standard pattern are compared as shown by arrow 13 to find the distance.
From the fifth frame onwards, the consonant standard pattern 12 is made to correspond to the vowel pattern 15 for one frame of the input voice as shown by the arrow 14, and the distance is determined. The rear part of the consonant pattern is usually almost continuous with the vowel pattern, and the vowel pattern is almost constant over time. Therefore, by making a vowel pattern follow the shorter consonant pattern, the original pattern can be reproduced.

この方法により長さの異なるパタン同士も簡単
にマツチングさせることができる。また、標準パ
タンとしては子音部と母音部のパタンを分けて必
要なだけ記憶していればよいため、標準パタンの
メモリ量も少なくてすむ。ここに示した例は入力
された子音パタンのほうが、子音標準パタンより
短い場合を示したが、逆の場合でも両方を入れか
えるだけで、以上の手順と全く同じことを行なえ
ばよい。 This method allows patterns of different lengths to be easily matched. Further, since it is sufficient to separately store patterns for consonant parts and vowel parts as required as standard patterns, the amount of memory required for standard patterns can be reduced. The example shown here shows a case where the input consonant pattern is shorter than the standard consonant pattern, but in the opposite case, you can just replace both and follow the same procedure as above.

第３図は子音マツチング部７の回路の具体的な
構成例を示すブロツク図である。入力された音声
の子音パタンＣは子音バツフア２１に母音パタン
Ｖは母音バツフア２２に保持される。また、子音
標準パタンCRは、子音標準パタンバツフア２３
に母音標準パタンVRは母音標準パタンバツフア
２４に保持される。入力パタンフレームカウンタ
２５は子音バツフア２１にフレームアドレスIA
を出力する。このアドレスにしたがつてフレーム
ごとの子音パターンCFが子音バツフア２１より
出力される。同様に標準パタンフレームカウンタ
２６は子音標準パタンバツフア２３にフレームア
ドレスRAを出力する。このアドレスにしたがつ
てフレームごとの子音標準パタンCRFが子音標
準パタンバツフア２３より出力される。この入力
パタンフレームカウンタ２５及び標準パタンフレ
ームカウンタ２６は同期して１よりカウントアツ
プし、それぞれのフレーム数に等しい値となつた
ら、等しい値となつたほうのカウンタのカウント
が止まる。データセレクタ２７及び２８は、フレ
ームカウンタがカウントを続けている間は子音パ
タンCF及び子音標準パタンCRFを選択し、距離
計算部２９に出力する。距離計算部２９では、フ
レームごとの２つのパタン間の距離が計算され
る。このフレームごとの距離はアキユムレータ３
０で１パタン分の距離が積算される。 FIG. 3 is a block diagram showing a specific example of the configuration of the circuit of the consonant matching section 7. As shown in FIG. The consonant pattern C of the input voice is held in the consonant buffer 21 and the vowel pattern V is held in the vowel buffer 22. In addition, the consonant standard pattern CR is the consonant standard pattern buffer 23
The vowel standard pattern VR is held in the vowel standard pattern buffer 24. The input pattern frame counter 25 inputs the frame address IA to the consonant buffer 21.
Output. A consonant pattern CF for each frame is output from the consonant buffer 21 in accordance with this address. Similarly, the standard pattern frame counter 26 outputs a frame address RA to the consonant standard pattern buffer 23. In accordance with this address, the consonant standard pattern buffer 23 outputs the consonant standard pattern CRF for each frame. The input pattern frame counter 25 and the standard pattern frame counter 26 count up from 1 in synchronization, and when each reaches a value equal to the number of frames, the counter that has reached the same value stops counting. The data selectors 27 and 28 select the consonant pattern CF and the consonant standard pattern CRF while the frame counter continues counting, and output them to the distance calculation section 29. The distance calculation unit 29 calculates the distance between two patterns for each frame. The distance for each frame is the accumulator 3
When set to 0, the distance for one pattern is integrated.

子音パタンのフレーム数が子音標準パタンより
小さい場合について説明する。入力パタンフレー
ムカウンタ２５の値が子音パタンのフレーム数ま
で達してカウントが止まると、データセレクタ２
７により母音パタンバツフア２２からの母音フレ
ームデータVFが選択され、距離計算部２９に出
力される。標準パタンフレームカウンタ２６の値
が子音標準パタンのフレーム数まで達すると両パ
タン間の距離が求まつたことになり、距離積算値
ｄがアキユムレータ３０より出力される。以上は
入力された子音パタンのフレーム数のほうが、小
さい場合であるが、逆に子音標準パタンのフレー
ム数が小さい場合も同様に動作する。子音標準パ
タンCRFのうしろには母音標準パタンVRFが付
加される。この距離積算値ｄは標準パタンのカテ
ゴリごとに最小値計算部３１に入力される。ここ
では全子音カテゴリにおける最小値を計算し、最
小値をとる子音カテゴリCCが認識結果として出
力される。 A case where the number of frames of the consonant pattern is smaller than the standard consonant pattern will be explained. When the value of the input pattern frame counter 25 reaches the number of consonant pattern frames and stops counting, the data selector 2
7 selects the vowel frame data VF from the vowel pattern buffer 22 and outputs it to the distance calculation section 29. When the value of the standard pattern frame counter 26 reaches the number of frames of the consonant standard pattern, the distance between the two patterns has been determined, and the cumulative distance value d is output from the accumulator 30. The above is a case where the number of frames of the input consonant pattern is smaller, but the same operation is performed when the number of frames of the standard consonant pattern is smaller. A vowel standard pattern VRF is added behind the consonant standard pattern CRF. This distance integrated value d is input to the minimum value calculation unit 31 for each category of standard patterns. Here, the minimum value in all consonant categories is calculated, and the consonant category CC that has the minimum value is output as a recognition result.

以上、述べてきた実施例は、説明の便宜上選択
した、ほんの一例であつて本発明はこの実施例の
みに限定されるものではない。 The embodiments described above are merely examples selected for convenience of explanation, and the present invention is not limited to these embodiments.

[Brief explanation of drawings]

第１図は本発明の一実施例について示したブロ
ツク図で、第２図は子音マツチング部の動作例を
示す説明するための概念図、第３図は子音マツチ
ング部の具体的な回路の一例を示すブロツク図で
ある。図中、１はマイクロフオン、２は分析部、３は
音声切り出し部、４は子音標準パタンメモリ、５
は母音標準パタンメモリ、６は母音マツチング
部、７は子音マツチング部、１１は子音パタン、
１２は子音標準パタン、１５は母音パタン、２１
は子音バツフア、２２は母音バツフア、２３は子
音標準パタンバツフア、２４は母音標準パタンバ
ツフア、２５は入力パタンフレームカウンタ、２
６は標準パタンフレームカウンタ、２７，２８は
データセレクタ、２９は距離計算部、３０はアキ
ユムレータ、３１は最小値計算部である。 FIG. 1 is a block diagram showing an embodiment of the present invention, FIG. 2 is a conceptual diagram for explaining an example of the operation of the consonant matching section, and FIG. 3 is an example of a specific circuit of the consonant matching section. FIG. In the figure, 1 is a microphone, 2 is an analysis section, 3 is a voice extraction section, 4 is a consonant standard pattern memory, and 5 is a consonant standard pattern memory.
is a vowel standard pattern memory, 6 is a vowel matching section, 7 is a consonant matching section, 11 is a consonant pattern,
12 is a standard consonant pattern, 15 is a vowel pattern, 21
22 is a consonant buffer, 22 is a vowel buffer, 23 is a consonant standard pattern buffer, 24 is a vowel standard pattern buffer, 25 is an input pattern frame counter, 2
6 is a standard pattern frame counter, 27 and 28 are data selectors, 29 is a distance calculation section, 30 is an accumulator, and 31 is a minimum value calculation section.

Claims

[Claims]

1. An analysis unit that patterns the single syllables that are uttered in sections to create a voice pattern, a voice extraction unit that cuts out consonant parts and vowel parts from the voice pattern and creates consonant patterns and vowel patterns, and a monosyllable that has been uttered in advance. The consonant pattern and vowel pattern of
a standard pattern memory section that stores consonant standard patterns and vowel standard patterns, respectively; a vowel matching section that matches input vowel patterns with the vowel standard patterns and determines vowel categories; A single syllable recognition device comprising: a consonant matching section that adds a vowel pattern after the shorter consonant pattern and performs matching when the time lengths of the standard consonant patterns are different. .