JPS5876894A

JPS5876894A - Monosyllable recognition equipment

Info

Publication number: JPS5876894A
Application number: JP56174115A
Authority: JP
Inventors: 吉田　和永
Original assignee: Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1981-10-30
Filing date: 1981-10-30
Publication date: 1983-05-10
Also published as: JPH0160160B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】本発明は区切って発声された日本語のホな文字略に対応
する単音節を認識する音声胤識装置の改良に関する。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to an improvement in a speech recognition device that recognizes monosyllables corresponding to Japanese character abbreviations that are uttered in segments.

従来、区切って発声された単音節をｆｇ識する方法とし
てバタンマツチングを用いる方法があり大。Conventionally, there is a method that uses bang matching to identify single syllables that are uttered in sections.

代表的な方法としては、以下のような方法がある。Typical methods include the following.

通常、日本語の単音節は、轢とんどが子音上母音の形を
しているため、両者を分けて認識する手法が用−られる
。まず、入力された音声を分析して得られた音声バタン
の子音部と母音部を切り出す。Normally, in Japanese monosyllables, tsukitondo is in the form of a consonant and a vowel, so a method is used to recognize them separately. First, the input voice is analyzed and the consonant and vowel parts of the voice bang are extracted.

母音部の音声バタン、すなわち母音バタンを、あらかじ
め登録１れてぃ（る母音標準バタンとバタンマツチング
して母音カテゴリを決定する。The vowel category is determined by matching the vowel sound part, that is, the vowel sound, with a pre-registered vowel standard sound.

つぎに、子音バタンとあらｂじめ登録されている子音標
準バタンとをバタンマツチングして、子音カテゴリを決
定し、単音節の認識結果を決定する。Next, a consonant category is determined by matching the consonant button with a pre-registered consonant standard button, and a monosyllable recognition result is determined.

母音部は一般に安定したバタンか得られるため、切）出
す方法及びバタンマツチングの方法は、あまり法及びバ
タンマツチングの方法は１．さまざまな変形がある・たとえば、１つの方法として、音声の始端ふらある定ｔ
ｐたフレーム数の音声バタンを切り出しこれを子音バタ
ンとする方法が考えられる。この方法は、定ｔｚたフレ
ーム数のパタ″ン同士を比較するため、時間軸を伸縮さ
せる必要はない。このため少な一計算量で認識すること
ができる。Generally, stable bangs can be obtained for vowel parts, so the cutting method and batan matching method are different from the methods of 1. There are various transformations. For example, one method is to
One possible method is to cut out a number of p frames of voice bangs and use them as consonant bangs. Since this method compares patterns with a fixed number of frames, there is no need to expand or contract the time axis. Therefore, recognition can be performed with a small amount of calculation.

しかし、子音の時間長は、それぞれの発声またはカテゴ
リの違いにより大吉〈異な°ることがある。However, the duration of consonants may vary depending on the pronunciation or category.

この時のフレーム数を時間長の長−子音の長さに合わせ
ると標準バタン記憶のために多くのメモリ量を必要とす
ることとなる。また、この方法では時間長の短い子音同
士の比較の場合、長いフレーム数で比較する丸め特徴が
、うすめられるおそれがある。一方、時間長の短い子音
の長さにフレーム数を合せると長い時間長の子音が認識
しにくくなると―う欠点もある。If the number of frames at this time is adjusted to the length of the time minus the length of the consonant, a large amount of memory will be required to store the standard bang. Furthermore, in this method, when comparing consonants with short durations, the rounding feature of comparing using a long number of frames may be weakened. On the other hand, if the number of frames is adjusted to the length of a consonant with a short duration, it becomes difficult to recognize a consonant with a long duration.

２番目の方法として、子音の長さに合せて子音部を切抄
出し、子音バタンと子音標準バタンをダイナミック・プ
ログラミング法を用いて時間軸を非線形に伸縮させてマ
ツチングする方法が考えられる。この方法を用いれば、
長さの異なるバタン同士を、きめｍｉ＜マッンングする
ことができる〇しかし、このダイナミック・ブリグラミ
ング法は、かなシの計算量を必要とする。また、単語認
識の場合と異′＆す、単音節認識の場合は、時間軸の非
線形伸縮の効果は少ないと考えられるので、必らずしも
最適な方法とは言えない。A second method is to cut out the consonant part according to the length of the consonant, and then use a dynamic programming method to match the consonant part and the standard consonant part by expanding and contracting the time axis non-linearly. Using this method,
Buttons of different lengths can be mapped to each other with a fine grain of mi <0. However, this dynamic programming method requires a large amount of calculation. Furthermore, unlike the case of word recognition, in the case of single syllable recognition, the effect of nonlinear expansion and contraction of the time axis is considered to be small, so it cannot necessarily be said to be the optimal method.

本発明の目的は単音節の認識を、少ない標準バタンメモ
リ量及び少ない計算量で行ない、高−認識性能を得るこ
とにある。この目的を達成するために１本発明による単
音節認識装置は、区切って発声され九単音節をバタン化
し、音声バタンとする分析部と、前記音声バタンより子
音部分と母音部分を切り出し、子音バタン及び母音バタ
ンとする音声切り出し部と、あらかじめ発声された単音
節の子音バタン及び母音バタンを、それぞれ子音標準バ
タン、母音標準バタンとして記憶しておく標準バタンメ
モリ部と、入力された母音バタンと前記母音標準バタン
をマツチングし母音カテゴリを決定する母、パ音マツチ
ング部逅、入力された子音バタンと前記子音標準バタン
をマツチングする際に両者の時間長が異なる場合、短い
方の子音バタンの後に母音バタンを付加してマツチング
する子音マツチング部とを有して構成される。An object of the present invention is to recognize monosyllables with a small amount of standard memory and a small amount of calculation, and to obtain high recognition performance. In order to achieve this object, the monosyllable recognition device according to the present invention includes an analysis unit that converts nine monosyllables that are uttered into segments into a sound slam, and extracts a consonant part and a vowel part from the sound sound, and converts the consonant sound into a sound sound. and a voice extraction unit that generates a vowel bang; a standard bang memory unit that stores monosyllable consonant bangs and vowel bangs that have been uttered in advance as consonant standard bangs and vowel standard bangs, respectively; In the vowel and pa sound matching section, which matches vowel standard drums and determines the vowel category, when matching the input consonant drum and the above-mentioned consonant standard drum, if the time lengths of the two are different, a vowel is added after the shorter consonant drum. The consonant matching section includes a consonant matching section that performs matching by adding a bang.

以下、本発明による一実施例について、図と共に説明す
る。第１図は本発明による一実施例全体の構ｔを示すブ
ロック図である。！イク賞７オン１より入力された音声
は、分析部２で分析され、音声バタンＰとして出力され
る。音声バタンＰよシ音声切抄出し部３において子音バ
タンＣ及び母音バタンＶが、切シ出される・標準パ゛タ
ン登録時には、これらのバタンか、それぞれ子音標準バ
タンメモリ４と母音標準バタンメモリ５の中に保持され
る。認識時には、まず母音バタンＶを母音マツチング部
６におψで、母音標準バタン■Ｒとマ、チングをとる。An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing the overall structure of an embodiment according to the present invention. ! The voice input from Iku Sho 7 on 1 is analyzed by the analysis section 2 and output as a voice bang P. Consonant bang C and vowel bang V are cut out in the voice cutting section 3 from the voice beat P. When registering standard patterns, these are saved in the consonant standard button memory 4 and the vowel standard button memory 5, respectively. held within. At the time of recognition, first the vowel slam V is ψ in the vowel matching section 6, and the vowel standard bang ■R and ma are matched.

このマツチング方法は母音バタンどうしの距離を求めれ
ばよ≠。これによｋ）／ａ／、　／ｉ、／：　／ｌシ／
ｅ／；　１０／等の母音カテゴリを認識し結果を母音カ
テゴリＶＣとして出力する。子音マツチング１１７では
子音部のマツチングを行ない、認識結果を／ｋｌ／ｖ’
等の子音カテゴリＣＣとして出力する。子音マツチング
部７の動作原理を説明する。第２図は子音マツチング部
７の動作例を説明するた恰の概念図である。入力された
子音バタン１１（図は５フレームのバタンの例である）
と子音標準Ｉタン１２　（図は８フレームのバタンの例
である）の間の距離を求めるとする。それぞれの音声パ
ターンは右方向を時間方向とするベクトルの時系列であ
る。音声の始端から５フレーム目までは、矢印１３で示
すように入力された子音バタンと子音標準バタンの同じ
位置のフレームどうしを比較し距離を求める。５フレー
ム目以降は、子音標準バタン１２を人力された音声の１
フレ一ム分の母音バタン１５と矢印１４で示すように対
応させ距離を求める。子音バタンの後部は通常母音バタ
ンと、はば連続しておシ、母音バタンは時間的にほぼ一
定である。このため、短いはうの子音バタンのあとに母
音バタンを連続させることによシ、もとのバタンを再現
させることができる。For this matching method, find the distance between the vowel bangs≠. With this k) /a/, /i, /: /lshi/
It recognizes vowel categories such as e/; 10/, and outputs the result as a vowel category VC. Consonant matching 117 performs matching of consonant parts and converts the recognition result into /kl/v'
etc. are output as consonant category CC. The operating principle of the consonant matching section 7 will be explained. FIG. 2 is a conceptual diagram for explaining an example of the operation of the consonant matching unit 7. Input consonant bang 11 (the figure is an example of a 5 frame bang)
Suppose that we want to find the distance between the consonant standard I tongue 12 (the figure shows an example of eight frames of tongue). Each audio pattern is a time series of vectors with the time direction pointing to the right. From the beginning of the voice to the fifth frame, as shown by arrow 13, the frames of the input consonant bang and the consonant standard slam at the same position are compared to find the distance. From the 5th frame onwards, the consonant standard button 12 is changed to 1 of the human-generated voice.
The vowel button 15 for one frame is made to correspond as shown by the arrow 14, and the distance is determined. The consonant bang is usually followed by the vowel bang, and the vowel bang is almost constant over time. Therefore, by continuing a short crawling consonant bang with a vowel bang, the original bang can be reproduced.

この方法により長さの異なるバタン同士も簡単にマツチ
ングさせることができる。また、標準バタンとしては子
音部と母音部のバタンを分けて必要なだけ記憶していれ
ばよいため、標準バタンのメモリ量も少なくてすむ。こ
こに示した例は入力された子音バタンのほうが、子音標
準バタンより短い場合を示したが、逆の場合でも両方を
入れかえるだけで、以上の手順と全く同じことを行なえ
ばよい・第３図は子音マツチング部７０回路の具体的な構成例を
示すブロック図である。入力された音声の子音バタンＣ
は子音バッファ２１に母音バタンＶは母音バッファ２２
に保持される。まえ、子音標準バタンＣＲは、子音標準
バタンバッファ２３に母音標準バタンＶＢは母音標準バ
タンバッファ２４に保持される。入力バタンフレームカ
ウンタ２５は子音バッフ　７２１　Ｋフレームアドレス
ｌＡｔ−出力する。このアドレスにしたがってフレーム
ごとの子音パターンＣＦが子音バッフ７２１より出力さ
れる。同様に標準バタン７レームカウンタ２６は子ａｓ
ｓパタンバッファ２３にフレームアドレスされる。この
入力バタン７レームカウンタ２５及びｌ１ｌｓバタン７
レームカウンタ２６は同期シて１よりカウントアツプし
、それぞれのフレーム数に等しい値となったら、等しい
値となったほうのカウンタのカウントが止まる。データ
セレクタ２７及び２８は、７レームカウンタがカウント
を続けて−る聞は子音バタンＣＦ及び子音標準バタ′／
ｃＲＦを選択し、距離計算部２９に出力する。距離計算
部２９では、フレームごとの２つのバタン間の距離が計
算される。このフレームごとの距離はアキμレータ３０
で１バタン分の距離が積算される。With this method, it is possible to easily match battens of different lengths. Furthermore, since it is sufficient to separately store the consonant part and vowel part drums as needed as the standard drums, the amount of memory required for the standard drums can be reduced. The example shown here shows a case where the input consonant bang is shorter than the standard consonant bang, but in the opposite case, just replace both and follow the same procedure as above. Figure 3 2 is a block diagram showing a specific example of the configuration of a consonant matching section 70 circuit. FIG. Consonant bang C of the input voice
is a vowel slam in the consonant buffer 21 V is a vowel buffer 22
is maintained. First, the consonant standard slam CR is held in the consonant standard tap buffer 23, and the vowel standard tap VB is held in the vowel standard tap buffer 24. The input bang frame counter 25 outputs a consonant buffer 721 K frame address lAt-. A consonant pattern CF for each frame is output from the consonant buffer 721 in accordance with this address. Similarly, the standard slam 7 frame counter 26 is a child as
The frame is addressed to the s pattern buffer 23. This input slam 7 frame counter 25 and l1ls slam 7
The frame counter 26 synchronously counts up from 1, and when the value reaches a value equal to each frame number, the counter that has reached the equal value stops counting. Data selectors 27 and 28 select consonant bang CF and consonant standard bat'/ while the 7-frame counter continues counting.
cRF is selected and output to the distance calculation section 29. The distance calculation unit 29 calculates the distance between two bangs for each frame. The distance for each frame is 30
The distance for one bang is accumulated.

子音バタンのフレーム数が子音標準バタンより小さい場
合について説明する。入力バタン７レームカウンタ２５
の値が子音バタンのフレーム数まで達してカウントが止
まると、データセレクタ２７によ転母音バタンバッファ
２２からの母音フレームデータＶＦが選択され、距離計
算部２９に出力される。標準バタンフレームカウンタ２
６の値が子音標準バタンの７レ一人数まで達すると両メ
タン間の距離が求まったことＫなり、距離積算値ｄが７
キユムレー、夕３ｏより出力される・以上は入力された
子音バタンのフレーム数のほうが、小さい場合であるが
、逆圧子音標準バタンのフレーム数が小さい場合も同様
に動作する。子音゛標準バタンＣＨＰのうしろには母音
標準バタンＶＲＦが付加される。この距離積算値ｄは標
準バタンのカテゴリごとに最小値計算部３１に入力され
る。ここでは全子音カテゴリにおける最小値を計算し、
最小値をとる子音カテゴリＣＣが認識結果として出力さ
れる。A case where the number of frames of the consonant bang is smaller than the standard consonant bang will be explained. Input button 7 frame counter 25
When the value reaches the number of consonant bang frames and counting stops, the data selector 27 selects the vowel frame data VF from the inverted vowel bang buffer 22 and outputs it to the distance calculation section 29. Standard slam frame counter 2
When the value of 6 reaches the number of 7 rays of the consonant standard bang, it means that the distance between both methane has been found, and the distance cumulative value d becomes 7.
The above is a case where the number of frames of the input consonant bang is smaller, but it operates in the same way when the number of frames of the counter-pressure consonant standard bang is small. A vowel standard bang VRF is added after the consonant standard slam CHP. This distance integrated value d is input to the minimum value calculation unit 31 for each standard slam category. Here we calculate the minimum value in all consonant categories,
The consonant category CC that takes the minimum value is output as a recognition result.

以上、述べてきた実施例は、説明の便宜上選択した、は
んの−例であって本発明はこの実施例のみに限定される
ものではない。The embodiments described above are mere examples selected for convenience of explanation, and the present invention is not limited to these embodiments.

[Brief explanation of drawings]

第１図は本発明の一実施例について示したブロック図で
、第２図は子音マツチング部の動作例を示す説明するた
めの概念図、第３図は子音マツチング部の具体的な回路
の一例を示すブロック図である。FIG. 1 is a block diagram showing one embodiment of the present invention, FIG. 2 is a conceptual diagram for explaining an example of the operation of the consonant matching section, and FIG. 3 is an example of a specific circuit of the consonant matching section. FIG.

Claims

[Claims]

A segmentation section that converts the single syllables that have been uttered into sections into bangs and makes them sound bangs, a voice cutting section that cuts out the consonant parts and vowel parts from the sound padding, and makes consonant bangs and vowel bangs; A standard bang memory section stores consonant bangs and vowel bangs of monosyllables that have just been uttered as consonant standard bangs and vowel standard bangs, respectively, the input vowel bangs, and the vowel standard bangs! , a vowel matching unit that determines the vowel category, and when matching the input consonant stamp and the standard consonant stamp, if the time lengths of the two are different, the vowel matching section adds the vowel stamp after the shorter consonant stamp. A monosyllable recognition device comprising a consonant matching unit for matching.