JPH0223878B2

JPH0223878B2 -

Info

Publication number: JPH0223878B2
Application number: JP58072420A
Authority: JP
Inventors: Norihiro Jinnai
Original assignee: Individual
Current assignee: Individual
Priority date: 1983-04-23
Filing date: 1983-04-23
Publication date: 1990-05-25
Also published as: JPS59197100A

Description

【発明の詳細な説明】この発明は、人間の音声を検出するための音声
検出方法に関する。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a voice detection method for detecting human voice.

コンピユータの発展に伴い、音声入力装置が提
案されている。この音声入力装置は予め音声波の
特徴を音声記憶部に記憶させ、新らたに入力され
た音声波の特徴を音声記憶部の記憶値と比較して
入力した音声を認識するものである。 With the development of computers, voice input devices have been proposed. This voice input device stores the characteristics of a voice wave in advance in a voice storage section, and recognizes the input voice by comparing the characteristics of the newly input voice wave with the stored values in the voice storage section.

しかしながら、従来の音声認識は鼻腔・口腔出
力が重ね合わされた音声波でもつて音素を識別認
識するものであるため、認識率・認識時間の点で
十分に満足のいけるものではない。 However, conventional speech recognition is not fully satisfactory in terms of recognition rate and recognition time, because it identifies and recognizes phonemes even in speech waves in which nasal and oral cavity outputs are superimposed.

ところで、有声音発声時、声帯振動波は口蓋帆
によつて鼻腔に向うものと口腔に向うものに分割
され、鼻腔、口腔の形状相違などに基づく伝達特
性により修飾を受けて鼻孔と唇から出力される。
この時、鼻腔形状は変化しないが、口蓋帆は動
き、口腔形状は歯、舌などの動きにより変化し、
唇は開閉する。したがつて、声帯から発生し口腔
と鼻腔に分割された声帯振動波は、鼻孔と唇まで
の通路形状の違いによつて種々の修飾を受けるた
め、両者の鼻孔、唇からの出力波形は明瞭に異な
るものとなる。 By the way, when producing a voiced sound, the vocal cord vibration waves are divided by the velum into those directed to the nasal cavity and those directed to the oral cavity, and are output from the nostrils and lips after being modified by transmission characteristics based on the differences in the shapes of the nasal cavity and oral cavity. be done.
At this time, the shape of the nasal cavity does not change, but the velum of the mouth moves, and the shape of the oral cavity changes due to the movement of teeth, tongue, etc.
Lips open and close. Therefore, the vocal cord vibration waves generated from the vocal cords and divided into the oral cavity and nasal cavity are modified in various ways depending on the shape of the passage from the nostrils to the lips, so the output waveforms from both nostrils and lips are distinct. will be different.

また、無声音発声時においても、鼻腔出力が極
めて小さく、口腔出力のみが存在し、両者の出力
波形は明瞭に異なる。 Furthermore, even when producing an unvoiced sound, the nasal cavity output is extremely small, only the oral cavity output exists, and the output waveforms of both are clearly different.

この発明は、口腔と鼻腔から出る音声波が明瞭
に異なる波形を示すことに着目して成したもので
あり、音素の認識を極めて容易にすることを目的
とする。 This invention was made by focusing on the fact that the sound waves emitted from the oral cavity and the nasal cavity exhibit clearly different waveforms, and aims to make recognition of phonemes extremely easy.

この目的達成のため、この発明の音声波検出方
法にあつては、鼻腔からの音声出力を検出すると
同時に口腔からの音声出力を検出し、鼻腔出力エネルギー曲線と口腔出力エネルギー
曲線の立上り時間差を第１の特徴量とし、口腔出
力エネルギー曲線の立上り時における正規化鼻腔
出力エネルギー曲線と正規化口腔出力エネルギー
曲線の傾斜の比を第２の特徴量とする判別用Ｄ−
Ｓ平面と、咽頭腔通過エネルギー曲線の立上り時における
咽頭腔通過エネルギーに対する鼻腔出力エネルギ
ーの比を第１の特徴量とし、発声中における前記
比の時間変化曲線の最大傾斜の値を第２の特徴量
とする判別用Ｏ−Δ平面と、口腔出力音声信号における無音区間の差分零交
叉数を除去した雑音除去差分零交叉数を第１の特
徴量とし、この雑音除去差分零交叉数と鼻腔出力
エネルギー曲線の立上り時間差を第２の特徴量と
する判別用Ｎ−Ｄ平面と、を作成し、発せられた音声波のこれらの判別用各
平面上における分布に基づき、（step1）Ｏ−Δ平面上で（／Ｎ／）を分離し、（step2）Ｎ−Ｄ平面上で（／サ行の音韻／）、
（／破擦音／）、（／摩擦音／）、（／ザ行の音
韻／）、（／マ行の音韻／、／ナ行の音韻／、／
ガ行の音韻／、／ダ行の音韻／、／バ行の音
韻／）の各音韻及び音韻群を分離し、（step3）Ｏ−Δ平面上で（／ヤ行の音
韻／、／ラ行の音韻／、／ワ行の音韻／）の音
韻群を分離し、（step4）Ｄ−Ｓ平面上で（／カ行の音
韻／、／タ行の破裂音韻／、／パ行の音韻／）
の音韻群を分離し、（step5）Ｎ−Ｄ平面上で（／ア行の音韻／）
と（／ハ行の音韻／）の音韻の識別を行なう、上記各stepからなる一連の手順により上記各音
韻群又は音韻を分類するようにしたのである。 In order to achieve this objective, the sound wave detection method of the present invention detects the sound output from the oral cavity at the same time as the sound output from the nasal cavity, and calculates the difference in rise time between the nasal cavity output energy curve and the oral cavity output energy curve. 1, and the second feature is the ratio of the slope of the normalized nasal output energy curve and the normalized oral cavity output energy curve at the rise of the oral cavity output energy curve.
The S plane and the ratio of the nasal cavity output energy to the pharyngeal cavity passage energy at the rise of the pharyngeal cavity passage energy curve are the first feature quantity, and the value of the maximum slope of the time change curve of the ratio during phonation is the second feature. The first feature quantity is the O-Δ plane for discrimination, which is the quantity, and the noise-removed differential zero-crossing number obtained by removing the differential zero-crossing number of silent sections in the oral cavity output audio signal, and the noise-removed differential zero-crossing number and the nasal cavity output. Create a discrimination N-D plane with the rise time difference of the energy curve as the second feature, and based on the distribution of the emitted audio waves on each of these discrimination planes, (step 1) O-Δ plane Separate (/N/) on the top, (step 2) On the N-D plane (/S-line phoneme/),
(/affricate/), (/fricative/), (/the phoneme of the line/), (/the phoneme of the line M/, /the phoneme of the line N/, /
Separate each phoneme and phoneme group (phoneme in the G line /, phoneme in the /DA line /, / phoneme in the B line /), and (step 3) on the O-Δ plane (phoneme in the /Y line/, /RA line). Separate the phoneme group /, phoneme of /Wa line/) on the D-S plane (phoneme of /Ka line/, /plosive phoneme/ of /Ta line, /phoneme of /P line/)
Separate the phoneme group of (step 5) On the N-D plane (/phoneme in row A/)
The above phoneme groups or phonemes are classified by a series of steps consisting of the above steps in which phonemes are identified.

このように構成されるこの発明に係る音声波検
出方法にあつては、まず、不特定多数人の発声音
に基づく予備実験によつて、前記Ｄ−Ｓ平面、Ｏ
−Δ平面、Ｎ−Ｄ平面を決定し、ある不特定人の
発声音を前記各平面に基づいて各音韻群又は音韻
に分類して、その音声を認識する。 In the audio wave detection method according to the present invention configured as described above, first, a preliminary experiment based on the vocalizations of an unspecified number of people is carried out to determine whether the D-S plane, the O
- The Δ plane and the ND plane are determined, and the utterances of an unspecified person are classified into phoneme groups or phonemes based on the planes, and the speech is recognized.

したがつて、この発明によると、以上のように
構成し、鼻腔・口腔両出力の明瞭に異なる２つの
波形に基づいてその音声を認識するようにしたの
で、音声の認識が極めて正確かつ容易となり、コ
ンピユータ処理する際には、認識率・認識時間及
び記憶容量が著しく向上する。 Therefore, according to the present invention, the voice is recognized based on the two clearly different waveforms of the nasal cavity and oral cavity outputs, so that voice recognition is extremely accurate and easy. , when processing by computer, the recognition rate, recognition time, and storage capacity are significantly improved.

以下、この発明の実施例を添付図面に基づいて
説明する。 Embodiments of the present invention will be described below with reference to the accompanying drawings.

第１図に示すように、人の頭部１に固定される
ヘツドアーム２からは先端にマイク３ｍ，３ｎを
支持するアーム４が位置調整可能の顔の前面に延
びている。マイク３ｎは音声の鼻腔出力を検出す
るために鼻孔に向かい、マイク３ｍは音声の口腔
出力を検出するために口に向つている。なお、上
記マイク３ｎ，３ｍのいずれか一方は位置調整可
能に設けることが望ましい。また、マイク３ｍと
３ｎ間には鼻腔出力と口腔出力とを分離する遮断
板５を設けて、両出力が混合することなく各マイ
ク３ｎ，３ｍに入力することが望ましい。 As shown in FIG. 1, from a head arm 2 fixed to a human head 1, an arm 4 supporting microphones 3m and 3n at its tip extends in front of the face, the position of which is adjustable. The microphone 3n faces the nostril to detect the nasal cavity output of sound, and the microphone 3m faces the mouth to detect the oral cavity output of sound. Note that it is desirable that either one of the microphones 3n and 3m is provided so that its position can be adjusted. Further, it is desirable to provide a shielding plate 5 between the microphones 3m and 3n to separate the nasal cavity output and the oral cavity output, so that both outputs are inputted to each microphone 3n and 3m without being mixed.

上記各マイク３ｍ，３ｎからの信号は、第２図
に示すようにマルチプレクサ等の切換器６を介し
てAD変換器７に入力され、デイジタル信号に変
換されてコンピユータ８に入力される。コンピユ
ータ８において両マイク３ｍ，３ｎの出力に基づ
き、発せられた音声の認識処理を行なう。 The signals from each of the microphones 3m and 3n are inputted to an AD converter 7 via a switch 6 such as a multiplexer, as shown in FIG. 2, converted into digital signals, and inputted to a computer 8. The computer 8 performs recognition processing of the emitted voice based on the outputs of both the microphones 3m and 3n.

この認識処理は種々の手段を取り得るが例えば
以下のようにして行なう。 This recognition process can be performed by various means, for example, as follows.

この手段は、マイク３ｍ，３ｎによつて検出し
て得られた音声出力のエネルギー曲線N^Eｍ，N^E
ｎに基づき、下記のＤ、Ｓ、（En／Eo）_p′、Δ
（En／Eo）、CR、DTなるパラメータを算出し、
各パラメータにより、Ｄ−Ｓ平面、（En／Eo）_p−
Δ（En／Eo）平面（０−Δ平面）、CR−DT平面
（Ｎ−Ｄ平面）を得て、この各平面に基づき発声
音を識別するものである。 This means uses the energy curves N ^E m and N ^E of the audio output detected by the microphones 3 m and 3 n
Based on n, the following D, S, (En/Eo) _p ′, Δ
Calculate the parameters (En/Eo), CR, and DT,
Depending on each parameter, the D-S plane, (En/Eo) _p −
The Δ(En/Eo) plane (0-Δ plane) and the CR-DT plane (ND plane) are obtained, and vocalizations are identified based on these planes.

Ｄ：N^EｎとN^Eｍの立上り時間差Ｓ：N^EｎとN^Eｍの立上り時の傾斜比 En：鼻腔の通過エネルギー Eo：咽頭腔の通過エネルギー（En／Eo）_p：En／Eo曲線の開始点の値 Δ（En／Eo）：En／Eo曲線の最大傾斜値 CR：雑音除去差分零交叉数 DT：雑音除去差分零交叉数とN^Eｎの立上り時間
差つぎに、上記各平面の作成及びそれに基づく音
韻識別を述べる。D: Rise time difference between N ^E n and N ^E m S: Slope ratio at the rise of N ^E n and N ^E m En: Energy passing through the nasal cavity Eo: Energy passing through the pharyngeal cavity (En/Eo) _p : En/Eo Value at the starting point of the curve Δ(En/Eo): Maximum slope value of the En/Eo curve CR: Noise removal difference zero crossing number DT: Noise removal difference zero crossing number and the rise time difference of N ^E n Next, each of the above planes This section describes the creation of ``phoneme'' and the phonological identification based on it.

(i) Ｄ−Ｓ平面第３図ａ〜ｌは、音声(ア)／ａ／、(カ)／ka／、
（サ）／sa／、（タ）／ta／、（ナ）／na／、
(ハ)／ha／、（マ）／ma／、（ヤ）／ya／、
（ラ）／ta／、（ワ）／wa／、（パ）／pa／、
（ン）／Ｎ／をマイク３ｍ，３ｎで検出して得
られたエネルギー曲線N^Eｍ、N^Eｎであり、観
規した音声波エネルギーを各々最大値で正規化
したエネルギーの時間変化曲線である。(i) D-S plane Figure 3 a-l shows the sounds (a) /a/, (f) /ka/,
(sa) /sa/, (ta) /ta/, (na) /na/,
(ha) /ha/, (ma) /ma/, (ya) /ya/,
(la) /ta/, (wa) /wa/, (pa) /pa/,
These are the energy curves N ^E m and N ^E n obtained by detecting (n)/N/ with microphones 3m and 3n, and are energy time change curves obtained by normalizing the observed acoustic wave energy by the maximum value. be.

このエネルギー曲線において、／ａ／、／
ha／ではN^EｎとN^Eｍが同時に立上り、発声中
のN^EｎとN^Eｍは同じ変化をしている。／
ka／、／ta／、／pa／ではN^Eｍの立上り時に
破裂気流によるピークが現われ、N^EｍはN^Eｎ
より早く立上つている。／sa／では／ｓ／の区
間でN^Eｍに小さな値（矢印）が現われてい
る。／na／、／ma／ではN^EｎがN^Eｍより早
く立上り、N^Eｍが増加を始めると同時にN^Eｎ
が減少を始める。／ya／、／ra／、／wa／で
はN^EｎとN^Eｍがほぼ同時に立上るが立上り時
の傾斜はN^EｎがN^Eｍより大きい。／Ｎ／では
口腔出力が極めて小さく、N^Eｍには室内騒音
のエネルギー曲線が現われている。 In this energy curve, /a/, /
In ha/, N ^E n and N ^E m rise simultaneously, and N ^E n and N ^E m change in the same way during utterance. /
In ka/, /ta/, /pa/, a peak due to burst airflow appears at the rise of N ^E m, and N ^E m is N ^E n
I'm getting up faster. In /sa/, a small value (arrow) appears in N ^E m in the /s/ interval. In /na/, /ma/, N ^E n rises earlier than N ^E m, and N ^E n starts to increase at the same time as N ^E m starts increasing.
starts to decrease. For /ya/, /ra/, and /wa/, N ^E n and N ^E m rise almost simultaneously, but the slope at the time of rise is larger for N ^E n than N ^E m. The oral cavity output is extremely small at /N/, and the energy curve of indoor noise appears at N ^E m.

以上の各音韻のエネルギー曲線の特徴を表わ
すパラメータとして次式(1)、(2)で定義する遅延
時間Ｄと傾斜比Ｓを算出する。 The delay time D and slope ratio S defined by the following equations (1) and (2) are calculated as parameters representing the characteristics of the energy curve of each phoneme.

Ｄ＝t_op−t_np ………(1) Ｓ＝N^Eｎ（t_n3）−N^Eｎ（t_np）／N^Eｍ（t_n3）−N^E
ｍ（t_np）………(2) 但し、t_op、t_npはN^Eｎ、N^Eｍが各々最大値の
５％点を初めて越えた時刻、t_n3はt_npから任意
の時間例えば、19.2ｍsec後の時刻である。式
(1)はN^EｎとN^Eｍの立上り時間差を、式(2)はN^E
ｍの立上り時におけるN^EｎとN^Eｍの傾斜の比
を表わす。 D=t _op −t _np ………(1) S=N ^E n (t _n3 ) − N ^E n (t _np )/N ^E m (t _n3 ) − N ^E
m (t _np )......(2) However, t _op and t _np are the times when N ^E n and N ^E m exceed the 5% point of their maximum values for the first time, and t _n3 is an arbitrary time from t _np , e.g. , 19.2 msec later. formula
(1) is the rise time difference between N ^E n and N ^E m, and equation (2) is N ^E
It represents the ratio of the slope of N ^E n and N ^E m at the time of rise of m.

第４図は第３図における／sa／と／Ｎ／を除
く10種の単音節の２つのパラメータを算出して
Ｄ−Ｓ平面上に発声音の頭文字で例えば／ta／
はＴでプロツトしたもので、音声試料は10名の
男性が孤立発声したものである。なお、／sa／
では／ｓ／の区間でN^Eｍ値のばらつきが大き
くt_npの検出が不安定となり、また、／Ｎ／で
は口腔出力が極めて小さいためt_npが決定でき
ないという理由で除外した。図面においては、
Ｓ＞3.0の場合、Ｓ＝3.0の位置にプロツトして
いる。 Figure 4 shows the two parameters of 10 types of monosyllables excluding /sa/ and /N/ in Figure 3 calculated and plotted on the D-S plane with the initial letters of the vocalizations, such as /ta/.
is plotted with T, and the audio samples are isolated vocalizations by 10 men. In addition, /sa/
In the /s/ interval, the variation in the N ^E m value was large, making the detection of t _np unstable, and in the /N/ interval, the oral cavity output was extremely small, so t _np could not be determined, so these were excluded. In the drawing,
When S>3.0, it is plotted at the position of S=3.0.

この図によれば／ａ／、／ha／の遅延時間
Ｄは小さく、傾斜比Ｓは1.0を中心に分布す
る。／ka／、／tm／、／pa／ではＤ＞０でＳ
は小さい。。又、／na／、／ma／ではＤ＜０、
Ｓ＜０である。／ya／、／ra／、／wa／では
Ｄが小さく、Ｓが他の音韻群より大きい。この
音韻ではN^EｍがN^Eｎの立上りよりややおくれ
る（Ｄ＜０）音声試料があるが、それらは、
N^EｎとN^Eｍの概形は第３図とほぼ同じ形状で
あつたが、N^Eｎ曲線の最初のピークが早く、
そのピークの頂上付近でN^Eｎの傾きを計算す
ることとなるため、傾斜比Ｓがやや小さくなつ
たと考える（第４図矢印）。 According to this figure, the delay time D of /a/ and /ha/ is small, and the slope ratio S is distributed around 1.0. For /ka/, /tm/, /pa/, D>0 and S
is small. . Also, for /na/ and /ma/, D<0,
S<0. For /ya/, /ra/, and /wa/, D is small and S is larger than other phoneme groups. In this phoneme, there are speech samples in which N ^E m is slightly later than the rising edge of N ^E n (D<0), but these are
The outline shapes of N ^E n and N ^E m were almost the same as those in Figure 3, but the first peak of the N ^E n curve was earlier;
Since the slope of N ^E n is calculated near the top of the peak, it is thought that the slope ratio S becomes slightly smaller (arrow in Figure 4).

以上により、Ｄ、Ｓを求めることにより各音
韻を数種のグループに分類し得ることが理解で
きる。 From the above, it can be understood that by determining D and S, each phoneme can be classified into several types of groups.

(ii) Ｏ−Δ平面第１図に示すように、咽頭腔９の通過エネル
ギーをE_p（ｔ）、鼻腔１０と口腔１１の出力エネ
ルギーを各々E_o（ｔ）、E_n（ｔ）又、マイク３
ｎ，３ｍが観測するエネルギーをE_o（in）（ｔ）、
E_n（in）（ｔ）とした時、まず、観測値E_o（in）
（ｔ）、E_n（in）（ｔ）からE_o（ｔ）／E_p（ｔ）の
時間変化曲線を推定する。( _ii ) O _- Δ plane As _shown in FIG. , microphone 3
The energy observed by n, 3m is E _o (in) (t),
When E _n (in) (t), first, the observed value E _o (in)
(t), E _n (in) (t) to estimate the time change curve of E _o (t)/E _p (t).

エネルギーは声道内で無損失であると仮定す
ると式(3)が成り立つ。 Assuming that energy is lossless within the vocal tract, equation (3) holds true.

E_p（ｔ）＝E_o（ｔ）＋E_n（ｔ） ………(3) 又、C_o、C_nを放射エネルギーのうち各マイ
クに入る比率とすれば、 E_o（in）（ｔ）＝C_oE_o（ｔ） E_n（in）（ｔ）＝C_nE_n（ｔ） ………(4) となり、式(3)と式(4)より C_oE_p（ｔ）＝E_o（in）（ｔ）＋（C_o／C_n）E_n（in）（ｔ） ………(5) が得られる。 E _p (t) = E _o (t) + E _n (t) ...... (3) Also, if C _o and C _n are the proportions of the radiated energy that enter each microphone, then E _o (in) (t )=C _o E _o (t) E _n (in) (t)=C _n E _n (t) ......(4) From equations (3) and (4), C _o E _p (t) = E _o (in) (t) + (C _o /C _n ) E _n (in) (t) ......(5) is obtained.

ここで、C_o／C_nの算出が問題になるが、例え
ば円筒の一端開口部に１個のマイクを配置し、他
端開口部で口及び鼻を覆つてE_pを検出するととも
に第１図に示す手段によりE_n、E_oを検出し、E_p
＝E_n＋E_oとなるC_o、C_nを算出すればよい。C_n、
C_oはマイク３ｍ，３ｎの位置で変化するため、
固定して行ないE_p、E_n、E_oは複数回の平均値で
比較するとよい。 Here, calculation of C _o /C _n is a problem, but for example, one microphone is placed in the opening at one end of the cylinder, and the opening at the other end covers the mouth and nose to detect E _p and the first microphone. Detect E _n and E _o by the means shown in the figure, and E _p
It is sufficient to calculate C _o and C _n such that =E _n +E _o . _Cn ,
Since C _o changes depending on the position of microphone 3m and 3n,
It is better to fix E _p , E _n , and E _o and compare the average values of multiple times.

このようにして得たC_o／C_nに基づき次式(6)を
得る。 Based on C _o /C _n thus obtained, the following equation (6) is obtained.

E_o／E_p＝E_o（ｔ）／E_p（ｔ）＝E_o（in）（ｔ）／E_o（in）（ｔ）＋（C_o／C_n）E_n（in
）（ｔ） ………(6) 第５図に上述の発声音／ａ／…における上式
(6)のE_o／E_pの時間変化曲線を示す。 E _o /E _p = E _o (t) / E _p (t) = E _o (in) (t) / E _o (in) (t) + (C _o /C _n ) E _n (in
)(t) ......(6) Figure 5 shows the above equation for the vocal sound /a/...
The time change curve of E _o /E _p in (6) is shown.

このE_o／E_pの曲線の特徴を表わすパラメー
タとして次式を定義する。 The following equation is defined as a parameter representing the characteristics of this E _o /E _p curve.

（E_o／E_p）_p＝E_o（to）／E_p（to） ………(7) Δ（E_o／E_p）＝Max〔E_o（ｔ）／E_p（ｔ） −E_o（ｔ）′／E_p（ｔ）′〕 ………(8) ただし、Maxはかぎかつこ内の最大値を意
味する。また、t_pはE_pが最大値の15％点を初め
て越えた時時刻、t′はt_pから任意の時刻ｔから
ある時間例えば、19.2ｍsec後の時刻である。
式(7)はE_o／E_p曲線の左端の値を、また、式(8)
は曲線の最大傾斜を表わしている。第６図は第
５図に示した10名の男性が孤立発声した12種の
単音節の上記２つのパラメータをＯ−Δ平面に
発声音の頭文字でプロツトしたものである。
（第６図では図を見やすくするため５個の音声
試料のみプロツトした音韻があるが、他の５個
も同様の分布をしている。）この図によれば12種の単音節が（／ａ／、／
ka／、／sa／、／ta／、／ha／、／pa／）、
（／na／、／ma／）、（／ya／、／ra／、／
wa／）、（／Ｎ／）の４群に分類できる。 (E _o / E _p ) _p = E _o (to) / E _p (to) ………(7) Δ (E _o / E _p ) = Max [E _o (t) / E _p (t) −E _o (t)′/E _p (t)′] ………(8) However, Max means the maximum value within the hook. Further, t _p is the time when E _p exceeds the 15% point of the maximum value for the first time, and t' is the time after a certain time, for example, 19.2 msec, from t _p .
Equation (7) calculates the leftmost value of the E _o /E _p curve, and Equation (8)
represents the maximum slope of the curve. FIG. 6 shows the above two parameters of the 12 types of monosyllables uttered in isolation by the 10 men shown in FIG. 5, plotted on the O-Δ plane using the initial letters of the utterances.
(In Figure 6, there are phonemes plotted for only 5 phonetic samples to make the diagram easier to read, but the other 5 phonemes have a similar distribution.) According to this figure, 12 types of monosyllables (/ a/,/
ka/, /sa/, /ta/, /ha/, /pa/),
(/na/, /ma/), (/ya/, /ra/, /
It can be classified into four groups: wa/) and (/N/).

なお、この平面図上では／∫a／、／
za／、／ga／、／da／、／ba／を扱わない
が、／∫a／の分布は／sa／と同じであり、／
za／、／ga／、／da／、／ba／の分布は／
na／に類似しているが分布は広く分類しにく
いからである。 Note that /∫a/, / on this plan view
Although za/, /ga/, /da/, and /ba/ are not treated, the distribution of /∫a/ is the same as /sa/, and /
The distribution of za/, /ga/, /da/, /ba/ is /
This is because although it is similar to na/, the distribution is wide and difficult to classify.

(iii) Ｎ−Ｄ平面この平面は／ｓ／、／ｚ／などの摩擦音、／
t∫i／、／tsu／などの破擦音を識別するもので
あり、まず、その１つのパラメータである雑音
除去差分零交叉数（Noise rejected
differential zero crossing rate）について述
べる。(iii) N-D plane This plane is used for fricatives such as /s/, /z/, /
It identifies affricates such as t∫i/ and /tsu/. First, one of its parameters, the number of noise rejected differential zero crossovers,
This section describes differential zero crossing rate.

第７図に示す例えば／su／の口腔出力音声波
形において、ある点における口腔出力音声信号
を｛xi｝とするとき、雑音除去差分零交叉数を
CRを次式(9)で定義する。 For example, in the oral cavity output speech waveform of /su/ shown in FIG. 7, when the oral cavity output speech signal at a certain point is {xi}, the noise removal difference zero crossing number is
CR is defined by the following equation (9).

｛（xi＋１−xi）（xi−xi−１）｝＜０、かつ｛｜
xi＋１｜しきい値または｜xi｜＞しきい値また
は｜xi−１｜＞しきい値｝ならばサンプル点ｉ
において雑音除去差分零交叉が１回あつたと
し、この零交叉をある区間内で合計したもの。{(xi+1−xi)(xi−xi−1)}<0 and {|
If xi + 1 | threshold or | xi | > threshold or | xi − 1 | > threshold, then sample point i
Assuming that there is one noise removal difference zero crossing in , this is the sum of these zero crossings within a certain interval.

(9) 第８図ａ乃至ｆに／ｕ／、／su／、／
zu／、／tsu／、／hu／、／nu／の発声音のCR
と時間の関係を示す。 (9) In Figure 8 a to f, /u/, /su/, /
CR of vocalizations of zu/, /tsu/, /hu/, /nu/
shows the relationship between and time.

つぎに、もう１つのパラメータとして、次式(10)
で示すCRとN^Eｎの立上り時間差DT（Delay
time）を定義する。 Next, as another parameter, the following equation (10)
The ^rise time difference DT (Delay
time).

Delay time＝t_op−z_p 但し、Dalay time＜０のとき Deley time＝t_op−t_np (10) ここで、NEn、NEmの立上り時刻を各々t_op、
t_np、雑音除去差分零交叉数の立上りを、その交
叉数が９回を初めて越えた時刻z_pとする。第８図
ａ乃至ｆに鼻腔、口腔出力エネルギーの正規化時
間変化曲線N^Eｎ、N^Eｍを示す。 Delay time=t _op −z _pHowever , when Delay time<0, Delay time=t _op −t _np (10) Here, the rising times of NEn and NEm are t _op and
t _np , the rise of the noise removal difference zero crossing number is defined as the time z _p when the number of crossings exceeds 9 times for the first time. FIGS. 8a to 8f show normalized time change curves N ^E n and N ^E m of nasal cavity and oral cavity output energy.

第９図は、雑音除去差分零交叉数〔N.R.−D.
Z.C.R.〕（CR）を縦軸に雑音除去差分零交叉数
CRとN^Eｎの立上り時間差（Delay time）を横軸
にとつた平面（Ｎ−Ｄ平面）に後続母音別に各単
音節を発声音の頭文字でプロツトしたものであ
る。 Figure 9 shows the noise removal differential zero crossing number [NR-D.
ZCR] (CR) is the noise removal difference zero crossing number on the vertical axis.
Each monosyllable is plotted by the initial letter of the uttered sound for each subsequent vowel on a plane (ND plane) with the delay time between CR and N ^E n as the horizontal axis.

この図によると、単音節／ｓ／、／∫／、／
ｚ／、／ｈ／、（／ｎ／、／ｍ／、／ｇ／、／
ｄ／、／ｂ／）及び母音に分類できることが確認
できる。 According to this diagram, the monosyllables /s/, /∫/, /
z/, /h/, (/n/, /m/, /g/, /
d/, /b/) and vowels.

なお、この図において、／ka／、／ta／、／
pa／、／ya／、／ra／、／wa／、／Ｎ／を扱つ
ていないが、／ka／、／ta／、／pa／の分布
は／ａ／、／ha／の間にあり、／ya／、／
ra／、／wa／の分布は／ａ／と同じで、分類上
不都合であり、また、／Ｎ／は口腔出力が存在し
ないためＮ−Ｄ平面上にプロツトすることができ
ないからである。 In this figure, /ka/, /ta/, /
Although pa/, /ya/, /ra/, /wa/, and /N/ are not treated, the distribution of /ka/, /ta/, and /pa/ is between /a/ and /ha/. , /ya／, /
The distribution of ra/ and /wa/ is the same as that of /a/, which is inconvenient for classification, and /N/ cannot be plotted on the ND plane since there is no oral output.

以上で各平面図の作成方法を述べたが、つぎに
これらの平面を使用して音韻認識したアルゴリズ
ムの一例を示す。 The method for creating each plan view has been described above, and next, an example of an algorithm for phoneme recognition using these planes will be shown.

このアルゴリズムの一例は、音韻／ａ／、／
ka／、／sa／、／ta／、／na／、／ha／、／
ma／、／ya／、／ra／、wa／、／pa／、／
∫a／、／za／、／ga／、／da／、／ba／、／
Ｎ／を識別するものであり、第１０図に示す識別
フローチヤートによつて行なう。 An example of this algorithm is the phonemes /a/, /
ka/, /sa/, /ta/, /na/, /ha/, /
ma/, /ya/, /ra/, wa/, /pa/, /
∫a/, /za/, /ga/, /da/, /ba/, /
This is to identify N/, and is carried out according to the identification flowchart shown in FIG.

step1ではＯ−Δ平面上で／Ｎ／を分離識別す
る。これはＯ−Δ平面上での／Ｎ／の分布が非常
に顕著であり、最初に他の音韻から分離しておく
ことが適切であることによる。 In step 1, /N/ is separated and identified on the O-Δ plane. This is because the distribution of /N/ on the O-Δ plane is very prominent, and it is appropriate to first separate it from other phonemes.

step2ではＮ−Ｄ平面上で／sa／、／∫a／、／
za／、（／ma／、／na／、／ga／、／da／、／
ba／）の４群を分離識別する。 In step 2, /sa/, /∫a/, / on the N-D plane
za/, (/ma/, /na/, /ga/, /da/, /
Separate and identify the four groups of ba/).

step3では再びＯ−Δ平面上で（／ya／、／
ra／、／wa／）の１群を分離識別する。 In step 3, on the O-Δ plane again (/ya/, /
ra/, /wa/) is separated and identified.

step4ではＤ−Ｓ平面上で（／ａ／、／ha／）
と（／ka／、／ta／、／pa／）の分離を行なう。 In step 4, on the D-S plane (/a/, /ha/)
and (/ka/, /ta/, /pa/) are separated.

step5では再びＮ−Ｄ平面上で／ａ／と／ha／
の識別を行なう。 In step 5, /a/ and /ha/ are again on the N-D plane.
Identification is performed.

以上の５段階の処理によつて17種の音韻を９群
に分類する。このように５段階の構成をとる理由
は各平面上で他の音韻群から顕著に分離している
音韻を先に分離識別する方法を採用していること
による。 The 17 types of phonemes are classified into 9 groups through the above 5 steps of processing. The reason for this five-stage configuration is that a method is used to first separate and identify phonemes that are significantly separated from other phoneme groups on each plane.

以上は、母音及び後続母音が／ａ／のもの、す
なわちア母音列（ア、カ、サ、タ…）の音韻識別
であつたが、母音／ａ／を母音／ｅ／又は／ｏ／
に置き換え、後続母音／ａ／を後続母音／ｅ／又
は／ｏ／に置き換えれば、同様にして、エ母音
列、オ母音列の音韻群又は音韻を分類することが
できる。また、母音／ａ／を母音／ｉ／又は／
ｕ／に置き換え、後続母音／ａ／を後続母音／
ｉ／又は／ｕ／に置き換え、破擦音／t∫i／及
び／tsu／の音韻をstep2で分離すれば、同様にし
て、イ母音列、ウ母音列の各音韻群又は音韻を分
類することができる。この破擦音が分類できるこ
とは第９図から確認できる（図中、Ｔが破擦音で
あり、同図ｂ／t∫i／、同図ｃが／tsu／）。 Above, the vowel and the following vowel were /a/, that is, the phonetic identification of the A vowel string (a, ka, sa, ta...), but the vowel /a/ was replaced with the vowel /e/ or /o/.
By replacing the following vowel /a/ with the following vowel /e/ or /o/, the phoneme groups or phonemes of the E vowel string and the O vowel string can be classified in the same way. Also, the vowel /a/ can be replaced with the vowel /i/ or /.
Replace the following vowel /a/ with the following vowel /
If the affricate /t∫i/ and /tsu/ are replaced with i/ or /u/ and the phonemes of /t∫i/ and /tsu/ are separated in step 2, each phoneme group or phoneme of the i vowel string and the u vowel string can be classified in the same way. be able to. It can be confirmed from FIG. 9 that this affricate can be classified (in the figure, T is an affricate, b/t∫i/ in the figure, and /tsu/ in c in the figure).

この様にして多群に分類された各音韻群の中に
おいて、従来から行なわれている周知な認識手
法、例えばスペクトルの重心周波数・ピーク周波
数・谷周波数およびそれらの時間変化に基づく認
識手法により各音韻を識別し、最終的な判定を下
す。 Within each phoneme group classified into multiple groups in this way, each phoneme is identified using a conventionally well-known recognition method, such as a recognition method based on the centroid frequency, peak frequency, and valley frequency of the spectrum and their temporal changes. Identify the phonology and make the final judgment.

前記実施例は、鼻腔出力と口腔出力を分離する
遮蔽板を設けたものであつたが、遮蔽板を設けな
い場合には例えば第１１図に示すように、エネル
ギー曲線N^Eｍ、N^Eｎにおいて、最大値の35％点
と15％点とを直線で結び、この直線と時間軸との
交点をt_op、（t_np）とするなどの補正をしてＤ−Ｓ
平面、Ｏ−Δ平面、Ｎ−Ｄ平面を作成すればよ
い。 In the above embodiment, a shielding plate was provided to separate the nasal cavity output and the oral cavity output, but in the case where the shielding plate is not provided, the energy curves N ^E m and N ^E n as shown in FIG. 11, for example. , connect the 35% point and 15% point of the maximum value with a straight line, and make corrections such as setting the intersection of this straight line and the time axis as t _op , (t _np ), and calculate D-S.
What is necessary is to create a plane, an O-Δ plane, and an N-D plane.

なお、上記音声の認識において、口の動きを検
出するカメラ等を用いた検出器を設け、この検出
器とこの発明の検出方法との組合わせで検出すれ
ば、より正確に識別できる。 In the above-mentioned speech recognition, more accurate identification can be achieved by providing a detector using a camera or the like that detects mouth movements and performing detection in combination with this detector and the detection method of the present invention.

[Brief explanation of drawings]

第１図はこの発明の一例を示す説明図、第２図
はこの発明を利用する制御ブロツク図、第３図ａ
〜ｌは時間とエネルギー分布を示すグラフ、第４
図はＤ−Ｓ平面を示すグラフ、第５図ａ〜ｌは
E_o／E_pの時間変化曲線を示すグラフ、第６図は
Ｏ−Δ平面を示すグラフ、第７図は／su／の口腔
出力音声波形図、第８図ａ乃至ｆは鼻腔、口腔出
力エネルギーの正規化時間変化曲線及び口腔出力
の雑音除去差分零交叉数の時間変化曲線のグラ
フ、第９図ａ〜ｅはＮ−Ｄ平面を示すグラフ、第
１０図は音声認識の一例を示すフローチヤート、
第１１図は補正例を示すグラフである。３ｍ，３ｎ……マイク、４……支持アーム。 FIG. 1 is an explanatory diagram showing an example of this invention, FIG. 2 is a control block diagram using this invention, and FIG. 3 a
~l is a graph showing time and energy distribution, the fourth
The figure is a graph showing the D-S plane, and Figures a to l are
A graph showing the time change curve of E _o /E _p , Fig. 6 is a graph showing the O-Δ plane, Fig. 7 is an oral output speech waveform diagram of /su/, and Fig. 8 a to f are nasal cavity and oral cavity outputs. Graphs of the normalized energy time change curve and the time change curve of the noise removal difference zero crossover number of the oral cavity output, Figures 9a to 9e are graphs showing the N-D plane, and Figure 10 is a flow chart showing an example of speech recognition. Chart,
FIG. 11 is a graph showing an example of correction. 3m, 3n...Microphone, 4...Support arm.

Claims

[Scope of Claims] 1. Detect the audio output from the oral cavity at the same time as the audio output from the nasal cavity, and set the rise time difference between the nasal cavity output energy curve and the oral cavity output energy curve as the first feature quantity, and set the difference in the rise time of the nasal cavity output energy curve and the oral cavity output energy curve Discrimination D- whose second feature is the ratio of the slopes of the normalized nasal output energy curve and the normalized oral cavity output energy curve at the time of rising.
The S plane and the ratio of the nasal cavity output energy to the pharyngeal cavity passage energy at the rise of the pharyngeal cavity passage energy curve are the first feature quantity, and the value of the maximum slope of the time change curve of the ratio during phonation is the second feature. The first feature quantity is the O-Δ plane for discrimination, which is the quantity, and the noise-removed differential zero-crossing number obtained by removing the differential zero-crossing number of silent sections in the oral cavity output audio signal, and the noise-removed differential zero-crossing number and the nasal cavity output. Create a discrimination N-D plane with the rise time difference of the energy curve as the second feature, and based on the distribution of the emitted audio waves on each of these discrimination planes, (step 1) O-Δ plane Separate (/N/) on the top, (step 2) On the N-D plane (/S-line phoneme/),
(/affricate/), (/fricative/), (/the phoneme of the line/), (/the phoneme of the line M/, /the phoneme of the line N/, /
Separate each phoneme and phoneme group (phoneme in the G line /, phoneme in the /DA line /, / phoneme in the B line /), and (step 3) on the O-Δ plane (phoneme in the /Y line/, /RA line). Separate the phoneme group /, phoneme of /Wa line/) on the D-S plane (phoneme of /Ka line/, /plosive phoneme/ of /Ta line, /phoneme of /P line/)
Separate the phoneme group of (step 5) On the N-D plane (/phoneme in row A/)
and (/phoneme in line C/) A speech wave detection method characterized in that each of the phoneme groups or phonemes is classified by a series of steps consisting of the steps described above, in which phonemes are identified.