JPS6068393A

JPS6068393A - Discrimination of phoneme

Info

Publication number: JPS6068393A
Application number: JP58177345A
Authority: JP
Inventors: 二矢田　勝行
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1983-09-26
Filing date: 1983-09-26
Publication date: 1985-04-18
Also published as: JPH0316039B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】産業上の利用分野本発明は音素認識を行なうことを特徴とする音素認識方
法における音素の認識方法に関するものである。DETAILED DESCRIPTION OF THE INVENTION Field of the Invention The present invention relates to a phoneme recognition method in a phoneme recognition method characterized by performing phoneme recognition.

従来例の構成とその問題点近年、不特定話者・多数語を対象とする音声認識に対す
る研究開発が盛んになってきた。Structure of conventional examples and their problems In recent years, research and development on speech recognition targeting unspecified speakers and multiple languages has become active.

音素認識を行なうこと全特徴とする音声認識方法は、ア
クセントの違いなどの話者による変動を受けにくいこと
、音声信号を音素系列という少ない情報量でしかも言語
学に対応した記号に変換するため、単語辞書の容量が少
なくてもよいこと、単語辞書の内容全容易に作成したり
変更したりできること、など不特定話者・多数語の認識
には適３、−ゾした方法である。この方法における重要なポイントは音
素認識を正確に行なうことである。母音の認識はともか
くとして、子音の認識は従来から技術的に難しい問題と
されてきた。したがって、ここでは主に子音に絞って述
べてゆく。Speech recognition methods are characterized by the fact that they are not susceptible to speaker-specific variations such as differences in accents, and that they convert speech signals into phoneme sequences, symbols that contain a small amount of information and are compatible with linguistics. This method is suitable for recognition of a wide variety of speakers and many words, since the capacity of the word dictionary may be small, and the contents of the word dictionary can be easily created and changed. The important point in this method is to perform phoneme recognition accurately. Aside from vowel recognition, consonant recognition has traditionally been considered a technically difficult problem. Therefore, I will mainly focus on consonants here.

個々の子音の特徴に関する音声学的な研究は、以前から
数多く行なわれている。しかし、音声信号のセグメンテ
ーションを行なって子音区間全検出して音素を認識する
方法、いわゆる自動認識に対する従来例は多くない。こ
こでは、本出願人を含むグループが先に出願した技術全
従来例として取りあげ、問題点を挙げる。Many phonetic studies have been conducted on the characteristics of individual consonants. However, there are not many conventional examples of so-called automatic recognition, a method of segmenting a speech signal and detecting all consonant intervals to recognize phonemes. Here, we will take up all the conventional techniques that were previously applied by a group including the present applicant, and list the problems.

従来例における音素区間の検出方法を第１図によって説
明する。次の３つの方法を併用している。A conventional method for detecting phoneme sections will be explained with reference to FIG. The following three methods are used together.

イ、音声・無声判定を行ない無声区間全子音区間とする
。第１図（ＩＬ）に示すように、音声を１０ｍ’ｉｅＣ
’程度のフレームに分割し、各フレームに対して有声・
無声判定を行なう。そして無声フレームが連続している
区間を子音区間とする。この方法は無声子音の検出に有
効である。B. Perform speech/unvoiced determination and determine the unvoiced section as an all-consonant section. As shown in Figure 1 (IL), the audio is 10 m'ieC.
Divide into frames of about ', and voice/
Performs silent judgment. Then, a section in which unvoiced frames are continuous is defined as a consonant section. This method is effective for detecting voiceless consonants.

１１開口ＨＧＯ−６８３９３（２）口、音声区間の各フレームを６母音と鼻音（／ｍ／。11 opening HGO-68393 (2) Each frame of the mouth and speech interval is divided into 6 vowels and a nasal sound (/m/.

／ｎ／、はつ音）の標準パターンと比較し、類似度が最
も大きくなる音素をそのフレームの認識結果とする。そ
して、鼻音として認識されたフレームが連続している区
間全子音区間とする（第１図（ｂ））。これは有声子音
のスペクトルが、母音よりも鼻音に近いことを利用した
セグメンテーション方法である。The phoneme with the highest degree of similarity is selected as the recognition result for that frame. Then, a section in which frames recognized as nasal sounds are consecutive is defined as an all-consonant section (FIG. 1(b)). This is a segmentation method that takes advantage of the fact that the spectrum of voiced consonants is closer to nasal sounds than to vowels.

ハ、音声パワーの時間的な動きからパワーのくぼみ（パ
ワーディラグ）を検出し、ディップ区間を子音区間とす
る（第１図（Ｃ））。これは子音のパワーが母音部のパ
ワーよりも小さいことを利用したセグメンテーション方
法でアル。C. Detect a power dip (power delag) from the temporal movement of voice power, and define the dip section as a consonant section (Fig. 1 (C)). This is a segmentation method that takes advantage of the fact that the power of consonants is smaller than the power of vowels.

このようにして得られた子音区間に対し、従来法では次
のようにして子音全認識していた。第２図によって説明
する。子音区間の全フＬ／−ムを対象として、フレーム
ごとに各子音の標準パターンとの類似度を計算する。図
では音素Ａに対する類似度ｆ　／ａ　、　Ｂに対する類
似度’ｆｘｌｂ、ｃに対する類似度ｋ１．ｃと表わして
いる。類似度は（式１）７！ｊ＝−（Ｘ−μ、）Ｔ・Σ
５１・（χ］附）−Ｋｊ　（式１）たた°し、ｊは音素
名（ａ、ｂ、ｃ・・・・・・）　、　Ｋｊは音素ｊに依
存する定数である。またχは入力特徴パラメータ（Ｌｐ
ｃケグストラム係数）ベクトル、辺は平均値ベクトル（
標準パターン）、Σは共分散行列（標準パターン）であ
る。In the conventional method, all consonants are recognized in the consonant section obtained in this manner as follows. This will be explained with reference to FIG. The degree of similarity between each consonant and the standard pattern is calculated for each frame, targeting all frames L/- of the consonant section. In the figure, the similarity f/a for phoneme A, the similarity 'fxlb for phoneme B, and the similarity k1 for phoneme c. It is expressed as c. The similarity is (formula 1) 7! j=-(X-μ,)T・Σ
51.(x])-Kj (Formula 1) where j is the phoneme name (a, b, c...), and Kj is a constant that depends on the phoneme j. Also, χ is the input feature parameter (Lp
c kegstrum coefficient) vector, the edge is the mean value vector (
standard pattern), Σ is the covariance matrix (standard pattern).

そして、全フレームに対する音素Ａ、Ｂ、Ｃ・・・の類
似度和をそれぞれり、Ｌ、　Ｌｂ、　Ｌｏ・・・・・・
とすると、となる。このようにして、類似度和が最大と
なる音素’を認識された音素とする。Then, calculate the sum of similarities of phonemes A, B, C, etc. for all frames, respectively, and calculate L, Lb, Lo...
Then, it becomes. In this way, the phoneme ' with the maximum similarity is determined to be the recognized phoneme.

上記で説明した従来例の問題は後半の音素判別の部分で
あり、セグメンテーションによって区間を決めた後、そ
の全区間に対して、フレームごとに類似度計算を行なう
点である。The problem with the conventional example described above is in the second half of the phoneme discrimination part, in that after an interval is determined by segmentation, similarity calculation is performed for each frame for the entire interval.

すなわち、子音区間全体全時間的に静的であると決め込
み、全区間を平等に扱っていることである。In other words, the entire consonant interval is assumed to be static throughout time, and the entire consonant interval is treated equally.

６ページしかし、母音はともかくとして、子音や半母音は区間内
で時間的に特徴パラメータが変化するものであり、その
変化形態に各音素の特徴が見出される。そして、特徴を
有する部分（特徴部）は子音や半母音の種類によって異
なっている。たとえば有声、無声破裂音では、破裂付近
に音素を判別するだめの特徴が集中し、鼻音では後続母
音へのわたりの部分に音素判別のための特徴部があり、
原音や半母音では音素区間全体のパラメータの動きに特
徴がある。Page 6However, apart from vowels, the characteristic parameters of consonants and semi-vowels change over time within an interval, and the characteristics of each phoneme are found in the form of these changes. The characteristic part (characteristic part) differs depending on the type of consonant or semivowel. For example, in voiced and voiceless plosives, the features for identifying phonemes are concentrated near the plosive, and for nasals, the features for identifying phonemes are present in the transition to the following vowel.
Original sounds and semi-vowels have characteristics in the movement of parameters throughout the phoneme interval.

したがって、子音や半母音の判別には、各音素を判別す
るだめの特徴部を正確に抽出し、特徴部におけるパラメ
ータの時間的な動きに着目して音素判別を行なう方法が
有効である。従来例ではこのような配慮がなされていな
い。Therefore, an effective method for distinguishing between consonants and semi-vowels is to accurately extract the characteristic parts for discriminating each phoneme, and perform phoneme discrimination by focusing on the temporal movement of parameters in the characteristic parts. In the conventional example, such consideration has not been taken.

発明の目的本発明は従来技術のもつ以上のような欠点を解消するも
ので、音素の特徴部全正確に抽出し、特徴部におけるパ
ラメータの時間的な動きを含めて音素標準パターンとの
マツチングを行なうことに７ページより、高い精度で音素全判別する手段を提供することを
目的とするものである。Purpose of the Invention The present invention solves the above-mentioned drawbacks of the prior art, and aims to accurately extract all the characteristic parts of a phoneme and match them with a standard phoneme pattern, including the temporal movement of parameters in the characteristic parts. From page 7 onwards, the purpose of this paper is to provide a means for identifying all phonemes with high accuracy.

発明の構成本発明は上記目的を達成するもので、音素区間で音素の
特徴をよく表現する部分（以下特徴部と記す）に対して
各音素ごとに作成された標準パターン（以下音素標準パ
ターンと記す）と、識別対象とする音素群に関して特徴
部の周囲情報に対して作成された標準パターン（以下周
囲情報標準パターンと記す）を用意し、入力音声のセグ
メンテーションを行なって音素区間中の特徴部候補区間
をめ、前記特徴部候補区間の各時点に対して前記音素標
準パターンと周囲情報標準パターンを適用してパターン
マツチングを行ない、各音素との類似度全特徴部の周囲
の影響全除去した形で特徴部候補区間全域についてめ、
前記特徴部候補区間内におげろ類似度全比較することに
よって音素の判別を行うことを特徴とする音素判別方法
を提供するものである。Structure of the Invention The present invention achieves the above object, and uses a standard pattern (hereinafter referred to as a phoneme standard pattern) created for each phoneme in a part of the phoneme interval that well expresses the characteristics of the phoneme (hereinafter referred to as the characteristic part). ) and a standard pattern (hereinafter referred to as the surrounding information standard pattern) created for the surrounding information of the feature part regarding the phoneme group to be identified, and perform segmentation of the input speech to identify the feature part in the phoneme interval. After finding a candidate section, pattern matching is performed by applying the phoneme standard pattern and surrounding information standard pattern to each point in the feature candidate section, and all effects of the surroundings of the feature are removed based on the degree of similarity with each phoneme. About the entire feature candidate section in the form of
The present invention provides a phoneme discrimination method characterized in that phonemes are discriminated by comparing all similarities within the characteristic part candidate section.

実施例の説明特開昭ＧＯ−６８３９３（３）以下本発明の一実施例を図面を参照しながら説明する。Description of examples JP-A-Sho GO-68393 (3) An embodiment of the present invention will be described below with reference to the drawings.

本実施例は、各音素の特徴部を目視によって正確に検出
し、多くのデータを使用して、音素標準パターンを予め
作成し、さらに一度に類似度を比較する全ての音素（音
素群に属する音素）を対象にして、特命部の周囲情報の
標準パターンも作成しておき、音素の判別は、先ずセグ
メンテーションパラメータによって特徴部の候補区間金
少し広めに設定し、次に候補区間の全域に対して、各音
素の標準パターンと周囲情報の標準パターンを１フレー
ムずつずらせながら適用して類似度を計算することによ
って、候補区間の中から正確に特徴部を検出するととも
に音素の判別を行なうものである。In this example, the features of each phoneme are accurately detected visually, a standard phoneme pattern is created in advance using a large amount of data, and the similarity of all phonemes (belonging to a phoneme group) is compared at once. A standard pattern for the surrounding information of the special part has been created for phonemes), and for phoneme discrimination, first set the candidate section of the characteristic part a little wider using the segmentation parameter, then set the candidate section of the feature part a little wider, and then By applying the standard pattern of each phoneme and the standard pattern of surrounding information while shifting them one frame at a time and calculating the degree of similarity, features can be accurately detected from candidate intervals and phonemes can be discriminated. be.

先ず標準パターンの作成方法を説明する。音素標準パタ
ーンは、音素ごとに目視によって切出した特徴部につい
て特徴パラメータ全抽出し、多くのデータの平均値と共
分散行列をめることによって作成する。特徴パラメータ
は、線形予測分析９ベージ（ＬＰＧ分析）でめたＬＰＧケプストラム係数全使用し
ている。First, a method for creating a standard pattern will be explained. The phoneme standard pattern is created by extracting all the feature parameters for the feature parts visually extracted for each phoneme, and then calculating the average value and covariance matrix of a lot of data. All LPG cepstrum coefficients obtained by linear prediction analysis 9 pages (LPG analysis) are used as the characteristic parameters.

いま１フレームあたりの特徴パラメータの数をｄとし、
特徴部のフレーム数をｊとすると、特徴部のパラメータ
系列χは（ｊ）（ｊ）（ｊ）！１　、　ｘ２−ｘｄ）　（式２）で表わされる。ｘ（ｋ）は特徴部の第にフレームにおけ
るｉ番目のＬＰＧケプヌトラム係数である。Let the number of feature parameters per frame be d,
Letting the number of frames of the feature part be j, the parameter sequence χ of the feature part is (j) (j) (j)! 1, x2-xd) (Equation 2). x(k) is the i-th LPG cepnutrum coefficient in the first frame of the feature.

多くのデータに対してパラメータ系列を抽出し、各要素
の平均値ベクトルＬと要素間の共分散行列Σをめ標準パ
ターンとする。A parameter series is extracted from a large amount of data, and the mean value vector L of each element and the covariance matrix Σ between the elements are used as a standard pattern.

（ｊ）（Ｊ）（ｊ） μｍ、μ２・・・μｄ）（式３）ただしμ年）＝ふｘ（ｋゝ／Ｍ（Ｍはデータ数）ｌ　ｚ
、Ｉ）共分散行列は複雑なのでここには記さない。このように
本実施例の方法では、複数フレームの特徴パラメータを
使用してすなわちパラメータの時間的動きを考慮して標
準パターン全作成しているの１０、−ジが特徴である。(j) (J) (j) μm, μ2...μd) (Formula 3) where μ year) = f x (kゝ/M (M is the number of data) l z
, I) The covariance matrix is complex and will not be described here. As described above, the method of this embodiment is characterized in that all standard patterns are created using characteristic parameters of a plurality of frames, that is, taking into consideration the temporal movement of the parameters.

次に周囲情報の標準パターン作成方法を説明する。周囲
情報の標準パターンは、音素判別において類似度を相互
に比較する音素群に対して１種を作成する。たとえば、
有声破裂音群（／ｂ／、／（１／。Next, a method for creating a standard pattern for surrounding information will be explained. One type of standard pattern of surrounding information is created for a group of phonemes whose similarities are compared with each other in phoneme discrimination. for example,
Voiced plosive group (/b/, /(1/.

／ｑ／）に対して１つ、鼻音群（／ｍ／　、　／ｎ／　
、　／１７／）に対して１つ、・・・・・・という割合
である。/q/) and one for the nasal group (/m/, /n/
, /17/).

周囲情報の標準パターンは、特徴部の周囲の情報の性質
をパターン化したものである。各音素群には、その音素
群に共通の性質がある。たとえば有声破裂音の特徴部は
破裂付近であるが、破裂の前には必ず数フレームのバズ
区間があり、破裂の後は急速に母音に接続する。無声破
裂音では特徴部（破裂付近）の前に無音区間がある。The standard pattern of surrounding information is a pattern of the nature of information surrounding a feature. Each phoneme group has properties common to that phoneme group. For example, the characteristic part of a voiced plosive is near the plosive, but there is always a buzz section of several frames before the plosive, and after the plosive, it quickly connects to a vowel. In voiceless plosives, there is a silent section before the characteristic part (near the plosive).

また鼻音では特徴部（後続母音へのわたりの部分）の前
に定常的な鼻音性フレームが存在する。In nasal sounds, there is a constant nasal frame before the characteristic part (the part that transitions to the following vowel).

周囲情報の標準パターンは、このような音素群に共通な
性質を標準パターン化するものである。The standard pattern of surrounding information is a standard pattern of characteristics common to such phoneme groups.

第３図に具体的な作成方法の一例を示す。特徴部（図の
斜線部）に対し、前後に十分な長さの区１１、−ジ間を設定して周囲情報区間を決める。区間の長さは音素
群ごとに設定する。周囲情報区間りに対して、βフレー
ムずつ区切りなからｄｘβ次元のパラメータ系列を、図
に示すように１フレームずつシフトさせながら、全区間
に対してめる。この操作を音素群に属する全データに対
して適用し、音素標準パターン作成の場合と同様の方法
で、周囲情報に対する標準パターンをめる。このように
すると、特徴部のデータも混入するが、定常性を有する
周囲情報のウェイトが格段に太きいため、問題とならな
い。FIG. 3 shows an example of a specific production method. The surrounding information section is determined by setting sections 11 and 11 of sufficient length before and after the characteristic portion (the shaded section in the figure). The length of the interval is set for each phoneme group. For the surrounding information section, the dxβ-dimensional parameter series is set for the entire section by dividing it into β frames and shifting it by one frame as shown in the figure. This operation is applied to all data belonging to the phoneme group, and a standard pattern for surrounding information is created in the same manner as in the case of creating a phoneme standard pattern. If this is done, the data of the characteristic part will also be mixed in, but this will not be a problem because the weight of stationary surrounding information is much greater.

次に音素の認識方法を説明する。先ずセグメンテーショ
ンパラメータによって子音区間を決め、子音区間の後端
を基準点として、特徴部の候補区間全多少広めに設定す
る。セグメンテーションパラメータには何を用いてもよ
いが、本実施例では高域、低域のパワーディップ金主に
用い、ディップの立上り部全基準点としている。パワー
ディップがとらえにくい場合、有声・無声判定結果およ
び鼻音性を併用している。Next, a method for recognizing phonemes will be explained. First, a consonant interval is determined using a segmentation parameter, and the rear end of the consonant interval is used as a reference point, and all candidate intervals of the characteristic part are set to be somewhat wider. Although any segmentation parameter may be used, in this embodiment, it is used mainly for power dips in the high and low ranges, and all the rising parts of the dips are used as reference points. When the power dip is difficult to detect, the voiced/unvoiced judgment result and nasality are used together.

次に、特徴部候補区間から特徴部を検出して音素の判別
を行なう方法全説明する。今後の説明では簡単のために
、音素群が２音素（音素１．音素２）で構成されている
ものとする。音素数が増しても、考え方は同じである。Next, a complete description will be given of a method for detecting a feature part from a feature part candidate section and discriminating a phoneme. In the following explanation, for simplicity, it is assumed that the phoneme group is composed of two phonemes (phoneme 1 and phoneme 2). Even if the number of phonemes increases, the idea remains the same.

特徴部候補区間ｔｔ１〜ｔ２とする。い丑、時点１（１
，≦ｔ≦ｔ２）におけろ未知入力ベクトル（判別される
べきデータ）をＸ、とする。χ、は（式２）と同様の形
式である。そして音素１の標準パターン（平均値）ｔｔ
ｕ４．　音素２の標準パターン（平均値）ヲ朽１周囲情
報の標準パターン（平均値）をか。と踵音素１．音素２
および周囲情報の全てに共通な共分散行列をΣとする。It is assumed that the characteristic part candidate section is tt1 to t2. Ishi, point 1 (1
, ≦t≦t2), let X be the unknown input vector (data to be determined). χ has the same format as (Equation 2). And the standard pattern (average value) of phoneme 1 tt
u4. Standard pattern (average value) of phoneme 2. Standard pattern (average value) of Woku 1 surrounding information. and heel phoneme 1. Phoneme 2
Let Σ be the covariance matrix common to all of the surrounding information.

Σは各々の共分散行列を平均することによって作成する
。Σ is created by averaging each covariance matrix.

時間ｔにおける未知入力の音素１との類似度（距離）を
Ｂ１９．とするとＬｌ、ｔ　＝　（Ｘ　ｔ／４１　）Ｔ・Σ４・（Ｘｔ−
汝、）−（Ｘ：ｔ−μ。）・Σ・文、−μ。）（式４）
同様に音素２との距離ｋＬｚ、ｔとすると１３４−ッＢ２．ｔ＝（Ｘｔ１０２　）Ｔ・Ｘ−１・（Ｘｔｔｔ２
　）−（Ｘｔ−μ。）・Σ−’−（Ｘｔ−μ６）（式６
）とする。これらの式の第１項は音素に対するマハラノ
ビス距離、第２項は周囲情報に対するマハラノビス距離
である。したがって、これらの式の意味は、時点ｔにお
ける未知入力と音素標準パターンとの類似度（距離）か
ら周囲情報に対する距離を減じたものを、新たに音素と
の距離とすることである。（式４）および（式６）の計
算ヲｔ１〜ｔ２の期間を対象として行ない、Ｌｌ、ｔ、
Ｂ２．ｔのうち、この期間に最小となった方の音素を認
識音素とする。The similarity (distance) between the unknown input and phoneme 1 at time t is calculated as B19. Then Ll, t = (X t/41)T・Σ4・(Xt−
Thou,)-(X:t-μ.)・Σ・Sentence,-μ. ) (Formula 4)
Similarly, if the distance to phoneme 2 is kLz, t, then 134-B2. t=(Xt102)T・X−1・(Xttt2
)-(Xt-μ.)・Σ-'-(Xt-μ6) (Equation 6
). The first term in these equations is the Mahalanobis distance for the phoneme, and the second term is the Mahalanobis distance for the surrounding information. Therefore, the meaning of these equations is that the similarity (distance) between the unknown input and the phoneme standard pattern at time t minus the distance to the surrounding information is set as the new distance to the phoneme. The calculations of (Formula 4) and (Formula 6) are performed for the period from t1 to t2, and Ll, t,
B2. Among t, the phoneme that is the smallest during this period is set as the recognized phoneme.

実際には（式４）、（式６）は次のように簡単な式に展
開できる（本質的でないので導出は略す）。In reality, (Formula 4) and (Formula 6) can be expanded into a simple formula as follows (the derivation is omitted as it is not essential).

Ｌｌ、ｔ−ａ、・ｊＣｔ−β、（式４）′Ｌ２．ｔ＝ｔ
２・Ｊｔ−ｌＢ２（式６〕′ｉｔ、、　Ｌ２＋　ＩＢ、
　、　ＩＢ２’ｅ新たに周囲情報を含んだ標準パターン
とする。Ll, ta, ·jCt-β, (Formula 4)'L2. t=t
2・Jt-lB2 (Formula 6)'it,, L2+ IB,
, IB2'e is newly set as a standard pattern that includes surrounding information.

上記の方法の概念的な説明を第４図で行なう。A conceptual explanation of the above method is given in FIG.

第４図ｅ）に示す状況において、子音の判別を行１４　
、、　、、。In the situation shown in Figure 4e), the consonant discrimination is performed in line 14.
,, ,,.

なう場合を考える。この子音の真の特徴部（斜線部）に
対震特徴部候補区間Ｔがｔ、〜ｔ２としてめられたもの
とする。（ｂ）ｉｌ−１′音素１．音素２に対する距離
の時間的変動をそれぞれ実線と破線で示したものである
。Ａ、Ｂ、Ｃは距離が極小となる位置を示す。真の特徴
部（Ｂ点）においては音素１の分が音素２よりも小さく
、この子音は音素１として判別されるべきである。しか
るに、セグメンテーションパラメータによって自動的に
めた特徴部候補区間内においては、音素２がＡ点におい
て最小となるため、このままでは音素２に誤判別されて
し壕う。第４図（Ｑ）は未知入力の周囲情報の標準パタ
ーンとの距離を示したものであり、真の特徴部付近で値
が大きくなる。これは、標準パターンが主に周辺の情報
によって作成されているためである。第４図（（１）は
周囲情報を含んた゛音素標準パターンとの距離であり、
Φ）から（ｃ）　ｅ減じたものと等価である。（ｄ）で
はＡ点よりもＢ点の値が小さくなっており、この子音は
正しく音素１として判別されることにな、。Consider the case where Assume that the anti-shock feature candidate section T is determined as t, to t2 in the true feature (shaded area) of this consonant. (b) il-1' phoneme 1. The temporal variation of the distance to phoneme 2 is shown by a solid line and a broken line, respectively. A, B, and C indicate positions where the distance is minimum. In the true characteristic portion (point B), the portion of phoneme 1 is smaller than that of phoneme 2, and this consonant should be determined as phoneme 1. However, within the feature candidate section automatically determined by the segmentation parameters, phoneme 2 is at its minimum at point A, so if this continues, it will likely be misclassified as phoneme 2. FIG. 4 (Q) shows the distance between the unknown input and the standard pattern of surrounding information, and the value increases near the true feature. This is because the standard pattern is created mainly using peripheral information. Figure 4 ((1) is the distance from the standard phoneme pattern that includes surrounding information,
It is equivalent to Φ) minus (c) e. In (d), the value at point B is smaller than that at point A, and this consonant is correctly identified as phoneme 1.

１６７゜　、。167°,.

このように、本発明の方法を用いることによって、セグ
メンテーションパラメータでめた大まかな特徴部候補区
間から、正確に真の特徴部を抽出して音素全判別するこ
とができる。In this manner, by using the method of the present invention, it is possible to accurately extract true features from the rough feature candidate sections determined by the segmentation parameters and to discriminate all phonemes.

なお、上記においてはマハラノビス距離で説明したが、
その他の距離においても考え方は同じである。たとえば
（式１）を使用する場合、距離のかわりに尤度を用い、
極小値のかわりに極大値を使えば工い。また、上記では
子音によって説明したが、時間的に変動する音素、たと
えば半母音に対しても同様な方法が適用できる。In addition, although the explanation was made using the Mahalanobis distance above,
The idea is the same for other distances. For example, when using (Equation 1), use likelihood instead of distance,
It works if you use the local maximum value instead of the local minimum value. Further, although the above explanation has been made using consonants, a similar method can be applied to phonemes that change over time, such as semi-vowels.

第Ｓ図は本発明の方法を実現するためのブｏ。FIG. S is a block diagram for implementing the method of the present invention.

り図である。１は特徴パラメータ抽出部で、入力音声の
分析全行なって特徴パラメータを抽出する部分である。This is a diagram. Reference numeral 1 denotes a feature parameter extraction unit, which performs all analysis of input speech and extracts feature parameters.

ここではＬＰＧ分析を行なってＬＰＧケプストラム係数
をめ類似度計算部２に送出する。また、特徴パラメータ
抽出部１では入力音声の高域バワート低域ハワーをめて
セグメンテーション部４に送出する。類似度計算部２は
入力パラメータと、特開昭ＧＯ−６８３９３（５）標準パターン格納部３に格納されている各標準パターン
との距離をフレームごとに計算する。標準パターン格納
部３には音素標準パターン、周囲情報の標準パターンの
他にセグメンテーションに用いる、有声・無音標準パタ
ーンと５母音・鼻音の標準ハターンも格納されている。Here, LPG analysis is performed and LPG cepstrum coefficients are determined and sent to the similarity calculation section 2. Further, the feature parameter extracting section 1 collects high-frequency power and low-frequency power of the input voice and sends them to the segmentation section 4. The similarity calculation unit 2 calculates the distance between the input parameter and each standard pattern stored in the standard pattern storage unit 3 according to Japanese Patent Application Laid-open No. Sho GO-68393(5) for each frame. In addition to phoneme standard patterns and surrounding information standard patterns, the standard pattern storage unit 3 also stores voiced/unvoiced standard patterns and five vowel/nasal standard patterns used for segmentation.

セグメンテーション８４では、類似度計算部２から送出
されろ類似度情報と、特徴パラメータ抽出部１から送出
されるパワー情報によって音素区間を決める。音素判別
部６は、音素区間と各音素に対する類似度、周囲情報の
類似度によって特徴部全抽出し、特徴部における音素の
類似度全比較して、音素の判別を行ない、結果を出力す
る。In the segmentation 84, phoneme sections are determined based on the similarity information sent from the similarity calculation section 2 and the power information sent from the feature parameter extraction section 1. The phoneme discriminator 6 extracts all the feature parts based on the similarity between the phoneme interval and each phoneme, and the similarity of surrounding information, compares all the similarities of the phonemes in the feature parts, discriminates the phoneme, and outputs the result.

本実施例によって、語中子音を対象として平均７６．１
％の認識率を得た。従来法で同様の評価を行なうと７２
．６％であった。従来法では一部、子音群として認識し
ているものもあることを考慮すれば、本実施例の効果が
顕著であることがわかる。According to this example, an average of 76.1
% recognition rate was obtained. A similar evaluation using the conventional method would result in 72
．． It was 6%. Considering that some consonant groups are recognized in the conventional method, it can be seen that the effect of this embodiment is remarkable.

使用したデータは男女計２ｏ名が発声した２１２１７、
・　−・また周囲情報の標準パターンを使用することの効果を調
べるため、有声破裂音と鼻音によって実験を行なった。The data used was 21217 utterances made by a total of 2 men and women.
・ −・ In order to investigate the effect of using standard patterns of surrounding information, we conducted experiments using voiced plosives and nasals.

その結果、周囲情報の標準パターンを用いない場合には
音声破裂音での認識率が７Ｊ７％、鼻音で６４．１％で
あったのに対し、周囲情報の標準パターンを用いると、
それぞれ７４．７％、７ｅｓ、２％に向上した。特に鼻
音に対して顕著な効果が現われている。これは、鼻音の
パワーディラグが不明瞭なため、音素区間が正確に抽出
できないことが原因である。周囲情報の標準パタ７ンを
導入すると、音素区間が不明確な場合でも、特徴部が正
確に検出でき、認識率が向上することにより、本実施例
の効果が検証できた。As a result, when the standard pattern of surrounding information was not used, the recognition rate was 7J7% for vocal plosives and 64.1% for nasal sounds, whereas when the standard pattern of surrounding information was used,
This improved to 74.7%, 7es, and 2%, respectively. In particular, a remarkable effect appears on nasal sounds. This is because the power delag of nasal sounds is unclear, so phoneme intervals cannot be extracted accurately. When the standard pattern 7 of surrounding information was introduced, even when the phoneme section was unclear, the characteristic part could be detected accurately and the recognition rate was improved, thereby verifying the effect of this example.

発明の効果以上型するに本発明は音素区間で音素の特徴をよく表現
する部分（以下特徴部と記す）に対して各音素ごとに作
成された標準パターンと、識別対象とする音素群に関し
て特徴部の周囲情報に対して作成された標準パターンを
用意し、入力音声のセグメンテーション全行なって音素
区間中の特徴１８、・　・・部候補区間をめ、前記特徴部候補区間の各時点に対して
前記音素標準パクーンと周囲情報標準パターンを適用し
てパターンマツチング全行ない、各音素との類似度全特
徴部の周囲の影響を除去した形で特徴部候補区間全域に
ついてめ、前記特徴部候補区間内における類似度を比較
することによって音素の判別を行うことを特徴とする音
素判別方法を提供するもので、イ、音声の自動セグメンテーションを行って、高い精度
で音素を判別することができる。Effects of the Invention In short, the present invention provides a standard pattern created for each phoneme for a part of a phoneme interval that well expresses the characteristics of a phoneme (hereinafter referred to as a characteristic part), and a characteristic pattern for a group of phonemes to be identified. Prepare a standard pattern created for the surrounding information of the part, perform all the segmentation of the input speech, find the feature 18 in the phoneme interval, ... part candidate interval, and set it for each point in the feature part candidate interval. All pattern matching is performed by applying the phoneme standard pattern and the surrounding information standard pattern, and the similarity with each phoneme is calculated for the entire feature part candidate section in a form that removes the influence of the surroundings of all the features, and the feature part candidate section is The present invention provides a phoneme discrimination method that is characterized by discriminating phonemes by comparing the degree of similarity within them.B. Automatic segmentation of speech can be performed to discriminate phonemes with high accuracy.

口、音素標準パターンと周囲情報標準パターンを用いる
ことにより、音素判別に有効な部分（特徴部）全自動的
にしかも正確に抽出し、マツチングを行なうことができ
る。By using the mouth, phoneme standard pattern, and surrounding information standard pattern, parts (characteristic parts) effective for phoneme discrimination can be fully automatically and accurately extracted and matched.

等の利点を有する。It has the following advantages.

[Brief explanation of the drawing]

第１図は従来における音素認識方法を説明する図、第２
図は従来における類似度計算法を説明する図、第３図は
本発明の一実施例におけろ音素判）Ｒ方法の、周囲情報
標準パターンの作成法全説明１９ア。−０゜する図、第４図は本発明における同方法の、特徴部の検
出及び音素判別を行う方法を説明する図、第６図は本発
明における同方法全具現化するためのブロック図である
。１・・・・・・特徴パラメータ抽出部、２・・・・・・
類似度計算部、３・・・・・・標準パターン格納部、４
・・・・・・セグメンテーション部、６・・・・・・音
素判別部。代理人の氏名　弁理士　中　尾　敏　男　ほか１名第１
図第２図第３図第４図勺りべ憾Figure 1 is a diagram explaining the conventional phoneme recognition method, Figure 2
The figure is a diagram explaining a conventional similarity calculation method, and FIG. 3 is a complete explanation 19a of a method for creating a surrounding information standard pattern using the phoneme type (R) method in an embodiment of the present invention. -0°, FIG. 4 is a diagram illustrating a method for detecting feature parts and phoneme discrimination according to the same method according to the present invention, and FIG. 6 is a block diagram for fully realizing the same method according to the present invention. be. 1...Feature parameter extraction unit, 2...
Similarity calculation unit, 3...Standard pattern storage unit, 4
... Segmentation section, 6... Phoneme discrimination section. Name of agent: Patent attorney Toshio Nakao and 1 other person No. 1
Figure 2 Figure 3 Figure 4

Claims

[Claims] (1) A standard pattern (hereinafter referred to as a phoneme standard pattern) created for each phoneme for a portion (hereinafter referred to as a feature part) that well expresses the characteristics of a phoneme in a phoneme interval, and identification. Prepare a standard pattern (hereinafter referred to as the surrounding information standard pattern) created for the surrounding information of the feature for the target phoneme group, perform all the segmentation of the input speech to find the feature part candidate section in the phoneme section, Pattern matching is performed by applying the phoneme standard pattern and surrounding information standard pattern to each point in the feature candidate section, and the similarity with each phoneme is determined by removing the influence of the surroundings of the feature. 1. A phoneme discrimination method, characterized in that phonemes are discriminated by comparing the degree of similarity within the feature section candidate sections over the entire candidate section. C2) As a distance measure in pattern matching, 2
The phoneme discrimination method according to claim 1, characterized in that a page statistical distance measure is used. (3) The phoneme discrimination method according to claim 2, wherein the statistical distance measure is Mahalanobis distance.