JPH064096A

JPH064096A - Voice recognizer

Info

Publication number: JPH064096A
Application number: JP4165163A
Authority: JP
Inventors: Yasushi Yamazaki; 泰山崎; Akihiro Kimura; 晋太木村
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1992-06-24
Filing date: 1992-06-24
Publication date: 1994-01-14
Anticipated expiration: 2017-12-03
Also published as: JP3353334B2

Abstract

(57)【要約】【目的】音声認識装置に関し、単語の認識性能を高くす
ることを目的とする。【構成】入力音声パターンと標準パターンをＤＰ法を用
いて照合し、最も照合距離の小さい標準パターンを認識
結果とする音声認識装置において、照合結果を用いて入
力パターンを音素に分割し、各音素の継続時間と標準継
続時間とのずれの分散を計算し、これを照合距離に付加
することで距離を補正することを特徴とする。分割部６
で照合結果を用いて音素に分割し，時間長ずれ計算部７
で標準継続時間とのずれの分散を計算し，距離補正部８
で照合距離を補正するように構成する。また時間長のず
れを計算する対象音素を選択する音素選択部９、距離補
正する対象単語を選択する単語選択部10を有する． (57) [Abstract] [Purpose] The purpose of the speech recognition apparatus is to improve word recognition performance. [Structure] In a speech recognition apparatus that matches an input speech pattern with a standard pattern using the DP method, and uses the standard pattern with the smallest matching distance as a recognition result, the input pattern is divided into phonemes using the matching result, and each phoneme is divided. It is characterized in that the variance of the difference between the continuous time and the standard continuous time is calculated, and this is added to the matching distance to correct the distance. Dividing unit 6
The phoneme is divided into phonemes using the matching result in the time length deviation calculation unit 7
Calculate the variance of the deviation from the standard duration with
Is configured to correct the matching distance. Further, it has a phoneme selection unit 9 for selecting a target phoneme for calculating the time length deviation and a word selection unit 10 for selecting a target word for distance correction.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は入力音声パターンを単語
標準パターンと照合して単語を認識する音声認識装置に
関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device for recognizing a word by matching an input voice pattern with a standard word pattern.

【０００２】[0002]

【従来の技術】図４は従来の音声認識装置のブロック構
成図である。図５はそこで使われる照合データを示す。2. Description of the Related Art FIG. 4 is a block diagram of a conventional voice recognition apparatus. FIG. 5 shows the collation data used therein.

【０００３】スペクトル分析部１では、入力音声を一定
の時間（フレーム）毎に区分し、フレーム毎にＦＦＴ等
を用いてスペクトル分析を行い、分析結果を保持する。
分析単位としてはフレーム長は10ミリ秒程度、周波数区
分は200 〜5000Hz程度の帯域を20チャネル程度に分割
し、その分割範囲のパワー値を値とする。周波数帯域の
分割方法には等分割やメル尺度分割（人間の耳の感度に
応じた分割) 等を用いる。The spectrum analysis unit 1 divides the input voice into fixed time periods (frames), performs spectrum analysis using FFT or the like for each frame, and holds the analysis result.
As a unit of analysis, a frame length is about 10 milliseconds and a frequency division is about 200 to 5000 Hz. A band is divided into about 20 channels, and the power value of the division range is used as a value. The frequency band is divided by equal division or mel scale division (division according to the sensitivity of the human ear).

【０００４】分析結果は、図５（Ａ）に示すように、ｉ
を入力フレーム番号、ｊを周波数区分番号（チャネル番
号）としてＡ＝｛ａ_ij｝で示される。The analysis result is, as shown in FIG.
Is an input frame number and j is a frequency division number (channel number), and is represented by A = {a _ij }.

【０００５】音素テンプレート記憶部２では、図５
（Ｂ）に示すように各音素あるいは音素に準ずる音声単
位（以下、音素と記す）毎に、入力と同じ分割方法で処
理された標準音声のパターンをテンプレートとして保持
している。In the phoneme template storage unit 2, FIG.
As shown in (B), a standard speech pattern processed by the same division method as the input is held as a template for each phoneme or for each phoneme unit (hereinafter referred to as a phoneme) corresponding to the phoneme.

【０００６】音素は母音（Ａ，Ｉ，Ｕ，Ｅ，Ｏ）、子音
（Ｋ，Ｓ，Ｔ，Ｎ，Ｈ，Ｍ・・・）等２０種程度のカテ
ゴリに分け、語頭、語尾、前後の他の音素の影響による
変形等に対応してカテゴリ毎に１０種程度の複数のテン
プレートを用意している。Phonemes are divided into about 20 categories such as vowels (A, I, U, E, O) and consonants (K, S, T, N, H, M ...) A plurality of templates of about 10 types are prepared for each category in response to deformation and the like due to the influence of other phonemes.

【０００７】テンプレートは、図５（Ｂ）に示すよう
に、ｋを音素のカテゴリ（種類) 番号、ｌを各カテゴリ
内のテンプレート番号、ｊを周波数区分番号として、Ｂ＝｛ｂ_klj｝で示される。As shown in FIG. 5B, the template is represented by B = {b _klj } where k is a phoneme category (type) number, l is a template number in each category, and j is a frequency division number. Be done.

【０００８】継続時間記憶部３では各音素毎に、ｋをカ
テゴリ番号として、継続時間の最小長、最大長｛ｓ_k, ｔ_k｝平均時間長（標準継続時間）｛ｖ_k｝および、図５（Ｃ），（Ｄ）に示すように、ｋをカテゴ
リ番号、ｍを継続時間（フレーム数）として、継続時間
の重み分布Ｇ＝｛ｇ_km｝を記憶している。これらを標準継続時間分布と称する。[0008] duration storage unit each phoneme in 3, the k as a category number, the minimum length of the duration, the maximum length {s _k, t _k} Mean duration (standard duration) {v _k} and, FIG. 5 (C) and 5 (D), k is a category number and m is a duration (the number of frames), and a duration weight distribution G = {g _km } is stored. These are called standard duration distributions.

【０００９】単語モデル記憶部４では、図５（Ｅ）に示
すように、ｗを単語番号、ｎを単語モデル内の音素番号
として、各単語モデルについて音素列Ｃ＝｛ｃ_wn｝を記憶している。As shown in FIG. 5E, the word model storage unit 4 stores a phoneme sequence C = {c _wn } for each word model, where w is a word number and n is a phoneme number in the word model. ing.

【００１０】照合部５では、入力音声と各単語モデルと
の照合を行い、最も類似度の高いものを入力音声の識別
結果であるとする。問題は、同じ単語であっても発声の
度に時間長が異なることである。従って時間軸上での非
線形なパターンマッチングが必要になる。The collation unit 5 collates the input voice with each word model, and the one with the highest degree of similarity is regarded as the identification result of the input voice. The problem is that the same word has different durations for each utterance. Therefore, non-linear pattern matching on the time axis is necessary.

【００１１】入力パターンＡと標準パターンＣ_wの時間
軸上の対応づけは逐一行うためＤＰ（Dynamic Programi
ng）法を用いる。図６はＤＰ法による照合の概念図であ
る。入力パターンＡと標準パターンＣ_wの時系列の対応
を考えると、入力パターンＡの時間軸を伸縮させなが
ら、各フレームが交差する所（格子点）を通り、全体を
最も短く結び付ける経路を見つければよい。そのために
は、入力フレームと音素テンプレート間の距離（局所距
離）を基本として全体の距離の累積値が最小になるよう
にする。これを各単語モデルについて行い、最も距離の
小さいものを結果とする。Since the input pattern A and the standard pattern C _w are associated on the time axis one by one, DP (Dynamic Programi)
ng) method is used. FIG. 6 is a conceptual diagram of matching by the DP method. Considering the time-series correspondence between the input pattern A and the standard pattern C _w , if the time axis of the input pattern A is expanded or contracted, and if a path that connects the frames at the intersection (lattice point) and finds the shortest connection is found, Good. For that purpose, the cumulative value of the total distance is minimized based on the distance (local distance) between the input frame and the phoneme template. This is done for each word model and the one with the smallest distance is taken as the result.

【００１２】各単語モデルに関して、局所距離、すなわ
ち入力フレームｉと、単語モデルのノードｎ（ｎ番目の
音素）の最も近い音素テンプレートとの距離ｄ_inは、そ
のノードの音素のカテゴリ番号がｋで、その音素テンプ
レートの数がＬ個であるとして、 (k ：単語モデルのｎ番目の音素のカテゴリ番号）とな
る。For each word model, the local distance, that is, the distance d _in between the input frame i and the closest phoneme template of the node n (nth phoneme) of the word model is the category number k of the phoneme of that node. , And the number of phoneme templates is L, (k: category number of the nth phoneme of the word model).

【００１３】ＤＰパスの形は継続時間を制御することに
より、図６に示すものとなる。各格子点の値は、そこに
達する最短の累積距離を示す。各格子点までの累積距離
Ｄ_inは、ただしｙ：継続時間 (フレーム数) 累積距離の初期値Ｄ₀₀＝０Ｄ_i0＝∞ (i=1,I) Ｄ_0n＝∞ (n=1,N) となる。The shape of the DP path becomes that shown in FIG. 6 by controlling the duration. The value of each grid point indicates the shortest cumulative distance to reach it. The cumulative distance D _in to each grid point is However, y: continuous time (number of frames) initial value of cumulative distance D ₀₀ = 0 D _i0 = ∞ (i = 1, I) D _0n = ∞ (n = 1, N).

【００１４】式２の第一項は、格子点（ｉ，ｎ）におけ
る累積距離すなわちＤ_inを基準に考えると、継続時間の
制約（ s_k〜 t_k) から、ここに達する一つ前の格子点
は限定されることを示す。つまり、図６の格子点に達
する経路の一つ前の格子点はからのいずれかに限ら
れ、その外は対象外である。Considering the cumulative distance at the grid point (i, n), that is, D _in , the first term of the equation (2) is based on the constraint of the duration (s _k to t _k ), and the one before it reaches this point. It indicates that the grid points are limited. That is, the grid point immediately before the path reaching the grid point in FIG. 6 is limited to any of, and the other grid points are excluded.

【００１５】第二項は、継続時間の平均（標準値）から
のずれを重みとして距離に換算したものであり、第三項
は、一つ前の格子点までの累積距離である。単語モデル
ｗとの照合距離は、入力の最終フレームと単語モデルの
最終音素まで比較した累積距離であって、Ｄ_w＝Ｄ_IN （W:単語番号，I:入力の最終フレーム，
N:単語モデルの最終音素）照合結果は、Ｄ＝｛Ｄ_w｝となる。The second term is the distance converted from the average (standard value) of the duration times into the distance, and the third term is the cumulative distance to the previous grid point. The matching distance with the word model w is a cumulative distance obtained by comparing the final frame of the input and the final phoneme of the word model, and D _w = D _IN (W: word number, I: final frame of the input,
N: final phoneme of word model) The matching result is D = {D _w }.

【００１６】照合結果のうち、照合距離の最も小さい単
語モデルが認識結果として出力される。Of the matching results, the word model with the smallest matching distance is output as the recognition result.

【００１７】[0017]

【発明が解決しようとする課題】上記従来技術で単語照
合する際に次のような問題が起こりうる。例えば『オオ
タ』という音声を入力した際に、「ＯＯＴＡ」でなく、
「ＯＯＩＴＡ」と誤認識することがある。この場合「Ｏ
ＯＩＴＡ（以下単語番号１）」との照合距離Ｄ₁（例
えば、５０とする）の方が、「ＯＯＴＡ（以下単語番号
２）」との照合距離Ｄ₂（例えば、６０とする）より
も小さくなって「ＯＯＩＴＡ」が認識結果として得られ
る場合である。この場合の入力と各単語との照合結果の
例を図７（Ａ），（Ｂ）に示す。「ＯＯＩＴＡ」、「Ｏ
ＯＴＡ」の各音素について継続時間を見てみると、それ
ぞれ、「10,10,10, 7,23」、「15,15, 7,23 」である。The following problems may occur when word matching is performed by the above-mentioned conventional technique. For example, when inputting the voice "Oota", instead of "OOTA",
It may be mistakenly recognized as "OOITA". In this case "O
The matching distance D ₁ (for example, 50) with “OITA (hereinafter word number 1)” is smaller than the matching distance D ₂ (for example, 60) with “OOTA (hereinafter word number 2)”. In this case, "OOITA" is obtained as the recognition result. 7A and 7B show examples of input results and collation results with each word in this case. "OOITA", "O
The duration of each phoneme of "OTA" is "10,10,10,7,23" and "15,15,7,23", respectively.

【００１８】( 単位：フレーム 1 フレーム＝10ミリ
秒) ここで、各音素の標準継続時間が (Ｏ) (Ｉ) (Ｔ) (Ａ) ｖ₅ = 12 , ｖ₂= 12 , ｖ₈= 4 , ｖ₁= 20
(単位：フレーム) の場合、標準継続時間からのずれＺ_wn = ｖ_k- dur _n 式３（ｗ：単語番号，ｎ：単語モデル中のノード番号ｋ：単語モデルのｎ番目の音素のカテゴリ番号ｖ_k：カテゴリｋの標準継続時間 dur_n：単語モデルのｎ番目の音素としたときの継続時
間）はそれぞれ (O) Z₁₁= 12 - 10 = 2 (O) Z₂₁= 12 - 15 = -3 (O) Z₁₂= 12 - 10 = 2 (O) Z₂₂= 12 - 15 = -3 (I) Z₁₃= 12 - 10 = 2 (T) Z₂₃= 4 - 7 = -3 (T) Z₁₄= 4 - 7 = -3 (A) Z₂₄= 20 - 23 = -3 (A) Z₁₅= 20 - 23 = -3 単位：フレームとなる。これを図に示したものが図７（Ｃ）である。(Unit: frame 1 frame = 10 ms) Here, the standard duration of each phoneme is (O) (I) (T) (A) v ₅ = 12, v ₂ = 12, v ₈ = 4 , v ₁ = 20
In case of (unit: frame), deviation from standard duration Z _wn = v _k -dur _n Expression 3 (w: word number, n: node number in word model k: category number of n-th phoneme of word model v _k : standard duration of category k dur _n : duration when the word model is the nth phoneme) is (O) Z ₁₁ = 12-10 = 2 (O) Z ₂₁ = 12-15 =- 3 (O) Z ₁₂ = 12-10 = 2 (O) Z ₂₂ = 12-15 = -3 (I) Z ₁₃ = 12-10 = 2 (T) Z ₂₃ = 4-7 = -3 (T) Z ₁₄ = 4-7 = -3 (A) Z ₂₄ = 20-23 = -3 (A) Z ₁₅ = 20-23 = -3 Unit: Frame. This is shown in FIG. 7 (C).

【００１９】入力の音声が速く発音されたり遅く発音さ
れたりした場合には各音素について標準継続時間からの
ずれは一方向へのずれであるが、別の単語と照合した場
合には、ずれの方向（および大きさ）がばらつくことを
示す。When the input voice is pronounced fast or slow, the deviation from the standard duration is one direction for each phoneme, but when it is collated with another word, the deviation is not. It shows that the direction (and size) varies.

【００２０】以上のように、照合距離が小さくても、継
続時間に関して平均からのばらつきが大きい場合には照
合結果が正しくない場合がある。本発明は、継続時間に
関して平均からのばらつきを考慮することにより、認識
率を高くした音声認識装置を実現することを目的として
いる。As described above, even if the collation distance is small, the collation result may be incorrect if the variation from the average with respect to the duration is large. It is an object of the present invention to realize a voice recognition device having a high recognition rate by considering the variation from the average regarding the duration.

【００２１】[0021]

【課題を解決するための手段】図１は本発明の原理ブロ
ック図である。従来の音声認識装置に対して、入力音声
パターンを音素に分割する分割部６と、標準継続時間と
のずれの分散を計算する時間長ずれ計算部７と、照合距
離を補正する距離補正部８とを備える。FIG. 1 is a block diagram showing the principle of the present invention. With respect to the conventional speech recognition apparatus, a division unit 6 that divides an input voice pattern into phonemes, a time length deviation calculation unit 7 that calculates a variance of deviation from a standard duration, and a distance correction unit 8 that corrects a matching distance. With.

【００２２】[0022]

【作用】従来の音声認識装置で誤った認識結果を得る原
因として、各音素について標準継続時間とのずれのばら
つきを考慮せずに照合距離を用いていたことがあげられ
る。The reason why the conventional speech recognition apparatus obtains an incorrect recognition result is that the matching distance is used for each phoneme without considering the variation in deviation from the standard duration.

【００２３】上記問題を解決するため、各音素について
標準継続時間とのずれの分散SD_wを算出し、 ( ave Z_w: 標準継続時間からのずれＺ_wnの平均）補正距離として従来の照合距離Ｄ_wに加える。In order to solve the above problem, the variance SD _w of the deviation from the standard duration is calculated for each phoneme, (ave Z _w : average of deviation Z _wn from standard continuation time) Add to conventional collation distance D _w as a correction distance.

【００２４】ＮＤ_w= Ｄ_w＋k SD_w ( k：比例定数) 式５これにより継続時間のずれのばらつきを考慮した、類似
度をより正確に表す新たな照合距離ＮＤ_wを求めたこと
になる。ND _w = D _w + k SD _w (k: proportional constant) Formula 5 By this, a new matching distance ND _w that more accurately represents the degree of similarity in consideration of the variation in the deviation of the duration is obtained. .

【００２５】図３に分割部６の動作説明図を示す。図３
（Ａ）は入力を「ＯＯＩＴＡ」と対応させたときのＤＰ
パスを示す。各格子点はそこまでの累積距離の最低値Ｄ
_inを示し、全体で最も短い累積距離となる経路を枠と矢
印で示すものである。FIG. 3 shows an operation explanatory diagram of the dividing unit 6. Figure 3
(A) DP when the input is associated with "OOITA"
Indicates a path. The minimum value D of the cumulative distance to each grid point
_It indicates in, and the route with the shortest cumulative distance as a whole is indicated by a frame and an arrow.

【００２６】図３（Ｂ）は、各格子点で、式２の累積距
離Ｄ_inが最低値になるときの継続時間ｙの値Ｙ_inを示
す。分割部６はこのＹ_inを照合部５から受け取り記憶す
る。これを照合後に、図３（Ｂ）に示すように終端（語
尾）から始端（語頭）に向かって経路を後戻りすること
で音素に分割する。こうして単語モデルの各ノードに対
応させたときの音素の継続時間｛ dur_n｝を得る。FIG. 3B shows the value Y _{in of the} duration y when the cumulative distance D _in of the equation 2 becomes the minimum value at each grid point. The dividing unit 6 receives this Y _in from the matching unit 5 and stores it. After this is collated, as shown in FIG. 3 (B), the path is moved backward from the end (word ending) to the start (word beginning) to divide into phonemes. In this way, the phoneme duration {dur _n } when corresponding to each node of the word model is obtained.

【００２７】時間長ずれ計算部７は、式３により、分割
された音素の継続時間 dur_nと継続時間記憶部３から得
た標準継続時間ｖ_kとの差Ｚ_wnを計算し、さらに式４に
より、ずれの分散SD_wを求める。The time length shift calculation unit 7 calculates the difference Z _wn between the duration dur _n of the divided phonemes and the standard duration v _k obtained from the duration storage unit 3 by the equation 3, and further the equation 4 Then, the deviation variance SD _w is obtained.

【００２８】距離補正部８は式５により、ずれの分散SD
_wを加えて補正した照合距離ＮＤ_wを算出し、距離の近
さ一位の単語を認識結果とする。このように構成するこ
とにより、より正確な認識結果を得ることができる。The distance correction unit 8 calculates the deviation SD by the formula 5
_The corrected matching distance ND _w is calculated by adding _w, and the word closest to the distance is used as the recognition result. With this configuration, a more accurate recognition result can be obtained.

【００２９】なお、ずれの分散SD_wを求める場合、実用
上は対象とする音素を限定したり対象単語を限定して処
理時間を短くできる。When obtaining the deviation variance SD _w , the processing time can be shortened in practice by limiting the target phonemes or target words.

【００３０】[0030]

【実施例】以下、図面を参照して本発明の実施例を詳細
に説明する。図２は本発明の実施例のブロック図であ
る。図１、図４と同一機能のものは、同一の符号を付し
て示す。Embodiments of the present invention will now be described in detail with reference to the drawings. FIG. 2 is a block diagram of an embodiment of the present invention. Components having the same functions as those in FIGS. 1 and 4 are denoted by the same reference numerals.

【００３１】図２において、１はスペクトル分析部であ
り、ＡＤ変換器、ＦＦＴ演算回路等と分析結果を記憶す
る記憶部よりなる。２は音素テンプレート記憶部、３は
継続時間記憶部、４は単語モデル記憶部であり、前記の
分析結果の記憶部と共にＥＷＳ（エンジニアリングワー
クステーション）の記憶部に置く。５は照合部、６は分
割部、７は時間長ずれ計算部、８は距離補正部でＥＷＳ
のプロセサおよびソフトウェアで構成される。In FIG. 2, reference numeral 1 denotes a spectrum analysis unit, which includes an AD converter, an FFT operation circuit, and the like, and a storage unit for storing the analysis result. Reference numeral 2 is a phoneme template storage unit, 3 is a duration storage unit, and 4 is a word model storage unit, which is placed in the storage unit of the EWS (engineering workstation) together with the storage unit of the analysis result. Reference numeral 5 is a collation unit, 6 is a division unit, 7 is a time length deviation calculation unit, and 8 is a distance correction unit.
It consists of a processor and software.

【００３２】動作手順を以下に示す。スペクトル分析部１ではスペクトル分析し、結果Ａ＝
｛ａ_ij｝を保持する。分析単位としてはフレーム長は10
ミリ秒程度、周波数区分は200 〜5000Hz程度の帯域を20
チャネル程度に分割し、その分割範囲のパワー値を値と
する。音素テンプレート記憶部２、継続時間記憶部３、単語
モデル記憶部４に記憶した、音素テンプレートＢ＝｛ｂ
_klj｝、単語モデルＣ＝｛ｃ_wn｝、継続時間の最小長、
最大長｛ｓ_k, ｔ_k｝、重みＧ＝｛ｇ_km｝を用いて、
照合部５でＤＰ照合を行い、Ｄ＝｛Ｄ_w｝を得る。カ
テゴリ数は20程度、テンプレート数は10程度、単語モデ
ル数は1000程度である。ここまでは従来技術と同じであ
る。次に、分割部６で照合結果を用いて図３に示すように
入力音声を各音素に分割して音素の継続時間を決定し、時間長ずれ計算部７で各音素について標準からのずれ
の分散を計算し、距離補正部８で照合距離を時間長ずれ計算部７で計算
した結果を用いて補正する。The operation procedure is shown below. The spectrum analysis unit 1 analyzes the spectrum, and the result A =
Hold {a _ij }. The frame length is 10 as an analysis unit
20 milliseconds band with a frequency range of 200-5000Hz
It is divided into about channels, and the power value in the divided range is used as a value. The phoneme template B = {b stored in the phoneme template storage unit 2, the duration storage unit 3, and the word model storage unit 4
_klj }, word model C = { _cwn }, minimum duration,
Using the maximum length {s _k , t _k } and the weight G = {g _km },
The collation unit 5 performs DP collation to obtain D 1 = {D _w }. There are about 20 categories, about 10 templates, and about 1000 word models. Up to this point, the process is the same as the conventional technique. Next, the dividing unit 6 divides the input speech into each phoneme by using the matching result as shown in FIG. The variance is calculated, and the distance correction unit 8 corrects the matching distance using the result calculated by the time shift calculation unit 7.

【００３３】例えば前記の『オオタ』の場合には、「Ｏ
ＯＩＴＡ」に関するずれはばらついているので距離を大
きくし、「ＯＯＴＡ」についてはずれのばらつきが全く
ないので距離はそのままとする。具体的には各照合距離
は，ＮＤ₁= Ｄ₁+ k SD₁= 50 + 2×6 = 62 ＮＤ₂= Ｄ₂+ k SD₂= 60 + 2×0 = 60 (aveZ : 標準継続時間からのずれＺの平均）のように計算することができ、ＮＤ₂より小さいものが
ない（この場合他の単語モデルとの照合距離Ｄ₃〜はず
っと大きいとする）ので、照合結果は『オオタ』と正し
くすることができる。For example, in the case of the above "Ota", "O
Since the deviation relating to "OITA" varies, the distance is increased, and the distance relating to "OOTA" remains unchanged because there is no variation in deviation. Specifically, each matching distance is ND ₁ = D ₁ + k SD ₁ = 50 + 2 × 6 = 62 ND ₂ = D ₂ + k SD ₂ = 60 + 2 × 0 = 60 (aveZ: average of deviation Z from standard duration), and there is nothing smaller than ND ₂ (in this case, the matching distance D ₃ with other word models is much larger). , The matching result can be correct as "Ota".

【００３４】本実施例では、音素選択部９、単語選択部
10を加えてある。もちろん、どちらか一つだけでもよ
い。音素選択部９では標準とのずれを計算すべき音素を
限定する。これは継続時間の短い子音に比べて、母音等
の方が顕著にずれが見られるからである。例えば、母音
だけに限定すれば、処理時間が短くなる。In this embodiment, the phoneme selection unit 9 and the word selection unit
I added 10. Of course, only one of them is enough. The phoneme selection unit 9 limits the phonemes for which the deviation from the standard should be calculated. This is because vowels and the like are significantly different from consonants having a short duration. For example, if it is limited to only vowels, the processing time becomes short.

【００３５】単語選択部10では距離補正を行う単語を限
定する。これはすべての単語モデルについて距離補正を
行う必要はなく、照合距離Ｄ_wの上位のもの（正しい認
識結果となる可能性の高いもの）について行えば充分で
あるからである。上位の数単語についてのみ補正するな
ら処理時間は少なくてよい。The word selection unit 10 limits the words for which the distance correction is performed. This is because it is not necessary to perform distance correction on all word models, and it is sufficient to perform distance correction on a higher one of the matching distances D _w (those that are likely to give a correct recognition result). Processing time may be short if correction is performed only for the top few words.

【００３６】例えば、距離Ｄ_wが域値以下の単語について距離Ｄ_wが小さい順に上位ｎ位までの単語について１位との距離Ｄ_wの差が域値以下の単語について順位が１つ上の単語との距離Ｄ_wの差が域値以下の
単語について距離補正を行うなどでよい。[0036] For example, the distance D _w is about the words in the following range value distance D _w is less difference in the distance D _w of the 1-position for the word to top n in the order of about words following frequency values rank up one Distance correction may be performed for words whose difference in distance D _{w from} the word is a threshold value or less.

【００３７】[0037]

【発明の効果】以上詳細に説明したように、本発明によ
れば従来の照合方式の後処理として継続時間のずれを照
合距離に反映させることで、より精密な照合が可能とな
り、認識率の高い音声認識装置を実現することができ
る。As described above in detail, according to the present invention, by performing the post-processing of the conventional collation method by reflecting the deviation of the duration time on the collation distance, more precise collation becomes possible and the recognition rate is improved. It is possible to realize a high voice recognition device.

[Brief description of drawings]

【図１】本発明の原理ブロック構成図である。FIG. 1 is a block diagram of the principle of the present invention.

【図２】本発明の実施例のブロック構成図である。FIG. 2 is a block configuration diagram of an embodiment of the present invention.

【図３】分割部の動作説明図である。FIG. 3 is an operation explanatory diagram of a dividing unit.

【図４】従来の音声認識装置のブロック構成図である。FIG. 4 is a block diagram of a conventional voice recognition device.

【図５】照合データを示す図である。FIG. 5 is a diagram showing collation data.

【図６】ＤＰ法による照合の概念図である。FIG. 6 is a conceptual diagram of matching by the DP method.

【図７】照合結果の例を示す図である。FIG. 7 is a diagram showing an example of a matching result.

[Explanation of symbols]

１スペクトル分析部２音素テンプレート記憶部３継続時間記憶部４単語モデル記憶部５照合部６分割部７時間長ずれ計算部８距離補正部９音素選択部 10 単語選択部 1 spectrum analysis unit 2 phoneme template storage unit 3 duration storage unit 4 word model storage unit 5 collation unit 6 division unit 7 time difference calculation unit 8 distance correction unit 9 phoneme selection unit 10 word selection unit

Claims

[Claims]

1. A spectrum analysis unit (1) for performing spectrum analysis of input speech and storing it as characteristic time series data.
And a phoneme template storage unit (2) that stores feature data of a phoneme or a phoneme unit that is similar to a phoneme, and a duration storage unit (3) that stores a standard duration distribution of a phoneme or a phoneme unit that is similar to a phoneme, a word or A word model storage unit (4) that stores a model of a speech unit corresponding to a word, and a matching unit that matches the spectrum analysis result of the input speech with the word model by controlling the duration using a phoneme template and duration distribution ( In the speech recognition device having 5), a dividing unit (6) that divides the input speech into phonemes or phoneme units corresponding to phonemes using the matching result, and duration and standard continuation of the divided phonemes or phoneme units corresponding to phonemes. There is a time length deviation calculation unit (7) for calculating a time difference and a distance correction unit (8) for correcting the matching distance using the calculation result. Speech recognition apparatus characterized by.

2. The speech recognition apparatus according to claim 1, further comprising a phoneme selection unit (9) for specifying a phoneme or a phoneme unit corresponding to the phoneme for which a difference between the duration and the standard duration is to be calculated.

3. The voice according to claim 1, further comprising a word selection unit (10) for setting a word or a voice unit corresponding to the word whose distance is to be corrected to have a matching distance equal to or less than a predetermined threshold value. Recognition device.

4. The word selection unit (10) for setting the order of the matching result within a predetermined order for a word or a speech unit corresponding to the word for which distance correction is to be performed. Speech recognizer.

5. A word selection in which a word to be subjected to distance correction or a voice unit corresponding to the word is a word whose difference in matching distance from the matching distance of a word ranked first in a matching result is equal to or less than a predetermined threshold value. Claim 1 characterized in that it has a part (10).
Voice recognition device.

6. The word or the speech unit equivalent to the word for which the distance correction is performed is up to a difference in the matching distance from the matching distance of the word having a higher rank in the matching result is equal to or less than a predetermined threshold value. The speech recognition apparatus according to claim 3, further comprising a word selection unit (10).