JPH11194792A

JPH11194792A - Speech recognition device, speech recognition method, and recording medium recording the method

Info

Publication number: JPH11194792A
Application number: JP10000117A
Authority: JP
Inventors: Akio Amano; 明雄天野; Toshiyuki Odaka; 俊之小高; Yasunari Obuchi; 康成大淵
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1998-01-05
Filing date: 1998-01-05
Publication date: 1999-07-21

Abstract

(57)【要約】【課題】処理量が少なく認識精度の劣化のない大語彙
の音声認識の実現。【解決手段】音声入力手段１から入力された音声を、
音声分析手段２で分析して特徴ベクトルの時系列を出力
し、音声検出手段３で音声区間を判断する。音響照合手
段４において、該特徴ベクトルの時系列と、標準パタン
格納手段５に格納されている音声基本単位に対する標準
パタンを照合し、単語評価手段６において、前記照合結
果に基づいて認識対象を評価する。音響照合手段４で
は、各標準パタンを前記入力音声の特徴ベクトル時系列
の全区間にわたって照合して各標準パタン毎に照合結果
を時系列として求める。単語評価手段６では、前記時系
列として得られた各標準パタン毎の照合結果と認識対象
語の単語を音声基本単位の並びとして記述する単語辞書
７の情報に基づいて各単語を評価し、前記評価結果に従
って認識結果を求める。 (57) [Summary] [PROBLEMS] To realize speech recognition of a large vocabulary with a small amount of processing and no deterioration in recognition accuracy. SOLUTION: A voice input from voice input means 1 is
The speech analysis unit 2 analyzes and outputs a time series of feature vectors, and the speech detection unit 3 determines a speech section. The sound collating unit 4 collates the time series of the feature vector with the standard pattern for the basic voice unit stored in the standard pattern storing unit 5, and the word evaluating unit 6 evaluates the recognition target based on the collation result. I do. The sound matching means 4 matches each standard pattern over the entire section of the feature vector time series of the input speech, and obtains a matching result as a time series for each standard pattern. The word evaluation means 6 evaluates each word based on the matching result for each standard pattern obtained as the time series and the information of the word dictionary 7 which describes the words of the recognition target word as a sequence of basic speech units. A recognition result is obtained according to the evaluation result.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音節や音素（子
音，母音）等の音声言語表現上の基本的な単位を標準パ
タンとして用いるような音声認識技術に係り、特に、標
準パタンが特徴ベクトルの出現確率分布で構成されるよ
うな音声認識において大語彙の音声認識を少ない処理量
で実現するようにした単語／文音声認識装置，そのため
のマイコンデバイス，および音声認識方法，ならびにそ
の認識方法を記録したコンピュータで読取り可能な記録
媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition technology in which basic units in speech language expression, such as syllables and phonemes (consonants, vowels), are used as standard patterns. Word / Sentence Speech Recognition Apparatus for Realizing Speech Recognition of Large Vocabulary with Small Processing Amount in Speech Recognition Consisting of Appearance Probability Distributions The present invention relates to a computer-readable recording medium on which recording is performed.

【０００２】[0002]

【従来の技術】音声認識装置、特に標準パタンが特徴ベ
クトルの出現確率分布で構成されるような音声認識装置
では、確率計算が認識処理の大部分を占める。通常の音
声認識手法では前記確率計算の回数は認識対象となる単
語の語数に比例し、大語彙の音声認識の場合には膨大な
処理量が必要となり、実時間音声認識を実現するために
大がかりなハードウェア構成が必要となっていた。この
ような音声認識に必要な膨大な処理量を削減する方法と
しては、従来からいくつかの方法が提案されている。以
下代表的な処理量削減手法を説明する。2. Description of the Related Art In a speech recognition apparatus, in particular, a speech recognition apparatus in which a standard pattern is composed of a distribution of appearance probabilities of feature vectors, probability calculation occupies most of the recognition processing. In a normal speech recognition method, the number of times of the probability calculation is proportional to the number of words to be recognized. In the case of speech recognition with a large vocabulary, an enormous amount of processing is required, and a large amount of processing is required to realize real-time speech recognition. Hardware configuration was required. Several methods have been conventionally proposed as methods for reducing the enormous amount of processing required for such speech recognition. Hereinafter, a typical processing amount reduction method will be described.

【０００３】第１の従来技術として「ビームサーチ」呼
ばれる技術がある（電子情報通信学会論文誌 D Vol.J7
1-D No.9(1988年9月) PP.1650-1659参照）。「ビーム
サーチ」とは、認識対象の候補の内、計算過程で可能性
が低いと判断された候補について、途中で計算を打ち切
るようにした手法である。認識対象候補の内、可能性の
高い方から一定個数の候補についてのみ計算をするよう
なやり方や、認識計算に対して閾値を設定し、閾値以上
の候補についてのみ計算をするやり方などがある。いず
れの方法にしても、認識候補全体に対して計算を行なう
のに対して一定の比率で計算が削減される。As a first prior art, there is a technique called "beam search" (Transactions of the Institute of Electronics, Information and Communication Engineers, D Vol. J7).
1-D No.9 (September 1988) PP.1650-1659). The “beam search” is a method in which, among candidates to be recognized, a candidate determined to have a low possibility in the calculation process is stopped in the middle of the calculation. Among the recognition target candidates, there are a method of calculating only a certain number of candidates from the most likely one, and a method of setting a threshold value for the recognition calculation and calculating only candidates having a threshold value or more. In any case, the calculation is reduced at a fixed rate with respect to the calculation for the entire recognition candidate.

【０００４】なお、途中で計算を打ち切る「ビームサー
チ」に対して、全ての候補に対して最後まで計算をする
手法が「フルサーチ」と呼ばれる技術である。「フルサ
ーチ」の場合、全ての候補に対して最後まで計算をする
ので最適解が得られることが保証される。計算の途中過
程で可能性が低いと判断された候補でも、最後まで計算
を続けると第１位の正解となる場合もあり、途中で計算
を打ち切る「ビームサーチ」の場合には最適解が保証さ
れない。[0004] In contrast to the "beam search" in which the calculation is interrupted halfway, a method of calculating all the candidates to the end is a technique called "full search". In the case of “full search”, the calculation is performed for all the candidates to the end, so that it is guaranteed that an optimum solution is obtained. Even if a candidate is determined to be unlikely in the middle of a calculation, if the calculation is continued to the end, it may be the first correct answer. In the case of "beam search" where the calculation is terminated in the middle, the optimal solution is guaranteed. Not done.

【０００５】第２の従来技術として、まず音響レベルで
の処理を行なって音素あるいは音節認識結果を求め、こ
の結果に対して言語処理を施して最終的な認識結果を得
るような例がある（第16回応用情報学研究センター・シ
ンポジウム「音声認識の現状と将来」東北大応用情報
学研究センター，'90.5 29-30参照）。この例では、音
素や音節の単位での認識を行い、その結果を音素ラティ
スや音節ラティスといった、音素単位や音節単位での複
数仮説として求め、音素ラティスや音節ラティスを単語
辞書と照合し認識結果を求める。ここで行なう照合は記
号レベルでの照合であり確率計算などを必要とする音響
レベルでの照合処理に比べると処理量は大幅に少なくて
すむ。本手法によれば、音響照合処理は音素あるいは音
節の数だけで済み大幅な計算量削減ができる。しかしな
がら、音響照合レベルで判断を下してしまうため、音素
ラティスや音節ラティスに正解候補が含まれない場合に
は、辞書照合レベルではどのような処理を施しても正解
を得ることができない。[0005] As a second prior art, there is an example in which processing at an acoustic level is first performed to obtain a phoneme or syllable recognition result, and the result is subjected to language processing to obtain a final recognition result ( 16th Symposium on Applied Informatics Research Center "Current and Future of Speech Recognition", Applied Informatics Research Center, Tohoku University, '90 .5 29-30). In this example, recognition is performed in units of phonemes or syllables, and the results are obtained as multiple hypotheses in units of phonemes or syllables, such as phoneme lattices or syllable lattices. Ask for. The collation performed here is a collation at the symbol level, and the processing amount is significantly smaller than that of the collation processing at the acoustic level that requires probability calculation and the like. According to this method, the acoustic matching process requires only the number of phonemes or syllables, and can greatly reduce the amount of calculation. However, since the decision is made at the acoustic matching level, if the phoneme lattice or the syllable lattice does not include the correct answer candidate, no processing can be performed at the dictionary matching level to obtain a correct answer.

【０００６】[0006]

【発明が解決しようとする課題】上述したように、上記
第１の従来技術は、ある一定の比率で処理量を削減する
ことができるという利点があるが、認識対象の語数に比
例して音響照合処理が増加してしまうという問題点があ
る。また、上記第２の従来技術は、音素または音節単位
で認識結果を求めてしまうため、音響照合処理の処理量
を一定の処理量に抑えることができるという利点がある
が、音響処理で音素単位あるいは音節単位に結果を求め
てしまうため、ここで候補から落ちた仮説に対しては最
終結果が得られないという問題点がある。本発明の目的
は、上記問題点を解消し、認識対象の全仮説に対して最
終的な評価結果が得られ、かつ音響照合処理量が認識対
象の語数に比例せず、一定の処理量で抑えられるよう
な、処理量が小さくかつ認識精度の劣化の少ない単語音
声認識装置や文音声認識装置，そのためのマイコンデバ
イス，単語や文の音声認識方法，ならびに、該認識方法
を記録したコンピュータで読取り可能な記録媒体を提供
することにある。As described above, the first prior art has an advantage that the processing amount can be reduced at a certain fixed rate, but the sound is proportional to the number of words to be recognized. There is a problem that the number of matching processes increases. Further, the second conventional technique has an advantage that the recognition result is obtained in units of phonemes or syllables, so that the processing amount of the sound matching processing can be suppressed to a certain processing amount. Alternatively, since a result is obtained for each syllable, there is a problem that a final result cannot be obtained for a hypothesis that has fallen from a candidate here. An object of the present invention is to solve the above problems, to obtain a final evaluation result for all the hypotheses to be recognized, and to have a fixed processing amount in which the acoustic matching processing amount is not proportional to the number of words to be recognized. A word speech recognition device and a sentence speech recognition device with a small processing amount and a small deterioration in recognition accuracy that can be suppressed, a microcomputer device therefor, a speech recognition method for words and sentences, and a computer reading the recognition method It is to provide a possible recording medium.

【０００７】[0007]

【課題を解決するための手段】上記目的を達成するため
に、本発明の単語音声認識は、音節あるいは音節連鎖，
または音素等の音声基本単位に対する標準パタンを入力
音声の特徴ベクトル時系列の全区間にわたって照合して
各標準パタン毎に照合結果を時系列として求め、音節あ
るいは音節連鎖，または音素等の音声基本単位の並びと
して記述された単語辞書と前記時系列として得られた各
標準パタン毎の照合結果に基づいて各単語を評価して認
識結果を求めるようにしたものである。In order to achieve the above object, word speech recognition according to the present invention uses syllables or syllable chains,
Or, a standard pattern for a basic voice unit such as a phoneme is collated over the entire section of the feature vector time series of the input voice, and a collation result is obtained as a time series for each standard pattern. Are evaluated based on a word dictionary described as a list of words and a collation result for each standard pattern obtained as the time series, and a recognition result is obtained.

【０００８】また、本発明の文音声認識は、さらに認識
対象の文を単語の並びとして記述する文法を格納してお
き、各標準パタン毎の照合結果と単語辞書と前記文法に
基づいて各文を評価して認識結果を求めるようにしたも
のである。Further, in the sentence speech recognition of the present invention, a grammar for describing a sentence to be recognized as a sequence of words is stored in advance, and a matching result for each standard pattern, a word dictionary, and each sentence based on the grammar are stored. Is evaluated to obtain a recognition result.

【０００９】さらに、本発明のマイコンデバイスは、上
記認識するために必要な手段を半導体チップ上に搭載す
ることによって構成され、また、本発明の記憶媒体は、
上記単語または文音声認識を行う手順（ステップ）をプ
ログラムコード化して記録したＣＤ−ＲＯＭなどであ
る。Further, the microcomputer device of the present invention is constituted by mounting means necessary for the recognition on a semiconductor chip, and the storage medium of the present invention comprises:
A CD-ROM or the like in which the procedure (step) for performing the word or sentence speech recognition is converted into a program code and recorded.

【００１０】[0010]

【発明の実施の形態】以下、図面を用いて本発明の実施
例を詳細に説明する。標準パタンの単位としては音節，
音節連鎖，音素など色々な単位を考えることができる
が、ここでは簡単のため音節を単位とする場合について
説明する。なお、以下では、主として単語音声認識の場
合について詳細に説明するが、本発明は文音声認識に適
用することができることはいうまでもない。すなわち、
以下の実施例と同様の音節単位の標準パタンと単語辞書
の他に、認識対象の文を単語の並びとして記述する文法
を用い、音節単位の照合を組み合わせ、該単語辞書と該
文法に基づいて連続的に発声された文章や会話文を認識
することも可能である。Embodiments of the present invention will be described below in detail with reference to the drawings. The unit of the standard pattern is syllable,
Although various units such as syllable chains and phonemes can be considered, a case in which syllables are used as units will be described here for simplicity. In the following, mainly the case of word speech recognition will be described in detail, but it goes without saying that the present invention can be applied to sentence speech recognition. That is,
In addition to the standard pattern and word dictionary in syllable units similar to those in the following examples, using a grammar that describes a sentence to be recognized as a sequence of words, combining syllable unit matching, and based on the word dictionary and the grammar It is also possible to recognize sentences and conversation sentences uttered continuously.

【００１１】図１は、本発明の単語音声認識装置の一実
施例の機能ブロック図である。入力された音声は、ま
ず、音声入力手段１において電気信号に変換される。電
気信号に変換された音声はさらに音声分析手段２におい
て分析され、特徴ベクトルの時系列が出力される。この
特徴ベクトルの時系列は音声検出手段３に入力される。
音声検出手段３において音声区間であると判断された区
間の特徴ベクトルの時系列は音響照合手段４に入力され
る。音響照合手段４おいて、入力された特徴ベクトルの
時系列が標準パタン格納手段５に格納された全ての音節
標準パタンと照合され、各音節標準パタン毎に照合結果
の時系列が得られる。単語評価手段６では、前記各音節
標準パタン毎に照合結果の時系列と単語辞書７に格納さ
れた単語辞書情報に基づいて各単語の評価を行ない、各
単語毎に評価値を求める。判定手段８では各単語に与え
られた評価値に基づいて最終的な認識結果を求め出力す
る。例えば評価値の高い上位５候補の単語を出力する。FIG. 1 is a functional block diagram of one embodiment of the word speech recognition apparatus of the present invention. The input voice is first converted into an electric signal by the voice input unit 1. The voice converted into the electric signal is further analyzed by the voice analysis means 2, and a time series of feature vectors is output. The time series of this feature vector is input to the voice detection means 3.
The time series of the feature vector of the section determined to be a voice section by the voice detecting means 3 is input to the sound matching means 4. In the acoustic matching means 4, the time series of the input feature vector is compared with all the syllable standard patterns stored in the standard pattern storage means 5, and a time series of the matching result is obtained for each syllable standard pattern. The word evaluation means 6 evaluates each word based on the time series of the matching result and the word dictionary information stored in the word dictionary 7 for each syllable standard pattern, and obtains an evaluation value for each word. The determining means 8 obtains and outputs a final recognition result based on the evaluation value given to each word. For example, the words of the top five candidates with high evaluation values are output.

【００１２】本発明の音声認識装置においては、標準パ
タンを特徴ベクトル時系列として用意しパタンマッチン
グにより照合を実現することもできるし、また、標準パ
タンを特徴ベクトルの出力確率分布の時系列として用意
し、確率計算により照合を実現することもできる。ここ
では後者の特徴ベクトルの出力確率分布に基づく方法、
すなわち、隠れマルコフモデル（ＨＭＭ：Ｈidden Ｍar
kov Ｍodels）に基づいて実現する場合について説明す
る。In the speech recognition apparatus of the present invention, it is possible to prepare a standard pattern as a time series of a feature vector and realize matching by pattern matching, or to prepare a standard pattern as a time series of an output probability distribution of a feature vector. However, matching can also be realized by probability calculation. Here, the latter method based on the output probability distribution of the feature vector,
That is, a hidden Markov model (HMM: Hidden Mar)
kov Models).

【００１３】まず、ＨＭＭについて図２を用いて簡単に
説明する。図２は、本発明で用いる認識基本単位の隠れ
マルコフモデル（ＨＭＭ）を説明するための図である。
同図において、各円は状態を表わし、矢印は状態間の遷
移を表わす。矢印に添えた記号ａ(i,j)は状態ｉから状
態ｊへの遷移が生じる確率を表わし、記号ｂ(i,j,ｖ)は
状態ｉから状態ｊへの遷移が生じたときに特徴ベクトル
ｖが出力される確率を表わす。First, the HMM will be briefly described with reference to FIG. FIG. 2 is a diagram for explaining a hidden Markov model (HMM) of a basic recognition unit used in the present invention.
In the figure, each circle represents a state, and an arrow represents a transition between states. The symbol a (i, j) attached to the arrow represents the probability of a transition from state i to state j, and the symbol b (i, j, v) is the characteristic when a transition from state i to state j occurs. Represents the probability that the vector v will be output.

【００１４】ＨＭＭは、ｂ(i,j,ｖ)の表現形態により大
きく２つの種類、離散出力確率分布型ＨＭＭと連続出力
確率分布型ＨＭＭに分かれる。離散出力確率分布型ＨＭ
Ｍでは特徴ベクトルｖをベクトル量子化し、その各量子
化コード毎に予めｂ(i,j,ｖ)の値を求めてテーブル化
し、確率計算をテーブル参照により行なう。連続出力確
率型ＨＭＭではある分布関数を仮定し、特徴ベクトルｖ
を用いて関数計算することにより確率を求める。分布関
数としてはガウス分布がよく使われる。ガウス分布を用
いる場合ｂ(i,j,ｖ)は式（１）により求められる。HMMs are roughly classified into two types, a discrete output probability distribution type HMM and a continuous output probability distribution type HMM, depending on the expression form of b (i, j, v). Discrete output probability distribution type HM
In M, the feature vector v is vector-quantized, the value of b (i, j, v) is obtained in advance for each quantization code and tabulated, and probability calculation is performed by referring to the table. In a continuous output stochastic HMM, a certain distribution function is assumed, and a feature vector v
The probability is obtained by calculating a function using. A Gaussian distribution is often used as the distribution function. When a Gaussian distribution is used, b (i, j, v) is obtained by Expression (1).

【数１】ここで、ｖ，μ：列ベクトルｔ：転置 Σ ：行列（共分散行列）｜Σ｜：Σの行列式の計算(Equation 1) Here, v, μ: column vector t: transpose ：: matrix (covariance matrix) | calculation of determinant of | Σ |: Σ

【００１５】式（１）では確率分布を複数のガウス分布
の重み付き和で表す場合を示している。確率分布を単一
のガウス分布で表す場合もあるが、不特定話者の音声認
識では複数のガウス分布の重み付き和とするのが一般的
である。ｂ(i,j,ｖ)は、特徴ベクトルｖを得たときの各
状態遷移に対応する出現確率（あるいは確率密度）であ
るが、音響照合処理においてはさらに遷移確率ａ(ij)も
用いてＨＭＭの各状態の累積の確率計算を行なう。各状
態の累積の確率計算は動的計画法、例えば、ビタビアル
ゴリズムと呼ばれる計算法を用いて効率的に計算でき
る。式（２）〜（４）にビタビアルゴリズムによる計算
の漸化式を示す。ここで、γ(i,t)は、特徴ベクトル時
系列Ｖ1，Ｖ2…Ｖｔを観測し、ＨＭＭの第ｉ状態にいる
確率である。 Equation (1) shows a case where the probability distribution is represented by a weighted sum of a plurality of Gaussian distributions. Although the probability distribution may be represented by a single Gaussian distribution, in general, in speech recognition of an unspecified speaker, a weighted sum of a plurality of Gaussian distributions is used. b (i, j, v) is the appearance probability (or probability density) corresponding to each state transition when the feature vector v is obtained. In the acoustic matching process, the transition probability a (ij) is further used. Calculate the cumulative probability of each state of the HMM. The cumulative probability calculation of each state can be efficiently calculated using a dynamic programming method, for example, a calculation method called a Viterbi algorithm. Formulas (2) to (4) show recurrence formulas for calculation by the Viterbi algorithm. Here, γ (i, t) is the probability of observing the feature vector time series V1, V2... Vt and being in the i-th state of the HMM.

【００１６】上記式（２）〜（４）の漸化式計算により
ＨＭＭの各状態における累積確率γ(i,t)求めることが
できる。式（１）に従って確率計算を行なう処理から
（２）〜（４）の漸化式に従って確率累積計算を行なう
処理までの一連の処理が音響照合手段４で行なわれる処
理である。なお標準パタン格納手段５には、図２で説明
したようなＨＭＭが音節あるいは音節連鎖，または音素
等の音声基本単位毎に格納されており、音響照合手段４
で特徴ベクトルｖが得られると標準パタン格納手段５に
格納された全てのＨＭＭについて確率累積計算を行な
う。The cumulative probability γ (i, t) in each state of the HMM can be obtained by the recurrence formula calculation of the above equations (2) to (4). A series of processes from the process of calculating the probability according to the formula (1) to the process of performing the cumulative probability calculation according to the recurrence formulas of (2) to (4) are processes performed by the sound matching means 4. The standard pattern storage means 5 stores the HMM as described in FIG. 2 for each basic syllable, syllable chain, or phoneme such as a phoneme.
When the feature vector v is obtained in step (3), the probability accumulation calculation is performed for all the HMMs stored in the standard pattern storage unit 5.

【００１７】以上の説明はＨＭＭとして連続型のＨＭＭ
を使い、さらに連続型のＨＭＭを複数のガウス分布の混
合で表現する場合についての説明である。しかし、以後
の説明ではＨＭＭとして特に半連続型と呼ばれる種類の
ＨＭＭを使用し、さらに処理量の削減を図った場合につ
いて説明する。The above description is directed to a continuous HMM as the HMM.
Is a description of a case where a continuous HMM is represented by a mixture of a plurality of Gaussian distributions. However, in the following description, a case will be described in which a HMM of a type called a semi-continuous type is used as the HMM, and the processing amount is further reduced.

【００１８】図３は、半連続型のＨＭＭを利用した場合
の本発明の単語音声認識装置の一実施例の詳細な機能ブ
ロック図である。図１を用いて本発明の単語音声認識装
置の一実施例の機能を説明したが、図３は図１中の音響
照合手段４および標準パタン格納手段５を半連続型のＨ
ＭＭに合わせて詳細化したものである。半連続型のＨＭ
Ｍを用いる場合、音響照合処理は３段階の処理となる。
第１段階が確率計算手段４１における確率計算，第２段
階が確率混合手段４２における確率混合，第３段階が確
率累積手段４３における確率累積である。第１段階の確
率計算手段４１における確率計算は特徴ベクトルｖから
式（１）の中の個々のガウス分布に従った確率を求める
計算であり、第２段階の確率混合手段４２における確率
混合は個々のガウス分布の確率計算結果からこれらを混
合してｂ(i,j,ｖ)を求める計算であり、第３段階の確率
累積手段４３における確率累積は式（２）〜（４）の漸
化式に従って確率累積を行なう処理である。FIG. 3 is a detailed functional block diagram of one embodiment of the word speech recognition apparatus of the present invention when a semi-continuous HMM is used. The function of one embodiment of the word speech recognition apparatus of the present invention has been described with reference to FIG. 1. FIG. 3 shows that the acoustic matching means 4 and the standard pattern storage means 5 in FIG.
This is detailed in accordance with the MM. Semi-continuous HM
When M is used, the sound matching process is a three-stage process.
The first stage is the probability calculation in the probability calculation means 41, the second stage is the probability mixing in the probability mixing means 42, and the third step is the probability accumulation in the probability accumulation means 43. The probability calculation in the first stage probability calculation means 41 is a calculation for obtaining a probability according to each Gaussian distribution in the equation (1) from the feature vector v. Are calculated from the results of the probability calculation of the Gaussian distribution of b), and b (i, j, v) is obtained. This is a process of performing probability accumulation according to the equation.

【００１９】実際に存在する確率分布の個数は、認識基
本単位の個数をＮu，認識基本単位のＨＭＭの状態数を
Ｓu，各状態における分布の個数をＭとすれば、Ｎu×Ｓ
u×Ｍ個となる。Ｎu＝４００，Ｓu＝２，Ｍ＝３とした
場合、実在する確率分布の個数は２４００個となる。半
連続型のＨＭＭを使わない場合にはこれら全ての確率分
布計算を行なわなければならないが、半連続型ＨＭＭを
使う場合には処理量が大幅に削減される。半連続型ＨＭ
Ｍでは、これら２４００個ある確率分布のうち類似の確
率分布はまとめてしまい、代表的な確率分布の計算だけ
で済ませるようにする。例えば、上記の２４００個の確
率分布を２５６個のクラスタにクラスタリングし、各ク
ラスタ毎に代表分布を作成し、代表分布の計算のみで実
際の確率分布の計算を代用する。以上により、半連続型
ＨＭＭを使わないと２４００回必要な確率計算を２５６
回で済ませるようにする。The number of probability distributions actually present is Nu × S, where Nu is the number of recognition basic units, Su is the number of states of the HMM of the recognition basic unit, and M is the number of distributions in each state.
u × M. When Nu = 400, Su = 2, and M = 3, the number of existing probability distributions is 2400. When a semi-continuous HMM is not used, all of these probability distribution calculations must be performed. However, when a semi-continuous HMM is used, the processing amount is greatly reduced. Semi-continuous HM
In M, similar probability distributions among the 2,400 probability distributions are put together, and only the calculation of a representative probability distribution is required. For example, the above 2,400 probability distributions are clustered into 256 clusters, a representative distribution is created for each cluster, and only the calculation of the representative distribution is used instead of the calculation of the actual probability distribution. From the above, the probability calculation required 2400 times without using the semi-continuous HMM is 256 times.
I'm going to do it in times.

【００２０】代表分布格納手段５１には上記のような代
表分布を格納しておく。本実施例では確率分布としてガ
ウス分布を用いるものとし、さらに共分散行列について
は対角成分のみを持つものとする。代表分布格納手段５
１には各ガウス分布の平均ベクトルと共分散行列（対角
成分のみ）を格納する。代表分布格納手段５１には図４
に示すように、代表分布の番号１０１に対して、対応す
る平均ベクトル１０２、共分散行列（対角成分のみ）１
０３が格納される。これを用いて確率計算手段４１によ
り確率計算する。このような代表分布を持つようにする
ことにより各ＨＭＭは固有の確率分布をもつ代わりに代
表分布の中のいずれかを持つようになる。代表分布の中
のいずれかであることを指すためには、その代表分布の
番号がわかればよいため、各標準パタンは代表分布の番
号を用いて表すようになる。半連続型ＨＭＭ格納手段５
２にはこのような代表分布の番号を用いて記述されたＨ
ＭＭが格納される。半連続型ＨＭＭ格納手段５２に格納
されている各半連続型ＨＭＭは図５に示すようなものと
なる。The representative distribution storage means 51 stores the representative distribution as described above. In this embodiment, a Gaussian distribution is used as the probability distribution, and the covariance matrix has only diagonal components. Representative distribution storage means 5
1 stores a mean vector and a covariance matrix (only diagonal components) of each Gaussian distribution. 4 is stored in the representative distribution storage unit 51.
As shown in the figure, for the representative distribution number 101, the corresponding mean vector 102, covariance matrix (only diagonal components) 1
03 is stored. Using this, the probability is calculated by the probability calculation means 41. By having such a representative distribution, each HMM has one of the representative distributions instead of having a unique probability distribution. In order to indicate any of the representative distributions, it is sufficient to know the number of the representative distribution. Therefore, each standard pattern is represented using the number of the representative distribution. Semi-continuous HMM storage means 5
2 is H described using such a representative distribution number.
MM is stored. Each semi-continuous HMM stored in the semi-continuous HMM storage means 52 is as shown in FIG.

【００２１】確率計算手段４１では代表分布格納手段５
１に格納された各代表分布について、音声検出手段３よ
り得られる特徴ベクトルｖを用いて、各代表分布の確率
を求める。確率値を求めるにはガウス分布の計算式
（５）を用いる。In the probability calculating means 41, the representative distribution storing means 5
For each representative distribution stored in No. 1, the probability of each representative distribution is obtained using the feature vector v obtained by the voice detection means 3. To calculate the probability value, a Gaussian distribution formula (5) is used.

【数５】 (Equation 5)

【００２２】確率計算手段４１で計算された確率値は図
６に示すように、代表分布の番号２０１と、各代表分布
に対する確率値２０２が対となって求められる。確率混
合手段４２では半連続型ＨＭＭ格納手段５２に格納され
た全てのＨＭＭの全ての状態について、図６に示された
確率計算結果を参照して確率混合を行ない、各状態にお
ける出力確率ｂ(i,j,ｖ)を求める。確率累積手段４３で
は各状態における出力確率ｂ(i,j,ｖ)を受け取り、ビタ
ビアルゴリズムによる計算を実行し、全てのＨＭＭの全
ての状態について累積確率を求めて出力する。なお、こ
こで行なう確率累積計算はワードスポットなどで行なわ
れる連続ビタビ計算であり、厳密には漸化式も（２）〜
（４）の漸化式とは異なる。また、使用するＨＭＭの構
造も図５に示したＨＭＭとはわずかに異なる。実際に使
用するＨＭＭの構造を図７に示す。図７のＨＭＭが図５
のＨＭＭと異なるのは、セルフループを持たない状態が
先頭に追加された点である。図５のＨＭＭでは始端固定
の照合しかできないが、図７の構造をとることにより始
端フリーの照合が行なえるようになる。始端フリーの照
合を行なうための漸化式は（２）〜（４）の漸化式とわ
ずかに異なる。As shown in FIG. 6, the probability value calculated by the probability calculating means 41 is obtained by pairing the number 201 of the representative distribution and the probability value 202 for each representative distribution. The probability mixing means 42 performs probability mixing for all states of all HMMs stored in the semi-continuous HMM storage means 52 with reference to the probability calculation results shown in FIG. i, j, v). The probability accumulating means 43 receives the output probabilities b (i, j, v) in each state, executes a calculation by the Viterbi algorithm, and calculates and outputs the cumulative probabilities for all states of all HMMs. Note that the probability accumulation calculation performed here is continuous Viterbi calculation performed at a word spot or the like, and strictly speaking, the recurrence formula is also expressed by (2) to
It is different from the recurrence formula of (4). Further, the structure of the HMM used is slightly different from the HMM shown in FIG. FIG. 7 shows the structure of the HMM actually used. The HMM of FIG.
Is different from the HMM in that a state having no self-loop is added to the head. Although the HMM of FIG. 5 can perform only the fixed-start-point matching, the structure of FIG. 7 enables the free-end matching to be performed. The recurrence formula for performing the start-free matching is slightly different from the recurrence formulas (2) to (4).

【００２３】式（６）〜（８）に始端フリーの照合を行
なうための漸化式を示す。（６）式は（２）と同じであるが、（７）式のように各
時刻でγ(i,t)に１を与える点、（８）式のように最大
値判定を行なう対象が照合経路長Ｌで正規化される点が
異なる。Formulas (6) to (8) show recurrence formulas for performing start-free matching. Equation (6) is the same as equation (2), except that γ (i, t) is given 1 at each time, as in equation (7), and the maximum value is determined as in equation (8). The difference is that it is normalized by the collation path length L.

【００２４】なお、（８）式の最大値選択において、い
ずれの状態が選択されたかの情報を記憶しておくことに
より、照合経路の始点情報を求めることができる。この
ようにして確率累積手段４３では各ＨＭＭについて図８
に示すような音響照合結果時系列を算出する。図８に示
すように、音響照合結果時系列には各ＨＭＭのスコアと
して各時刻毎３０１に確率累積値３０２が求められ、か
つ、そのような確率累積値を与える照合経路の始点情報
３０３も与えられる。図８は、ある一つのＨＭＭについ
ての照合結果を示しているが、半連続型ＨＭＭ格納手段
５２に格納されている全てのＨＭＭについて同様の照合
結果が求められる。図８の時刻ｔの欄を見ると、このＨ
ＭＭは入力音声の時刻２３から時刻ｔまでの間で照合
し、スコア０.００９１７４が得られることが判る。In the selection of the maximum value in the equation (8), by storing information on which state is selected, it is possible to obtain the starting point information of the collation path. In this way, the probability accumulating means 43 performs the processing shown in FIG.
The time series of the acoustic matching result as shown in FIG. As shown in FIG. 8, in the acoustic matching result time series, a probability cumulative value 302 is obtained at each time 301 as a score of each HMM, and start point information 303 of a matching path that provides such a probability cumulative value is also given. Can be FIG. 8 shows the collation result for a certain HMM. Similar collation results are obtained for all the HMMs stored in the semi-continuous HMM storage unit 52. Looking at the column of time t in FIG.
The MM is collated between the time 23 and the time t of the input voice, and it can be seen that a score of 0.009174 is obtained.

【００２５】単語評価手段６では、前記各ＨＭＭ毎に得
られた照合結果の時系列と単語辞書７に格納された単語
辞書情報に基づいて各単語の評価を行ない、各単語毎に
評価値を求める。図９は、単語評価手段６で行なう処理
を説明するためのフローチャートである。図９のフロー
チャートは、１単語分の単語評価過程の処理を表してい
る。本アルゴリズムは単語を構成する後方の音節から前
方の音節に遡りながら評価するような手法である。いま
評価しようとしている単語がＮ音節で構成されるものと
し、処理対象の音節番号ｉをＮ，スコアを０，探索開始
時刻ｔを入力音声の終端時刻Ｔにセットする（ステップ
８０１）。The word evaluation means 6 evaluates each word based on the time series of the matching result obtained for each of the HMMs and the word dictionary information stored in the word dictionary 7, and calculates an evaluation value for each word. Ask. FIG. 9 is a flowchart for explaining the processing performed by the word evaluation means 6. The flowchart of FIG. 9 shows the process of the word evaluation process for one word. This algorithm is a method that evaluates backward from the back syllable of the word to the front syllable. Assuming that the word to be evaluated is composed of N syllables, the syllable number i to be processed is set to N, the score is set to 0, and the search start time t is set to the end time T of the input voice (step 801).

【００２６】次に、入力音声の終端（時刻＝Ｔ）からあ
る範囲内で最終音節に対応するＨＭＭの照合結果時系列
の最大値を求める。この最大値をＳmax，最大値を与え
る時刻をｔmaxとする。照合結果時系列情報の中には始
端情報が含まれているのでｔmaxに対応する始端時刻ｔs
tartを求めることができる（ステップ８０２）。ｉ＝ｉ
−１とし、以上求められたＳmaxを当該単語のスコアに
足し込み、新たな探索開始点ｔにｔstartをセットして
一つ前の音節に対する探索の準備とする（ステップ８０
３）。処理対象の音節番号ｉが０となるまで、ステップ
８０２，ステップ８０３を繰り返す。処理対象の音節番
号ｉが０となったら（ステップ８０４：Ｙ）、その単語
に関する処理が終了したことになるので計算を終了す
る。Next, the maximum value of the time series of the matching result of the HMM corresponding to the last syllable within a certain range from the end of the input voice (time = T) is obtained. The maximum value is Smax, and the time at which the maximum value is given is tmax. Since the start time information is included in the collation result time-series information, the start time ts corresponding to tmax
A tart can be determined (step 802). i = i
−1, and the obtained Smax is added to the score of the word, and tstart is set as a new search start point t to prepare for the search for the immediately preceding syllable (step 80).
3). Steps 802 and 803 are repeated until the syllable number i to be processed becomes 0. When the syllable number i to be processed becomes 0 (step 804: Y), it means that the processing for the word has been completed, and the calculation is terminated.

【００２７】以上の処理の様子を横軸に時刻，縦軸にＨ
ＭＭの状態を取った図面（これをトレリスと呼ぶ）上で
の照合経路として表したものを図１０に示す。図１０に
示したのは、単語「こくぶんじ」の例である。時刻Ｔ-
αから時刻Ｔの間で音節「じ」の照合値の最大値を求
め、これに対応する照合開始点をｔ1としたとき、ｔ1-
αからｔ1+αの間で一つ前の音節「ん」の照合値の最大
値を求める。以下同様に、これに対応する照合開始点を
ｔ2としたとき、ｔ2-αからｔ2+αの間で一つ前の音節
「ぶ」の照合値の最大値を求める。これに対応する照合
開始点をｔ3としたとき、ｔ3-αからｔ3+αの間で一つ
前の音節「く」の照合値の最大値を求める。これに対応
する照合開始点をｔ4としたとき、ｔ4-αからｔ4+αの
間で一つ前の音節「こ」の照合値の最大値を求める。以
上求められた各最大値が累積されて単語「こくぶんじ」
のスコアとなる。The horizontal axis represents time, and the vertical axis represents H.
FIG. 10 shows a collation path on a drawing in which the state of the MM is taken (this is called a trellis). FIG. 10 shows an example of the word “Kokubunji”. Time T-
The maximum value of the collation value of the syllable “ji” is obtained from α to time T, and when the corresponding collation start point is t1, t1-
The maximum value of the collation value of the previous syllable "n" between α and t1 + α is determined. Similarly, when the matching start point corresponding to this is t2, the maximum value of the matching value of the previous syllable "bu" is determined between t2-α and t2 + α. Assuming that the matching start point corresponding to this is t3, the maximum value of the matching value of the previous syllable "ku" is determined between t3-α and t3 + α. Assuming that the matching start point corresponding to this is t4, the maximum value of the matching value of the previous syllable "ko" is determined between t4-α and t4 + α. The maximum values obtained above are accumulated and the word “Kokubunji” is accumulated.
Score.

【００２８】以上の説明では、図８の音響照合結果時系
列を全てのＨＭＭについて全時刻毎に求めるようにして
いたが、メモリ量処理量ともに大きくなるので累積確率
値がある基準値を越えた場合だけ記録するようにした
り、あるいは累積確率値が時間方向に極大値となる時刻
のみ記録したりすることにより、メモリ量処理量ともに
削減できることは言うまでもない。また、図９のフロー
チャートで示した単語評価の処理においては、全ての単
語について全音節分のスコア累積を行なうように示した
が、途中の音節で得られたＳmaxの値がある基準値以下
の場合には処理を途中で打ち切るなどにより処理量を削
減できることも言うまでもない。In the above description, the time series of the sound matching result shown in FIG. 8 is obtained for all the HMMs at every time. However, since both the memory amount and the processing amount increase, the cumulative probability value exceeds a certain reference value. It is needless to say that both the memory amount and the processing amount can be reduced by recording only in the case, or by recording only the time when the cumulative probability value becomes the maximum value in the time direction. Also, in the word evaluation process shown in the flowchart of FIG. 9, it has been shown that score accumulation for all syllables is performed for all words, but when the value of Smax obtained in the middle syllable is less than a certain reference value. Needless to say, the processing amount can be reduced by terminating the processing in the middle.

【００２９】また、本発明の音声認識の処理とは全く異
なる処理量の少ない手法を用いて音声認識対象単語の予
備選択を行なって、対象単語数を削減しておいてから本
発明の音声認識の処理を施すようにすることももちろん
可能である。The speech recognition target word is preliminarily selected by using a method with a small processing amount completely different from the speech recognition process of the present invention, and the number of target words is reduced. Of course, it is also possible to carry out the processing of.

【００３０】図１１に、本発明の単語音声認識装置の一
例として、図３の音声認識装置の具体的なハードウェア
構成を示すブロック図を示す。同図において、１１１は
音声入力を行い音声情報を電気信号に変換するマイク、
１１２は電気信号に変換された音声信号を増幅するアン
プ、１１３は、Ａ／Ｄ変換器、１１４は、オペレーティ
ングシステム（ＯＳ）１１４１，音声認識プログラム１
１４２，代表分布１１４３，半連続型ＨＭＭ１１４４，
単語辞書１１４５，ワークエリア１１４６などを格納す
るメモリ、１１５は演算プロセッサ（ＣＰＵ）、１１６
はプリンタや表示装置などその他の周辺機器である。図
１１のマイク１１１が図３の音声入力手段１に、図１の
音声分析手段２，音声検出手段３，音響照合手段４（確
率計算手段４１，確率混合手段４２，確率累積手段４
３），標準パタン格納手段５（代表分布格納手段５１，
半連続型ＨＭＭ格納手段５２），単語評価手段６，単語
辞書７，判定手段８の各機能は、図１１の演算プロセッ
サ１１５とメモリ１１４に格納されているプログラムお
よび各種データによって実現される。FIG. 11 is a block diagram showing a specific hardware configuration of the speech recognition apparatus of FIG. 3 as an example of the word speech recognition apparatus of the present invention. In the figure, reference numeral 111 denotes a microphone which performs voice input and converts voice information into an electric signal;
Reference numeral 112 denotes an amplifier for amplifying an audio signal converted to an electric signal; 113, an A / D converter; 114, an operating system (OS) 1141;
142, representative distribution 1143, semi-continuous HMM 1144
A memory for storing the word dictionary 1145, the work area 1146, etc., 115 is an arithmetic processor (CPU), 116
Are other peripheral devices such as a printer and a display device. The microphone 111 of FIG. 11 is provided to the voice input unit 1 of FIG. 3 and the voice analysis unit 2, the voice detection unit 3, and the sound verification unit 4 (probability calculation unit 41, probability mixing unit 42, probability accumulation unit 4) of FIG.
3), standard pattern storage means 5 (representative distribution storage means 51,
The functions of the semi-continuous HMM storage means 52), the word evaluation means 6, the word dictionary 7, and the determination means 8 are realized by programs and various data stored in the arithmetic processor 115 and the memory 114 in FIG.

【００３１】また、図３における音声分析手段２，音声
検出手段３，音響照合手段４（確率計算手段４１，確率
混合手段４２，確率累積手段４３），標準パタン格納手
段５（代表分布格納手段５１，半連続型ＨＭＭ格納手段
５２），単語評価手段６，単語辞書７，および判定手段
８の各機能、すなわち、図１１の演算プロセッサ１１５
とメモリ１１４の音声認識プログラム／代表分布，半連
続型ＨＭＭ，単語辞書などを半導体チップ上に組み込む
ことにより、単語音声認識用のマイコンデバイスを実現
することができ、カーナビゲーション，電話，ＰＤＡ
（Ｐarsonal Ｄigital Ａsistant）など、音声認識を必
要とする各種情報機器に組み込むことが可能になり、適
用範囲は広い。Also, the voice analysis means 2, the voice detection means 3, the sound collation means 4 (probability calculation means 41, the probability mixing means 42, the probability accumulation means 43), the standard pattern storage means 5 (the representative distribution storage means 51) in FIG. , The semi-continuous HMM storage means 52), the word evaluation means 6, the word dictionary 7, and the determination means 8, ie, the arithmetic processor 115 in FIG.
And a speech recognition program / representative distribution of the memory 114, a semi-continuous HMM, a word dictionary, and the like can be incorporated on a semiconductor chip to realize a microcomputer device for word speech recognition.
(Parsonal Digital Assistant) can be incorporated into various information devices that require voice recognition, and the application range is wide.

【００３２】また、前述したように、上記実施例では、
簡単のため、単語音声認識の場合について説明したが、
同様の音節単位の標準パタンや単語辞書の他に、認識対
象の文を単語の並びとして記述する文法を格納し、照合
手段において、各標準パタンを入力音声の特徴ベクトル
の時系列の全区間にわたって照合し各標準パタン毎に照
合結果を時系列で求め、評価手段において、時系列とし
て得られた各標準パタン毎の照合結果と前記単語辞書お
よび前記文法の情報に基づいて各文を評価し、その結果
に従って認識結果を求めるようにすることにより、連続
的に発声された文章や会話文などの文音声を認識する文
音声認識装置，文音声認識用のマイコンデバイス，文音
声認識方法を実現することも可能である。As described above, in the above embodiment,
For simplicity, we explained the case of word speech recognition,
In addition to similar syllable-based standard patterns and word dictionaries, the grammar that describes the sentence to be recognized as a sequence of words is stored. Matching is performed for each standard pattern to obtain a matching result in time series, and in the evaluation means, each sentence is evaluated based on the matching result for each standard pattern obtained as a time series and the word dictionary and the grammar information, By obtaining a recognition result according to the result, a sentence speech recognition apparatus, a microcomputer device for sentence speech recognition, and a sentence speech recognition method for recognizing sentence speech such as continuously uttered sentences and conversation sentences are realized. It is also possible.

【００３３】また、上述した単語音声認識方法および文
音声認識方法を構成する各ステップをプログラムコード
化してＣＤ−ＲＯＭやＦＤ（フレキシブルディスク）な
どの記録媒体に記録すれば、市場に流通し易くなり本発
明の音声認識方法を広く普及することができる。Further, if the steps constituting the above-mentioned word speech recognition method and sentence speech recognition method are converted into program codes and recorded on a recording medium such as a CD-ROM or FD (flexible disk), they can be easily distributed to the market. The speech recognition method of the present invention can be widely spread.

【００３４】上記実施例により、本発明の所期の目的、
すなわち、認識対象の全仮説に対して最終的な評価結果
が得られ、かつ音響照合処理量が認識対象の語数に比例
せず、一定の処理量で抑えられるような、処理量が小さ
くかつ認識精度の劣化の少ない単語音声認識装置や文音
声認識装置，そのためのマイコンデバイス，単語や文の
音声認識方法，ならびに、該認識方法を記録したコンピ
ュータで読取り可能な記録媒体を得ることができる。According to the above embodiment, the intended object of the present invention is as follows:
In other words, the final evaluation result is obtained for all the hypotheses to be recognized, and the processing amount is small and the recognition amount is small so that the amount of sound matching processing is not proportional to the number of words to be recognized and can be suppressed by a certain amount. It is possible to obtain a word-speech recognition device and a sentence-speech recognition device with less deterioration in accuracy, a microcomputer device therefor, a speech recognition method for words and sentences, and a computer-readable recording medium recording the recognition method.

【００３５】[0035]

【発明の効果】以上本発明によれば、音声認識のために
必要となる確率計算回数を大幅に削減でき、認識精度を
保ったまま、処理量の少ない大語彙音声認識が可能とな
る。As described above, according to the present invention, the number of probability calculations required for speech recognition can be greatly reduced, and large vocabulary speech recognition with a small processing amount can be performed while maintaining recognition accuracy.

[Brief description of the drawings]

【図１】本発明の音声認識装置の一実施例の構成を示す
ブロック図である。FIG. 1 is a block diagram showing a configuration of an embodiment of a speech recognition device of the present invention.

【図２】本発明の音声認識装置で用いる認識基本単位の
隠れマルコフモデル（ＨＭＭ）を説明する図である。FIG. 2 is a diagram illustrating a hidden Markov model (HMM) of a basic recognition unit used in the speech recognition apparatus of the present invention.

【図３】本発明の音声認識装置の一実施例の詳細構成を
示すブロック図である。FIG. 3 is a block diagram showing a detailed configuration of one embodiment of the speech recognition device of the present invention.

【図４】本発明の代表分布格納手段を説明する図であ
る。FIG. 4 is a diagram illustrating a representative distribution storage unit according to the present invention.

【図５】本発明の音声認識装置で用いる半連続型の隠れ
マルコフモデル（ＨＭＭ）を説明する図である。FIG. 5 is a diagram illustrating a semi-continuous Hidden Markov Model (HMM) used in the speech recognition apparatus of the present invention.

【図６】本発明の代表分布確率保持手段を説明する図で
ある。FIG. 6 is a diagram illustrating a representative distribution probability holding unit according to the present invention.

【図７】本発明の音声認識装置で用いる半連続型の隠れ
マルコフモデル（ＨＭＭ）を説明する図である。FIG. 7 is a diagram illustrating a semi-continuous Hidden Markov Model (HMM) used in the speech recognition device of the present invention.

【図８】音響照合結果の時系列を説明する図である。FIG. 8 is a diagram illustrating a time series of a sound matching result.

【図９】単語評価手段における単語評価計算過程を説明
するフローチャートである。FIG. 9 is a flowchart illustrating a word evaluation calculation process in a word evaluation unit.

【図１０】単語評価手段における単語評価計算処理のイ
メージを説明する図である。FIG. 10 is a diagram illustrating an image of a word evaluation calculation process in a word evaluation unit.

【図１１】図３の音声認識装置の具体的なハードウェア
構成を示すブロック図である。FIG. 11 is a block diagram illustrating a specific hardware configuration of the speech recognition device in FIG. 3;

[Explanation of symbols]

１：音声入力手段、２：音声分析手段、３：音声検出手
段、４：音響照合手段、５：標準パタン格納手段、６：
単語評価手段、７：単語辞書、８：判定手段、４１：確
率計算手段、４２：確率混合手段、５１：代表分布格納
手段、５２：半連続型ＨＭＭ格納手段、１１１：マイ
ク、１１２：アンプ、１１３：Ａ／Ｄ変換器、１１４：
メモリ、１１４１：オペレーティングシステム（Ｏ
Ｓ）、１１４２：音声認識プログラム、１１４３：代表
分布、１１４４：半連続型ＨＭＭ、１１４５：単語辞
書、１１４６：ワークエリア、１１５：演算プロセッサ
（ＣＰＵ）、１１６：その他の周辺機器。1: voice input means, 2: voice analysis means, 3: voice detection means, 4: sound collation means, 5: standard pattern storage means, 6:
Word evaluation means, 7: word dictionary, 8: judgment means, 41: probability calculation means, 42: probability mixing means, 51: representative distribution storage means, 52: semi-continuous HMM storage means, 111: microphone, 112: amplifier, 113: A / D converter, 114:
Memory, 1141: operating system (O
S), 1142: speech recognition program, 1143: representative distribution, 1144: semi-continuous HMM, 1145: word dictionary, 1146: work area, 115: arithmetic processor (CPU), 116: other peripheral devices.

Claims

[Claims]

1. A voice input unit for inputting a voice, a voice analysis unit for analyzing an input voice (input voice) and outputting a time series of feature vectors, and a standard pattern for a basic voice unit are stored. A standard pattern storage unit, a word dictionary that describes words of a recognition target word as a sequence of basic speech units, a matching unit that matches a time series of a feature vector of the input speech with the standard pattern, based on the matching result An evaluation means for evaluating the recognition target by using the comparison means, wherein the collation means collates each of the standard patterns over the entire section of the feature vector time series of the input speech and a collation result for each standard pattern. As a time series, wherein the evaluation means evaluates each word based on the matching result for each standard pattern obtained as the time series and the information of the word dictionary. And a means for obtaining a recognition result according to the evaluation result.

2. A speech analysis means for analyzing input speech (input speech) and outputting a time series of feature vectors, a standard pattern storage means for storing a standard pattern for a basic speech unit, and a recognition target word. A word dictionary that describes the words as a sequence of speech basic units, a matching unit that matches the time series of the feature vector of the input speech with the standard pattern, and an evaluation unit that evaluates a recognition target based on the matching result. Wherein the matching means matches each of the standard patterns over the entire section of the feature vector time series of the input speech, and obtains a matching result as a time series for each standard pattern. The evaluation unit evaluates each word based on the result of the comparison for each standard pattern obtained as the time series and the information of the word dictionary, and Accordingly, a microcomputer device for word speech recognition, which is a means for obtaining a recognition result.

3. A voice input unit for inputting voice, a voice analysis unit for analyzing the input voice and outputting a time series of feature vectors, and a standard pattern storage unit for storing a standard pattern for a basic voice unit. And a word dictionary that describes the words of the recognition target word as a sequence of basic speech units, a grammar that describes the recognition target sentence as a sequence of words, and matches the time series of the feature vector of the input voice with the standard pattern. And a evaluating unit that evaluates a recognition target based on the matching result, wherein the matching unit determines each of the standard patterns in the entire interval of the feature vector time series of the input speech. Means for obtaining a matching result as a time series for each of the standard patterns, wherein the evaluation means includes a matching result for each of the standard patterns obtained as the time series and the word dictionary. A sentence speech recognition apparatus characterized in that it is means for evaluating each sentence on the basis of information on a book and the grammar, and obtaining a recognition result according to the evaluation result.

4. Speech analysis means for analyzing input speech and outputting a time series of feature vectors, standard pattern storage means for storing a standard pattern for a speech basic unit,
A word dictionary that describes the words of the recognition target word as a sequence of basic speech units, a grammar that describes the recognition target sentence as a sequence of words, and a collation that matches the time series of the feature vector of the input speech with the standard pattern Means, and a sentence speech recognition microcomputer device having evaluation means for evaluating a recognition target based on the matching result, wherein the matching means converts each of the standard patterns into an entire section of the feature vector time series of the input speech. Means for obtaining a matching result as a time series for each standard pattern, wherein the evaluation means is based on the matching result for each standard pattern obtained as the time series and the information of the word dictionary and the grammar. A means for evaluating each sentence and obtaining a recognition result according to the evaluation result.

5. A voice input step of inputting voice, a voice analysis step of analyzing input voice (input voice) and outputting a time series of feature vectors, a time series of feature vectors of the input voice, A word-speech recognition method comprising: a matching step of matching a standard pattern for a basic voice unit; and an evaluating step of evaluating a recognition target based on the matching result. A step of obtaining a matching result for each standard pattern as a time series by performing matching over the entire section of the speech feature vector time series, and the evaluation step includes the step of obtaining a matching result for each standard pattern obtained as the time series and recognizing the same. Each word is evaluated based on information of a word dictionary that describes the words of the target word as a sequence of basic speech units, and a recognition result is obtained according to the evaluation result. Word speech recognition method which is a Mel step.

6. The standard pattern is constituted by an appearance probability distribution of a feature vector of the speech, and the matching step includes calculating a probability of each standard pattern from the feature vector of the input speech and the appearance probability distribution. 6. The word speech recognition method according to claim 5, wherein the matching is performed based on the calculated probability value.

7. The word speech recognition method according to claim 5, wherein the matching step performs a matching calculation based on a dynamic programming.

8. The word speech recognition apparatus and word speech recognition method according to claim 7, wherein said dynamic programming method uses a Viterbi algorithm.

9. The word speech recognition method according to claim 5, wherein the basic speech unit is a syllable.

10. The word speech recognition method according to claim 5, wherein the basic speech unit is a triphone chain of vowels, consonants, and vowels.

11. The matching step includes matching each of the standard patterns over the entire section of the feature vector time series of the input voice and obtaining a matching result as a time series for each of the standard patterns. The word speech recognition method according to any one of claims 5 to 10, wherein only a portion where the middle evaluation result is higher than a predetermined reference value is obtained as a time series of the matching result.

12. The collating step comprises: collating each of the standard patterns over the entire section of the feature vector time series of the input voice to obtain a collation result as a time series for each standard pattern; 11. The method according to claim 5, wherein only the portion where the middle evaluation result is maximum in the time direction is obtained as a time series of the comparison result.
The word speech recognition method according to any one of the above.

13. The collation result obtained as a time series for each standard pattern includes an evaluation value of each standard pattern ending at that time for each time and corresponding start point information. 13. The word speech recognition method according to claim 11, wherein

14. The evaluation method according to claim 5, wherein the evaluation step terminates the evaluation of a recognition target less than a predetermined reference value during the evaluation. The described word speech recognition method.

15. A computer-readable recording medium recording each step of the word speech recognition method according to claim 5. Description:

16. A voice input step of inputting voice,
A voice analysis step of analyzing an input voice (input voice) and outputting a time series of a feature vector; a matching step of matching the time series of the feature vector of the input voice with a standard pattern for a basic voice unit; An evaluation step of evaluating a recognition target based on the collation result, wherein the collation step collates each of the standard patterns over the entire section of the feature vector time series of the input speech, and executes each standard pattern. A step of obtaining a collation result as a time series for each of the words, wherein the evaluation step comprises: a word dictionary that describes the collation result and the word of the recognition target word for each standard pattern obtained as the time series as a sequence of speech basic units; Each sentence is evaluated based on grammatical information describing the recognition target sentence as a sequence of words, and a recognition result is obtained according to the evaluation result. A sentence speech recognition method characterized by the following steps:

17. The standard pattern is constituted by an appearance probability distribution of a feature vector of the speech, and the matching step calculates a probability of each standard pattern from the feature vector of the input speech and the appearance probability distribution, 17. The sentence speech recognition method according to claim 16, wherein matching is performed based on the calculated probability value.

18. The sentence speech recognition method according to claim 17, wherein the matching step performs a matching calculation based on a dynamic programming.

19. The sentence speech recognition method according to claim 18, wherein said dynamic programming uses a Viterbi algorithm.

20. The sentence speech recognition method according to claim 16, wherein the basic speech unit is a syllable.

21. The speech basic unit according to claim 16, wherein the vowel, consonant, and vowel are a three-phoneme chain.
10. The sentence speech recognition method according to any one of items 9 to 9.

22. The collating step comprises: collating the standard patterns over the entire section of the feature vector time series of the input voice to obtain a collation result as a time series for each standard pattern; The sentence speech recognition method according to any one of claims 16 to 21, wherein only a portion where the middle evaluation result is higher than a predetermined reference value is obtained as a time series of the matching result.

23. The collating step comprises: collating each of the standard patterns over the entire section of the feature vector time series of the input speech to obtain a collation result as a time series for each standard pattern; The method according to claim 16, wherein only a portion where the middle evaluation result is maximum in the time direction is obtained as a time series of the matching result.
Item 1. The sentence speech recognition method according to any one of 1.

24. The collation result obtained as a time series for each standard pattern includes an evaluation value of each standard pattern ending at that time for each time and corresponding start point information. The sentence speech recognition method according to any one of claims 16 to 23.

25. The method according to claim 16, wherein in the evaluation step, the evaluation is terminated in the middle of the evaluation for a recognition target less than a predetermined reference value in the middle of the evaluation. The sentence speech recognition method described.

26. A computer-readable recording medium recording each step of the sentence speech recognition method according to claim 16. Description: