JPH0484197A

JPH0484197A - Continuous voice recognizer

Info

Publication number: JPH0484197A
Application number: JP2200530A
Authority: JP
Inventors: Atsushi Horioka; 篤史堀岡
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1990-07-26
Filing date: 1990-07-26
Publication date: 1992-03-17
Anticipated expiration: 2014-07-19
Also published as: JP2921059B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】産業上の利用分野本発明は音響信頼度と接続信頼度に可変にそれぞれ重み
付けした線形結合で表される評価値を用いて認識結果を
得る連続音声認識装置に関するものであム従来の技術近爪　音声認識技術の発達とともに　連続音声認識装置
が種々の分野で実用化されようとしており、実用化する
ために（上　認識装置を実用する上での種々の問題点を
解決する必要がある。この実用上の問題点の１つく　入
力連続音声中に不明瞭に発声された部分が存在する場合
、この部分を誤認識してしま（＼　誤った認識文を出力
してしまうという点かあム従来の連続音声認識装置でζよ　上記問題点を解決する
ために　最高の音響信頼度で認識された素片を単に接続
して文単位の認識結果を得るのではなく、認識された素
片の音響信頼度と接続信頼度の線形結合として評価関数
を設定してこの評価値が最高のものを認識結果とするの
で、素片の音響信頼度が低くても前後の素片との文法上
のつながり（接続信頼度）が高ければ評価値が高くなり
、入力連続音声中に不明瞭に発声され・た部分が存在し
ても正しく認識できるようになる。[Detailed Description of the Invention] Industrial Application Field The present invention relates to a continuous speech recognition device that obtains recognition results using evaluation values expressed by linear combinations of acoustic reliability and connection reliability that are each variably weighted. With the development of speech recognition technology, continuous speech recognition devices are about to be put into practical use in various fields. One of the practical problems with this is that if there is a part that is uttered unclearly in the input continuous speech, this part will be misrecognized (＼ An incorrect recognized sentence will be output. In order to solve the above problems, conventional continuous speech recognition devices do not simply connect segments recognized with the highest acoustic reliability to obtain sentence-by-sentence recognition results. The evaluation function is set as a linear combination of the acoustic reliability and connection reliability of the segment, and the one with the highest evaluation value is taken as the recognition result, so even if the acoustic reliability of the segment is low, the previous and subsequent segments If the grammatical connection (connection reliability) is high, the evaluation value will be high, and even if there are unclearly uttered parts in the input continuous speech, it will be possible to recognize them correctly.

以下、第３＠　第４医　第５図を参照しなが収上述した
ような従来の連続音声認識装置で素片を単語としたもの
を例として説明を行う。Hereinafter, referring to FIG. 5, explanation will be given by taking as an example the conventional continuous speech recognition device as described above, in which the fragments are words.

第３図Ｃ表　　従来の連続音声認識装置のブロック諷　
第４図はワードラチス生成のフローチャート、第５図は
接続部における処理を示したフローチャートである。第
３図において、　１は信号入力端子、２は分析部　４は
特徴パラメータ保管孔　５は照合部　６はワードラチス
保管服　８は予測部　９は接続部　１０は認識結果出力
端子、　１２はスイッチであム　以上のように構成され
た音声認識装置について以下その動作について説明する
。Figure 3 Table C Block diagram of conventional continuous speech recognition device
FIG. 4 is a flowchart of word lattice generation, and FIG. 5 is a flowchart showing processing at the connection section. In Fig. 3, 1 is a signal input terminal, 2 is an analysis section, 4 is a feature parameter storage hole, 5 is a collation section, 6 is a word lattice storage suit, 8 is a prediction section, 9 is a connection section, 10 is a recognition result output terminal, and 12 is a switch. The operation of the speech recognition device configured as described above will be explained below.

最初に標準音声登録時については第３１図を参照しなが
ら説明すも　まず、スイッチ１２を分析部の出力が特徴
パラメータ保管部に入力されるように操作し　信号入力
端子１から単語単位で入力された標準音声を分析部２に
入力し　フレームごとの特徴パラメータを算出し　特徴
パラメータ保管部４に登録する。そして、標準音声を入
力して上記の処理を認識すべき全単語について繰り返し
登録を終了する。First, standard voice registration will be explained with reference to FIG. The standard voice obtained is input to the analysis section 2, feature parameters are calculated for each frame, and the feature parameters are registered in the feature parameter storage section 4. Then, the standard speech is input and the above process is repeated to complete the registration of all the words to be recognized.

次に認識時については第３図と第４図とを参照しながら
説明する。まずスイッチ１２を分析部の出力が照合部に
入力されるように操作し　登録時と同様に信号入力端子
１より認識すべき信号を入力しく処理２１）、分析部２
でフレームごとの特徴パラメータを算出する（処理２２
）。次に照合部５において標準音声と入力信号との照合
を行う。Next, the time of recognition will be explained with reference to FIGS. 3 and 4. First, operate the switch 12 so that the output of the analysis section is input to the verification section, process 21) so that the signal to be recognized is input from the signal input terminal 1 in the same way as when registering, and the analysis section 2.
Calculate the feature parameters for each frame (process 22
). Next, a comparison section 5 performs a comparison between the standard voice and the input signal.

まず、フレーム番号＝１、単語番号＝１とし　初期化を
行う（処理２３．２４）。そして、フレーム番号のフレ
ームを始端として単語番号番目の認識すべき単語との照
合を行い（処理２７）、判定閾値以上の類似度を示した
場合（処理２８）、この時の単語を認識素片候補　類似
度を音響信頼度とし　認識の始端と終端とともにワード
ラチス保管部に出力する（処理２９）。この後、単語番
号に１を加算して（処理３０）次の認識すべき単語につ
いての処理に移も　以上の処理が認識すべき単語すべて
について終了したとき（処理２６）、フレーム番号に１
を加算して（処理３１）次のフレームを始端として同様
の処理を行う。以上の処理がすべてのフレーム　すべて
の認識すべき単語について終了したならば（処理２５）
、照合部５での処理を終了すム　この処理によりワード
ラチスとして入力連続音声中に存在する可能性のある認
識単語候補名とその始端位！　終端位置　音響信頼度が
記録されてワードラチス保管部６に出力されもまた　予測部８は接続部９から入力された認識単語候補
に後続可能な単語を文法または統計情報などを用いて求
取　次単語候補としてその接続信頼度（確率などで表現
する）とともに接続部９に出力するように動作すム次に接続部９の処理については第３図と第５図を参照し
ながら説明する。First, initialization is performed by setting the frame number to 1 and the word number to 1 (processes 23 and 24). Then, the frame with the frame number is used as the starting point to match the word to be recognized with the word number (processing 27), and if the degree of similarity is greater than or equal to the determination threshold (processing 28), the word at this time is used as a recognition segment. The candidate similarity is used as the acoustic reliability and is output to the word lattice storage unit along with the recognition start and end points (process 29). After that, add 1 to the word number (process 30) and move on to processing for the next word to be recognized. When the above processing is completed for all the words to be recognized (process 26), add 1 to the frame number.
is added (process 31), and the same process is performed using the next frame as the starting point. When the above processing is completed for all frames and all words to be recognized (processing 25)
, the processing in the collation unit 5 is completed. This processing identifies the names of recognized word candidates that may exist in the input continuous speech as a word lattice and their starting positions. The end position acoustic reliability is recorded and output to the word lattice storage section 6.The prediction section 8 also uses grammar or statistical information to find words that can follow the recognition word candidate input from the connection section 9.Next word Next, the processing of the connection unit 9 which operates to output the connection reliability (expressed by probability or the like) as a candidate to the connection unit 9 will be explained with reference to FIGS. 3 and 5.

上記接続部ではワードラチス保管部６より入力された認
識単語候補名を接続して複数の認識結果候補を生成し　
これらのう敷　最高の評価値を持つものを認識結果とＬ
　認識結果出力端子８に出力する。認識結果候補を求め
るには　まず、認識途中結果を空文字列としく処理１）
、認識途中結果内の最後尾の認識単語候補の終端位置で
ある認識途中結果終端位置を０　（フレーム番号）とし
て（処理２）初期化を行う。次に認識途中結果終端位置−ｇａｐ≦始端位置≦認識途中結
果終端位置＋ｇａｐ　　　（式ｌ、１）の始端位置の条
件をみたす（後続可能な）認識単語候補が存在するなら
ば（処理４）、その認識単語候補名を認識途中結果内の
文字列の最後尾に接続する（処理５）。ここでｇａｐは
照合部での処理における始端位置と終端位置の検出誤差
に対応するための定数であム　その後、下記の式２によ
って、認識途中結果終端位置の更新を行う（処理６）。The connection section connects the recognition word candidate names input from the word lattice storage section 6 to generate multiple recognition result candidates.
The one with the highest evaluation value is the recognition result and L
The recognition result is output to the output terminal 8. To obtain recognition result candidates, first process the recognition result as an empty string 1)
, initialization is performed (process 2) by setting the recognition intermediate result end position, which is the end position of the last recognized word candidate in the recognition intermediate result, to 0 (frame number). Next, if there is a recognized word candidate (possible to follow) that satisfies the start position condition of recognition intermediate result end position - gap ≦ start position ≦ recognition intermediate result end position + gap (formula 1, 1) (process 4), The recognized word candidate name is connected to the end of the character string in the recognition intermediate result (processing 5). Here, gap is a constant for dealing with the detection error between the start end position and the end position in the processing in the matching section. Thereafter, the recognition intermediate result end position is updated using the following equation 2 (processing 6).

認識途中結果終端位置＝認識途中結果終端位置＋（認識単語候補の終端位置−
認識単語候補の始端位置）　　　（式２）そして、処理
３４で接続した認識単語候補が文頭から１番目であると
すると、接続部４は文頭からｉ−１番目の認識単語候補
を予測部に送り、次単語候補（文頭からｉ番目の単語に
なり得る単語候補）とその接続信頼度を予測部から得る
（処理７）。そして、これらを使って下記の式３によっ
て評価値を更新する（処理９）。このとき、単語列（Ｗ
ｌ、　　Ｗ２．　　、　、　、　Ｗｉ）から構成される
認識途中結果の評価関数ｈ（Ｗｉ）は以下のように表さ
れも十β・ｆ　（Ｗｉ−１，Ｗｉ）（式３）ただしｈ　　（ＷＯ）　　＝０である。Recognition intermediate result end position = Recognition intermediate result end position + (recognition word candidate end position -
Starting position of recognition word candidate) (Equation 2) Then, assuming that the recognition word candidate connected in process 34 is the first recognition word candidate from the beginning of the sentence, the connection unit 4 sends the i-1st recognition word candidate from the beginning of the sentence to the prediction unit. , the next word candidate (word candidate that could be the i-th word from the beginning of the sentence) and its connection reliability are obtained from the prediction unit (processing 7). Then, using these, the evaluation value is updated according to the following equation 3 (processing 9). At this time, the word string (W
l, W2. The evaluation function h(Wi) of the recognition intermediate result, which is composed of , , , Wi), is expressed as follows. be.

ここでｇ（Ｗｉ）は単語Ｗｉの音響信頼度、　ｆ　（Ｗ
ｉ−１，Ｗｉ）は単語Ｗｉ−１から単語ｗ１への接続信
頼度、α、βは重み（定数）である。その後、処理４か
ら処理９を処理４での条件が満たされなくなるまで繰り
返す。その後、入力音声フレーム長−ｇａｐ≦認識途中結果終端位置≦
入力音声フレーム長＋ｇａｐ（式１，２）なる条件で認識途中結果終端位置を判定しく処理１０）
、この条件が満たされたならば　このときの認識途中結
果を認識結果候補としてその評価値とともに保存する（
処理１３）。以上の認識途中結果を求める手続きを接続
部９に入力されるワードラチス全体について行（＼　存
在し得るだけの認識結果候補をもと礁　それらの中で最
高の評価値を持つ認識結果候補を認識結果として認識結
果出力端子１１より出力すも発明が解決しようとする纒題しかしなか技　上記のような構成では　音響信頼度と接
続信頼度にかける重み（式３におけるαとβ）が固定で
あるために　両信頼度が評価値に効率よく反映されずミ
　信頼度の導入が認識率の向上に全く関与しないという
課題を有していたまた　上記のような構成でＣヨ　　入
力音声中に息継ぎなどの無音区間が存在した場合　認識
素片候補が接続できないために認識結果が出力されなか
ったり、無音区間の存在をあらかじめ想定して式１．１
におけるｇａｐの値を大きくした場合には膨大な数の認
識結果候補を出力してしま（（結局は正しい認識結果を
出力しないという課題を有していｔら本発明ｉ：Ｌ　　素片接続における次素片予測の情報量
または入力音声中の無音時間またはその両者に応じて音
響信頼度と接続信頼度にかける重み（式３におけるαと
β）を変化させ、それぞれの信頼度が忠実に認識率の向
上につながる連続音声認識装置を提供するこ、とを目的
とすも課題を解決するための手段この目的を達成するために　第１の発明に係る連続音声
認識装置は　入力信号のフレームごとの特徴パラメータ
を検出する分析部と、分析部の圧力と標準信号の素片ご
との特徴パラメータとを照合して認識素片候補とその音
響信頼度を出力する照合部と、接続部より入力された認
識途中結果より予測される次素片候補とその接続信頼度
とその予測される次素片候補の情報量を出力する予測部
と、照合部の出力である認識素片候補を接続して認識結
果を出力するとともに認識途中結果を予測部に出力する
接続部とから構成される。Here g(Wi) is the acoustic reliability of word Wi, f(W
i-1, Wi) is the connection reliability from word Wi-1 to word w1, and α and β are weights (constants). Thereafter, processes 4 to 9 are repeated until the condition in process 4 is no longer satisfied. After that, input audio frame length - gap ≦ recognition intermediate result end position ≦
Input audio frame length + gap (Formula 1, 2) Process to determine the end position of the mid-recognition result under the condition 10)
, if this condition is met, the intermediate recognition result at this time is saved as a recognition result candidate along with its evaluation value (
Processing 13). The above procedure for obtaining intermediate recognition results is performed for the entire word lattice that is input to the connection unit 9. The recognition result is output from the output terminal 11 as a short answer to the problem that the invention attempts to solve.In the above configuration, the weights (α and β in equation 3) applied to acoustic reliability and connection reliability are fixed. In addition, the above configuration had the problem that the two-way reliability was not efficiently reflected in the evaluation value, and the introduction of the reliability had no effect on improving the recognition rate. If there is a silent section, the recognition result may not be output because the recognition segment candidates cannot be connected, or if the presence of a silent section is assumed in advance, Equation 1.1
If the gap value is increased, a huge number of recognition result candidates will be output. The weights (α and β in Equation 3) applied to the acoustic reliability and connection reliability are changed according to the amount of information for segment prediction, the silent time in the input speech, or both, and each reliability is faithfully adjusted to the recognition rate. It is an object of the present invention to provide a continuous speech recognition device that leads to an improvement in the number of frames of an input signal. an analysis section that detects feature parameters, a verification section that compares the pressure of the analysis section with the feature parameters of each segment of the standard signal and outputs recognition segment candidates and their acoustic reliability; Recognition is performed by connecting the prediction unit that outputs the next segment candidate predicted from the recognition intermediate result, its connection reliability, and the amount of information of the predicted next segment candidate, and the recognition unit candidate that is the output of the matching unit. It is composed of a connection section that outputs the results and also outputs the results during recognition to the prediction section.

第２の発明に係る連続音声認識装置（よ　入力信号のフ
レームごとの特徴パラメータを検出する分析部と、入力
信号の無音区間を検出する検出部と、分析部の出力と標
準信号の素片ごとの特徴パラメータとを照合して認識素
片候補とその音響信頼度を出力する照合部と、接続部よ
り入力された認識途中結果より予測される次素片候補と
その接続信頼度とを出力する予測部と、上記照合部の出
力である認識素片候補を接続して認識結果を出力すると
ともに認識途中結果を予測部に出力する接続部とから構
成される。Continuous speech recognition device according to the second invention: an analysis section that detects feature parameters for each frame of an input signal; a detection section that detects silent sections of the input signal; and an output of the analysis section and each segment of a standard signal. a matching unit that outputs a recognition unit candidate and its acoustic reliability by comparing it with the feature parameters of the unit, and outputs a next unit candidate and its connection reliability that are predicted from the recognition intermediate results input from the connection unit. It is composed of a prediction section and a connection section that connects recognition unit candidates output from the matching section to output recognition results and outputs intermediate recognition results to the prediction section.

第３の発明に係る連続音声認識装置は　入力信号のフレ
ームごとの特徴パラメータを検出する分析部と、入力信
号の無音区間を検出する検出部と、分析部の出力と標準
信号の素片ごとの特徴パラメータとを照合して認識素片
候補とその音響信頼度を出力する照合部と、接続部より
入力された認識途中結果より予測される次素片候補とそ
の接続信頼度とその次素片候補の情報量とを出力する予
測部と、照合部の出力である認識素片候補を接続して認
識結果を出力するとともに認識途中結果を予測部に出力
する接続部とから構成される。The continuous speech recognition device according to the third invention includes an analysis section that detects feature parameters for each frame of an input signal, a detection section that detects a silent section of the input signal, and an analysis section that detects feature parameters for each frame of an input signal, a detection section that detects a silent section of the input signal, and an output of the analysis section and a standard signal for each segment. A matching unit that outputs recognition segment candidates and their acoustic reliability by comparing them with feature parameters, and a next segment candidate, its connection reliability, and its next segment predicted from the recognition intermediate results input from the connection unit. It is composed of a prediction unit that outputs the information amount of the candidate, and a connection unit that connects the recognition segment candidates output from the matching unit to output a recognition result and outputs an intermediate recognition result to the prediction unit.

作用第１の発明の連続音声認識装置（訳　分析部で入力信号
のフレームごとの特徴パラメータを検出し照合部で分析
部の出力と標準信号の素片ごとの特徴パラメータとを照
合して認識素片候補とその音響信頼度を出力し　予測部
で接続部より入力された認識途中結果より予測される次
素片候補とその接続信頼度とその予測される次素片候補
の情報量とを接続部に出力し　接続部で照合部より入力
された音響信頼度と、予測部より入力された接続信頼度
とを予測部より入力された次素片候補の情報量に応じて
それぞれ重み付けした線形結合で表される評価値を用い
て認識素片候補を接続して認識結果を得るとともに認識
途中結果を予測部に出力する。Continuous speech recognition device of the first invention (translation) The analysis section detects the feature parameters for each frame of the input signal, and the matching section compares the output of the analysis section with the feature parameters for each segment of the standard signal to generate recognition elements. The segment candidate and its acoustic reliability are output, and the prediction unit connects the next segment candidate predicted from the recognition intermediate result input from the connection unit, its connection reliability, and the predicted information amount of the next segment candidate. At the connection section, the acoustic reliability input from the matching section and the connection reliability input from the prediction section are each weighted according to the amount of information of the next segment candidate input from the prediction section. Recognition segment candidates are connected using the evaluation value expressed by to obtain recognition results, and intermediate recognition results are output to the prediction unit.

第２の発明の連続音声認識装置（よ　分析部で入力信号
のフレームごとの特徴パラメータを検出し検出部で上記
入力信号の無音区間を検出し　照合部で分析部の出力と
標準信号の素片ごとの特徴パラメータとを照合して認識
素片候補とその音響信頼度を出力し　予測部で接続部よ
り入力された認識途中結果より予測される次素片候補と
その接続信頼度とを接続部に出力し　接続部で照合部よ
り入力された音響信頼度と、予測部より入力された接続
信頼度とを検出部より入力された無音区間の時間長に応
じてそれぞれ重み付けした線形結合で表される評価値を
用いて認識素片候補を接続して認識結果を得る七ともに
認識途中結果を予測部に出力す４第３の発明の連続音声認識装置は　分析部で入力信号の
フレームごとの特徴パラメータを検出上検出部で上記入
力信号の無音区間を検出し　照合部で分析部の出力と標
準信号の素片ごとの特徴パラメータとを照合して認識素
片候補とその音響信頼度を出力し　予測部で接続部より
入力された認識途中結果より予測される次素片候補とそ
の接続信頼度とその次素片候補の情報量とを接続部に出
力し　接続部で照合部より入力された音響信頼度と、予
測部より入力された上記接続信頼度とを予測部より入力
された次素片候補の情報量と検出部より入力された無音
区間の時間長とに応じてそれぞれ重み付けした線形結合
で表される評価値を用いて認識素片候補を接続して認識
結果を得るとともに認識途中結果を予測部に出力する。Continuous speech recognition device of the second invention (an analysis section detects feature parameters for each frame of the input signal, a detection section detects silent sections of the input signal, and a collation section compares the output of the analysis section with a segment of the standard signal) The prediction unit outputs the recognition unit candidate and its acoustic reliability by comparing it with the feature parameters for each unit, and the prediction unit outputs the next unit candidate and its connection reliability predicted from the recognition intermediate results input from the connection unit. At the connection section, the acoustic reliability input from the matching section and the connection reliability input from the prediction section are expressed as a linear combination of each weighted according to the time length of the silent section input from the detection section. The continuous speech recognition device of the third invention has the following characteristics for each frame of the input signal: On detecting the parameters, the detection section detects the silent section of the above input signal, and the matching section compares the output of the analysis section with the characteristic parameters of each segment of the standard signal to output recognition segment candidates and their acoustic reliability. The prediction unit outputs the next segment candidate predicted from the intermediate recognition results input from the connection unit, its connection reliability, and the amount of information of the next segment candidate to the connection unit, and outputs the information input from the matching unit at the connection unit. A linear method in which the acoustic reliability and the above-mentioned connection reliability input from the prediction unit are respectively weighted according to the amount of information of the next unit candidate input from the prediction unit and the time length of the silent section input from the detection unit. Recognition segment candidates are connected using evaluation values expressed by combinations to obtain recognition results, and intermediate recognition results are output to the prediction unit.

実施例以下、第１、第２および第３の発明の実施例について第
１皿　第２母　第１表を参照しながら説明すも下記実施例（よ　請求項３に記載されている連続音声認
識装置について説明している力交　本実施例内容におけ
る評価値算出処理では次素片候補の情報量、または無音
区間の時間長を入力に使用しない場合でも有効であるの
で、請求項１記載の発明および請求項２記載の発明の実
施例と兼ねて記載する。Embodiments Hereinafter, embodiments of the first, second and third inventions will be explained with reference to Table 1. Force exchange describing the device The evaluation value calculation process in this embodiment is effective even when the amount of information of the next segment candidate or the time length of the silent section is not used as input, so the invention according to claim 1 This will also be described as an embodiment of the invention set forth in claim 2.

第１図は本発明の一実施例における単語を素片とした連
続音声認識装置のブロック図である。　■は信号入力端
子、　２は分析部　３は検出部　４は特徴パラメータ保
管服　５は照合部　６はワードラチス保管巳　７は重み
付は部（１）、　８は予測部　９は接続部　１０は重み
付は部（２）、　１１は認識結果出力端子、　１２はス
イッチである。以上のように構成された音声認識装置に
ついて以下その動作について説明すもまず標準信号登録時にζよ　スイッチ１１を分析部の出
力が特徴パラメータ保管部に入力されるように操作し　
信号入力端子１から入力された標準信号を分析部２に入
力し　フレームごとにＬＰＣケプストラムなどの特徴パ
ラメータを算出し　特徴パラメータ保管部４に入力すも
　標準信号を入力して上記の処理を認識すべき全単語に
ついて繰り返し　登録を終了す４また実施例で（表　予測部８には認識すべき全単語（前
単語）と、次単語候補としてその単語に後続可能な認識
されるべき単語と、あらかじめ統計的に求めておいた接
続確率（前単語が出現した後にそれぞれの次単語候補が
出現する確率）をこの接続信頼度として登録しておく。FIG. 1 is a block diagram of a continuous speech recognition device using words as fragments in one embodiment of the present invention. ■ is the signal input terminal, 2 is the analysis section, 3 is the detection section, 4 is the feature parameter storage, 5 is the matching section, 6 is the word lattice storage, 7 is the weighting section (1), 8 is the prediction section, 9 is the connection section, 10 is the weight Attached is a part (2), 11 is a recognition result output terminal, and 12 is a switch. The operation of the speech recognition device configured as described above will be explained below. First, when registering a standard signal, switch 11 is operated so that the output of the analysis section is input to the feature parameter storage section.
The standard signal input from the signal input terminal 1 is input to the analysis section 2, which calculates feature parameters such as LPC cepstrum for each frame, and inputs it to the feature parameter storage section 4.The standard signal is input and the above processing is recognized. Repeat for all the words to be recognized End the registration 4 In addition, in the example (Table) A statistically determined connection probability (probability that each next word candidate appears after the previous word appears) is registered as this connection reliability.

また次単語候補の情報量として実施例ではバープレキシ
ティｅ　（Ｗｉ）を使用し　下記の式Ｏで算出して予測
部８に登録してお（。In addition, in the embodiment, the verb plexity e (Wi) is used as the amount of information of the next word candidate, calculated by the following formula O, and registered in the prediction unit 8 (.

た場合で、ｐ　（Ｗｉ−１，Ｗｉ）は単語Ｗｉ−１に後
続する単語Ｗｉの接続信頼度であム　この登録の例を第
１表に示す。In this case, p (Wi-1, Wi) is the connection reliability of the word Wi following the word Wi-1. An example of this registration is shown in Table 1.

第１表これはＷｉ−１を前単飄Ｗｉを次単語候補としてみそして上記予測部８は接続部９より認識単語候補が入力
されると、その認識単語候補に後続可能な単語とその接
続確率とパープレキシティをそれぞれ次単語候補と持続
確率と次単語候補の情報量として接続部９に出力するよ
うに動作する。Table 1 This shows Wi-1 as the previous word Wi as the next word candidate, and when the recognition word candidate is input from the connection part 9, the prediction unit 8 selects the words that can follow the recognition word candidate and their connections. It operates to output the probability and perplexity to the connection unit 9 as the information amount of the next word candidate, persistence probability, and next word candidate, respectively.

次に認識時については第１図と第４図とを参照しながら
説明すも　まずスイッチ１２を分析部の出力が照合部に
入力されるように操作し　登録時と同様に信号入力端子
１より認識すべき信号を入力しく処理２１）、分析部２
でフレームごとの特徴パラメータを算出する（処理２２
）。次に照合部５において標準音声と入力信号との照合
を行う。Next, the recognition process will be explained with reference to Figures 1 and 4. First, operate the switch 12 so that the output of the analysis unit is input to the verification unit, Processing for inputting signals to be recognized 21), analysis section 2
Calculate the feature parameters for each frame (process 22
). Next, a comparison section 5 performs a comparison between the standard voice and the input signal.

まず、フレーム番号−１、単語番号＝１とし　初期化を
行う（処理２３、２４）。そして、フレーム番号のフレ
ームを始端として単語番号番目の認識すべき単語との照
合を行い（処理２７）、判定閾値以上の類似度を示した
場合（処理２８）、この時の単語を認識結果候補　類似
度を音響信頼度とし　認識の始端と終端とともにワード
ラチス保管部に出力する（処理２９）。この後、単語番
号に１を加算して（処理３０）次の認識すべき単語につ
いての処理に移４　以上の処理が認識すべき単語すべて
について終了したとき（処理２６）、フレーム番号に１
を加算して（処理３１）次のフレームを始端として同様
の処理を行う。以上の処理がすべてのフレーム　すべて
の認識すべき単語について終了したならば（処理２５）
、照合部５での処理を終了すム　この処理によりワード
ラチスとして入力連続音声中に存在する可能性のある認
識単語候補名とその始端位置　終端位置　音響信頼度が
記録されてワードラチス保管部６に出力されも　上記の
ワードラチス生成方法は従来例と同様のものであ４　ま
た上記入力信号は検出部にも入力され　フレームごとに
入力信号のパワーが計算され　−足間値以下の場合には
このフレームでは無音であると判断す４　無音フレーム
の連続を無音区間とし　その開始位置　終了位置を１組
として接続部９に出力する。First, initialization is performed by setting the frame number to -1 and the word number to 1 (processes 23 and 24). Then, the frame with the frame number is used as the starting point to match the word to be recognized with the word number (processing 27), and if the degree of similarity is greater than or equal to the determination threshold (processing 28), the word at this time is used as a recognition result candidate. The degree of similarity is used as the degree of acoustic reliability and is output to the word lattice storage unit along with the start and end points of recognition (processing 29). After that, add 1 to the word number (process 30) and move on to processing for the next word to be recognized.4 When the above processing is completed for all the words to be recognized (process 26), add 1 to the frame number.
is added (process 31), and the same process is performed using the next frame as the starting point. When the above processing is completed for all frames and all words to be recognized (processing 25)
, the processing in the matching unit 5 is ended. Through this processing, the names of recognized word candidates that may exist in the input continuous speech as a word lattice, their starting position, ending position, and acoustic reliability are recorded and output to the word lattice storage unit 6. However, the word lattice generation method described above is the same as the conventional example.4 The input signal is also input to the detection unit, and the power of the input signal is calculated for each frame. Determine that there is no sound 4 A series of silent frames is defined as a silent section, and its start position and end position are set as one set and output to the connection unit 9.

次に接続部９の処理については第１図と第２図を参照し
ながら説明する。Next, the processing of the connecting portion 9 will be explained with reference to FIGS. 1 and 2.

上記接続部ではワードラチス保管部６より入力された認
識単語候補名を接続して複数の認識結果候補を生成し　
これらのう板　最高の評価値を持つものを認識結果とＬ
ｌ　　認識結果出力端子８に出力すも　認識結果候補を
求めるに（戴　ます、認識途中結果を空文字列としく処
理１）、認識途中結果内の最後尾の認識単語候補の終端
位置である認識途中結果終端位置を０（フレーム番号）
として（処理２）初期化を行ｔ、Ｘ、検出部より無音区
間の開始位置　終了位置といった無音区間の位置情報を
入力する（処理３）。次へ認識途中結果終端位置−ｇａｐ≦始端位置≦認識途中結
果終端位置＋ｇａｐ　　　　　　　（式１，１）の始端
位置の条件をみたす（後続可能な）認識単語候補が存在
するならば（処理４）、その認識単語候補名を認識途中
結果内の文字列の最後尾に接続する（処理５）。ここで
ｇａｐは照合部での処理における始端位置と終端位置の
検出誤差に対応するための定数である。ただし式１を満
たす認識単語候補が存在せず（処理４）、かつ、入力音
声フレーム長−ｇａｐ≦認識途中結果終端位置≦入力音
声フレーム長＋ｇａｐ（式１，２）が満たされない場合で（処理１０）、認識途中結果終端位置＝ｇａｐ≦無音区間の開始位置≦
認識途中結果終端位置＋ｇａｐ（式１，３）を満たす場合（処理１１）は無音区間が存在すると判断
Ｌ　認識途中結果終端位置を無音区間の時間長だけ延長
して（処理１２）再び処理４にもどる。式１．　３を満
たさない場合（処理１１）は後続可能な単語が存在しな
いたべ　それまでの認識途中結果が誤っていると判断し
て処理を打ち切る。The connection section connects the recognition word candidate names input from the word lattice storage section 6 to generate multiple recognition result candidates.
Among these boards, the one with the highest evaluation value is recognized as the recognition result.
l Output to recognition result output terminal 8 To obtain recognition result candidates (process 1 by treating the recognition result as an empty string), output the recognition result candidate that is the end position of the last recognized word candidate in the recognition result. Set the result end position to 0 (frame number)
(Process 2) Initialize t, X, and input position information of the silent section such as the start position and end position of the silent section from the detection unit (Process 3). Next, if there is a recognition word candidate (possible to follow) that satisfies the start position condition of (Equation 1, 1) (processing 4), The recognized word candidate name is connected to the end of the character string in the recognition intermediate result (processing 5). Here, gap is a constant for dealing with a detection error between the start position and the end position in the processing in the matching section. However, if there is no recognized word candidate that satisfies Equation 1 (Processing 4), and input audio frame length - gap ≦ recognition intermediate result end position ≦ input audio frame length + gap (Equations 1, 2) is not satisfied (Processing 10), Recognition intermediate result end position = gap ≦ start position of silent section ≦
If the recognition intermediate result end position + gap (Formula 1, 3) is satisfied (processing 11), it is determined that a silent section exists L. The recognition intermediate result end position is extended by the time length of the silent section (processing 12) and the process returns to process 4. Return. Formula 1. If condition 3 is not satisfied (process 11), it is determined that there is no subsequent word and that the recognition results up to that point are incorrect, and the process is terminated.

この方法により発声者の息継ぎなどによる入力音声中の
無音区間が存在した場合でもｇａｐの値を変更すること
なく処理が行えることになる。This method allows processing to be performed without changing the gap value even if there is a silent section in the input voice due to the speaker's breathing.

処理４で後続可能な次単語候補が存在した時（表認識途
中結果内の文字列の最後尾に次単語候補名を接続しく処
理５）、下記の式２によって認識途中結果終端位置の更
新を行う　（処理６）。When there is a next word candidate that can be followed in process 4 (process 5 to connect the next word candidate name to the end of the character string in the table recognition intermediate result), update the recognition intermediate result end position using the following formula 2. Perform (process 6).

認識途中結果終端位置＝認識途中結果終端位置＋（認識
単語候補の終端位置−認識単語候補の始端位置）　　　
　　　　　　　　　　　　　　（式２）そして、処理５
で接続した認識単語候補が文頭から１番目であるとする
と、接続部４は文頭から１−１番目の認識単語候補を予
測部に送り、次単語候補（文頭から１番目の単語になり
得る単語候補）とその接続信頼度とを予測部から得る（
処理７）。Recognition intermediate result end position = Recognition intermediate result end position + (end position of recognition word candidate - start position of recognition word candidate)
(Formula 2) and processing 5
Assuming that the recognized word candidate connected in is the first recognized word candidate from the beginning of the sentence, the connecting unit 4 sends the 1-1st recognized word candidate from the beginning of the sentence to the prediction unit, and selects the next word candidate (word that can be the first word from the beginning of the sentence). candidate) and its connection reliability from the prediction unit (
Processing 7).

また予測部より文頭からｉ−１番目の認識単語候補から
みた次単語候補Ｃ１番目の単語候補）の情報量を得も　
そして、これらを使って下記の式３によって評価値を更
新する（処理７）。このとき、単語列（Ｗｌ、　　Ｗ２
．　　、　、　、　Ｗｉ）から構成される認識途中結果
の評価関数ｈ　（Ｗｉ）は以下のように表されも＋　ｂ−ｆ　　（Ｗｉ−１，Ｗｉ）（式３）ただし　ｈ　（ＷＯ）　＝　　Ｏ。The prediction unit also obtains the information amount of the next word candidate (C1th word candidate) from the i-1st recognized word candidate from the beginning of the sentence.
Then, using these, the evaluation value is updated according to the following equation 3 (processing 7). At this time, the word string (Wl, W2
．． The evaluation function h (Wi) of the recognition intermediate result consisting of , , , Wi) can be expressed as follows: + b−f (Wi−1, Wi) (Formula 3) where h (WO) = O.

ａ”ｌ”　　・ｒ　　・ｅ　　（Ｗｉ）。a”l”　・r　・e　(Wi).

ｂｃｌ：δ／　（τ　・　ｅ　（Ｗｉ））であもここでｇ　（Ｗｉ）はワードラチス保管部６が重み付は
部（１）７に出力する単語Ｗｉの音響信頼度であり、重
み付は部（１）７で重みａがつけられて接続部９に入力
されも　またｆ　　（Ｗｉ−１，Ｗｉ）は予測部８が重
み付は部（２）１０に出力する単語Ｗｉ−１から単語Ｗ
ｉへの接続信頼度であり、重み付は部（２）１０で重み
ｂがつけられて接続部９に入力されも　音響信頼度の重
みａは予測部８から入力される次単語候補の情報員に比
例させ、接続信頼度の重みｂは予測部８から入力される
次単語候補の情報量に反比例する関数とする。またτは
認識単語候補Ｗｉ立直前無音区間が存在した場合の無音
区間の時間長（無音区間の終了位置−無音区間の開始位
置）である力丈　無音区間が存在しない場合に（τ−０
）はｂが無限大になってしまわないように最小値を設け
ておく。なおγ、　δは定数（固定）である。この方法
により、次単語候補の情報量が大きいときには音響的信
頼度が優先され小さいときには接続信頼度が優先される
。よって不明瞭に発声されがちな一連の単語列（このよ
うな単語列は次単語候補の情報量が小さく、無音区間は
存在しにくい傾向にある）が入力されて、音響信頼度が
低い場合でも接続信頼度を優先して評価値を上げること
ができるた敷　認識率を向上することができる。その後
、処理４から処理９までを処理４の条件が満たされなく
なるまで繰り返す。bcl: δ/ (τ ・ e (Wi)) where g (Wi) is the acoustic reliability of the word Wi that the word lattice storage unit 6 outputs to the weighted unit (1) 7, and the weighting is In addition, f (Wi-1, Wi) is weighted by the prediction unit 8 and outputted to the unit (2) 10 from the word Wi-1. W
It is the connection reliability to i, and the weighting is given by weight b in part (2) 10 and input to the connection part 9. The weight a of the acoustic reliability is the information of the next word candidate input from the prediction part 8. The connection reliability weight b is a function inversely proportional to the amount of information of the next word candidate input from the prediction unit 8. In addition, τ is the time length of the silent section (end position of the silent section - start position of the silent section) when there is a silent section immediately before the recognition word candidate Wi.
) has a minimum value set so that b does not become infinite. Note that γ and δ are constants (fixed). With this method, when the amount of information of the next word candidate is large, priority is given to acoustic reliability, and when it is small, priority is given to connection reliability. Therefore, even if a series of word strings that tend to be uttered indistinctly (such word strings have a small amount of information for next word candidates and silent intervals tend to be difficult to exist) are input and the acoustic reliability is low, It is possible to improve the recognition rate by prioritizing connection reliability and increasing the evaluation value. Thereafter, processes 4 to 9 are repeated until the condition of process 4 is no longer satisfied.

その後、式１，２が満たされたならば（処理１０）この
ときの認識途中結果を認識結果候補としてその評価値と
ともに保存する（処理１３）。以上の認識途中結果を求
める手続きを接続部９に入力されるワードラチス全体に
ついて行（Ｘ、存在し得るだけの認識結果候補をもと取
　それらの中で最高の評価値を持つ認識結果候補を認識
結果として認識結果出力端子１１より出力する。Thereafter, if Equations 1 and 2 are satisfied (process 10), the intermediate recognition result at this time is saved as a recognition result candidate together with its evaluation value (process 13). The above procedure for obtaining intermediate recognition results is performed for the entire word lattice input to the connection unit 9 (X), and the recognition result candidate with the highest evaluation value is recognized among them. The result is output from the recognition result output terminal 11.

以上のように　本実施例は請求項３に記載の発明につい
てである力丈　評価値算出処理では無音区間の時間長を
使用しない場合（請求項１に記載）、または次素片候補
の情報量を使用しない場合（請求項２に記載）でも有効
である。As described above, the present embodiment relates to the invention described in claim 3. When the time length of the silent section is not used in the power evaluation value calculation process (as described in claim 1), or when the amount of information of the next segment candidate is It is effective even when not using (as described in claim 2).

発明の効果第１の発明の連続音声認識装置（よ　次素片候補の情報
量に応じて音響信頼度と接続信頼度にかける重みを変化
させるために　両信頼度を評価値に効率よく反映させる
ことができ、認識率の向上につなげることができム　す
なわ板　入力信号の音響信頼度が低い場合でも次素片候
補の情報量が小さければ接続信頼度を優先して評価値を
上げることができるた嵌　正しい認識結果が得ることが
できるようになる。Effects of the Invention Continuous speech recognition device of the first invention (in order to change the weight given to acoustic reliability and connection reliability according to the amount of information of the next segment candidate, both reliability levels are efficiently reflected in the evaluation value. Even if the acoustic reliability of the input signal is low, if the amount of information of the next segment candidate is small, it is possible to give priority to the connection reliability and increase the evaluation value. It becomes possible to obtain correct recognition results.

第２の発明の連続音声認識装置（よ　入力音声中の無音
区間の時間長に応じて音響信頼度と接続信頼度にかける
重みを変化させるために　両信頼度を評価値に効率よく
反映させることができ、ｇ忍識率の向上につなげること
ができも　すなわ板　入力信号の音響信頼度が低い場合
でも無音区間が存在しなければ接続信頼度を優先して評
価値を上げることができるた敦　正しい認識結果が得る
ことができる。Continuous speech recognition device of the second invention (to change the weight to be applied to the acoustic reliability and the connection reliability according to the time length of a silent section in input speech, and to efficiently reflect both reliability in the evaluation value. In other words, even if the acoustic reliability of the input signal is low, if there are no silent sections, it is possible to give priority to the connection reliability and increase the evaluation value. Atsushi: Correct recognition results can be obtained.

また　入力音声中に息継ぎなどの無音区間が存在した場
合は認識結果候補を大幅に増やすことなく認識素片候補
を接続できるた数　認識結果が出力されなかったりする
ことがなくなる。Furthermore, if there are silent sections such as breath breaks in the input speech, it is possible to connect recognition unit candidates without significantly increasing the number of recognition result candidates, and no recognition results will be output.

第３の発明の連続音声認識装置ζよ　次素片候補の情報
量と入力音声中の無音区間の時間長とに応じて音響信頼
度と接続信頼度にかける重みを変化させるために　第１
、第２の発明の効果が得られるだけでなく、次素片候補
の情報量と無音区間の時間長が比例するといった入力信
号である音声の特徴を取り入れているのて　入力信号の
品質を高めることにより認識率を向上させることができ
る。Continuous speech recognition device ζ of the third invention: To change the weight to be applied to acoustic reliability and connection reliability according to the amount of information of the next segment candidate and the time length of the silent section in the input speech.1.
, not only can the effect of the second invention be obtained, but also the quality of the input signal is improved because it incorporates the characteristics of the voice, which is the input signal, such that the amount of information of the next segment candidate is proportional to the time length of the silent section. By doing so, the recognition rate can be improved.

[Brief explanation of drawings]

第１図は本発明の一実施例における音声認識装置のブロ
ック皿　第２図は本発明の一実施例における接続部にお
ける処理のフローチャート、第３図は従来例における音
声認識装置のブロック医第４図は従来例におけるワード
ラチス生成フローチャート、第５図は従来例における接
続部における処理のフローチャートである。１・・、入力端子、　２・・・分析訊　３・・・検出訊
　４・・・特徴パラメータ保管撤　５・・・照合縁　６
・・・ワードラチス保管ｓ、７・・・重み付は部（１）
、８・・・予測訊　９・・・接続部　１０・・・重み付
は部（２）、　１１・・・出力端子、　１２・・・スイ
ッチ。代理人の氏名　弁理士　粟野重孝　はか１基端　１　図／１２図樟靴信町／ｆ（認Ｎｂ結ｆ峡補９し）第図／前図FIG. 1 is a block diagram of a speech recognition device according to an embodiment of the present invention. FIG. 2 is a flowchart of processing at a connection section in an embodiment of the present invention. The figure is a flowchart of word lattice generation in the conventional example, and FIG. 5 is a flowchart of processing at the connection section in the conventional example. 1...Input terminal, 2...Analysis test 3...Detection test 4...Characteristic parameter storage and withdrawal 5...Verification edge 6
...Word lattice storage s, 7...Weighting is part (1)
, 8... Prediction test 9... Connection part 10... Weighting part (2), 11... Output terminal, 12... Switch. Name of agent: Patent attorney Shigetaka Awano Haka 1 base 1 Figure / Figure 12 Shoshushincho/f (Approved Nb Keifu Gorge Supplement 9) Figure / Front figure

Claims

[Claims]

(1) An analysis section that detects feature parameters for each unit time (hereinafter referred to as a frame) of the input signal, and compares the output of the above analysis section with the feature parameters for each fixed time period (hereinafter referred to as a segment) of the standard signal. A matching unit that outputs recognition segment candidates and their similarity (hereinafter referred to as acoustic reliability); and a unit that connects the recognition segment candidates that are the output of the matching unit to output recognition results, and a prediction unit that predicts intermediate recognition results. the next segment candidate predicted from the above recognition intermediate result inputted from the above connection part, the reliability that the segment will appear (hereinafter referred to as connection reliability), and the predicted next segment. and the prediction unit that outputs the information amount of the candidate to the connection unit, and the connection unit outputs the acoustic reliability input from the verification unit and the connection reliability input from the prediction unit. The recognition unit candidates are connected using evaluation values expressed by linear combinations that are each weighted according to the amount of information of the next unit candidate inputted from the prediction unit to obtain a continuous speech recognition result. Continuous speech recognition device.

(2) An analysis section that detects feature parameters for each frame of the input signal, a detection section that detects silent sections of the input signal, and a comparison between the output of the analysis section and the feature parameters for each segment of the standard signal. a matching unit that outputs recognition segment candidates and their acoustic reliability; and a connection unit that connects the recognition segment candidates that are output from the matching unit to output recognition results and outputs intermediate recognition results to the prediction unit. and the prediction unit that outputs the next segment candidate and its connection reliability predicted from the recognition intermediate result inputted from the connection unit to the connection unit, and the connection unit Using an evaluation value expressed by a linear combination of the input acoustic reliability and the connection reliability input from the prediction unit, each weighted according to the time length of the silent section input from the detection unit. A continuous speech recognition device characterized in that a continuous speech recognition result is obtained by connecting the recognition unit candidates.

(3) An analysis section that detects feature parameters for each frame of the input signal, a detection section that detects silent sections of the input signal, and a comparison between the output of the analysis section and the feature parameters for each segment of the standard signal. a matching unit that outputs recognition segment candidates and their acoustic reliability; and a connection unit that connects the recognition segment candidates that are output from the matching unit to output recognition results and outputs intermediate recognition results to the prediction unit. and the prediction unit that outputs the next segment candidate, its connection reliability, and the information amount of the next segment candidate predicted from the recognition intermediate result inputted from the connection unit to the connection unit. , in the connection section, the acoustic reliability input from the matching section and the connection reliability input from the prediction section are combined with the information amount of the next segment candidate input from the prediction section and the detection. Continuous speech, characterized in that continuous speech recognition results are obtained by connecting the recognition segment candidates using evaluation values expressed by linear combinations each weighted according to the time length of a silent section input from the section. recognition device.