JPH10207485A

JPH10207485A - Speech recognition device and speaker adaptation method

Info

Publication number: JPH10207485A
Application number: JP9009777A
Authority: JP
Inventors: Hiroshi Kanazawa; 博史金澤
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1997-01-22
Filing date: 1997-01-22
Publication date: 1998-08-07

Abstract

(57)【要約】【課題】少量の学習データでも認識性能が向上し、大量
にデータが集まれば一層の性能向上が見込め、しかも適
応のためのユーザの負担を極力排除できる。【解決手段】話者適応モードにおいて、音韻ラベル系列
決定部１３は、特定話者の入力音声に関し、正解音韻系
列に対応する辞書格納部１５内のＨＭＭとの照合により
正解音韻系列情報を求めると共に、辞書格納部１５内の
全ＨＭＭとの照合によりスコアが最大となる最適音韻系
列情報を求める。適応部１４は、正解音韻系列情報最適
に従い、最大事後確率推定法により辞書格納部１５内の
音韻ＨＭＭの平均ベクトル及び分散の学習を行い、更に
正解音韻系列情報中の音韻ラベル系列と最適音韻系列情
報中の音韻ラベル系列とを比較して、正解音韻ラベルと
は異なる音韻ラベルが割り当てられている音声パターン
を抽出し、その音声パターンを当該音韻ラベルに対応す
る音韻ＨＭＭの平均ベクトルから差し引く。 (57) [Summary] [Problem] To improve the recognition performance even with a small amount of learning data, further improvement of the performance can be expected if a large amount of data is collected, and the burden on the user for adaptation can be eliminated as much as possible. In a speaker adaptation mode, a phoneme label sequence determination unit (13) obtains correct phoneme sequence information by collating an input voice of a specific speaker with an HMM in a dictionary storage unit (15) corresponding to a correct phoneme sequence. Then, the optimum phoneme sequence information that maximizes the score is obtained by collation with all the HMMs in the dictionary storage unit 15. The adaptation unit 14 learns the average vector and the variance of the phoneme HMM in the dictionary storage unit 15 according to the maximum posterior probability estimation method according to the correct phoneme sequence information, and furthermore, the phoneme label sequence and the optimal phoneme sequence in the correct phoneme sequence information. By comparing with the phoneme label sequence in the information, a speech pattern to which a phoneme label different from the correct phoneme label is assigned is extracted, and the speech pattern is subtracted from the average vector of the phoneme HMM corresponding to the phoneme label.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、話者適応機能を持
った音声認識装置及び話者適応方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus having a speaker adaptation function and a speaker adaptation method.

【０００２】[0002]

【従来の技術】一般に、音声認識における話者適応に
は、認識に用いる辞書を話者の音声の特徴に適応させる
方法と、入力音声に含まれる話者の特徴を正規化する方
法の両方がとられる。2. Description of the Related Art In general, speaker adaptation in speech recognition includes both a method of adapting a dictionary used for recognition to characteristics of a speaker's voice and a method of normalizing a speaker's characteristics included in input speech. Be taken.

【０００３】まず、認識辞書を特定話者の音声に適応さ
せる方法としては、統計的な認識手法の一つである複合
類似度法における話者適応が知られている。複合類似度
法で用いられる音声認識辞書は、音声パターンから作成
された共分散行列を主成分分析して求められた固有値、
固有ベクトルから構成される。この音声認識辞書を特定
話者の音声に適応するために、照合の単位となる各カテ
ゴリ毎に該カテゴリに属する音声パターンを用いて、以
下の式のような共分散行列の更新が行われ、更新後の共
分散行列を主成分分析することにより、話者適応化され
た認識辞書が求まる。First, as a method of adapting a recognition dictionary to a specific speaker's voice, speaker adaptation in a composite similarity method, which is one of statistical recognition methods, is known. The speech recognition dictionary used in the composite similarity method has eigenvalues obtained by principal component analysis of a covariance matrix created from a speech pattern,
It consists of eigenvectors. In order to adapt this voice recognition dictionary to the voice of a specific speaker, a covariance matrix such as the following equation is updated using a voice pattern belonging to each category as a unit of collation, By performing principal component analysis on the updated covariance matrix, a speaker-adapted recognition dictionary is obtained.

【０００４】Ｋ′＝Ｋ＋ａΣＸＸ^t ここで、Ｋ′は更新後の共分散行列、Ｋは更新前の共分
散行列、Ｘは対応カテゴリに属する音声パターン、ａは
更新係数、ｔは転置を表す。K ′ = K + aΣXX ^t where K ′ is a covariance matrix after update, K is a covariance matrix before update, X is a voice pattern belonging to the corresponding category, a is an update coefficient, and t is transpose.

【０００５】このようにして、更新前の例えば、不特定
話者用に作成された共分散行列に、特定話者の発声した
音声を加えることにより、特定話者の音声の特徴を反映
した認識辞書を作成することができる。[0005] In this way, by adding the voice uttered by a specific speaker to the covariance matrix created for an unspecified speaker before updating, for example, the recognition reflecting the characteristics of the voice of the specific speaker is performed. You can create a dictionary.

【０００６】また、ＬＶＱ（Learning Vector Quantiza
tion）と呼ばれる方法も提案されている。この方法は、
離散ＨＭＭ（Hidden Markov Model:隠れマルコフモデ
ル）で用いる符号系列を作成するための符号帳を話者に
適応化するものである。ここでは、認識結果に基づき、
入力音声パターンを用いて、各カテゴリの符号帳（コー
ドベクトル）を更新する。具体的には、該入力音声が正
解カテゴリとは異なるカテゴリと判定された場合に、該
入力音声パターンを正解カテゴリに近付ける処理を行
い、誤ったカテゴリから遠ざける処理を行うというもの
である。（文献：甘利俊一監修、中川聖一、鹿野清宏、
東倉洋一共著：“音声・聴覚と神経回路網モデル”，p
p.205-206，オーム社）更に、最大事後確率推定法と呼ばれる連続ＨＭＭにおけ
るガウス分布の平均ベクトルの話者適応化法も提案され
ている。この方法も上記の方法と同様、話者の発声した
音声を用いて、連続ＨＭＭのパラメータを更新するもの
である。（文献：特開平８−９５５９２号公報）一方、入力音声の話者性を正規化する方法として、スペ
クトルマッピング法が提案されている。本方法は、ある
特定話者の音声パターンを、標準話者の音声パターンに
マッピングするための対応テーブルを予め求めておき、
認識の際に特定話者の入力音声を標準話者の音声に変換
して認識に供するという方法である。（文献：ＡＴＲ先
端テクノロジーシリーズ：自動翻訳電話，pp.70-72，オ
ーム社）。これにより、認識辞書を変更することなく、
特定話者の音声に対しても標準話者に近い認識性能を得
ることができる。Also, LVQ (Learning Vector Quantiza)
A method called option) has also been proposed. This method
A codebook for generating a code sequence used in a discrete HMM (Hidden Markov Model) is adapted to a speaker. Here, based on the recognition result,
The codebook (code vector) of each category is updated using the input voice pattern. Specifically, when the input voice is determined to be a category different from the correct category, a process of bringing the input voice pattern closer to the correct category and a process of moving the input voice pattern away from the wrong category are performed. (Literature: Shunichi Amari, supervised by Seiichi Nakagawa, Kiyohiro Kano,
Yoichi Higashikura, "Speech / Hearing and Neural Network Model," p.
Further, a speaker adaptation method of a mean vector of a Gaussian distribution in a continuous HMM called a maximum posterior probability estimation method has been proposed. In this method, similarly to the above method, the parameters of the continuous HMM are updated using the voice uttered by the speaker. (Reference: Japanese Patent Application Laid-Open No. 8-95592) On the other hand, a spectrum mapping method has been proposed as a method for normalizing the speaker characteristics of input speech. In the method, a correspondence table for mapping a voice pattern of a specific speaker to a voice pattern of a standard speaker is obtained in advance,
At the time of recognition, the input voice of a specific speaker is converted into the voice of a standard speaker and used for recognition. (Literature: ATR Advanced Technology Series: Automatic translation telephone, pp.70-72, Ohmsha). As a result, without changing the recognition dictionary,
Recognition performance close to that of a standard speaker can be obtained for a specific speaker's voice.

【０００７】[0007]

【発明が解決しようとする課題】音声認識における話者
適応方法として、上述したように、話者の発声した音声
データを用いて、認識辞書を特定話者の音声に適応させ
る方法と、標準話者への対応付けを行い、特定話者の入
力音声を標準話者の音声にマッピングする方法などが提
案されている。As described above, as a speaker adaptation method in speech recognition, a method of adapting a recognition dictionary to a specific speaker's speech by using speech data uttered by a speaker, and a standard speech A method has been proposed in which the input speech of a specific speaker is mapped to the speech of a standard speaker by associating the input speech with a speaker.

【０００８】しかし、統計的手法に基づく認識方式の場
合には、適応に用いるデータ数が大量に必要となり、少
量のデータではその効果があまりないため、適応用の音
声データを収集するのに、話者に多大な負担を強いると
いう問題がある。また、少量データで効果のある例えば
最大事後確率推定法などでは、性能向上の飽和が早いた
め、たとえ大量の音声データが収集できても認識率はあ
る程度以上は向上しないといった問題がある。However, in the case of a recognition method based on a statistical method, a large number of data to be used for adaptation is required, and a small amount of data has little effect. There is a problem that a great burden is imposed on the speaker. Further, in the maximum posterior probability estimating method, which is effective with a small amount of data, for example, the performance improvement is quickly saturated, so that even if a large amount of voice data can be collected, there is a problem that the recognition rate is not improved to some extent.

【０００９】更に、スペクトルマッピング法などでも、
正しいマッピングをするためには、事前に特定話者の大
量の音声データが必要とななり、話者への負担が問題と
なる。Further, in a spectrum mapping method or the like,
In order to perform correct mapping, a large amount of voice data of a specific speaker is required in advance, and the burden on the speaker becomes a problem.

【００１０】また、話者への負担軽減のために、上述の
方法に対して、正解カテゴリを予め付与しない教師なし
学習と呼ばれる方法が検討されている。これは、話者の
発声した音声に対して、正解情報なしに辞書の学習を可
能とする方法であり、事前に学習用として音声を発声し
なくても、実際の認識に供された音声をそのまま学習に
利用できるというメリットがあるが、誤ったカテゴリと
して学習される可能性があり、教師有り学習に比べて、
一般的に認識性能向上の度合いは小さい。[0010] To reduce the burden on the speaker, a method called unsupervised learning that does not previously assign a correct answer category to the above-mentioned method is being studied. This is a method that enables a dictionary to be learned without the correct answer information for the voice uttered by the speaker. Even if the voice is not uttered in advance for learning, the voice provided for the actual recognition is used. There is an advantage that it can be used for learning as it is, but it may be learned as the wrong category, and compared to supervised learning,
Generally, the degree of improvement in recognition performance is small.

【００１１】本発明は、上記の問題を考慮してなされた
もので、その目的は、少量の学習データでも認識性能向
上に顕著な効果があり、且つ大量にデータが集まれば、
更なる性能向上が見込め、しかも適応のためのユーザの
負担を極力排除できる音声認識装置及び話者適応方法を
提供することにある。The present invention has been made in consideration of the above problems, and has as its object the purpose of improving the recognition performance even with a small amount of learning data.
It is an object of the present invention to provide a speech recognition apparatus and a speaker adaptation method that can be expected to further improve performance and that can minimize the burden on the user for adaptation.

【００１２】[0012]

【課題を解決するための手段】上記の問題を解決するた
めに本発明は、特定話者の入力音声に対応した既知の音
韻系列について、当該入力音声に対する音声分析により
得られた音声パターンと対応する認識辞書（音韻認識辞
書）との照合を行うことで照合結果の情報を含む正解音
韻系列情報を抽出すると共に、上記音声パターンと全て
の認識辞書（音韻認識辞書）との照合を行うことで、最
大尤度を与える音韻系列に関する照合結果の情報を含む
最適音韻系列情報を抽出する音韻系列情報決定手段と、
上記正解音韻系列情報に従い、最大事後確率推定法によ
り該当する認識辞書の学習を行う第１の適応学習手段
と、上記正解音韻系列情報及び最適音韻系列情報を比較
してその相違部分を抽出し、その相違部分が解消される
方向に上記音声パターンを用いて該当する認識辞書の学
習を行う第２の適応学習手段とを備えたことを特徴とす
る。SUMMARY OF THE INVENTION In order to solve the above problem, the present invention relates to a known phoneme sequence corresponding to an input voice of a specific speaker, which is associated with a voice pattern obtained by voice analysis of the input voice. By extracting the correct phoneme sequence information including the information of the matching result by performing the matching with the recognition dictionary (phoneme recognition dictionary) to be performed, and by performing the matching between the voice pattern and all the recognition dictionaries (phoneme recognition dictionary). Phonological sequence information determining means for extracting optimal phonological sequence information including information on a matching result for a phonological sequence giving the maximum likelihood,
According to the correct phoneme sequence information, a first adaptive learning means for learning a corresponding recognition dictionary by a maximum posterior probability estimation method, and comparing the correct phoneme sequence information and the optimal phoneme sequence information to extract a difference part thereof, A second adaptive learning means for learning a corresponding recognition dictionary using the voice pattern in a direction in which the difference is eliminated.

【００１３】このような構成においては、特定話者の入
力音声に対応した既知の音韻系列を与えることで、その
既知の音韻系列（正解音韻系列）に対応する音韻区間及
び音韻ラベル系列を含む正解音韻系列情報が抽出され
る。また、入力音声の音声パターン（入力音声パター
ン）と全認識辞書との照合により、尤度（スコア）が最
大となる音韻系列に関する音韻区間及び音韻ラベル系列
を含む最適音韻系列情報が抽出される。ここで、音韻認
識辞書が音韻ＨＭＭの場合、正解音韻系列情報及び最適
音韻系列情報は、各音韻、各状態、各混合（正規分布の
混合）毎の平均ベクトルと分散からなる正規分布のパラ
メータを含む。In such a configuration, by providing a known phoneme sequence corresponding to the input speech of a specific speaker, a correct answer including a phoneme section and a phoneme label sequence corresponding to the known phoneme sequence (correct phoneme sequence) is provided. Phoneme sequence information is extracted. Further, by matching the voice pattern of the input voice (input voice pattern) with the entire recognition dictionary, optimal phoneme sequence information including a phoneme section and a phoneme label sequence related to a phoneme sequence having the maximum likelihood (score) is extracted. Here, when the phoneme recognition dictionary is a phoneme HMM, the correct phoneme sequence information and the optimal phoneme sequence information include parameters of a normal distribution including an average vector and a variance for each phoneme, each state, and each mixture (a mixture of normal distributions). Including.

【００１４】正解音韻系列情報が抽出されると、その正
解音韻系列情報に従って、最大事後確率推定法により該
当する音韻認識辞書を学習することができる。この最大
事後確率推定法による学習では、音韻認識辞書が音韻Ｈ
ＭＭの場合には、当該ＨＭＭの正規分布のパラメータで
ある平均ベクトルと分散が対応する（音韻ラベルが付さ
れている）音韻区間の音声パターンを選択的に用いて更
新される。通常、各音韻ＨＭＭの正規分布のパラメータ
は、各状態（を示す状態番号）、各混合（を示す混合番
号）毎に存在することから、音韻数×状態数×混合数だ
けの数の正規分布のパラメータの学習が行われる。但
し、該当する音韻区間の音声パターンが存在しない音韻
ＨＭＭの正規分布のパラメータについては、学習の対象
外となる。When the correct phoneme sequence information is extracted, the corresponding phoneme recognition dictionary can be learned by the maximum posterior probability estimation method according to the correct phoneme sequence information. In the learning using the maximum posterior probability estimation method, the phoneme recognition dictionary uses the phoneme H
In the case of the MM, the mean vector and the variance, which are parameters of the normal distribution of the HMM, are updated by selectively using the voice pattern of the corresponding phoneme section (having a phoneme label). Normally, the parameters of the normal distribution of each phoneme HMM are present for each state (state number indicating) and each mixture (mixing number indicating), so that the number of normal distributions equal to the number of phonemes × the number of states × the number of mixtures Is learned. However, the parameters of the normal distribution of the phoneme HMM in which no speech pattern of the corresponding phoneme section exists do not belong to the learning.

【００１５】次に、正解音韻系列情報及び最適音韻系列
情報との相違部分が抽出され、その相違部分が解消され
る方向に入力音声パターンを用いた音韻認識辞書の学習
（更新）が行われる。ここで上記相違部分は、正解音韻
系列情報中の音韻ラベル系列（正解ラベル系列）と最適
音韻系列情報中の音韻ラベル系列（最適ラベル系列）と
を比較することで抽出されるものであり、例えば最適ラ
ベル系列中で正解ラベル系列の正解音韻ラベルとは異な
る音韻ラベルが割り当てられている区間である。この区
間内の音声パターンを抽出して、音韻認識辞書の学習に
用いることで、上記相違部分が解消される方向への音韻
認識辞書の更新が可能となる。特に、音韻認識辞書が音
韻ＨＭＭの場合には、最適ラベル系列中で正解ラベル系
列とは異なる音韻ラベルが割り当てられている区間内の
音声パターンを、上記異なる音韻ラベルの音韻の音韻Ｈ
ＭＭの平均ベクトルから差し引く処理を行うことで、今
後当該音声パターンと同様のパターンが出現した際に、
上記正解音韻ラベルの音声パターンとして扱われるよう
に更新できる。Next, a difference between the correct phoneme sequence information and the optimal phoneme sequence information is extracted, and learning (updating) of the phoneme recognition dictionary using the input speech pattern is performed in a direction in which the difference is eliminated. Here, the difference is extracted by comparing the phoneme label sequence (correct label sequence) in the correct phoneme sequence information with the phoneme label sequence (optimal label sequence) in the optimum phoneme sequence information. This is a section in the optimal label sequence to which a phoneme label different from the correct phoneme label of the correct label sequence is assigned. By extracting the voice pattern in this section and using it for learning the phoneme recognition dictionary, it is possible to update the phoneme recognition dictionary in a direction in which the difference is eliminated. In particular, when the phoneme recognition dictionary is a phoneme HMM, a speech pattern in a section to which a phoneme label different from the correct label sequence in the optimal label sequence is assigned is converted to a phoneme H of the phoneme of the different phoneme label.
By performing a process of subtracting from the MM average vector, when a pattern similar to the voice pattern appears in the future,
It can be updated so as to be treated as the voice pattern of the correct phoneme label.

【００１６】このように本発明においては、最大事後確
率推定法を用いることにより、学習データが少量の場合
でも効果のある話者適応を実現でき、更に、正解ラベル
系列と最適ラベル系列の比較により、認識結果を考慮し
た競合学習をすることができるので、大量の音声データ
が存在する場合には、更なる認識性能の向上を実現する
ことができる。また、これにより、ユーザは学習データ
が少ないときも、多いときも最大限に適応機能の効果を
得ることができ、結果として、学習のための負担を大幅
に軽減することができる。As described above, according to the present invention, by using the maximum posterior probability estimation method, an effective speaker adaptation can be realized even when the learning data is small. Further, by comparing the correct label sequence and the optimal label sequence, In addition, since it is possible to perform competitive learning in consideration of the recognition result, it is possible to further improve the recognition performance when a large amount of voice data exists. This also allows the user to maximize the effect of the adaptive function when the amount of learning data is small or large, and as a result, the burden on learning can be greatly reduced.

【００１７】[0017]

【発明の実施の形態】以下、本発明の実施の形態につき
図面を参照して説明する。図１は本発明の一実施形態に
係る音声認識装置の基本構成を示すブロック図である。
図１の音声認識装置（本装置）は、主として、音声入力
部１１、音声分析部１２、音韻ラベル系列決定部１３、
適応部１４、辞書格納部１５、認識部１６、認識語彙格
納部１７、制御部１８より構成されている。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a basic configuration of a speech recognition device according to one embodiment of the present invention.
The speech recognition device (this device) of FIG. 1 mainly includes a speech input unit 11, a speech analysis unit 12, a phoneme label sequence determination unit 13,
It comprises an adaptation unit 14, a dictionary storage unit 15, a recognition unit 16, a recognized vocabulary storage unit 17, and a control unit 18.

【００１８】本装置は、（１）話者適応モード、（２）
認識モードの２つのモードで作動される。このモード
は、ユーザによりキーボード、マウス等の入力手段を操
作することで選択指定される。制御部１８は、ユーザか
らのモード指定を受け付け、その受け付けたモードでの
処理の実行を制御する。The present apparatus comprises (1) a speaker adaptation mode, and (2)
It is operated in two modes, recognition mode. This mode is selected and designated by the user operating input means such as a keyboard and a mouse. The control unit 18 receives a mode specification from the user, and controls execution of a process in the received mode.

【００１９】ここで、上記２つのモードのうち、本発明
の特徴を表す（１）話者適応モードの処理について、図
２のフローチャートを適宜参照して説明する。まず話者
適応モードにおいて、ユーザ（特定話者）が音声を発声
すると、その音声が音声入力部１１により所定のサンプ
リング周波数でＡ／Ｄ変換され、ディジタル時系列信号
の音声データに変換される。音声入力部１１は、こうし
て得られた音声データに対して、固定時間（例えば８m
s：以後この単位をフレームと呼ぶ）毎に音声パワーを
計算し、パワーの時系列を用いて、発声された音声の始
終端の時刻を検出する。そして音声入力部１１は、検出
した音声の始終端の時刻をもとに、その始終端区間内の
音声データを抽出し、音声分析部１２に送る。Here, of the two modes, the processing of the (1) speaker adaptation mode, which is a feature of the present invention, will be described with reference to the flowchart of FIG. First, in the speaker adaptation mode, when a user (specific speaker) utters a voice, the voice is A / D-converted by the voice input unit 11 at a predetermined sampling frequency, and is converted into digital time-series signal voice data. The voice input unit 11 applies a fixed time (for example, 8 m) to the voice data thus obtained.
s: This unit is hereinafter referred to as a frame), and the speech power is calculated, and the start and end times of the uttered speech are detected using the time series of the power. Then, the voice input unit 11 extracts the voice data in the start / end section based on the detected start / end time of the voice and sends it to the voice analysis unit 12.

【００２０】音声分析部１２は、音声入力部１１から送
られた始終端区間内の音声データを対象に、例えば高速
フーリエ変換（ＦＦＴ）などを用いて周波数分析を行
い、音声データの時系列信号を周波数パラメータの時系
列データに変換する。ここでは、２５６ポイントの高速
フーリエ変換を行い、得られた１２８次元のパワースペ
クトルを、Ｂａｒｋスケールにより、１６次元のバンド
パスフィルタ出力に圧縮して、各フレーム当たり１６次
元の特徴ベクトルからなる音声パターンを生成してい
る。The voice analysis unit 12 performs a frequency analysis on the voice data in the start / end section sent from the voice input unit 11 using, for example, a fast Fourier transform (FFT) or the like to obtain a time-series signal of the voice data. Is converted into time-series data of frequency parameters. Here, a 256-point fast Fourier transform is performed, and the obtained 128-dimensional power spectrum is compressed into a 16-dimensional band-pass filter output by the Bark scale, and a speech pattern including a 16-dimensional feature vector per frame is obtained. Has been generated.

【００２１】音声分析部１２により求められた音声パタ
ーン、即ち入力音声に対する周波数分析結果である周波
数パラメータの時系列データは、図示せぬ音声パターン
格納部に格納される。The voice pattern obtained by the voice analysis unit 12, that is, time-series data of frequency parameters as a result of frequency analysis of the input voice is stored in a voice pattern storage unit (not shown).

【００２２】以上により、特定話者がｎ種の単語につい
てｍ回発声した場合には、ｎ種の単語について、それぞ
れｍ個の音声パターンが音声パターン格納部に格納され
ることになる。As described above, when a specific speaker utters m times for n types of words, m voice patterns for each of the n types of words are stored in the voice pattern storage unit.

【００２３】さて、話者適応処理に必要な全ての入力音
声に対応する音声パターンの群が音声パターン格納部に
格納されると、制御部１８により音韻ラベル系列決定部
１３が起動される。すると音韻ラベル系列決定部１３
は、以下に述べるように正解音韻系列情報及び最適音韻
系列情報を求める音韻ラベル系列決定処理を行う（ステ
ップＳ１）。When a group of voice patterns corresponding to all input voices necessary for the speaker adaptation process is stored in the voice pattern storage unit, the control unit 18 activates the phoneme label sequence determination unit 13. Then, the phoneme label sequence determination unit 13
Performs phoneme label sequence determination processing for obtaining correct phoneme sequence information and optimal phoneme sequence information as described below (step S1).

【００２４】即ち音韻ラベル系列決定部１３は、入力さ
れた音声に対応した既知の音韻系列（正解の音韻系列）
について、音声分析部１２により得られた音声パターン
と辞書（認識辞書）格納部１５に格納されている（当該
正解の音韻系列に対応する）音韻ＨＭＭ（ここでは連続
ＨＭＭ）とのビタビ（Viterbi ）照合を行うことで、各
音韻の区間を求めると共に、区間中の各フレームが音韻
ＨＭＭのどの状態、どの混合度（どの混合番号の正規分
布）に対応するかを求め、音韻ラベル系列を含む正解音
韻系列情報として保持する。That is, the phoneme label sequence determining unit 13 determines a known phoneme sequence (correct phoneme sequence) corresponding to the input speech.
Viterbi between the speech pattern obtained by the speech analysis unit 12 and the phoneme HMM (here, the continuous HMM) stored in the dictionary (recognition dictionary) storage unit 15 (corresponding to the correct phoneme sequence). By performing the matching, the section of each phoneme is obtained, and at the same time, which state of each frame in the section corresponds to which state of the phoneme HMM, which mixing degree (normal distribution of which mixing number), and the correct answer including the phoneme label sequence It is stored as phoneme sequence information.

【００２５】また音韻ラベル系列決定部１３は、それと
は別に、正解音韻系列を与えずに、上記音声パターンと
辞書格納部１５に格納されている全ての音韻ＨＭＭとの
ビタビ照合により、最大スコア（尤度）を与える音韻系
列（以下、最適音韻系列と称する）及びその区間（音韻
区間）、各フレーム毎の音韻ＨＭＭの状態、混合度への
対応を求め、音韻ラベル系列を含む最適音韻系列情報と
して保持する。Also, the phoneme label sequence determination unit 13 separately performs the Viterbi comparison between the voice pattern and all the phoneme HMMs stored in the dictionary storage unit 15 without giving a correct phoneme sequence to obtain a maximum score ( The phoneme sequence (likelihood sequence) giving the likelihood), its section (phoneme section), the state of the phoneme HMM for each frame, and the degree of mixing are determined, and the optimal phoneme sequence information including the phoneme label sequence is obtained. Hold as.

【００２６】このようにして音韻ラベル系列決定部１３
により求められる正解音韻系列情報及び最適音韻系列情
報の一例を、入力音声が「わたし」、したがって音韻表
記が“ＷＡＴＡＳＨＩ”である場合について図３及び図
４に示す。Thus, the phoneme label sequence determining unit 13
FIGS. 3 and 4 show examples of the correct phoneme sequence information and the optimal phoneme sequence information obtained by using the case where the input voice is "I" and therefore the phoneme notation is "WATASHI".

【００２７】音韻ラベル系列決定部１３は、以上の処理
を音声分析部１２により得られた全ての音声パターンに
ついて実行する。音韻ラベル系列決定部１３による音韻
ラベル系列決定処理（ステップＳ１）が終了すると、適
応部１４に制御が移る。適応部１４は、音韻ラベル系列
決定部１３により各音声パターン毎に求められた上記２
つの音韻系列情報を用いて、以下に述べる手順で音韻Ｈ
ＭＭのパラメータの更新（学習）を行う。The phoneme label sequence determination unit 13 executes the above processing for all the voice patterns obtained by the voice analysis unit 12. When the phoneme label sequence determination processing (step S1) by the phoneme label sequence determination unit 13 ends, control is transferred to the adaptation unit 14. The adaptation unit 14 calculates the above-mentioned 2 obtained by the phoneme label sequence determination unit 13 for each voice pattern.
Using the two phoneme sequence information, the phoneme H
Update (learn) the parameters of the MM.

【００２８】まず適応部１４は、正解音韻系列情報に基
づき、対応する正解音韻系列をなす各音韻の音韻ＨＭＭ
（連続ＨＭＭ）の各正規分布のパラメータ（平均ベクト
ルと分散）を学習の対象として最大事後確率推定法によ
り次のように更新する（ステップＳ２）。First, based on the correct phoneme sequence information, the adaptation unit 14 performs the phoneme HMM of each phoneme forming the corresponding correct phoneme sequence.
The parameters (mean vector and variance) of each normal distribution of (continuous HMM) are updated as follows by the maximum posterior probability estimation method as learning targets (step S2).

【００２９】即ち適応部１４は、音韻（音韻カテゴリ）
ｋの音韻ＨＭＭの正規分布の平均ベクトルを次式 μ_k ′＝（αμ_k ＋Σ_i Ｘ_i ）／（α＋Ｎ）により更新し、同様に分散を次式 σ_k ′＝｛Σ_i Ｘ_i ² −（α＋Ｎ）μ_k ′²＋βσ_k ＋
αμ_k ² ｝／（Ｎ＋β）により更新する。That is, the adaptation unit 14 controls the phoneme (phoneme category).
The average vector of the normal distribution of the k phoneme HMM is updated by the following equation: μ _k ′ = (αμ _k + Σ _i X _i ) / (α + N), and the variance is similarly calculated by the following equation: σ _k ′ = ｛Σ _i X _i ² − (Α + N) μ _k ′ ² + βσ _k +
Update by αμ _k ^{2 2} / (N + β).

【００３０】ここで、μ_k は更新前の平均ベクトル、μ
_k ′は更新後の平均ベクトル、Ｎは学習に供する音声パ
ターン数、Ｘ_i は学習に供する音声パターン、σ_k は更
新前の分散、σ_k ′は更新後の分散、α及びβは更新係
数、Σ_i Ｘ_i はＮ個の音声パターンＸ_i の総和、Σ_i Ｘ
_i ² はＮ個の音声パターンＸ_i のそれぞれの平方値の総
和である。Here, μ _k is an average vector before updating, μ
_k ′ is the average vector after the update, N is the number of audio patterns to be used for learning, X _i is the audio pattern to be used for learning, σ _k is the variance before update, σ _k ′ is the variance after update, and α and β are the update coefficients. , Σ _i X _i is the sum of N voice patterns X _i , Σ _i X
_i ² is the sum of the square values of each of the N voice patterns X _i .

【００３１】上記の学習（更新）は、各音韻、各状態、
各混合度の全ての組み合わせについて行われる。したが
って、例えば音韻数１００、各音韻ＨＭＭの状態数４、
混合数５の場合には、１００×４×５＝２０００（個）
の正規分布の学習を行うことになる。但し、該当する分
布の学習に供する音声パターンが存在しない場合には、
その学習は行われない。The learning (updating) is performed for each phoneme, each state,
This is performed for all combinations of each mixing degree. Therefore, for example, the number of phonemes is 100, the number of states of each phoneme HMM is 4,
In the case of 5 mixing numbers, 100 × 4 × 5 = 2000 (pieces)
Learning of the normal distribution of. However, if there is no audio pattern for learning the distribution,
The learning is not performed.

【００３２】適応部１４は、以上に述べた学習（更新）
処理、即ち正解音韻系列情報の示す正解音韻系列をなす
各音韻の音韻ＨＭＭを構成する各正規分布の平均ベクト
ル、分散を学習（更新）する処理（ステップＳ２）を終
了すると、平均ベクトルの追加学習処理を行う。この平
均ベクトルの追加学習処理につき、簡単のために各音韻
ＨＭＭは３状態、３混合度からなるものと仮定して図５
を参照して説明する。The adaptation unit 14 performs the learning (update) described above.
When the processing, that is, the processing (step S2) of learning (updating) the average vector and variance of each normal distribution constituting the phoneme HMM of each phoneme forming the correct phoneme sequence indicated by the correct phoneme sequence information is completed, additional learning of the average vector is performed. Perform processing. In this additional learning process of the average vector, for the sake of simplicity, it is assumed that each phoneme HMM has three states and three degrees of mixture.
This will be described with reference to FIG.

【００３３】ここでは、先に述べたように、「わたし」
という音声が学習用音声として入力されたとする。ま
た、この入力音声「わたし」の音韻表記である“Ｗ−Ａ
−Ｔ−Ａ−ＳＨ−Ｉ”という系列を音韻ラベル系列決定
部１３での音韻ＨＭＭによるビタビ照合で入力音声「わ
たし」の音声パターンに当てはめた結果、図５において
符号５１で示す正解音韻系列の例のような、当該系列を
なす各音韻（Ｗ，Ａ，Ｔ，Ａ，ＳＨ，Ｉ）の区間が（上
記ステップＳ１の処理で）求められたとする。Here, as described above, "I"
Is input as a learning voice. In addition, "WA" which is a phonemic notation of this input voice "I"
As a result of applying the sequence “−TA−SH-I” to the voice pattern of the input voice “I” by Viterbi matching by the phoneme HMM in the phoneme label sequence determination unit 13, the correct phoneme sequence denoted by reference numeral 51 in FIG. It is assumed that a section of each phoneme (W, A, T, A, SH, I) forming the sequence as in the example is obtained (in the process of step S1).

【００３４】また、音韻ラベル系列決定部１３にて、入
力音声「わたし」の音声パターンに対して最大スコアを
とる最適音韻系列を求めたところ、図５において符号５
２で示すような“Ｗ−Ａ−Ｈ−Ｔ−Ａ−ＳＨ−Ｉ”とい
う最適音韻系列及び当該系列をなす各音韻（Ｗ，Ａ，
Ｈ，Ｔ，Ａ，ＳＨ，Ｉ）の区間が（上記ステップＳ１の
処理で）求められたとする。Further, when the phoneme label sequence determining unit 13 finds the optimum phoneme sequence that takes the maximum score for the voice pattern of the input voice “I”, the code 5 in FIG.
2 and the optimal phoneme sequence “WAHTA-SH-I” and each phoneme (W, A,
It is assumed that the section of (H, T, A, SH, I) has been obtained (in the process of step S1).

【００３５】ここで、正解音韻系列５１と最適音韻系列
５２との間で、異なる音韻のラベル付けがなされている
区間は６区間あり、それぞれ区間ａ，ｂ，ｃ，ｄ，ｅ，
ｆとする。この区間ａ〜ｆのうちの例えば区間ａは、正
解音韻系列５１では音韻（ラベル）“Ｗ”に割り当てら
れているが、最適音韻系列５２では“Ａ”に割り当てら
れている。区間ａは、図３から明らかなように、第５フ
レームと第６フレームの２フレームからなる。Here, between the correct phoneme sequence 51 and the optimal phoneme sequence 52, there are six sections in which different phonemes are labeled, and the sections a, b, c, d, e, and
f. Among the sections a to f, for example, the section a is assigned to the phoneme (label) “W” in the correct phoneme sequence 51, but is assigned to “A” in the optimal phoneme sequence 52. The section a includes two frames, a fifth frame and a sixth frame, as is apparent from FIG.

【００３６】さて適応部１４は、上記ステップＳ２を終
了すると、まず処理対象とする（フレームのフレーム位
置を示す）フレーム番号ｊを初期値１に設定する（ステ
ップＳ３）。When step S2 is completed, the adaptation unit 14 first sets a frame number j to be processed (indicating the frame position of the frame) to an initial value 1 (step S3).

【００３７】次に適応部１４は、（正解音韻系列情報に
含まれる）正解音韻系列５１中の第ｊフレームの音韻ラ
ベルと（最適音韻系列情報に含まれる）最適音韻系列５
２中の第ｊフレームの音韻ラベルとを比較する（ステッ
プＳ４）。もし、両ラベルが異なっていないならば（ス
テップＳ５）、適応部１４はフレーム番号ｊを１つ進め
た後（ステップＳ６）、ステップＳ４に戻る。Next, the adaptation unit 14 determines the j-th frame phoneme label in the correct phoneme sequence 51 (included in the correct phoneme sequence information) and the optimal phoneme sequence 5 (included in the optimum phoneme sequence information).
A comparison is made with the phoneme label of the j-th frame in 2 (step S4). If the two labels are not different (step S5), the adaptation unit 14 advances the frame number j by one (step S6), and then returns to step S4.

【００３８】図３の例では、正解音韻系列５１及び最適
音韻系列５２の第１乃至第４フレームの音韻ラベルは、
いずれも“Ｗ”であり、一致している。一方、次の第５
フレーム及び第６フレームの音韻ラベル、即ち区間ａ内
の各フレームの音韻ラベルは、正解音韻系列５１では
“Ｗ”であるのに対し、最適音韻系列５２では“Ａ”と
なっており、異なっている。In the example of FIG. 3, the phoneme labels of the first to fourth frames of the correct phoneme sequence 51 and the optimal phoneme sequence 52 are:
All are “W” and coincide with each other. On the other hand, the next fifth
The phoneme labels of the frame and the sixth frame, that is, the phoneme labels of the respective frames in the section a are “W” in the correct phoneme sequence 51 and “A” in the optimal phoneme sequence 52, which is different from the above. I have.

【００３９】適応部１４は、第５フレーム、或は第６フ
レームの例のように、第ｊフレームの音韻ラベルが正解
音韻系列５１と最適音韻系列５２とで異なっている場合
（ステップＳ４，Ｓ５）、最適音韻系列５２中の第ｊフ
レームの音韻ラベル名、ＨＭＭ状態番号、ＨＭＭ混合番
号（図３の例の第５フレームの場合であれば、音韻ラベ
ル名＝Ａ、ＨＭＭ状態番号＝１、ＨＭＭ混合番号＝３）
と共に、第ｊフレームの音声パターンを保持する（ステ
ップＳ７）。When the phoneme label of the j-th frame is different between the correct phoneme sequence 51 and the optimal phoneme sequence 52 as in the example of the fifth frame or the sixth frame (steps S4 and S5). ), The phoneme label name, the HMM state number, and the HMM mixed number of the j-th frame in the optimal phoneme sequence 52 (in the case of the fifth frame in the example of FIG. 3, the phoneme label name = A, the HMM state number = 1, HMM mixing number = 3)
At the same time, the audio pattern of the j-th frame is held (step S7).

【００４０】次に適応部１４は、最終フレームの処理ま
で行われたか否かを判断し（ステップＳ８）、最終フレ
ームの処理まで行われていないならば、フレーム番号ｊ
を１つ進めた後（ステップＳ６）、ステップＳ４に戻
る。Next, the adaptation unit 14 determines whether or not the processing of the last frame has been performed (step S8).
Is advanced by one (step S6), and the process returns to step S4.

【００４１】このようにして、上記ステップＳ４以降の
処理が繰り返され、最終フレームの処理まで行われると
（ステップＳ８）、適応部１４は、それまで保持してお
いた、（正解音韻系列情報中の）正解音韻系列５１と
（最適音韻系列情報中の）最適音韻系列５２との間で音
韻ラベルが異なっているフレームの音声パターンを用い
て、対応する音韻ＨＭＭ中の対応する状態番号、混合番
号の正規分布の平均ベクトルを更新し、辞書格納部１５
に格納し直す（ステップＳ９）。In this way, the processing after step S4 is repeated until the processing of the last frame is performed (step S8), and the adaptation unit 14 stores the (correct answer phoneme sequence information) Using the speech patterns of frames whose phoneme labels are different between the correct phoneme sequence 51) and the optimum phoneme sequence 52 (in the optimum phoneme sequence information), the corresponding state number and mixture number in the corresponding phoneme HMM Is updated, and the dictionary storage unit 15 is updated.
(Step S9).

【００４２】このステップＳ９での平均ベクトル更新処
理の詳細を以下に述べる。上記区間ａを例にとると、こ
の区間ａは正解音韻系列５１では“Ｗ”に割り当てられ
ているが、最適音韻系列５２では“Ａ”に割り当てられ
ている。この区間ａは、本来“Ａ”ではなくて“Ｗ”と
見なされるべき区間である。The details of the average vector updating process in step S9 will be described below. Taking the section a as an example, the section a is assigned to “W” in the correct phoneme sequence 51, but is assigned to “A” in the optimal phoneme sequence 52. This section a is a section that should be regarded as “W” instead of “A”.

【００４３】そこで本実施形態では、今後区間ａ内の音
声パターンと同様のパターンが出現した際に、“Ａ”の
パターンとはならないようにするために、以下の式のよ
うに、当該区間ａ内の音声パターンを“Ａ”の音韻ＨＭ
Ｍの平均ベクトルから差し引く処理（平均ベクトル更新
処理）を行う。Therefore, in the present embodiment, when a pattern similar to the voice pattern in the section a appears in the future, the pattern of the section a is set as shown in the following equation so as not to become the pattern of “A”. The voice pattern of the "A" phoneme HM
A process of subtracting from the average vector of M (average vector update process) is performed.

【００４４】 μ_k ″＝μ_k ′＋（γ／Ｎ）｛Σ_i （Ｘ_i −μ_k ′）｝ここで、μ_k ′は更新前の平均ベクトル、μ_k ″は更新
後の平均ベクトル、γは更新係数（負の値）、Ｘ_i は学
習に供する音声パターン、Ｎは学習に供する音声パター
ンの数、ｋは音韻カテゴリ、Σ_i （Ｘ_i −μ_k ′）はＮ
個の音声パターンＸ_i についてのＸ_i −μ_k ′の総和を
表す。Μ _k ″ = μ _k ′ + (γ / N) { _i (X _i −μ _k ′)} where μ _k ′ is the average vector before update, and μ _k ″ is the average vector after update. , Γ are update coefficients (negative values), X _i is a voice pattern to be used for learning, N is the number of voice patterns to be used for learning, k is a phoneme category, and Σ _i (X _i −μ _k ′) is N
Represents the sum of X _i −μ _k ′ for a plurality of voice patterns X _i .

【００４５】こうして平均ベクトルが更新された音韻Ｈ
ＭＭは、前記したように辞書格納部１５に格納し直さ
れ、認識処理に供される。以上のように、各音韻ＨＭＭ
の平均ベクトル及び分散を、特定話者の発声した音声を
用いて更新（学習）することにより、音韻ＨＭＭを話者
に適応させることができ、認識性能を向上させることが
できる。この音韻ＨＭＭのパラメータの更新（学習）手
法（話者適応方法）を音声認識装置に適用した場合にお
ける認識性能の向上を５００単語認識実験（話者は男性
３名）で確認した結果（学習に供する音声データ数に対
する各特定話者毎の認識率の平均値）を、図６に実線で
示す。また、参考までに、最大事後確率推定法のみを用
いた場合について破線で示す。図５において、横軸は学
習に供する音声データ数、縦軸は認識率である。The phoneme H whose average vector has been updated in this way
The MM is stored again in the dictionary storage unit 15 as described above, and is subjected to a recognition process. As described above, each phoneme HMM
By updating (learning) the average vector and the variance of the specific speaker using the voice uttered by the specific speaker, the phoneme HMM can be adapted to the speaker, and the recognition performance can be improved. The result of confirming the improvement of the recognition performance in the case where this parameter updating (learning) method (speaker adaptation method) of the phonological HMM is applied to a speech recognition apparatus in a 500 word recognition experiment (three males) was obtained. The average value of the recognition rate for each specific speaker with respect to the number of voice data to be provided) is shown by a solid line in FIG. For reference, the case where only the maximum a posteriori probability estimation method is used is indicated by a broken line. In FIG. 5, the horizontal axis is the number of audio data to be used for learning, and the vertical axis is the recognition rate.

【００４６】図５から明らかなように、本実施形態で適
用した方法の方が、最大事後確率推定法のみを用いた場
合よりも、学習データが少ないときでも高い認識性能を
示し、しかも学習データ数の増加に伴う認識性能の飽和
も起こっていない。As is apparent from FIG. 5, the method applied in the present embodiment shows higher recognition performance even when the learning data is small, and also has a higher learning data than the case where only the maximum posterior probability estimation method is used. There is no saturation of the recognition performance as the number increases.

【００４７】さて、図２のフローチャート（中のステッ
プＳ１〜Ｓ９）に従うＨＭＭパラメータ（中の平均ベク
トル及び分散）の更新の結果、例えば認識性能が所定比
率以上上昇したならば、一連の話者適応処理は終了とな
る（ステップＳ１０）。これに対し、認識性能の上昇率
が所定比率に達していないならば、ステップＳ１以降の
処理が再度行われる。なお、話者適応処理の終了の条件
として、認識性能の上昇率ではなくて、処理回数（ステ
ップＳ１〜Ｓ９の処理を繰り返す回数）を用いても構わ
ない。Now, as a result of updating the HMM parameters (mean vector and variance therein) according to the flowchart (steps S1 to S9 in FIG. 2), for example, if the recognition performance has increased by a predetermined ratio or more, a series of speaker adaptations The process ends (step S10). On the other hand, if the rate of increase in the recognition performance has not reached the predetermined ratio, the processing after step S1 is performed again. Note that as the condition for terminating the speaker adaptation process, the number of processes (the number of times the processes of steps S1 to S9 are repeated) may be used instead of the rate of increase in recognition performance.

【００４８】次に、認識モードでの処理について説明す
る。なお、認識モードは本発明に直接関係するものでは
ない。そのためここでは、一般に行われる認識処理を例
に簡単に説明する。Next, the processing in the recognition mode will be described. Note that the recognition mode is not directly related to the present invention. Therefore, here, a brief description will be given of a generally performed recognition process as an example.

【００４９】認識モードにおける音声入力部１１及び音
声分析部１２の処理は、上述した話者適応モードと同様
であり、音声分析部１２では、入力音声の特徴を表す音
声パターンが取得される。The processing of the speech input unit 11 and the speech analysis unit 12 in the recognition mode is the same as that of the above-described speaker adaptation mode, and the speech analysis unit 12 acquires a speech pattern representing the characteristics of the input speech.

【００５０】音声分析部１２により得られた入力音声の
音声パターンは、認識部１６に送られる。認識部１６
は、認識語彙格納部１７に格納された各語彙毎に、辞書
格納部１５内の音韻ＨＭＭを用いて音声パターンとのビ
タビ照合を行い、スコア（尤度）を求める。ここで例え
ば、認識語彙が単語の場合には、認識部１６は単語を構
成する音韻列に従って対応する音韻ＨＭＭを連結して単
語ＨＭＭを構成し、各単語ＨＭＭ毎に音声パターンとの
照合を行う。このようにして認識部１６は、全ての語彙
についてスコアを求めた後、最大スコアをとる語彙を認
識結果として出力する。The voice pattern of the input voice obtained by the voice analysis unit 12 is sent to the recognition unit 16. Recognition unit 16
For each vocabulary stored in the recognized vocabulary storage unit 17, Viterbi matching with a voice pattern is performed using the phoneme HMM in the dictionary storage unit 15 to obtain a score (likelihood). Here, for example, when the recognized vocabulary is a word, the recognizing unit 16 forms a word HMM by concatenating the corresponding phoneme HMMs according to the phoneme sequence forming the word, and performs matching with the voice pattern for each word HMM. . After obtaining the scores for all the vocabularies in this way, the recognition unit 16 outputs the vocabulary having the highest score as the recognition result.

【００５１】以上に述べた図１の構成の音声認識装置の
各部の機能は、コンピュータ、例えば内蔵型マイクロホ
ンが組み込まれた、或いはマイクロホン入力端子が設け
られた音声入力機能を持つ図７に示すパーソナルコンピ
ュータ７０を、主として音声分析部１２、音韻ラベル系
列決定部１３、適応部１４、認識部１６、及び制御部１
８として機能させるためのプログラムを記録した記録媒
体、例えばフロッピーディスク（ＦＤ）７１を用い、当
該フロッピーディスク７１をパーソナルコンピュータ７
０に装着して、当該フロッピーディスク７１に記録され
ているプログラムをパーソナルコンピュータ７０で読み
取り実行させることにより実現される。The function of each part of the speech recognition apparatus having the configuration shown in FIG. 1 described above is the same as that of the personal computer shown in FIG. 7 having a speech input function in which a computer, for example, a built-in microphone is incorporated or a microphone input terminal is provided. The computer 70 is mainly composed of a speech analysis unit 12, a phoneme label sequence determination unit 13, an adaptation unit 14, a recognition unit 16, and a control unit 1.
A recording medium, such as a floppy disk (FD) 71, on which a program for functioning as the computer 8 is recorded, is used.
0, and the program recorded on the floppy disk 71 is read and executed by the personal computer 70.

【００５２】なお、以上の実施形態で述べた音声分析条
件や、図５で示した音韻系列は単なる例であり、この内
容に限るものではない。この他、本発明は前記実施形態
に限定されるものではなく、その要旨を逸脱しない範囲
で、種々変形して実施することができる。The speech analysis conditions described in the above embodiment and the phoneme sequence shown in FIG. 5 are merely examples, and the present invention is not limited to these contents. In addition, the present invention is not limited to the above-described embodiment, and can be implemented with various modifications without departing from the gist thereof.

【００５３】[0053]

【発明の効果】以上詳記したように本発明によれば、最
大事後確率推定法を用いているので、学習データが少量
の場合でも、話者適応の効果が顕著であり、更に、最適
音韻系列と正解音韻系列の競合学習を併用しているの
で、最大事後確率推定法のみを用いる場合に比べて、学
習データが増えても収束することなく、認識性能の一層
の向上が期待できる。また、これにより、ユーザは一度
に大量の音声を発声することなく、状況により、話者適
応用音声データの発声を制御できるので、適応に対する
ユーザの負担を軽減することができる。As described in detail above, according to the present invention, since the maximum posterior probability estimation method is used, even when the learning data is small, the effect of speaker adaptation is remarkable, and the optimal phoneme is further improved. Since the competitive learning of the sequence and the correct phoneme sequence is used together, even if the learning data increases, the convergence does not converge and further improvement of the recognition performance can be expected as compared with the case of using only the maximum posterior probability estimation method. In addition, this allows the user to control the utterance of the speaker adaptation voice data depending on the situation without uttering a large amount of speech at a time, so that the user's burden on adaptation can be reduced.

[Brief description of the drawings]

【図１】本発明の一実施形態に係る音声認識装置の基本
構成を示すブロック図。FIG. 1 is a block diagram showing a basic configuration of a speech recognition device according to an embodiment of the present invention.

【図２】同実施形態における話者適応モードの処理を説
明するためのフローチャート。FIG. 2 is an exemplary flowchart for explaining processing in a speaker adaptation mode according to the embodiment;

【図３】図１中の音韻ラベル系列決定部１３により求め
られる正解音韻系列情報及び最適音韻系列情報の一例の
一部を示す図。FIG. 3 is a view showing a part of an example of correct phoneme sequence information and optimum phoneme sequence information obtained by a phoneme label sequence determination unit 13 in FIG. 1;

【図４】図１中の音韻ラベル系列決定部１３により求め
られる正解音韻系列情報及び最適音韻系列情報の一例の
他の一部を示す図。FIG. 4 is a view showing another part of an example of correct phoneme sequence information and optimum phoneme sequence information obtained by the phoneme label sequence determination unit 13 in FIG. 1;

【図５】正解音韻系列と最適音韻系列とで異なる音韻ラ
ベルが割り当てられている区間を示す図。FIG. 5 is a diagram showing a section in which different phoneme labels are assigned to a correct phoneme sequence and an optimal phoneme sequence.

【図６】同実施形態で適用した話者適応方法の効果を最
大事後確率推定法のみを用いた場合と対比させて示す
図。FIG. 6 is an exemplary view showing the effect of the speaker adaptation method applied in the embodiment in comparison with the case where only the maximum posterior probability estimation method is used;

【図７】図１の音声認識装置を実現するパーソナルコン
ピュータの外観を示す図。FIG. 7 is an exemplary external view of a personal computer that implements the voice recognition device of FIG. 1;

[Explanation of symbols]

１１…音声入力部１２…音声分析部１３…音韻ラベル系列決定部（音韻系列情報決定手段）１４…適応部（第１の適応学習手段、第２の適応学習手
段）１５…辞書格納部１６…認識部１７…認識語彙格納部１８…制御部DESCRIPTION OF SYMBOLS 11 ... Speech input part 12 ... Speech analysis part 13 ... Phoneme label sequence determination part (phoneme sequence information determination means) 14 ... Adaptation part (first adaptive learning means, second adaptive learning means) 15 ... Dictionary storage part 16 ... Recognition unit 17: Recognized vocabulary storage unit 18: Control unit

Claims

[Claims]

1. A voice input unit for inputting a uttered voice, a voice analysis unit for analyzing a voice input by the voice input unit to obtain a voice pattern representing the feature, and a matching unit for each phoneme. Dictionary storage means for storing a group of recognition dictionaries; and recognition means for executing recognition processing of a voice pattern obtained by the voice analysis means using a recognition dictionary in the dictionary storage means in a recognition mode. In the speaker recognition mode, in a speaker adaptation mode, for a known phoneme sequence corresponding to an input speech of a specific speaker, a correspondence between a speech pattern obtained from the input speech by the speech analysis unit and the dictionary storage unit. The correct phonological sequence information including the information of the matching result is extracted by performing the matching with the recognition dictionary to be recognized, and at the same time, the voice pattern and all the recognitions in the dictionary storage unit are recognized. A phonological sequence information determining means for extracting optimal phonological sequence information including information on a matching result relating to a phonological sequence giving the maximum likelihood by performing a check with a dictionary; and the correct phoneme extracted by the phonological sequence information determining means. A first adaptive learning means for learning a corresponding recognition dictionary in the dictionary storage means by a maximum a posteriori probability estimating method according to the sequence information; and the correct phoneme sequence information extracted by the phoneme sequence information determining means and the optimal The phoneme sequence information is compared to extract the difference, and learning of the corresponding recognition dictionary in the dictionary storage is performed using the voice pattern obtained by the voice analyzer in the direction in which the difference is eliminated. A speech recognition device comprising: a second adaptive learning unit.

2. A speech input means for inputting uttered speech, a speech analysis means for analyzing a speech inputted by the speech input means to obtain a speech pattern representing the feature, and a collation for each phoneme. Dictionary storage means for storing a group of phoneme HMMs; and recognition means for executing recognition processing of the voice pattern obtained by the voice analysis means using the phoneme HMMs in the dictionary storage means in a recognition mode. In the speaker recognition mode, in a speaker adaptation mode, for a known phoneme sequence corresponding to an input speech of a specific speaker, a correspondence between a speech pattern obtained from the input speech by the speech analysis unit and the dictionary storage unit. The correct phoneme sequence information including the phoneme label sequence is extracted by performing the matching with the phoneme HMM to be performed. Phonological HM of
A phoneme sequence information determining means for extracting optimal phoneme sequence information including a phoneme label sequence relating to a phoneme sequence giving the maximum likelihood by performing matching with M; and the correct phoneme sequence extracted by the phoneme sequence information determining means. First adaptive learning means for learning the parameters of the corresponding phoneme HMM in the dictionary storage means according to the maximum posterior probability estimation method according to the information, and the correct phoneme sequence information extracted by the phoneme sequence information determination means. A phoneme label sequence is compared with a phoneme label sequence in the optimal phoneme sequence information, and a speech pattern to which a phoneme label different from the correct phoneme label is assigned is extracted from the speech patterns obtained by the speech analysis means. Then, using the voice pattern, the corresponding HMM is stored in the dictionary storage means in the direction in which the difference between the phoneme labels is eliminated. Speech recognition apparatus characterized by comprising a second adaptive learning means for performing learning of the meter.

3. A voice input means for inputting uttered voice, a voice analysis means for analyzing a voice input by the voice input means to obtain a voice pattern representing a feature thereof, and a normal means comprising an average vector and a variance. A dictionary storing means for storing a group of phonemic HMMs including distribution parameters, and a recognition process of a voice pattern obtained by the voice analyzing means in a recognition mode, using a phonemic HMM in the dictionary storing section. And a recognition unit for executing, in a speaker adaptation mode, for a known phoneme sequence corresponding to an input voice of a specific speaker, a voice pattern obtained from the input voice by the voice analysis unit and the voice pattern. The correct phoneme sequence information including the phoneme label sequence is extracted by matching with the corresponding phoneme HMM in the dictionary storage means, and the speech pattern is extracted. Ting all phonemes HM of the dictionary storage in means
A phoneme sequence information determining means for extracting optimal phoneme sequence information including a phoneme label sequence relating to a phoneme sequence giving the maximum likelihood by performing matching with M; and the correct phoneme sequence extracted by the phoneme sequence information determining means. First adaptive learning means for learning the average vector and variance of the corresponding phoneme HMM in the dictionary storage means by a maximum a posteriori probability estimating method according to the information; and the correct phoneme extracted by the phoneme sequence information determining means. The phoneme label sequence in the sequence information is compared with the phoneme label sequence in the optimal phoneme sequence information, and a phoneme label different from the correct phoneme label is assigned from the speech patterns obtained by the speech analysis means. A voice pattern is extracted, and the voice pattern is extracted from the average vector of the phoneme HMM in the dictionary storage means corresponding to the phoneme label. Speech recognition apparatus characterized by comprising a second adaptive learning means for pulling.

4. A group of recognition dictionaries for each phoneme stored in the dictionary storage means for recognition processing of a voice pattern representing characteristics of the input voice obtained by analyzing the input voice. A speaker adaptation method for learning so as to adapt to a known phoneme sequence corresponding to an input speech of a specific speaker, by comparing a speech pattern of the input speech with a corresponding recognition dictionary in the dictionary storage means. , The correct phoneme sequence information including the information of the matching result is extracted, and the voice pattern is checked against all the recognition dictionaries in the dictionary storage means, so that the matching of the phoneme sequence giving the maximum likelihood is performed. A first step of extracting optimal phoneme sequence information including result information, and the dictionary storage method using a maximum posterior probability estimation method according to the correct phoneme sequence information extracted in the first step. And a second step of learning a corresponding recognition dictionary in the first step, and comparing the correct phoneme sequence information and the optimal phoneme sequence information extracted in the first step to extract a difference part thereof. A third step of learning a corresponding recognition dictionary in the dictionary storage unit using the voice pattern in the direction to be canceled.