JPH10274994A

JPH10274994A - Similar word identification method and device

Info

Publication number: JPH10274994A
Application number: JP27024997A
Authority: JP
Inventors: Yasunaga Miyazawa; 康永宮沢; Sunao Aizawa; 直相澤; Mitsuhiro Inazumi; 満広稲積; Hiroo Hasegawa; 浩男長谷川
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 1997-01-30
Filing date: 1997-10-02
Publication date: 1998-10-13

Abstract

(57)【要約】【課題】ＤＲＮＮ単語モデルを用いて単語認識を行う
際に誤認識されやすい類似単語を高精度で識別する。【解決手段】類似単語を識別する方法の１例として、
ある単語の音声が入力されたとき、ＤＲＮＮ単語モデル
を用いてその入力単語音声データに対応したＤＲＮＮ出
力を単語検出信号出力部４から出力し、前記入力単語音
声データをコードブック７を用いてコードデータ化す
る。そして、前記単語検出信号出力部４から一定以上の
確からしさを表すＤＲＮＮ出力が出された場合、認識処
理部９がそのＤＲＮＮ出力にその入力単語の特徴部分を
含む所定区間を設定するとともに、その設定された所定
区間において前記コード化されコードデータを調べ、そ
の結果に基づいて入力単語とその入力単語に類似する単
語との識別を行う。 (57) [Summary] [PROBLEMS] To identify a similar word that is easily misrecognized when performing word recognition using a DRNN word model with high accuracy. SOLUTION: As one example of a method for identifying a similar word,
When a voice of a certain word is input, a DRNN output corresponding to the input word voice data is output from the word detection signal output unit 4 using a DRNN word model, and the input word voice data is encoded using a codebook 7. Convert to data. When a DRNN output indicating certainty or more certainty is output from the word detection signal output unit 4, the recognition processing unit 9 sets a predetermined section including the characteristic portion of the input word in the DRNN output, The coded code data is checked in the set predetermined section, and based on the result, an input word and a word similar to the input word are identified.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、発音の類似する単
語を識別する類似単語識別方法および類似単語識別装置
に関し、特に、不特定話者用の音声認識技術の一つとし
てのＤＲＮＮ（Ｄynamic Ｒecurrent Ｎeural Ｎetwork
s）単語モデルを用いた音声認識技術における類似単語
識別方法および類似単語識別装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a similar word identification method and a similar word identification device for identifying words having similar pronunciations, and more particularly to a dynamic random current (DRNN) as one of voice recognition techniques for unspecified speakers. Neural Network
s) A similar word identification method and a similar word identification device in a speech recognition technology using a word model.

【０００２】[0002]

【従来の技術】不特定話者用の音声認識技術の一つとし
て、ＤＲＮＮ単語モデルを用いた音声認識技術がある
（このＤＲＮＮによる音声認識技術については、本出願
人が特開平６−４０７９、特開平６−１１９４７６など
により出願済みである）。2. Description of the Related Art As one of speech recognition techniques for unspecified speakers, there is a speech recognition technique using a DRNN word model. An application has been filed in JP-A-6-119476.

【０００３】このＤＲＮＮ単語モデルは、或る単語の特
徴ベクトル列が時系列データとして入力されると、その
単語に対する適切な出力が得られるようにするために、
予め定めた学習則に従って各ユニット間の重みとバイア
スがそれぞれ決められ、これにより、或る不特定話者の
発話する単語の音声データに対して、その単語に対する
教師出力に近い出力を得るようにしている。[0003] This DRNN word model is designed so that when a feature vector sequence of a certain word is input as time-series data, an appropriate output for the word is obtained.
The weight and bias between each unit are determined in accordance with a predetermined learning rule, whereby an output similar to the teacher output for the word is obtained for the voice data of the word spoken by a certain unspecified speaker. ing.

【０００４】たとえば、或る不特定話者の「おはよう」
という単語の特徴ベクトル列の時系列データが入力され
たとき、その「おはよう」という単語に対する理想的な
出力（教師出力）に近い出力を得るために、「おはよ
う」という単語の各時刻における特徴ベクトルのそれぞ
れの次元ごとのデータを、対応する入力ユニットに与
え、学習則に従って設定された重みとバイアスによって
変換する。これを時系列データとして入力される或る１
つの単語の特徴ベクトル列すべてについて、各時刻対応
に時系列処理を行う。このようにして、或る不特定話者
の発話する単語の音声データに対して、その単語に対す
る教師出力に近い出力を得るようにしている。[0004] For example, "good morning" of a certain unspecified speaker
When the time series data of the feature vector sequence of the word “good morning” is input, in order to obtain an output close to the ideal output (teacher output) for the word “good morning”, the feature vector of the word “good morning” at each time Is given to the corresponding input unit, and is converted by the weight and bias set according to the learning rule. This is input as a time series data.
The time series processing is performed for all the feature vector strings of one word for each time. In this way, for speech data of a word spoken by an unspecified speaker, an output close to a teacher output for the word is obtained.

【０００５】このように、認識すべき単語全てに対応し
て用意されたＤＲＮＮ音声モデルについて、それぞれの
単語に対して適切な出力が取り出されるように重みを変
化させる学習則は、(社)電子情報通信学会発行の信学技
報:technical report of ＩＥＩＣＩ sp92-125(1993-0
1）の１７頁から２４頁に記載されている。[0005] As described above, the learning rule for changing the weight of the DRNN speech model prepared corresponding to all the words to be recognized so that an appropriate output is obtained for each word is described in IEICE technical report: IEICI sp92-125 (1993-0
It is described on pages 17 to 24 of 1).

【０００６】このように、予め幾つかの単語について学
習されたＤＲＮＮ単語モデルを用いた音声認識について
図７を参照しながら簡単に説明する。The speech recognition using the DRNN word model previously learned for some words will be briefly described with reference to FIG.

【０００７】このＤＲＮＮ方式による音声認識技術は、
たとえば、「おはよう、いいお天気だね」といった連続
音声の中から認識対象単語として予め登録されている単
語（この場合、「おはよう」、「天気」など）をキーワ
ードとして、これらキーワードとなる単語が入力音声中
のどの部分にどれくらいの確かさで存在するかを示す値
を得て、その確からしさを示す値を基に前記したような
連続的な音声を理解するものである。[0007] The speech recognition technology based on the DRNN method is as follows.
For example, a word registered in advance as a recognition target word (in this case, “good morning” or “weather”) from a continuous voice such as “good morning, good weather” is used as a keyword, and these keywords are input. This is to obtain a value indicating which part in the sound is present and with certainty, and to understand the continuous sound as described above based on the value indicating the certainty.

【０００８】たとえば、入力話者が、「おはよう、いい
お天気だね」発話したとき、図７（ａ）に示すような音
声信号が出力されたとする。このような音声信号に対
し、「おはよう」という音声信号部分で同図（ｂ）のよ
うな出力が得られ、また、「天気」という音声信号部分
で同図（ｃ）のような出力が得られる。同図（ｂ），
（ｃ）において、0.9あるいは0.8といった数値は、確か
らしさ（近似度）を示す数値であり、0.9や0.8といった
高い数値であれば、その単語は入力音声の中に、高い確
からしさで存在するということができる。つまり、「お
はよう」という登録単語は、同図（ｂ）に示すように、
入力音声信号の時間軸上のｗ１の部分に0.9という確か
らしさで存在し、「天気」という登録単語は、同図
（ｃ）に示すように、入力音声信号の時間軸上のｗ２の
部分に0.8という確からしさで存在することがわかる。For example, it is assumed that when the input speaker utters "Good morning, good weather", an audio signal as shown in FIG. 7A is output. With respect to such an audio signal, an output as shown in FIG. 3B is obtained in an audio signal portion of “good morning”, and an output as in FIG. 2C is obtained in an audio signal portion of “weather”. Can be FIG.
In (c), a numerical value such as 0.9 or 0.8 is a numerical value indicating a certainty (degree of approximation), and a high numerical value such as 0.9 or 0.8 indicates that the word exists in the input voice with high certainty. be able to. That is, the registered word "good morning" is, as shown in FIG.
The registered word "weather" exists in the w1 portion on the time axis of the input voice signal at the w2 portion on the time axis of the input voice signal as shown in FIG. It can be seen that it exists with a certainty of 0.8.

【０００９】このように、認識対象とするそれぞれの単
語ごとのＤＲＮＮ単語モデルを作成して、そのＤＲＮＮ
単語モデルを用いて入力音声を認識する。In this manner, a DRNN word model is created for each word to be recognized, and the DRNN word model is created.
Recognize input speech using a word model.

【００１０】ところで、ある単語についてのＤＲＮＮ単
語モデルを作成する場合、その認識対象単語とそれ以外
の単語の２つの単語を連ねて発話して学習させるという
ようなことを行う。When a DRNN word model is created for a certain word, two words, that is, the word to be recognized and the other words are uttered in succession and learned.

【００１１】たとえば、図８に示すように、２つの単語
（認識対象単語を単語１、他を単語２とする）の連続す
る音声信号に対して、単語１の音声データに対しては出
力を立ち上げ、その後に続く単語２の音声データに対し
ては出力を立ち下げ、また、図示しないが、順序を逆に
して、単語２の音声データに対しては出力を立ち上げな
いで、その後に続く単語１の音声データに対しては出力
を立ち上げるような学習を行う。For example, as shown in FIG. 8, a continuous speech signal of two words (the word to be recognized is word 1 and the other is word 2) is output, while the speech data of word 1 is output. The output is turned off for the voice data of the word 2 that follows, and the output is turned off for the voice data of the word 2 (not shown). Learning to start the output is performed for the audio data of the following word 1.

【００１２】[0012]

【発明が解決しようとする課題】このようにしてＤＲＮ
Ｎ単語モデルの学習がなされるが、このように学習され
て作成されたＤＲＮＮ単語モデルを用いた音声認識処理
において問題となるのは、認識対象単語に似た単語（類
似単語）が発話されると、その単語が認識対象単語でな
くても、一定以上の確からしさを有するＤＲＮＮ出力が
出てくることである。In this way, the DRN
The N word model is learned, but the problem in the speech recognition processing using the DRNN word model created by learning in this way is that words similar to the recognition target word (similar words) are uttered. That is, even if the word is not a recognition target word, a DRNN output having certainty or more certainty is output.

【００１３】これは、ＤＲＮＮ単語モデルを学習する
際、前述したように、２つの単語の音声データを連続し
て与えて学習するが、このとき、発音の似た単語同志を
用いて学習しないのが通例であるからである。その理由
は、たとえば、「何時」と「何度」を発音の似た単語
（類似単語という）として考えたとき、「なんじ」につ
いての音声モデルを作成する際、「なんじ」の音声デー
タと「なんど」の音声データを連続させて与え、「なん
じ」の音声データに対しては出力を立ち上げ、それと発
音の似た「なんど」では立ち上げないという学習をさせ
るのは、両者とも同じ音韻列である「なん」の部分で学
習に矛盾が生じるからである。[0013] When learning a DRNN word model, as described above, learning is performed by continuously providing speech data of two words, but at this time, learning is not performed using words having similar pronunciations. Is customary. The reason is, for example, when “what time” and “how many times” are considered as words with similar pronunciations (similar words), when creating a voice model for “Nanji”, the voice data of “Nanji” And "Nando" voice data are given in succession, the output is started for "Niji" voice data, and the learning that "Nando" with similar pronunciation is not started is learned by both. This is because learning is inconsistent in the part of the same phoneme sequence, “what”.

【００１４】したがって、認識対象単語として「なん
じ」について学習されたＤＲＮＮ単語モデルの場合、入
力話者が「なんど」と発話した場合も「なんじ」と発話
した場合と同等のＤＲＮＮ出力が出ることが多く、この
場合は、「なんじ」と認識されることになる。Therefore, in the case of the DRNN word model trained on “Nanji” as a recognition target word, when the input speaker utters “Nando”, a DRNN output equivalent to the case of uttering “Nanji” is output. In many cases, in this case, it is recognized as "what."

【００１５】また、ユーザからは、たとえば、予め学習
されて認識対象単語として登録されている「なんじ」と
いう単語に加えて、「なんど」という単語についても認
識可能としてほしいとの要望が出される場合がある。こ
のように、ユーザから類似単語を確実に認識できるよう
にしてほしいとの要望が出た場合、簡単な処理で対応で
きるようにする必要がある。[0015] In addition, for example, in addition to the word "Nanji" which has been learned in advance and registered as a recognition target word, the user requests that the word "Nando" be recognizable. There are cases. As described above, when the user requests that the similar words can be surely recognized, it is necessary to be able to respond by simple processing.

【００１６】そこで、本発明は、類似単語を確実に認識
可能とするために、現在のＤＲＮＮ音声モデルの学習方
法そのものを変えることなく、現在の音声モデルを利用
して、簡単でかつ高精度な類似単語の識別を可能とする
ことを目的とする。Therefore, the present invention utilizes a current speech model without changing the current learning method of a DRNN speech model, and enables simple and highly accurate recognition of similar words. An object is to enable similar words to be identified.

【００１７】[0017]

【課題を解決するための手段】前記した目的を達成する
ために、本発明における類似単語識別方法における請求
項１の発明は、ある単語の音声データに反応して確から
しさを表す所定の出力が得られるように学習された音声
モデルを有し、この音声モデルを用いて入力単語に対す
る出力を取り出してその出力に基づいて認識処理する
際、誤認識される可能性のある類似単語との識別を行う
類似単語識別方法において、ある単語の音声が入力され
たとき、その入力単語の音声データに反応する音声モデ
ルにより一定以上の確からしさを表す出力が出された場
合、その出力に入力単語の特徴部分を含む所定区間を設
定し、その所定区間における前記入力単語の音声データ
の特徴を調べ、その結果に基づいて、入力単語とその入
力単語に類似する単語との識別を行うことを特徴として
いる。In order to achieve the above-mentioned object, according to the first aspect of the present invention, there is provided a similar word identification method according to the present invention, wherein a predetermined output representing certainty in response to voice data of a certain word is provided. It has a speech model that has been learned so that it can be obtained.When the output for an input word is extracted using this speech model and recognition processing is performed based on the output, identification of a similar word that may be erroneously recognized is performed. In the similar word identification method to be performed, when a voice of a certain word is input, when an output representing certainty or more certainty is output by a voice model that reacts to voice data of the input word, a characteristic of the input word is included in the output. A predetermined section including a portion is set, the characteristics of the voice data of the input word in the predetermined section are examined, and based on the result, the input word and a unit similar to the input word are determined. It is characterized by performing the identification of the.

【００１８】また、請求項２の発明は、ある単語の音声
データに反応して確からしさを表す所定の出力が得られ
るように学習されたＤＲＮＮ音声モデルを有し、このＤ
ＲＮＮ音声モデルを用いて入力単語に対するＤＲＮＮ出
力を取り出してその出力に基づいて認識処理する際、誤
認識される可能性のある類似単語との識別を行う類似単
語識別方法において、ある単語の音声が入力されたと
き、ＤＲＮＮ音声モデルを用いてその入力単語音声デー
タに応じたＤＲＮＮ出力を出すとともに、前記入力単語
音声データをコードブックを用いてコードデータ化し、
前記入力単語に対して一定以上の確からしさを表すＤＲ
ＮＮ出力が出された場合、そのＤＲＮＮ出力にその入力
単語の特徴部分を含む所定区間を設定するとともに、前
記設定された所定区間において前記コードデータを調
べ、その結果に基づいて、入力単語とその入力単語に類
似する単語との識別を行うことを特徴としている。Further, the invention according to claim 2 has a DRNN speech model that has been trained so as to obtain a predetermined output representing certainty in response to speech data of a certain word.
When a DRNN output for an input word is extracted using an RNN speech model and recognition processing is performed based on the output, in a similar word identification method for identifying a similar word that may be erroneously recognized, a speech of a certain word is When input, a DRNN output corresponding to the input word voice data is output using a DRNN voice model, and the input word voice data is converted into code data using a codebook,
DR indicating certainty or more certainty for the input word
When an NN output is issued, a predetermined section including the characteristic portion of the input word is set in the DRNN output, the code data is checked in the set predetermined section, and based on the result, the input word and its It is characterized in that it is distinguished from words similar to the input word.

【００１９】また、請求項３の発明は、請求項２の発明
において、前記設定された所定区間におけるコードデー
タのうち、母音に対応するコードデータを調べ、どの母
音であるかにより、入力単語とその入力単語に類似する
単語との識別を行うようにしている。According to a third aspect of the present invention, in the second aspect of the present invention, code data corresponding to a vowel is examined among the code data in the set predetermined section, and an input word and a vowel are determined according to which vowel is. A word similar to the input word is identified.

【００２０】そして、請求項４の発明は、請求項２また
は３の発明において、前記コードブックを、５つの母音
から生成されたコードブックとしている。According to a fourth aspect of the present invention, in the second or third aspect, the codebook is a codebook generated from five vowels.

【００２１】また、請求項５の発明は、前記請求項２、
３または４の発明において、前記ＤＲＮＮ音声モデル
を、類似単語の種類毎にグループ分けされた類似単語グ
ループに対応させ、それぞれの類似単語グループにおい
て、そのグループ内の単語すべてに対して一定以上の確
からしさを表すＤＲＮＮ出力が得られるように学習され
た音声モデルとしたことを特徴としている。[0021] Further, the invention according to claim 5 is the invention according to claim 2,
In the invention of 3 or 4, the DRNN speech model is made to correspond to similar word groups grouped for each type of similar word, and in each similar word group, a certain level of certainty is set for all the words in the group. It is characterized by using a speech model that has been trained so as to obtain a DRNN output representing likeness.

【００２２】このように、本発明は、類似単語を識別す
る方法の１つとして、コードブックを用いて入力音声デ
ータをベクトル量子化し、コードデータを得て、ＤＲＮ
Ｎ出力の所定区間にどのような母音が存在するかを調べ
て、入力単語を識別するようにしている。これにより、
ＤＲＮＮ出力だけでは識別できない類似単語に対して、
ＤＲＮＮ音声モデルの学習方法を変えることなく、既存
のＤＲＮＮモデルをそのまま用いて、高精度に類似単語
の識別が可能となる。また、コードブックを５つの母音
から生成されたコードブックとすることにより、処理を
大幅に簡略化することができる。As described above, according to the present invention, as one method of identifying similar words, input speech data is vector-quantized using a codebook, code data is obtained, and DRN is performed.
By examining what vowels exist in a predetermined section of N outputs, an input word is identified. This allows
For similar words that cannot be identified only by DRNN output,
Similar words can be identified with high accuracy by using the existing DRNN model without changing the method of learning the DRNN speech model. Further, by setting the codebook to a codebook generated from five vowels, the processing can be greatly simplified.

【００２３】また、請求項６の発明は、ある単語の音声
データに反応して確からしさを表す所定の出力が得られ
るように学習されたＤＲＮＮ音声モデルを有し、このＤ
ＲＮＮ音声モデルを用いて入力単語に対するＤＲＮＮ出
力を取り出してその出力に基づいて認識処理する際、誤
認識される可能性のある類似単語との識別を行う類似単
語識別方法において、誤認識される可能性のある類似単
語それぞれに対し、複数の話者がそれぞれの単語につい
て発話して得られた音声データを所定のコードブックを
用いてコード化し、このコード化されたコードデータを
用いて、各単語についてその単語の特徴部分を含む所定
区間におけるコードヒストグラムを生成し、各単語語ご
とのヒストグラムデータを標準ヒストグラムデータとし
て保存し、ある単語の音声が入力されたとき、ＤＲＮＮ
音声モデルを用いてその入力単語音声データに対応した
ＤＲＮＮ出力を出すとともに、前記入力単語音声データ
を所定のコードブックを用いてコードデータ化し、前記
入力単語に対して一定以上の確からしさを表すＤＲＮＮ
出力が出された場合、そのＤＲＮＮ出力にその入力単語
の特徴部分を含む所定区間を設定するとともに、前記コ
ードデータを用いて前記所定区間内におけるコードヒス
トグラムを生成し、このヒストグラムデータと、前記標
準ヒストグラムデータとを比較して、前記入力単語とそ
の入力単語に類似する単語との識別を行うことを特徴と
している。The invention according to claim 6 has a DRNN speech model that has been trained to obtain a predetermined output representing certainty in response to speech data of a certain word.
When a DRNN output for an input word is extracted using an RNN speech model and recognition processing is performed based on the output, a similar word identification method for identifying a similar word that may be erroneously recognized may cause a false recognition. For each similar word having sex, speech data obtained by a plurality of speakers speaking for each word is coded using a predetermined codebook, and each coded word is used for each word using the coded code data. , Generate a code histogram in a predetermined section including the characteristic portion of the word, store the histogram data for each word as standard histogram data, and, when speech of a certain word is input, DRNN
A DRNN output corresponding to the input word voice data is output using a voice model, the input word voice data is converted into code data using a predetermined codebook, and a DRNN representing certainty or more certainty for the input word is provided.
When an output is issued, a predetermined section including the characteristic portion of the input word is set in the DRNN output, a code histogram in the predetermined section is generated using the code data, and the histogram data and the standard The method is characterized in that the input word and words similar to the input word are identified by comparing the input word with histogram data.

【００２４】また、請求項７の発明は、請求項６におい
て、それぞれのヒストグラムを正規化したのち、両者の
差分をとり、その差分の大きさから入力単語とその入力
単語に類似する単語との識別を行うことを特徴としてい
る。According to a seventh aspect of the present invention, in the sixth aspect, after each histogram is normalized, a difference between the two is obtained, and an input word and a word similar to the input word are determined based on the magnitude of the difference. It is characterized by performing identification.

【００２５】このように、入力単語データを基に作成さ
れたヒストグラムデータと標準ヒストグラムデータから
類似単語を識別する方法によっても、前記同様、ＤＲＮ
Ｎ出力だけでは識別できない類似単語に対して、ＤＲＮ
Ｎ音声モデルの学習方法を変えることなく、既存のＤＲ
ＮＮモデルをそのまま用いて、高精度に類似単語の識別
が可能となる。さらに、この発明は、単語の特徴部分を
含む所定区間のコードデータの度数分布を標準話者の度
数分布と比較しているので、より一層、高精度な識別が
可能となり、きわめて誤認識されやすい類似単語につい
ても高精度に識別可能となる。As described above, the method of identifying similar words from the histogram data created based on the input word data and the standard histogram data also provides the same method as described above.
DRN for similar words that cannot be identified only with N output
Existing DR without changing the learning method of N voice model
Using the NN model as it is, it is possible to identify similar words with high accuracy. Furthermore, in the present invention, the frequency distribution of the code data in the predetermined section including the characteristic portion of the word is compared with the frequency distribution of the standard speaker, so that it is possible to perform the classification with higher accuracy, and it is extremely easy to be erroneously recognized. Similar words can be identified with high accuracy.

【００２６】また、請求項８の発明は、請求項６または
７の発明において、前記ＤＲＮＮ音声モデルを、類似単
語の種類毎にグループ分けされた類似単語グループに対
応させ、それぞれの類似単語グループにおいて、そのグ
ループ内の単語すべてに対して一定以上の確からしさを
表すＤＲＮＮ出力が得られるように学習された音声モデ
ルとしたことを特徴としている。The invention of claim 8 is the invention according to claim 6 or 7, wherein the DRNN speech model is made to correspond to similar word groups grouped for each type of similar word, and in each similar word group, , A speech model trained so as to obtain a DRNN output representing certainty or more certainty for all the words in the group.

【００２７】さらに、請求項９の発明は、ある単語の音
声データに反応して確からしさを表す所定の出力が得ら
れるように学習されたＤＲＮＮ音声モデルを有し、この
ＤＲＮＮ音声モデルを用いて入力単語に対するＤＲＮＮ
出力を取り出してその出力に基づいて認識処理する際、
誤認識される可能性のある類似単語との識別を行う類似
単語識別方法において、誤認識される可能性のある類似
単語それぞれの特徴部分に対し、一定以上の確からしさ
を表すＤＲＮＮ出力が得られるように学習されたＤＲＮ
Ｎサブ音声モデルを作成しておき、ある単語の音声が入
力されたとき、その入力単語に対して一定以上の確から
しさを表すＤＲＮＮ出力が出された場合、そのＤＲＮＮ
出力にその単語の特徴部分を含む所定区間を設定し、そ
の所定区間内において前記ＤＲＮＮサブ音声モデルによ
るＤＲＮＮ出力状態を調べ、その結果に基づいて入力単
語とその入力単語に類似する単語との識別を行うことを
特徴としている。Further, the invention of claim 9 has a DRNN speech model that has been trained so as to obtain a predetermined output representing certainty in response to speech data of a certain word, and uses this DRNN speech model. DRNN for input word
When extracting output and performing recognition processing based on the output,
In a similar word identification method for identifying a similar word that is likely to be erroneously recognized, a DRNN output indicating certainty or more certainty is obtained for each characteristic portion of a similar word that may be erroneously recognized. DRN learned as
If an N-sub-speech model is created and a voice of a certain word is input, and a DRNN output indicating certainty or more certainty is output for the input word, the DRNN is output.
A predetermined section including the characteristic portion of the word is set in the output, the DRNN output state by the DRNN sub-speech model is checked in the predetermined section, and based on the result, an input word and a word similar to the input word are identified. It is characterized by performing.

【００２８】請求項１０の発明は、請求項９の発明にお
いて、どのＤＲＮＮサブ音声モデルによるＤＲＮＮ出力
が一定以上の確からしさを表す値となっているかにより
入力単語とその入力単語に類似する単語との識別を行う
ことを特徴としている。According to a tenth aspect of the present invention, in the ninth aspect of the present invention, an input word and a word similar to the input word are determined based on which DRNN sub-speech model has a value indicating certainty or more in the DRNN output. Is characterized.

【００２９】このように、ＤＲＮＮサブ音声モデルを用
いて類似単語を識別する方法によっても、ＤＲＮＮ出力
だけでは識別できない類似単語に対しても、既存のＤＲ
ＮＮ音声モデルの学習方法を変更することなく類似単語
を確実に識別することができる。さらに、この発明は、
類似単語同士の特徴部分のみのＤＲＮＮ音声モデルを用
いてそのＤＲＮＮ出力から判断するので、より一層、高
精度な識別が可能となり、きわめて誤認識されやすい類
似単語についても高精度に識別可能となる。As described above, even if a similar word is identified by using the DRNN sub-speech model, or if a similar word that cannot be identified only by the DRNN output is used, the existing DR is used.
Similar words can be reliably identified without changing the learning method of the NN speech model. In addition, the present invention
Since the determination is made from the DRNN output using the DRNN voice model of only the characteristic portion between the similar words, the identification can be performed with higher accuracy, and the similar words that are easily misrecognized can be identified with high accuracy.

【００３０】そして、また、請求項１１の発明は、請求
項９または１０の発明において、前記ＤＲＮＮ音声モデ
ルは、類似単語の種類毎にグループ分けされた類似単語
グループに対応させ、それぞれの類似単語グループにお
いて、そのグループ内の単語すべてに対して一定以上の
確からしさを表すＤＲＮＮ出力が得られるように学習さ
れた音声モデルであることを特徴としている。[0030] Further, according to the invention of claim 11, in the invention of claim 9 or 10, the DRNN speech model corresponds to a similar word group grouped for each type of similar word, and In the group, the speech model is characterized in that it is a speech model that has been trained so as to obtain a DRNN output representing certainty or more certainty for all the words in the group.

【００３１】以上のように、ここでは大きく分けて３つ
の方法について述べたが、これらのどの方法も処理が簡
単で、既存のＤＲＮＮ音声モデルの学習方法を変更する
ことなく類似単語を確実に識別することができる。ま
た、前記ＤＲＮＮ音声モデルは、類似単語の種類毎にグ
ループ分けされた類似単語グループに対応させ、それぞ
れの類似単語グループにおいて、そのグループ内の単語
すべてに対して一定以上の確からしさを表すＤＲＮＮ出
力が得られるように学習された音声モデルとすることに
より、類似単語をそれぞれ認識対象単語とする際、類似
単語１つ１つに対して音声モデルを作成する必要がなく
なり、コスト的にも優れたものとなる。また、本発明の
類似単語識別装置のうち、請求項１２の発明は、ある単
語の音声データに反応して確からしさを表す所定の出力
が得られるように学習された音声モデルを有し、この音
声モデルを用いて入力単語に対する出力を取り出してそ
の出力に基づいて認識処理する際、誤認識される可能性
のある類似単語との識別を行う類似単語識別装置におい
て、ある単語の音声が入力されたとき、その入力単語の
音声データに反応する音声モデルにより一定以上の確か
らしさを表す出力を出す単語検出信号出力手段と、この
単語検出信号出力手段から一定以上の確からしさを表す
出力が出された場合、その出力に入力単語の特徴部分を
含む所定区間を設定し、その所定区間における前記入力
単語の音声データの特徴を調べ、その結果に基づいて、
入力単語とその入力単語に類似する単語との識別を行う
認識処理手段とを有することを特徴としている。As described above, three methods have been roughly described here, but all of these methods are easy to process, and similar words can be reliably identified without changing the existing DRNN speech model learning method. can do. Further, the DRNN speech model corresponds to a similar word group grouped for each type of similar word, and in each similar word group, a DRNN output representing a certain degree or more certainty of all the words in the group is output. When the similar words are used as the recognition target words, there is no need to create a speech model for each similar word, and the cost is excellent. It will be. Further, among the similar word identification devices of the present invention, the invention of claim 12 has a speech model that has been learned so as to obtain a predetermined output representing certainty in response to speech data of a certain word. When an output for an input word is extracted using a speech model and subjected to recognition processing based on the output, a similar word identification device that identifies a similar word that may be erroneously recognized receives a speech of a certain word. A word detection signal output means for outputting an output representing certainty or more by a voice model responding to the voice data of the input word, and an output representing certainty or more certainty from the word detection signal output means. In this case, a predetermined section including a characteristic portion of the input word is set in the output, the characteristics of the voice data of the input word in the predetermined section are checked, and based on the result,
It is characterized by having recognition processing means for identifying an input word and a word similar to the input word.

【００３２】また、請求項１３の発明は、ある単語の音
声データに反応して確からしさを表す所定の出力が得ら
れるように学習されたＤＲＮＮ音声モデルを有し、この
ＤＲＮＮ音声モデルを用いて入力単語に対するＤＲＮＮ
出力を取り出してその出力に基づいて認識処理する際、
誤認識される可能性のある類似単語との識別を行う類似
単語識別装置において、ある単語の音声が入力されたと
き、ＤＲＮＮ音声モデルを用いてその入力単語音声デー
タに対応したＤＲＮＮ出力を出す単語検出信号出力手段
と、前記入力単語音声データをコードブックを用いてコ
ードデータ化するコード化手段と、前記単語検出信号出
力手段から一定以上の確からしさを表すＤＲＮＮ出力が
出された場合、そのＤＲＮＮ出力にその入力単語の特徴
部分を含む所定区間を設定するとともに、その設定され
た所定区間において前記コード化手段によりコード化さ
れコードデータを調べ、その結果に基づいて入力単語と
その入力単語に類似する単語との識別を行う認識処理手
段とを有することを特徴としている。The invention according to claim 13 has a DRNN speech model that has been trained so as to obtain a predetermined output representing certainty in response to speech data of a certain word, and uses this DRNN speech model. DRNN for input word
When extracting output and performing recognition processing based on the output,
In a similar word identification device that identifies a similar word that may be erroneously recognized, when a voice of a certain word is input, a word that outputs a DRNN output corresponding to the input word voice data using a DRNN voice model Detection signal output means, coding means for converting the input word voice data into code data using a codebook, and when a DRNN output indicating certainty or more certainty is output from the word detection signal output means, In the output, a predetermined section including the characteristic portion of the input word is set, and in the set predetermined section, code data is coded by the coding means, and code data is checked. And a recognition processing means for identifying a word to be processed.

【００３３】請求項１４の発明は、請求項１３におい
て、前記設定された所定区間におけるコードデータのう
ち、母音に対応するコードデータを調べ、どの母音であ
るかにより、入力単語とその入力単語に類似する単語と
の識別を行うことを特徴としている。According to a fourteenth aspect of the present invention, in the thirteenth aspect, code data corresponding to a vowel is examined from among the code data in the set predetermined section, and an input word and an input word are determined according to which vowel. It is characterized by identifying a similar word.

【００３４】また、請求項１５の発明は、請求項１３ま
たは１４の発明において、前記コードブックを５つの母
音から生成したコードブックとしている。According to a fifteenth aspect, in the thirteenth or fourteenth aspect, the codebook is a codebook generated from five vowels.

【００３５】また、請求項１６の発明は、請求項１３、
１４または１５の発明において、前記ＤＲＮＮ音声モデ
ルは、類似単語の種類毎にグループ分けされた類似単語
グループに対応させ、それぞれの類似単語グループにお
いて、そのグループ内の単語すべてに対して一定以上の
確からしさを表すＤＲＮＮ出力が得られるように学習さ
れた音声モデルであることを特徴としている。The invention according to claim 16 is based on claim 13,
In the invention of the fourteenth or fifteenth aspect, the DRNN speech model corresponds to a similar word group grouped for each type of the similar word, and in each similar word group, a certain level of certainty for all the words in the group is determined. It is a speech model that has been trained so as to obtain a DRNN output representing likeness.

【００３６】このように、本発明の類似単語識別装置
は、コードブックを用いて入力音声データをベクトル量
子化し、コードデータを得て、ＤＲＮＮ出力の所定区間
にどのような母音が存在するかを調べて、入力単語を識
別するようにしている。これにより、ＤＲＮＮ出力だけ
では識別できない類似単語に対して、ＤＲＮＮ音声モデ
ルの学習方法を変えることなく、既存のＤＲＮＮモデル
をそのまま用いて、高精度に類似単語の識別が可能とな
る。また、コードブックを５つの母音から生成されたコ
ードブックとすることにより、処理を大幅に簡略化する
ことができる。As described above, the similar word identification apparatus of the present invention uses the codebook to vector quantize the input speech data, obtains the code data, and determines what vowels exist in a predetermined section of the DRNN output. It examines and identifies the input words. As a result, similar words that cannot be identified only by the DRNN output can be identified with high accuracy by using the existing DRNN model without changing the learning method of the DRNN speech model. Further, by setting the codebook to a codebook generated from five vowels, the processing can be greatly simplified.

【００３７】また、請求項１７の発明は、ある単語の音
声データに反応して確からしさを表す所定の出力が得ら
れるように学習されたＤＲＮＮ音声モデルを有し、この
ＤＲＮＮ音声モデルを用いて入力単語に対するＤＲＮＮ
出力を取り出してその出力に基づいて認識処理する際、
誤認識される可能性のある類似単語との識別を行う類似
単語識別装置において、誤認識される可能性のある類似
単語それぞれについて複数の話者が発話して得られた音
声データをコード化したコードデータのうち、その単語
の特徴部分を含む所定区間におけるコードデータを用い
て作成されたコードヒストグラムを、標準ヒストグラム
データとして保存する標準ヒストグラムデータ記憶手段
と、ある単語の音声が入力されたとき、ＤＲＮＮ音声モ
デルを用いてその入力単語音声データに対応するＤＲＮ
Ｎ出力を出す単語検出信号出力手段と、前記入力単語音
声データを所定のコードブックを用いてコードデータ化
するコード化手段と、前記単語検出信号出力手段から一
定以上の確からしさを表すＤＲＮＮ出力が出された場
合、そのＤＲＮＮ出力にその入力単語の特徴部分を含む
所定区間を設定するとともに、前記コード化手段でコー
ド化されたコードデータを用いて前記所定区間内におけ
るコードヒストグラムを生成し、このヒストグラムデー
タと、前記標準ヒストグラムデータとを比較して、前記
入力単語とその入力単語に類似する単語との識別を行う
認識処理手段とを有することを特徴としている。Further, the invention of claim 17 has a DRNN speech model that has been trained so as to obtain a predetermined output representing certainty in response to speech data of a certain word, and uses this DRNN speech model. DRNN for input word
When extracting output and performing recognition processing based on the output,
In a similar word identification device for identifying a similar word that may be misrecognized, speech data obtained by a plurality of speakers uttering each of the similar words that may be misrecognized is encoded. A standard histogram data storage unit that stores a code histogram created using code data in a predetermined section including a characteristic portion of the word in the code data as standard histogram data, and when a voice of a certain word is input, DRN corresponding to the input word voice data using the DRNN voice model
A word detection signal output unit that outputs N outputs, a coding unit that converts the input word voice data into code data using a predetermined codebook, and a DRNN output that indicates certainty or more certainty from the word detection signal output unit. When the code is output, a predetermined section including the characteristic portion of the input word is set in the DRNN output, and a code histogram in the predetermined section is generated using the code data coded by the coding means. It is characterized by having recognition processing means for comparing the histogram data with the standard histogram data to identify the input word and a word similar to the input word.

【００３８】また、請求項１８の発明は、請求項１７に
おいて、それぞれのヒストグラムを正規化したのち、両
者の差分をとり、その差分の大きさから入力単語とその
入力単語に類似する単語との識別を行うことを特徴とし
ている。The invention of claim 18 is the invention according to claim 17 in which, after normalizing each histogram, a difference between the two is obtained, and the difference between the input word and the word similar to the input word is determined based on the magnitude of the difference. It is characterized by performing identification.

【００３９】このように、入力単語データを基に作成さ
れたヒストグラムデータと標準ヒストグラムデータから
類似単語を識別する類似単語識別装置は、前記同様、Ｄ
ＲＮＮ出力だけでは識別できない類似単語に対して、Ｄ
ＲＮＮ音声モデルの学習方法を変えることなく、既存の
ＤＲＮＮモデルをそのまま用いて、高精度に類似単語の
識別が可能となる。さらに、この発明は、単語の特徴部
分を含む所定区間のコードデータの度数分布を標準話者
の度数分布と比較しているので、より一層、高精度な識
別が可能となり、きわめて誤認識されやすい類似単語に
ついても高精度に識別可能となる。As described above, the similar word identifying apparatus for identifying similar words from the histogram data created based on the input word data and the standard histogram data has the same structure as that of the above.
For similar words that cannot be identified only by RNN output,
Without changing the learning method of the RNN speech model, similar words can be identified with high accuracy using the existing DRNN model as it is. Furthermore, in the present invention, the frequency distribution of the code data in the predetermined section including the characteristic portion of the word is compared with the frequency distribution of the standard speaker, so that it is possible to perform the classification with higher accuracy, and it is extremely easy to be erroneously recognized. Similar words can be identified with high accuracy.

【００４０】請求項１９の発明は、請求項１７または１
８の発明において、前記ＤＲＮＮ音声モデルを、類似単
語の種類毎にグループ分けされた類似単語グループに対
応させ、それぞれの類似単語グループにおいて、そのグ
ループ内の単語すべてに対して一定以上の確からしさを
表すＤＲＮＮ出力が得られるように学習された音声モデ
ルとしたことを特徴としている。The invention of claim 19 is the invention of claim 17 or 1
In the invention of the eighth aspect, the DRNN speech model is made to correspond to similar word groups grouped for each type of similar words, and in each similar word group, a certain degree of certainty or more for all the words in the group is set. It is characterized by using a speech model that has been trained so as to obtain a DRNN output.

【００４１】また、請求項２０の発明は、ある単語の音
声データに反応して確からしさを表す所定の出力が得ら
れるように学習されたＤＲＮＮ音声モデルを有し、この
ＤＲＮＮ音声モデルを用いて入力単語に対するＤＲＮＮ
出力を取り出してその出力に基づいて認識処理する際、
誤認識される可能性のある類似単語との識別を行う類似
単語識別装置において、誤認識される可能性のある類似
単語それぞれの特徴部分に対し、一定以上の確からしさ
を表すＤＲＮＮ出力が得られるように学習されたＤＲＮ
Ｎサブ音声モデルを記憶するＤＲＮＮサブ音声モデル記
憶手段と、ある単語の音声が入力されたとき、前記ＤＲ
ＮＮ音声モデルを用いてその入力単語データに対応した
ＤＲＮＮ出力を出すとともに、前記ＤＲＮＮサブ音声モ
デルを用いて前記入力単語の特徴部分に対応したＤＲＮ
Ｎ出力を出す単語検出信号出力手段と、この単語検出信
号出力手段から前記ＤＲＮＮ音声モデルを用いて一定以
上の確からしさを表すＤＲＮＮ出力が出された場合、そ
のＤＲＮＮ出力にその単語の特徴部分を含む所定区間を
設定し、その所定区間内において前記入力単語に対する
前記ＤＲＮＮサブ音声モデルによるＤＲＮＮ出力状態を
調べ、その結果に基づいて入力単語とその入力単語に類
似する単語との識別を行う認識処理部とを有することを
特徴としている。Further, the invention of claim 20 has a DRNN speech model that has been trained so as to obtain a predetermined output representing certainty in response to speech data of a certain word, and using this DRNN speech model. DRNN for input word
When extracting output and performing recognition processing based on the output,
In a similar word identification device that identifies a similar word that may be erroneously recognized, a DRNN output that indicates a certain degree of certainty or more for each characteristic portion of a similar word that may be erroneously recognized is obtained. DRN learned as
DRNN sub-speech model storage means for storing N sub-speech models;
A DRNN output corresponding to the input word data is output using the NN speech model, and a DRN output corresponding to the characteristic portion of the input word using the DRNN sub-speech model.
A word detection signal output unit that outputs N outputs, and when a DRNN output indicating certainty or more certainty is output from the word detection signal output unit using the DRNN voice model, the characteristic portion of the word is output to the DRNN output. A recognition process for setting a predetermined section including the input word, examining a DRNN output state of the input word by the DRNN sub-speech model with respect to the input word, and discriminating the input word and a word similar to the input word based on the result. And a part.

【００４２】請求項２１の発明は、請求項２０の発明に
おいて、どのＤＲＮＮサブ音声モデルによるＤＲＮＮ出
力が一定以上の確からしさを表す値となっているかによ
り入力単語とその入力単語に類似する単語との識別を行
うことを特徴としている。According to a twenty-first aspect of the present invention, in the twentieth aspect, an input word and a word similar to the input word are determined based on which DRNN sub-speech model has a value representing certainty or more in the DRNN output. Is characterized.

【００４３】このようにＤＲＮＮサブ音声モデルを用い
て類似単語を識別する類似単語識別装置によっても、前
記同様、ＤＲＮＮ出力だけでは識別できない類似単語に
対して、ＤＲＮＮ音声モデルの学習方法を変えることな
く、既存のＤＲＮＮモデルをそのまま用いて、高精度に
類似単語の識別が可能となる。さらに、この発明は、類
似単語同士の特徴部分のみのＤＲＮＮ音声モデルを用い
てそのＤＲＮＮ出力から判断するので、より一層、高精
度な識別が可能となり、きわめて誤認識されやすい類似
単語についても高精度に識別可能となる。As described above, according to the similar word identification device for identifying similar words using the DRNN sub-speech model, similar methods cannot be used for the similar words that cannot be identified only by the DRNN output, without changing the DRNN speech model learning method. Using the existing DRNN model as it is, it becomes possible to identify similar words with high accuracy. Further, since the present invention uses a DRNN speech model of only the characteristic portion between similar words to make a judgment from the DRNN output, it is possible to perform highly accurate discrimination, and to perform highly accurate similar words that are easily misrecognized. Can be identified.

【００４４】また、請求項２２の発明は、請求項２０ま
たは２１の発明において、前記ＤＲＮＮ音声モデルは、
類似単語の種類毎にグループ分けされた類似単語グルー
プに対応させ、それぞれの類似単語グループにおいて、
そのグループ内の単語すべてに対して一定以上の確から
しさを表すＤＲＮＮ出力が得られるように学習された音
声モデルであることを特徴としている。The invention of claim 22 is the invention of claim 20 or 21, wherein the DRNN voice model is
Corresponding to similar word groups grouped for each type of similar word, and in each similar word group,
It is a speech model that has been trained so as to obtain a DRNN output representing certainty or more certainty for all the words in the group.

【００４５】以上のように、ここでは大きく分けて３つ
の類似単語識別装置について述べたが、これらのどの類
似単語識別装置においても、装置構成が大幅に複雑化す
ることはなく、また、既存のＤＲＮＮ音声モデルの学習
方法を変更せずに、簡単な処理で類似単語を確実に識別
することができる。また、前記ＤＲＮＮ音声モデルは、
類似単語の種類毎にグループ分けされた類似単語グルー
プに対応させ、それぞれの類似単語グループにおいて、
そのグループ内の単語すべてに対して一定以上の確から
しさを表すＤＲＮＮ出力が得られるように学習された音
声モデルとすることにより、類似単語を認識対象単語と
する際、類似単語１つ１つに対して音声モデルを作成す
る必要がなくなり、コスト的にも優れたものとなる。As described above, here, three similar word identification devices are roughly divided and described. However, in any of these similar word identification devices, the configuration of the device is not significantly complicated, and existing similar word identification devices are not complicated. Similar words can be reliably identified by a simple process without changing the DRNN speech model learning method. Also, the DRNN speech model is:
Corresponding to similar word groups grouped for each type of similar word, and in each similar word group,
By using a speech model that has been trained so that a DRNN output representing a certain degree of certainty is obtained for all the words in the group, similar words are recognized one by one when similar words are used as recognition target words. On the other hand, there is no need to create a voice model, and the cost is excellent.

【００４６】[0046]

【発明の実施の形態】以下、本発明の実施の形態を図面
を参照して説明する。なお、以下に説明する実施の形態
では、誤認識されやすい類似単語語として「なんじ（何
時）」と「なんど（何度）」を用い、これらそれぞれの
単語を認識処理する例について説明する。Embodiments of the present invention will be described below with reference to the drawings. In the embodiment described below, an example will be described in which “what” and “how” are used as similar words that are likely to be erroneously recognized, and each of these words is recognized.

【００４７】（第１の実施の形態）図１は第１の実施の
形態を実現するための単語識別装置を示すブロック図で
あり、音声入力部としてのマイクロホン１、Ａ／Ｄ変換
部２、音声分析部３、単語検出信号出力部４、ＤＲＮＮ
出力情報記憶部５、ベクトル量子化部６、コードブック
７、コードデータ記憶部８、認識処理部９などから構成
されている。(First Embodiment) FIG. 1 is a block diagram showing a word identification device for realizing the first embodiment. A microphone 1 as an audio input unit, an A / D conversion unit 2, Voice analysis unit 3, word detection signal output unit 4, DRNN
It comprises an output information storage unit 5, a vector quantization unit 6, a code book 7, a code data storage unit 8, a recognition processing unit 9, and the like.

【００４８】前記マイクロホン１から入力された音声は
Ａ／Ｄ変換部２でＡ／Ｄ変換されたのち、音声分析部３
でたとえば１０次元のＬＰＣケプストラム係数で表され
る特徴ベクトル列に変換される。The voice input from the microphone 1 is A / D-converted by the A / D converter 2,
Is converted into a feature vector sequence represented by, for example, a 10-dimensional LPC cepstrum coefficient.

【００４９】前記単語検出信号出力部４はＤＲＮＮ単語
デル記憶部４１、ワードスポッティング部４２から構成
される。ＤＲＮＮ単語モデル記憶部４１は、認識対象単
語ごとのＤＲＮＮ単語モデルデータが記憶されるもの
で、認識対象単語としては、たとえば、「おはよう」、
「おやすみ」、「なんじ」などの単語であるとする。The word detection signal output section 4 comprises a DRNN word Dell storage section 41 and a word spotting section 42. The DRNN word model storage unit 41 stores the DRNN word model data for each recognition target word. As the recognition target words, for example, “Good morning”,
It is assumed that the word is "good night" or "Nanji".

【００５０】ワードスポッティング部４２は、音声分析
部３からの音声データが入力されると、ＤＲＮＮ単語モ
デル記憶部４１の内容を用いて、キーワード（認識対象
単語）に対するＤＲＮＮ出力（ＤＲＮＮ出力の開始時
刻、終了時刻、確からしさを表す出力値などのデータ）
を得る。そして、これらの各データはＤＲＮＮ出力情報
記憶部５に記憶される。When the speech data from the speech analysis unit 3 is input, the word spotting unit 42 uses the contents of the DRNN word model storage unit 41 to output a DRNN for the keyword (recognition target word) (start time of the DRNN output). , End time, output value indicating certainty)
Get. Each of these data is stored in the DRNN output information storage unit 5.

【００５１】なお、この実施の形態では、前記したよう
に、類似語として「なんじ」と「なんど」を例にしてい
る。そして、「なんじ」という単語について学習された
単語モデルを有し、入力話者の発話する「なんじ」とい
う音声データに対しては、高い確からしさを有する出力
が出てくるようになっている。また、入力話者の発話す
る「なんど」という音声データに対しても同様の出力が
出力される。In this embodiment, as described above, "Nanji" and "Nando" are exemplified as similar words. Then, it has a word model learned for the word "Nanji", and an output with high certainty comes to the voice data "Nanji" spoken by the input speaker. I have. A similar output is also output for voice data "Nando" spoken by the input speaker.

【００５２】ところで、前記コードブック７は「あ・い
・う・え・お」の５個の母音から作成されたコードサイ
ズが５のコードブックである。The code book 7 is a code book having a code size of 5 created from five vowels "A / I / U / E / O".

【００５３】ベクトル量子化部６は、音声分析された入
力音声の特徴ベクトル列を前記コードブック７を用いて
ベクトル量子化しコードデータを作成するものであり、
そのコードデータはコードデータ記憶部８に格納され
る。The vector quantizing unit 6 quantizes the feature vector sequence of the input speech subjected to the speech analysis using the codebook 7 to generate code data.
The code data is stored in the code data storage unit 8.

【００５４】このような構成において具体的な処理につ
いて説明する。Specific processing in such a configuration will be described.

【００５５】前述したように、ＤＲＮＮ単語モデル記憶
部４１は、認識対象単語ごとのＤＲＮＮ単語モデルを記
憶している。したがって、入力話者が「なんじ」と発話
すれば、ワードスポッティング部４２からは一定以上の
確からしさを有するＤＲＮＮ出力（「なんじ」の音声モ
デルによるＤＲＮＮ出力）が出され、そのＤＲＮＮ出力
の開始時刻、終了時刻、確からしさを表す出力値が検出
される。そして、これら各データはＤＲＮＮ出力情報記
憶部５に格納される。As described above, the DRNN word model storage section 41 stores a DRNN word model for each recognition target word. Therefore, if the input speaker speaks “Nanji”, the word spotting section 42 outputs a DRNN output having a certain degree of certainty (DRNN output based on the voice model of “Nanji”), and outputs the DRNN output. Output values representing start time, end time, and certainty are detected. These data are stored in the DRNN output information storage unit 5.

【００５６】なお、この第１の実施の形態およびその後
の第２、第３の実施の形態における説明の中で、ワード
スポッティング部４２からのＤＲＮＮ出力というような
表現がある場合は、「なんじ」の音声モデルによるＤＲ
ＮＮ出力という意味である。一方、入力話者が「なん
ど」と発話したときも、同様に、ワードスポッティング
部４２からは一定以上の確からしさを有するＤＲＮＮ出
力が出るとともに、そのＤＲＮＮ出力の開始時刻、終了
時刻、確からしさを表す値が出力され、そのＤＲＮＮ出
力の開始時刻、終了時刻、確からしさを表す出力値など
のデータはＤＲＮＮ出力情報記憶部５に格納される。In the description of the first embodiment and the subsequent second and third embodiments, when there is an expression such as a DRNN output from the word spotting section 42, "Nanji. DR based on voice model
It means NN output. On the other hand, when the input speaker utters “what”, similarly, a DRNN output having a certain degree of certainty is output from the word spotting unit 42, and the start time, the end time, and the certainty of the DRNN output are similarly determined. Then, data such as a start time, an end time, and an output value indicating certainty of the DRNN output are stored in the DRNN output information storage unit 5.

【００５７】図２（ａ）は入力話者が発話１として「い
まなんじかな」と発話した場合のＤＲＮＮ出力を示すも
ので、図２（ｂ）は入力話者が発話２として「いまなん
どかな」と発話した場合のＤＲＮＮ出力を示すものであ
る。このように、「なんじ」および「なんど」の部分で
共に一定以上の確からしさを有するＤＲＮＮ出力が出て
くる。FIG. 2 (a) shows the DRNN output when the input speaker utters "I'm doing now" as utterance 1, and FIG. This shows the DRNN output when "Kana" is spoken. In this way, a DRNN output having a certain degree of certainty in both the “what” and “what” portions is output.

【００５８】そして、入力話者が「いまなんじかな」と
発話した場合、その音声データ（特徴ベクトル列）は、
ベクトル量子化部６にも与えらる。このベクトル量子化
部６では、５つの母音から作成されたコードブック７を
用いて、入力話者の「いまなんじかな」の音声データを
ベクトル量子化する。Then, when the input speaker utters "Now, what is it?", The voice data (feature vector sequence)
It is also given to the vector quantization unit 6. The vector quantization unit 6 uses the codebook 7 created from the five vowels to perform vector quantization on the voice data of the input speaker “now what it is”.

【００５９】すなわち、コードブック７には、「あ・い
・う・え・お」のそれぞれの母音に対応する５個のコー
ドベクトル、つまり、「あ」に対してはｃ０、「い」に
対してはｃ１、「う」に対してはｃ２、「え」に対して
はｃ３、「お」に対してはｃ４が存在し、これらのコー
ドベクトル「ｃ０，ｃ１，・・・，ｃ４」と、入力話者
が発話して得られた特徴ベクトル列を構成する各特徴ベ
クトルとの距離を計算し、最短距離のコードベクトルと
の対応付けを行うことによりコード化してコードデータ
得る。このコード化されたコード列の例を図２（ｃ）に
示す。この図２（ｃ）からもわかるように、「いまなん
じかな」の音韻列のうち、たとえば、「い」は母音の
「い」の音韻そのものであり、また、「じ」の部分にも
母音の「い」の音韻が存在するため、そのコードデータ
はｃ１が多いデータとなり、「ま」、「な」、「か」、
「な」などは母音の「あ」の音韻が存在するため、その
コードデータはｃ０が多いデータとなる。That is, the codebook 7 contains five code vectors corresponding to the respective vowels of “A / I / U / E / O”, ie, c0 and “I” for “A”. C1 for "U", c3 for "E", and c4 for "O". These code vectors "c0, c1,..., C4" Then, a distance between each of the feature vectors constituting the feature vector sequence obtained by the utterance of the input speaker is calculated, and is correlated with a code vector of the shortest distance to code and obtain code data. FIG. 2C shows an example of the coded code string. As can be seen from FIG. 2 (c), for example, in the phoneme sequence of "I'm now alive", "I" is the vowel "I" phoneme itself, and "Ji" Since there is a vowel "i" phoneme, its chord data has a large amount of c1, and "ma", "na", "ka",
Since "na" has a vowel "a" phoneme, its code data is data having a large c0.

【００６０】このようにコード化されたコードデータは
コードデータ記憶部８に格納される。そして、認識処理
部９では、このコードデータ記憶部８に格納されたコー
ドデータと前記ＤＲＮＮ出力情報記憶部５に格納された
ＤＲＮＮ出力データに基づいて、入力話者の発話した単
語が、「なんじ」か「なんど」であるかを判定する。こ
の判定処理について以下に説明する。The code data thus coded is stored in the code data storage unit 8. Based on the code data stored in the code data storage unit 8 and the DRNN output data stored in the DRNN output information storage unit 5, the recognition processing unit 9 determines the word spoken by the input speaker as “what Judgment ”or“ what ”. This determination processing will be described below.

【００６１】前記ＤＲＮＮ出力情報記憶部５に格納され
るデータは、前述したように、ＤＲＮＮ出力の開始時
刻、終了時刻、確からしさを表す出力値であり、これら
のデータに基づいて、「なんじ」に対応するＤＲＮＮ出
力（図２(a)参照）のうち、ある区間ｔ１を設定する。
この区間ｔ１は、この場合、「なんじ」と「なんど」の
識別であるから、両者に最も違いの出ると思われる
「じ」または「ど」の音韻部分のＤＲＮＮ出力を十分含
むような区間を設定する。つまり、入力される類似単語
の特徴部分（この場合、「じ」、「ど」の部分）に対す
るＤＲＮＮ出力を含むような区間を設定する。As described above, the data stored in the DRNN output information storage unit 5 is the output time indicating the start time, end time, and certainty of the DRNN output. Of the DRNN output (see FIG. 2 (a)) corresponding to "."
In this case, since the section t1 is a discrimination between “Nanji” and “Nando”, an interval that sufficiently includes the DRNN output of the phoneme part of “Ji” or “Doh”, which seems to be the most different between the two. Set. That is, a section including a DRNN output for a characteristic portion (in this case, a portion of “ji” or “do”) of an input similar word is set.

【００６２】そして、図２（ｃ）に示すコードデータ列
における区間ｔ１に対応するコードデータが主にどのよ
うなコードベクトルで構成されているかを調べる。この
場合、区間ｔ１におけるコードベクトルは、ｃ２，ｃ
１，ｃ１，ｃ１が存在している。Then, what kind of code vector mainly constitutes the code data corresponding to the section t1 in the code data string shown in FIG. 2C is examined. In this case, the code vector in the section t1 is c2, c
1, c1 and c1 exist.

【００６３】このように、区間ｔ１には「い」のコード
ベクトルが存在しているので、図２（ａ）で示されるＤ
ＲＮＮ出力は、「なんじ」に対するＤＲＮＮ出力である
と判断する。As described above, since the code vector of "i" exists in the section t1, the code vector D shown in FIG.
It is determined that the RNN output is a DRNN output for “what”.

【００６４】また、入力話者が「いまなんどかな」と発
話したとすると、この場合も、ＤＲＮＮ出力は「なん
じ」とほぼ同様の出力となるが、この「いまなんどか
な」の特徴ベクトル列をコードブック７を用いてベクト
ル量子化部６でコード化すると、図２（ｄ）のようなコ
ード列となる。この図２（ｄ）からもわかるように、こ
の場合、区間ｔ１におけるコードベクトルは、ｃ２，ｃ
４，ｃ４，ｃ４が存在している。If the input speaker utters "Now, what is it?", The DRNN output is almost the same as "Nanji". Is encoded by the vector quantization unit 6 using the codebook 7, and a code sequence as shown in FIG. 2D is obtained. As can be seen from FIG. 2D, in this case, the code vectors in the section t1 are c2 and c
4, c4, and c4.

【００６５】このように、区間ｔ１には「お」のコード
ベクトルが存在しているので、図２（ａ）で示されるＤ
ＲＮＮ出力は、「なんど」に対するＤＲＮＮ出力である
と判断する。As described above, since the code vector of “O” exists in the section t1, the code vector D shown in FIG.
It is determined that the RNN output is a DRNN output for “what”.

【００６６】以上のようにこの第１の実施の形態では、
類似単語を識別する方法として、５つの母音から作成し
たコードブックを用いて入力音声データをベクトル量子
化し、コードデータを得て、ＤＲＮＮ出力の所定区間に
どのような母音が存在するかを調べて、入力単語を識別
するようにしている。これにより、ＤＲＮＮ出力だけで
は識別できない単語に対しても高精度に識別できるよう
になる。As described above, in the first embodiment,
As a method of identifying similar words, input speech data is vector-quantized using a codebook created from five vowels, code data is obtained, and what vowels are present in a predetermined section of the DRNN output is examined. , Input words are identified. As a result, words that cannot be identified only by the DRNN output can be identified with high accuracy.

【００６７】なお、この第１の実施の形態では、コード
ブック７は５つの母音から作成されたコード数が５個の
ものを使用する例について述べたが、このコードブック
７のコード数は５個に限られるものではなく、たとえ
ば、子音を含んだ音声から作成されたもっとコード数の
多いコードブックを用いるようにしてもよい。たとえ
ば、全音素の特徴を含むコードブックを用いた場合、
「なんじ」と「なんど」を例に取れば、入力話者の発話
して得られた「じ」または「ど」の音声データを、コー
ドブックを用いてベクトル量子化して、「なんじ」また
は「なんど」のＤＲＮＮ出力の前記設定区間ｔ１に
「じ」または「ど」に対応するコードデータが有るか否
かを判断することにより、入力話者の発話した単語が、
「なんじ」であるか「なんど」であるかを判断すること
ができる。ただし、この第１の実施の形態においては、
５つの母音から作成されたコードブックを用いた方が処
理量の点から有利である。In the first embodiment, an example has been described in which the code book 7 uses five vowels and has five codes. However, the code book 7 has five codes. The present invention is not limited to this, and for example, a code book having a larger number of codes created from voice including consonants may be used. For example, if you use a codebook that includes all phoneme features,
Taking “Nanji” and “Nando” as an example, the voice data of “Ji” or “Do” obtained by uttering the input speaker is vector-quantized using a codebook, and “Nanji” Alternatively, by determining whether or not there is code data corresponding to “ji” or “do” in the set section t1 of the DRNN output of “what”, the word spoken by the input speaker is determined as
It is possible to determine whether it is "what" or "what". However, in the first embodiment,
Using a codebook created from five vowels is more advantageous in terms of processing amount.

【００６８】以上説明した第１の実施の形態で説明した
方法を用いて、成人男性と成人女性の合計二百数十名の
話者数にて「なんど」と「なんじ」について実際に発話
して認識率を求める実験を行った結果、9５％近い認識
率が得られた。なお、ここで用いたコードブックは、男
女別のコードブック（離散母音発話データから作成した
サイズ＝５のコードブック）である。この実験結果から
も類似単語についてきわめて高精度に識別を行うことが
できることがわかる。By using the method described in the first embodiment described above, "Nando" and "Nanji" are actually uttered with a total of about two hundred and dozens of adult men and adult women. As a result of performing an experiment for obtaining a recognition rate, a recognition rate close to 95% was obtained. The codebook used here is a gender-specific codebook (a codebook of size = 5 created from discrete vowel utterance data). This experimental result also indicates that similar words can be identified with extremely high accuracy.

【００６９】（第２の実施の形態）次に本発明の第２の
実施の形態について説明する。図３は第２の実施の形態
を実現するための類似単語識別装置の構成図であり、第
１の実施の形態の構成と異なるのは、標準ヒストグラム
記憶部１１とヒストグラム生成部１２を設けた点にあ
る。その他の構成要素は図１とほぼ同じであるので、同
一部分には同一符号が付されている。ただし、この第２
の実施の形態で用いられるコードブック７は、全音素の
特徴を含むコードブックが使用される。この第２の実施
の形態では、コード数が６４のコードブックを使用した
例で説明する。(Second Embodiment) Next, a second embodiment of the present invention will be described. FIG. 3 is a configuration diagram of a similar word identification device for realizing the second embodiment. The difference from the configuration of the first embodiment is that a standard histogram storage unit 11 and a histogram generation unit 12 are provided. On the point. The other components are almost the same as those in FIG. 1, and the same portions are denoted by the same reference numerals. However, this second
As the codebook 7 used in the embodiment, a codebook including characteristics of all phonemes is used. In the second embodiment, an example using a codebook having 64 codes will be described.

【００７０】前記標準ヒストグラム記憶部１１には、標
準ヒストグラムデータが記憶される。この標準ヒストグ
ラムデータは、たとえば、「なんじ」という単語につい
て、数百人の話者一人一人が発話して得られた音声デー
タを、６４のサイズのコードブックを用いてベクトル量
子化したとき、そのコードブックのどのコードベクトル
が何回出現したかを示すヒストグラムデータである。こ
の標準ヒストグラムデータは、誤認識されやすい類似単
語ごとに予め作成しておくものである。The standard histogram storage section 11 stores standard histogram data. This standard histogram data is obtained, for example, by performing vector quantization on voice data obtained by uttering each of several hundred speakers for the word “Nanji” using a codebook of 64 sizes, This is histogram data indicating which code vector of the code book has appeared and how many times. The standard histogram data is created in advance for each similar word that is easily misrecognized.

【００７１】図４（ａ）は、数百人のうち或る一人の話
者が「なんじ」と発話して得られた音声データを６４サ
イズのコードブックを用いてベクトル量子化したとき得
られたコードデータ列を示すものであり、これを一人一
人について求め、ｃ０〜ｃ６３のコードベクトルごとに
出現数を累積して、標準ヒストグラムを作成する。FIG. 4 (a) shows the results obtained when one of the hundreds of speakers uttered “Nanji” and obtained vector data using a 64-size codebook. A code data sequence is obtained for each of the code vectors, and the number of appearances is accumulated for each of the code vectors c0 to c63 to create a standard histogram.

【００７２】そして、前述の第１の実施の形態で説明し
たように、「なんじ」の音声データに反応して出力する
ＤＲＮＮ出力（図２（ａ）参照）のうち、前記同様、あ
る区間ｔ１を設定する。そして、前記作成された標準ヒ
ストグラムのうち、この区間ｔ１における標準ヒストグ
ラムデータを標準ヒストグラム記憶部１１に記憶させて
おく。図４（ｂ）は標準ヒストグラム記憶部１１に記憶
される標準ヒストグラム例を示すものである。Then, as described in the first embodiment, in the DRNN output (see FIG. 2A) output in response to the voice data of "what", a certain section Set t1. Then, among the created standard histograms, the standard histogram data in the section t1 is stored in the standard histogram storage unit 11. FIG. 4B shows an example of a standard histogram stored in the standard histogram storage unit 11.

【００７３】すなわち、この図４（ｂ）に示される標準
ヒストグラムは、「なんじ」に対するＤＲＮＮ出力の区
間ｔ１に対応する音韻部分における数百人の話者から得
られたコードベクトルの累積出現数を表すものとなる。That is, the standard histogram shown in FIG. 4B is the cumulative number of code vectors obtained from hundreds of speakers in the phoneme portion corresponding to the section t1 of the DRNN output for “Nanji”. Is represented.

【００７４】同様にして、「なんじ」の類似単語である
「なんど」に対しても、前記区間ｔ１に対応する音韻部
分における数百人の話者から得られたコードベクトルの
標準ヒストグラムを作成しておく。Similarly, a standard histogram of code vectors obtained from hundreds of speakers in the phoneme portion corresponding to the interval t1 is created for the word “Nando” similar to “Nanji”. Keep it.

【００７５】以上のようにして類似単語（ここでは「な
んじ」と「なんど」）のＤＲＮＮ出力の所定区間ｔ１部
分におけるそれぞれの標準ヒストグラムを予め作成して
おき、それを標準ヒストグラム記憶部１１に記憶させて
おく。As described above, standard histograms of similar words (here, “Nanji” and “Nando”) in the predetermined section t1 of the DRNN output are created in advance, and stored in the standard histogram storage unit 11. Remember.

【００７６】そして、ユーザが「いまなんじかな」と発
話した場合、その音声データ（特徴ベクトル列）は、ワ
ードスポッティング部４２に与えられるとともに、ベク
トル量子化部６にも与えらる。このベクトル量子化部６
では、６４のコードサイズのコードブック７を用いて、
ユーザの「いまなんじかな」の音声データをベクトル量
子化してコードデータを得る。このコードデータはコー
ドデータ記憶部８に格納される。Then, when the user utters “Now, what is it?”, The voice data (feature vector sequence) is supplied to the word spotting section 42 and also to the vector quantization section 6. This vector quantization unit 6
Now, using a codebook 7 with a code size of 64,
The code data is obtained by vector-quantizing the voice data of the user "now what it is". This code data is stored in the code data storage unit 8.

【００７７】そして、認識処理部９では、ヒストグラム
の生成処理が必要と判断すると、ＤＲＮＮ出力情報から
得られた区間ｔ１における入力話者のヒストグラムを生
成する。なお、ヒストグラムの生成処理が必要か否かの
判断は、ＤＲＮＮ出力情報記憶部５の内容を見て、ユー
ザの発話した「なんじ」に対して、一定上の確からしさ
を示す値が出力された場合は、ヒストグラムの生成処理
が必要と判断する。When the recognition processing section 9 determines that the histogram generation processing is necessary, it generates a histogram of the input speaker in the section t1 obtained from the DRNN output information. The determination as to whether or not the histogram generation processing is necessary is performed by checking the contents of the DRNN output information storage unit 5 and outputting a value indicating a certain degree of certainty with respect to “Nanji” spoken by the user. In this case, it is determined that histogram generation processing is necessary.

【００７８】前記区間ｔ１における入力話者のヒストグ
ラムを生成する処理は、コードデータ記憶部８に格納さ
れたユーザの「いまなんじかな」の音声データに対する
コードデータのうち、前記ＤＲＮＮ出力の区間ｔ１に対
応する部分のコードベクトルのヒストグラムを生成す
る。これにより生成されたヒストグラムの例を図４
（ｃ）に示す。そして、この入力話者のヒストグラムと
前記標準ヒストグラムの距離を求めるが、標準ヒストグ
ラムは数百人から得られたヒストグラムであり、入力話
者のヒストグラムは一人の音声データから得られたヒス
トグラムであるため、それぞれを正規化して距離を求め
る。この正規化処理は特に限定されるものではない。The process of generating the histogram of the input speaker in the section t1 is performed in the section t1 of the DRNN output of the code data corresponding to the user's “now okana” voice data stored in the code data storage section 8. Is generated. FIG. 4 shows an example of the generated histogram.
It is shown in (c). Then, the distance between the histogram of the input speaker and the standard histogram is obtained. The standard histogram is a histogram obtained from several hundred people, and the histogram of the input speaker is a histogram obtained from one voice data. Are normalized to obtain a distance. This normalization processing is not particularly limited.

【００７９】正規化された入力話者ヒストグラムと「な
んじ」に対する標準ヒストグラムとの差分ヒストグラム
を求めるとともに、入力話者ヒストグラムと「なんど」
に対する標準ヒストグラムとの差分ヒストグラムを求め
る。図４（ｄ）は入力話者ヒストグラムと「なんじ」に
対する標準ヒストグラムとの差分ヒストグラム（絶対
値）を示すものである。A difference histogram between the normalized input speaker histogram and the standard histogram for “Nanji” is obtained, and the input speaker histogram and the “Nando” are calculated.
Of the standard histogram with respect to. FIG. 4D shows a difference histogram (absolute value) between the input speaker histogram and the standard histogram for “Nanji”.

【００８０】このようにして求められた差分ヒストグラ
ム（絶対値）における累積度数をたし算してその合計を
求める。The sum of the cumulative frequencies in the difference histogram (absolute value) thus obtained is obtained.

【００８１】以上の処理を入力話者ヒストグラムと「な
んど」に対する標準ヒストグラムについても行い、両者
の差分ヒストグラムを求め、その差分ヒストグラムの累
積度数を足して合計を求める。The above processing is also performed on the input speaker histogram and the standard histogram for "Nando", a difference histogram between the two is obtained, and the cumulative frequency of the difference histogram is added to obtain the sum.

【００８２】そして、それぞれの合計値を比較して合計
値の小さい方を選択する。たとえば、入力話者ヒストグ
ラムと「なんじ」に対する標準ヒストグラムとにより求
められた差分ヒストグラム（絶対値）における累積度数
の合計値が、入力話者ヒストグラムと「なんじ」に対す
る標準ヒストグラムとにより求められた差分ヒストグラ
ム（絶対値）における合計値よりも小さい場合は、入力
話者の発話した単語は「なんじ」であると判定する。Then, the respective total values are compared and the smaller total value is selected. For example, the sum of the cumulative frequencies in the difference histogram (absolute value) obtained from the input speaker histogram and the standard histogram for “Nanji” was obtained from the input speaker histogram and the standard histogram for “Nanji”. If the sum is smaller than the total value in the difference histogram (absolute value), it is determined that the word spoken by the input speaker is “Nanji”.

【００８３】以上のように、この第２の実施の形態で
は、類似単語について数百人が発話して得られたそれぞ
れの音声データをコード化し、そのコードデータを基に
前記したような類似単語ごとの標準ヒストグラムを作成
しておき、この標準ヒストグラムと入力話者ヒストグラ
ムとの差分ヒストグラムを求め、その差分ヒストグラム
から入力単語を識別するようにしている。この第２の実
施の形態によっても、第１の実施の形態同様、ＤＲＮＮ
出力だけでは識別できない単語に対しても、既存のＤＲ
ＮＮ音声モデルの学習方法を変更することなく類似単語
を確実に識別することができる。As described above, in the second embodiment, each voice data obtained by uttering hundreds of people about similar words is coded, and the similar word described above is coded based on the code data. A difference histogram between the standard histogram and the input speaker histogram is obtained in advance, and an input word is identified from the difference histogram. According to the second embodiment, as in the first embodiment, the DRNN is used.
Existing DR for words that cannot be identified by output alone
Similar words can be reliably identified without changing the learning method of the NN speech model.

【００８４】また、この第２の実施の形態で示した方法
は、単語の特徴部分を含む所定区間のコードデータの度
数分布を標準話者の度数分布と比較しているので、より
一層、高精度な識別が可能となり、きわめて誤認識され
やすい類似単語についても高精度に識別可能となる。In the method described in the second embodiment, the frequency distribution of code data in a predetermined section including a characteristic portion of a word is compared with the frequency distribution of a standard speaker. Accurate identification is possible, and it becomes possible to identify highly similar words that are easily misrecognized with high accuracy.

【００８５】以上説明した第２の実施の形態で説明した
方法を用いて、成人男性と成人女性の合計二百数十名の
話者数にて「なんど」と「なんじ」について実験した結
果、ほぼ１００％に近い認識率が得られた。なお、ここ
で用いたコードブックは、男性用コードブックでそのコ
ードサイズは２５６のコードブックであり、標準ヒスト
グラムは男性用、女性用、男女兼用を作成したが、どの
標準ヒストグラムを用いてもほぼ同様の高い認識率が得
られた。Using the method described in the second embodiment described above, an experiment was conducted on "Nando" and "Nanji" with a total of more than 200 dozen speakers of adult men and adult women. , A recognition rate close to 100% was obtained. The codebook used here is a male codebook with a code size of 256, and standard histograms were created for male, female, and unisex. Similar high recognition rates were obtained.

【００８６】（第３の実施の形態）次に本発明の第３の
実施の形態について説明する。図５は第３の実施の形態
を実現するための類似識別装置の構成図であり、音声入
力部としてのマイクロホン１、Ａ／Ｄ変換部２、音声分
析部３、単語検出信号出力部４、ＤＲＮＮ出力情報記憶
部５、認識処理部９、サブ単語ＤＲＮＮ出力情報記憶部
１３などから構成されている。(Third Embodiment) Next, a third embodiment of the present invention will be described. FIG. 5 is a configuration diagram of a similar identification device for realizing the third embodiment, and includes a microphone 1 as an audio input unit, an A / D conversion unit 2, an audio analysis unit 3, a word detection signal output unit 4, It comprises a DRNN output information storage unit 5, a recognition processing unit 9, a sub-word DRNN output information storage unit 13, and the like.

【００８７】この第３の実施の形態による単語検出信号
出力部４は、前記第１、第２の実施の形態で用いたＤＲ
ＮＮ単語モデル記憶部４１、ワードスポッティング部４
２の他に、ＤＲＮＮサブ単語モデル記憶部４３を有して
いる。このＤＲＮＮサブ単語モデル記憶部４３は、類似
単語としての「なんじ」と「なんど」におけるそれぞれ
の特徴部分「じ」と「ど」のＤＲＮＮ単語モデルデータ
を記憶するものである。The word detection signal output unit 4 according to the third embodiment uses the DR used in the first and second embodiments.
NN word model storage unit 41, word spotting unit 4
2, a DRNN sub-word model storage unit 43 is provided. The DRNN sub-word model storage unit 43 stores the DRNN word model data of the characteristic parts “ji” and “do” in “nanji” and “nando” as similar words.

【００８８】また、サブ単語ＤＲＮＮ出力情報記憶部１
３は、ＤＲＮＮサブ単語モデル記憶部４３を用いて、ワ
ードスポッティング処理されたときに出力されるＤＲＮ
Ｎ出力の開始時刻、終了時刻、確からしさを表すデータ
などを記憶するものである。なお、この実施の形態で
は、前記ＤＲＮＮ出力情報記憶部５は、サブ単語ＤＲＮ
Ｎ出力情報記憶部１３に対して入力単語そのもののＤＲ
ＮＮ出力情報を記憶するものであるから、両者を区別す
るために、以下では入力単語ＤＲＮＮ出力情報記憶部５
という。The sub-word DRNN output information storage unit 1
3 is a DRN output when word spotting processing is performed using the DRNN sub-word model storage unit 43.
It stores the start time and end time of N output, data representing certainty, and the like. In this embodiment, the DRNN output information storage unit 5 stores the sub-word DRN
DR of input word itself to N output information storage unit 13
Since the NN output information is stored, the input word DRNN output information storage unit 5 will be described below in order to distinguish between them.
That.

【００８９】以下、この第３の実施の形態の処理につい
て説明する。この第３の実施の形態においても、類似単
語として「なんじ」と「なんど」を例にして説明する。The processing of the third embodiment will be described below. Also in the third embodiment, similar words “Nanji” and “Nando” will be described as examples.

【００９０】今、入力話者が「いまなんじかな」と発話
したとすると、その音声は音声分析部３で分析され、特
徴ベクトル列として出力され、ワードスポッティング部
４２に与えられる。これにより、ＤＲＮＮ単語モデル記
憶部４１の内容を用いてワードスポッティング処理さ
れ、ワードスポッティング部４２からは「なんじ」の部
分で図６（ａ）に示すような一定以上の確からしさを有
するＤＲＮＮ出力が出され、そのＤＲＮＮ出力の開始時
刻、終了時刻、確からしさを表す出力値などのデータが
入力単語ＤＲＮＮ出力情報記憶部５に格納される。Now, assuming that the input speaker has uttered, "Now what is it?", The voice is analyzed by the voice analysis unit 3, output as a feature vector sequence, and given to the word spotting unit 42. As a result, word spotting processing is performed using the contents of the DRNN word model storage unit 41, and the DRNN output from the word spotting unit 42 having a certain degree of certainty or more as shown in FIG. Is output, and data such as a start time, an end time, and an output value representing certainty of the DRNN output are stored in the input word DRNN output information storage unit 5.

【００９１】このように、ユーザが「いまなんじかな」
と発話したときに、「なんじ」の部分でＤＲＮＮが出力
されるが、これとともに、「じ」の音韻部分でＤＲＮＮ
サブ単語モデル記憶部４３の内容を用いてワードスポッ
ティン処理され、ワードスポッティング部４２からは
「じ」の部分で図６（ｂ）に示すような一定以上の確か
らしさを有するＤＲＮＮ出力が出される。そして、この
「じ」の音韻部分におけるＤＲＮＮ出力の開始時刻、終
了時刻、確からしさを表す出力値がサブ単語ＤＲＮＮ出
力情報記憶部１３に格納される。As described above, the user is asked "what is it now?"
DRNN is output in the part of “Noji”, but with this, the DRNN is output in the phoneme part of “Jiji”.
The word spotting process is performed using the contents of the sub-word model storage unit 43, and the word spotting unit 42 outputs a DRNN output having a certain degree of certainty or more as shown in FIG. . Then, the output value indicating the start time, the end time, and the certainty of the DRNN output in the phoneme portion of the “ji” is stored in the sub-word DRNN output information storage unit 13.

【００９２】そして、認識処理部９では、「なんじ」に
対する一定以上の確からしさを有するＤＲＮＮ出力が有
った場合、前記第１、第２の実施の形態で説明したよう
に、入力単語ＤＲＮＮ出力情報記憶部５に格納された時
刻情報を基に、区間ｔ１を設定し、その区間ｔ１におけ
るサブ単語ＤＲＮＮ出力を調べ、その結果に基づいて入
力単語がどの単語であるかを判定する。つまり、認識処
理部９は、入力単語ＤＲＮＮ出力情報記憶部５とサブ単
語ＤＲＮＮ出力情報記憶部１３の内容に基づいて入力単
語を認識する。具体的な処理は次のようにして行う。Then, when there is a DRNN output having a certain degree of certainty with respect to "Nanji", the recognition processing unit 9, as described in the first and second embodiments, outputs the input word DRNN. Based on the time information stored in the output information storage unit 5, a section t1 is set, the sub-word DRNN output in the section t1 is checked, and based on the result, which word is the input word is determined. That is, the recognition processing unit 9 recognizes the input word based on the contents of the input word DRNN output information storage unit 5 and the sub-word DRNN output information storage unit 13. Specific processing is performed as follows.

【００９３】前記サブ音声ＤＲＮＮ出力情報記憶部１３
には、この場合、「じ」に対するＤＲＮＮ出力の開始時
刻、終了時刻、確からしさを表す出力値などデータが入
っている。The sub audio DRNN output information storage unit 13
In this case, data such as a start time and an end time of the DRNN output for the “ji” and an output value indicating the likelihood are stored.

【００９４】したがって、認識処理部９は、「なんじ」
に対する一定以上の確からしさを有するＤＲＮＮ出力が
有り、かつ、前記区間ｔ１に一定上の確からしさを有す
るサブ単語ＤＲＮＮ出力が存在すれば、入力音声は「な
んじ」であると判断する。Therefore, the recognition processing section 9 sets the “Nanji”
If there is a DRNN output that has a certain degree of certainty with respect to and a subword DRNN output that has a certain degree of certainty exists in the section t1, it is determined that the input speech is "what."

【００９５】一方、入力話者が「いまなんどかな」と発
話すると、ワードスポッティング部４２からは「なん
ど」の部分で図６（ｃ）に示すような一定以上の確から
しさを有するＤＲＮＮ出力が出る。そして、そのＤＲＮ
Ｎ出力の開始時刻、終了時刻、確からしさを表す出力値
が検出され、各データは入力単語ＤＲＮＮ出力情報記憶
部５に格納される。On the other hand, when the input speaker utters "now", the word spotting section 42 outputs a DRNN output having a certain degree of certainty or more as shown in FIG. . And that DRN
Output values representing the start time, end time, and certainty of the N output are detected, and each data is stored in the input word DRNN output information storage unit 5.

【００９６】このように、ユーザが「いまなんどかな」
と発話したときに、「なんど」の部分でＤＲＮＮが出力
されるが、「ど」の音韻部分でＤＲＮＮサブ単語モデル
を用いてワードスポッティン処理され、ワードスポッテ
ィング部４２からは「ど」の部分で図６（ｄ）に示すよ
うな一定以上の確からしさを有するＤＲＮＮ出力が出さ
れる。そして、この「ど」の音韻部分におけるＤＲＮＮ
出力の開始時刻、終了時刻、確からしさを表す出力値が
検出され、各検出データはサブ単語ＤＲＮＮ出力情報記
憶部１３に格納される。As described above, the user is asked "what is it now?"
DRNN is output at the part of "do", but the word phoneting processing is performed on the phoneme part of "do" using the DRNN sub-word model, and the part of "do" is As a result, a DRNN output having a certain degree of certainty as shown in FIG. Then, the DRNN in this "do" phoneme part
Output values indicating the output start time, end time, and certainty are detected, and each detected data is stored in the sub-word DRNN output information storage unit 13.

【００９７】したがって、認識処理部９は、「なんど」
に対する一定以上の確からしさを有するＤＲＮＮ出力が
あり、かつ、前記区間ｔ１に一定上の確からしさを有す
るサブ単語ＤＲＮＮ出力が存在すれば、入力音声は「な
んど」であると判断する。Therefore, the recognition processing section 9 sets “Nando”
If there is a DRNN output having a certain degree of certainty with respect to and a sub-word DRNN output having a certain degree of certainty exists in the section t1, it is determined that the input speech is "what".

【００９８】ところで、この第３の実施の形態では、
「じ」や「ど」といった１つの音韻のＤＲＮＮ単語モデ
ルを持つようにいている。実際の認識においては、１つ
の音韻ＤＲＮＮ単語モデルでは、「じ」や「ど」以外の
色々な音声に対しても出力がでてしまうことが多く、１
つの音韻そのものを認識するためのＤＲＮＮ単語モデル
は現段階では問題を残している。しかし、この第３の実
施の形態のように、サブ単語モデルとして用いるのであ
れば実用上十分なデータを得ることができる。By the way, in the third embodiment,
It has a DRNN word model of one phoneme such as “ji” or “do”. In actual recognition, one phonological DRNN word model often outputs various voices other than “ji” and “do”, and the output is often 1
The DRNN word model for recognizing one phoneme itself has a problem at this stage. However, if it is used as a sub-word model as in the third embodiment, practically sufficient data can be obtained.

【００９９】また、ここでは「なんじ」と「なんど」に
ついて説明したので、ＤＲＮＮサブ単語モデルは「じ」
と「ど」に対応した１つの音韻のＤＲＮＮサブ単語モデ
ルとしたが、類似文字によっては、１つの音韻のＤＲＮ
Ｎサブ単語モデルではなく２つ音韻以上のサブ単語モデ
ルとする場合もある。Also, since “Nanji” and “Nando” have been described here, the DRNN sub-word model is “Ji”.
And a “Don” corresponding to a single phoneme DRNN sub-word model.
There may be a case where a sub-word model having two or more phonemes is used instead of the N sub-word model.

【０１００】たとえば、「なんじ」と「なんにち（何
日）」が類似単語であるとすると、この場合は、「なん
じ」に対するＤＲＮＮサブ単語モデルとして「じ」のＤ
ＲＮＮサブ単語モデルを用意し、「なんにち」に対して
は、たとえば、「にち」のＤＲＮＮサブ単語モデルを用
意する。このようにすれば、種々の類似単語に対応でき
る。For example, assuming that “Nanji” and “Nanichi (what day)” are similar words, in this case, the DNN of “Ji” is used as a DRNN sub-word model for “Nanji”.
An RNN sub-word model is prepared. For “what”, for example, a DRNN sub-word model of “ni” is prepared. In this way, various similar words can be handled.

【０１０１】以上説明した第３の実施の形態によって
も、第１、第２の実施の形態同様、ＤＲＮＮ出力だけで
は識別できない単語に対しても、既存のＤＲＮＮ単語モ
デルの学習方法を何等変更することなく類似単語を確実
に識別することができる。また、この第３の実施の形態
では、類似単語同志で異なる部分の音韻そのものに対す
るＤＲＮＮ出力によって識別しているので、より一層、
高精度な識別が可能となる。According to the third embodiment described above, similarly to the first and second embodiments, the existing DRNN word model learning method is changed for words that cannot be identified only by the DRNN output. Similar words can be reliably identified without the need for such a word. Further, in the third embodiment, similar words are identified by the DRNN output for the phonemes themselves of different parts, so that further similarity is achieved.
Highly accurate identification becomes possible.

【０１０２】以上説明したように、ＤＲＮＮ単語モデル
を用いた音声認識においては、たとえば、「なんじ」と
「なんど」などの類似単語が入力された場合、両方の音
声に対してＤＲＮＮ出力が出てしまうが、第１〜第３の
実施の形態によれば、このような類似単語に対して、認
識装置が持っているＤＲＮＮ単語モデルの学習方法を変
えることなく、簡単な処理を追加するだけで、類似単語
を高精度に識別することができる。As described above, in speech recognition using the DRNN word model, when similar words such as “Nanji” and “Nando” are input, a DRNN output is output for both voices. However, according to the first to third embodiments, simple processing is added to such similar words without changing the learning method of the DRNN word model possessed by the recognition device. Thus, similar words can be identified with high accuracy.

【０１０３】なお、本発明は前述した第１〜第３の実施
の形態以外でも類似単語の識別は可能である。たとえ
ば、キーワードスポッティングが可能な始終端フリーＤ
Ｐマッチング法、ＨＭＭ（隠れマルコフ）法、ニューラ
ルネットワーク法などの一般的な音声認識技術を用いて
識別することもできる。始終端フリーＤＰマッチング法
を用いた場合について以下に簡単に説明する。ここでも
「なんじ」と「なんど」を識別するものとする。In the present invention, similar words can be identified other than in the first to third embodiments. For example, start / end free D that allows keyword spotting
Identification can also be performed using a general speech recognition technique such as a P matching method, an HMM (Hidden Markov) method, or a neural network method. A brief description will be given below of the case where the start-end free DP matching method is used. Also here, "Nanji" and "Nando" shall be identified.

【０１０４】「なんじ」と「なんど」のそれぞれの標準
話者特徴データを用意しておき、入力話者の発話するた
とえば「いまなんじかな」によって出力される「なん
じ」に対応する部分のＤＲＮＮ出力に前記した区間ｔ１
を設定し、その区間ｔ１において、入力音声特徴データ
と前記標準話者特徴データとのＤＰマッチングをとって
距離を求め、その距離から入力単語を識別する。この識
別方式は前述の第１〜第３の実施の形態に比べると処理
量が多くなるがこの方式によっても本発明を実現するこ
とは十分可能である。The standard speaker characteristic data of “Nanji” and “Nando” are prepared, and a portion corresponding to “Nanji” output by, for example, “Now now kana” spoken by the input speaker. T1 to the DRNN output of
Is set, and in the section t1, a distance is obtained by performing DP matching between the input voice characteristic data and the standard speaker characteristic data, and an input word is identified from the distance. This identification method requires a larger amount of processing than the first to third embodiments described above, but it is sufficiently possible to realize the present invention also by this method.

【０１０５】また、前述の各実施の形態では、誤認識し
やすい類似単語として「なんじ」と「なんど」を例にし
て説明したが、類似単語はこれに限られるものでないこ
とは勿論であり、本発明は他の類似単語同志についても
識別可能である。Further, in each of the above-described embodiments, “Nanji” and “Nando” have been described as examples of similar words that are easily misrecognized. However, similar words are not limited to these. The present invention can also identify other similar words.

【０１０６】また、誤認識しやすい類似単語の組とし
て、たとえば、「なんじ」と「なんど」を考えたとき、
音声認識装置がもともと持っている単語モデルが「なん
じ」の単語モデルであったとすると、この単語モデルに
「なんど」が反応して一定以上のＤＲＮＮ出力が出てし
まう場合、前記した各実施の形態で説明した処理を行う
ことにより、もともとの認識対象単語である「なんじ」
に対しては高精度に識別することができることは勿論、
この音声認識装置に新たな認識対象単語として「なん
ど」を加えることも可能となる。これは、他の類似単語
の組についても同様のことが言える。Also, for example, assuming “Nanji” and “Nando” as a set of similar words that are easily misrecognized,
Assuming that the word model originally possessed by the speech recognition device is the word model of “Nanji”, if “Nado” reacts to this word model and a DRNN output exceeding a certain level is output, the above-mentioned respective implementations are performed. By performing the processing described in the form, the original recognition target word "Nanji"
Can be identified with high accuracy,
It is also possible to add “Nando” as a new recognition target word to this speech recognition device. The same can be said for other sets of similar words.

【０１０７】さらに、本発明を実現するに際して、認識
装置の持つ単語モデルを始めから類似単語すべてに一定
以上のＤＲＮＮ出力が出るような学習しておくようにし
てもよい。Further, when implementing the present invention, the word model possessed by the recognition device may be learned from the beginning so that a DRNN output equal to or greater than a certain value is output for all similar words.

【０１０８】たとえば、類似単語として「なんじ」と
「なんど」を例に取れば、「なんじ」に対しても「なん
ど」に対しても十分なＤＲＮＮ出力が出る音声モデルを
予め作成しておき、これら類似単語のいずれが入力され
ても積極的にＤＲＮＮ出力を出して、以降は、前記各実
施の形態で説明したような識別処理を行ってどの単語が
入力されたかを判断するようにしてもよい。For example, if “Nanji” and “Nando” are taken as examples of similar words, a speech model that produces sufficient DRNN output for “Nanji” and “Nando” is created in advance. The DRNN output is actively output regardless of which of these similar words is input, and thereafter, the identification processing described in each of the above embodiments is performed to determine which word is input. You may.

【０１０９】このようにすれば、認識対象単語の中に誤
認識されやすい類似単語が複数有る場合、類似単語それ
ぞれの音声モデルを持つ必要がなくなり、コスト的にも
有利なものとなる。In this way, when there are a plurality of similar words that are likely to be erroneously recognized among the words to be recognized, it is not necessary to have a speech model for each of the similar words, which is advantageous in terms of cost.

【０１１０】さらに、これに関連して、類似単語のグル
ープを作成し、そのグループ内のどの単語に対しても一
定以上の確からしさを有するＤＲＮＮ出力が出るように
した単語モデルを各グループごとに作成するようにして
もよい。Further, in this connection, a word model in which a group of similar words is created and a DRNN output having a certain degree of certainty is output for any word in the group is generated for each group. You may make it create.

【０１１１】たとえば、類似単語Ａグループとして「な
んじ」と「なんど」、類似単語Ｂグループとして「でん
ごん（伝言）」、「でんわ（電話）」、「でんき（電
気）」などと言うように誤認識されやすい単語同志でグ
ループ分けしておく。そして、類似単語Ａグループのど
の単語に対しても一定以上の確からしさを有する出力が
出るように学習されたＤＲＮＮ単語モデルＡを作成し、
また、類似単語Ｂグループのどの単語に対しても一定以
上の確からしさを有する出力が出るように学習されたＤ
ＲＮＮ単語モデルＢを作成するというように、それぞれ
のグループ対応の単語モデルを作成しておく。同様にし
て、他の類似単語グループに対しても各グループ対応の
単語モデルを作成しておく。For example, the similar word A group includes “Nanji” and “Nando”, and the similar word B group includes “Dengon (message)”, “Denwa (telephone)”, and “Denki (electric)”. Words are grouped by words that are likely to be misrecognized. Then, a DRNN word model A trained so that an output having a certain degree of certainty is output for any word in the similar word A group is created,
Further, D is learned so that an output having a certain degree of certainty is output for any word in the similar word B group.
A word model corresponding to each group is created, such as creating an RNN word model B. Similarly, a word model corresponding to each group is created for other similar word groups.

【０１１２】このように、類似単語のグループを作成
し、そのグループ内のどの単語に対しても一定以上のＤ
ＲＮＮ出力が出るようにした各グループ対応の単語モデ
ルを持つことで、たとえば、「なんじ」という入力が入
った場合には、それに対応する単語モデルが働いて一定
以上の確からしさを有するＤＲＮＮ出力が出され、以降
は、前述した各実施の形態で説明した処理を行うように
すれば、入力音声は「なんじ」であるとの判定を行うこ
とができる。In this way, a group of similar words is created, and a certain or more D
By having a word model corresponding to each group so that an RNN output is output, for example, when an input "Nanji" is input, the corresponding word model is activated to provide a DRNN output having a certain degree of certainty or more. After that, if the processing described in each of the above-described embodiments is performed, it can be determined that the input voice is “what”.

【０１１３】このようにすれば、認識対象単語の中に誤
認識されやすい類似単語が多数存在しても、類似単語そ
れぞれの単語モデルを持つ必要がなくなり、コスト的に
も有利なものとなる。In this way, even if there are many similar words which are likely to be erroneously recognized among the words to be recognized, it is not necessary to have a word model for each similar word, which is advantageous in terms of cost.

【０１１４】また、本発明は、前述の各実施の形態で説
明した類似単語識別処理を組み合わせて用いるようにし
てもよい。Further, in the present invention, the similar word identification processing described in each of the above embodiments may be used in combination.

【０１１５】なお、本発明の処理を行う処理プログラム
は、フロッピィディスク、光ディスク、ハードディスク
などの記憶媒体に記憶させておくことができ、本発明
は、それらの記憶媒体をも含むものであり、また、ネッ
トワークからデータを得る形式でもよい。The processing program for performing the processing of the present invention can be stored in a storage medium such as a floppy disk, an optical disk, or a hard disk. The present invention also includes such a storage medium. Alternatively, data may be obtained from a network.

【０１１６】[0116]

【発明の効果】以上説明したように本発明によれば、Ｄ
ＲＮＮ出力だけでは識別できない類似単語に対して、Ｄ
ＲＮＮ単語モデルの学習方法を変えることなく、既存の
ＤＲＮＮ単語モデルをそのまま用いて、高精度に類似単
語の識別が可能となる。これを実現するための１つの方
法として、コードブックを用いて入力音声データをベク
トル量子化し、コードデータを得て、ＤＲＮＮ出力の所
定区間にどのような母音が存在するかを調べて、入力単
語を識別することにより、ＤＲＮＮ出力だけでは識別で
きない類似単語に対して、ＤＲＮＮ単語モデルの学習方
法を変えることなく、既存のＤＲＮＮ単語モデルをその
まま用いて、高精度に類似単語の識別が可能となる。ま
た、コードブックを５つの母音から生成されたコードブ
ックとすることにより、きわめて少ない処理量にて類似
単語の識別が可能となる。As described above, according to the present invention, D
For similar words that cannot be identified only by RNN output,
Without changing the learning method of the RNN word model, similar words can be identified with high accuracy using the existing DRNN word model as it is. As one method for realizing this, the input speech data is vector-quantized using a codebook, code data is obtained, and what vowels are present in a predetermined section of the DRNN output is checked. , It becomes possible to identify similar words with high accuracy by using the existing DRNN word model as it is without changing the method of learning the DRNN word model for similar words that cannot be identified only by the DRNN output. . Further, by making the codebook a codebook generated from five vowels, similar words can be identified with an extremely small amount of processing.

【０１１７】また、入力単語データを基に作成されたヒ
ストグラムデータと標準ヒストグラムデータから類似単
語を識別する方法によっても、前記同様、ＤＲＮＮ出力
だけでは識別できない類似単語に対して、ＤＲＮＮ単語
モデルの学習方法を変えることなく、既存のＤＲＮＮ単
語モデルをそのまま用いて、高精度に類似単語の識別が
可能となる。このように、単語の特徴部分を含む所定区
間のコードデータの度数分布を標準話者の度数分布と比
較して類似単語の識別を行うことで、より一層、高精度
な識別が可能となり、きわめて誤認識されやすい類似単
語についても高精度に識別可能となる。Also, by the method of identifying a similar word from the histogram data created based on the input word data and the standard histogram data, the learning of the DRNN word model is performed for the similar word which cannot be identified only by the DRNN output. Without changing the method, similar words can be identified with high accuracy using the existing DRNN word model as it is. In this way, by comparing the frequency distribution of the code data of the predetermined section including the characteristic part of the word with the frequency distribution of the standard speaker and identifying the similar words, it is possible to perform the identification with higher accuracy, and the Similar words that are easily misrecognized can also be identified with high accuracy.

【０１１８】さらに、ＤＲＮＮサブ単語モデルを用いて
類似単語を識別する方法によっても、前記同様、ＤＲＮ
Ｎ出力だけでは識別できない類似単語に対しても、既存
のＤＲＮＮ単語モデルの学習方法を変更することなく類
似単語を確実に識別することができる。このように、類
似単語同士の特徴部分のみのＤＲＮＮ単語モデルを用い
て、類似単語同志で異なる部分の音韻そのものに対する
ＤＲＮＮ出力によって識別することにより、より一層、
高精度な識別が可能となる。Further, similar to the above, the DRN sub-word model is used to identify similar words.
Even for similar words that cannot be identified only by N outputs, similar words can be reliably identified without changing the learning method of the existing DRNN word model. In this way, by using the DRNN word model of only the characteristic part between similar words and identifying the different parts of the similar words by the DRNN output for the phonemes themselves,
Highly accurate identification becomes possible.

[Brief description of the drawings]

【図１】本発明の第１の実施の形態における類似単語識
別装置の構成図。FIG. 1 is a configuration diagram of a similar word identification device according to a first embodiment of the present invention.

【図２】第１の実施の形態における識別処理を説明する
図。FIG. 2 is a diagram illustrating an identification process according to the first embodiment.

【図３】本発明の第２の実施の形態における類似単語識
別装置の構成図。FIG. 3 is a configuration diagram of a similar word identification device according to a second embodiment of the present invention.

【図４】第２の実施の形態における識別処理を説明する
図。FIG. 4 is a diagram illustrating an identification process according to a second embodiment.

【図５】本発明の第３の実施の形態における類似単語識
別装置の構成図。FIG. 5 is a configuration diagram of a similar word identification device according to a third embodiment of the present invention.

【図６】第３の実施の形態における識別処理を説明する
図。FIG. 6 is a diagram illustrating an identification process according to a third embodiment.

【図７】ＤＲＮＮ単語モデルを用いてワードスポッティ
ング処理する際のＤＲＮＮ出力を説明する図。FIG. 7 is a view for explaining DRNN output when performing word spotting processing using a DRNN word model.

【図８】ＤＲＮＮ単語モデルを学習する際に２つの単語
を連続させて学習させる処理を説明する図。FIG. 8 is a view for explaining a process of learning two words in succession when learning a DRNN word model.

[Explanation of symbols]

１マイクロホン２Ａ／Ｄ変換部３音声分析部４単語検出信号分析部５ＤＲＮＮ出力情報記憶部６ベクトル量子化部７コードブック８コードデータ記憶部９認識処理部１１標準ヒストグラムデータ記憶部１２ヒストグラム生成部１３サブ単語ＤＲＮＮ出力情報記憶部４１ＤＲＮＮ単語モデル記憶部４２ワードスポッティング部４３ＤＲＮＮサブ単語モデル記憶部 REFERENCE SIGNS LIST 1 microphone 2 A / D conversion unit 3 voice analysis unit 4 word detection signal analysis unit 5 DRNN output information storage unit 6 vector quantization unit 7 codebook 8 code data storage unit 9 recognition processing unit 11 standard histogram data storage unit 12 histogram generation Unit 13 Sub-word DRNN output information storage unit 41 DRNN word model storage unit 42 Word spotting unit 43 DRNN sub-word model storage unit

───────────────────────────────────────────────────── フロントページの続き (72)発明者長谷川浩男長野県諏訪市大和３丁目３番５号セイコーエプソン株式会社内 ──────────────────────────────────────────────────続き Continued on the front page (72) Inventor Hiroo Hasegawa 3-5-5 Yamato, Suwa City, Nagano Prefecture Seiko Epson Corporation

Claims

[Claims]

1. A speech model which has been trained so as to obtain a predetermined output representing certainty in response to speech data of a certain word, and uses the speech model to extract an output for an input word and output the output. In a similar word identification method for identifying a similar word that may be erroneously recognized when performing recognition processing based on a speech model, a speech model reacting to speech data of the input word when a speech of a certain word is input When an output representing certainty or more certainty is output, a predetermined section including a characteristic part of the input word is set in the output, the characteristics of the voice data of the input word in the predetermined section are checked, and based on the result, And identifying the input word and words similar to the input word.

2. A DR trained so as to obtain a predetermined output representing certainty in response to voice data of a certain word.
Similar word identification that has an NN speech model and uses this DRNN speech model to extract a DRNN output for an input word and perform recognition processing based on the output to identify a similar word that may be erroneously recognized In the method, when a voice of a certain word is input, a DRNN output corresponding to the input word voice data is output using a DRNN voice model, and the input word voice data is converted into code data using a codebook. DR that indicates certainty or more certainty for a word
When an NN output is issued, a predetermined section including the characteristic portion of the input word is set in the DRNN output, the code data is checked in the set predetermined section, and based on the result, the input word and its A similar word identification method characterized by identifying a word similar to an input word.

3. A process of examining the code data in the set predetermined section and identifying an input word and a word similar to the input word based on a result of the code data in the set predetermined section Of which
3. The similar word identification method according to claim 2, wherein code data corresponding to the vowel is examined, and an input word and a word similar to the input word are identified based on which vowel is used.

4. The code book according to claim 2, wherein the code book is a code book generated from five vowels.
Or the similar word identification method described in 3.

5. The DRNN speech model is adapted to correspond to similar word groups grouped for each type of similar word, and in each similar word group, a certain degree of certainty or more for all the words in the group is set. 5. The similar word identification method according to claim 2, wherein the speech model is a speech model learned so as to obtain a DRNN output.

6. A DR trained so as to obtain a predetermined output representing certainty in response to voice data of a certain word.
Similar word identification that has an NN speech model and uses this DRNN speech model to extract a DRNN output for an input word and perform recognition processing based on the output to identify a similar word that may be erroneously recognized In the method, for each similar word that may be misrecognized, a plurality of speakers utter the speech data for each word, and code the obtained speech data using a predetermined code book. Using the code data, a code histogram is created for each word in a predetermined section including the characteristic portion of the word, and the histogram data for each word is stored as standard histogram data. A DRNN output corresponding to the input word voice data is output using a voice model, and The audio data encoding data by using a predetermined codebook, DR representing at least a certain likelihood for the input word
When the NN output is issued, a predetermined section including the characteristic portion of the input word is set in the DRNN output, and a code histogram in the predetermined section is created using the code data. A similar word identification method comprising comparing the input word and a word similar to the input word by comparing the input word with standard histogram data.

7. A process of comparing the histogram data created from the input word and the standard histogram data to identify the input word and a word similar to the input word, includes: 7. The similar word identification method according to claim 6, wherein a difference between the two words is obtained, and an input word and a word similar to the input word are identified based on the magnitude of the difference. ,

8. The DRNN speech model associates similar word groups grouped for each type of similar word, and in each similar word group, a certain degree of certainty or more for all the words in the group. 8. The similar word identification method according to claim 6, wherein the speech model is a speech model trained so as to obtain a DRNN output.

9. A DR trained so as to obtain a predetermined output representing certainty in response to voice data of a certain word.
Similar word identification that has an NN speech model and uses this DRNN speech model to extract a DRNN output for an input word and perform recognition processing based on the output to identify a similar word that may be erroneously recognized In the method, a DRNN sub-speech model trained so as to obtain a DRNN output representing certainty or more certainty is obtained for each characteristic portion of similar words that may be erroneously recognized, and a certain word When a voice is input, when a DRNN output indicating certainty or more certainty is output for the input word, a predetermined section including the characteristic portion of the word is set in the DRNN output, and within the predetermined section, The DRNN
A similar word identification method characterized by examining a DRNN output state by a sub-speech model and identifying an input word and a word similar to the input word based on the result.

10. The DRNN in the predetermined section.
The process of examining the DRNN output state by the sub-speech model and discriminating an input word and a word similar to the input word based on the result is a value indicating the certainty that the DRNN output by the DRNN sub-speech model is equal to or more than a certain level. 10. The similar word identification method according to claim 9, wherein an input word and a word similar to the input word are identified depending on whether the input word is set to or not.

11. The DRNN speech model is adapted to correspond to similar word groups grouped for each type of similar word, and in each similar word group, a certain level of certainty is determined for all the words in the group. The similar word identification method according to claim 9 or 10, wherein the speech model is a speech model that has been learned so as to obtain a DRNN output.

12. A speech model that has been trained so as to obtain a predetermined output representing certainty in response to speech data of a certain word, and uses the speech model to extract an output for an input word and output the output. In a similar word identification device that identifies a similar word that may be erroneously recognized when performing recognition processing based on a voice model, a voice model that reacts to voice data of the input word when a voice of a certain word is input A word detection signal output unit that outputs an output representing certainty or more certainty, and when an output indicating a certainty or more certainty is output from the word detection signal output unit, the output includes a characteristic portion of the input word. A section is set, the characteristics of the voice data of the input word in the predetermined section are examined, and based on the result, the input word and a word similar to the input word are identified. Similar word identification device, characterized in that it comprises a recognition processing unit for performing a.

13. A D which has been learned so as to obtain a predetermined output representing certainty in response to voice data of a certain word.
Similar word identification that has an RNN speech model and uses the DRNN speech model to extract a DRNN output for an input word and perform recognition processing based on the output to identify a similar word that may be erroneously recognized. In the apparatus, when a voice of a certain word is input, a word detection signal output unit that outputs a DRNN output corresponding to the input word voice data using a DRNN voice model, and codes the input word voice data using a codebook. When a DRNN output representing certainty or more certainty is output from the encoding means for converting to data and the word detection signal output means, a predetermined section including the characteristic portion of the input word is set in the DRNN output, In the set predetermined section, code data is coded by the coding means, and code data is checked. Input word Te and the recognition processing means for identification of the word that is similar to the input word, similar word identification device characterized in that it comprises a.

14. A process for examining the code data in the set predetermined section and identifying an input word and a word similar to the input word based on a result of the examination, the code data in the set predetermined section Of which
14. The similar word identification apparatus according to claim 13, wherein code data corresponding to the vowel is examined, and an input word and a word similar to the input word are identified based on which vowel is used.

15. The codebook according to claim 1, wherein the codebook is a codebook generated from five vowels.
15. The similar word identification device according to 3 or 14.

16. The DRNN speech model is adapted to correspond to similar word groups grouped for each type of similar word, and in each similar word group, certain degree of certainty is given to all the words in the group. 16. A speech model that has been trained to obtain a DRNN output.
Description similar word identification device.

17. A D which is learned so as to obtain a predetermined output representing certainty in response to voice data of a certain word.
Similar word identification that has an RNN speech model and uses the DRNN speech model to extract a DRNN output for an input word and perform recognition processing based on the output to identify a similar word that may be erroneously recognized. In the apparatus, of code data obtained by coding speech data obtained by a plurality of speakers uttering each of similar words that may be erroneously recognized, a code in a predetermined section including a characteristic portion of each similar word Standard histogram storage means for storing, as standard histogram data, a code histogram for each similar word created using the data, and when a voice of a certain word is input, corresponding to the input word voice data using a DRNN voice model A word detection signal output means for outputting a DRNN output to perform A DRNN output indicating a certain degree of certainty is output from the word detection signal output means, and a predetermined section including the characteristic portion of the input word in the DRNN output. Setting and generating a code histogram in the predetermined section using the code data coded by the coding means, comparing the histogram data with the standard histogram data, and A recognition processing means for identifying a word similar to a word; and a similar word identification device, comprising:

18. A process of comparing the histogram data created from the input word and the standard histogram data to identify the input word and a word similar to the input word includes normalizing each histogram. 18. The similar word identification device according to claim 17, wherein after the conversion, the difference between the two is obtained, and the input word and a word similar to the input word are identified based on the magnitude of the difference.

19. The DRNN speech model is adapted to correspond to similar word groups grouped for each type of similar word, and in each similar word group, a certain degree of certainty or more for all the words in the group is set. 19. The similar word identification apparatus according to claim 17, wherein the speech model is a speech model learned so as to obtain a DRNN output.

20. D learned in response to voice data of a certain word to obtain a predetermined output representing certainty
Similar word identification that has an RNN speech model and uses the DRNN speech model to extract a DRNN output for an input word and perform recognition processing based on the output to identify a similar word that may be erroneously recognized. In the apparatus, a DRNN sub-speech model storage means for storing a DRNN sub-speech model learned so as to obtain a DRNN output representing certainty or more certainty with respect to a characteristic portion of each of similar words that may be erroneously recognized. When a voice of a certain word is input, a DRNN output corresponding to the input word data is output using the DRNN voice model, and a DRNN corresponding to the characteristic portion of the input word is output using the DRNN sub-voice model. A word detection signal output unit for outputting an output, and the DRNN voice model from the word detection signal output unit. If DRNN output representing at least a certain likelihood using is issued, the DRNN sets a predetermined interval including a characteristic part of the word to the output, DR by the DRNN sub speech model for said input word within the predetermined interval
A recognition processing unit for examining the NN output and identifying an input word and a word similar to the input word based on a result thereof;

21. A process for examining a DRNN output of the input word according to the DRNN sub-speech model in the predetermined section and discriminating an input word and a word similar to the input word based on a result thereof, 21. The similar word identification apparatus according to claim 20, wherein an input word and a word similar to the input word are identified based on whether the DRNN output by the sub-speech model has a value indicating certainty or more.

22. The DRNN speech model corresponds to similar word groups grouped for each type of similar word, and in each similar word group, a certain level of certainty or more is determined for all the words in the group. 22. The similar word identification apparatus according to claim 20, wherein the similar word recognition apparatus is a speech model that has been learned so as to obtain a DRNN output.