JPH0464076B2

JPH0464076B2 -

Info

Publication number: JPH0464076B2
Application number: JP58085241A
Authority: JP
Inventors: Sadaichi Watanabe; Teruhiko Ukita
Original assignee: Tokyo Shibaura Electric Co Ltd
Current assignee: Toshiba Corp
Priority date: 1983-05-16
Filing date: 1983-05-16
Publication date: 1992-10-13
Also published as: JPS59211096A

Description

[Detailed description of the invention]

〔発明の技術分野〕本発明は連続発声された入力音声を精度良く認
識することのできる実用性の高い音声認識装置に
関する。〔発明の技術的背景とその問題点〕音声の自動認識は、人間から機械への直接的な
情報入力を可能とするインターフエース技術とし
て非常に重要である。しかして、この音声の自動
認識は、例えば音素、音節、単語等の言語的記号
の系列を連続発声してなる音声パターンを、離散
的な言語的記号に変換し、これらの各言語的記号
をそれぞれ認識する過程として考えることができ
る。ところで人間の発声器官は、或る質量を持つた
物理的なものであり、従つて発声された音声が離
散的に変化することはない。この結果、連続発声
された音声の言語的記号に対する音声パターンが
その前後の環境の影響を受けることが否めない。
例えば言語的記号として音素を考えた場合、「青
い（aoi）」なる単語における「Ｏ」の部分の音声
パターン（音声波）を切出してこれを聴取しても
明瞭な「Ｏ」として認識することは難しい。また
言語的記号として単語を考えた場合、数字列「83
（ハチサン）」のうち「チ」の部分が無声化される
ことが多いのに対し、「81（ハチイチ）」の場合に
は無声化されることが殆んどない。つまり同じ数
字「８」であると雖ども、その音声パターンが異
なると云う現象がある。このように人間が連続発声する音声の言語的記
号は、その前後の影響を強く受けて変化している
ことが多く、この為１つの言語的記号を１つの音
声パターンとして捕え、その前後の環境とは独立
に認識処理するには無理があつた。そこで従来で
は、種々の変形を類型化して１つの言語的記号に
対して複数の音声パターンを標準パターンとして
準備することが考えられている。然し乍ら、前後
の環境の種類が多い音声や音節、或いは語彙数の
多い単語を連続発声したものの全てを認識しよう
とするとき、上述した如き標準パターンを準備す
ることはその数が膨大となり実際性に欠けると云
う不具合があつた。一方、上述したように言語的記号に対する音声
パターンがその前後の環境の影響を受けて変形し
ていることから、その影響を除去して認識処理す
ることが考えられている。しかし、このようにし
て或る記号の前後の環境による影響を除去するに
は、その前後の記号が既に決定している必要があ
る。しかもその前後の記号を決定するに際して
は、更にその前後の記号がそれぞれ決定している
必要があり、この処理を進めるには矛盾がある。
つまり、各言語的記号が相互に影響を及ぼし合つ
ている為、簡単に前後環境の影響を除去すること
ができなかつた。〔発明の目的〕本発明はこのような事情を考慮してなされたも
ので、その目的とするところは、連続発声された
入力音声を連続する言語的記号間の前後関係によ
る影響を除去して精度良く認識することのできる
実用性の高い音声認識装置を提供することにあ
る。〔発明の概要〕本発明は入力音声を分析してその特徴パラメー
タ系列を離散的な言語的記号の単位に分離し、各
単位毎に１つ若しくは複数の言語的記号候補とこ
れらの各記号候補の上記単位に対する尤度を求
め、これらの各尤度を時間的に隣接する単位間に
おける各記号候補間の適合係数を関数としてそれ
ぞれ更新し、これらの更新された尤度が所定の閾
値を越えたとき、その尤度を得る記号候補をその
単位における認識結果として出力するようにした
ものである。〔発明の効果〕かくして本発明によれば、離散的な言語的記号
の単位に分離された入力音声の、各単位毎に求め
られた言語的記号候補の尤度を、時間的に隣接す
る単位間における記号候補間の適合係数を関数と
してそれぞれ更新するので、前後の記号候補が確
定していない場合であつてもその影響を順次除去
し乍ら尤度の高い記号候補を確定することが可能
となる。つまり、複数の記号候補について、その
前後関係から単位間の適合係数と云う形で尤度を
順次更新するので、上記適合係数に反映される単
位間の影響を徐々に除去し、ここに連続発声され
た入力音声の各言語的記号をそれぞれ精度良く認
識することが可能となる。従つて、全体に亘つて
互いに影響を及ぼし合つている言語的記号列によ
つて構成される連続発声された音声を極めて効果
的に、且つ上記影響を取除いて認識することがで
き、人間から機械への情報入力インターフエース
技術として多大な効果が奏せられる。〔発明の実施例〕以下、図面を参照して本発明の一実施例につき
説明する。尚、ここでは言語的記号として音素を例にとり
説明するが、音節や単語等を言語的記号として取
扱うことも勿論可能である。第１図は実施例装置の概略構成図である。連続
発声して入力される音声は音響分析部１に導びか
れ、所定の分析時間毎に音響分析されて特徴パラ
メータに変換される。この音響分析部１は、例え
ばスペクトル分析の手法として知られている複数
の帯域通過フイルタからなるフイルタバンクによ
り構成される。しかして、音響分析して変換された入力音声の
特徴パラメータ時系列は分割・記号化処理部２に
入力され、離散的な言語的記号の単位である音素
程度に区分・分割され、各単位毎に記号化され
る。この分割・記号化処理部２は、例えば音声パ
ワーやスペクトル変化等の特徴パラメータから音
素境界を検出して音素を単位とする区分・分割を
行つたのち、各単位についてそれぞれ統計的識別
関数を用いたり、或いは標準パターンとのマツチ
ング処理を行つて各単位毎に１つ或いは複数の音
素候補、つまり言語的記号候補を求めている。こ
の際、各音素候補について、その距離や類似度等
によつて示される尤度をも求めている。この尤度
は、換言すればその音素候補が対象単位の特徴パ
ラメータと比較したときの確からしさを示すもの
と云える。第２図はこのようにして求められる音素候補と
その尤度との関係を、入力音声である「オンセ
イ」の音声パワーに対応させて示したものであ
る。つまり、単語「音声（oNsei）」を分割・記
号化処理して求められる各単位の音素候補をアル
フアベツトで、またその尤度を（100）を満点と
して示した例が示される。この例からも明らかな
ように、各単位毎に求められる音素候補の尤度の
高いものの系列は「omsai」となつており、各単
位の特徴パラメータがその前後の影響を相互に受
けて変化していることが判る。従つて、従来一般
的に、このような記号化結果だけを用いて認識処
理する方式にあつては、単位間の相互影響による
誤りが発生することになる。しかして、分割・記号化処理部２により求めら
れた、例えば第２図に示す如き各単位についての
１つまたは複数の音素候補とその尤度の情報は記
憶部３に送られて格納される。この記憶部３に格
納された情報が、制御部４の制御の下で、記号評
価部５により評価処理される。次に、この評価処理について説明する。分離さ
れた各単位の音素は、その前後の音素の影響を受
けていることは前述した通りである。従つて或る
単位の音素に対して求められた音素候補が正しい
か否かを判定するには、その単独の尤度だけでは
不十分である。そこで今、第ｉ番目の単位の音素
に対して求められた音素候補αの尤度、つまり正
しい確率をP_i（α）として定義し、その前後の単
位（ｉ−１），（ｉ＋１）において求められた音素
候補をそれぞれβ・γとする。また音素候補βと
α、或いはαとγの相互の影響を適合係数ｒ（β，
α），ｒ（α，γ）として定める。尚、この適合係
数ｒ（β，α）は正しい音素列（β，α）に対し
て大きな値をとり、他の音素δとの音素列（β，
δ），（δ，α）に対しては小さな値となる尺度が
用いられる。例えば音素列（β，α）が正しく認
識される確率を適合係数ｒ（β，α）として定め
ればよい。この適合係数ｒ（β，α）は、音素列
（β，α）が正しく記号化される確率が、他の音
素列（β，δ）等が誤つて音素列（β，α）とし
て記号化される確率よりも通常大きいと云う性質
を利用したものであり、例えば前記分割・記号化
処理部２の処理結果、例えば認識率あるいは文字
の出現頻度等から統計的に容易に求めることがで
きる。次表は、このようにして求められる適合係
数をその代表的なものについてのみ示すものであ
る。 [Technical Field of the Invention] The present invention relates to a highly practical speech recognition device that can accurately recognize continuously uttered input speech. [Technical background of the invention and its problems] Automatic speech recognition is extremely important as an interface technology that enables direct information input from humans to machines. Therefore, this automatic recognition of speech converts a speech pattern made by continuously uttering a series of linguistic symbols such as phonemes, syllables, and words into discrete linguistic symbols, and then converts each of these linguistic symbols into discrete linguistic symbols. Each can be thought of as a process of recognition. By the way, the human vocal organ is a physical thing with a certain mass, and therefore the voice produced does not change discretely. As a result, it is undeniable that the speech pattern corresponding to the linguistic symbol of continuously uttered speech is affected by the surrounding environment.
For example, if we consider phonemes as linguistic symbols, it is possible to cut out the sound pattern (speech wave) of the "O" part of the word "aoi" and recognize it as a clear "O" even if we listen to it. is difficult. Also, if we consider words as linguistic symbols, the number string “83
While the ``chi'' part in ``(Hachi-san)'' is often devoiced, it is almost never devoiced in the case of ``81 (Hachi-ichi)''. In other words, there is a phenomenon in which the same number "8" has slightly different voice patterns. In this way, the linguistic symbols of the sounds that humans continuously utter often change due to the strong influence of the surroundings, and for this reason, one linguistic symbol is regarded as one speech pattern, and the surrounding environment It was impossible to recognize and process them independently. Conventionally, it has been considered to categorize various transformations and prepare a plurality of sound patterns as standard patterns for one linguistic symbol. However, when trying to recognize all sounds and syllables with many types of surrounding environments, or continuous utterances of words with a large vocabulary, it would be impractical to prepare standard patterns such as the one described above because the number would be enormous. There was a problem with it missing. On the other hand, as mentioned above, since the speech patterns for linguistic symbols are deformed due to the influence of the surrounding environment, it has been considered to perform recognition processing by removing this influence. However, in order to eliminate the influence of the environment around a certain symbol in this way, the symbols before and after it must already be determined. Moreover, when determining the symbols before and after the symbol, it is necessary to determine the symbols before and after the symbol, and there is a contradiction in proceeding with this process.
In other words, because each linguistic symbol influences each other, it is not possible to easily eliminate the influence of the context. [Object of the Invention] The present invention has been made in consideration of the above circumstances, and its purpose is to eliminate the influence of the context between consecutive linguistic symbols from continuously uttered input speech. It is an object of the present invention to provide a highly practical speech recognition device that can recognize with high accuracy. [Summary of the Invention] The present invention analyzes input speech, separates its feature parameter series into units of discrete linguistic symbols, and divides each unit into one or more linguistic symbol candidates and each of these symbol candidates. The likelihoods for the above units are calculated, each of these likelihoods is updated as a function of the matching coefficient between each symbol candidate between temporally adjacent units, and these updated likelihoods exceed a predetermined threshold. When this happens, the symbol candidate that obtains that likelihood is output as the recognition result for that unit. [Effects of the Invention] Thus, according to the present invention, the likelihood of a linguistic symbol candidate obtained for each unit of input speech separated into discrete units of linguistic symbols is calculated based on the likelihood of a linguistic symbol candidate obtained for each unit of temporally adjacent units. Since the compatibility coefficient between symbol candidates in between is updated as a function, even if the previous and subsequent symbol candidates are not determined, it is possible to determine the most likely symbol candidate while sequentially removing their influence. becomes. In other words, for multiple symbol candidates, the likelihood is sequentially updated in the form of a compatibility coefficient between units based on their context, so the influence between units reflected in the compatibility coefficient is gradually removed, and here continuous utterances are It becomes possible to recognize each linguistic symbol of the input speech with high accuracy. Therefore, it is possible to recognize continuously uttered speech composed of linguistic symbol strings that influence each other throughout the whole, very effectively and with the above influences removed, and it is possible for humans to recognize It has great effects as an interface technology for inputting information to machines. [Embodiment of the Invention] Hereinafter, an embodiment of the present invention will be described with reference to the drawings. Although phonemes will be explained here as examples of linguistic symbols, it is of course possible to treat syllables, words, etc. as linguistic symbols. FIG. 1 is a schematic configuration diagram of an embodiment device. Speech that is continuously uttered and input is led to the acoustic analysis section 1, where it is acoustically analyzed at predetermined analysis time intervals and converted into characteristic parameters. The acoustic analysis section 1 is constituted by a filter bank consisting of a plurality of band pass filters, which is known as a method of spectrum analysis, for example. The characteristic parameter time series of the input speech that has been acoustically analyzed and converted is input to the segmentation/symbolization processing unit 2, where it is segmented and segmented into phonemes, which are units of discrete linguistic symbols, and each unit is It is symbolized as . This segmentation/symbolization processing unit 2 detects phoneme boundaries from characteristic parameters such as voice power and spectral changes, performs segmentation/segmentation using phonemes as units, and then uses a statistical discriminant function for each unit. One or more phoneme candidates, that is, linguistic symbol candidates, are obtained for each unit by performing a matching process with a standard pattern. At this time, the likelihood expressed by distance, similarity, etc. is also calculated for each phoneme candidate. In other words, this likelihood can be said to indicate the likelihood of the phoneme candidate when compared with the feature parameters of the target unit. FIG. 2 shows the relationship between the phoneme candidates obtained in this way and their likelihoods in relation to the voice power of the input voice "Onsay". In other words, an example is shown in which the phoneme candidates for each unit obtained by dividing and symbolizing the word "oNsei" are shown in alpha alphabets, and the likelihood is shown as a perfect score of (100). As is clear from this example, the series of phoneme candidates with high likelihood found for each unit is "omsai", and the feature parameters of each unit change due to the influence of the previous and subsequent units. It can be seen that Therefore, in conventional recognition processing methods using only such encoding results, errors occur due to mutual influence between units. Thus, information on one or more phoneme candidates and their likelihoods for each unit, as shown in FIG. . The information stored in the storage section 3 is evaluated by the symbol evaluation section 5 under the control of the control section 4 . Next, this evaluation process will be explained. As described above, each separated unit of phoneme is influenced by the phonemes before and after it. Therefore, in order to determine whether or not a phoneme candidate found for a certain unit of phoneme is correct, the likelihood alone is not sufficient. Therefore, the likelihood of the phoneme candidate α found for the i-th unit phoneme, that is, the correct probability, is defined as P _i (α), and for the units (i-1) and (i+1) before and after it, Let the obtained phoneme candidates be β and γ, respectively. In addition, the mutual influence of phoneme candidates β and α, or α and γ is calculated using a compatibility coefficient r(β,
α), r(α, γ). Note that this compatibility coefficient r (β, α) takes a large value for the correct phoneme sequence (β, α), and for the phoneme sequence (β, α) with other phonemes δ.
For δ) and (δ, α), a scale that gives a small value is used. For example, the probability that a phoneme string (β, α) is correctly recognized may be determined as the matching coefficient r(β, α). This compatibility coefficient r(β, α) is the probability that a phoneme string (β, α) will be correctly encoded, and the probability that another phoneme string (β, δ) will be erroneously encoded as a phoneme string (β, α). This method takes advantage of the property that the probability is usually larger than the probability of occurrence, and can be easily determined statistically from, for example, the processing results of the division/symbolization processing section 2, such as the recognition rate or the frequency of appearance of characters. The following table shows only typical adaptation coefficients obtained in this way.

【表】しかして、記号評価部５はこのような適合係数
を用いて、音素候補αの尤度P_i（α）を、その前
後の音素候補β，γとの関係に従つて更新処理す
ることになる。今、ｋ回（ｋ＝１，２…）の更新
処理がなされたときの尤度をP^k _i（α）とすると、
その更新処理は次式に示される演算を施すことに
よつて行われる。 S^k _i（α）＝P^k-1 _i（α） ×｛１＋〓〓P^k _i-1（β）ｒ（β，α）＋〓〓P^k-1 _i+1（γ）ｒ（α，γ）｝ ……(1) P^k _i（α）＝S^k _i（α）／〓〓S^k _i（δ） ……(2) 但し、P^k _p（β）＝P^k _I+1（γ）＝０即ち、上記第(1)式においては、隣接単位間にお
ける各単語候補間の相互影響を取除くべく、前記
した適合係数を関数とした処理が行われる。しか
るのち、この処理によつて求められたS^k _i（α）を
上記第(2)式を用いて、その単位における確率の和
が１となるべき正規化処理が行われ、これが更新
された新たな尤度として求められる。以下、この
処理が入力音声の全区間（全単位）に亘つてｋ回
繰返し実行され、各音素候補の尤度が更新され
る。尚、上記第(1)式の括弧内第２項のｋは、繰返し
更新処理における尤度の収束を速める為のもので
あり、ｋ−１でもよい。また適合係数ｒの設定の
しかたによつては上記収束が保証されない場合も
ある。従つて、このようなことを考慮して、上記
ｋの値の上限を予め定めておいてもよい。またｋ
＝１における尤度の初期値は、前述した分割・記
号化処理部２によつて求められた類似度値の比と
して決めれば十分である。第３図は上述した尤度の更新処理を示す制御フ
ローであり、この制御フローに従つて記号評価部
５が動作することになる。そして、音素候補に対
する新たな尤度が求められる都度、記憶部３に格
納された各音素候補の尤度が更新されることにな
る。第４図は前記第２図に示す情報を前述した表に
示す適合係数を用いて更新処理したときの遷移過
程を示すものである。この第４図に示される更新
過程から明らかなように、更新処理によつて各音
素候補の尤度が変化し、その候補順位に入れ替り
が生じる。このことは、単位間の影響が徐々に除
かれていくことを意味しており、この例では（ｋ
＝５）の段階で正しい音素記号（oNsei）が得ら
れ、その後は候補順位が入れ替ることがない。つ
まり収束したと云える。従つて各区分単位の音素
候補が収束した時点、或いは第１順位の音素候補
の尤度が所定の閾値を越えたとき、その音素候補
を処理結果として出力すれば、ここに正しい記号
列からなる認識結果が得られることになる。このようにして本装置によれば連続発声された
音声の言語的記号間の相互影響を除去して、その
記号列を簡易に且つ正しく認識することができ
る。故にその実用的利点が極めて高い等、絶大な
る効果が奏せられる。尚、本発明は上記実施例に限定されるものでは
ない。例えば注目した単位の音素がその前後の複
数の音素の影響を受けているとする場合には、そ
の音素間の距離に応じて重み付けを行い、前述し
た第(1)(2)式に示される処理を行うようにすればよ
い。即ち重みω_ijを ω_ij＝１／｜ｉ−ｊ｜（ｉ≠ｊ）とし、 S^k _i（α）＝P^k-1 _i（α）×｛１＋〓^j <ⁱω_ij 〓〓P^k _j（β）ｒ（β，α）＋〓^j >ⁱω_ij 〓〓P^k-1 _j（γ）ｒ（α，γ）｝ ……(3) として計算を実行するようにすればよい。この
際、隣接音素に対してはω_ij＝１を与え、その外
側の音素に対しては、ω_ij＝0.5，…等の重みを与
えるようにすれば良好である。また上述した更新処理は線形的であるが、積で
表わされるような非線形の関係式を導入してもよ
い。更に適合係数も、他の特徴パラメータの値を
用いて設定してもよいことは云うまでもない。ま
た先に記したように単語列“83（ハチサン）”と
“81（ハチイチ）”では同じ“８”という単語でも、
前後の単語（この場合は後続単語）の影響により
音声パターンが異なつている。その結果“８”に
対する類似度値の大きさも異なり、誤認識する危
険性も大きい。このような場合にも、上記の実施
例をそのまま適応すれば、前後の単語の影響を除
去することができるのは明らかである。更には言語的記号単位を音節や単語等としても
良く、要するに本発明はその要旨を逸脱しない範
囲で種々変形して実施することができる。[Table] Therefore, the symbol evaluation unit 5 uses such a compatibility coefficient to update the likelihood P _i (α) of the phoneme candidate α according to the relationship with the phoneme candidates β and γ before and after it. It turns out. Now, if the likelihood when update processing is performed k times (k=1, 2...) is P ^k _i (α), then
The updating process is performed by performing the calculation shown in the following equation. S ^k _i (α)=P ^k-1 _i (α) ×{1+ 〓〓P ^k _i-1 (β)r(β, α)+ 〓〓P ^k-1 _i+1 (γ)r(α , γ)} ...(1) P ^k _i (α)=S ^k _i (α)/ 〓〓S ^k _i (δ) ...(2) However, P ^k _p (β)=P ^k _I+1 (γ)=0 That is, in the above equation (1), processing is performed using the above-mentioned compatibility coefficient as a function in order to remove the mutual influence between word candidates in adjacent units. After that, S ^k _i (α) obtained through this process is normalized using the above equation (2) so that the sum of the probabilities in that unit becomes 1, and this is updated. It is required as a new likelihood. Thereafter, this process is repeatedly executed k times over the entire section (all units) of the input speech, and the likelihood of each phoneme candidate is updated. Note that k in the second term in parentheses in the above equation (1) is for speeding up the convergence of the likelihood in the iterative update process, and may be k-1. Furthermore, depending on how the adaptation coefficient r is set, the above convergence may not be guaranteed. Therefore, taking this into consideration, the upper limit of the value of k may be determined in advance. Also k
It is sufficient to determine the initial value of the likelihood at =1 as the ratio of the similarity values obtained by the above-described division/symbolization processing section 2. FIG. 3 is a control flow showing the above-mentioned likelihood updating process, and the symbol evaluation section 5 operates according to this control flow. Then, each time a new likelihood for a phoneme candidate is determined, the likelihood for each phoneme candidate stored in the storage unit 3 is updated. FIG. 4 shows a transition process when the information shown in FIG. 2 is updated using the adaptation coefficients shown in the table mentioned above. As is clear from the updating process shown in FIG. 4, the likelihood of each phoneme candidate changes due to the updating process, and the ranking of the candidates changes. This means that the influence between units is gradually removed, and in this example (k
= 5), the correct phoneme symbol (oNsei) is obtained, and the candidate rankings are not changed thereafter. In other words, it can be said that it has converged. Therefore, when the phoneme candidates for each classification unit converge, or when the likelihood of the first-order phoneme candidate exceeds a predetermined threshold, if that phoneme candidate is output as a processing result, a correct symbol string will be created. A recognition result will be obtained. In this way, according to the present device, mutual influence between linguistic symbols of continuously uttered speech can be removed, and the symbol strings can be easily and correctly recognized. Therefore, its practical advantages are extremely high, and great effects can be achieved. Note that the present invention is not limited to the above embodiments. For example, if the unit phoneme of interest is influenced by multiple phonemes before and after it, weighting is performed according to the distance between the phonemes, and the result is expressed as shown in equations (1) and (2) above. All you have to do is process it. That is, the weight ω _ij is set to ω _ij =1/|i−j| (i≠j), and S ^k _i (α)=P ^k−1 _i (α)×{1+ 〓 ^j < ⁱ ω _ij 〓〓P ^k _j (β)r(β,α)+ 〓 ^j > ⁱ ω _ij 〓〓P ^k-1 _j (γ)r(α, γ)} (3) The calculation may be performed as follows. At this time, it is preferable to give weights such as ω _ij =1 to adjacent phonemes and weights such as ω _ij =0.5 to phonemes outside of the adjacent phonemes. Further, although the above-described updating process is linear, a non-linear relational expression expressed as a product may be introduced. Furthermore, it goes without saying that the adaptation coefficient may also be set using the values of other feature parameters. Also, as mentioned earlier, even though the word strings "83 (Hachisan)" and "81 (Hachiichi)" have the same word "8",
The sound pattern differs due to the influence of the preceding and following words (in this case, the subsequent word). As a result, the magnitude of the similarity value for "8" is also different, and there is a large risk of misrecognition. It is clear that even in such a case, if the above embodiment is applied as is, the influence of the preceding and succeeding words can be removed. Furthermore, the linguistic symbol units may be syllables, words, etc., and in short, the present invention can be implemented with various modifications without departing from the gist thereof.

[Brief explanation of the drawing]

第１図は本発明の一実施例装置の概略構成図、
第２図は入力音声とその音素候補とを示す図、第
３図は尤度の更新処理の流れを示す図、第４図は
更新処理の例を示す図である。１…音響分析部、２…分割・記号化処理部、３
…記憶部、４…制御部、５…記号評価部。 FIG. 1 is a schematic diagram of an apparatus according to an embodiment of the present invention;
FIG. 2 is a diagram showing input speech and its phoneme candidates, FIG. 3 is a diagram showing the flow of likelihood updating processing, and FIG. 4 is a diagram showing an example of the updating processing. 1...Acoustic analysis section, 2...Division/symbolization processing section, 3
...Storage section, 4...Control section, 5...Symbol evaluation section.

Claims

[Claims]

1 Analyze the input speech to separate the feature parameter series of the input speech into discrete linguistic symbol units, and for each of these separated units, one or more linguistic symbol candidates and these linguistic symbols means for determining the likelihood of a symbol candidate for that unit; and means for updating each of the likelihoods of these linguistic symbol candidates as a function of a compatibility coefficient between each linguistic symbol candidate between temporally adjacent units; Speech recognition characterized by comprising means for outputting a linguistic symbol candidate as a recognition result for that unit when the updated likelihood exceeds a predetermined threshold or when the update processing has been performed a predetermined number of times. Device.