JPH0683384A

JPH0683384A - A device for automatic detection and identification of utterance intervals of multiple speakers in speech

Info

Publication number: JPH0683384A
Application number: JP4231157A
Authority: JP
Inventors: Masahide Sugiyama; 雅英杉山
Original assignee: A T R JIDO HONYAKU DENWA KENKYUSHO KK
Current assignee: A T R JIDO HONYAKU DENWA KENKYUSHO KK
Priority date: 1992-08-31
Filing date: 1992-08-31
Publication date: 1994-03-25
Anticipated expiration: 2010-01-11
Also published as: JPH071438B2

Abstract

(57)【要約】【目的】この発明は任意数の未知話者の音声区間を検
出して同定できるような音声中の複数話者の発話区間自
動検出同定装置を提供することを主要な特徴とする。【構成】入力音声１を音声特徴抽出部２で特徴ベクト
ルの時系列３に変換し、量子化部６によって共通符号帳
作成部４で作成された共通符号帳５により符号の系列７
に変換し、音声区間始終端検出部８で各音声区間ごとに
各符号の出現頻度を算出し、出現確率算出部１０で出現
確率の集合１１を作成し、クラスタ分析部１２で幾つか
のクラスタ１３に分割し、そのクラスタ１３の情報を基
にして音声区間のクラス判別を行なう。 (57) [Summary] [Object] The main feature of the present invention is to provide an automatic detection and identification device for utterance intervals of a plurality of speakers in a voice capable of detecting and identifying voice intervals of an arbitrary number of unknown speakers. And [Structure] The input speech 1 is converted into a time series 3 of feature vectors by a speech feature extraction unit 2, and a code sequence 7 is generated by a common codebook 5 created by a common codebook creation unit 4 by a quantization unit 6.
The appearance frequency of each code is calculated for each voice section by the voice section start / end detection unit 8, the appearance probability calculation unit 10 creates a set 11 of appearance probabilities, and the cluster analysis unit 12 generates several clusters. It is divided into 13, and the class of the voice section is determined based on the information of the cluster 13.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は音声中の複数話者の発
話区間自動検出同定装置に関し、特に、未知の複数話者
の発話区間を自動的に検出して同定するような発話区間
自動検出同定装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an automatic utterance section detection / identification apparatus for a plurality of speakers in a voice, and more particularly to an automatic utterance section automatic detection for automatically detecting and identifying unknown utterance sections of a plurality of speakers. The present invention relates to an identification device.

【０００２】[0002]

【従来の技術および発明が解決しようとする課題】音声
中の複数者の自動検出同定や、音声中の複数言語の識別
や、非音声の同定や、音声，雑音の同定や、音響言語モ
デルを作成するに際して、話者の発話区間を同定する必
要がある。2. Description of the Related Art Automatic detection and identification of multiple persons in speech, identification of multiple languages in speech, identification of non-speech, identification of speech and noise, acoustic language model When creating, it is necessary to identify the speaker's utterance section.

【０００３】従来では、複数話者による発話区間を検出
するためには、予めそれぞれの話者の音声を用いて話者
登録を行なっておき、話者識別の技術を用いて話者の発
話区間を検出して同定する方法が用いられている。しか
しながら、発話区間を検出して同定するためには、事前
の登録が必要であり、任意数の未知話者の音声区間を検
出同定することができなかった。Conventionally, in order to detect the utterance section by a plurality of speakers, speaker registration is performed in advance by using the voices of the respective speakers, and the utterance section of the speaker is used by a speaker identification technique. The method of detecting and identifying is used. However, in order to detect and identify the speech section, it is necessary to register in advance, and it has been impossible to detect and identify the speech section of an arbitrary number of unknown speakers.

【０００４】それゆえに、この発明の主たる目的は、任
意数の未知話者の音声区間を検出して同定できるような
音声中の複数話者の発話区間自動検出同定装置を提供す
ることである。Therefore, a main object of the present invention is to provide an automatic utterance section detection / identification apparatus for a plurality of speakers in a voice, which can detect and identify the voice sections of an arbitrary number of unknown speakers.

【０００５】[0005]

【課題を解決するための手段】請求項１に係る発明は、
入力された音声から特徴パターンを抽出する音声特徴抽
出手段と、共通符号を作成する共通符号帳作成手段と、
音声特徴抽出手段によって抽出された特徴パターンを共
通符号帳作成手段によって作成された共通符号で量子化
する量子化手段と、複数の音声区間に対して共通符号の
出現確率を算出する出現確率算出手段と、算出された出
現確率をクラスタ分析するクラスタ分析手段と、分析さ
れたそれぞれのクラスタに属する出現確率を検出し、そ
の出現確率に対応する音声区間を同定する同定手段を備
えて構成される。The invention according to claim 1 is
A voice feature extraction means for extracting a feature pattern from the input voice, a common codebook creation means for creating a common code,
Quantization means for quantizing the feature pattern extracted by the voice feature extraction means with the common code created by the common codebook creation means, and appearance probability calculation means for calculating the appearance probability of the common code for a plurality of voice intervals. A cluster analysis means for performing a cluster analysis of the calculated appearance probabilities, and an identification means for detecting the appearance probabilities belonging to each analyzed cluster and identifying the voice section corresponding to the appearance probabilities.

【０００６】請求項２に係る発明は、複数の音声区間の
始終端が予め定められている。請求項３に係る発明は、
複数の音声区間の始終端が自動的に検出される。In the invention according to claim 2, the start and end of a plurality of voice sections are predetermined. The invention according to claim 3 is
The start and end of a plurality of voice sections are automatically detected.

【０００７】請求項４に係る発明は、話者数が予め与え
られていない場合に話者数を自動的に決定する。The invention according to claim 4 automatically determines the number of speakers when the number of speakers is not given in advance.

【０００８】請求項５に係る発明は、話者に独立な雑音
区間に対応する雑音クラスタを有する音声中の複数話者
の発話区間の出現確率を算出する。The invention according to claim 5 calculates the appearance probability of the utterance section of a plurality of speakers in a voice having a noise cluster corresponding to a speaker-independent noise section.

【０００９】請求項６に係る発明は、入力された音声か
ら特徴パターンを抽出する音声特徴抽出手段と、共通符
号を作成する共通符号帳作成手段と、抽出された特徴パ
ターンを共通符号帳作成手段によって作成された共通符
号で量子化する量子化手段と、予め与えられた初期値を
基にエルゴード隠れマルコフモデルの状態における符号
の出現確率，遷移確率および初期状態確率を更新する更
新手段と、更新の停止条件を判定する判定手段と、得ら
れたエルゴード隠れマルコフモデルを用いて音声を復号
化する手段を備えて構成される。According to a sixth aspect of the present invention, a voice feature extracting means for extracting a feature pattern from the input voice, a common codebook producing means for producing a common code, and a common codebook producing means for the extracted feature pattern. Quantization means that quantizes with the common code created by, update means that updates the code appearance probability, transition probability and initial state probability in the state of the Ergodic Hidden Markov Model based on a given initial value, and And a means for decoding speech using the obtained Ergodic hidden Markov model.

【００１０】請求項７に係る発明は、入力された音声か
ら特徴パターンを抽出する音声特徴抽出手段と、予め与
えられた初期値を基に混合連続分布型エルゴード隠れマ
ルコフモデルの状態における音声特徴量の出現確率，分
岐確率，遷移確率および初期状態確率を更新する更新手
段と、更新の停止条件を判定する判定手段と、得られた
混合連続分布型エルゴード隠れマルコフモデルを用いて
音声を復号化する手段を備えて構成される。According to a seventh aspect of the present invention, a voice feature extracting means for extracting a feature pattern from an input voice and a voice feature quantity in a state of a mixed continuous distribution type ergodic hidden Markov model based on a preset initial value. Updating means for updating the occurrence probability, branching probability, transition probability and initial state probability of, the judging means for judging the update stop condition, and the obtained continuous continuous distribution type ergodic hidden Markov model for decoding speech It is configured with means.

【００１１】[0011]

【作用】この発明に係る音声中の複数話者の発話区間自
動検出同定装置は、入力された音声から特徴パターンを
抽出し、抽出された特徴パターンを共通符号で量子化
し、複数の音声区間に対して共通符号の出現する確率を
算出する。出現確率の集合をクラスタ分析することによ
り、出現確率を複数個のクラスタに分割する。このと
き、予め話者数が与えられている場合は、クラスタ分析
において指定の数に達するまで分割を行ない、話者数が
未知の場合にはクラスタ分析で得られる評価基準が或る
条件を満たすとき、クラスタ分割を停止する。ここで得
られたそれぞれのクラスタに属する出現確率は同一の話
者に属するものと判定し、その出現確率を与えた音声区
間をその話者から得られたものとする。ここで、複数個
の音声区間が予め得られていない場合には、音声区間の
自動検出方法を用いて自動的に区分化することもでき
る。また、エルゴード隠れマルコフモデルを用いて、音
声区間の区分化とその区間の話者クラスタ同定とを同時
に行なうこともできる。The automatic speech segment detection / identification device for a plurality of speakers in a voice according to the present invention extracts a characteristic pattern from an input voice, quantizes the extracted characteristic pattern with a common code, and divides it into a plurality of speech periods. On the other hand, the probability of appearance of the common code is calculated. A cluster analysis is performed on the set of appearance probabilities to divide the appearance probabilities into a plurality of clusters. At this time, if the number of speakers is given in advance, the division is performed until the specified number is reached in the cluster analysis, and if the number of speakers is unknown, the evaluation standard obtained by the cluster analysis satisfies a certain condition. At this time, the cluster division is stopped. It is determined that the appearance probabilities belonging to the respective clusters obtained here belong to the same speaker, and the voice section given the appearance probability is obtained from the speaker. Here, when a plurality of voice sections are not obtained in advance, the voice section can be automatically segmented using an automatic detection method. Further, the segmentation of the voice section and the speaker cluster identification of the section can be performed at the same time by using the ergodic hidden Markov model.

【００１２】[0012]

【実施例】図１はこの発明の一実施例のブロック図であ
る。図１を参照して、入力音声１は音声特徴抽出部２に
与えられ、特徴ベクトルの時系列３に変換される。共通
符号帳作成部４は予めその音声からもしくはそれとは独
立に共通符号帳５を作成し、量子化部６に与える。量子
化部６は音声特徴抽出部２から与えられた特徴ベクトル
の時系列を符号の系列７に変換する。この符号の系列７
は音声区間始終端検出部８に与えられ、音声区間始終端
検出部８は音声区間の始終端を検出し、複数個の音声区
間に分割する。この音声区間の集合９は出現確率算出部
１０に与えられ、それぞれの音声区間ごとに各符号の出
現頻度が算出され、出現確率の集合１１が作成されてク
ラスタ分析部１２に与えられる。1 is a block diagram of an embodiment of the present invention. With reference to FIG. 1, an input voice 1 is given to a voice feature extraction unit 2 and converted into a time series 3 of feature vectors. The common codebook creation unit 4 creates a common codebook 5 from the voice in advance or independently of it, and gives it to the quantization unit 6. The quantizing unit 6 converts the time series of feature vectors given from the voice feature extracting unit 2 into a code sequence 7. Sequence 7 of this code
Is given to the voice section start / end detection unit 8, and the voice section start / end detection unit 8 detects the start / end of the voice section and divides it into a plurality of voice sections. The set 9 of voice sections is given to the appearance probability calculation unit 10, the appearance frequency of each code is calculated for each voice section, and the set 11 of appearance probabilities is created and given to the cluster analysis unit 12.

【００１３】クラスタ分析部１２は出現確率の集合を幾
つかのクラスタに分割する。このクラスタの数は予め指
定されている場合は、その数とすることもできる。一
方、数が指定されていない場合は、評価基準に従って数
が設定される。クラスタ分析の手段としては、たとえば
ベクトル量子化手法が用いられ、量子化歪に対するしき
い値でクラスタ分析が行なわれる。クラスタ分析部１２
で分析されたクラスタ１３は音声区間のクラス判別部１
４に与えられ、クラスタの情報を基にそのクラスタに属
する出現確率が同一の話者から発話されたものとし、そ
の出現確率に対応する音声区間が同一の話者から発話さ
れたものと検出されて同定される。The cluster analysis unit 12 divides the set of appearance probabilities into several clusters. If the number of clusters is designated in advance, it may be that number. On the other hand, when the number is not specified, the number is set according to the evaluation standard. As a means of cluster analysis, for example, a vector quantization method is used, and cluster analysis is performed with a threshold value for quantization distortion. Cluster analysis unit 12
The cluster 13 analyzed in step 1 is the voice section class discriminator 1
4, it is assumed that a speaker having the same appearance probability belonging to the cluster is uttered based on the information of the cluster, and a voice section corresponding to the appearance probability is detected as being uttered by the same speaker. Identified.

【００１４】図２はこの発明の他の実施例のブロック図
である。この図２に示した実施例は、以下の点を除いて
図１の実施例と同じである。すなわち、音声区間始終端
検出部１５は話者以外の指定された音声カテゴリ（たと
えば、日本語，英語などのような複数の言語カテゴリ）
に対応する区間の始終端を検出し、複数個の音声区間に
分割し、音声区間の集合９を作成し、以下、図１の実施
例と同様にして出現確率算出部１０で出現確率１１が算
出される。FIG. 2 is a block diagram of another embodiment of the present invention. The embodiment shown in FIG. 2 is the same as the embodiment shown in FIG. 1 except for the following points. That is, the voice section start / end detection unit 15 determines a designated voice category other than the speaker (for example, a plurality of language categories such as Japanese and English).
The beginning and end of the section corresponding to is detected and divided into a plurality of voice sections to create a set 9 of voice sections. Hereinafter, the appearance probability calculation unit 10 determines the appearance probability 11 as in the embodiment of FIG. It is calculated.

【００１５】図３はこの発明のさらに他の実施例のブロ
ック図である。図３において、音声特徴抽出部２，共通
符号帳作成部４および量子化部６は図１および図２の実
施例と同じであり、量子化部６で変換された符号列７は
離散的エルゴードＨＭＭ（隠れマルコフモデル）算出部
１６に与えられ、パラメータ１７が推定される。このパ
ラメータ１７は音声のバックトレース部１８に与えら
れ、推定されたパラメータを基に再度エルゴードＨＭＭ
を用いて音声を符号列とステートとの最適な対応が算出
され、バックトレース情報１９が算出される。このバッ
クトレース情報１９は音声区間のステート対応部２０に
与えられ、バックトレース情報から各ステートに属する
音声区間が同一の話者から発話されたものと検出同定さ
れる。FIG. 3 is a block diagram of still another embodiment of the present invention. In FIG. 3, the speech feature extracting unit 2, the common codebook creating unit 4, and the quantizing unit 6 are the same as those in the embodiments of FIGS. 1 and 2, and the code string 7 converted by the quantizing unit 6 is a discrete ergodic. The parameter 17 is given to the HMM (Hidden Markov Model) calculation unit 16 and the parameter 17 is estimated. This parameter 17 is given to the voice back trace unit 18, and again based on the estimated parameter, the ergodic HMM.
Is used to calculate the optimum correspondence between the code string and the state of the voice, and the back trace information 19 is calculated. The back trace information 19 is given to the state corresponding unit 20 of the voice section, and the back trace information detects and identifies that the voice section belonging to each state is uttered by the same speaker.

【００１６】図４はこの発明のその他の実施例のブロッ
ク図である。この図４に示した実施例は、混合連続分布
型エルゴードＨＭＭ算出部３を用いたものである。入力
音声１は音声特徴抽出部２において、特徴ベクトルの時
系列３に変換され、混合連続分布型エルゴードＨＭＭ算
出部２３に入力され、そのパラメータ２４が推定され
る。この推定されたパラメータを基に、再度エルゴード
ＨＭＭを用いて音声のバックトレース部６によって符号
列とステートとの最適な対応が算出され、バックトレー
ス情報１９が算出される。このバックトレース情報１９
は音声区間のステート対応部２０に与えられ、バックト
レース情報１９から各ステートに属する音声区間が同一
の話者から発話されたものと検出同定される。クラスタ
の数が予め指定されている場合は、このステートの数を
その数とすることもできる。一方、数が指定されていな
い場合は評価基準に従って数を設定することができる。
１つの手段として、ＨＭＭの尤度に対するしきい値で行
なうことが可能である。FIG. 4 is a block diagram of another embodiment of the present invention. The embodiment shown in FIG. 4 uses the mixed continuous distribution type ergodic HMM calculation unit 3. The input voice 1 is converted into a time series 3 of feature vectors in the voice feature extraction unit 2, input to the mixed continuous distribution ergodic HMM calculation unit 23, and its parameter 24 is estimated. Based on the estimated parameters, the ergodic HMM is used again to calculate the optimum correspondence between the code string and the state by the voice back trace unit 6, and the back trace information 19 is calculated. This backtrace information 19
Is given to the state corresponding unit 20 of the voice section, and the back trace information 19 detects and identifies that the voice section belonging to each state is uttered by the same speaker. If the number of clusters is specified in advance, the number of this state can be used as the number. On the other hand, when the number is not specified, the number can be set according to the evaluation standard.
As one means, it is possible to use a threshold for the likelihood of HMM.

【００１７】図５はこの発明のその他の実施例のブロッ
ク図である。この図５に示した実施例も、音声特徴抽出
部２，共通符号帳作成部４および量子化部６は、図１〜
図３の実施例と同じであり、量子化部６で変換された符
号列７は音声区間および雑音区間始終端検出部２１に与
えられる。音声区間および雑音区間始終端検出部２１は
音声および雑音区間の始終端を検出し、複数個の音声区
間および雑音区間に分割し、音声および雑音区間の集合
２２を作成する。出現確率算出部１０は音声および雑音
区間の集合２２に基づいて、各符号の出現頻度を算出す
ることにより、出現確率を算出し、出現確率の集合１１
をクラスタ分析部１２に与える。FIG. 5 is a block diagram of another embodiment of the present invention. Also in the embodiment shown in FIG. 5, the voice feature extraction unit 2, the common codebook creation unit 4, and the quantization unit 6 are similar to those in FIGS.
As in the embodiment of FIG. 3, the code string 7 converted by the quantizing unit 6 is supplied to the voice section and noise section start / end detecting unit 21. The voice section / noise section start / end detection unit 21 detects the start / end of a voice section and a noise section, divides the start section into a plurality of voice sections and noise sections, and creates a set 22 of voice and noise sections. The appearance probability calculation unit 10 calculates the appearance probability by calculating the appearance frequency of each code based on the set 22 of voice and noise intervals, and the appearance probability set 11
To the cluster analysis unit 12.

【００１８】クラスタ分析部１２はその出現確率の集合
１１で幾つかのクラスタに分割する。このクラスタの数
は予め指定されている場合は、その数とすることもで
き、一方、数が指定されていない場合は、評価基準に従
って数を設定することができる。クラスタ分析の手段と
しては、前述の図１に示した実施例と同様にして、ベク
トル量子化手法を用いる場合は、量子化歪に対するしき
い値で行なうことが可能である。音声区間のクラスタ判
別部１４はクラスタ１３の情報を基に、そのクラスタ１
３に属する出現確率を同一の話者カテゴリおよび雑音カ
テゴリから発話されたものとし、その出現確率に対する
音声，雑音区間を同一のカテゴリから生成されたものと
して検出し同定する。The cluster analysis unit 12 divides the appearance probability set 11 into several clusters. If the number of clusters is designated in advance, it can be set to that number. On the other hand, if the number is not designated, the number can be set according to the evaluation standard. As a means for cluster analysis, in the same way as the embodiment shown in FIG. 1 described above, when the vector quantization method is used, it is possible to use a threshold value for quantization distortion. Based on the information of the cluster 13, the cluster discriminating unit 14 for the voice section
It is assumed that the appearance probabilities belonging to No. 3 are uttered from the same speaker category and the noise category, and the speech and noise intervals corresponding to the appearance probabilities are detected and identified as those generated from the same category.

【００１９】[0019]

【発明の効果】以上のように、この発明によれば、入力
された音声から特徴パターンを抽出し、特徴パターンを
共通符号で量子化し、それぞれの音声区間に対して共通
符号の出現確率を算出し、算出された出現確率をクラス
タ分析し、それぞれのクラスタに属する出現確率を検出
して対応する音声区間を同定することにより、任意数の
未知話者の音声区間を予め登録することなく検出して同
定することができる。As described above, according to the present invention, the characteristic pattern is extracted from the input voice, the characteristic pattern is quantized by the common code, and the appearance probability of the common code is calculated for each voice section. Then, by performing cluster analysis on the calculated appearance probabilities, by detecting the appearance probabilities belonging to each cluster and identifying the corresponding voice sections, it is possible to detect the voice sections of an arbitrary number of unknown speakers without registering them in advance. Can be identified.

[Brief description of drawings]

【図１】この発明の一実施例のブロック図である。FIG. 1 is a block diagram of an embodiment of the present invention.

【図２】この発明の他の実施例のブロック図である。FIG. 2 is a block diagram of another embodiment of the present invention.

【図３】この発明のさらに他の実施例のブロック図であ
る。FIG. 3 is a block diagram of still another embodiment of the present invention.

【図４】この発明のその他の実施例のブロック図であ
る。FIG. 4 is a block diagram of another embodiment of the present invention.

【図５】この発明のさらにその他の実施例のブロック図
である。FIG. 5 is a block diagram of still another embodiment of the present invention.

[Explanation of symbols]

１入力音声２音声特徴抽出部３特徴系列４共通符号帳作成部５共通符号帳６量子化部７符号列８，１５音声区間始終端検出部９音声区間の集合１０出現確率算出部１１出現確率の集合１２クラスタ分析部１３クラスタ１４音声区間のクラス判別部１６離散的エルゴードＨＭＭ算出部１７エルゴードＨＭＭパラメータ１８音声のバックトレース部１９バックトレース情報２０音声区間のステート対応部２１音声区間および雑音区間始終端検出部２３混合連続分布型エルゴードＨＭＭ算出部 1 Input Speech 2 Speech Feature Extraction Section 3 Feature Sequence 4 Common Codebook Creation Section 5 Common Codebook 6 Quantization Section 7 Code Sequence 8 and 15 Speech Section Start / End Detection Section 9 Speech Set 10 Occurrence Probability Calculation Section 11 Appearance Probability 12 cluster analysis unit 13 cluster 14 speech class determination unit 16 discrete ergodic HMM calculation unit 17 ergodic HMM parameter 18 speech backtrace unit 19 backtrace information 20 state correspondence unit for speech segment 21 speech segment and noise segment Edge detector 23 Mixed continuous distribution type ergodic HMM calculator

Claims

[Claims]

1. A voice feature extracting means for extracting a feature pattern from input voice, a common codebook producing means for producing a common code, and a feature pattern extracted by the voice feature extracting means by the common codebook producing means. Quantization means for quantizing with the created common code, appearance probability calculation means for calculating the appearance probability of the common code with respect to a plurality of speech sections, and cluster analysis of the appearance probabilities calculated by the appearance probability calculation means A utterance section of a plurality of speakers in a voice, which includes a cluster analysis unit and an identification unit that detects an appearance probability belonging to each cluster analyzed by the cluster analysis unit and identifies a voice section corresponding to the appearance probability. Automatic detection and identification device.

2. The apparatus for automatically detecting and identifying utterance sections of a plurality of speakers in a voice according to claim 1, wherein the start and end points of the plurality of voice sections are predetermined.

3. The apparatus for automatically detecting and identifying utterance sections of a plurality of speakers in a voice according to claim 1, wherein the start and end of the plurality of voice sections are automatically detected.

4. The cluster analyzing means automatically determines the number of speakers when the number of speakers is not given in advance. Multi-speaker utterance section automatic detection and identification device.

5. The appearance probability calculation means calculates the appearance probability of a speech section of a plurality of speakers in a voice having a noise cluster corresponding to a speaker-independent noise section. Device for automatic detection and identification of utterance intervals of multiple speakers in a voice.

6. A voice feature extracting means for extracting a feature pattern from input voice, a common codebook producing means for producing a common code, and a feature pattern extracted by the voice feature extracting means by the common codebook producing means. Quantizing means for quantizing with the created common code, updating means for updating the appearance probability, transition probability and initial state probability of the code in the state of the ergodic hidden Markov model based on a given initial value, by the updating means A device for automatically detecting and identifying a utterance section of a plurality of speakers in a voice, comprising: a determining unit that determines a stop condition for updating; and a unit that decodes the voice using the obtained Ergodic hidden Markov model.

7. A voice feature extracting means for extracting a feature pattern from an input voice, the appearance probability of voice features in a state of a mixed continuous distribution ergodic hidden Markov model, a branch probability, based on an initial value given in advance. Update means for updating the transition probabilities and initial state probabilities, judging means for judging the update stop condition by the updating means, and means for decoding speech using the obtained continuous continuous distribution type ergodic hidden Markov model In addition, a device for automatic detection and identification of utterance intervals of multiple speakers in a voice.