WO2024201995A1 - Dispositif d'étiquetage de locuteur, procédé d'étiquetage de locuteur et programme - Google Patents

Dispositif d'étiquetage de locuteur, procédé d'étiquetage de locuteur et programme Download PDF

Info

Publication number
WO2024201995A1
WO2024201995A1 PCT/JP2023/013520 JP2023013520W WO2024201995A1 WO 2024201995 A1 WO2024201995 A1 WO 2024201995A1 JP 2023013520 W JP2023013520 W JP 2023013520W WO 2024201995 A1 WO2024201995 A1 WO 2024201995A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
speakers
segment
sequence
markov model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2023/013520
Other languages
English (en)
Japanese (ja)
Inventor
マーク デルクロア
章子 荒木
智広 中谷
厚徳 小川
直弘 俵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NTT Inc
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to PCT/JP2023/013520 priority Critical patent/WO2024201995A1/fr
Priority to JP2025509583A priority patent/JPWO2024201995A1/ja
Publication of WO2024201995A1 publication Critical patent/WO2024201995A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/16Hidden Markov models [HMM]

Definitions

  • the disclosed technology relates to a technique for estimating who is speaking, where they are speaking, and what part they are saying. This technology is called “speaker diarization (SD)."
  • speaker diarization One way to perform speaker diarization is to split the input speech signal into short segments, extract a speaker embedding vector from each segment, and cluster these speaker embedding vectors to identify speakers and assign a speaker label to each segment. This approach is called speaker embedding vector clustering (VC). It should be noted that a speaker label refers to an identifier that uniquely identifies a speaker.
  • VBx in Non-Patent Document 1 is a state-of-the-art clustering algorithm for VCs.
  • VBx is a clustering algorithm based on variational Bayesian estimation, which uses a hidden Markov model (HMM) to model the time evolution of speakers in a conversation (i.e., the time evolution of speaker embedding vectors).
  • HMM hidden Markov model
  • FIG. 1 shows the topology of an HMM for VBx with three speakers (A, B, C).
  • 201, 202, and 203 represent states in which the speakers are A, B, and C, respectively, and 204 is a non-output node.
  • P loop is the transition probability to the same speaker (state) and is an adjustable parameter.
  • (1-P loop ) is the probability of changing speakers, which corresponds to the transition probability to a non-output node. From a non-output node, there is an immediate transition to one of states A, B, or C with probabilities ⁇ A , ⁇ B , or ⁇ C. Therefore, the transition probability p(s
  • ⁇ ss' is the Kronecker delta.
  • ⁇ s is estimated from the input X along with Y and Z described below.
  • the HMM of VBx can be written as equation (2).
  • T is the total number of segments to be analyzed
  • S is the number of HMM states.
  • the states of the HMM are associated with speakers, and the state output probability p( xt
  • I is an identity matrix
  • N(x t ; Vy s , I) represents a Gaussian function of the vector variable x t , the mean vector Vy s , and the covariance matrix I.
  • VBx trains a probabilistic linear discriminant analysis (PLDA) model with the speaker embedding vector extracted from the training data, and derives Vy s from the trained PLDA model and uses it as the parameters of the Gaussian function.
  • z t-1 ) is as shown in formula (1).
  • the probability distribution of y s is the standard normal distribution (formula (4)).
  • the problem with the VC approach is that it assumes that there is only one active speaker in a segment (no overlapping voices), which is not the case in normal conversations.
  • EEND end-to-end neural diarization
  • EEND-VC VC
  • a speech signal 10 produced by three speakers, A, B, and C is input to EEND-VC.
  • the number of streams of a speaker is estimated from S t using a neural network.
  • a neural network is then used to calculate speaker activity a ct (when an utterance occurred within a segment) and speaker vector x ct (indicating the characteristics of the speaker) for each stream.
  • the speaker vectors are clustered to group x c t .
  • EEND-VC extends traditional clustering approaches, such as K-means and AHC, with constraints to handle multi-stream speaker embedding vectors.
  • the speaker labeling device uses a hidden Markov model. It has a hidden Markov model initialization unit that initializes a hidden Markov model with a speaker embedding vector sequence as observed information, a speaker label sequence as a hidden state, elements of the hidden state consisting of combinations of speakers in a speech signal, and state output probabilities as the product of Gaussian functions for each speaker.
  • the disclosed technique allows us to use the power of VBx clustering to cluster multi-stream speaker embedding vectors such as those obtained in the EEND-VC system, allowing overlapping speakers to be treated consistently, as in VBx.
  • cAHC constrained agglomerative hierarchical clustering
  • FIG. 1 is a diagram illustrating the HMM topology of prior art VBx.
  • FIG. 1 is a diagram for explaining the conventional technology EEND-VC.
  • FIG. 1 is a diagram for explaining the HMM topology of the disclosed technology MS-VBx.
  • FIG. 1 is a functional block diagram of a speaker labeling system according to a first embodiment.
  • FIG. 4 is a flowchart illustrating the operation of the speaker labeling system according to the first embodiment.
  • 11A to 11C are diagrams for explaining experimental results regarding the performance of the disclosed technology.
  • FIG. 2 is a diagram showing an example of the functional configuration of a computer.
  • MS-VBx extends the VBx approach to handle overlapping speech: we assume that in each segment there can be any number of active speakers.
  • Figure 3 shows an example of the HMM topology of MS-VBx. It is assumed that there are three speakers in the entire recording, and that up to two speakers appear simultaneously in each segment.
  • denotes an inactive speaker.
  • the states "BA”, “CA”, and "CB" are omitted.
  • P loop is the transition probability to the same state (loop probability) and is an adjustable parameter.
  • the state output probabilities of an HMM are expressed as a product of multiple Gaussian functions, with each Gaussian function associated with a different speaker. Parameters associated with the same speaker are integrated across multiple states of the HMM. Subsequent speaker label estimation is performed similarly to conventional VBx, except that the parameters are concatenated and there are multiple input speaker embedding vectors for each speech segment. The above is an overview of MS-VBx.
  • FIG. 4 is a functional block diagram showing an example of the configuration of a speaker labeling system 40 according to the first embodiment.
  • the speaker labeling system 40 receives an audio signal 41 (a recording having sections in which the speech of speakers A, B, and C overlap) as input, and outputs one or more speaker labels 42 estimated for each segment.
  • the speaker labeling system 40 includes a multi-stream speaker embedding estimator (MSEE) 401 and a speaker labeling device 402.
  • MSEE multi-stream speaker embedding estimator
  • the speaker labeling device 402 includes an HMM initialization unit 403 , a speaker estimation unit 404 , and a speaker model storage unit 405 .
  • the speaker model storage unit stores pre-trained speaker models.
  • a pre-trained speaker model is used to obtain prior knowledge on the parameters of the state output probabilities of the HMM, which can be obtained by training a PLDA model on the speaker embedding vectors obtained by a multi-stream speaker embedding vector estimator on training data covering a large number of speakers.
  • a PLDA model can be trained using standard algorithms, for example by inputting speaker embedding vectors from multiple speakers and outputting PLDA parameters.
  • FIG. 5 is a flow chart illustrating an example of the operation of the speaker labeling system.
  • the speaker labeling system will be described in detail with reference to FIGS.
  • the MSEE401 receives an audio signal (step S501).
  • the audio signal is, for example, a recording of a meeting or conversation lasting from a few seconds to a few minutes, and includes sections in which multiple speakers speak simultaneously.
  • MSEE401 divides the audio signal 41 into segments, for example, one second to several seconds long (step S502).
  • the MSEE 401 calculates, for each segment, a speaker embedding vector for each active speaker in the segment (step S503).
  • the content of a set xt of C' speaker vectors obtained in segment t is expressed by Equation (6).
  • the number of speaker embedding vectors obtained from a segment corresponds to the estimated number of active speakers in the segment, C', and therefore the number of speaker embedding vectors obtained may vary from segment to segment and may be less than the total number of speakers in the input speech signal.
  • the speaker labeling device 402 receives the sequence of speaker embedding vectors X (equation (7)) output by the MSEE 401.
  • T is the total number of segments analyzed.
  • the HMM initialization unit 403 of the speaker labeling device 402 estimates the total number C of speakers in the input speech signal (step S504).
  • the HMM initialization unit 403 initializes the HMM parameters (step S505), which means creating an HMM structure as shown in FIG. Using the estimate C of the total number of speakers calculated in step S504, all combinations of speakers are generated. Some a priori knowledge can be used to eliminate less likely HMM states and reduce the number of states. Because MS-VBx can improve (reduce) the number of estimated speakers but cannot increase the number of speakers, it is desirable to overestimate the total number of speakers C. Another clustering technique, such as cAHC, can be used to obtain an estimate of the total number of speakers and an initial value for state occupancy, which indicates which states are most likely for each time segment.
  • X be the speaker embedding vector sequence (observation information) output by MSEE 401
  • Z ⁇ z 1 , z 2 , ..., z T ⁇ be the sequence of speaker label assignment results for segments (hidden state)
  • the HMM model of MS-VBx can be written as equation (8).
  • Y is the set of speaker related latent variables and S ⁇ C is the number of states in the HMM.
  • z t-1 ) is given by equation (5), and ⁇ s is found by variational Bayesian estimation together with Y and Z.
  • the distribution of y s,c is the standard normal distribution (equation (10)).
  • the state output probability is expressed as a product of multiple Gaussian functions, and each Gaussian function is associated with a different speaker (Equation (11)).
  • the output probability of state s of the HMM represents the probability that a speaker is active in a segment.
  • corresponds to the inter-speaker covariance matrix in the space transformed into the PLDA model.
  • the speaker estimation unit 404 estimates the HMM parameters and the most likely HMM state sequence based on a given sequence of multi-stream speaker embedding vectors. To this end, it performs a recursive optimization as described below.
  • the output consists of the most likely HMM state for each segment. From there, a speaker label can be assigned to each of the multi-stream speaker embedding vectors.
  • X) is found by variational Bayesian (VB) estimation. Specifically, we use the approximate probability distribution and, To obtain this, the following procedure is carried out:
  • step S513 When ELBO satisfies a predetermined convergence condition, the iterations of steps (2-1) to (2-7) are terminated, and the most likely HMM state s for each segment t is output based on ⁇ t,s (segment t is labeled) (step S513).
  • MSEE401 can be any system that takes a speech signal as input and outputs a sequence of multi-stream speaker embedding vectors.
  • speaker activity can also be generated, but speaker activity is not necessarily required for MS-VBx.
  • the HMM of the first embodiment also assumes a state in which the number of active speakers is less than the number of channels C.
  • the number of speakers may also be soft-determined.
  • the program describing this processing can be recorded on a computer-readable recording medium.
  • Examples of computer-readable recording media include magnetic recording devices, optical disks, magneto-optical recording media, and semiconductor memories.
  • the program may be distributed, for example, by selling, transferring, or lending portable recording media such as DVDs or CD-ROMs on which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of a server computer and transferring the program from the server computer to other computers via a network.
  • a computer that executes such a program for example, first stores in its own storage device the program recorded on a portable recording medium or the program transferred from a server computer. Then, when executing a process, the computer reads the program stored on its own recording medium and executes the process according to the read program. As another execution form of the program, the computer may read the program directly from the portable recording medium and execute the process according to the program, or may execute the process according to the received program each time a program is transferred from the server computer to the computer.
  • the above-mentioned process may also be executed by a so-called ASP (Application Service Provider) type service that does not transfer the program from the server computer to the computer, but realizes the processing function only by issuing an execution instruction and obtaining the results.
  • ASP Application Service Provider
  • the program in this form includes information used for processing by an electronic computer that is equivalent to a program (such as data that is not a direct command to the computer but has properties that specify the processing of the computer).
  • the device is configured by executing a specific program on a computer, but at least a portion of the processing may be realized by hardware.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Stereophonic System (AREA)

Abstract

Une extension de VBx est proposée afin de permettre le traitement d'un vecteur d'incorporation de locuteur à flux multiples. À cet effet, un dispositif d'étiquetage de locuteur selon la présente divulgation utilise un modèle de Markov caché dans lequel une séquence de vecteurs d'incorporation de locuteur sert d'informations d'observation et une séquence d'étiquettes de locuteur sert d'état caché dont les éléments consistent en combinaisons de locuteurs dans un signal audio. Le dispositif d'étiquetage de locuteur comprend une unité d'initialisation de modèle de Markov caché qui initialise un modèle de Markov caché dans lequel une probabilité de sortie d'état est exprimée en tant que produit de fonctions gaussiennes pour chaque locuteur.
PCT/JP2023/013520 2023-03-31 2023-03-31 Dispositif d'étiquetage de locuteur, procédé d'étiquetage de locuteur et programme Ceased WO2024201995A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2023/013520 WO2024201995A1 (fr) 2023-03-31 2023-03-31 Dispositif d'étiquetage de locuteur, procédé d'étiquetage de locuteur et programme
JP2025509583A JPWO2024201995A1 (fr) 2023-03-31 2023-03-31

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2023/013520 WO2024201995A1 (fr) 2023-03-31 2023-03-31 Dispositif d'étiquetage de locuteur, procédé d'étiquetage de locuteur et programme

Publications (1)

Publication Number Publication Date
WO2024201995A1 true WO2024201995A1 (fr) 2024-10-03

Family

ID=92904513

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/013520 Ceased WO2024201995A1 (fr) 2023-03-31 2023-03-31 Dispositif d'étiquetage de locuteur, procédé d'étiquetage de locuteur et programme

Country Status (2)

Country Link
JP (1) JPWO2024201995A1 (fr)
WO (1) WO2024201995A1 (fr)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011007497A1 (fr) * 2009-07-16 2011-01-20 日本電気株式会社 Dispositif d'analyse de données vocales, procédé d'analyse de données vocales et programme d'analyse de données vocales
JP2014502375A (ja) * 2010-12-10 2014-01-30 パナソニック株式会社 話者照合のためのパスフレーズ・モデリングのデバイスおよび方法、ならびに話者照合システム
JP2016057461A (ja) * 2014-09-09 2016-04-21 富士通株式会社 話者インデキシング装置、話者インデキシング方法及び話者インデキシング用コンピュータプログラム
US20160225374A1 (en) * 2012-09-28 2016-08-04 Agnito, S.L. Speaker Recognition
JP2021026050A (ja) * 2019-07-31 2021-02-22 株式会社リコー 音声認識システム、情報処理装置、音声認識方法、プログラム
US20220122615A1 (en) * 2019-03-29 2022-04-21 Microsoft Technology Licensing Llc Speaker diarization with early-stop clustering

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011007497A1 (fr) * 2009-07-16 2011-01-20 日本電気株式会社 Dispositif d'analyse de données vocales, procédé d'analyse de données vocales et programme d'analyse de données vocales
JP2014502375A (ja) * 2010-12-10 2014-01-30 パナソニック株式会社 話者照合のためのパスフレーズ・モデリングのデバイスおよび方法、ならびに話者照合システム
US20160225374A1 (en) * 2012-09-28 2016-08-04 Agnito, S.L. Speaker Recognition
JP2016057461A (ja) * 2014-09-09 2016-04-21 富士通株式会社 話者インデキシング装置、話者インデキシング方法及び話者インデキシング用コンピュータプログラム
US20220122615A1 (en) * 2019-03-29 2022-04-21 Microsoft Technology Licensing Llc Speaker diarization with early-stop clustering
JP2021026050A (ja) * 2019-07-31 2021-02-22 株式会社リコー 音声認識システム、情報処理装置、音声認識方法、プログラム

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TAKAFUMI KOSHINAKA, KENTARO NAGATOMO, KENJI SATO: "Online speaker clustering using an ergodic HMM and its application to meeting minute generation.", IEICE TECHNICAL REPORT, IEICE, JP, vol. 109, no. 375, 14 January 2010 (2010-01-14), JP, pages 39 - 44, XP009557928, ISSN: 0913-5685 *

Also Published As

Publication number Publication date
JPWO2024201995A1 (fr) 2024-10-03

Similar Documents

Publication Publication Date Title
JP7023934B2 (ja) 音声認識方法及び装置
KR102294638B1 (ko) 잡음 환경에 강인한 화자 인식을 위한 심화 신경망 기반의 특징 강화 및 변형된 손실 함수를 이용한 결합 학습 방법 및 장치
CN107564513B (zh) 语音识别方法及装置
US20180253648A1 (en) Connectionist temporal classification using segmented labeled sequence data
EP1465154A2 (fr) Méthode pour la reconnaissance de parole utilisant de l'inférence variationelle avec des modèles d'espace d'état changeants
US10762417B2 (en) Efficient connectionist temporal classification for binary classification
WO2022232284A1 (fr) Diarisation de locuteur supportant un contenu épisodique
US10614343B2 (en) Pattern recognition apparatus, method, and program using domain adaptation
Price et al. Speaker adaptation of deep neural networks using a hierarchy of output layers
KR101985793B1 (ko) 자율 행동 로봇을 이용하여 대화 서비스를 제공하는 방법, 시스템 및 비일시성의 컴퓨터 판독 가능 기록 매체
US9984678B2 (en) Factored transforms for separable adaptation of acoustic models
Mirsamadi et al. A study on deep neural network acoustic model adaptation for robust far-field speech recognition.
KR102720486B1 (ko) 화자 모델을 업데이트하는 음성 인식 장치, 방법 및 컴퓨터 프로그램
JP2009086581A (ja) 音声認識の話者モデルを作成する装置およびプログラム
JP6711765B2 (ja) 形成装置、形成方法および形成プログラム
KR20190141350A (ko) 로봇에서의 음성인식 장치 및 방법
Lee et al. Training hidden Markov models by hybrid simulated annealing for visual speech recognition
JP7743875B2 (ja) 音声信号の処理方法、音声信号処理装置、およびプログラム
WO2024201995A1 (fr) Dispositif d'étiquetage de locuteur, procédé d'étiquetage de locuteur et programme
Wang et al. Domain adaptation using maximum likelihood linear transformation for PLDA-based speaker verification
JP5647159B2 (ja) 事前分布計算装置、音声認識装置、事前分布計算方法、音声認識方法、プログラム
CN114974264B (zh) 一种基于改进的变分贝叶斯算法的话者分割方法和系统
Wang et al. Tandem system adaptation using multiple linear feature transforms
JP4256314B2 (ja) 音声認識用音響モデル作成方法、音声認識用音響モデル作成装置、音声認識用音響モデル作成プログラム及びこのプログラムを記録した記録媒体
JP2839855B2 (ja) 顔の表情認識装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23930630

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2025509583

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2025509583

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 23930630

Country of ref document: EP

Kind code of ref document: A1