WO2020098083A1 - Procédé et appareil de séparation d'appel, dispositif informatique et support d'informations - Google Patents

Procédé et appareil de séparation d'appel, dispositif informatique et support d'informations Download PDF

Info

Publication number
WO2020098083A1
WO2020098083A1 PCT/CN2018/123553 CN2018123553W WO2020098083A1 WO 2020098083 A1 WO2020098083 A1 WO 2020098083A1 CN 2018123553 W CN2018123553 W CN 2018123553W WO 2020098083 A1 WO2020098083 A1 WO 2020098083A1
Authority
WO
WIPO (PCT)
Prior art keywords
call
segment
speaker
call segment
speakers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2018/123553
Other languages
English (en)
Chinese (zh)
Inventor
刘博卿
贾雪丽
程宁
王健宗
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Publication of WO2020098083A1 publication Critical patent/WO2020098083A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • This application relates to the field of artificial intelligence, in particular to a call separation method, device, computer equipment and storage medium.
  • the embodiments of the present application provide a call separation method, device, computer equipment, and storage medium to solve the current problem of inaccurate call separation.
  • an embodiment of the present application provides a call separation method, including:
  • Mute detection is used to remove the mute segment in the original call segment to obtain the first call segment
  • a variational Bayes algorithm is used to determine the second conversation segment of the same speaker, and the second conversation segment of the same speaker is marked as a unified label.
  • an embodiment of the present application provides a call separation device, including:
  • An original call segment acquisition module for acquiring an original call segment, the original call segment includes at least two call segments of different speakers;
  • a first call segment acquisition module used to remove the mute segment in the original call segment using mute detection to obtain the first call segment
  • a second call segment acquisition module configured to cut the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments;
  • the target model acquisition module is used to acquire the i-vector features of each of the second call segments, and use the pre-trained double covariance probability linear discriminant analysis model to model each of the i-vector features to obtain each A target model of the second call segment;
  • the unified labeling module is used to determine the second call segment of the same speaker based on the target model, and to use the variational Bayes algorithm to mark the second call segment of the same speaker as a unified label .
  • a computer device in a third aspect, includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor.
  • the processor implements the computer-readable instructions as follows step:
  • Mute detection is used to remove the mute segment in the original call segment to obtain the first call segment
  • a variational Bayes algorithm is used to determine the second conversation segment of the same speaker, and the second conversation segment of the same speaker is marked as a unified label.
  • an embodiment of the present application provides a computer non-volatile readable storage medium, including: computer readable instructions, which are used to execute the following steps when the computer readable instructions are executed:
  • Mute detection is used to remove the mute segment in the original call segment to obtain the first call segment
  • a variational Bayes algorithm is used to determine the second conversation segment of the same speaker, and the second conversation segment of the same speaker is marked as a unified label.
  • the mute detection of the original call voice is performed first, which can remove the mute segment of the voice call in which no one emits sound, which is beneficial to improve the efficiency and accuracy of call separation. Then, the first call segment is cut to obtain second call segments of different speakers, which provides an important technical premise for subsequent determination of the second call segments of the same speaker. Then use the pre-trained double covariance probability linear discriminant analysis model for modeling to obtain the target model of each second call segment. The characteristics of the second call segment can be more accurately represented by the double covariance probability linear discriminant analysis model come out. Finally, the second call segment of the same speaker is determined by the variational Bayes algorithm, and the second call segment belonging to the same speaker can be clustered by using the variational Bayes algorithm, with high accuracy and accurate call seperate effect.
  • FIG. 1 is a flowchart of a call separation method in an embodiment of the present application
  • FIG. 2 is a schematic diagram of a call separation device in an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a computer device in an embodiment of the present application.
  • first, second, third, etc. may be used to describe the preset ranges and the like in the embodiments of the present application, these preset ranges should not be limited to these terms. These terms are only used to distinguish the preset ranges from each other.
  • the first preset range may also be called a second preset range, and similarly, the second preset range may also be called a first preset range.
  • the word “if” as used herein can be interpreted as “when” or “when” or “in response to determination” or “in response to detection”.
  • the phrases “if determined” or “if detected (statement or event stated)” can be interpreted as “when determined” or “in response to determination” or “when detected (statement or event stated) ) “Or” in response to detection (statement or event stated) ".
  • FIG. 1 shows a flowchart of the call separation method in this embodiment.
  • the call separation method can be applied to a terminal device that performs call separation, and is used to realize the function of call separation. Specifically, it can be applied to a phone call separation system installed on a computer device.
  • the computer device is a device that can perform human-computer interaction with a user, including but not limited to computers, smart phones, and tablets.
  • the call separation method includes the following steps:
  • the original call segment includes at least two call segments of different speakers.
  • the original call segment may be a call segment obtained by a recording device and including at least two different speakers. In an embodiment, it may specifically be an original call segment composed of multiple speakers recorded by a recording device in a conference scene.
  • Mute detection is used to remove the mute segment in the original call segment to obtain the first call segment.
  • the mute detection refers to the detection of the silent (unattended) part of the original call segment.
  • it can be implemented using the technology of Voice Endpoint Detection (Voice Activity Detection) (VAD for short), including frame amplitude, frame energy, short-time zero-crossing rate, and deep neural network.
  • S30 Cut the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments.
  • the first call voice segment is continuous on the time axis, but the call voice segments of different speakers will alternately appear on the time axis. Therefore, the first call voice segment can be cut into call segments corresponding to different speakers, and these segments are the second call segments.
  • the obtained second call segment includes at least three segments (because two segments are not necessary for call separation), a speaker can correspond to one or more second call segments, for example, there are 10 second call segments, the The second call segment corresponds to a total of 4 speakers A, B, C, and D, then A may include 5 second call segments, B includes 2, C includes 1, and D includes 2.
  • step S30 the first call segment is cut to obtain at least three second call segments, specifically including:
  • Bayesian Information Criteria (Bayesian Information Criterion, referred to as BIC) is to estimate the partially unknown state with subjective probability under incomplete intelligence, and then use Bayesian formula to modify the probability of occurrence, and finally use the expected value and correction Probability to make the best decision.
  • Likelihood ratio (LR) is an indicator that reflects authenticity.
  • the specific time for changing the speaker in the first call segment can be determined, and the speaker's transition point in the first call segment can be detected.
  • S32 Cut the first call segment according to the speaker's transition point to obtain at least three second call segments.
  • cutting the first call segment according to the obtained transition point can achieve a preliminary call separation effect, and it can be determined that each obtained second call segment corresponds to a speaker.
  • steps S31-S32 the first conversation segment is cut so that each second conversation segment obtained by cutting corresponds to a speaker, which provides an important technical premise for subsequent determination of the second conversation segment of the same speaker.
  • S40 Obtain the i-vector features of each second call segment, and use a pre-trained double covariance probability linear discriminant analysis model to model each i-vector feature to obtain the target model of each second call segment.
  • the i-vector feature refers to a more compact vector extracted from the Gaussian mixture model (GMM) mean supervector.
  • the i-vector feature also includes information about the soundtrack, Microphone, speaking method, voice and other information can fully reflect the voiceprint characteristics of the sound.
  • the double-covariance probability linear discriminant analysis model is used to extract speaker information from i-vector, which can be used to compare and distinguish voiceprint features.
  • the double-covariance probability linear discriminant analysis model assumes that the i-vector is extracted by two other parameters: a speaker's vector y (different speakers have different vectors), and a residual vector ⁇ (different fragments have different vectors) ).
  • the total number of speakers is S.
  • I ⁇ i 1 , ..., i M ⁇ be a given set of indication vectors about the second call segment.
  • the speaker-independent vector ⁇ representing the m-th second call segment is subject to a Gaussian distribution with mean 0 and covariance L -1 .
  • the double covariance in the linear discriminant analysis model of double covariance probability comes from y k and ⁇ m respectively . Understandably, the modeling process is to calculate the representation of each second call segment in the double-covariance probability linear discriminant analysis model.
  • the variational Bayesian algorithm (Variational Bayes, VB for short) is an approximate posterior method that provides a local optimal but has a definite solution.
  • the problem of determining the second call segment of the same speaker can be reduced to the posterior probability of asking the speaker to speak in a given second call segment, where the posterior probability refers to random
  • the conditional probability of an event or uncertainty assertion is the conditional probability after the relevant evidence or background is given and taken into account. Due to the above assumption, P (Y, I
  • the variational Bayesian algorithm is used to approximate P (Y
  • step S50 based on the target model, a variational Bayes algorithm is used to determine the second call segment of the same speaker, which specifically includes:
  • S512 The expression of the posterior probability of the speaker based on the target model and the variational Bayes algorithm, Where s is the speaker, S is the total number of speakers, y s is the second conversation segment of each speaker s, Q (Y) is subject to the mean ⁇ s , and the covariance is Gaussian distribution.
  • S513 Update the posterior probability Q (I) of the second call segment and the posterior probability Q (Y) of the speaker based on the variational Bayesian algorithm.
  • the update process of the Expectation Maximization Algorithm is used in the calculation process of the variational Bayesian algorithm.
  • the maximum expectation algorithm includes e-step and m-step, the posterior probability Q (I) of the second call segment and the posterior probability Q (Y) of the speaker are updated in the e-step step of variation; in the m-step step Assign each second call segment m to The speaker in s.
  • step S513 it specifically includes:
  • the temperature parameter ⁇ can also be introduced, and the deterministic annealing variant pair of the variational Bayes algorithm The posterior probability of the segment and the posterior probability of the speaker are updated.
  • the update process is: q ms is updated to s ′ is used to distinguish s in q ms , which means s before update, ⁇ represents the temperature parameter,
  • T is the transposed matrix operation
  • L is the inverse of the covariance L -1
  • tr (.) Is the trace operation of the matrix
  • const is the irrelevant term to the speaker
  • the update of the posterior probability of the speaker is expressed as ⁇ is the inverse of the covariance ⁇ -1
  • C s is the inverse of the covariance.
  • S514 Determine the second call segment of the same speaker according to the updated Q (I) and the updated Q (Y).
  • the posterior probability that the speaker has spoken in a given second conversation segment can be obtained, thereby determining the second conversation segment of the same speaker.
  • step S50 that is, before the variational Bayes algorithm is used to determine the second call segment of the same speaker in the target model, the method further includes:
  • S521 Initialize the number of speakers in the posterior probability of the second call segment, and use each different speaker in the posterior probability of the second call segment as a pair.
  • the number of speakers in the posterior probability of initializing the second call segment may specifically be initialized to 3 speakers.
  • S522 Calculate the distance between each pair of speakers to obtain the two speakers with the longest distance.
  • cosine similarity and / or likelihood ratio score can be used as a criterion for measuring distance.
  • S523 Repeat the preset number of times to initialize the number of speakers in the posterior probability of the second call segment, and use each different speaker in the posterior probability of the second call segment as a pair and calculate the number of each pair of speakers The distance between the two speakers with the furthest distance, the two speakers with the furthest distance in the preset number of steps, and the two speakers with the furthest distance in the preset number of steps As a starting point for variational Bayesian calculations.
  • this step is steps S521-S522 repeating a preset number of times (for example, 10 times), and then the two speakers who are farthest among all the steps of the preset number of times are used as the starting point of variational Bayesian calculation.
  • Steps S521-S523 are optimization steps for the initialization of the variational Bayesian algorithm, which can improve the operation results obtained by the variational Bayesian algorithm when iterating with the maximum expectation algorithm is more accurate, and finally based on the accurate The posterior probability that a person has spoken in a given second call segment, so as to better distinguish the second call voice by speaker.
  • the mute detection of the original call voice is performed first, which can remove the mute segment of the voice call in which no one emits sound, which is beneficial to improve the efficiency and accuracy of call separation. Then, the first call segment is cut to obtain second call segments of different speakers, which provides an important technical premise for subsequent determination of the second call segments of the same speaker. Then use the pre-trained double covariance probability linear discriminant analysis model for modeling to obtain the target model of each second call segment. The characteristics of the second call segment can be more accurately represented by the double covariance probability linear discriminant analysis model come out. Finally, the second call segment of the same speaker is determined by the variational Bayes algorithm, and the second call segment belonging to the same speaker can be clustered by using the variational Bayes algorithm, with high accuracy and accurate call seperate effect.
  • the embodiments of the present application further provide device embodiments that implement the steps and methods in the above method embodiments.
  • FIG. 2 shows a functional block diagram of a call separation device corresponding to the call separation method in the embodiment.
  • the call separation device includes an original call segment acquisition module 10, a first call segment acquisition module 20, a second call segment acquisition module 30, a target model acquisition module 40 and a unified label module 50.
  • the implementation functions of the original call segment acquisition module 10, the first call segment acquisition module 20, the second call segment acquisition module 30, the target model acquisition module 40, and the unified label module 50 correspond to the steps of the call separation method in the embodiment one by one
  • this embodiment will not elaborate one by one.
  • the original call segment obtaining module 10 is used to obtain an original call segment, and the original call segment includes at least two call segments of different speakers.
  • the first call segment acquisition module 20 is used to remove the mute segment in the original call segment using mute detection to obtain the first call segment.
  • the second call segment acquisition module 30 is configured to cut the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments.
  • the target model acquisition module 40 is used to acquire the i-vector features of each second call segment, and use the pre-trained double covariance probability linear discriminant analysis model to model each i-vector feature to obtain each second The target model of the call segment.
  • the unified labeling module 50 is used to determine the second conversation segment of the same speaker based on the target model, and use the variational Bayes algorithm to mark the second conversation segment of the same speaker as a unified label.
  • the first call segment acquisition module 10 includes a transition point acquisition unit and a second call segment acquisition unit.
  • the transition point acquisition unit is used to detect and obtain the speaker's transition point in the first call segment based on the Bayesian information criterion and the likelihood ratio.
  • the second call segment acquisition unit is configured to cut the first call segment according to the speaker's transition point to obtain at least three second call segments.
  • the speaker-independent vector ⁇ representing the m-th second call segment follows a Gaussian distribution with mean 0 and covariance L -1 .
  • the unified labeling module 50 includes a second call segment posterior probability acquisition unit Unit, update unit and determination unit.
  • the speaker posterior probability acquisition unit is used to acquire the posterior probability expression of the speaker based on the target model and the variational Bayesian algorithm, Where s is the speaker, S is the total number of speakers, y s is the second conversation segment of each speaker s, Q (Y) is subject to the mean ⁇ s , and the covariance is Gaussian distribution.
  • the updating unit is used to update the posterior probability Q (I) of the second call segment and the posterior probability Q (Y) of the speaker based on the variational Bayes algorithm.
  • the determining unit is configured to determine the second call segment of the same speaker according to the updated Q (I) and the updated Q (Y).
  • the call separation device further includes an initialization unit, a distance unit, and a starting point determination unit.
  • the initialization unit is used for initializing the number of speakers in the posterior probability of the second conversation segment, and using each different speaker in the posterior probability of the second conversation segment as a pair.
  • the distance unit is used to calculate the distance between each pair of speakers to obtain the two speakers with the longest distance.
  • the starting point determining unit is used to repeat the preset number of times to initialize the number of speakers in the posterior probability of the second call segment, using each different speaker in the posterior probability of the second call segment as a pair and calculating each For the distance between the speakers, the step of getting the two speakers farthest away is obtained, the two speakers who are farthest apart in the preset number of steps are obtained, and the farthest distance is separated in the preset number of steps
  • the two speakers serve as the starting point for the variational Bayesian calculation.
  • the updating unit includes: updating q ms in the posterior probability Q (I) of the second call segment to among them, s ′ is used to distinguish s in q ms , which means s before update,
  • T is the transposed matrix operation
  • L is the inverse of the covariance L -1
  • tr Is the trace operation of the matrix
  • const is the irrelevant term to the speaker
  • the posterior probability Q (Y) of the speaker is updated
  • is the inverse of the covariance ⁇ -1
  • C s is the inverse of the covariance.
  • the mute detection of the original call voice is performed first, which can remove the mute segment of the voice call in which no one emits sound, which is beneficial to improve the efficiency and accuracy of call separation. Then, the first call segment is cut to obtain second call segments of different speakers, which provides an important technical premise for subsequent determination of the second call segments of the same speaker. Then use the pre-trained double covariance probability linear discriminant analysis model for modeling to obtain the target model of each second call segment. The characteristics of the second call segment can be more accurately represented by the double covariance probability linear discriminant analysis model come out. Finally, the second call segment of the same speaker is determined by the variational Bayes algorithm, and the second call segment belonging to the same speaker can be clustered by using the variational Bayes algorithm, with high accuracy and accurate call seperate effect.
  • This embodiment provides a computer non-volatile readable storage medium.
  • the computer non-volatile readable storage medium stores computer readable instructions.
  • the call separation method in the embodiment is implemented. To avoid repetition, I will not repeat them here.
  • the computer-readable instructions are executed by the processor, the functions of the modules / units in the call separation device in the embodiment are implemented. To avoid repetition, details are not described here one by one.
  • the computer device 60 of this embodiment includes: a processor 61, a memory 62, and computer readable instructions 63 stored in the memory 62 and executable on the processor 61, and the computer readable instructions 63 are processed
  • the call separation method in the embodiment is implemented. To avoid repetition, details are not described here one by one.
  • the computer readable instructions are executed by the processor 61, the functions of each model / unit in the call separation device in the embodiment are implemented. To avoid repetition, they are not described here one by one.
  • the computer device 60 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • Computer equipment may include, but is not limited to, a processor 61 and a memory 62.
  • FIG. 3 is only an example of the computer device 60, and does not constitute a limitation on the computer device 60, and may include more or less components than shown, or combine certain components, or different components.
  • computer equipment may also include input and output devices, network access devices, buses, and so on.
  • the so-called processor 61 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 62 may be an internal storage unit of the computer device 60, such as a hard disk or a memory of the computer device 60.
  • the memory 62 may also be an external storage device of the computer device 60, for example, a plug-in hard disk equipped on the computer device 60, a smart memory card (Smart) Card (SMC), a secure digital (SD) card, and a flash memory card (Flash Card) etc.
  • the memory 62 may also include both the internal storage unit of the computer device 60 and the external storage device.
  • the memory 62 is used to store computer readable instructions and other programs and data required by the computer device.
  • the memory 62 may also be used to temporarily store data that has been or will be output.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)

Abstract

La présente invention concerne un procédé et un appareil de séparation d'appel, un dispositif informatique et un support d'informations appartenant au domaine de l'intelligence artificielle. Le procédé de séparation d'appel consiste : à acquérir des segments d'appel d'origine ; à utiliser une détection de silence afin d'éliminer des segments de silence dans les segments d'appel d'origine, afin d'obtenir un premier segment d'appel ; à segmenter le premier segment d'appel afin d'obtenir au moins trois seconds segments d'appel, un locuteur correspondant à un ou plusieurs seconds segments d'appel ; à acquérir des caractéristiques de vecteur i de chaque second segment d'appel, et à modéliser chaque caractéristique de vecteur i à l'aide d'un modèle d'analyse de discrimination linéaire de probabilité de double covariance pré-appris, afin d'obtenir un modèle cible de chaque second segment d'appel ; sur la base des modèles cibles, à utiliser un algorithme de bayes variationnel afin de déterminer les seconds segments d'appel du même locuteur, et à marquer les seconds segments d'appel du même locuteur au moyen d'une étiquette unifiée. À l'aide du procédé de séparation d'appel, des segments d'appel correspondant à différents locuteurs dans un appel peuvent être séparés avec précision.
PCT/CN2018/123553 2018-11-13 2018-12-25 Procédé et appareil de séparation d'appel, dispositif informatique et support d'informations Ceased WO2020098083A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811347184.3A CN109360572B (zh) 2018-11-13 2018-11-13 通话分离方法、装置、计算机设备及存储介质
CN201811347184.3 2018-11-13

Publications (1)

Publication Number Publication Date
WO2020098083A1 true WO2020098083A1 (fr) 2020-05-22

Family

ID=65344905

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/123553 Ceased WO2020098083A1 (fr) 2018-11-13 2018-12-25 Procédé et appareil de séparation d'appel, dispositif informatique et support d'informations

Country Status (2)

Country Link
CN (1) CN109360572B (fr)
WO (1) WO2020098083A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112071438A (zh) * 2020-09-29 2020-12-11 武汉东湖大数据交易中心股份有限公司 一种百日咳智能筛查方法及系统
CN115168643A (zh) * 2022-09-07 2022-10-11 腾讯科技(深圳)有限公司 音频处理方法、装置、设备及计算机可读存储介质

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390946A (zh) * 2019-07-26 2019-10-29 龙马智芯(珠海横琴)科技有限公司 一种语音信号处理方法、装置、电子设备和存储介质
CN110517667A (zh) * 2019-09-03 2019-11-29 龙马智芯(珠海横琴)科技有限公司 一种语音处理方法、装置、电子设备和存储介质
CN113129893B (zh) * 2019-12-30 2022-09-02 Oppo(重庆)智能科技有限公司 一种语音识别方法、装置、设备及存储介质
CN112669855A (zh) * 2020-12-17 2021-04-16 北京沃东天骏信息技术有限公司 语音处理方法和装置
CN112735384B (zh) * 2020-12-28 2024-07-05 科大讯飞股份有限公司 应用于说话人分离的转折点检测方法、装置以及设备
CN113051426A (zh) * 2021-03-18 2021-06-29 深圳市声扬科技有限公司 音频信息分类方法、装置、电子设备及存储介质
CN113707173B (zh) * 2021-08-30 2023-12-29 平安科技(深圳)有限公司 基于音频切分的语音分离方法、装置、设备及存储介质
CN114974264B (zh) * 2022-04-15 2025-02-28 厦门快商通科技股份有限公司 一种基于改进的变分贝叶斯算法的话者分割方法和系统
CN116434758B (zh) * 2023-04-07 2026-02-03 平安科技(深圳)有限公司 声纹识别模型训练方法、装置、电子设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106971713A (zh) * 2017-01-18 2017-07-21 清华大学 基于密度峰值聚类和变分贝叶斯的说话人标记方法与系统
CN107342077A (zh) * 2017-05-27 2017-11-10 国家计算机网络与信息安全管理中心 一种基于因子分析的说话人分段聚类方法及系统
CN107452403A (zh) * 2017-09-12 2017-12-08 清华大学 一种说话人标记方法
WO2018005620A1 (fr) * 2016-06-28 2018-01-04 Pindrop Security, Inc. Système et procédé de détection d'événement audio basé sur une grappe
US20180254051A1 (en) * 2017-03-02 2018-09-06 International Business Machines Corporation Role modeling in call centers and work centers

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018005620A1 (fr) * 2016-06-28 2018-01-04 Pindrop Security, Inc. Système et procédé de détection d'événement audio basé sur une grappe
CN106971713A (zh) * 2017-01-18 2017-07-21 清华大学 基于密度峰值聚类和变分贝叶斯的说话人标记方法与系统
US20180254051A1 (en) * 2017-03-02 2018-09-06 International Business Machines Corporation Role modeling in call centers and work centers
CN107342077A (zh) * 2017-05-27 2017-11-10 国家计算机网络与信息安全管理中心 一种基于因子分析的说话人分段聚类方法及系统
CN107452403A (zh) * 2017-09-12 2017-12-08 清华大学 一种说话人标记方法

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112071438A (zh) * 2020-09-29 2020-12-11 武汉东湖大数据交易中心股份有限公司 一种百日咳智能筛查方法及系统
CN112071438B (zh) * 2020-09-29 2022-06-14 武汉东湖大数据交易中心股份有限公司 一种百日咳智能筛查方法及系统
CN115168643A (zh) * 2022-09-07 2022-10-11 腾讯科技(深圳)有限公司 音频处理方法、装置、设备及计算机可读存储介质

Also Published As

Publication number Publication date
CN109360572A (zh) 2019-02-19
CN109360572B (zh) 2022-03-11

Similar Documents

Publication Publication Date Title
WO2020098083A1 (fr) Procédé et appareil de séparation d'appel, dispositif informatique et support d'informations
US11996091B2 (en) Mixed speech recognition method and apparatus, and computer-readable storage medium
US11335352B2 (en) Voice identity feature extractor and classifier training
US10468032B2 (en) Method and system of speaker recognition using context aware confidence modeling
CN106683680B (zh) 说话人识别方法及装置、计算机设备及计算机可读介质
US10269346B2 (en) Multiple speech locale-specific hotword classifiers for selection of a speech locale
US9202462B2 (en) Key phrase detection
US9589560B1 (en) Estimating false rejection rate in a detection system
WO2021208287A1 (fr) Procédé et appareil de détection d'activité vocale pour reconnaissance d'émotion, dispositif électronique et support de stockage
CN109960743A (zh) 会议内容区分方法、装置、计算机设备及存储介质
WO2021093449A1 (fr) Procédé et appareil de détection de mot de réveil employant l'intelligence artificielle, dispositif, et support
WO2020073694A1 (fr) Procédé d'identification d'empreinte vocale, procédé d'entraînement de modèle et serveur
US20150325240A1 (en) Method and system for speech input
CN107221320A (zh) 训练声学特征提取模型的方法、装置、设备和计算机存储介质
KR20230116886A (ko) 페이크 오디오 검출을 위한 자기 지도형 음성 표현
CN108346427A (zh) 一种语音识别方法、装置、设备及存储介质
WO2020253051A1 (fr) Procédé et appareil de reconnaissance du langage des lèvres
WO2014029099A1 (fr) Regroupement de données d'entraînement sur la base de vecteurs i en reconnaissance vocale
CN114049900B (zh) 模型训练方法、身份识别方法、装置及电子设备
CN113657249B (zh) 训练方法、预测方法、装置、电子设备以及存储介质
WO2019237518A1 (fr) Procédé d'établissement de bibliothèque de modèles, procédé et appareil de reconnaissance vocale, ainsi que dispositif et support
CN110299150A (zh) 一种实时语音说话人分离方法及系统
CN110491375A (zh) 一种目标语种检测的方法和装置
CN111462762B (zh) 一种说话人向量正则化方法、装置、电子设备和存储介质
CN115221351A (zh) 音频匹配方法、装置、电子设备和计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18939999

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18939999

Country of ref document: EP

Kind code of ref document: A1