WO2023088142A1 - 音频信号处理方法、装置、设备及存储介质 - Google Patents
音频信号处理方法、装置、设备及存储介质 Download PDFInfo
- Publication number
- WO2023088142A1 WO2023088142A1 PCT/CN2022/130728 CN2022130728W WO2023088142A1 WO 2023088142 A1 WO2023088142 A1 WO 2023088142A1 CN 2022130728 W CN2022130728 W CN 2022130728W WO 2023088142 A1 WO2023088142 A1 WO 2023088142A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio
- sets
- segments
- audio segments
- cluster center
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
Definitions
- the present disclosure relates to the field of information technology, and in particular to an audio signal processing method, device, equipment and storage medium.
- AI artificial intelligence
- the present disclosure provides an audio signal processing method, device, device and storage medium, which improves the accuracy of unsupervised role separation based on single-channel speech.
- an embodiment of the present disclosure provides a method for processing roles in a conference scene, including:
- the second cluster center of the first set determine one or more second target segments in the first set, the feature information corresponding to the second target segment is the same as the first set of the first set.
- the similarity between the two cluster centers is greater than or equal to a second threshold;
- an audio signal processing method including:
- the plurality of audio segments are clustered to obtain one or more second sets, wherein the audio segments in the same second set correspond to Same role tag.
- an audio signal processing device including:
- a segment module configured to perform segment processing on the audio signal to obtain multiple audio segments
- a clustering module configured to cluster the multiple audio segments according to the feature information of each of the multiple audio segments to obtain one or more first sets
- a determining module configured to determine the first cluster center of each of the first sets according to the feature information of the audio segments included in each of the first sets;
- the clustering module is further configured to: cluster the multiple audio segments according to the first cluster centers of each of the first sets to obtain one or more second sets, wherein the same The audio segments in the second set correspond to the same role label.
- an electronic device including:
- the computer program is stored in the memory and is configured to be executed by the processor to implement the method as described in the first aspect or the second aspect.
- an embodiment of the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and the computer program is executed by a processor to implement the method described in the first aspect or the second aspect.
- an embodiment of the present disclosure provides a conference system, where the system includes a terminal and a server; wherein, the terminal and the server are connected by communication;
- the terminal is used to send audio signals of conference multi-roles to the server, and the server is used to execute the method described in the second aspect; or
- the server is configured to send the conference multi-role audio signal to the terminal, and the terminal is configured to execute the method described in the second aspect.
- multiple audio segments are obtained by segmenting the audio signal, and according to the feature information of each audio segment in the multiple audio segments , performing clustering processing on the multiple audio segments to obtain one or more first sets. Further, according to the feature information of the audio segments included in each of the first sets, determine the first cluster centers of each of the first sets, and according to the first cluster centers of each of the first sets, Perform clustering processing on the multiple audio segments to obtain one or more second sets, wherein the audio segments in the same second set correspond to the same role label. That is to say, after the initial clustering process is performed on the multiple audio segments, the multiple audio segments can also be clustered again according to the first cluster center of each first set, thereby improving Accuracy of unsupervised role separation for channel speech.
- FIG. 1 is a flowchart of an audio signal processing method provided by an embodiment of the present disclosure
- FIG. 2 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure
- FIG. 3 is a flowchart of an audio signal processing method provided by another embodiment of the present disclosure.
- FIG. 4 is a flowchart of an audio signal processing method provided by another embodiment of the present disclosure.
- FIG. 5 is a schematic diagram of a clustering result provided by an embodiment of the present disclosure.
- FIG. 6 is a flowchart of an audio signal processing method provided by another embodiment of the present disclosure.
- FIG. 7 is a schematic structural diagram of an audio signal processing device provided by another embodiment of the present disclosure.
- FIG. 8 is a schematic structural diagram of an embodiment of an electronic device provided by an embodiment of the present disclosure.
- unsupervised role separation based on single-channel voice is a necessary and challenging technology in conference systems, and has a wide range of application requirements.
- the accuracy of current unsupervised role separation based on single-channel speech is low.
- the unsupervised role separation specifically refers to obtaining the number of roles in the speech and the time information of each role speaking when the role information is unknown.
- an embodiment of the present disclosure provides an audio signal processing method, which will be introduced below in conjunction with specific embodiments.
- FIG. 1 is a flowchart of an audio signal processing method provided by an embodiment of the present disclosure.
- This embodiment is applicable to the situation where the audio signal is processed in the client, the method can be executed by an audio signal processing device, the device can be implemented in the form of software and/or hardware, and the device can be configured in an electronic device, such as a terminal , including smartphones, PDAs, tablets, wearables with displays, desktops, laptops, all-in-ones, smart home devices, and more.
- this embodiment can be applied to the situation where the audio signal is processed in the server, the method can be executed by an audio signal processing device, the device can be implemented in the form of software and/or hardware, and the device can be configured in an electronic device, For example server.
- the audio signal processing method described in this embodiment may be applicable to application scenarios such as unsupervised role separation of single-channel speech, speech recognition, and a conference system. As shown in Figure 1, the specific steps of the method are as follows:
- the terminal 21 can obtain an audio signal from a server 22 .
- an audio signal is stored locally in the terminal 21 .
- the terminal 21 may collect audio signals through an audio collection module.
- the audio signal may be a single-channel voice, and further, the terminal 21 may perform unsupervised role separation on the audio signal.
- the terminal 21 may also send its local or collected audio signal to the server 22, and the server 22 performs unsupervised role separation on the audio signal. The following uses an example in which the terminal 21 performs unsupervised role separation on the audio signal as an example.
- the terminal 21 may segment the audio signal to obtain multiple audio segments.
- methods such as voice active detection (Voice Active Detection, VAD) and Bayesian Information Criterion (Bayesian Information Criterion, BIC) can be used for segmentation processing.
- An audio segment may also be called a voice segment, and each audio segment obtained by segmenting may be an audio segment of 1-2 seconds.
- S102 Perform clustering processing on the multiple audio segments according to feature information of each of the multiple audio segments to obtain one or more first sets.
- the terminal 21 may use x-vector, Resnet or other embedded vector representation methods to extract feature information of each audio segment, and the feature information may specifically be an embedded vector representation (embedding) feature.
- x-vector and Resnet are embedded vector representation methods based on neural network models.
- the terminal 21 may calculate the similarity between every two audio segments according to the embedding features of each audio segment. It can be understood that the similarity between the audio segment A and the audio segment B may be the similarity between the embedding features of the audio segment A and the embedding features of the audio segment B. The greater the similarity, the smaller the distance between the embedding feature of audio segment A and the embedding feature of audio segment B, and at the same time, the smaller the distance between audio segment A and audio segment B.
- the clustering method of AHC may specifically be: according to the similarity between every two audio segments in the plurality of audio segments, determine the two audio segments with the highest similarity scores, and combine the two audio segments The segments are merged into a new audio segment. Further, calculate the similarity between the new audio segment and every two audio segments in the set of other audio segments, and repeat the merging process and the process of calculating the similarity until the constraint criterion is reached.
- the merging is stopped, so as to obtain one or more first sets.
- the clustering method is not limited to AHC, and may also be other clustering algorithms, for example, k-means clustering algorithm (k-means clustering algorithm, kmeans).
- the above-mentioned clustering method is used to cluster the multiple audio segments obtained after the segmentation process, for each first set in one or more first sets, according to the The embedding features corresponding to more than one audio segment respectively determine the first cluster center of the first set.
- S104 Perform clustering processing on the multiple audio segments according to the first clustering centers of each of the first sets to obtain one or more second sets, wherein the audio segments in the same second set Segments correspond to the same role label.
- the multiple audio segments obtained after segmentation processing can be re-clustered according to the first cluster center of each first set to obtain a or multiple second sets, and the audio segments in the same second set correspond to the same role label.
- one or more second sets may be recorded as the updated clustering result.
- the updated clustering result is a result of updating the initial clustering result.
- multiple audio segments are obtained by segmenting the audio signal, and performing clustering processing on the multiple audio segments according to the feature information of each audio segment in the multiple audio segments, Get one or more first sets. Further, according to the feature information of the audio segments included in each of the first sets, determine the first cluster centers of each of the first sets, and according to the first cluster centers of each of the first sets, Perform clustering processing on the multiple audio segments to obtain one or more second sets, wherein the audio segments in the same second set correspond to the same role label. That is to say, after the initial clustering process is performed on the multiple audio segments, the multiple audio segments can also be clustered again according to the first cluster center of each first set, thereby improving Accuracy of unsupervised role separation for channel speech.
- the first cluster center of each of the first sets can be determined.
- determining the first cluster center of each of the first sets includes: determining The first target segment of the first target segment, the sum of the similarity scores between the first target segment and other audio segments in the first set is greater than the first threshold; the feature information corresponding to the first target segment as the first cluster center of the first set.
- the first set 1 includes more than one audio segment.
- the first set 1 includes audio segment A, audio segment B and audio segment C.
- an audio segment can be determined from the audio segment A, the audio segment B and the audio segment C as the first target segment, and the first target segment can be Fragments representing the first cluster centers. For example, assuming that audio segment A is used as the first target segment, the similarity score between audio segment A and audio segment B and the similarity score between audio segment A and audio segment C are calculated, And the two similarity scores are accumulated to obtain the sum of the similarity scores.
- the audio segment A can be used as the first target segment; otherwise, continue to traverse to find the first target segment. Or, respectively calculate the sum of similarity scores when assuming audio segment A as the first target segment, the sum of similarity scores when assuming audio segment B as the first target segment, and assuming audio segment C as the first target segment The sum of similarity scores for a target segment. If it is assumed that audio segment A is used as the first target segment, the sum of similarity scores is the largest, then audio segment A is determined as the first target segment.
- the calculation process of the sum of similarity scores when assuming audio segment B or audio segment C as the first target segment can refer to the calculation of the sum of similarity scores when assuming audio segment A as the first target segment The process will not be repeated here.
- the feature information corresponding to the first target segment is used as the first cluster center of the first set. It can be understood that the first target segment obtained in this way is the segment that can best represent the first cluster center. Therefore, in this way, one or more second sets obtained after performing S104 are equivalent to It is more accurate than the one or more first sets obtained in S102.
- the first cluster center of each of the first sets is determined, including the following as shown in FIG. 3 Several steps:
- determining the second cluster center of the first set according to the feature information of the audio segments included in the first set includes: calculating the first cluster center of the feature information of the audio segments included in the first set A mean value; use the first mean value as the second cluster center of the first set.
- the first set 1 includes audio segment A, audio segment B and audio segment C. Since audio segment A, audio segment B, and audio segment C correspond to embedding features, respectively, the average value of the embedding features corresponding to audio segment A, audio segment B, and audio segment C can be calculated, the The average value is recorded as the first average. Further, the first mean value is used as the initial cluster center of the first set 1, and the initial cluster center is recorded as the second cluster center.
- the initial cluster center of the first set 1, that is, the second cluster center may be updated to obtain an updated cluster center of the first set 1, and the updated cluster center is recorded as the first cluster center.
- updating the second cluster center of the first set to obtain the first cluster center of the first set includes: according to the second cluster center of the first set, determining the One or more second target segments in the first set, the similarity between the feature information corresponding to the second target segment and the second cluster center of the first set is greater than or equal to a second threshold; A first cluster center of the first set is determined based on one or more second target segments in the first set.
- one or more second target segments are determined from audio segment A, audio segment B, and audio segment C, and the one or more second target segments A segment may be a K-nearest neighbor segment of the second cluster center. That is to say, the similarity between the embedding feature of each second target segment in the one or more second target segments and the second cluster center is greater than or equal to the second threshold, that is, each second The distance between the embedding feature of the target segment and the second cluster center is smaller than a certain threshold.
- audio segment A and audio segment B may serve as the second target segment respectively.
- the first cluster center of the first set 1, that is, the updated cluster center can be determined.
- determining the first cluster centers of the first set according to one or more second target segments in the first set includes: calculating one or more second target segments in the first set A second mean value of the feature information corresponding to the two target segments; using the second mean value as the first cluster center of the first set.
- the average value of the embedding features of audio segment A and the embedding features of audio segment B is calculated, and the average value is recorded as the second mean value, and further, the second mean value is used as the first cluster center of the first set 1 .
- the second cluster center of the first set 1 can be calculated through S301, and the second cluster center can be the average of the embedding features corresponding to the audio segment A, the audio segment B, and the audio segment C respectively. value.
- the average value of the embedding feature of audio segment A and the embedding feature of audio segment B can be used as the first cluster center.
- the first cluster center is more accurate than the second cluster center, therefore, the one or more second sets obtained after performing S104 are more accurate than the one or more first sets obtained by S102.
- Fig. 4 is a flowchart of an audio signal processing method provided by another embodiment of the present disclosure. The specific steps of the method are as follows:
- S401 Perform segmentation processing on an audio signal to obtain multiple audio segments.
- S402. Perform clustering processing on the multiple audio segments according to the feature information of each of the multiple audio segments to obtain one or more first sets.
- the first cluster centers respectively corresponding to the first set 1, the first set 2, and the first set 3 are obtained.
- the first cluster center can be obtained through several methods as described above, which will not be repeated here.
- the audio segment calculates the distance between the audio segment and the first cluster center of the first set 1, the audio segment The distance between the first clustering center of the first collection 2, the distance between the audio segment and the first clustering center of the first collection 3, since these 3 distances may be different, according to the 3 distances, it can be determined which first clustering center the audio segment has the shortest distance to.
- the second set For example, if there are 3 in the first set, then there are 3 first cluster centers, and each first cluster center can correspond to a second set, then there are 3 corresponding to the second set, and each second set includes The more than one audio segment of is the audio segment closest to the first cluster center.
- the distances between one or more audio segments included in the second set and the first cluster center are less than or equal to the third threshold. That is to say, the second set may be a result of partially or fully adjusting more than one audio segment in the first set. Wherein, the audio segments in the same second set correspond to the same role label.
- the first set 1 , the first set 2 and the first set 3 are initial clustering results. After determining the first cluster centers corresponding to the first set 1, the first set 2, and the first set 3 respectively, it is determined that the distance between the audio segment C and the first cluster center of the first set 2 is the shortest, so , the audio segment C may be adjusted from the first set 1 into the first set 2 . Similarly, the distance between the audio segment F and the first clustering center of the first set 3 is the shortest, therefore, the audio segment F is adjusted from the first set 2 to the first set 3, thereby obtaining
- the re-clustering results shown are the second sets corresponding to each first set.
- multiple audio segments are obtained by segmenting the audio signal, and according to the feature information of each audio segment in the multiple audio segments, the multiple audio segments are clustered to obtain One or more first collections. Further, according to the feature information of the audio segments included in each of the first sets, determine the first cluster centers of each of the first sets, and according to the first cluster centers of each of the first sets, Perform clustering processing on the multiple audio segments to obtain one or more second sets, wherein the audio segments in the same second set correspond to the same role label. That is to say, after the initial clustering process is performed on the multiple audio segments, the multiple audio segments can also be clustered again according to the first cluster center of each first set, thereby improving Accuracy of unsupervised role separation for channel speech. This can effectively avoid clustering result errors caused by inaccurate clustering centers, for example, dividing two audio segments of the same character into different classes, or dividing part of the audio segments of a character into another in the character's class.
- Fig. 6 is a flowchart of an audio signal processing method provided by another embodiment of the present disclosure. The specific steps of the method are as follows:
- S602. Perform clustering processing on the multiple audio segments according to the feature information of each of the multiple audio segments to obtain one or more first sets.
- S605. Determine one or more second target segments in the first set according to the second cluster centers of the first set, and the feature information corresponding to the second target segments is the same as that in the first set
- the similarity between the second cluster centers is greater than or equal to the second threshold.
- the second set after performing S609, may also be used as the first set, so as to repeatedly execute S603-S609.
- the number of iterations of S603-S609 may be a preset number. That is to say, on the basis of the initial clustering result, the second clustering center and the first clustering center may be updated multiple times, and role labels may be reassigned to each audio clip.
- the K-nearest neighbor segment of the second cluster center that is, one or more second target segments is changed, therefore, in each iteration process, the first cluster center can be updated, and the first The cluster center can continuously approach the real cluster center, thereby greatly reducing the influence of noise points on the cluster center and ensuring the accuracy of the cluster center.
- the accuracy of role separation can be increased from 90% to 94% through the method described in the embodiment of the present disclosure, and the effect is significantly improved.
- the embodiment of the present disclosure also provides a method for processing roles in a conference scene, the method includes the following steps:
- the second cluster centers of the first set determine one or more second target segments in the first set, the feature information corresponding to the second target segments is the same as that in the first set
- the similarity between the second cluster centers is greater than or equal to the second threshold.
- FIG. 7 is a schematic structural diagram of an audio signal processing device provided by an embodiment of the present disclosure.
- the audio signal processing device provided in the embodiment of the present disclosure can execute the processing flow provided in the embodiment of the audio signal processing method.
- the audio signal processing device 70 includes:
- Segmentation module 71 is used for carrying out segmentation processing to audio signal and obtains a plurality of audio frequency segmentations
- a clustering module 72 configured to perform clustering processing on the multiple audio segments according to the feature information of each of the multiple audio segments to obtain one or more first sets;
- a determining module 73 configured to determine the first cluster center of each of the first sets according to the feature information of the audio segments included in each of the first sets;
- the clustering module 71 is further configured to: perform clustering processing on the plurality of audio segments according to the first clustering centers of each of the first sets to obtain one or more second sets, wherein the same Audio segments in a second set correspond to the same role label.
- the determination module 73 is specifically configured to: determine the first target segment in the first set, the difference between the similarity scores between the first target segment and other audio segments in the first set The sum is greater than a first threshold; and the feature information corresponding to the first target segment is used as the first cluster center of the first set.
- the determining module 73 includes a determining unit 731 and an updating unit 732, wherein the determining unit 731 is configured to determine the second cluster center of the first set according to the feature information of the audio segments included in the first set
- the updating unit 732 is configured to update the second cluster centers of the first set to obtain the first cluster centers of the first set.
- the determining unit 731 is specifically configured to: calculate a first mean value of the feature information of the audio segments included in the first set; and use the first mean value as a second cluster center of the first set.
- the updating unit 732 is specifically configured to: determine one or more second target segments in the first set according to the second cluster centers of the first set, and the second target segments correspond to The similarity between the feature information of the first set and the second cluster center of the first set is greater than or equal to the second threshold; according to one or more second target segments in the first set, determine the first The first cluster center of the set.
- the updating unit 732 determines the first cluster center of the first set according to one or more second target segments in the first set, it is specifically configured to: calculate the first set The second mean value of the feature information corresponding to one or more second target segments in ; using the second mean value as the first cluster center of the first set.
- the clustering module 72 performs clustering processing on the plurality of audio segments according to the first cluster centers of each of the first sets, and when one or more second sets are obtained, it is specifically used for: For each audio segment in the plurality of audio segments, according to the feature information of the audio segment and the first cluster center of each of the first sets, calculate the relationship between the audio segment and each The distance between each of the first cluster centers; among the plurality of audio segments, the audio segments whose distance from the first cluster center is less than or equal to a third threshold are divided into the second set.
- the audio signal processing device of the embodiment shown in FIG. 7 can be used to implement the technical solution of the above method embodiment, and its implementation principle and technical effect are similar, and will not be repeated here.
- FIG. 8 is a schematic structural diagram of an embodiment of an electronic device provided by an embodiment of the present disclosure. As shown in FIG. 8 , the electronic device includes a memory 81 and a processor 82 .
- the memory 81 is used to store programs. In addition to the above-mentioned programs, the memory 81 may also be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, etc.
- Memory 81 can be realized by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
- SRAM static random access memory
- EEPROM electrically erasable programmable read-only memory
- EPROM erasable Programmable Read Only Memory
- PROM Programmable Read Only Memory
- ROM Read Only Memory
- Magnetic Memory Flash Memory
- Magnetic or Optical Disk Magnetic Disk
- the processor 82 coupled with the memory 81, executes the program stored in the memory 81 for:
- the plurality of audio segments are clustered to obtain one or more second sets, wherein the audio segments in the same second set correspond to Same role tag.
- the electronic device may further include: a communication component 83 , a power supply component 84 , an audio component 85 , a display 86 and other components.
- FIG. 8 only schematically shows some components, which does not mean that the electronic device only includes the components shown in FIG. 8 .
- the communication component 83 is configured to facilitate wired or wireless communication between the electronic device and other devices. Electronic devices can access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 83 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 83 also includes a near field communication (NFC) module to facilitate short-range communication.
- the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
- RFID Radio Frequency Identification
- IrDA Infrared Data Association
- UWB Ultra Wideband
- Bluetooth Bluetooth
- the power supply component 84 provides power for various components of the electronic device.
- Power supply components 84 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic devices.
- the audio component 85 is configured to output and/or input audio signals.
- the audio component 85 includes a microphone (MIC), which is configured to receive external audio signals when the electronic device is in operation modes, such as calling mode, recording mode and voice recognition mode.
- the received audio signal may be further stored in the memory 81 or sent via the communication component 83 .
- the audio component 85 also includes a speaker for outputting audio signals.
- the display 86 includes a screen, which may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.
- the touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense a boundary of a touch or swipe action, but also detect duration and pressure associated with the touch or swipe action.
- an embodiment of the present disclosure further provides a computer-readable storage medium, on which a computer program is stored, and the computer program is executed by a processor to implement the audio signal processing method described in the above-mentioned embodiments.
- an embodiment of the present disclosure also provides a conference system, the system includes a terminal and a server; wherein, the terminal and the server are connected by communication;
- the terminal is used to send the audio signal of the conference multi-role to the server, and the server is used to execute the role processing method in the conference scene described in the above embodiment;
- the server is configured to send audio signals of conference multi-roles to the terminal, and the terminal is configured to execute the role processing method in the conference scenario described in the above-mentioned embodiments.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (12)
- 一种会议场景中的角色处理方法,其中,所述方法包括:接收会议多角色的音频信号;对音频信号进行分段处理得到多个音频分段;根据所述多个音频分段中每个音频分段的特征信息,对所述多个音频分段进行聚类处理,得到一个或多个第一集合;计算所述第一集合包括的音频分段的特征信息的第一均值;将所述第一均值作为所述第一集合的第二聚类中心;根据所述第一集合的第二聚类中心,确定所述第一集合中的一个或多个第二目标分段,所述第二目标分段对应的特征信息与所述第一集合的第二聚类中心之间的相似度大于或等于第二阈值;计算所述第一集合中的一个或多个第二目标分段所对应的特征信息的第二均值;将所述第二均值作为所述第一集合的第一聚类中心;针对所述多个音频分段中的每个音频分段,根据所述音频分段的特征信息、以及每个所述第一集合的第一聚类中心,计算所述音频分段分别与每个所述第一聚类中心之间的距离;将所述多个音频分段中距离所述第一聚类中心小于或等于第三阈值的音频分段划分到第二集合中;根据所述第二集合确定所述音频信号中多个发言者的角色信息;将所述第二集合作为第一集合,重复执行从计算所述第一均值到确定角色信息的过程。
- 一种音频信号处理方法,其中,所述方法包括:对音频信号进行分段处理得到多个音频分段;根据所述多个音频分段中每个音频分段的特征信息,对所述多个音频分段进行聚类处理,得到一个或多个第一集合;根据每个所述第一集合包括的音频分段的特征信息,确定每个所述第一集合的第一聚类中心;根据每个所述第一集合的第一聚类中心,对所述多个音频分段进行聚类处理,得到一个或多个第二集合,其中,同一个第二集合中的音频分段对应相同的角色标签。
- 根据权利要求2所述的方法,其中,根据每个所述第一集合包括的音频分段的特征信息,确定每个所述第一集合的第一聚类中心,包括:确定所述第一集合中的第一目标分段,所述第一目标分段与所述第一集合中其他音频分段之间的相似度得分之和大于第一阈值;将所述第一目标分段对应的特征信息作为所述第一集合的第一聚类中心。
- 根据权利要求2所述的方法,其中,根据每个所述第一集合包括的音频分段的特征信息,确定每个所述第一集合的第一聚类中心,包括:根据所述第一集合包括的音频分段的特征信息,确定所述第一集合的第二聚类中心;对所述第一集合的第二聚类中心进行更新,得到所述第一集合的第一聚类中心。
- 根据权利要求4所述的方法,其中,根据所述第一集合包括的音频分段的特征信息,确定所述第一集合的第二聚类中心,包括:计算所述第一集合包括的音频分段的特征信息的第一均值;将所述第一均值作为所述第一集合的第二聚类中心。
- 根据权利要求4所述的方法,其中,对所述第一集合的第二聚类中心进行更新,得到所述第一集合的第一聚类中心,包括:根据所述第一集合的第二聚类中心,确定所述第一集合中的一个或多个第二目标分段,所述第二目标分段对应的特征信息与所述第一集合的第二聚类中心之间的相似度大于或等于第二阈值;根据所述第一集合中的一个或多个第二目标分段,确定所述第一集合的第一聚类中心。
- 根据权利要求6所述的方法,其中,根据所述第一集合中的一个或多个第二目标分段,确定所述第一集合的第一聚类中心,包括:计算所述第一集合中的一个或多个第二目标分段所对应的特征信息的第二均值;将所述第二均值作为所述第一集合的第一聚类中心。
- 根据权利要求2所述的方法,其中,根据每个所述第一集合的第一聚类中心,对所述多个音频分段进行聚类处理,得到一个或多个第二集合,包括:针对所述多个音频分段中的每个音频分段,根据所述音频分段的特征信息、以及每个所述第一集合的第一聚类中心,计算所述音频分段分别与每个所述第一聚类中心之间的距离;将所述多个音频分段中距离所述第一聚类中心小于或等于第三阈值的音频分段划分到第二集合中。
- 一种音频信号处理装置,其中,包括:分段模块,用于对音频信号进行分段处理得到多个音频分段;聚类模块,用于根据所述多个音频分段中每个音频分段的特征信息,对所述多个音频 分段进行聚类处理,得到一个或多个第一集合;确定模块,用于根据每个所述第一集合包括的音频分段的特征信息,确定每个所述第一集合的第一聚类中心;所述聚类模块还用于:根据每个所述第一集合的第一聚类中心,对所述多个音频分段进行聚类处理,得到一个或多个第二集合,其中,同一个第二集合中的音频分段对应相同的角色标签。
- 一种电子设备,其中,包括:存储器;处理器;以及计算机程序;其中,所述计算机程序存储在所述存储器中,并被配置为由所述处理器执行以实现如权利要求1-8中任一项所述的方法。
- 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1-8中任一项所述的方法。
- 一种会议系统,其中,所述系统包括终端和服务器;其中,所述终端和所述服务器之间通信连接;所述终端用于向所述服务器发送会议多角色的音频信号,所述服务器用于执行如权利要求1所述的方法;或者所述服务器用于向所述终端发送会议多角色的音频信号,所述终端用于执行如权利要求1所述的方法。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US18/685,019 US20240355335A1 (en) | 2021-11-16 | 2022-11-08 | Audio signal processing method and apparatus, device and storage medium |
| EP22894679.4A EP4375988B1 (en) | 2021-11-16 | 2022-11-08 | Audio signal processing method and apparatus, and device and storage medium |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111351380.X | 2021-11-16 | ||
| CN202111351380.XA CN113808578B (zh) | 2021-11-16 | 2021-11-16 | 音频信号处理方法、装置、设备及存储介质 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023088142A1 true WO2023088142A1 (zh) | 2023-05-25 |
Family
ID=78898545
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2022/130728 Ceased WO2023088142A1 (zh) | 2021-11-16 | 2022-11-08 | 音频信号处理方法、装置、设备及存储介质 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20240355335A1 (zh) |
| EP (1) | EP4375988B1 (zh) |
| CN (1) | CN113808578B (zh) |
| WO (1) | WO2023088142A1 (zh) |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114596877B (zh) * | 2022-03-03 | 2024-11-08 | 北京百度网讯科技有限公司 | 一种话者分离方法、装置、电子设备及存储介质 |
| CN114465737B (zh) * | 2022-04-13 | 2022-06-24 | 腾讯科技(深圳)有限公司 | 一种数据处理方法、装置、计算机设备及存储介质 |
| CN115269909A (zh) * | 2022-07-28 | 2022-11-01 | 腾讯音乐娱乐科技(深圳)有限公司 | 音频分类方法、音频搜索方法、计算机设备和程序产品 |
| CN116524937A (zh) * | 2023-01-11 | 2023-08-01 | 阿里巴巴达摩院(杭州)科技有限公司 | 说话人转换点的检测方法、训练检测模型的方法及装置 |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103699573A (zh) * | 2013-11-28 | 2014-04-02 | 微梦创科网络科技(中国)有限公司 | 社交平台的ugc标签聚类方法和装置 |
| CN106845518A (zh) * | 2016-12-19 | 2017-06-13 | 苏州蓝盛电子有限公司 | 一种基于K‑means聚类分析算法的汽车故障数据诊断方法 |
| CN110414569A (zh) * | 2019-07-03 | 2019-11-05 | 北京小米智能科技有限公司 | 聚类实现方法及装置 |
| US20200082809A1 (en) * | 2016-12-14 | 2020-03-12 | International Business Machines Corporation | Using recurrent neural network for partitioning of audio data into segments that each correspond to a speech feature cluster identifier |
| WO2020199013A1 (en) * | 2019-03-29 | 2020-10-08 | Microsoft Technology Licensing, Llc | Speaker diarization with early-stop clustering |
| CN111899755A (zh) * | 2020-08-11 | 2020-11-06 | 华院数据技术(上海)有限公司 | 一种说话人语音分离方法及相关设备 |
| CN111966798A (zh) * | 2020-07-24 | 2020-11-20 | 北京奇保信安科技有限公司 | 一种基于多轮K-means算法的意图识别方法、装置和电子设备 |
| CN113593597A (zh) * | 2021-08-27 | 2021-11-02 | 中国电信股份有限公司 | 语音噪声过滤方法、装置、电子设备和介质 |
Family Cites Families (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9368116B2 (en) * | 2012-09-07 | 2016-06-14 | Verint Systems Ltd. | Speaker separation in diarization |
| WO2016022588A1 (en) * | 2014-08-04 | 2016-02-11 | Flagler Llc | Voice tallying system |
| US10141009B2 (en) * | 2016-06-28 | 2018-11-27 | Pindrop Security, Inc. | System and method for cluster-based audio event detection |
| US10440325B1 (en) * | 2018-07-17 | 2019-10-08 | International Business Machines Corporation | Context-based natural language participant modeling for videoconference focus classification |
| CN111291177B (zh) * | 2018-12-06 | 2024-08-02 | 中兴通讯股份有限公司 | 一种信息处理方法、装置和计算机存储介质 |
| CN110930984A (zh) * | 2019-12-04 | 2020-03-27 | 北京搜狗科技发展有限公司 | 一种语音处理方法、装置和电子设备 |
| CN111462758A (zh) * | 2020-03-02 | 2020-07-28 | 深圳壹账通智能科技有限公司 | 智能会议角色分类的方法、装置、设备及存储介质 |
| US11546690B2 (en) * | 2020-04-27 | 2023-01-03 | Orcam Technologies Ltd. | Processing audio and video |
| CN111599346B (zh) * | 2020-05-19 | 2024-02-20 | 科大讯飞股份有限公司 | 一种说话人聚类方法、装置、设备及存储介质 |
| CN111477251B (zh) * | 2020-05-21 | 2023-09-05 | 北京百度网讯科技有限公司 | 模型评测方法、装置及电子设备 |
| CN112420069A (zh) * | 2020-11-18 | 2021-02-26 | 北京云从科技有限公司 | 一种语音处理方法、装置、机器可读介质及设备 |
| CN112562693B (zh) * | 2021-02-24 | 2021-05-28 | 北京远鉴信息技术有限公司 | 一种基于聚类的说话人确定方法、确定装置及电子设备 |
| CN113450773A (zh) * | 2021-05-11 | 2021-09-28 | 多益网络有限公司 | 视频记录文稿生成方法、装置、存储介质以及电子设备 |
-
2021
- 2021-11-16 CN CN202111351380.XA patent/CN113808578B/zh active Active
-
2022
- 2022-11-08 EP EP22894679.4A patent/EP4375988B1/en active Active
- 2022-11-08 US US18/685,019 patent/US20240355335A1/en active Pending
- 2022-11-08 WO PCT/CN2022/130728 patent/WO2023088142A1/zh not_active Ceased
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103699573A (zh) * | 2013-11-28 | 2014-04-02 | 微梦创科网络科技(中国)有限公司 | 社交平台的ugc标签聚类方法和装置 |
| US20200082809A1 (en) * | 2016-12-14 | 2020-03-12 | International Business Machines Corporation | Using recurrent neural network for partitioning of audio data into segments that each correspond to a speech feature cluster identifier |
| CN106845518A (zh) * | 2016-12-19 | 2017-06-13 | 苏州蓝盛电子有限公司 | 一种基于K‑means聚类分析算法的汽车故障数据诊断方法 |
| WO2020199013A1 (en) * | 2019-03-29 | 2020-10-08 | Microsoft Technology Licensing, Llc | Speaker diarization with early-stop clustering |
| CN110414569A (zh) * | 2019-07-03 | 2019-11-05 | 北京小米智能科技有限公司 | 聚类实现方法及装置 |
| CN111966798A (zh) * | 2020-07-24 | 2020-11-20 | 北京奇保信安科技有限公司 | 一种基于多轮K-means算法的意图识别方法、装置和电子设备 |
| CN111899755A (zh) * | 2020-08-11 | 2020-11-06 | 华院数据技术(上海)有限公司 | 一种说话人语音分离方法及相关设备 |
| CN113593597A (zh) * | 2021-08-27 | 2021-11-02 | 中国电信股份有限公司 | 语音噪声过滤方法、装置、电子设备和介质 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4375988A4 |
Also Published As
| Publication number | Publication date |
|---|---|
| US20240355335A1 (en) | 2024-10-24 |
| EP4375988A1 (en) | 2024-05-29 |
| EP4375988B1 (en) | 2026-01-14 |
| EP4375988A4 (en) | 2024-09-25 |
| CN113808578B (zh) | 2022-04-15 |
| CN113808578A (zh) | 2021-12-17 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2023088142A1 (zh) | 音频信号处理方法、装置、设备及存储介质 | |
| CN109829433B (zh) | 人脸图像识别方法、装置、电子设备及存储介质 | |
| CN107102746B (zh) | 候选词生成方法、装置以及用于候选词生成的装置 | |
| CN103914518B (zh) | 聚类方法及相关装置 | |
| CN109829435B (zh) | 一种视频图像处理方法、装置及计算机可读介质 | |
| US20160012820A1 (en) | Multilevel speech recognition method and apparatus | |
| WO2021027344A1 (zh) | 图像处理方法及装置、电子设备和存储介质 | |
| US20160210965A1 (en) | Method and apparatus for speech recognition | |
| EP2757493A2 (en) | Natural language processing method and system | |
| CN110781957A (zh) | 图像处理方法及装置、电子设备和存储介质 | |
| CN108108455B (zh) | 目的地的推送方法、装置、存储介质及电子设备 | |
| WO2020228163A1 (zh) | 图像处理方法及装置、电子设备和存储介质 | |
| CN114139726A (zh) | 数据处理方法及装置、电子设备、存储介质 | |
| CN112333596A (zh) | 一种耳机均衡器的调整方法、装置、服务器及介质 | |
| CN107133361A (zh) | 手势识别方法、装置和终端设备 | |
| CN117011581A (zh) | 图像识别方法、介质、装置和计算设备 | |
| CN109003607A (zh) | 语音识别方法、装置、存储介质及电子设备 | |
| CN108922520A (zh) | 语音识别方法、装置、存储介质及电子设备 | |
| CN111797880A (zh) | 数据处理方法、装置、存储介质及电子设备 | |
| CN109583583B (zh) | 神经网络训练方法、装置、计算机设备及可读介质 | |
| WO2025055714A1 (zh) | 行人重识别方法、装置、电子设备及存储介质 | |
| CN113298747B (zh) | 图片、视频检测方法和装置 | |
| CN114550728B (zh) | 用于标记说话人的方法、装置和电子设备 | |
| CN111368015B (zh) | 用于压缩地图的方法和装置 | |
| KR102131353B1 (ko) | 머신 러닝의 예측 데이터 피드백 적용 방법 및 그 시스템 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22894679 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2022894679 Country of ref document: EP |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 18685019 Country of ref document: US |
|
| ENP | Entry into the national phase |
Ref document number: 2022894679 Country of ref document: EP Effective date: 20240219 |
|
| WWG | Wipo information: grant in national office |
Ref document number: 2022894679 Country of ref document: EP |