WO2023088142A1 - 音频信号处理方法、装置、设备及存储介质 - Google Patents

音频信号处理方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2023088142A1
WO2023088142A1 PCT/CN2022/130728 CN2022130728W WO2023088142A1 WO 2023088142 A1 WO2023088142 A1 WO 2023088142A1 CN 2022130728 W CN2022130728 W CN 2022130728W WO 2023088142 A1 WO2023088142 A1 WO 2023088142A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
sets
segments
audio segments
cluster center
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2022/130728
Other languages
English (en)
French (fr)
Inventor
王宪亮
索宏彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Damo Hangzhou Technology Co Ltd
Original Assignee
Alibaba Damo Hangzhou Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Damo Hangzhou Technology Co Ltd filed Critical Alibaba Damo Hangzhou Technology Co Ltd
Priority to US18/685,019 priority Critical patent/US20240355335A1/en
Priority to EP22894679.4A priority patent/EP4375988B1/en
Publication of WO2023088142A1 publication Critical patent/WO2023088142A1/zh
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Definitions

  • the present disclosure relates to the field of information technology, and in particular to an audio signal processing method, device, equipment and storage medium.
  • AI artificial intelligence
  • the present disclosure provides an audio signal processing method, device, device and storage medium, which improves the accuracy of unsupervised role separation based on single-channel speech.
  • an embodiment of the present disclosure provides a method for processing roles in a conference scene, including:
  • the second cluster center of the first set determine one or more second target segments in the first set, the feature information corresponding to the second target segment is the same as the first set of the first set.
  • the similarity between the two cluster centers is greater than or equal to a second threshold;
  • an audio signal processing method including:
  • the plurality of audio segments are clustered to obtain one or more second sets, wherein the audio segments in the same second set correspond to Same role tag.
  • an audio signal processing device including:
  • a segment module configured to perform segment processing on the audio signal to obtain multiple audio segments
  • a clustering module configured to cluster the multiple audio segments according to the feature information of each of the multiple audio segments to obtain one or more first sets
  • a determining module configured to determine the first cluster center of each of the first sets according to the feature information of the audio segments included in each of the first sets;
  • the clustering module is further configured to: cluster the multiple audio segments according to the first cluster centers of each of the first sets to obtain one or more second sets, wherein the same The audio segments in the second set correspond to the same role label.
  • an electronic device including:
  • the computer program is stored in the memory and is configured to be executed by the processor to implement the method as described in the first aspect or the second aspect.
  • an embodiment of the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and the computer program is executed by a processor to implement the method described in the first aspect or the second aspect.
  • an embodiment of the present disclosure provides a conference system, where the system includes a terminal and a server; wherein, the terminal and the server are connected by communication;
  • the terminal is used to send audio signals of conference multi-roles to the server, and the server is used to execute the method described in the second aspect; or
  • the server is configured to send the conference multi-role audio signal to the terminal, and the terminal is configured to execute the method described in the second aspect.
  • multiple audio segments are obtained by segmenting the audio signal, and according to the feature information of each audio segment in the multiple audio segments , performing clustering processing on the multiple audio segments to obtain one or more first sets. Further, according to the feature information of the audio segments included in each of the first sets, determine the first cluster centers of each of the first sets, and according to the first cluster centers of each of the first sets, Perform clustering processing on the multiple audio segments to obtain one or more second sets, wherein the audio segments in the same second set correspond to the same role label. That is to say, after the initial clustering process is performed on the multiple audio segments, the multiple audio segments can also be clustered again according to the first cluster center of each first set, thereby improving Accuracy of unsupervised role separation for channel speech.
  • FIG. 1 is a flowchart of an audio signal processing method provided by an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure
  • FIG. 3 is a flowchart of an audio signal processing method provided by another embodiment of the present disclosure.
  • FIG. 4 is a flowchart of an audio signal processing method provided by another embodiment of the present disclosure.
  • FIG. 5 is a schematic diagram of a clustering result provided by an embodiment of the present disclosure.
  • FIG. 6 is a flowchart of an audio signal processing method provided by another embodiment of the present disclosure.
  • FIG. 7 is a schematic structural diagram of an audio signal processing device provided by another embodiment of the present disclosure.
  • FIG. 8 is a schematic structural diagram of an embodiment of an electronic device provided by an embodiment of the present disclosure.
  • unsupervised role separation based on single-channel voice is a necessary and challenging technology in conference systems, and has a wide range of application requirements.
  • the accuracy of current unsupervised role separation based on single-channel speech is low.
  • the unsupervised role separation specifically refers to obtaining the number of roles in the speech and the time information of each role speaking when the role information is unknown.
  • an embodiment of the present disclosure provides an audio signal processing method, which will be introduced below in conjunction with specific embodiments.
  • FIG. 1 is a flowchart of an audio signal processing method provided by an embodiment of the present disclosure.
  • This embodiment is applicable to the situation where the audio signal is processed in the client, the method can be executed by an audio signal processing device, the device can be implemented in the form of software and/or hardware, and the device can be configured in an electronic device, such as a terminal , including smartphones, PDAs, tablets, wearables with displays, desktops, laptops, all-in-ones, smart home devices, and more.
  • this embodiment can be applied to the situation where the audio signal is processed in the server, the method can be executed by an audio signal processing device, the device can be implemented in the form of software and/or hardware, and the device can be configured in an electronic device, For example server.
  • the audio signal processing method described in this embodiment may be applicable to application scenarios such as unsupervised role separation of single-channel speech, speech recognition, and a conference system. As shown in Figure 1, the specific steps of the method are as follows:
  • the terminal 21 can obtain an audio signal from a server 22 .
  • an audio signal is stored locally in the terminal 21 .
  • the terminal 21 may collect audio signals through an audio collection module.
  • the audio signal may be a single-channel voice, and further, the terminal 21 may perform unsupervised role separation on the audio signal.
  • the terminal 21 may also send its local or collected audio signal to the server 22, and the server 22 performs unsupervised role separation on the audio signal. The following uses an example in which the terminal 21 performs unsupervised role separation on the audio signal as an example.
  • the terminal 21 may segment the audio signal to obtain multiple audio segments.
  • methods such as voice active detection (Voice Active Detection, VAD) and Bayesian Information Criterion (Bayesian Information Criterion, BIC) can be used for segmentation processing.
  • An audio segment may also be called a voice segment, and each audio segment obtained by segmenting may be an audio segment of 1-2 seconds.
  • S102 Perform clustering processing on the multiple audio segments according to feature information of each of the multiple audio segments to obtain one or more first sets.
  • the terminal 21 may use x-vector, Resnet or other embedded vector representation methods to extract feature information of each audio segment, and the feature information may specifically be an embedded vector representation (embedding) feature.
  • x-vector and Resnet are embedded vector representation methods based on neural network models.
  • the terminal 21 may calculate the similarity between every two audio segments according to the embedding features of each audio segment. It can be understood that the similarity between the audio segment A and the audio segment B may be the similarity between the embedding features of the audio segment A and the embedding features of the audio segment B. The greater the similarity, the smaller the distance between the embedding feature of audio segment A and the embedding feature of audio segment B, and at the same time, the smaller the distance between audio segment A and audio segment B.
  • the clustering method of AHC may specifically be: according to the similarity between every two audio segments in the plurality of audio segments, determine the two audio segments with the highest similarity scores, and combine the two audio segments The segments are merged into a new audio segment. Further, calculate the similarity between the new audio segment and every two audio segments in the set of other audio segments, and repeat the merging process and the process of calculating the similarity until the constraint criterion is reached.
  • the merging is stopped, so as to obtain one or more first sets.
  • the clustering method is not limited to AHC, and may also be other clustering algorithms, for example, k-means clustering algorithm (k-means clustering algorithm, kmeans).
  • the above-mentioned clustering method is used to cluster the multiple audio segments obtained after the segmentation process, for each first set in one or more first sets, according to the The embedding features corresponding to more than one audio segment respectively determine the first cluster center of the first set.
  • S104 Perform clustering processing on the multiple audio segments according to the first clustering centers of each of the first sets to obtain one or more second sets, wherein the audio segments in the same second set Segments correspond to the same role label.
  • the multiple audio segments obtained after segmentation processing can be re-clustered according to the first cluster center of each first set to obtain a or multiple second sets, and the audio segments in the same second set correspond to the same role label.
  • one or more second sets may be recorded as the updated clustering result.
  • the updated clustering result is a result of updating the initial clustering result.
  • multiple audio segments are obtained by segmenting the audio signal, and performing clustering processing on the multiple audio segments according to the feature information of each audio segment in the multiple audio segments, Get one or more first sets. Further, according to the feature information of the audio segments included in each of the first sets, determine the first cluster centers of each of the first sets, and according to the first cluster centers of each of the first sets, Perform clustering processing on the multiple audio segments to obtain one or more second sets, wherein the audio segments in the same second set correspond to the same role label. That is to say, after the initial clustering process is performed on the multiple audio segments, the multiple audio segments can also be clustered again according to the first cluster center of each first set, thereby improving Accuracy of unsupervised role separation for channel speech.
  • the first cluster center of each of the first sets can be determined.
  • determining the first cluster center of each of the first sets includes: determining The first target segment of the first target segment, the sum of the similarity scores between the first target segment and other audio segments in the first set is greater than the first threshold; the feature information corresponding to the first target segment as the first cluster center of the first set.
  • the first set 1 includes more than one audio segment.
  • the first set 1 includes audio segment A, audio segment B and audio segment C.
  • an audio segment can be determined from the audio segment A, the audio segment B and the audio segment C as the first target segment, and the first target segment can be Fragments representing the first cluster centers. For example, assuming that audio segment A is used as the first target segment, the similarity score between audio segment A and audio segment B and the similarity score between audio segment A and audio segment C are calculated, And the two similarity scores are accumulated to obtain the sum of the similarity scores.
  • the audio segment A can be used as the first target segment; otherwise, continue to traverse to find the first target segment. Or, respectively calculate the sum of similarity scores when assuming audio segment A as the first target segment, the sum of similarity scores when assuming audio segment B as the first target segment, and assuming audio segment C as the first target segment The sum of similarity scores for a target segment. If it is assumed that audio segment A is used as the first target segment, the sum of similarity scores is the largest, then audio segment A is determined as the first target segment.
  • the calculation process of the sum of similarity scores when assuming audio segment B or audio segment C as the first target segment can refer to the calculation of the sum of similarity scores when assuming audio segment A as the first target segment The process will not be repeated here.
  • the feature information corresponding to the first target segment is used as the first cluster center of the first set. It can be understood that the first target segment obtained in this way is the segment that can best represent the first cluster center. Therefore, in this way, one or more second sets obtained after performing S104 are equivalent to It is more accurate than the one or more first sets obtained in S102.
  • the first cluster center of each of the first sets is determined, including the following as shown in FIG. 3 Several steps:
  • determining the second cluster center of the first set according to the feature information of the audio segments included in the first set includes: calculating the first cluster center of the feature information of the audio segments included in the first set A mean value; use the first mean value as the second cluster center of the first set.
  • the first set 1 includes audio segment A, audio segment B and audio segment C. Since audio segment A, audio segment B, and audio segment C correspond to embedding features, respectively, the average value of the embedding features corresponding to audio segment A, audio segment B, and audio segment C can be calculated, the The average value is recorded as the first average. Further, the first mean value is used as the initial cluster center of the first set 1, and the initial cluster center is recorded as the second cluster center.
  • the initial cluster center of the first set 1, that is, the second cluster center may be updated to obtain an updated cluster center of the first set 1, and the updated cluster center is recorded as the first cluster center.
  • updating the second cluster center of the first set to obtain the first cluster center of the first set includes: according to the second cluster center of the first set, determining the One or more second target segments in the first set, the similarity between the feature information corresponding to the second target segment and the second cluster center of the first set is greater than or equal to a second threshold; A first cluster center of the first set is determined based on one or more second target segments in the first set.
  • one or more second target segments are determined from audio segment A, audio segment B, and audio segment C, and the one or more second target segments A segment may be a K-nearest neighbor segment of the second cluster center. That is to say, the similarity between the embedding feature of each second target segment in the one or more second target segments and the second cluster center is greater than or equal to the second threshold, that is, each second The distance between the embedding feature of the target segment and the second cluster center is smaller than a certain threshold.
  • audio segment A and audio segment B may serve as the second target segment respectively.
  • the first cluster center of the first set 1, that is, the updated cluster center can be determined.
  • determining the first cluster centers of the first set according to one or more second target segments in the first set includes: calculating one or more second target segments in the first set A second mean value of the feature information corresponding to the two target segments; using the second mean value as the first cluster center of the first set.
  • the average value of the embedding features of audio segment A and the embedding features of audio segment B is calculated, and the average value is recorded as the second mean value, and further, the second mean value is used as the first cluster center of the first set 1 .
  • the second cluster center of the first set 1 can be calculated through S301, and the second cluster center can be the average of the embedding features corresponding to the audio segment A, the audio segment B, and the audio segment C respectively. value.
  • the average value of the embedding feature of audio segment A and the embedding feature of audio segment B can be used as the first cluster center.
  • the first cluster center is more accurate than the second cluster center, therefore, the one or more second sets obtained after performing S104 are more accurate than the one or more first sets obtained by S102.
  • Fig. 4 is a flowchart of an audio signal processing method provided by another embodiment of the present disclosure. The specific steps of the method are as follows:
  • S401 Perform segmentation processing on an audio signal to obtain multiple audio segments.
  • S402. Perform clustering processing on the multiple audio segments according to the feature information of each of the multiple audio segments to obtain one or more first sets.
  • the first cluster centers respectively corresponding to the first set 1, the first set 2, and the first set 3 are obtained.
  • the first cluster center can be obtained through several methods as described above, which will not be repeated here.
  • the audio segment calculates the distance between the audio segment and the first cluster center of the first set 1, the audio segment The distance between the first clustering center of the first collection 2, the distance between the audio segment and the first clustering center of the first collection 3, since these 3 distances may be different, according to the 3 distances, it can be determined which first clustering center the audio segment has the shortest distance to.
  • the second set For example, if there are 3 in the first set, then there are 3 first cluster centers, and each first cluster center can correspond to a second set, then there are 3 corresponding to the second set, and each second set includes The more than one audio segment of is the audio segment closest to the first cluster center.
  • the distances between one or more audio segments included in the second set and the first cluster center are less than or equal to the third threshold. That is to say, the second set may be a result of partially or fully adjusting more than one audio segment in the first set. Wherein, the audio segments in the same second set correspond to the same role label.
  • the first set 1 , the first set 2 and the first set 3 are initial clustering results. After determining the first cluster centers corresponding to the first set 1, the first set 2, and the first set 3 respectively, it is determined that the distance between the audio segment C and the first cluster center of the first set 2 is the shortest, so , the audio segment C may be adjusted from the first set 1 into the first set 2 . Similarly, the distance between the audio segment F and the first clustering center of the first set 3 is the shortest, therefore, the audio segment F is adjusted from the first set 2 to the first set 3, thereby obtaining
  • the re-clustering results shown are the second sets corresponding to each first set.
  • multiple audio segments are obtained by segmenting the audio signal, and according to the feature information of each audio segment in the multiple audio segments, the multiple audio segments are clustered to obtain One or more first collections. Further, according to the feature information of the audio segments included in each of the first sets, determine the first cluster centers of each of the first sets, and according to the first cluster centers of each of the first sets, Perform clustering processing on the multiple audio segments to obtain one or more second sets, wherein the audio segments in the same second set correspond to the same role label. That is to say, after the initial clustering process is performed on the multiple audio segments, the multiple audio segments can also be clustered again according to the first cluster center of each first set, thereby improving Accuracy of unsupervised role separation for channel speech. This can effectively avoid clustering result errors caused by inaccurate clustering centers, for example, dividing two audio segments of the same character into different classes, or dividing part of the audio segments of a character into another in the character's class.
  • Fig. 6 is a flowchart of an audio signal processing method provided by another embodiment of the present disclosure. The specific steps of the method are as follows:
  • S602. Perform clustering processing on the multiple audio segments according to the feature information of each of the multiple audio segments to obtain one or more first sets.
  • S605. Determine one or more second target segments in the first set according to the second cluster centers of the first set, and the feature information corresponding to the second target segments is the same as that in the first set
  • the similarity between the second cluster centers is greater than or equal to the second threshold.
  • the second set after performing S609, may also be used as the first set, so as to repeatedly execute S603-S609.
  • the number of iterations of S603-S609 may be a preset number. That is to say, on the basis of the initial clustering result, the second clustering center and the first clustering center may be updated multiple times, and role labels may be reassigned to each audio clip.
  • the K-nearest neighbor segment of the second cluster center that is, one or more second target segments is changed, therefore, in each iteration process, the first cluster center can be updated, and the first The cluster center can continuously approach the real cluster center, thereby greatly reducing the influence of noise points on the cluster center and ensuring the accuracy of the cluster center.
  • the accuracy of role separation can be increased from 90% to 94% through the method described in the embodiment of the present disclosure, and the effect is significantly improved.
  • the embodiment of the present disclosure also provides a method for processing roles in a conference scene, the method includes the following steps:
  • the second cluster centers of the first set determine one or more second target segments in the first set, the feature information corresponding to the second target segments is the same as that in the first set
  • the similarity between the second cluster centers is greater than or equal to the second threshold.
  • FIG. 7 is a schematic structural diagram of an audio signal processing device provided by an embodiment of the present disclosure.
  • the audio signal processing device provided in the embodiment of the present disclosure can execute the processing flow provided in the embodiment of the audio signal processing method.
  • the audio signal processing device 70 includes:
  • Segmentation module 71 is used for carrying out segmentation processing to audio signal and obtains a plurality of audio frequency segmentations
  • a clustering module 72 configured to perform clustering processing on the multiple audio segments according to the feature information of each of the multiple audio segments to obtain one or more first sets;
  • a determining module 73 configured to determine the first cluster center of each of the first sets according to the feature information of the audio segments included in each of the first sets;
  • the clustering module 71 is further configured to: perform clustering processing on the plurality of audio segments according to the first clustering centers of each of the first sets to obtain one or more second sets, wherein the same Audio segments in a second set correspond to the same role label.
  • the determination module 73 is specifically configured to: determine the first target segment in the first set, the difference between the similarity scores between the first target segment and other audio segments in the first set The sum is greater than a first threshold; and the feature information corresponding to the first target segment is used as the first cluster center of the first set.
  • the determining module 73 includes a determining unit 731 and an updating unit 732, wherein the determining unit 731 is configured to determine the second cluster center of the first set according to the feature information of the audio segments included in the first set
  • the updating unit 732 is configured to update the second cluster centers of the first set to obtain the first cluster centers of the first set.
  • the determining unit 731 is specifically configured to: calculate a first mean value of the feature information of the audio segments included in the first set; and use the first mean value as a second cluster center of the first set.
  • the updating unit 732 is specifically configured to: determine one or more second target segments in the first set according to the second cluster centers of the first set, and the second target segments correspond to The similarity between the feature information of the first set and the second cluster center of the first set is greater than or equal to the second threshold; according to one or more second target segments in the first set, determine the first The first cluster center of the set.
  • the updating unit 732 determines the first cluster center of the first set according to one or more second target segments in the first set, it is specifically configured to: calculate the first set The second mean value of the feature information corresponding to one or more second target segments in ; using the second mean value as the first cluster center of the first set.
  • the clustering module 72 performs clustering processing on the plurality of audio segments according to the first cluster centers of each of the first sets, and when one or more second sets are obtained, it is specifically used for: For each audio segment in the plurality of audio segments, according to the feature information of the audio segment and the first cluster center of each of the first sets, calculate the relationship between the audio segment and each The distance between each of the first cluster centers; among the plurality of audio segments, the audio segments whose distance from the first cluster center is less than or equal to a third threshold are divided into the second set.
  • the audio signal processing device of the embodiment shown in FIG. 7 can be used to implement the technical solution of the above method embodiment, and its implementation principle and technical effect are similar, and will not be repeated here.
  • FIG. 8 is a schematic structural diagram of an embodiment of an electronic device provided by an embodiment of the present disclosure. As shown in FIG. 8 , the electronic device includes a memory 81 and a processor 82 .
  • the memory 81 is used to store programs. In addition to the above-mentioned programs, the memory 81 may also be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, etc.
  • Memory 81 can be realized by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM erasable Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Magnetic or Optical Disk Magnetic Disk
  • the processor 82 coupled with the memory 81, executes the program stored in the memory 81 for:
  • the plurality of audio segments are clustered to obtain one or more second sets, wherein the audio segments in the same second set correspond to Same role tag.
  • the electronic device may further include: a communication component 83 , a power supply component 84 , an audio component 85 , a display 86 and other components.
  • FIG. 8 only schematically shows some components, which does not mean that the electronic device only includes the components shown in FIG. 8 .
  • the communication component 83 is configured to facilitate wired or wireless communication between the electronic device and other devices. Electronic devices can access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 83 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 83 also includes a near field communication (NFC) module to facilitate short-range communication.
  • the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
  • RFID Radio Frequency Identification
  • IrDA Infrared Data Association
  • UWB Ultra Wideband
  • Bluetooth Bluetooth
  • the power supply component 84 provides power for various components of the electronic device.
  • Power supply components 84 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic devices.
  • the audio component 85 is configured to output and/or input audio signals.
  • the audio component 85 includes a microphone (MIC), which is configured to receive external audio signals when the electronic device is in operation modes, such as calling mode, recording mode and voice recognition mode.
  • the received audio signal may be further stored in the memory 81 or sent via the communication component 83 .
  • the audio component 85 also includes a speaker for outputting audio signals.
  • the display 86 includes a screen, which may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.
  • the touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense a boundary of a touch or swipe action, but also detect duration and pressure associated with the touch or swipe action.
  • an embodiment of the present disclosure further provides a computer-readable storage medium, on which a computer program is stored, and the computer program is executed by a processor to implement the audio signal processing method described in the above-mentioned embodiments.
  • an embodiment of the present disclosure also provides a conference system, the system includes a terminal and a server; wherein, the terminal and the server are connected by communication;
  • the terminal is used to send the audio signal of the conference multi-role to the server, and the server is used to execute the role processing method in the conference scene described in the above embodiment;
  • the server is configured to send audio signals of conference multi-roles to the terminal, and the terminal is configured to execute the role processing method in the conference scenario described in the above-mentioned embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开涉及一种音频信号处理方法、装置、设备及存储介质。本公开通过对音频信号进行分段处理得到多个音频分段,根据多个音频分段中每个音频分段的特征信息,对多个音频分段进行聚类处理,得到一个或多个第一集合。进一步,根据每个第一集合包括的音频分段的特征信息,确定每个第一集合的第一聚类中心,并且根据每个第一集合的第一聚类中心,对多个音频分段进行聚类处理,得到一个或多个第二集合,其中,同一个第二集合中的音频分段对应相同的角色标签。也就是说,在对多个音频分段进行初始聚类处理后,还可以根据每个第一集合的第一聚类中心,对多个音频分段进行再次聚类处理,从而提高了基于单通道语音的无监督角色分离的准确性。

Description

音频信号处理方法、装置、设备及存储介质
本申请要求于2021年11月16日提交中国专利局、申请号为202111351380.X、申请名称为“音频信号处理方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及信息技术领域,尤其涉及一种音频信号处理方法、装置、设备及存储介质。
背景技术
随着科技的不断发展,语音识别、角色分离等人工智能(Artificial Intelligence,AI)技术的应用越来越广泛。
当前基于单通道语音的无监督角色分离是会议系统中必备且具有挑战性的技术,并且具有较为广泛的应用需求。
但是,本申请的发明人发现,当前基于单通道语音的无监督角色分离的准确性较低。
发明内容
为了解决上述技术问题或者至少部分地解决上述技术问题,本公开提供了一种音频信号处理方法、装置、设备及存储介质,提高了基于单通道语音的无监督角色分离的准确性。
第一方面,本公开实施例提供一种会议场景中的角色处理方法,包括:
接收会议多角色的音频信号;
对音频信号进行分段处理得到多个音频分段;
根据所述多个音频分段中每个音频分段的特征信息,对所述多个音频分段进行聚类处理,得到一个或多个第一集合;
计算所述第一集合包括的音频分段的特征信息的第一均值;
将所述第一均值作为所述第一集合的第二聚类中心;
根据所述第一集合的第二聚类中心,确定所述第一集合中的一个或多个第二目标分段,所述第二目标分段对应的特征信息与所述第一集合的第二聚类中心之间的相似度大于或 等于第二阈值;
计算所述第一集合中的一个或多个第二目标分段所对应的特征信息的第二均值;
将所述第二均值作为所述第一集合的第一聚类中心;
针对所述多个音频分段中的每个音频分段,根据所述音频分段的特征信息、以及每个所述第一集合的第一聚类中心,计算所述音频分段分别与每个所述第一聚类中心之间的距离;
将所述多个音频分段中距离所述第一聚类中心小于或等于第三阈值的音频分段划分到第二集合中;
根据所述第二集合确定所述音频信号中多个发言者的角色信息;
将所述第二集合作为第一集合,重复执行从计算所述第一均值到确定角色信息的过程。
第二方面,本公开实施例提供一种音频信号处理方法,包括:
对音频信号进行分段处理得到多个音频分段;
根据所述多个音频分段中每个音频分段的特征信息,对所述多个音频分段进行聚类处理,得到一个或多个第一集合;
根据每个所述第一集合包括的音频分段的特征信息,确定每个所述第一集合的第一聚类中心;
根据每个所述第一集合的第一聚类中心,对所述多个音频分段进行聚类处理,得到一个或多个第二集合,其中,同一个第二集合中的音频分段对应相同的角色标签。
第三方面,本公开实施例提供一种音频信号处理装置,包括:
分段模块,用于对音频信号进行分段处理得到多个音频分段;
聚类模块,用于根据所述多个音频分段中每个音频分段的特征信息,对所述多个音频分段进行聚类处理,得到一个或多个第一集合;
确定模块,用于根据每个所述第一集合包括的音频分段的特征信息,确定每个所述第一集合的第一聚类中心;
所述聚类模块还用于:根据每个所述第一集合的第一聚类中心,对所述多个音频分段进行聚类处理,得到一个或多个第二集合,其中,同一个第二集合中的音频分段对应相同的角色标签。
第四方面,本公开实施例提供一种电子设备,包括:
存储器;
处理器;以及
计算机程序;
其中,所述计算机程序存储在所述存储器中,并被配置为由所述处理器执行以实现如第一方面或第二方面所述的方法。
第五方面,本公开实施例提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行以实现第一方面或第二方面所述的方法。
第六方面,本公开实施例提供一种会议系统,所述系统包括终端和服务器;其中,所述终端和所述服务器之间通信连接;
所述终端用于向所述服务器发送会议多角色的音频信号,所述服务器用于执行第二方面所述的方法;或者
所述服务器用于向所述终端发送会议多角色的音频信号,所述终端用于执行第二方面所述的方法。
本公开实施例提供的音频信号处理方法、装置、设备及存储介质,通过对音频信号进行分段处理得到多个音频分段,根据所述多个音频分段中每个音频分段的特征信息,对所述多个音频分段进行聚类处理,得到一个或多个第一集合。进一步,根据每个所述第一集合包括的音频分段的特征信息,确定每个所述第一集合的第一聚类中心,并且根据每个所述第一集合的第一聚类中心,对所述多个音频分段进行聚类处理,得到一个或多个第二集合,其中,同一个第二集合中的音频分段对应相同的角色标签。也就是说,在对多个音频分段进行初始聚类处理后,还可以根据每个第一集合的第一聚类中心,对多个音频分段进行再次聚类处理,从而提高了基于单通道语音的无监督角色分离的准确性。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本公开实施例提供的音频信号处理方法流程图;
图2为本公开实施例提供的应用场景的示意图;
图3为本公开另一实施例提供的音频信号处理方法流程图;
图4为本公开又一实施例提供的音频信号处理方法流程图;
图5为本公开实施例提供的聚类结果的示意图;
图6为本公开另一实施例提供的音频信号处理方法流程图;
图7为本公开另一实施例提供的音频信号处理装置的结构示意图;
图8为本公开实施例提供的电子设备实施例的结构示意图。
具体实施方式
为了能够更清楚地理解本公开的上述目的、特征和优点,下面将对本公开的方案进行进一步描述。需要说明的是,在不冲突的情况下,本公开的实施例及实施例中的特征可以相互组合。
在下面的描述中阐述了很多具体细节以便于充分理解本公开,但本公开还可以采用其他不同于在此描述的方式来实施;显然,说明书中的实施例只是本公开的一部分实施例,而不是全部的实施例。
通常情况下,基于单通道语音的无监督角色分离是会议系统中必备且具有挑战性的技术,并且具有较为广泛的应用需求。但是,当前基于单通道语音的无监督角色分离的准确性较低。其中,无监督角色分离具体是指:在实现角色信息未知的情况下,得到语音中的角色数目和每个角色发言的时间信息。
针对该问题,本公开实施例提供了一种音频信号处理方法,下面结合具体的实施例对该方法进行介绍。
图1为本公开实施例提供的音频信号处理方法流程图。本实施例可适用于客户端中进行音频信号处理的情况,该方法可以由音频信号处理装置执行,该装置可以采用软件和/或硬件的方式实现,该装置可配置于电子设备中,例如终端,具体包括智能手机、掌上电脑、平板电脑、带显示屏的可穿戴设备、台式机、笔记本电脑、一体机、智能家居设备等。或者,本实施例可适用于服务端中进行音频信号处理的情况,该方法可以由音频信号处理装置执行,该装置可以采用软件和/或硬件的方式实现,该装置可配置于电子设备中,例如服务器。下面以终端为例介绍该音频信号处理方法。另外,本实施例所述的音频信号处理方法可以适用于单通道语音的无监督角色分离、语音识别、会议系统等应用场景。如图1所示,该方法具体步骤如下:
S101、对音频信号进行分段处理得到多个音频分段。
如图2所示,终端21可以从服务器22获取音频信号。或者,终端21的本地存储有音频信号。再或者,终端21可以通过音频采集模块采集音频信号。具体的,该音频信号 可以是单通道语音,进一步,终端21可以对该音频信号进行无监督角色分离。在其他一些实施例中,终端21还可以将其本地或采集到的音频信号发送给服务器22,由服务器22对该音频信号进行无监督角色分离。下面以终端21对该音频信号进行无监督角色分离为例进行介绍。
具体的,终端21可以对音频信号进行分段处理得到多个音频分段。具体的,分段处理可以采用语音边界检(Voice Active Detection,VAD)、贝叶斯信息准则(Bayesian Information Criterion,BIC)等方法。音频分段也可以称作语音片段,分段得到的每个音频分段可以是1-2秒的音频分段。
S102、根据所述多个音频分段中每个音频分段的特征信息,对所述多个音频分段进行聚类处理,得到一个或多个第一集合。
例如,终端21可以采用x-vector、Resnet或其他嵌入式向量表示方法提取每个音频分段的特征信息,该特征信息具体可以是嵌入式向量表示(embedding)特征。其中,x-vector、Resnet分别是基于神经网络模型的嵌入式向量表示方法。进一步,终端21可以根据每个音频分段的embedding特征,计算每两个音频分段之间的相似度。可以理解的是,音频分段A和音频分段B之间的相似度可以是音频分段A的embedding特征和音频分段B的embedding特征之间的相似度。该相似度越大,表示音频分段A的embedding特征和音频分段B的embedding特征之间的距离越小,同时,音频分段A和音频分段B之间的距离越小。
进一步,采用层次聚类算法(Agglomerative Hierarchical Clustering,AHC)对分段处理后得到的多个音频分段进行聚类处理,得到一个或多个第一集合,每个第一集合可以包括一个以上的音频分段。该一个或多个第一集合可以记为初始聚类结果。其中,AHC的聚类方法具体可以是:根据该多个音频分段中每两个音频分段之间的相似度,确定出相似度得分最高的两个音频分段,并将该两个音频分段合并为一个新的音频分段。进一步,计算该新的音频分段和其他音频分段合中每两个音频分段之间的相似度,重复合并过程以及计算相似度的过程,直到达到约束准则为止。例如,当相似度得分低于预先设定的阈值时停止合并,从而得到一个或多个第一集合。可以理解的是,该聚类方法不限于AHC,还可以是其他的聚类算法,例如,k均值聚类算法(k-means clustering algorithm,kmeans)。
S103、根据每个所述第一集合包括的音频分段的特征信息,确定每个所述第一集合的第一聚类中心。
例如,通过上述的聚类方法将分段处理后得到的多个音频分段进行聚类处理后,针对一个或多个第一集合中的每个第一集合,根据该第一集合内包括的一个以上的音频分段分 别对应的embedding特征,确定出该第一集合的第一聚类中心。
S104、根据每个所述第一集合的第一聚类中心,对所述多个音频分段进行聚类处理,得到一个或多个第二集合,其中,同一个第二集合中的音频分段对应相同的角色标签。
在确定出每个第一集合的第一聚类中心后,可以根据每个第一集合的第一聚类中心,重新对分段处理后得到的多个音频分段进行聚类处理,得到一个或多个第二集合,并且同一个第二集合中的音频分段对应相同的角色标签。其中,一个或多个第二集合可以记为更新后的聚类结果。该更新后的聚类结果是对初始聚类结果进行更新的结果。
本公开实施例通过对音频信号进行分段处理得到多个音频分段,根据所述多个音频分段中每个音频分段的特征信息,对所述多个音频分段进行聚类处理,得到一个或多个第一集合。进一步,根据每个所述第一集合包括的音频分段的特征信息,确定每个所述第一集合的第一聚类中心,并且根据每个所述第一集合的第一聚类中心,对所述多个音频分段进行聚类处理,得到一个或多个第二集合,其中,同一个第二集合中的音频分段对应相同的角色标签。也就是说,在对多个音频分段进行初始聚类处理后,还可以根据每个第一集合的第一聚类中心,对多个音频分段进行再次聚类处理,从而提高了基于单通道语音的无监督角色分离的准确性。
在上述实施例的基础上,根据每个所述第一集合包括的音频分段的特征信息,确定每个所述第一集合的第一聚类中心,可以有多种实现方式,下面介绍如下几种。
在一种可行的实现方式中,根据每个所述第一集合包括的音频分段的特征信息,确定每个所述第一集合的第一聚类中心,包括:确定所述第一集合中的第一目标分段,所述第一目标分段与所述第一集合中其他音频分段之间的相似度得分之和大于第一阈值;将所述第一目标分段对应的特征信息作为所述第一集合的第一聚类中心。
例如,对多个音频分段进行初始聚类处理后,得到3个第一集合,分别记为第一集合1、第一集合2和第一集合3。每个第一集合包括一个以上的音频分段。例如,第一集合1包括音频分段A、音频分段B和音频分段C。确定第一集合1的第一聚类中心时,可以从音频分段A、音频分段B和音频分段C中确定出一个音频分段作为第一目标分段,第一目标分段可以是代表第一聚类中心的片段。例如,假设将音频分段A作为第一目标分段时,计算音频分段A和音频分段B之间的相似度得分、以及音频分段A和音频分段C之间的相似度得分,并且将这两个相似度得分进行累加,得到相似度得分之和。如果该相似度得分之和大于第一阈值,那么音频分段A可以作为第一目标分段,否则继续遍历寻找第一目标分段。或者,分别计算假设音频分段A作为第一目标分段时的相似度得分之和、假设音 频分段B作为第一目标分段时的相似度得分之和、以及假设音频分段C作为第一目标分段时的相似度得分之和。如果假设音频分段A作为第一目标分段时的相似度得分之和最大,那么将音频分段A确定为第一目标分段。其中,假设音频分段B或音频分段C作为第一目标分段时的相似度得分之和的计算过程可以参照假设音频分段A作为第一目标分段时的相似度得分之和的计算过程,此处不再赘述。进一步,将第一目标分段对应的特征信息作为该第一集合的第一聚类中心。可以理解的是,通过这种方式得到的第一目标分段是最能代表第一聚类中心的片段,因此,在这种方式下,执行S104后得到的一个或多个第二集合,相比于S102得到的一个或多个第一集合更精准。
在另一种可行的实现方式中,根据每个所述第一集合包括的音频分段的特征信息,确定每个所述第一集合的第一聚类中心,包括如图3所示的如下几个步骤:
S301、根据所述第一集合包括的音频分段的特征信息,确定所述第一集合的第二聚类中心。
可选的,根据所述第一集合包括的音频分段的特征信息,确定所述第一集合的第二聚类中心,包括:计算所述第一集合包括的音频分段的特征信息的第一均值;将所述第一均值作为所述第一集合的第二聚类中心。
例如,第一集合1包括音频分段A、音频分段B和音频分段C。由于音频分段A、音频分段B和音频分段C分别对应有embedding特征,因此,可以求出音频分段A、音频分段B和音频分段C分别对应的embedding特征的平均值,该平均值记为第一均值。进一步,将该第一均值作为第一集合1的初始聚类中心,该初始聚类中心记为第二聚类中心。
S302、对所述第一集合的第二聚类中心进行更新,得到所述第一集合的第一聚类中心。
例如,可以对第一集合1的初始聚类中心即第二聚类中心进行更新,得到第一集合1更新后的聚类中心,该更新后的聚类中心记为第一聚类中心。
可选的,对所述第一集合的第二聚类中心进行更新,得到所述第一集合的第一聚类中心,包括:根据所述第一集合的第二聚类中心,确定所述第一集合中的一个或多个第二目标分段,所述第二目标分段对应的特征信息与所述第一集合的第二聚类中心之间的相似度大于或等于第二阈值;根据所述第一集合中的一个或多个第二目标分段,确定所述第一集合的第一聚类中心。
例如,根据第一集合1的第二聚类中心,从音频分段A、音频分段B和音频分段C中确定出一个或多个第二目标分段,该一个或多个第二目标分段可以是该第二聚类中心的K近邻片段。也就是说,该一个或多个第二目标分段中的每个第二目标分段的embedding特征与该第二聚类中心之间的相似度大于或等于第二阈值,即每个第二目标分段的 embedding特征与该第二聚类中心之间的距离小于某个阈值。例如,音频分段A、音频分段B分别可以作为第二目标分段。进一步,根据音频分段A和音频分段B可以确定出该第一集合1的第一聚类中心即更新后的聚类中心。
可选的,根据所述第一集合中的一个或多个第二目标分段,确定所述第一集合的第一聚类中心,包括:计算所述第一集合中的一个或多个第二目标分段所对应的特征信息的第二均值;将所述第二均值作为所述第一集合的第一聚类中心。
例如,计算音频分段A的embedding特征和音频分段B的embedding特征的平均值,该平均值记为第二均值,进一步,将该第二均值作为第一集合1的第一聚类中心。可以理解的是,通过S301可以计算出第一集合1的第二聚类中心,该第二聚类中心可以是音频分段A、音频分段B和音频分段C分别对应的embedding特征的平均值。进一步,确定出第二聚类中心的K近邻片段即音频分段A和音频分段B之后,可以将音频分段A的embedding特征和音频分段B的embedding特征的平均值作为第一聚类中心。该第一聚类中心比该第二聚类中心更准确,因此,执行S104后得到的一个或多个第二集合,相比于S102得到的一个或多个第一集合更精准。
图4为本公开另一实施例提供的音频信号处理方法流程图。该方法具体步骤如下:
S401、对音频信号进行分段处理得到多个音频分段。
S402、根据所述多个音频分段中每个音频分段的特征信息,对所述多个音频分段进行聚类处理,得到一个或多个第一集合。
S403、根据每个所述第一集合包括的音频分段的特征信息,确定每个所述第一集合的第一聚类中心。
具体的,S401-S403的实现方式和具体原理与上述实施例所述的相应步骤的实现方式和具体原理一致,此处不再赘述。
S404、针对所述多个音频分段中的每个音频分段,根据所述音频分段的特征信息、以及每个所述第一集合的第一聚类中心,计算所述音频分段分别与每个所述第一聚类中心之间的距离。
例如,执行S401后得到9个音频分段,依次记为音频分段A、音频分段B、音频分段C、音频分段D、音频分段E、音频分段F、音频分段G、音频分段H、音频分段J。执行S402后得到3个第一集合,第一集合1包括音频分段A、音频分段B、音频分段C。第一集合2包括音频分段D、音频分段E、音频分段F。第一集合3包括音频分段G、音频分段H、音频分段J。执行S403后得到第一集合1、第一集合2、第一集合3分别对应的第 一聚类中心。该第一聚类中心可以通过如上所述的几种方式得到,此处不再赘述。进一步,针对9个音频分段中的每个音频分段,根据该音频分段的embedding特征,计算该音频分段与第一集合1的第一聚类中心之间的距离、该音频分段与第一集合2的第一聚类中心之间的距离、该音频分段与第一集合3的第一聚类中心之间的距离,由于该3个距离可能是不同的,因此,根据该3个距离,可以确定出该音频分段与哪个第一聚类中心的距离最近。
S405、将所述多个音频分段中距离所述第一聚类中心小于或等于第三阈值的音频分段划分到第二集合中。
例如,通过上述步骤可以确定出该9个音频分段中的每个音频分段分别与哪个第一聚类中心的距离最近。从而可以确定出每个第一聚类中心周围的一个以上的音频分段,从而实现了对9个音频分段进行重新聚类。进一步,将每个第一聚类中心周围的一个以上的音频分段划分到一个第二集合中。例如,第一集合有3个,那么第一聚类中心有3个,每个第一聚类中心可以对应一个第二集合,那么第二集合也对应有3个,每个第二集合中包括的一个以上的音频分段是距离该第一聚类中心最近的音频分段。此处可以认为第二集合中包括的一个以上的音频分段分别与该第一聚类中心之间的距离小于或等于第三阈值。也就是说,第二集合可以是对第一集合内的一个以上的音频分段进行部分或全部调整后的结果。其中,同一个第二集合中的音频分段对应相同的角色标签。
例如图5所示,第一集合1、第一集合2和第一集合3为初始聚类结果。在确定出第一集合1、第一集合2和第一集合3分别对应的第一聚类中心后,确定音频分段C与第一集合2的第一聚类中心之间的距离最近,因此,可以将音频分段C从第一集合1调整到第一集合2中。同理,音频分段F与第一集合3的第一聚类中心之间的距离最近,因此,将音频分段F从第一集合2调整到第一集合3中,从而得到如图5所示的再次聚类结果即每个第一集合所对应的第二集合。
本实施例通过对音频信号进行分段处理得到多个音频分段,根据所述多个音频分段中每个音频分段的特征信息,对所述多个音频分段进行聚类处理,得到一个或多个第一集合。进一步,根据每个所述第一集合包括的音频分段的特征信息,确定每个所述第一集合的第一聚类中心,并且根据每个所述第一集合的第一聚类中心,对所述多个音频分段进行聚类处理,得到一个或多个第二集合,其中,同一个第二集合中的音频分段对应相同的角色标签。也就是说,在对多个音频分段进行初始聚类处理后,还可以根据每个第一集合的第一聚类中心,对多个音频分段进行再次聚类处理,从而提高了基于单通道语音的无监督角色分离的准确性。从而可以有效避免由于聚类中心不准导致的聚类结果错误,例如,将原本同一角色的两个音频分段划分到不同的类中,或者将某个角色的部分音频分段分到另一角 色的类中。
图6为本公开另一实施例提供的音频信号处理方法流程图。该方法具体步骤如下:
S601、对音频信号进行分段处理得到多个音频分段。
S602、根据所述多个音频分段中每个音频分段的特征信息,对所述多个音频分段进行聚类处理,得到一个或多个第一集合。
S603、计算所述第一集合包括的音频分段的特征信息的第一均值。
S604、将所述第一均值作为所述第一集合的第二聚类中心。
S605、根据所述第一集合的第二聚类中心,确定所述第一集合中的一个或多个第二目标分段,所述第二目标分段对应的特征信息与所述第一集合的第二聚类中心之间的相似度大于或等于第二阈值。
S606、计算所述第一集合中的一个或多个第二目标分段所对应的特征信息的第二均值。
S607、将所述第二均值作为所述第一集合的第一聚类中心。
S608、针对所述多个音频分段中的每个音频分段,根据所述音频分段的特征信息、以及每个所述第一集合的第一聚类中心,计算所述音频分段分别与每个所述第一聚类中心之间的距离。
S609、将所述多个音频分段中距离所述第一聚类中心小于或等于第三阈值的音频分段划分到第二集合中。
具体的,S601-S609的实现方式和具体原理与上述实施例所述的相应步骤的实现方式和具体原理一致,此处不再赘述。
另外,在本实施例中,执行S609之后,还可以将第二集合作为第一集合,从而重复执行S603-S609。S603-S609的迭代次数可以是预设次数。也就是说,在初始聚类结果的基础上,可以多次迭代更新第二聚类中心和第一聚类中心,以及对各个音频片段重新分配角色标签。在每次迭代过程中,第二聚类中心的K近邻片段即一个或多个第二目标分段是变化的,因此,每次迭代过程中,第一聚类中心可以被更新,并且第一聚类中心可以不断的向真实的聚类中心靠近,从而在很大程度上减少了噪声点对聚类中心的影响,也保证了聚类中心的准确性。通过本公开实施例所述的方法可以将角色分离的准确率由90%提高到94%,效果提升明显。
此外,本公开实施例还提供了一种会议场景中的角色处理方法,该方法包括如下几个步骤:
S701、接收会议多角色的音频信号。
S702、对音频信号进行分段处理得到多个音频分段。
S703、根据所述多个音频分段中每个音频分段的特征信息,对所述多个音频分段进行聚类处理,得到一个或多个第一集合。
S704、计算所述第一集合包括的音频分段的特征信息的第一均值。
S705、将所述第一均值作为所述第一集合的第二聚类中心。
S706、根据所述第一集合的第二聚类中心,确定所述第一集合中的一个或多个第二目标分段,所述第二目标分段对应的特征信息与所述第一集合的第二聚类中心之间的相似度大于或等于第二阈值。
S707、计算所述第一集合中的一个或多个第二目标分段所对应的特征信息的第二均值。
S708、将所述第二均值作为所述第一集合的第一聚类中心。
S709、针对所述多个音频分段中的每个音频分段,根据所述音频分段的特征信息、以及每个所述第一集合的第一聚类中心,计算所述音频分段分别与每个所述第一聚类中心之间的距离。
S710、将所述多个音频分段中距离所述第一聚类中心小于或等于第三阈值的音频分段划分到第二集合中。
S711、根据所述第二集合确定所述音频信号中多个发言者的角色信息。
S712、将所述第二集合作为第一集合,重复执行从计算所述第一均值到确定角色信息的过程。
图7为本公开实施例提供的音频信号处理装置的结构示意图。本公开实施例提供的音频信号处理装置可以执行音频信号处理方法实施例提供的处理流程,如图7所示,音频信号处理装置70包括:
分段模块71,用于对音频信号进行分段处理得到多个音频分段;
聚类模块72,用于根据所述多个音频分段中每个音频分段的特征信息,对所述多个音频分段进行聚类处理,得到一个或多个第一集合;
确定模块73,用于根据每个所述第一集合包括的音频分段的特征信息,确定每个所述第一集合的第一聚类中心;
所述聚类模块71还用于:根据每个所述第一集合的第一聚类中心,对所述多个音频分段进行聚类处理,得到一个或多个第二集合,其中,同一个第二集合中的音频分段对应相同的角色标签。
可选的,确定模块73具体用于:确定所述第一集合中的第一目标分段,所述第一目标分段与所述第一集合中其他音频分段之间的相似度得分之和大于第一阈值;将所述第一目标分段对应的特征信息作为所述第一集合的第一聚类中心。
可选的,确定模块73包括确定单元731和更新单元732,其中,确定单元731用于根据所述第一集合包括的音频分段的特征信息,确定所述第一集合的第二聚类中心;更新单元732用于对所述第一集合的第二聚类中心进行更新,得到所述第一集合的第一聚类中心。
可选的,确定单元731具体用于:计算所述第一集合包括的音频分段的特征信息的第一均值;将所述第一均值作为所述第一集合的第二聚类中心。
可选的,更新单元732具体用于:根据所述第一集合的第二聚类中心,确定所述第一集合中的一个或多个第二目标分段,所述第二目标分段对应的特征信息与所述第一集合的第二聚类中心之间的相似度大于或等于第二阈值;根据所述第一集合中的一个或多个第二目标分段,确定所述第一集合的第一聚类中心。
可选的,更新单元732在根据所述第一集合中的一个或多个第二目标分段,确定所述第一集合的第一聚类中心时,具体用于:计算所述第一集合中的一个或多个第二目标分段所对应的特征信息的第二均值;将所述第二均值作为所述第一集合的第一聚类中心。
可选的,聚类模块72根据每个所述第一集合的第一聚类中心,对所述多个音频分段进行聚类处理,得到一个或多个第二集合时,具体用于:针对所述多个音频分段中的每个音频分段,根据所述音频分段的特征信息、以及每个所述第一集合的第一聚类中心,计算所述音频分段分别与每个所述第一聚类中心之间的距离;将所述多个音频分段中距离所述第一聚类中心小于或等于第三阈值的音频分段划分到第二集合中。
图7所示实施例的音频信号处理装置可用于执行上述方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。
以上描述了音频信号处理装置的内部功能和结构,该装置可实现为一种电子设备。图8为本公开实施例提供的电子设备实施例的结构示意图。如图8所示,该电子设备包括存储器81和处理器82。
存储器81,用于存储程序。除上述程序之外,存储器81还可被配置为存储其它各种数据以支持在电子设备上的操作。这些数据的示例包括用于在电子设备上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。
存储器81可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器, 磁盘或光盘。
处理器82,与存储器81耦合,执行存储器81所存储的程序,以用于:
对音频信号进行分段处理得到多个音频分段;
根据所述多个音频分段中每个音频分段的特征信息,对所述多个音频分段进行聚类处理,得到一个或多个第一集合;
根据每个所述第一集合包括的音频分段的特征信息,确定每个所述第一集合的第一聚类中心;
根据每个所述第一集合的第一聚类中心,对所述多个音频分段进行聚类处理,得到一个或多个第二集合,其中,同一个第二集合中的音频分段对应相同的角色标签。
进一步,如图8所示,电子设备还可以包括:通信组件83、电源组件84、音频组件85、显示器86等其它组件。图8中仅示意性给出部分组件,并不意味着电子设备只包括图8所示组件。
通信组件83被配置为便于电子设备和其他设备之间有线或无线方式的通信。电子设备可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信组件83经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件83还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
电源组件84,为电子设备的各种组件提供电力。电源组件84可以包括电源管理系统,一个或多个电源,及其他与为电子设备生成、管理和分配电力相关联的组件。
音频组件85被配置为输出和/或输入音频信号。例如,音频组件85包括一个麦克风(MIC),当电子设备处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器81或经由通信组件83发送。在一些实施例中,音频组件85还包括一个扬声器,用于输出音频信号。
显示器86包括屏幕,其屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。
另外,本公开实施例还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行以实现上述实施例所述的音频信号处理方法。
此外,本公开实施例还提供一种会议系统,所述系统包括终端和服务器;其中,所述 终端和所述服务器之间通信连接;
所述终端用于向所述服务器发送会议多角色的音频信号,所述服务器用于执行上述实施例所述的会议场景中的角色处理方法;或者
所述服务器用于向所述终端发送会议多角色的音频信号,所述终端用于执行上述实施例所述的会议场景中的角色处理方法。
需要说明的是,在本文中,诸如“第一”和“第二”等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
以上所述仅是本公开的具体实施方式,使本领域技术人员能够理解或实现本公开。对这些实施例的多种修改对本领域的技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本公开的精神或范围的情况下,在其它实施例中实现。因此,本公开将不会被限制于本文所述的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (12)

  1. 一种会议场景中的角色处理方法,其中,所述方法包括:
    接收会议多角色的音频信号;
    对音频信号进行分段处理得到多个音频分段;
    根据所述多个音频分段中每个音频分段的特征信息,对所述多个音频分段进行聚类处理,得到一个或多个第一集合;
    计算所述第一集合包括的音频分段的特征信息的第一均值;
    将所述第一均值作为所述第一集合的第二聚类中心;
    根据所述第一集合的第二聚类中心,确定所述第一集合中的一个或多个第二目标分段,所述第二目标分段对应的特征信息与所述第一集合的第二聚类中心之间的相似度大于或等于第二阈值;
    计算所述第一集合中的一个或多个第二目标分段所对应的特征信息的第二均值;
    将所述第二均值作为所述第一集合的第一聚类中心;
    针对所述多个音频分段中的每个音频分段,根据所述音频分段的特征信息、以及每个所述第一集合的第一聚类中心,计算所述音频分段分别与每个所述第一聚类中心之间的距离;
    将所述多个音频分段中距离所述第一聚类中心小于或等于第三阈值的音频分段划分到第二集合中;
    根据所述第二集合确定所述音频信号中多个发言者的角色信息;
    将所述第二集合作为第一集合,重复执行从计算所述第一均值到确定角色信息的过程。
  2. 一种音频信号处理方法,其中,所述方法包括:
    对音频信号进行分段处理得到多个音频分段;
    根据所述多个音频分段中每个音频分段的特征信息,对所述多个音频分段进行聚类处理,得到一个或多个第一集合;
    根据每个所述第一集合包括的音频分段的特征信息,确定每个所述第一集合的第一聚类中心;
    根据每个所述第一集合的第一聚类中心,对所述多个音频分段进行聚类处理,得到一个或多个第二集合,其中,同一个第二集合中的音频分段对应相同的角色标签。
  3. 根据权利要求2所述的方法,其中,根据每个所述第一集合包括的音频分段的特征信息,确定每个所述第一集合的第一聚类中心,包括:
    确定所述第一集合中的第一目标分段,所述第一目标分段与所述第一集合中其他音频分段之间的相似度得分之和大于第一阈值;
    将所述第一目标分段对应的特征信息作为所述第一集合的第一聚类中心。
  4. 根据权利要求2所述的方法,其中,根据每个所述第一集合包括的音频分段的特征信息,确定每个所述第一集合的第一聚类中心,包括:
    根据所述第一集合包括的音频分段的特征信息,确定所述第一集合的第二聚类中心;
    对所述第一集合的第二聚类中心进行更新,得到所述第一集合的第一聚类中心。
  5. 根据权利要求4所述的方法,其中,根据所述第一集合包括的音频分段的特征信息,确定所述第一集合的第二聚类中心,包括:
    计算所述第一集合包括的音频分段的特征信息的第一均值;
    将所述第一均值作为所述第一集合的第二聚类中心。
  6. 根据权利要求4所述的方法,其中,对所述第一集合的第二聚类中心进行更新,得到所述第一集合的第一聚类中心,包括:
    根据所述第一集合的第二聚类中心,确定所述第一集合中的一个或多个第二目标分段,所述第二目标分段对应的特征信息与所述第一集合的第二聚类中心之间的相似度大于或等于第二阈值;
    根据所述第一集合中的一个或多个第二目标分段,确定所述第一集合的第一聚类中心。
  7. 根据权利要求6所述的方法,其中,根据所述第一集合中的一个或多个第二目标分段,确定所述第一集合的第一聚类中心,包括:
    计算所述第一集合中的一个或多个第二目标分段所对应的特征信息的第二均值;
    将所述第二均值作为所述第一集合的第一聚类中心。
  8. 根据权利要求2所述的方法,其中,根据每个所述第一集合的第一聚类中心,对所述多个音频分段进行聚类处理,得到一个或多个第二集合,包括:
    针对所述多个音频分段中的每个音频分段,根据所述音频分段的特征信息、以及每个所述第一集合的第一聚类中心,计算所述音频分段分别与每个所述第一聚类中心之间的距离;
    将所述多个音频分段中距离所述第一聚类中心小于或等于第三阈值的音频分段划分到第二集合中。
  9. 一种音频信号处理装置,其中,包括:
    分段模块,用于对音频信号进行分段处理得到多个音频分段;
    聚类模块,用于根据所述多个音频分段中每个音频分段的特征信息,对所述多个音频 分段进行聚类处理,得到一个或多个第一集合;
    确定模块,用于根据每个所述第一集合包括的音频分段的特征信息,确定每个所述第一集合的第一聚类中心;
    所述聚类模块还用于:根据每个所述第一集合的第一聚类中心,对所述多个音频分段进行聚类处理,得到一个或多个第二集合,其中,同一个第二集合中的音频分段对应相同的角色标签。
  10. 一种电子设备,其中,包括:
    存储器;
    处理器;以及
    计算机程序;
    其中,所述计算机程序存储在所述存储器中,并被配置为由所述处理器执行以实现如权利要求1-8中任一项所述的方法。
  11. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1-8中任一项所述的方法。
  12. 一种会议系统,其中,所述系统包括终端和服务器;其中,所述终端和所述服务器之间通信连接;
    所述终端用于向所述服务器发送会议多角色的音频信号,所述服务器用于执行如权利要求1所述的方法;或者
    所述服务器用于向所述终端发送会议多角色的音频信号,所述终端用于执行如权利要求1所述的方法。
PCT/CN2022/130728 2021-11-16 2022-11-08 音频信号处理方法、装置、设备及存储介质 Ceased WO2023088142A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US18/685,019 US20240355335A1 (en) 2021-11-16 2022-11-08 Audio signal processing method and apparatus, device and storage medium
EP22894679.4A EP4375988B1 (en) 2021-11-16 2022-11-08 Audio signal processing method and apparatus, and device and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111351380.X 2021-11-16
CN202111351380.XA CN113808578B (zh) 2021-11-16 2021-11-16 音频信号处理方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023088142A1 true WO2023088142A1 (zh) 2023-05-25

Family

ID=78898545

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/130728 Ceased WO2023088142A1 (zh) 2021-11-16 2022-11-08 音频信号处理方法、装置、设备及存储介质

Country Status (4)

Country Link
US (1) US20240355335A1 (zh)
EP (1) EP4375988B1 (zh)
CN (1) CN113808578B (zh)
WO (1) WO2023088142A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114596877B (zh) * 2022-03-03 2024-11-08 北京百度网讯科技有限公司 一种话者分离方法、装置、电子设备及存储介质
CN114465737B (zh) * 2022-04-13 2022-06-24 腾讯科技(深圳)有限公司 一种数据处理方法、装置、计算机设备及存储介质
CN115269909A (zh) * 2022-07-28 2022-11-01 腾讯音乐娱乐科技(深圳)有限公司 音频分类方法、音频搜索方法、计算机设备和程序产品
CN116524937A (zh) * 2023-01-11 2023-08-01 阿里巴巴达摩院(杭州)科技有限公司 说话人转换点的检测方法、训练检测模型的方法及装置

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699573A (zh) * 2013-11-28 2014-04-02 微梦创科网络科技(中国)有限公司 社交平台的ugc标签聚类方法和装置
CN106845518A (zh) * 2016-12-19 2017-06-13 苏州蓝盛电子有限公司 一种基于K‑means聚类分析算法的汽车故障数据诊断方法
CN110414569A (zh) * 2019-07-03 2019-11-05 北京小米智能科技有限公司 聚类实现方法及装置
US20200082809A1 (en) * 2016-12-14 2020-03-12 International Business Machines Corporation Using recurrent neural network for partitioning of audio data into segments that each correspond to a speech feature cluster identifier
WO2020199013A1 (en) * 2019-03-29 2020-10-08 Microsoft Technology Licensing, Llc Speaker diarization with early-stop clustering
CN111899755A (zh) * 2020-08-11 2020-11-06 华院数据技术(上海)有限公司 一种说话人语音分离方法及相关设备
CN111966798A (zh) * 2020-07-24 2020-11-20 北京奇保信安科技有限公司 一种基于多轮K-means算法的意图识别方法、装置和电子设备
CN113593597A (zh) * 2021-08-27 2021-11-02 中国电信股份有限公司 语音噪声过滤方法、装置、电子设备和介质

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9368116B2 (en) * 2012-09-07 2016-06-14 Verint Systems Ltd. Speaker separation in diarization
WO2016022588A1 (en) * 2014-08-04 2016-02-11 Flagler Llc Voice tallying system
US10141009B2 (en) * 2016-06-28 2018-11-27 Pindrop Security, Inc. System and method for cluster-based audio event detection
US10440325B1 (en) * 2018-07-17 2019-10-08 International Business Machines Corporation Context-based natural language participant modeling for videoconference focus classification
CN111291177B (zh) * 2018-12-06 2024-08-02 中兴通讯股份有限公司 一种信息处理方法、装置和计算机存储介质
CN110930984A (zh) * 2019-12-04 2020-03-27 北京搜狗科技发展有限公司 一种语音处理方法、装置和电子设备
CN111462758A (zh) * 2020-03-02 2020-07-28 深圳壹账通智能科技有限公司 智能会议角色分类的方法、装置、设备及存储介质
US11546690B2 (en) * 2020-04-27 2023-01-03 Orcam Technologies Ltd. Processing audio and video
CN111599346B (zh) * 2020-05-19 2024-02-20 科大讯飞股份有限公司 一种说话人聚类方法、装置、设备及存储介质
CN111477251B (zh) * 2020-05-21 2023-09-05 北京百度网讯科技有限公司 模型评测方法、装置及电子设备
CN112420069A (zh) * 2020-11-18 2021-02-26 北京云从科技有限公司 一种语音处理方法、装置、机器可读介质及设备
CN112562693B (zh) * 2021-02-24 2021-05-28 北京远鉴信息技术有限公司 一种基于聚类的说话人确定方法、确定装置及电子设备
CN113450773A (zh) * 2021-05-11 2021-09-28 多益网络有限公司 视频记录文稿生成方法、装置、存储介质以及电子设备

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699573A (zh) * 2013-11-28 2014-04-02 微梦创科网络科技(中国)有限公司 社交平台的ugc标签聚类方法和装置
US20200082809A1 (en) * 2016-12-14 2020-03-12 International Business Machines Corporation Using recurrent neural network for partitioning of audio data into segments that each correspond to a speech feature cluster identifier
CN106845518A (zh) * 2016-12-19 2017-06-13 苏州蓝盛电子有限公司 一种基于K‑means聚类分析算法的汽车故障数据诊断方法
WO2020199013A1 (en) * 2019-03-29 2020-10-08 Microsoft Technology Licensing, Llc Speaker diarization with early-stop clustering
CN110414569A (zh) * 2019-07-03 2019-11-05 北京小米智能科技有限公司 聚类实现方法及装置
CN111966798A (zh) * 2020-07-24 2020-11-20 北京奇保信安科技有限公司 一种基于多轮K-means算法的意图识别方法、装置和电子设备
CN111899755A (zh) * 2020-08-11 2020-11-06 华院数据技术(上海)有限公司 一种说话人语音分离方法及相关设备
CN113593597A (zh) * 2021-08-27 2021-11-02 中国电信股份有限公司 语音噪声过滤方法、装置、电子设备和介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4375988A4

Also Published As

Publication number Publication date
US20240355335A1 (en) 2024-10-24
EP4375988A1 (en) 2024-05-29
EP4375988B1 (en) 2026-01-14
EP4375988A4 (en) 2024-09-25
CN113808578B (zh) 2022-04-15
CN113808578A (zh) 2021-12-17

Similar Documents

Publication Publication Date Title
WO2023088142A1 (zh) 音频信号处理方法、装置、设备及存储介质
CN109829433B (zh) 人脸图像识别方法、装置、电子设备及存储介质
CN107102746B (zh) 候选词生成方法、装置以及用于候选词生成的装置
CN103914518B (zh) 聚类方法及相关装置
CN109829435B (zh) 一种视频图像处理方法、装置及计算机可读介质
US20160012820A1 (en) Multilevel speech recognition method and apparatus
WO2021027344A1 (zh) 图像处理方法及装置、电子设备和存储介质
US20160210965A1 (en) Method and apparatus for speech recognition
EP2757493A2 (en) Natural language processing method and system
CN110781957A (zh) 图像处理方法及装置、电子设备和存储介质
CN108108455B (zh) 目的地的推送方法、装置、存储介质及电子设备
WO2020228163A1 (zh) 图像处理方法及装置、电子设备和存储介质
CN114139726A (zh) 数据处理方法及装置、电子设备、存储介质
CN112333596A (zh) 一种耳机均衡器的调整方法、装置、服务器及介质
CN107133361A (zh) 手势识别方法、装置和终端设备
CN117011581A (zh) 图像识别方法、介质、装置和计算设备
CN109003607A (zh) 语音识别方法、装置、存储介质及电子设备
CN108922520A (zh) 语音识别方法、装置、存储介质及电子设备
CN111797880A (zh) 数据处理方法、装置、存储介质及电子设备
CN109583583B (zh) 神经网络训练方法、装置、计算机设备及可读介质
WO2025055714A1 (zh) 行人重识别方法、装置、电子设备及存储介质
CN113298747B (zh) 图片、视频检测方法和装置
CN114550728B (zh) 用于标记说话人的方法、装置和电子设备
CN111368015B (zh) 用于压缩地图的方法和装置
KR102131353B1 (ko) 머신 러닝의 예측 데이터 피드백 적용 방법 및 그 시스템

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22894679

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022894679

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 18685019

Country of ref document: US

ENP Entry into the national phase

Ref document number: 2022894679

Country of ref document: EP

Effective date: 20240219

WWG Wipo information: grant in national office

Ref document number: 2022894679

Country of ref document: EP