WO2024255461A1 - 一种语音处理方法、装置、设备、介质及程序产品 - Google Patents
一种语音处理方法、装置、设备、介质及程序产品 Download PDFInfo
- Publication number
- WO2024255461A1 WO2024255461A1 PCT/CN2024/089862 CN2024089862W WO2024255461A1 WO 2024255461 A1 WO2024255461 A1 WO 2024255461A1 CN 2024089862 W CN2024089862 W CN 2024089862W WO 2024255461 A1 WO2024255461 A1 WO 2024255461A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech
- voiceprint
- feature
- network layer
- feature map
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/028—Voice signal separating using properties of sound source
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02087—Noise filtering the noise being separate speech, e.g. cocktail party
Definitions
- the present application relates to the field of computer technology, in particular to the field of artificial intelligence, and specifically to a speech processing method, a speech processing device, a computer device, a computer-readable storage medium, and a computer program product.
- Aliased voice data is voice data that is a mixture of voice signals generated by multiple sound sources (i.e., objects that generate sound).
- the aliased voice data recorded by a recording device from a physical environment may include voice signals generated by multiple participants, and may also include voice signals generated by certain devices in the physical environment (such as devices that play conference videos).
- the source separation methods provided for aliased speech data include: 1. Separating aliased speech data by human ears. This artificial listening method results in a long segmentation process and low efficiency. 2. Relying on timbre frequency to separate aliased speech data. When there are multiple objects with similar timbre frequencies, accurate segmentation cannot be achieved. 3. Separating aliased speech data based on the distance of the sound source will limit the speech segmentation to the different distances of each sound source. 4. Using a dedicated speech segmentation model for a specified object to separate aliased speech data. This method is not portable and cannot be universal.
- the embodiments of the present application provide a speech processing method, apparatus, device, medium and program product, which can separate the pure speech signal of any specified object from the mixed speech data and have universality.
- an embodiment of the present application provides a speech processing method, which is executed by a computer device, and the method includes:
- aliased speech data includes speech signals generated by each of at least two objects
- Acquire reference speech data of a designated object refers to any one of at least two objects;
- the reference speech data includes a reference speech signal of the designated object;
- the aliased speech data and the voiceprint representation vector are input into a preset speech segmentation model, wherein the speech segmentation model is used to: segment a target speech signal matching the voiceprint characteristics from the aliased speech data based on an attention mechanism;
- a voice file of a designated object is generated.
- an embodiment of the present application provides a speech processing device, the device comprising:
- An acquisition unit configured to acquire aliased speech data, wherein the aliased speech data includes a speech signal generated by each of at least two objects;
- the acquisition unit is further used to acquire reference speech data of a specified object;
- the specified object refers to any one of at least two objects;
- the reference speech data includes a reference speech signal of the specified object;
- a processing unit configured to extract a voiceprint representation vector of a specified object from the reference speech data, wherein the voiceprint representation vector is used to represent a voiceprint characteristic of the specified object;
- the processing unit is further used to input the aliased speech data and the voiceprint representation vector into a preset speech segmentation model, wherein the speech segmentation model is used to: segment a target speech signal matching the voiceprint characteristics from the aliased speech data based on an attention mechanism;
- the processing unit is further used to generate a voice file of a designated object based on the segmented target voice signal.
- an embodiment of the present application provides a computer device, the computer device comprising:
- a computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the above-mentioned speech processing method is implemented.
- an embodiment of the present application provides a computer-readable storage medium, which stores a computer program, and the computer program is suitable for being loaded by a processor and executing the above-mentioned speech processing method.
- an embodiment of the present application provides a computer program product, which includes a computer program.
- the computer program is executed by a processor, the above-mentioned speech processing method is implemented.
- aliased speech data to be segmented is obtained, and the aliased speech data contains speech signals generated by each of at least two objects; if there is a need to segment the speech signal generated by a specified object among the at least two objects, a section of reference speech data of the specified object (such as a few seconds of speech generated by the specified object) can be obtained; the specified object can be any one of the at least two objects.
- a voiceprint representation vector of the specified object is extracted from the reference speech data, and the voiceprint representation vector can represent the voiceprint characteristics of the specified object, and the voiceprint characteristics are unique and can represent the identity of the specified object.
- the voiceprint representation vector that can uniquely represent the identity of the specified object and the aliased speech data to be segmented can be input into a preset speech segmentation model, so that the speech segmentation model can segment a target speech signal that matches the voiceprint characteristics of the specified object from the aliased speech data based on the attention mechanism, thereby generating a separate speech file for the specified object based on the segmented target speech signal.
- the embodiments of the present application support extracting a voiceprint representation vector that represents the voiceprint characteristics of a specified object from the pure reference voice data of the specified object, and using the voiceprint representation vector as a reference, and utilizing the attention mechanism provided by the voice segmentation model to clearly and accurately calculate and extract the target voice signal of the specified object from the aliased voice data, thereby improving the extraction purity of the target voice signal and achieving a more accurate voice separation effect.
- the embodiments of the present application only need to obtain the reference voice data of the specified object to segment the target voice signal of the specified object from the aliased voice data; if you want to obtain the voice data of other objects, you only need to replace the voiceprint representation vector of the object to be segmented, and there is no need to train a dedicated network for each object, which greatly improves convenience and portability, and improves the versatility of this solution.
- FIG1 is a schematic diagram of the architecture of a speech processing system provided by an exemplary embodiment of the present application.
- FIG2 is a schematic diagram of the architecture of a speech processing scenario provided by an exemplary embodiment of the present application.
- FIG3 is a flow chart of a speech processing method provided by an exemplary embodiment of the present application.
- FIG4 is a schematic diagram of an interface for a user to input reference voice data of a specified object provided by an exemplary embodiment of the present application
- FIG5 is a schematic diagram of the structure of an existing Unet network
- FIG6 is a schematic diagram of the structure of a speech segmentation model constructed by adding an Attention mechanism to each network layer in a Unet network, provided by an exemplary embodiment of the present application;
- FIG7a is a schematic diagram of speech segmentation when a target network layer is a convolutional layer or a convolutional connection layer provided by an exemplary embodiment of the present application;
- FIG7b is a schematic diagram of speech segmentation when a target network layer is an upsampling layer provided by an exemplary embodiment of the present application;
- FIG8 is a flow chart of another speech processing method provided by an exemplary embodiment of the present application.
- FIG9 is a schematic diagram of a process of extracting a voiceprint vector provided by an exemplary embodiment of the present application.
- FIG10 is a schematic structural diagram of an improved PANNS provided by an exemplary embodiment of the present application.
- FIG11 is a schematic diagram of the structure of a transformer network provided by an exemplary embodiment of the present application.
- FIG12 is a schematic diagram of the structure of a speech processing device provided by an exemplary embodiment of the present application.
- FIG. 13 is a schematic diagram of the structure of a computer device provided by an exemplary embodiment of the present application.
- a speech processing scheme specifically, a speech separation scheme for performing source separation on aliased speech data is provided.
- aliased speech data can be referred to as aliased speech or mixed audio signal, which is an audio mixed with multiple speech signals (or called audio signals); that is, the so-called “aliasing” can be understood as multiple speech signals mixed/mixed together.
- the aliased speech data can be understood as: speech data containing speech signals generated by multiple sound sources, which is collected directly from the environment by a sound receiving device (such as a microphone).
- multiple speech signals can be generated by different objects (or called sound sources), and the objects here can include but are not limited to: humans, animals or physical devices (such as cars), etc.; the embodiment of the present application does not limit the sources of the multiple speech signals contained in the aliased speech data.
- the collected voice data usually includes voice signals generated by different participants; of course, if the conference scenario also includes a device for playing audio and video, then the collected voice data also includes voice signals emitted by the device; in this way, the voice data collected in the conference scenario can be called aliased voice data, which includes voice signals generated by multiple objects in the conversation scenario.
- source separation refers to the process of separating the voice signal of a specified object from the aliased voice data.
- source separation can be simply understood as the technology of separating the aliased voice data through signal processing or other algorithms to segment the target voice signal of the specified object from the aliased voice data, and finally generating a separate audio file (or voice file) of the specified object.
- the target voice signal generated by a specified object can be extracted from the aliased voice data through the source separation technology to generate a voice file of the specified object; in this way, when the voice file is played, only the voice generated by the specified object exists, so as to achieve the purpose of identifying the voice generated by a specified object.
- the embodiment of the present application provides a new voice processing solution, which mainly includes: obtaining aliased voice data to be segmented, wherein the aliased voice data includes a voice signal generated by each of at least two objects, such as the voice signal included in the aliased voice data includes: a voice signal generated by object 1 and a voice signal generated by object 2; if the user wants to extract the target voice signal generated by a specified object (such as any one of the at least two objects) from the aliased voice data, a reference voice data containing a reference voice signal of the specified object can be obtained.
- a voiceprint representation vector of the specified object can be extracted based on the reference voice data, and the voiceprint representation vector can represent the voiceprint characteristics of the specified object.
- the voiceprint characteristics can be understood as the sound characteristics of the specified object, such as the unique pitch or timbre of the specified object.
- the voiceprint representation vector of the specified object and the aliased voice data are input into the voice segmentation model, and the attention mechanism in the voice segmentation model can be used to segment and extract the target voice signal matching the voiceprint characteristics of the specified object from the aliased voice data, thereby generating a separate voice file for the specified object based on the target voice signal.
- the embodiment of the present application relies on the uniqueness of the voiceprint characteristics of each user. It only needs to provide a section of reference voice data of any specified object to extract the voiceprint feature vector that characterizes the voiceprint characteristics of the specified object, and then the target voice signal of any specified object can be separated and extracted from the aliased voice data based on the voiceprint characterization vector; not only can the purpose of accurately separating the target voice signal of the specified object from the aliased voice data be achieved, but also the source separation can be realized for the object to which any voice signal contained in the aliased voice data belongs, so that it is highly reusable and portable, reduces the complexity of user input operations, and makes the entire system more universal.
- the embodiment of the present application calculates the voiceprint characteristics and aliased voice data of the specified object based on the attention mechanism, which greatly improves the clarity and accuracy of extracting the target voice signal of the specified object from the aliased voice data, avoids too much noise in the extracted target voice signal, and achieves a more accurate and pure voice separation effect.
- the embodiment of the present application mainly implements the speech processing solution through a reusable designated speaker speech segmentation system based on voiceprint vector embedding, that is, the system deploys the speech processing solution provided by the embodiment of the present application; in this way, when any user has the need to separate the speech signal of the aliased speech data, the system can be called to automatically separate and extract the speech file corresponding to the designated object from the aliased speech data.
- the exemplary architecture diagram of the system can be seen in Figure 1; as shown in Figure 1, the system mainly includes two modules, namely: a voiceprint vector extraction model and a speech segmentation model; the following is a brief introduction to these two modules, wherein:
- Voiceprint vector extraction model which can be called voiceprint vector extractor or voiceprint recognition network.
- the voiceprint vector extraction model is mainly used to identify the identity of the specified object to be segmented and extract the identity semantic vector of the specified object; the identity semantic vector here is In the embodiment of the application, it is called a voiceprint characterization vector (or simply a voiceprint vector), which is used to characterize the voiceprint characteristics of the specified object.
- the voiceprint vector extraction model is constructed based on an improved audio neural network (Pretrained Audio Neural Networks, PANNS) network and a transformer (Transformer) network.
- the voiceprint vector extraction model is obtained by fully training with an open source large-scale speaker data set (a data set containing rich voice data).
- the trained voiceprint vector extraction model has the ability to fully express the voiceprint characteristics of the object; thus, the voiceprint vector extraction model can be used as a voiceprint vector extractor for the entire system; in the inference stage, after loading the model parameters trained in advance with a large-scale speaker data set, the trained voiceprint vector extraction model can be used to calculate the voiceprint representation vector of the specified object for the reference voice data of the specified object (such as a short segment (such as a few seconds or more than ten seconds) of speech), and the voiceprint representation vector is used to represent the voiceprint characteristics of the specified object.
- the improved PANNS in the voiceprint vector extraction model is an improvement on the traditional PANNS; the improvement is mainly reflected in: designing the information exchange link between the time domain link and the frequency domain link, so that there are multiple information exchanges in the time domain and frequency domain during the voiceprint representation vector extraction process, thereby achieving information complementarity between the time domain and the frequency domain, enabling the high-level network to fully perceive the underlying network information and improve the accuracy of voiceprint vector extraction.
- PANNS is an audio neural network trained based on a large audio data set (including speech data of a large number of speakers); it is usually used for audio pattern recognition or audio frame-level vectorization (embedding) as the encoding network at the front end of the model.
- the transformer network is a conversion model that relies on the attention mechanism to calculate input and output; the Transformer network abandons the convolutional model structure and achieves better performance only through the attention mechanism and the feed forward neural network (Feed Forward Neural Network), without the need to use a sequence-aligned recurrent architecture.
- a speech segmentation model which can be called a semantic segmentation network or a segmentation network.
- the speech segmentation network is mainly used to receive the voiceprint characteristics of a specified object (specifically, the voiceprint representation vector) input by the voiceprint vector extraction model, and extract a speech signal that matches the voiceprint characteristics of the specified object from the aliased speech data using an attention mechanism based on the voiceprint representation vector.
- the speech segmentation model is a segmentation model that integrates the attention mechanism.
- the target speech signal related to the voiceprint feature of the specified object can be calculated in combination with the attention mechanism during the feature processing of the segmentation network for the aliased speech data, and the target speech signal of the specified object can be segmented from the aliased speech data; in this way, the target speech signal of the specified object in the aliased speech data can be calculated and extracted more clearly and accurately, and the pure target speech signal of the specified object can be separated, so as to achieve a more accurate and pure speech separation effect.
- the attention mechanism is a solution proposed by imitating human attention; in short, it is to imitate human attention to quickly filter out the information you want to pay attention to from a large amount of information. It is mainly used to solve the problem that it is difficult to obtain a reasonable vector representation when the input sequence of the time series model is long.
- the method is to retain the intermediate results of the time series model, learn it with a new model and associate it with the output, so as to achieve the purpose of information screening.
- the system provided in the embodiment of the present application includes two modules.
- the voiceprint vector extraction model extracts the voiceprint representation vector that can characterize the voiceprint characteristics of the specified object from the reference voice data of the specified object
- the voiceprint representation vector can be embedded into the voice segmentation model; in this way, the voice segmentation model can extract and separate the voice signal that matches the voiceprint characteristics from the aliased voice data based on the attention mechanism, and achieve a better signal separation effect.
- the system provided in the embodiment of the present application is a fully automatic segmentation system constructed based on multiple deep learning neural networks (such as an improved PANNS network, a conversion network, and a segmentation network integrated with an attention mechanism, etc.); for the fully automatic segmentation system, the user only needs to input the reference voice data of the specified object and the aliased voice data to be segmented into the fully automatic segmentation system, and the fully automatic segmentation system can automatically and quickly extract the voice signal of the specified object from the aliased voice data, greatly improving the efficiency of voice segmentation, completely getting rid of manual participation, and forming rapid standardization.
- multiple deep learning neural networks such as an improved PANNS network, a conversion network, and a segmentation network integrated with an attention mechanism, etc.
- the speech separation model in the system can be reused; where reusability means that each time the system performs source separation, it only needs to replace the voiceprint characteristics of the extracted object, and there is no need to train a separate segmentation network for each object.
- the network can be easily migrated, making the entire system highly versatile.
- the system shown in FIG1 can be deployed in a computer device, specifically in an application running in the computer device (such as deployed in the application in the form of a plug-in); that is, the solution is provided by the application running in the computer device.
- the application can refer to a computer program for completing one or more specific tasks.
- the types of the same application in different dimensions can be obtained.
- the application may include but is not limited to: a client installed in the terminal, a small program that can be used without downloading and installing (as a subprogram of the client), a web (World Wide Web) application opened through a browser, etc.
- the application may include but is not limited to: IM (Instant Messaging) application, content interaction application, audio application or video application, etc.
- the instant messaging application refers to an application for instant communication of messages and social interaction based on the Internet.
- the instant messaging application may include but is not limited to: a social application with communication functions, a map application with social interaction functions, a game application, etc.
- Content interaction applications refer to applications that can realize content interaction, such as online banking, sharing platforms, personal space, news and other applications.
- Audio applications refer to applications that realize audio functions based on the Internet. Audio applications may include but are not limited to: music applications with music playback and editing capabilities, radio applications with radio playback capabilities, or live broadcast applications with live broadcast capabilities, etc.
- Video applications refer to applications that can play pictures. Video applications may include but are not limited to: applications with short videos (videos are often short, such as a few seconds or minutes), applications with long videos (such as movies or TV series that often have a long playback time), etc.
- Computer equipment may include terminals and/or servers.
- terminals may include but are not limited to: smart phones (such as smart phones with Android systems, or smart phones with Internetworking Operating System (IOS)), tablet computers, portable personal computers, mobile Internet devices (MID), vehicle-mounted devices, head-mounted devices, smart TVs or smart homes, etc.
- IOS Internetworking Operating System
- MID mobile Internet devices
- the embodiments of this application do not limit the types of terminals, which are explained here.
- the terminal is deployed with the system shown in Figure 1 or an application (or plug-in) providing the system, etc.
- the server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDN), and big data and artificial intelligence platforms.
- cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDN), and big data and artificial intelligence platforms.
- FIG. 2 A schematic diagram of the system architecture of an exemplary voice processing solution jointly executed by a terminal and a server can be seen in Figure 2; as shown in Figure 2, terminal 201 is a device held by a user with a voice separation requirement.
- terminal 201 is a device held by a user with a voice separation requirement.
- server 202 can first identify the reference voice data through the voiceprint vector extraction model in the system, obtain a voiceprint representation vector for representing the identity of the specified object, and then embed the voiceprint representation vector into the voice segmentation model in the system; after receiving the voiceprint representation vector of the specified object and the aliased voice data to be segmented, the voice segmentation model can extract a pure target voice signal that matches the voiceprint characteristics represented by the voiceprint representation vector from the aliased voice data based on the attention mechanism, thereby generating a voice file of the specified object based on the target voice signal. In this way, the server 202 returns the voice file of the designated object to the terminal 201 , so that the user can play the voice file containing only the voice data of the designated object through the terminal 201 .
- the above-mentioned process of the voice processing scheme is briefly introduced by taking the computer device as a terminal and a server as an example; however, when the computer device is a terminal or a server, the idea of the computer device executing the voice processing scheme is similar to the above-described process, except that the execution subject is different, which is not described here.
- the terminal 201 and the server 202 shown in FIG2 can be directly or indirectly connected by wired or wireless communication, and this application does not limit this.
- the speech processing solution provided in the embodiment of the present application can be applied to any application scenario with speech separation requirements; the computer device providing the solution may also be different depending on the application scenario, and this is not limited to this.
- the application scenario may include but is not limited to at least one of the following: film and television drama scenario, audio and video creation scenario, and conversation scenario.
- the application scenario is a film or TV drama scenario.
- the film or TV drama scenario is a dubbing scenario for a character in the film or TV drama.
- voice actors are often required to dub for a certain role in the film and television drama (such as after the aliased voice data recorded by radio recording is submitted for review, some of the lines do not meet the regulations and need to be re-recorded).
- the voice data obtained by the radio recording performed during the filming process or post-production process of the film and television drama is usually aliased voice data containing multiple voice signals.
- the application scenario is an audio and video creation scenario.
- the audio and video creation scenario is a secondary creation scenario for audio and video (i.e., re-creation for existing audio and video).
- the secondary creation scenario users like to extract some lines of a specified actor in multiple audio and video for line dialogue editing, that is, to edit the voice data of the specified actor in different audio and video into the same audio and video.
- the application scenario is a conversation scenario.
- the conversation scenario is an online conference scenario.
- online conference scenarios there is often a need to transcribe speech into text, that is, convert the recorded voice data into text form; however, in online conference scenarios with the participation of multiple people, the transcription of mixed voice data containing voice signals of multiple people has always been a difficult problem, and transcription refers to the process of converting the voice signal of a specified person among multiple people into text.
- the voice signal of each object participating in the online conference can be segmented and extracted from the mixed voice data, and then the voice signal of each object can be input into the voice recognition system to achieve text transcription, which can greatly improve the accuracy of the mixed voice transcription of the conversation.
- the collection and processing of relevant data in the embodiments of this application should be strictly in accordance with the requirements of relevant laws and regulations.
- the acquisition of personal information requires the knowledge or consent of the individual subject (or the legal basis for obtaining the information), and the subsequent use and processing of data shall be carried out within the scope of authorization of laws and regulations and the subject of personal information.
- the embodiments of this application are applied to specific products or technologies, such as obtaining reference voice data of a specified object, the permission or consent of the specified object shall be obtained, and the collection, use and processing of relevant data (such as the collection and release of barrages posted by the object, etc.) shall comply with the relevant laws, regulations and standards of the relevant regions.
- the embodiment of the present application proposes a more detailed speech processing method.
- the speech processing method proposed in the embodiment of the present application will be described in detail below in conjunction with the accompanying drawings.
- FIG3 shows a flow chart of a speech processing method provided by an exemplary embodiment of the present application; the speech processing method may be executed by a computer device in the aforementioned system, such as a terminal and/or a server; the speech processing method may include but is not limited to steps S301-S304:
- S301 Acquire aliased speech data to be segmented.
- S302 Acquire reference speech data of a designated object among at least two objects.
- the aliased voice data to be segmented includes the voice signals generated by each of at least two objects.
- a heavy metal music includes the "lyrics" voice signal generated by the "singer”, the “melody” voice signal generated by the "guitar”, and the “melody” voice signal generated by the "drum set”, etc.; therefore, it is determined that the heavy metal music is aliased voice data, and the objects included in the aliased voice data are: singer, guitar and drum set, and the voice signals included in the aliased voice data are the voice signal generated by the "singer", the voice signal generated by the "guitar”, and the voice signal generated by the "drum set".
- the reference voice data of the certain object can be obtained.
- the certain object is called a designated object, and the reference voice data of the designated object and the aliased voice data are not The same, but the reference voice data includes a pure reference voice signal of the specified object; in this way, the reference voice signal contained in the reference voice data of the specified object can be used as a reference signal to separate the target voice signal of the specified object from at least two voice signals corresponding to at least two objects contained in the aliased voice data.
- the specified object can be any object of the at least two objects contained in the aliased voice data from which the user wants to extract the voice signal; as can be seen from the foregoing description, the object can refer to a human, an animal or a physical device; for ease of explanation, the object type of the specified object is taken as an example for introduction, and this is specially explained here.
- the object type of the specified object is taken as an example for introduction, and this is specially explained here.
- the reference voice data refers to a segment of voice data containing the reference voice signal of the specified object.
- the reference voice data should be a segment of relatively pure voice data containing the specified object.
- the reference voice data only contains the reference voice signal of the specified object; for another example, the reference voice data also contains the reference voice signal of the specified object and other voice signals, but it is necessary to ensure that the reference voice signal of the specified object can be easily extracted from the reference voice data mixed with other voice signals (such as the signal frequency of other voice signals is low, while the signal frequency of the reference voice signal of the specified object is relatively high, etc.), which is conducive to analyzing the pure reference voice data to extract the relatively accurate voiceprint characteristics of the specified object.
- the embodiment of the present application does not limit the type, duration and source of the reference voice data.
- the type of reference voice data may include but is not limited to: a segment of audio generated by the specified object reading an article, a segment of audio generated by the specified object speaking, or a segment of audio generated by the specified object singing, etc.
- the duration of the reference voice data may be a few seconds or more than ten seconds, etc.
- the source of the reference voice data may include but is not limited to: when the designated object and the user with the voice separation requirement are different users, the reference voice data may be sent to the user by the designated object, or downloaded or recorded by the user through certain channels (such as historical voice information); when the designated object and the user with the voice separation requirement are the same user, the reference voice data may be input in real time by the designated object, that is, collected in real time through a microphone deployed in the terminal held by the user.
- FIG4 An exemplary interface schematic diagram for inputting reference voice data of a specified object by a user can be seen in FIG4; as shown in FIG4, a voice acquisition interface 401 is displayed on the terminal screen of a terminal held by a user, and the voice acquisition interface 401 includes an acquisition area 402 for reference voice data.
- the voice acquisition interface 401 includes an acquisition area 402 for reference voice data.
- at least two voice acquisition entrances can be displayed in the acquisition area 402, such as a collection entrance 4021 and an upload entrance 4022.
- the collection entrance 4021 When the collection entrance 4021 is triggered, it means that the user wants to input the reference voice data of the specified object (the specified object is the user, or the specified object and the user are in the same physical environment) by real-time collection, then the microphone of the terminal is turned on, so that the reference voice signal in the physical environment of the user can be collected in real time to generate reference voice data.
- the upload entrance 4022 it means that the user wants to input the reference voice data of the specified object by uploading a file, then the user can upload the reference voice data of the specified object from the storage space (such as the local storage space of the terminal, the cloud storage space or the server storage space, etc.).
- the interface elements (such as the interface content contained in the interface) and interface styles of the voice acquisition interface are not limited to those shown in Figure 4.
- the upload entry of the aliased voice data can also be displayed in the voice acquisition interface, through which the user can replace the aliased voice data to be segmented.
- a text conversion control (or component, button, option, etc.) can also be added to the voice acquisition interface, so that the user can trigger the text conversion control before or after voice separation to convert the separated voice signal into text form with one click, shorten the text conversion path to a certain extent, and thus improve the text conversion efficiency.
- S303 extracting a voiceprint representation vector of a specified object from the reference speech data, and inputting the aliased speech data and the voiceprint representation vector into a preset speech segmentation model, wherein the speech segmentation model is used to segment a speech signal matching the voiceprint characteristics of the specified object from the aliased speech data based on an attention mechanism.
- S304 Generate a voice file of a designated object based on the segmented voice signal.
- voiceprint is the sound wave spectrum that carries voice information, and is a biometric feature composed of multiple characteristic dimensions such as wavelength, frequency, and intensity.
- Voiceprint has the characteristics of stability, measurability, and uniqueness, and can be used to uniquely identify the voice characteristics of an object, that is, voiceprint can be used to characterize the identity of an object. Therefore, after obtaining relatively pure reference voice data of a specified object, the embodiment of the present application supports extracting the voiceprint characteristics of the specified object from the reference voice data, so as to facilitate the subsequent analysis of the voice signal based on the unique voiceprint characteristics. Extraction.
- the voiceprint vector extraction model outputs the voiceprint characterization vector (or simply referred to as voiceprint vector) of the specified object, that is, the voiceprint vector extraction model analyzes the reference voice data to obtain a voiceprint characterization vector that can be used to characterize the voiceprint characteristics of the specified object.
- the voiceprint vector extraction model can innovatively use vector embedding to transmit the voiceprint information characterization, and input the voiceprint characterization vector into the voice segmentation model to participate in the calculation of the attention mechanism, so as to segment the target voice signal that matches the voiceprint characteristics of the specified object from the aliased voice data.
- This innovative vector embedding mechanism enables the voice segmentation model to be independent of any object's historical voice data for additional training, and only needs to extract the voiceprint characterization vector from a small amount of reference voice data, which can get rid of the dependence on large-scale voice data, thereby making the system highly reusable and portable, making the entire system more efficient, and reducing the complexity of user input operations, and improving the universality of the system.
- the speech segmentation model provided in the embodiment of the present application is obtained by improving the traditional speech segmentation network by using the attention mechanism; specifically, it is obtained by integrating the attention mechanism into the traditional speech segmentation network.
- the traditional speech segmentation network involved in the embodiment of the present application is a Unet (or represented as U-net, U-Net, etc.) network, which is one of the algorithms for semantic segmentation using a fully convolutional network; it mainly uses a symmetrical U-shaped structure including a compression path and an expansion path.
- FIG5. A schematic diagram of an exemplary network structure of a Unet network can be seen in FIG5.
- the Unet network is a U-shaped symmetrical network structure, which includes a left-right symmetrical feature extraction subnetwork and an upsampling subnetwork, and the feature extraction subnetwork and the upsampling subnetwork are connected through a convolution connection layer.
- the hierarchical distribution of m convolutional layers means that the m convolutional layers are connected in sequence, and the upper convolutional layer of any two adjacent convolutional layers in the m convolutional layers is used as the convolutional layer of the upper level, and the lower convolutional layer is used as the convolutional layer of the next level, and the feature map output by the convolutional layer of the previous level is used as the input of the convolutional layer of the next level.
- a pooling function can be deployed after each convolutional layer; by first using the convolutional network in the convolutional layer to perform feature extraction on the aliased speech data, and then using the pooling function (pool) to further extract higher-order features, the features that are desired to be highlighted in the aliased speech data are effectively retained; wherein, the embodiment of the present application does not limit the type of pooling function, such as if the pooling function is a maximum pooling (max pool), which tends to be the maximum feature within the pooling window (such as a window size of 2*2) in the feature map output by the convolutional layer.
- max pool maximum pooling
- the feature extraction subnetwork and the upsampling subnetwork are symmetrical, and the upsampling subnetwork can be simply understood as a decoding network, which includes an upsampling layer (up sampling layer) corresponding to each convolutional layer in the feature extraction subnetwork.
- a transposed convolution (up-Conv) with a convolution kernel of 2*2 is also deployed after the convolution connection layer and each upsampling layer to achieve the upsampling function through transposed convolution.
- the symmetrical network structure of the Unet network can not only realize the network from scratch and initialize the weights, and then train the model; it can also borrow the convolutional layer structure of some existing networks (such as vgg (a convolutional network) in resnet (residual neural network)) and the corresponding trained weight files, and then add the subsequent upsampling layer for training calculations, etc.; in this way, using the existing weight model files in deep learning model training can greatly speed up the model training.
- some existing networks such as vgg (a convolutional network) in resnet (residual neural network)
- each convolutional layer, convolutional connection layer and upsampling layer includes multiple convolutional networks connected in sequence; as shown in FIG5, the feature extraction subnetwork, convolutional connection layer and upsampling layer can each include three convolutional networks with a convolution kernel of 3*3.
- the convolutional network is also called a convolutional neural network (CNN);
- a convolutional neural network is a feedforward neural network, which is mainly composed of one or more convolutional layers and a fully connected layer at the top, and also includes associated weights and a pooling layer.
- an activation function can be deployed after each convolutional network to add nonlinear factors to the model through the activation function, so that the trained model can solve problems that the linear model cannot solve.
- the embodiment of the present application does not limit the type of activation function, such as the activation function can be a ReLu function (ReLuSigmoidTanh, linear rectification function) and the like.
- the Unet network can also effectively combine high-level feature maps and low-level feature maps through skip connections (also called copy and crop) to obtain the final feature map.
- the specific process of skip connection may include: the feature map obtained by each convolutional layer in the feature extraction subnetwork will be concatenated to the corresponding upsampling layer in the upsampling subnetwork; thus achieving The feature maps are effectively used in subsequent calculations.
- this method of skipping feature maps of different dimensions can effectively avoid direct supervision and loss calculation in high-level feature maps, effectively combine the features in low-level feature maps, so that the final feature map contains both high-dimensional features and many low-dimensional features, realizing the fusion of features at different scales and improving the accuracy of the model results.
- the above-mentioned Figure 5 gives a detailed introduction to the network structure of the traditional Unet network.
- the speech segmentation model provided in the embodiment of the present application is obtained by improving the network structure of the Unet network.
- the improvement of the network structure of the Unet network in the embodiment of the present application mainly includes: integrating the attention mechanism in all or part of the network layers (such as convolutional layers, convolutional connection layers and upsampling layers) in the network structure of the Unet network.
- the attention mechanism is integrated in all or part of the networks in the speech segmentation model obtained by improving the Unet network based on the attention mechanism.
- the voiceprint representation vector of the specified object can be embedded into each network layer in the Unet network, so that each network layer in the network can deeply feel the voiceprint information or voiceprint characteristics represented by the voiceprint representation vector, so that the final output voice signal is closer to the specified object, ensuring that the extracted voice signal is purer.
- the structural diagram of the speech segmentation model constructed when the Attention mechanism is added to each network layer in the Unet network can be seen in FIG6.
- the improved speech segmentation model has the same basic network architecture as the original Unet network architecture, but an attention mechanism is added to each layer in the Unet network architecture, and the input information of the attention mechanism is: the voiceprint representation vector of the specified object and the feature map output by the previous layer of the attention mechanism.
- the entire model can deeply perceive and learn the extracted voiceprint representation vector, so that the calculation of each layer can be close to the voiceprint representation vector, ensuring that the final extracted speech signal matches the voiceprint characteristics represented by the voiceprint representation vector.
- the fusion position of the attention mechanism in the multiple convolutional networks corresponding to the network layer is not fixed, and the fusion position shown in FIG6 is exemplary.
- each network layer in the speech segmentation model is integrated with the attention mechanism as an example to introduce the specific implementation process of the speech segmentation model based on the attention mechanism to segment the speech signal matching the voiceprint characteristics of the specified object from the aliased speech data, and generate the speech file of the specified object based on the speech signal; the process may include but is not limited to steps (1)-(4), wherein:
- time domain and frequency domain are two commonly used concepts in audio applications, and are also two dimensional concepts for measuring audio features; time domain is to display and process the sampling points of the speech signal in time, that is, to bind with time; frequency domain is a characteristic representation of the energy distribution of the speech signal in each frequency band; through conversion formulas (such as Fourier Transform, Laplace Transform or ZTransform, etc.), the speech signal can be converted from time domain to frequency domain, or from frequency domain to time domain.
- the correlation between the voiceprint representation vector and the speech spectrum features is calculated to obtain the speech spectrum feature segmentation that matches the voiceprint characteristics.
- the voiceprint representation vector is input into each network layer in the speech segmentation network, specifically the attention mechanism integrated in each network layer.
- the correlation between the voiceprint representation vector and the first feature map of the corresponding network layer can be calculated to obtain the second feature map output by the corresponding network layer; it is worth noting that the first feature map of the network layer is different depending on the position of the network layer in the speech segmentation network, which will be introduced in the subsequent embodiments.
- the second feature map output by the 2m+1th network layer in the speech segmentation model (that is, the last upsampling layer in the upsampling subnetwork) can be used as the speech spectrum feature segmentation that matches the voiceprint representation vector;
- the speech spectrum feature segmentation is specifically the segmentation in the speech spectrum feature that matches the voiceprint characteristics, that is, the segmentation in the speech spectrum characteristics that belongs to the specified object.
- each network layer in the speech segmentation network can deeply sense the voiceprint information of the specified object, so that the final segmented output voice signal is closer to the voiceprint characteristics of the specified object, ensuring that the extracted voice signal is purer and more accurate.
- the target network layer is a convolutional layer or a convolutional connection layer in a speech segmentation model
- the fusion position of the attention mechanism in the multiple convolutional networks corresponding to the target network layer is: the position between the first convolutional network and the second convolutional network adjacent to the first convolutional network in the multiple sequentially connected convolutional networks included in the target network layer; that is, the fusion position of the attention mechanism in the multiple convolutional networks corresponding to the target network layer is: the position between the first convolutional network in the multiple sequentially connected convolutional networks included in the target network layer, and the convolutional network adjacent to the first convolutional network and located after the first convolutional network in the multiple sequentially connected convolutional networks.
- the target network layer calculates the correlation between the voiceprint representation vector and the first feature map of the target network layer to obtain the second feature map output by the target network layer.
- the specific implementation process may include: first, using the first convolutional network in the target network layer, performing feature extraction processing on the first feature map of the target network layer to obtain the third feature map of the target network layer;
- the feature extraction processing here refers to the process of extracting useful information (i.e., features) from the first feature map for subsequent classification, clustering, regression and other tasks;
- the feature extraction processing process may include: preprocessing the first feature map (such as denoising, normalization or standardization), and performing feature extraction on the preprocessed first feature map to extract useful features, and screening representative or discriminative features from the extracted features, and the screened features are used as features after feature extraction processing.
- the target network layer is the first convolution layer 701 of the hierarchical distribution in the speech segmentation model (specifically, the feature extraction subnetwork)
- the first feature map of the target network layer is the speech spectrum feature obtained by frequency domain conversion of the aliased speech data
- the target network layer is other convolution layers (such as convolution layer 702) in the speech segmentation model except the first convolution layer 701
- the first feature map of the target network layer is obtained by pooling the feature map output by the upper-level network layer adjacent to the target network layer (such as convolution layer 701)
- the pooling process is performed by the pooling layer in the target network layer, and the pooling process aims to reduce the size and parameter amount of the feature map output by the upper-level network layer through parallel processing or data compression, thereby reducing the amount of calculation.
- the correlation between the voiceprint representation vector and the third feature map of the target network layer is calculated to obtain the fourth feature map of the target network layer;
- the feature dimension of the third feature map of the target network layer is the same as the feature dimension of the fourth feature map;
- the feature map (such as the third feature map and the fourth feature map) can be expressed in the form of a vector, so the feature dimension of the feature map can be the dimension of the vector, each dimension in the vector corresponds to a feature, that is, the feature dimension before and after the attention mechanism calculation is the same.
- the fourth feature map is subjected to feature extraction processing using other convolutional networks except the first convolutional network in the target network layer to obtain the second feature map output by the target network layer.
- the attention mechanism is integrated in the multiple convolutional networks contained in the convolutional layer or the convolutional connection layer.
- the features that match the voiceprint representation vector of the specified object can be focused on in the aliased speech data based on the attention mechanism, and then the second feature map that matches the voiceprint representation vector can be analyzed through the attention mechanism in the feature extraction process, and then the target speech signal of the specified object can be accurately segmented from the aliased speech data based on the second feature map.
- the target network layer is an upsampling layer in a speech segmentation model
- the fusion position of the attention mechanism in the multiple convolutional networks corresponding to the target network layer is: the position after the last convolutional network in the multiple sequentially connected convolutional networks.
- the target network layer calculates the correlation between the voiceprint representation vector and the first feature map of the target network layer to obtain the second feature map output by the target network layer.
- the specific implementation process may include: first, using multiple sequentially connected convolutional networks in the target network layer to perform feature extraction processing on the target feature map to obtain the first feature map of the target network layer.
- the target feature map here is obtained by feature splicing the feature map output by the convolutional layer corresponding to the target network layer and the feature map output by the upper-level network layer of the target network layer; as shown in FIG7b, the input information of the first upsampling layer 703 in the upsampling subnetwork is the target feature map, and the target feature map is the first upsampling layer 703.
- the feature map output by the convolution connection layer 704 of the previous level and the feature map output by the convolution layer 705 corresponding to the first upsampling layer 703 are obtained by feature concatenation.
- the attention mechanism fused in the target network layer is used to calculate the correlation between the voiceprint representation vector and the first feature map of the target network layer to obtain the second feature map output by the target network layer; the feature dimension of the second feature map is the same as that of the first feature map.
- the attention mechanism is integrated in the multiple convolutional networks contained in the upsampling layer. After the feature graph output by the previous network layer and the feature graph output by the corresponding convolutional layer are used to extract features, the attention mechanism can be used to focus on the features that match the voiceprint representation vector of the specified object in the first feature graph after feature extraction, thereby analyzing and obtaining the second feature graph that matches the voiceprint representation vector of the specified object, and then based on the second feature graph, the target speech signal of the specified object can be accurately segmented from the aliased speech data.
- FIGS. 7a and 7b are only exemplary processes for the target network layer to perform correlation calculation when the attention mechanism is respectively integrated into an exemplary fusion position in the feature extraction module, the convolution connection layer and the upsampling sub-network layer; when the attention mechanism is integrated into different fusion positions in the target network layer, the specific implementation process of the target network layer performing the correlation calculation is different.
- the method of converting the speech spectrum feature segment from the frequency domain to the time domain may include but is not limited to the aforementioned Fourier Transform, Laplace Transform or Z Transform, etc., and is not limited to this.
- the file format of the voice file can be set according to the personalized needs of the user, and the embodiment of the present application does not limit the file format of the voice file.
- the voice file when the voice file is a text file, it supports the use of a speech recognition algorithm or tool to convert the voice signal that matches the voiceprint characteristics of the specified object into text, and performs text processing on the converted text (such as correcting spelling errors, adding conformances, and clarifying the text), and then saves the processed text as a text file in a text format (such as .doc format).
- the voice file is an audio file
- the voice signal that matches the voiceprint characteristics of the specified object can be directly saved as an audio file in an audio format (such as .WAV format).
- the embodiment of the present application supports converting the aliased speech data from the time domain to the frequency domain to obtain the speech spectrum features of the aliased speech data in the frequency domain.
- the attention mechanism of each network layer in the speech segmentation model can be used to calculate the correlation between the voiceprint representation vector and the speech spectrum features that belong to the frequency domain, thereby ensuring the feasibility of the correlation calculation; considering that the final segmentation is the signal in the time domain, it is necessary to convert the speech spectrum features from the frequency domain to the time domain to obtain the target speech signal that matches the voiceprint characteristics, thereby ensuring that the final extracted signal is a time domain signal that can be understood and read by the device.
- each network layer in the speech segmentation network is different, specifically, the feature dimensions of each convolutional network in each network layer are different. Therefore, before inputting the voiceprint representation vector of the specified object into each network layer in the speech segmentation network, it is also necessary to use a network layer to transform the dimension of the voiceprint representation vector to obtain the voiceprint representation vector after the dimension transformation.
- dimension transformation refers to changing the feature dimension of the voiceprint representation vector, so that the feature dimension of the voiceprint representation vector after dimension transformation is the same as the feature dimension of the feature map of the attention mechanism to be input into the corresponding network layer, so that when the voiceprint representation vector after dimension transformation is input into the speech segmentation model, the speech segmentation model can effectively process the voiceprint representation vector to avoid the unavailability of the voiceprint representation vector caused by different dimensions.
- the feature map of the attention mechanism to be input into the corresponding network layer can refer to the third feature map described above.
- the attention mechanism can be inserted between any two convolutional networks in a plurality of convolutional networks connected sequentially in the network layer;
- the first feature map of the network layer can refer to the feature map output by the convolutional network in the network layer that is adjacent to the attention mechanism and is located before the attention mechanism.
- the embodiment of the present application innovatively constructs a fully automatic speech processing solution, which realizes the segmentation of speech signals based on the voiceprint vector embedding method; for the user, only a small segment of reference speech data of the specified object and the aliased speech to be segmented need to be input. Data can automatically and quickly separate the voice signal of the specified object from the aliased voice data, which can greatly improve the efficiency of voice segmentation, completely get rid of manual participation, and form a rapid standardization.
- the voiceprint vector embedding method is adopted, and the voiceprint representation vector used to represent the voiceprint characteristics of the specified object is input into the voice segmentation model to participate in the Attention calculation, which can make the voice segmentation model independent of any object's historical voice data for additional training, and can get rid of the dependence on the object's large-scale voice data, so as to achieve the system's high reusability and portability, making the entire system more universal.
- the input voiceprint representation vector can be calculated with the Attention mechanism of each network layer in the Unet network, so as to ensure that the feature map output by the voice segmentation model and the voiceprint characteristics represented by the voiceprint representation vector of the specified object are more closely matched, avoiding too much noise in the extracted voice signal of the specified object, improving the purity of the voice signal, and improving the segmentation accuracy of the voice segmentation model.
- FIG. 8 shows a flowchart of another voice processing method provided by an exemplary embodiment of the present application
- the voice processing method may be executed by a computer device in the aforementioned system, such as a terminal and/or a server; the voice processing method may include but is not limited to steps S801-S806:
- S802 Acquire reference speech data of a designated object among at least two objects.
- S803 Perform short-term correlation analysis on the reference speech data of the designated object.
- S804 Perform long-term correlation analysis on the reference speech data of the designated object to obtain a voiceprint representation vector of the designated object.
- steps S803-S804 in order to learn the clearer voiceprint characteristics of the specified object from the reference voice data of the specified object, the embodiment of the present application supports the analysis of the reference voice data in combination with short-time correlation and long-time correlation to extract the voiceprint representation vector that can fully express the voiceprint characteristics of the specified object.
- the short-time correlation analysis of the reference voice data can be simply understood as: the process of feature analysis of a shorter (such as 20 milliseconds) voice signal in the reference voice data; considering that the voice signal in the reference voice data usually does not change in a short time, after the reference voice data is discretized, the information distribution of each shorter voice signal in the time domain and frequency domain can be used to extract the characteristics of the voice signal in the short time, thereby realizing the feature analysis of each voice signal in the reference voice data.
- the short-time correlation analysis focuses on the feature analysis of the segmented voice signals in the reference voice data.
- the long-time correlation analysis of the reference voice data can be simply understood as: the process of feature analysis of the entire reference voice data; that is, the long-time correlation analysis focuses on the semantic expression of the entire signal sequence of the reference voice data.
- the voiceprint vector extraction model includes an improved audio neural network (PANNS) network and a transformer (Transformer) network.
- PANNS improved audio neural network
- Transformer transformer
- the improved audio neural network (PANNS) network can be referred to as improved PANNS, which is mainly used for short-time correlation analysis of reference voice data
- the transformer (Transformer) network can be referred to as transformer network, which is mainly used for long-time correlation analysis of reference voice data.
- FIG. 9 A schematic diagram of an exemplary use of improved PANNS and Transformer network to perform short-term correlation analysis and long-term correlation analysis on reference speech data can be seen in Figure 9.
- the reference speech data is converted from the time domain to the frequency domain to obtain the reference speech spectrum characteristics corresponding to the reference speech data; wherein, the conversion formula from the time domain to the frequency domain can be referred to the aforementioned related description, which will not be repeated here.
- the reference speech spectrum characteristics can also be segmented to obtain the reference speech spectrum feature segmentation corresponding to each speech data segmentation;
- the reference speech spectrum feature segmentation involved in the embodiment of the present application can be a logarithmic Mel spectrum (Log-mel or Logmel).
- the mel (Mel) spectrum is a nonlinear frequency scale determined based on the human ear's sensory judgment of equidistant (i.e., the frequency bands are equidistantly distributed on the Mel scale) pitch (pitch) changes, and the pitch refers to the height of the sound; when performing signal processing, it can be more catered to the changes in the auditory perception of the human ear to be artificially set.
- the reference speech spectrum feature corresponding to each speech data segment is input into the improved PANNS segment by segment input.
- the complete reference speech data is input into the improved PANNS; thus, the improved PANNS can segment the received reference speech data according to the segmentation rules followed when segmenting the reference speech spectrum characteristics, and obtain multiple speech data segments corresponding to the reference speech data.
- the improved PANNS will perform short-term correlation analysis on each speech data segment based on each speech data segment and the corresponding reference speech spectrum characteristic segment, and obtain the voiceprint semantic feature vector corresponding to each speech data segment; the voiceprint semantic feature vector is used to characterize the semantic characteristics of the corresponding speech data segment.
- the vector sequence composed of the voiceprint semantic feature vectors corresponding to each speech data segment (or called voiceprint semantic feature vector sequence) is input into the Transformer network, so that the Transformer network can perform long-term correlation analysis on the voiceprint semantic feature vector sequence to obtain the voiceprint representation vector of the specified object, where the voiceprint semantic feature vector sequence includes the voiceprint semantic feature vector corresponding to each speech data segment.
- the Transformer network is a sequence network, and its input is an overall vector sequence, and its output is also a sequence (that is, the voiceprint representation vector of the specified object is a sequence); after the Transformer network outputs a sequence, the last vector in the sequence can be called state, which contains the semantic fusion of the entire sequence; in this way, the overall sequence output by the Transformer network can be averaged (mean), and the average result and the state can be superimposed to generate a voiceprint representation vector that integrates the semantic features expressed by the entire sequence.
- the above-mentioned method of obtaining the voiceprint representation vector by combining the speech features represented by the average result of the Transformer network output sequence and the speech features represented by the overall sequence can effectively ensure that the voiceprint representation vector can more fully reflect the voiceprint characteristics of the specified object and ensure the accuracy of voiceprint feature extraction.
- the exemplary structural diagram of the improved PANNS can be seen in FIG10 ; as shown in FIG10 , the input information of the improved PANNS is the reference speech data, that is, the input of the improved PANNS uses the original speech sampling point sequence, that is, the original sequence of the audio signal.
- the improved PANNS can be divided into two branches, namely the time domain branch (or called the time domain processing branch) and the frequency domain branch (or called the frequency domain processing branch).
- the input information of the time domain branch is the reference speech data
- the input information of the frequency domain branch is the reference speech spectrum feature corresponding to the reference speech data
- the reference speech spectrum feature is obtained by converting the time domain signal "reference speech data" from the time domain to the frequency domain.
- the improved PANNS focuses on short-term correlation analysis of the reference speech data, that is, it supports processing by using the improved PANNS in a segmented input manner; specifically, the improved PANNS only processes a segment of speech data in the reference speech data and the reference speech spectrum feature corresponding to the speech data each time.
- the input information of each input time domain branch is a segment of speech data in the reference speech data; similarly, the input information of each input frequency domain branch is a segment of reference speech spectrum feature in the frequency domain corresponding to a segment of speech data in the reference speech data.
- the reference speech data can be converted from the time domain to the frequency domain to obtain the reference speech spectrum feature corresponding to the reference speech data, and the reference speech spectrum feature is segmented to obtain the reference speech spectrum feature segment corresponding to each speech data segment; and the reference speech data is segmented to obtain multiple speech data segments corresponding to the reference speech data.
- segmentation rules followed by the segmentation processing of the reference speech data are the same as the segmentation rules followed by the segmentation processing of the reference speech spectrum feature; for example, the segmentation rules may include: when the reference speech data is segmented, speech data with a duration of 20 milliseconds is periodically collected as a segment, then when the reference speech spectrum feature is segmented, each reference speech spectrum feature segment is converted to the time domain and the corresponding speech data segment duration is 20 milliseconds. The accuracy of each characteristic analysis is ensured by the correspondence between time domain segmentation and frequency domain segmentation.
- the improved PANNS can perform short-term correlation analysis on each speech data segment based on each speech data segment and the corresponding reference speech spectrum feature segment, and obtain the voiceprint semantic feature vector corresponding to each speech data segment; the voiceprint semantic feature vector is used to characterize the semantic characteristics of the corresponding speech data segment.
- the specific process of the short-term correlation analysis described above is introduced, wherein:
- the time domain branch contains multiple one-dimensional convolutional layers (Conv 1D) and maximum pooling layers (Max pooling); where Conv is the abbreviation of Convolutional, D is the abbreviation of Dimension, and the dimension of the maximum pooling layer is 1, and the stride (abbreviated as s) is a one-dimensional vector with a length of 4.
- the time domain branch includes: one-dimensional convolutional layer ⁇ one-dimensional convolutional block (Conv 1Dblock, composed of one or more one-dimensional convolutional layers) ⁇ maximum pooling layer (dimension is 1, and the stride s is 4) ⁇ one-dimensional convolutional block ⁇ maximum pooling layer ⁇ one-dimensional convolutional block ⁇ maximum pooling layer; then after the target speech data is segmented and feature extracted by the convolution layer, the feature map after feature extraction is selected by the adjacent maximum pooling layer after the convolution layer, which is conducive to further extracting the features that you want to focus on in the feature map.
- Conv 1Dblock composed of one or more one-dimensional convolutional layers
- maximum pooling layer dimension is 1, and the stride s is 4
- the feature map after feature extraction is selected by the adjacent maximum pooling layer after the convolution layer, which is conducive to further extracting the features that you want to focus on in the feature map.
- the reference speech data (specifically, the target speech data segments) can be subjected to multiple feature extractions to obtain a one-dimensional sequence (resize), and the one-dimensional sequence is transformed (Reshape) to obtain multiple two-dimensional time domain feature maps (or two-dimensional graph Wavegrams).
- the time domain feature map is a graphical representation in the time domain that describes the changes of the reference speech signal in the reference speech data over time, and can intuitively display the basic characteristics of the reference speech signal (such as the periodicity, frequency components, and phase relationships of the reference speech signal).
- the purpose of the dimensional conversion here is to enable the converted time domain feature map to be fused with the frequency domain feature map output by the frequency domain branch.
- the time domain characteristics of the speech signal in the reference speech data (such as information such as audio loudness and sampling point amplitude) can be directly learned when the reference speech data is subjected to feature extraction processing through the time domain branch.
- the frequency domain branch includes multiple two-dimensional convolution layers (Conv 2D) and maximum pooling layers (Max pooling); wherein, the dimension of the maximum pooling layer of the frequency domain branch is 2.
- the frequency domain branch includes in sequence: a two-dimensional convolution block (composed of one or more two-dimensional convolution layers) ⁇ a maximum pooling layer (dimension is 2) ⁇ a two-dimensional convolution block ⁇ a maximum pooling layer ⁇ a two-dimensional convolution block; through the multiple two-dimensional convolution layers and the maximum pooling layer, the reference speech spectrum feature segment (Logmel) corresponding to the target speech data segment can be subjected to feature extraction processing to obtain multiple frequency domain feature maps (Feature maps).
- Feature maps frequency domain feature maps
- the frequency domain feature map is a graphical representation of the frequency components of the reference speech signal in the reference speech data in the frequency domain, and can display the various frequency components contained in the reference speech signal and their corresponding amplitudes or intensities. It is worth noting that the feature dimension of the frequency domain feature map here is the same as the feature dimension of the time domain feature map output by the time domain branch. In the above process, by using a large number of two-dimensional convolutional layers in the frequency domain branch, the frequency domain characteristics of the speech signal in the reference speech data can be directly learned when the frequency domain branch is used to perform feature extraction processing on the reference speech spectrum feature segments.
- the time domain feature map and the frequency domain feature map can be fused to generate the voiceprint semantic feature vector corresponding to the target speech data segment.
- the time domain branch and the frequency domain branch there are multiple information exchanges between the time domain branch and the frequency domain branch, which are to transform the dimension of the information features of the time domain branch (Reshape), and then fuse (concat) with the features of the frequency domain branch, and the fusion result is convolved by a two-dimensional convolution block (Conv 2D block) and input into a higher-level fusion module for fusion.
- Conv 2D block two-dimensional convolution block
- time domain processing can obtain the correlation of the reference voice data in time sequence
- frequency domain processing can obtain the correlation of the reference voice data in different frequencies, the two belong to different fields; then through the information interaction between the two domains (time domain and frequency domain) described above, the time domain and frequency domain can maintain information complementarity, so that the high-level network can perceive the underlying network information, thereby achieving full learning of the reference voice data.
- the intermediate time domain feature map obtained by the time domain branch in the i-th feature extraction process, the intermediate frequency domain feature map obtained by the frequency domain branch in the i-th feature extraction process, and the i-1th intermediate feature vector obtained by the i-1th feature extraction process are fused to generate the i-th intermediate feature vector after the i-th feature extraction process.
- the voiceprint semantic feature vector corresponding to the target semantic data segment is generated.
- the specific process of generating here may include: inputting the kth intermediate feature vector into a two-dimensional convolutional neural network (2D CNN layers) for feature extraction to obtain a vector sequence, and using the vector sequence for averaging to obtain the average value (mean) and performing a maximum value operation The maximum value (max) is obtained, and then the average value and the maximum value are added (sum), and then a layer of activation function (Rule) is passed to obtain a feature vector (vector), and the feature vector is normalized using a normalization function (softmax) to convert the feature vector into a voiceprint semantic feature vector representing a probability distribution, that is, the voiceprint semantic feature vector corresponding to the target speech data segment.
- 2D CNN layers two-dimensional convolutional neural network
- the entire reference speech data is divided into frames and input into the improved PANNS respectively.
- the improved PANNS selects to use the last vector of the entire sequence and the mean of the entire sequence to fuse together to generate the final voiceprint representation vector, and can obtain a multi-band semantic feature vector sequence representing the entire reference speech data (including the voiceprint semantic feature vector corresponding to each speech data segment).
- short-term correlation analysis of the reference speech data can be realized, and the voiceprint characteristics of each segmented speech signal in the reference speech data can be fully learned, so that the extracted voiceprint representation vector can fully express the voiceprint information of the specified object.
- the embodiment of the present application also introduces the Transformer network to learn the long-term correlation information of the reference speech data.
- the attention mechanism of the Transformer network can better pay attention to the voiceprint characteristics of the reference language data about the specified object, and better realize the extraction of global feature information for the reference language data.
- the schematic diagram of the network structure of the Transformer network can be seen in Figure 11; as shown in Figure 11, the Transformer network adopts an encoder (encoding)-decoder (decoding) architecture.
- both the encoder side and the decoder side are formed by stacking N encoder layers (i.e., "N ⁇ " in Figure 11).
- the encoder layer mainly includes two sublayers, among which: the first sublayer contains a multi-head attention mechanism (Multi-Head Attention), residual and normalization (Add&Norm); the second sublayer contains a feedforward neural network (Feed Forward), residual and normalization layer.
- the multi-head attention mechanism contained in the first sublayer can help obtain the contextual semantics of the voiceprint semantic feature vector corresponding to the speech data segment.
- the decoder layer mainly consists of three sub-layers, among which: the first sub-layer contains the masked multi-head attention mechanism (Masked Multi-Head Attention), residual and normalization layers, the first sub-layer contains the multi-head attention mechanism (Multi-Head Attention), residual and normalization (Add&Norm) layers, and the first sub-layer contains the feedforward neural network (Feed Forward), residual and normalization layers;
- the decoding layer can help obtain the key content that needs attention through this three-layer structure.
- the input information of the encoder side in the Transformer network is the voiceprint semantic feature vector sequence (Input Embedding) output by the improved PANNS, and the voiceprint semantic feature vector sequence contains the voiceprint semantic feature vector corresponding to each semantic data segment.
- the input voiceprint semantic feature vector sequence is positionally encoded (Positional Encoding) to realize the preprocessing of the voiceprint semantic feature vector sequence; wherein, position encoding is a method of secondary representation of each word in the vector sequence using the position information of the word, which enables the data input to the encoding side to carry the position information of the word.
- the position-encoded data is input to the encoding side, and each layer in the encoding side (as described above for the structure of the encoding side) encodes the input data to obtain the encoding result. Then, after obtaining the encoding result output by the encoding side, the decoding side supports combining the encoding result and the output feature of the decoding side at the previous moment (shifted right) as the input data of the decoding side at this time, and the decoding side decodes the input data. Finally, the output features of the decoding side are linearly transformed (Linear) and classified (softmax) to obtain the voiceprint representation vector (Output Probabilities) of the specified object.
- the voiceprint semantic feature vector sequence can be calculated through the Transformer network, which can make the long-term semantic expression of the entire reference speech data of the entire voiceprint semantic feature vector sequence clearer, that is, the final output voiceprint representation vector can fully and clearly express the voiceprint characteristics of the specified object.
- S805 Inputting the aliased speech data and the voiceprint representation vector into a preset speech segmentation model, where the speech segmentation model is used to segment a target speech signal that matches the voiceprint characteristics of a specified object from the aliased speech data based on an attention mechanism.
- step S805 can be referred to the embodiment shown in Figure 3 above, and the relevant description of the specific implementation process of step S302 about segmenting the voice signal matching the voiceprint characteristics of the specified object from the mixed voice data based on the attention mechanism is not repeated here.
- the embodiment of the present application mainly adopts the improved Unet network (i.e., the speech segmentation network) to implement the segmentation of the mixed speech data based on the attention mechanism to obtain a speech signal that matches the voiceprint characteristics of the specified object.
- the speech segmentation network mainly integrates the attention mechanism into the traditional Unet network, such as adding the attention mechanism to each network layer in the traditional Unet network to realize the improvement of the traditional Unet network.
- the embodiment of the present application also supports model distillation of the speech segmentation model integrated with the attention mechanism to obtain a speech segmentation model after model distillation; in this way, the aforementioned correlation calculation can be implemented by the speech segmentation model after model distillation, thereby reducing the scale of the entire system and reducing the overall parameter amount and time consumption.
- the voiceprint representation vector and the speech spectrum feature are both expressed in the form of vectors, so the method for the speech segmentation model after model distillation to calculate the correlation between the voiceprint representation vector and the speech spectrum feature may include but is not limited to the dot product method; wherein, the correlation between the voiceprint representation vector and the speech spectrum feature calculated by the dot product method includes: the product of the modulus of the voiceprint representation vector and the modulus of the speech spectrum feature, and the product of the cosine value of the angle between the voiceprint representation vector and the speech spectrum feature; if the voiceprint representation vector is The speech spectrum characteristics are The angle between the voiceprint representation vector and the speech spectrum feature is ⁇ , so the dot product of the voiceprint representation vector and the speech spectrum feature is The dot product result is used as the similarity between the voiceprint representation vector and the speech spectrum feature.
- Model distillation is a method of learning a large model (teacher model) with a large number of parameters to obtain a more compact small model (student model) with a small number of parameters.
- Model distillation for speech segmentation models mainly implements model distillation for large models through technologies such as pruning and knowledge distillation; among them: pruning is called model pruning (Model Pruning), which is a model compression technology that aims to reduce the complexity of the speech segmentation model by deleting some unimportant parameters or structures in the speech segmentation model, improve the reasoning speed of the speech segmentation model, and reduce the storage requirements of the speech segmentation model.
- Model Pruning model compression technology that aims to reduce the complexity of the speech segmentation model by deleting some unimportant parameters or structures in the speech segmentation model, improve the reasoning speed of the speech segmentation model, and reduce the storage requirements of the speech segmentation model.
- Knowledge distillation can be understood as taking a relatively complex speech segmentation model as a teacher model, training a student model with a small number of parameters and a simple structure, and training the student model to learn and imitate the output of the teacher model, so that the trained student model not only has the advantage of small computational complexity, but also has the same model performance as the teacher model; in the embodiment of the present application, the trained learning model is a speech segmentation model after knowledge increment.
- the model pruning and knowledge distillation mentioned above are the main implementation technologies of model distillation, but model distillation can also include other technologies.
- the embodiments of the present application do not limit the specific implementation methods of model distillation.
- S806 Generate a voice file of a designated object based on the segmented target voice signal.
- step S806 reference may be made to the relevant description of the specific implementation process shown in step S304 in the embodiment shown in FIG. 3 , and will not be repeated here.
- the embodiment of the present application provides a reusable voice segmentation method for a specified object with voiceprint vector embedding.
- the method uses a voiceprint vector extraction model composed of an improved PANNS and a Transformer network to extract voiceprints from a small section of reference voice data of the specified object.
- the extracted voiceprint representation vector of the specified object can more fully and clearly express the voiceprint characteristics of the specified object.
- the embodiment of the present application also integrates the attention mechanism into the Unet network, so that the Unet network (i.e., the speech segmentation model) that integrates the attention mechanism can realize the segmentation of aliased voice data based on the attention mechanism, and can more clearly and accurately calculate and extract the voice signal of the specified object in the aliased voice data, ensure the purity of the extracted voice signal, and achieve a more accurate and pure voice separation effect.
- the embodiment of the present application only needs to obtain the reference voice data of the specified object to segment the voice signal of the specified object from the mixed voice data; if you want to obtain the voice data of other objects, you only need to change the voiceprint characteristics of the object to be segmented. There is no need to train a dedicated network for each object, which makes the solution convenient and transferable, making the solution extremely versatile.
- FIG12 shows a schematic diagram of the structure of a speech processing device provided by an exemplary embodiment of the present application; the speech processing device can be used to execute some or all of the steps in the method embodiment shown in FIG3 or FIG8.
- the speech processing device includes the following units:
- An acquiring unit 1201 is configured to acquire aliased speech data, where the aliased speech data includes a speech signal generated by each of at least two objects;
- the acquisition unit 1201 is further configured to acquire reference speech data of a designated object;
- the designated object refers to any one of at least two objects;
- the reference speech data includes a reference speech signal of the designated object;
- the processing unit 1202 is used to extract a voiceprint representation vector of a specified object from the reference speech data, where the voiceprint representation vector is used to represent the voiceprint characteristics of the specified object;
- the processor 1202 is further configured to input the aliased speech data and the voiceprint representation vector into a preset speech segmentation model, wherein the speech segmentation model is configured to: segment a target speech signal matching the voiceprint characteristics from the aliased speech data based on an attention mechanism;
- the processing unit 1202 is further configured to generate a voice file of a designated object based on the segmented target voice signal.
- the process of segmenting a target speech signal matching a voiceprint characteristic from aliased speech data based on an attention mechanism includes:
- the speech spectrum features are characteristic manifestations of the aliased speech data in the frequency domain;
- the correlation between the voiceprint representation vector and the speech spectrum feature is calculated to obtain a speech spectrum feature segment that matches the voiceprint feature;
- the speech spectrum feature segment is a segment in the speech spectrum feature that matches the voiceprint feature;
- the speech spectrum feature segments are converted from the frequency domain to the time domain to obtain the target speech signal that matches the voiceprint characteristics.
- the correlation calculation is implemented by a speech segmentation model;
- the speech segmentation model includes a feature extraction subnetwork and an upsampling subnetwork, and the feature extraction subnetwork and the upsampling subnetwork are connected by a convolutional connection layer;
- the feature extraction subnetwork and the upsampling subnetwork are symmetrical; the feature extraction subnetwork contains m convolutional layers distributed in a hierarchy, and the upsampling subnetwork contains an upsampling layer corresponding to each of the m convolutional layers, where m is a positive integer; the convolutional layer, the convolutional connection layer, and the upsampling layer all include multiple convolutional networks connected in sequence;
- the attention mechanism is integrated into all or part of the network layers in the speech segmentation model, and the fusion position of the attention mechanism in the multiple convolutional networks included in the network layer is not fixed;
- the network layer includes a convolutional layer, an upsampling layer and a convolutional connection layer.
- each network layer in the speech segmentation model is integrated with an attention mechanism; the processing unit 1202 is used to calculate the correlation between the voiceprint representation vector and the speech spectrum feature based on the attention mechanism to obtain the speech spectrum feature segmentation matching the voiceprint characteristics, specifically for:
- the correlation between the voiceprint representation vector and the first feature map of the corresponding network layer is calculated to obtain the second feature map output by the corresponding network layer; the voiceprint characteristics represented by the second feature map match the voiceprint characteristics represented by the voiceprint representation vector;
- the second feature map output by the 2m+1th network layer in the speech segmentation model is used as the speech spectrum feature segmentation that matches the voiceprint characteristics represented by the voiceprint representation vector; the 2m+1th network layer in the speech segmentation model is the last upsampling layer in the upsampling subnetwork.
- any network layer in the speech segmentation model that is integrated with the attention mechanism is represented as a target network layer;
- the target network layer is a convolutional layer or a convolutional connection layer;
- the fusion position of the attention mechanism in the target network layer is: the position between the first convolutional network and the second convolutional network adjacent to the first convolutional network in the multiple convolutional networks connected sequentially included in the target network layer;
- the processing unit 1202 is used to calculate the correlation between the voiceprint representation vector and the first feature map of the corresponding network layer based on the attention mechanism integrated in each network layer to obtain the second feature map output by the corresponding network layer, specifically for:
- the first convolutional network in the target network layer is used to perform feature extraction processing on the first feature map of the target network layer to obtain the third feature map of the target network layer; wherein, when the target network layer is the first convolutional layer of the hierarchical distribution in the feature extraction subnetwork, the first feature map of the target network layer is the speech spectrum feature; when the target network layer is a convolutional layer other than the first convolutional layer in the speech segmentation model, the first feature map of the target network layer is obtained by pooling the feature map output by the upper-level network layer adjacent to the target network layer;
- the correlation between the voiceprint representation vector and the third feature map of the target network layer is calculated to obtain the fourth feature map of the target network layer;
- the feature dimension of the third feature map is the same as the feature dimension of the fourth feature map;
- the fourth feature map is subjected to feature extraction processing by using other convolutional networks except the first convolutional network in the target network layer to obtain a second feature map output by the target network layer.
- any network layer in the speech segmentation model that is integrated with the attention mechanism is represented as a target network layer;
- the target network layer is an upsampling layer;
- the fusion position of the attention mechanism in the multiple convolutional networks corresponding to the target network layer is: the position after the last convolutional network in the multiple convolutional networks sequentially connected in the target network layer;
- the processing unit 1202 is used to calculate the correlation between the voiceprint representation vector and the first feature map of the corresponding network layer based on the attention mechanism integrated in each network layer to obtain the second feature map output by the corresponding network layer, specifically for:
- the target feature map is obtained by feature concatenating the feature map output by the convolutional layer corresponding to the target network layer in the feature extraction subnetwork and the feature map output by the upper-level network layer of the target network layer;
- the attention mechanism integrated in the target network layer is used to calculate the correlation between the voiceprint representation vector and the first feature map of the target network layer to obtain the second feature map output by the target network layer; the feature dimension of the second feature map is the same as that of the first feature map.
- processing unit 1202 is further configured to:
- the feature dimension of the voiceprint representation vector after dimension transformation is the same as the feature dimension of the feature map of the attention mechanism to be input into the corresponding network layer for fusion.
- the processing unit 1202 is further configured to:
- the correlation calculation is realized by the speech segmentation model after model distillation.
- the processing unit 1202 is configured to extract the voiceprint representation vector of the specified object from the reference voice data, specifically to:
- Segmenting the reference voice data to obtain a plurality of voice data segments corresponding to the reference voice data
- the reference speech spectrum feature is segmented to obtain a reference speech spectrum feature segment corresponding to each speech data segment;
- any one of the multiple voice data segments is represented as a target voice data segment; the processing unit 1202 is used to perform short-term correlation analysis on each voice data segment based on each voice data segment and the corresponding reference voice spectrum feature segment, and obtain the voiceprint semantic feature vector corresponding to each voice data segment, specifically for:
- the time domain feature map and the frequency domain feature map are fused to generate the voiceprint semantic feature vector corresponding to the target speech data segment.
- the number of feature extraction processes is k, where k is an integer greater than 1; any feature extraction process is represented as the i-th feature extraction process; the processing unit 1202 is used to fuse the time domain feature map and the frequency domain feature map to generate a voiceprint semantic feature vector corresponding to the target speech data segment, specifically for:
- the intermediate time domain feature map and the intermediate frequency domain feature map obtained by the first feature extraction process are fused to generate the first intermediate feature vector after the first feature extraction process;
- the intermediate time domain feature map and the intermediate frequency domain feature map obtained by the i-th feature extraction process, and the i-1th intermediate feature vector obtained by the i-1th feature extraction process are fused to generate the i-th intermediate feature vector after the i-th feature extraction process;
- each unit in the speech processing device shown in FIG. 12 can be separately or completely combined into one or several other units to constitute, or one (some) of the units can be further divided into multiple smaller units in function to constitute, which can achieve the same operation without affecting the realization of the technical effect of the embodiment of the present application.
- the above-mentioned units are divided based on logical functions.
- the function of one unit can also be realized by multiple units, or the function of multiple units is realized by one unit.
- the speech processing device may also include other units.
- these functions can also be implemented by other units, and can be implemented by multiple units in collaboration.
- FIG13 shows a schematic diagram of the structure of a computer device provided by an exemplary embodiment of the present application.
- the computer device includes a processor 1301, a communication interface 1302, and a computer-readable storage medium 1303.
- the processor 1301, the communication interface 1302, and the computer-readable storage medium 1303 may be connected via a bus or other means.
- the communication interface 1302 is used to receive and send data.
- the computer-readable storage medium 1303 may be stored in a memory of the computer device, the computer-readable storage medium 1303 is used to store computer programs, and the processor 1301 is used to execute the computer programs stored in the computer-readable storage medium 1303.
- the processor 1301 (or CPU (Central Processing Unit)) is the computing core and control core of the computer device, which is suitable for implementing one or more computer programs, and is specifically suitable for loading and executing one or more computer programs to implement the corresponding method flow or corresponding function.
- CPU Central Processing Unit
- the embodiment of the present application also provides a computer-readable storage medium (Memory), which is a memory device in a computer device for storing programs and data. It is understandable that the computer-readable storage medium here can include both built-in storage media in the computer device and, of course, extended storage media supported by the computer device.
- the computer-readable storage medium provides a storage space that stores the processing system of the computer device.
- one or more computer programs suitable for being loaded and executed by the processor 1301 are also stored in the storage space.
- the computer-readable storage medium here can be a high-speed RAM memory or a non-volatile memory, such as at least one disk storage; optionally, it can also be at least one computer-readable storage medium located away from the aforementioned processor.
- An embodiment of the present application also provides a computer program product, which includes a computer program.
- a computer program product which includes a computer program.
- the computer program is executed by a processor, the above-mentioned speech processing method is implemented.
- the computer program product includes a computer program (one or more).
- the computer program executes the above-mentioned process or function of the embodiment of the present application.
- the computer device may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
- the computer program may be stored in a computer-readable storage medium or transmitted via a computer-readable storage medium.
- the computer program may be transmitted from a website site, a computer device, a server or a data center to another website site, a computer device, a server or a data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.).
- the computer-readable storage medium may be any available medium that a computer device can access or a data storage device such as a server or a data center that includes one or more available media integrated.
- the available medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state drive (SSD)), etc.
- a magnetic medium e.g., a floppy disk, a hard disk, a tape
- an optical medium e.g., a DVD
- a semiconductor medium e.g., a solid-state drive (SSD)
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Quality & Reliability (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Telephonic Communication Services (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (15)
- 一种语音处理方法,其特征在于,所述方法由计算机设备执行,所述方法包括:获取混叠语音数据,所述混叠语音数据中包含至少两个对象中每个所述对象产生的语音信号;获取指定对象的参考语音数据,所述指定对象是指所述至少两个对象中的任一个;所述参考语音数据中包含所述指定对象的参考语音信号;从所述参考语音数据中提取所述指定对象的声纹表征向量,所述声纹表征向量用于表征所述指定对象的声纹特性;将所述混叠语音数据和所述声纹表征向量输入预设的语音分割模型,所述语音分割模型用于:基于注意力机制从所述混叠语音数据中分割出与所述声纹特性相匹配的目标语音信号;基于所分割出的所述目标语音信号,生成所述指定对象的语音文件。
- 如权利要求1所述的方法,其特征在于,基于注意力机制从所述混叠语音数据中分割出与所述声纹特性相匹配的目标语音信号的过程,包括:将所述混叠语音数据从时域转换至频域,得到所述混叠语音数据对应的语音频谱特征;所述语音频谱特征是所述混叠语音数据在所述频域上的特征表现;基于注意力机制对所述声纹表征向量和所述语音频谱特征进行相关度计算,得到与所述声纹特性相匹配的语音频谱特征分段;所述语音频谱特征分段是所述语音频谱特征中,与所述声纹特性相匹配的分段;将所述语音频谱特征分段从所述频域转换至所述时域,得到与所述声纹特性相匹配的所述目标语音信号。
- 如权利要求1或2所述的方法,其特征在于,所述相关度计算是通过所述语音分割模型实现的;所述语音分割模型中包括特征提取子网络和上采样子网络,所述特征提取子网络和所述上采样子网络之间通过卷积连接层进行连接;所述特征提取子网络和所述上采样子网络具有对称性;所述特征提取子网络中包含层级分布的m个卷积层,所述上采样子网络中包含所述m个卷积层中每个所述卷积层对应的上采样层,m为正整数;所述卷积层、所述卷积连接层和所述上采样层中均包括顺序连接的多个卷积网络;其中,所述语音分割模型中的全部或部分网络层中融合有所述注意力机制,所述注意力机制在所述网络层包括的多个卷积网络中的融合位置不固定;所述网络层包括所述卷积层、所述上采样层和所述卷积连接层。
- 如权利要求1-3任一项所述的方法,其特征在于,所述语音分割模型中每个所述网络层均融合有所述注意力机制;所述基于注意力机制对所述声纹表征向量和所述语音频谱特征进行相关度计算,得到与所述声纹特性相匹配的语音频谱特征分段,包括:将所述声纹表征向量输入至所述语音分割模型中的每个所述网络层;基于每个所述网络层中融合的所述注意力机制,对所述声纹表征向量和相应所述网络层的第一特征图进行相关度计算,以得到相应所述网络层输出的第二特征图;所述第二特征图所表征的声纹特性和所述声纹表征向量所表征的声纹特性相匹配;将所述语音分割模型中第2m+1个网络层输出的第二特征图,作为与所述声纹表征向量所表征的声纹特性相匹配的语音频谱特征分段;所述第2m+1个网络层为所述上采样子网络中的最后一个上采样层。
- 如权利要1-4任一项所述的方法,其特征在于,所述语音分割模型中融合有所述注意力机制的任一网络层表示为目标网络层;所述目标网络层为所述卷积层或者所述卷积连接层;所述注意力机制在所述目标网络层中的融合位置为:所述目标网络层包括的顺序连接的多个所述卷积网络中,首个所述卷积网络和与首个所述卷积网络相邻的第二个所述卷积网络之间的位置;所述基于每个所述网络层中融合的所述注意力机制,对所述声纹表征向量和相应所述网络层的第一特征图进行相关度计算,以得到相应所述网络层输出的第二特征图,包括:采用所述目标网络层中的首个所述卷积网络,对所述目标网络层的第一特征图进行特征提取处理,得到所述目标网络层的第三特征图;其中,所述目标网络层为所述特征提取子网络中层级分布的首个卷积层时,所述目标网络层的第一特征图为所述语音频谱特征;所述目标网络层为所述语音分割模型中除首个所述卷积层外的其他所述卷积层时,所述目标网络层的第一特征图是对与所述目标网络层相邻的上一层级所述网络层输出的特征图进行池化处理得到的;按照所述目标网络层中融合的所述注意力机制,对所述声纹表征向量和所述目标网络层的第三特征图进行相关度计算,得到所述目标网络层的第四特征图;所述第三特征图的特征维度和所述第四特征图的特征维度相同;采用所述目标网络层中除首个所述卷积网络外的其他卷积网络对所述第四特征图进行特征提取处理,得到所述目标网络层输出的第二特征图。
- 如权利要求1-5任一项所述的方法,其特征在于,所述语音分割模型中融合有注意力机制的任一网络层表示为目标网络层;所述目标网络层为所述上采样层;所述注意力机制在所述目标网络层中的融合位置为:所述目标网络层中顺序连接的多个卷积网络中的最后一个卷积网络之后的位置;所述基于每个所述网络层中融合的所述注意力机制,对所述声纹表征向量和相应所述网络层的第一特征图进行相关度计算,以得到相应所述网络层输出的第二特征图,包括:采用所述目标网络层中顺序连接的多个卷积网络,对目标特征图进行特征提取处理,得到所述目标网络层的第一特征图;所述目标特征图是将所述目标网络层在所述特征提取子网络中对应的卷积层输出的特征图,和所述目标网络层的上一层级网络层输出的特征图进行特征拼接得到的;采用所述目标网络层中融合的注意力机制,对所述声纹表征向量和所述目标网络层的第一特征图进行相关度计算,得到所述目标网络层输出的第二特征图;所述第二特征图的特征维度和所述第一特征图的特征维度相同。
- 如权利要求1-6任一项所述的方法,其特征在于,所述基于每个所述网络层中融合的所述注意力机制,对所述声纹表征向量和相应所述网络层的第一特征图进行相关度计算,以得到相应所述网络层输出的第二特征图之前,还包括:对所述声纹表征向量进行维度变换,得到维度变换后的所述声纹表征向量;其中,维度变换后的所述声纹表征向量的特征维度,与待输入至相应所述网络层中融合的注意力机制的特征图的特征维度相同。
- 如权利要求1-7任一项所述的方法,其特征在于,若所述语音分割模型中融合有所述注意力机制的网络层的数量大于数量阈值,则所述方法还包括:对所述语音分割模型进行模型蒸馏,得到模型蒸馏后的语音分割模型;其中,所述相关度计算由模型蒸馏后的所述语音分割模型实现。
- 如权利要求1-8任一项所述的方法,其特征在于,所述从所述参考语音数据中提取所述指定对象的声纹表征向量,包括:对所述参考语音数据进行分段处理,得到所述参考语音数据对应的多个语音数据分段;将所述参考语音数据从时域转换至频域,得到所述参考语音数据对应的参考语音频谱特征;对所述参考语音频谱特征进行分段处理,得到每个所述语音数据分段对应的参考语音频谱特征分段;分别基于每个所述语音数据分段和相应的所述参考语音频谱特征分段,对每个所述语音数据分段进行短时相关分析,得到每个所述语音数据分段对应的声纹语义特征向量;所述声纹语义特征向量用于表征所述语音数据分段的语义特性;对声纹语义特征向量序列进行长时相关分析,得到所述指定对象的声纹表征向量;所述声纹语义特征向量序列中包括每个所述语音数据分段对应的声纹语义特征向量。
- 如权利要求1-9任一项所述的方法,其特征在于,多个所述语音数据分段中的任一语音数据分段表示为目标语音数据分段;所述分别基于每个所述语音数据分段和相应的所述参考语音频谱特征分段,对每个所述语音数据分段进行短时相关分析,得到每个所述语音数据分段对应的声纹语义特征向量,包括:对所述目标语音数据分段进行特征提取处理,得到时域特征图;对所述目标语音数据分段对应的参考语音频谱特征分段进行特征提取处理,得到频域特征图;将所述时域特征图和所述频域特征图进行融合处理,生成所述目标语音数据分段对应的声纹语义特征向量。
- 如权利要求1-10任一项所述的方法,其特征在于,所述特征提取处理的次数为k次,k为大于1的整数;任一次特征提取处理表示为第i次特征提取处理;所述将所述时域特征图和所述频域特征图进行融合处理,生成所述目标语音数据分段对应的声纹语义特征向量,包括:当i=1时,将首次特征提取处理得到的中间时域特征图和中间频域特征图进行融合处理,生成所述首次特征提取处理后的第一中间特征向量;当1<i≤k时,将所述第i次特征提取处理得到的中间时域特征图和中间频域特征图,以及第i-1次特征提取处理得到的第i-1中间特征向量进行融合处理,生成所述第i次特征提取处理后的第i中间特征向量;基于i=k时第k次特征提取处理后的第k中间特征向量,生成所述目标语音数据分段对应的声纹语义特征向量。
- 一种语音处理装置,其特征在于,所述语音处理装置搭载于计算机设备,所述语音处理装置包括:获取单元,用于获取混叠语音数据,所述混叠语音数据中包含至少两个对象中每个所述对象产生的语音信号;所述获取单元,还用于获取指定对象的参考语音数据,所述指定对象是指所述至少两个对象中的任一个;所述参考语音数据中包含所述指定对象的参考语音信号;处理单元,用于从所述参考语音数据中提取所述指定对象的声纹表征向量,所述声纹表征向量用于表征所述指定对象的声纹特性;所述处理单元,还用于将所述混叠语音数据和所述声纹表征向量输入预设的语音分割模型,所述语音分割模型用于:基于注意力机制从所述混叠语音数据中分割出与所述声纹特性相匹配的目标语音信号;所述处理单元,还用于基于所分割出的所述目标语音信号,生成所述指定对象的语音文件。
- 一种计算机设备,其特征在于,处理器,适于执行计算机程序;计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,所述计算机程序被所述处理器执行时,实现如权利要求1-11任一项所述的语音处理方法。
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序适于被处理器加载并执行如权利要求1-11任一项所述的语音处理方法。
- 一种计算机程序产品,其特征在于,所述计算机程序产品包括计算机程序,所述计算机程序被处理器执行时,实现如权利要求1-11任一项所述的语音处理方法。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP24822416.4A EP4636761A4 (en) | 2023-06-13 | 2024-04-25 | PROCESS AND APPARATUS FOR SPEECH PROCESSING, DEVICE, SUPPORT, AND PRODUCT-PROGRAM |
| US19/257,189 US20250329334A1 (en) | 2023-06-13 | 2025-07-01 | Speech processing method and apparatus, device, and medium |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310699993.5 | 2023-06-13 | ||
| CN202310699993.5A CN119132328A (zh) | 2023-06-13 | 2023-06-13 | 一种语音处理方法、装置、设备、介质及程序产品 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US19/257,189 Continuation US20250329334A1 (en) | 2023-06-13 | 2025-07-01 | Speech processing method and apparatus, device, and medium |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2024255461A1 true WO2024255461A1 (zh) | 2024-12-19 |
| WO2024255461A9 WO2024255461A9 (zh) | 2025-01-30 |
Family
ID=93748676
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2024/089862 Ceased WO2024255461A1 (zh) | 2023-06-13 | 2024-04-25 | 一种语音处理方法、装置、设备、介质及程序产品 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20250329334A1 (zh) |
| EP (1) | EP4636761A4 (zh) |
| CN (1) | CN119132328A (zh) |
| WO (1) | WO2024255461A1 (zh) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120260546A (zh) * | 2025-06-03 | 2025-07-04 | 中国电子科技集团公司第二十八研究所 | 一种基于双模型动态触发的语音流切分方法 |
| CN120496557A (zh) * | 2025-03-05 | 2025-08-15 | 西安赛普特信息科技有限公司 | 一种双路轻量级时频域自适应神经网络模型及其使用方法 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108109619A (zh) * | 2017-11-15 | 2018-06-01 | 中国科学院自动化研究所 | 基于记忆和注意力模型的听觉选择方法和装置 |
| CN111429937A (zh) * | 2020-05-09 | 2020-07-17 | 北京声智科技有限公司 | 语音分离方法、模型训练方法及电子设备 |
| WO2022048239A1 (zh) * | 2020-09-04 | 2022-03-10 | 华为技术有限公司 | 音频的处理方法和装置 |
| CN115116448A (zh) * | 2022-08-29 | 2022-09-27 | 四川启睿克科技有限公司 | 语音提取方法、神经网络模型训练方法、装置及存储介质 |
| CN115376541A (zh) * | 2022-07-13 | 2022-11-22 | 平安科技(深圳)有限公司 | 基于语音数据的角色分离方法和装置、设备、介质 |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114329041B (zh) * | 2021-11-17 | 2025-06-10 | 腾讯科技(深圳)有限公司 | 一种多媒体数据处理方法、装置以及可读存储介质 |
-
2023
- 2023-06-13 CN CN202310699993.5A patent/CN119132328A/zh active Pending
-
2024
- 2024-04-25 WO PCT/CN2024/089862 patent/WO2024255461A1/zh not_active Ceased
- 2024-04-25 EP EP24822416.4A patent/EP4636761A4/en active Pending
-
2025
- 2025-07-01 US US19/257,189 patent/US20250329334A1/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108109619A (zh) * | 2017-11-15 | 2018-06-01 | 中国科学院自动化研究所 | 基于记忆和注意力模型的听觉选择方法和装置 |
| CN111429937A (zh) * | 2020-05-09 | 2020-07-17 | 北京声智科技有限公司 | 语音分离方法、模型训练方法及电子设备 |
| WO2022048239A1 (zh) * | 2020-09-04 | 2022-03-10 | 华为技术有限公司 | 音频的处理方法和装置 |
| CN115376541A (zh) * | 2022-07-13 | 2022-11-22 | 平安科技(深圳)有限公司 | 基于语音数据的角色分离方法和装置、设备、介质 |
| CN115116448A (zh) * | 2022-08-29 | 2022-09-27 | 四川启睿克科技有限公司 | 语音提取方法、神经网络模型训练方法、装置及存储介质 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4636761A4 * |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120496557A (zh) * | 2025-03-05 | 2025-08-15 | 西安赛普特信息科技有限公司 | 一种双路轻量级时频域自适应神经网络模型及其使用方法 |
| CN120260546A (zh) * | 2025-06-03 | 2025-07-04 | 中国电子科技集团公司第二十八研究所 | 一种基于双模型动态触发的语音流切分方法 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4636761A1 (en) | 2025-10-22 |
| CN119132328A (zh) | 2024-12-13 |
| EP4636761A4 (en) | 2026-04-29 |
| WO2024255461A9 (zh) | 2025-01-30 |
| US20250329334A1 (en) | 2025-10-23 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Czyzewski et al. | An audio-visual corpus for multimodal automatic speech recognition | |
| JP6855527B2 (ja) | 情報を出力するための方法、及び装置 | |
| WO2024140434A1 (zh) | 基于多模态知识图谱的文本分类方法、设备及存储介质 | |
| CN111883107B (zh) | 语音合成、特征提取模型训练方法、装置、介质及设备 | |
| CN110517689A (zh) | 一种语音数据处理方法、装置及存储介质 | |
| US20250329334A1 (en) | Speech processing method and apparatus, device, and medium | |
| CN112183107A (zh) | 音频的处理方法和装置 | |
| CN114329041B (zh) | 一种多媒体数据处理方法、装置以及可读存储介质 | |
| WO2024140430A9 (zh) | 基于多模态深度学习的文本分类方法、设备及存储介质 | |
| CN109474843A (zh) | 语音操控终端的方法、客户端、服务器 | |
| CN113763925B (zh) | 语音识别方法、装置、计算机设备及存储介质 | |
| CN113573161B (zh) | 多媒体数据处理方法、装置、设备及存储介质 | |
| CN113889081A (zh) | 语音识别方法、介质、装置和计算设备 | |
| CN119854545A (zh) | 一种基于深度学习的新闻智能播报系统及方法 | |
| Liu et al. | Anti-forensics of fake stereo audio using generative adversarial network | |
| CN111883105B (zh) | 用于视频场景的上下文信息预测模型的训练方法及系统 | |
| WO2024082928A1 (zh) | 语音处理方法、装置、设备和介质 | |
| CN121194017A (zh) | 高光短视频的剪辑方法、装置、存储介质以及电子设备 | |
| CN113407779B (zh) | 一种视频检测方法、设备及计算机可读存储介质 | |
| CN118802398A (zh) | 会议纪要生成方法、装置、存储介质及电子设备 | |
| CN118779486A (zh) | 一种基于文本提示词的语音内容检索方法、设备及计算机可读存储介质 | |
| CN118587625A (zh) | 一种视频文件的检测方法、装置及计算设备 | |
| CN117373463A (zh) | 用于语音处理的模型训练方法、设备、介质及程序产品 | |
| CN118411996B (zh) | 音色转换方法、装置、电子设备、存储介质和程序产品 | |
| CN120690233A (zh) | 音视频情绪标记方法、装置、电子设备、存储介质及产品 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24822416 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2024822416 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2024822416 Country of ref document: EP Effective date: 20250715 |
|
| WWP | Wipo information: published in national office |
Ref document number: 2024822416 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |