WO2024255461A1 - 一种语音处理方法、装置、设备、介质及程序产品 - Google Patents

一种语音处理方法、装置、设备、介质及程序产品 Download PDF

Info

Publication number
WO2024255461A1
WO2024255461A1 PCT/CN2024/089862 CN2024089862W WO2024255461A1 WO 2024255461 A1 WO2024255461 A1 WO 2024255461A1 CN 2024089862 W CN2024089862 W CN 2024089862W WO 2024255461 A1 WO2024255461 A1 WO 2024255461A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
voiceprint
feature
network layer
feature map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2024/089862
Other languages
English (en)
French (fr)
Other versions
WO2024255461A9 (zh
Inventor
冯鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to EP24822416.4A priority Critical patent/EP4636761A4/en
Publication of WO2024255461A1 publication Critical patent/WO2024255461A1/zh
Publication of WO2024255461A9 publication Critical patent/WO2024255461A9/zh
Priority to US19/257,189 priority patent/US20250329334A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party

Definitions

  • the present application relates to the field of computer technology, in particular to the field of artificial intelligence, and specifically to a speech processing method, a speech processing device, a computer device, a computer-readable storage medium, and a computer program product.
  • Aliased voice data is voice data that is a mixture of voice signals generated by multiple sound sources (i.e., objects that generate sound).
  • the aliased voice data recorded by a recording device from a physical environment may include voice signals generated by multiple participants, and may also include voice signals generated by certain devices in the physical environment (such as devices that play conference videos).
  • the source separation methods provided for aliased speech data include: 1. Separating aliased speech data by human ears. This artificial listening method results in a long segmentation process and low efficiency. 2. Relying on timbre frequency to separate aliased speech data. When there are multiple objects with similar timbre frequencies, accurate segmentation cannot be achieved. 3. Separating aliased speech data based on the distance of the sound source will limit the speech segmentation to the different distances of each sound source. 4. Using a dedicated speech segmentation model for a specified object to separate aliased speech data. This method is not portable and cannot be universal.
  • the embodiments of the present application provide a speech processing method, apparatus, device, medium and program product, which can separate the pure speech signal of any specified object from the mixed speech data and have universality.
  • an embodiment of the present application provides a speech processing method, which is executed by a computer device, and the method includes:
  • aliased speech data includes speech signals generated by each of at least two objects
  • Acquire reference speech data of a designated object refers to any one of at least two objects;
  • the reference speech data includes a reference speech signal of the designated object;
  • the aliased speech data and the voiceprint representation vector are input into a preset speech segmentation model, wherein the speech segmentation model is used to: segment a target speech signal matching the voiceprint characteristics from the aliased speech data based on an attention mechanism;
  • a voice file of a designated object is generated.
  • an embodiment of the present application provides a speech processing device, the device comprising:
  • An acquisition unit configured to acquire aliased speech data, wherein the aliased speech data includes a speech signal generated by each of at least two objects;
  • the acquisition unit is further used to acquire reference speech data of a specified object;
  • the specified object refers to any one of at least two objects;
  • the reference speech data includes a reference speech signal of the specified object;
  • a processing unit configured to extract a voiceprint representation vector of a specified object from the reference speech data, wherein the voiceprint representation vector is used to represent a voiceprint characteristic of the specified object;
  • the processing unit is further used to input the aliased speech data and the voiceprint representation vector into a preset speech segmentation model, wherein the speech segmentation model is used to: segment a target speech signal matching the voiceprint characteristics from the aliased speech data based on an attention mechanism;
  • the processing unit is further used to generate a voice file of a designated object based on the segmented target voice signal.
  • an embodiment of the present application provides a computer device, the computer device comprising:
  • a computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the above-mentioned speech processing method is implemented.
  • an embodiment of the present application provides a computer-readable storage medium, which stores a computer program, and the computer program is suitable for being loaded by a processor and executing the above-mentioned speech processing method.
  • an embodiment of the present application provides a computer program product, which includes a computer program.
  • the computer program is executed by a processor, the above-mentioned speech processing method is implemented.
  • aliased speech data to be segmented is obtained, and the aliased speech data contains speech signals generated by each of at least two objects; if there is a need to segment the speech signal generated by a specified object among the at least two objects, a section of reference speech data of the specified object (such as a few seconds of speech generated by the specified object) can be obtained; the specified object can be any one of the at least two objects.
  • a voiceprint representation vector of the specified object is extracted from the reference speech data, and the voiceprint representation vector can represent the voiceprint characteristics of the specified object, and the voiceprint characteristics are unique and can represent the identity of the specified object.
  • the voiceprint representation vector that can uniquely represent the identity of the specified object and the aliased speech data to be segmented can be input into a preset speech segmentation model, so that the speech segmentation model can segment a target speech signal that matches the voiceprint characteristics of the specified object from the aliased speech data based on the attention mechanism, thereby generating a separate speech file for the specified object based on the segmented target speech signal.
  • the embodiments of the present application support extracting a voiceprint representation vector that represents the voiceprint characteristics of a specified object from the pure reference voice data of the specified object, and using the voiceprint representation vector as a reference, and utilizing the attention mechanism provided by the voice segmentation model to clearly and accurately calculate and extract the target voice signal of the specified object from the aliased voice data, thereby improving the extraction purity of the target voice signal and achieving a more accurate voice separation effect.
  • the embodiments of the present application only need to obtain the reference voice data of the specified object to segment the target voice signal of the specified object from the aliased voice data; if you want to obtain the voice data of other objects, you only need to replace the voiceprint representation vector of the object to be segmented, and there is no need to train a dedicated network for each object, which greatly improves convenience and portability, and improves the versatility of this solution.
  • FIG1 is a schematic diagram of the architecture of a speech processing system provided by an exemplary embodiment of the present application.
  • FIG2 is a schematic diagram of the architecture of a speech processing scenario provided by an exemplary embodiment of the present application.
  • FIG3 is a flow chart of a speech processing method provided by an exemplary embodiment of the present application.
  • FIG4 is a schematic diagram of an interface for a user to input reference voice data of a specified object provided by an exemplary embodiment of the present application
  • FIG5 is a schematic diagram of the structure of an existing Unet network
  • FIG6 is a schematic diagram of the structure of a speech segmentation model constructed by adding an Attention mechanism to each network layer in a Unet network, provided by an exemplary embodiment of the present application;
  • FIG7a is a schematic diagram of speech segmentation when a target network layer is a convolutional layer or a convolutional connection layer provided by an exemplary embodiment of the present application;
  • FIG7b is a schematic diagram of speech segmentation when a target network layer is an upsampling layer provided by an exemplary embodiment of the present application;
  • FIG8 is a flow chart of another speech processing method provided by an exemplary embodiment of the present application.
  • FIG9 is a schematic diagram of a process of extracting a voiceprint vector provided by an exemplary embodiment of the present application.
  • FIG10 is a schematic structural diagram of an improved PANNS provided by an exemplary embodiment of the present application.
  • FIG11 is a schematic diagram of the structure of a transformer network provided by an exemplary embodiment of the present application.
  • FIG12 is a schematic diagram of the structure of a speech processing device provided by an exemplary embodiment of the present application.
  • FIG. 13 is a schematic diagram of the structure of a computer device provided by an exemplary embodiment of the present application.
  • a speech processing scheme specifically, a speech separation scheme for performing source separation on aliased speech data is provided.
  • aliased speech data can be referred to as aliased speech or mixed audio signal, which is an audio mixed with multiple speech signals (or called audio signals); that is, the so-called “aliasing” can be understood as multiple speech signals mixed/mixed together.
  • the aliased speech data can be understood as: speech data containing speech signals generated by multiple sound sources, which is collected directly from the environment by a sound receiving device (such as a microphone).
  • multiple speech signals can be generated by different objects (or called sound sources), and the objects here can include but are not limited to: humans, animals or physical devices (such as cars), etc.; the embodiment of the present application does not limit the sources of the multiple speech signals contained in the aliased speech data.
  • the collected voice data usually includes voice signals generated by different participants; of course, if the conference scenario also includes a device for playing audio and video, then the collected voice data also includes voice signals emitted by the device; in this way, the voice data collected in the conference scenario can be called aliased voice data, which includes voice signals generated by multiple objects in the conversation scenario.
  • source separation refers to the process of separating the voice signal of a specified object from the aliased voice data.
  • source separation can be simply understood as the technology of separating the aliased voice data through signal processing or other algorithms to segment the target voice signal of the specified object from the aliased voice data, and finally generating a separate audio file (or voice file) of the specified object.
  • the target voice signal generated by a specified object can be extracted from the aliased voice data through the source separation technology to generate a voice file of the specified object; in this way, when the voice file is played, only the voice generated by the specified object exists, so as to achieve the purpose of identifying the voice generated by a specified object.
  • the embodiment of the present application provides a new voice processing solution, which mainly includes: obtaining aliased voice data to be segmented, wherein the aliased voice data includes a voice signal generated by each of at least two objects, such as the voice signal included in the aliased voice data includes: a voice signal generated by object 1 and a voice signal generated by object 2; if the user wants to extract the target voice signal generated by a specified object (such as any one of the at least two objects) from the aliased voice data, a reference voice data containing a reference voice signal of the specified object can be obtained.
  • a voiceprint representation vector of the specified object can be extracted based on the reference voice data, and the voiceprint representation vector can represent the voiceprint characteristics of the specified object.
  • the voiceprint characteristics can be understood as the sound characteristics of the specified object, such as the unique pitch or timbre of the specified object.
  • the voiceprint representation vector of the specified object and the aliased voice data are input into the voice segmentation model, and the attention mechanism in the voice segmentation model can be used to segment and extract the target voice signal matching the voiceprint characteristics of the specified object from the aliased voice data, thereby generating a separate voice file for the specified object based on the target voice signal.
  • the embodiment of the present application relies on the uniqueness of the voiceprint characteristics of each user. It only needs to provide a section of reference voice data of any specified object to extract the voiceprint feature vector that characterizes the voiceprint characteristics of the specified object, and then the target voice signal of any specified object can be separated and extracted from the aliased voice data based on the voiceprint characterization vector; not only can the purpose of accurately separating the target voice signal of the specified object from the aliased voice data be achieved, but also the source separation can be realized for the object to which any voice signal contained in the aliased voice data belongs, so that it is highly reusable and portable, reduces the complexity of user input operations, and makes the entire system more universal.
  • the embodiment of the present application calculates the voiceprint characteristics and aliased voice data of the specified object based on the attention mechanism, which greatly improves the clarity and accuracy of extracting the target voice signal of the specified object from the aliased voice data, avoids too much noise in the extracted target voice signal, and achieves a more accurate and pure voice separation effect.
  • the embodiment of the present application mainly implements the speech processing solution through a reusable designated speaker speech segmentation system based on voiceprint vector embedding, that is, the system deploys the speech processing solution provided by the embodiment of the present application; in this way, when any user has the need to separate the speech signal of the aliased speech data, the system can be called to automatically separate and extract the speech file corresponding to the designated object from the aliased speech data.
  • the exemplary architecture diagram of the system can be seen in Figure 1; as shown in Figure 1, the system mainly includes two modules, namely: a voiceprint vector extraction model and a speech segmentation model; the following is a brief introduction to these two modules, wherein:
  • Voiceprint vector extraction model which can be called voiceprint vector extractor or voiceprint recognition network.
  • the voiceprint vector extraction model is mainly used to identify the identity of the specified object to be segmented and extract the identity semantic vector of the specified object; the identity semantic vector here is In the embodiment of the application, it is called a voiceprint characterization vector (or simply a voiceprint vector), which is used to characterize the voiceprint characteristics of the specified object.
  • the voiceprint vector extraction model is constructed based on an improved audio neural network (Pretrained Audio Neural Networks, PANNS) network and a transformer (Transformer) network.
  • the voiceprint vector extraction model is obtained by fully training with an open source large-scale speaker data set (a data set containing rich voice data).
  • the trained voiceprint vector extraction model has the ability to fully express the voiceprint characteristics of the object; thus, the voiceprint vector extraction model can be used as a voiceprint vector extractor for the entire system; in the inference stage, after loading the model parameters trained in advance with a large-scale speaker data set, the trained voiceprint vector extraction model can be used to calculate the voiceprint representation vector of the specified object for the reference voice data of the specified object (such as a short segment (such as a few seconds or more than ten seconds) of speech), and the voiceprint representation vector is used to represent the voiceprint characteristics of the specified object.
  • the improved PANNS in the voiceprint vector extraction model is an improvement on the traditional PANNS; the improvement is mainly reflected in: designing the information exchange link between the time domain link and the frequency domain link, so that there are multiple information exchanges in the time domain and frequency domain during the voiceprint representation vector extraction process, thereby achieving information complementarity between the time domain and the frequency domain, enabling the high-level network to fully perceive the underlying network information and improve the accuracy of voiceprint vector extraction.
  • PANNS is an audio neural network trained based on a large audio data set (including speech data of a large number of speakers); it is usually used for audio pattern recognition or audio frame-level vectorization (embedding) as the encoding network at the front end of the model.
  • the transformer network is a conversion model that relies on the attention mechanism to calculate input and output; the Transformer network abandons the convolutional model structure and achieves better performance only through the attention mechanism and the feed forward neural network (Feed Forward Neural Network), without the need to use a sequence-aligned recurrent architecture.
  • a speech segmentation model which can be called a semantic segmentation network or a segmentation network.
  • the speech segmentation network is mainly used to receive the voiceprint characteristics of a specified object (specifically, the voiceprint representation vector) input by the voiceprint vector extraction model, and extract a speech signal that matches the voiceprint characteristics of the specified object from the aliased speech data using an attention mechanism based on the voiceprint representation vector.
  • the speech segmentation model is a segmentation model that integrates the attention mechanism.
  • the target speech signal related to the voiceprint feature of the specified object can be calculated in combination with the attention mechanism during the feature processing of the segmentation network for the aliased speech data, and the target speech signal of the specified object can be segmented from the aliased speech data; in this way, the target speech signal of the specified object in the aliased speech data can be calculated and extracted more clearly and accurately, and the pure target speech signal of the specified object can be separated, so as to achieve a more accurate and pure speech separation effect.
  • the attention mechanism is a solution proposed by imitating human attention; in short, it is to imitate human attention to quickly filter out the information you want to pay attention to from a large amount of information. It is mainly used to solve the problem that it is difficult to obtain a reasonable vector representation when the input sequence of the time series model is long.
  • the method is to retain the intermediate results of the time series model, learn it with a new model and associate it with the output, so as to achieve the purpose of information screening.
  • the system provided in the embodiment of the present application includes two modules.
  • the voiceprint vector extraction model extracts the voiceprint representation vector that can characterize the voiceprint characteristics of the specified object from the reference voice data of the specified object
  • the voiceprint representation vector can be embedded into the voice segmentation model; in this way, the voice segmentation model can extract and separate the voice signal that matches the voiceprint characteristics from the aliased voice data based on the attention mechanism, and achieve a better signal separation effect.
  • the system provided in the embodiment of the present application is a fully automatic segmentation system constructed based on multiple deep learning neural networks (such as an improved PANNS network, a conversion network, and a segmentation network integrated with an attention mechanism, etc.); for the fully automatic segmentation system, the user only needs to input the reference voice data of the specified object and the aliased voice data to be segmented into the fully automatic segmentation system, and the fully automatic segmentation system can automatically and quickly extract the voice signal of the specified object from the aliased voice data, greatly improving the efficiency of voice segmentation, completely getting rid of manual participation, and forming rapid standardization.
  • multiple deep learning neural networks such as an improved PANNS network, a conversion network, and a segmentation network integrated with an attention mechanism, etc.
  • the speech separation model in the system can be reused; where reusability means that each time the system performs source separation, it only needs to replace the voiceprint characteristics of the extracted object, and there is no need to train a separate segmentation network for each object.
  • the network can be easily migrated, making the entire system highly versatile.
  • the system shown in FIG1 can be deployed in a computer device, specifically in an application running in the computer device (such as deployed in the application in the form of a plug-in); that is, the solution is provided by the application running in the computer device.
  • the application can refer to a computer program for completing one or more specific tasks.
  • the types of the same application in different dimensions can be obtained.
  • the application may include but is not limited to: a client installed in the terminal, a small program that can be used without downloading and installing (as a subprogram of the client), a web (World Wide Web) application opened through a browser, etc.
  • the application may include but is not limited to: IM (Instant Messaging) application, content interaction application, audio application or video application, etc.
  • the instant messaging application refers to an application for instant communication of messages and social interaction based on the Internet.
  • the instant messaging application may include but is not limited to: a social application with communication functions, a map application with social interaction functions, a game application, etc.
  • Content interaction applications refer to applications that can realize content interaction, such as online banking, sharing platforms, personal space, news and other applications.
  • Audio applications refer to applications that realize audio functions based on the Internet. Audio applications may include but are not limited to: music applications with music playback and editing capabilities, radio applications with radio playback capabilities, or live broadcast applications with live broadcast capabilities, etc.
  • Video applications refer to applications that can play pictures. Video applications may include but are not limited to: applications with short videos (videos are often short, such as a few seconds or minutes), applications with long videos (such as movies or TV series that often have a long playback time), etc.
  • Computer equipment may include terminals and/or servers.
  • terminals may include but are not limited to: smart phones (such as smart phones with Android systems, or smart phones with Internetworking Operating System (IOS)), tablet computers, portable personal computers, mobile Internet devices (MID), vehicle-mounted devices, head-mounted devices, smart TVs or smart homes, etc.
  • IOS Internetworking Operating System
  • MID mobile Internet devices
  • the embodiments of this application do not limit the types of terminals, which are explained here.
  • the terminal is deployed with the system shown in Figure 1 or an application (or plug-in) providing the system, etc.
  • the server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDN), and big data and artificial intelligence platforms.
  • cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDN), and big data and artificial intelligence platforms.
  • FIG. 2 A schematic diagram of the system architecture of an exemplary voice processing solution jointly executed by a terminal and a server can be seen in Figure 2; as shown in Figure 2, terminal 201 is a device held by a user with a voice separation requirement.
  • terminal 201 is a device held by a user with a voice separation requirement.
  • server 202 can first identify the reference voice data through the voiceprint vector extraction model in the system, obtain a voiceprint representation vector for representing the identity of the specified object, and then embed the voiceprint representation vector into the voice segmentation model in the system; after receiving the voiceprint representation vector of the specified object and the aliased voice data to be segmented, the voice segmentation model can extract a pure target voice signal that matches the voiceprint characteristics represented by the voiceprint representation vector from the aliased voice data based on the attention mechanism, thereby generating a voice file of the specified object based on the target voice signal. In this way, the server 202 returns the voice file of the designated object to the terminal 201 , so that the user can play the voice file containing only the voice data of the designated object through the terminal 201 .
  • the above-mentioned process of the voice processing scheme is briefly introduced by taking the computer device as a terminal and a server as an example; however, when the computer device is a terminal or a server, the idea of the computer device executing the voice processing scheme is similar to the above-described process, except that the execution subject is different, which is not described here.
  • the terminal 201 and the server 202 shown in FIG2 can be directly or indirectly connected by wired or wireless communication, and this application does not limit this.
  • the speech processing solution provided in the embodiment of the present application can be applied to any application scenario with speech separation requirements; the computer device providing the solution may also be different depending on the application scenario, and this is not limited to this.
  • the application scenario may include but is not limited to at least one of the following: film and television drama scenario, audio and video creation scenario, and conversation scenario.
  • the application scenario is a film or TV drama scenario.
  • the film or TV drama scenario is a dubbing scenario for a character in the film or TV drama.
  • voice actors are often required to dub for a certain role in the film and television drama (such as after the aliased voice data recorded by radio recording is submitted for review, some of the lines do not meet the regulations and need to be re-recorded).
  • the voice data obtained by the radio recording performed during the filming process or post-production process of the film and television drama is usually aliased voice data containing multiple voice signals.
  • the application scenario is an audio and video creation scenario.
  • the audio and video creation scenario is a secondary creation scenario for audio and video (i.e., re-creation for existing audio and video).
  • the secondary creation scenario users like to extract some lines of a specified actor in multiple audio and video for line dialogue editing, that is, to edit the voice data of the specified actor in different audio and video into the same audio and video.
  • the application scenario is a conversation scenario.
  • the conversation scenario is an online conference scenario.
  • online conference scenarios there is often a need to transcribe speech into text, that is, convert the recorded voice data into text form; however, in online conference scenarios with the participation of multiple people, the transcription of mixed voice data containing voice signals of multiple people has always been a difficult problem, and transcription refers to the process of converting the voice signal of a specified person among multiple people into text.
  • the voice signal of each object participating in the online conference can be segmented and extracted from the mixed voice data, and then the voice signal of each object can be input into the voice recognition system to achieve text transcription, which can greatly improve the accuracy of the mixed voice transcription of the conversation.
  • the collection and processing of relevant data in the embodiments of this application should be strictly in accordance with the requirements of relevant laws and regulations.
  • the acquisition of personal information requires the knowledge or consent of the individual subject (or the legal basis for obtaining the information), and the subsequent use and processing of data shall be carried out within the scope of authorization of laws and regulations and the subject of personal information.
  • the embodiments of this application are applied to specific products or technologies, such as obtaining reference voice data of a specified object, the permission or consent of the specified object shall be obtained, and the collection, use and processing of relevant data (such as the collection and release of barrages posted by the object, etc.) shall comply with the relevant laws, regulations and standards of the relevant regions.
  • the embodiment of the present application proposes a more detailed speech processing method.
  • the speech processing method proposed in the embodiment of the present application will be described in detail below in conjunction with the accompanying drawings.
  • FIG3 shows a flow chart of a speech processing method provided by an exemplary embodiment of the present application; the speech processing method may be executed by a computer device in the aforementioned system, such as a terminal and/or a server; the speech processing method may include but is not limited to steps S301-S304:
  • S301 Acquire aliased speech data to be segmented.
  • S302 Acquire reference speech data of a designated object among at least two objects.
  • the aliased voice data to be segmented includes the voice signals generated by each of at least two objects.
  • a heavy metal music includes the "lyrics" voice signal generated by the "singer”, the “melody” voice signal generated by the "guitar”, and the “melody” voice signal generated by the "drum set”, etc.; therefore, it is determined that the heavy metal music is aliased voice data, and the objects included in the aliased voice data are: singer, guitar and drum set, and the voice signals included in the aliased voice data are the voice signal generated by the "singer", the voice signal generated by the "guitar”, and the voice signal generated by the "drum set".
  • the reference voice data of the certain object can be obtained.
  • the certain object is called a designated object, and the reference voice data of the designated object and the aliased voice data are not The same, but the reference voice data includes a pure reference voice signal of the specified object; in this way, the reference voice signal contained in the reference voice data of the specified object can be used as a reference signal to separate the target voice signal of the specified object from at least two voice signals corresponding to at least two objects contained in the aliased voice data.
  • the specified object can be any object of the at least two objects contained in the aliased voice data from which the user wants to extract the voice signal; as can be seen from the foregoing description, the object can refer to a human, an animal or a physical device; for ease of explanation, the object type of the specified object is taken as an example for introduction, and this is specially explained here.
  • the object type of the specified object is taken as an example for introduction, and this is specially explained here.
  • the reference voice data refers to a segment of voice data containing the reference voice signal of the specified object.
  • the reference voice data should be a segment of relatively pure voice data containing the specified object.
  • the reference voice data only contains the reference voice signal of the specified object; for another example, the reference voice data also contains the reference voice signal of the specified object and other voice signals, but it is necessary to ensure that the reference voice signal of the specified object can be easily extracted from the reference voice data mixed with other voice signals (such as the signal frequency of other voice signals is low, while the signal frequency of the reference voice signal of the specified object is relatively high, etc.), which is conducive to analyzing the pure reference voice data to extract the relatively accurate voiceprint characteristics of the specified object.
  • the embodiment of the present application does not limit the type, duration and source of the reference voice data.
  • the type of reference voice data may include but is not limited to: a segment of audio generated by the specified object reading an article, a segment of audio generated by the specified object speaking, or a segment of audio generated by the specified object singing, etc.
  • the duration of the reference voice data may be a few seconds or more than ten seconds, etc.
  • the source of the reference voice data may include but is not limited to: when the designated object and the user with the voice separation requirement are different users, the reference voice data may be sent to the user by the designated object, or downloaded or recorded by the user through certain channels (such as historical voice information); when the designated object and the user with the voice separation requirement are the same user, the reference voice data may be input in real time by the designated object, that is, collected in real time through a microphone deployed in the terminal held by the user.
  • FIG4 An exemplary interface schematic diagram for inputting reference voice data of a specified object by a user can be seen in FIG4; as shown in FIG4, a voice acquisition interface 401 is displayed on the terminal screen of a terminal held by a user, and the voice acquisition interface 401 includes an acquisition area 402 for reference voice data.
  • the voice acquisition interface 401 includes an acquisition area 402 for reference voice data.
  • at least two voice acquisition entrances can be displayed in the acquisition area 402, such as a collection entrance 4021 and an upload entrance 4022.
  • the collection entrance 4021 When the collection entrance 4021 is triggered, it means that the user wants to input the reference voice data of the specified object (the specified object is the user, or the specified object and the user are in the same physical environment) by real-time collection, then the microphone of the terminal is turned on, so that the reference voice signal in the physical environment of the user can be collected in real time to generate reference voice data.
  • the upload entrance 4022 it means that the user wants to input the reference voice data of the specified object by uploading a file, then the user can upload the reference voice data of the specified object from the storage space (such as the local storage space of the terminal, the cloud storage space or the server storage space, etc.).
  • the interface elements (such as the interface content contained in the interface) and interface styles of the voice acquisition interface are not limited to those shown in Figure 4.
  • the upload entry of the aliased voice data can also be displayed in the voice acquisition interface, through which the user can replace the aliased voice data to be segmented.
  • a text conversion control (or component, button, option, etc.) can also be added to the voice acquisition interface, so that the user can trigger the text conversion control before or after voice separation to convert the separated voice signal into text form with one click, shorten the text conversion path to a certain extent, and thus improve the text conversion efficiency.
  • S303 extracting a voiceprint representation vector of a specified object from the reference speech data, and inputting the aliased speech data and the voiceprint representation vector into a preset speech segmentation model, wherein the speech segmentation model is used to segment a speech signal matching the voiceprint characteristics of the specified object from the aliased speech data based on an attention mechanism.
  • S304 Generate a voice file of a designated object based on the segmented voice signal.
  • voiceprint is the sound wave spectrum that carries voice information, and is a biometric feature composed of multiple characteristic dimensions such as wavelength, frequency, and intensity.
  • Voiceprint has the characteristics of stability, measurability, and uniqueness, and can be used to uniquely identify the voice characteristics of an object, that is, voiceprint can be used to characterize the identity of an object. Therefore, after obtaining relatively pure reference voice data of a specified object, the embodiment of the present application supports extracting the voiceprint characteristics of the specified object from the reference voice data, so as to facilitate the subsequent analysis of the voice signal based on the unique voiceprint characteristics. Extraction.
  • the voiceprint vector extraction model outputs the voiceprint characterization vector (or simply referred to as voiceprint vector) of the specified object, that is, the voiceprint vector extraction model analyzes the reference voice data to obtain a voiceprint characterization vector that can be used to characterize the voiceprint characteristics of the specified object.
  • the voiceprint vector extraction model can innovatively use vector embedding to transmit the voiceprint information characterization, and input the voiceprint characterization vector into the voice segmentation model to participate in the calculation of the attention mechanism, so as to segment the target voice signal that matches the voiceprint characteristics of the specified object from the aliased voice data.
  • This innovative vector embedding mechanism enables the voice segmentation model to be independent of any object's historical voice data for additional training, and only needs to extract the voiceprint characterization vector from a small amount of reference voice data, which can get rid of the dependence on large-scale voice data, thereby making the system highly reusable and portable, making the entire system more efficient, and reducing the complexity of user input operations, and improving the universality of the system.
  • the speech segmentation model provided in the embodiment of the present application is obtained by improving the traditional speech segmentation network by using the attention mechanism; specifically, it is obtained by integrating the attention mechanism into the traditional speech segmentation network.
  • the traditional speech segmentation network involved in the embodiment of the present application is a Unet (or represented as U-net, U-Net, etc.) network, which is one of the algorithms for semantic segmentation using a fully convolutional network; it mainly uses a symmetrical U-shaped structure including a compression path and an expansion path.
  • FIG5. A schematic diagram of an exemplary network structure of a Unet network can be seen in FIG5.
  • the Unet network is a U-shaped symmetrical network structure, which includes a left-right symmetrical feature extraction subnetwork and an upsampling subnetwork, and the feature extraction subnetwork and the upsampling subnetwork are connected through a convolution connection layer.
  • the hierarchical distribution of m convolutional layers means that the m convolutional layers are connected in sequence, and the upper convolutional layer of any two adjacent convolutional layers in the m convolutional layers is used as the convolutional layer of the upper level, and the lower convolutional layer is used as the convolutional layer of the next level, and the feature map output by the convolutional layer of the previous level is used as the input of the convolutional layer of the next level.
  • a pooling function can be deployed after each convolutional layer; by first using the convolutional network in the convolutional layer to perform feature extraction on the aliased speech data, and then using the pooling function (pool) to further extract higher-order features, the features that are desired to be highlighted in the aliased speech data are effectively retained; wherein, the embodiment of the present application does not limit the type of pooling function, such as if the pooling function is a maximum pooling (max pool), which tends to be the maximum feature within the pooling window (such as a window size of 2*2) in the feature map output by the convolutional layer.
  • max pool maximum pooling
  • the feature extraction subnetwork and the upsampling subnetwork are symmetrical, and the upsampling subnetwork can be simply understood as a decoding network, which includes an upsampling layer (up sampling layer) corresponding to each convolutional layer in the feature extraction subnetwork.
  • a transposed convolution (up-Conv) with a convolution kernel of 2*2 is also deployed after the convolution connection layer and each upsampling layer to achieve the upsampling function through transposed convolution.
  • the symmetrical network structure of the Unet network can not only realize the network from scratch and initialize the weights, and then train the model; it can also borrow the convolutional layer structure of some existing networks (such as vgg (a convolutional network) in resnet (residual neural network)) and the corresponding trained weight files, and then add the subsequent upsampling layer for training calculations, etc.; in this way, using the existing weight model files in deep learning model training can greatly speed up the model training.
  • some existing networks such as vgg (a convolutional network) in resnet (residual neural network)
  • each convolutional layer, convolutional connection layer and upsampling layer includes multiple convolutional networks connected in sequence; as shown in FIG5, the feature extraction subnetwork, convolutional connection layer and upsampling layer can each include three convolutional networks with a convolution kernel of 3*3.
  • the convolutional network is also called a convolutional neural network (CNN);
  • a convolutional neural network is a feedforward neural network, which is mainly composed of one or more convolutional layers and a fully connected layer at the top, and also includes associated weights and a pooling layer.
  • an activation function can be deployed after each convolutional network to add nonlinear factors to the model through the activation function, so that the trained model can solve problems that the linear model cannot solve.
  • the embodiment of the present application does not limit the type of activation function, such as the activation function can be a ReLu function (ReLuSigmoidTanh, linear rectification function) and the like.
  • the Unet network can also effectively combine high-level feature maps and low-level feature maps through skip connections (also called copy and crop) to obtain the final feature map.
  • the specific process of skip connection may include: the feature map obtained by each convolutional layer in the feature extraction subnetwork will be concatenated to the corresponding upsampling layer in the upsampling subnetwork; thus achieving The feature maps are effectively used in subsequent calculations.
  • this method of skipping feature maps of different dimensions can effectively avoid direct supervision and loss calculation in high-level feature maps, effectively combine the features in low-level feature maps, so that the final feature map contains both high-dimensional features and many low-dimensional features, realizing the fusion of features at different scales and improving the accuracy of the model results.
  • the above-mentioned Figure 5 gives a detailed introduction to the network structure of the traditional Unet network.
  • the speech segmentation model provided in the embodiment of the present application is obtained by improving the network structure of the Unet network.
  • the improvement of the network structure of the Unet network in the embodiment of the present application mainly includes: integrating the attention mechanism in all or part of the network layers (such as convolutional layers, convolutional connection layers and upsampling layers) in the network structure of the Unet network.
  • the attention mechanism is integrated in all or part of the networks in the speech segmentation model obtained by improving the Unet network based on the attention mechanism.
  • the voiceprint representation vector of the specified object can be embedded into each network layer in the Unet network, so that each network layer in the network can deeply feel the voiceprint information or voiceprint characteristics represented by the voiceprint representation vector, so that the final output voice signal is closer to the specified object, ensuring that the extracted voice signal is purer.
  • the structural diagram of the speech segmentation model constructed when the Attention mechanism is added to each network layer in the Unet network can be seen in FIG6.
  • the improved speech segmentation model has the same basic network architecture as the original Unet network architecture, but an attention mechanism is added to each layer in the Unet network architecture, and the input information of the attention mechanism is: the voiceprint representation vector of the specified object and the feature map output by the previous layer of the attention mechanism.
  • the entire model can deeply perceive and learn the extracted voiceprint representation vector, so that the calculation of each layer can be close to the voiceprint representation vector, ensuring that the final extracted speech signal matches the voiceprint characteristics represented by the voiceprint representation vector.
  • the fusion position of the attention mechanism in the multiple convolutional networks corresponding to the network layer is not fixed, and the fusion position shown in FIG6 is exemplary.
  • each network layer in the speech segmentation model is integrated with the attention mechanism as an example to introduce the specific implementation process of the speech segmentation model based on the attention mechanism to segment the speech signal matching the voiceprint characteristics of the specified object from the aliased speech data, and generate the speech file of the specified object based on the speech signal; the process may include but is not limited to steps (1)-(4), wherein:
  • time domain and frequency domain are two commonly used concepts in audio applications, and are also two dimensional concepts for measuring audio features; time domain is to display and process the sampling points of the speech signal in time, that is, to bind with time; frequency domain is a characteristic representation of the energy distribution of the speech signal in each frequency band; through conversion formulas (such as Fourier Transform, Laplace Transform or ZTransform, etc.), the speech signal can be converted from time domain to frequency domain, or from frequency domain to time domain.
  • the correlation between the voiceprint representation vector and the speech spectrum features is calculated to obtain the speech spectrum feature segmentation that matches the voiceprint characteristics.
  • the voiceprint representation vector is input into each network layer in the speech segmentation network, specifically the attention mechanism integrated in each network layer.
  • the correlation between the voiceprint representation vector and the first feature map of the corresponding network layer can be calculated to obtain the second feature map output by the corresponding network layer; it is worth noting that the first feature map of the network layer is different depending on the position of the network layer in the speech segmentation network, which will be introduced in the subsequent embodiments.
  • the second feature map output by the 2m+1th network layer in the speech segmentation model (that is, the last upsampling layer in the upsampling subnetwork) can be used as the speech spectrum feature segmentation that matches the voiceprint representation vector;
  • the speech spectrum feature segmentation is specifically the segmentation in the speech spectrum feature that matches the voiceprint characteristics, that is, the segmentation in the speech spectrum characteristics that belongs to the specified object.
  • each network layer in the speech segmentation network can deeply sense the voiceprint information of the specified object, so that the final segmented output voice signal is closer to the voiceprint characteristics of the specified object, ensuring that the extracted voice signal is purer and more accurate.
  • the target network layer is a convolutional layer or a convolutional connection layer in a speech segmentation model
  • the fusion position of the attention mechanism in the multiple convolutional networks corresponding to the target network layer is: the position between the first convolutional network and the second convolutional network adjacent to the first convolutional network in the multiple sequentially connected convolutional networks included in the target network layer; that is, the fusion position of the attention mechanism in the multiple convolutional networks corresponding to the target network layer is: the position between the first convolutional network in the multiple sequentially connected convolutional networks included in the target network layer, and the convolutional network adjacent to the first convolutional network and located after the first convolutional network in the multiple sequentially connected convolutional networks.
  • the target network layer calculates the correlation between the voiceprint representation vector and the first feature map of the target network layer to obtain the second feature map output by the target network layer.
  • the specific implementation process may include: first, using the first convolutional network in the target network layer, performing feature extraction processing on the first feature map of the target network layer to obtain the third feature map of the target network layer;
  • the feature extraction processing here refers to the process of extracting useful information (i.e., features) from the first feature map for subsequent classification, clustering, regression and other tasks;
  • the feature extraction processing process may include: preprocessing the first feature map (such as denoising, normalization or standardization), and performing feature extraction on the preprocessed first feature map to extract useful features, and screening representative or discriminative features from the extracted features, and the screened features are used as features after feature extraction processing.
  • the target network layer is the first convolution layer 701 of the hierarchical distribution in the speech segmentation model (specifically, the feature extraction subnetwork)
  • the first feature map of the target network layer is the speech spectrum feature obtained by frequency domain conversion of the aliased speech data
  • the target network layer is other convolution layers (such as convolution layer 702) in the speech segmentation model except the first convolution layer 701
  • the first feature map of the target network layer is obtained by pooling the feature map output by the upper-level network layer adjacent to the target network layer (such as convolution layer 701)
  • the pooling process is performed by the pooling layer in the target network layer, and the pooling process aims to reduce the size and parameter amount of the feature map output by the upper-level network layer through parallel processing or data compression, thereby reducing the amount of calculation.
  • the correlation between the voiceprint representation vector and the third feature map of the target network layer is calculated to obtain the fourth feature map of the target network layer;
  • the feature dimension of the third feature map of the target network layer is the same as the feature dimension of the fourth feature map;
  • the feature map (such as the third feature map and the fourth feature map) can be expressed in the form of a vector, so the feature dimension of the feature map can be the dimension of the vector, each dimension in the vector corresponds to a feature, that is, the feature dimension before and after the attention mechanism calculation is the same.
  • the fourth feature map is subjected to feature extraction processing using other convolutional networks except the first convolutional network in the target network layer to obtain the second feature map output by the target network layer.
  • the attention mechanism is integrated in the multiple convolutional networks contained in the convolutional layer or the convolutional connection layer.
  • the features that match the voiceprint representation vector of the specified object can be focused on in the aliased speech data based on the attention mechanism, and then the second feature map that matches the voiceprint representation vector can be analyzed through the attention mechanism in the feature extraction process, and then the target speech signal of the specified object can be accurately segmented from the aliased speech data based on the second feature map.
  • the target network layer is an upsampling layer in a speech segmentation model
  • the fusion position of the attention mechanism in the multiple convolutional networks corresponding to the target network layer is: the position after the last convolutional network in the multiple sequentially connected convolutional networks.
  • the target network layer calculates the correlation between the voiceprint representation vector and the first feature map of the target network layer to obtain the second feature map output by the target network layer.
  • the specific implementation process may include: first, using multiple sequentially connected convolutional networks in the target network layer to perform feature extraction processing on the target feature map to obtain the first feature map of the target network layer.
  • the target feature map here is obtained by feature splicing the feature map output by the convolutional layer corresponding to the target network layer and the feature map output by the upper-level network layer of the target network layer; as shown in FIG7b, the input information of the first upsampling layer 703 in the upsampling subnetwork is the target feature map, and the target feature map is the first upsampling layer 703.
  • the feature map output by the convolution connection layer 704 of the previous level and the feature map output by the convolution layer 705 corresponding to the first upsampling layer 703 are obtained by feature concatenation.
  • the attention mechanism fused in the target network layer is used to calculate the correlation between the voiceprint representation vector and the first feature map of the target network layer to obtain the second feature map output by the target network layer; the feature dimension of the second feature map is the same as that of the first feature map.
  • the attention mechanism is integrated in the multiple convolutional networks contained in the upsampling layer. After the feature graph output by the previous network layer and the feature graph output by the corresponding convolutional layer are used to extract features, the attention mechanism can be used to focus on the features that match the voiceprint representation vector of the specified object in the first feature graph after feature extraction, thereby analyzing and obtaining the second feature graph that matches the voiceprint representation vector of the specified object, and then based on the second feature graph, the target speech signal of the specified object can be accurately segmented from the aliased speech data.
  • FIGS. 7a and 7b are only exemplary processes for the target network layer to perform correlation calculation when the attention mechanism is respectively integrated into an exemplary fusion position in the feature extraction module, the convolution connection layer and the upsampling sub-network layer; when the attention mechanism is integrated into different fusion positions in the target network layer, the specific implementation process of the target network layer performing the correlation calculation is different.
  • the method of converting the speech spectrum feature segment from the frequency domain to the time domain may include but is not limited to the aforementioned Fourier Transform, Laplace Transform or Z Transform, etc., and is not limited to this.
  • the file format of the voice file can be set according to the personalized needs of the user, and the embodiment of the present application does not limit the file format of the voice file.
  • the voice file when the voice file is a text file, it supports the use of a speech recognition algorithm or tool to convert the voice signal that matches the voiceprint characteristics of the specified object into text, and performs text processing on the converted text (such as correcting spelling errors, adding conformances, and clarifying the text), and then saves the processed text as a text file in a text format (such as .doc format).
  • the voice file is an audio file
  • the voice signal that matches the voiceprint characteristics of the specified object can be directly saved as an audio file in an audio format (such as .WAV format).
  • the embodiment of the present application supports converting the aliased speech data from the time domain to the frequency domain to obtain the speech spectrum features of the aliased speech data in the frequency domain.
  • the attention mechanism of each network layer in the speech segmentation model can be used to calculate the correlation between the voiceprint representation vector and the speech spectrum features that belong to the frequency domain, thereby ensuring the feasibility of the correlation calculation; considering that the final segmentation is the signal in the time domain, it is necessary to convert the speech spectrum features from the frequency domain to the time domain to obtain the target speech signal that matches the voiceprint characteristics, thereby ensuring that the final extracted signal is a time domain signal that can be understood and read by the device.
  • each network layer in the speech segmentation network is different, specifically, the feature dimensions of each convolutional network in each network layer are different. Therefore, before inputting the voiceprint representation vector of the specified object into each network layer in the speech segmentation network, it is also necessary to use a network layer to transform the dimension of the voiceprint representation vector to obtain the voiceprint representation vector after the dimension transformation.
  • dimension transformation refers to changing the feature dimension of the voiceprint representation vector, so that the feature dimension of the voiceprint representation vector after dimension transformation is the same as the feature dimension of the feature map of the attention mechanism to be input into the corresponding network layer, so that when the voiceprint representation vector after dimension transformation is input into the speech segmentation model, the speech segmentation model can effectively process the voiceprint representation vector to avoid the unavailability of the voiceprint representation vector caused by different dimensions.
  • the feature map of the attention mechanism to be input into the corresponding network layer can refer to the third feature map described above.
  • the attention mechanism can be inserted between any two convolutional networks in a plurality of convolutional networks connected sequentially in the network layer;
  • the first feature map of the network layer can refer to the feature map output by the convolutional network in the network layer that is adjacent to the attention mechanism and is located before the attention mechanism.
  • the embodiment of the present application innovatively constructs a fully automatic speech processing solution, which realizes the segmentation of speech signals based on the voiceprint vector embedding method; for the user, only a small segment of reference speech data of the specified object and the aliased speech to be segmented need to be input. Data can automatically and quickly separate the voice signal of the specified object from the aliased voice data, which can greatly improve the efficiency of voice segmentation, completely get rid of manual participation, and form a rapid standardization.
  • the voiceprint vector embedding method is adopted, and the voiceprint representation vector used to represent the voiceprint characteristics of the specified object is input into the voice segmentation model to participate in the Attention calculation, which can make the voice segmentation model independent of any object's historical voice data for additional training, and can get rid of the dependence on the object's large-scale voice data, so as to achieve the system's high reusability and portability, making the entire system more universal.
  • the input voiceprint representation vector can be calculated with the Attention mechanism of each network layer in the Unet network, so as to ensure that the feature map output by the voice segmentation model and the voiceprint characteristics represented by the voiceprint representation vector of the specified object are more closely matched, avoiding too much noise in the extracted voice signal of the specified object, improving the purity of the voice signal, and improving the segmentation accuracy of the voice segmentation model.
  • FIG. 8 shows a flowchart of another voice processing method provided by an exemplary embodiment of the present application
  • the voice processing method may be executed by a computer device in the aforementioned system, such as a terminal and/or a server; the voice processing method may include but is not limited to steps S801-S806:
  • S802 Acquire reference speech data of a designated object among at least two objects.
  • S803 Perform short-term correlation analysis on the reference speech data of the designated object.
  • S804 Perform long-term correlation analysis on the reference speech data of the designated object to obtain a voiceprint representation vector of the designated object.
  • steps S803-S804 in order to learn the clearer voiceprint characteristics of the specified object from the reference voice data of the specified object, the embodiment of the present application supports the analysis of the reference voice data in combination with short-time correlation and long-time correlation to extract the voiceprint representation vector that can fully express the voiceprint characteristics of the specified object.
  • the short-time correlation analysis of the reference voice data can be simply understood as: the process of feature analysis of a shorter (such as 20 milliseconds) voice signal in the reference voice data; considering that the voice signal in the reference voice data usually does not change in a short time, after the reference voice data is discretized, the information distribution of each shorter voice signal in the time domain and frequency domain can be used to extract the characteristics of the voice signal in the short time, thereby realizing the feature analysis of each voice signal in the reference voice data.
  • the short-time correlation analysis focuses on the feature analysis of the segmented voice signals in the reference voice data.
  • the long-time correlation analysis of the reference voice data can be simply understood as: the process of feature analysis of the entire reference voice data; that is, the long-time correlation analysis focuses on the semantic expression of the entire signal sequence of the reference voice data.
  • the voiceprint vector extraction model includes an improved audio neural network (PANNS) network and a transformer (Transformer) network.
  • PANNS improved audio neural network
  • Transformer transformer
  • the improved audio neural network (PANNS) network can be referred to as improved PANNS, which is mainly used for short-time correlation analysis of reference voice data
  • the transformer (Transformer) network can be referred to as transformer network, which is mainly used for long-time correlation analysis of reference voice data.
  • FIG. 9 A schematic diagram of an exemplary use of improved PANNS and Transformer network to perform short-term correlation analysis and long-term correlation analysis on reference speech data can be seen in Figure 9.
  • the reference speech data is converted from the time domain to the frequency domain to obtain the reference speech spectrum characteristics corresponding to the reference speech data; wherein, the conversion formula from the time domain to the frequency domain can be referred to the aforementioned related description, which will not be repeated here.
  • the reference speech spectrum characteristics can also be segmented to obtain the reference speech spectrum feature segmentation corresponding to each speech data segmentation;
  • the reference speech spectrum feature segmentation involved in the embodiment of the present application can be a logarithmic Mel spectrum (Log-mel or Logmel).
  • the mel (Mel) spectrum is a nonlinear frequency scale determined based on the human ear's sensory judgment of equidistant (i.e., the frequency bands are equidistantly distributed on the Mel scale) pitch (pitch) changes, and the pitch refers to the height of the sound; when performing signal processing, it can be more catered to the changes in the auditory perception of the human ear to be artificially set.
  • the reference speech spectrum feature corresponding to each speech data segment is input into the improved PANNS segment by segment input.
  • the complete reference speech data is input into the improved PANNS; thus, the improved PANNS can segment the received reference speech data according to the segmentation rules followed when segmenting the reference speech spectrum characteristics, and obtain multiple speech data segments corresponding to the reference speech data.
  • the improved PANNS will perform short-term correlation analysis on each speech data segment based on each speech data segment and the corresponding reference speech spectrum characteristic segment, and obtain the voiceprint semantic feature vector corresponding to each speech data segment; the voiceprint semantic feature vector is used to characterize the semantic characteristics of the corresponding speech data segment.
  • the vector sequence composed of the voiceprint semantic feature vectors corresponding to each speech data segment (or called voiceprint semantic feature vector sequence) is input into the Transformer network, so that the Transformer network can perform long-term correlation analysis on the voiceprint semantic feature vector sequence to obtain the voiceprint representation vector of the specified object, where the voiceprint semantic feature vector sequence includes the voiceprint semantic feature vector corresponding to each speech data segment.
  • the Transformer network is a sequence network, and its input is an overall vector sequence, and its output is also a sequence (that is, the voiceprint representation vector of the specified object is a sequence); after the Transformer network outputs a sequence, the last vector in the sequence can be called state, which contains the semantic fusion of the entire sequence; in this way, the overall sequence output by the Transformer network can be averaged (mean), and the average result and the state can be superimposed to generate a voiceprint representation vector that integrates the semantic features expressed by the entire sequence.
  • the above-mentioned method of obtaining the voiceprint representation vector by combining the speech features represented by the average result of the Transformer network output sequence and the speech features represented by the overall sequence can effectively ensure that the voiceprint representation vector can more fully reflect the voiceprint characteristics of the specified object and ensure the accuracy of voiceprint feature extraction.
  • the exemplary structural diagram of the improved PANNS can be seen in FIG10 ; as shown in FIG10 , the input information of the improved PANNS is the reference speech data, that is, the input of the improved PANNS uses the original speech sampling point sequence, that is, the original sequence of the audio signal.
  • the improved PANNS can be divided into two branches, namely the time domain branch (or called the time domain processing branch) and the frequency domain branch (or called the frequency domain processing branch).
  • the input information of the time domain branch is the reference speech data
  • the input information of the frequency domain branch is the reference speech spectrum feature corresponding to the reference speech data
  • the reference speech spectrum feature is obtained by converting the time domain signal "reference speech data" from the time domain to the frequency domain.
  • the improved PANNS focuses on short-term correlation analysis of the reference speech data, that is, it supports processing by using the improved PANNS in a segmented input manner; specifically, the improved PANNS only processes a segment of speech data in the reference speech data and the reference speech spectrum feature corresponding to the speech data each time.
  • the input information of each input time domain branch is a segment of speech data in the reference speech data; similarly, the input information of each input frequency domain branch is a segment of reference speech spectrum feature in the frequency domain corresponding to a segment of speech data in the reference speech data.
  • the reference speech data can be converted from the time domain to the frequency domain to obtain the reference speech spectrum feature corresponding to the reference speech data, and the reference speech spectrum feature is segmented to obtain the reference speech spectrum feature segment corresponding to each speech data segment; and the reference speech data is segmented to obtain multiple speech data segments corresponding to the reference speech data.
  • segmentation rules followed by the segmentation processing of the reference speech data are the same as the segmentation rules followed by the segmentation processing of the reference speech spectrum feature; for example, the segmentation rules may include: when the reference speech data is segmented, speech data with a duration of 20 milliseconds is periodically collected as a segment, then when the reference speech spectrum feature is segmented, each reference speech spectrum feature segment is converted to the time domain and the corresponding speech data segment duration is 20 milliseconds. The accuracy of each characteristic analysis is ensured by the correspondence between time domain segmentation and frequency domain segmentation.
  • the improved PANNS can perform short-term correlation analysis on each speech data segment based on each speech data segment and the corresponding reference speech spectrum feature segment, and obtain the voiceprint semantic feature vector corresponding to each speech data segment; the voiceprint semantic feature vector is used to characterize the semantic characteristics of the corresponding speech data segment.
  • the specific process of the short-term correlation analysis described above is introduced, wherein:
  • the time domain branch contains multiple one-dimensional convolutional layers (Conv 1D) and maximum pooling layers (Max pooling); where Conv is the abbreviation of Convolutional, D is the abbreviation of Dimension, and the dimension of the maximum pooling layer is 1, and the stride (abbreviated as s) is a one-dimensional vector with a length of 4.
  • the time domain branch includes: one-dimensional convolutional layer ⁇ one-dimensional convolutional block (Conv 1Dblock, composed of one or more one-dimensional convolutional layers) ⁇ maximum pooling layer (dimension is 1, and the stride s is 4) ⁇ one-dimensional convolutional block ⁇ maximum pooling layer ⁇ one-dimensional convolutional block ⁇ maximum pooling layer; then after the target speech data is segmented and feature extracted by the convolution layer, the feature map after feature extraction is selected by the adjacent maximum pooling layer after the convolution layer, which is conducive to further extracting the features that you want to focus on in the feature map.
  • Conv 1Dblock composed of one or more one-dimensional convolutional layers
  • maximum pooling layer dimension is 1, and the stride s is 4
  • the feature map after feature extraction is selected by the adjacent maximum pooling layer after the convolution layer, which is conducive to further extracting the features that you want to focus on in the feature map.
  • the reference speech data (specifically, the target speech data segments) can be subjected to multiple feature extractions to obtain a one-dimensional sequence (resize), and the one-dimensional sequence is transformed (Reshape) to obtain multiple two-dimensional time domain feature maps (or two-dimensional graph Wavegrams).
  • the time domain feature map is a graphical representation in the time domain that describes the changes of the reference speech signal in the reference speech data over time, and can intuitively display the basic characteristics of the reference speech signal (such as the periodicity, frequency components, and phase relationships of the reference speech signal).
  • the purpose of the dimensional conversion here is to enable the converted time domain feature map to be fused with the frequency domain feature map output by the frequency domain branch.
  • the time domain characteristics of the speech signal in the reference speech data (such as information such as audio loudness and sampling point amplitude) can be directly learned when the reference speech data is subjected to feature extraction processing through the time domain branch.
  • the frequency domain branch includes multiple two-dimensional convolution layers (Conv 2D) and maximum pooling layers (Max pooling); wherein, the dimension of the maximum pooling layer of the frequency domain branch is 2.
  • the frequency domain branch includes in sequence: a two-dimensional convolution block (composed of one or more two-dimensional convolution layers) ⁇ a maximum pooling layer (dimension is 2) ⁇ a two-dimensional convolution block ⁇ a maximum pooling layer ⁇ a two-dimensional convolution block; through the multiple two-dimensional convolution layers and the maximum pooling layer, the reference speech spectrum feature segment (Logmel) corresponding to the target speech data segment can be subjected to feature extraction processing to obtain multiple frequency domain feature maps (Feature maps).
  • Feature maps frequency domain feature maps
  • the frequency domain feature map is a graphical representation of the frequency components of the reference speech signal in the reference speech data in the frequency domain, and can display the various frequency components contained in the reference speech signal and their corresponding amplitudes or intensities. It is worth noting that the feature dimension of the frequency domain feature map here is the same as the feature dimension of the time domain feature map output by the time domain branch. In the above process, by using a large number of two-dimensional convolutional layers in the frequency domain branch, the frequency domain characteristics of the speech signal in the reference speech data can be directly learned when the frequency domain branch is used to perform feature extraction processing on the reference speech spectrum feature segments.
  • the time domain feature map and the frequency domain feature map can be fused to generate the voiceprint semantic feature vector corresponding to the target speech data segment.
  • the time domain branch and the frequency domain branch there are multiple information exchanges between the time domain branch and the frequency domain branch, which are to transform the dimension of the information features of the time domain branch (Reshape), and then fuse (concat) with the features of the frequency domain branch, and the fusion result is convolved by a two-dimensional convolution block (Conv 2D block) and input into a higher-level fusion module for fusion.
  • Conv 2D block two-dimensional convolution block
  • time domain processing can obtain the correlation of the reference voice data in time sequence
  • frequency domain processing can obtain the correlation of the reference voice data in different frequencies, the two belong to different fields; then through the information interaction between the two domains (time domain and frequency domain) described above, the time domain and frequency domain can maintain information complementarity, so that the high-level network can perceive the underlying network information, thereby achieving full learning of the reference voice data.
  • the intermediate time domain feature map obtained by the time domain branch in the i-th feature extraction process, the intermediate frequency domain feature map obtained by the frequency domain branch in the i-th feature extraction process, and the i-1th intermediate feature vector obtained by the i-1th feature extraction process are fused to generate the i-th intermediate feature vector after the i-th feature extraction process.
  • the voiceprint semantic feature vector corresponding to the target semantic data segment is generated.
  • the specific process of generating here may include: inputting the kth intermediate feature vector into a two-dimensional convolutional neural network (2D CNN layers) for feature extraction to obtain a vector sequence, and using the vector sequence for averaging to obtain the average value (mean) and performing a maximum value operation The maximum value (max) is obtained, and then the average value and the maximum value are added (sum), and then a layer of activation function (Rule) is passed to obtain a feature vector (vector), and the feature vector is normalized using a normalization function (softmax) to convert the feature vector into a voiceprint semantic feature vector representing a probability distribution, that is, the voiceprint semantic feature vector corresponding to the target speech data segment.
  • 2D CNN layers two-dimensional convolutional neural network
  • the entire reference speech data is divided into frames and input into the improved PANNS respectively.
  • the improved PANNS selects to use the last vector of the entire sequence and the mean of the entire sequence to fuse together to generate the final voiceprint representation vector, and can obtain a multi-band semantic feature vector sequence representing the entire reference speech data (including the voiceprint semantic feature vector corresponding to each speech data segment).
  • short-term correlation analysis of the reference speech data can be realized, and the voiceprint characteristics of each segmented speech signal in the reference speech data can be fully learned, so that the extracted voiceprint representation vector can fully express the voiceprint information of the specified object.
  • the embodiment of the present application also introduces the Transformer network to learn the long-term correlation information of the reference speech data.
  • the attention mechanism of the Transformer network can better pay attention to the voiceprint characteristics of the reference language data about the specified object, and better realize the extraction of global feature information for the reference language data.
  • the schematic diagram of the network structure of the Transformer network can be seen in Figure 11; as shown in Figure 11, the Transformer network adopts an encoder (encoding)-decoder (decoding) architecture.
  • both the encoder side and the decoder side are formed by stacking N encoder layers (i.e., "N ⁇ " in Figure 11).
  • the encoder layer mainly includes two sublayers, among which: the first sublayer contains a multi-head attention mechanism (Multi-Head Attention), residual and normalization (Add&Norm); the second sublayer contains a feedforward neural network (Feed Forward), residual and normalization layer.
  • the multi-head attention mechanism contained in the first sublayer can help obtain the contextual semantics of the voiceprint semantic feature vector corresponding to the speech data segment.
  • the decoder layer mainly consists of three sub-layers, among which: the first sub-layer contains the masked multi-head attention mechanism (Masked Multi-Head Attention), residual and normalization layers, the first sub-layer contains the multi-head attention mechanism (Multi-Head Attention), residual and normalization (Add&Norm) layers, and the first sub-layer contains the feedforward neural network (Feed Forward), residual and normalization layers;
  • the decoding layer can help obtain the key content that needs attention through this three-layer structure.
  • the input information of the encoder side in the Transformer network is the voiceprint semantic feature vector sequence (Input Embedding) output by the improved PANNS, and the voiceprint semantic feature vector sequence contains the voiceprint semantic feature vector corresponding to each semantic data segment.
  • the input voiceprint semantic feature vector sequence is positionally encoded (Positional Encoding) to realize the preprocessing of the voiceprint semantic feature vector sequence; wherein, position encoding is a method of secondary representation of each word in the vector sequence using the position information of the word, which enables the data input to the encoding side to carry the position information of the word.
  • the position-encoded data is input to the encoding side, and each layer in the encoding side (as described above for the structure of the encoding side) encodes the input data to obtain the encoding result. Then, after obtaining the encoding result output by the encoding side, the decoding side supports combining the encoding result and the output feature of the decoding side at the previous moment (shifted right) as the input data of the decoding side at this time, and the decoding side decodes the input data. Finally, the output features of the decoding side are linearly transformed (Linear) and classified (softmax) to obtain the voiceprint representation vector (Output Probabilities) of the specified object.
  • the voiceprint semantic feature vector sequence can be calculated through the Transformer network, which can make the long-term semantic expression of the entire reference speech data of the entire voiceprint semantic feature vector sequence clearer, that is, the final output voiceprint representation vector can fully and clearly express the voiceprint characteristics of the specified object.
  • S805 Inputting the aliased speech data and the voiceprint representation vector into a preset speech segmentation model, where the speech segmentation model is used to segment a target speech signal that matches the voiceprint characteristics of a specified object from the aliased speech data based on an attention mechanism.
  • step S805 can be referred to the embodiment shown in Figure 3 above, and the relevant description of the specific implementation process of step S302 about segmenting the voice signal matching the voiceprint characteristics of the specified object from the mixed voice data based on the attention mechanism is not repeated here.
  • the embodiment of the present application mainly adopts the improved Unet network (i.e., the speech segmentation network) to implement the segmentation of the mixed speech data based on the attention mechanism to obtain a speech signal that matches the voiceprint characteristics of the specified object.
  • the speech segmentation network mainly integrates the attention mechanism into the traditional Unet network, such as adding the attention mechanism to each network layer in the traditional Unet network to realize the improvement of the traditional Unet network.
  • the embodiment of the present application also supports model distillation of the speech segmentation model integrated with the attention mechanism to obtain a speech segmentation model after model distillation; in this way, the aforementioned correlation calculation can be implemented by the speech segmentation model after model distillation, thereby reducing the scale of the entire system and reducing the overall parameter amount and time consumption.
  • the voiceprint representation vector and the speech spectrum feature are both expressed in the form of vectors, so the method for the speech segmentation model after model distillation to calculate the correlation between the voiceprint representation vector and the speech spectrum feature may include but is not limited to the dot product method; wherein, the correlation between the voiceprint representation vector and the speech spectrum feature calculated by the dot product method includes: the product of the modulus of the voiceprint representation vector and the modulus of the speech spectrum feature, and the product of the cosine value of the angle between the voiceprint representation vector and the speech spectrum feature; if the voiceprint representation vector is The speech spectrum characteristics are The angle between the voiceprint representation vector and the speech spectrum feature is ⁇ , so the dot product of the voiceprint representation vector and the speech spectrum feature is The dot product result is used as the similarity between the voiceprint representation vector and the speech spectrum feature.
  • Model distillation is a method of learning a large model (teacher model) with a large number of parameters to obtain a more compact small model (student model) with a small number of parameters.
  • Model distillation for speech segmentation models mainly implements model distillation for large models through technologies such as pruning and knowledge distillation; among them: pruning is called model pruning (Model Pruning), which is a model compression technology that aims to reduce the complexity of the speech segmentation model by deleting some unimportant parameters or structures in the speech segmentation model, improve the reasoning speed of the speech segmentation model, and reduce the storage requirements of the speech segmentation model.
  • Model Pruning model compression technology that aims to reduce the complexity of the speech segmentation model by deleting some unimportant parameters or structures in the speech segmentation model, improve the reasoning speed of the speech segmentation model, and reduce the storage requirements of the speech segmentation model.
  • Knowledge distillation can be understood as taking a relatively complex speech segmentation model as a teacher model, training a student model with a small number of parameters and a simple structure, and training the student model to learn and imitate the output of the teacher model, so that the trained student model not only has the advantage of small computational complexity, but also has the same model performance as the teacher model; in the embodiment of the present application, the trained learning model is a speech segmentation model after knowledge increment.
  • the model pruning and knowledge distillation mentioned above are the main implementation technologies of model distillation, but model distillation can also include other technologies.
  • the embodiments of the present application do not limit the specific implementation methods of model distillation.
  • S806 Generate a voice file of a designated object based on the segmented target voice signal.
  • step S806 reference may be made to the relevant description of the specific implementation process shown in step S304 in the embodiment shown in FIG. 3 , and will not be repeated here.
  • the embodiment of the present application provides a reusable voice segmentation method for a specified object with voiceprint vector embedding.
  • the method uses a voiceprint vector extraction model composed of an improved PANNS and a Transformer network to extract voiceprints from a small section of reference voice data of the specified object.
  • the extracted voiceprint representation vector of the specified object can more fully and clearly express the voiceprint characteristics of the specified object.
  • the embodiment of the present application also integrates the attention mechanism into the Unet network, so that the Unet network (i.e., the speech segmentation model) that integrates the attention mechanism can realize the segmentation of aliased voice data based on the attention mechanism, and can more clearly and accurately calculate and extract the voice signal of the specified object in the aliased voice data, ensure the purity of the extracted voice signal, and achieve a more accurate and pure voice separation effect.
  • the embodiment of the present application only needs to obtain the reference voice data of the specified object to segment the voice signal of the specified object from the mixed voice data; if you want to obtain the voice data of other objects, you only need to change the voiceprint characteristics of the object to be segmented. There is no need to train a dedicated network for each object, which makes the solution convenient and transferable, making the solution extremely versatile.
  • FIG12 shows a schematic diagram of the structure of a speech processing device provided by an exemplary embodiment of the present application; the speech processing device can be used to execute some or all of the steps in the method embodiment shown in FIG3 or FIG8.
  • the speech processing device includes the following units:
  • An acquiring unit 1201 is configured to acquire aliased speech data, where the aliased speech data includes a speech signal generated by each of at least two objects;
  • the acquisition unit 1201 is further configured to acquire reference speech data of a designated object;
  • the designated object refers to any one of at least two objects;
  • the reference speech data includes a reference speech signal of the designated object;
  • the processing unit 1202 is used to extract a voiceprint representation vector of a specified object from the reference speech data, where the voiceprint representation vector is used to represent the voiceprint characteristics of the specified object;
  • the processor 1202 is further configured to input the aliased speech data and the voiceprint representation vector into a preset speech segmentation model, wherein the speech segmentation model is configured to: segment a target speech signal matching the voiceprint characteristics from the aliased speech data based on an attention mechanism;
  • the processing unit 1202 is further configured to generate a voice file of a designated object based on the segmented target voice signal.
  • the process of segmenting a target speech signal matching a voiceprint characteristic from aliased speech data based on an attention mechanism includes:
  • the speech spectrum features are characteristic manifestations of the aliased speech data in the frequency domain;
  • the correlation between the voiceprint representation vector and the speech spectrum feature is calculated to obtain a speech spectrum feature segment that matches the voiceprint feature;
  • the speech spectrum feature segment is a segment in the speech spectrum feature that matches the voiceprint feature;
  • the speech spectrum feature segments are converted from the frequency domain to the time domain to obtain the target speech signal that matches the voiceprint characteristics.
  • the correlation calculation is implemented by a speech segmentation model;
  • the speech segmentation model includes a feature extraction subnetwork and an upsampling subnetwork, and the feature extraction subnetwork and the upsampling subnetwork are connected by a convolutional connection layer;
  • the feature extraction subnetwork and the upsampling subnetwork are symmetrical; the feature extraction subnetwork contains m convolutional layers distributed in a hierarchy, and the upsampling subnetwork contains an upsampling layer corresponding to each of the m convolutional layers, where m is a positive integer; the convolutional layer, the convolutional connection layer, and the upsampling layer all include multiple convolutional networks connected in sequence;
  • the attention mechanism is integrated into all or part of the network layers in the speech segmentation model, and the fusion position of the attention mechanism in the multiple convolutional networks included in the network layer is not fixed;
  • the network layer includes a convolutional layer, an upsampling layer and a convolutional connection layer.
  • each network layer in the speech segmentation model is integrated with an attention mechanism; the processing unit 1202 is used to calculate the correlation between the voiceprint representation vector and the speech spectrum feature based on the attention mechanism to obtain the speech spectrum feature segmentation matching the voiceprint characteristics, specifically for:
  • the correlation between the voiceprint representation vector and the first feature map of the corresponding network layer is calculated to obtain the second feature map output by the corresponding network layer; the voiceprint characteristics represented by the second feature map match the voiceprint characteristics represented by the voiceprint representation vector;
  • the second feature map output by the 2m+1th network layer in the speech segmentation model is used as the speech spectrum feature segmentation that matches the voiceprint characteristics represented by the voiceprint representation vector; the 2m+1th network layer in the speech segmentation model is the last upsampling layer in the upsampling subnetwork.
  • any network layer in the speech segmentation model that is integrated with the attention mechanism is represented as a target network layer;
  • the target network layer is a convolutional layer or a convolutional connection layer;
  • the fusion position of the attention mechanism in the target network layer is: the position between the first convolutional network and the second convolutional network adjacent to the first convolutional network in the multiple convolutional networks connected sequentially included in the target network layer;
  • the processing unit 1202 is used to calculate the correlation between the voiceprint representation vector and the first feature map of the corresponding network layer based on the attention mechanism integrated in each network layer to obtain the second feature map output by the corresponding network layer, specifically for:
  • the first convolutional network in the target network layer is used to perform feature extraction processing on the first feature map of the target network layer to obtain the third feature map of the target network layer; wherein, when the target network layer is the first convolutional layer of the hierarchical distribution in the feature extraction subnetwork, the first feature map of the target network layer is the speech spectrum feature; when the target network layer is a convolutional layer other than the first convolutional layer in the speech segmentation model, the first feature map of the target network layer is obtained by pooling the feature map output by the upper-level network layer adjacent to the target network layer;
  • the correlation between the voiceprint representation vector and the third feature map of the target network layer is calculated to obtain the fourth feature map of the target network layer;
  • the feature dimension of the third feature map is the same as the feature dimension of the fourth feature map;
  • the fourth feature map is subjected to feature extraction processing by using other convolutional networks except the first convolutional network in the target network layer to obtain a second feature map output by the target network layer.
  • any network layer in the speech segmentation model that is integrated with the attention mechanism is represented as a target network layer;
  • the target network layer is an upsampling layer;
  • the fusion position of the attention mechanism in the multiple convolutional networks corresponding to the target network layer is: the position after the last convolutional network in the multiple convolutional networks sequentially connected in the target network layer;
  • the processing unit 1202 is used to calculate the correlation between the voiceprint representation vector and the first feature map of the corresponding network layer based on the attention mechanism integrated in each network layer to obtain the second feature map output by the corresponding network layer, specifically for:
  • the target feature map is obtained by feature concatenating the feature map output by the convolutional layer corresponding to the target network layer in the feature extraction subnetwork and the feature map output by the upper-level network layer of the target network layer;
  • the attention mechanism integrated in the target network layer is used to calculate the correlation between the voiceprint representation vector and the first feature map of the target network layer to obtain the second feature map output by the target network layer; the feature dimension of the second feature map is the same as that of the first feature map.
  • processing unit 1202 is further configured to:
  • the feature dimension of the voiceprint representation vector after dimension transformation is the same as the feature dimension of the feature map of the attention mechanism to be input into the corresponding network layer for fusion.
  • the processing unit 1202 is further configured to:
  • the correlation calculation is realized by the speech segmentation model after model distillation.
  • the processing unit 1202 is configured to extract the voiceprint representation vector of the specified object from the reference voice data, specifically to:
  • Segmenting the reference voice data to obtain a plurality of voice data segments corresponding to the reference voice data
  • the reference speech spectrum feature is segmented to obtain a reference speech spectrum feature segment corresponding to each speech data segment;
  • any one of the multiple voice data segments is represented as a target voice data segment; the processing unit 1202 is used to perform short-term correlation analysis on each voice data segment based on each voice data segment and the corresponding reference voice spectrum feature segment, and obtain the voiceprint semantic feature vector corresponding to each voice data segment, specifically for:
  • the time domain feature map and the frequency domain feature map are fused to generate the voiceprint semantic feature vector corresponding to the target speech data segment.
  • the number of feature extraction processes is k, where k is an integer greater than 1; any feature extraction process is represented as the i-th feature extraction process; the processing unit 1202 is used to fuse the time domain feature map and the frequency domain feature map to generate a voiceprint semantic feature vector corresponding to the target speech data segment, specifically for:
  • the intermediate time domain feature map and the intermediate frequency domain feature map obtained by the first feature extraction process are fused to generate the first intermediate feature vector after the first feature extraction process;
  • the intermediate time domain feature map and the intermediate frequency domain feature map obtained by the i-th feature extraction process, and the i-1th intermediate feature vector obtained by the i-1th feature extraction process are fused to generate the i-th intermediate feature vector after the i-th feature extraction process;
  • each unit in the speech processing device shown in FIG. 12 can be separately or completely combined into one or several other units to constitute, or one (some) of the units can be further divided into multiple smaller units in function to constitute, which can achieve the same operation without affecting the realization of the technical effect of the embodiment of the present application.
  • the above-mentioned units are divided based on logical functions.
  • the function of one unit can also be realized by multiple units, or the function of multiple units is realized by one unit.
  • the speech processing device may also include other units.
  • these functions can also be implemented by other units, and can be implemented by multiple units in collaboration.
  • FIG13 shows a schematic diagram of the structure of a computer device provided by an exemplary embodiment of the present application.
  • the computer device includes a processor 1301, a communication interface 1302, and a computer-readable storage medium 1303.
  • the processor 1301, the communication interface 1302, and the computer-readable storage medium 1303 may be connected via a bus or other means.
  • the communication interface 1302 is used to receive and send data.
  • the computer-readable storage medium 1303 may be stored in a memory of the computer device, the computer-readable storage medium 1303 is used to store computer programs, and the processor 1301 is used to execute the computer programs stored in the computer-readable storage medium 1303.
  • the processor 1301 (or CPU (Central Processing Unit)) is the computing core and control core of the computer device, which is suitable for implementing one or more computer programs, and is specifically suitable for loading and executing one or more computer programs to implement the corresponding method flow or corresponding function.
  • CPU Central Processing Unit
  • the embodiment of the present application also provides a computer-readable storage medium (Memory), which is a memory device in a computer device for storing programs and data. It is understandable that the computer-readable storage medium here can include both built-in storage media in the computer device and, of course, extended storage media supported by the computer device.
  • the computer-readable storage medium provides a storage space that stores the processing system of the computer device.
  • one or more computer programs suitable for being loaded and executed by the processor 1301 are also stored in the storage space.
  • the computer-readable storage medium here can be a high-speed RAM memory or a non-volatile memory, such as at least one disk storage; optionally, it can also be at least one computer-readable storage medium located away from the aforementioned processor.
  • An embodiment of the present application also provides a computer program product, which includes a computer program.
  • a computer program product which includes a computer program.
  • the computer program is executed by a processor, the above-mentioned speech processing method is implemented.
  • the computer program product includes a computer program (one or more).
  • the computer program executes the above-mentioned process or function of the embodiment of the present application.
  • the computer device may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer program may be stored in a computer-readable storage medium or transmitted via a computer-readable storage medium.
  • the computer program may be transmitted from a website site, a computer device, a server or a data center to another website site, a computer device, a server or a data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that a computer device can access or a data storage device such as a server or a data center that includes one or more available media integrated.
  • the available medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state drive (SSD)), etc.
  • a magnetic medium e.g., a floppy disk, a hard disk, a tape
  • an optical medium e.g., a DVD
  • a semiconductor medium e.g., a solid-state drive (SSD)

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Telephonic Communication Services (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种语音处理方法、装置、计算机设备、存储介质及程序产品,该方法包括:获取混叠语音数据(S301);获取指定对象的参考语音数据(S302);从参考语音数据中提取指定对象的声纹表征向量,声纹表征向量用于表征指定对象的声纹特性,将混叠语音数据和声纹表征向量输入预设的语音分割模型,语音分割模型用于:基于注意力机制从混叠语音数据中分割出与声纹特性相匹配的目标语音信号(S303);基于所分割出的语音信号,生成指定对象的语音文件(S304)。该方法能够从混叠语音数据中分割出任意的指定对象的纯净的语音信号。

Description

一种语音处理方法、装置、设备、介质及程序产品
本申请要求于2023年06月13日提交中国专利局、申请号为:202310699993.5、申请名称为“一种语音处理方法、装置、设备、介质及程序产品”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及人工智能领域,具体涉及一种语音处理方法、一种语音处理装置、一种计算机设备、一种计算机可读存储介质及一种计算机程序产品。
背景技术
混叠语音数据(或者称为混叠语音)是混叠着多种声音源(即产生声音的对象)产生的语音信号的语音数据。例如,在会议场景中录音设备从物理环境中录取的混叠语音数据中可以包括多个参会者产生的语音信号,还可以包括该物理环境中某些设备(如播放会议视频的设备)产生的语音信号。
目前,针对混叠语音数据提供的信源分离的方法包括:1、人耳分离混叠语音数据,这种人工听取方式导致分割过程较长,效率较低。2、依赖音色频率分离混叠语音数据,当存在多个音色频率较为相似的对象时,不能实现精准分割。3、根据声音源的距离分离混叠语音数据,会使语音分割受限于各声音源的距离不同。4、采用指定对象的专属语音分割模型分离混叠语音数据,该方法不可移植,无法做到通用性。
发明内容
本申请实施例提供一种语音处理方法、装置、设备、介质及程序产品,能够从混叠语音数据中分割出任意的指定对象的纯净语音信号,具有通用性。
一方面,本申请实施例提供了一种语音处理方法,该方法由计算机设备执行,该方法包括:
获取混叠语音数据,混叠语音数据中包含至少两个对象中每个对象产生的语音信号;
获取指定对象的参考语音数据;指定对象是指至少两个对象中的任一个;参考语音数据中包含指定对象的参考语音信号;
从参考语音数据中提取指定对象的声纹表征向量,所述声纹表征向量用于表征指定对象的声纹特性;
将混叠语音数据和声纹表征向量输入预设的语音分割模型,所述语音分割模型用于:基于注意力机制从混叠语音数据中分割出与声纹特性相匹配的目标语音信号;
基于所分割出的目标语音信号,生成指定对象的语音文件。
另一方面,本申请实施例提供了一种语音处理装置,该装置包括:
获取单元,用于获取混叠语音数据,混叠语音数据中包含至少两个对象中每个对象产生的语音信号;
获取单元,还用于获取指定对象的参考语音数据;指定对象是指至少两个对象中的任一个;参考语音数据中包含指定对象的参考语音信号;
处理单元,用于从参考语音数据中提取指定对象的声纹表征向量,所述声纹表征向量用于表征指定对象的声纹特性;
处理单元,还用于将混叠语音数据和声纹表征向量输入预设的语音分割模型,所述语音分割模型用于:基于注意力机制从混叠语音数据中分割出与声纹特性相匹配的目标语音信号;
处理单元,还用于基于所分割出的目标语音信号,生成指定对象的语音文件。
另一方面,本申请实施例提供了一种计算机设备,该计算机设备包括:
处理器,用于加载并执行计算机程序;
计算机可读存储介质,该计算机可读存储介质中存储有计算机程序,该计算机程序被处理器执行时,实现上述语音处理方法。
另一方面,本申请实施例提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,该计算机程序适于由处理器加载并执行上述语音处理方法。
另一方面,本申请实施例提供了一种计算机程序产品,该计算机程序产品包括计算机程序,该计算机程序被处理器执行时,实现上述语音处理方法。
本申请实施例中,获取待分割的混叠语音数据,混叠语音数据中包含至少两个对象中每个对象产生的语音信号;如果具有对该至少两个对象中的指定对象所产生的语音信号的分割需求,可以获取该指定对象的一段参考语音数据(如该指定对象产生的几秒语音);该指定对象可以是至少两个对象中的任一个。从该参考语音数据中提取该指定对象的声纹表征向量,该声纹表征向量能够表征指定对象的声纹特性,该声纹特性具有唯一性,能够表征该指定对象的身份。这样,可以将能够唯一表征指定对象身份的声纹表征向量和待分割的混叠语音数据输入至预设的语音分割模型,使得语音分割模型能够基于注意力机制从混叠语音数据中分割出与指定对象的声纹特性相匹配的目标语音信号,从而基于所分割出的目标语音信号,生成该指定对象的单独的语音文件。由此可见,一方面,本申请实施例支持从指定对象的纯净的参考语音数据中提取出表征指定对象的声纹特性的声纹表征向量,并将该声纹表征向量作为参考,利用语音分割模型提供的注意力机制,从混叠语音数据中清晰准确的计算和提取出指定对象的目标语音信号,提升目标语音信号的提取纯净性,达到更准确的语音分离效果。另一方面,本申请实施例只需要获取指定对象的参考语音数据,就能够从混叠语音数据中分割出该指定对象的目标语音信号;如果想要获取其他对象的语音数据,只需要更换待分割的对象的声纹表征向量,不需要针对每个对象训练一个专属网络,大大提高便捷和可迁移,提升本方案的通用性。
附图说明
图1是本申请一个示例性实施例提供的一种语音处理系统的架构示意图;
图2是本申请一个示例性实施例提供的一种语音处理场景的架构示意图;
图3是本申请一个示例性实施例提供的一种语音处理方法的流程示意图;
图4是本申请一个示例性实施例提供的一种由用户输入指定对象的参考语音数据的界面示意图;
图5是一种现有的Unet网络的结构示意图;
图6是本申请一个示例性实施例提供的一种在Unet网络中的每个网络层中均加入Attention机制,所构建的语音分割模型的结构示意图;
图7a是本申请一个示例性实施例提供的一种目标网络层为卷积层或卷积连接层时语音分割的示意图;
图7b是本申请一个示例性实施例提供的一种目标网络层为上采样层时语音分割的示意图;
图8是本申请一个示例性实施例提供的另一种语音处理方法的流程示意图;
图9是本申请一个示例性实施例提供的一种声纹向量提取的流程示意图;
图10是本申请一个示例性实施例提供的一种改进型PANNS的结构示意图;
图11是本申请一个示例性实施例提供的一种transformer网络的结构示意图;
图12是本申请一个示例性实施例提供的一种语音处理装置的结构示意图;
图13是本申请一个示例性实施例提供的一种计算机设备的结构示意图。
具体实施方式
在本申请实施例中,提供了一种语音处理方案,具体是提供了一种针对混叠语音数据进行信源分离的语音分离方案。其中,混叠语音数据可以简称为混叠语音或者混合音频信号,是一条掺杂着多种语音信号(或者称为音频信号)的音频;即所谓“混叠”可以理解为多种语音信号混合/糅杂在一起。在实际应用场景中,该混叠语音数据可以理解为:通过收音设备(如麦克风)直接从环境中采集到的,包含多种声音源产生的语音信号的语音数据。其中,多种语音信号可以是由不同的对象(或者称为声音源)所产生的,此处的对象可以包括但是不限于:人类、动物或者实体设备(如汽车)等;本申请实施例对混叠语音数据所包含的多种语音信号的来源不作限定。举例来说,在多人参与讨论的会议场景中,采集的语音数据中通常包含不同参与者产生的语音信号;当然,如果在会议场景中还包含播放音视频的设备,那么采集的语音数据中还包含该设备发出的语音信号;如此,可以将在会议场景中采集到的语音数据称为混叠语音数据,该混叠语音数据中包含会话场景中的多个对象所产生的语音信号。
进一步的,信源分离是指:从混叠语音数据中分离出某个指定对象的语音信号的过程。换句话说,信源分离可以简单理解为:通过信号处理或者其他算法将混叠语音数据进行分离,以实现从混叠语音数据中分割出指定对象的目标语音信号,最终生成该指定对象的单独的音频文件(或语音文件)的技术。举例来说,在户外嘈杂场景中采集了一段混叠语音数据后,可以通过信源分离的技术从该混叠语音数据中提取出某个指定对象所产生的目标语音信号,以生成该指定对象的语音文件;这样,播放该语音文件时只存在该指定对象所产生的语音,以此达到识别某个指定对象所产生的语音的目的。
基于对混叠语音数据和信源分离的概念的简单介绍,本申请实施例提供一种新的语音处理方案,该方案主要包括:获取待分割的混叠语音数据,该混叠语音数据中包含至少两个对象中每个对象产生的语音信号,如混叠语音数据中包括的语音信号包括:对象1产生的语音信号和对象2产生的语音信号;如果用户想要从混叠语音数据中提取出指定对象(如至少两个对象中的任一对象)所产生的目标语音信号,则可以获取一段包含该指定对象的参考语音信号的参考语音数据。这样,可以基于参考语音数据提取到该指定对象的声纹表征向量,该声纹表征向量能够表征指定对象的声纹特性,声纹特性可以理解为指定对象的声音特色,如指定对象独特的音高或音色等。如此,将指定对象的声纹表征向量和混叠语音数据输入至语音分割模型,就可以利用语音分割模型中的注意力机制,从混叠语音数据中分割提取出与指定对象的声纹特性相匹配的目标语音信号,从而基于该目标语音信号为该指定对象生成单独的语音文件。
由此可见,一方面,本申请实施例依赖于每个用户的声纹特性的唯一性,只需提供任一指定对象的一段参考语音数据来提取表征该指定对象的声纹特性的声纹特征向量,就能够基于该声纹表征向量从混叠语音数据中分离提取出该任一指定对象的目标语音信号;不仅达到从混叠语音数据中精准分离指定对象的目标语音信号的目的,而且对于混叠语音数据所包含的任意语音信号所属的对象均可以实现信源分离,做到了高度的可复用和可移植,降低了用户输入操作复杂度,让整个系统更通用化。另一方面,本申请实施例基于注意力机制来进行对指定对象的声纹特性和混叠语音数据进行计算,大大提高从混叠语音数据中提取指定对象的目标语音信号的清晰性和准确性,避免提取出的目标语音信号中包含有太多杂音,实现更准确纯净的语音分离效果。
本申请实施例主要通过基于声纹向量嵌入的可复用型指定说话人语音分割系统实现语音处理方案,即该系统部署了本申请实施例提供的语音处理方案;这样,任意用户具有对混叠语音数据进行语音信号的分离需求时,可以调用该系统自动从混叠语音数据中分离提取出指定对象对应的语音文件。其中,系统的示例性架构示意图可以参见图1;如图1所示,该系统主要包括两个模块,分别为:声纹向量提取模型和语音分割模型;下面对这两个模块进行简单介绍,其中:
(1)声纹向量提取模型,可以称为声纹向量提取器或者声纹识别网络等。声纹向量提取模型主要用于:对待分割的指定对象的身份进行识别,并提取该指定对象的身份语义向量;此处的身份语义向量在本 申请实施例中称为声纹表征向量(或简称为声纹向量),用于表征该指定对象的声纹特性。
参见图1可见,声纹向量提取模型是基于改进的音频神经网络(Pretrained Audio Neural Networks,PANNS)网络和转换器(Transformer)网络构建的。声纹向量提取模型是使用开源的大规模说话人数据集(包含丰富的语音数据的数据集)进行充分训练得到的,训练好的声纹向量提取模型具有充分表达对象的声纹特性的能力;这样,该声纹向量提取模型可以作为整个系统的声纹向量提取器;在推理的阶段,加载提前用大规模数据说话人数据集训练好的模型参数后,可以采用训练好的声纹向量提取模型对指定对象的参考语音数据(如一小段(如几秒或十几秒)语音)计算出该指定对象的声纹表征向量,该声纹表征向量用于表征该指定对象的声纹特性。由此可见,通过大规模说话人数据集对声纹向量提取模型进行训练的方式,无需针对混叠语音数据中每个对象特意收集相关训练数据,摆脱针对混叠语音数据中对象进行数据提取来构建某个对象专属的模型的依赖。
其中:①声纹向量提取模型中改进型PANNS是对传统PANNS改进得到的;该改进主要体现于:设计了时域链路和频域链路之间的信息交流的链路,使得在声纹表征向量提取的过程中存在多次时域和频域的信息交流,从而实现时域和频域保持信息上的互补,能够让高层网络充分感知底层网络信息,提高声纹向量提取的准确性。其中,PANNS是一种基于大型音频数据集(包含大规模说话人的语音数据)训练得到的音频神经网络;通常用于音频模式识别或者音频帧级别的向量化(embedding),作为模型前端的编码网络。②转换器网络是一个依赖于注意力机制(Attention)来计算输入和输出的转换模型;Transformer网络抛弃了卷积模型结构,仅仅通过注意力机制和前向神经网络(Feed Forward Neural Network),不需要使用序列对齐的循环架构就实现了较好的表现。
(2)语音分割模型,可以称为语义分割网络或分割网络等。该语音分割网络主要用于:接收声纹向量提取模型输入的关于指定对象的声纹特性(具体接收的是声纹表征向量),并基于该声纹表征向量利用注意力机制从混叠语音数据中提取出与指定对象的声纹特性相匹配的语音信号。
参见图1可见,语音分割模型是融合了注意力机制的分割模型。通过在分割网络中引入注意力机制进行改造,能够在分割网络针对混叠语音数据进行特征处理的过程中,结合注意力机制计算出与指定对象的声纹特征相关的目标语音信号,能够从混叠语音数据中分割出指定对象的目标语音信号;这样,能够更清晰准确的计算提取出混叠语音数据中指定对象的目标语音信号,并将指定对象纯净的目标语音信号分离出来,达到更准确纯净的语音分离效果。其中,注意力机制(Attention机制)是模仿人类注意力而提出的一种解决问题的办法;简单地说,就是模仿人类注意力从大量信息中快速筛选出想要关注的信息。主要用于解决时序模型输入序列较长时很难获得难以获取合理的向量表示问题,做法是保留时序模型的中间结果,用新的模型对其进行学习并将其与输出进行关联,从而达到信息筛选的目的。
综上所述,本申请实施例提供的系统中包含两个模块,声纹向量提取模型在对指定对象的参考语音数据进行提取出能够表征该指定对象的声纹特性的声纹表征向量后,可以将声纹表征向量嵌入至语音分割模型;这样,语音分割模型可以基于注意力机制从混叠语音数据中提取并分离出与该声纹特性相匹配的语音信号,实现较好地信号分离效果。一方面,参见图1可知本申请实施例提供的系统是基于多个深度学习神经网络(如改进型PANNS网络、转换网络和融合有注意力机制的分割网络等)进行构建的全自动分割系统;对于该全自动分割系统而言,只需要用户往该全自动分割系统中输入指定对象的参考语音数据和待分割的混叠语音数据,全自动分割系统就能够自动快速地从混叠语音数据中提取出指定对象的语音信号,极大的提升语音分割的效率,彻底摆脱人工的参与,形成快速的标准化。另一方面,通过创新性地将声纹向量提取模型提取的声纹特性嵌入至语音分割模型的模型架构,能够让系统中的语音分离模型可复用;其中,可复用是指每次系统进行信源分离时,只需要更换提取的对象的声纹特性,不需要针对于每个对象均训练一个单独的分割网络,能够做到网络的便捷可迁移,使得整个系统具有极高的通用性。
图1所示的系统可以部署于计算机设备中,具体可以是部署于计算机设备中运行的应用程序(如以插件形式部署于应用程序)中;也就是说,由计算机设备中运行的应用程序来提供本方案。其中:①应用程序可以是指为完成某项或多项特定工作的计算机程序。按照不同维度(如应用程序的运行方式、功能等)对应用程序进行归类,可得到同一应用程序在不同维度下的类型。例如:按照应用程序的运行方式分类,应用程序可包括但不限于:安装在终端中的客户端、无需下载安装即可使用的小程序(作为客户端的子程序)、通过浏览器打开的web(World WideWeb,全球广域网)应用程序等等。再如:按照应用程序的功能类型分类,应用程序可包括但不限于:IM(Instant Messaging,即时通信)应用程序、内容交互应用程序、音频应用程序或者视频应用程序等等。其中,即时通信应用程序是指基于互联网的即时交流消息和社交交互的应用程序,即时通信应用程序可以包括但不限于:包含通信功能的社交应用程序、包含社交交互功能的地图应用程序、游戏应用程序等等。内容交互应用程序是指能够实现内容交互的应用程序,例如可以是网银、分享平台、个人空间、新闻等应用程序。音频应用程序是指基于互联网实现音频功能的应用程序,音频应用程序可以包括但是不限于:具备音乐播放和编辑能力的音乐类应用程序,具备电台播放能力的电台类应用程序或者具备直播能力的直播类应用程序等等。视频应用程序是指能够播放画面的应用程序,视频应用程序可以包括但是不限于:具备短视频(视频长度往往较短,如几秒或几分钟等)的应用程序,具备长视频(如类似电影或电视剧这种播放时常较长的视频)的应用程序等等。
②计算机设备可以包括终端和/或服务器。其中,终端可以包括但是不限于:智能手机(如部署安卓(Android)系统的智能手机,或部署互联网操作系统(Internetworking Operating System,IOS)的智能手机)、平板电脑、便携式个人计算机、移动互联网设备(Mobile Internet Devices,MID)、车载设备、头戴设备、智能电视或智能家居等设备,本申请实施例并不对终端的类型进行限定,在此说明。该终端中部署有图1所示的系统或提供该系统的应用程序(或插件)等。服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。
由此可见,本申请实施例可以由终端或者服务器执行,还可以由终端和服务器共同执行。一种示例性的由终端和服务器共同执行语音处理方案的系统架构示意图可以参见图2;如图2所示,终端201为具有语音分离需求的用户所持有的设备。在用户具有从混叠语音数据中分离出指定对象的语音文件的需求时,用户可以通过终端201将待分割的混叠语音数据和指定对象的参考语音数据发送至服务器202。这样,服务器202在接收到指定对象的参考语音数据和待分割的混叠语音数据后,可以先通过系统中的声纹向量提取模型对参考语音数据进行身份识别,得到用于表征指定对象的身份的声纹表征向量,然后将该声纹表征向量嵌入至系统中的语音分割模型;语音分割模型接收到指定对象的声纹表征向量和待分割的混叠语音数据后,能够基于注意力机制从混叠语音数据中提取到与声纹表征向量所表征的声纹特性相匹配的纯净地目标语音信号,从而基于该目标语音信号生成指定对象的语音文件。如此,服务器202将指定对象的语音文件返回至终端201,使得用户可以通过该终端201播放只包含该指定对象的语音数据的语音文件。
应当理解的是,上述以计算机设备为终端和服务器为例,对语音处理方案的流程进行了简单介绍;但计算机设备为终端或服务器时,计算机设备执行语音处理方案的思路与上述描述流程是类似的,只是执行主体有所不同,在此不作赘述。此外,图2所示的终端201和服务器202之间可以通过有线或无线通信方式进行直接或间接地连接,本申请在此不做限制。
进一步的,本申请实施例提供的语音处理方案可以应用于任意具有语音分离需求的应用场景;根据所应用的场景不同,提供本方案的计算机设备也有由不同,对此不作限定。其中,应用场景可以包括但是不限于以下至少一种:影视剧场景、音视频创作场景和会话场景等。
可选的,应用场景为影视剧场景。示例性地,影视剧场景为影视剧中针对角色的配音场景。具体地: 在影视剧制作阶段,往往需要配音演员针对影视剧中的某个角色进行配音(如收音录制的混叠语音数据送审后,存在部分的台词不符合规定,需要进行重新录音)。然而,在影视剧拍摄过程或者后期制作过程中进行的收音录制所得到的语音数据通常是包含多个语音信号的混叠语音数据。因此,在配音前,需要将混叠语音数据中除了待重新配音的指定演员的语音信号之外的其他演员的语音信号进行纯净提取,以便于将配音后的该指定演员的语音信号和提取到的其他演员的纯净语音信号进行混合,生成新的混叠语音数据,加入到影视剧中。由此可见,在影视剧场景中,通过本申请实施例能够提供精确的语音分割,而且无需使用所有演员的大量数据进行专属分割网络的训练,可以快速高效的提取分割出纯净的演员语音。
可选的,应用场景为音视频创作场景。示例性地,音视频创作场景为针对音视频的二创场景(即针对已存在的音视频再次进行创作)。具体地:在二创场景中,用户喜欢提取指定演员在多个音视频中部分台词进行台词对话剪辑,即将指定演员在不同音视频中的语音数据剪辑到同一音视频中。这就会涉及提取该指定演员在多个音视频中的纯净语音信号;考虑到各音视频中的台词带有背景音乐或其他对象的语音信号,因此需要从音视频中剔除掉背景音乐得到该指定演员的纯净语音信号,以便于将提取到的多个纯净的语音信号进行融合,生成该指定演员对应的剪辑语音文件。
可选的,应用场景为会话场景。示例性地,会话场景为在线会议场景。具体地:在线会议场景中往往具有语音转录文本的需求,即将录音到的语音数据转换为文本形式;但是在多人参与的在线会议场景中,包含多人的语音信号的混叠语音数据的转录一直是一个难题,转录是指将多人中某个指定人的语音信号转换为文本的过程。采用本申请实施例可以先按照参与在线会议的每个对象的声纹特性,从混叠语音数据中分割提取出每个对象的语音信号,然后再分别将每个对象的语音信号输入至语音识别系统中实现文本转录,能够极高的提升会话语音混叠转录的准确性。
值得说明的是,本申请实施例提供的语音处理方案所适用的应用场景并不仅限于上述几种;并且随着应用场景的不同,承载语音处理方案的应用或平台有所不同。
还需说明的是,本申请实施例中相关数据收集处理应该严格根据相关法律法规的要求,获取个人信息需得到个人主体的知情或同意(或具备信息获取的合法性基础),并在法律法规及个人信息主体的授权范围内,开展后续数据使用及处理行为。例如,本申请实施例运用到具体产品或技术中时,如获取指定对象的参考语音数据时,需要获得该指定对象的许可或者同意,且相关数据的收集、使用和处理(如对象发布的弹幕的收集和发布等)需要遵守相关地区的相关法律法规和标准。
基于上述描述的语音处理方案,本申请实施例提出更为详细的语音处理方法,下面将结合附图对本申请实施例提出的语音处理方法进行详细介绍。
图3示出了本申请一个示例性实施例提供的一种语音处理方法的流程示意图;该语音处理方法可以由前述提及的系统中的计算机设备来执行,如计算机设备为终端和/或服务器;该语音处理方法可包括但不限于步骤S301-S304:
S301:获取待分割的混叠语音数据。
S302:获取至少两个对象中的指定对象的参考语音数据。
步骤S301-S302中,待分割的混叠语音数据中包含至少两个对象中每个对象所产生的语音信号。例如,一首重金属音乐中包括由“歌手”产生的“歌词”语音信号,由“吉他”产生的“旋律”语音信号以及由“架子鼓”产生的“旋律”语音信号等;因此,确定该重金属音乐是混叠语音数据,该混叠语音数据中包含的对象分别为:歌手、吉他和架子鼓,该混叠语音数据中包含的语音信号分别为“歌手”产生的语音信号、“吉他”产生的语音信号和“架子鼓”产生的语音信号。
进一步的,如果用户具有从混叠语音数据中分离提取出某个对象的语音信号的需求,可以获取该某个对象的参考语音数据。此时,将该某个对象称为指定对象,该指定对象的参考语音数据和混叠语音数据不 同,但该参考语音数据中包括指定对象的纯净的参考语音信号;这样,指定对象的参考语音数据中包含的参考语音信号可以作为参考的信号,用于从混叠语音数据中包含的至少两个对象对应的至少两个语音信号中分离出指定对象的目标语音信号。下面对指定对象和参考语音数据进行简单介绍,其中:①指定对象可以是混叠语音数据中包含的至少两个对象中,用户想要提取语音信号的任一对象;由前述描述可知,对象可以是指人类、动物或实体设备;为便于阐述,以指定对象的对象类型为人类为例进行介绍,特在此说明。例如,上述重金属音乐的例子中,如果用户想要从嘈杂重金属音乐中提取出“歌手”产生的“歌词”,那么确定该“歌手”为指定对象,此时需要将重金属音乐中各乐器产生的语音信号和该“歌手”产生的语音信号进行分离,并提取出“歌手”的纯净的语音信号。
②参考语音数据是指包含该指定对象的参考语音信号的一段语音数据。为确保能够从参考语音数据中提取到指定对象的较为纯净地声纹特性,参考语音数据应当是包含指定对象的一段较为纯净地语音数据。例如,参考语音数据中只包含指定对象的参考语音信号;再如,参考语音数据中同时包含指定对象的参考语音信号和其他语音信号,但要确保容易从掺杂有其他语音信号的参考语音数据中提取指定对象的参考语音信号(如其他语音信号的信号频率较低,而指定对象的参考语音信号的信号频率相对较高等),这样有利于对纯净的参考语音数据进行分析,以提取到指定对象较为准确的声纹特性。本申请实施例对该参考语音数据的类型、时长和来源不作限定。示例性地:参考语音数据的类型可以包括但是不限于:指定对象朗读文章所产生的一段音频,指定对象说话所产生的一段音频或者指定对象清唱产生的一段音频等。参考语音数据的时长可以为几秒或十几秒等。参考语音数据的来源可以包括但是不限于:在指定对象与具有语音分离需求的用户为不同用户时,该参考语音数据可以是由指定对象发送给该用户的,或者,该用户通过某些途径(如历史的语音信息)下载或录音得到的;在指定对象与具有语音分离需求的用户为相同用户时,该参考语音数据可以是由指定对象实时录入的,即通过用户持有的终端中部署的麦克风实时采集的。
一种示例性地由用户输入指定对象的参考语音数据的界面示意图可以参见图4;如图4所示,在用户持有的终端的终端屏幕中显示有语音获取界面401,该语音获取界面401中包含关于参考语音数据的获取区域402。详细地,在该获取区域402中可以显示至少两种语音获取入口,如采集入口4021和上传入口4022。当采集入口4021被触发时,表示用户想要通过实时采集的方式输入指定对象(指定对象为该用户,或者指定对象和该用户处于同一物理环境)的参考语音数据,那么终端的麦克风被打开,以便于能够实时采集用户所处物理环境中的参考语音信号生成参考语音数据。当上传入口4022被触发时,表示用户想要通过上传文件的方式输入指定对象的参考语音数据,那么用户可以从存储空间(如终端的本地存储空间,云存储空间或者服务器存储空间等)中将关于指定对象的参考语音数据进行上传。
应当理解的是,语音获取界面的界面元素(如界面所包含的界面内容)和界面样式并不仅限于图4所示。例如,在语音获取界面中还可以显示混叠语音数据的上传入口,通过该上传入口用户可以实现更换待分割的混叠语音数据。再如,在语音获取界面中还可以添加文本转换控件(或称为组件、按键、选项等),这样用户可以在语音分离前或语音分离后通过触发该文本转换控件,实现将分离出的语音信号一键转换为文本形式,在一定程度上缩短文本转换路径,从而提升文本转换效率。
S303:从参考语音数据中提取指定对象的声纹表征向量,将混叠语音数据和声纹表征向量输入预设的语音分割模型,语音分割模型用于:基于注意力机制从混叠语音数据中分割出与指定对象的声纹特性相匹配的语音信号。
S304:基于所分割出的语音信号,生成指定对象的语音文件。
步骤S303-S304中,声纹(Voiceprint)是携带语音信息的声波频谱,是由波长、频率以及强度等多种特征维度组成的生物特征;声纹具有稳定性、可测量性和唯一性等特点,可以用来唯一标识对象的声音特点,即声纹可以用于表征对象的身份。因此,本申请实施例在获取到指定对象的较为纯净的参考语音数据后,支持从该参考语音数据中提取指定对象的声纹特性,以便于后续基于该唯一的声纹特性进行语音信号的分 离提取。
由前述图1所示的系统可知,通过声纹向量提取模型对参考语音数据进行分析,可以实现从参考语音数据中提取出用于表征指定对象的身份的声纹特性。在实际应用中,该声纹向量提取模型输出的是该指定对象的声纹表征向量(或简称为声纹向量),即声纹向量提取模型对参考语音数据进行分析,得到的是能够用于表征指定对象的声纹特性的声纹表征向量。进一步的,声纹向量提取模型在提取到指定对象的声纹表征向量后,可以创新性的采用向量嵌入的方式进行声纹信息表征的传递,将声纹表征向量输入到语音分割模型中参与注意力机制的计算,以便于从混叠语音数据中分割出与指定对象的声纹特性相匹配的目标语音信号。这种创新性的向量嵌入机制能够让语音分割模型不依赖于任何对象的历史语音数据进行额外的训练,只需要对少量的参考语音数据进行声纹表征向量的提取,能够摆脱对大规模语音数据的依赖,从而做到系统的高度可复用可移植,让整个系统更高效,并降低用户输入操作复杂度,提高系统的通用化。
本申请实施例提供的语音分割模型是采用注意力机制,对传统的语音分割网络进行改进得到的;具体是将注意力机制融合至传统的语音分割网络所得到的。其中,本申请实施例涉及的传统的语音分割网络为Unet(或表示为U-net、U-Net等)网络,Unet是使用全卷积网络进行语义分割的算法之一;其主要使用包含压缩路径和扩展路径的对称U形结构。
一种示例性地Unet网络的网络结构的示意图可以参见图5。如图5所示,Unet网络是一个U型对称网络结构,该对称网络结构中包括左右对称的特征提取子网络和上采样子网络,且特征提取子网络和上采样子网络之间通过卷积连接层进行连接。其中:①特征提取子网络可以简单理解为下采样层或者编码网络,其包含层级分布(上一层级的卷积层输出的特征图(Feature map)作为相邻下一层级的卷积层的输入)的m(图5中m=4)个卷积层(Convolutional layer),m为正整数;层级分布的m个卷积层是指,m个卷积层依次连接,m个卷积层中任意相邻的两个卷积层中上一个卷积层作为上一层级的卷积层,下一个卷积层作为下一层级的卷积层,且上一层级的卷积层输出的特征图作为相邻下一层级的卷积层的输入。如图5所示,在每个卷积层之后可以部署池化函数;通过这种先采用卷积层中的卷积网络针对混叠语音数据执行特征提取后,再采用池化函数(pool)进一步抽取更高阶的特征的方式,有效保留混叠语音数据中想要突出的特征;其中,本申请实施例对池化函数的类型不作限定,如池化函数为最大池化(max pool),其倾向于卷积层输出的特征图中池化窗口(如窗口大小为2*2)内的最大特征。②相应地,特征提取子网络和上采样子网络具有对称性,上采样子网络可以简单理解为解码网络,其包含特征提取子网络中每个卷积层对应的上采样层(up sampling layer)。如图5所示,卷积连接层和每个上采样层之后还部署有卷积核为2*2的转置卷积(up-Conv),以通过转置卷积实现上采样功能。由此可见,Unet网络这种对称网络结构既可以从头实现网络并进行权重的初始化,然后进行模型的训练;也可以借用现有一些网络的卷积层结构(如resnet(残差神经网络)中的vgg(一种卷积网络))和对应的已训练好的权重文件,再加上后面的上采样层进行训练计算等;这样,在深度学习的模型训练中使用已有的权重模型文件,可以大大加快模型训练的速度。
进一步的,每个卷积层、卷积连接层和上采样层中包括顺序连接的多个卷积网络;如图5所示,特征提取子网络、卷积连接层和上采样层均可以包含三个卷积核为3*3的卷积网络。其中,卷积网络或称为卷积神经网络(Convolutional Neural Network,CNN);卷积神经网络是一种前馈神经网络,主要由一个或多个卷积层和顶端的全连通层组成,同时也包括关联权重和池化层(pooling layer)。如图5所示,每个卷积网络之后可以部署激活函数,以通过激活函数为模型加入非线性因素,以便于训练好的模型能够解决线性模型所不能解决的问题。本申请实施例对激活函数的类型不作限定,如激活函数可以为ReLu函数(ReLuSigmoidTanh,线性整流函数)等。
更进一步的,Unet网络还可以通过跳跃连接(skip-connection,或称为copy and crop)有效结合高级特征图和低级特征图来得到最终的特征图。其中,跳跃连接的具体过程可以包括:特征提取子网络中的每个卷积层得到的特征图都会拼接(concatenate)到,上采样子网络中对应的上采样层中;从而实现对每层 特征图都有效使用到后续计算中。这种跳跃连接不同维度的特征图的方式,相比于未实现跳跃连接的网络结构而言,可以有效避免直接在高级特征图中进行监督和损失计算,有效结合低级特征图中的特征,从而使得最终所得到的特征图既包含了高维度的特征,也包含很多低维度的特征,实现了不同规模下特征的融合,提高模型的结果精确度。
上述图5对传统的Unet网络的网络结构进行了详细介绍,本申请实施例提供的语音分割模型是对该Unet网络的网络结构进行改进得到的。本申请实施例针对Unet网络的网络结构的改进主要包括:在Unet网络的网络结构中的全部或部分网络层(如卷积层、卷积连接层和上采样层)中融合注意力机制。也就是说,基于注意力机制对Unet网络改进得到的语音分割模型中全部或部分网络中融合有注意力机制。为便于阐述,以在Unet网络中的每个网络层中均加入Attention机制来作为语音分割模型为例,通过给每个网络层均加入注意力机制,能够将指定对象的声纹表征向量嵌入到Unet网络中的每一网络层,可以让网络中每一网络层能够深度感受该声纹表征向量所表征的声纹信息或声纹特性,从而让最终输出的语音信号更贴近指定对象,确保提取的语音信号更纯净。
示例性地,在Unet网络中的每个网络层中均加入Attention机制时所构建的语音分割模型的结构示意图可以参见图6。如图6所示,改进得到的语音分割模型相比于Unet网络而言,基本的网络架构与原始的Unet网络架构相同,但是在Unet网络架构中的每一个层级中均加入一个注意力机制,且该注意力机制的输入信息为:指定对象的声纹表征向量和该注意力机制的上一层级所输出的特征图。通过让指定对象的声纹表征向量嵌入到每一网络层中,主要是与每一网络层的特征图进行注意力计算,能够让整个模型深度的感受学习到提取的声纹表征向量,从而能够让每一个层级的计算都朝着声纹表征向量进行靠拢,确保最终提取的语音信号是与声纹表征向量所表征的声纹特性相匹配的。值得注意的是,注意力机制在网络层对应的多个卷积网络中的融合位置是不固定的,图6所示的融合位置是示例性地。
基于上述图6对语音分割模型的网络结构的相关介绍,下面以语音分割模型中每个网络层均融合有注意力机制为例,对语音分割模型基于注意力机制,从混叠语音数据中分割出与指定对象的声纹特性相匹配的语音信号,并基于语音信号生成指定对象的语音文件的具体实施过程进行介绍;该过程可以包括但不限于步骤(1)-(4),其中:
(1)将混叠语音数据从时域转换至频域,得到混叠语音数据对应的语音频谱特征,即该语音频谱特征是混叠语音数据在频域上的特征表现。
考虑到Unet网络的输入信息属于频域,因此需要将混叠语音数据进行转换,得到该混叠语音数据对应的语音频谱特征;这样该语音频谱特征可以作为图片输入至语音分割网络中。其中,时域和频域是音频应用中常用的两个概念,也是衡量音频特征的两个维度概念;时域是通过将语音信号的采样点在时间上进行展示处理,即与时间进行相关绑定;频域是通过将语音信号在各个频带上进行能量分布的一种特征表现;通过转换公式(如傅里叶变换(Fourier Transform)、拉普拉斯变换(Laplace Transform)或者索尔兹变换(ZTransform)等)可以实现语音信号从时域转换至频域,或者从频域转换为时域。
(2)基于注意力机制对声纹表征向量和语音频谱特征进行相关度计算,得到与声纹特性相匹配的语音频谱特征分段。
如图6所示,在获取到声纹向量提取模型提取的指定对象的声纹表征向量后,将该声纹表征向量输入至语音分割网络中的每个网络层,具体是每个网络层中融合的注意力机制。这样,按照每个网络层中融合的注意力机制,可以对声纹表征向量和相应网络层的第一特征图进行相关度计算,得到相应网络层输出的第二特征图;值得注意的是,根据网络层在语音分割网络中的位置不同,该网络层的第一特征图也有不同,在后续实施例对此进行介绍。然后,可以将语音分割模型中第2m+1个网络层(即上采样子网络中的最后一个上采样层)输出的第二特征图,作为与声纹表征向量相匹配的语音频谱特征分段;该语音频谱特征分段具体是语音频谱特征中,与声纹特性相匹配的分段,即语音频谱特性中属于指定对象的分段。由此可见, 通过让用于表征指定对象的声纹特性的声纹表征向量嵌入至语音分割模型中的每一网络层以参与特征提取处理,可以让语音分割网络中的每一网络层都能够深度感受到该指定对象的声纹信息,从而让最终分割输出的语音信号更靠近指定对象的声纹特性,保证提取的语音信号更纯净且更精确。
不难理解的是,根据注意力机制在语音分割模型中的融合位置不同,特征提取子网络和上采样子网络中的网络层按照融合的注意力机制,对声纹表征向量和相应网络层的上一层级所输出的第一特征图进行相关度计算的具体实施过程有所不同。下面以语音分割模型中融合有注意力机制的任一网络层表示为目标网络层为例,目标网络层按照注意力机制进行相关度计算进行示例性说明,其中:
在一种实现方式中,如图7a所示,假设目标网络层为语音分割模型中的卷积层或卷积连接层,且注意力机制在目标网络层对应的多个卷积网络中的融合位置为:目标网络层包括的顺序连接的多个卷积网络中,首个卷积网络和与首个卷积网络相邻的第二个卷积网络之间的位置;也就是说,注意力机制在目标网络层对应的多个卷积网络中的融合位置为:目标网络层包括的顺序连接的多个卷积网络中的首个卷积网络,和顺序连接的多个卷积网络中与首个卷积网络相邻且位于首个卷积网络之后的卷积网络之间的位置。此实现方式下,该目标网络层对声纹表征向量和该目标网络层的第一特征图进行相关度计算,以得到该目标网络层输出的第二特征图的具体实施过程可以包括:首先,采用目标网络层中的首个卷积网络,对目标网络层的第一特征图进行特征提取处理,得到该目标网络层的第三特征图;此处特征提取处理是指从第一特征图中提取出有用的信息(即特征),以供后续的分类、聚类和回归等任务使用的过程;特征提取处理的过程可以包括:对第一特征图进行预处理(如去噪、归一化或标准化等处理),并对预处理后的第一特征图进行特征提取,以提取出有用的特征,并从提取出的特征中筛选具有代表性或区分度的特征,筛选后的特征作为特征提取处理后的特征。其中,该目标网络层为语音分割模型(具体是特征提取子网络)中层级分布的首个卷积层701时,该目标网络层的第一特征图为对混叠语音数据进行频域转换所得到的语音频谱特征;目标网络层为语音分割模型中除首个卷积层701外的其他卷积层(如卷积层702)时,该目标网络层的第一特征图是对该目标网络层相邻的上一层级网络层(如卷积层701)输出的特征图进行池化处理得到的;池化处理由目标网络层中的池化层执行的,池化处理旨在通过并行处理或数据压缩等手段,减小上一层级网络层输出的特征图的尺寸和参数量,从而降低计算量。然后,按照目标网络层中融合的注意力机制,对声纹表征向量和目标网络层的第三特征图进行相关度计算,得到该目标网络层的第四特征图;该目标网络层的第三特征图的特征维度和第四特征图的特征维度相同;其中,特征图(如第三特征图和第四特征图)可以表现为向量形式,因此特征图的特征维度可以是向量的维度,该向量中的每一个维度对应一个特征,也就是说,注意力机制计算前后的特征维度是相同的。最后,采用目标网络层中除首个卷积网络外的其他卷积网络对第四特征图进行特征提取处理,得到目标网络层输出的第二特征图。
由此可见,在目标网络层为语音分割模型中的卷积层或卷积连接层的情况下,在卷积层或卷积连接层所包含的多个卷积网络中融合注意力机制,这样在采用卷积层或卷积连接层中包含的多个卷积网络对混叠语音数据进行特征提取的过程中,能够基于注意力机制在混叠语音数据中聚焦与指定对象的声纹表征向量相匹配的特征,从而通过特征提取过程中的注意力机制,分析得到与声纹表征向量相匹配的第二特征图,进而基于第二特征图能够从混叠语音数据中准确地分割出指定对象的目标语音信号。
其他实现方式中,如图7b所示,假设目标网络层为语音分割模型中的上采样层,且注意力机制在目标网络层对应的多个卷积网络中的融合位置为:顺序连接的多个卷积网络中的最后一个卷积网络之后的位置。此实现方式下,该目标网络层对声纹表征向量和该目标网络层的第一特征图进行相关度计算,以得到该目标网络层输出的第二特征图的具体实施过程可以包括:首先,采用目标网络层中顺序连接的多个卷积网络对目标特征图进行特征提取处理,得到目标网络层的第一特征图。此处的目标特征图是将目标网络层对应的卷积层输出的特征图,和目标网络层的上一层级网络层输出的特征图进行特征拼接得到的;如图7b所示,上采样子网络中的首个上采样层703的输入信息为目标特征图,该目标特征图是该首个上采样层703 的上一层级卷积连接层704输出的特征图,和该首个上采样层703对应的卷积层705输出的特征图进行特征拼接得到的。然后,采用目标网络层中融合的注意力机制,对声纹表征向量和目标网络层的第一特征图进行相关度计算,得到目标网络层输出的第二特征图;该第二特征图的特征维度和第一特征图的特征维度相同。
由此可见,在目标网络层为语音分割模型中的上采样层的情况下,在上采样层所包含的多个卷积网络中融合注意力机制,这样在采用上采样层中包含的多个卷积网络对上一层级网络层输出的特征图和对应的卷积层输出的特征图进行特征提取后,能够基于注意力机制在特征提取后的第一特征图中聚焦与指定对象的声纹表征向量相匹配的特征,从而分析得到与指定对象的声纹表征向量相匹配的第二特征图,进而基于第二特征图能够从混叠语音数据中准确地分割出指定对象的目标语音信号。
应当理解的是,上述图7a和图7b只是注意力机制分别融合至特征提取模块、卷积连接层和上采样子网络层中的一种示例性融合位置时,目标网络层进行相关度计算的示例性过程;在注意力机制融合至目标网络层中的不同融合位置时,目标网络层执行相关度计算的具体实施过程有所不同。
(3)将与声纹特性相匹配的语音频谱特征分段从频域转换至时域,得到与声纹特性相匹配的目标语音信号。
在基于前述步骤,从混叠语音数据对应的语音频谱特征中,提取到与指定对象的声纹特性相匹配的语音频谱特征分段后,还需要将该语音频谱特征分段从频域转换至时域,以得到符合数据传输格式的目标语音信号。其中,将语音频谱特征分段从频域转换至时域的方式可以包含但是不限于前述提及的傅里叶变换(Fourier Transform)、拉普拉斯变换(Laplace Transform)或者索尔兹变换(Z Transform)等等,对此不作限定。
(4)基于与声纹特性相匹配的目标语音信号生成指定对象的语音文件。其中,该语音文件的文件格式可以按照用户的个性化需求进行设置,本申请实施例对语音文件的文件格式不作限定。例如,语音文件为文本文件时,支持使用语音识别算法或工具将与指定对象的声纹特性相匹配的语音信号转换为文本,并对转换后的文本进行文本处理(如纠正拼错、符合添加和文本清晰等处理),然后将文本处理后的文本保存为文本格式(如.doc格式)的文本文件。再如,语音文件为音频文件时,可以直接将与指定对象的声纹特性相匹配的语音信号,保存为音频格式(如.WAV格式)的音频文件。
基于上述步骤(1)-(4)所阐述的语音信号提取过程可知,本申请实施例支持将混叠语音数据从时域转换到频域,得到混叠语音数据在频域上的语音频谱特征,这样将混叠语音数据转换为能够和声纹表征向量进行相关度计算且属于频域的语音频谱特征后,就可以利用语音分割模型中的每一网络层的注意力机制,对同属于频域的声纹表征向量和语音频谱特征进行相关度计算,确保相关度计算的可行性;考虑到最终想要分割得到的是时域上的信号,因而需要将语音频谱特征分段从频域转换到时域,得到与声纹特性相匹配的目标语音信号,确保最终提取出的是能够被设备理解和读取的时域信号。考虑到语音分割网络中每一网络层的特征维度不同,具体是每个网络层中各卷积网络的特征维度不同。因此,在将指定对象的声纹表征向量输入至语音分割网络中的每一网络层之前,还需要先采用一层网络层对声纹表征向量进行维度变换,得到维度变换后的声纹表征向量。其中,维度变换是指对声纹表征向量的特征维度进行变化,使得维度变换后的声纹表征向量的特征维度,与待输入至相应网络层中融合的注意力机制的特征图的特征维度相同,从而将维度变换后的声纹表征向量输入至语音分割模型时,语音分割模型才能对该声纹表征向量进行有效处理,避免维度不同造成的声纹表征向量的不可用性。其中,待输入至相应网络层中融合的注意力机制的特征图,可以是指前述描述的第三特征图。此外,注意力机制可以插入至网络层中顺序连接的多个卷积网络中的任意两个卷积网络之间;网络层的第一特征图可以是指该网络层中与注意力机制相邻且位于该注意力机制之前的卷积网络所输出的特征图。
综上所述,本申请实施例创新性的构建了一种全自动的语音处理方案,该方案基于声纹向量嵌入的方式实现语音信号的分割;对于用户而言,只需要输入指定对象的一小段参考语音数据和待分割的混叠语音 数据,就能够实现自动快速的从混叠语音数据从分离出指定对象的语音信号,能够极大的提升语音分割的效率,彻底摆脱人工的参与,形成快速的标准化。在本方案中,一方面,采用了声纹向量嵌入的方式,将用于表征指定对象的声纹特性的声纹表征向量输入至语音分割模型中参与Attention的计算,能够让语音分割模型不依赖于任何对象的历史语音数据进行额外的训练,能够摆脱对对象的大规模语音数据的依赖,从而做到系统的高度可复用可移植,让整个系统更通用化。另一方面,通过基于注意力机制对Unet网络进行改进,使得输入的声纹表征向量能够和Unet网络中的每一网络层进行Attention机制计算,从而确保语音分割模型输出的特征图和指定对象的声纹表征向量所表征的声纹特性更加贴合,避免提取出的指定对象的语音信号中包含有太多杂音,提高语音信号的纯净性,提高语音分割模型的分割准确性。
请参见图8,图8示出了本申请一个示例性实施例提供的另一种语音处理方法的流程示意图;该语音处理方法可以由前述提及的系统中的计算机设备来执行,如计算机设备为终端和/或服务器;该语音处理方法可包括但不限于步骤S801-S806:
S801:获取待分割的混叠语音数据。
S802:获取至少两个对象中的指定对象的参考语音数据。
需要说明的是,步骤S801-S802所示的具体实施过程,可以参见前述图3所示实施例中步骤S301-S302所示的具体实施过程的相关描述,在此不作赘述。
S803:对指定对象的参考语音数据进行短时相关分析。
S804:对所述指定对象的参考语音数据进行长时相关分析,得到指定对象的声纹表征向量。
步骤S803-S804中,为了能够从指定对象的参考语音数据中学习到指定对象较为清晰的声纹特性,本申请实施例支持结合短时相关和长时相关对参考语音数据进行分析,以提取能够充分表达指定对象的声纹特性的声纹表征向量。其中,针对参考语音数据的短时相关分析可以简单理解为:对参考语音数据中一段较短(如20毫秒)语音信号进行特征分析的过程;考虑到在较短时间内,参考语音数据中的语音信号通常是不发生变化的,因此对参考语音数据进行离散化后,可以利用每段较短语音信号在时域和频域的信息分布,提取该语音信号在该较短时间内的特征,从而实现对参考语音数据中每段语音信号的特征分析。简而言之,短时相关分析关注于对参考语音数据中分段的语音信号进行特征分析。不同的是,针对参考语音数据的长时相关分析可以简单理解为:对整个参考语音数据进行特征分析的过程;也就是说,长时相关分析关注于对参考语音数据的整个信号序列进行语义表达。
上述提及的短时相关分析和长时相关分析,主要是依赖于图1所示系统中的声纹向量提取模型实现的。正如前述所描述的,声纹向量提取模型中包含改进型的音频神经网络(PANNS)网络和转换器(Transformer)网络。其中:改进型的音频神经网络(PANNS)网络可以简称为改进型PANNS,主要用于对参考语音数据进行短时相关分析;转换器(Transformer)网络可以简称为Transformer网络,主要用于对参考语音数据进行长时相关分析。
一种示例性地采用改进型PANNS和Transformer网络对参考语音数据进行短时相关分析和长时相关分析的示意图可以参见图9。如图9所示,首先,将参考语音数据从时域转换至频域,得到该参考语音数据对应的参考语音频谱特征;其中,时域转换为频域的转换公式可以参见前述相关描述,在此不作赘述。然后,为实现短时相关分析,还可以对参考语音频谱特征进行分段处理,得到每个语音数据分段对应的参考语音频谱特征分段;本申请实施例涉及的参考语音频谱特征分段可以为对数梅尔频谱(Log-mel或者Logmel)。其中,mel(梅尔)频谱是一种基于人耳对等距(即频带等距离的分布在梅尔尺度上)的音高(pitch)变化的感官判断而确定的非线性频率刻度,音高是指声音的高低;在进行信号处理时,更能够迎合人耳的听觉感受变化来人为设定。
然后,采用分段输入的方式,将每个语音数据分段对应的参考语音频谱特征分段输入至改进型PANNS 网络中,以及,将完整的参考语音数据输入至改进型PANNS中;这样,改进型PANNS可以按照参考语音频谱特征分段时遵循的分段规则,对接收到的参考语音数据进行分段处理,得到参考语音数据对应的多个语音数据分段。并且,改进型PANNS会分别基于每个语音数据分段和相应的参考语音频谱特征分段,对每个语音数据分段进行短时相关分析,得到每个语音数据分段对应的声纹语义特征向量;该声纹语义特征向量用于表征相应语音数据分段的语义特性。
最后,将每个语音数据分段对应的声纹语义特征向量所组成的向量序列(或称为声纹语义特征向量序列)输入至Transformer网络,这样Transformer网络可以对声纹语义特征向量序列进行长时相关分析,得到指定对象的声纹表征向量,此处的声纹语义特征向量序列中包括每个语音数据分段对应的声纹语义特征向量。其中,Transformer网络作为一个序列网络,其输入是一个整体的向量序列,其输出也是一个序列(即指定对象的声纹表征向量为一个序列);在Transformer网络输出一个序列后,可以将该序列中最后一个向量称为state,该state包含了整体序列的语义融合;这样,可以将Transformer网络输出的整体序列求取平均(mean),并将平均结果和state进行叠加,生成一个综合了整体序列所表达的语义特征的声纹表征向量。上述这种综合Transformer网络输出序列的平均结果所表征的语音特征和整体序列所表征的语音特征,来得到声纹表征向量的方式,可以有效确保声纹表征向量能够较为充分地体现指定对象的声纹特性,确保声纹特性提取的准确性。
在上述对包含改进型PANNS和Transformer网络的声纹向量提取模型进行整体介绍的基础上,下面分别对改进型PANNS和Transformer网络的结构和流程进行介绍,其中:
(1)改进型PANNS。
改进型PANNS的示例性结构示意图可以参见图10;如图10所示,该改进型PANNS的输入信息为参考语音数据,即该改进型PANNS的输入使用的是原语音采样点序列,即音频信号的原始序列。该改进型PANNS可以分为两个支路,分别为时域支路(或称为时域处理支路)和频域支路(或称为频域处理支路)。其中:时域支路的输入信息是参考语音数据,频域支路的输入信息是参考语音数据对应的参考语音频谱特征;该参考语音频谱特征是将时域信号“参考语音数据”从时域转换至频域所得到的。进一步的,由前述描述可知,为了充分提取和表征指定对象的声纹特性,该改进型PANNS着重于对参考语音数据进行短时相关分析,即支持通过分段输入的方式采用改进型PANNS进行处理;具体是改进型PANNS每次只对参考语音数据中的一段语音数据和该语音数据对应的参考语音频谱特征进行处理。
也就是说,每次输入时域支路的输入信息是参考语音数据中的一段语音数据分段;同理,每次输入频域支路的输入信息是参考语音数据中的一段语音数据对应的在频域中的参考语音频谱特征分段。其中,将参考语音数据输入至改进型PANNS后,可以将该参考语音数据从时域转换至频域,得到参考语音数据对应的参考语音频谱特征,并对参考语音频谱特征进行分段处理,得到每个语音数据分段对应的参考语音频谱特征分段;以及,对参考语音数据进行分段处理,得到参考语音数据对应的多个语音数据分段。值得注意的是,参考语音数据进行分段处理所遵循的分段规则,和参考语音频谱特征进行分段处理所遵循的分段规则是相同的;例如,分段规则可以包括:参考语音数据进行分段处理时,是周期性的采集20毫秒时长的语音数据为一个分段,则参考语音频谱特征进行分段处理时,每个参考语音频谱特征分段转换至时域后对应的语音数据分段时长为20毫秒。通过这种时域分段和频域分段相对应的方式,确保每次特性分析的准确性。
继续参见图10,在分段输入的情况下,改进型PANNS可以分别基于每个语音数据分段和相应的参考语音频谱特征分段,对每个语音数据分段进行短时相关分析,得到每个语音数据分段对应的声纹语义特征向量;该声纹语义特征向量用于表征相应语音数据分段的语义特性。为便于阐述,以多个语音数据分段中的任一语音数据分段表示为目标语音数据分段为例,对上述描述的短时相关分析的具体过程进行介绍,其中:
①时域支路中包含多个一维卷积层(Conv 1D)和最大池化层(Max pooling);其中,Conv为卷积Convolutional的缩写,D为维度Dimension的缩写,且最大池化层的维度为1,步长stride(简写为s)是一个一维向量,其长度为4。如图10所示,假设时域支路中依次包括:一维卷积层→一维卷积块(Conv 1Dblock,由一个或多个一维卷积层组成)→最大池化层(维度为1,且步长s为4)→一维卷积块→最大池化层→一维卷积块→最大池化层;那么通过卷积层对目标语音数据分段进行特征提取处理后,再由该卷积层后相邻的最大池化层对特征提取处理后的特征图进行特征挑选,有利于进一步提取出特征图中想要关注的特征。通过多个一维卷积层和最大池化层之间层层递进的方式,可以对参考语音数据(具体是目标语音数据分段)进行多次特征提取,得到一维序列(resize),并将该一维序列通过维度变换(Reshape)得到多个二维的时域特征图(或称为二维图谱Wavegram),时域特征图是在时域中描述参考语音数据中的参考语音信号随时间变化的图形表示,可以直观地展示参考语音信号的基本特征(如参考语音信号的周期性、频率成分和相位关系等)。此处的维度转换的目的是为了使得转换后的时域特征图能够和频域支路输出的频域特征图进行融合。上述过程中,通过在时域支路使用大量一维卷积层能够使得通过时域支路对参考语音数据进行特征提取处理时,直接学习到参考语音数据中语音信号的时域特性(如音频响度和采样点幅度这类信息)。
②频域支路中包含多个二维卷积层(Conv 2D)和最大池化层(Max pooling);其中,频域支路的最大池化层的维度为2。如图10所示,假设频域支路中依次包括:二维卷积块(由一个或多个二维卷积层组成)→最大池化层(维度为2)→二维卷积块→最大池化层→二维卷积块;通过该多个二维卷积层和最大池化层可以对目标语音数据分段对应的参考语音频谱特征分段(Logmel)进行特征提取处理,得到多个频域特征图(Feature maps),频域特征图是在频域中描述参考语音数据中参考语音信号的频率成分的图形表示,可以显示可以参考语音信号中包含的各种频率成分及其对应的幅度或强度。值得注意的是,此处的频域特征图的特征维度和时域支路输出的时域特征图的特征维度相同。上述过程中,通过在频域支路使用大量二维卷积层能够使得通过频域支路对参考语音频谱特征分段进行特征提取处理时,直接学习到参考语音数据中语音信号的频域特性。
③在基于前述步骤得到时域支路输出的时域特征图(Wavegram),和频域支路输出的频域特征图(Feature map)后,可以将时域特征图和频域特征图进行融合处理,生成目标语音数据分段对应的声纹语义特征向量。详细地,如图10所示的在时域支路和频域支路之间还存在多次时域和频域之间的信息交流,分别是将时域支路的信息特征进行维度变换(Reshape),然后与频域支路的特征进行融合(concat),并将融合结果经过二维卷积块(Conv 2D block)卷积后输入到更高层的融合模块进行融合。考虑到时域处理中能够获得参考语音数据在时序上的关联性,而频域处理中可以获得参考语音数据在不同频率上的关联性,两者属于不同领域;那么通过上述描述的两个域(时域和频域)之间的信息交互,能够实现时域和频域保持信息上的互补,使得高层网络能够感知到底层网络信息,从而实现针对参考语音数据的充分学习。
基于此,假设在时域支路和频域支路进行的特征提取处理的次数为k,k为大于1的整数,且任一次特征提取处理表示为第i次特征提取处理,那么上述提及的时域支路输出的时域特征图和频域支路输出的频域特征图之间的融合具体可以包括:首先,当i=1时(即首次特征提取处理时),将时域支路在第1次特征提取处理得到的中间时域特征图,和频域支路在第1次特征提取处理得到的中间频域特征图进行融合处理,生成第1次特征提取处理后的第一中间特征向量。然后,当1<i≤k时,将时域支路在第i次特征提取处理得到的中间时域特征图,频域支路在第i次特征提取处理得到的中间频域特征图,以及第i-1次特征提取处理得到的第i-1中间特征向量进行融合处理,生成第i次特征提取处理后的第i中间特征向量。最后,基于i=k时第k次特征提取处理后的第k中间特征向量,生成目标语义数据分段对应的声纹语义特征向量。其中,此处生成的具体过程可以包括:将第k中间特征向量输入到二维卷积神经网络(2D CNN layers)中进行特征提取,得到向量序列,并采用该向量序列进行平均运算求取平均值(mean)以及进行最大值运算 求取最大值(max),然后将求得的平均值与最大值进行相加(sum)后再经过一层激活函数(Rule)得到特征向量(vector),并采用归一化函数(softmax)对该特征向量进行归一化处理,以将特征向量转换为表示概率分布的声纹语义特征向量,即目标语音数据分段对应的声纹语义特征向量。
综上所述,整个参考语音数据进行分帧并分别输入到改进型PANNS中,该改进型PANNS选择使用整条序列的最后一个向量和整条序列的均值一起融合,从而生成最后的声纹表征向量,能够得到代表整个参考语音数据的多频带语义特征向量序列(包括每个语音数据分段对应的声纹语义特征向量)。通过该改进型PANNS能够实现针对参考语音数据的短时相关分析,充分学习参考语音数据中各分段语音信号的声纹特性,能够让提取的声纹表征向量充分表达指定对象的声纹信息。
(2)Transformer网络。
考虑到改进型PANNS关注于短时相关性,也就是特征是按照分段进行计算的;因此为了能够充分从参考语音数据中充分学习指定对象的声纹特性,本申请实施例还引入Transformer网络来对参考语音数据进行长时关联信息的学习。Transformer网络拥有的注意力机制能够更好地关注到参考语言数据关于指定对象的声纹特性,较好地实现针对参考语言数据的全局特征信息的提取。
示例性地,Transformer网络的网络结构示意图可以参见图11;如图11所示,Transformer网络采用了encoder(编码)-decoder(解码)架构。其中,encoder侧和decoder侧均由N个encoder层(即图11中的“N×”)堆叠形成。encoder层主要包含两个子层,其中:第一个子层中包含多头注意力机制(Multi-Head Attention)、残差和归一化(Add&Norm);第二个子层中包含前馈神经网络(Feed Forward)、残差和归一化的层。其中,第一个子层所包含的多头注意力机制能够帮助获取到语音数据分段对应的声纹语义特征向量的上下文语义。decoder层主要包含三个子层,其中:第一个子层中包含遮罩多头注意力机制(Masked Multi-Head Attention)、残差和归一化的层,第一个子层中包含多头注意力机制(Multi-Head Attention)、残差和归一化(Add&Norm)的层,第一个子层中包含前馈神经网络(Feed Forward)、残差和归一化的层;解码层通过这种三层结构能够帮助获取到需要关注的重点内容。
具体实现中,Transformer网络中的encoder侧的输入信息是改进型PANNS输出的声纹语义特征向量序列(Input Embedding),该声纹语义特征向量序列中包含每个语义数据分段对应的声纹语义特征向量。然后,对输入的声纹语义特征向量序列进行位置编码(Positional Encoding),以实现对该声纹语义特征向量序列的预处理;其中,位置编码是一种用词的位置信息对向量序列中的每个词进行二次表示的方法,能够让输入至编码侧的数据携带词的位置信息。进一步的,将位置编码后的数据输入至编码侧,由编码侧中的各层(如前述对编码侧结构的相关描述)对输入的数据进行编码处理,得到编码结果。然后,解码侧在获取到编码侧输出的编码结果后,支持结合该编码结果和解码侧在之前时刻的输出特征(shifted right),作为解码侧此时的输入数据,并由解码侧对输入数据进行解码处理。最后,解码侧的输出特征经过线性变换(Linear)和分类(softmax)后,得到指定对象的声纹表征向量(Output Probabilities)。通过上述过程可以让声纹语义特征向量序列经过Transformer网络计算,能实现让整个声纹语义特征向量序列对整个参考语音数据的长时语义表达更为清晰,即最终输出的声纹表征向量能够充分且清晰地表达指定对象的声纹特性。
S805:将混叠语音数据和声纹表征向量输入至预设的语音分割模型,该语音分割模型用于:基于注意力机制从混叠语音数据中分割出与指定对象的声纹特性相匹配的目标语音信号。
需要说明的是,步骤S805所示的具体实施过程,可以参见前述图3所示实施例中,步骤S302中关于基于注意力机制从混叠语音数据中分割出与指定对象的声纹特性相匹配的语音信号这部分具体实施过程的相关描述,在此不作赘述。
由前述步骤S302所示的相关描述可知,本申请实施例主要是采用改进后的Unet网络(即语音分割网络)来实现基于注意力机制对混叠语音数据进行分割,以得到与指定对象的声纹特性相匹配的语音信号。 并且,该语音分割网络主要是将注意力机制融合至传统的Unet网络中,如在传统的Unet网络中的每一网络层中均加入注意力机制实现针对传统的Unet网络的改进。考虑到语音分割系统中可能会引入较多的Attention机制从而引起整个模型的参数量较大,因此本申请实施例还支持对融合了注意力机制的语音分割模型进行模型蒸馏,得到模型蒸馏后的语音分割模型;这样,前述提及的相关度计算可以由模型蒸馏后的语音分割模型实现,以此缩小整个系统的规模,降低整体参数量和耗时。声纹表征向量和语音频谱特征均表现为向量形式,那么模型蒸馏后的语音分割模型对声纹表征向量和语音频谱特征进行相关度计算的方式可以包括但是不限于点积方式;其中,通过点积方式计算声纹表征向量和语音频谱特征的相关度包括:声纹表征向量的模和语音频谱特征的模的乘积,与声纹表征向量和语音频谱特征夹角的余弦值的乘积;如声纹表征向量为语音频谱特征为声纹表征向量和语音频谱特征的夹角为θ,那么声纹表征向量和语音频谱特征的点积为该点积结果作为声纹表征向量和语音频谱特征的相似度。其中,模型蒸馏是一种对参数量较多的大模型(teacher model)进行学习以得到参数量较小的更为紧凑的小模型(student model)的方法。针对语音分割模型的模型蒸馏主要通过剪枝(pruning)和知识蒸馏等技术实现针对大模型的模型蒸馏;其中:剪枝称为模型剪枝(Model Pruning)是一种模型压缩技术,旨在通过删除语音分割模型中的一些不重要的参数或结构,从而减少语音分割模型的复杂度,提高语音分割模型的推理速度,并减少语音分割模型的存储需求。知识蒸馏可以理解为将相对复杂的语音分割模型作为一个教师模型,训练一个参数量较小且结构简单的学生模型,并训练过程学生模型学习和模仿教师模型的输出,使得训练好的学生模型不仅具有计算量小的优势,而且具有和教师模型相同的模型性能;在本申请实施例中,训练好的学习模型为知识增量后的语音分割模型。上述提及的模型剪枝和知识蒸馏是模型蒸馏的主要实现技术,但模型蒸馏还可以包括其他技术,本申请实施例对模型蒸馏的具体实施方式不作限定。
S806:基于所分割出的目标语音信号,生成指定对象的语音文件。
需要说明的是,步骤S806所示的具体实施过程,可以参见前述图3所示实施例中步骤S304所示的具体实施过程的相关描述,在此不作赘述。
综上所述,本申请实施例提供了一种声纹向量嵌入的可复用性指定对象的语音分割方法,该方法采用改进型PANNS和Transformer网络所组成的声纹向量提取模型,来对指定对象的一小段参考语音数据进行声纹提取,通过结合改进型PANNS的短时相关分析和transformer网络的长时相关分析,能够使得提取到的指定对象的声纹表征向量更为充分和清晰地表达出指定对象的声纹特性。此外,本申请实施例还通过将注意力机制融合至Unet网络中,使得融合了注意力机制的Unet网络(即语音分割模型)基于注意力机制来实现混叠语音数据的分割,能够更清晰准确的计算提取出混叠语音数据中指定对象的语音信号,确保提取的语音信号的纯净,实现更准确纯净的语音分离效果。此外,本申请实施例只需获取指定对象的参考语音数据,就能实现从混叠语音数据中分割出该指定对象的语音信号;如果想要获取其他对象的语音数据,则只需更换待分割的对象的声纹特性即可,不需要针对每个对象训练一个专属网络,能够做到方案的便捷可迁移,使得本方案具有极高的通用性。
上述详细阐述了本申请实施例的方法,为了便于更好地实施本申请实施例的上述方案,相应地,下面提供了本申请实施例的装置。
图12示出了本申请一个示例性实施例提供的一种语音处理装置的结构示意图;该语音处理装置可以用于执行图3或图8所示的方法实施例中的部分或全部步骤。请参见图12,该语音处理装置包括如下单元:
获取单元1201,用于获取混叠语音数据,混叠语音数据中包含至少两个对象中每个对象产生的语音信号;
获取单元1201,还用于获取指定对象的参考语音数据;指定对象是指至少两个对象中的任一个;参考语音数据中包含指定对象的参考语音信号;
处理单元1202,用于从参考语音数据中提取指定对象的声纹表征向量,所述声纹表征向量用于表征指定对象的声纹特性;
处理器1202,还用于将混叠语音数据和声纹表征向量输入预设的语音分割模型,所述语音分割模型用于:基于注意力机制从混叠语音数据中分割出与声纹特性相匹配的目标语音信号;
处理单元1202,还用于基于所分割出的目标语音信号,生成指定对象的语音文件。
在一种实现方式中,基于注意力机制从混叠语音数据中分割出与声纹特性相匹配的目标语音信号的过程,包括:
将混叠语音数据从时域转换至频域,得到混叠语音数据对应的语音频谱特征;语音频谱特征是混叠语音数据在频域上的特征表现;
基于注意力机制对声纹表征向量和语音频谱特征进行相关度计算,得到与声纹特性相匹配的语音频谱特征分段;所述语音频谱特征分段是所述语音频谱特征中,与所述声纹特性相匹配的分段;
将语音频谱特征分段从频域转换至时域,得到与声纹特性相匹配的目标语音信号。
在一种实现方式中,相关度计算是通过语音分割模型实现的;语音分割模型中包括特征提取子网络和上采样子网络,特征提取子网络和上采样子网络之间通过卷积连接层进行连接;
特征提取子网络和上采样子网络具有对称性;特征提取子网络中包含层级分布的m个卷积层,上采样子网络中包含m个卷积层中每个卷积层对应的上采样层,m为正整数;卷积层、卷积连接层和上采样层中均包括顺序连接的多个卷积网络;
其中,语音分割模型中的全部或部分网络层中融合有注意力机制,注意力机制在网络层包括的多个卷积网络中的融合位置不固定;网络层包括卷积层、上采样层和卷积连接层。
在一种实现方式中,语音分割模型中每个网络层均融合有注意力机制;处理单元1202,用于基于注意力机制对声纹表征向量和语音频谱特征进行相关度计算,得到与声纹特性相匹配的语音频谱特征分段时,具体用于:
将声纹表征向量输入至语音分割模型中的每个网络层;
基于每个网络层中融合的注意力机制,对声纹表征向量和相应网络层的第一特征图进行相关度计算,以得到相应网络层输出的第二特征图;第二特征图所表征的声纹特性和声纹表征向量所表征的声纹特性相匹配;
将语音分割模型中第2m+1个网络层输出的第二特征图,作为与声纹表征向量所表征的声纹特性相匹配的语音频谱特征分段;语音分割模型中的第2m+1个网络层为上采样子网络中的最后一个上采样层。
在一种实现方式中,语音分割模型中融合有注意力机制的任一网络层表示为目标网络层;目标网络层为卷积层或者卷积连接层;注意力机制在目标网络层中的融合位置为:目标网络层包括的顺序连接的多个卷积网络中,首个卷积网络和与首个卷积网络相邻的第二个卷积网络之间的位置;
处理单元1202,用于基于每个网络层中融合的注意力机制,对声纹表征向量和相应网络层的第一特征图进行相关度计算,以得到相应网络层输出的第二特征图时,具体用于:
采用目标网络层中的首个卷积网络,对目标网络层的第一特征图进行特征提取处理,得到目标网络层的第三特征图;其中,目标网络层为特征提取子网络中层级分布的首个卷积层时,目标网络层的第一特征图为语音频谱特征;目标网络层为语音分割模型中除首个卷积层外的其他卷积层时,目标网络层的第一特征图是对与目标网络层相邻的上一层级网络层输出的特征图进行池化处理得到的;
按照目标网络层中融合的注意力机制,对声纹表征向量和目标网络层的第三特征图进行相关度计算,得到目标网络层的第四特征图;第三特征图的特征维度和第四特征图的特征维度相同;
采用目标网络层中除首个卷积网络外的其他卷积网络对第四特征图进行特征提取处理,得到目标网络层输出的第二特征图。
在一种实现方式中,语音分割模型中融合有注意力机制的任一网络层表示为目标网络层;目标网络层为上采样层;注意力机制在目标网络层对应的多个卷积网络中的融合位置为:目标网络层中顺序连接的多个卷积网络中的最后一个卷积网络之后的位置;
处理单元1202,用于基于每个网络层中融合的注意力机制,对声纹表征向量和相应网络层的第一特征图进行相关度计算,以得到相应网络层输出的第二特征图时,具体用于:
采用目标网络层中顺序连接的多个卷积网络,对目标特征图进行特征提取处理,得到目标网络层的第一特征图;目标特征图是将目标网络层在特征提取子网络中对应的卷积层输出的特征图,和目标网络层的上一层级网络层输出的特征图进行特征拼接得到的;
采用目标网络层中融合的注意力机制,对声纹表征向量和目标网络层的第一特征图进行相关度计算,得到目标网络层输出的第二特征图;第二特征图的特征维度和第一特征图的特征维度相同。
在一种实现方式中,处理单元1202,还用于:
对声纹表征向量进行维度变换,得到维度变换后的声纹表征向量;
其中,维度变换后的声纹表征向量的特征维度,与待输入至相应网络层中融合的注意力机制的特征图的特征维度相同。
在一种实现方式中,若语音分割模型中融合有注意力机制的网络层的数量大于数量阈值,则处理单元1202,还用于:
对语音分割模型进行模型蒸馏,得到模型蒸馏后的语音分割模型;
其中,相关度计算由模型蒸馏后的语音分割模型实现。
在一种实现方式中,处理单元1202,用于从参考语音数据中提取指定对象的声纹表征向量时,具体用于:
对参考语音数据进行分段处理,得到参考语音数据对应的多个语音数据分段;
将参考语音数据从时域转换至频域,得到参考语音数据对应的参考语音频谱特征;
对参考语音频谱特征进行分段处理,得到每个语音数据分段对应的参考语音频谱特征分段;
分别基于每个语音数据分段和相应的参考语音频谱特征分段,对每个语音数据分段进行短时相关分析,得到每个语音数据分段对应的声纹语义特征向量;声纹语义特征向量用于表征语音数据分段的语义特性;
对声纹语义特征向量序列进行长时相关分析,得到指定对象的声纹表征向量;声纹语义特征向量序列中包括每个语音数据分段对应的声纹语义特征向量。
在一种实现方式中,多个语音数据分段中的任一语音数据分段表示为目标语音数据分段;处理单元1202,用于分别基于每个语音数据分段和相应的参考语音频谱特征分段,对每个语音数据分段进行短时相关分析,得到每个语音数据分段对应的声纹语义特征向量时,具体用于:
对目标语音数据分段进行特征提取处理,得到时域特征图;
对目标语音数据分段对应的参考语音频谱特征分段进行特征提取处理,得到频域特征图;
将时域特征图和频域特征图进行融合处理,生成目标语音数据分段对应的声纹语义特征向量。
在一种实现方式中,特征提取处理的次数为k次,k为大于1的整数;任一次特征提取处理表示为第i次特征提取处理;处理单元1202,用于将时域特征图和频域特征图进行融合处理,生成目标语音数据分段对应的声纹语义特征向量时,具体用于:
当i=1时,将第1次特征提取处理得到的中间时域特征图和中间频域特征图进行融合处理,生成第1次特征提取处理后的第一中间特征向量;
当1<i≤k时,将第i次特征提取处理得到的中间时域特征图和中间频域特征图,以及第i-1次特征提取处理得到的第i-1中间特征向量进行融合处理,生成第i次特征提取处理后的第i中间特征向量;
基于i=k时第k次特征提取处理后的第k中间特征向量,生成目标语音数据分段对应的声纹语义特征 向量。
根据本申请的一个实施例,图12所示的语音处理装置中的各个单元可以分别或全部合并为一个或若干个另外的单元来构成,或者其中的某个(些)单元还可以再拆分为功能上更小的多个单元来构成,这可以实现同样的操作,而不影响本申请的实施例的技术效果的实现。上述单元是基于逻辑功能划分的,在实际应用中,一个单元的功能也可以由多个单元来实现,或者多个单元的功能由一个单元实现。在本申请的其它实施例中,该语音处理装置也可以包括其它单元,在实际应用中,这些功能也可以由其它单元协助实现,并且可以由多个单元协作实现。根据本申请的另一个实施例,可以通过在包括中央处理单元(CPU)、随机存取存储介质(RAM)、只读存储介质(ROM)等处理元件和存储元件的例如计算机的通用计算设备上运行能够执行如图3及图8所示的相应方法所涉及的各步骤的计算机程序(包括程序代码),来构造如图12中所示的语音处理装置,以及来实现本申请实施例的语音处理方法。计算机程序可以记载于例如计算机可读记录介质上,并通过计算机可读记录介质装载于上述计算设备中,并在其中运行。
基于同一发明构思,本申请实施例中提供的语音处理装置解决问题的原理与有益效果与本申请方法实施例中语音处理方法解决问题的原理和有益效果相似,可以参见方法的实施的原理和有益效果,为简洁描述,在这里不再赘述。
图13示出了本申请一个示例性实施例提供的一种计算机设备的结构示意图。请参见图13,该计算机设备包括处理器1301、通信接口1302以及计算机可读存储介质1303。其中,处理器1301、通信接口1302以及计算机可读存储介质1303可通过总线或者其它方式连接。其中,通信接口1302用于接收和发送数据。计算机可读存储介质1303可以存储在计算机设备的存储器中,计算机可读存储介质1303用于存储计算机程序,处理器1301用于执行计算机可读存储介质1303存储的计算机程序。处理器1301(或称CPU(Central Processing Unit,中央处理器))是计算机设备的计算核心以及控制核心,其适于实现一条或多条计算机程序,具体适于加载并执行一条或多条计算机程序从而实现相应方法流程或相应功能。
本申请实施例还提供了一种计算机可读存储介质(Memory),计算机可读存储介质是计算机设备中的记忆设备,用于存放程序和数据。可以理解的是,此处的计算机可读存储介质既可以包括计算机设备中的内置存储介质,当然也可以包括计算机设备所支持的扩展存储介质。计算机可读存储介质提供存储空间,该存储空间存储了计算机设备的处理系统。并且,在该存储空间中还存放了适于被处理器1301加载并执行的一条或多条计算机程序。需要说明的是,此处的计算机可读存储介质可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器;可选的,还可以是至少一个位于远离前述处理器的计算机可读存储介质。
在一个实施例中,该计算机设备可以是前述实施例提到的终端或服务器;该计算机可读存储介质中存储有一条或多条计算机程序;由处理器1301加载并执行计算机可读存储介质中存放的一条或多条计算机程序,以实现上述语音处理方法实施例中的相应步骤;具体实现中,计算机可读存储介质中的一条或多条计算机程序,由处理器1301加载并执行本申请各实施例的步骤;其中,本申请各实施例的步骤可以参见前述各实施例的相关描述,在此不作赘述。
基于同一发明构思,本申请实施例中提供的计算机设备解决问题的原理与有益效果与本申请方法实施例中语音处理方法解决问题的原理和有益效果相似,可以参见方法的实施的原理和有益效果,为简洁描述,在这里不再赘述。
本申请实施例还提供一种计算机程序产品,该计算机程序产品包括计算机程序,该计算机程序被处理器执行时,实现上述语音处理方法。
本领域普通技术人员可以意识到,结合本申请中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行, 取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用,使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。计算机程序产品包括计算机程序(一个或多个)。在计算机设备上加载和执行计算机程序时,计算机程序执行本申请实施例上述的流程或功能。计算机设备可以是通用计算机、专用计算机、计算机网络、或者其他可编程设备。计算机程序可以存储在计算机可读存储介质中,或者通过计算机可读存储介质进行传输。计算机程序可以从一个网站站点、计算机设备、服务器或数据中心通过有线(例如,同轴电缆、光纤、数字用户线(DSL))或无线(例如,红外、无线、微波等)方式向另一个网站站点、计算机设备、服务器或数据中心进行传输。计算机可读存储介质可以是计算机设备能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如,固态硬盘(Solid State Disk,SSD))等。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (15)

  1. 一种语音处理方法,其特征在于,所述方法由计算机设备执行,所述方法包括:
    获取混叠语音数据,所述混叠语音数据中包含至少两个对象中每个所述对象产生的语音信号;
    获取指定对象的参考语音数据,所述指定对象是指所述至少两个对象中的任一个;所述参考语音数据中包含所述指定对象的参考语音信号;
    从所述参考语音数据中提取所述指定对象的声纹表征向量,所述声纹表征向量用于表征所述指定对象的声纹特性;
    将所述混叠语音数据和所述声纹表征向量输入预设的语音分割模型,所述语音分割模型用于:基于注意力机制从所述混叠语音数据中分割出与所述声纹特性相匹配的目标语音信号;
    基于所分割出的所述目标语音信号,生成所述指定对象的语音文件。
  2. 如权利要求1所述的方法,其特征在于,基于注意力机制从所述混叠语音数据中分割出与所述声纹特性相匹配的目标语音信号的过程,包括:
    将所述混叠语音数据从时域转换至频域,得到所述混叠语音数据对应的语音频谱特征;所述语音频谱特征是所述混叠语音数据在所述频域上的特征表现;
    基于注意力机制对所述声纹表征向量和所述语音频谱特征进行相关度计算,得到与所述声纹特性相匹配的语音频谱特征分段;所述语音频谱特征分段是所述语音频谱特征中,与所述声纹特性相匹配的分段;
    将所述语音频谱特征分段从所述频域转换至所述时域,得到与所述声纹特性相匹配的所述目标语音信号。
  3. 如权利要求1或2所述的方法,其特征在于,所述相关度计算是通过所述语音分割模型实现的;所述语音分割模型中包括特征提取子网络和上采样子网络,所述特征提取子网络和所述上采样子网络之间通过卷积连接层进行连接;
    所述特征提取子网络和所述上采样子网络具有对称性;所述特征提取子网络中包含层级分布的m个卷积层,所述上采样子网络中包含所述m个卷积层中每个所述卷积层对应的上采样层,m为正整数;所述卷积层、所述卷积连接层和所述上采样层中均包括顺序连接的多个卷积网络;
    其中,所述语音分割模型中的全部或部分网络层中融合有所述注意力机制,所述注意力机制在所述网络层包括的多个卷积网络中的融合位置不固定;所述网络层包括所述卷积层、所述上采样层和所述卷积连接层。
  4. 如权利要求1-3任一项所述的方法,其特征在于,所述语音分割模型中每个所述网络层均融合有所述注意力机制;所述基于注意力机制对所述声纹表征向量和所述语音频谱特征进行相关度计算,得到与所述声纹特性相匹配的语音频谱特征分段,包括:
    将所述声纹表征向量输入至所述语音分割模型中的每个所述网络层;
    基于每个所述网络层中融合的所述注意力机制,对所述声纹表征向量和相应所述网络层的第一特征图进行相关度计算,以得到相应所述网络层输出的第二特征图;所述第二特征图所表征的声纹特性和所述声纹表征向量所表征的声纹特性相匹配;
    将所述语音分割模型中第2m+1个网络层输出的第二特征图,作为与所述声纹表征向量所表征的声纹特性相匹配的语音频谱特征分段;所述第2m+1个网络层为所述上采样子网络中的最后一个上采样层。
  5. 如权利要1-4任一项所述的方法,其特征在于,所述语音分割模型中融合有所述注意力机制的任一网络层表示为目标网络层;所述目标网络层为所述卷积层或者所述卷积连接层;所述注意力机制在所述目标网络层中的融合位置为:所述目标网络层包括的顺序连接的多个所述卷积网络中,首个所述卷积网络和与首个所述卷积网络相邻的第二个所述卷积网络之间的位置;
    所述基于每个所述网络层中融合的所述注意力机制,对所述声纹表征向量和相应所述网络层的第一特征图进行相关度计算,以得到相应所述网络层输出的第二特征图,包括:
    采用所述目标网络层中的首个所述卷积网络,对所述目标网络层的第一特征图进行特征提取处理,得到所述目标网络层的第三特征图;其中,所述目标网络层为所述特征提取子网络中层级分布的首个卷积层时,所述目标网络层的第一特征图为所述语音频谱特征;所述目标网络层为所述语音分割模型中除首个所述卷积层外的其他所述卷积层时,所述目标网络层的第一特征图是对与所述目标网络层相邻的上一层级所述网络层输出的特征图进行池化处理得到的;
    按照所述目标网络层中融合的所述注意力机制,对所述声纹表征向量和所述目标网络层的第三特征图进行相关度计算,得到所述目标网络层的第四特征图;所述第三特征图的特征维度和所述第四特征图的特征维度相同;
    采用所述目标网络层中除首个所述卷积网络外的其他卷积网络对所述第四特征图进行特征提取处理,得到所述目标网络层输出的第二特征图。
  6. 如权利要求1-5任一项所述的方法,其特征在于,所述语音分割模型中融合有注意力机制的任一网络层表示为目标网络层;所述目标网络层为所述上采样层;所述注意力机制在所述目标网络层中的融合位置为:所述目标网络层中顺序连接的多个卷积网络中的最后一个卷积网络之后的位置;
    所述基于每个所述网络层中融合的所述注意力机制,对所述声纹表征向量和相应所述网络层的第一特征图进行相关度计算,以得到相应所述网络层输出的第二特征图,包括:
    采用所述目标网络层中顺序连接的多个卷积网络,对目标特征图进行特征提取处理,得到所述目标网络层的第一特征图;所述目标特征图是将所述目标网络层在所述特征提取子网络中对应的卷积层输出的特征图,和所述目标网络层的上一层级网络层输出的特征图进行特征拼接得到的;
    采用所述目标网络层中融合的注意力机制,对所述声纹表征向量和所述目标网络层的第一特征图进行相关度计算,得到所述目标网络层输出的第二特征图;所述第二特征图的特征维度和所述第一特征图的特征维度相同。
  7. 如权利要求1-6任一项所述的方法,其特征在于,所述基于每个所述网络层中融合的所述注意力机制,对所述声纹表征向量和相应所述网络层的第一特征图进行相关度计算,以得到相应所述网络层输出的第二特征图之前,还包括:
    对所述声纹表征向量进行维度变换,得到维度变换后的所述声纹表征向量;
    其中,维度变换后的所述声纹表征向量的特征维度,与待输入至相应所述网络层中融合的注意力机制的特征图的特征维度相同。
  8. 如权利要求1-7任一项所述的方法,其特征在于,若所述语音分割模型中融合有所述注意力机制的网络层的数量大于数量阈值,则所述方法还包括:
    对所述语音分割模型进行模型蒸馏,得到模型蒸馏后的语音分割模型;
    其中,所述相关度计算由模型蒸馏后的所述语音分割模型实现。
  9. 如权利要求1-8任一项所述的方法,其特征在于,所述从所述参考语音数据中提取所述指定对象的声纹表征向量,包括:
    对所述参考语音数据进行分段处理,得到所述参考语音数据对应的多个语音数据分段;
    将所述参考语音数据从时域转换至频域,得到所述参考语音数据对应的参考语音频谱特征;
    对所述参考语音频谱特征进行分段处理,得到每个所述语音数据分段对应的参考语音频谱特征分段;
    分别基于每个所述语音数据分段和相应的所述参考语音频谱特征分段,对每个所述语音数据分段进行短时相关分析,得到每个所述语音数据分段对应的声纹语义特征向量;所述声纹语义特征向量用于表征所述语音数据分段的语义特性;
    对声纹语义特征向量序列进行长时相关分析,得到所述指定对象的声纹表征向量;所述声纹语义特征向量序列中包括每个所述语音数据分段对应的声纹语义特征向量。
  10. 如权利要求1-9任一项所述的方法,其特征在于,多个所述语音数据分段中的任一语音数据分段表示为目标语音数据分段;所述分别基于每个所述语音数据分段和相应的所述参考语音频谱特征分段,对每个所述语音数据分段进行短时相关分析,得到每个所述语音数据分段对应的声纹语义特征向量,包括:
    对所述目标语音数据分段进行特征提取处理,得到时域特征图;
    对所述目标语音数据分段对应的参考语音频谱特征分段进行特征提取处理,得到频域特征图;
    将所述时域特征图和所述频域特征图进行融合处理,生成所述目标语音数据分段对应的声纹语义特征向量。
  11. 如权利要求1-10任一项所述的方法,其特征在于,所述特征提取处理的次数为k次,k为大于1的整数;任一次特征提取处理表示为第i次特征提取处理;所述将所述时域特征图和所述频域特征图进行融合处理,生成所述目标语音数据分段对应的声纹语义特征向量,包括:
    当i=1时,将首次特征提取处理得到的中间时域特征图和中间频域特征图进行融合处理,生成所述首次特征提取处理后的第一中间特征向量;
    当1<i≤k时,将所述第i次特征提取处理得到的中间时域特征图和中间频域特征图,以及第i-1次特征提取处理得到的第i-1中间特征向量进行融合处理,生成所述第i次特征提取处理后的第i中间特征向量;
    基于i=k时第k次特征提取处理后的第k中间特征向量,生成所述目标语音数据分段对应的声纹语义特征向量。
  12. 一种语音处理装置,其特征在于,所述语音处理装置搭载于计算机设备,所述语音处理装置包括:
    获取单元,用于获取混叠语音数据,所述混叠语音数据中包含至少两个对象中每个所述对象产生的语音信号;
    所述获取单元,还用于获取指定对象的参考语音数据,所述指定对象是指所述至少两个对象中的任一个;所述参考语音数据中包含所述指定对象的参考语音信号;
    处理单元,用于从所述参考语音数据中提取所述指定对象的声纹表征向量,所述声纹表征向量用于表征所述指定对象的声纹特性;
    所述处理单元,还用于将所述混叠语音数据和所述声纹表征向量输入预设的语音分割模型,所述语音分割模型用于:基于注意力机制从所述混叠语音数据中分割出与所述声纹特性相匹配的目标语音信号;
    所述处理单元,还用于基于所分割出的所述目标语音信号,生成所述指定对象的语音文件。
  13. 一种计算机设备,其特征在于,
    处理器,适于执行计算机程序;
    计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,所述计算机程序被所述处理器执行时,实现如权利要求1-11任一项所述的语音处理方法。
  14. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序适于被处理器加载并执行如权利要求1-11任一项所述的语音处理方法。
  15. 一种计算机程序产品,其特征在于,所述计算机程序产品包括计算机程序,所述计算机程序被处理器执行时,实现如权利要求1-11任一项所述的语音处理方法。
PCT/CN2024/089862 2023-06-13 2024-04-25 一种语音处理方法、装置、设备、介质及程序产品 Ceased WO2024255461A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP24822416.4A EP4636761A4 (en) 2023-06-13 2024-04-25 PROCESS AND APPARATUS FOR SPEECH PROCESSING, DEVICE, SUPPORT, AND PRODUCT-PROGRAM
US19/257,189 US20250329334A1 (en) 2023-06-13 2025-07-01 Speech processing method and apparatus, device, and medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202310699993.5 2023-06-13
CN202310699993.5A CN119132328A (zh) 2023-06-13 2023-06-13 一种语音处理方法、装置、设备、介质及程序产品

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US19/257,189 Continuation US20250329334A1 (en) 2023-06-13 2025-07-01 Speech processing method and apparatus, device, and medium

Publications (2)

Publication Number Publication Date
WO2024255461A1 true WO2024255461A1 (zh) 2024-12-19
WO2024255461A9 WO2024255461A9 (zh) 2025-01-30

Family

ID=93748676

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2024/089862 Ceased WO2024255461A1 (zh) 2023-06-13 2024-04-25 一种语音处理方法、装置、设备、介质及程序产品

Country Status (4)

Country Link
US (1) US20250329334A1 (zh)
EP (1) EP4636761A4 (zh)
CN (1) CN119132328A (zh)
WO (1) WO2024255461A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120260546A (zh) * 2025-06-03 2025-07-04 中国电子科技集团公司第二十八研究所 一种基于双模型动态触发的语音流切分方法
CN120496557A (zh) * 2025-03-05 2025-08-15 西安赛普特信息科技有限公司 一种双路轻量级时频域自适应神经网络模型及其使用方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108109619A (zh) * 2017-11-15 2018-06-01 中国科学院自动化研究所 基于记忆和注意力模型的听觉选择方法和装置
CN111429937A (zh) * 2020-05-09 2020-07-17 北京声智科技有限公司 语音分离方法、模型训练方法及电子设备
WO2022048239A1 (zh) * 2020-09-04 2022-03-10 华为技术有限公司 音频的处理方法和装置
CN115116448A (zh) * 2022-08-29 2022-09-27 四川启睿克科技有限公司 语音提取方法、神经网络模型训练方法、装置及存储介质
CN115376541A (zh) * 2022-07-13 2022-11-22 平安科技(深圳)有限公司 基于语音数据的角色分离方法和装置、设备、介质

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114329041B (zh) * 2021-11-17 2025-06-10 腾讯科技(深圳)有限公司 一种多媒体数据处理方法、装置以及可读存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108109619A (zh) * 2017-11-15 2018-06-01 中国科学院自动化研究所 基于记忆和注意力模型的听觉选择方法和装置
CN111429937A (zh) * 2020-05-09 2020-07-17 北京声智科技有限公司 语音分离方法、模型训练方法及电子设备
WO2022048239A1 (zh) * 2020-09-04 2022-03-10 华为技术有限公司 音频的处理方法和装置
CN115376541A (zh) * 2022-07-13 2022-11-22 平安科技(深圳)有限公司 基于语音数据的角色分离方法和装置、设备、介质
CN115116448A (zh) * 2022-08-29 2022-09-27 四川启睿克科技有限公司 语音提取方法、神经网络模型训练方法、装置及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4636761A4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120496557A (zh) * 2025-03-05 2025-08-15 西安赛普特信息科技有限公司 一种双路轻量级时频域自适应神经网络模型及其使用方法
CN120260546A (zh) * 2025-06-03 2025-07-04 中国电子科技集团公司第二十八研究所 一种基于双模型动态触发的语音流切分方法

Also Published As

Publication number Publication date
EP4636761A1 (en) 2025-10-22
CN119132328A (zh) 2024-12-13
EP4636761A4 (en) 2026-04-29
WO2024255461A9 (zh) 2025-01-30
US20250329334A1 (en) 2025-10-23

Similar Documents

Publication Publication Date Title
Czyzewski et al. An audio-visual corpus for multimodal automatic speech recognition
JP6855527B2 (ja) 情報を出力するための方法、及び装置
WO2024140434A1 (zh) 基于多模态知识图谱的文本分类方法、设备及存储介质
CN111883107B (zh) 语音合成、特征提取模型训练方法、装置、介质及设备
CN110517689A (zh) 一种语音数据处理方法、装置及存储介质
US20250329334A1 (en) Speech processing method and apparatus, device, and medium
CN112183107A (zh) 音频的处理方法和装置
CN114329041B (zh) 一种多媒体数据处理方法、装置以及可读存储介质
WO2024140430A9 (zh) 基于多模态深度学习的文本分类方法、设备及存储介质
CN109474843A (zh) 语音操控终端的方法、客户端、服务器
CN113763925B (zh) 语音识别方法、装置、计算机设备及存储介质
CN113573161B (zh) 多媒体数据处理方法、装置、设备及存储介质
CN113889081A (zh) 语音识别方法、介质、装置和计算设备
CN119854545A (zh) 一种基于深度学习的新闻智能播报系统及方法
Liu et al. Anti-forensics of fake stereo audio using generative adversarial network
CN111883105B (zh) 用于视频场景的上下文信息预测模型的训练方法及系统
WO2024082928A1 (zh) 语音处理方法、装置、设备和介质
CN121194017A (zh) 高光短视频的剪辑方法、装置、存储介质以及电子设备
CN113407779B (zh) 一种视频检测方法、设备及计算机可读存储介质
CN118802398A (zh) 会议纪要生成方法、装置、存储介质及电子设备
CN118779486A (zh) 一种基于文本提示词的语音内容检索方法、设备及计算机可读存储介质
CN118587625A (zh) 一种视频文件的检测方法、装置及计算设备
CN117373463A (zh) 用于语音处理的模型训练方法、设备、介质及程序产品
CN118411996B (zh) 音色转换方法、装置、电子设备、存储介质和程序产品
CN120690233A (zh) 音视频情绪标记方法、装置、电子设备、存储介质及产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24822416

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2024822416

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2024822416

Country of ref document: EP

Effective date: 20250715

WWP Wipo information: published in national office

Ref document number: 2024822416

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE