WO2024109375A1 - 语音转换模型的训练方法、装置、设备及介质 - Google Patents
语音转换模型的训练方法、装置、设备及介质 Download PDFInfo
- Publication number
- WO2024109375A1 WO2024109375A1 PCT/CN2023/124162 CN2023124162W WO2024109375A1 WO 2024109375 A1 WO2024109375 A1 WO 2024109375A1 CN 2023124162 W CN2023124162 W CN 2023124162W WO 2024109375 A1 WO2024109375 A1 WO 2024109375A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sample
- audio
- model
- conversion
- accent
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
- G10L2021/0135—Voice conversion or morphing
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Definitions
- the embodiments of the present application relate to the field of audio processing technology, and in particular to a training method, device, equipment and medium for a speech conversion model.
- accent conversion is usually implemented using a voice conversion model, and a large amount of parallel corpus is required in the process of training the voice conversion model.
- the parallel corpus is audio with different accents of the same voice content.
- the embodiment of the present application provides a method, device, equipment and medium for training a speech conversion model, which can ensure the training quality of the speech conversion model while reducing the demand for manually recorded parallel corpus.
- the technical solution is as follows:
- an embodiment of the present application provides a method for training a speech conversion model, the method being executed by a computer device, comprising:
- a speech conversion model is generated based on the first ASR model, the second conversion model, and the third conversion model obtained through training, and the speech conversion model is used to convert audio in a first accent into audio in a second accent.
- an embodiment of the present application provides a speech conversion method, the method is performed by a computer device, a speech conversion model is set in the computer device, the speech conversion model includes a first ASR model, a second conversion model and a third conversion model, the method includes:
- the second content feature is converted into audio through the third conversion model to obtain second accent audio.
- an embodiment of the present application provides a training device for a speech conversion model, the device comprising:
- a training module configured to train a first ASR model based on a first sample audio, and to train a second ASR model based on a second sample audio, wherein the first sample audio corresponds to a first accent, and the second sample audio corresponds to a second accent;
- the training module is further used to train a first conversion model based on a first sample text and a first sample content feature corresponding to the first sample audio, wherein the first sample content feature is obtained by extracting the first sample audio by the first ASR model, and the first conversion model is used to convert the text into content features of the first accent;
- the training module is further used to construct parallel sample data based on the first conversion model, a second sample text corresponding to the second sample audio, and a second sample content feature, wherein the second sample content feature is extracted by the second ASR model from the second sample audio, and the parallel sample data includes different content features, wherein different content features correspond to different accents, and different content features correspond to the same text; a second conversion model is trained based on the parallel sample data, wherein the second conversion model is used to convert content features between the first accent and the second accent;
- the training module is further used to train a third conversion model based on sample content features of different sample audios, wherein the third conversion model is used to convert the content features into audio;
- a generation module is used to generate a speech conversion model based on the first ASR model, the second conversion model and the third conversion model obtained through training, wherein the speech conversion model is used to convert audio in a first accent into audio in a second accent.
- an embodiment of the present application provides a speech conversion device, wherein the device includes:
- An acquisition module configured to acquire a first accent audio, where the first accent audio corresponds to the first accent
- an extraction module configured to extract a first content feature from the first accent audio by using the first ASR model, wherein the first content feature corresponds to the first accent;
- a content feature conversion module configured to convert the first content feature into a second content feature by using the second conversion model, wherein the second content feature corresponds to a second accent
- An audio conversion module is used to perform audio conversion on the second content feature through the third conversion model to obtain a second accent audio.
- an embodiment of the present application provides a computer device, which includes a processor and a memory, wherein the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the training method of the speech conversion model as described in the above aspects, or the speech conversion method as described in the above aspects.
- an embodiment of the present application provides a computer-readable storage medium, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the training method of the speech conversion model as described in the above aspects, or the speech conversion method as described in the above aspects.
- an embodiment of the present application provides a computer program product, which includes computer instructions, and the computer instructions are stored in a computer-readable storage medium; a processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the training method of the speech conversion model as described in the above aspects, or the speech conversion method as described in the above aspects.
- a first conversion model for converting text into content features is trained, thereby using the first conversion model and the second sample text corresponding to the second sample audio to construct parallel sample data containing the same text content but corresponding to different accents, and then using the parallel sample data to train a second conversion model for converting content features between different accents, and a third conversion model for converting content features into audio, to complete the speech conversion model training; during the model training process, the intermediate model obtained by training is used to construct the parallel corpus, and there is no need to record audio files of different accents before model training.
- Parallel corpus can reduce the demand for manually recorded parallel corpus in model training while ensuring the quality of model training, which helps to improve the efficiency of model training and improve the training quality of the model when samples are insufficient.
- FIG1 shows a schematic diagram of a speech conversion system provided by an exemplary embodiment of the present application
- FIG2 shows a flow chart of a method for training a speech conversion model provided by an exemplary embodiment of the present application
- FIG3 shows a flow chart of an accent conversion method provided by an exemplary embodiment of the present application
- FIG4 is a schematic diagram of a voice setting interface shown in an exemplary embodiment of the present application.
- FIG5 is a schematic diagram of an implementation of an accent conversion process provided by an exemplary embodiment of the present application.
- FIG6 is a flowchart of a text-to-content feature process shown in an exemplary embodiment of the present application.
- FIG7 is a diagram of an FFT structure provided by an exemplary embodiment of the present application.
- FIG8 is a schematic structural diagram of a first conversion model shown in an exemplary embodiment of the present application.
- FIG9 is a flow chart of a second conversion model training process shown in an exemplary embodiment of the present application.
- FIG10 is a schematic diagram of the structure of a second conversion model shown in an exemplary embodiment of the present application.
- FIG11 is a schematic diagram of the structure of a third conversion model shown in an exemplary embodiment of the present application.
- FIG12 is a flowchart of a third conversion model training process shown in an exemplary embodiment of the present application.
- FIG13 is a schematic diagram of an implementation of an accent conversion process provided by another exemplary embodiment of the present application.
- FIG14 is a structural block diagram of a training device for a speech conversion model provided by an exemplary embodiment of the present application.
- FIG15 is a structural block diagram of a speech conversion device provided by an exemplary embodiment of the present application.
- FIG. 16 shows a schematic diagram of the structure of a computer device provided by an exemplary embodiment of the present application.
- the speech conversion model is composed of a first ASR model (for converting audio to text), a second conversion model (for converting content features between different accents) and a third conversion model (for converting content features to audio).
- the first conversion model for converting text to content features is trained, thereby constructing parallel sample data with the help of the first conversion model for subsequent training of the second conversion model and the third conversion model.
- parallel corpora are constructed with the help of the conversion models obtained through training, without the need to manually record a large amount of parallel corpora in advance, thereby reducing the dependence of the training process on parallel corpora and ensuring the quality of model training.
- the information including but not limited to user device information, user personal information, etc.
- data including but not limited to data used for analysis, stored data, displayed data, etc.
- signals involved in this application are all authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with relevant laws, regulations and standards of relevant countries and regions.
- the audio, accent and text involved in this application are all obtained with full authorization.
- the speech conversion model trained by the training method provided in the embodiment of the present application can be applied to various scenarios requiring accent conversion.
- FIG1 it shows a schematic diagram of a speech conversion system shown in an exemplary embodiment of the present application.
- the speech conversion system includes: an audio acquisition device 110, a terminal 120 and a server 130.
- the audio acquisition device 110 is a device for collecting user voice.
- the audio acquisition device 110 can be an earphone, a microphone, or an AR/VR device with a sound receiving function, etc., which is not limited in this embodiment of the present application.
- the audio collection device 110 is connected to the terminal 120 by wire or wireless means, and is used to transmit the collected user voice to the terminal 120, and the terminal 120 further performs accent conversion processing on the user voice.
- the terminal 120 can be an electronic device such as a smart phone, a tablet computer, a personal computer, or a vehicle-mounted terminal.
- an application with an accent conversion function is provided in the terminal 120. Through the application, the user can set an accent conversion target, thereby converting the user's voice from an original voice to a target voice.
- the accent conversion may be implemented locally by the terminal 120 (the voice conversion model is set in the terminal 120); in another possible implementation, the accent conversion may be implemented by the terminal 120 with the aid of the server 130 (the voice conversion model is set in the server 130, and the terminal 120 transmits the accent conversion requirement to the server 130).
- the server 130 may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDN), and big data and artificial intelligence platforms.
- the server 130 may be a background server that implements the accent conversion function, and is used to provide conversion services between different accents.
- multiple speech conversion models are provided in the server 130, and different speech conversion models are used to achieve conversion between different accents. For example, when supporting conversion of Mandarin into n local accents, n speech conversion models are provided in the server 130.
- the server 130 obtains accent corpora of different accents, where the accent corpora are composed of audio and corresponding text, so as to train a corresponding speech conversion model based on the accent corpora.
- the user before performing accent conversion, sets the first accent to be converted into the second accent through the terminal 120, and the terminal 120 sends an accent conversion request to the server 130, requesting the server 130 to use the corresponding speech conversion model (convert the first accent into the second accent) to perform accent conversion.
- the audio acquisition device 110 transmits the collected user voice in the first accent to the terminal 120, and the terminal 120 transmits the user voice in the first accent to the server 130.
- the server 130 converts it into the user voice in the second accent through the voice conversion model, and feeds it back to the terminal 120, which is further processed by the terminal 120.
- the terminal 120 processes the user voice in different ways.
- the following uses a concentrated exemplary application scenario for illustration.
- the terminal After the terminal obtains the converted user voice, it merges the user voice with the produced content (such as virtual human short video, virtual human long video, etc.) to obtain virtual human content.
- the mouth of the virtual human can be controlled according to the converted user voice to improve the matching degree between the virtual human mouth movement and the voice.
- the terminal obtains the first accent audio corresponding to the real user, the first accent audio corresponds to the first accent of the real user, and the terminal extracts the first accent audio through the first ASR model in the speech conversion model to obtain the first content feature under the first accent; the terminal converts the first content feature into the second content feature through the second conversion model in the speech conversion model, and the second content feature corresponds to the second accent; after converting the accent, the terminal performs audio conversion on the second content feature under the second accent through the third conversion model in the speech conversion model to obtain the second accent audio corresponding to the virtual human.
- the virtual anchor can pre-set the live broadcast accent through the accent setting interface.
- the terminal sends the user voice collected by the microphone to the server, and the server converts the user voice with the original accent into the user voice with the live broadcast accent and feeds it back to the terminal.
- the terminal merges the user voice with the live broadcast accent with the video stream containing the virtual anchor image, and then pushes the merged audio and video stream to each viewer client in the live broadcast room through the push stream server.
- the accent of the virtual anchor during live broadcast can be preset.
- the terminal collects the first accent audio corresponding to the real user through a microphone, and the first accent audio corresponds to the first accent of the real user.
- the terminal extracts the first accent audio through the first ASR model in the voice conversion model to obtain the first content feature under the first accent;
- the terminal converts the first content feature into the second content feature through the second conversion model in the voice conversion model, and the second content feature corresponds to the second accent; after the accent is converted, the terminal performs audio conversion on the second content feature under the second accent through the third conversion model in the voice conversion model to obtain the second accent audio corresponding to the virtual anchor, that is, the virtual anchor broadcasts live with the second accent audio.
- users can set the accent to be used when interacting in the Metaverse.
- the user's voice is collected by headsets, AR/VR and other devices and transmitted to the terminal, which is then handed over to the server for accent conversion.
- the server controls the virtual character in the Metaverse to play the converted accent audio to achieve voice interaction with other virtual characters.
- the second accent for interacting with other virtual characters can be pre-selected.
- the terminal collects the first accent audio corresponding to the real user through the microphone, and the first accent audio corresponds to the first accent of the real user.
- the terminal extracts the first accent audio through the first ASR model in the voice conversion model to obtain the first content feature under the first accent; the terminal converts the first content feature into the second content feature through the second conversion model in the voice conversion model, and the second content feature corresponds to the second accent; after the accent is converted, the terminal performs audio conversion on the second content feature under the second accent through the third conversion model in the voice conversion model to obtain the second accent audio corresponding to the virtual character in the metaverse, that is, the virtual character in the metaverse interacts with other virtual characters with the second accent audio.
- the above application scenarios are only exemplary descriptions.
- the speech conversion model trained by the method provided in the embodiments of the present application can also be used in real-world application scenarios such as voice calls (to facilitate voice communication between callers with different accents) and translation, and the embodiments of the present application do not constitute a limitation to this.
- the training and use of the speech conversion model are used in a computer device (which may be a terminal or a server), and the speech conversion model trained to convert a first accent into a second accent is used as an example for explanation (other schemes for converting a source speech into a target speech are similar), but this is not intended to be limiting.
- Fig. 2 shows a flow chart of a method for training a speech conversion model provided by an exemplary embodiment of the present application. The method is executed by a computer device and includes the following steps.
- Step 201 training a first ASR model based on a first sample audio, and training a second ASR model based on a second sample audio, wherein the first sample audio corresponds to a first accent, and the second sample audio corresponds to a second accent.
- the first accent is the source accent
- the second accent is the target speech, that is, the trained speech conversion model is used to convert speech with the first accent into speech with the second accent.
- the first sample audio corresponds to a first sample text
- the second sample audio corresponds to a second sample text.
- the first sample text does not need to be the same as the second sample text, so a public speech data set can be directly used for model training.
- a computer device uses the Wenet Speech dataset as the first sample audio and the KeSpeech dataset as the second sample audio, wherein the Wenet Speech dataset includes 10,000 hours of ASR data, as can be seen in the introduction at the URL: https://zhuanlan.zhihu.com/p/424118791; the KeSpeech dataset contains ASR data of dialects from different regions, as can be seen in the introduction at the URL: https://datasets-benchmarks-proceedings.neurips.cc/pap er/2021/hash/0336dcbab05b9d5ad24f4333c7658a0e-Abstract-round2.html.
- a computer device inputs sample audio into the ASR model to obtain a predicted text output by the ASR model, thereby training the ASR model based on the predicted text and the sample text corresponding to the sample audio.
- the model architecture of the ASR model includes but is not limited to Wenet, wav2vec2, Kaldi, etc., which is not limited in the embodiments of the present application.
- Wenet is a speech recognition toolkit for industrial applications that was open-sourced by the Chuwenwen voice team and the Speech Laboratory of Northwestern Polytechnical University. The tool provides a one-stop service from training to deployment of speech recognition with a simple solution. Please refer to the website: https://zhuanlan.zhihu.com/p/349586567 for introduction. Wav2vec was proposed in an article included in Interspeech 2019.
- Kaldi is an open source speech recognition tool that uses WFST to implement the decoding algorithm.
- the main code of Kaldi is written in C++, and some tools are made using bash and python scripts.
- the website https://zhuanlan.zhihu.com/p/84050431 Shao.
- the ASR model can be retrained based on the sample audio (applicable to situations where the number of sample audios is large), or it can be obtained by fine-tuning the pre-trained ASR model based on the sample audio (applicable to situations where the number of sample audios is small).
- the first ASR model is retrained based on the first sample audio, and the second ASR model is fine-tuned based on the second sample audio on the basis of the first ASR model.
- the trained ASR model is used to extract content features in speech.
- the content features are called BN (BottleNeck) features, which are usually the last layer of features of the ASR model, which retains the content features of the speech and eliminates other features such as timbre and pitch.
- the training process of the first ASR model includes: the computer device inputs the first sample audio into the first ASR model for text extraction to obtain a first predicted text; the computer device calculates the loss function value between the first predicted text and the first sample text corresponding to the first sample audio; the computer device updates the model parameters of the first ASR model based on the loss function value between the first predicted text and the first sample text, thereby realizing the training of the first ASR model.
- the training process of the second ASR model includes: the computer device inputs the second sample audio into the second ASR model for text extraction to obtain a second predicted text; the computer device calculates the loss function value between the second predicted text and the second sample text corresponding to the second sample audio; the computer device updates the model parameters of the second ASR model based on the loss function value between the second predicted text and the second sample text, thereby realizing the training of the second ASR model.
- Step 202 training a first conversion model based on a first sample text corresponding to a first sample audio and a first sample content feature, wherein the first sample content feature is extracted by a first ASR model from the first sample audio, and the first conversion model is used to convert the text into content features of a first accent.
- a data enhancement scheme is adopted to realize the content feature conversion between non-parallel corpora (that is, corpora corresponding to different accents and corresponding to different texts).
- the computer device extracts features from the first sample audio through the trained first ASR model to obtain first sample content features of the first sample audio, thereby training a first conversion model based on the first sample text corresponding to the first sample audio and the first sample content features.
- the first conversion model can be called a text content feature conversion model (Text2BN model), which is used to realize the conversion between text and source accent content features.
- the training process of the first conversion model includes: the computer device inputs the first sample text into the first conversion model to obtain the first predicted content feature; the computer device extracts the first sample audio through the first ASR model to obtain the first sample content feature; the computer device calculates the loss function value between the first predicted content feature and the first sample content feature; the computer device updates the model parameters of the first conversion model based on the loss function value between the first predicted content feature and the first sample content feature, thereby realizing the training of the first conversion model.
- Step 203 constructing parallel sample data based on the first conversion model, the second sample text corresponding to the second sample audio, and the second sample content feature, wherein the second sample content feature is extracted by the second ASR model from the second sample audio, and the parallel sample data includes different content features, different content features correspond to different accents, and different content features correspond to the same text.
- the computer device performs text conversion on the second sample text corresponding to the second sample audio based on the first conversion model to obtain content features of the first accent corresponding to the second sample text; the computer device aggregates the content features of the first accent corresponding to the second sample text and the content features of the second accent corresponding to the second sample text to obtain parallel sample data.
- the content features of the second accent corresponding to the second sample text are extracted by a second ASR model.
- the computer device After the first conversion model is trained, the computer device performs data enhancement based on the second sample text corresponding to the second sample audio and the first conversion model, thereby constructing parallel sample data based on the second sample content features and the content features of the first accent obtained by data enhancement.
- the parallel sample data includes the content features of the first accent corresponding to the same text (generated by the first conversion model) and the content features of the second accent (extracted by the second ASR model).
- the computer device can construct parallel sample data corresponding to text A based on the first conversion model, text A and the dialect sample content features of the dialect sample audio corresponding to text A, and the parallel sample data includes the Mandarin and dialect content features corresponding to text A.
- Step 204 training a second conversion model based on the parallel sample data, where the second conversion model is used to convert content features between the first accent and the second accent.
- the computer device trains a second conversion model based on parallel sample data corresponding to the same text.
- the second conversion model can be called a content feature conversion model (BN2BN model), which is used to convert the content features of the source accent into the content features of the target accent.
- BN2BN model is used to implement the accent transfer task, and the introduction can be found in the website: https://zhuanlan.zhihu.com/p/586037409.
- the computer device is trained to obtain a second conversion model for converting content features of Mandarin into content features of the dialect.
- the sample content features corresponding to the first sample audio and the second sample audio can be directly used to train the second conversion model.
- the training process of the second conversion model includes: the computer device extracts the second sample audio through the second ASR model to obtain the second sample content feature; the computer device converts the second sample text corresponding to the second sample audio through the first conversion model to obtain the third sample content feature, and the third sample content feature refers to the content feature of the audio generated by expressing the second sample text in the first accent; the computer device inputs the third sample content feature into the second conversion model to obtain the second predicted content feature; the computer device calculates the loss function value between the second predicted content feature and the second sample content feature; the computer device updates the model parameters of the second conversion model based on the loss function value between the second predicted content feature and the second sample content feature, thereby realizing the training of the second conversion model.
- Step 205 training a third conversion model based on sample content features of different sample audios, where the third conversion model is used to convert the content features into audios.
- the third conversion model can be called a content audio conversion model, which is used to convert the audio of the target speech based on the content features of the target speech.
- the third conversion model may include an acoustic model and a vocoder, wherein the acoustic model is used to generate an audio spectrum based on the content feature, and the vocoder is used to generate audio based on the audio spectrum.
- the samples for training the third conversion model may be sample audios of various accents.
- the third conversion model can be executed after the ASR model training is completed, that is, the third conversion model can be trained synchronously with the first and second conversion models.
- the embodiment of the present application does not limit the training sequence of the model.
- the training process of the third conversion model includes: the computer device inputs the sample content features and the speaker identifier corresponding to the sample audio into the third conversion model to generate audio and obtain predicted audio; the computer device calculates the loss function value between the predicted audio and the sample audio; the computer device updates the model parameters of the third conversion model based on the loss function value between the predicted audio and the sample audio, thereby realizing the training of the third conversion model.
- Step 206 Generate a speech conversion model based on the trained first ASR model, the second conversion model, and the third conversion model, where the speech conversion model is used to convert the audio of the first accent into the audio of the second accent.
- the computer device After the first ASR model, the second conversion model and the third conversion model are trained through the above steps, the computer device combines the above models to obtain the final speech conversion model.
- the order of splicing between the models is the first ASR model ⁇ the second conversion model ⁇ the third conversion model, that is, the output of the first ASR model is input into the second conversion model, and the output of the second conversion model is input into the third conversion model.
- the speech conversion model trained to convert Mandarin into a dialect consists of Mandarin It consists of an ASR model, a Mandarin-dialect content conversion model, and a content-audio conversion model.
- a first conversion model for converting text into content features is trained, thereby using the first conversion model and the second sample text corresponding to the second sample audio to construct parallel sample data containing the same text content but corresponding to different accents, and then using the parallel sample data to train a second conversion model for converting content features between different accents, and a third conversion model for converting content features into audio, to complete the training of the speech conversion model; during the model training process, the intermediate model obtained by training is used to construct parallel corpora, and there is no need to record parallel corpora of different accents before model training. While ensuring the quality of model training, the demand for manually recorded parallel corpora for model training can be reduced, which helps to improve the efficiency of model training and improve the training quality of the model when samples are insufficient.
- the speech conversion method can be implemented by using the speech conversion model, and the speech conversion method is executed by a computer device, and the speech conversion model includes a first ASR model, a second conversion model, and a third conversion model.
- the computer device When performing speech conversion, the computer device obtains a first accent audio, and the first accent audio corresponds to a first accent; the computer device extracts the first accent audio through the first ASR model to obtain a first content feature, and the first content feature corresponds to the first accent; the computer device converts the first content feature into a second content feature through the second conversion model, and the second content feature corresponds to the second accent; the computer device performs audio conversion on the second content feature through the third conversion model to obtain a second accent audio, thereby completing the speech conversion.
- the computer device after receiving the first accent audio in the first accent, extracts content features through the first ASR model in the speech conversion model to obtain the first content features.
- the computer device inputs the first content features extracted by the first ASR model into the second conversion model, and the second conversion model converts the content features between the first accent and the second accent to obtain the second content features in the second accent.
- the first content feature and the second content feature correspond to the same text (both are texts corresponding to the first accent audio).
- the second conversion model includes a convolution layer and N stacked FFT layers; after the computer device performs convolution processing on the first content feature through the convolution layer in the second conversion model, the convolution result is input into the N stacked FFT layers for conversion to obtain the second content feature.
- the computer device inputs the second content feature and the speaker identifier of the speaker corresponding to the target timbre into a third conversion model to obtain the second accent audio.
- the third conversion model includes a third conversion sub-model and a vocoder, wherein the third conversion sub-model is used to convert the content features into audio spectrum features, and the vocoder is used to generate audio based on the audio spectrum features.
- the third conversion sub-model includes a convolution layer and N layers of stacked FFT
- the audio spectrum features may be Mel spectrum features, MFCC (Mel Frequency Cepstrum Coefficient) features, etc., which is not limited to the embodiments of the present application.
- the vocoder may be an autoregressive Wavenet or WaveRNN, or a non-autoregressive Hifigan or Melgan, etc., which is not limited in the embodiments of the present application.
- the audio spectrum feature is described by taking the audio spectrum feature as the Mel spectrum feature and the vocoder as the hifigan as an example, but this is not a limitation.
- the computer device inputs the second content feature and the speaker identifier into a third conversion sub-model to obtain an audio spectrum feature; the computer device inputs the audio spectrum feature into a vocoder to obtain a second accent audio.
- Figure 3 shows a flow chart of an accent conversion method provided by an exemplary embodiment of the present application. The method is executed by a computer device and includes the following steps.
- Step 301 in response to an accent conversion instruction, extracting a first content feature of a first accent audio through a first ASR model, the first content feature corresponding to a first accent, and the accent conversion instruction is used to instruct to convert the audio from the first accent to a second accent. sound.
- the accent conversion instruction is triggered after the accent setting is completed.
- the Metaverse virtual character setting interface 41 in addition to the virtual character image setting option, it also includes a voice setting option.
- the user can set the timbre and accent of the virtual character through the voice setting option.
- you can enter the Metaverse by triggering the enter button 42.
- the computer device receives the accent conversion instruction, which includes the accent identifiers of the source accent and the target accent.
- the source accent is the first accent and the target accent is the second accent as an example for explanation.
- the computer device After receiving the first accent audio in the first accent, the computer device extracts content features through the first ASR model in the speech conversion model to obtain a first content feature, which eliminates interference such as timbre and pitch and only retains features at the level of the expressed content.
- the computer device uses the last layer BN feature of the first ASR model as the first content feature.
- the computer device when it is necessary to convert Mandarin into a dialect, the computer device extracts features of Mandarin audio 51 through a Mandarin ASR model 52 to obtain Mandarin content features 53 .
- Step 302 Convert the first content feature into a second content feature using a second conversion model, where the second content feature corresponds to a second accent.
- the computer device inputs the first content feature extracted by the first ASR model into the second conversion model, and the second conversion model converts the content feature between the first accent and the second accent to obtain the second content feature under the second accent.
- the first content feature and the second content feature correspond to the same text (both are texts corresponding to the audio of the first accent).
- the BN2BN model 54 is used to convert content features between Mandarin and dialect. After obtaining the Mandarin content features 53 , the computer device further performs feature conversion on the Mandarin content features 53 through the BN2BN model 54 to obtain the dialect content features 55 .
- Step 303 Perform audio conversion on the second content feature through a third conversion model to obtain second accent audio.
- the computer device inputs the second content feature into a third conversion model, and the third conversion model generates a second accent audio based on the content feature.
- the computer device inputs the dialect content feature 55 into the BN2Wav model 56 to obtain the dialect audio 57 output by the BN2Wav model 56 .
- the first conversion model serves as a key model for constructing parallel sample data.
- the computer device inputs the first sample text into the first conversion model to obtain the first predicted content feature output by the first conversion model, thereby training the first conversion model with the first sample content feature as the supervision of the first predicted content feature.
- the computer device uses the first sample content feature as the supervision of the first predicted content feature, determines the first conversion model loss based on the feature difference between the first predicted content feature and the first sample content feature, and trains the first conversion model based on the first conversion model loss.
- the loss may be an MSE (Mean Square Error) loss or other types of losses, which are not limited in this embodiment.
- the mean square error refers to the average of the sum of squares of feature difference values between the first predicted content feature and the first sample content feature, that is, the average of the sum of squares of errors.
- the loss of the first conversion model Text2BN can be expressed as:
- BN na is the first sample content feature extracted by the first ASR model
- the first predicted content feature is output by the first conversion model.
- the first conversion model includes a first conversion sub-model, a duration prediction sub-model and a second conversion sub-model, wherein the first conversion sub-model is used to realize the conversion between text and text encoding features, the duration prediction sub-model is used to predict the expression duration of the text, and the second conversion sub-model is used to convert the text encoding features into content features.
- Step 601 encode a first sample text through a first conversion sub-model to obtain a first text encoding feature.
- an N-layer stacked FFT (Feed Forward Transformer) is used to form the first conversion sub-model.
- the FFT is used to map the data to a high-dimensional space and then to a low-dimensional space through linear transformation, so as to extract deeper features.
- the FFT includes a multi-head attention mechanism layer and a convolution layer.
- the FFT structure is shown in FIG7 .
- the original input is first processed by the multi-head attention layer 701, and the multi-channel results obtained by the multi-head attention layer 701 and the original input are processed by weighting and normalization 702, and then input to the convolution layer 703 for convolution processing.
- the input and output of the convolution layer 703 are added and then weighted and normalized 702 are processed for the final output.
- the first conversion sub-model obtained by superimposing multiple layers of FFT is used for text encoding to improve the quality of text encoding.
- the first conversion sub-model can also be implemented using other types of modules such as LSTM (Long Short-Term Memory) (which needs to include an attention mechanism and keep the input and output sizes consistent), which is not limited in the embodiments of the present application.
- LSTM Long Short-Term Memory
- Step 602 perform duration prediction on the first text encoding feature through the duration prediction sub-model to obtain a predicted duration, and the predicted duration is used to represent the pronunciation duration of the first sample text.
- the computer device Since spoken text has a certain duration, in order to improve the authenticity of the audio obtained through subsequent conversion (to make the converted speech conform to the real person's speaking speed), the computer device performs duration prediction through the duration prediction sub-model to obtain the pronunciation duration of the first sample text.
- the predicted duration includes the pronunciation sub-durations corresponding to each sub-text in the first sample text. For example, if the first sample text is "The weather is really good today", the predicted duration includes the pronunciation durations corresponding to "today”, “day”, “weather”, “air”, “real”, and "good”.
- Step 603 Expand the first text encoding feature based on the predicted duration to obtain a second text encoding feature.
- the computer device performs feature expansion on the first text encoding feature based on the predicted duration, and copies the sub-features in the first text encoding feature so that the duration corresponding to the copied sub-features is consistent with the pronunciation sub-duration of the corresponding sub-text.
- the first text coding feature is "abcd”
- the second text coding feature after feature expansion is "aabbbcdddd”.
- Step 604 Convert the second text encoding feature into a first predicted content feature through a second conversion sub-model.
- the feature size of the first predicted content feature output by the second conversion sub-model is consistent with the feature size of the second text encoding feature output by the second conversion sub-model.
- the second conversion sub-model includes N layers of FFT to improve the conversion quality of text encoding features to content features.
- the first conversion sub-model 81 first performs feature encoding on the first sample text to obtain the first text encoding feature, and inputs the first text encoding feature into the duration prediction sub-model 82 to obtain the predicted duration, and performs feature expansion processing on the first text encoding feature based on the predicted duration to obtain the second text encoding feature.
- the second conversion sub-model 83 performs feature conversion on the second text encoding feature to obtain the first predicted content feature.
- the process may include the following steps.
- Step 901 converting the second sample text through the first conversion model to obtain a third sample content feature, where the third sample content feature refers to a content feature of the audio generated by expressing the second sample text in the first accent.
- the computer device When constructing parallel sample data based on the second sample audio, the computer device performs content feature conversion on the second sample text corresponding to the second sample audio to obtain the third sample content feature. Since the first conversion model is used to convert the text into the content feature of the first accent, the first conversion model is used to convert the content feature of the second sample text to obtain the third sample content feature.
- the content feature is the content feature of the audio generated by expressing the second sample text in the first accent.
- the content features of the parallel corpus can be generated, eliminating the process of manually recording the parallel corpus and extracting content features from the parallel corpus.
- Step 902 construct parallel sample data based on the second sample content feature and the third sample content feature.
- Step 903 Input the third sample content feature into the second conversion model to obtain a second predicted content feature.
- the second conversion model includes a convolution layer and N layers of stacked FFT, wherein the specific structure of FFT can refer to FIG7, and this embodiment is not repeated here.
- the content feature is first processed by the convolution layer, and then processed by the N layers of FFT to obtain the converted content feature.
- the convolution result is input into the N-layer FFT 1002 to obtain the second predicted content feature.
- Step 904 training a second conversion model using the second sample content feature as supervision for the second predicted content feature.
- the computer device determines the second conversion model loss based on the difference between the second sample content feature and the second predicted content feature, thereby training the second conversion model based on the second conversion model loss.
- the loss may be an MSE loss or other types of losses, which is not limited in this embodiment.
- the loss BN2BN of the second conversion model can be expressed as:
- BN ac is the second sample content feature extracted by the second ASR model
- the second predicted content feature is output by the second conversion model.
- the speaker identifier of the sample audio needs to be taken as part of the input so that the trained third conversion model can output audio with a specific timbre.
- the computer device inputs the sample content feature and the speaker identifier corresponding to the sample audio into the third conversion model to obtain the predicted audio, thereby training the third conversion model based on the predicted audio and the sample audio.
- the predicted audio and the sample audio correspond to the same audio content and have the same timbre.
- speakers correspond to different speaker identifiers.
- speakers are pre-classified into different timbres, so that the same speaker identifier is assigned to different speakers corresponding to the same timbre.
- the third conversion model includes a third conversion sub-model and a vocoder, wherein the third conversion sub-model is used to convert content features into audio spectrum features, and the vocoder is used to generate audio based on the audio spectrum features.
- the third conversion model includes a convolution layer and N layers of stacked FFT
- the audio spectrum features may be Mel spectrum features, MFCC (Mel Frequency Cepstrum Coefficient) features, etc., which is not limited to the embodiments of the present application.
- the vocoder may be an autoregressive Wavenet or WaveRNN, or a non-autoregressive hifigan or melgan, etc., which is not limited in the embodiments of the present application.
- the audio spectrum feature is described by taking the audio spectrum feature as the Mel spectrum feature and the vocoder as the hifigan as an example, but this is not a limitation.
- the computer device inputs the sample content features and the speaker identifier into the third conversion sub-model to obtain the predicted audio spectrum features, and then inputs the predicted audio spectrum features into the vocoder to obtain the predicted audio.
- the BN2Wav model includes a BN2Mel sub-model 1101 and a hifigan sub-model 1102, wherein the BN2Mel sub-model 1101 includes a convolution layer 11011 and an N-layer stacked FFT 11012.
- the computer device inputs the sample content feature BN and the speaker identifier spk_id of the sample audio into the BN2Mel sub-model.
- Model 1101 The BN2Mel sub-model 1101 inputs the converted Mel spectrum into the hifigan sub-model 1102, which converts the Mel spectrum into the predicted audio.
- the computer device jointly trains the third conversion sub-model and the vocoder.
- the computer device first trains the third conversion sub-model and then trains the vocoder based on the trained third conversion sub-model, so as to improve the training efficiency.
- the training process of the third conversion model may include the following steps.
- Step 1201 Input sample content features and speaker identification into a third conversion sub-model to obtain predicted audio spectrum features.
- the computer device inputs the sample content feature and the speaker identifier into the third conversion sub-model to obtain a predicted Mel spectrum corresponding to the sample audio.
- Step 1202 train a third conversion sub-model using the sample audio spectrum features of the sample audio as supervision for predicting the audio spectrum features.
- a computer device extracts audio spectrum features from the sample audio to obtain sample audio spectrum features, thereby determining a third conversion sub-model loss based on a difference between the predicted audio spectrum features and the sample audio spectrum features, thereby training a third conversion sub-model based on the third conversion sub-model loss.
- the loss may be an MSE loss or other types of losses, which is not limited in this embodiment.
- loss BN2Mel of the third conversion sub-model can be expressed as:
- Mel is the sample audio spectrum feature extracted by directly performing audio spectrum feature extraction on the sample audio. It is the predicted audio spectrum feature output by the third conversion sub-model.
- Step 1203 when the training of the third conversion sub-model is completed, the predicted audio spectrum features output by the third conversion sub-model after the training are input into the vocoder to obtain the predicted audio.
- the computer device After completing the training of the third conversion sub-model, the computer device inputs the sample content features and the speaker identification into the trained third conversion sub-model to obtain the predicted audio spectrum features, and then inputs the predicted audio spectrum features into the vocoder to obtain the predicted audio output by the vocoder.
- a computer device inputs the predicted Mel spectrum features output by the trained BN2Mel sub-model into hifigan to obtain the predicted audio output by hifigan.
- Step 1204 train a vocoder in a third conversion model based on the predicted audio and the sample audio.
- the computer device determines a conversion loss of a vocoder using the sample audio as supervision for the predicted audio, thereby training the vocoder based on the loss.
- the computer device when the vocoder adopts an adversarial network, taking hifigan as an example, the computer device adopts the adversarial training idea, through the generator and the discriminator adversarial training.
- the Mel spectrum features obtained by reconverting the audio G(s) generated by the generator is the Mel spectrum feature extracted from the sample audio;
- L FM (G; D) is the feature matching loss between the generated audio and the sample audio;
- LG (G; D) is the discriminant loss of the generated audio.
- LD (G;D) (D(x)-1) 2+ (D(G(s))) 2
- D(x) is the discriminator's discrimination result for the sample audio
- D(G(s)) is the discriminator's discrimination result for the predicted audio.
- the third conversion model trained in the above manner can not only convert content features into audio, but also add a specific timbre to the converted audio. In addition, you can select a target sound.
- the computer device when the accent conversion instruction includes a target timbre, the computer device inputs the second content feature and the speaker identifier of the speaker corresponding to the target timbre into a third conversion model to obtain second accent audio, wherein the second accent audio has a second accent and the target timbre.
- the computer device when it is necessary to convert Mandarin into a dialect with a target timbre, the computer device extracts features of Mandarin audio 1301 through Mandarin ASR model 1302 to obtain Mandarin content features 1303.
- the computer device further performs feature conversion on Mandarin content features 1303 through BN2BN model 1304 to obtain dialect content features 1305.
- the computer device inputs the dialect content features 1305 and the speaker identifier corresponding to the target timbre 1306 into BN2Wav model 1307 to obtain dialect audio 1308 with the target timbre output by BN2Wav model 1307.
- the speaker identification corresponding to the sample audio is also taken as input, so that the third conversion model can perform audio conversion based on the content features and the speaker's timbre features during training.
- the third conversion model can output audio with different timbres, thereby achieving dual conversion of accent and timbre.
- FIG. 14 is a structural block diagram of a training device for a speech conversion model provided by an exemplary embodiment of the present application, the device comprising:
- a training module 1401 configured to train a first ASR model based on a first sample audio, and to train a second ASR model based on a second sample audio, wherein the first sample audio corresponds to a first accent, and the second sample audio corresponds to a second accent;
- the training module 1401 is further used to train a first conversion model based on a first sample text and a first sample content feature corresponding to the first sample audio, wherein the first sample content feature is obtained by extracting the first sample audio by the first ASR model, and the first conversion model is used to convert the text into content features of the first accent;
- the training module 1401 is further used to construct parallel sample data based on the first conversion model, a second sample text corresponding to the second sample audio, and a second sample content feature, wherein the second sample content feature is extracted by the second ASR model from the second sample audio, and the parallel sample data includes different content features, and different content features correspond to different accents, and different content features correspond to the same text; train a second conversion model based on the parallel sample data, and the second conversion model is used to convert content features between the first accent and the second accent;
- the training module 1401 is further used to train a third conversion model based on sample content features of different sample audios, wherein the third conversion model is used to convert the content features into audio;
- the generation module 1402 is used to generate a speech conversion model based on the first ASR model, the second conversion model and the third conversion model obtained through training, wherein the speech conversion model is used to convert audio in a first accent into audio in a second accent.
- the training module 1401 is used to:
- the second sample text is converted by the first conversion model to obtain a third sample content feature, where the third sample content feature refers to a content feature of an audio generated by expressing the second sample text in the first accent;
- the parallel sample data is constructed based on the second sample content feature and the third sample content feature.
- the training module 1401 is used to: input the third sample content feature into the second conversion model to obtain a second predicted content feature;
- the second conversion model is trained by using the second sample content feature as supervision of the second predicted content feature.
- the training module 1401 is used to: input the first sample text into the first conversion model to obtain a first predicted content feature output by the first conversion model;
- the first conversion model is trained by using the first sample content feature as supervision of the first predicted content feature.
- the first conversion model includes a first conversion sub-model, a duration prediction sub-model and a second conversion sub-model;
- the training module 1401 is used to:
- duration prediction on the first text encoding feature by using the duration prediction sub-model to obtain a predicted duration, wherein the predicted duration is used to represent the pronunciation duration of the first sample text;
- the second text encoding feature is converted into the first predicted content feature through the second conversion sub-model.
- the first conversion sub-model and the second conversion sub-model include FFT, and the FFT includes a multi-head attention mechanism layer and a convolution layer.
- the training module 1401 is used to:
- the third conversion model is trained based on the predicted audio and the sample audio.
- the third conversion model includes a third conversion sub-model and a vocoder
- the training module 1401 is used to:
- the predicted audio spectrum feature is input into the vocoder to obtain the predicted audio.
- the training module 1401 is used to:
- the predicted audio spectrum feature output by the third conversion sub-model after the training is completed is input into the vocoder to obtain the predicted audio;
- the vocoder in the third conversion model is trained based on the predicted audio and the sample audio.
- the device further comprises:
- a conversion module configured to extract a first content feature of the first accent audio through the first ASR model in response to an accent conversion instruction, wherein the first content feature corresponds to the first accent, and the accent conversion instruction is used to instruct to convert the audio from the first accent to the second accent;
- the second content feature is converted into audio through the third conversion model to obtain second accent audio.
- the accent conversion instruction includes a target timbre
- the conversion module is used to:
- the second content feature and the speaker identifier of the speaker corresponding to the target timbre are input into the third conversion model to obtain the second accent audio, wherein different speakers correspond to different speaker identifiers.
- a first conversion model for converting text into content features is trained, thereby using the first conversion model and the second sample text corresponding to the second sample audio to construct parallel sample data containing the same text content but corresponding to different accents, and then using the parallel sample data to train a second conversion model for converting content features between different accents, and a third conversion model for converting content features into audio, to complete the training of the speech conversion model; during the model training process, the intermediate model obtained by training is used to construct parallel corpora, and there is no need to record parallel corpora of different accents before model training. While ensuring the quality of model training, the demand for manually recorded parallel corpora for model training can be reduced, which helps to improve the efficiency of model training and improve the training quality of the model when samples are insufficient.
- the device provided in the above embodiment is only illustrated by the division of the above functional modules.
- the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
- the device and method embodiments provided in the above embodiment belong to the same concept, and the implementation process thereof is detailed in the method embodiment, which will not be repeated here.
- FIG. 15 is a structural block diagram of a speech conversion device provided by an exemplary embodiment of the present application, the device comprising:
- An acquisition module 1501 is used to acquire a first accent audio, where the first accent audio corresponds to a first accent;
- An extraction module 1502 is configured to extract a first content feature from the first accent audio by using the first ASR model, where the first content feature corresponds to the first accent;
- a content feature conversion module 1503 configured to convert the first content feature into a second content feature by using the second conversion model, wherein the second content feature corresponds to a second accent;
- the audio conversion module 1504 is used to perform audio conversion on the second content feature through the third conversion model to obtain a second accent audio.
- the content feature conversion module 1503 is further used to input the first content feature extracted by the first ASR model into a second conversion model, and the second conversion model performs content feature conversion between the first accent and the second accent to obtain a second content feature in the second accent.
- the second transformation model includes a convolutional layer and an N-layer stacked FFT.
- the content feature conversion module 1503 is further used to perform convolution processing on the first content feature through the convolution layer in the second conversion model, and input the convolution result into the N-layer stacked FFT conversion to obtain the second content feature.
- the third conversion model includes a third conversion sub-model and a vocoder, wherein the third conversion sub-model is used to convert the content features into audio spectrum features, and the vocoder is used to generate audio based on the audio spectrum features.
- the audio conversion module 1504 is further configured to input the second content feature and the speaker identifier into a third conversion sub-model to obtain an audio spectrum feature.
- the audio conversion module 1504 is further used to input the audio spectrum features into the vocoder to obtain the second accent audio.
- the third transformation sub-model includes a convolutional layer and N layers of stacked FFT.
- FIG. 16 shows a schematic diagram of the structure of a computer device provided by an exemplary embodiment of the present application.
- the computer device can be a screen projection device or terminal in the above-mentioned embodiment.
- the computer device 1600 includes a central processing unit (CPU) 1601, a system memory 1604 including a random access memory 1602 and a read-only memory 1603, and a system bus 1605 connecting the system memory 1604 and the central processing unit 1601.
- the computer device 1600 also includes a basic input/output system (I/O system) 1606 that helps transmit information between various devices in the computer, and a large-capacity storage device 1607 for storing an operating system 1613, an application program 1614 and other program modules 1615.
- I/O system basic input/output system
- the basic input/output system 1606 includes a display 1608 for displaying information and an input device 1609 such as a mouse and a keyboard for user inputting information.
- the display 1608 and the input device 1609 are connected to the central processing unit 1601 through an input/output controller 1610 connected to the system bus 1605.
- the basic input/output system 1606 may also include an input/output controller 1610 for receiving and processing inputs from a plurality of other devices such as a keyboard, a mouse, or an electronic stylus.
- the input/output controller 1610 also provides output to a display screen, a printer, or other types of output devices.
- the mass storage device 1607 is connected to the central processing unit 1601 through a mass storage controller (not shown) connected to the system bus 1605.
- the mass storage device 1607 and its associated computer readable media provide non-volatile storage for the computer device 1600. That is, the mass storage device 1607 may include a computer readable medium (not shown) such as a hard disk or drive.
- the computer-readable medium may include computer storage media and communication media.
- Computer storage media include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storing information such as computer-readable instructions, data structures, program modules or other data.
- Computer storage media include random access memory.
- the computer storage medium may be a random access memory (RAM), a read-only memory (ROM), a flash memory or other solid-state storage technology, a compact disc read-only memory (CD-ROM), a digital versatile disc (DVD) or other optical storage, a cassette, a magnetic tape, a disk storage or other magnetic storage device.
- RAM random access memory
- ROM read-only memory
- CD-ROM compact disc read-only memory
- DVD digital versatile disc
- the computer storage medium is not limited to the above.
- the system memory 1604 and the mass storage device 1607 may be collectively referred to as a memory.
- the memory stores one or more programs, and the one or more programs are configured to be executed by one or more central processing units 1601.
- the one or more programs contain instructions for implementing the above-mentioned methods.
- the central processing unit 1601 executes the one or more programs to implement the methods provided by the above-mentioned various method embodiments.
- the computer device 1600 can also be connected to a remote computer on the network through a network such as the Internet. That is, the computer device 1600 can be connected to the network 1612 through the network interface unit 1611 connected to the system bus 1605, or the network interface unit 1611 can be used to connect to other types of networks or remote computer systems (not shown).
- An embodiment of the present application also provides a computer-readable storage medium, which stores at least one instruction, and the at least one instruction is loaded and executed by a processor to implement the training method of the speech conversion model described in the above embodiment, or the speech conversion method described in the above aspect.
- the computer readable storage medium may include: ROM, RAM, solid state drives (SSD, Solid State Drives) or optical disks, etc.
- RAM may include resistance random access memory (ReRAM, Resistance Random Access Memory) and dynamic random access memory (DRAM, Dynamic Random Access Memory).
- the embodiment of the present application provides a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium.
- the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the training method of the speech conversion model described in the above embodiment, or the speech conversion method described in the above aspect.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biophysics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
Description
LG=LG(G;D)+LFM(G;D)+Lmel(G)
LD(G;D)=(D(x)-1)2+(D(G(s)))2
Claims (20)
- 一种语音转换模型的训练方法,所述方法由计算机设备执行,所述方法包括:基于第一样本音频训练第一ASR模型,以及基于第二样本音频训练第二ASR模型,所述第一样本音频对应第一口音,所述第二样本音频对应第二口音;基于所述第一样本音频对应的第一样本文本以及第一样本内容特征,训练第一转换模型,所述第一样本内容特征由所述第一ASR模型对所述第一样本音频进行提取得到,所述第一转换模型用于将文本转换为所述第一口音的内容特征;基于所述第一转换模型、所述第二样本音频对应的第二样本文本以及第二样本内容特征,构建平行样本数据,所述第二样本内容特征由所述第二ASR模型对所述第二样本音频进行提取得到,所述平行样本数据包括不同内容特征,不同内容特征对应不同口音,且不同内容特征对应相同文本;基于所述平行样本数据训练第二转换模型,所述第二转换模型用于对所述第一口音和所述第二口音间进行内容特征转换;基于不同样本音频的样本内容特征训练第三转换模型,所述第三转换模型用于将内容特征转换为音频;基于训练得到的所述第一ASR模型、所述第二转换模型和所述第三转换模型生成语音转换模型,所述语音转换模型用于将第一口音的音频转换为第二口音的音频。
- 根据权利要求1所述的方法,其中,所述基于所述第一转换模型、所述第二样本音频对应的第二样本文本以及第二样本内容特征,构建平行样本数据,包括:通过所述第一转换模型将所述第二样本文本转换得到第三样本内容特征,所述第三样本内容特征指采用所述第一口音表述第二样本文本所产生音频的内容特征;基于所述第二样本内容特征和所述第三样本内容特征构建所述平行样本数据。
- 根据权利要求2所述的方法,其中,所述基于所述平行样本数据训练第二转换模型,包括:将所述第三样本内容特征输入所述第二转换模型,得到第二预测内容特征;以所述第二样本内容特征为所述第二预测内容特征的监督,训练所述第二转换模型。
- 根据权利要求3所述的方法,其中,所述以所述第二样本内容特征为所述第二预测内容特征的监督,训练所述第二转换模型,包括:根据所述第二样本内容特征与所述第二预测内容特征之间的差异确定第二转换模型损失,基于所述第二转换模型损失训练所述第二转换模型。
- 根据权利要求1至4任一所述的方法,其中,所述基于所述第一样本音频对应的第一样本文本以及第一样本内容特征,训练第一转换模型,包括:将所述第一样本文本输入所述第一转换模型,得到所述第一转换模型输出的第一预测内容特征;以所述第一样本内容特征为所述第一预测内容特征的监督,训练所述第一转换模型。
- 根据权利要求5所述的方法,其中,所述第一转换模型中包括第一转换子模型、时长预测子模型以及第二转换子模型;所述将所述第一样本文本输入所述第一转换模型,得到所述第一转换模型输出的第一预测内容特征,包括:通过所述第一转换子模型对所述第一样本文本进行编码,得到第一文本编码特征;通过所述时长预测子模型对所述第一文本编码特征进行时长预测,得到预测时长,所述预测时长用于表征所述第一样本文本的发音时长;基于所述预测时长对所述第一文本编码特征进行特征扩充,得到第二文本编码特征;通过所述第二转换子模型将所述第二文本编码特征转换得到所述第一预测内容特征。
- 根据权利要求6所述的方法,其中,所述第一转换子模型和所述第二转换子模型包括FFT,所述FFT包括多头注意力机制层和卷积层。
- 根据权利要求5所述的方法,其中,所述以所述第一样本内容特征为所述第一预测内容特征的监督,训练所述第一转换模型,包括:根据所述第一样本内容特征与所述第一预测内容特征之间的差异确定第一转换模型损失,基于所述第一转换模型损失训练所述第一转换模型。
- 根据权利要求1至8任一所述的方法,其中,所述基于不同样本音频的样本内容特征训练第三转换模型,包括:将所述样本内容特征以及所述样本音频对应的说话者标识输入所述第三转换模型,得到预测音频,所述预测音频与所述样本音频对应相同音频内容,且具有相同音色,其中,不同说话者对应不同说话者标识;基于所述预测音频以及所述样本音频,训练所述第三转换模型。
- 根据权利要求9所述的方法,其中,所述第三转换模型包括第三转换子模型以及声码器;所述将所述样本内容特征以及所述样本音频对应的说话者标识输入所述第三转换模型,得到预测音频,包括:将所述样本内容特征以及所述说话者标识输入所述第三转换子模型,得到预测音频谱特征;将所述预测音频谱特征输入所述声码器,得到所述预测音频。
- 根据权利要求10所述的方法,其中,所述将所述预测音频谱特征输入所述声码器,得到所述预测音频之前,所述方法还包括:以所述样本音频的样本音频谱特征为所述预测音频谱特征的监督,训练所述第三转换子模型;所述将所述预测音频谱特征输入所述声码器,得到所述预测音频,包括:在所述第三转换子模型训练完成的情况下,将训练完成后所述第三转换子模型输出的所述预测音频谱特征输入所述声码器,得到所述预测音频;所述基于所述预测音频以及所述样本音频,训练所述第三转换模型,包括:基于所述预测音频以及所述样本音频,训练所述第三转换模型中的所述声码器。
- 根据权利要求11所述的方法,其中,所述以所述样本音频的样本音频谱特征为所述预测音频谱特征的监督,训练所述第三转换子模型,包括:对所述样本音频进行音频谱特征提取,得到样本音频谱特征;根据所述预测音频谱特征与所述样本音频谱特征的差异确定第三转换子模型损失,基于所述第三转换子模型损失训练所述第三转换子模型。
- 根据权利要求1至12任一所述的方法,其中,所述方法包括:响应于口音转换指令,通过所述第一ASR模型提取第一口音音频的第一内容特征,第一内容特征对应所述第一口音,所述口音转换指令用于指示将音频由所述第一口音转换为所述第二口音;通过所述第二转换模型将所述第一内容特征转换为第二内容特征,所述第二内容特征对应所述第二口音;通过所述第三转换模型对所述第二内容特征进行音频转换,得到第二口音音频。
- 根据权利要求13所述的方法,其中,所述口音转换指令中包含目标音色;所述通过所述第三转换模型对所述第二内容特征进行音频转换,得到第二口音音频,包括:将所述第二内容特征以及所述目标音色对应说话者的说话者标识输入所述第三转换模型,得到所述第二口音音频,其中,不同说话者对应不同说话者标识。
- 一种语音转换方法,所述方法由计算机设备执行,所述计算机设备中设置有语音转换模型,所述语音转换模型包括第一ASR模型、第二转换模型和第三转换模型,所述方法包括:获取第一口音音频,所述第一口音音频对应第一口音;通过所述第一ASR模型对所述第一口音音频提取得到第一内容特征,所述第一内容特征对应所述第一口音;通过所述第二转换模型将所述第一内容特征转换为第二内容特征,所述第二内容特征对应第二口音;通过所述第三转换模型对所述第二内容特征进行音频转换,得到第二口音音频。
- 一种语音转换模型的训练装置,其中,所述装置包括:训练模块,用于基于第一样本音频训练第一ASR模型,以及基于第二样本音频训练第二ASR模型,所述第一样本音频对应第一口音,所述第二样本音频对应第二口音;所述训练模块,还用于基于所述第一样本音频对应的第一样本文本以及第一样本内容特征,训练第一转换模型,所述第一样本内容特征由所述第一ASR模型对所述第一样本音频进行提取得到,所述第一转换模型用于将文本转换为所述第一口音的内容特征;所述训练模块,还用于基于所述第一转换模型、所述第二样本音频对应的第二样本文本以及第二样本内容特征,构建平行样本数据,所述第二样本内容特征由所述第二ASR模型对所述第二样本音频进行提取得到,所述平行样本数据包括不同内容特征,不同内容特征对应不同口音,且不同内容特征对应相同文本;基于所述平行样本数据训练第二转换模型,所述第二转换模型用于对所述第一口音和所述第二口音间进行内容特征转换;所述训练模块,还用于基于不同样本音频的样本内容特征训练第三转换模型,所述第三转换模型用于将内容特征转换为音频;生成模块,用于基于训练得到的所述第一ASR模型、所述第二转换模型和所述第三转换模型生成语音转换模型,所述语音转换模型用于将第一口音的音频转换为第二口音的音频。
- 一种语音转换装置,其中,所述装置包括:获取模块,用于获取第一口音音频,所述第一口音音频对应第一口音;提取模块,用于通过所述第一ASR模型对所述第一口音音频提取得到第一内容特征,所述第一内容特征对应所述第一口音;内容特征转换模块,用于通过所述第二转换模型将所述第一内容特征转换为第二内容特征,所述第二内容特征对应第二口音;音频转换模块,用于通过所述第三转换模型对所述第二内容特征进行音频转换,得到第二口音音频。
- 一种计算机设备,其中,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令,所述至少一条指令由所述处理器加载并执行以实现如权利要求1至14任一所述的语音转换模型的训练方法,或,如权利要求15所述的语音转换方法。
- 一种计算机可读存储介质,其中,所述可读存储介质中存储有至少一条指令,所述至少一条指令由处理器加载并执行以实现如权利要求1至14任一所述的语音转换模型的训练方法,或,如权利要求15所述的语音转换方法。
- 一种计算机程序产品,其中,所述计算机程序产品包括计算机指令,所述计算机指令存储在计算机可读存储介质中;计算机设备的处理器从所述计算机可读存储介质读取所述计算机指令,所述处理器执行所述计算机指令,使得所述计算机设备执行如权利要求1至14任一所述的语音转换模型的训练方法,或,如权利要求15所述的语音转换方法。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP23893467.3A EP4618072A4 (en) | 2022-11-21 | 2023-10-12 | METHOD AND APPARATUS FOR TRAINING A SPEECH CONVERSION MODEL, DEVICE AND SUPPORT |
| US18/885,324 US20250006212A1 (en) | 2022-11-21 | 2024-09-13 | Method and apparatus for training speech conversion model, device, and medium |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211455842.7 | 2022-11-21 | ||
| CN202211455842.7A CN116959447A (zh) | 2022-11-21 | 2022-11-21 | 语音转换模型的训练方法、装置、设备及介质 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US18/885,324 Continuation US20250006212A1 (en) | 2022-11-21 | 2024-09-13 | Method and apparatus for training speech conversion model, device, and medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024109375A1 true WO2024109375A1 (zh) | 2024-05-30 |
Family
ID=88453595
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2023/124162 Ceased WO2024109375A1 (zh) | 2022-11-21 | 2023-10-12 | 语音转换模型的训练方法、装置、设备及介质 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20250006212A1 (zh) |
| EP (1) | EP4618072A4 (zh) |
| CN (1) | CN116959447A (zh) |
| WO (1) | WO2024109375A1 (zh) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119649834A (zh) * | 2024-12-02 | 2025-03-18 | 平安科技(深圳)有限公司 | 一种语音转换生成方法、装置、计算机设备及存储介质 |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118298836B (zh) * | 2024-05-29 | 2024-08-23 | 摩尔线程智能科技(北京)有限责任公司 | 音色转换方法、装置、电子设备、存储介质和程序产品 |
| CN119785824A (zh) * | 2024-12-03 | 2025-04-08 | 平安科技(深圳)有限公司 | 口音转换方法、装置、设备及存储介质 |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20160064033A1 (en) * | 2014-08-26 | 2016-03-03 | Microsoft Corporation | Personalized audio and/or video shows |
| CN110085244A (zh) * | 2019-05-05 | 2019-08-02 | 广州虎牙信息科技有限公司 | 直播互动方法、装置、电子设备及可读存储介质 |
| CN112767912A (zh) * | 2020-12-28 | 2021-05-07 | 深圳市优必选科技股份有限公司 | 跨语言语音转换方法、装置、计算机设备和存储介质 |
| CN113223542A (zh) * | 2021-04-26 | 2021-08-06 | 北京搜狗科技发展有限公司 | 音频的转换方法、装置、存储介质及电子设备 |
| CN113450759A (zh) * | 2021-06-22 | 2021-09-28 | 北京百度网讯科技有限公司 | 语音生成方法、装置、电子设备以及存储介质 |
| CN113838448A (zh) * | 2021-06-16 | 2021-12-24 | 腾讯科技(深圳)有限公司 | 一种语音合成方法、装置、设备及计算机可读存储介质 |
| CN114038484A (zh) * | 2021-12-16 | 2022-02-11 | 游密科技(深圳)有限公司 | 语音数据处理方法、装置、计算机设备和存储介质 |
| US20220382998A1 (en) * | 2021-05-25 | 2022-12-01 | Compal Electronics, Inc. | Translation method and translation device |
-
2022
- 2022-11-21 CN CN202211455842.7A patent/CN116959447A/zh active Pending
-
2023
- 2023-10-12 WO PCT/CN2023/124162 patent/WO2024109375A1/zh not_active Ceased
- 2023-10-12 EP EP23893467.3A patent/EP4618072A4/en active Pending
-
2024
- 2024-09-13 US US18/885,324 patent/US20250006212A1/en active Pending
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20160064033A1 (en) * | 2014-08-26 | 2016-03-03 | Microsoft Corporation | Personalized audio and/or video shows |
| CN110085244A (zh) * | 2019-05-05 | 2019-08-02 | 广州虎牙信息科技有限公司 | 直播互动方法、装置、电子设备及可读存储介质 |
| CN112767912A (zh) * | 2020-12-28 | 2021-05-07 | 深圳市优必选科技股份有限公司 | 跨语言语音转换方法、装置、计算机设备和存储介质 |
| CN113223542A (zh) * | 2021-04-26 | 2021-08-06 | 北京搜狗科技发展有限公司 | 音频的转换方法、装置、存储介质及电子设备 |
| US20220382998A1 (en) * | 2021-05-25 | 2022-12-01 | Compal Electronics, Inc. | Translation method and translation device |
| CN113838448A (zh) * | 2021-06-16 | 2021-12-24 | 腾讯科技(深圳)有限公司 | 一种语音合成方法、装置、设备及计算机可读存储介质 |
| CN113450759A (zh) * | 2021-06-22 | 2021-09-28 | 北京百度网讯科技有限公司 | 语音生成方法、装置、电子设备以及存储介质 |
| CN114038484A (zh) * | 2021-12-16 | 2022-02-11 | 游密科技(深圳)有限公司 | 语音数据处理方法、装置、计算机设备和存储介质 |
Non-Patent Citations (2)
| Title |
|---|
| See also references of EP4618072A4 * |
| ZHANG YONGMAO; WANG ZHICHAO; YANG PEIJI; SUN HONGSHEN; WANG ZHISHENG; XIE LEI: "AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents", 2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), IEEE, 11 December 2022 (2022-12-11), pages 76 - 80, XP034290712, DOI: 10.1109/ISCSLP57327.2022.10037914 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119649834A (zh) * | 2024-12-02 | 2025-03-18 | 平安科技(深圳)有限公司 | 一种语音转换生成方法、装置、计算机设备及存储介质 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4618072A1 (en) | 2025-09-17 |
| CN116959447A (zh) | 2023-10-27 |
| EP4618072A4 (en) | 2026-02-18 |
| US20250006212A1 (en) | 2025-01-02 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2024109375A1 (zh) | 语音转换模型的训练方法、装置、设备及介质 | |
| EP4270255B1 (en) | Cross-lingual voice conversion system and method | |
| JP4246790B2 (ja) | 音声合成装置 | |
| CN111465982B (zh) | 信号处理设备和方法、训练设备和方法以及程序 | |
| JP2020034895A (ja) | 応答方法及び装置 | |
| JP2014519082A (ja) | 文字に基づく映像生成 | |
| CN113205793A (zh) | 音频生成方法、装置、存储介质及电子设备 | |
| CN116561294B (zh) | 手语视频的生成方法、装置、计算机设备及存储介质 | |
| WO2025066461A1 (zh) | 语音数据、会议语音的处理方法及服务器 | |
| CN118898986A (zh) | 语音合成模型训练、语音合成方法及任务平台 | |
| WO2025020916A1 (zh) | 任务处理、自动问答以及多媒体数据识别模型训练方法 | |
| WO2024255461A1 (zh) | 一种语音处理方法、装置、设备、介质及程序产品 | |
| KR102875075B1 (ko) | 음성 인식 기반 자막 및 회의록 생성 장치 및 방법 | |
| CN117316185A (zh) | 一种音视频的生成方法、装置、设备及存储介质 | |
| CN119274535B (zh) | 一种语音处理方法、装置、设备、介质及程序产品 | |
| Savale et al. | Multilingual video dubbing system | |
| CN113889130A (zh) | 一种语音转换方法、装置、设备及介质 | |
| US20220383850A1 (en) | System and method for posthumous dynamic speech synthesis using neural networks and deep learning | |
| CN118069805B (zh) | 基于语音和文本协同的智能问答方法及装置 | |
| CN119835488A (zh) | 直播间语音对话变声方法及其装置、设备、介质 | |
| CN119815067A (zh) | 一种数字人直播方法、装置、设备及存储介质 | |
| CN117373463A (zh) | 用于语音处理的模型训练方法、设备、介质及程序产品 | |
| KR20220023381A (ko) | 한 화자의 적은 음성 데이터로 감정 운율을 담은 음성 합성 방법 및 장치 | |
| WO2024251169A1 (zh) | 语音识别方法、设备和存储介质 | |
| HK40100471A (zh) | 语音转换模型的训练方法、装置、设备及介质 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23893467 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 11202500651U Country of ref document: SG |
|
| WWP | Wipo information: published in national office |
Ref document number: 11202500651U Country of ref document: SG |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023893467 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2023893467 Country of ref document: EP Effective date: 20250613 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWP | Wipo information: published in national office |
Ref document number: 2023893467 Country of ref document: EP |