WO2023207638A1 - 模型训练方法、语音到语音翻译方法、装置及介质 - Google Patents
模型训练方法、语音到语音翻译方法、装置及介质 Download PDFInfo
- Publication number
- WO2023207638A1 WO2023207638A1 PCT/CN2023/088492 CN2023088492W WO2023207638A1 WO 2023207638 A1 WO2023207638 A1 WO 2023207638A1 CN 2023088492 W CN2023088492 W CN 2023088492W WO 2023207638 A1 WO2023207638 A1 WO 2023207638A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech
- time step
- samples
- translation
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/44—Statistical methods, e.g. probability models
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/027—Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
Definitions
- Embodiments of the present application relate to the field of machine learning technology, and in particular, to a model training method, a speech-to-speech translation method, a device, and a medium.
- the Speech-to-Speech Translation (S2ST) model aims to translate source language speech into target language speech. It is widely used in various scenarios such as video translation, multinational conference speeches, and translating intercoms. Usually speech-to-speech translation models need to be trained through a large amount of data. However, it is currently difficult to collect paired speech-to-speech translation samples in real-life scenarios. This lack of data leads to the problem of low model training accuracy.
- This application provides a model training method, speech-to-speech translation method, device and medium, thereby improving model training accuracy.
- the first aspect provides a model training method, including: obtaining speech recognition samples and real speech-to-speech translation samples; generating pseudo-annotated speech-to-speech translation samples based on the speech recognition samples; and generating pseudo-annotated speech-to-speech translation samples based on the pseudo-annotated speech-to-speech translation samples.
- Real speech-to-speech translation samples train speech-to-speech translation models.
- a second aspect provides a speech-to-speech translation method, including: obtaining source language speech features; inputting the source language speech features into a speech-to-speech translation model trained as in the first aspect or in an optional manner of the first aspect, to obtain The phonetic features of the source language correspond to the phonetic features of the target language.
- a model training device including: an acquisition module, a generation module and a training module.
- the acquisition module is used to acquire speech recognition samples and real speech-to-speech translation samples;
- the generation module is used to generate pseudo annotations based on speech recognition samples.
- the training module is used to train the speech-to-speech translation model based on pseudo-annotated speech-to-speech translation samples and real speech-to-speech translation samples.
- a fourth aspect provides a speech-to-speech translation device, including: an acquisition module and a processing module, wherein the acquisition module is used to acquire source language speech features; the processing module is used to input the source language speech features into the first aspect or the first aspect;
- the speech-to-speech translation model trained by the optional method can obtain the target language speech features corresponding to the source language speech features.
- an electronic device including: a processor and a memory, the memory being used to store a computer program, the processing The device is used to call and run the computer program stored in the memory to perform the method described in the first aspect or the second aspect.
- a sixth aspect provides a computer-readable storage medium for storing a computer program, the computer program causing a computer to execute the method described in the first or second aspect.
- speech recognition samples are relatively easy to collect.
- pseudo-annotated speech-to-speech translation samples can be generated, thereby expanding the Speech-to-speech translation samples, which in turn can improve model training accuracy.
- Figure 1 is the frame diagram of Transformer
- Figure 2 is a schematic diagram of a system architecture involved in an embodiment of the present application.
- Figure 3 is a flow chart of a model training method provided by an embodiment of the present application.
- Figure 4 is a flow chart of another model training method provided by an embodiment of the present application.
- Figure 5 is a schematic diagram of a speech-to-speech translation model provided by an embodiment of the present application.
- Figure 6 is a schematic diagram of another speech-to-speech translation model provided by an embodiment of the present application.
- Figure 7 is a flow chart of a speech-to-speech translation method provided by an embodiment of the present application.
- Figure 8 is a schematic diagram of a model training device 800 provided by an embodiment of the present application.
- Figure 9 is a schematic diagram of a speech-to-speech translation device 900 provided by an embodiment of the present application.
- FIG. 10 is a schematic block diagram of an electronic device 1000 provided by an embodiment of the present application.
- Encoder is used to process the source language speech features and compress the source language speech features into a fixed-length hidden representation.
- the hidden representation is also called context vector (context), semantic encoding, semantic vector, etc. It is expected that the Hidden representation is information that can better represent the phonetic features of a language.
- Decoder uses hidden representation to initialize the decoder to obtain the target language speech features.
- Figure 1 is the frame diagram of Transformer.
- ADD residual connection
- Normal layer normalization
- the decoder is almost the same as the encoder, except that an additional layer of multi-head attention mechanism (encoder-decoder attention) can be added in the middle to process the output of the encoder.
- the first unit of the decoder which is the first unit using the multi-head self-attention mechanism, performs a masking operation to ensure that the decoder does not read information after the current position.
- this application proposes to expand the training data to improve the model training accuracy.
- system architecture of the embodiment of the present application is as shown in Figure 2.
- Figure 2 is a schematic diagram of a system architecture related to an embodiment of the present application, including user equipment 201, data collection equipment 202, training equipment 203, execution equipment 204, database 205 and content library 206.
- the data collection device 202 is used to read training data from the content library 206 and store the read training data in the database 205 .
- the training data involved in the embodiments of this application includes pseudo-annotated speech-to-speech translation samples and real speech-to-speech translation samples.
- the training device 203 trains the speech-to-speech translation model based on the training data maintained in the database 205, so that the trained speech-to-speech translation model can effectively translate the source language speech into the target language speech.
- the execution device 204 is configured with an I/O interface 207 for data interaction with external devices.
- the source language voice characteristics sent by the user device 201 are received through the I/O interface.
- the computing module 208 in the execution device 204 uses the trained speech-to-speech translation model to process the input source language speech features, outputs the target language speech features, specifically the target language speech features, and converts the corresponding speech features through the I/O interface. The results are sent to user device 201.
- the user device 201 may include a mobile phone, a tablet computer, a laptop computer, a handheld computer, a mobile internet device (mobile internet device, MID), a desktop computer, or other terminal devices with the function of installing a browser.
- a mobile phone a tablet computer, a laptop computer, a handheld computer, a mobile internet device (mobile internet device, MID), a desktop computer, or other terminal devices with the function of installing a browser.
- a mobile internet device mobile internet device, MID
- desktop computer or other terminal devices with the function of installing a browser.
- the execution device 204 may be a server.
- the server may be a computing device such as a rack server, a blade server, a tower server, or a cabinet server.
- the server can be an independent test server or a test server cluster composed of multiple test servers.
- the execution device 204 is connected to the user device 201 through the network.
- the network may be an intranet, Internet, Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), fourth generation ( The 4th Generation (4G) network, the 5th Generation (5G) network, Bluetooth, wireless fidelity (Wi-Fi), call network and other wireless or wired networks.
- GSM Global System of Mobile communication
- WCDMA Wideband Code Division Multiple Access
- 4G 4th Generation
- 5G Fifth Generation
- Bluetooth wireless fidelity
- Wi-Fi wireless fidelity
- Figure 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, components, modules, etc. shown in the figure does not constitute any limitation.
- the above-mentioned data collection device 202, user device 201, training device 203 and execution device 204 may be the same device.
- the above-mentioned database 205 can be distributed on one server or multiple servers, and the above-mentioned content library 206 can be distributed on one server or multiple servers.
- Figure 3 is a flow chart of a model training method provided by an embodiment of the present application. This method can be executed by any electronic device such as a mobile phone, tablet computer, notebook computer, palmtop computer, MID, desktop computer, etc. For example, it can be executed by the method in Figure 2
- the training equipment is executed, and this application does not limit this. As shown in Figure 3, the method includes:
- S320 Generate pseudo-annotated speech-to-speech translation samples based on the speech recognition samples
- S330 Train a speech-to-speech translation model based on pseudo-annotated speech-to-speech translation samples and real speech-to-speech translation samples.
- the speech-to-speech translation model in this application can be a speech-to-speech translation model based on multi-task learning (Multi-Task Learning, MTL).
- MTL multi-task learning
- MTL multi-task learning
- MTL single-task learning
- MTL multi-task learning
- multi-task learning is a promising field in machine learning. Its goal is to use the useful information contained in multiple learning tasks to help learn a more accurate learner for each task.
- inductive preferences are shared between different tasks, tasks can generally improve each other to prevent a single task from easily falling into a local optimum.
- speech-to-speech translation samples For convenience, real speech-to-speech translation samples or pseudo-annotated speech-to-speech translation samples are collectively referred to as speech-to-speech translation samples below. It should be understood that the number of data elements included in the speech-to-speech translation sample is related to whether the speech-to-speech translation model is based on multi-task learning or single-task learning. For example: If the speech-to-speech translation model is a speech-to-speech translation model based on single-task learning, the speech-to-speech translation sample can be a tuple, including: source language speech features and target language speech features.
- the speech-to-speech translation model is a speech-to-speech translation model based on multi-task learning, and the multi-task includes: one main task and two auxiliary tasks, the main task is the speech-to-speech translation task; the two auxiliary tasks are speech recognition. task and a speech-to-text translation task, the speech recognition task is used to convert the source language speech features into the source language text, the speech-to-text translation task is used to convert the source language speech features into the source language text, and the source language language text to target language text, in which case a speech-to-speech translation sample can So it is a four-tuple, including: source language speech features, source language text, target language speech features and target language text.
- the real speech-to-speech translation samples include: first source language speech features, first source language text, first target language speech features, and first target language text.
- the first source language speech feature is a real source language speech feature
- the first source language text is a real source language text
- the first target language text is also a real target language text
- the first target language speech feature is The electronic device synthesizes the target language speech features obtained by synthesizing the first target language text. For example, the electronic device can input the first target language text into a speech synthesis model to obtain the first target language speech features.
- the speech recognition samples include: second source language speech features and second source language text.
- the second source language speech feature is a real source language speech feature
- the second source language text is also a real source language text.
- the so-called real source language speech features refer to the source language speech features that can be obtained in real scenes.
- an electronic device can collect a user's voice through a microphone and extract the features of the voice.
- the real source language text can be a language text obtained through artificial means.
- the user can record a speech to form a language text corresponding to the speech.
- the real target language text can also be a language text obtained through artificial means.
- the user translates the content in the source language text into the target language text.
- the above-mentioned speech recognition samples may be one or more, and the above-mentioned real speech-to-speech translation samples may be one or more.
- the electronic device can translate the text in the second source language to obtain the speech features of the second target language; synthesize the speech features of the second target language to obtain the speech features of the second target language; wherein, the pseudo-labeled speech-to-speech
- the translation sample may be a four-tuple, including: second source language speech features, second source language text, second target language text, and second target language speech features.
- the first two items in the pseudo-annotated speech-to-speech translation samples namely the second source language speech features and the second source language text, are both real.
- the electronic device can input the second source language text into a machine translation (Machine Translation, MT) model to obtain the second target language text.
- the electronic device can input the text of the second target language into a speech synthesis (Text-To-Speech, TTS) model to obtain the speech features of the second target language.
- MT Machine Translation
- TTS speech synthesis
- the difference between real speech-to-speech translation samples and pseudo-annotated speech-to-speech translation samples is mainly in the target language text, for example: real speech-to-speech translation samples
- the speech translation sample is a four-tuple, including: ⁇ s src ,t src ,t tgt ,s tgt ⁇ , where s src represents the real source language speech features, t src represents the real source language text, and t tgt represents the real The target language text, s tgt represents the target language phonetic features obtained after synthesizing t tgt .
- the pseudo-annotated speech-to-speech translation sample is a four-tuple, including: ⁇ s src ,t src ,t' tgt ,s tgt ⁇ , where s src represents the real source language speech features and t src represents the real source language Text, t′ tgt represents the target language text obtained after inputting the real source language text into MT, and s tgt represents the target language speech feature obtained after t′ tgt is synthesized.
- This application may refer to the speech-to-speech translation based on such pseudo-annotated speech-to-speech translation samples as pseudo-translation tags.
- Adaptation Pseudo Translation Label Adaptation, PTLA.
- pseudo-annotated speech-to-speech translation samples are obtained on the basis of speech recognition samples.
- pseudo-annotated speech-to-speech translation samples can also be constructed based on the source language speech features.
- electronic devices can obtain real source language speech features, input the source language speech features into the Automatic Speech Recognition (ASR) model, obtain the source language text corresponding to the source language speech features, and then convert the source language speech features into the automatic speech recognition (ASR) model.
- the text is input into the MT model to obtain the target language text.
- the target language text can be input into the TTS model to obtain the target language speech features.
- these source language speech features, source language text, target language text and target language speech Features constitute pseudo-annotated speech-to-speech translation samples.
- the source language speech feature may be a log-mel spectrogram of the source language speech, and the log-mel spectrogram may be an 80-channel log-mel spectrogram. , but not limited to this.
- the source language speech feature may be a linear frequency spectrogram (linear freq spectrogram) of the target language speech, but is not limited to this.
- the training process of the speech-to-speech translation model by electronic devices includes: pre-training phase (pre-training) and fine-tuning phase (fine tuning).
- Pre-training refers to the process of pre-training a model or pre-training a model.
- Fine-tuning refers to the process of applying a pre-trained model to the data set of a certain task and adapting the parameters to the data set of the task.
- the electronic device needs to randomly initialize the parameters, then start training the network model, and continuously adjust the parameters of the model, so that the loss of the network model becomes smaller and smaller until Until the training stop conditions are met, the process is the pre-training process.
- the electronic device can directly use the previously trained network model, use the parameters of the network model as the initialization parameters of the task, and then train the network model and continuously adjust it.
- the parameters of the model make the loss of the network model smaller and smaller until the training stop condition is met. This process is the fine-tuning process.
- the above-mentioned real speech-to-speech translation samples may also be called original speech-to-speech translation samples.
- Pseudo-annotated speech-to-speech translation samples may also be called derived speech-to-speech translation samples.
- the real speech-to-speech translation sample can be used in the pre-training stage of the speech-to-speech translation model, and can also be used in the fine-tuning stage of the model.
- the pseudo-annotated speech-to-speech translation sample can be used in the pre-training stage of the speech-to-speech translation model, or can also be used in the fine-tuning stage of the model. This application does not limit this.
- speech recognition samples are relatively easy to collect.
- pseudo-annotated speech-to-speech translation samples can be generated, thereby expanding the Speech-to-speech translation samples, which in turn can improve model training accuracy.
- the above S330 includes:
- S410 Pre-train the speech-to-speech translation model based on the pseudo-annotated speech-to-speech translation samples
- S420 Fine-tune the pre-trained speech-to-speech translation model based on real speech-to-speech translation samples.
- the electronic device can directly fine-tune the pre-trained speech-to-speech translation model through real speech-to-speech translation samples. That is, the electronic device only fine-tunes the pre-trained speech-to-speech translation model through real speech-to-speech translation samples. Translation model.
- the electronic device can also fine-tune the pre-trained speech-to-speech translation model based on real speech-to-speech translation samples and pseudo-annotated speech-to-speech translation samples.
- the model training method based on this method can be called a hybrid training method.
- this hybrid training algorithm can maximize the preservation of pseudo-annotated speech-to-speech translation sample gains. Since the scale of speech recognition samples is much larger than the scale of real speech-to-speech translation samples, based on this, the scale of pseudo-annotated speech-to-speech translation samples is much larger than the scale of real speech-to-speech translation samples.
- real speech-to-speech translation samples can be upsampled to expand the scale of real speech-to-speech translation samples, and then the upsampling can be performed Fine-tune the pre-trained speech-to-speech translation model with speech-to-speech translation samples and pseudo-annotated speech-to-speech translation samples.
- the electronic device can also label the real speech-to-speech translation samples with a first label , the first label is used to identify the real speech-to-speech translation sample as a real sample, which can be represented by real; the second label is labeled for the pseudo-labeled speech-to-speech translation sample, and the second label is used to identify the pseudo-labeled speech-to-speech translation
- the sample is a pseudo-labeled sample, which can be represented by pseudo.
- the model training method based on this method can be called a prompt training method. Based on this prompt training method, the model can better distinguish between real speech-to-speech translation samples and pseudo-annotated speech-to-speech translation samples.
- the electronic device can pre-train a speech-to-speech translation model based on pseudo-labeled speech-to-speech translation samples; fine-tune the pre-trained speech-to-speech translation model based on real speech-to-speech translation samples, that is, pseudo-labeled data It is mainly used in the pre-training process. In this way, pseudo-annotated speech-to-speech translation samples can be prevented from misleading the model optimization results.
- real speech-to-speech translation samples can be Upsampling, and then fine-tuning the pre-trained speech-to-speech translation model through the upsampled speech-to-speech translation samples and pseudo-annotated speech-to-speech translation samples.
- This method can solve the shortage of real speech-to-speech translation samples on the one hand. This leads to the problem of low model training accuracy. On the other hand, it can prevent pseudo-annotated speech-to-speech translation samples from misleading the model optimization results.
- the electronic device can also perform real speech-to-speech translation samples and pseudo-annotated speech-to-speech translation samples.
- Speech-to-speech translation samples are annotated with corresponding labels, which allows the model to better distinguish between real speech-to-speech translation samples and pseudo-annotated speech-to-speech translation samples.
- the speech-to-speech translation model may be an existing translator (Translatotron) model or the speech-to-speech translation model as shown in Figure 5 , which is not limited in this application.
- Figure 5 is a schematic diagram of a speech-to-speech translation model provided by an embodiment of the present application.
- the model includes: an encoder module 510, a first attention module 520, a first decoder module 530, N
- the second attention module 540 and the N second decoder modules 550, N is a positive integer
- the N second attention modules correspond to the N second decoder modules one-to-one.
- the model can be a speech-to-speech translation model based on multi-task learning, and the multi-task includes: a main task and N auxiliary tasks.
- the main task is a speech-to-speech translation task, where the above-mentioned second attention
- the number of modules and second decoder modules is consistent with the number of auxiliary tasks.
- the two auxiliary tasks can be respectively a speech recognition task and a speech-to-text translation task, but are not limited to this.
- the speech recognition task is used to convert source language speech features into source language text
- the speech-to-text translation task is used to convert the source language speech features into source language text
- the source language text is converted into target language text.
- N 1, that is, there is an auxiliary task.
- the auxiliary task can be a speech recognition task or a speech-to-text translation task, but is not limited to this.
- the first attention module 520 and the first decoder module 530 correspond to the main task, and each of the following sets of the second attention module 540 and the second decoder module 550 correspond to an auxiliary task.
- the first decoder module 530 is mainly used to predict and synthesize speech features of the target language.
- the two auxiliary tasks accept the input of the encoder module 510 and add the predicted loss value to the main task in the form of a weighted sum.
- the second decoder module 550 is not used.
- the encoder module 510 is used to obtain the source language speech features and process the source language speech features to obtain multiple sets of first hidden state representations corresponding to the source language speech features.
- Figure 6 is a schematic diagram of another speech-to-speech translation model provided by the embodiment of the present application.
- the encoder module 510 includes: a convolutional neural network sub-module 5101 and a first converter module 5102 ;
- the convolutional neural network sub-module 5101 is used to obtain the source language speech features and process the source language speech features to obtain the second hidden state representation corresponding to the source language speech features;
- the first converter module 5102 is used to obtain the second hidden state representation state representation, and process the second hidden state representation to obtain multiple sets of first hidden state representations.
- the convolutional neural network sub-module 5101 may include two convolutional neural network layers, but is not limited thereto.
- two layers of convolutional neural network layers can map the length of the 80-channel logarithmic melspectrogram to one quarter of the original, That is, assuming that the previous 80-channel logarithmic mel spectrogram is represented by 100 vectors, each vector is 80-dimensional, then after two layers of convolutional neural network layer processing, 25 vectors are obtained.
- the number of hidden units in the converter module 5102 remains the same. For example, the number of hidden units is 512, then the dimension of the 25 vectors processed by the two-layer convolutional neural network layer is also 512.
- 25 512-dimensional vectors can be The vector is understood as the second hidden state representation mentioned above.
- the first converter module 5102 may be similar to the encoder structure shown in Figure 1 , that is, the first converter module 5102 may include 6 converter layers, or may include 12 converter layers.
- Each transformer layer can have 512 hidden units. That is to say, the hidden representation output by the transformer layer can be 512-dimensional.
- Each converter layer can consist of two subunit.
- the first one is a self-attention network using a multi-head self-attention mechanism.
- the self-attention network of the multi-head self-attention mechanism here can be a self-attention network with an 8-head self-attention mechanism.
- the second is a fully connected feed-forward network, in which the feed-forward network can use 2048-dimensional internal states. Both subunits use residual connections and layer normalization.
- the convolutional neural network sub-module 5101 outputs 25 512-dimensional vectors
- N groups of first hidden state representations can be obtained, and each group of first hidden state
- the representation is also 25 512-dimensional vectors.
- the 25 512-dimensional vectors obtained by the last layer of the first converter module 5102 can be output to the first attention module 520, and the 25 512-dimensional vectors obtained by the middle layer The vector may be output to the second attention module 540.
- the first attention module 520 is used to obtain a set of first hidden state representations among multiple sets of first hidden state representations and the first vector corresponding to each time step output by the first decoder, and represent the set of first hidden state representations Process the first vector corresponding to each time step to obtain the first attention representation corresponding to each time step.
- the first decoder module 530 is used to obtain the second vector corresponding to each time step, process the second vector corresponding to each time step, obtain the first vector corresponding to each time step, and convert the first vector corresponding to each time step into Output to the first attention module 520, obtain the first attention representation corresponding to each time step, and process the first attention representation corresponding to each time step to obtain the target language speech features corresponding to the source language speech features.
- the first decoder module 530 includes: a pre-processing network (prenet) 5301, a second converter module 5302 and a post-processing network (postnet) 5303; the pre-processing network 5301 is used to obtain each time The second vector corresponding to each step is processed, and the second vector corresponding to each time step is processed to obtain the first vector corresponding to each time step, and the first vector corresponding to each time step is output to the first attention module; the second conversion The processor module 5302 is used to obtain the first attention representation corresponding to each time step, and process the first attention representation corresponding to each time step to obtain the target language speech features at each time step; the post-processing network 5303 is used to The target language speech features at each time step are processed to obtain the target language speech features corresponding to the source language speech features.
- prenet pre-processing network
- postnet post-processing network
- the bottleneck dimension of the pre-processing network 5301 may be 32.
- the pre-processing network 5301 can obtain an 80-dimensional all-0 vector.
- This all-0 vector is the second vector corresponding to the first time step.
- the pre-processing network 5301 can process the all-0 vector.
- a 512-dimensional all-0 vector is obtained.
- This all-0 vector is the first vector corresponding to the first time step.
- the pre-processing network 5301 can input the all-0 vector to the first attention module 520.
- the first attention The module 520 can process the all-zero vector and the 25 512-dimensional vectors obtained from the encoder module 510 to obtain the first attention representation corresponding to the first time step.
- the first attention module 520 The first attention representation corresponding to the first time step is input to the second converter module 5302.
- the second converter module 5302 can process the first attention representation corresponding to the first time step to obtain the first time
- the speech features of the target language at the first time step are the predicted speech features of the target language at the first time step.
- the pre-processing network 5301 can also obtain the actual target language speech features at the first time step.
- the actual target language speech features at the first time step can be understood as the second time step.
- the pre-processing network 5301 can process the second vector corresponding to the second time step to obtain a 512-dimensional vector.
- the 512-dimensional vector is the first vector corresponding to the second time step.
- the pre-processing network 5301 can input the first vector corresponding to the second time step to the first attention module 520, and the first attention module 520 can input the first vector corresponding to the second time step. And process the 25 512-dimensional vectors obtained from the encoder module 510 to obtain the first attention representation corresponding to the second time step. Furthermore, the first attention module 520 processes the first attention representation corresponding to the second time step.
- the first attention representation is input to the second converter module 5302, and the second converter module 5302 can process the first attention representation corresponding to the second time step to obtain the target language speech features at the second time step,
- the speech features of the target language at the second time step are the predicted speech features of the target language at the second time step.
- the actual target language speech features at the i-th time step can be understood as the second vector corresponding to the i+1-th time step, and the pre-processing network 5301 can perform the following operations on the second vector corresponding to the i+1-th time step: Process to obtain a 512-dimensional vector, which is the first vector corresponding to the i+1th time step. Further, the pre-processing network 5301 can input the first vector corresponding to the i+1th time step.
- the first attention module 520 can process the first vector corresponding to the i+1-th time step and the 25 512-dimensional vectors obtained from the encoder module 510 to obtain the i-th +1 time step corresponding to the first attention representation. Furthermore, the first attention module 520 inputs the first attention representation corresponding to the i+1 time step to the second converter module 5302. The second conversion The processor module 5302 can process the first attention representation corresponding to the i+1th time step to obtain the target language speech features at the i+1th time step.
- the pre-processing network 5301 can obtain an 80-dimensional all-0 vector, which is the second vector corresponding to the first time step.
- the pre-processing network 5301 can process the all-0 vector to obtain a 512-dimensional all-0 vector. 0 vector, which is the first vector corresponding to the first time step.
- the pre-processing network 5301 can input the all-0 vector to the first attention module 520, and the first attention module 520 can The all-zero vectors and the 25 512-dimensional vectors obtained from the encoder module 510 are processed to obtain the first attention representation corresponding to the first time step.
- the first attention module 520 The first attention representation corresponding to the first time step is input to the second converter module 5302.
- the second converter module 5302 can process the first attention representation corresponding to the first time step to obtain the target language at the first time step.
- Speech features, the speech features of the target language at the first time step are the predicted speech features of the target language at the first time step.
- the pre-processing network 5301 can process the predicted target language speech features at the first time step to obtain a 512-dimensional vector, in which the predicted target language speech features at the first time step can be is understood as the second vector corresponding to the above-mentioned second time step.
- the 512-dimensional vector is the first vector corresponding to the second time step.
- the pre-processing network 5301 can convert the second vector corresponding to the second time step into A vector is input to the first attention module 520.
- the first attention module 520 can process the first vector corresponding to the second time step and the 25 512-dimensional vectors obtained from the encoder module 510 to obtain the first vector.
- the first attention module 520 inputs the first attention representation corresponding to the second time step to the second converter module 5302.
- the second converter module 5302 The first attention representation corresponding to the second time step can be processed to obtain the target language speech features at the second time step.
- the predicted target language speech feature at the i-th time step can be understood as the second vector corresponding to the i+1-th time step, and the pre-processing network 5301 can perform the following operations on the second vector corresponding to the i+1-th time step: Process to obtain a 512-dimensional vector, which is the first vector corresponding to the i+1th time step. Further, the pre-processing network 5301 can input the first vector corresponding to the i+1th time step.
- the first attention module 520 can process the first vector corresponding to the i+1-th time step and the 25 512-dimensional vectors obtained from the encoder module 510 to obtain the i-th +1 time step corresponding to the first attention representation. Furthermore, the first attention module 520 inputs the first attention representation corresponding to the i+1 time step to the second converter module 5302. The second conversion The processor module 5302 can process the first attention representation corresponding to the i+1th time step to obtain the target language speech features at the i+1th time step.
- the second converter module 5302 can input the target language speech features at each time step to the post-processing network 5303, and the post-processing network 5303 can perform a weighted summation of the target language speech features at each time step to obtain the source
- the phonetic features of the language correspond to the phonetic features of the target language.
- the second converter module 5302 may be similar to the decoder structure shown in Figure 1 , that is, the second converter module 5302 may include 6 converter layers. This application does not limit this.
- the second converter Module 5302 may employ the same hyperparameters as first converter module 5102.
- the second attention module 540 is used to obtain a set of first hidden state representations among the plurality of sets of first hidden state representations and the second decoder output corresponding to the second attention module 540
- the third vector corresponding to each time step and processes the first hidden state representation of the set and the third vector corresponding to each time step to obtain the second attention representation corresponding to each time step;
- the second decoder module 550 corresponding to the second attention module 540 is used to obtain the fourth vector corresponding to each time step, and process the fourth vector corresponding to each time step to obtain the third vector corresponding to each time step.
- the third vector corresponding to each time step is output to the second attention module 540 to obtain the second attention representation corresponding to each time step, and process the second attention representation corresponding to each time step to obtain the source language speech feature correspondence. auxiliary representation.
- the second decoder module 550 may include: a pre-processing network, a third converter module and a post-processing network; the pre-processing network is used to obtain the fourth vector corresponding to each time step, and calculate the fourth vector corresponding to each time step.
- the four vectors are processed to obtain the third vector corresponding to each time step, and the third vector corresponding to each time step is output to the second attention module 540; the third converter module is used to obtain the second attention corresponding to each time step. representation, and process the second attention representation corresponding to each time step to obtain the auxiliary representation at each time step; the post-processing network is used to process the auxiliary representation at each time step to obtain the auxiliary representation corresponding to the source language speech features. express.
- the bottleneck dimension of the pre-processing network may be 32.
- the pre-processing network 5301 can obtain an 80-dimensional embedding vector, which is the fourth vector corresponding to the first time step.
- the pre-processing network can process the vector to obtain a 512-dimensional vector, This vector is the third vector corresponding to the first time step.
- the pre-processing network can input this vector to the second attention module 540, and the second attention module 540 can This vector and the 25 512-dimensional vectors obtained from the encoder module 510 are processed to obtain the second attention representation corresponding to the first time step.
- the second attention module 540 converts the first time step
- the second attention representation corresponding to the first time step is input to the third converter module.
- the third converter module can process the second attention representation corresponding to the first time step to obtain the auxiliary representation at the first time step.
- the auxiliary representation at the first time step is the predicted auxiliary representation at the first time step.
- the pre-processing network can also obtain the actual auxiliary representation at the first time step.
- the actual auxiliary representation at the first time step can be understood as the fourth vector corresponding to the second time step.
- the pre-processing network The fourth vector corresponding to the second time step can be processed to obtain a 512-dimensional vector.
- the 512-dimensional vector is the third vector corresponding to the second time step.
- the pre-processing network can convert the second vector The third vector corresponding to the time step is input to the second attention module 540.
- the second attention module 540 can compare the third vector corresponding to the second time step and the 25 512-dimensional vectors obtained from the encoder module 510. Processing is performed to obtain the second attention representation corresponding to the second time step. Furthermore, the second attention module 540 inputs the second attention representation corresponding to the second time step to the third converter module. The third The converter module can process the second attention representation corresponding to the second time step to obtain an auxiliary representation at the second time step. The auxiliary representation at the second time step is the predicted second time step. Auxiliary representation on.
- the actual auxiliary representation at the i-th time step can be understood as the fourth vector corresponding to the i+1-th time step.
- the pre-processing network can process the fourth vector corresponding to the i+1-th time step to obtain A 512-dimensional vector.
- the 512-dimensional vector is the third vector corresponding to the i+1 time step.
- the pre-processing network can input the third vector corresponding to the i+1 time step to the second attention.
- the force module 540 and the second attention module 540 can process the third vector corresponding to the i+1th time step and the 25 512-dimensional vectors obtained from the encoder module 510 to obtain the i+1th time step.
- the second attention module 540 inputs the second attention representation corresponding to the i+1th time step to the third converter module.
- the third converter module can The second attention representation corresponding to the i+1 time step is processed to obtain the auxiliary representation at the i+1 time step.
- the third converter module can input the auxiliary representation at each time step to the post-processing network, and the post-processing network can perform a weighted summation of the auxiliary representation at each time step to obtain an auxiliary representation corresponding to the speech features of the source language. .
- the above-mentioned auxiliary representation may be a speech recognition result, such as a source language text corresponding to the source language speech.
- the above-mentioned auxiliary representation may be a speech translation result, such as a target language text.
- the speech-to-speech translation model provided by this application is a corresponding improvement on the existing translator (Translatotron) model, specifically by integrating the long short-term memory network (Long Short-term Memory Network) in the translator (Translatotron) model.
- Term Memory LSTM
- the speech-to-speech translation model can be called a transformer-based translator model (Transformer-based Translatotron).
- LSTM the calculation at each time step is a local calculation, while in the converter module, the calculation at each time step is a global calculation, which can improve the model accuracy.
- the Transformer-based Translatotron provided in this application can be trained not based on pseudo-annotated speech-to-speech translation samples, and the speech-to-speech model using this training method can be called a baseline system.
- this application provides Transformer-based Translatotron can also be trained based on pseudo-annotated speech-to-speech translation samples.
- the speech-to-speech model using this training method can be called the baseline system + PTLA.
- This application provides the TEDEn2Zh data set (English to Chinese) commonly used in speech translation to test the performance of the baseline system and baseline system + PTLA, as shown in Table 1:
- S-PER represents the phoneme recognition error rate of the speech recognition task on the test set.
- Tp-BLEU represents the phoneme calculation-based bilingual translation quality auxiliary tool (Bilingual Evaluation Understudy, BLEU) for the speech-to-text translation task on the test set.
- Dev-BLER represents the BLEU based on phoneme calculation of the main task on the development set.
- test-BLEU represents the BLEU based on the phoneme calculation of the main task on the test set.
- the baseline system can achieve good performance in complex language-to-language translation, and the baseline system + PTLA solution can effectively improve model performance.
- a speech-to-speech translation method is provided below:
- Figure 7 is a flow chart of a speech-to-speech translation method provided by an embodiment of the present application.
- This method can be executed by any electronic device such as a mobile phone, tablet computer, notebook computer, handheld computer, MID, desktop computer, etc.
- Figure 2 The execution device in the application does not limit this.
- the method includes:
- S720 Input the source language speech features into the speech-to-speech translation model to obtain the target language speech features corresponding to the source language speech features.
- the speech-to-speech translation model can be trained by the above-mentioned model training method. Since the speech-to-speech translation model obtained by the above-mentioned training method has higher accuracy, based on this, speech-to-speech translation can be better realized.
- Figure 8 is a schematic diagram of a model training device 800 provided by an embodiment of the present application.
- the device 800 includes: an acquisition module 810, a generation module 820 and a training module 830, where the acquisition module 810 is used to acquire speech Recognition samples and real speech-to-speech translation samples; the generation module 820 is used to generate pseudo-annotated speech-to-speech translation samples according to the speech recognition samples; the training module 830 is used to generate pseudo-annotated speech-to-speech translation samples according to the pseudo-annotated speech-to-speech translation samples and real speech-to-speech translation samples. Translate samples to train speech-to-speech translation models.
- the training module 830 is specifically configured to: pre-train a speech-to-speech translation model based on pseudo-annotated speech-to-speech translation samples, and fine-tune the pre-trained speech-to-speech translation model based on real speech-to-speech translation samples.
- the training module 830 is specifically configured to: fine-tune the pre-trained speech-to-speech translation model through real speech-to-speech translation samples; or fine-tune the pre-trained speech-to-speech translation model based on real speech-to-speech translation samples and pseudo-annotated speech-to-speech translation samples. Pretrained speech-to-speech translation model.
- the device 800 further includes: a labeling module 840, configured to label the real speech before fine-tuning the pre-trained speech-to-speech translation model according to the real speech-to-speech translation samples and the pseudo-labeled speech-to-speech translation samples.
- the speech-to-speech translation samples are labeled with a first label, and the first label is used to identify the real speech-to-speech translation samples as real samples;
- the pseudo-labeled speech-to-speech translation samples are labeled with a second label, and the second label is used to identify the pseudo-labeled speech.
- the speech translation samples are pseudo-labeled samples.
- the training module 830 is specifically used to: upsample real speech-to-speech translation samples to obtain upsampled speech-to-speech translation samples; and obtain upsampled speech-to-speech translation samples and pseudo-annotated speech-to-speech translation samples. Fine-tune a pre-trained speech-to-speech translation model with speech translation samples.
- real speech-to-speech translation samples include: first source language speech features, first source language text, first target language speech features, and first target language text; speech recognition samples include: second source language speech features and second source language text.
- the generation module 820 is specifically configured to: translate the second source language text to obtain the second target language text; synthesize the second target language text to obtain the second target language speech features; wherein, the pseudo-annotated speech
- the to-speech translation samples include: second source language speech features, second source language text, second target language text, and second target language speech features.
- the speech-to-speech translation model includes: an encoder module, a first attention module, a first decoder module, N second attention modules, and N second decoder modules, where N is a positive integer, and N
- the second attention module has a one-to-one correspondence with the N second decoder modules;
- the encoder module is used to obtain the speech features of the source language and process the speech features of the source language to obtain multiple sets of first hidden state representations corresponding to the speech features of the source language;
- the first attention module is used to obtain a set of first hidden state representations among multiple sets of first hidden state representations and the first vector corresponding to each time step output by the first decoder, and sum the set of first hidden state representations
- the first vector corresponding to each time step is processed to obtain the first attention representation corresponding to each time step;
- the first decoder module is used to obtain the second vector corresponding to each time step, process the second vector corresponding to each time step, obtain the first vector corresponding to each time step, and output the first vector corresponding to each time step.
- Give the first attention module obtain the first attention representation corresponding to each time step, and process the first attention representation corresponding to each time step to obtain the target language speech features corresponding to the source language speech features;
- the second attention module is used to obtain a set of first hidden state representations among the plurality of sets of first hidden state representations and each of the second decoder outputs corresponding to the second attention module.
- the third vector corresponding to the time step is processed, and the first hidden state representation of the group and the third vector corresponding to each time step are processed to obtain the second attention representation corresponding to each time step;
- the second decoder module corresponding to the second attention module is used to obtain the fourth vector corresponding to each time step, and process the fourth vector corresponding to each time step to obtain the third vector corresponding to each time step.
- the third vector corresponding to each step is output to the second attention module to obtain the second attention representation corresponding to each time step, and process the second attention representation corresponding to each time step to obtain an auxiliary representation corresponding to the speech features of the source language. .
- the encoder module includes: a convolutional neural network sub-module and a first converter module; the convolutional neural network sub-module is used to obtain the source language speech features and process the source language speech features to obtain the source language speech features. The corresponding second hidden state representation; the first The converter module is used to obtain the second hidden state representation and process the second hidden state representation to obtain multiple sets of first hidden state representations.
- the first decoder module includes: a pre-processing network, a second converter module and a post-processing network; the pre-processing network is used to obtain the second vector corresponding to each time step, and compare the second vector corresponding to each time step Perform processing to obtain the first vector corresponding to each time step, and output the first vector corresponding to each time step to the first attention module; the second converter module is used to obtain the first attention representation corresponding to each time step, and The first attention representation corresponding to each time step is processed to obtain the speech features of the target language at each time step; the post-processing network is used to process the speech features of the target language at each time step to obtain the speech features corresponding to the source language. Target language phonetic features.
- the device embodiments and the method embodiments may correspond to each other, and similar descriptions may refer to the method embodiments. To avoid repetition, they will not be repeated here.
- the device 800 shown in Figure 8 can execute the above-mentioned model training method embodiment, and the foregoing and other operations and/or functions of each module in the device 800 are respectively to implement the corresponding processes in the above-mentioned model training method. For the sake of simplicity, I won’t go into details here.
- the device 800 in the embodiment of the present application is described above from the perspective of functional modules in conjunction with the accompanying drawings. It should be understood that this functional module can be implemented in the form of hardware, can also be implemented through instructions in the form of software, or can also be implemented through a combination of hardware and software modules. Specifically, each step of the method embodiments in the embodiments of the present application can be completed by integrated logic circuits of hardware in the processor and/or instructions in the form of software. The steps of the methods disclosed in conjunction with the embodiments of the present application can be directly embodied in hardware. The execution of the decoding processor is completed, or the execution is completed using a combination of hardware and software modules in the decoding processor.
- the software module may be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, register, etc.
- the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps in the above model training method embodiment in combination with its hardware.
- Figure 9 is a schematic diagram of a speech-to-speech translation device 900 provided by an embodiment of the present application.
- the device 900 includes: an acquisition module 910 and a processing module 920, where the acquisition module 910 is used to acquire source language speech. Feature; processing module 920 is used to input source language speech features into the speech-to-speech translation model trained by the above model training method to obtain target language speech features corresponding to the source language speech features.
- the device embodiments and the method embodiments may correspond to each other, and similar descriptions may refer to the method embodiments. To avoid repetition, they will not be repeated here.
- the device 900 shown in Figure 9 can execute the above embodiment of the speech-to-speech translation method, and the foregoing and other operations and/or functions of each module in the device 900 are respectively intended to implement the corresponding processes in the above-mentioned speech-to-speech translation method. , for the sake of brevity, will not be repeated here.
- the device 900 in the embodiment of the present application is described above from the perspective of functional modules in conjunction with the accompanying drawings. It should be understood that this functional module can be implemented in the form of hardware, can also be implemented through instructions in the form of software, or can also be implemented through a combination of hardware and software modules. Specifically, each step of the method embodiments in the embodiments of the present application can be completed by integrated logic circuits of hardware in the processor and/or instructions in the form of software. The steps of the methods disclosed in conjunction with the embodiments of the present application can be directly embodied in hardware. The execution of the decoding processor is completed, or the execution is completed using a combination of hardware and software modules in the decoding processor.
- the software module may be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, register, etc.
- the storage medium is located in the memory at The processor reads the information in the memory and completes the steps in the above speech-to-speech translation method embodiment in combination with its hardware.
- FIG. 10 is a schematic block diagram of an electronic device 1000 provided by an embodiment of the present application.
- the electronic device 1000 may include:
- Memory 1010 and processor 1020 are used to store computer programs and transmit the program code to the processor 1020.
- the processor 1020 can call and run the computer program from the memory 1010 to implement the method in the embodiment of the present application.
- the processor 1020 may be configured to execute the above method embodiments according to instructions in the computer program.
- the processor 1020 may include but is not limited to:
- DSP Digital Signal Processor
- ASIC Application Specific Integrated Circuit
- FPGA Field Programmable Gate Array
- the memory 1010 includes but is not limited to:
- Non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electrically removable memory. Erase programmable read-only memory (Electrically EPROM, EEPROM) or flash memory. Volatile memory may be Random Access Memory (RAM), which is used as an external cache.
- RAM Random Access Memory
- RAM static random access memory
- DRAM dynamic random access memory
- DRAM synchronous dynamic random access memory
- SDRAM double data rate synchronous dynamic random access memory
- Double Data Rate SDRAM DDR SDRAM
- ESDRAM enhanced synchronous dynamic random access memory
- SLDRAM synchronous link dynamic random access memory
- Direct Rambus RAM Direct Rambus RAM
- the computer program can be divided into one or more modules, and the one or more modules are stored in the memory 1010 and executed by the processor 1020 to complete the tasks provided by this application.
- the one or more modules may be a series of computer program instruction segments capable of completing specific functions. The instruction segments are used to describe the execution process of the computer program in the electronic device.
- the electronic device may also include:
- Transceiver 1030 which may be connected to the processor 1020 or the memory 1010.
- the processor 1020 can control the transceiver 1030 to communicate with other devices. Specifically, it can send information or data to other devices, or receive information or data sent by other devices.
- Transceiver 1030 may include a transmitter and a receiver.
- the transceiver 1030 may further include an antenna, and the number of antennas may be one or more.
- bus system where in addition to the data bus, the bus system also includes a power bus, a control bus and a status signal bus.
- This application also provides a computer storage medium on which a computer program is stored.
- the computer program When the computer program is executed by a computer, the computer can perform the method of the above method embodiment.
- embodiments of the present application also provide a computer program product containing instructions, which when executed by a computer causes the computer to perform the method of the above method embodiments.
- the computer program product includes one or more computer instructions.
- the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
- the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted over a wired connection from a website, computer, server, or data center (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) to another website, computer, server or data center.
- the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server or data center integrated with one or more available media.
- the available media may be magnetic media (such as floppy disks, hard disks, magnetic tapes), optical media (such as digital video discs (DVD)), or semiconductor media (such as solid state disks (SSD)), etc.
- the disclosed systems, devices and methods can be implemented in other ways.
- the device embodiments described above are only illustrative.
- the division of the modules is only a logical function division. In actual implementation, there may be other division methods.
- multiple modules or components may be combined or may be Integrated into another system, or some features can be ignored, or not implemented.
- the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, indirect coupling or communication connection of devices or modules, and may be in electrical, mechanical or other forms.
- Modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical modules, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. For example, each functional module in each embodiment of the present application can be integrated into a processing module, or each module can exist physically alone, or two or more modules can be integrated into one module.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Artificial Intelligence (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Quality & Reliability (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims (15)
- 一种模型训练方法,其特征在于,包括:获取语音识别样本和真实的语音到语音翻译样本;根据所述语音识别样本生成伪标注的语音到语音翻译样本;根据所述伪标注的语音到语音翻译样本和所述真实的语音到语音翻译样本训练语音到语音翻译模型。
- 根据权利要求1所述的方法,其特征在于,所述根据所述伪标注的语音到语音翻译样本和所述真实的语音到语音翻译样本训练语音到语音翻译模型,包括:根据所述伪标注的语音到语音翻译样本预训练语音到语音翻译模型,并根据所述真实的语音到语音翻译样本微调预训练后的语音到语音翻译模型。
- 根据权利要求2所述的方法,其特征在于,所述根据所述真实的语音到语音翻译样本微调预训练后的语音到语音翻译模型,包括:通过所述真实的语音到语音翻译样本微调预训练后的语音到语音翻译模型;或者,根据所述真实的语音到语音翻译样本和所述伪标注的语音到语音翻译样本微调预训练后的语音到语音翻译模型。
- 根据权利要求3所述的方法,其特征在于,所述根据所述真实的语音到语音翻译样本和所述伪标注的语音到语音翻译样本微调预训练后的语音到语音翻译模型之前,还包括:对所述真实的语音到语音翻译样本标注第一标签,所述第一标签用于标识所述真实的语音到语音翻译样本为真实样本;对所述伪标注的语音到语音翻译样本标注第二标签,所述第二标签用于标识所述伪标注的语音到语音翻译样本为伪标注样本。
- 根据权利要求3或4所述的方法,其特征在于,所述根据所述真实的语音到语音翻译样本和所述伪标注的语音到语音翻译样本微调预训练后的语音到语音翻译模型,包括:对所述真实的语音到语音翻译样本进行上采样,得到上采样后的语音到语音翻译样本;通过所述上采样后的语音到语音翻译样本和所述伪标注的语音到语音翻译样本微调预训练后的语音到语音翻译模型。
- 根据权利要求1-4任一项所述的方法,其特征在于,所述真实的语音到语音翻译样本包括:第一源语言语音特征、第一源语言文本、第一目标语言语音特征和第一目标语言文本;所述语音识别样本包括:第二源语言语音特征和第二源语言文本。
- 根据权利要求6所述的方法,其特征在于,所述根据所述语音识别样本生成伪标注的语音到语音翻译样本,包括:对所述第二源语言文本进行翻译,得到第二目标语言文本;对所述第二目标语言文本进行合成,得到第二目标语言语音特征;其中,所述伪标注的语音到语音翻译样本包括:所述第二源语言语音特征、所述第二源语言文本、所述第二目标语言文本和所述第二目标语言语音特征。
- 根据权利要求1-4任一项所述的方法,其特征在于,所述语音到语音翻译模型包括:编码器模块、第一注意力模块、第一解码器模块、N个第二注意力模块和N个第二解码器模块,N为正整数,所述N个第二注意力模块和所述N个第二解码器模块一一对应;所述编码器模块用于获取源语言语音特征,并对所述源语言语音特征进行处理,得到所述源语言语音特征对应的多组第一隐藏状态表示;所述第一注意力模块用于获取所述多组第一隐藏状态表示中的一组第一隐藏状态表示以及所述第一解码器输出的各个时间步对应的第一向量,并对该组第一隐藏状态表示和各个时间步对应的第一向量进行处理,得到各个时间步对应的第一注意力表示;所述第一解码器模块用于获取所述各个时间步对应的第二向量,并对所述各个时间步对应的第二向量进行处理,得到各个时间步对应的第一向量,将各个时间步对应的第一向量输出给所述第一注意力模块,获取各个时间步对应的第一注意力表示,并对各个时间步对应的第一注意力表示进行处理,得到所述源语言语音特征对应的目标语言语音特征;在对所述语音到语音翻译模型的训练阶段,所述第二注意力模块用于获取所述多组第一隐藏状态表示中的一组第一隐藏状态表示以及所述第二注意力模块对应的第二解码器输出的各个时间步对应的第三向量,并对该组第一隐藏状态表示和各个时间步对应的第三向量进行处理,得到各个时间步对应的第二注意力表示;所述第二注意力模块对应的第二解码器模块用于获取各个时间步对应的第四向量,并对各个时间步对应的第四向量进行处理,得到各个时间步对应的第三向量,将各个时间步对应的第三向量输出给所述第二注意力模块,获取各个时间步对应的第二注意力表示,并对各个时间步对应的第二注意力表示进行处理,得到所述源语言语音特征对应的辅助表示。
- 根据权利要求8所述的方法,其特征在于,所述编码器模块包括:卷积神经网络子模块和第一转换器模块;所述卷积神经网络子模块用于获取所述源语言语音特征,并对所述源语言语音特征进行处理,得到所述源语言语音特征对应的第二隐藏状态表示;所述第一转换器模块用于获取所述第二隐藏状态表示,并对所述第二隐藏状态表示进行处理,得到所述多组第一隐藏状态表示。
- 根据权利要求9所述的方法,其特征在于,所述第一解码器模块包括:前处理网络、第二转换器模块和后处理网络;所述前处理网络用于获取各个时间步对应的第二向量,并对各个时间步对应的第二向量进行处理,得到各个时间步对应的第一向量,将各个时间步对应的第一向量输出给所述第一注意力模块;所述第二转换器模块用于获取各个时间步对应的第一注意力表示,并对各个时间步对应的第一注意力表示进行处理,得到各个时间步上的目标语言语音特征;所述后处理网络用于对各个时间步上的目标语言语音特征进行处理,得到所述源语言语音特征对应的目标语言语音特征。
- 一种语音到语音翻译方法,其特征在于,包括:获取源语言语音特征;将所述源语言语音特征输入至如权利要求1至10中任一项方法训练得到的语音到语音翻译模型,得到所述源语言语音特征对应的目标语言语音特征。
- 一种模型训练装置,其特征在于,包括:获取模块,用于获取语音识别样本和真实的语音到语音翻译样本;生成模块,用于根据所述语音识别样本生成伪标注的语音到语音翻译样本;训练模块,用于根据所述伪标注的语音到语音翻译样本和所述真实的语音到语音翻译样本训练语音到语音翻译模型。
- 一种语音到语音翻译装置,其特征在于,包括:获取模块,用于获取源语言语音特征;处理模块,用于将所述源语言语音特征输入至如权利要求1至10中任一项方法训练得到的语音到语音翻译模型,得到所述源语言语音特征对应的目标语言语音特征。
- 一种电子设备,其特征在于,包括:处理器和存储器,所述存储器用于存储计算机程序,所述处理器用于调用并运行所述存储器中存储的计算机程序,以执行权利要求1至11中任一项所述的方法。
- 一种计算机可读存储介质,其特征在于,用于存储计算机程序,所述计算机程序使得计算机执行如权利要求1至11中任一项所述的方法。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP23795071.2A EP4517742A4 (en) | 2022-04-26 | 2023-04-14 | METHOD AND APPARATUS FOR MODEL TRAINING, METHOD AND APPARATUS FOR SPEECH-TO-SPEECH TRANSLATION, AND SUPPORT |
| US18/724,300 US20250061888A1 (en) | 2022-04-26 | 2023-04-14 | Model training method and apparatus, speech-to-speech translation method and apparatus, and medium |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210448585.8 | 2022-04-26 | ||
| CN202210448585.8A CN114822499B (zh) | 2022-04-26 | 2022-04-26 | 模型训练方法、语音到语音翻译方法、装置及介质 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023207638A1 true WO2023207638A1 (zh) | 2023-11-02 |
Family
ID=82508177
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2023/088492 Ceased WO2023207638A1 (zh) | 2022-04-26 | 2023-04-14 | 模型训练方法、语音到语音翻译方法、装置及介质 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20250061888A1 (zh) |
| EP (1) | EP4517742A4 (zh) |
| CN (1) | CN114822499B (zh) |
| WO (1) | WO2023207638A1 (zh) |
Families Citing this family (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114822499B (zh) * | 2022-04-26 | 2024-11-01 | 北京有竹居网络技术有限公司 | 模型训练方法、语音到语音翻译方法、装置及介质 |
| CN115881102A (zh) * | 2022-11-14 | 2023-03-31 | 北京数美时代科技有限公司 | 一种数据短缺场景下的语音识别方法、系统和存储介质 |
| CN115862630A (zh) * | 2022-11-30 | 2023-03-28 | 北京有竹居网络技术有限公司 | 语音翻译的方法、装置、电子设备和介质 |
| CN115828943A (zh) * | 2022-12-28 | 2023-03-21 | 沈阳雅译网络技术有限公司 | 一种基于语音合成数据的语音翻译模型建模方法和设备 |
| CN116206616A (zh) * | 2022-12-30 | 2023-06-02 | 沈阳雅译网络技术有限公司 | 一种基于序列动态压缩的语音翻译和语音识别方法 |
| US12596890B2 (en) * | 2023-03-30 | 2026-04-07 | Salesforce, Inc. | Systems and methods for cross-lingual transfer learning |
| CN116343751B (zh) * | 2023-05-29 | 2023-08-11 | 深圳市泰为软件开发有限公司 | 基于语音翻译的音频分析方法及装置 |
| CN119091870A (zh) * | 2023-06-05 | 2024-12-06 | 北京有竹居网络技术有限公司 | 生成语音翻译模型的方法、翻译方法和装置 |
| US20250095631A1 (en) * | 2023-09-18 | 2025-03-20 | Adobe Inc. | Position-based text-to-speech model |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111738025A (zh) * | 2020-08-20 | 2020-10-02 | 腾讯科技(深圳)有限公司 | 基于人工智能的翻译方法、装置、电子设备和存储介质 |
| CN112966529A (zh) * | 2021-04-08 | 2021-06-15 | 中译语通科技股份有限公司 | 神经网络机器翻译训练方法、系统、介质、设备及应用 |
| CN114822499A (zh) * | 2022-04-26 | 2022-07-29 | 北京有竹居网络技术有限公司 | 模型训练方法、语音到语音翻译方法、装置及介质 |
Family Cites Families (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11409791B2 (en) * | 2016-06-10 | 2022-08-09 | Disney Enterprises, Inc. | Joint heterogeneous language-vision embeddings for video tagging and search |
| GB201804073D0 (en) * | 2018-03-14 | 2018-04-25 | Papercup Tech Limited | A speech processing system and a method of processing a speech signal |
| WO2020146873A1 (en) * | 2019-01-11 | 2020-07-16 | Applications Technology (Apptek), Llc | System and method for direct speech translation system |
| US12032920B2 (en) * | 2019-03-29 | 2024-07-09 | Google Llc | Direct speech-to-speech translation via machine learning |
| CN110503945B (zh) * | 2019-09-06 | 2022-07-08 | 北京金山数字娱乐科技有限公司 | 一种语音处理模型的训练方法及装置 |
| CN110717345B (zh) * | 2019-10-15 | 2020-07-07 | 内蒙古工业大学 | 一种译文重对齐的循环神经网络跨语言机器翻译方法 |
| CN111597778B (zh) * | 2020-04-15 | 2023-05-30 | 哈尔滨工业大学 | 一种基于自监督的机器翻译译文自动优化的方法和系统 |
| CN111859994B (zh) * | 2020-06-08 | 2024-01-23 | 北京百度网讯科技有限公司 | 机器翻译模型获取及文本翻译方法、装置及存储介质 |
| JP7663171B2 (ja) * | 2020-08-17 | 2025-04-16 | 国立研究開発法人情報通信研究機構 | 疑似対訳データ生成用機械翻訳モデルの学習方法、疑似対訳データ取得方法、および、機械翻訳モデルの学習方法 |
| US12050882B2 (en) * | 2021-11-23 | 2024-07-30 | Baidu Usa Llc | Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation |
| WO2023178583A1 (en) * | 2022-03-24 | 2023-09-28 | Microsoft Technology Licensing, Llc | Advanced clustering for self-supervised learning in speech recognition |
-
2022
- 2022-04-26 CN CN202210448585.8A patent/CN114822499B/zh active Active
-
2023
- 2023-04-14 EP EP23795071.2A patent/EP4517742A4/en active Pending
- 2023-04-14 WO PCT/CN2023/088492 patent/WO2023207638A1/zh not_active Ceased
- 2023-04-14 US US18/724,300 patent/US20250061888A1/en active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111738025A (zh) * | 2020-08-20 | 2020-10-02 | 腾讯科技(深圳)有限公司 | 基于人工智能的翻译方法、装置、电子设备和存储介质 |
| CN112966529A (zh) * | 2021-04-08 | 2021-06-15 | 中译语通科技股份有限公司 | 神经网络机器翻译训练方法、系统、介质、设备及应用 |
| CN114822499A (zh) * | 2022-04-26 | 2022-07-29 | 北京有竹居网络技术有限公司 | 模型训练方法、语音到语音翻译方法、装置及介质 |
Non-Patent Citations (3)
| Title |
|---|
| JIA YE; JOHNSON MELVIN; MACHEREY WOLFGANG; WEISS RON J.; CAO YUAN; CHIU CHUNG-CHENG; ARI NAVEEN; LAURENZO STELLA; WU YONGHUI: "Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation", ICASSP 2019 - 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 12 May 2019 (2019-05-12), pages 7180 - 7184, XP033565891, DOI: 10.1109/ICASSP.2019.8683343 * |
| See also references of EP4517742A4 * |
| YE JIA; MICHELLE TADMOR RAMANOVICH; TAL REMEZ; ROI POMERANTZ: "Translatotron 2: Robust direct speech-to-speech translation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 3 December 2021 (2021-12-03), 201 Olin Library Cornell University Ithaca, NY 14853, XP091110046 * |
Also Published As
| Publication number | Publication date |
|---|---|
| US20250061888A1 (en) | 2025-02-20 |
| EP4517742A4 (en) | 2026-04-22 |
| EP4517742A1 (en) | 2025-03-05 |
| CN114822499B (zh) | 2024-11-01 |
| CN114822499A (zh) | 2022-07-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2023207638A1 (zh) | 模型训练方法、语音到语音翻译方法、装置及介质 | |
| Tu et al. | End-to-end text-to-speech for low-resource languages by cross-lingual transfer learning | |
| CN113470662B (zh) | 生成和使用用于关键词检出系统的文本到语音数据和语音识别系统中的说话者适配 | |
| US11322133B2 (en) | Expressive text-to-speech utilizing contextual word-level style tokens | |
| CN113327575B (zh) | 一种语音合成方法、装置、计算机设备和存储介质 | |
| WO2023197206A1 (en) | Personalized and dynamic text to speech voice cloning using incompletely trained text to speech models | |
| WO2024055752A1 (zh) | 语音合成模型的训练方法、语音合成方法和相关装置 | |
| Kumar et al. | Towards building text-to-speech systems for the next billion users | |
| CN107590135A (zh) | 自动翻译方法、设备和系统 | |
| US12597365B2 (en) | Automatic translation between sign language and spoken language | |
| US20240304175A1 (en) | Speech modification using accent embeddings | |
| CN116129862A (zh) | 语音合成方法、装置、电子设备及存储介质 | |
| US20150221298A1 (en) | System and Method for Cloud-Based Text-to-Speech Web Services | |
| WO2023175367A1 (en) | End-to-end streaming speech translation with neural transducer | |
| CN115910002B (zh) | 一种音频生成的方法、存储介质及电子设备 | |
| CN114974249B (zh) | 一种语音识别方法、装置及存储介质 | |
| US12603078B2 (en) | Generating speech data using artificial intelligence techniques | |
| US20240233704A9 (en) | Residual adapters for few-shot text-to-speech speaker adaptation | |
| US20240386903A1 (en) | Speech speed adjustment method and apparatus, electronic device, and readable storage medium | |
| Arya et al. | Direct vs cascaded speech-to-speech translation using transformer | |
| CN115114933A (zh) | 用于文本处理的方法、装置、设备和存储介质 | |
| WO2020166359A1 (ja) | 推定装置、推定方法、及びプログラム | |
| US12223979B1 (en) | Pre-trained machine learning models for real- time speech form conversion | |
| CN119274531B (zh) | 语音单位预测模型的训练方法、语音合成方法及电子设备 | |
| CN115294955B (zh) | 一种模型训练和语音合成方法、装置、设备及介质 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23795071 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 18724300 Country of ref document: US |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023795071 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2023795071 Country of ref document: EP Effective date: 20241126 |