WO2023207638A1 - 模型训练方法、语音到语音翻译方法、装置及介质 - Google Patents

模型训练方法、语音到语音翻译方法、装置及介质 Download PDF

Info

Publication number
WO2023207638A1
WO2023207638A1 PCT/CN2023/088492 CN2023088492W WO2023207638A1 WO 2023207638 A1 WO2023207638 A1 WO 2023207638A1 CN 2023088492 W CN2023088492 W CN 2023088492W WO 2023207638 A1 WO2023207638 A1 WO 2023207638A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
time step
samples
translation
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2023/088492
Other languages
English (en)
French (fr)
Inventor
董倩倩
岳凤鹏
高汝霆
王明轩
白奇丙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Youzhuju Network Technology Co Ltd
Original Assignee
Beijing Youzhuju Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Youzhuju Network Technology Co Ltd filed Critical Beijing Youzhuju Network Technology Co Ltd
Priority to EP23795071.2A priority Critical patent/EP4517742A4/en
Priority to US18/724,300 priority patent/US20250061888A1/en
Publication of WO2023207638A1 publication Critical patent/WO2023207638A1/zh
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility

Definitions

  • Embodiments of the present application relate to the field of machine learning technology, and in particular, to a model training method, a speech-to-speech translation method, a device, and a medium.
  • the Speech-to-Speech Translation (S2ST) model aims to translate source language speech into target language speech. It is widely used in various scenarios such as video translation, multinational conference speeches, and translating intercoms. Usually speech-to-speech translation models need to be trained through a large amount of data. However, it is currently difficult to collect paired speech-to-speech translation samples in real-life scenarios. This lack of data leads to the problem of low model training accuracy.
  • This application provides a model training method, speech-to-speech translation method, device and medium, thereby improving model training accuracy.
  • the first aspect provides a model training method, including: obtaining speech recognition samples and real speech-to-speech translation samples; generating pseudo-annotated speech-to-speech translation samples based on the speech recognition samples; and generating pseudo-annotated speech-to-speech translation samples based on the pseudo-annotated speech-to-speech translation samples.
  • Real speech-to-speech translation samples train speech-to-speech translation models.
  • a second aspect provides a speech-to-speech translation method, including: obtaining source language speech features; inputting the source language speech features into a speech-to-speech translation model trained as in the first aspect or in an optional manner of the first aspect, to obtain The phonetic features of the source language correspond to the phonetic features of the target language.
  • a model training device including: an acquisition module, a generation module and a training module.
  • the acquisition module is used to acquire speech recognition samples and real speech-to-speech translation samples;
  • the generation module is used to generate pseudo annotations based on speech recognition samples.
  • the training module is used to train the speech-to-speech translation model based on pseudo-annotated speech-to-speech translation samples and real speech-to-speech translation samples.
  • a fourth aspect provides a speech-to-speech translation device, including: an acquisition module and a processing module, wherein the acquisition module is used to acquire source language speech features; the processing module is used to input the source language speech features into the first aspect or the first aspect;
  • the speech-to-speech translation model trained by the optional method can obtain the target language speech features corresponding to the source language speech features.
  • an electronic device including: a processor and a memory, the memory being used to store a computer program, the processing The device is used to call and run the computer program stored in the memory to perform the method described in the first aspect or the second aspect.
  • a sixth aspect provides a computer-readable storage medium for storing a computer program, the computer program causing a computer to execute the method described in the first or second aspect.
  • speech recognition samples are relatively easy to collect.
  • pseudo-annotated speech-to-speech translation samples can be generated, thereby expanding the Speech-to-speech translation samples, which in turn can improve model training accuracy.
  • Figure 1 is the frame diagram of Transformer
  • Figure 2 is a schematic diagram of a system architecture involved in an embodiment of the present application.
  • Figure 3 is a flow chart of a model training method provided by an embodiment of the present application.
  • Figure 4 is a flow chart of another model training method provided by an embodiment of the present application.
  • Figure 5 is a schematic diagram of a speech-to-speech translation model provided by an embodiment of the present application.
  • Figure 6 is a schematic diagram of another speech-to-speech translation model provided by an embodiment of the present application.
  • Figure 7 is a flow chart of a speech-to-speech translation method provided by an embodiment of the present application.
  • Figure 8 is a schematic diagram of a model training device 800 provided by an embodiment of the present application.
  • Figure 9 is a schematic diagram of a speech-to-speech translation device 900 provided by an embodiment of the present application.
  • FIG. 10 is a schematic block diagram of an electronic device 1000 provided by an embodiment of the present application.
  • Encoder is used to process the source language speech features and compress the source language speech features into a fixed-length hidden representation.
  • the hidden representation is also called context vector (context), semantic encoding, semantic vector, etc. It is expected that the Hidden representation is information that can better represent the phonetic features of a language.
  • Decoder uses hidden representation to initialize the decoder to obtain the target language speech features.
  • Figure 1 is the frame diagram of Transformer.
  • ADD residual connection
  • Normal layer normalization
  • the decoder is almost the same as the encoder, except that an additional layer of multi-head attention mechanism (encoder-decoder attention) can be added in the middle to process the output of the encoder.
  • the first unit of the decoder which is the first unit using the multi-head self-attention mechanism, performs a masking operation to ensure that the decoder does not read information after the current position.
  • this application proposes to expand the training data to improve the model training accuracy.
  • system architecture of the embodiment of the present application is as shown in Figure 2.
  • Figure 2 is a schematic diagram of a system architecture related to an embodiment of the present application, including user equipment 201, data collection equipment 202, training equipment 203, execution equipment 204, database 205 and content library 206.
  • the data collection device 202 is used to read training data from the content library 206 and store the read training data in the database 205 .
  • the training data involved in the embodiments of this application includes pseudo-annotated speech-to-speech translation samples and real speech-to-speech translation samples.
  • the training device 203 trains the speech-to-speech translation model based on the training data maintained in the database 205, so that the trained speech-to-speech translation model can effectively translate the source language speech into the target language speech.
  • the execution device 204 is configured with an I/O interface 207 for data interaction with external devices.
  • the source language voice characteristics sent by the user device 201 are received through the I/O interface.
  • the computing module 208 in the execution device 204 uses the trained speech-to-speech translation model to process the input source language speech features, outputs the target language speech features, specifically the target language speech features, and converts the corresponding speech features through the I/O interface. The results are sent to user device 201.
  • the user device 201 may include a mobile phone, a tablet computer, a laptop computer, a handheld computer, a mobile internet device (mobile internet device, MID), a desktop computer, or other terminal devices with the function of installing a browser.
  • a mobile phone a tablet computer, a laptop computer, a handheld computer, a mobile internet device (mobile internet device, MID), a desktop computer, or other terminal devices with the function of installing a browser.
  • a mobile internet device mobile internet device, MID
  • desktop computer or other terminal devices with the function of installing a browser.
  • the execution device 204 may be a server.
  • the server may be a computing device such as a rack server, a blade server, a tower server, or a cabinet server.
  • the server can be an independent test server or a test server cluster composed of multiple test servers.
  • the execution device 204 is connected to the user device 201 through the network.
  • the network may be an intranet, Internet, Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), fourth generation ( The 4th Generation (4G) network, the 5th Generation (5G) network, Bluetooth, wireless fidelity (Wi-Fi), call network and other wireless or wired networks.
  • GSM Global System of Mobile communication
  • WCDMA Wideband Code Division Multiple Access
  • 4G 4th Generation
  • 5G Fifth Generation
  • Bluetooth wireless fidelity
  • Wi-Fi wireless fidelity
  • Figure 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, components, modules, etc. shown in the figure does not constitute any limitation.
  • the above-mentioned data collection device 202, user device 201, training device 203 and execution device 204 may be the same device.
  • the above-mentioned database 205 can be distributed on one server or multiple servers, and the above-mentioned content library 206 can be distributed on one server or multiple servers.
  • Figure 3 is a flow chart of a model training method provided by an embodiment of the present application. This method can be executed by any electronic device such as a mobile phone, tablet computer, notebook computer, palmtop computer, MID, desktop computer, etc. For example, it can be executed by the method in Figure 2
  • the training equipment is executed, and this application does not limit this. As shown in Figure 3, the method includes:
  • S320 Generate pseudo-annotated speech-to-speech translation samples based on the speech recognition samples
  • S330 Train a speech-to-speech translation model based on pseudo-annotated speech-to-speech translation samples and real speech-to-speech translation samples.
  • the speech-to-speech translation model in this application can be a speech-to-speech translation model based on multi-task learning (Multi-Task Learning, MTL).
  • MTL multi-task learning
  • MTL multi-task learning
  • MTL single-task learning
  • MTL multi-task learning
  • multi-task learning is a promising field in machine learning. Its goal is to use the useful information contained in multiple learning tasks to help learn a more accurate learner for each task.
  • inductive preferences are shared between different tasks, tasks can generally improve each other to prevent a single task from easily falling into a local optimum.
  • speech-to-speech translation samples For convenience, real speech-to-speech translation samples or pseudo-annotated speech-to-speech translation samples are collectively referred to as speech-to-speech translation samples below. It should be understood that the number of data elements included in the speech-to-speech translation sample is related to whether the speech-to-speech translation model is based on multi-task learning or single-task learning. For example: If the speech-to-speech translation model is a speech-to-speech translation model based on single-task learning, the speech-to-speech translation sample can be a tuple, including: source language speech features and target language speech features.
  • the speech-to-speech translation model is a speech-to-speech translation model based on multi-task learning, and the multi-task includes: one main task and two auxiliary tasks, the main task is the speech-to-speech translation task; the two auxiliary tasks are speech recognition. task and a speech-to-text translation task, the speech recognition task is used to convert the source language speech features into the source language text, the speech-to-text translation task is used to convert the source language speech features into the source language text, and the source language language text to target language text, in which case a speech-to-speech translation sample can So it is a four-tuple, including: source language speech features, source language text, target language speech features and target language text.
  • the real speech-to-speech translation samples include: first source language speech features, first source language text, first target language speech features, and first target language text.
  • the first source language speech feature is a real source language speech feature
  • the first source language text is a real source language text
  • the first target language text is also a real target language text
  • the first target language speech feature is The electronic device synthesizes the target language speech features obtained by synthesizing the first target language text. For example, the electronic device can input the first target language text into a speech synthesis model to obtain the first target language speech features.
  • the speech recognition samples include: second source language speech features and second source language text.
  • the second source language speech feature is a real source language speech feature
  • the second source language text is also a real source language text.
  • the so-called real source language speech features refer to the source language speech features that can be obtained in real scenes.
  • an electronic device can collect a user's voice through a microphone and extract the features of the voice.
  • the real source language text can be a language text obtained through artificial means.
  • the user can record a speech to form a language text corresponding to the speech.
  • the real target language text can also be a language text obtained through artificial means.
  • the user translates the content in the source language text into the target language text.
  • the above-mentioned speech recognition samples may be one or more, and the above-mentioned real speech-to-speech translation samples may be one or more.
  • the electronic device can translate the text in the second source language to obtain the speech features of the second target language; synthesize the speech features of the second target language to obtain the speech features of the second target language; wherein, the pseudo-labeled speech-to-speech
  • the translation sample may be a four-tuple, including: second source language speech features, second source language text, second target language text, and second target language speech features.
  • the first two items in the pseudo-annotated speech-to-speech translation samples namely the second source language speech features and the second source language text, are both real.
  • the electronic device can input the second source language text into a machine translation (Machine Translation, MT) model to obtain the second target language text.
  • the electronic device can input the text of the second target language into a speech synthesis (Text-To-Speech, TTS) model to obtain the speech features of the second target language.
  • MT Machine Translation
  • TTS speech synthesis
  • the difference between real speech-to-speech translation samples and pseudo-annotated speech-to-speech translation samples is mainly in the target language text, for example: real speech-to-speech translation samples
  • the speech translation sample is a four-tuple, including: ⁇ s src ,t src ,t tgt ,s tgt ⁇ , where s src represents the real source language speech features, t src represents the real source language text, and t tgt represents the real The target language text, s tgt represents the target language phonetic features obtained after synthesizing t tgt .
  • the pseudo-annotated speech-to-speech translation sample is a four-tuple, including: ⁇ s src ,t src ,t' tgt ,s tgt ⁇ , where s src represents the real source language speech features and t src represents the real source language Text, t′ tgt represents the target language text obtained after inputting the real source language text into MT, and s tgt represents the target language speech feature obtained after t′ tgt is synthesized.
  • This application may refer to the speech-to-speech translation based on such pseudo-annotated speech-to-speech translation samples as pseudo-translation tags.
  • Adaptation Pseudo Translation Label Adaptation, PTLA.
  • pseudo-annotated speech-to-speech translation samples are obtained on the basis of speech recognition samples.
  • pseudo-annotated speech-to-speech translation samples can also be constructed based on the source language speech features.
  • electronic devices can obtain real source language speech features, input the source language speech features into the Automatic Speech Recognition (ASR) model, obtain the source language text corresponding to the source language speech features, and then convert the source language speech features into the automatic speech recognition (ASR) model.
  • the text is input into the MT model to obtain the target language text.
  • the target language text can be input into the TTS model to obtain the target language speech features.
  • these source language speech features, source language text, target language text and target language speech Features constitute pseudo-annotated speech-to-speech translation samples.
  • the source language speech feature may be a log-mel spectrogram of the source language speech, and the log-mel spectrogram may be an 80-channel log-mel spectrogram. , but not limited to this.
  • the source language speech feature may be a linear frequency spectrogram (linear freq spectrogram) of the target language speech, but is not limited to this.
  • the training process of the speech-to-speech translation model by electronic devices includes: pre-training phase (pre-training) and fine-tuning phase (fine tuning).
  • Pre-training refers to the process of pre-training a model or pre-training a model.
  • Fine-tuning refers to the process of applying a pre-trained model to the data set of a certain task and adapting the parameters to the data set of the task.
  • the electronic device needs to randomly initialize the parameters, then start training the network model, and continuously adjust the parameters of the model, so that the loss of the network model becomes smaller and smaller until Until the training stop conditions are met, the process is the pre-training process.
  • the electronic device can directly use the previously trained network model, use the parameters of the network model as the initialization parameters of the task, and then train the network model and continuously adjust it.
  • the parameters of the model make the loss of the network model smaller and smaller until the training stop condition is met. This process is the fine-tuning process.
  • the above-mentioned real speech-to-speech translation samples may also be called original speech-to-speech translation samples.
  • Pseudo-annotated speech-to-speech translation samples may also be called derived speech-to-speech translation samples.
  • the real speech-to-speech translation sample can be used in the pre-training stage of the speech-to-speech translation model, and can also be used in the fine-tuning stage of the model.
  • the pseudo-annotated speech-to-speech translation sample can be used in the pre-training stage of the speech-to-speech translation model, or can also be used in the fine-tuning stage of the model. This application does not limit this.
  • speech recognition samples are relatively easy to collect.
  • pseudo-annotated speech-to-speech translation samples can be generated, thereby expanding the Speech-to-speech translation samples, which in turn can improve model training accuracy.
  • the above S330 includes:
  • S410 Pre-train the speech-to-speech translation model based on the pseudo-annotated speech-to-speech translation samples
  • S420 Fine-tune the pre-trained speech-to-speech translation model based on real speech-to-speech translation samples.
  • the electronic device can directly fine-tune the pre-trained speech-to-speech translation model through real speech-to-speech translation samples. That is, the electronic device only fine-tunes the pre-trained speech-to-speech translation model through real speech-to-speech translation samples. Translation model.
  • the electronic device can also fine-tune the pre-trained speech-to-speech translation model based on real speech-to-speech translation samples and pseudo-annotated speech-to-speech translation samples.
  • the model training method based on this method can be called a hybrid training method.
  • this hybrid training algorithm can maximize the preservation of pseudo-annotated speech-to-speech translation sample gains. Since the scale of speech recognition samples is much larger than the scale of real speech-to-speech translation samples, based on this, the scale of pseudo-annotated speech-to-speech translation samples is much larger than the scale of real speech-to-speech translation samples.
  • real speech-to-speech translation samples can be upsampled to expand the scale of real speech-to-speech translation samples, and then the upsampling can be performed Fine-tune the pre-trained speech-to-speech translation model with speech-to-speech translation samples and pseudo-annotated speech-to-speech translation samples.
  • the electronic device can also label the real speech-to-speech translation samples with a first label , the first label is used to identify the real speech-to-speech translation sample as a real sample, which can be represented by real; the second label is labeled for the pseudo-labeled speech-to-speech translation sample, and the second label is used to identify the pseudo-labeled speech-to-speech translation
  • the sample is a pseudo-labeled sample, which can be represented by pseudo.
  • the model training method based on this method can be called a prompt training method. Based on this prompt training method, the model can better distinguish between real speech-to-speech translation samples and pseudo-annotated speech-to-speech translation samples.
  • the electronic device can pre-train a speech-to-speech translation model based on pseudo-labeled speech-to-speech translation samples; fine-tune the pre-trained speech-to-speech translation model based on real speech-to-speech translation samples, that is, pseudo-labeled data It is mainly used in the pre-training process. In this way, pseudo-annotated speech-to-speech translation samples can be prevented from misleading the model optimization results.
  • real speech-to-speech translation samples can be Upsampling, and then fine-tuning the pre-trained speech-to-speech translation model through the upsampled speech-to-speech translation samples and pseudo-annotated speech-to-speech translation samples.
  • This method can solve the shortage of real speech-to-speech translation samples on the one hand. This leads to the problem of low model training accuracy. On the other hand, it can prevent pseudo-annotated speech-to-speech translation samples from misleading the model optimization results.
  • the electronic device can also perform real speech-to-speech translation samples and pseudo-annotated speech-to-speech translation samples.
  • Speech-to-speech translation samples are annotated with corresponding labels, which allows the model to better distinguish between real speech-to-speech translation samples and pseudo-annotated speech-to-speech translation samples.
  • the speech-to-speech translation model may be an existing translator (Translatotron) model or the speech-to-speech translation model as shown in Figure 5 , which is not limited in this application.
  • Figure 5 is a schematic diagram of a speech-to-speech translation model provided by an embodiment of the present application.
  • the model includes: an encoder module 510, a first attention module 520, a first decoder module 530, N
  • the second attention module 540 and the N second decoder modules 550, N is a positive integer
  • the N second attention modules correspond to the N second decoder modules one-to-one.
  • the model can be a speech-to-speech translation model based on multi-task learning, and the multi-task includes: a main task and N auxiliary tasks.
  • the main task is a speech-to-speech translation task, where the above-mentioned second attention
  • the number of modules and second decoder modules is consistent with the number of auxiliary tasks.
  • the two auxiliary tasks can be respectively a speech recognition task and a speech-to-text translation task, but are not limited to this.
  • the speech recognition task is used to convert source language speech features into source language text
  • the speech-to-text translation task is used to convert the source language speech features into source language text
  • the source language text is converted into target language text.
  • N 1, that is, there is an auxiliary task.
  • the auxiliary task can be a speech recognition task or a speech-to-text translation task, but is not limited to this.
  • the first attention module 520 and the first decoder module 530 correspond to the main task, and each of the following sets of the second attention module 540 and the second decoder module 550 correspond to an auxiliary task.
  • the first decoder module 530 is mainly used to predict and synthesize speech features of the target language.
  • the two auxiliary tasks accept the input of the encoder module 510 and add the predicted loss value to the main task in the form of a weighted sum.
  • the second decoder module 550 is not used.
  • the encoder module 510 is used to obtain the source language speech features and process the source language speech features to obtain multiple sets of first hidden state representations corresponding to the source language speech features.
  • Figure 6 is a schematic diagram of another speech-to-speech translation model provided by the embodiment of the present application.
  • the encoder module 510 includes: a convolutional neural network sub-module 5101 and a first converter module 5102 ;
  • the convolutional neural network sub-module 5101 is used to obtain the source language speech features and process the source language speech features to obtain the second hidden state representation corresponding to the source language speech features;
  • the first converter module 5102 is used to obtain the second hidden state representation state representation, and process the second hidden state representation to obtain multiple sets of first hidden state representations.
  • the convolutional neural network sub-module 5101 may include two convolutional neural network layers, but is not limited thereto.
  • two layers of convolutional neural network layers can map the length of the 80-channel logarithmic melspectrogram to one quarter of the original, That is, assuming that the previous 80-channel logarithmic mel spectrogram is represented by 100 vectors, each vector is 80-dimensional, then after two layers of convolutional neural network layer processing, 25 vectors are obtained.
  • the number of hidden units in the converter module 5102 remains the same. For example, the number of hidden units is 512, then the dimension of the 25 vectors processed by the two-layer convolutional neural network layer is also 512.
  • 25 512-dimensional vectors can be The vector is understood as the second hidden state representation mentioned above.
  • the first converter module 5102 may be similar to the encoder structure shown in Figure 1 , that is, the first converter module 5102 may include 6 converter layers, or may include 12 converter layers.
  • Each transformer layer can have 512 hidden units. That is to say, the hidden representation output by the transformer layer can be 512-dimensional.
  • Each converter layer can consist of two subunit.
  • the first one is a self-attention network using a multi-head self-attention mechanism.
  • the self-attention network of the multi-head self-attention mechanism here can be a self-attention network with an 8-head self-attention mechanism.
  • the second is a fully connected feed-forward network, in which the feed-forward network can use 2048-dimensional internal states. Both subunits use residual connections and layer normalization.
  • the convolutional neural network sub-module 5101 outputs 25 512-dimensional vectors
  • N groups of first hidden state representations can be obtained, and each group of first hidden state
  • the representation is also 25 512-dimensional vectors.
  • the 25 512-dimensional vectors obtained by the last layer of the first converter module 5102 can be output to the first attention module 520, and the 25 512-dimensional vectors obtained by the middle layer The vector may be output to the second attention module 540.
  • the first attention module 520 is used to obtain a set of first hidden state representations among multiple sets of first hidden state representations and the first vector corresponding to each time step output by the first decoder, and represent the set of first hidden state representations Process the first vector corresponding to each time step to obtain the first attention representation corresponding to each time step.
  • the first decoder module 530 is used to obtain the second vector corresponding to each time step, process the second vector corresponding to each time step, obtain the first vector corresponding to each time step, and convert the first vector corresponding to each time step into Output to the first attention module 520, obtain the first attention representation corresponding to each time step, and process the first attention representation corresponding to each time step to obtain the target language speech features corresponding to the source language speech features.
  • the first decoder module 530 includes: a pre-processing network (prenet) 5301, a second converter module 5302 and a post-processing network (postnet) 5303; the pre-processing network 5301 is used to obtain each time The second vector corresponding to each step is processed, and the second vector corresponding to each time step is processed to obtain the first vector corresponding to each time step, and the first vector corresponding to each time step is output to the first attention module; the second conversion The processor module 5302 is used to obtain the first attention representation corresponding to each time step, and process the first attention representation corresponding to each time step to obtain the target language speech features at each time step; the post-processing network 5303 is used to The target language speech features at each time step are processed to obtain the target language speech features corresponding to the source language speech features.
  • prenet pre-processing network
  • postnet post-processing network
  • the bottleneck dimension of the pre-processing network 5301 may be 32.
  • the pre-processing network 5301 can obtain an 80-dimensional all-0 vector.
  • This all-0 vector is the second vector corresponding to the first time step.
  • the pre-processing network 5301 can process the all-0 vector.
  • a 512-dimensional all-0 vector is obtained.
  • This all-0 vector is the first vector corresponding to the first time step.
  • the pre-processing network 5301 can input the all-0 vector to the first attention module 520.
  • the first attention The module 520 can process the all-zero vector and the 25 512-dimensional vectors obtained from the encoder module 510 to obtain the first attention representation corresponding to the first time step.
  • the first attention module 520 The first attention representation corresponding to the first time step is input to the second converter module 5302.
  • the second converter module 5302 can process the first attention representation corresponding to the first time step to obtain the first time
  • the speech features of the target language at the first time step are the predicted speech features of the target language at the first time step.
  • the pre-processing network 5301 can also obtain the actual target language speech features at the first time step.
  • the actual target language speech features at the first time step can be understood as the second time step.
  • the pre-processing network 5301 can process the second vector corresponding to the second time step to obtain a 512-dimensional vector.
  • the 512-dimensional vector is the first vector corresponding to the second time step.
  • the pre-processing network 5301 can input the first vector corresponding to the second time step to the first attention module 520, and the first attention module 520 can input the first vector corresponding to the second time step. And process the 25 512-dimensional vectors obtained from the encoder module 510 to obtain the first attention representation corresponding to the second time step. Furthermore, the first attention module 520 processes the first attention representation corresponding to the second time step.
  • the first attention representation is input to the second converter module 5302, and the second converter module 5302 can process the first attention representation corresponding to the second time step to obtain the target language speech features at the second time step,
  • the speech features of the target language at the second time step are the predicted speech features of the target language at the second time step.
  • the actual target language speech features at the i-th time step can be understood as the second vector corresponding to the i+1-th time step, and the pre-processing network 5301 can perform the following operations on the second vector corresponding to the i+1-th time step: Process to obtain a 512-dimensional vector, which is the first vector corresponding to the i+1th time step. Further, the pre-processing network 5301 can input the first vector corresponding to the i+1th time step.
  • the first attention module 520 can process the first vector corresponding to the i+1-th time step and the 25 512-dimensional vectors obtained from the encoder module 510 to obtain the i-th +1 time step corresponding to the first attention representation. Furthermore, the first attention module 520 inputs the first attention representation corresponding to the i+1 time step to the second converter module 5302. The second conversion The processor module 5302 can process the first attention representation corresponding to the i+1th time step to obtain the target language speech features at the i+1th time step.
  • the pre-processing network 5301 can obtain an 80-dimensional all-0 vector, which is the second vector corresponding to the first time step.
  • the pre-processing network 5301 can process the all-0 vector to obtain a 512-dimensional all-0 vector. 0 vector, which is the first vector corresponding to the first time step.
  • the pre-processing network 5301 can input the all-0 vector to the first attention module 520, and the first attention module 520 can The all-zero vectors and the 25 512-dimensional vectors obtained from the encoder module 510 are processed to obtain the first attention representation corresponding to the first time step.
  • the first attention module 520 The first attention representation corresponding to the first time step is input to the second converter module 5302.
  • the second converter module 5302 can process the first attention representation corresponding to the first time step to obtain the target language at the first time step.
  • Speech features, the speech features of the target language at the first time step are the predicted speech features of the target language at the first time step.
  • the pre-processing network 5301 can process the predicted target language speech features at the first time step to obtain a 512-dimensional vector, in which the predicted target language speech features at the first time step can be is understood as the second vector corresponding to the above-mentioned second time step.
  • the 512-dimensional vector is the first vector corresponding to the second time step.
  • the pre-processing network 5301 can convert the second vector corresponding to the second time step into A vector is input to the first attention module 520.
  • the first attention module 520 can process the first vector corresponding to the second time step and the 25 512-dimensional vectors obtained from the encoder module 510 to obtain the first vector.
  • the first attention module 520 inputs the first attention representation corresponding to the second time step to the second converter module 5302.
  • the second converter module 5302 The first attention representation corresponding to the second time step can be processed to obtain the target language speech features at the second time step.
  • the predicted target language speech feature at the i-th time step can be understood as the second vector corresponding to the i+1-th time step, and the pre-processing network 5301 can perform the following operations on the second vector corresponding to the i+1-th time step: Process to obtain a 512-dimensional vector, which is the first vector corresponding to the i+1th time step. Further, the pre-processing network 5301 can input the first vector corresponding to the i+1th time step.
  • the first attention module 520 can process the first vector corresponding to the i+1-th time step and the 25 512-dimensional vectors obtained from the encoder module 510 to obtain the i-th +1 time step corresponding to the first attention representation. Furthermore, the first attention module 520 inputs the first attention representation corresponding to the i+1 time step to the second converter module 5302. The second conversion The processor module 5302 can process the first attention representation corresponding to the i+1th time step to obtain the target language speech features at the i+1th time step.
  • the second converter module 5302 can input the target language speech features at each time step to the post-processing network 5303, and the post-processing network 5303 can perform a weighted summation of the target language speech features at each time step to obtain the source
  • the phonetic features of the language correspond to the phonetic features of the target language.
  • the second converter module 5302 may be similar to the decoder structure shown in Figure 1 , that is, the second converter module 5302 may include 6 converter layers. This application does not limit this.
  • the second converter Module 5302 may employ the same hyperparameters as first converter module 5102.
  • the second attention module 540 is used to obtain a set of first hidden state representations among the plurality of sets of first hidden state representations and the second decoder output corresponding to the second attention module 540
  • the third vector corresponding to each time step and processes the first hidden state representation of the set and the third vector corresponding to each time step to obtain the second attention representation corresponding to each time step;
  • the second decoder module 550 corresponding to the second attention module 540 is used to obtain the fourth vector corresponding to each time step, and process the fourth vector corresponding to each time step to obtain the third vector corresponding to each time step.
  • the third vector corresponding to each time step is output to the second attention module 540 to obtain the second attention representation corresponding to each time step, and process the second attention representation corresponding to each time step to obtain the source language speech feature correspondence. auxiliary representation.
  • the second decoder module 550 may include: a pre-processing network, a third converter module and a post-processing network; the pre-processing network is used to obtain the fourth vector corresponding to each time step, and calculate the fourth vector corresponding to each time step.
  • the four vectors are processed to obtain the third vector corresponding to each time step, and the third vector corresponding to each time step is output to the second attention module 540; the third converter module is used to obtain the second attention corresponding to each time step. representation, and process the second attention representation corresponding to each time step to obtain the auxiliary representation at each time step; the post-processing network is used to process the auxiliary representation at each time step to obtain the auxiliary representation corresponding to the source language speech features. express.
  • the bottleneck dimension of the pre-processing network may be 32.
  • the pre-processing network 5301 can obtain an 80-dimensional embedding vector, which is the fourth vector corresponding to the first time step.
  • the pre-processing network can process the vector to obtain a 512-dimensional vector, This vector is the third vector corresponding to the first time step.
  • the pre-processing network can input this vector to the second attention module 540, and the second attention module 540 can This vector and the 25 512-dimensional vectors obtained from the encoder module 510 are processed to obtain the second attention representation corresponding to the first time step.
  • the second attention module 540 converts the first time step
  • the second attention representation corresponding to the first time step is input to the third converter module.
  • the third converter module can process the second attention representation corresponding to the first time step to obtain the auxiliary representation at the first time step.
  • the auxiliary representation at the first time step is the predicted auxiliary representation at the first time step.
  • the pre-processing network can also obtain the actual auxiliary representation at the first time step.
  • the actual auxiliary representation at the first time step can be understood as the fourth vector corresponding to the second time step.
  • the pre-processing network The fourth vector corresponding to the second time step can be processed to obtain a 512-dimensional vector.
  • the 512-dimensional vector is the third vector corresponding to the second time step.
  • the pre-processing network can convert the second vector The third vector corresponding to the time step is input to the second attention module 540.
  • the second attention module 540 can compare the third vector corresponding to the second time step and the 25 512-dimensional vectors obtained from the encoder module 510. Processing is performed to obtain the second attention representation corresponding to the second time step. Furthermore, the second attention module 540 inputs the second attention representation corresponding to the second time step to the third converter module. The third The converter module can process the second attention representation corresponding to the second time step to obtain an auxiliary representation at the second time step. The auxiliary representation at the second time step is the predicted second time step. Auxiliary representation on.
  • the actual auxiliary representation at the i-th time step can be understood as the fourth vector corresponding to the i+1-th time step.
  • the pre-processing network can process the fourth vector corresponding to the i+1-th time step to obtain A 512-dimensional vector.
  • the 512-dimensional vector is the third vector corresponding to the i+1 time step.
  • the pre-processing network can input the third vector corresponding to the i+1 time step to the second attention.
  • the force module 540 and the second attention module 540 can process the third vector corresponding to the i+1th time step and the 25 512-dimensional vectors obtained from the encoder module 510 to obtain the i+1th time step.
  • the second attention module 540 inputs the second attention representation corresponding to the i+1th time step to the third converter module.
  • the third converter module can The second attention representation corresponding to the i+1 time step is processed to obtain the auxiliary representation at the i+1 time step.
  • the third converter module can input the auxiliary representation at each time step to the post-processing network, and the post-processing network can perform a weighted summation of the auxiliary representation at each time step to obtain an auxiliary representation corresponding to the speech features of the source language. .
  • the above-mentioned auxiliary representation may be a speech recognition result, such as a source language text corresponding to the source language speech.
  • the above-mentioned auxiliary representation may be a speech translation result, such as a target language text.
  • the speech-to-speech translation model provided by this application is a corresponding improvement on the existing translator (Translatotron) model, specifically by integrating the long short-term memory network (Long Short-term Memory Network) in the translator (Translatotron) model.
  • Term Memory LSTM
  • the speech-to-speech translation model can be called a transformer-based translator model (Transformer-based Translatotron).
  • LSTM the calculation at each time step is a local calculation, while in the converter module, the calculation at each time step is a global calculation, which can improve the model accuracy.
  • the Transformer-based Translatotron provided in this application can be trained not based on pseudo-annotated speech-to-speech translation samples, and the speech-to-speech model using this training method can be called a baseline system.
  • this application provides Transformer-based Translatotron can also be trained based on pseudo-annotated speech-to-speech translation samples.
  • the speech-to-speech model using this training method can be called the baseline system + PTLA.
  • This application provides the TEDEn2Zh data set (English to Chinese) commonly used in speech translation to test the performance of the baseline system and baseline system + PTLA, as shown in Table 1:
  • S-PER represents the phoneme recognition error rate of the speech recognition task on the test set.
  • Tp-BLEU represents the phoneme calculation-based bilingual translation quality auxiliary tool (Bilingual Evaluation Understudy, BLEU) for the speech-to-text translation task on the test set.
  • Dev-BLER represents the BLEU based on phoneme calculation of the main task on the development set.
  • test-BLEU represents the BLEU based on the phoneme calculation of the main task on the test set.
  • the baseline system can achieve good performance in complex language-to-language translation, and the baseline system + PTLA solution can effectively improve model performance.
  • a speech-to-speech translation method is provided below:
  • Figure 7 is a flow chart of a speech-to-speech translation method provided by an embodiment of the present application.
  • This method can be executed by any electronic device such as a mobile phone, tablet computer, notebook computer, handheld computer, MID, desktop computer, etc.
  • Figure 2 The execution device in the application does not limit this.
  • the method includes:
  • S720 Input the source language speech features into the speech-to-speech translation model to obtain the target language speech features corresponding to the source language speech features.
  • the speech-to-speech translation model can be trained by the above-mentioned model training method. Since the speech-to-speech translation model obtained by the above-mentioned training method has higher accuracy, based on this, speech-to-speech translation can be better realized.
  • Figure 8 is a schematic diagram of a model training device 800 provided by an embodiment of the present application.
  • the device 800 includes: an acquisition module 810, a generation module 820 and a training module 830, where the acquisition module 810 is used to acquire speech Recognition samples and real speech-to-speech translation samples; the generation module 820 is used to generate pseudo-annotated speech-to-speech translation samples according to the speech recognition samples; the training module 830 is used to generate pseudo-annotated speech-to-speech translation samples according to the pseudo-annotated speech-to-speech translation samples and real speech-to-speech translation samples. Translate samples to train speech-to-speech translation models.
  • the training module 830 is specifically configured to: pre-train a speech-to-speech translation model based on pseudo-annotated speech-to-speech translation samples, and fine-tune the pre-trained speech-to-speech translation model based on real speech-to-speech translation samples.
  • the training module 830 is specifically configured to: fine-tune the pre-trained speech-to-speech translation model through real speech-to-speech translation samples; or fine-tune the pre-trained speech-to-speech translation model based on real speech-to-speech translation samples and pseudo-annotated speech-to-speech translation samples. Pretrained speech-to-speech translation model.
  • the device 800 further includes: a labeling module 840, configured to label the real speech before fine-tuning the pre-trained speech-to-speech translation model according to the real speech-to-speech translation samples and the pseudo-labeled speech-to-speech translation samples.
  • the speech-to-speech translation samples are labeled with a first label, and the first label is used to identify the real speech-to-speech translation samples as real samples;
  • the pseudo-labeled speech-to-speech translation samples are labeled with a second label, and the second label is used to identify the pseudo-labeled speech.
  • the speech translation samples are pseudo-labeled samples.
  • the training module 830 is specifically used to: upsample real speech-to-speech translation samples to obtain upsampled speech-to-speech translation samples; and obtain upsampled speech-to-speech translation samples and pseudo-annotated speech-to-speech translation samples. Fine-tune a pre-trained speech-to-speech translation model with speech translation samples.
  • real speech-to-speech translation samples include: first source language speech features, first source language text, first target language speech features, and first target language text; speech recognition samples include: second source language speech features and second source language text.
  • the generation module 820 is specifically configured to: translate the second source language text to obtain the second target language text; synthesize the second target language text to obtain the second target language speech features; wherein, the pseudo-annotated speech
  • the to-speech translation samples include: second source language speech features, second source language text, second target language text, and second target language speech features.
  • the speech-to-speech translation model includes: an encoder module, a first attention module, a first decoder module, N second attention modules, and N second decoder modules, where N is a positive integer, and N
  • the second attention module has a one-to-one correspondence with the N second decoder modules;
  • the encoder module is used to obtain the speech features of the source language and process the speech features of the source language to obtain multiple sets of first hidden state representations corresponding to the speech features of the source language;
  • the first attention module is used to obtain a set of first hidden state representations among multiple sets of first hidden state representations and the first vector corresponding to each time step output by the first decoder, and sum the set of first hidden state representations
  • the first vector corresponding to each time step is processed to obtain the first attention representation corresponding to each time step;
  • the first decoder module is used to obtain the second vector corresponding to each time step, process the second vector corresponding to each time step, obtain the first vector corresponding to each time step, and output the first vector corresponding to each time step.
  • Give the first attention module obtain the first attention representation corresponding to each time step, and process the first attention representation corresponding to each time step to obtain the target language speech features corresponding to the source language speech features;
  • the second attention module is used to obtain a set of first hidden state representations among the plurality of sets of first hidden state representations and each of the second decoder outputs corresponding to the second attention module.
  • the third vector corresponding to the time step is processed, and the first hidden state representation of the group and the third vector corresponding to each time step are processed to obtain the second attention representation corresponding to each time step;
  • the second decoder module corresponding to the second attention module is used to obtain the fourth vector corresponding to each time step, and process the fourth vector corresponding to each time step to obtain the third vector corresponding to each time step.
  • the third vector corresponding to each step is output to the second attention module to obtain the second attention representation corresponding to each time step, and process the second attention representation corresponding to each time step to obtain an auxiliary representation corresponding to the speech features of the source language. .
  • the encoder module includes: a convolutional neural network sub-module and a first converter module; the convolutional neural network sub-module is used to obtain the source language speech features and process the source language speech features to obtain the source language speech features. The corresponding second hidden state representation; the first The converter module is used to obtain the second hidden state representation and process the second hidden state representation to obtain multiple sets of first hidden state representations.
  • the first decoder module includes: a pre-processing network, a second converter module and a post-processing network; the pre-processing network is used to obtain the second vector corresponding to each time step, and compare the second vector corresponding to each time step Perform processing to obtain the first vector corresponding to each time step, and output the first vector corresponding to each time step to the first attention module; the second converter module is used to obtain the first attention representation corresponding to each time step, and The first attention representation corresponding to each time step is processed to obtain the speech features of the target language at each time step; the post-processing network is used to process the speech features of the target language at each time step to obtain the speech features corresponding to the source language. Target language phonetic features.
  • the device embodiments and the method embodiments may correspond to each other, and similar descriptions may refer to the method embodiments. To avoid repetition, they will not be repeated here.
  • the device 800 shown in Figure 8 can execute the above-mentioned model training method embodiment, and the foregoing and other operations and/or functions of each module in the device 800 are respectively to implement the corresponding processes in the above-mentioned model training method. For the sake of simplicity, I won’t go into details here.
  • the device 800 in the embodiment of the present application is described above from the perspective of functional modules in conjunction with the accompanying drawings. It should be understood that this functional module can be implemented in the form of hardware, can also be implemented through instructions in the form of software, or can also be implemented through a combination of hardware and software modules. Specifically, each step of the method embodiments in the embodiments of the present application can be completed by integrated logic circuits of hardware in the processor and/or instructions in the form of software. The steps of the methods disclosed in conjunction with the embodiments of the present application can be directly embodied in hardware. The execution of the decoding processor is completed, or the execution is completed using a combination of hardware and software modules in the decoding processor.
  • the software module may be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, register, etc.
  • the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps in the above model training method embodiment in combination with its hardware.
  • Figure 9 is a schematic diagram of a speech-to-speech translation device 900 provided by an embodiment of the present application.
  • the device 900 includes: an acquisition module 910 and a processing module 920, where the acquisition module 910 is used to acquire source language speech. Feature; processing module 920 is used to input source language speech features into the speech-to-speech translation model trained by the above model training method to obtain target language speech features corresponding to the source language speech features.
  • the device embodiments and the method embodiments may correspond to each other, and similar descriptions may refer to the method embodiments. To avoid repetition, they will not be repeated here.
  • the device 900 shown in Figure 9 can execute the above embodiment of the speech-to-speech translation method, and the foregoing and other operations and/or functions of each module in the device 900 are respectively intended to implement the corresponding processes in the above-mentioned speech-to-speech translation method. , for the sake of brevity, will not be repeated here.
  • the device 900 in the embodiment of the present application is described above from the perspective of functional modules in conjunction with the accompanying drawings. It should be understood that this functional module can be implemented in the form of hardware, can also be implemented through instructions in the form of software, or can also be implemented through a combination of hardware and software modules. Specifically, each step of the method embodiments in the embodiments of the present application can be completed by integrated logic circuits of hardware in the processor and/or instructions in the form of software. The steps of the methods disclosed in conjunction with the embodiments of the present application can be directly embodied in hardware. The execution of the decoding processor is completed, or the execution is completed using a combination of hardware and software modules in the decoding processor.
  • the software module may be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, register, etc.
  • the storage medium is located in the memory at The processor reads the information in the memory and completes the steps in the above speech-to-speech translation method embodiment in combination with its hardware.
  • FIG. 10 is a schematic block diagram of an electronic device 1000 provided by an embodiment of the present application.
  • the electronic device 1000 may include:
  • Memory 1010 and processor 1020 are used to store computer programs and transmit the program code to the processor 1020.
  • the processor 1020 can call and run the computer program from the memory 1010 to implement the method in the embodiment of the present application.
  • the processor 1020 may be configured to execute the above method embodiments according to instructions in the computer program.
  • the processor 1020 may include but is not limited to:
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • the memory 1010 includes but is not limited to:
  • Non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electrically removable memory. Erase programmable read-only memory (Electrically EPROM, EEPROM) or flash memory. Volatile memory may be Random Access Memory (RAM), which is used as an external cache.
  • RAM Random Access Memory
  • RAM static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory
  • SDRAM double data rate synchronous dynamic random access memory
  • Double Data Rate SDRAM DDR SDRAM
  • ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous link dynamic random access memory
  • Direct Rambus RAM Direct Rambus RAM
  • the computer program can be divided into one or more modules, and the one or more modules are stored in the memory 1010 and executed by the processor 1020 to complete the tasks provided by this application.
  • the one or more modules may be a series of computer program instruction segments capable of completing specific functions. The instruction segments are used to describe the execution process of the computer program in the electronic device.
  • the electronic device may also include:
  • Transceiver 1030 which may be connected to the processor 1020 or the memory 1010.
  • the processor 1020 can control the transceiver 1030 to communicate with other devices. Specifically, it can send information or data to other devices, or receive information or data sent by other devices.
  • Transceiver 1030 may include a transmitter and a receiver.
  • the transceiver 1030 may further include an antenna, and the number of antennas may be one or more.
  • bus system where in addition to the data bus, the bus system also includes a power bus, a control bus and a status signal bus.
  • This application also provides a computer storage medium on which a computer program is stored.
  • the computer program When the computer program is executed by a computer, the computer can perform the method of the above method embodiment.
  • embodiments of the present application also provide a computer program product containing instructions, which when executed by a computer causes the computer to perform the method of the above method embodiments.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted over a wired connection from a website, computer, server, or data center (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) to another website, computer, server or data center.
  • the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server or data center integrated with one or more available media.
  • the available media may be magnetic media (such as floppy disks, hard disks, magnetic tapes), optical media (such as digital video discs (DVD)), or semiconductor media (such as solid state disks (SSD)), etc.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the modules is only a logical function division. In actual implementation, there may be other division methods.
  • multiple modules or components may be combined or may be Integrated into another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, indirect coupling or communication connection of devices or modules, and may be in electrical, mechanical or other forms.
  • Modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical modules, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. For example, each functional module in each embodiment of the present application can be integrated into a processing module, or each module can exist physically alone, or two or more modules can be integrated into one module.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Machine Translation (AREA)

Abstract

一种模型训练方法、语音到语音翻译方法、装置及介质,该方法包括:获取语音识别样本和真实的语音到语音翻译样本(S310);根据语音识别样本生成伪标注的语音到语音翻译样本(S320);根据伪标注的语音到语音翻译样本和真实的语音到语音翻译样本训练语音到语音翻译模型(S330)。该方法能够解决翻译样本数据匮乏导致模型训练精度较低的问题。

Description

模型训练方法、语音到语音翻译方法、装置及介质
优先权信息
本申请要求于2022年04月26日提交的,申请名称为“模型训练方法、语音到语音翻译方法、装置及介质”的、中国专利申请号“2022104485858”的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及机器学习技术领域,尤其涉及一种模型训练方法、语音到语音翻译方法、装置及介质。
背景技术
语音到语音翻译(Speech-to-Speech Translation,S2ST)模型旨在将源语言语音翻译为目标语言语音,其广泛应用于视频翻译、跨国会议演讲、翻译对讲机等各种场景。通常语音到语音翻译模型需要通过大量数据训练得到,而目前在现实场景中很难收集成对的语音到语音的翻译样本,这种数据匮乏的情况导致模型训练精度较低的问题。
发明内容
本申请提供一种模型训练方法、语音到语音翻译方法、装置及介质,从而可以提高模型训练精度。
第一方面,提供一种模型训练方法,包括:获取语音识别样本和真实的语音到语音翻译样本;根据语音识别样本生成伪标注的语音到语音翻译样本;根据伪标注的语音到语音翻译样本和真实的语音到语音翻译样本训练语音到语音翻译模型。
第二方面,提供一种语音到语音翻译方法,包括:获取源语言语音特征;将源语言语音特征输入至如第一方面或第一方面的可选方式训练得到的语音到语音翻译模型,得到源语言语音特征对应的目标语言语音特征。
第三方面,提供一种模型训练装置,包括:获取模块、生成模块和训练模块,获取模块用于获取语音识别样本和真实的语音到语音翻译样本;生成模块用于根据语音识别样本生成伪标注的语音到语音翻译样本;训练模块用于根据伪标注的语音到语音翻译样本和真实的语音到语音翻译样本训练语音到语音翻译模型。
第四方面,提供语音到语音翻译装置,包括:获取模块和处理模块,其中,获取模块用于获取源语言语音特征;处理模块用于将源语言语音特征输入至如第一方面或第一方面的可选方式训练得到的语音到语音翻译模型,得到源语言语音特征对应的目标语言语音特征。
第五方面,提供了一种电子设备,包括:处理器和存储器,该存储器用于存储计算机程序,该处理 器用于调用并运行该存储器中存储的计算机程序,以执行第一方面或第二方面所述的方法。
第六方面,提供了一种计算机可读存储介质,用于存储计算机程序,该计算机程序使得计算机执行第一方面或第二方面所述的方法。
综上,虽然在现实场景中很难收集成对的语音到语音的翻译样本,但是语音识别样本却比较好收集,基于该语音识别样本可以生成伪标注的语音到语音的翻译样本,从而扩充了语音到语音的翻译样本,进而可以提高模型训练精度。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为Transformer的框架图;
图2为本申请实施例涉及的一种系统架构示意图;
图3为本申请实施例提供的一种模型训练方法的流程图;
图4为本申请实施例提供的另一种模型训练方法的流程图;
图5为本申请实施例提供的一种语音到语音翻译模型的示意图;
图6为本申请实施例提供的另一种语音到语音翻译模型的示意图;
图7为本申请实施例提供的一种语音到语音翻译方法的流程图;
图8为本申请实施例提供的一种模型训练装置800的示意图;
图9为本申请实施例提供的一种语音到语音翻译装置900的示意图;
图10是本申请实施例提供的电子设备1000的示意性框图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本发明保护的范围。
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或服务器不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
在介绍本申请技术方案之前,下面首先对本申请技术方案的相关知识进行阐述:
一、编码器(Encoder)和解码器(Decoder):
编码器(Encoder),用于处理源语言语音特征,并将源语言语音特征压缩成固定长度的隐藏表示,该隐藏表示也被称为上下文向量(context)、语义编码、语义向量等,期望该隐藏表示能够比较好的表示语言语音特征的信息。
解码器(Decoder),利用隐藏表示初始化解码器以得到目标语言语音特征。
二、转换器(Transformer):
图1为Transformer的框架图,如图1所示,编码器可以由N=6个一模一样的单元组成。每个单元包含两个子单元。第一个是采用多头自注意力机制(multi-head self-attention mechanism)的自注意力网络,第二个是全连接的前馈网络,激活函数是ReLU。这两个子单元都是用了残差连接(residual connection,ADD)和层归一化(layer normalization,Norm)。解码器与编码器几乎一样,只不过可以在中间多增加了一层多头注意力机制(encoder-decoder attention)来处理编码器的输出。同时,解码器的第一个单元即采用多头自注意力机制的第一个单元为了确保解码器不会读取当前位置之后的信息进行了遮挡(masking)操作。
下面对本申请所要解决的技术问题和发明构思进行说明:
如上所述,通常语音到语音翻译模型需要通过大量数据训练得到,而目前在现实场景中很难收集成对的语音到语音的翻译样本,这种数据匮乏的情况导致模型训练精度较低的问题。
为了解决上述技术问题,本申请提出了扩充训练数据,以提高模型训练精度。
在一些实施例中,本申请实施例的系统架构如图2所示。
图2为本申请实施例涉及的一种系统架构示意图,用户设备201、数据采集设备202、训练设备203、执行设备204、数据库205和内容库206。
其中,数据采集设备202用于从内容库206中读取训练数据,并将读取的训练数据存储至数据库205中。本申请实施例涉及的训练数据包括伪标注的语音到语音翻译样本和真实的语音到语音翻译样本。
训练设备203基于数据库205中维护的训练数据,对语音到语音翻译模型进行训练,使得训练后的语音到语音翻译模型可以有效地实现源语言语音到目标语言语音的翻译。
另外,参考图2,执行设备204配置有I/O接口207,与外部设备进行数据交互。比如通过I/O接口接收用户设备201发送的源语言语音特征。执行设备204中的计算模块208使用训练后的语音到语音翻译模型对输入的源语言语音特征进行处理,输出目标语言语音特征,具体可以是目标语言语音特征,并通过I/O接口将相应的结果发送至用户设备201。
其中,用户设备201可以包括手机、平板电脑、笔记本电脑、掌上电脑、移动互联网设备(mobile internet device,MID)、台式电脑、或其他具有安装浏览器功能的终端设备。
执行设备204可以为服务器。
示例性的,服务器可以是机架式服务器、刀片式服务器、塔式服务器或机柜式服务器等计算设备。该服务器可以是独立的测试服务器,也可以是多个测试服务器所组成的测试服务器集群。
本实施例中,执行设备204通过网络与用户设备201连接。所述网络可以是企业内部网(Intranet)、互联网(Internet)、全球移动通讯系统(Global System of Mobile communication,GSM)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、第四代(the 4rd Generation,4G)网络、第五代(the5rd Generation,5G)网络、蓝牙(Bluetooth)、无线保真(wireless fidelity,Wi-Fi)、通话网络等无线或有线网络。
需要说明的是,图2仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制。在一些实施例中,上述数据采集设备202与用户设备201、训练设备203和执行设备204可以为同一个设备。上述数据库205可以分布在一个服务器上也可以分布在多个服务器上,上述的内容库206可以分布在一个服务器上也可以分布在多个服务器上。
下面通过一些实施例对本申请实施例的技术方案进行详细说明。下面这几个实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例不再赘述。
图3为本申请实施例提供的一种模型训练方法的流程图,该方法可以由手机、平板电脑、笔记本电脑、掌上电脑、MID、台式电脑等任何电子设备执行,例如可以由图2中的训练设备执行,本申请对此不做限制,如图3所示,该方法包括:
S310:获取语音识别样本和真实的语音到语音翻译样本;
S320:根据语音识别样本生成伪标注的语音到语音翻译样本;
S330:根据伪标注的语音到语音翻译样本和真实的语音到语音翻译样本训练语音到语音翻译模型。
可选地,本申请中的语音到语音翻译模型可以是基于多任务学习(Multi-Task Learning,MTL)的语音到语音翻译模型,当然也可以是基于单任务学习的语音到语音翻译模型,本申请对此不做限制。其中,多任务学习是机器学习中一个很有前景的领域,其目标是利用多个学习任务中所包含的有用信息来帮助为每个任务学习得到更为准确的学习器。并且在多任务学习中,由于不同任务之间共享归纳偏好,所以任务之间一般是可以相互提升而避免单任务容易陷入局部最优。
为了方便起见,下面将真实的语音到语音翻译样本,又或者是伪标注的语音到语音翻译样本统称为语音到语音翻译样本。应理解的是,语音到语音翻译样本所包括的数据元素的个数与语音到语音翻译模型基于多任务学习或者单任务学习有关。例如:若语音到语音翻译模型是基于单任务学习的语音到语音翻译模型,则语音到语音翻译样本可以是一个二元组,包括:源语言语音特征和目标语言语音特征。若语音到语音翻译模型是基于多任务学习的语音到语音翻译模型,且多任务包括:一个主任务和两个辅助任务,主任务是语音到语音的翻译任务;两个辅助任务分别是语音识别任务和语音到文本的翻译任务,该语音识别任务用于将源语言语音特征转换为源语言文本,该语音到文本的翻译任务用于将源语言语音特征转换为源语言文本,并将该源语言文本转换为目标语言文本,这种情况下,语音到语音翻译样本可 以是一个四元组,包括:源语言语音特征、源语言文本、目标语言语音特征和目标语言文本。
可选地,真实的语音到语音翻译样本包括:第一源语言语音特征、第一源语言文本、第一目标语言语音特征和第一目标语言文本。
应理解的是,第一源语言语音特征是真实的源语言语音特征,第一源语言文本是真实的源语言文本,第一目标语言文本也是真实的目标语言文本,第一目标语言语音特征是电子设备对第一目标语言文本合成得到的目标语言语音特征,例如:电子设备可以将第一目标语言文本输入至语音合成模型中,得到第一目标语言语音特征。
可选地,语音识别样本包括:第二源语言语音特征和第二源语言文本。其中,该第二源语言语音特征是真实的源语言语音特征,该第二源语言文本也是真实的源语言文本。
应理解的是,所谓真实的源语言语音特征指的是在真实场景中可以得到的源语言语音特征,例如:电子设备可以通过麦克风采集某用户的语音,并提取该语音的特征。真实的源语言文本可以是通过人工方式得到的语言文本,例如:用户可以将一段语音记录下来,以形成该语音对应的语言文本。真实的目标语言文本也可以是通过人工方式得到的语言文本,例如:用户将源语言文本中的内容翻译为目标语言文本。
可选地,上述语音识别样本可以是一个或多个,上述真实的语音到语音翻译样本可以是一个或多个。
可选地,电子设备可以对第二源语言文本进行翻译,得到第二目标语言语音特征;对第二目标语言语音特征进行合成,得到第二目标语言语音特征;其中,伪标注的语音到语音翻译样本可以是一个四元组,包括:第二源语言语音特征、第二源语言文本、第二目标语言文本和第二目标语言语音特征。换句话讲,伪标注的语音到语音翻译样本中的前两项,即第二源语言语音特征、第二源语言文本均是真实的。
可选地,电子设备可以将第二源语言文本输入至机器翻译(Machine Translation,MT)模型中,得到第二目标语言文本。电子设备可以将第二目标语言文本输入至语音合成(Text-To-Speech,TTS)模型中,得到第二目标语言语音特征。
应理解的是,按照上述伪标注的语音到语音翻译样本的构造过程可知,真实的语音到语音翻译样本与伪标注的语音到语音翻译样本的区别主要在目标语言文本,例如:真实的语音到语音翻译样本是一个四元组,包括:{ssrc,tsrc,ttgt,stgt},其中,ssrc表示真实的源语言语音特征、tsrc表示真实的源语言文本、ttgt表示真实的目标语言文本、stgt表示对ttgt合成之后得到的目标语言语音特征。伪标注的语音到语音翻译样本是一个四元组,包括:{ssrc,tsrc,t'tgt,stgt},其中,ssrc表示真实的源语言语音特征、tsrc表示真实的源语言文本、t′tgt表示将真实的源语言文本输入至MT之后得到的目标语言文本、stgt表示t′tgt合成之后得到的目标语言语音特征。
本申请可以将基于这种伪标注的语音到语音翻译样本所进行的语音到语音的翻译称为伪翻译标签 适配(Pseudo Translation Label Adaptation,PTLA)。
应理解的是,上述伪标注的语音到语音翻译样本是在语音识别样本的基础上的得到的,实际上,也可以基于源语言语音特征构造伪标注的语音到语音翻译样本。例如:电子设备可以获取真实的源语言语音特征,将该源语言语音特征输入至自动语音识别(Automatic Speech Recognition,ASR)模型,得到该源语言语音特征对应的源语言文本,再将该源语言文本输入至MT模型中得到目标语言文本,最后可以将该目标语言文本输入至TTS模型中,得到目标语言语音特征,基于此,这些源语言语音特征、源语言文本、目标语言文本和目标语言语音特征构成伪标注的语音到语音翻译样本。
可选地,在本申请中,源语言语音特征可以是源语言语音的对数梅尔谱图(log-mel spectrogram),该对数梅尔谱图可以是80通道的对数梅尔谱图,但不限于此。
可选地,在本申请中,源语言语音特征可以是目标语言语音的线性频率频谱图(linear freq spectrogram),但不限于此。
应理解的是,电子设备对语音到语音翻译模型训练过程包括:预训练阶段(pre-training)和微调阶段(fine tuning)。
预训练指的是预先训练模型或者预先训练模型的过程。微调指的是将预训练过的模型作用于某任务的数据集,并使参数适应该任务的数据集的过程。
例如:当需要搭建一个网络模型来完成一个特定图像分类任务时,首先,电子设备需要随机初始化参数,然后开始训练网络模型,不断调整该模型的参数,使得网络模型的损失越来越小,直到满足训练停止条件为止,该过程就是预训练过程。当获取到一个与上述特定图像分类任务类似的图像分类任务时,电子设备可以直接使用之前训练的网络模型,将该网络模型的参数来作为这一任务的初始化参数,然后训练网络模型,不断调整该模型的参数,使得网络模型的损失越来越小,直到满足训练停止条件为止,该过程就是微调过程。
应理解的是,上述真实的语音到语音翻译样本也可以被称为原始的语音到语音翻译样本。伪标注的语音到语音翻译样本也可以被称为衍生的语音到语音翻译样本。其中,该真实的语音到语音翻译样本可以作用于语音到语音翻译模型的预训练阶段,也可以作用于该模型的微调阶段。该伪标注的语音到语音翻译样本可以作用于语音到语音翻译模型的预训练阶段,也可以作用于该模型的微调阶段,本申请对此不做限制。
综上,虽然在现实场景中很难收集成对的语音到语音的翻译样本,但是语音识别样本却比较好收集,基于该语音识别样本可以生成伪标注的语音到语音的翻译样本,从而扩充了语音到语音的翻译样本,进而可以提高模型训练精度。
下面将示例性地阐述若干模型训练方法:
如图4所示,上述S330包括:
S410:根据伪标注的语音到语音翻译样本预训练语音到语音翻译模型;
S420:根据真实的语音到语音翻译样本微调预训练后的语音到语音翻译模型。
应理解的是,由于本申请引入了伪标注的语音到语音翻译样本,为了提高模型训练精度,可以将伪标注的语音到语音翻译样本应用于模型的预训练阶段,而将真实的语音到语音翻译样本应用于模型的微调阶段。
可选地,电子设备可以直接通过真实的语音到语音翻译样本微调预训练后的语音到语音翻译模型,也就是说,电子设备只通过真实的语音到语音翻译样本微调预训练后的语音到语音翻译模型。
可选地,电子设备也可以根据真实的语音到语音翻译样本和伪标注的语音到语音翻译样本微调预训练后的语音到语音翻译模型。基于该方式所进行的模型训练方法可以称为混合训练方法。
应理解的是,这种混合训练算法可以最大限度地保留伪标注的语音到语音翻译样本收益。由于语音识别样本的规模相对于真实的语音到语音翻译样本的规模要大很多,基于此,伪标注的语音到语音翻译样本的规模相对于真实的语音到语音翻译样本的规模要大很多,为了防止伪标注的语音到语音翻译样本误导模型优化结果,在本申请中,可以对真实的语音到语音翻译样本进行上采样,以扩充真实的语音到语音翻译样本的规模,进而可以通过上采样后的语音到语音翻译样本和伪标注的语音到语音翻译样本微调预训练后的语音到语音翻译模型。
可选地,电子设备在根据真实的语音到语音翻译样本和伪标注的语音到语音翻译样本微调预训练后的语音到语音翻译模型之前,还可以对真实的语音到语音翻译样本标注第一标签,第一标签用于标识真实的语音到语音翻译样本为真实样本,可以用real表示;对伪标注的语音到语音翻译样本标注第二标签,第二标签用于标识伪标注的语音到语音翻译样本为伪标注样本,可以用pseudo表示。基于该方式所进行的模型训练方法可以称为提示(prompt)训练方法,基于这种提示训练方式,可以使得模型更好地区分真实的语音到语音翻译样本和伪标注的语音到语音翻译样本。
在本申请实施例中,电子设备可以根据伪标注的语音到语音翻译样本预训练语音到语音翻译模型;根据真实的语音到语音翻译样本微调预训练后的语音到语音翻译模型,即伪标注数据主要应用于预训练过程,通过这种方式可以防止伪标注的语音到语音翻译样本误导模型优化结果。
进一步地,由于真实的语音到语音翻译样本匮乏,所以伪标注数据也可以参与进微调过程,但是为了防止伪标注的语音到语音翻译样本误导模型优化结果,可以对真实的语音到语音翻译样本进行上采样,进而可以通过上采样后的语音到语音翻译样本和伪标注的语音到语音翻译样本微调预训练后的语音到语音翻译模型,通过该方法一方面可以解决真实的语音到语音翻译样本匮乏而导致的模型训练精度较低的问题,另一方面可以防止伪标注的语音到语音翻译样本误导模型优化结果。
更进一步地,电子设备在根据真实的语音到语音翻译样本和伪标注的语音到语音翻译样本微调预训练后的语音到语音翻译模型之前,还可以对真实的语音到语音翻译样本和伪标注的语音到语音翻译样本标注对应标签,从而可以使得模型更好地区分真实的语音到语音翻译样本和伪标注的语音到语音翻译样本。
可选地,在本申请中,语音到语音翻译模型可以是现有的翻译器(Translatotron)模型或者可以如图5所示的语音到语音翻译模型,本申请对此不做限制。
图5为本申请实施例提供的一种语音到语音翻译模型的示意图,如图5所示,该模型包括:编码器模块510、第一注意力模块520、第一解码器模块530、N个第二注意力模块540和N个第二解码器模块550,N为正整数,N个第二注意力模块和N个第二解码器模块一一对应,图5以N=2为例,当然N也可以等于1、3等等。
可选地,该模型可以是基于多任务学习的语音到语音翻译模型,且多任务包括:一个主任务和N个辅助任务,主任务是语音到语音的翻译任务,其中,上述第二注意力模块、第二解码器模块的数量与辅助任务数量一致,例如:N=2,即存在两个辅助任务,两个辅助任务可以分别是语音识别任务和语音到文本的翻译任务,但不限于此。该语音识别任务用于将源语言语音特征转换为源语言文本,该语音到文本的翻译任务用于将源语言语音特征转换为源语言文本,并将该源语言文本转换为目标语言文本。再例如:N=1,即存在一个辅助任务,该辅助任务可以是语音识别任务或语音到文本的翻译任务,但不限于此。第一注意力模块520和第一解码器模块530对应于主任务,而下面的每组第二注意力模块540和第二解码器模块550对应于一个辅助任务。第一解码器模块530主要用于预测合成目标语言语音特征。训练时,两个辅助任务接受编码器模块510的输入,并将预测的损失值以加权和的形式加入到主任务,测试时,第二解码器模块550不被使用。
编码器模块510用于获取源语言语音特征,并对源语言语音特征进行处理,得到源语言语音特征对应的多组第一隐藏状态表示。
可选地,图6为本申请实施例提供的另一种语音到语音翻译模型的示意图,如图6所示,编码器模块510包括:卷积神经网络子模块5101和第一转换器模块5102;卷积神经网络子模块5101用于获取源语言语音特征,并对源语言语音特征进行处理,得到源语言语音特征对应的第二隐藏状态表示;第一转换器模块5102用于获取第二隐藏状态表示,并对第二隐藏状态表示进行处理,得到多组第一隐藏状态表示。
可选地,卷积神经网络子模块5101可以包括两层卷积神经网络层,但不限于此。例如,80通道的对数梅尔谱图输入卷积神经网络子模块5101之后,两层卷积神经网络层可以将80通道的对数梅尔谱图的长度映射为原来的四分之一,即假设之前80通道的对数梅尔谱图是通过100个向量表示,每个向量是80维的,那么经过两层卷积神经网络层处理之后,得到的是25个向量,为了与第一转换器模块5102中的隐藏单元数量保持一致,如隐藏单元数量是512,那么经过两层卷积神经网络层处理之后的25个向量的维数也是512,其中,这里可以将25个512维的向量理解为上述第二隐藏状态表示。
可选地,第一转换器模块5102可以类似于图1所示的编码器结构,即该第一转换器模块5102可以包括6个转换器层,也可以包括12个转换器层,本申请对此不做限制,每个转换器层可以带有512隐藏单元,也就是说,经过通过转换器层输出的隐藏表示可以是512维的。每个转换器层可以包括包含两 个子单元。第一个是采用多头自注意力机制的自注意力网络,这里的多头自注意力机制的自注意力网络可以是8头自注意力机制的自注意力网络,本申请对此不做限制。第二个是全连接的前馈网络,其中,前馈网络可以使用2048维的内部状态。这两个子单元都是用了残差连接和层归一化。经过第一转换器模块5102对第二隐藏状态表示的处理,可以得到N组第一隐藏状态表示,这里的N表示第一转换器模块5102所包括的转换器层的层数。
结合上面的示例,假设卷积神经网络子模块5101输出的是25个512维的向量,那么经过第一转换器模块5102处理后,可以得到N组第一隐藏状态表示,每组第一隐藏状态表示也是25个512维的向量,其中,第一转换器模块5102的最后一层所得到的25个512维的向量可以输出给第一注意力模块520,中间层所得到的25个512维的向量可以输出给第二注意力模块540。
第一注意力模块520用于获取多组第一隐藏状态表示中的一组第一隐藏状态表示以及第一解码器输出的各个时间步对应的第一向量,并对该组第一隐藏状态表示和各个时间步对应的第一向量进行处理,得到各个时间步对应的第一注意力表示。
第一解码器模块530用于获取各个时间步对应的第二向量,并对各个时间步对应的第二向量进行处理,得到各个时间步对应的第一向量,将各个时间步对应的第一向量输出给第一注意力模块520,获取各个时间步对应的第一注意力表示,并对各个时间步对应的第一注意力表示进行处理,得到源语言语音特征对应的目标语言语音特征。
可选地,如图6所示,第一解码器模块530包括:前处理网络(prenet)5301、第二转换器模块5302和后处理网络(postnet)5303;前处理网络5301用于获取各个时间步对应的第二向量,并对各个时间步对应的第二向量进行处理,得到各个时间步对应的第一向量,将各个时间步对应的第一向量输出给第一注意力模块;第二转换器模块5302用于获取各个时间步对应的第一注意力表示,并对各个时间步对应的第一注意力表示进行处理,得到各个时间步上的目标语言语音特征;后处理网络5303用于对各个时间步上的目标语言语音特征进行处理,得到源语言语音特征对应的目标语言语音特征。
可选地,前处理网络5301的瓶颈(bottleneck)维数可以是32。
结合上面的示例,在训练阶段,前处理网络5301可以获取80维的全0向量,该全0向量是第一个时间步对应的第二向量,前处理网络5301可以对全0向量进行处理,得到512维的全0向量,该全0向量是第一个时间步对应的第一向量,进一步地,前处理网络5301可以将该全0向量输入至第一注意力模块520,第一注意力模块520可以对该全0向量以及从编码器模块510获取到的25个512维的向量进行处理,得到第一个时间步对应的第一注意力表示,更进一步地,第一注意力模块520将第一个时间步对应的第一注意力表示输入至第二转换器模块5302,第二转换器模块5302可以对第一个时间步对应的第一注意力表示进行处理,得到第一个时间步上的目标语言语音特征,该第一个时间步上的目标语言语音特征是预测得到的第一个时间步上的目标语言语音特征。此外,前处理网络5301还可以获取第一个时间步上的实际地目标语言语音特征,该第一个时间步上的实际地目标语言语音特征可以被理解为第 二个时间步对应的第二向量,前处理网络5301可以对第二个时间步对应的第二向量进行处理,得到512维的向量,该512维的向量是第二个时间步对应的第一向量,进一步地,前处理网络5301可以将该第二个时间步对应的第一向量输入至第一注意力模块520,第一注意力模块520可以对该第二个时间步对应的第一向量以及从编码器模块510获取到的25个512维的向量进行处理,得到第二个时间步对应的第一注意力表示,更进一步地,第一注意力模块520将第二个时间步对应的第一注意力表示输入至第二转换器模块5302,第二转换器模块5302可以对第二个时间步对应的第一注意力表示进行处理,得到第二个时间步上的目标语言语音特征,该第二个时间步上的目标语言语音特征是预测得到的第二个时间步上的目标语言语音特征。
总之,前处理网络5301可以获取第i个时间步上的实际地目标语言语音特征,i=1,2……M,M是总共的时间步数。该第i个时间步上的实际地目标语言语音特征可以被理解为第i+1个时间步对应的第二向量,前处理网络5301可以对第i+1个时间步对应的第二向量进行处理,得到512维的向量,该512维的向量是第i+1个时间步对应的第一向量,进一步地,前处理网络5301可以将该第i+1个时间步对应的第一向量输入至第一注意力模块520,第一注意力模块520可以对该第i+1个时间步对应的第一向量以及从编码器模块510获取到的25个512维的向量进行处理,得到第i+1个时间步对应的第一注意力表示,更进一步地,第一注意力模块520将第i+1个时间步对应的第一注意力表示输入至第二转换器模块5302,第二转换器模块5302可以对第i+1个时间步对应的第一注意力表示进行处理,得到第i+1个时间步上的目标语言语音特征。
在测试阶段,前处理网络5301可以获取80维的全0向量,该全0向量是第一个时间步对应的第二向量,前处理网络5301可以对全0向量进行处理,得到512维的全0向量,该全0向量是第一个时间步对应的第一向量,进一步地,前处理网络5301可以将该全0向量输入至第一注意力模块520,第一注意力模块520可以对该全0向量以及从编码器模块510获取到的25个512维的向量进行处理,得到第一个时间步对应的第一注意力表示,更进一步地,第一注意力模块520将第一个时间步对应的第一注意力表示输入至第二转换器模块5302,第二转换器模块5302可以对第一个时间步对应的第一注意力表示进行处理,得到第一个时间步上的目标语言语音特征,该第一个时间步上的目标语言语音特征是预测得到的第一个时间步上的目标语言语音特征。进一步地,前处理网络5301可以对第一个时间步上的预测地目标语言语音特征并对其进行处理,得到512维的向量,其中,第一个时间步上的预测地目标语言语音特征可以被理解为上述第二个时间步对应的第二向量,该512维的向量是第二个时间步对应的第一向量,进一步地,前处理网络5301可以将该第二个时间步对应的第一向量输入至第一注意力模块520,第一注意力模块520可以对该第二个时间步对应的第一向量以及从编码器模块510获取到的25个512维的向量进行处理,得到第二个时间步对应的第一注意力表示,更进一步地,第一注意力模块520将第二个时间步对应的第一注意力表示输入至第二转换器模块5302,第二转换器模块5302可以对第二个时间步对应的第一注意力表示进行处理,得到第二个时间步上的目标语言语音特征。
总之,前处理网络5301可以获取第i个时间步上的预测地目标语言语音特征,i=1,2……M,M是总共的时间步数。该第i个时间步上的预测地目标语言语音特征可以被理解为第i+1个时间步对应的第二向量,前处理网络5301可以对第i+1个时间步对应的第二向量进行处理,得到512维的向量,该512维的向量是第i+1个时间步对应的第一向量,进一步地,前处理网络5301可以将该第i+1个时间步对应的第一向量输入至第一注意力模块520,第一注意力模块520可以对该第i+1个时间步对应的第一向量以及从编码器模块510获取到的25个512维的向量进行处理,得到第i+1个时间步对应的第一注意力表示,更进一步地,第一注意力模块520将第i+1个时间步对应的第一注意力表示输入至第二转换器模块5302,第二转换器模块5302可以对第i+1个时间步对应的第一注意力表示进行处理,得到第i+1个时间步上的目标语言语音特征。
可选地,第二转换器模块5302可以将各个时间步上的目标语言语音特征输入给后处理网络5303,后处理网络5303可以对各个时间步上的目标语言语音特征进行加权求和,得到源语言语音特征对应的目标语言语音特征。
可选地,第二转换器模块5302可以类似于图1所示的解码器结构,即该第二转换器模块5302可以包括6个转换器层,本申请对此不做限制,第二转换器模块5302可以与第一转换器模块5102采用相同的超参数。
在对语音到语音翻译模型的训练阶段,第二注意力模块540用于获取多组第一隐藏状态表示中的一组第一隐藏状态表示以及第二注意力模块540对应的第二解码器输出的各个时间步对应的第三向量,并对该组第一隐藏状态表示和各个时间步对应的第三向量进行处理,得到各个时间步对应的第二注意力表示;
第二注意力模块540对应的第二解码器模块550用于获取各个时间步对应的第四向量,并对各个时间步对应的第四向量进行处理,得到各个时间步对应的第三向量,将各个时间步对应的第三向量输出给第二注意力模块540,获取各个时间步对应的第二注意力表示,并对各个时间步对应的第二注意力表示进行处理,得到源语言语音特征对应的辅助表示。
可选地,第二解码器模块550可以包括:前处理网络、第三转换器模块和后处理网络;前处理网络用于获取各个时间步对应的第四向量,并对各个时间步对应的第四向量进行处理,得到各个时间步对应的第三向量,将各个时间步对应的第三向量输出给第二注意力模块540;第三转换器模块用于获取各个时间步对应的第二注意力表示,并对各个时间步对应的第二注意力表示进行处理,得到各个时间步上的辅助表示;后处理网络用于对各个时间步上的辅助表示进行处理,得到源语言语音特征对应的辅助表示。
可选地,前处理网络的瓶颈(bottleneck)维数可以是32。
示例性地,在训练阶段,前处理网络5301可以获取80维的嵌入向量,该嵌入向量是第一个时间步对应的第四向量,前处理网络可以对向量进行处理,得到512维的向量,该向量是第一个时间步对应的第三向量,进一步地,前处理网络可以将该向量输入至第二注意力模块540,第二注意力模块540可以 对该向量以及从编码器模块510获取到的25个512维的向量进行处理,得到第一个时间步对应的第二注意力表示,更进一步地,第二注意力模块540将第一个时间步对应的第二注意力表示输入至第三转换器模块,第三转换器模块可以对第一个时间步对应的第二注意力表示进行处理,得到第一个时间步上的辅助表示,该第一个时间步上的辅助表示是预测得到的第一个时间步上的辅助表示。此外,前处理网络还可以获取第一个时间步上的实际地辅助表示,该第一个时间步上的实际地辅助表示可以被理解为第二个时间步对应的第四向量,前处理网络可以对第二个时间步对应的第四向量进行处理,得到512维的向量,该512维的向量是第二个时间步对应的第三向量,进一步地,前处理网络可以将该第二个时间步对应的第三向量输入至第二注意力模块540,第二注意力模块540可以对该第二个时间步对应的第三向量以及从编码器模块510获取到的25个512维的向量进行处理,得到第二个时间步对应的第二注意力表示,更进一步地,第二注意力模块540将第二个时间步对应的第二注意力表示输入至第三转换器模块,第三转换器模块可以对第二个时间步对应的第二注意力表示进行处理,得到第二个时间步上的辅助表示,该第二个时间步上的辅助表示是预测得到的第二个时间步上的辅助表示。
总之,前处理网络可以获取第i个时间步上的实际地辅助表示,i=1,2……M,M是总共的时间步数。该第i个时间步上的实际地辅助表示可以被理解为第i+1个时间步对应的第四向量,前处理网络可以对第i+1个时间步对应的第四向量进行处理,得到512维的向量,该512维的向量是第i+1个时间步对应的第三向量,进一步地,前处理网络可以将该第i+1个时间步对应的第三向量输入至第二注意力模块540,第二注意力模块540可以对该第i+1个时间步对应的第三向量以及从编码器模块510获取到的25个512维的向量进行处理,得到第i+1个时间步对应的第二注意力表示,更进一步地,第二注意力模块540将第i+1个时间步对应的第二注意力表示输入至第三转换器模块,第三转换器模块可以对第i+1个时间步对应的第二注意力表示进行处理,得到第i+1个时间步上的辅助表示。
可选地,第三转换器模块可以将各个时间步上的辅助表示输入给后处理网络,后处理网络可以对各个时间步上的辅助表示进行加权求和,得到源语言语音特征对应的辅助表示。
可选地,当辅助任务是语音识别任务时,上述辅助表示可以是语音识别结果,如源语言语音对应的源语言文本。当辅助任务是语音到文本的翻译任务时,上述辅助表示可以是语音翻译结果,如目标语言文本。
应理解的是,本申请提供的语音到语音翻译模型是对现有的翻译器(Translatotron)模型进行了相应的改进,具体是将翻译器(Translatotron)模型中的长短期记忆网络(Long Short-Term Memory,LSTM)替换为转换器模块,在本申请中,可以将该语音到语音翻译模型称为基于转换器的翻译器模型(Transformer-based Translatotron)。在LSTM中每个时间步上的计算都是局部计算,而在转换器模块中,每个时间步上的计算都是全局计算,从而可以提高模型精度。
应理解的是,本申请所提供的Transformer-based Translatotron可以不基于伪标注的语音到语音翻译样本进行训练,将采用这种训练方式的语音到语音模型可以称为基线系统。当然本申请所提供的 Transformer-based Translatotron也可以基于伪标注的语音到语音翻译样本进行训练,将采用这种训练方式的语音到语音模型可以称为基线系统+PTLA。
本申请提供了语音翻译中常用的TEDEn2Zh数据集(英到中)来测试基线系统和基线系统+PTLA的性能,具体如表1所示:
表1
其中,S-PER表示语音识别任务在测试集上的音素识别错误率。Tp-BLEU表示语音到文本翻译任务在测试集上的基于的音素计算的双语互译质量辅助工具(Bilingual Evaluation Understudy,BLEU)。Dev-BLER表示主任务在开发集上的基于的音素计算的BLEU。test-BLEU表示主任务在测试集上的基于的音素计算的BLEU。
从表1中可以看出,基线系统可以在复杂的语向翻译上取得良好性能,基线系统+PTLA方案可以有效提升模型表现。
下面将提供一种语音到语音翻译方法:
图7为本申请实施例提供的一种语音到语音翻译方法的流程图,该方法可以由手机、平板电脑、笔记本电脑、掌上电脑、MID、台式电脑等任何电子设备执行,例如可以由图2中的执行设备执行,本申请对此不做限制,如图7所示,该方法包括:
S710:获取源语言语音特征;
S720:将源语言语音特征输入至语音到语音翻译模型,得到源语言语音特征对应的目标语言语音特征。
应理解的是,该语音到语音翻译模型可以由上述模型训练方法训练得到,由于通过上述训练方法所得到的语音到语音翻译模型精度更高,基于此,可以更好地实现语音到语音翻译。
图8为本申请实施例提供的一种模型训练装置800的示意图,如图8所示,该装置800包括:获取模块810、生成模块820和训练模块830,其中,获取模块810用于获取语音识别样本和真实的语音到语音翻译样本;生成模块820用于根据语音识别样本生成伪标注的语音到语音翻译样本;训练模块830用于根据伪标注的语音到语音翻译样本和真实的语音到语音翻译样本训练语音到语音翻译模型。
可选地,训练模块830具体用于:根据伪标注的语音到语音翻译样本预训练语音到语音翻译模型,并根据真实的语音到语音翻译样本微调预训练后的语音到语音翻译模型。
可选地,训练模块830具体用于:通过真实的语音到语音翻译样本微调预训练后的语音到语音翻译模型;或者,根据真实的语音到语音翻译样本和伪标注的语音到语音翻译样本微调预训练后的语音到语音翻译模型。
可选地,该装置800还包括:标注模块840,用于在根据真实的语音到语音翻译样本和伪标注的语音到语音翻译样本微调预训练后的语音到语音翻译模型之前,对真实的语音到语音翻译样本标注第一标签,第一标签用于标识真实的语音到语音翻译样本为真实样本;对伪标注的语音到语音翻译样本标注第二标签,第二标签用于标识伪标注的语音到语音翻译样本为伪标注样本。
可选地,训练模块830具体用于:对真实的语音到语音翻译样本进行上采样,得到上采样后的语音到语音翻译样本;通过上采样后的语音到语音翻译样本和伪标注的语音到语音翻译样本微调预训练后的语音到语音翻译模型。
可选地,真实的语音到语音翻译样本包括:第一源语言语音特征、第一源语言文本、第一目标语言语音特征和第一目标语言文本;语音识别样本包括:第二源语言语音特征和第二源语言文本。
可选地,生成模块820具体用于:对第二源语言文本进行翻译,得到第二目标语言文本;对第二目标语言文本进行合成,得到第二目标语言语音特征;其中,伪标注的语音到语音翻译样本包括:第二源语言语音特征、第二源语言文本、第二目标语言文本和第二目标语言语音特征。
可选地,语音到语音翻译模型包括:编码器模块、第一注意力模块、第一解码器模块、N个第二注意力模块和N个第二解码器模块,N为正整数,N个第二注意力模块和N个第二解码器模块一一对应;
编码器模块用于获取源语言语音特征,并对源语言语音特征进行处理,得到源语言语音特征对应的多组第一隐藏状态表示;
第一注意力模块用于获取多组第一隐藏状态表示中的一组第一隐藏状态表示以及第一解码器输出的各个时间步对应的第一向量,并对该组第一隐藏状态表示和各个时间步对应的第一向量进行处理,得到各个时间步对应的第一注意力表示;
第一解码器模块用于获取各个时间步对应的第二向量,并对各个时间步对应的第二向量进行处理,得到各个时间步对应的第一向量,将各个时间步对应的第一向量输出给第一注意力模块,获取各个时间步对应的第一注意力表示,并对各个时间步对应的第一注意力表示进行处理,得到源语言语音特征对应的目标语言语音特征;
在对语音到语音翻译模型的训练阶段,第二注意力模块用于获取多组第一隐藏状态表示中的一组第一隐藏状态表示以及第二注意力模块对应的第二解码器输出的各个时间步对应的第三向量,并对该组第一隐藏状态表示和各个时间步对应的第三向量进行处理,得到各个时间步对应的第二注意力表示;
第二注意力模块对应的第二解码器模块用于获取各个时间步对应的第四向量,并对各个时间步对应的第四向量进行处理,得到各个时间步对应的第三向量,将各个时间步对应的第三向量输出给第二注意力模块,获取各个时间步对应的第二注意力表示,并对各个时间步对应的第二注意力表示进行处理,得到源语言语音特征对应的辅助表示。
可选地,编码器模块包括:卷积神经网络子模块和第一转换器模块;卷积神经网络子模块用于获取源语言语音特征,并对源语言语音特征进行处理,得到源语言语音特征对应的第二隐藏状态表示;第一 转换器模块用于获取第二隐藏状态表示,并对第二隐藏状态表示进行处理,得到多组第一隐藏状态表示。
可选地,第一解码器模块包括:前处理网络、第二转换器模块和后处理网络;前处理网络用于获取各个时间步对应的第二向量,并对各个时间步对应的第二向量进行处理,得到各个时间步对应的第一向量,将各个时间步对应的第一向量输出给第一注意力模块;第二转换器模块用于获取各个时间步对应的第一注意力表示,并对各个时间步对应的第一注意力表示进行处理,得到各个时间步上的目标语言语音特征;后处理网络用于对各个时间步上的目标语言语音特征进行处理,得到源语言语音特征对应的目标语言语音特征。
应理解的是,装置实施例与方法实施例可以相互对应,类似的描述可以参照方法实施例。为避免重复,此处不再赘述。具体地,图8所示的装置800可以执行上述模型训练方法实施例,并且装置800中的各个模块的前述和其它操作和/或功能分别为了实现上述模型训练方法中的相应流程,为了简洁,在此不再赘述。
上文中结合附图从功能模块的角度描述了本申请实施例的装置800。应理解,该功能模块可以通过硬件形式实现,也可以通过软件形式的指令实现,还可以通过硬件和软件模块组合实现。具体地,本申请实施例中的方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路和/或软件形式的指令完成,结合本申请实施例公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。可选地,软件模块可以位于随机存储器,闪存、只读存储器、可编程只读存储器、电可擦写可编程存储器、寄存器等本领域的成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述模型训练方法实施例中的步骤。
图9为本申请实施例提供的一种语音到语音翻译装置900的示意图,如图9所示,该装置900包括:获取模块910和处理模块920,其中,获取模块910用于获取源语言语音特征;处理模块920用于将源语言语音特征输入至通过上述模型训练方法训练得到的语音到语音翻译模型,得到源语言语音特征对应的目标语言语音特征。
应理解的是,装置实施例与方法实施例可以相互对应,类似的描述可以参照方法实施例。为避免重复,此处不再赘述。具体地,图9所示的装置900可以执行上述语音到语音翻译方法实施例,并且装置900中的各个模块的前述和其它操作和/或功能分别为了实现上述语音到语音翻译方法中的相应流程,为了简洁,在此不再赘述。
上文中结合附图从功能模块的角度描述了本申请实施例的装置900。应理解,该功能模块可以通过硬件形式实现,也可以通过软件形式的指令实现,还可以通过硬件和软件模块组合实现。具体地,本申请实施例中的方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路和/或软件形式的指令完成,结合本申请实施例公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。可选地,软件模块可以位于随机存储器,闪存、只读存储器、可编程只读存储器、电可擦写可编程存储器、寄存器等本领域的成熟的存储介质中。该存储介质位于存储器,处 理器读取存储器中的信息,结合其硬件完成上述语音到语音翻译方法实施例中的步骤。
图10是本申请实施例提供的电子设备1000的示意性框图。
如图10所示,该电子设备1000可包括:
存储器1010和处理器1020,该存储器1010用于存储计算机程序,并将该程序代码传输给该处理器1020。换言之,该处理器1020可以从存储器1010中调用并运行计算机程序,以实现本申请实施例中的方法。
例如,该处理器1020可用于根据该计算机程序中的指令执行上述方法实施例。
在本申请的一些实施例中,该处理器1020可以包括但不限于:
通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等等。
在本申请的一些实施例中,该存储器1010包括但不限于:
易失性存储器和/或非易失性存储器。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synch link DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DR RAM)。
在本申请的一些实施例中,该计算机程序可以被分割成一个或多个模块,该一个或者多个模块被存储在该存储器1010中,并由该处理器1020执行,以完成本申请提供的方法。该一个或多个模块可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述该计算机程序在该电子设备中的执行过程。
如图10所示,该电子设备还可包括:
收发器1030,该收发器1030可连接至该处理器1020或存储器1010。
其中,处理器1020可以控制该收发器1030与其他设备进行通信,具体地,可以向其他设备发送信息或数据,或接收其他设备发送的信息或数据。收发器1030可以包括发射机和接收机。收发器1030还可以进一步包括天线,天线的数量可以为一个或多个。
应当理解,该电子设备中的各个组件通过总线系统相连,其中,总线系统除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。
本申请还提供了一种计算机存储介质,其上存储有计算机程序,该计算机程序被计算机执行时使得该计算机能够执行上述方法实施例的方法。或者说,本申请实施例还提供一种包含指令的计算机程序产品,该指令被计算机执行时使得计算机执行上述方法实施例的方法。
当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。该计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行该计算机程序指令时,全部或部分地产生按照本申请实施例该的流程或功能。该计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。该计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,该计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。该计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。该可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如数字视频光盘(digital video disc,DVD))、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的模块及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,该模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或模块的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。例如,在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。
以上该,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以该权利要求的保护范围为准。

Claims (15)

  1. 一种模型训练方法,其特征在于,包括:
    获取语音识别样本和真实的语音到语音翻译样本;
    根据所述语音识别样本生成伪标注的语音到语音翻译样本;
    根据所述伪标注的语音到语音翻译样本和所述真实的语音到语音翻译样本训练语音到语音翻译模型。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述伪标注的语音到语音翻译样本和所述真实的语音到语音翻译样本训练语音到语音翻译模型,包括:
    根据所述伪标注的语音到语音翻译样本预训练语音到语音翻译模型,并根据所述真实的语音到语音翻译样本微调预训练后的语音到语音翻译模型。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述真实的语音到语音翻译样本微调预训练后的语音到语音翻译模型,包括:
    通过所述真实的语音到语音翻译样本微调预训练后的语音到语音翻译模型;或者,
    根据所述真实的语音到语音翻译样本和所述伪标注的语音到语音翻译样本微调预训练后的语音到语音翻译模型。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述真实的语音到语音翻译样本和所述伪标注的语音到语音翻译样本微调预训练后的语音到语音翻译模型之前,还包括:
    对所述真实的语音到语音翻译样本标注第一标签,所述第一标签用于标识所述真实的语音到语音翻译样本为真实样本;
    对所述伪标注的语音到语音翻译样本标注第二标签,所述第二标签用于标识所述伪标注的语音到语音翻译样本为伪标注样本。
  5. 根据权利要求3或4所述的方法,其特征在于,所述根据所述真实的语音到语音翻译样本和所述伪标注的语音到语音翻译样本微调预训练后的语音到语音翻译模型,包括:
    对所述真实的语音到语音翻译样本进行上采样,得到上采样后的语音到语音翻译样本;
    通过所述上采样后的语音到语音翻译样本和所述伪标注的语音到语音翻译样本微调预训练后的语音到语音翻译模型。
  6. 根据权利要求1-4任一项所述的方法,其特征在于,所述真实的语音到语音翻译样本包括:第一源语言语音特征、第一源语言文本、第一目标语言语音特征和第一目标语言文本;所述语音识别样本包括:第二源语言语音特征和第二源语言文本。
  7. 根据权利要求6所述的方法,其特征在于,所述根据所述语音识别样本生成伪标注的语音到语音翻译样本,包括:
    对所述第二源语言文本进行翻译,得到第二目标语言文本;
    对所述第二目标语言文本进行合成,得到第二目标语言语音特征;
    其中,所述伪标注的语音到语音翻译样本包括:所述第二源语言语音特征、所述第二源语言文本、所述第二目标语言文本和所述第二目标语言语音特征。
  8. 根据权利要求1-4任一项所述的方法,其特征在于,所述语音到语音翻译模型包括:编码器模块、第一注意力模块、第一解码器模块、N个第二注意力模块和N个第二解码器模块,N为正整数,所述N个第二注意力模块和所述N个第二解码器模块一一对应;
    所述编码器模块用于获取源语言语音特征,并对所述源语言语音特征进行处理,得到所述源语言语音特征对应的多组第一隐藏状态表示;
    所述第一注意力模块用于获取所述多组第一隐藏状态表示中的一组第一隐藏状态表示以及所述第一解码器输出的各个时间步对应的第一向量,并对该组第一隐藏状态表示和各个时间步对应的第一向量进行处理,得到各个时间步对应的第一注意力表示;
    所述第一解码器模块用于获取所述各个时间步对应的第二向量,并对所述各个时间步对应的第二向量进行处理,得到各个时间步对应的第一向量,将各个时间步对应的第一向量输出给所述第一注意力模块,获取各个时间步对应的第一注意力表示,并对各个时间步对应的第一注意力表示进行处理,得到所述源语言语音特征对应的目标语言语音特征;
    在对所述语音到语音翻译模型的训练阶段,所述第二注意力模块用于获取所述多组第一隐藏状态表示中的一组第一隐藏状态表示以及所述第二注意力模块对应的第二解码器输出的各个时间步对应的第三向量,并对该组第一隐藏状态表示和各个时间步对应的第三向量进行处理,得到各个时间步对应的第二注意力表示;
    所述第二注意力模块对应的第二解码器模块用于获取各个时间步对应的第四向量,并对各个时间步对应的第四向量进行处理,得到各个时间步对应的第三向量,将各个时间步对应的第三向量输出给所述第二注意力模块,获取各个时间步对应的第二注意力表示,并对各个时间步对应的第二注意力表示进行处理,得到所述源语言语音特征对应的辅助表示。
  9. 根据权利要求8所述的方法,其特征在于,所述编码器模块包括:卷积神经网络子模块和第一转换器模块;
    所述卷积神经网络子模块用于获取所述源语言语音特征,并对所述源语言语音特征进行处理,得到所述源语言语音特征对应的第二隐藏状态表示;
    所述第一转换器模块用于获取所述第二隐藏状态表示,并对所述第二隐藏状态表示进行处理,得到所述多组第一隐藏状态表示。
  10. 根据权利要求9所述的方法,其特征在于,所述第一解码器模块包括:前处理网络、第二转换器模块和后处理网络;
    所述前处理网络用于获取各个时间步对应的第二向量,并对各个时间步对应的第二向量进行处理,得到各个时间步对应的第一向量,将各个时间步对应的第一向量输出给所述第一注意力模块;
    所述第二转换器模块用于获取各个时间步对应的第一注意力表示,并对各个时间步对应的第一注意力表示进行处理,得到各个时间步上的目标语言语音特征;
    所述后处理网络用于对各个时间步上的目标语言语音特征进行处理,得到所述源语言语音特征对应的目标语言语音特征。
  11. 一种语音到语音翻译方法,其特征在于,包括:
    获取源语言语音特征;
    将所述源语言语音特征输入至如权利要求1至10中任一项方法训练得到的语音到语音翻译模型,得到所述源语言语音特征对应的目标语言语音特征。
  12. 一种模型训练装置,其特征在于,包括:
    获取模块,用于获取语音识别样本和真实的语音到语音翻译样本;
    生成模块,用于根据所述语音识别样本生成伪标注的语音到语音翻译样本;
    训练模块,用于根据所述伪标注的语音到语音翻译样本和所述真实的语音到语音翻译样本训练语音到语音翻译模型。
  13. 一种语音到语音翻译装置,其特征在于,包括:
    获取模块,用于获取源语言语音特征;
    处理模块,用于将所述源语言语音特征输入至如权利要求1至10中任一项方法训练得到的语音到语音翻译模型,得到所述源语言语音特征对应的目标语言语音特征。
  14. 一种电子设备,其特征在于,包括:
    处理器和存储器,所述存储器用于存储计算机程序,所述处理器用于调用并运行所述存储器中存储的计算机程序,以执行权利要求1至11中任一项所述的方法。
  15. 一种计算机可读存储介质,其特征在于,用于存储计算机程序,所述计算机程序使得计算机执行如权利要求1至11中任一项所述的方法。
PCT/CN2023/088492 2022-04-26 2023-04-14 模型训练方法、语音到语音翻译方法、装置及介质 Ceased WO2023207638A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP23795071.2A EP4517742A4 (en) 2022-04-26 2023-04-14 METHOD AND APPARATUS FOR MODEL TRAINING, METHOD AND APPARATUS FOR SPEECH-TO-SPEECH TRANSLATION, AND SUPPORT
US18/724,300 US20250061888A1 (en) 2022-04-26 2023-04-14 Model training method and apparatus, speech-to-speech translation method and apparatus, and medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210448585.8 2022-04-26
CN202210448585.8A CN114822499B (zh) 2022-04-26 2022-04-26 模型训练方法、语音到语音翻译方法、装置及介质

Publications (1)

Publication Number Publication Date
WO2023207638A1 true WO2023207638A1 (zh) 2023-11-02

Family

ID=82508177

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/088492 Ceased WO2023207638A1 (zh) 2022-04-26 2023-04-14 模型训练方法、语音到语音翻译方法、装置及介质

Country Status (4)

Country Link
US (1) US20250061888A1 (zh)
EP (1) EP4517742A4 (zh)
CN (1) CN114822499B (zh)
WO (1) WO2023207638A1 (zh)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114822499B (zh) * 2022-04-26 2024-11-01 北京有竹居网络技术有限公司 模型训练方法、语音到语音翻译方法、装置及介质
CN115881102A (zh) * 2022-11-14 2023-03-31 北京数美时代科技有限公司 一种数据短缺场景下的语音识别方法、系统和存储介质
CN115862630A (zh) * 2022-11-30 2023-03-28 北京有竹居网络技术有限公司 语音翻译的方法、装置、电子设备和介质
CN115828943A (zh) * 2022-12-28 2023-03-21 沈阳雅译网络技术有限公司 一种基于语音合成数据的语音翻译模型建模方法和设备
CN116206616A (zh) * 2022-12-30 2023-06-02 沈阳雅译网络技术有限公司 一种基于序列动态压缩的语音翻译和语音识别方法
US12596890B2 (en) * 2023-03-30 2026-04-07 Salesforce, Inc. Systems and methods for cross-lingual transfer learning
CN116343751B (zh) * 2023-05-29 2023-08-11 深圳市泰为软件开发有限公司 基于语音翻译的音频分析方法及装置
CN119091870A (zh) * 2023-06-05 2024-12-06 北京有竹居网络技术有限公司 生成语音翻译模型的方法、翻译方法和装置
US20250095631A1 (en) * 2023-09-18 2025-03-20 Adobe Inc. Position-based text-to-speech model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111738025A (zh) * 2020-08-20 2020-10-02 腾讯科技(深圳)有限公司 基于人工智能的翻译方法、装置、电子设备和存储介质
CN112966529A (zh) * 2021-04-08 2021-06-15 中译语通科技股份有限公司 神经网络机器翻译训练方法、系统、介质、设备及应用
CN114822499A (zh) * 2022-04-26 2022-07-29 北京有竹居网络技术有限公司 模型训练方法、语音到语音翻译方法、装置及介质

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11409791B2 (en) * 2016-06-10 2022-08-09 Disney Enterprises, Inc. Joint heterogeneous language-vision embeddings for video tagging and search
GB201804073D0 (en) * 2018-03-14 2018-04-25 Papercup Tech Limited A speech processing system and a method of processing a speech signal
WO2020146873A1 (en) * 2019-01-11 2020-07-16 Applications Technology (Apptek), Llc System and method for direct speech translation system
US12032920B2 (en) * 2019-03-29 2024-07-09 Google Llc Direct speech-to-speech translation via machine learning
CN110503945B (zh) * 2019-09-06 2022-07-08 北京金山数字娱乐科技有限公司 一种语音处理模型的训练方法及装置
CN110717345B (zh) * 2019-10-15 2020-07-07 内蒙古工业大学 一种译文重对齐的循环神经网络跨语言机器翻译方法
CN111597778B (zh) * 2020-04-15 2023-05-30 哈尔滨工业大学 一种基于自监督的机器翻译译文自动优化的方法和系统
CN111859994B (zh) * 2020-06-08 2024-01-23 北京百度网讯科技有限公司 机器翻译模型获取及文本翻译方法、装置及存储介质
JP7663171B2 (ja) * 2020-08-17 2025-04-16 国立研究開発法人情報通信研究機構 疑似対訳データ生成用機械翻訳モデルの学習方法、疑似対訳データ取得方法、および、機械翻訳モデルの学習方法
US12050882B2 (en) * 2021-11-23 2024-07-30 Baidu Usa Llc Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation
WO2023178583A1 (en) * 2022-03-24 2023-09-28 Microsoft Technology Licensing, Llc Advanced clustering for self-supervised learning in speech recognition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111738025A (zh) * 2020-08-20 2020-10-02 腾讯科技(深圳)有限公司 基于人工智能的翻译方法、装置、电子设备和存储介质
CN112966529A (zh) * 2021-04-08 2021-06-15 中译语通科技股份有限公司 神经网络机器翻译训练方法、系统、介质、设备及应用
CN114822499A (zh) * 2022-04-26 2022-07-29 北京有竹居网络技术有限公司 模型训练方法、语音到语音翻译方法、装置及介质

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JIA YE; JOHNSON MELVIN; MACHEREY WOLFGANG; WEISS RON J.; CAO YUAN; CHIU CHUNG-CHENG; ARI NAVEEN; LAURENZO STELLA; WU YONGHUI: "Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation", ICASSP 2019 - 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 12 May 2019 (2019-05-12), pages 7180 - 7184, XP033565891, DOI: 10.1109/ICASSP.2019.8683343 *
See also references of EP4517742A4 *
YE JIA; MICHELLE TADMOR RAMANOVICH; TAL REMEZ; ROI POMERANTZ: "Translatotron 2: Robust direct speech-to-speech translation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 3 December 2021 (2021-12-03), 201 Olin Library Cornell University Ithaca, NY 14853, XP091110046 *

Also Published As

Publication number Publication date
US20250061888A1 (en) 2025-02-20
EP4517742A4 (en) 2026-04-22
EP4517742A1 (en) 2025-03-05
CN114822499B (zh) 2024-11-01
CN114822499A (zh) 2022-07-29

Similar Documents

Publication Publication Date Title
WO2023207638A1 (zh) 模型训练方法、语音到语音翻译方法、装置及介质
Tu et al. End-to-end text-to-speech for low-resource languages by cross-lingual transfer learning
CN113470662B (zh) 生成和使用用于关键词检出系统的文本到语音数据和语音识别系统中的说话者适配
US11322133B2 (en) Expressive text-to-speech utilizing contextual word-level style tokens
CN113327575B (zh) 一种语音合成方法、装置、计算机设备和存储介质
WO2023197206A1 (en) Personalized and dynamic text to speech voice cloning using incompletely trained text to speech models
WO2024055752A1 (zh) 语音合成模型的训练方法、语音合成方法和相关装置
Kumar et al. Towards building text-to-speech systems for the next billion users
CN107590135A (zh) 自动翻译方法、设备和系统
US12597365B2 (en) Automatic translation between sign language and spoken language
US20240304175A1 (en) Speech modification using accent embeddings
CN116129862A (zh) 语音合成方法、装置、电子设备及存储介质
US20150221298A1 (en) System and Method for Cloud-Based Text-to-Speech Web Services
WO2023175367A1 (en) End-to-end streaming speech translation with neural transducer
CN115910002B (zh) 一种音频生成的方法、存储介质及电子设备
CN114974249B (zh) 一种语音识别方法、装置及存储介质
US12603078B2 (en) Generating speech data using artificial intelligence techniques
US20240233704A9 (en) Residual adapters for few-shot text-to-speech speaker adaptation
US20240386903A1 (en) Speech speed adjustment method and apparatus, electronic device, and readable storage medium
Arya et al. Direct vs cascaded speech-to-speech translation using transformer
CN115114933A (zh) 用于文本处理的方法、装置、设备和存储介质
WO2020166359A1 (ja) 推定装置、推定方法、及びプログラム
US12223979B1 (en) Pre-trained machine learning models for real- time speech form conversion
CN119274531B (zh) 语音单位预测模型的训练方法、语音合成方法及电子设备
CN115294955B (zh) 一种模型训练和语音合成方法、装置、设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23795071

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18724300

Country of ref document: US

WWE Wipo information: entry into national phase

Ref document number: 2023795071

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2023795071

Country of ref document: EP

Effective date: 20241126