WO2020200178A1 - 语音合成方法、装置和计算机可读存储介质 - Google Patents
语音合成方法、装置和计算机可读存储介质 Download PDFInfo
- Publication number
- WO2020200178A1 WO2020200178A1 PCT/CN2020/082172 CN2020082172W WO2020200178A1 WO 2020200178 A1 WO2020200178 A1 WO 2020200178A1 CN 2020082172 W CN2020082172 W CN 2020082172W WO 2020200178 A1 WO2020200178 A1 WO 2020200178A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- parameter
- acoustic
- vocoder
- speech synthesis
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/06—Elementary speech units used in speech synthesisers; Concatenation rules
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/086—Detection of language
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/263—Language identification
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
- G10L13/047—Architecture of speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/0018—Speech coding using phonetic or linguistical decoding of the source; Reconstruction using text-to-speech synthesis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
Definitions
- the present disclosure relates to the field of computer technology, and in particular, to a speech synthesis method, device and computer-readable storage medium.
- the speech synthesis system can realize text-to-speech conversion (Text To Speech, TTS), which can convert text into sound through a series of algorithmic operations, and realize the process of the machine simulating human pronunciation.
- TTS text-to-speech conversion
- the current speech synthesis system generally only supports the pronunciation of a single language.
- a technical problem to be solved by the present disclosure is: how to implement an end-to-end speech synthesis system that supports pronunciation in multiple languages.
- a speech synthesis method including: dividing text into multiple segments belonging to different language types; according to the language type to which each segment belongs, converting each segment into a corresponding phoneme to generate The phoneme sequence of the text; input the phoneme sequence into the pre-trained speech synthesis model and convert it into vocoder characteristic parameters; input the vocoder characteristic parameters into the vocoder to generate speech.
- dividing the text into multiple segments belonging to different language types includes: identifying the language type to which each character belongs according to the encoding of each character in the text; and dividing consecutive characters belonging to the same language type into the language type. A fragment.
- generating the phoneme sequence of the text includes: determining the prosodic structure of the text; according to the prosodic structure of the text, adding a prosody mark after the phoneme corresponding to each character in the text to form the phoneme sequence of the text.
- inputting the phoneme sequence into a pre-trained speech synthesis model and converting it into vocoder feature parameters includes: inputting the phoneme sequence into an acoustic parameter prediction model in the speech synthesis model and converting it into acoustic feature parameters; and converting the acoustic feature parameters Input the vocoder parameter conversion model in the speech synthesis model to obtain the output vocoder characteristic parameters.
- the acoustic parameter prediction model includes: an encoder, a decoder, and an attention model; inputting the phoneme sequence into the acoustic parameter prediction model in the speech synthesis model and converting it into acoustic feature parameters includes: using the attention model to determine the current The attention weight represented by each feature output by the encoder at the moment; it is judged whether the attention weight represented by the feature corresponding to the preset element in the phoneme sequence is the maximum value among the attention weights, and if it is, the decoding process is ended.
- the acoustic characteristic parameters include speech frequency spectrum parameters;
- the vocoder parameter conversion model is composed of a multi-layer deep neural network and a long and short-term memory network.
- up-sampling is performed by repeating the acoustic characteristic parameter so that the frequency of the acoustic characteristic parameter is equal to the frequency of the vocoder characteristic parameter.
- the method further includes: training a speech synthesis model; wherein the training method includes: dividing the speech samples corresponding to each training text into different frames according to a preset frequency, and extracting acoustic feature parameters for each frame, respectively Generate the first acoustic feature parameter sample corresponding to each training text; use each training text and the first acoustic feature parameter sample corresponding to each training text to train the acoustic parameter prediction model; use the trained acoustic parameter prediction model, Convert each training text into a second acoustic feature parameter sample; according to the synthesis frequency of the vocoder, convert the speech sample corresponding to each training text into a vocoder feature parameter sample; use the second acoustic feature corresponding to each training text Parameter samples and vocoder characteristic parameter samples train the vocoder parameter conversion model.
- the acoustic parameter prediction model includes: an encoder, a decoder, and an attention model; inputting the phoneme sequence into the acoustic parameter prediction model in the speech synthesis model and converting it into acoustic feature parameters includes: inputting the phoneme sequence into the encoder, Obtain the feature representation corresponding to each element in the encoder output phoneme sequence; the feature representation corresponding to each element, the decoder hidden state output at the current time of the first loop layer in the decoder, and the cumulative attention weight corresponding to each element at the previous time
- the information is input into the attention model to obtain the context vector; the hidden state and context vector of the decoder output at the current moment of the first cyclic layer in the decoder are input into the second cyclic layer of the decoder to obtain the current moment output of the second cyclic layer
- the hidden state of the decoder predict the acoustic characteristic parameters according to the hidden state of the decoder at each moment output by the decoder.
- converting each segment into corresponding phonemes according to the language type to which each segment belongs includes: normalizing each segment according to the language type to which each segment belongs; according to the language type to which each segment belongs , The normalized segments are segmented separately; the segmentation of each segment is converted into corresponding phonemes according to the preset phoneme conversion table corresponding to the language type of each segment; wherein the phonemes include the tones corresponding to the characters.
- a speech synthesis device including: a language recognition module for dividing text into multiple segments belonging to different language types; a phoneme conversion module for dividing a text into multiple segments according to the language to which each segment belongs Type, each segment is converted into corresponding phonemes to generate the phoneme sequence of the text; the parameter conversion module is used to input the phoneme sequence into the pre-trained speech synthesis model and converted into vocoder characteristic parameters; the speech generation module is used to convert The characteristic parameters of the vocoder are input to the vocoder to generate speech.
- the language recognition module is used to identify the language type to which each character belongs according to the encoding of each character in the text; divide consecutive characters belonging to the same language type into a segment of the language type.
- the phoneme conversion module is used to determine the prosodic structure of the text; according to the prosodic structure of the text, a prosody mark is added after the phoneme corresponding to each character in the text to form a phoneme sequence of the text.
- the parameter conversion module is used to input the phoneme sequence into the acoustic parameter prediction model in the speech synthesis model and convert it into acoustic feature parameters; input the acoustic feature parameters into the vocoder parameter conversion model in the speech synthesis model to obtain the output Vocoder characteristic parameters.
- the acoustic parameter prediction model includes: an encoder, a decoder, and an attention model; the parameter conversion module is used to use the attention model to determine the attention weight represented by each feature output by the encoder at the current moment; determine the phoneme sequence Whether the attention weight represented by the feature corresponding to the preset element in is the maximum value among the attention weights, if so, the decoding process is ended.
- the acoustic characteristic parameters include speech frequency spectrum parameters;
- the vocoder parameter conversion model is composed of a multi-layer deep neural network and a long and short-term memory network.
- up-sampling is performed by repeating the acoustic characteristic parameter so that the frequency of the acoustic characteristic parameter is equal to the frequency of the vocoder characteristic parameter.
- the model training module is used to divide the speech samples corresponding to each training text into different frames according to the preset frequency, and extract the acoustic feature parameters for each frame, and respectively generate the first sound corresponding to each training text.
- Learn feature parameter samples use each training text and the first acoustic feature parameter sample corresponding to each training text to train the acoustic parameter prediction model; use the trained acoustic parameter prediction model to convert each training text into a second acoustic Feature parameter samples; according to the synthesis frequency of the vocoder, the speech samples corresponding to each training text are respectively converted into vocoder feature parameter samples; use the second acoustic feature parameter sample and the vocoder feature parameter sample pair corresponding to each training text The vocoder parameter conversion model is trained.
- the acoustic parameter prediction model includes an encoder, a decoder, and an attention model; the parameter conversion module is used to input the phoneme sequence into the encoder to obtain the feature representation corresponding to each element in the encoder output phoneme sequence; The feature representation corresponding to the element, the hidden state of the decoder output at the current moment of the first loop layer in the decoder, and the cumulative attention weight information corresponding to each element at the previous moment are input into the attention model to obtain the context vector; The hidden state of the decoder and the context vector output at the current time of the cyclic layer are input to the second cyclic layer of the decoder to obtain the hidden state of the decoder at the current time output by the second cyclic layer of the decoder; State prediction acoustic characteristic parameters.
- the phoneme conversion module is used to normalize each segment according to the language type to which each segment belongs; and perform word segmentation on each normalized segment according to the language type to which each segment belongs; The word segmentation of each segment is converted into corresponding phonemes according to the preset phoneme conversion table corresponding to the language type to which each segment belongs; wherein the phonemes include the tones corresponding to the characters.
- a speech synthesis device including: a memory; and a processor coupled to the memory, and the processor is configured to execute the same as in any of the foregoing embodiments based on instructions stored in the memory. Speech synthesis method.
- a computer-readable storage medium having a computer program stored thereon, wherein the program is executed by a processor to implement the speech synthesis method of any of the foregoing embodiments.
- the language category in the text is first identified, and the text is divided into multiple segments belonging to different language categories. According to the language type to which each segment belongs, each segment is converted into a corresponding phoneme.
- the phoneme sequence of the text is converted into the characteristic parameters of the vocoder by the input speech synthesis model, and the vocoder outputs speech according to the characteristic parameters of the vocoder.
- the solution of the present disclosure realizes an end-to-end speech synthesis system that supports pronunciation in multiple languages. And according to the phoneme sequence converted into vocoder characteristic parameters, compared with the character sequence directly converted into vocoder characteristic parameters, the synthesized speech can be more accurate, smooth and natural.
- FIG. 1 shows a schematic flowchart of a speech synthesis method according to some embodiments of the present disclosure.
- Fig. 2 shows a schematic structural diagram of a speech synthesis model of some embodiments of the present disclosure.
- Fig. 3 shows a schematic flowchart of a speech synthesis method according to other embodiments of the present disclosure.
- Fig. 4 shows a schematic structural diagram of a speech synthesis device according to some embodiments of the present disclosure.
- Fig. 5 shows a schematic structural diagram of a speech synthesis device according to other embodiments of the present disclosure.
- Fig. 6 shows a schematic structural diagram of a speech synthesis device according to still other embodiments of the present disclosure.
- the present disclosure proposes a speech synthesis method, which is described below in conjunction with FIG. 1.
- Figure 1 is a flowchart of some embodiments of the disclosed speech synthesis method. As shown in Fig. 1, the method of this embodiment includes: steps S102 to S108.
- step S102 the text is divided into multiple segments belonging to different language categories.
- the language type to which each character belongs is identified according to the encoding of each character in the text; consecutive characters belonging to the same language type are divided into a segment of the language type. For example, if the text contains Chinese and English characters, the Unicode code or other codes of the characters in the text can be obtained, and the Chinese characters and English characters in the text can be recognized according to the Unicode code, and the text can be divided into multiple fragments in different languages. If it contains characters in other languages (for example, Japanese, French, etc.), it can be recognized according to the corresponding encoding form.
- the sentence when the sentence contains only preset English characters, the sentence is marked as a Chinese sentence to facilitate subsequent normalization of the preset English characters according to Chinese, for example, a preset English character such as 12km/h. It can be converted to 12 kilometers per hour during the subsequent normalization, and the subsequent voice is Chinese pronunciation, which is more in line with the habits of Chinese users.
- a preset English character such as 12km/h. It can be converted to 12 kilometers per hour during the subsequent normalization, and the subsequent voice is Chinese pronunciation, which is more in line with the habits of Chinese users.
- the sentence in the case that the sentence contains only some special international characters, the sentence can be marked as a preset language type according to the pronunciation requirement, which is convenient for subsequent text normalization and speech synthesis. deal with.
- the above step (7) may include the following steps. (i) Determine whether the language type of the current character is the same as the language type of the previous character, if they are the same, execute (ii), otherwise execute (iv). (ii) Move the current character into the current segment set. (iii) Determine whether the end of the sentence is reached, if it is, then execute (iv), otherwise execute (v), (iv) mark the language type of the characters in the current fragment set and move it out of the current fragment set. (v) Update the next character to the current character, and return to (i) to restart execution.
- each segment is converted into a corresponding phoneme according to the language type to which each said segment belongs, and a phoneme sequence of the text is generated.
- each segment is normalized according to the language type to which each segment belongs; each segment after normalization is segmented according to the language type to which each segment belongs; and the word segmentation of each segment is
- the preset phoneme conversion table corresponding to the language type to which the segment belongs is converted into the corresponding phoneme.
- the text usually contains a large number of non-standard abbreviations, such as 12km/s, 2019, etc. These non-standard texts must be converted into standardized text suitable for speech synthesis by the speech synthesis system through a normalization operation. Fragments belonging to different languages need to be normalized separately. According to the special character comparison table of different languages, irregular characters can be converted into standardized characters, for example, 12km/s is converted to twelve kilometers per second. , To facilitate subsequent phoneme conversion.
- each word segmentation can be converted into corresponding phonemes (G2P).
- G2P phonemes
- the preset phoneme conversion table may include phoneme correspondences of polyphones, so as to perform accurate phoneme conversion for polyphones. It is also possible to recognize polyphonic characters in other ways, or perform phoneme conversion through other existing technologies, which are not limited to the examples given.
- the phoneme may include the tones corresponding to the characters, and using the tones as part of the phonemes can make the synthesized speech more accurate and natural.
- Some languages such as English, etc., do not have a tone, so there is no need to add a corresponding tone mark in the phoneme sequence.
- the text can also be divided into prosodic structures, for example, to identify prosodic words and prosodic phrases in the text. According to the prosodic structure of the text, a prosodic mark is added after the phonemes corresponding to each character in the text to form the phoneme sequence of the text.
- the prosody mark can be a special mark added after the phoneme corresponding to the prosodic word or the prosodic phrase to indicate a pause.
- the prediction of the prosodic structure can adopt the existing technology, which will not be repeated here.
- step S106 the phoneme sequence is input into a pre-trained speech synthesis model, and converted into vocoder characteristic parameters.
- the phoneme sequence of the text may include the phoneme (including tone) and prosody identifier corresponding to each character, and may also include some special symbols, such as the symbol ⁇ EOS> that indicates the end of the input phoneme sequence.
- the training process of the speech synthesis model will be described later.
- the speech synthesis model may include an acoustic parameter prediction model and a vocoder parameter conversion model.
- the acoustic parameters include, for example, speech spectral parameters, such as Mel spectral parameters or linear spectral parameters.
- the vocoder parameters are determined according to the actual vocoder used. For example, if the vocoder adopts a world vocoder, the vocoder parameters can include fundamental frequency (F0), generalized Mel cepstrum coefficient (Mel- generalized Cepstral (MGC), band a periodic component (BAP), etc.
- F0 fundamental frequency
- MMC generalized Mel cepstrum coefficient
- BAP band a periodic component
- Inputting the phoneme sequence into the acoustic parameter prediction model in the speech synthesis model can be converted into acoustic feature parameters; inputting the acoustic feature parameters into the vocoder parameter conversion model in the speech synthesis model, the output vocoder feature parameters can be obtained.
- the acoustic feature parameter prediction model adopts the Encoder-Decoder network structure, including: encoder, decoder and attention (Attention) model.
- the length of the input phoneme sequence and the output acoustic feature parameter sequence may not match, and usually the acoustic feature parameter sequence will be relatively long.
- the neural network structure based on Encoder-Decoder can perform flexible feature prediction, which conforms to the characteristics of speech synthesis.
- the encoder can include three-layer one-dimensional convolution and two-way LSTM (Long Short-Term Memory).
- the three-layer one-dimensional convolution can learn the local context information of each phoneme, and the two-way LSTM coding can calculate the two-way global information of each phoneme.
- the encoder module can obtain a very expressive and contextual feature representation of the input phoneme through three-layer one-dimensional convolution and two-way LSTM encoding.
- the decoder includes, for example, two fully connected layers and two LSTMs.
- the two fully connected layers can use Dropout technology to prevent the occurrence of neural network over-fitting.
- the attention model allows the decoder to learn the internal representation of which input phonemes need to be paid attention to at the current decoding moment during the decoding process. Through the attention mechanism, the decoder can also learn which input phonemes have completed parameter prediction. , And which phonemes need special attention at the moment.
- the context vector of the encoder obtained by the attention model can better predict the acoustic parameters that need to be obtained at the current moment and whether to end the decoding process by combining this context vector during the decoding process.
- the following steps can be performed in the acoustic feature parameter prediction model.
- the phoneme sequence is input to the encoder, and the characteristic representation corresponding to each element in the encoder output phoneme sequence is obtained.
- the first recurrent layer such as the first LSTM
- the hidden state of the decoder at each moment predicts the acoustic feature parameters.
- the hidden state sequence of the decoder is linearly transformed to obtain the acoustic characteristic parameters.
- j represents the position of each element in the input phoneme sequence
- M represents the total number of elements in the phoneme sequence.
- the prosody mark in the phoneme sequence will also be converted into the corresponding hidden state, and then into the decoder hidden state.
- the context vector can be calculated using the following formula.
- i represents the time step of the decoder
- j represents the position of the element in the phoneme sequence corresponding to the encoder
- i and j are positive integers.
- v, W, V, U, b are the parameters learned during model training
- s i represents the hidden state of the decoder output at the current i-th moment of the first recurrent layer (such as the first LSTM) in the decoder.
- h j represents the feature representation corresponding to the j-th element
- f i,j are the vectors in f i
- F is a convolution kernel with a preset length
- ⁇ i-1 is the cumulative attention corresponding to each element at the i-1th time Force weight information (Alignments)
- e i, j are numerical values
- e i represent the vector corresponding to each element
- ⁇ i is a vector
- ⁇ i, j represent the value in ⁇ i
- c i represents the corresponding value at the i-th moment Context vector
- M represents the total number of elements in the phoneme sequence.
- the attention model is used to determine the attention weight represented by each feature output by the encoder at the current moment; determine whether the attention weight represented by the feature corresponding to the preset element in the phoneme sequence is each attention weight ( That is, the maximum value of the attention weights corresponding to all elements in the input phoneme sequence. If it is, the decoding process ends.
- the attention weight represented by the feature is generated by the attention model.
- the preset element is the last ⁇ EOS> symbol in the phoneme sequence.
- the above method of judging whether to stop decoding can make the decoder stop decoding according to actual needs. Judge whether it is necessary to end the decoding process through the learned Alignments information. If the attention model has already shifted the attention to the last symbol during decoding, but the decoding process is not correctly predicted to end the decoding process, the system can force the end of the decoding process according to the Alignments information.
- the above-mentioned auxiliary decoding end algorithm can solve the problem that the model predicts that the decoding process ends or the prediction ends incorrectly. It prevents the acoustic parameter prediction model from continuing to predict the acoustic characteristics of several frames, and finally synthesizing some incomprehensible speech, improving the system The accuracy, fluency and naturalness of speech output.
- the acoustic feature parameters (for example, Mel spectrum parameters) are converted into the vocoder parameter conversion model into the vocoder feature parameters, and then speech synthesis can be performed by the vocoder.
- the vocoder parameter conversion model can adopt the neural network structure of DNN-LSTM (Deep Neural Network-Long Short Term Memory Network).
- the network structure can include multiple layers of deep neural networks and long and short-term memory networks.
- the network structure includes two layers of ReLU (activation function) connections and one layer of LSTM.
- Acoustic feature parameters are first input into a DNN network (such as ReLU), which can learn the nonlinear transformation of acoustic features and learn the internal feature representation of the neural network, which is equivalent to a feature learning process.
- the features output by the DNN network are input to the LSTM to learn the historical dependence information of the acoustic feature parameters in order to obtain smoother feature conversion.
- the inventor found through testing that the vocoder parameter conversion effect is better when the network structure includes two layers of ReLU connections and one layer of LSTM.
- up-sampling is performed by repeating the acoustic characteristic parameter so that the frequency of the acoustic characteristic parameter is equal to the frequency of the vocoder characteristic parameter.
- the acoustic parameter prediction model uses 15ms as a frame for parameter prediction, but the vocoder usually uses 5ms as a frame for speech synthesis, so there is a problem of mismatch in time and frequency. In order to solve the inconsistent frequency of the two models The problem is that the output of the acoustic parameter prediction model needs to be up-sampled to match the frequency of the vocoder model.
- Up-sampling can be performed by repeating the output of the acoustic parameter prediction model, for example, repeating the acoustic feature parameters three times, 1*80-dimensional Mel spectrum parameters, and repeating three times to obtain 3*80-dimensional Mel spectrum parameters.
- the inventor has determined through testing that, compared to learning an up-sampling neural network, or performing up-sampling by means of difference, the up-sampling can achieve good results by directly repeating features.
- step S108 the characteristic parameters of the vocoder are input to the vocoder to generate speech.
- the vocoder parameter conversion model in the above embodiment can be combined with the world vocoder.
- a simple network architecture can speed up the calculation speed and realize real-time
- speech generation reduces duplication and improves the effect of speech synthesis.
- the language type in the text is first identified, and the text is divided into multiple segments belonging to different language types. According to the language type to which each segment belongs, each segment is converted into a corresponding phoneme.
- the phoneme sequence of the text is converted into the characteristic parameters of the vocoder by the input speech synthesis model, and the vocoder outputs speech according to the characteristic parameters of the vocoder.
- the solution of the foregoing embodiment realizes an end-to-end speech synthesis system that supports pronunciation in multiple languages, and converts it into vocoder characteristic parameters according to phoneme sequences, and directly converts character sequences to vocoder characteristic parameters, which enables the synthesis of The voice is more accurate, smooth and natural. Further, by adding prosodic structure, pitch, etc.
- the speech synthesis effect can be further improved.
- the new vocoder feature parameter conversion model the calculation speed is accelerated to realize real-time speech generation, which reduces duplication and further improves the effect of speech synthesis.
- the above-mentioned embodiment also proposes a decoder termination method, which can solve the problem that the model predicts that the decoding process fails to end or the prediction ends incorrectly, and prevents the acoustic parameter prediction model from finally synthesizing some unintelligible speech, and further improves the system's voice output. Accuracy, fluency and naturalness.
- the method of training a speech synthesis model includes: converting a speech sample corresponding to each training text into a vocoder feature parameter sample according to the synthesis frequency of the vocoder; and inputting each training text into the speech synthesis to be trained Model to obtain the output vocoder feature parameters; compare the output vocoder feature parameters with the corresponding vocoder feature parameter samples, and adjust the parameters of the speech synthesis model to be trained according to the comparison results until the training is completed .
- FIG. 3 is a flowchart of other embodiments of the speech synthesis method of the present disclosure. As shown in FIG. 3, the method of this embodiment includes: steps S302 to S310.
- step S302 the speech samples corresponding to each training text are divided into different frames according to the preset frequency, and the acoustic feature parameters are extracted for each frame, and the first acoustic feature parameter samples corresponding to each training text are respectively generated.
- each speech sample may be divided with a frequency of 15 ms as a frame, and the acoustic characteristic parameters of each frame of samples may be extracted to generate the first acoustic characteristic parameter samples (for example, Mel spectrum parameters).
- the first acoustic characteristic parameter samples for example, Mel spectrum parameters
- step S304 each training text and the first acoustic feature parameter sample corresponding to each training text are used to train the acoustic parameter prediction model.
- the training text can be divided into segments belonging to different language types, and each segment can be converted into corresponding phonemes according to the language type to which each segment belongs, and a phoneme sequence of the training text can be generated.
- the phoneme sequence can include pitch, prosodic identification, and so on.
- the phoneme sequence of each training text is input into the acoustic parameter prediction model, and the output acoustic feature parameters corresponding to each training text are obtained.
- the output acoustic feature parameters corresponding to the same training text are compared with the first acoustic feature parameter samples, and the parameters in the acoustic parameter prediction model are adjusted according to the comparison results until the first preset target is met, and the acoustic parameter prediction model is completed Training.
- step S306 the trained acoustic parameter prediction model is used to convert each training text into a second acoustic feature parameter sample.
- Each training text is input into the trained acoustic parameter prediction model, and then the second acoustic feature parameter sample corresponding to each training text can be obtained.
- step S308 according to the synthesis frequency of the vocoder, the speech samples corresponding to each training text are respectively converted into vocoder characteristic parameter samples.
- the speech samples can be divided into a frame frequency of 5 ms, and each frame sample can be converted into a vocoder characteristic parameter sample (for example, MGC, BAP, log F0).
- a vocoder characteristic parameter sample for example, MGC, BAP, log F0.
- step S310 the second acoustic feature parameter sample and the vocoder feature parameter sample corresponding to each training text are used to train the vocoder parameter conversion model.
- each second acoustic characteristic parameter sample is input into the vocoder parameter conversion model to obtain the output vocoder characteristic parameter.
- the output vocoder characteristic parameters are compared with the corresponding vocoder characteristic parameter samples, and the parameters in the vocoder parameter conversion model are adjusted according to the comparison results until the second preset target is met, and the vocoder parameters are completed Conversion model training.
- the method of the foregoing embodiment uses the acoustic feature parameters predicted by the acoustic prediction model as training data for training the vocoder parameter conversion model, which can improve the accuracy of the vocoder parameter conversion model and make the synthesized speech more accurate, smooth and smooth. natural.
- the vocoder parameter conversion model is trained using real acoustic feature parameters (for example, Mel spectrum parameters) extracted directly from the voice file, the input features and training of the model will exist when the actual speech synthesis is performed. Differences in feature mismatch. Specifically because in the actual speech synthesis process, the input feature is the Mel spectrum predicted by the acoustic parameter prediction model.
- the acoustic parameter conversion module training process uses the real acoustic feature parameters of the sound file.
- the trained model has not learned the predicted acoustic feature parameters and the acoustic feature parameters that have accumulated errors during the decoding process. Therefore, the mismatch between the input feature and the training feature will result in a serious degradation of the performance of the vocoder parameter conversion model.
- the present disclosure also provides a speech synthesis device, which is described below with reference to FIG. 4.
- FIG. 4 is a structural diagram of some embodiments of the speech synthesis device of the present disclosure.
- the device 40 of this embodiment includes: a language recognition module 402, a phoneme conversion module 404, a parameter conversion module 406, and a speech generation module 408.
- the language recognition module 402 divides the text into multiple segments belonging to different language types.
- the language recognition module 402 is configured to recognize the language type to which each character belongs according to the encoding of each character in the text; divide consecutive characters belonging to the same language type into a segment of the language type.
- the phoneme conversion module 404 is configured to convert each segment into a corresponding phoneme according to the language type to which each segment belongs to generate a phoneme sequence of the text.
- the phoneme conversion module 404 is used to determine the prosodic structure of the text; according to the prosodic structure of the text, a prosody mark is added after the phoneme corresponding to each character in the text to form a phoneme sequence of the text.
- the phoneme conversion module 404 is configured to perform text normalization on each segment according to the language type to which each segment belongs; and perform word segmentation on each normalized segment according to the language type to which each segment belongs; The word segmentation of each segment is converted into corresponding phonemes according to the preset phoneme conversion table corresponding to the language type to which each segment belongs; wherein the phonemes include the tones corresponding to the characters.
- the parameter conversion module 406 is configured to input the phoneme sequence into the pre-trained speech synthesis model and convert it into the characteristic parameters of the vocoder.
- the parameter conversion module 406 is configured to input the phoneme sequence into the acoustic parameter prediction model in the speech synthesis model and convert it into acoustic feature parameters; input the acoustic feature parameters into the vocoder parameter conversion model in the speech synthesis model to obtain the output Characteristic parameters of the vocoder.
- the acoustic parameter prediction model includes: an encoder, a decoder, and an attention model; the parameter conversion module 406 is used to use the attention model to determine the attention weight represented by each feature output by the encoder at the current moment; to determine the phoneme Whether the attention weight represented by the feature corresponding to the preset element in the sequence is the maximum value among the attention weights, and if it is, the decoding process ends.
- the acoustic characteristic parameters include speech frequency spectrum parameters;
- the vocoder parameter conversion model is composed of a multi-layer deep neural network and a long and short-term memory network.
- up-sampling is performed by repeating the acoustic characteristic parameter, so that the frequency of the acoustic characteristic parameter is equal to the frequency of the vocoder characteristic parameter.
- the parameter conversion module 406 is used to input the phoneme sequence into the encoder to obtain the feature representation corresponding to each element in the encoder output phoneme sequence; the feature representation corresponding to each element is the current moment of the first cyclic layer in the decoder.
- the output decoder hidden state and the cumulative attention weight information corresponding to each element at the previous moment are input into the attention model to obtain the context vector; the decoder hidden state and context vector output at the current moment of the first loop layer in the decoder are input into the decoding
- the second recurrent layer of the decoder obtains the decoder hidden state at the current moment output by the second recurrent layer of the decoder; the acoustic feature parameters are predicted according to the decoder hidden state at each moment output by the decoder.
- the speech generating module 408 is used to input the characteristic parameters of the vocoder into the vocoder to generate speech.
- the speech synthesis device 40 further includes: a model training module 410, configured to divide the speech samples corresponding to each training text into different frames according to the preset frequency, and extract the acoustics for each frame.
- Feature parameters respectively generate the first acoustic feature parameter samples corresponding to each training text; use each training text and the first acoustic feature parameter samples corresponding to each training text to train the acoustic parameter prediction model; use the trained acoustics
- the parameter prediction model converts each training text into a second acoustic feature parameter sample; according to the synthesis frequency of the vocoder, the speech sample corresponding to each training text is converted into a vocoder feature parameter sample; using the corresponding training text
- the second acoustic feature parameter sample and the vocoder feature parameter sample train the vocoder parameter conversion model.
- the speech synthesis apparatus in the embodiments of the present disclosure can be implemented by various computing devices or computer systems, which are described below in conjunction with FIG. 5 and FIG. 6.
- FIG. 5 is a structural diagram of some embodiments of the speech synthesis device of the present disclosure.
- the apparatus 50 of this embodiment includes: a memory 510 and a processor 520 coupled to the memory 510, and the processor 520 is configured to execute any of the implementations in the present disclosure based on instructions stored in the memory 510
- the speech synthesis method in the example is a structural diagram of some embodiments of the speech synthesis device of the present disclosure.
- the memory 510 may include, for example, a system memory, a fixed non-volatile storage medium, and the like.
- the system memory stores, for example, an operating system, an application program, a boot loader (Boot Loader), a database, and other programs.
- FIG. 6 is a structural diagram of other embodiments of the speech synthesis device of the present disclosure.
- the device 60 of this embodiment includes a memory 610 and a processor 620, which are similar to the memory 510 and the processor 520, respectively. It may also include an input/output interface 630, a network interface 640, a storage interface 650, and so on. These interfaces 630, 640, 650, and the memory 610 and the processor 620 may be connected via a bus 660, for example.
- the input and output interface 630 provides a connection interface for input and output devices such as a display, a mouse, a keyboard, and a touch screen.
- the network interface 640 provides a connection interface for various networked devices, for example, can be connected to a database server or a cloud storage server.
- the storage interface 650 provides a connection interface for external storage devices such as SD cards and U disks.
- the embodiments of the present disclosure may be provided as methods, systems, or computer program products. Therefore, the present disclosure may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present disclosure may take the form of a computer program product implemented on one or more computer-usable non-transitory storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes. .
- These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
- the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
- These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
- the instructions provide steps configured to implement functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims (22)
- 一种语音合成方法,包括:将文本划分为属于不同语言种类的多个片段;根据各个所述片段属于的语言种类,将各个所述片段分别转换为对应的音素,生成所述文本的音素序列;将所述音素序列输入预先训练的语音合成模型,转换为声码器特征参数;将所述声码器特征参数输入声码器,生成语音。
- 根据权利要求1所述的语音合成方法,其中,所述将文本划分为属于不同语言种类的多个片段包括:根据所述文本中各个字符的编码,识别各个所述字符属于的语言种类;将属于同一语言种类的连续字符划分为该语言种类的一个片段。
- 根据权利要求1所述的语音合成方法,其中,所述生成所述文本的音素序列包括:确定所述文本的韵律结构;根据所述文本的韵律结构,在与所述文本中各个字符对应的音素后添加韵律标识,以形成所述文本的音素序列。
- 根据权利要求1所述的语音合成方法,其中,所述将所述音素序列输入预先训练的语音合成模型,转换为声码器特征参数包括:将所述音素序列输入所述语音合成模型中的声学参数预测模型,转换为声学特征参数;将所述声学特征参数输入所述语音合成模型中声码器参数转换模型,得到输出的声码器特征参数。
- 根据权利要求4所述的语音合成方法,其中,所述声学参数预测模型包括:编码器、解码器和注意力模型;所述将所述音素序列输入所述语音合成模型中的声学参数预测模型,转换为声学 特征参数包括:利用所述注意力模型,确定当前时刻所述编码器输出的各个特征表示的注意力权重;判断所述音素序列中预设元素对应的特征表示的注意力权重是否为各个注意力权重中的最大值,如果是,则结束解码过程。
- 根据权利要求4所述的语音合成方法,其中,所述声学特征参数包括语音频谱参数;所述声码器参数转换模型由多层深度神经网络和长短期记忆网络构成。
- 根据权利要求4所述的语音合成方法,其中,在所述声学特征参数的频率小于所述声码器特征参数的频率的情况下,通过重复所述声学特征参数进行上采样,使所述声学特征参数的频率等于所述声码器特征参数的频率。
- 根据权利要求1所述的语音合成方法,还包括:训练所述语音合成模型;其中,所述训练方法包括:根据预设频率将各个训练文本对应的语音样本划分为不同的帧,并针对每帧提取声学特征参数,分别生成与各个所述训练文本对应的第一声学特征参数样本;利用各个所述训练文本和各个所述训练文本对应的第一声学特征参数样本,对所述声学参数预测模型进行训练;利用训练完成的声学参数预测模型,将各个所述训练文本分别转换为第二声学特征参数样本;根据所述声码器的合成频率,将各个所述训练文本对应的语音样本分别转换为声码器特征参数样本;利用各个所述训练文本对应的所述第二声学特征参数样本和所述声码器特征参数样本对所述声码器参数转换模型进行训练。
- 根据权利要求4所述的语音合成方法,其中,所述声学参数预测模型包括:编码器、解码器和注意力模型;所述将所述音素序列输入所述语音合成模型中的声学参数预测模型,转换为声学特征参数包括:将所述音素序列输入所述编码器,获得所述编码器输出所述音素序列中各个元素对应的特征表示;将所述各个元素对应的特征表示、所述解码器中第一循环层当前时刻输出的解码器隐状态,以及上一时刻所述各个元素对应的累积注意力权重信息输入所述注意力模型,获得上下文向量;将所述解码器中第一循环层当前时刻输出的解码器隐状态和所述上下文向量输入所述解码器的第二循环层,获得所述解码器第二循环层输出的当前时刻的解码器隐状态;根据所述解码器输出的各个时刻的解码器隐状态预测所述声学特征参数。
- 根据权利要求1所述的语音合成方法,其中,所述根据各个所述片段属于的语言种类,将各个所述片段分别转换为对应的音素包括:根据各个所述片段属于的语言种类,将各个所述片段分别进行文本归一化;根据各个所述片段属于的语言种类,将归一化后的各个所述片段分别进行分词;将各个所述片段的分词,根据各个所述片段属于的语言种类对应的预设的音素转换表转换为对应的音素;其中,音素包括字符对应的音调。
- 一种语音合成装置,包括:语言识别模块,用于将文本划分为属于不同语言种类的多个片段;音素转换模块,用于根据各个所述片段属于的语言种类,将各个所述片段分别转换为对应的音素,生成所述文本的音素序列;参数转换模块,用于将所述音素序列输入预先训练的语音合成模型,转换为声码器特征参数;语音生成模块,用于将所述声码器特征参数输入声码器,生成语音。
- 根据权利要求11所述的语音合成装置,其中,所述语言识别模块用于根据所述文本中各个字符的编码,识别各个所述字符属于的语言种类;将属于同一语言种类的连续字符划分为该语言种类的一个片段。
- 根据权利要求11所述的语音合成装置,其中,所述音素转换模块用于确定所述文本的韵律结构;根据所述文本的韵律结构,在与所述文本中各个字符对应的音素后添加韵律标识,以形成所述文本的音素序列。
- 根据权利要求11所述的语音合成装置,其中,所述参数转换模块用于将所述音素序列输入所述语音合成模型中的声学参数预测模型,转换为声学特征参数;将所述声学特征参数输入所述语音合成模型中声码器参数转换模型,得到输出的声码器特征参数。
- 根据权利要求14所述的语音合成装置,其中,所述声学参数预测模型包括:编码器、解码器和注意力模型;所述参数转换模块用于利用所述注意力模型,确定当前时刻所述编码器输出的各个特征表示的注意力权重;判断所述音素序列中预设元素对应的特征表示的注意力权重是否为各个注意力权重中的最大值,如果是,则结束解码过程。
- 根据权利要求14所述的语音合成装置,其中,所述声学特征参数包括语音频谱参数;所述声码器参数转换模型由多层深度神经网络和长短期记忆网络构成。
- 根据权利要求14所述的语音合成装置,其中,在所述声学特征参数的频率小于所述声码器特征参数的频率的情况下,通过重复所述声学特征参数进行上采样,使所述声学特征参数的频率等于所述声码器特征参数的频率。
- 根据权利要求11所述的语音合成装置,还包括:模型训练模块,用于根据预设频率将各个训练文本对应的语音样本划分为不同的帧,并针对每帧提取声学特征参数,分别生成与各个所述训练文本对应的第一声学特 征参数样本;利用各个所述训练文本和各个所述训练文本对应的第一声学特征参数样本,对所述声学参数预测模型进行训练;利用训练完成的声学参数预测模型,将各个所述训练文本分别转换为第二声学特征参数样本;根据所述声码器的合成频率,将各个所述训练文本对应的语音样本分别转换为声码器特征参数样本;利用各个所述训练文本对应的所述第二声学特征参数样本和所述声码器特征参数样本对所述声码器参数转换模型进行训练。
- 根据权利要求14所述的语音合成装置,其中,所述声学参数预测模型包括:编码器、解码器和注意力模型;所述参数转换模块用于将所述音素序列输入所述编码器,获得所述编码器输出所述音素序列中各个元素对应的特征表示;将所述各个元素对应的特征表示、所述解码器中第一循环层当前时刻输出的解码器隐状态,以及上一时刻所述各个元素对应的累积注意力权重信息输入所述注意力模型,获得上下文向量;将所述解码器中第一循环层当前时刻输出的解码器隐状态和所述上下文向量输入所述解码器的第二循环层,获得所述解码器第二循环层输出的当前时刻的解码器隐状态;根据所述解码器输出的各个时刻的解码器隐状态预测所述声学特征参数。
- 根据权利要求11所述的语音合成装置,其中,所述音素转换模块用于根据各个所述片段属于的语言种类,将各个所述片段分别进行文本归一化;根据各个所述片段属于的语言种类,将归一化后的各个所述片段分别进行分词;将各个所述片段的分词,根据各个所述片段属于的语言种类对应的预设的音素转换表转换为对应的音素;其中,音素包括字符对应的音调。
- 一种语音合成装置,包括:存储器;以及耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器中的指令,执行如权利要求1-10任一项所述的语音合成方法。
- 一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器 执行时实现权利要求1-10任一项所述方法的步骤。
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2021558871A JP7464621B2 (ja) | 2019-04-03 | 2020-03-30 | 音声合成方法、デバイス、およびコンピュータ可読ストレージ媒体 |
| EP20783784.0A EP3937165B1 (en) | 2019-04-03 | 2020-03-30 | Speech synthesis method and apparatus, and computer-readable storage medium |
| US17/600,850 US11881205B2 (en) | 2019-04-03 | 2020-03-30 | Speech synthesis method, device and computer readable storage medium |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910266289.4A CN111798832B (zh) | 2019-04-03 | 2019-04-03 | 语音合成方法、装置和计算机可读存储介质 |
| CN201910266289.4 | 2019-04-03 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2020200178A1 true WO2020200178A1 (zh) | 2020-10-08 |
Family
ID=72664952
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2020/082172 Ceased WO2020200178A1 (zh) | 2019-04-03 | 2020-03-30 | 语音合成方法、装置和计算机可读存储介质 |
Country Status (5)
| Country | Link |
|---|---|
| US (1) | US11881205B2 (zh) |
| EP (1) | EP3937165B1 (zh) |
| JP (1) | JP7464621B2 (zh) |
| CN (1) | CN111798832B (zh) |
| WO (1) | WO2020200178A1 (zh) |
Cited By (21)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112185340A (zh) * | 2020-10-30 | 2021-01-05 | 网易(杭州)网络有限公司 | 语音合成方法、语音合成装置、存储介质与电子设备 |
| CN112992177A (zh) * | 2021-02-20 | 2021-06-18 | 平安科技(深圳)有限公司 | 语音风格迁移模型的训练方法、装置、设备及存储介质 |
| CN113327576A (zh) * | 2021-06-03 | 2021-08-31 | 多益网络有限公司 | 语音合成方法、装置、设备及存储介质 |
| CN113409761A (zh) * | 2021-07-12 | 2021-09-17 | 上海喜马拉雅科技有限公司 | 语音合成方法、装置、电子设备以及计算机可读存储介质 |
| CN113450760A (zh) * | 2021-06-07 | 2021-09-28 | 北京一起教育科技有限责任公司 | 一种文本转语音的方法、装置及电子设备 |
| CN113707125A (zh) * | 2021-08-30 | 2021-11-26 | 中国科学院声学研究所 | 一种多语言语音合成模型的训练方法及装置 |
| CN113724683A (zh) * | 2021-07-23 | 2021-11-30 | 阿里巴巴达摩院(杭州)科技有限公司 | 音频生成方法、计算机设备及计算机可读存储介质 |
| CN113763922A (zh) * | 2021-05-12 | 2021-12-07 | 腾讯科技(深圳)有限公司 | 音频合成方法和装置、存储介质及电子设备 |
| CN114267375A (zh) * | 2021-11-24 | 2022-04-01 | 北京百度网讯科技有限公司 | 音素检测方法及装置、训练方法及装置、设备和介质 |
| CN114267376A (zh) * | 2021-11-24 | 2022-04-01 | 北京百度网讯科技有限公司 | 音素检测方法及装置、训练方法及装置、设备和介质 |
| CN114399991A (zh) * | 2022-01-27 | 2022-04-26 | 北京有竹居网络技术有限公司 | 语音合成方法、装置、存储介质及电子设备 |
| CN114495899A (zh) * | 2021-12-29 | 2022-05-13 | 深圳市优必选科技股份有限公司 | 一种基于时长信息的音频合成方法、装置及终端设备 |
| CN114765022A (zh) * | 2020-12-30 | 2022-07-19 | 大众问问(北京)信息科技有限公司 | 一种语音合成方法、装置、计算机设备和存储介质 |
| CN115101041A (zh) * | 2022-05-09 | 2022-09-23 | 北京百度网讯科技有限公司 | 语音合成与语音合成模型的训练方法、装置 |
| CN115132170A (zh) * | 2022-06-28 | 2022-09-30 | 腾讯音乐娱乐科技(深圳)有限公司 | 语种分类方法、装置及计算机可读存储介质 |
| CN115691476A (zh) * | 2022-06-06 | 2023-02-03 | 腾讯科技(深圳)有限公司 | 语音识别模型的训练方法、语音识别方法、装置及设备 |
| CN116129866A (zh) * | 2023-02-16 | 2023-05-16 | 北京百度网讯科技有限公司 | 语音合成方法、网络训练方法、装置、设备及存储介质 |
| CN116612742A (zh) * | 2023-04-28 | 2023-08-18 | 科大讯飞股份有限公司 | 语音合成方法、装置、设备及存储介质 |
| CN118782018A (zh) * | 2023-04-03 | 2024-10-15 | 科大讯飞股份有限公司 | 语音合成方法、装置、设备及存储介质 |
| CN118840996A (zh) * | 2024-06-27 | 2024-10-25 | 合肥智能语音创新发展有限公司 | 一种发音预测方法及相关装置 |
| CN120032621A (zh) * | 2025-01-16 | 2025-05-23 | 思必驰科技股份有限公司 | 面向vqtts模型的语音合成缺陷修正方法、设备及存储介质 |
Families Citing this family (44)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2022086590A1 (en) * | 2020-10-21 | 2022-04-28 | Google Llc | Parallel tacotron: non-autoregressive and controllable tts |
| CN112331183B (zh) * | 2020-10-27 | 2022-03-18 | 中科极限元(杭州)智能科技股份有限公司 | 基于自回归网络的非平行语料语音转换方法及系统 |
| CN112365878B (zh) * | 2020-10-30 | 2024-01-23 | 广州华多网络科技有限公司 | 语音合成方法、装置、设备及计算机可读存储介质 |
| CN112435650B (zh) * | 2020-11-11 | 2022-04-15 | 四川长虹电器股份有限公司 | 一种多说话人、多语言的语音合成方法及系统 |
| CN112420016B (zh) * | 2020-11-20 | 2022-06-03 | 四川长虹电器股份有限公司 | 一种合成语音与文本对齐的方法、装置及计算机储存介质 |
| JP7487794B2 (ja) * | 2020-11-25 | 2024-05-21 | 日本電信電話株式会社 | ラベリング処理方法、ラベリング処理装置およびラベリング処理プログラム |
| CN112634865B (zh) * | 2020-12-23 | 2022-10-28 | 爱驰汽车有限公司 | 语音合成方法、装置、计算机设备和存储介质 |
| CN113539231B (zh) * | 2020-12-30 | 2024-06-18 | 腾讯科技(深圳)有限公司 | 音频处理方法、声码器、装置、设备及存储介质 |
| CN112885328B (zh) | 2021-01-22 | 2024-06-28 | 华为技术有限公司 | 一种文本数据处理方法及装置 |
| CN112951200B (zh) * | 2021-01-28 | 2024-03-12 | 北京达佳互联信息技术有限公司 | 语音合成模型的训练方法、装置、计算机设备及存储介质 |
| CN112802449B (zh) * | 2021-03-19 | 2021-07-02 | 广州酷狗计算机科技有限公司 | 音频合成方法、装置、计算机设备及存储介质 |
| CN113035228B (zh) * | 2021-03-23 | 2024-08-23 | 广州酷狗计算机科技有限公司 | 声学特征提取方法、装置、设备及存储介质 |
| EP4248441A4 (en) * | 2021-03-25 | 2024-07-10 | Samsung Electronics Co., Ltd. | SPEECH RECOGNITION METHOD, DEVICE, ELECTRONIC DEVICE AND COMPUTER-READABLE STORAGE MEDIUM |
| CN115223539B (zh) * | 2021-03-30 | 2025-02-25 | 暗物智能科技(广州)有限公司 | 一种豪萨语语音合成方法及系统 |
| CN113761841B (zh) | 2021-04-19 | 2023-07-25 | 腾讯科技(深圳)有限公司 | 将文本数据转换为声学特征的方法 |
| CN113362803B (zh) * | 2021-05-31 | 2023-04-25 | 杭州芯声智能科技有限公司 | 一种arm侧离线语音合成的方法、装置及存储介质 |
| CN113345412A (zh) * | 2021-05-31 | 2021-09-03 | 平安科技(深圳)有限公司 | 语音合成方法、装置、设备以及存储介质 |
| CN113345415B (zh) * | 2021-06-01 | 2024-10-25 | 平安科技(深圳)有限公司 | 语音合成方法、装置、设备及存储介质 |
| CN113808571B (zh) * | 2021-08-17 | 2022-05-27 | 北京百度网讯科技有限公司 | 语音合成方法、装置、电子设备以及存储介质 |
| CN113838452B (zh) | 2021-08-17 | 2022-08-23 | 北京百度网讯科技有限公司 | 语音合成方法、装置、设备和计算机存储介质 |
| CN113838453B (zh) * | 2021-08-17 | 2022-06-28 | 北京百度网讯科技有限公司 | 语音处理方法、装置、设备和计算机存储介质 |
| CN114299910B (zh) * | 2021-09-06 | 2024-03-22 | 腾讯科技(深圳)有限公司 | 语音合成模型的训练方法、使用方法、装置、设备及介质 |
| CN114049873B (zh) * | 2021-10-29 | 2025-07-08 | 北京搜狗科技发展有限公司 | 语音克隆方法、训练方法、装置和介质 |
| GB2612624B (en) * | 2021-11-05 | 2025-10-15 | Spotify Ab | Methods and systems for synthesising speech from text |
| CN114678005B (zh) * | 2022-04-11 | 2025-09-23 | 平安科技(深圳)有限公司 | 一种语音合成方法、结构、终端及存储介质 |
| CN115223538B (zh) * | 2022-07-13 | 2025-07-25 | 深圳市腾讯计算机系统有限公司 | 声码器模型的训练方法、装置、设备、介质及程序产品 |
| US12555563B2 (en) * | 2022-08-15 | 2026-02-17 | Tencent America LLC | Systems and methods for character-to-phone conversion |
| CN117636841A (zh) * | 2022-08-19 | 2024-03-01 | 北京嘀嘀无限科技发展有限公司 | 语音合成方法、装置、设备、存储介质和程序产品 |
| CN116665636B (zh) * | 2022-09-20 | 2024-03-12 | 荣耀终端有限公司 | 音频数据处理方法、模型训练方法、电子设备和存储介质 |
| US12518736B2 (en) * | 2022-11-09 | 2026-01-06 | Square Enix Co., Ltd. | Non-transitory computer-readable medium and voice generating system |
| CN116052636A (zh) * | 2023-01-13 | 2023-05-02 | 长城汽车股份有限公司 | 中文语音合成方法、装置、终端及存储介质 |
| CN116665641A (zh) * | 2023-06-07 | 2023-08-29 | 腾讯音乐娱乐科技(深圳)有限公司 | 一种音频帧的基频预测方法、模型的训练方法及其装置 |
| US12363319B2 (en) | 2023-06-14 | 2025-07-15 | Microsoft Technology Licensing, Llc | Object-based context-based decoder correction |
| US12469507B2 (en) | 2023-06-14 | 2025-11-11 | Microsoft Technology Licensing, Llc | Predictive context-based decoder correction |
| US12561525B2 (en) * | 2023-07-31 | 2026-02-24 | Paypal, Inc. | Systems and methods for establishing multilingual context-preserving chunk library |
| CN117475992A (zh) * | 2023-11-21 | 2024-01-30 | 支付宝(杭州)信息技术有限公司 | 语音合成方法、装置、设备及存储介质 |
| CN117765926B (zh) * | 2024-02-19 | 2024-05-14 | 上海蜜度科技股份有限公司 | 语音合成方法、系统、电子设备及介质 |
| CN118486294B (zh) * | 2024-06-05 | 2025-03-25 | 内蒙古工业大学 | 一种基于分离对比学习的蒙古语未登录词读音增强方法 |
| CN118571236B (zh) * | 2024-08-05 | 2024-10-29 | 上海岩芯数智人工智能科技有限公司 | 一种基于音域范围的音频token化编码方法及装置 |
| CN119446114B (zh) * | 2024-09-30 | 2025-09-30 | 平安科技(深圳)有限公司 | 一种语音合成方法、装置、设备及其存储介质 |
| CN119724150B (zh) * | 2024-12-12 | 2025-11-14 | 安徽讯飞寰语科技有限公司 | 语音合成方法、系统、电子设备及存储介质 |
| CN119724148B (zh) * | 2025-02-27 | 2025-06-17 | 科大讯飞股份有限公司 | 语音合成方法及相关装置、设备和存储介质 |
| CN120580987B (zh) * | 2025-06-23 | 2025-12-02 | 广州佰锐网络科技有限公司 | 一种基于深度学习的多语言tts实时合成方法 |
| CN121354534B (zh) * | 2025-12-17 | 2026-03-20 | 科大讯飞股份有限公司 | 语音合成方法、装置、电子设备及存储介质 |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1540625A (zh) * | 2003-03-24 | 2004-10-27 | 微软公司 | 多语种文本-语音系统的前端结构 |
| US20060136216A1 (en) * | 2004-12-10 | 2006-06-22 | Delta Electronics, Inc. | Text-to-speech system and method thereof |
| US20120278081A1 (en) * | 2009-06-10 | 2012-11-01 | Kabushiki Kaisha Toshiba | Text to speech method and system |
| US9484014B1 (en) * | 2013-02-20 | 2016-11-01 | Amazon Technologies, Inc. | Hybrid unit selection / parametric TTS system |
| CN106297764A (zh) * | 2015-05-27 | 2017-01-04 | 科大讯飞股份有限公司 | 一种多语种混语文本处理方法及系统 |
| TW201705019A (zh) * | 2015-07-21 | 2017-02-01 | 華碩電腦股份有限公司 | 文字轉語音方法以及多語言語音合成裝置 |
Family Cites Families (17)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6067520A (en) * | 1995-12-29 | 2000-05-23 | Lee And Li | System and method of recognizing continuous mandarin speech utilizing chinese hidden markou models |
| JP2975586B2 (ja) * | 1998-03-04 | 1999-11-10 | 株式会社エイ・ティ・アール音声翻訳通信研究所 | 音声合成システム |
| CA2562366A1 (en) * | 2004-04-06 | 2005-10-20 | Department Of Information Technology | A system for multiligual machine translation from english to hindi and other indian languages using pseudo-interlingua and hybridized approach |
| US20050267757A1 (en) * | 2004-05-27 | 2005-12-01 | Nokia Corporation | Handling of acronyms and digits in a speech recognition and text-to-speech engine |
| US20070106513A1 (en) * | 2005-11-10 | 2007-05-10 | Boillot Marc A | Method for facilitating text to speech synthesis using a differential vocoder |
| EP2276023A3 (en) * | 2005-11-30 | 2011-10-05 | Telefonaktiebolaget LM Ericsson (publ) | Efficient speech stream conversion |
| US8478581B2 (en) * | 2010-01-25 | 2013-07-02 | Chung-ching Chen | Interlingua, interlingua engine, and interlingua machine translation system |
| US8688435B2 (en) * | 2010-09-22 | 2014-04-01 | Voice On The Go Inc. | Systems and methods for normalizing input media |
| US9483461B2 (en) * | 2012-03-06 | 2016-11-01 | Apple Inc. | Handling speech synthesis of content for multiple languages |
| US9195656B2 (en) * | 2013-12-30 | 2015-11-24 | Google Inc. | Multilingual prosody generation |
| US9865251B2 (en) * | 2015-07-21 | 2018-01-09 | Asustek Computer Inc. | Text-to-speech method and multi-lingual speech synthesizer using the method |
| RU2632424C2 (ru) * | 2015-09-29 | 2017-10-04 | Общество С Ограниченной Ответственностью "Яндекс" | Способ и сервер для синтеза речи по тексту |
| US9799327B1 (en) * | 2016-02-26 | 2017-10-24 | Google Inc. | Speech recognition with attention-based recurrent neural networks |
| JP6819988B2 (ja) * | 2016-07-28 | 2021-01-27 | 国立研究開発法人情報通信研究機構 | 音声対話装置、サーバ装置、音声対話方法、音声処理方法およびプログラム |
| US10872598B2 (en) * | 2017-02-24 | 2020-12-22 | Baidu Usa Llc | Systems and methods for real-time neural text-to-speech |
| US10796686B2 (en) * | 2017-10-19 | 2020-10-06 | Baidu Usa Llc | Systems and methods for neural text-to-speech using convolutional sequence learning |
| CN107945786B (zh) | 2017-11-27 | 2021-05-25 | 北京百度网讯科技有限公司 | 语音合成方法和装置 |
-
2019
- 2019-04-03 CN CN201910266289.4A patent/CN111798832B/zh active Active
-
2020
- 2020-03-30 US US17/600,850 patent/US11881205B2/en active Active
- 2020-03-30 WO PCT/CN2020/082172 patent/WO2020200178A1/zh not_active Ceased
- 2020-03-30 EP EP20783784.0A patent/EP3937165B1/en active Active
- 2020-03-30 JP JP2021558871A patent/JP7464621B2/ja active Active
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1540625A (zh) * | 2003-03-24 | 2004-10-27 | 微软公司 | 多语种文本-语音系统的前端结构 |
| US20060136216A1 (en) * | 2004-12-10 | 2006-06-22 | Delta Electronics, Inc. | Text-to-speech system and method thereof |
| US20120278081A1 (en) * | 2009-06-10 | 2012-11-01 | Kabushiki Kaisha Toshiba | Text to speech method and system |
| US9484014B1 (en) * | 2013-02-20 | 2016-11-01 | Amazon Technologies, Inc. | Hybrid unit selection / parametric TTS system |
| CN106297764A (zh) * | 2015-05-27 | 2017-01-04 | 科大讯飞股份有限公司 | 一种多语种混语文本处理方法及系统 |
| TW201705019A (zh) * | 2015-07-21 | 2017-02-01 | 華碩電腦股份有限公司 | 文字轉語音方法以及多語言語音合成裝置 |
Non-Patent Citations (4)
| Title |
|---|
| BO LI; YU ZHANG; TARA SAINATH; YONGHUI WU; WILLIAM CHAN: "Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes", ELECTRICAL ENGINEERING AND SYSTEMS SCIENCE, 22 November 2018 (2018-11-22), pages 1 - 5, XP080937763 * |
| ELIYA NACHMANI; LIOR WOLF: "Unsupervised Polyglot Text To Speech", COMPUTER SCIENCE, 6 February 2019 (2019-02-06), pages 1 - 5, XP081026077 * |
| LIUMENG XUE; WEI SONG; GUANGHUI XU; LEI XIE; ZHIZHENG WU: "Building a mixed-lingual neural TTS system with only monolingual data", ELECTRICAL ENGINEERING AND SYSTEMS SCIENCE , 12 April 2019 (2019-04-12), pages 1 - 6, XP081168422 * |
| See also references of EP3937165A4 * |
Cited By (29)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112185340A (zh) * | 2020-10-30 | 2021-01-05 | 网易(杭州)网络有限公司 | 语音合成方法、语音合成装置、存储介质与电子设备 |
| CN112185340B (zh) * | 2020-10-30 | 2024-03-15 | 网易(杭州)网络有限公司 | 语音合成方法、语音合成装置、存储介质与电子设备 |
| CN114765022A (zh) * | 2020-12-30 | 2022-07-19 | 大众问问(北京)信息科技有限公司 | 一种语音合成方法、装置、计算机设备和存储介质 |
| CN112992177A (zh) * | 2021-02-20 | 2021-06-18 | 平安科技(深圳)有限公司 | 语音风格迁移模型的训练方法、装置、设备及存储介质 |
| CN112992177B (zh) * | 2021-02-20 | 2023-10-17 | 平安科技(深圳)有限公司 | 语音风格迁移模型的训练方法、装置、设备及存储介质 |
| CN113763922A (zh) * | 2021-05-12 | 2021-12-07 | 腾讯科技(深圳)有限公司 | 音频合成方法和装置、存储介质及电子设备 |
| CN113327576B (zh) * | 2021-06-03 | 2024-04-23 | 多益网络有限公司 | 语音合成方法、装置、设备及存储介质 |
| CN113327576A (zh) * | 2021-06-03 | 2021-08-31 | 多益网络有限公司 | 语音合成方法、装置、设备及存储介质 |
| CN113450760A (zh) * | 2021-06-07 | 2021-09-28 | 北京一起教育科技有限责任公司 | 一种文本转语音的方法、装置及电子设备 |
| CN113409761A (zh) * | 2021-07-12 | 2021-09-17 | 上海喜马拉雅科技有限公司 | 语音合成方法、装置、电子设备以及计算机可读存储介质 |
| CN113409761B (zh) * | 2021-07-12 | 2022-11-01 | 上海喜马拉雅科技有限公司 | 语音合成方法、装置、电子设备以及计算机可读存储介质 |
| CN113724683B (zh) * | 2021-07-23 | 2024-03-22 | 阿里巴巴达摩院(杭州)科技有限公司 | 音频生成方法、计算机设备及计算机可读存储介质 |
| CN113724683A (zh) * | 2021-07-23 | 2021-11-30 | 阿里巴巴达摩院(杭州)科技有限公司 | 音频生成方法、计算机设备及计算机可读存储介质 |
| CN113707125B (zh) * | 2021-08-30 | 2024-02-27 | 中国科学院声学研究所 | 一种多语言语音合成模型的训练方法及装置 |
| CN113707125A (zh) * | 2021-08-30 | 2021-11-26 | 中国科学院声学研究所 | 一种多语言语音合成模型的训练方法及装置 |
| CN114267376A (zh) * | 2021-11-24 | 2022-04-01 | 北京百度网讯科技有限公司 | 音素检测方法及装置、训练方法及装置、设备和介质 |
| CN114267375A (zh) * | 2021-11-24 | 2022-04-01 | 北京百度网讯科技有限公司 | 音素检测方法及装置、训练方法及装置、设备和介质 |
| CN114267375B (zh) * | 2021-11-24 | 2022-10-28 | 北京百度网讯科技有限公司 | 音素检测方法及装置、训练方法及装置、设备和介质 |
| CN114495899A (zh) * | 2021-12-29 | 2022-05-13 | 深圳市优必选科技股份有限公司 | 一种基于时长信息的音频合成方法、装置及终端设备 |
| CN114399991A (zh) * | 2022-01-27 | 2022-04-26 | 北京有竹居网络技术有限公司 | 语音合成方法、装置、存储介质及电子设备 |
| CN115101041A (zh) * | 2022-05-09 | 2022-09-23 | 北京百度网讯科技有限公司 | 语音合成与语音合成模型的训练方法、装置 |
| CN115691476B (zh) * | 2022-06-06 | 2023-07-04 | 腾讯科技(深圳)有限公司 | 语音识别模型的训练方法、语音识别方法、装置及设备 |
| CN115691476A (zh) * | 2022-06-06 | 2023-02-03 | 腾讯科技(深圳)有限公司 | 语音识别模型的训练方法、语音识别方法、装置及设备 |
| CN115132170A (zh) * | 2022-06-28 | 2022-09-30 | 腾讯音乐娱乐科技(深圳)有限公司 | 语种分类方法、装置及计算机可读存储介质 |
| CN116129866A (zh) * | 2023-02-16 | 2023-05-16 | 北京百度网讯科技有限公司 | 语音合成方法、网络训练方法、装置、设备及存储介质 |
| CN118782018A (zh) * | 2023-04-03 | 2024-10-15 | 科大讯飞股份有限公司 | 语音合成方法、装置、设备及存储介质 |
| CN116612742A (zh) * | 2023-04-28 | 2023-08-18 | 科大讯飞股份有限公司 | 语音合成方法、装置、设备及存储介质 |
| CN118840996A (zh) * | 2024-06-27 | 2024-10-25 | 合肥智能语音创新发展有限公司 | 一种发音预测方法及相关装置 |
| CN120032621A (zh) * | 2025-01-16 | 2025-05-23 | 思必驰科技股份有限公司 | 面向vqtts模型的语音合成缺陷修正方法、设备及存储介质 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN111798832A (zh) | 2020-10-20 |
| JP2022527970A (ja) | 2022-06-07 |
| EP3937165C0 (en) | 2025-10-22 |
| CN111798832B (zh) | 2024-09-20 |
| EP3937165A1 (en) | 2022-01-12 |
| US20220165249A1 (en) | 2022-05-26 |
| JP7464621B2 (ja) | 2024-04-09 |
| EP3937165B1 (en) | 2025-10-22 |
| EP3937165A4 (en) | 2023-05-10 |
| US11881205B2 (en) | 2024-01-23 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11881205B2 (en) | Speech synthesis method, device and computer readable storage medium | |
| CN114038447B (zh) | 语音合成模型的训练方法、语音合成方法、装置及介质 | |
| US11322133B2 (en) | Expressive text-to-speech utilizing contextual word-level style tokens | |
| CN115547293B (zh) | 一种基于分层韵律预测的多语言语音合成方法及系统 | |
| US20220246132A1 (en) | Generating Diverse and Natural Text-To-Speech Samples | |
| US11810471B2 (en) | Computer implemented method and apparatus for recognition of speech patterns and feedback | |
| CN114464162B (zh) | 语音合成方法、神经网络模型训练方法、和语音合成模型 | |
| US12087272B2 (en) | Training speech synthesis to generate distinct speech sounds | |
| KR20210146368A (ko) | 숫자 시퀀스에 대한 종단 간 자동 음성 인식 | |
| CN113205792A (zh) | 一种基于Transformer和WaveNet的蒙古语语音合成方法 | |
| CN115424604B (zh) | 一种基于对抗生成网络的语音合成模型的训练方法 | |
| CN113450758B (zh) | 语音合成方法、装置、设备及介质 | |
| CN113257221B (zh) | 一种基于前端设计的语音模型训练方法及语音合成方法 | |
| US12073822B2 (en) | Voice generating method and apparatus, electronic device and storage medium | |
| CN114863945A (zh) | 基于文本的语音变声方法、装置、电子设备及存储介质 | |
| Azim et al. | Using character-level sequence-to-sequence model for word level text generation to enhance Arabic speech recognition | |
| CN118800212A (zh) | 语音合成前端处理方法、装置、设备和存储介质 | |
| CN116597809A (zh) | 多音字消歧方法、装置、电子设备及可读存储介质 | |
| CN120932627A (zh) | 一种基于npu的中英双语文本转语音方法及系统 | |
| CN119517004B (zh) | 文本转换语音的方法、装置、设备及存储介质 | |
| JP7357518B2 (ja) | 音声合成装置及びプログラム | |
| CN114267330A (zh) | 语音合成方法、装置、电子设备和存储介质 | |
| CN115114933A (zh) | 用于文本处理的方法、装置、设备和存储介质 | |
| Saychum et al. | A great reduction of wer by syllable toneme prediction for thai grapheme to phoneme conversion | |
| Hendessi et al. | A speech synthesizer for Persian text using a neural network with a smooth ergodic HMM |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20783784 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2021558871 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2020783784 Country of ref document: EP Effective date: 20211005 |
|
| WWG | Wipo information: grant in national office |
Ref document number: 2020783784 Country of ref document: EP |