WO2020200178A1 - 语音合成方法、装置和计算机可读存储介质 - Google Patents

语音合成方法、装置和计算机可读存储介质 Download PDF

Info

Publication number
WO2020200178A1
WO2020200178A1 PCT/CN2020/082172 CN2020082172W WO2020200178A1 WO 2020200178 A1 WO2020200178 A1 WO 2020200178A1 CN 2020082172 W CN2020082172 W CN 2020082172W WO 2020200178 A1 WO2020200178 A1 WO 2020200178A1
Authority
WO
WIPO (PCT)
Prior art keywords
parameter
acoustic
vocoder
speech synthesis
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2020/082172
Other languages
English (en)
French (fr)
Inventor
武执政
张政臣
宋伟
饶永辉
解知杭
徐光辉
刘树勇
马博森
邱双稳
林隽民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to JP2021558871A priority Critical patent/JP7464621B2/ja
Priority to EP20783784.0A priority patent/EP3937165B1/en
Priority to US17/600,850 priority patent/US11881205B2/en
Publication of WO2020200178A1 publication Critical patent/WO2020200178A1/zh
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/086Detection of language
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/263Language identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/0018Speech coding using phonetic or linguistical decoding of the source; Reconstruction using text-to-speech synthesis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis

Definitions

  • the present disclosure relates to the field of computer technology, and in particular, to a speech synthesis method, device and computer-readable storage medium.
  • the speech synthesis system can realize text-to-speech conversion (Text To Speech, TTS), which can convert text into sound through a series of algorithmic operations, and realize the process of the machine simulating human pronunciation.
  • TTS text-to-speech conversion
  • the current speech synthesis system generally only supports the pronunciation of a single language.
  • a technical problem to be solved by the present disclosure is: how to implement an end-to-end speech synthesis system that supports pronunciation in multiple languages.
  • a speech synthesis method including: dividing text into multiple segments belonging to different language types; according to the language type to which each segment belongs, converting each segment into a corresponding phoneme to generate The phoneme sequence of the text; input the phoneme sequence into the pre-trained speech synthesis model and convert it into vocoder characteristic parameters; input the vocoder characteristic parameters into the vocoder to generate speech.
  • dividing the text into multiple segments belonging to different language types includes: identifying the language type to which each character belongs according to the encoding of each character in the text; and dividing consecutive characters belonging to the same language type into the language type. A fragment.
  • generating the phoneme sequence of the text includes: determining the prosodic structure of the text; according to the prosodic structure of the text, adding a prosody mark after the phoneme corresponding to each character in the text to form the phoneme sequence of the text.
  • inputting the phoneme sequence into a pre-trained speech synthesis model and converting it into vocoder feature parameters includes: inputting the phoneme sequence into an acoustic parameter prediction model in the speech synthesis model and converting it into acoustic feature parameters; and converting the acoustic feature parameters Input the vocoder parameter conversion model in the speech synthesis model to obtain the output vocoder characteristic parameters.
  • the acoustic parameter prediction model includes: an encoder, a decoder, and an attention model; inputting the phoneme sequence into the acoustic parameter prediction model in the speech synthesis model and converting it into acoustic feature parameters includes: using the attention model to determine the current The attention weight represented by each feature output by the encoder at the moment; it is judged whether the attention weight represented by the feature corresponding to the preset element in the phoneme sequence is the maximum value among the attention weights, and if it is, the decoding process is ended.
  • the acoustic characteristic parameters include speech frequency spectrum parameters;
  • the vocoder parameter conversion model is composed of a multi-layer deep neural network and a long and short-term memory network.
  • up-sampling is performed by repeating the acoustic characteristic parameter so that the frequency of the acoustic characteristic parameter is equal to the frequency of the vocoder characteristic parameter.
  • the method further includes: training a speech synthesis model; wherein the training method includes: dividing the speech samples corresponding to each training text into different frames according to a preset frequency, and extracting acoustic feature parameters for each frame, respectively Generate the first acoustic feature parameter sample corresponding to each training text; use each training text and the first acoustic feature parameter sample corresponding to each training text to train the acoustic parameter prediction model; use the trained acoustic parameter prediction model, Convert each training text into a second acoustic feature parameter sample; according to the synthesis frequency of the vocoder, convert the speech sample corresponding to each training text into a vocoder feature parameter sample; use the second acoustic feature corresponding to each training text Parameter samples and vocoder characteristic parameter samples train the vocoder parameter conversion model.
  • the acoustic parameter prediction model includes: an encoder, a decoder, and an attention model; inputting the phoneme sequence into the acoustic parameter prediction model in the speech synthesis model and converting it into acoustic feature parameters includes: inputting the phoneme sequence into the encoder, Obtain the feature representation corresponding to each element in the encoder output phoneme sequence; the feature representation corresponding to each element, the decoder hidden state output at the current time of the first loop layer in the decoder, and the cumulative attention weight corresponding to each element at the previous time
  • the information is input into the attention model to obtain the context vector; the hidden state and context vector of the decoder output at the current moment of the first cyclic layer in the decoder are input into the second cyclic layer of the decoder to obtain the current moment output of the second cyclic layer
  • the hidden state of the decoder predict the acoustic characteristic parameters according to the hidden state of the decoder at each moment output by the decoder.
  • converting each segment into corresponding phonemes according to the language type to which each segment belongs includes: normalizing each segment according to the language type to which each segment belongs; according to the language type to which each segment belongs , The normalized segments are segmented separately; the segmentation of each segment is converted into corresponding phonemes according to the preset phoneme conversion table corresponding to the language type of each segment; wherein the phonemes include the tones corresponding to the characters.
  • a speech synthesis device including: a language recognition module for dividing text into multiple segments belonging to different language types; a phoneme conversion module for dividing a text into multiple segments according to the language to which each segment belongs Type, each segment is converted into corresponding phonemes to generate the phoneme sequence of the text; the parameter conversion module is used to input the phoneme sequence into the pre-trained speech synthesis model and converted into vocoder characteristic parameters; the speech generation module is used to convert The characteristic parameters of the vocoder are input to the vocoder to generate speech.
  • the language recognition module is used to identify the language type to which each character belongs according to the encoding of each character in the text; divide consecutive characters belonging to the same language type into a segment of the language type.
  • the phoneme conversion module is used to determine the prosodic structure of the text; according to the prosodic structure of the text, a prosody mark is added after the phoneme corresponding to each character in the text to form a phoneme sequence of the text.
  • the parameter conversion module is used to input the phoneme sequence into the acoustic parameter prediction model in the speech synthesis model and convert it into acoustic feature parameters; input the acoustic feature parameters into the vocoder parameter conversion model in the speech synthesis model to obtain the output Vocoder characteristic parameters.
  • the acoustic parameter prediction model includes: an encoder, a decoder, and an attention model; the parameter conversion module is used to use the attention model to determine the attention weight represented by each feature output by the encoder at the current moment; determine the phoneme sequence Whether the attention weight represented by the feature corresponding to the preset element in is the maximum value among the attention weights, if so, the decoding process is ended.
  • the acoustic characteristic parameters include speech frequency spectrum parameters;
  • the vocoder parameter conversion model is composed of a multi-layer deep neural network and a long and short-term memory network.
  • up-sampling is performed by repeating the acoustic characteristic parameter so that the frequency of the acoustic characteristic parameter is equal to the frequency of the vocoder characteristic parameter.
  • the model training module is used to divide the speech samples corresponding to each training text into different frames according to the preset frequency, and extract the acoustic feature parameters for each frame, and respectively generate the first sound corresponding to each training text.
  • Learn feature parameter samples use each training text and the first acoustic feature parameter sample corresponding to each training text to train the acoustic parameter prediction model; use the trained acoustic parameter prediction model to convert each training text into a second acoustic Feature parameter samples; according to the synthesis frequency of the vocoder, the speech samples corresponding to each training text are respectively converted into vocoder feature parameter samples; use the second acoustic feature parameter sample and the vocoder feature parameter sample pair corresponding to each training text The vocoder parameter conversion model is trained.
  • the acoustic parameter prediction model includes an encoder, a decoder, and an attention model; the parameter conversion module is used to input the phoneme sequence into the encoder to obtain the feature representation corresponding to each element in the encoder output phoneme sequence; The feature representation corresponding to the element, the hidden state of the decoder output at the current moment of the first loop layer in the decoder, and the cumulative attention weight information corresponding to each element at the previous moment are input into the attention model to obtain the context vector; The hidden state of the decoder and the context vector output at the current time of the cyclic layer are input to the second cyclic layer of the decoder to obtain the hidden state of the decoder at the current time output by the second cyclic layer of the decoder; State prediction acoustic characteristic parameters.
  • the phoneme conversion module is used to normalize each segment according to the language type to which each segment belongs; and perform word segmentation on each normalized segment according to the language type to which each segment belongs; The word segmentation of each segment is converted into corresponding phonemes according to the preset phoneme conversion table corresponding to the language type to which each segment belongs; wherein the phonemes include the tones corresponding to the characters.
  • a speech synthesis device including: a memory; and a processor coupled to the memory, and the processor is configured to execute the same as in any of the foregoing embodiments based on instructions stored in the memory. Speech synthesis method.
  • a computer-readable storage medium having a computer program stored thereon, wherein the program is executed by a processor to implement the speech synthesis method of any of the foregoing embodiments.
  • the language category in the text is first identified, and the text is divided into multiple segments belonging to different language categories. According to the language type to which each segment belongs, each segment is converted into a corresponding phoneme.
  • the phoneme sequence of the text is converted into the characteristic parameters of the vocoder by the input speech synthesis model, and the vocoder outputs speech according to the characteristic parameters of the vocoder.
  • the solution of the present disclosure realizes an end-to-end speech synthesis system that supports pronunciation in multiple languages. And according to the phoneme sequence converted into vocoder characteristic parameters, compared with the character sequence directly converted into vocoder characteristic parameters, the synthesized speech can be more accurate, smooth and natural.
  • FIG. 1 shows a schematic flowchart of a speech synthesis method according to some embodiments of the present disclosure.
  • Fig. 2 shows a schematic structural diagram of a speech synthesis model of some embodiments of the present disclosure.
  • Fig. 3 shows a schematic flowchart of a speech synthesis method according to other embodiments of the present disclosure.
  • Fig. 4 shows a schematic structural diagram of a speech synthesis device according to some embodiments of the present disclosure.
  • Fig. 5 shows a schematic structural diagram of a speech synthesis device according to other embodiments of the present disclosure.
  • Fig. 6 shows a schematic structural diagram of a speech synthesis device according to still other embodiments of the present disclosure.
  • the present disclosure proposes a speech synthesis method, which is described below in conjunction with FIG. 1.
  • Figure 1 is a flowchart of some embodiments of the disclosed speech synthesis method. As shown in Fig. 1, the method of this embodiment includes: steps S102 to S108.
  • step S102 the text is divided into multiple segments belonging to different language categories.
  • the language type to which each character belongs is identified according to the encoding of each character in the text; consecutive characters belonging to the same language type are divided into a segment of the language type. For example, if the text contains Chinese and English characters, the Unicode code or other codes of the characters in the text can be obtained, and the Chinese characters and English characters in the text can be recognized according to the Unicode code, and the text can be divided into multiple fragments in different languages. If it contains characters in other languages (for example, Japanese, French, etc.), it can be recognized according to the corresponding encoding form.
  • the sentence when the sentence contains only preset English characters, the sentence is marked as a Chinese sentence to facilitate subsequent normalization of the preset English characters according to Chinese, for example, a preset English character such as 12km/h. It can be converted to 12 kilometers per hour during the subsequent normalization, and the subsequent voice is Chinese pronunciation, which is more in line with the habits of Chinese users.
  • a preset English character such as 12km/h. It can be converted to 12 kilometers per hour during the subsequent normalization, and the subsequent voice is Chinese pronunciation, which is more in line with the habits of Chinese users.
  • the sentence in the case that the sentence contains only some special international characters, the sentence can be marked as a preset language type according to the pronunciation requirement, which is convenient for subsequent text normalization and speech synthesis. deal with.
  • the above step (7) may include the following steps. (i) Determine whether the language type of the current character is the same as the language type of the previous character, if they are the same, execute (ii), otherwise execute (iv). (ii) Move the current character into the current segment set. (iii) Determine whether the end of the sentence is reached, if it is, then execute (iv), otherwise execute (v), (iv) mark the language type of the characters in the current fragment set and move it out of the current fragment set. (v) Update the next character to the current character, and return to (i) to restart execution.
  • each segment is converted into a corresponding phoneme according to the language type to which each said segment belongs, and a phoneme sequence of the text is generated.
  • each segment is normalized according to the language type to which each segment belongs; each segment after normalization is segmented according to the language type to which each segment belongs; and the word segmentation of each segment is
  • the preset phoneme conversion table corresponding to the language type to which the segment belongs is converted into the corresponding phoneme.
  • the text usually contains a large number of non-standard abbreviations, such as 12km/s, 2019, etc. These non-standard texts must be converted into standardized text suitable for speech synthesis by the speech synthesis system through a normalization operation. Fragments belonging to different languages need to be normalized separately. According to the special character comparison table of different languages, irregular characters can be converted into standardized characters, for example, 12km/s is converted to twelve kilometers per second. , To facilitate subsequent phoneme conversion.
  • each word segmentation can be converted into corresponding phonemes (G2P).
  • G2P phonemes
  • the preset phoneme conversion table may include phoneme correspondences of polyphones, so as to perform accurate phoneme conversion for polyphones. It is also possible to recognize polyphonic characters in other ways, or perform phoneme conversion through other existing technologies, which are not limited to the examples given.
  • the phoneme may include the tones corresponding to the characters, and using the tones as part of the phonemes can make the synthesized speech more accurate and natural.
  • Some languages such as English, etc., do not have a tone, so there is no need to add a corresponding tone mark in the phoneme sequence.
  • the text can also be divided into prosodic structures, for example, to identify prosodic words and prosodic phrases in the text. According to the prosodic structure of the text, a prosodic mark is added after the phonemes corresponding to each character in the text to form the phoneme sequence of the text.
  • the prosody mark can be a special mark added after the phoneme corresponding to the prosodic word or the prosodic phrase to indicate a pause.
  • the prediction of the prosodic structure can adopt the existing technology, which will not be repeated here.
  • step S106 the phoneme sequence is input into a pre-trained speech synthesis model, and converted into vocoder characteristic parameters.
  • the phoneme sequence of the text may include the phoneme (including tone) and prosody identifier corresponding to each character, and may also include some special symbols, such as the symbol ⁇ EOS> that indicates the end of the input phoneme sequence.
  • the training process of the speech synthesis model will be described later.
  • the speech synthesis model may include an acoustic parameter prediction model and a vocoder parameter conversion model.
  • the acoustic parameters include, for example, speech spectral parameters, such as Mel spectral parameters or linear spectral parameters.
  • the vocoder parameters are determined according to the actual vocoder used. For example, if the vocoder adopts a world vocoder, the vocoder parameters can include fundamental frequency (F0), generalized Mel cepstrum coefficient (Mel- generalized Cepstral (MGC), band a periodic component (BAP), etc.
  • F0 fundamental frequency
  • MMC generalized Mel cepstrum coefficient
  • BAP band a periodic component
  • Inputting the phoneme sequence into the acoustic parameter prediction model in the speech synthesis model can be converted into acoustic feature parameters; inputting the acoustic feature parameters into the vocoder parameter conversion model in the speech synthesis model, the output vocoder feature parameters can be obtained.
  • the acoustic feature parameter prediction model adopts the Encoder-Decoder network structure, including: encoder, decoder and attention (Attention) model.
  • the length of the input phoneme sequence and the output acoustic feature parameter sequence may not match, and usually the acoustic feature parameter sequence will be relatively long.
  • the neural network structure based on Encoder-Decoder can perform flexible feature prediction, which conforms to the characteristics of speech synthesis.
  • the encoder can include three-layer one-dimensional convolution and two-way LSTM (Long Short-Term Memory).
  • the three-layer one-dimensional convolution can learn the local context information of each phoneme, and the two-way LSTM coding can calculate the two-way global information of each phoneme.
  • the encoder module can obtain a very expressive and contextual feature representation of the input phoneme through three-layer one-dimensional convolution and two-way LSTM encoding.
  • the decoder includes, for example, two fully connected layers and two LSTMs.
  • the two fully connected layers can use Dropout technology to prevent the occurrence of neural network over-fitting.
  • the attention model allows the decoder to learn the internal representation of which input phonemes need to be paid attention to at the current decoding moment during the decoding process. Through the attention mechanism, the decoder can also learn which input phonemes have completed parameter prediction. , And which phonemes need special attention at the moment.
  • the context vector of the encoder obtained by the attention model can better predict the acoustic parameters that need to be obtained at the current moment and whether to end the decoding process by combining this context vector during the decoding process.
  • the following steps can be performed in the acoustic feature parameter prediction model.
  • the phoneme sequence is input to the encoder, and the characteristic representation corresponding to each element in the encoder output phoneme sequence is obtained.
  • the first recurrent layer such as the first LSTM
  • the hidden state of the decoder at each moment predicts the acoustic feature parameters.
  • the hidden state sequence of the decoder is linearly transformed to obtain the acoustic characteristic parameters.
  • j represents the position of each element in the input phoneme sequence
  • M represents the total number of elements in the phoneme sequence.
  • the prosody mark in the phoneme sequence will also be converted into the corresponding hidden state, and then into the decoder hidden state.
  • the context vector can be calculated using the following formula.
  • i represents the time step of the decoder
  • j represents the position of the element in the phoneme sequence corresponding to the encoder
  • i and j are positive integers.
  • v, W, V, U, b are the parameters learned during model training
  • s i represents the hidden state of the decoder output at the current i-th moment of the first recurrent layer (such as the first LSTM) in the decoder.
  • h j represents the feature representation corresponding to the j-th element
  • f i,j are the vectors in f i
  • F is a convolution kernel with a preset length
  • ⁇ i-1 is the cumulative attention corresponding to each element at the i-1th time Force weight information (Alignments)
  • e i, j are numerical values
  • e i represent the vector corresponding to each element
  • ⁇ i is a vector
  • ⁇ i, j represent the value in ⁇ i
  • c i represents the corresponding value at the i-th moment Context vector
  • M represents the total number of elements in the phoneme sequence.
  • the attention model is used to determine the attention weight represented by each feature output by the encoder at the current moment; determine whether the attention weight represented by the feature corresponding to the preset element in the phoneme sequence is each attention weight ( That is, the maximum value of the attention weights corresponding to all elements in the input phoneme sequence. If it is, the decoding process ends.
  • the attention weight represented by the feature is generated by the attention model.
  • the preset element is the last ⁇ EOS> symbol in the phoneme sequence.
  • the above method of judging whether to stop decoding can make the decoder stop decoding according to actual needs. Judge whether it is necessary to end the decoding process through the learned Alignments information. If the attention model has already shifted the attention to the last symbol during decoding, but the decoding process is not correctly predicted to end the decoding process, the system can force the end of the decoding process according to the Alignments information.
  • the above-mentioned auxiliary decoding end algorithm can solve the problem that the model predicts that the decoding process ends or the prediction ends incorrectly. It prevents the acoustic parameter prediction model from continuing to predict the acoustic characteristics of several frames, and finally synthesizing some incomprehensible speech, improving the system The accuracy, fluency and naturalness of speech output.
  • the acoustic feature parameters (for example, Mel spectrum parameters) are converted into the vocoder parameter conversion model into the vocoder feature parameters, and then speech synthesis can be performed by the vocoder.
  • the vocoder parameter conversion model can adopt the neural network structure of DNN-LSTM (Deep Neural Network-Long Short Term Memory Network).
  • the network structure can include multiple layers of deep neural networks and long and short-term memory networks.
  • the network structure includes two layers of ReLU (activation function) connections and one layer of LSTM.
  • Acoustic feature parameters are first input into a DNN network (such as ReLU), which can learn the nonlinear transformation of acoustic features and learn the internal feature representation of the neural network, which is equivalent to a feature learning process.
  • the features output by the DNN network are input to the LSTM to learn the historical dependence information of the acoustic feature parameters in order to obtain smoother feature conversion.
  • the inventor found through testing that the vocoder parameter conversion effect is better when the network structure includes two layers of ReLU connections and one layer of LSTM.
  • up-sampling is performed by repeating the acoustic characteristic parameter so that the frequency of the acoustic characteristic parameter is equal to the frequency of the vocoder characteristic parameter.
  • the acoustic parameter prediction model uses 15ms as a frame for parameter prediction, but the vocoder usually uses 5ms as a frame for speech synthesis, so there is a problem of mismatch in time and frequency. In order to solve the inconsistent frequency of the two models The problem is that the output of the acoustic parameter prediction model needs to be up-sampled to match the frequency of the vocoder model.
  • Up-sampling can be performed by repeating the output of the acoustic parameter prediction model, for example, repeating the acoustic feature parameters three times, 1*80-dimensional Mel spectrum parameters, and repeating three times to obtain 3*80-dimensional Mel spectrum parameters.
  • the inventor has determined through testing that, compared to learning an up-sampling neural network, or performing up-sampling by means of difference, the up-sampling can achieve good results by directly repeating features.
  • step S108 the characteristic parameters of the vocoder are input to the vocoder to generate speech.
  • the vocoder parameter conversion model in the above embodiment can be combined with the world vocoder.
  • a simple network architecture can speed up the calculation speed and realize real-time
  • speech generation reduces duplication and improves the effect of speech synthesis.
  • the language type in the text is first identified, and the text is divided into multiple segments belonging to different language types. According to the language type to which each segment belongs, each segment is converted into a corresponding phoneme.
  • the phoneme sequence of the text is converted into the characteristic parameters of the vocoder by the input speech synthesis model, and the vocoder outputs speech according to the characteristic parameters of the vocoder.
  • the solution of the foregoing embodiment realizes an end-to-end speech synthesis system that supports pronunciation in multiple languages, and converts it into vocoder characteristic parameters according to phoneme sequences, and directly converts character sequences to vocoder characteristic parameters, which enables the synthesis of The voice is more accurate, smooth and natural. Further, by adding prosodic structure, pitch, etc.
  • the speech synthesis effect can be further improved.
  • the new vocoder feature parameter conversion model the calculation speed is accelerated to realize real-time speech generation, which reduces duplication and further improves the effect of speech synthesis.
  • the above-mentioned embodiment also proposes a decoder termination method, which can solve the problem that the model predicts that the decoding process fails to end or the prediction ends incorrectly, and prevents the acoustic parameter prediction model from finally synthesizing some unintelligible speech, and further improves the system's voice output. Accuracy, fluency and naturalness.
  • the method of training a speech synthesis model includes: converting a speech sample corresponding to each training text into a vocoder feature parameter sample according to the synthesis frequency of the vocoder; and inputting each training text into the speech synthesis to be trained Model to obtain the output vocoder feature parameters; compare the output vocoder feature parameters with the corresponding vocoder feature parameter samples, and adjust the parameters of the speech synthesis model to be trained according to the comparison results until the training is completed .
  • FIG. 3 is a flowchart of other embodiments of the speech synthesis method of the present disclosure. As shown in FIG. 3, the method of this embodiment includes: steps S302 to S310.
  • step S302 the speech samples corresponding to each training text are divided into different frames according to the preset frequency, and the acoustic feature parameters are extracted for each frame, and the first acoustic feature parameter samples corresponding to each training text are respectively generated.
  • each speech sample may be divided with a frequency of 15 ms as a frame, and the acoustic characteristic parameters of each frame of samples may be extracted to generate the first acoustic characteristic parameter samples (for example, Mel spectrum parameters).
  • the first acoustic characteristic parameter samples for example, Mel spectrum parameters
  • step S304 each training text and the first acoustic feature parameter sample corresponding to each training text are used to train the acoustic parameter prediction model.
  • the training text can be divided into segments belonging to different language types, and each segment can be converted into corresponding phonemes according to the language type to which each segment belongs, and a phoneme sequence of the training text can be generated.
  • the phoneme sequence can include pitch, prosodic identification, and so on.
  • the phoneme sequence of each training text is input into the acoustic parameter prediction model, and the output acoustic feature parameters corresponding to each training text are obtained.
  • the output acoustic feature parameters corresponding to the same training text are compared with the first acoustic feature parameter samples, and the parameters in the acoustic parameter prediction model are adjusted according to the comparison results until the first preset target is met, and the acoustic parameter prediction model is completed Training.
  • step S306 the trained acoustic parameter prediction model is used to convert each training text into a second acoustic feature parameter sample.
  • Each training text is input into the trained acoustic parameter prediction model, and then the second acoustic feature parameter sample corresponding to each training text can be obtained.
  • step S308 according to the synthesis frequency of the vocoder, the speech samples corresponding to each training text are respectively converted into vocoder characteristic parameter samples.
  • the speech samples can be divided into a frame frequency of 5 ms, and each frame sample can be converted into a vocoder characteristic parameter sample (for example, MGC, BAP, log F0).
  • a vocoder characteristic parameter sample for example, MGC, BAP, log F0.
  • step S310 the second acoustic feature parameter sample and the vocoder feature parameter sample corresponding to each training text are used to train the vocoder parameter conversion model.
  • each second acoustic characteristic parameter sample is input into the vocoder parameter conversion model to obtain the output vocoder characteristic parameter.
  • the output vocoder characteristic parameters are compared with the corresponding vocoder characteristic parameter samples, and the parameters in the vocoder parameter conversion model are adjusted according to the comparison results until the second preset target is met, and the vocoder parameters are completed Conversion model training.
  • the method of the foregoing embodiment uses the acoustic feature parameters predicted by the acoustic prediction model as training data for training the vocoder parameter conversion model, which can improve the accuracy of the vocoder parameter conversion model and make the synthesized speech more accurate, smooth and smooth. natural.
  • the vocoder parameter conversion model is trained using real acoustic feature parameters (for example, Mel spectrum parameters) extracted directly from the voice file, the input features and training of the model will exist when the actual speech synthesis is performed. Differences in feature mismatch. Specifically because in the actual speech synthesis process, the input feature is the Mel spectrum predicted by the acoustic parameter prediction model.
  • the acoustic parameter conversion module training process uses the real acoustic feature parameters of the sound file.
  • the trained model has not learned the predicted acoustic feature parameters and the acoustic feature parameters that have accumulated errors during the decoding process. Therefore, the mismatch between the input feature and the training feature will result in a serious degradation of the performance of the vocoder parameter conversion model.
  • the present disclosure also provides a speech synthesis device, which is described below with reference to FIG. 4.
  • FIG. 4 is a structural diagram of some embodiments of the speech synthesis device of the present disclosure.
  • the device 40 of this embodiment includes: a language recognition module 402, a phoneme conversion module 404, a parameter conversion module 406, and a speech generation module 408.
  • the language recognition module 402 divides the text into multiple segments belonging to different language types.
  • the language recognition module 402 is configured to recognize the language type to which each character belongs according to the encoding of each character in the text; divide consecutive characters belonging to the same language type into a segment of the language type.
  • the phoneme conversion module 404 is configured to convert each segment into a corresponding phoneme according to the language type to which each segment belongs to generate a phoneme sequence of the text.
  • the phoneme conversion module 404 is used to determine the prosodic structure of the text; according to the prosodic structure of the text, a prosody mark is added after the phoneme corresponding to each character in the text to form a phoneme sequence of the text.
  • the phoneme conversion module 404 is configured to perform text normalization on each segment according to the language type to which each segment belongs; and perform word segmentation on each normalized segment according to the language type to which each segment belongs; The word segmentation of each segment is converted into corresponding phonemes according to the preset phoneme conversion table corresponding to the language type to which each segment belongs; wherein the phonemes include the tones corresponding to the characters.
  • the parameter conversion module 406 is configured to input the phoneme sequence into the pre-trained speech synthesis model and convert it into the characteristic parameters of the vocoder.
  • the parameter conversion module 406 is configured to input the phoneme sequence into the acoustic parameter prediction model in the speech synthesis model and convert it into acoustic feature parameters; input the acoustic feature parameters into the vocoder parameter conversion model in the speech synthesis model to obtain the output Characteristic parameters of the vocoder.
  • the acoustic parameter prediction model includes: an encoder, a decoder, and an attention model; the parameter conversion module 406 is used to use the attention model to determine the attention weight represented by each feature output by the encoder at the current moment; to determine the phoneme Whether the attention weight represented by the feature corresponding to the preset element in the sequence is the maximum value among the attention weights, and if it is, the decoding process ends.
  • the acoustic characteristic parameters include speech frequency spectrum parameters;
  • the vocoder parameter conversion model is composed of a multi-layer deep neural network and a long and short-term memory network.
  • up-sampling is performed by repeating the acoustic characteristic parameter, so that the frequency of the acoustic characteristic parameter is equal to the frequency of the vocoder characteristic parameter.
  • the parameter conversion module 406 is used to input the phoneme sequence into the encoder to obtain the feature representation corresponding to each element in the encoder output phoneme sequence; the feature representation corresponding to each element is the current moment of the first cyclic layer in the decoder.
  • the output decoder hidden state and the cumulative attention weight information corresponding to each element at the previous moment are input into the attention model to obtain the context vector; the decoder hidden state and context vector output at the current moment of the first loop layer in the decoder are input into the decoding
  • the second recurrent layer of the decoder obtains the decoder hidden state at the current moment output by the second recurrent layer of the decoder; the acoustic feature parameters are predicted according to the decoder hidden state at each moment output by the decoder.
  • the speech generating module 408 is used to input the characteristic parameters of the vocoder into the vocoder to generate speech.
  • the speech synthesis device 40 further includes: a model training module 410, configured to divide the speech samples corresponding to each training text into different frames according to the preset frequency, and extract the acoustics for each frame.
  • Feature parameters respectively generate the first acoustic feature parameter samples corresponding to each training text; use each training text and the first acoustic feature parameter samples corresponding to each training text to train the acoustic parameter prediction model; use the trained acoustics
  • the parameter prediction model converts each training text into a second acoustic feature parameter sample; according to the synthesis frequency of the vocoder, the speech sample corresponding to each training text is converted into a vocoder feature parameter sample; using the corresponding training text
  • the second acoustic feature parameter sample and the vocoder feature parameter sample train the vocoder parameter conversion model.
  • the speech synthesis apparatus in the embodiments of the present disclosure can be implemented by various computing devices or computer systems, which are described below in conjunction with FIG. 5 and FIG. 6.
  • FIG. 5 is a structural diagram of some embodiments of the speech synthesis device of the present disclosure.
  • the apparatus 50 of this embodiment includes: a memory 510 and a processor 520 coupled to the memory 510, and the processor 520 is configured to execute any of the implementations in the present disclosure based on instructions stored in the memory 510
  • the speech synthesis method in the example is a structural diagram of some embodiments of the speech synthesis device of the present disclosure.
  • the memory 510 may include, for example, a system memory, a fixed non-volatile storage medium, and the like.
  • the system memory stores, for example, an operating system, an application program, a boot loader (Boot Loader), a database, and other programs.
  • FIG. 6 is a structural diagram of other embodiments of the speech synthesis device of the present disclosure.
  • the device 60 of this embodiment includes a memory 610 and a processor 620, which are similar to the memory 510 and the processor 520, respectively. It may also include an input/output interface 630, a network interface 640, a storage interface 650, and so on. These interfaces 630, 640, 650, and the memory 610 and the processor 620 may be connected via a bus 660, for example.
  • the input and output interface 630 provides a connection interface for input and output devices such as a display, a mouse, a keyboard, and a touch screen.
  • the network interface 640 provides a connection interface for various networked devices, for example, can be connected to a database server or a cloud storage server.
  • the storage interface 650 provides a connection interface for external storage devices such as SD cards and U disks.
  • the embodiments of the present disclosure may be provided as methods, systems, or computer program products. Therefore, the present disclosure may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present disclosure may take the form of a computer program product implemented on one or more computer-usable non-transitory storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes. .
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps configured to implement functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

一种语音合成方法,该方法包括:将文本划分为属于不同语言种类的多个片段(S102);根据各个片段属于的语言种类,将各个片段分别转换为对应的音素,生成文本的音素序列(S104);将音素序列输入预先训练的语音合成模型,转换为声码器特征参数(S106);将声码器特征参数输入声码器,生成语音(S108)。

Description

语音合成方法、装置和计算机可读存储介质
相关申请的交叉引用
本申请是以CN申请号为201910266289.4,申请日为2019年4月3日的申请为基础,并主张其优先权,该CN申请的公开内容在此作为整体引入本申请中。
技术领域
本公开涉及计算机技术领域,特别涉及一种语音合成方法、装置和计算机可读存储介质。
背景技术
语音合成系统能够实现文本到语音的转换(Text To Speech,TTS),可以将文本通过一系列的算法操作转换为声音,实现机器模拟人进行发音的过程。
目前的语音合成系统,一般只能支持单独一种语言的发音。
发明内容
发明人发现:目前的语音合成系统一般只支持中文或只支持英文发音,无法实现多种语言的流畅发音。
本公开所要解决的一个技术问题是:如何实现支持多种语言发音的端到端的语音合成系统。
根据本公开的一些实施例,提供的一种语音合成方法,包括:将文本划分为属于不同语言种类的多个片段;根据各个片段属于的语言种类,将各个片段分别转换为对应的音素,生成文本的音素序列;将音素序列输入预先训练的语音合成模型,转换为声码器特征参数;将声码器特征参数输入声码器,生成语音。
在一些实施例中,将文本划分为属于不同语言种类的多个片段包括:根据文本中各个字符的编码,识别各个字符属于的语言种类;将属于同一语言种类的连续字符划分为该语言种类的一个片段。
在一些实施例中,生成文本的音素序列包括:确定文本的韵律结构;根据文本的韵律结构,在与文本中各个字符对应的音素后添加韵律标识,以形成文本的音素序列。
在一些实施例中,将音素序列输入预先训练的语音合成模型,转换为声码器特征 参数包括:将音素序列输入语音合成模型中的声学参数预测模型,转换为声学特征参数;将声学特征参数输入语音合成模型中声码器参数转换模型,得到输出的声码器特征参数。
在一些实施例中,声学参数预测模型包括:编码器、解码器和注意力模型;将音素序列输入语音合成模型中的声学参数预测模型,转换为声学特征参数包括:利用注意力模型,确定当前时刻编码器输出的各个特征表示的注意力权重;判断音素序列中预设元素对应的特征表示的注意力权重是否为各个注意力权重中的最大值,如果是,则结束解码过程。
在一些实施例中,声学特征参数包括语音频谱参数;声码器参数转换模型由多层深度神经网络和长短期记忆网络构成。
在一些实施例中,在声学特征参数的频率小于声码器特征参数的频率的情况下,通过重复声学特征参数进行上采样,使声学特征参数的频率等于声码器特征参数的频率。
在一些实施例中,该方法还包括:训练语音合成模型;其中,训练方法包括:根据预设频率将各个训练文本对应的语音样本划分为不同的帧,并针对每帧提取声学特征参数,分别生成与各个训练文本对应的第一声学特征参数样本;利用各个训练文本和各个训练文本对应的第一声学特征参数样本,对声学参数预测模型进行训练;利用训练完成的声学参数预测模型,将各个训练文本分别转换为第二声学特征参数样本;根据声码器的合成频率,将各个训练文本对应的语音样本分别转换为声码器特征参数样本;利用各个训练文本对应的第二声学特征参数样本和声码器特征参数样本对声码器参数转换模型进行训练。
在一些实施例中,声学参数预测模型包括:编码器、解码器和注意力模型;将音素序列输入语音合成模型中的声学参数预测模型,转换为声学特征参数包括:将音素序列输入编码器,获得编码器输出音素序列中各个元素对应的特征表示;将各个元素对应的特征表示、解码器中第一循环层当前时刻输出的解码器隐状态,以及上一时刻各个元素对应的累积注意力权重信息输入注意力模型,获得上下文向量;将解码器中第一循环层当前时刻输出的解码器隐状态和上下文向量输入解码器的第二循环层,获得解码器第二循环层输出的当前时刻的解码器隐状态;根据解码器输出的各个时刻的解码器隐状态预测声学特征参数。
在一些实施例中,根据各个片段属于的语言种类,将各个片段分别转换为对应的 音素包括:根据各个片段属于的语言种类,将各个片段分别进行文本归一化;根据各个片段属于的语言种类,将归一化后的各个片段分别进行分词;将各个片段的分词,根据各个片段属于的语言种类对应的预设的音素转换表转换为对应的音素;其中,音素包括字符对应的音调。
根据本公开的另一些实施例,提供的一种语音合成装置,包括:语言识别模块,用于将文本划分为属于不同语言种类的多个片段;音素转换模块,用于根据各个片段属于的语言种类,将各个片段分别转换为对应的音素,生成文本的音素序列;参数转换模块,用于将音素序列输入预先训练的语音合成模型,转换为声码器特征参数;语音生成模块,用于将声码器特征参数输入声码器,生成语音。
在一些实施例中,语言识别模块用于根据文本中各个字符的编码,识别各个字符属于的语言种类;将属于同一语言种类的连续字符划分为该语言种类的一个片段。
在一些实施例中,音素转换模块用于确定文本的韵律结构;根据文本的韵律结构,在与文本中各个字符对应的音素后添加韵律标识,以形成文本的音素序列。
在一些实施例中,参数转换模块用于将音素序列输入语音合成模型中的声学参数预测模型,转换为声学特征参数;将声学特征参数输入语音合成模型中声码器参数转换模型,得到输出的声码器特征参数。
在一些实施例中,声学参数预测模型包括:编码器、解码器和注意力模型;参数转换模块用于利用注意力模型,确定当前时刻编码器输出的各个特征表示的注意力权重;判断音素序列中预设元素对应的特征表示的注意力权重是否为各个注意力权重中的最大值,如果是,则结束解码过程。
在一些实施例中,声学特征参数包括语音频谱参数;声码器参数转换模型由多层深度神经网络和长短期记忆网络构成。
在一些实施例中,在声学特征参数的频率小于声码器特征参数的频率的情况下,通过重复声学特征参数进行上采样,使声学特征参数的频率等于声码器特征参数的频率。
在一些实施例中,模型训练模块,用于根据预设频率将各个训练文本对应的语音样本划分为不同的帧,并针对每帧提取声学特征参数,分别生成与各个训练文本对应的第一声学特征参数样本;利用各个训练文本和各个训练文本对应的第一声学特征参数样本,对声学参数预测模型进行训练;利用训练完成的声学参数预测模型,将各个训练文本分别转换为第二声学特征参数样本;根据声码器的合成频率,将各个训练文 本对应的语音样本分别转换为声码器特征参数样本;利用各个训练文本对应的第二声学特征参数样本和声码器特征参数样本对声码器参数转换模型进行训练。
在一些实施例中,声学参数预测模型包括:编码器、解码器和注意力模型;参数转换模块用于将音素序列输入编码器,获得编码器输出音素序列中各个元素对应的特征表示;将各个元素对应的特征表示、解码器中第一循环层当前时刻输出的解码器隐状态,以及上一时刻各个元素对应的累积注意力权重信息输入注意力模型,获得上下文向量;将解码器中第一循环层当前时刻输出的解码器隐状态和上下文向量输入解码器的第二循环层,获得解码器第二循环层输出的当前时刻的解码器隐状态;根据解码器输出的各个时刻的解码器隐状态预测声学特征参数。
在一些实施例中,音素转换模块用于根据各个片段属于的语言种类,将各个片段分别进行文本归一化;根据各个片段属于的语言种类,将归一化后的各个片段分别进行分词;将各个片段的分词,根据各个片段属于的语言种类对应的预设的音素转换表转换为对应的音素;其中,音素包括字符对应的音调。
根据本公开的又一些实施例,提供的一种语音合成装置,包括:存储器;以及耦接至存储器的处理器,处理器被配置为基于存储在存储器中的指令,执行如前述任意实施例的语音合成方法。
根据本公开的再一些实施例,提供的一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现前述任意实施例的语音合成方法。
本公开中首先识别文本中的语言种类,将文本划分为属于不同语言种类的多个片段。根据各个片段属于的语言种类,将各个片段分别转换为对应的音素。文本的音素序列被输入语音合成模型转换为声码器特征参数,声码器根据声码器特征参数输出语音。本公开的方案实现了支持多种语言的发音的端到端的语音合成系统。并且根据音素序列转换为声码器特征参数,相对于字符序列直接转换为声码器特征参数,能够使合成的语音更加的准确、流畅和自然。
通过以下参照附图对本公开的示例性实施例的详细描述,本公开的其它特征及其优点将会变得清楚。
附图说明
此处所说明的附图用来提供对本公开的进一步理解,构成本申请的一部分,本公开的示意性实施例及其说明被配置为解释本公开,并不构成对本公开的不当限定。在 附图中:
图1示出本公开的一些实施例的语音合成方法的流程示意图。
图2示出本公开的一些实施例的语音合成模型的结构示意图。
图3示出本公开的另一些实施例的语音合成方法的流程示意图。
图4示出本公开的一些实施例的语音合成装置的结构示意图。
图5示出本公开的另一些实施例的语音合成装置的结构示意图。
图6示出本公开的又一些实施例的语音合成装置的结构示意图。
具体实施方式
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本公开及其应用或使用的任何限制。基于本公开中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。
本公开提出一种语音合成方法,下面结合图1进行描述。
图1为本公开语音合成方法一些实施例的流程图。如图1所示,该实施例的方法包括:步骤S102~S108。
在步骤S102中,将文本划分为属于不同语言种类的多个片段。
在一些实施例中,根据文本中各个字符的编码,识别各个字符属于的语言种类;将属于同一语言种类的连续字符划分为该语言种类的一个片段。例如,文本中包含中文和英文字符的情况,可以获取文本中字符的Unicode码或其他编码,根据Unicode码分别识别文本中中文字符和英文字符,进而将文本划分为不同语言的多个片段。如果包含其他语言(例如,日语、法语等)的字符可以根据对应的编码形式进行识别。
下面以文本包含中文和英文为例,描述划分属于不同语言种类的多个片段的具体实施例。(1)根据句子中字符的编码,确定句子中是否存在英文字符,如果不存在执行(2),否则执行(3)。(2)将句子标记为中文句子。(3)确定句子中是否存在中文字符,如果不存在执行(4),否则执行(7)。(4)判断句子是否只包含预设英文字符,预设英文字符可以包括计量单位、缩写和英文编号中至少一项,如果是,执行(5),否则执行(6)。(5)将该句子标记为中文句子。(6)将该句子标记为英文句子。(7)对句子划分中文片段和英文片段。
上述实施例中在句子中只包含预设英文字符的情况下,将句子标记为中文句子,便于后续按照中文将预设的英文字符进行归一化,例如12km/h这样的预设英文字符,可以后续进行归一化时转换为12千米每小时,后续发出的语音则是中文读法,更加符合中文用户的习惯。本领域技术人员可以理解,参考上述实施例,在句子中只包含一些特殊国际通用字符的情况下,可以根据发音需求将句子标记为预设语言种类,便于后续的文本归一化和语音合成的处理。
上述步骤(7)可以包括以下步骤。(i)判断当前字符的语言种类是否和上一字符的语言种类相同,如果相同,执行(ii),否则执行(iv)。(ii)将当前字符移入当前片段集合。(iii)判断是否到达句尾,如果是,则执行(iv),否则执行(v),(iv)将当前片段集合中的字符标记语言种类,并从当前片段集合移出。(v)将下一字符更新为当前字符,并返回(i)重新开始执行。
在步骤S104中,根据各个所述片段属于的语言种类,将各个片段分别转换为对应的音素,生成文本的音素序列。
在一些实施例中,根据各个片段属于的语言种类,将各个片段分别进行文本归一化;根据各个片段属于的语言种类,将归一化后的各个片段分别进行分词;将各个片段的分词,根据该片段属于的语言种类对应的预设的音素转换表转换为对应的音素。文本中通常包含大量的不规范的缩写,例如12km/s、2019年等,必须通过归一化操作将这些不规范的文本转换为适合语音合成系统进行语音合成的规范文本。属于不同语言种类的片段需要分别进行文本归一化,可以分别根据不同语言种类的特殊字符对照表,将不规范的字符转换为规范字符,例如,将12km/s转换为十二千米每秒,便于后续的音素转换。
由于不同语言的分词方式不同,例如,英文按照单词进行分词,而中文需要根据语义信息等进行分词。因此,根据各个片段属于的语言种类,将各个片段分别进行分词。可以通过查询不同语言种类对应的预设的音素转换表,将各个分词转换为对应的音素(G2P)。一些预设的音素转换表里不存在的单词(OOV),例如拼写错误的单词、新创建的单词、网络单词等,可以通过神经网络等现有技术进行音素转换。预设的音素转换表可以包括多音字的音素对应关系,以便对多音字进行准确的音素转换。也可以通过其他方式识别多音字,或通过其他现有技术进行音素转换,不限于所举示例。
在一些实施例中,音素可以包括字符对应的音调,将音调作为音素的一部分,可 以使合成的语音更加的准确和自然。一些语言例如英语等,没有音调,则不需要在音素序列里添加对应的音调标识。在一些实施例中,还可以对文本划分韵律结构,例如识别文本中的韵律词、韵律短语等。根据文本的韵律结构,在与文本中各个字符对应的音素后添加韵律标识,以形成文本的音素序列。韵律标识可以是韵律词或韵律短语对应的音素后添加的一个表示停顿的特殊标识。韵律结构的预测可以采用现有技术,在此不再赘述。
在步骤S106中,将音素序列输入预先训练的语音合成模型,转换为声码器特征参数。
根据上述实施例,文本的音素序列可以包括每个字符对应的音素(包括音调)、韵律标识,还可以包括一些特殊符号,例如表示输入的音素序列结束的符号<EOS>。语音合成模型的训练过程后续将进行描述。
在一些实施例中,语音合成模型可以包括声学参数预测模型和声码器参数转换模型。声学参数例如包括语音频谱参数,例如,梅尔频谱参数或线性谱参数等。声码器参数根据实际使用的声码器进行确定,例如,声码器采用world声码器,则声码器参数可以包括基频(fundamental frequency,F0)、广义梅尔倒谱系数(Mel-generalized Cepstral,MGC),频带非周期分量(band a periodical,BAP)等。将音素序列输入语音合成模型中的声学参数预测模型,可以转换为声学特征参数;将声学特征参数输入语音合成模型中声码器参数转换模型,可以得到输出的声码器特征参数。
声学特征参数预测模型采用Encoder-Decoder网络结构,包括:编码器、解码器和注意力(Attention)模型。输入的音素序列和输出的声学特征参数序列的长度可以是不匹配的,通常声学特征参数序列会比较长。基于Encoder-Decoder的神经网络结构可以进行灵活的特征预测,符合语音合成的特性。编码器可以包含三层一维卷积和双向LSTM(Long Short-Term Memory,长短期记忆网络)。三层一维卷积可以学习得到每个音素的局部上下文信息,双向LSTM编码则计算得到了每个音素的双向全局信息。编码器模块通过三层一维卷积和双向LSTM编码能够得到输入音素的非常具有表现力并且包含上下文信息的特征表示。
解码器例如包含两层全连接层和两层LSTM。两层全连接层可以采用Dropout技术防止神经网络过拟合现象的发生。注意力模型使得解码器在解码过程中可以学习到当前解码时刻需要将注意力关注到哪些输入的音素的内部表示上,通过注意力机制,解码器还可以学习到哪些输入的音素已经完成参数预测,以及当前时刻需要特别关注 哪些音素。注意力模型得到了的编码器的上下文向量,在解码的过程中,通过结合这个上下文向量,可以更好的预测当前时刻需要得到的声学参数以及是否结束解码过程。
在一些实施例中,声学特征参数预测模型中可以执行以下步骤。将音素序列输入编码器,获得编码器输出音素序列中各个元素对应的特征表示。将各个元素对应的特征表示、解码器中第一循环层(例如第一LSTM)当前时刻输出的解码器隐状态,以及上一时刻各个元素对应的累积注意力权重信息输入注意力模型,获得上下文向量。将解码器中第一循环层当前时刻输出的解码器隐状态和上下文向量输入解码器的第二循环层,获得解码器第二循环层输出的当前时刻的解码器隐状态;根据解码器输出的各个时刻的解码器隐状态预测声学特征参数。例如将解码器隐状态序列进行线性变换得到声学特征参数。
例如,输入音素序列为X=[x 1,x 2,…,x j,…x M],编码器输出的特征表示序列为H=[h 1,h 2,…,h j,…h M],j表示输入音素序列中的各个元素所在的位置,M表示音素序列中元素的总个数。解码器输出的隐状态序列为S=[s 1,s 2,…,s i,…],i表示解码器输出的时间步骤。音素序列中的韵律标识也会被转换为对应的隐状态,进而转换为解码器隐状态。
例如,上下文向量可以采用以下公式计算。
e i,j=v Ttanh(Ws i+Vh j+Uf i,j+b)      (1)
f i=F*α i-1        (2)
β i=softmax(e i)        (3)
Figure PCTCN2020082172-appb-000001
其中,i表示的是解码器的时间步骤,j表示编码器对应的音素序列中元素的位置,i和j为正整数。v,W,V,U,b是模型训练时学习到的参数,s i表示解码器中第一循环层(例如第一LSTM)当前第i个时刻输出的解码器隐状态。h j表示第j个元素对应的特征表示,f i,j是f i中的向量,F是一个预设长度的卷积核,α i-1是第i-1时刻各个元素对应的累积注意力权重信息(Alignments),e i,j为数值,e i表示各个元素对应的组成的向量,β i为向量,β i,j表示β i中的数值,c i表示第i个时刻对应的上下文向量,M表示音素序列中元素的总个数。
在一些实施例中,利用所述注意力模型,确定当前时刻编码器输出的各个特征表示的注意力权重;判断音素序列中预设元素对应的特征表示的注意力权重是否为各个注意力权重(即输入音素序列中所有元素对应的注意力权重)中的最大值,如果是, 则结束解码过程。特征表示的注意力权重由注意力模型生成。例如预设元素为音素序列最后一个<EOS>符号。
上述判断是否停止解码的方法,可以使解码器根据实际需求停止解码。通过学习到的Alignments信息判断是否需要结束解码过程。如果解码的时候注意力模型已经将注意力转移到了最后符号,但是没有正确的预测结束解码过程,系统可以根据这个Alignments信息强制结束解码过程。上述辅助解码结束算法,能够很好的解决模型预测解码过程结束失败或者预测结束不正确的问题,避免声学参数预测模型会继续预测若干帧的声学特征出来,最终合成一些无法理解的语音,提高系统语音输出的准确性、流畅性和自然度。
在预测得到输入音素序列的声学特征参数之后,将声学特征参数(例如梅尔谱参数)输入声码器参数转换模型转换为声码器特征参数,然后就可以通过声码器进行语音合成。
声码器参数转换模型可以采用DNN-LSTM(深度神经网络-长短期记忆网络)的神经网络结构。该网络结构可以包含多层深度神经网络和长短期记忆网络构成。例如,如图2所示,该网络结构包含两层ReLU(激活函数)连接和一层LSTM。声学特征参数首先被输入DNN网络(例如ReLU),可以学习声学特征的非线性变换,学习神经网络内部特征表示,相当于一个特征学习的过程。DNN网络输出的特征被输入LSTM学习到声学特征参数的历史依赖信息,以便得到更加平滑的特征转换。发明人通过测试发现,当网络结构包含两层ReLU连接和一层LSTM时声码器参数转换效果更好。
在一些实施例中,在声学特征参数的频率小于声码器特征参数的频率的情况下,通过重复声学特征参数进行上采样,使声学特征参数的频率等于声码器特征参数的频率。例如,声学参数预测模型以15ms为一帧进行参数预测,但是声码器通常以5ms为一帧进行语音合成,这样就在时间频率上存在一个不匹配的问题,为了解决两个模型频率不一致的问题,需要将声学参数预测模型的输出进行上采样以匹配声码器模型的频率。可以通过重复声学参数预测模型的输出进行上采样,例如,将声学特征参数重复三次,1*80维的梅尔谱参数,重复三次可以得到3*80维的梅尔谱参数。发明人通过测试确定,相对于学习一个上采样神经网络,或差值等方式进行上采样,通过直接重复特征进行上采样就能够达到很好的效果。
在步骤S108中,将声码器特征参数输入声码器,生成语音。
上述实施例中的声码器参数转换模型可以与world声码器结合,相对于现有技术中wavenet(网络结构复杂,无法实时在线生成语音),通过简单的网络架构,可以加快计算速度实现实时语音生成,相对于现有技术中Griffin-lim模型,减少了叠音,提高了语音合成的效果。
上述实施例的方法中首先识别文本中的语言种类,将文本划分为属于不同语言种类的多个片段。根据各个片段属于的语言种类,将各个片段分别转换为对应的音素。文本的音素序列被输入语音合成模型转换为声码器特征参数,声码器根据声码器特征参数输出语音。上述实施例的方案实现了支持多种语言的发音的端到端的语音合成系统,并且根据音素序列转换为声码器特征参数,相对于字符序列直接转换为声码器特征参数,能够使合成的语音更加的准确、流畅和自然。进一步通过加入韵律结构、音调等生成音素序列,能够进一步提高语音合成效果。通过新的声码器特征参数转换模型,加快计算速度实现实时语音生成,减少了叠音,进一步提高了语音合成的效果。并且上述实施例中还提出了一种解码器结束方法,可以解决模型预测解码过程结束失败或者预测结束不正确的问题,避免声学参数预测模型最终合成一些无法理解的语音,进一步提高系统语音输出的准确性、流畅性和自然度。
在一些实施例中,训练语音合成模型的方法包括:将每个训练文本对应的语音样本根据声码器的合成频率转换为声码器特征参数样本;将每个训练文本输入待训练的语音合成模型,得到输出的声码器特征参数;将输出的声码器特征参数与对应的声码器特征参数样本进行比对,并根据比对结果调整待训练的语音合成模型的参数,直至完成训练。
为了进一步提高声码器参数转换模型的准确性,下面结合图3描述本公开的语音合成模型的训练方法的一些实施例。
图3为本公开语音合成方法另一些实施例的流程图。如图3所示,该实施例的方法包括:步骤S302~S310。
在步骤S302中,根据预设频率将各个训练文本对应的语音样本划分为不同的帧,并针对每帧提取声学特征参数,分别生成与各个训练文本对应的第一声学特征参数样本。
例如,可以将各个语音样本以15ms为一帧的频率进行划分,将每帧样本提取声学特征参数,生成第一声学特征参数样本(例如,梅尔谱参数)。
在步骤S304中,利用各个训练文本和各个训练文本对应的第一声学特征参数样 本,对声学参数预测模型进行训练。
可以首先针对每个训练文本,将该训练文本划分为属于不同语言种类的片段,根据各个片段属于的语言种类,将各个片段分别转换为对应的音素,生成该训练文本的音素序列。音素序列可以包括音调、韵律标识等。将各个训练文本的音素序列输入声学参数预测模型,得到输出的与各个训练文本对应的声学特征参数。将同一训练文本对应的输出的声学特征参数与第一声学特征参数样本进行比对,根据比对结果对声学参数预测模型中参数进行调整,直至满足第一预设目标,完成声学参数预测模型的训练。
在步骤S306中,利用训练完成的声学参数预测模型,将各个训练文本分别转换为第二声学特征参数样本。
将各个训练文本输入训练完成的声学参数预测模型,则可以得到与各个训练文本对应的第二声学特征参数样本。
在步骤S308中,根据声码器的合成频率,将各个训练文本对应的语音样本分别转换为声码器特征参数样本。
例如,可以将语音样本以5ms为一帧的频率进行划分,将每帧样本转换为声码器特征参数样本(例如,MGC、BAP、log F0)。步骤S308的执行顺序不受限制,只要在步骤S310之前即可。
在步骤S310中,利用各个训练文本对应的第二声学特征参数样本和声码器特征参数样本对声码器参数转换模型进行训练。
例如,将各个第二声学特征参数样本输入声码器参数转换模型,得到输出的声码器特征参数。将输出的声码器特征参数与对应的声码器特征参数样本进行比对,根据比对结果对声码器参数转换模型中参数进行调整,直至满足第二预设目标,完成声码器参数转换模型的训练。
上述实施例的方法采用声学预测模型预测得到的声学特征参数,作为训练数据进行声码器参数转换模型的训练,可以提高声码器参数转换模型的准确度,使合成的语音更加准确、流畅和自然。这是因为,采用直接在语音文件上提取的真实的声学特征参数(例如,梅尔谱参数)训练声码器参数转换模型,那么在实际进行语音合成的时候就会存在模型的输入特征和训练特征不匹配的差异。具体因为在实际语音合成的过程中,输入的特征是声学参数预测模型预测得到的梅尔谱,声学参数预测模型在解码的过程中,随着解码步数的增加,预测得到的声学特征参数的误差会越来越大,但是 声学参数转换模块训练过程却采用的声音文件真实的声学特征参数,训练得到的模型没有学习过预测得到的声学特征参数以及解码过程中存在误差累积的声学特征参数,所以输入特征和训练特征不匹配会导致声码器参数转换模型性能严重下降。
本公开还提供一种语音合成装置,下面结合图4进行描述。
图4为本公开语音合成装置的一些实施例的结构图。如图4所示,该实施例的装置40包括:语言识别模块402,音素转换模块404,参数转换模块406,语音生成模块408。
语言识别模块402,将文本划分为属于不同语言种类的多个片段。
在一些实施例中,语言识别模块402用于根据文本中各个字符的编码,识别各个字符属于的语言种类;将属于同一语言种类的连续字符划分为该语言种类的一个片段。
音素转换模块404,用于根据各个片段属于的语言种类,将各个片段分别转换为对应的音素,生成文本的音素序列。
在一些实施例中,音素转换模块404用于确定文本的韵律结构;根据文本的韵律结构,在与文本中各个字符对应的音素后添加韵律标识,以形成文本的音素序列。
在一些实施例中,音素转换模块404用于根据各个片段属于的语言种类,将各个片段分别进行文本归一化;根据各个片段属于的语言种类,将归一化后的各个片段分别进行分词;将各个片段的分词,根据各个片段属于的语言种类对应的预设的音素转换表转换为对应的音素;其中,音素包括字符对应的音调。
参数转换模块406,用于将音素序列输入预先训练的语音合成模型,转换为声码器特征参数。
在一些实施例中,参数转换模块406用于将音素序列输入语音合成模型中的声学参数预测模型,转换为声学特征参数;将声学特征参数输入语音合成模型中声码器参数转换模型,得到输出的声码器特征参数。
在一些实施例中,声学参数预测模型包括:编码器、解码器和注意力模型;参数转换模块406用于利用注意力模型,确定当前时刻编码器输出的各个特征表示的注意力权重;判断音素序列中预设元素对应的特征表示的注意力权重是否为各个注意力权重中的最大值,如果是,则结束解码过程。
在一些实施例中,声学特征参数包括语音频谱参数;声码器参数转换模型由多层深度神经网络和长短期记忆网络构成。
在一些实施例中,在声学特征参数的频率小于声码器特征参数的频率的情况下, 通过重复声学特征参数进行上采样,使声学特征参数的频率等于声码器特征参数的频率。
在一些实施例中,参数转换模块406用于将音素序列输入编码器,获得编码器输出音素序列中各个元素对应的特征表示;将各个元素对应的特征表示、解码器中第一循环层当前时刻输出的解码器隐状态,以及上一时刻各个元素对应的累积注意力权重信息输入注意力模型,获得上下文向量;将解码器中第一循环层当前时刻输出的解码器隐状态和上下文向量输入解码器的第二循环层,获得解码器第二循环层输出的当前时刻的解码器隐状态;根据解码器输出的各个时刻的解码器隐状态预测声学特征参数。
语音生成模块408,用于将声码器特征参数输入声码器,生成语音。
在一些实施例中,如图4所示,语音合成装置40还包括:模型训练模块410,用于根据预设频率将各个训练文本对应的语音样本划分为不同的帧,并针对每帧提取声学特征参数,分别生成与各个训练文本对应的第一声学特征参数样本;利用各个训练文本和各个训练文本对应的第一声学特征参数样本,对声学参数预测模型进行训练;利用训练完成的声学参数预测模型,将各个训练文本分别转换为第二声学特征参数样本;根据声码器的合成频率,将各个训练文本对应的语音样本分别转换为声码器特征参数样本;利用各个训练文本对应的第二声学特征参数样本和声码器特征参数样本对声码器参数转换模型进行训练。
本公开的实施例中的语音合成装置可各由各种计算设备或计算机系统来实现,下面结合图5以及图6进行描述。
图5为本公开语音合成装置的一些实施例的结构图。如图5所示,该实施例的装置50包括:存储器510以及耦接至该存储器510的处理器520,处理器520被配置为基于存储在存储器510中的指令,执行本公开中任意一些实施例中的语音合成方法。
其中,存储器510例如可以包括系统存储器、固定非易失性存储介质等。系统存储器例如存储有操作系统、应用程序、引导装载程序(Boot Loader)、数据库以及其他程序等。
图6为本公开语音合成装置的另一些实施例的结构图。如图6所示,该实施例的装置60包括:存储器610以及处理器620,分别与存储器510以及处理器520类似。还可以包括输入输出接口630、网络接口640、存储接口650等。这些接口630,640,650以及存储器610和处理器620之间例如可以通过总线660连接。其中,输入输出接口630为显示器、鼠标、键盘、触摸屏等输入输出设备提供连接接口。网络接口640 为各种联网设备提供连接接口,例如可以连接到数据库服务器或者云端存储服务器等。存储接口650为SD卡、U盘等外置存储设备提供连接接口。
本领域内的技术人员应当明白,本公开的实施例可提供为方法、系统、或计算机程序产品。因此,本公开可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本公开可采用在一个或多个其中包含有计算机可用程序代码的计算机可用非瞬时性存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本公开是参照根据本公开实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解为可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生被配置为实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供被配置为实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
以上所述仅为本公开的较佳实施例,并不用以限制本公开,凡在本公开的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本公开的保护范围之内。

Claims (22)

  1. 一种语音合成方法,包括:
    将文本划分为属于不同语言种类的多个片段;
    根据各个所述片段属于的语言种类,将各个所述片段分别转换为对应的音素,生成所述文本的音素序列;
    将所述音素序列输入预先训练的语音合成模型,转换为声码器特征参数;
    将所述声码器特征参数输入声码器,生成语音。
  2. 根据权利要求1所述的语音合成方法,其中,
    所述将文本划分为属于不同语言种类的多个片段包括:
    根据所述文本中各个字符的编码,识别各个所述字符属于的语言种类;
    将属于同一语言种类的连续字符划分为该语言种类的一个片段。
  3. 根据权利要求1所述的语音合成方法,其中,所述生成所述文本的音素序列包括:
    确定所述文本的韵律结构;
    根据所述文本的韵律结构,在与所述文本中各个字符对应的音素后添加韵律标识,以形成所述文本的音素序列。
  4. 根据权利要求1所述的语音合成方法,其中,
    所述将所述音素序列输入预先训练的语音合成模型,转换为声码器特征参数包括:
    将所述音素序列输入所述语音合成模型中的声学参数预测模型,转换为声学特征参数;
    将所述声学特征参数输入所述语音合成模型中声码器参数转换模型,得到输出的声码器特征参数。
  5. 根据权利要求4所述的语音合成方法,其中,
    所述声学参数预测模型包括:编码器、解码器和注意力模型;
    所述将所述音素序列输入所述语音合成模型中的声学参数预测模型,转换为声学 特征参数包括:
    利用所述注意力模型,确定当前时刻所述编码器输出的各个特征表示的注意力权重;
    判断所述音素序列中预设元素对应的特征表示的注意力权重是否为各个注意力权重中的最大值,如果是,则结束解码过程。
  6. 根据权利要求4所述的语音合成方法,其中,
    所述声学特征参数包括语音频谱参数;
    所述声码器参数转换模型由多层深度神经网络和长短期记忆网络构成。
  7. 根据权利要求4所述的语音合成方法,其中,
    在所述声学特征参数的频率小于所述声码器特征参数的频率的情况下,通过重复所述声学特征参数进行上采样,使所述声学特征参数的频率等于所述声码器特征参数的频率。
  8. 根据权利要求1所述的语音合成方法,还包括:训练所述语音合成模型;其中,
    所述训练方法包括:
    根据预设频率将各个训练文本对应的语音样本划分为不同的帧,并针对每帧提取声学特征参数,分别生成与各个所述训练文本对应的第一声学特征参数样本;
    利用各个所述训练文本和各个所述训练文本对应的第一声学特征参数样本,对所述声学参数预测模型进行训练;
    利用训练完成的声学参数预测模型,将各个所述训练文本分别转换为第二声学特征参数样本;
    根据所述声码器的合成频率,将各个所述训练文本对应的语音样本分别转换为声码器特征参数样本;
    利用各个所述训练文本对应的所述第二声学特征参数样本和所述声码器特征参数样本对所述声码器参数转换模型进行训练。
  9. 根据权利要求4所述的语音合成方法,其中,
    所述声学参数预测模型包括:编码器、解码器和注意力模型;
    所述将所述音素序列输入所述语音合成模型中的声学参数预测模型,转换为声学特征参数包括:
    将所述音素序列输入所述编码器,获得所述编码器输出所述音素序列中各个元素对应的特征表示;
    将所述各个元素对应的特征表示、所述解码器中第一循环层当前时刻输出的解码器隐状态,以及上一时刻所述各个元素对应的累积注意力权重信息输入所述注意力模型,获得上下文向量;
    将所述解码器中第一循环层当前时刻输出的解码器隐状态和所述上下文向量输入所述解码器的第二循环层,获得所述解码器第二循环层输出的当前时刻的解码器隐状态;
    根据所述解码器输出的各个时刻的解码器隐状态预测所述声学特征参数。
  10. 根据权利要求1所述的语音合成方法,其中,
    所述根据各个所述片段属于的语言种类,将各个所述片段分别转换为对应的音素包括:
    根据各个所述片段属于的语言种类,将各个所述片段分别进行文本归一化;
    根据各个所述片段属于的语言种类,将归一化后的各个所述片段分别进行分词;
    将各个所述片段的分词,根据各个所述片段属于的语言种类对应的预设的音素转换表转换为对应的音素;
    其中,音素包括字符对应的音调。
  11. 一种语音合成装置,包括:
    语言识别模块,用于将文本划分为属于不同语言种类的多个片段;
    音素转换模块,用于根据各个所述片段属于的语言种类,将各个所述片段分别转换为对应的音素,生成所述文本的音素序列;
    参数转换模块,用于将所述音素序列输入预先训练的语音合成模型,转换为声码器特征参数;
    语音生成模块,用于将所述声码器特征参数输入声码器,生成语音。
  12. 根据权利要求11所述的语音合成装置,其中,
    所述语言识别模块用于根据所述文本中各个字符的编码,识别各个所述字符属于的语言种类;将属于同一语言种类的连续字符划分为该语言种类的一个片段。
  13. 根据权利要求11所述的语音合成装置,其中,
    所述音素转换模块用于确定所述文本的韵律结构;根据所述文本的韵律结构,在与所述文本中各个字符对应的音素后添加韵律标识,以形成所述文本的音素序列。
  14. 根据权利要求11所述的语音合成装置,其中,
    所述参数转换模块用于将所述音素序列输入所述语音合成模型中的声学参数预测模型,转换为声学特征参数;将所述声学特征参数输入所述语音合成模型中声码器参数转换模型,得到输出的声码器特征参数。
  15. 根据权利要求14所述的语音合成装置,其中,
    所述声学参数预测模型包括:编码器、解码器和注意力模型;
    所述参数转换模块用于利用所述注意力模型,确定当前时刻所述编码器输出的各个特征表示的注意力权重;判断所述音素序列中预设元素对应的特征表示的注意力权重是否为各个注意力权重中的最大值,如果是,则结束解码过程。
  16. 根据权利要求14所述的语音合成装置,其中,
    所述声学特征参数包括语音频谱参数;
    所述声码器参数转换模型由多层深度神经网络和长短期记忆网络构成。
  17. 根据权利要求14所述的语音合成装置,其中,
    在所述声学特征参数的频率小于所述声码器特征参数的频率的情况下,通过重复所述声学特征参数进行上采样,使所述声学特征参数的频率等于所述声码器特征参数的频率。
  18. 根据权利要求11所述的语音合成装置,还包括:
    模型训练模块,用于根据预设频率将各个训练文本对应的语音样本划分为不同的帧,并针对每帧提取声学特征参数,分别生成与各个所述训练文本对应的第一声学特 征参数样本;利用各个所述训练文本和各个所述训练文本对应的第一声学特征参数样本,对所述声学参数预测模型进行训练;利用训练完成的声学参数预测模型,将各个所述训练文本分别转换为第二声学特征参数样本;根据所述声码器的合成频率,将各个所述训练文本对应的语音样本分别转换为声码器特征参数样本;利用各个所述训练文本对应的所述第二声学特征参数样本和所述声码器特征参数样本对所述声码器参数转换模型进行训练。
  19. 根据权利要求14所述的语音合成装置,其中,
    所述声学参数预测模型包括:编码器、解码器和注意力模型;
    所述参数转换模块用于将所述音素序列输入所述编码器,获得所述编码器输出所述音素序列中各个元素对应的特征表示;将所述各个元素对应的特征表示、所述解码器中第一循环层当前时刻输出的解码器隐状态,以及上一时刻所述各个元素对应的累积注意力权重信息输入所述注意力模型,获得上下文向量;将所述解码器中第一循环层当前时刻输出的解码器隐状态和所述上下文向量输入所述解码器的第二循环层,获得所述解码器第二循环层输出的当前时刻的解码器隐状态;根据所述解码器输出的各个时刻的解码器隐状态预测所述声学特征参数。
  20. 根据权利要求11所述的语音合成装置,其中,
    所述音素转换模块用于根据各个所述片段属于的语言种类,将各个所述片段分别进行文本归一化;根据各个所述片段属于的语言种类,将归一化后的各个所述片段分别进行分词;将各个所述片段的分词,根据各个所述片段属于的语言种类对应的预设的音素转换表转换为对应的音素;
    其中,音素包括字符对应的音调。
  21. 一种语音合成装置,包括:
    存储器;以及
    耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器中的指令,执行如权利要求1-10任一项所述的语音合成方法。
  22. 一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器 执行时实现权利要求1-10任一项所述方法的步骤。
PCT/CN2020/082172 2019-04-03 2020-03-30 语音合成方法、装置和计算机可读存储介质 Ceased WO2020200178A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2021558871A JP7464621B2 (ja) 2019-04-03 2020-03-30 音声合成方法、デバイス、およびコンピュータ可読ストレージ媒体
EP20783784.0A EP3937165B1 (en) 2019-04-03 2020-03-30 Speech synthesis method and apparatus, and computer-readable storage medium
US17/600,850 US11881205B2 (en) 2019-04-03 2020-03-30 Speech synthesis method, device and computer readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910266289.4A CN111798832B (zh) 2019-04-03 2019-04-03 语音合成方法、装置和计算机可读存储介质
CN201910266289.4 2019-04-03

Publications (1)

Publication Number Publication Date
WO2020200178A1 true WO2020200178A1 (zh) 2020-10-08

Family

ID=72664952

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/082172 Ceased WO2020200178A1 (zh) 2019-04-03 2020-03-30 语音合成方法、装置和计算机可读存储介质

Country Status (5)

Country Link
US (1) US11881205B2 (zh)
EP (1) EP3937165B1 (zh)
JP (1) JP7464621B2 (zh)
CN (1) CN111798832B (zh)
WO (1) WO2020200178A1 (zh)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185340A (zh) * 2020-10-30 2021-01-05 网易(杭州)网络有限公司 语音合成方法、语音合成装置、存储介质与电子设备
CN112992177A (zh) * 2021-02-20 2021-06-18 平安科技(深圳)有限公司 语音风格迁移模型的训练方法、装置、设备及存储介质
CN113327576A (zh) * 2021-06-03 2021-08-31 多益网络有限公司 语音合成方法、装置、设备及存储介质
CN113409761A (zh) * 2021-07-12 2021-09-17 上海喜马拉雅科技有限公司 语音合成方法、装置、电子设备以及计算机可读存储介质
CN113450760A (zh) * 2021-06-07 2021-09-28 北京一起教育科技有限责任公司 一种文本转语音的方法、装置及电子设备
CN113707125A (zh) * 2021-08-30 2021-11-26 中国科学院声学研究所 一种多语言语音合成模型的训练方法及装置
CN113724683A (zh) * 2021-07-23 2021-11-30 阿里巴巴达摩院(杭州)科技有限公司 音频生成方法、计算机设备及计算机可读存储介质
CN113763922A (zh) * 2021-05-12 2021-12-07 腾讯科技(深圳)有限公司 音频合成方法和装置、存储介质及电子设备
CN114267375A (zh) * 2021-11-24 2022-04-01 北京百度网讯科技有限公司 音素检测方法及装置、训练方法及装置、设备和介质
CN114267376A (zh) * 2021-11-24 2022-04-01 北京百度网讯科技有限公司 音素检测方法及装置、训练方法及装置、设备和介质
CN114399991A (zh) * 2022-01-27 2022-04-26 北京有竹居网络技术有限公司 语音合成方法、装置、存储介质及电子设备
CN114495899A (zh) * 2021-12-29 2022-05-13 深圳市优必选科技股份有限公司 一种基于时长信息的音频合成方法、装置及终端设备
CN114765022A (zh) * 2020-12-30 2022-07-19 大众问问(北京)信息科技有限公司 一种语音合成方法、装置、计算机设备和存储介质
CN115101041A (zh) * 2022-05-09 2022-09-23 北京百度网讯科技有限公司 语音合成与语音合成模型的训练方法、装置
CN115132170A (zh) * 2022-06-28 2022-09-30 腾讯音乐娱乐科技(深圳)有限公司 语种分类方法、装置及计算机可读存储介质
CN115691476A (zh) * 2022-06-06 2023-02-03 腾讯科技(深圳)有限公司 语音识别模型的训练方法、语音识别方法、装置及设备
CN116129866A (zh) * 2023-02-16 2023-05-16 北京百度网讯科技有限公司 语音合成方法、网络训练方法、装置、设备及存储介质
CN116612742A (zh) * 2023-04-28 2023-08-18 科大讯飞股份有限公司 语音合成方法、装置、设备及存储介质
CN118782018A (zh) * 2023-04-03 2024-10-15 科大讯飞股份有限公司 语音合成方法、装置、设备及存储介质
CN118840996A (zh) * 2024-06-27 2024-10-25 合肥智能语音创新发展有限公司 一种发音预测方法及相关装置
CN120032621A (zh) * 2025-01-16 2025-05-23 思必驰科技股份有限公司 面向vqtts模型的语音合成缺陷修正方法、设备及存储介质

Families Citing this family (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022086590A1 (en) * 2020-10-21 2022-04-28 Google Llc Parallel tacotron: non-autoregressive and controllable tts
CN112331183B (zh) * 2020-10-27 2022-03-18 中科极限元(杭州)智能科技股份有限公司 基于自回归网络的非平行语料语音转换方法及系统
CN112365878B (zh) * 2020-10-30 2024-01-23 广州华多网络科技有限公司 语音合成方法、装置、设备及计算机可读存储介质
CN112435650B (zh) * 2020-11-11 2022-04-15 四川长虹电器股份有限公司 一种多说话人、多语言的语音合成方法及系统
CN112420016B (zh) * 2020-11-20 2022-06-03 四川长虹电器股份有限公司 一种合成语音与文本对齐的方法、装置及计算机储存介质
JP7487794B2 (ja) * 2020-11-25 2024-05-21 日本電信電話株式会社 ラベリング処理方法、ラベリング処理装置およびラベリング処理プログラム
CN112634865B (zh) * 2020-12-23 2022-10-28 爱驰汽车有限公司 语音合成方法、装置、计算机设备和存储介质
CN113539231B (zh) * 2020-12-30 2024-06-18 腾讯科技(深圳)有限公司 音频处理方法、声码器、装置、设备及存储介质
CN112885328B (zh) 2021-01-22 2024-06-28 华为技术有限公司 一种文本数据处理方法及装置
CN112951200B (zh) * 2021-01-28 2024-03-12 北京达佳互联信息技术有限公司 语音合成模型的训练方法、装置、计算机设备及存储介质
CN112802449B (zh) * 2021-03-19 2021-07-02 广州酷狗计算机科技有限公司 音频合成方法、装置、计算机设备及存储介质
CN113035228B (zh) * 2021-03-23 2024-08-23 广州酷狗计算机科技有限公司 声学特征提取方法、装置、设备及存储介质
EP4248441A4 (en) * 2021-03-25 2024-07-10 Samsung Electronics Co., Ltd. SPEECH RECOGNITION METHOD, DEVICE, ELECTRONIC DEVICE AND COMPUTER-READABLE STORAGE MEDIUM
CN115223539B (zh) * 2021-03-30 2025-02-25 暗物智能科技(广州)有限公司 一种豪萨语语音合成方法及系统
CN113761841B (zh) 2021-04-19 2023-07-25 腾讯科技(深圳)有限公司 将文本数据转换为声学特征的方法
CN113362803B (zh) * 2021-05-31 2023-04-25 杭州芯声智能科技有限公司 一种arm侧离线语音合成的方法、装置及存储介质
CN113345412A (zh) * 2021-05-31 2021-09-03 平安科技(深圳)有限公司 语音合成方法、装置、设备以及存储介质
CN113345415B (zh) * 2021-06-01 2024-10-25 平安科技(深圳)有限公司 语音合成方法、装置、设备及存储介质
CN113808571B (zh) * 2021-08-17 2022-05-27 北京百度网讯科技有限公司 语音合成方法、装置、电子设备以及存储介质
CN113838452B (zh) 2021-08-17 2022-08-23 北京百度网讯科技有限公司 语音合成方法、装置、设备和计算机存储介质
CN113838453B (zh) * 2021-08-17 2022-06-28 北京百度网讯科技有限公司 语音处理方法、装置、设备和计算机存储介质
CN114299910B (zh) * 2021-09-06 2024-03-22 腾讯科技(深圳)有限公司 语音合成模型的训练方法、使用方法、装置、设备及介质
CN114049873B (zh) * 2021-10-29 2025-07-08 北京搜狗科技发展有限公司 语音克隆方法、训练方法、装置和介质
GB2612624B (en) * 2021-11-05 2025-10-15 Spotify Ab Methods and systems for synthesising speech from text
CN114678005B (zh) * 2022-04-11 2025-09-23 平安科技(深圳)有限公司 一种语音合成方法、结构、终端及存储介质
CN115223538B (zh) * 2022-07-13 2025-07-25 深圳市腾讯计算机系统有限公司 声码器模型的训练方法、装置、设备、介质及程序产品
US12555563B2 (en) * 2022-08-15 2026-02-17 Tencent America LLC Systems and methods for character-to-phone conversion
CN117636841A (zh) * 2022-08-19 2024-03-01 北京嘀嘀无限科技发展有限公司 语音合成方法、装置、设备、存储介质和程序产品
CN116665636B (zh) * 2022-09-20 2024-03-12 荣耀终端有限公司 音频数据处理方法、模型训练方法、电子设备和存储介质
US12518736B2 (en) * 2022-11-09 2026-01-06 Square Enix Co., Ltd. Non-transitory computer-readable medium and voice generating system
CN116052636A (zh) * 2023-01-13 2023-05-02 长城汽车股份有限公司 中文语音合成方法、装置、终端及存储介质
CN116665641A (zh) * 2023-06-07 2023-08-29 腾讯音乐娱乐科技(深圳)有限公司 一种音频帧的基频预测方法、模型的训练方法及其装置
US12363319B2 (en) 2023-06-14 2025-07-15 Microsoft Technology Licensing, Llc Object-based context-based decoder correction
US12469507B2 (en) 2023-06-14 2025-11-11 Microsoft Technology Licensing, Llc Predictive context-based decoder correction
US12561525B2 (en) * 2023-07-31 2026-02-24 Paypal, Inc. Systems and methods for establishing multilingual context-preserving chunk library
CN117475992A (zh) * 2023-11-21 2024-01-30 支付宝(杭州)信息技术有限公司 语音合成方法、装置、设备及存储介质
CN117765926B (zh) * 2024-02-19 2024-05-14 上海蜜度科技股份有限公司 语音合成方法、系统、电子设备及介质
CN118486294B (zh) * 2024-06-05 2025-03-25 内蒙古工业大学 一种基于分离对比学习的蒙古语未登录词读音增强方法
CN118571236B (zh) * 2024-08-05 2024-10-29 上海岩芯数智人工智能科技有限公司 一种基于音域范围的音频token化编码方法及装置
CN119446114B (zh) * 2024-09-30 2025-09-30 平安科技(深圳)有限公司 一种语音合成方法、装置、设备及其存储介质
CN119724150B (zh) * 2024-12-12 2025-11-14 安徽讯飞寰语科技有限公司 语音合成方法、系统、电子设备及存储介质
CN119724148B (zh) * 2025-02-27 2025-06-17 科大讯飞股份有限公司 语音合成方法及相关装置、设备和存储介质
CN120580987B (zh) * 2025-06-23 2025-12-02 广州佰锐网络科技有限公司 一种基于深度学习的多语言tts实时合成方法
CN121354534B (zh) * 2025-12-17 2026-03-20 科大讯飞股份有限公司 语音合成方法、装置、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1540625A (zh) * 2003-03-24 2004-10-27 微软公司 多语种文本-语音系统的前端结构
US20060136216A1 (en) * 2004-12-10 2006-06-22 Delta Electronics, Inc. Text-to-speech system and method thereof
US20120278081A1 (en) * 2009-06-10 2012-11-01 Kabushiki Kaisha Toshiba Text to speech method and system
US9484014B1 (en) * 2013-02-20 2016-11-01 Amazon Technologies, Inc. Hybrid unit selection / parametric TTS system
CN106297764A (zh) * 2015-05-27 2017-01-04 科大讯飞股份有限公司 一种多语种混语文本处理方法及系统
TW201705019A (zh) * 2015-07-21 2017-02-01 華碩電腦股份有限公司 文字轉語音方法以及多語言語音合成裝置

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6067520A (en) * 1995-12-29 2000-05-23 Lee And Li System and method of recognizing continuous mandarin speech utilizing chinese hidden markou models
JP2975586B2 (ja) * 1998-03-04 1999-11-10 株式会社エイ・ティ・アール音声翻訳通信研究所 音声合成システム
CA2562366A1 (en) * 2004-04-06 2005-10-20 Department Of Information Technology A system for multiligual machine translation from english to hindi and other indian languages using pseudo-interlingua and hybridized approach
US20050267757A1 (en) * 2004-05-27 2005-12-01 Nokia Corporation Handling of acronyms and digits in a speech recognition and text-to-speech engine
US20070106513A1 (en) * 2005-11-10 2007-05-10 Boillot Marc A Method for facilitating text to speech synthesis using a differential vocoder
EP2276023A3 (en) * 2005-11-30 2011-10-05 Telefonaktiebolaget LM Ericsson (publ) Efficient speech stream conversion
US8478581B2 (en) * 2010-01-25 2013-07-02 Chung-ching Chen Interlingua, interlingua engine, and interlingua machine translation system
US8688435B2 (en) * 2010-09-22 2014-04-01 Voice On The Go Inc. Systems and methods for normalizing input media
US9483461B2 (en) * 2012-03-06 2016-11-01 Apple Inc. Handling speech synthesis of content for multiple languages
US9195656B2 (en) * 2013-12-30 2015-11-24 Google Inc. Multilingual prosody generation
US9865251B2 (en) * 2015-07-21 2018-01-09 Asustek Computer Inc. Text-to-speech method and multi-lingual speech synthesizer using the method
RU2632424C2 (ru) * 2015-09-29 2017-10-04 Общество С Ограниченной Ответственностью "Яндекс" Способ и сервер для синтеза речи по тексту
US9799327B1 (en) * 2016-02-26 2017-10-24 Google Inc. Speech recognition with attention-based recurrent neural networks
JP6819988B2 (ja) * 2016-07-28 2021-01-27 国立研究開発法人情報通信研究機構 音声対話装置、サーバ装置、音声対話方法、音声処理方法およびプログラム
US10872598B2 (en) * 2017-02-24 2020-12-22 Baidu Usa Llc Systems and methods for real-time neural text-to-speech
US10796686B2 (en) * 2017-10-19 2020-10-06 Baidu Usa Llc Systems and methods for neural text-to-speech using convolutional sequence learning
CN107945786B (zh) 2017-11-27 2021-05-25 北京百度网讯科技有限公司 语音合成方法和装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1540625A (zh) * 2003-03-24 2004-10-27 微软公司 多语种文本-语音系统的前端结构
US20060136216A1 (en) * 2004-12-10 2006-06-22 Delta Electronics, Inc. Text-to-speech system and method thereof
US20120278081A1 (en) * 2009-06-10 2012-11-01 Kabushiki Kaisha Toshiba Text to speech method and system
US9484014B1 (en) * 2013-02-20 2016-11-01 Amazon Technologies, Inc. Hybrid unit selection / parametric TTS system
CN106297764A (zh) * 2015-05-27 2017-01-04 科大讯飞股份有限公司 一种多语种混语文本处理方法及系统
TW201705019A (zh) * 2015-07-21 2017-02-01 華碩電腦股份有限公司 文字轉語音方法以及多語言語音合成裝置

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BO LI; YU ZHANG; TARA SAINATH; YONGHUI WU; WILLIAM CHAN: "Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes", ELECTRICAL ENGINEERING AND SYSTEMS SCIENCE, 22 November 2018 (2018-11-22), pages 1 - 5, XP080937763 *
ELIYA NACHMANI; LIOR WOLF: "Unsupervised Polyglot Text To Speech", COMPUTER SCIENCE, 6 February 2019 (2019-02-06), pages 1 - 5, XP081026077 *
LIUMENG XUE; WEI SONG; GUANGHUI XU; LEI XIE; ZHIZHENG WU: "Building a mixed-lingual neural TTS system with only monolingual data", ELECTRICAL ENGINEERING AND SYSTEMS SCIENCE , 12 April 2019 (2019-04-12), pages 1 - 6, XP081168422 *
See also references of EP3937165A4 *

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185340A (zh) * 2020-10-30 2021-01-05 网易(杭州)网络有限公司 语音合成方法、语音合成装置、存储介质与电子设备
CN112185340B (zh) * 2020-10-30 2024-03-15 网易(杭州)网络有限公司 语音合成方法、语音合成装置、存储介质与电子设备
CN114765022A (zh) * 2020-12-30 2022-07-19 大众问问(北京)信息科技有限公司 一种语音合成方法、装置、计算机设备和存储介质
CN112992177A (zh) * 2021-02-20 2021-06-18 平安科技(深圳)有限公司 语音风格迁移模型的训练方法、装置、设备及存储介质
CN112992177B (zh) * 2021-02-20 2023-10-17 平安科技(深圳)有限公司 语音风格迁移模型的训练方法、装置、设备及存储介质
CN113763922A (zh) * 2021-05-12 2021-12-07 腾讯科技(深圳)有限公司 音频合成方法和装置、存储介质及电子设备
CN113327576B (zh) * 2021-06-03 2024-04-23 多益网络有限公司 语音合成方法、装置、设备及存储介质
CN113327576A (zh) * 2021-06-03 2021-08-31 多益网络有限公司 语音合成方法、装置、设备及存储介质
CN113450760A (zh) * 2021-06-07 2021-09-28 北京一起教育科技有限责任公司 一种文本转语音的方法、装置及电子设备
CN113409761A (zh) * 2021-07-12 2021-09-17 上海喜马拉雅科技有限公司 语音合成方法、装置、电子设备以及计算机可读存储介质
CN113409761B (zh) * 2021-07-12 2022-11-01 上海喜马拉雅科技有限公司 语音合成方法、装置、电子设备以及计算机可读存储介质
CN113724683B (zh) * 2021-07-23 2024-03-22 阿里巴巴达摩院(杭州)科技有限公司 音频生成方法、计算机设备及计算机可读存储介质
CN113724683A (zh) * 2021-07-23 2021-11-30 阿里巴巴达摩院(杭州)科技有限公司 音频生成方法、计算机设备及计算机可读存储介质
CN113707125B (zh) * 2021-08-30 2024-02-27 中国科学院声学研究所 一种多语言语音合成模型的训练方法及装置
CN113707125A (zh) * 2021-08-30 2021-11-26 中国科学院声学研究所 一种多语言语音合成模型的训练方法及装置
CN114267376A (zh) * 2021-11-24 2022-04-01 北京百度网讯科技有限公司 音素检测方法及装置、训练方法及装置、设备和介质
CN114267375A (zh) * 2021-11-24 2022-04-01 北京百度网讯科技有限公司 音素检测方法及装置、训练方法及装置、设备和介质
CN114267375B (zh) * 2021-11-24 2022-10-28 北京百度网讯科技有限公司 音素检测方法及装置、训练方法及装置、设备和介质
CN114495899A (zh) * 2021-12-29 2022-05-13 深圳市优必选科技股份有限公司 一种基于时长信息的音频合成方法、装置及终端设备
CN114399991A (zh) * 2022-01-27 2022-04-26 北京有竹居网络技术有限公司 语音合成方法、装置、存储介质及电子设备
CN115101041A (zh) * 2022-05-09 2022-09-23 北京百度网讯科技有限公司 语音合成与语音合成模型的训练方法、装置
CN115691476B (zh) * 2022-06-06 2023-07-04 腾讯科技(深圳)有限公司 语音识别模型的训练方法、语音识别方法、装置及设备
CN115691476A (zh) * 2022-06-06 2023-02-03 腾讯科技(深圳)有限公司 语音识别模型的训练方法、语音识别方法、装置及设备
CN115132170A (zh) * 2022-06-28 2022-09-30 腾讯音乐娱乐科技(深圳)有限公司 语种分类方法、装置及计算机可读存储介质
CN116129866A (zh) * 2023-02-16 2023-05-16 北京百度网讯科技有限公司 语音合成方法、网络训练方法、装置、设备及存储介质
CN118782018A (zh) * 2023-04-03 2024-10-15 科大讯飞股份有限公司 语音合成方法、装置、设备及存储介质
CN116612742A (zh) * 2023-04-28 2023-08-18 科大讯飞股份有限公司 语音合成方法、装置、设备及存储介质
CN118840996A (zh) * 2024-06-27 2024-10-25 合肥智能语音创新发展有限公司 一种发音预测方法及相关装置
CN120032621A (zh) * 2025-01-16 2025-05-23 思必驰科技股份有限公司 面向vqtts模型的语音合成缺陷修正方法、设备及存储介质

Also Published As

Publication number Publication date
CN111798832A (zh) 2020-10-20
JP2022527970A (ja) 2022-06-07
EP3937165C0 (en) 2025-10-22
CN111798832B (zh) 2024-09-20
EP3937165A1 (en) 2022-01-12
US20220165249A1 (en) 2022-05-26
JP7464621B2 (ja) 2024-04-09
EP3937165B1 (en) 2025-10-22
EP3937165A4 (en) 2023-05-10
US11881205B2 (en) 2024-01-23

Similar Documents

Publication Publication Date Title
US11881205B2 (en) Speech synthesis method, device and computer readable storage medium
CN114038447B (zh) 语音合成模型的训练方法、语音合成方法、装置及介质
US11322133B2 (en) Expressive text-to-speech utilizing contextual word-level style tokens
CN115547293B (zh) 一种基于分层韵律预测的多语言语音合成方法及系统
US20220246132A1 (en) Generating Diverse and Natural Text-To-Speech Samples
US11810471B2 (en) Computer implemented method and apparatus for recognition of speech patterns and feedback
CN114464162B (zh) 语音合成方法、神经网络模型训练方法、和语音合成模型
US12087272B2 (en) Training speech synthesis to generate distinct speech sounds
KR20210146368A (ko) 숫자 시퀀스에 대한 종단 간 자동 음성 인식
CN113205792A (zh) 一种基于Transformer和WaveNet的蒙古语语音合成方法
CN115424604B (zh) 一种基于对抗生成网络的语音合成模型的训练方法
CN113450758B (zh) 语音合成方法、装置、设备及介质
CN113257221B (zh) 一种基于前端设计的语音模型训练方法及语音合成方法
US12073822B2 (en) Voice generating method and apparatus, electronic device and storage medium
CN114863945A (zh) 基于文本的语音变声方法、装置、电子设备及存储介质
Azim et al. Using character-level sequence-to-sequence model for word level text generation to enhance Arabic speech recognition
CN118800212A (zh) 语音合成前端处理方法、装置、设备和存储介质
CN116597809A (zh) 多音字消歧方法、装置、电子设备及可读存储介质
CN120932627A (zh) 一种基于npu的中英双语文本转语音方法及系统
CN119517004B (zh) 文本转换语音的方法、装置、设备及存储介质
JP7357518B2 (ja) 音声合成装置及びプログラム
CN114267330A (zh) 语音合成方法、装置、电子设备和存储介质
CN115114933A (zh) 用于文本处理的方法、装置、设备和存储介质
Saychum et al. A great reduction of wer by syllable toneme prediction for thai grapheme to phoneme conversion
Hendessi et al. A speech synthesizer for Persian text using a neural network with a smooth ergodic HMM

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20783784

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021558871

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020783784

Country of ref document: EP

Effective date: 20211005

WWG Wipo information: grant in national office

Ref document number: 2020783784

Country of ref document: EP