WO2023184874A1 - 语音合成方法和装置 - Google Patents

语音合成方法和装置 Download PDF

Info

Publication number
WO2023184874A1
WO2023184874A1 PCT/CN2022/118072 CN2022118072W WO2023184874A1 WO 2023184874 A1 WO2023184874 A1 WO 2023184874A1 CN 2022118072 W CN2022118072 W CN 2022118072W WO 2023184874 A1 WO2023184874 A1 WO 2023184874A1
Authority
WO
WIPO (PCT)
Prior art keywords
prosodic
target
sequence
clause
phoneme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2022/118072
Other languages
English (en)
French (fr)
Inventor
高羽
刘雪铃
涂建华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Midea Group Co Ltd
Midea Group Shanghai Co Ltd
Original Assignee
Midea Group Co Ltd
Midea Group Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202210344456.4A external-priority patent/CN114678002A/zh
Priority claimed from CN202210346097.6A external-priority patent/CN114708848B/zh
Priority claimed from CN202210346114.6A external-priority patent/CN114822490A/zh
Priority claimed from CN202210344448.XA external-priority patent/CN114678001A/zh
Priority claimed from CN202210346094.2A external-priority patent/CN114822489A/zh
Application filed by Midea Group Co Ltd, Midea Group Shanghai Co Ltd filed Critical Midea Group Co Ltd
Priority to EP22933875.1A priority Critical patent/EP4503017A4/en
Publication of WO2023184874A1 publication Critical patent/WO2023184874A1/zh
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management

Definitions

  • the present application relates to the field of speech synthesis technology, and in particular to speech synthesis methods and speech synthesis devices, speech splicing methods and speech splicing devices, audio and video file size acquisition methods and devices, text transliteration methods and text transliteration devices, and text cutting. Methods and text segmentation devices.
  • Text to Speech (TTS) technology is widely used in the field of speech synthesis.
  • speech synthesis is usually performed directly on the entire text to be synthesized. For some longer texts to be synthesized, it takes longer to perform speech synthesis, which also means that the user It takes a long time to obtain the synthesized speech, and the speech synthesis performance is low, which not only wastes the user's time, but also affects the user's experience.
  • This application aims to solve at least one of the technical problems existing in the prior art. To this end, this application proposes a speech synthesis method and a speech synthesis device.
  • a speech synthesis method which includes: segmenting a prosodic phoneme sequence of a target text and generating multiple sentence sequences.
  • the prosodic phoneme sequence includes multiple phonemes corresponding to the target text and a sequence located at Prosodic identifiers between adjacent phonemes, each clause sequence includes at least one phoneme; perform speech synthesis on the first sub-prosodic phoneme sequence in the plurality of clause sequences to obtain the first speech information; output the first speech information and Perform speech synthesis on a second sub-rhyme phoneme sequence in multiple clause sequences to generate second speech information.
  • the second sub-rhyme phoneme sequence is at least one clause sequence located after the first sub-rhyme phoneme sequence in the prosodic phoneme sequence. .
  • a speech synthesis device including: a first processing module, configured to segment the prosodic phoneme sequence of the target text and generate a plurality of sentence sequences, where the prosodic phoneme sequence includes the same sequence as the target text. Corresponding multiple phonemes and prosodic identifiers located between adjacent phonemes, each clause sequence includes at least one phoneme; the second processing module is used to perform speech processing on the first sub-prosodic phoneme sequence in the multiple clause sequences.
  • the third processing module is used to output the first speech information and perform speech synthesis on the second sub-rhyme phoneme sequence in the plurality of clause sequences to generate the second speech information and the second sub-rhyme phoneme.
  • the sequence is at least one clause sequence following the first sub-prosodic phoneme sequence in the prosodic phoneme sequence.
  • an electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the processor executes the computer program, the above speech synthesis method is implemented.
  • a non-transitory computer-readable storage medium on which a computer program is stored.
  • the computer program is executed by a processor, the speech synthesis method as described above is implemented.
  • a computer program product including a computer program.
  • the computer program is executed by a processor, the speech synthesis method as described above is implemented.
  • Figure 1 is one of the flow diagrams of a speech synthesis method according to an embodiment of the present application
  • Figure 2 is a second schematic flowchart of a speech synthesis method according to an embodiment of the present application.
  • Figure 3 is a schematic structural diagram of a speech synthesis device according to an embodiment of the present application.
  • Figure 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • Figure 5 is one of the flow diagrams of the voice splicing method according to an embodiment of the present application.
  • Figure 6 is a second schematic flowchart of the voice splicing method according to an embodiment of the present application.
  • Figure 7 is one of the flow diagrams of a method for obtaining the size of audio and video files according to an embodiment of the present application
  • Figure 8 is a second schematic flowchart of a method for obtaining audio and video file sizes according to an embodiment of the present application.
  • Figure 9 is one of the flow diagrams of a text transcription method according to an embodiment of the present application.
  • Figure 10 is a second schematic flowchart of a text transcription method according to an embodiment of the present application.
  • Figure 11 is one of the flow diagrams of a text segmentation method according to an embodiment of the present application.
  • Figure 12 is a second schematic flowchart of a text segmentation method according to an embodiment of the present application.
  • Figure 13 is a third schematic flowchart of a text segmentation method according to an embodiment of the present application.
  • references to the terms “one embodiment,” “some embodiments,” “an example,” “specific examples,” or “some examples” or the like means that specific features are described in connection with the embodiment or example. , structures, materials or features are included in at least one embodiment or example of the embodiments of this application. In this specification, the schematic expressions of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, those skilled in the art may combine and combine different embodiments or examples and features of different embodiments or examples described in this specification unless they are inconsistent with each other.
  • the execution subject of the speech synthesis method can be a speech synthesis device, or a server, or it can also be a user's terminal, including but not limited to mobile phones, tablet computers, PC terminals, vehicle-mounted terminals, and household smart appliances.
  • the speech synthesis method includes: step 110, step 120 and step 130.
  • Step 110 Segment the prosodic phoneme sequence of the target text and generate multiple sentence sequences
  • the target text is the text currently used for speech synthesis.
  • the prosodic phoneme sequence is a sequence used to characterize the prosodic features and phoneme features of the target text.
  • the prosodic phoneme sequence includes prosodic identifiers located between adjacent phonemes and multiple phonemes corresponding to the target text.
  • a phoneme can be a combination of one or more phonetic units divided according to the natural properties of speech.
  • the phonetic unit can be the pinyin, initial consonant or final of a Chinese character, or an English word, English phonetic symbol or English letter.
  • the prosodic identifier is an identifier used to characterize the prosodic features corresponding to each phoneme in the target text.
  • the prosodic features include but are not limited to: the tones, syllables, prosodic words, prosodic phrases, intonation phrases, silences, pauses and other features corresponding to the phonemes. .
  • the fine-grainedness of the prosodic identifier used to characterize pauses is higher than the fine-grainedness of the identifier used to characterize the prosody of intonation phrases, and the fine-grainedness used to characterize intonation phrases is higher than the fine-grainedness used to characterize prosodic phrases.
  • Prosodic phrases are represented at a higher granularity than prosodic words, and prosodic words are characterized at a higher granularity than syllables.
  • the prosodic identifier may include: numbers, symbols and English phonemes between adjacent pinyin; the phoneme may include the pinyin corresponding to each Chinese character.
  • sil is the silence that represents the beginning and end of the sentence in the prosodic phoneme sequence
  • #0 represents the syllable
  • #1 represents the prosodic word
  • #2 represents the prosodic phrase
  • #3 represents the intonation phrase
  • #4 represents the end of the sentence.
  • the number represents the tone of the phoneme.
  • the 4 in shang4 represents the fourth tone of the pinyin "shang”.
  • step 110 may include: converting the target text into a prosodic phoneme sequence; segmenting the prosodic phoneme sequence based on at least part of the plurality of prosodic identifiers to generate a plurality of clause sequences.
  • the target text is the text currently used for speech synthesis.
  • prosodic phonemes there are multiple phonemes and multiple prosodic identifiers, and the multiple prosodic identifiers include prosodic identifiers corresponding to different fine-grained levels.
  • an appropriate fine-grained level can be selected as the segmentation criterion based on the actual situation, and the position of the prosodic identifier corresponding to the fine-grained level in the prosodic phoneme sequence can be used as the segmentation point to segment the prosodic phoneme sequence. points to obtain multiple clause sequences.
  • each clause sequence includes a prosodic identifier at a segmentation point and at least one phoneme.
  • the prosodic phoneme sequence "sil shang4 #0 hai3 #0 shi4 #2 jin1 #0 tian1 #2 yin1 #0 zhuan3 #1 duo1 #0 yun2 #3 dong1 #0 nan2 #0 feng1 #2 san1 #0 dao4 #1 si4 #0 ji2 #4 sil", based on actual needs, it is determined to split at #3, then split at the position containing #3 in the prosodic phoneme sequence, and retain the prosodic separator #3 to the previous splicing unit,
  • the prosodic phoneme sequence can be divided into the following multiple clause sequences:
  • Step 120 Perform speech synthesis on the first sub-rhyme phoneme sequence in the plurality of clause sequences to obtain the first speech information
  • the first sub-prosodic phoneme sequence is the prosodic phoneme sequence before the first segmentation point of the prosodic phoneme sequence.
  • the first sub-rhyme phoneme sequence is clause sequence 1: sil shang4 #0 hai3 #0 shi4 #2 jin1 #0 tian1 #2 yin1 # 0 zhuan3 #1 duo1 #0 yun2 #3.
  • a vocoder can be used to perform speech synthesis on the first sub-rhyme phoneme sequence to generate first speech information corresponding to the first sub-rhyme phoneme sequence.
  • Step 130 Output the first speech information and perform speech synthesis on the second sub-rhyme phoneme sequence in the plurality of clause sequences to generate the second speech information.
  • the second sub-rhyme phoneme sequence is located in the first sub-rhyme phoneme sequence in the rhyme phoneme sequence. At least one clause sequence following the phoneme sequence.
  • the first voice information is returned to the client for output so that the user can play the first voice information.
  • the background continues to perform speech synthesis on the second sub-rhyme phoneme sequence to generate second speech information corresponding to the second sub-rhyme phoneme sequence.
  • the second sub-rhyme phoneme sequence is at least one clause sequence located after the first sub-rhyme phoneme sequence in the prosodic phoneme sequence.
  • the second sub-rhyme phoneme sequence The sequence is clause sequence 2: dong1 #0 nan2 #0 feng1 #2 san1 #0 dao4 #1 si4 #0 ji2 #4 sil.
  • the method may further include: determining that any to-be-matched clause sequence in the multiple clause sequences matches a cached target clause sequence, and obtaining the target clause corresponding to the target clause sequence from the cache. Speech, determine the speech corresponding to the sequence of clauses to be matched as the speech of the target clause; determine that any of the sequence of clauses to be matched in the multiple clause sequences does not match the cached target clause sequence, and perform speech for the sequence of clauses to be matched. Synthesis, generating the second sentence speech.
  • a clause sequence matching the first sub-rhyme phoneme sequence or the second sub-rhyme phoneme sequence is matched from multiple clause sequences cached in advance, and a pre-generated and cached clause sequence corresponding to the clause sequence determined by matching is obtained.
  • Speech obtain the synthesized speech corresponding to the first sub-rhyme phoneme sequence or the second sub-rhyme phoneme sequence. In this way, the corresponding speech is first matched from the cache without real-time synthesis, which improves the efficiency of speech synthesis.
  • the method may further include: segmenting the prosodic phoneme sequence of the target text to generate multiple candidate sequences; combining the target candidate sequence and adjacent candidate sequences in the multiple candidate sequences to generate a sentence clause Fine-grained size of sequence correspondence and multiple clause sequences.
  • the prosodic phoneme sequence "sil xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3
  • multiple clause sequences can also be matched with the target clause sequence in the cache from front to back based on the descending order, and the speech of the successfully matched target clause sequence is determined as a clause. Sequence of speech.
  • the clause sequence is accurately matched with the target clause sequence from front to back, for example, first the clause sequence "xi1 #0 wang4 #1 zhe4 #0 shou3 # 0 ge1 #3 neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 wei4 #0 nin2 #1 bo1 #0 fang4 #1 EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4( I hope this song will make you like playing XX's
  • the speech of the target clause corresponding to the target clause that matches one clause sequence is determined to be the speech corresponding to the first clause sequence, and the comparison ends.
  • the clause sequence "xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 wei4 #0 nin2 # 1 bo1 #0 fang4 #1 (I hope this song will make you like playing it for you)" is compared with the target clause, and the above process is repeated until it is determined that a certain clause sequence and the target clause can achieve an exact match, then End comparison.
  • the synthesis time of the fastest deep learning model is only a few seconds to tens of seconds.
  • it takes longer to perform speech synthesis. time which also means that users need to wait for a long time to obtain the synthesized speech, which not only wastes the user's time, but also affects the user experience.
  • the first sub-rhyme phoneme sequence is given priority for speech synthesis and the first speech information synthesized by the first sub-rhyme phoneme sequence is given priority and the first speech information synthesized by the first sub-rhyme phoneme sequence is output.
  • the sentence sequence after the first sub-rhyme phoneme sequence is synthesized, thereby effectively speeding up the feedback speed of the system after receiving the network speech synthesis service request, shortening the user's waiting time, and helping to improve the user's use experience.
  • the speech synthesis of the second sub-prosodic phoneme sequence in multiple clause sequences can be expressed as the speech synthesis of each clause sequence in sequence based on the segmentation order of each clause sequence in the target text. synthesis.
  • the first clause sequence is the first sub-rhyme phoneme sequence, and the first clause sequence is given priority.
  • Perform speech synthesis to generate the first speech information; while outputting the first speech information, perform speech synthesis on the second clause sequence, and after generating the second speech information corresponding to the second clause sequence, synthesize the third clause Sequence for speech synthesis.
  • Performing speech synthesis on the second sub-rhyme phoneme sequence in multiple clause sequences can also be performed by performing speech synthesis on each clause sequence at the same time.
  • the first clause sequence is the first sub-rhyme phoneme sequence, and the first clause sequence is given priority.
  • Speech synthesis is performed to generate the first speech information; while the first speech information is output, the system's parallel synthesis capability is used to synthesize the second clause sequence and the third clause sequence in parallel.
  • the speech synthesis method by dividing the target text into multiple clause sequences, the first clause sequence is prioritized for speech synthesis to generate the first speech information.
  • the process of outputting the first speech information Continuing to perform speech synthesis on subsequent sentence sequences effectively speeds up the system's feedback speed after receiving network speech synthesis service requests, shortens the user's waiting time, and thus helps improve the user experience.
  • the method may also include: obtaining the text to be synthesized; when the size of the text to be synthesized exceeds the target threshold, segmenting the text to be synthesized, and generating Target text, the size of the target text is not larger than the target threshold.
  • the text to be synthesized is the original text that requires speech synthesis.
  • the text level of the text to be synthesized can range from tens to hundreds of levels of regular text, to thousands or tens of thousands of levels of very long text.
  • the target threshold may be determined based on at least one of the computing power of the system and the upper limit of the ability of the speech synthesis model. For example, the target threshold may be determined to be within a range of several hundred words.
  • the size of the text to be synthesized is first judged, and its size is compared with the target threshold. If the size of the text to be synthesized does not exceed the target threshold, the entire text is directly The text to be synthesized is determined as the target text.
  • the text to be synthesized is first segmented to obtain multiple paragraphs of first text, such that The size of each paragraph of first text does not exceed the target threshold, and the first paragraph of text among the plurality of paragraphs of first text is determined as the target text.
  • the text to be synthesized is segmented based on the target threshold to generate the target text, and the actual capabilities of the server can be fully taken into account to provide target text within the processing capability range of the server for speech synthesis. Thereby improving the performance of speech synthesis.
  • step 110 may include: obtaining sentence-end information, intonation phrases, prosodic phrases, prosodic words, and syllables of the target text; based on at least two of the sentence-end information, intonation phrases, prosodic phrases, prosodic words, and syllables Mark the target text and generate a prosodic phoneme sequence.
  • the syllable is a phonetic unit in the speech flow, and it is also the phonetic unit that is easiest for people to distinguish auditorily.
  • a syllable can be each Chinese character in the target text.
  • a prosodic word is a group of syllables that are closely related and pronounced together in actual speech flow.
  • a prosodic phrase is a medium-rhythmic chunk between prosodic words and intonation phrases.
  • a prosodic phrase can include multiple prosodic words and modal particles, and the multiple prosodic words that make up the prosodic phrase sound like they share a rhythmic group.
  • An intonation phrase is a sentence composed of multiple prosodic phrases connected according to a certain intonation pattern, and is used to represent larger pauses.
  • End-of-sentence information is used to represent the end of each long sentence.
  • each Chinese character such as “ ⁇ ”, “ ⁇ ” and “ ⁇ ” are syllables corresponding to the target text;
  • "Shanghai City” , “today” and “cloudy to cloudy” and other words or phrases composed of words are the prosodic phrases corresponding to the target text;
  • sentences composed of the prosodic phrases “Shanghai”, “today” and “cloudy to cloudy” “Shanghai becomes cloudy today” is the intonation phrase corresponding to the target text.
  • the target text After obtaining the sentence-end information, intonation phrases, prosodic phrases, prosodic words, and syllables of the target text, the target text is marked based on at least two of them to generate a prosodic phoneme sequence.
  • the rhythm of a sentence is often represented by using punctuation marks in the sentence, such as segmenting the sentence at the position of a comma or period in the sentence to obtain multiple points. sentence.
  • this method cannot satisfy the segmentation of text without punctuation.
  • it will also cause the two ends of the segmentation to be unbalanced and the segmentation effect is poor.
  • At least two items of sentence end information, intonation phrases, prosodic phrases, prosodic words and syllables are used to characterize the rhythm of the sentence, and the target text is segmented based on this, so that it will not appear in the middle of a whole word.
  • the situation of cutting makes the sentence pauses and rhythm obtained after segmentation more natural.
  • marking the target text based on at least two of sentence end information, intonation phrases, prosodic phrases, prosodic words and syllables, generating a prosodic phoneme sequence includes: converting the target text into a phoneme sequence; based on sentence end information, At least two of intonation phrases, prosodic phrases, prosodic words and syllables are used to generate multiple prosodic identifiers; a phoneme sequence is marked based on the multiple prosodic identifiers to generate a prosodic phoneme sequence.
  • the phoneme sequence is a sequence connected by pronunciation marks corresponding to each Chinese character or English in the target text, including pinyin, tones or English phonetic notations.
  • Prosodic identifiers are identifiers used to characterize the prosodic features corresponding to each phoneme in the target text. That is, prosodic identifiers are symbols used to represent sentence-end information, intonation phrases, prosodic phrases, prosodic words, and syllables.
  • rhythm identifier can be represented by a combination of special symbols and numbers or a specific combination of letters, such as "#0", “#1", “#2", “#3” and “#” respectively. 4" to represent prosody identifiers, and different combinations represent different levels of fine-grainedness.
  • #0 represents a syllable
  • #1 represents a prosodic word
  • #2 represents a prosodic phrase
  • #3 represents an intonation phrase
  • #4 represents the end of a sentence.
  • the fine granularity from small to large is: #0 ⁇ #1 ⁇ #2 ⁇ #3 ⁇ #4.
  • the rhyme identifier After obtaining the phoneme sequence and rhyme identifier corresponding to the target text, insert the rhyme identifier into the corresponding position in the phoneme sequence. For example, insert the rhyme identifier #0 used to represent the syllable into the pinyin corresponding to each syllable in the phoneme sequence. Afterwards, the prosodic identifier #2 used to characterize the prosodic phrase is inserted after each prosodic phrase in the phoneme sequence, thereby converting the phoneme sequence into a prosodic phoneme sequence.
  • sil represents the silence at the beginning and end of the sentence.
  • the target text is converted into a phoneme sequence
  • the phoneme sequence is synthesized based on at least two prosodic identifiers corresponding to end-of-sentence information, intonation phrases, prosodic phrases, prosodic words, and syllables.
  • Tagging to generate prosodic phoneme sequences can provide a more refined prosodic representation, thereby helping to improve the segmentation refinement and accuracy in subsequent segmentation processes.
  • the method may further include: generating a target file size of the third speech information based on the prosodic phoneme sequence;
  • Step 130 may include: generating second voice information based on the target file size.
  • the third speech information is speech information generated by synthesizing speech information synthesized from at least two clause sequences among a plurality of clause sequences corresponding to the target text, wherein the at least two clause sequences are One of the clause sequences is the first sub-rhyme phoneme sequence.
  • the target file size is the predicted file size of the third voice information.
  • the target file size may be file volume information, or may be voice length information of the third voice information, which is not limited in this application.
  • generating the target file size of the third speech information based on the prosodic phoneme sequence may include: generating the predicted file size of the third speech information based on the prosodic phoneme sequence; correcting the predicted file size based on the target residual value, and generating Target file size.
  • the predicted file size is the initial file size value of the uncorrected speech synthesized by the target text, which is predicted based on the prosodic phoneme sequence.
  • the target residual value is used to correct the predicted file size to improve the accuracy of the final generated target file size.
  • the target residual value is determined based on the sample file size and the size of the sample audio file corresponding to the predicted sample text.
  • the sample file size is the actual size of the sample audio file corresponding to the sample text.
  • the target file size is the file size value of the speech synthesized by the target text after correction based on prosodic phoneme sequence prediction. Understandably, the accuracy of the target file size is higher than the predicted file size.
  • the target residual value is a predetermined value, for example, the target residual value can be the maximum residual value.
  • the predicted file size is corrected by performing supplementary residual processing on the predicted file size, thereby improving the accuracy of the final generated target file size.
  • the target residual value can be determined by the following steps:
  • the sample file size corresponding to the sample text, the sample audio file, and the sample audio file corresponding to the sample text is generated by speech synthesis of the sample text;
  • the absolute value of the difference between the sample file size and the sample predicted file size is determined as the target residual value.
  • the sample text can be a regular text with a level of tens to hundreds, or an extremely long text with a level of thousands or tens of thousands.
  • the sample audio file is the audio file finally generated by speech synthesis of the sample text.
  • the sample file size is the actual size value of the sample audio file or the actual audio duration.
  • a speech synthesis system can be used to calculate the actual wav file size or audio duration of the sample audio file corresponding to the sample text.
  • the sample predicted file size is the predicted, uncorrected size value or audio duration of the sample audio file.
  • sample predicted file size generation method should be consistent with the predicted file size generation method.
  • multiple predictions can be made on the sample prosodic phoneme sequence to obtain multiple sample prediction file sizes. Calculate the difference between the predicted file size and the sample file size of each sample separately to obtain multiple candidate differences; then select the absolute value of the minimum value from the multiple candidate differences and determine it as the target residual value to improve the target residual value accuracy.
  • the size information of the target audio file synthesized from the target text is predicted based on the prosodic phoneme sequence, and the predicted value is corrected based on the target residual value, which can be used before the target audio file is generated.
  • the size value of the target file can be predicted with high accuracy and precision.
  • step 110 may include:
  • the prosodic phoneme sequence is segmented based on the first segmentation position and the second segmentation position to generate a first sub-rhyme phoneme sequence and at least two second sub-rhyme phoneme sequences.
  • the first sub-rhyme phoneme sequence is located in the prosodic phoneme sequence.
  • the prosodic phoneme sequence before the first segmentation position, the at least two second sub-rhyme phoneme sequences are the prosodic phoneme sequences located after the first segmentation position in the prosodic phoneme sequence, and the adjacent second sub-rhyme phoneme sequence is based on the second segmentation position.
  • the sub-position is determined, and the speech synthesis duration corresponding to the first sub-rhyme phoneme sequence is within the target duration.
  • the first cutting position is the cutting point for the first cutting.
  • the second segmentation position is the position of the segmentation point corresponding to all other segments except the first segmentation.
  • the prosodic phoneme sequence can be segmented into two subsequences before and after the first segmentation position, and the subsequence before the first segmentation position is determined as the first sub-prosodic phoneme sequence.
  • the corresponding speech synthesis duration of the first sub-rhyme phoneme sequence generated based on the first segmentation position is within the target duration.
  • the speech synthesis time corresponding to the first sub-rhyme phoneme sequence is the time it takes to synthesize the first sub-rhyme phoneme sequence into speech.
  • the speech synthesis time is related to the computing power of the speech synthesis system.
  • the target duration is a shorter duration.
  • the value of the target duration can be customized based on the user, or the system default value can be used.
  • the target duration can be set to 0.2s or 0.3s.
  • At least part of the prosodic identifiers after the first segmentation position are searched from the prosodic phoneme sequence as a candidate set for determining the second segmentation position, and the positions of the prosodic identifiers in the candidate set are Determine the second cutting position.
  • the second sub-prosodic phoneme sequence is the entire prosodic phoneme sequence located after the first segmentation position in the prosodic phoneme sequence.
  • the second segmentation position is the position corresponding to #3, but #3 cannot be found in the second sub-rhyme phoneme sequence, it can be understood that there is no second segmentation position.
  • the first segmentation position is determined based on the prosodic identifier in the prosodic phoneme sequence, so that the corresponding speech synthesis duration of the first sub-prosodic phoneme sequence obtained based on the first segmentation position can be within a reasonable time.
  • the first segmentation position determined based on this method is at a position with a longer pause duration, so that the first segmentation obtained
  • the pauses and rhythm of the sub-rhyme phoneme sequence are more natural, making the subsequent output speech synthesized based on the first sub-rhyme phoneme sequence more natural and smooth.
  • the method may further include: merging the first voice information and the second voice information to generate third voice information.
  • the second speech information is speech information obtained by speech synthesis of the second sub-rhyme phoneme sequence, where the second sub-rhyme phoneme sequence may be one or more clause sequences, and the second sub-rhyme phoneme sequence The phoneme sequences are all located after the first sub-rhyme phoneme sequence in the target text.
  • speech synthesis can be performed on the second clause sequence located after the first sub-rhyme phoneme sequence and adjacent to the first sub-rhyme phoneme sequence. , to generate second voice information corresponding to the second sentence sequence, and combine the first voice information and the second voice information while outputting the second voice information to generate third voice information.
  • the second sub-rhyme phoneme sequence located after the first sub-rhyme phoneme sequence and adjacent to the first sub-rhyme phoneme sequence can be processed. Speech synthesis is performed to generate the second speech information. While outputting the second speech information, speech synthesis can be performed on a third clause sequence located after the second clause sequence and adjacent to the second clause sequence to generate the third clause sequence.
  • the subsequent sentence sequence is speech synthesized until the second voice information corresponding to all the sentences is generated, and then the first voice information and all the sentence parts are synthesized. The corresponding second voice information is synthesized to generate third voice information.
  • merging the first voice information and the second voice information may include: merging the first voice information and the second voice information based on the phoneme duration corresponding to the first voice information and the phoneme duration corresponding to the second voice information.
  • the phoneme duration is the pronunciation duration corresponding to the phoneme.
  • the duration corresponding to the excess phonemes at the beginning or end of the first sub-rhyme phoneme sequence in the speech is truncated to generate the truncated The first voice message after.
  • the duration corresponding to the redundant phonemes at the beginning or end of the first sub-rhyme phoneme sequence in the high-level acoustic feature corresponding to the first voice information is truncated , generate truncated high-level acoustic features; and then use a vocoder to perform speech synthesis on the truncated high-level acoustic features to generate the truncated first speech information.
  • the method of cutting off the second voice information is the same as that of the first voice information, and will not be described again here.
  • the truncated speech information corresponding to the adjacent clause sequences is spliced in sequence until the speech information corresponding to all the clause sequences is spliced.
  • the first speech information and the second speech information are spliced based on the phoneme duration, which can improve the splicing of adjacent speech information without the need for a preset speech splicing unit library. Naturalness and smoothness.
  • the speech synthesis device includes: a first processing module 310 , a second processing module 320 , and a third processing module 330 .
  • the first processing module 310 is used to segment the prosodic phoneme sequence of the target text and generate multiple clause sequences.
  • the prosodic phoneme sequence includes multiple phonemes corresponding to the target text and prosodic identifiers located between adjacent phonemes, each of which The clause sequence includes at least one phoneme;
  • the second processing module 320 is used to perform speech synthesis on the first sub-prosodic phoneme sequence in the plurality of clause sequences to obtain the first speech information;
  • the third processing module 330 is used to output the third A piece of speech information and speech synthesis is performed on the second sub-rhyme phoneme sequence in the plurality of clause sequences to generate the second speech information.
  • the second sub-rhyme phoneme sequence is at least one of the second sub-rhyme phoneme sequences located after the first sub-rhyme phoneme sequence in the rhyme phoneme sequence.
  • the target text is divided into multiple clause sequences, and the first clause sequence is prioritized for speech synthesis to generate the first speech information.
  • the first clause sequence is prioritized for speech synthesis to generate the first speech information.
  • the first processing module 310 is also used to: convert the target text into a prosodic phoneme sequence, where the prosodic phoneme sequence includes prosodic identifiers between adjacent phonemes and multiple phonemes corresponding to the target text; based on multiple At least part of the prosodic identifiers segments the prosodic phoneme sequence to generate a plurality of clause sequences, each clause sequence including at least one phoneme.
  • the device may further include: a fifth processing module, configured to combine the first voice information and the second voice information to generate third voice information after generating the second voice information.
  • a fifth processing module configured to combine the first voice information and the second voice information to generate third voice information after generating the second voice information.
  • the device may further include: a sixth processing module, configured to generate a target file size of the third speech information based on the prosodic phoneme sequence after converting the target text into a prosodic phoneme sequence;
  • the fourth processing module 340 is also used to generate second voice information based on the target file size.
  • the sixth processing module is also used to: generate the predicted file size of the third speech information based on the prosodic phoneme sequence; correct the predicted file size based on the target residual value to generate the target file size; the target residual value is based on
  • the sample file size is determined by the size of the sample audio file corresponding to the predicted sample text, and the sample file size is the actual size of the sample audio file corresponding to the sample text.
  • the fifth processing module is also configured to merge the first voice information and the second voice information based on the phoneme duration corresponding to the first voice information and the phoneme duration corresponding to the second voice information.
  • the device may further include: a seventh processing module, configured to obtain the text to be synthesized before converting the target text into a prosodic phoneme sequence; and segment the text to be synthesized if the size of the text to be synthesized exceeds the target threshold.
  • the text to be synthesized is used to generate target text, and the size of the target text does not exceed the target threshold.
  • the first processing module 310 is further configured to: determine the first segmentation position in the prosodic phoneme sequence based on multiple prosodic identifiers; and select the prosodic identifier located after the first segmentation position in the prosodic phoneme sequence.
  • the phoneme sequence, the first sub-rhyme phoneme sequence is the prosodic phoneme sequence located before the first segmentation position in the prosodic phoneme sequence, and the at least two second sub-rhyme phoneme sequences are the prosodic phoneme sequence located after the first segmentation position in the prosodic phoneme sequence.
  • sequence, the adjacent second sub-rhyme phoneme sequence is determined based on the second segmentation position, and the speech synthesis duration corresponding to the first sub-rhyme phoneme sequence is within the target duration.
  • the first processing module 310 is also used to: obtain the prosodic words, syllables, prosodic phrases, sentence-end information and intonation phrases of the target text; based on the prosodic words, syllables, prosodic phrases, sentence-end information and intonation phrases At least two of them mark the target text and generate a prosodic phoneme sequence.
  • the first processing module 310 is also used to: convert the target text into a phoneme sequence; generate multiple prosodic identifiers based on at least two of prosodic words, syllables, prosodic phrases, end-of-sentence information and intonation phrases. symbol; mark the phoneme sequence based on multiple prosodic identifiers to generate a prosodic phoneme sequence.
  • Figure 4 illustrates a schematic diagram of the physical structure of an electronic device.
  • the electronic device may include: a processor (processor) 410, a communications interface (Communications Interface) 420, a memory (memory) 430 and a communication bus 440.
  • the processor 410, the communication interface 420, and the memory 430 complete communication with each other through the communication bus 440.
  • the processor 410 can call logical instructions in the memory 430 to execute a speech synthesis method.
  • the method includes: segmenting a prosodic phoneme sequence of the target text and generating multiple sentence sequences.
  • the prosodic phoneme sequence includes multiple prosodic phoneme sequences corresponding to the target text.
  • each clause sequence includes at least one phoneme; perform speech synthesis on the first sub-prosodic phoneme sequence in the plurality of clause sequences to obtain the first speech information; output the first speech information and performs speech synthesis on the second sub-rhyme phoneme sequence in the plurality of clause sequences to generate the second speech information.
  • the second sub-rhyme phoneme sequence is at least one that is located after the first sub-rhyme phoneme sequence in the rhyme phoneme sequence. Clause sequence.
  • the above-mentioned logical instructions in the memory 430 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product.
  • the technical solution of the present application is essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods of various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code. .
  • the application also provides a computer program product.
  • the computer program product includes a computer program.
  • the computer program can be stored on a non-transitory computer-readable storage medium.
  • the computer program can perform the above methods.
  • the speech synthesis method provided by the example.
  • embodiments of the present application also provide a non-transitory computer-readable storage medium on which a computer program is stored.
  • the computer program is implemented when executed by a processor to execute the speech synthesis method provided by the above embodiments.
  • the speech splicing method includes: step 510, step 520 and step 530.
  • Step 510 Segment the prosodic phoneme sequence of the target text to generate multiple clause sequences.
  • the prosodic phoneme sequence includes prosodic identifiers between adjacent phonemes and multiple phonemes corresponding to the target text.
  • Each clause sequence includes at least a phoneme;
  • step 510 may include generating a plurality of clause sequences based on at least part of the segmented prosodic phoneme sequences in the plurality of prosodic identifiers.
  • an entire prosodic phoneme sequence includes multiple phonemes and multiple prosodic identifiers, and the multiple prosodic identifiers include prosodic identifiers corresponding to different fine-grained levels.
  • an appropriate fine-grained level can be selected as the segmentation criterion based on the actual situation, and the position of the prosodic identifier corresponding to the fine-grained level in the prosodic phoneme sequence can be used as the segmentation point to segment the prosodic phoneme sequence. points to obtain multiple clause sequences.
  • each clause sequence includes a prosodic identifier at a segmentation point and at least one phoneme.
  • the prosodic phoneme sequence "sil shang4 #0 hai3 #0 shi4 #2 jin1 #0 tian1 #2 yin1 #0 zhuan3 #1 duo1 #0 yun2 #3 dong1 #0 nan2 #0 feng1 #2 san1 #0 dao4 #1 si4 #0 ji2 #4 sil", based on actual needs, it is determined to split at #3, then split at the position containing #3 in the prosodic phoneme sequence, and retain the prosodic separator #3 to the previous splicing unit,
  • the prosodic phoneme sequence can be divided into the following multiple clause sequences:
  • the first segmentation position and the second segmentation position may also be determined in the prosodic phoneme sequence based on multiple prosodic identifiers; and the prosodic phoneme sequence may be segmented based on the first segmentation position and the second segmentation position. points to generate a clause sequence corresponding to the first sub-rhyme phoneme sequence and a clause sequence corresponding to the second sub-rhyme phoneme sequence.
  • the speech synthesis duration corresponding to the first sub-rhyme phoneme sequence is within the target duration.
  • the target duration may be determined based on at least one of the computing power of the system and the upper limit of the capability of the speech synthesis model. For example, if the target duration is a shorter duration, the value of the target duration can be customized based on the user, or the system default value can be used. For example, the target duration can be set to 0.2s or 0.3s, etc.
  • the second sub-rhyme phoneme sequence may continue to be segmented based on the prosodic identifier to obtain a sentence sequence.
  • the second sub-rhyme phoneme sequence may be segmented at the prosodic separator #3. It can be understood that when there is no prosodic separator #3 in the second sub-rhyme phoneme sequence, the second sub-rhyme phoneme sequence will not be segmented.
  • the corresponding speech synthesis duration of the first sub-rhyme phoneme sequence obtained based on the first segmentation position can be within a reasonable duration range, thereby shortening the first sentence response time of the synthesis system and shortening the delay time; except
  • the first segmentation position and the second segmentation position determined based on this method are at positions with longer pause durations, making the pauses and rhythm of the segmented sentence sequence more natural, thereby making the subsequent output
  • the speech synthesized based on sentence sequence is more natural and smooth.
  • Step 520 Perform speech synthesis on each clause sequence to generate multiple first clause speech information
  • the first clause speech information is speech information generated by speech synthesis based on the clause sequence, and each clause sequence corresponds to one first clause speech information.
  • the first clause speech information includes each prosodic identifier and the first duration corresponding to the phoneme.
  • the speech information of the first clause can be speech, or it can also be high-level acoustic features
  • advanced acoustic features are physical quantities used to characterize the acoustic characteristics of speech and can be used to reconstruct speech, including but not limited to: linear spectrum, Mel spectrum, Mel cepstrum, as well as the energy concentration area, formant frequency, resonance of timbre Peak intensity and bandwidth, duration indicating the prosodic characteristics of speech, fundamental frequency, average speech power, etc.
  • a phoneme is a combination of one or more phonetic units divided according to the natural properties of speech.
  • the phonetic unit can be the pinyin, initial consonant or final rhyme corresponding to a Chinese character, or an English word, English phonetic symbol or English letter.
  • the first duration is the pronunciation duration corresponding to the prosodic identifier or phoneme.
  • "shang4” can be used as a phoneme or can be split into There are two phonemes “sh” and "ang4", and each prosodic identifier or phoneme has a corresponding pronunciation duration.
  • all prosodic identifiers and phonemes in the first clause sequence may be obtained first, and then based on each prosodic identifier and phoneme, the first duration corresponding to the prosodic identifier and phoneme may be obtained.
  • step 520 may include: inputting the sentence sequence to the target speech synthesis model, and obtaining the first sentence speech information output by the target speech synthesis model, wherein the target speech synthesis model is based on the sample prosodic phoneme sequence is a sample, and the sample segmented speech corresponding to the sample prosodic phoneme sequence is used as the sample label and is trained.
  • the target speech synthesis model may be an end-to-end speech synthesis model.
  • the clause sequence is obtained by dividing an entire prosodic phoneme sequence into multiple segments.
  • the input value of the target speech synthesis model is a sequence of clauses, and the output value is the first clause speech corresponding to the clause sequence, or the high-level acoustic feature corresponding to the first clause speech.
  • the target speech synthesis model is trained using the sample sentence sequence as the sample and the sample sentence voice corresponding to the sample sentence sequence as the sample label.
  • the training process of the target speech synthesis model is similar to the training method of the neural network model and will not be described in detail here.
  • each sentence sequence can be converted into a prosodic phoneme sequence that can be received by the end-to-end speech synthesis model, and each phoneme in the prosodic phoneme sequence and the prosodic phoneme sequence can be obtained based on the prosodic phoneme sequence.
  • the method may further include: outputting the first segment speech information.
  • the first sentence voice information after generating the first sentence voice information corresponding to the sentence sequence, the first sentence voice information can be output.
  • the target text is divided into multiple clause sequences, speech synthesis is performed on each clause sequence, the first speech information corresponding to each clause sequence is generated, and the first speech information in the target text is output first.
  • the first voice information corresponding to a sentence sequence effectively speeds up the system's feedback speed after receiving the network speech synthesis service request, shortens the user's waiting time, and helps improve the user experience.
  • Step 530 Based on the segmentation order and first duration of the clause sequence corresponding to the first clause speech information in the prosodic phoneme sequence, splice multiple first clause speech information to generate the target speech.
  • the target speech is the speech obtained by speech synthesis of the target text.
  • the segment speech information corresponding to the segment sequence needs to be truncated based on the first excess duration at the beginning or end of each segment sequence.
  • the speech information of the first clause is the speech of the first clause
  • the speech corresponding to the duration of the redundant phonemes at the beginning or end of the clause sequence in the speech of the first clause is truncated, Generate the truncated first clause speech, and sequentially splice the adjacent truncated first clause speech until all the truncated first clause speech is spliced to generate the target speech.
  • the speech information of the first clause is the high-level acoustic feature corresponding to the speech of the first clause
  • the high-level acoustic features corresponding to the speech of the first clause are truncated.
  • the high-level acoustic features corresponding to the duration of the extra phonemes at the beginning or end of the sentence sequence are generated to generate the high-level acoustic features corresponding to the voice of the first truncated clause; then a vocoder is used to generate the high-level acoustic features corresponding to the voice of the truncated first clause.
  • Advanced acoustic features perform speech synthesis to generate truncated first-clause speech.
  • the adjacent truncated first clause speech is sequentially spliced until all the truncated first clause speech is completed to generate the target speech .
  • step 530 may include: based on the first duration corresponding to the target phoneme among the plurality of phonemes, truncating the speech corresponding to the target phoneme in the first segment of speech information, and generating the second segment of speech information;
  • the segment sequence corresponding to the second clause speech information is segmented in the prosodic phoneme sequence, and the second clause speech information is spliced to generate the target speech.
  • the speech sounds corresponding to the target phonemes are redundant phonemes in the clause sequence, including but not limited to unpronounced phonemes corresponding to the beginning or the end of the clause sequence.
  • the voice information of the second sentence is the voice information generated by truncating the redundant pauses or silence duration in the voice information of the first sentence.
  • the second voice information may be expressed as speech, or may be expressed as high-level acoustic features.
  • the expression form of the second voice information corresponds to the expression form of the first voice information.
  • the first speech information generated by synthesis includes speech information corresponding to the duration such as sil and eos. These speech information are redundant speech information such as silence or pause, which can be based on the target phonemes "sil" and "eos" at the end of the sentence. For the first duration, the excess duration corresponding to sil and eos at the end of the first clause of the voice information is cut off to generate the second sentence of voice information.
  • the second clause speech information corresponding to the adjacent clause sequence is spliced in sequence until Splicing completes the second clause speech information corresponding to all clause sequences.
  • the speech splicing method provided by the embodiment of the present application, after dividing the target text into multiple clause sequences and synthesizing the first clause speech information corresponding to each clause sequence, based on the duration corresponding to the phonemes in each clause sequence
  • the speech corresponding to the redundant phonemes in the speech information of the first clause is cut off, so that the adjacent first segment can be improved without the need for a preset speech splicing unit library and the smoothing of the speech units to be spliced.
  • the naturalness and fluency of splicing of segmented speech information is provided by the embodiment of the present application, after dividing the target text into multiple clause sequences and synthesizing the first clause speech information corresponding to each clause sequence, based on the duration corresponding to the phonemes in each clause sequence.
  • step 530 The following is a detailed description of the implementation of step 530 from two implementation perspectives.
  • the clause sequence corresponding to the first clause speech information is not the first clause sequence in the target text.
  • the target phoneme includes at least one of a redundant phoneme at the beginning of the sentence and a redundant phoneme at the end of the sentence.
  • the first sentence segment speech is truncated.
  • the speech corresponding to the target phoneme in the information may include: determining that the sequence of clauses corresponding to the speech information of the first clause is not the first sequence of clauses in the target text, and truncating the speech information of the first clause and the end of the sentence respectively.
  • the speech sounds corresponding to the extra phonemes and the speech sounds corresponding to the extra phonemes at the beginning of the sentence are examples of the speech sounds corresponding to the extra phonemes at the beginning of the sentence.
  • Prosodic phoneme sequence 1 that can be received by the end-to-end speech synthesis model: sil sh ang4 #0 h ai3 #0 sh i4 #2 j in1 #0 t ian1 #2 y in1 #0 zh uan3 #1 d uo1 #0 y vn2 # 3 sil eos;
  • Prosodic phoneme sequence 2 that can be received by the end-to-end speech synthesis model: sil d ong1 #0 n an2 #0 f eng1 #2 s an1 #0 d ao4 #1 s i4 #0 j i2 #4 sil eos.
  • the prosodic phoneme sequence 1 that the end-to-end speech synthesis model can receive is the first clause sequence in the target text
  • the prosodic phoneme sequence 2 that the end-to-end speech synthesis model can receive is not the first clause sequence in the target text. Clause sequence.
  • the duration of the redundant phonemes at the beginning and end of the sentence in the prosodic phoneme sequence 2 that can be received by the end-to-end speech synthesis model that is, Based on the first duration of "sil" at the beginning of the sentence, the corresponding duration speech or advanced acoustic features are truncated at the beginning. Based on the first duration of "sil" and "eos" at the end of the sentence, the corresponding duration speech or advanced acoustic features are truncated at the end.
  • the second voice message can be generated.
  • the clause sequence corresponding to the first clause speech information is the first clause sequence in the target text.
  • the target phoneme includes at least one of a redundant phoneme at the beginning of the sentence and a redundant phoneme at the end of the sentence.
  • the first clause is truncated.
  • the speech corresponding to the target phoneme in the speech information may also include: determining the clause sequence corresponding to the speech information of the first clause as the first clause sequence in the target text, and truncating the speech information of the first clause corresponding to the end of the sentence. The speech corresponding to the redundant phonemes.
  • Prosodic phoneme sequence 1 that can be received by the end-to-end speech synthesis model: sil sh ang4 #0 h ai3 #0 sh i4 #2 j in1 #0 t ian1 #2 y in1 #0 zh uan3 #1 d uo1 #0 y vn2 # 3 sil eos;
  • Prosodic phoneme sequence 2 that can be received by the end-to-end speech synthesis model: sil d ong1 #0 n an2 #0 f eng1 #2 s an1 #0 d ao4 #1 s i4 #0 j i2 #4 sil eos.
  • the prosodic phoneme sequence 1 that the end-to-end speech synthesis model can receive is the first clause sequence in the target text
  • the prosodic phoneme sequence 2 that the end-to-end speech synthesis model can receive is not the first clause sequence in the target text. Clause sequence.
  • the duration of the extra phonemes at the end of the sentence in the prosodic phoneme sequence 1 that can be received by the end-to-end speech synthesis model is based on "sil " and "eos"
  • the second speech information can be generated by truncating the corresponding duration speech or advanced acoustic features at the end.
  • the speech splicing method provided by the embodiment of the present application, after dividing the target text into multiple clause sequences and synthesizing the first clause speech information corresponding to each clause sequence, based on the duration corresponding to the phonemes in each clause sequence
  • the speech corresponding to the redundant phonemes in the speech information of the first clause is cut off, so that the adjacent first segment can be improved without the need for a preset speech splicing unit library and the smoothing of the speech units to be spliced.
  • the naturalness and fluency of splicing of segmented speech information is provided by the embodiment of the present application, after dividing the target text into multiple clause sequences and synthesizing the first clause speech information corresponding to each clause sequence, based on the duration corresponding to the phonemes in each clause sequence.
  • the method for obtaining the audio and video file size includes: step 710, step 720 and step 730.
  • Step 710 Obtain the target text
  • the target text is the text currently used for speech synthesis.
  • the target text can be a regular text with tens to hundreds of levels, or an extremely long text with thousands or tens of thousands of levels.
  • the target text can be a local file stored in the database, or it can also be a file downloaded from the network, which is not limited in this application.
  • Step 720 Extract features from the target text to generate target prosodic features and target phoneme features
  • the target prosodic features are used to characterize the prosodic features of the target text
  • the target phoneme features are used to characterize the phoneme features of the target text.
  • the target prosodic features and target phoneme features include but are not limited to: phonemes and their corresponding tones, syllables, prosodic words, prosodic phrases, intonation phrases, silence, pauses and other features.
  • step 720 may include: converting the target text into a prosodic phoneme sequence, where the prosodic phoneme sequence includes a prosodic identifier located between adjacent phonemes and a plurality of phonemes corresponding to the target text; extracting features of the prosodic phoneme sequence , generate target prosodic features and phoneme features.
  • converting the target text into a prosodic phoneme sequence may include: converting the target text into a phoneme sequence; obtaining sentence-end information, intonation phrases, prosodic phrases, prosodic words, and syllables of the phoneme sequence; based on sentence-end information, At least two of intonation phrases, prosodic phrases, prosodic words, and syllables mark the phoneme sequence to generate a prosodic phoneme sequence.
  • the target prosodic features and target phoneme features may include: the length of the prosodic phoneme sequence, the number of Chinese pinyin in the prosodic phoneme sequence, the number of pause symbols in the prosodic phoneme sequence, the number of English phonemes in the prosodic phoneme sequence The number, the number of Chinese phonemes in the prosodic phoneme sequence, the number of Chinese initial consonants in the prosodic phoneme sequence, the number of Chinese finals in the prosodic phoneme sequence, and at least one of each category of English phonemes in the prosodic phoneme sequence.
  • the length of the prosodic phoneme sequence may be the number of phonemes in the prosodic phoneme sequence.
  • Step 730 Obtain the target file size of the target audio file based on the target prosodic features and the target phoneme features.
  • the target audio file is an audio file generated by speech synthesis of the entire target text.
  • the target audio file is the audio file
  • the target audio file is the audio file included in the video file.
  • the target file size is the predicted file size of the target audio file.
  • the target file size may be file volume information, or may be voice length information of the third voice information, which is not limited in this application.
  • step 730 may include: obtaining a first predicted file size of the target audio file based on the target prosodic feature and the target phoneme feature; summing the target residual value and the first predicted file size to generate the target file size.
  • the first predicted file size is the initial file size value of the uncorrected target text-synthesized speech predicted based on the target prosodic features and the target phoneme features.
  • the target residual value is used to correct the first predicted file size to improve the accuracy of the final generated target file size.
  • the target residual value is determined based on the sample file size and the size of the sample audio file corresponding to the predicted sample text.
  • the sample file size is the actual size of the sample audio file corresponding to the sample text.
  • the target file size is the file size value of the speech synthesized by the target text after prediction based on the target prosodic features and the target phoneme features. Understandably, the accuracy of the target file size is higher than the first predicted file size.
  • the target residual value is a predetermined value.
  • the target residual value can be the maximum absolute value of the residual value.
  • the first predicted file size is corrected by performing supplementary residual processing on the first predicted file size, thereby improving the accuracy of the final generated target file size.
  • a neural network model can be used to predict the first predicted file size.
  • step 730 may include: inputting the target prosodic feature and the target phoneme feature to the file size prediction model, and obtaining the first predicted file size output by the file size prediction model.
  • the file size prediction model may be a pre-trained neural network model.
  • the file size prediction model is used to predict the file size value of the speech synthesized by the text based on the prosodic features and phoneme features of the text.
  • the training process of the file size prediction model is as follows: using the sample prosodic features and sample phoneme features as samples, and using the sample file size corresponding to the sample prosodic features and sample phoneme features as sample labels, the file size prediction model is trained.
  • sample prosodic features and sample phoneme features are generated by extracting prosodic features and phoneme features from the sample text.
  • the extraction method of the sample prosodic features and sample phoneme features is similar to the extraction method of the target prosodic features and target phoneme features mentioned above. Here No further details will be given.
  • the sample file size corresponding to the sample prosodic features and sample phoneme features is the actual size value of the sample audio file generated by speech synthesis of the sample text.
  • the target prosodic features and target phoneme features are input to the trained file size prediction model, and the file size prediction model can output the target text corresponding to the target prosodic features and target phoneme features for speech synthesis.
  • the initial file size value corresponding to the voice that is, the first predicted file size.
  • the sum of the first predicted file size and the target residual value is calculated to generate the target file size.
  • the computing efficiency in the actual application process can be improved.
  • target prosodic features and target phoneme features corresponding to each target text in the actual application process can be used as training samples for subsequent training of the file size prediction model.
  • the file size prediction model As the training sample volume increases, the file size The intelligence of large and small prediction models will also continue to improve, and the final prediction results will be more accurate.
  • the target residual value is determined by the following steps:
  • the sample file size corresponding to the sample text, the sample audio file, and the sample audio file corresponding to the sample text is generated by speech synthesis of the sample text;
  • the maximum absolute value of the difference between the second predicted file size and the sample file size is determined as the target residual value.
  • the sample text can be a regular text with a level of tens to hundreds, or an extremely long text with a level of thousands or tens of thousands.
  • the sample audio file is the audio file finally generated by speech synthesis of the sample text.
  • the sample file size is the actual size value of the sample audio file or the actual audio duration.
  • a speech synthesis system can be used to calculate the actual wav file size or audio duration of the sample audio file corresponding to the sample text.
  • the second predicted file size is the predicted uncorrected size value or audio duration of the sample audio file.
  • the generation method of the second predicted file size should be consistent with the generation method of the first predicted file size.
  • feature extraction can be performed on the sample text to generate sample prosodic features and sample phoneme features, and the sample prosodic features and sample phoneme features are input to the file size prediction model to obtain the second prediction output by the file size prediction model.
  • File size can be performed on the sample text to generate sample prosodic features and sample phoneme features, and the sample prosodic features and sample phoneme features are input to the file size prediction model to obtain the second prediction output by the file size prediction model.
  • the sample prosodic features and the sample phoneme features may be predicted multiple times to obtain multiple second predicted file sizes. Then calculate the difference between the size of each second prediction file and the size of the sample file separately to obtain multiple candidate differences; then select the absolute value of the smallest non-positive value from the multiple candidate differences to determine the target residual value. Improve the accuracy of target residual values.
  • the prosodic features and phoneme features of the target text are extracted, and the target audio synthesized from the target text is predicted based on the extracted target prosodic features and target phoneme features.
  • the size information of the file can predict the size value of the target audio file before the target audio file is generated, with a certain degree of timeliness; and the accuracy and precision of the prediction results are high.
  • the method may further include: segmenting the target text based on the target prosodic features and the target phoneme features to generate multiple sentence sequences; The sequence performs speech synthesis to generate segmented speech; outputs the segmented speech and target file size, and splices the segmented speech to generate a target audio file.
  • each clause sequence includes at least one phoneme, where the phoneme may be a Chinese phoneme or an English phoneme.
  • sample text please search the APP for detailed content, which can be converted into a sample prosodic phoneme sequence: sil xiang2 #0 xi4 #1 nei4 #0 rong2 #2 ma2 #0 fan5 #2 zai4 #1 AE1 P #0 shang4 #1 sou1 #0 xun2 #0 xia4 #4 sil;
  • sample prosodic features and sample phoneme features include but are not limited to: the length of the sample prosodic phoneme sequence; the number of Chinese pinyin occurrences in the sample prosodic phoneme sequence, the number of occurrences of Chinese pinyin in the sample prosodic phoneme sequence, The number of pause symbols (#0 #1 #2 #3 sil), the number of English phonemes in the sample prosodic phoneme sequence, the number of Chinese phonemes in the sample prosodic phoneme sequence, the number of Chinese initial consonants in the sample prosodic phoneme sequence number, the number of Chinese finals in the sample prosodic phoneme sequence, and the number of English phonemes of each category (Vowels, Diphthongs, R colored vowels, Stops, Affricates, Fricatives, Nasals, Liquids, Semivowels) in the sample prosodic phoneme sequence number.
  • the sample prosodic features and sample phoneme features obtained above are input to the wav file size prediction model.
  • the target output of the training process is the actual wav file byte number of the sample audio file.
  • the client initiates a request. For example, get the target text: Shanghai will become cloudy today with southeasterly winds of level 3 to 4.
  • the system extracts target prosodic features and target phoneme features from the target text requested by the client.
  • the extracted target prosodic features and target phoneme features are input into the model as described above to obtain the first predicted file size.
  • the residual is added to the first predicted file size, and the generated target file size is the sum of the first predicted file size and the target residual value.
  • the target text requested by the client is segmented to generate multiple sentence sequences, for example, divided into:
  • Second clause sequence Southeast wind of level three to level four.
  • the audio after the first sentence sequence is synthesized in order and written into the wav file until all requests are synthesized. For example, synthesize the audio of "Southeast Wind Level 3 to Level 4" and write it into a wav file, and then it's over.
  • the sample text is "controllable” and can be converted into a prosodic phoneme sequence: sil k e2 #0 y i3 #1 k ong4 #0 zh i4 #3 sil eos, and the rhythm can be predicted
  • the duration of the phonemes (number of mel spectrum frames): 3 1 3 1 1 6 2 2 7 2 4 5 11 4 12, and the sum of the durations of the phonemes is used as the sample file size.
  • the model In the subsequent model training process, you can set the model to have 1 layer of 256-dimensional embedding layer, followed by 4 layers of 1-dimensional convolutional neural network with 256 channels, followed by layer norm, followed by dropout, and followed by a layer of output dimension 1 full-dimensional convolutional neural network. connection layer.
  • the loss function can include the MSE loss of the phoneme duration sequence and the MAE loss of the average total duration of each phoneme.
  • the Adam optimizer is used to iteratively optimize the model.
  • the total number of predicted mel spectrum frames can be obtained based on the above model as the second prediction file size, and then the maximum residual value is calculated.
  • the client initiates a request. For example, get the target text: Shanghai will become cloudy today with southeasterly winds of level 3 to 4.
  • the system extracts target prosodic features and target phoneme features from the target text requested by the client.
  • the extracted target prosodic features and target phoneme features are input into the model as described above to obtain the first predicted file size.
  • the residual is added to the first predicted file size, and the generated target file size is the sum of the first predicted file size and the target residual value.
  • the number of mel spectrum frames is calculated, then the audio duration (mel The total number of frames in the spectrum) is converted into wav file size:
  • the wav file size ((Mel spectrum frame number x Mel spectrum frame shift/16000)*16000*16*1/8+44) bytes.
  • the prosodic and phoneme features of the target text are extracted, and the size of the target audio file synthesized from the target text is predicted based on the extracted target prosodic features and target phoneme features.
  • Size information can predict the size value of the target audio file before the target audio file is generated, with a certain degree of timeliness; and the accuracy and precision of the prediction results are high.
  • the text transcription method includes: step 910 and step 920.
  • Step 910 Segment the prosodic phoneme sequence of the target text to generate multiple sentence sequences
  • the target text is the text currently used for speech synthesis.
  • the prosodic identifier may include at least one of identifiers used to characterize syllables, used to characterize prosodic words, used to characterize prosodic phrases, used to characterize end-of-sentence information, and used to characterize intonation phrases.
  • different prosodic identifiers correspond to different levels of fine-grainedness, in which the fine-grainedness of the prosodic identifier used to characterize pauses is greater than the fine-grainedness of the identifiers used to characterize the prosody of intonation phrases, and the fine-grainedness of the identifiers used to characterize intonation.
  • the finer granularity of phrases is greater than that used to characterize prosodic phrases
  • the finer granularity used to characterize prosodic phrases is greater than the finer granularity used to characterize prosodic words
  • the finer granularity used to characterize prosodic words is greater than the finer granularity used to characterize syllables.
  • the prosodic identifier may include: # and numbers between adjacent pinyin; the phoneme may include the pinyin and tone or English phonetic symbols corresponding to each Chinese character.
  • sil represents the silence at the beginning and end of the sentence in the prosodic phoneme sequence
  • #0 represents the syllable
  • #1 represents the prosodic word
  • #2 represents the prosodic phrase
  • #3 represents the intonation phrase
  • #4 represents the end of the sentence.
  • the number after the phoneme represents the tone of the phoneme.
  • the 4 in shang4 represents the fourth tone of the pinyin "shang".
  • the fine-grained order from small to large is: #0 ⁇ #1 ⁇ #2 ⁇ #3 ⁇ #4.
  • At least two clause sequences can be obtained.
  • step 910 may include: converting the target text into a prosodic phoneme sequence; segmenting the prosodic phoneme sequence based on at least part of the plurality of prosodic identifiers to generate a plurality of clause sequences.
  • an entire prosodic phoneme sequence includes multiple phonemes and multiple prosodic identifiers, and the multiple prosodic identifiers include prosodic identifiers corresponding to different fine-grained levels.
  • an appropriate fine-grained level can be selected as the segmentation criterion based on the actual situation, and the position of the prosodic identifier corresponding to the fine-grained level in the prosodic phoneme sequence can be used as the segmentation point to segment the prosodic phoneme sequence. points to obtain multiple clause sequences.
  • each clause sequence includes a prosodic identifier at a segmentation point and at least one phoneme.
  • the prosodic phoneme sequence "sil shang4 #0 hai3 #0 shi4 #2 jin1 #0 tian1 #2 yin1 #0 zhuan3 #1 duo1 #0 yun2 #3 dong1 #0 nan2 #0 feng1 #2 san1 #0 dao4 #1 si4 #0 ji2 #4 sil", based on actual needs, it is determined to split at #3, then split at the position containing #3 in the prosodic phoneme sequence, and retain the prosodic separator #3 to the previous splicing unit,
  • the prosodic phoneme sequence can be divided into the following multiple clause sequences:
  • step 910 may also include:
  • the prosodic phoneme sequence can be segmented into multiple candidate sequences.
  • the candidate sequence located before the first segmentation point position has a corresponding speech synthesis duration of Within the target duration.
  • the speech synthesis time is the time it takes to synthesize the candidate sequence into speech.
  • the target duration is a shorter duration.
  • the target duration can be customized based on the user, or the system default value can be used.
  • the target duration can be set to 0.2s or 0.3s.
  • the target candidate sequence can be any candidate sequence among multiple candidate sequences.
  • the target candidate sequence is combined with other adjacent candidate sequences to generate multiple combined clause sequences, where multiple clause sequences includes the original candidate sequence and the original prosodic phoneme sequence corresponding to the target text.
  • the fine-grained level of the clause sequence is greater than the fine-grained level of the target candidate sequence.
  • XX is "sil xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3
  • each clause sequence corresponds to a fine-grained level.
  • clause sequences are sorted. The greater the fine-grainedness, the corresponding clause sequence is ranked earlier, for example, "xi1 #0 wang4 #1 zhe4 #0 shou3 # 0 ge1 #3 neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 wei4 #0 nin2 #1 bo1 #0 fang4 #1 EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4( I hope this song will make you like to play XX for you #0 nin2 #1 bo1 #0 fang4 #1 (Hope this song will make you enjoy playing it for you)" in front.
  • the prosodic phoneme sequence is segmented based on the prosodic results predicted by semantics and people's speaking habits, so as to segment at long pauses, rather than simply segmenting based on punctuation marks, which helps to improve the efficiency of subsequent splicing.
  • the naturalness of the target speech generated by the synthesized multi-sentence speech is segmented based on the prosodic results predicted by semantics and people's speaking habits, so as to segment at long pauses, rather than simply segmenting based on punctuation marks, which helps to improve the efficiency of subsequent splicing.
  • Step 920 Determine that any of the plurality of clause sequences to be matched clause sequence matches the cached target clause sequence, obtain the target clause voice corresponding to the target clause sequence from the cache, and convert the to-be-matched clause The speech corresponding to the sequence is determined as the target sentence speech.
  • the target clause sequence is a clause sequence that is pre-generated and stored in the system.
  • the target clause sequence can be any one of all pre-stored clause sequences cached in the system.
  • the target clause speech is the speech generated by performing speech synthesis on the target clause sequence in advance.
  • the target clause speech is stored in the system, and a corresponding relationship is established between the target clause sequence and the target clause speech.
  • the cached target clause sequence can be accurately matched with multiple clause sequences respectively. If it is determined that the target clause sequence matches any of the multiple clause sequences to be matched, then the target clause sequence will be matched.
  • the sequence of clauses to be matched is determined as the first sequence of clauses, and the target clause voice corresponding to the target clause sequence is obtained from the cache;
  • the target clause voice corresponding to the target clause matching the first clause sequence can be directly determined as the voice corresponding to the first clause sequence.
  • the clause sequence to be matched includes any clause sequence among multiple clause sequences and combinations between different clause sequences.
  • step 920 may include: based on the descending order in step 910, accurately matching multiple clause sequences with the target clause sequence from front to back.
  • the clause sequence is accurately matched with the target clause sequence from front to back, for example, first the clause sequence "xi1 #0 wang4 #1 zhe4 # 0 shou3 #0 ge1 #3 neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 wei4 #0 nin2 #1 bo1 #0 fang4 #1 EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4 (I hope this song will make you like playing XX for you)" is accurately matched with the target clause. If the match is successful, the clause sequence is determined as the first clause sequence, and The speech of the target clause corresponding to the target clause matching the first clause sequence is determined to be the speech corresponding to the first clause sequence, and the comparison ends.
  • the clause sequence "xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 wei4 #0 nin2 # 1 bo1 #0 fang4 #1 (I hope this song will make you like playing it for you)" is compared with the target clause, and the above process is repeated until it is determined that a certain clause sequence and the target clause can achieve an exact match, then End comparison.
  • the method may further include: outputting the target sentence speech.
  • the target sentence voice is the voice corresponding to the first sentence sequence
  • the target sentence voice is the voice that is pre-generated and stored in the cache.
  • the target clause voice corresponding to the target clause similar to the first clause sequence is directly determined as the first clause Speech corresponding to the sequence, and output the target sentence speech.
  • the target text is converted into a prosodic phoneme sequence
  • the pause position and pause duration level of the target text are determined based on the prosodic features
  • the prosodic phoneme sequence is divided into multiple sentence sequences based on the prosodic features, and the segments are divided into
  • the sentence sequence is compared with the cached target clause sequence. Since the sequence used as the search keyword is shorter, it is easier to hit in the cache search, which can effectively improve the hit efficiency; when the clause sequence is the same as the target sequence, then The target clause speech corresponding to the target clause sequence is directly determined as the speech of the clause sequence without re-speech synthesis, thereby effectively reducing the computing power expenditure of the server.
  • the prosody prediction module and the segmentation module can be used respectively to perform the above steps.
  • the prosodic phoneme sequence is divided into multiple clause sequences based on prosodic features, and the clause sequence is compared with the cached target clause sequence respectively, which can effectively improve the hit efficiency; in When the clause sequence is the same as the target sequence, the speech of the target clause corresponding to the target clause sequence is directly determined as the speech of the clause sequence without re-speech synthesis, thereby improving the efficiency of speech synthesis.
  • the method may further include: determining that any to-be-matched clause sequence in the multiple clause sequences does not match the target clause sequence, and performing speech synthesis on the to-be-matched clause sequence. , generate the second sentence speech.
  • the target clause sequence is a clause sequence that is pre-generated and stored in the system.
  • the target clause sequence can be any one of all pre-stored clause sequences cached in the system.
  • the sequence of clauses to be matched is determined to be the second clause sequence.
  • the target clause sequence is accurately matched with any to-be-matched clause sequence in the multiple clause sequences. If neither of them matches, the to-be-matched clause sequence is determined as the second clause. sentence sequence, and speech synthesis is performed on the second clause sequence to generate the second clause speech.
  • the voice of the second clause is a voice that does not exist in the cache.
  • the clause sequence can be accurately matched with the target clause sequence from front to back based on the descending order generated in step 910.
  • speech synthesis is performed on the clause sequence for which no similar sequence is found, and the second clause speech is generated.
  • performing speech synthesis on the second clause sequence to generate the second clause speech may include: converting the second clause sequence into a prosodic phoneme sequence that can be received by the end-to-end speech synthesis model.
  • the phoneme sequence undergoes speech synthesis to generate the second sentence speech.
  • the prosodic phoneme sequence is used to represent the prosodic information and phoneme information of the second clause sequence.
  • phonemes are the smallest phonetic units divided according to the natural properties of speech. Analyzed according to the pronunciation movements in syllables, one movement constitutes a phoneme, and the phoneme can be a Chinese phoneme or an English phoneme.
  • the second clause sequence can be expressed as sil shang4 #0 hai3 #0 shi4 #2 jin1 #0 tian1 #2 yin1 #0 zhuan3 #1 duo1 #0 yun2 #3, or as: sil sh ang4 #0 h ai3 #0 sh i4 #2 j in1 #0 t ian1 #2 y in1 #0 zh uan3 #1 d uo1 #0 y vn2 #3 sil eos and other phoneme sequences in different formats.
  • the second clause sequence is input to a speech synthesis system (such as an end-to-end speech synthesis model), and the second clause speech is synthesized by the speech synthesis system.
  • a speech synthesis system such as an end-to-end speech synthesis model
  • the text-to-phoneme module can be used to perform the above operations.
  • phonemes are cached as keywords, which overcomes the shortcomings of caching as different sentences when the punctuation changes in the text, or when the number writing changes but the pronunciation is exactly the same, it can achieve standardized caching of the target text and improve Cache efficiency.
  • the method may further include: segmenting the second clause speech based on the prosodic identifier to generate multiple sub-second clause speech; Cache the sub-clause sequence corresponding to the sub-second clause voice and multiple sub-second clause voices.
  • the second clause speech can be segmented based on the prosodic identifier in the second clause sequence corresponding to the second clause speech, and multiple sub-second clause speech can be generated.
  • the clause sequence corresponding to the voice of each sub-second clause is the sub-clause sequence.
  • the sub-clause sequence and the sub-second clause speech can be cached in the system as the target clause sequence and its corresponding target segment in the subsequent query process. Sentence pronunciation.
  • the method may further include: splicing the second clause speech and the target clause speech to generate a target speech corresponding to the target text.
  • the target speech is the speech obtained by speech synthesis of the target text.
  • the target sentence voice is the voice that exists in the cache
  • the voice of the second clause is a voice that does not exist in the cache.
  • the target speech is generated based on at least one of the target clause speech in the cache and the newly generated second clause speech.
  • splicing the speech of the second clause and the speech of the target clause may also include: based on the segmentation order of the clause sequence corresponding to the speech of the second clause in the prosodic phoneme sequence, and the speech corresponding to the target clause.
  • the segmentation sequence of the clause sequence in the prosodic phoneme sequence splices the second clause speech and the target clause speech.
  • the speech corresponding to the adjacent clause sequence is spliced in sequence, until the speech corresponding to all the clause sequences is spliced, and the target speech is generated.
  • the target voice After the target voice is generated, the target voice can also be output.
  • the prosodic phoneme sequence is divided into multiple clause sequences based on prosodic features, and the clause sequence is compared with the cached target clause sequence respectively, which can effectively improve the hit efficiency; only Speech synthesis is only performed when the sentence sequence is different from the target sequence, effectively reducing the computing power pressure on the server and improving the efficiency of speech synthesis.
  • the text segmentation method includes: step 1110, step 1120 and step 1130.
  • Step 1110 Convert the target text into a prosodic phoneme sequence.
  • the size of the target text exceeds the target threshold, and the target text exceeding the target threshold is segmented. It can be understood that the target text that exceeds the target threshold is the text whose speech synthesis time exceeds the preset range. Therefore, the efficiency of speech synthesis can be improved by segmenting the longer text and then performing speech synthesis on the segmented text.
  • Step 1120 Determine the first segmentation position in the prosodic phoneme sequence based on multiple prosodic identifiers
  • the first cutting position is the position of the cutting point used for the first cutting.
  • the prosodic phoneme sequence can be segmented into two subsequences before and after the first segmentation position, and the subsequence before the first segmentation position is determined as the first sub-prosodic phoneme sequence.
  • the corresponding speech synthesis duration of the first sub-rhyme phoneme sequence generated based on the first segmentation position is within the target duration.
  • the speech synthesis time is related to the computing power of the speech synthesis system.
  • the speech synthesis time corresponding to the first sub-rhyme phoneme sequence is the time it takes to synthesize the first sub-rhyme phoneme sequence into speech.
  • the target duration is a shorter duration.
  • the value of the target duration can be customized based on the user, or the system default value can be used.
  • the target duration can be set to 0.2s or 0.3s.
  • the plurality of prosodic identifiers may include: at least one of identifiers used to characterize syllables, used to characterize prosodic words, used to characterize prosodic phrases, used to characterize intonation phrases, and used to characterize end-of-sentence information.
  • the fine-grainedness of the identifier used to characterize the end of the sentence is greater than that of the identifier used for the intonation phrase, and the fine-grainedness of the identifier used to characterize the intonation phrase is greater than the fine-grainedness of the identifier used to characterize the prosodic phrase.
  • the identifiers used to characterize prosodic phrases are finer-grained than the identifiers used to characterize prosodic words, and the identifiers used to characterize prosodic words are finer-grained than the identifiers used to characterize syllables.
  • the prosodic identifier is a symbol used to represent at least one of sentence-end information, intonation phrase, prosodic phrase, prosodic word, and syllable.
  • rhythm identifier can be represented by a combination of special symbols and numbers or a specific combination of letters, such as "#0", “#1", “#2", “#3” and “#” respectively. 4" to represent prosody identifiers, and different combinations represent different levels of fine-grainedness.
  • #0 represents a syllable
  • #1 represents a prosodic word
  • #2 represents a prosodic phrase
  • #3 represents an intonation phrase
  • #4 represents the end of a sentence.
  • Xiao Jia can control the water heater switch, adjust the temperature, and time the switch. Please search for details on the Jiaju app", it can be converted into a rhyme phoneme sequence: sil mu4 #0 qian2 #1 xiao2 # 0 jia3 #3 ke2 #0 yi3 #1 kong4 #0 zhi4 #1 re4 #0 shui3 #0 qi4 #1 kai1 #0 guan1 #3 tiao2 #0 jie2 #1 wen1 #0 du4 #3 ding4 #0 shi2 #1 kai1 #0 guan1 #3 xiang2 #0 xi4 #1 nei4 #0 rong2 #2 ma2 #0 fan5 #1 zai4 #1 jia 3 #0 ju1 #1 AE1 P #0 shang4 #1 sou1 #0 xun2 #0 xia4 #4 si .
  • the prosodic phoneme sequence includes multiple prosodic identifiers such as #0, #1, #3, and #4; among them, the fine-grainedness of the prosodic identifiers from small to large is: #0 ⁇ #1 ⁇ #2 ⁇ #3 ⁇ #4.
  • each prosodic identifier in the prosodic phoneme sequence According to the size relationship between the speech synthesis duration corresponding to the sub-prosodic phoneme sequence before each position in the prosodic phoneme sequence and the target duration, from these multiple positions The position of the first target identifier is determined as the first segmentation position, thereby ensuring that the speech synthesis duration corresponding to the first sub-rhyme phoneme sequence generated based on the first segmentation position is within the target duration.
  • step 1120 may include: determining a rhyme identifier with the largest fine-grainedness from a plurality of prosodic identifiers based on the target threshold range; and assigning the rhyme identifier with the largest fine-grainedness to the prosodic phoneme sequence within the target threshold range. The position is determined as the first cutting position.
  • the target threshold range is the maximum and minimum value of the element's articulation length.
  • the element pronunciation length is the sum of the pronunciation lengths of all phonemes before the target position in the prosodic phoneme sequence.
  • the target position can be the position of any prosodic identifier in the prosodic phoneme sequence.
  • the target threshold range can be expressed by (n, n+m), where the values of n and m can be user-defined or determined based on an algorithm; where n and m are both positive integers, and the sum of n and m does not exceed The sum of the articulatory lengths of all phonemes in the prosodic phoneme sequence.
  • n can be set to 5
  • m can be set to 5
  • the target threshold range is determined to be 5-10 unit pronunciation lengths.
  • a Chinese character is one unit of pronunciation length
  • a word-formed English word is 2 units of pronunciation length
  • a string of unworded English phonemes is 4 units of pronunciation length.
  • the one with the smallest position number that is, the position of the first one in the prosodic phoneme sequence is determined as the first segmentation position.
  • determining the most fine-grained prosodic identifier from multiple prosodic identifiers based on the target threshold range may include:
  • the target sub-prosodic phoneme sequence is all the prosodic phoneme sequences before the target position in the prosodic phoneme sequence, where the target position is the position of any prosodic identifier in the prosodic phoneme sequence.
  • the first pronunciation length is the sum of the pronunciation lengths corresponding to each phoneme in the target sub-rhyme phoneme sequence.
  • the first pronunciation length can be obtained through the first segmentation position search module.
  • rhythmic phoneme sequence is converted into a rhythmic phoneme list, and the rhythmic phoneme list is input to the first segmentation position search module.
  • the first segmentation position search module sets the initial value of the current pronunciation length and the initial value of the list index to 0, initializes the empty prosodic position dictionary dict, and starts the loop based on the following formula:
  • Element pronunciation length get_voice_length (list index);
  • First pronunciation length first pronunciation length + element pronunciation length
  • the element pronunciation length is the pronunciation length corresponding to the syllable at the current target position
  • the list index is used to represent the target position
  • the function get_voice_length (list index) calculates the pronunciation length of the element specified by the index.
  • the calculation method is: 1 Chinese character is one unit of pronunciation length, and 1 formed English word (English string in the dictionary) is 2 units of pronunciation. Length, an English string that is not a word (not in the dictionary) is 4 units of pronunciation length.
  • N first pronunciation lengths whose list index values are in the range of (1, N) can be obtained respectively, where N is the number of syllables in the list.
  • the current prosodic identifier and position index are recorded into the dictionary dict.
  • determining the target position corresponding to the first pronunciation length as the candidate segmentation point position includes:
  • the prosodic identifier When the first pronunciation length is within the target threshold range, determine the prosodic identifier at the target position corresponding to the first pronunciation length as the first occurrence, and determine the target position corresponding to the first pronunciation length as the candidate segmentation point position. .
  • the current rhythm and the list index are recorded as a key-value pair in the dictionary dict.
  • the rhythm identifier corresponding to the current list index is an identifier that has appeared before
  • the current list index is skipped and the next cycle is entered.
  • the list indexes corresponding to all the recorded prosody identifiers can be used as candidate segmentation point positions for determining the first segmentation position.
  • the current list index is incremented by one and the next cycle is entered.
  • the loop ends.
  • the prosodic phoneme sequence is segmented from the first segmentation position to generate the first sub-prosodic phoneme sequence.
  • the first segmentation position determined based on this method is when the pause duration is longer. The position makes the pauses and rhythm of the segmented first sub-rhyme phoneme sequence more natural, thereby making the subsequent output speech synthesized based on the first sub-rhyme phoneme sequence more natural and smooth.
  • Step 1130 Segment the prosodic phoneme sequence based on the second segmentation position and the first segmentation position to generate at least a second sub-rhyme phoneme sequence and a first sub-rhyme phoneme sequence; wherein the second sub-rhyme phoneme sequence is in the prosodic phoneme sequence.
  • the first sub-rhyme phoneme sequence is generated based on the first segmentation position, and the first sub-rhyme phoneme sequence is the prosodic phoneme sequence located before the first segmentation position in the prosodic phoneme sequence;
  • the second sub-prosodic phoneme sequence is the prosodic phoneme sequence located after the first segmentation position.
  • the method may further include: determining a second position in the prosodic phoneme sequence from a prosodic identifier located after the first segmentation position in the prosodic phoneme sequence. segmentation position;
  • Step 1130 may include: segmenting the prosodic phoneme sequence based on the first segmentation position and the second segmentation position, generating a first sub-rhyme phoneme sequence and at least two second sub-rhyme phoneme sequences, and at least two second sub-rhyme phoneme sequences.
  • the phoneme sequence is a prosodic phoneme sequence located after the first segmentation position in the prosodic phoneme sequence, and the adjacent second sub-rhyme phoneme sequence is determined based on the second segmentation position.
  • the second segmentation position is the position of the segmentation point corresponding to all other segmentations except the first segmentation.
  • At least part of the prosodic identifiers after the first segmentation position are searched from the prosodic phoneme sequence as a candidate set for determining the second segmentation position, and the positions of the prosodic identifiers in the candidate set are Determine the second cutting position.
  • the second segmentation position search module can be used to find the second segmentation position.
  • the first segmentation position and the prosodic phoneme sequence are input to the second segmentation position search module, and a segmentation point list output by the second segmentation position search module is obtained.
  • the segmentation point list includes: the first segmentation position position and the second segmentation position.
  • determining the second segmentation position from the position of the prosodic phoneme sequence that is located after the first segmentation position in the prosodic phoneme sequence may include: placing the prosodic phoneme sequence at the first segmentation position. Afterwards, the position corresponding to the identifier used to characterize the intonation phrase is determined as the second segmentation position.
  • is the first segmentation position
  • are the second segmentation position.
  • the location of identifiers at other fine-grained levels can also be determined as the second segmentation location, which is not limited in this application.
  • the second sub-prosodic phoneme sequence is a prosodic phoneme sequence generated by segmenting the entire prosodic phoneme sequence located after the first segmentation position.
  • the second segmentation position is determined based on the first segmentation position and prosodic characteristics to improve the natural rhythm of the second sub-rhyme phoneme sequence generated by subsequent segmentation and the balance of the two ends after segmentation to avoid Cutting off in the middle of a whole word helps improve the efficiency and quality of subsequent speech synthesis.
  • the second sub-prosodic phoneme sequence is the entire prosodic phoneme sequence located after the first segmentation position in the prosodic phoneme sequence.
  • the second segmentation position is the position corresponding to #3, but #3 cannot be found in the second sub-rhyme phoneme sequence, it can be understood that there is no second segmentation position.
  • first cutting position and the second cutting position determined in step 1120 are as follows:
  • the prosodic phoneme sequence is segmented in order, thereby generating the first sub-prosodic phoneme sequence "sil mu4 #0 qian2 #1 xiao2 #0 jia3 #3 ke2 #0 yi3 #1", and the second sub-rhyme phoneme sequence: "kong4 #0 zhi4 #1 re4 #0 shui3 #0 qi4 #1 kai1 #0 guan1 #3", "tiao2 #0 jie2 #1 wen1 #0 du4 #3” and "ding4 #0 shi2 #1 kai1 #0 guan1 #3” etc.
  • the speech synthesis time of the first sub-rhyme phoneme sequence "sil mu4 #0 qian2 #1 xiao2 #0 jia3 #3 ke2 #0 yi3 #1" is about 0.2s.
  • the first segmentation position for segmenting the first sub-rhyme phoneme sequence is determined based on the speech synthesis duration corresponding to the first sub-rhyme phoneme sequence, so that the first segmentation position is determined.
  • the speech synthesis duration corresponding to the sub-rhyme phoneme sequence can be within a reasonable duration range, thereby shortening the first sentence response time of the synthesis system.
  • step 1110 may include: obtaining sentence-end information, intonation phrases, prosodic phrases, prosodic words, and syllables of the target text; converting the target text into a phoneme sequence; based on sentence-end information, intonation phrases, prosodic phrases, prosodic phrases, and syllables. At least two types of words and syllables are used to generate multiple prosodic identifiers; the phoneme sequence is marked based on the multiple prosodic identifiers to generate a prosodic phoneme sequence.
  • the method may further include: performing speech synthesis on the first sub-rhyme phoneme sequence to generate a first speech; outputting the first speech, and performing speech synthesis on the second sub-rhyme phoneme sequence. Synthesis, generating second speech.
  • the first sub-rhyme phoneme sequence is the sequence before the first segmentation point in the target text, that is, the sequence corresponding to the first sentence in the speech synthesized by the target text.
  • speech synthesis can be performed on the first sub-rhyme phoneme sequence to generate the first speech sound.
  • the first voice is then output for the client to play. While the client plays the first voice, the system can synthesize a subsequent second sub-rhyme phoneme sequence to generate a second voice.
  • the first speech "current small” can be synthesized based on the first sub-rhyme phoneme sequence.
  • A can" and output the first voice; while the client plays the first voice, the system processes the second sub-rhyme phoneme sequence: "kong4 #0 zhi4 #1 re4 #0 shui3 #0 qi4 #1 kai1 #0 guan1 #3" is synthesized.
  • the text segmentation method provided by the embodiment of the present application, by preferentially synthesizing the first speech corresponding to the first sub-rhyme phoneme sequence, while outputting the first speech, the subsequent second sub-rhyme phoneme sequence is speech synthesized, which can speed up the system.
  • the feedback speed after receiving the network speech synthesis service request shortens the user's waiting time.
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated.
  • the components shown as units may or may not be physical units, that is, they may be located in one place. , or it can be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. Persons of ordinary skill in the art can understand and implement the method without any creative effort.
  • each embodiment can be implemented by software plus a necessary general hardware platform, and of course, it can also be implemented by hardware.
  • the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disc, optical disk, etc., including a number of instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute various embodiments or methods of certain parts of the embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

本申请涉及语音合成领域,提供一种语音合成方法,包括:切分目标文本的韵律音素序列,生成多个分句序列,韵律音素序列包括与目标文本对应的多个音素以及位于相邻音素之间的韵律标识符,每个分句序列包括至少一个音素;对多个分句序列中的第一子韵律音素序列进行语音合成,得到第一语音信息;输出第一语音信息且对多个分句序列中的第二子韵律音素序列进行语音合成,生成第二语音信息,第二子韵律音素序列为在韵律音素序列中位于第一子韵律音素序列之后的至少一个分句序列。本申请的语音合成方法,有效加快系统在接收到网络语音合成服务请求后的反馈速度,缩短用户的等待时间。

Description

语音合成方法和装置
相关申请的交叉引用
本申请要求于2022年3月31日在中国知识产权局提交的标题为“语音合成方法和语音合成装置”中国专利申请No.202210344448X、于2022年3月31日在中国知识产权局提交的标题为“语音拼接方法和语音拼接装置”中国专利申请No.2022103461146的优先权、于2022年3月31日在中国知识产权局提交的标题为“音视频文件大小的获取方法和装置”中国专利申请No.2022103460976的优先权、于2022年3月31日在中国知识产权局提交的标题为“文本转写方法和文本转写装置”中国专利申请No.2022103460942的优先权、于2022年3月31日在中国知识产权局提交的标题为“文本的切分方法和文本的切分装置”中国专利申请No.2022103444564的优先权,通过引用将该中国专利申请公开的全部内容并入本文。
技术领域
本申请涉及语音合成技术领域,尤其涉及语音合成方法和语音合成装置、语音拼接方法和语音拼接装置、音视频文件大小的获取方法和装置、文本转写方法和文本转写装置、以及文本的切分方法和文本的切分装置。
背景技术
从文本到语音(Text To Speech,TTS)技术被广泛应用于语音合成领域。相关技术中,在进行语音合成时,通常是直接对整段待合成文本进行语音合成,对于一些较长的待合成文本,在进行语音合成时则需要耗费更长的时间,这也意味着用户需等待较长的时间才能获取所合成的语音,语音合成性能较低,既浪费了用户的时间,也影响了用户的使用体验。
发明内容
本申请旨在至少解决现有技术中存在的技术问题之一。为此,本申请提出了一种语音合成方法和语音合成装置。
根据本申请的第一方面,提供了一种语音合成方法,包括:对目标文本的韵律音素序列进行切分,生成多个分句序列,韵律音素序列包括与目标文本对应的多个音素以及位于相邻音素之间的韵律标识符,每个分句序列包括至少一个音素;对多个分句序列中的第一子韵律音素序列进行语音合成,得到第一语音信息;输出第一语音信息且对多个分句序列中的第二子韵律音素序列进行语音合成,生成第二语音信息,第二子韵律音素序列为在韵律音素序列中位于第一子韵律音素序列之后的至少一个分句序列。
根据本申请的第二方面,提供了一种语音合成装置,包括:第一处理模块,用于对目标文本的韵律音素序列进行切分,生成多个分句序列,韵律音素序列包括与目标文本对应的多个音素以及位于相邻音素之间的韵律标识符,每个分句序列包括至少一个音素;第二处理模块,用于对多个分句序列中的第一子韵律音素序列进行语音合成,得到第一语音信息;第三处理模块,用于输出第一语音信息且对多个分句序列中的第二子韵律音素序列进行语音合成,生成第二语音信息,第二子韵律音素序列为在韵律音素序列中位于第一子韵律音素序列之后的至少一个分句序列。
根据本申请的第三方面,一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行计算机程序时实现如上述的语音合成方法。
根据本申请的第四方面,提供了一种非暂态计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现如上述的语音合成方法。
根据本申请的第五方面,提供了一种计算机程序产品,包括计算机程序,计算机程序被处理器执行时实现如上述的语音合成方法。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是根据本申请实施例的语音合成方法的流程示意图之一;
图2是根据本申请实施例的语音合成方法的流程示意图之二;
图3是根据本申请实施例的语音合成装置的结构示意图;
图4是根据本申请实施例的电子设备的结构示意图;
图5是根据本申请实施例的语音拼接方法的流程示意图之一;
图6是根据本申请实施例的语音拼接方法的流程示意图之二;
图7是根据本申请实施例的音视频文件大小的获取方法的流程示意图之一;
图8是根据本申请实施例的音视频文件大小的获取方法的流程示意图之二;
图9是根据本申请实施例的文本转写方法的流程示意图之一;
图10是根据本申请实施例的文本转写方法的流程示意图之二;
图11是根据本申请实施例的文本的切分方法的流程示意图之一;
图12是根据本申请实施例的文本的切分方法的流程示意图之二;
图13是根据本申请实施例的文本的切分方法的流程示意图之三。
具体实施方式
下面结合附图和实施例对本申请的实施方式作进一步详细描述。以下实施例用于说明本申请,但不能用来限制本申请的范围。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本申请实施例的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。
下面结合图1-图2描述本申请实施例的语音合成方法。
该语音合成方法的执行主体可以为语音合成装置,或者为服务器,或者还可以为用户的终端,包括但不限于手机、平板电脑、pc端、车载终端以及家用智能电器等。
如图1所示,该语音合成方法包括:步骤110、步骤120和步骤130。
步骤110、切分目标文本的韵律音素序列,生成多个分句序列;
在该步骤中,目标文本为当前用于进行语音合成的文本。
韵律音素序列为用于表征目标文本的韵律特征和音素特征的序列。
韵律音素序列包括位于相邻音素之间的韵律标识符和与目标文本对应的多个音素。
其中,音素可以为一个或多个根据语音的自然属性划分出来的语音单位的组合,语音单位可以为一个汉字对应的拼音、声母或韵母或者一个英文单词、英文音标或英文字母。
韵律标识符为用于表征目标文本中每一个音素所对应的韵律特征的标识符,韵律特征包括但不限于:音素对应的声调、音节、韵律词、韵律短语、语调短语、静音以及停顿等特征。
其中,用于表征停顿的韵律标识符的细粒度高于用于表征语调短语的韵律的标识符的细粒度,用于表征语调短语的细粒度高于用于表征韵律短语的细粒度,用于表征韵律短语的细粒度高于用于表征韵律词的细粒度,用于表征韵律词的细粒度高于用于表征音节的细粒度。
在实际执行过程中,可以用不同的符号表示不同细粒度等级的韵律特征。
例如,对于目标文本“上海市今天阴转多云东南风三到四级”,可以将其转化为韵律音素序列:sil shang4 #0 hai3 #0 shi4 #2 jin1 #0 tian1 #2 yin1 #0 zhuan3 #1 duo1 #0 yun2 #3 dong1 #0 nan2 #0 feng1 #2 san1 #0 dao4 #1 si4 #0 ji2 #4 sil。
可以理解的是,对于该韵律音素序列,韵律标识符可以包括:各相邻的拼音之间的数字、符号以及英文音素;音素可以包括每一个汉字对应的拼音。
其中,sil为韵律音素序列中代表句首和句末的静音,#0代表音节、#1代表韵律词、#2代表韵律短语、#3代表语调短语以及#4代表句末,每个音素后面的数字代表该音素的声调,如shang4中的4代表拼音“shang”的声调为第四声。
可以理解的是,一整段韵律音素序列是由依次相连的分句序列连接而成。
在一些实施例中,步骤110可以包括:将目标文本转化为韵律音素序列;基于多个韵律标识符中的至少部分切分韵律音素序列,生成多个分句序列。
在该实施例中,目标文本为当前用于进行语音合成的文本。
对于一整段韵律音素序列,包括有多个音素和多个韵律标识符,多个韵律标识符中包括对应不同细粒度等级的韵律标识符。
在实际执行过程中,可以基于实际情况选择合适的细粒度等级作为切分标准,并将该细粒度等级对应的韵律标识符在韵律音素序列中的位置作为切分点,对韵律音素序列进行切分,以得到多个分句序列。
需要说明的是,每个分句序列包括切分点处的韵律标识符以及至少一个音素。
可以理解的是,对于每段韵律音素序列,对应至少一个切分点,则可以得到至少两个分句序列。
例如,对于韵律音素序列“sil shang4 #0 hai3 #0 shi4 #2 jin1 #0 tian1 #2 yin1 #0 zhuan3 #1 duo1 #0 yun2 #3 dong1 #0 nan2 #0 feng1 #2 san1 #0 dao4 #1 si4 #0 ji2 #4 sil”,基于实际需求确定在#3处进行切分,则分别在韵律音素序列中含有#3的位置进行切分,并保留韵律分割符#3至前一个拼接单元,从而可以将该韵律音素序列切分为以下多个分句序列:
分句序列1:sil shang4 #0 hai3 #0 shi4 #2 jin1 #0 tian1 #2 yin1 #0 zhuan3 #1 duo1 #0 yun2 #3;
分句序列2:dong1 #0 nan2 #0 feng1 #2 san1 #0 dao4 #1 si4 #0 ji2 #4 sil。
切分点的确定过程将在后续实施例中进行说明,在此暂不作赘述。
步骤120、对多个分句序列中的第一子韵律音素序列进行语音合成,得到第一语音信息;
在该步骤中,第一子韵律音素序列为韵律音素序列的第一个切分点之前的韵律音素序列。
继续以上述实施例中的分句序列1和分句序列2为例,第一子韵律音素序列即为分句序列1:sil shang4 #0 hai3 #0 shi4 #2 jin1 #0 tian1 #2 yin1 #0 zhuan3 #1 duo1 #0 yun2 #3。
在实际执行过程中,可以采用声码器对第一子韵律音素序列进行语音合成,以生成该第 一子韵律音素序列对应的第一语音信息。
步骤130、输出第一语音信息且对多个分句序列中的第二子韵律音素序列进行语音合成,生成第二语音信息,第二子韵律音素序列为在韵律音素序列中位于第一子韵律音素序列之后的至少一个分句序列。
在该步骤中,在生成第一语音信息后,将第一语音信息返回至客户端进行输出,以供用户播放该第一语音信息。
在输出第一语音信息的过程中,后台继续对第二子韵律音素序列进行语音合成,以生成第二子韵律音素序列对应的第二语音信息。
其中,第二子韵律音素序列为在韵律音素序列中位于第一子韵律音素序列之后的至少一个分句序列,对于上述实施例中的分句序列1和分句序列2,第二子韵律音素序列即为分句序列2:dong1 #0 nan2 #0 feng1 #2 san1 #0 dao4 #1 si4 #0 ji2 #4 sil。
在其他实施例中,该方法还可以包括:确定多个分句序列中的任一待匹配分句序列与缓存的目标分句序列匹配,从缓存中获取与目标分句序列对应的目标分句语音,将待匹配分句序列对应的语音确定为目标分句语音;确定多个分句序列中的任一待匹配分句序列与缓存的目标分句序列不匹配,对待匹配分句序列进行语音合成,生成第二分句语音。
例如,从预先缓存的多个分句序列中匹配与第一子韵律音素序列或者第二子韵律音素序列匹配的分句序列,获取预先生成并缓存的与经匹配所确定的分句序列对应的语音,得到第一子韵律音素序列或者第二子韵律音素序列对应的合成后的语音。这样,首先从缓存中匹配对应的语音而不需要实时合成,提高了语音合成的效率。
在一些实施例中,该方法还可以包括:对目标文本的韵律音素序列进行切分,生成多个候选序列;将多个候选序列中的目标候选序列与相邻候选序列进行组合,生成分句序列对应的细粒度大小以及多个分句序列。
例如,对句子“希望这首歌|能让你喜欢|为您播放|XX”的韵律音素序列“sil xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3| neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 | wei4 #0 nin2 #1 bo1 #0 fang4 #1 | EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4 sil”进行切分,可以得到多个候选序列:“xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3”、“neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1”、“wei4 #0 nin2 #1 bo1 #0 fang4 #1”以及“EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4”。
分别将候选序列中的任一候选序列作为目标候选序列,将其与相邻的其他候选序列进行组合,从而可以得到如下多个分句序列(两个“|”之间的为一个分句序列):
“sil xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 | neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 | wei4 #0 nin2 #1 bo1 #0 fang4 #1 | EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4 sil”
“sil xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 | wei4 #0 nin2 #1 bo1 #0 fang4 #1 | EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4 sil”
“sil xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 | neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 wei4 #0 nin2 #1 bo1 #0 fang4 #1 | EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4 sil”
“sil xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 wei4 #0 nin2 #1 bo1 #0 fang4 #1 | EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4 sil”
“sil xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 | neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 | wei4 #0 nin2 #1 bo1 #0 fang4 #1 EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4 sil”
“sil xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 | neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 wei4 #0 nin2 #1 bo1 #0 fang4 #1 EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4 sil”
“sil xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1  wei4 #0 nin2 #1 bo1 #0 fang4 #1 EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4 sil”
基于分句序列对应的细粒度大小,对多个分句序列进行降序排序。具体地,细粒度越大,则其对应的分句序列排在越前面,例如将“xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 wei4 #0 nin2 #1 bo1 #0 fang4 #1 EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4(希望这首歌能让你喜欢为您播放XX的X)”排在“xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 wei4 #0 nin2 #1 bo1 #0 fang4 #1(希望这首歌能让你喜欢为您播放)”的前面。
在一些实施例中,还可以基于降序排序的次序,从前至后依次将多个分句序列与缓存中的目标分句序列进行匹配,并将匹配成功的目标分句序列的语音确定为分句序列的语音。
在该实施例中,基于所生成的降序排序的顺序,从前到后依次将分句序列与目标分句序列进行精确匹配,例如先将分句序列“xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 wei4 #0 nin2 #1 bo1 #0 fang4 #1 EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4(希望这首歌能让你喜欢为您播放XX的X)”与目标分句进行精确匹配,在匹配成功的情况下,则将该分句序列确定为第一分句序列,并将与该第一分句序列匹配的目标分句对应的目标分句语音确定为该第一分句序列对应的语音,结束比较。
在匹配不成功的情况下,则再将分句序列“xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 wei4 #0 nin2 #1 bo1 #0 fang4 #1(希望这首歌能让你喜欢为您播放)”与目标分句进行比较,并重复上述过程,直至确定某一分句序列与目标分句能够实现精确匹配,则结束比较。
申请人在研发过程中发现,相关技术中,在进行语音合成时,通常是直接对整段待合成文本进行语音合成。考虑到语音具有时间特性,通常系统转化的时间和输入文本的长度成正比,越长的句子合成所需的时间就越长。
对于几十到数百字级别的输入,目前最快的深度学习模型合成时间也有几秒~几十秒级别,而对于一些较长的待合成文本,在进行语音合成时则需要耗费更长的时间,这也意味着用户需等待较长的时间才能获取所合成的语音,既浪费了用户的时间,也影响了用户的使用体验。
申请人在研发过程中还发现,为解决以上问题,相关技术中还存在一种将待合成文本切分为多个子文本,并通过系统并行合成的方式对多个子文本进行语音合成,但该方法仅限于GPU上的操作,而在CPU服务器上,该方法未能提升合成性能,仍需耗费大量的时间。
而在本申请中,通过将目标文本切分为多个分句序列,优先对第一子韵律音素序列进行语音合成并优先输出经第一子韵律音素序列合成的第一语音信息,在输出第一语音信息的过程中再合成第一子韵律音素序列之后的分句序列,从而有效加快系统在接收到网络语音合成服务请求后的反馈速度,缩短用户的等待时间,有助于提高用户的使用体验。
在实际执行过程中,对多个分句序列中的第二子韵律音素序列进行语音合成,可以表现为基于每一个分句序列在目标文本中的切分顺序,依次对各分句序列进行语音合成。
例如,在将目标文本顺次切分为第一分句序列、第二分句序列和第三分句序列后,第一分句序列为第一子韵律音素序列,优先对第一分句序列进行语音合成生成第一语音信息;在输出第一语音信息的同时,再对第二分句序列进行语音合成,在生成第二分句序列对应的第二语音信息之后,再对第三分句序列进行语音合成。
对多个分句序列中的第二子韵律音素序列进行语音合成,还可以表现为同时对各分句序列进行语音合成。
例如,在将目标文本顺次切分为第一分句序列、第二分句序列和第三分句序列后,第一分句序列为第一子韵律音素序列,优先对第一分句序列进行语音合成生成第一语音信息;在 输出第一语音信息的同时,利用系统的并行合成能力,并行合成第二分句序列和第三分句序列。
根据本申请实施例提供的语音合成方法,通过将目标文本切分为多个分句序列,优先对第一个分句序列进行语音合成生成第一语音信息,在输出第一语音信息的过程中继续对后续分句序列进行语音合成,有效加快了系统在接收到网络语音合成服务请求后的反馈速度,缩短了用户的等待时间,从而有助于提高用户的使用体验。
如图2所示,根据本申请的一些实施例,在步骤110之前,该方法还可以包括:获取待合成文本;在待合成文本的大小超过目标阈值的情况下,切分待合成文本,生成目标文本,目标文本的大小不大于目标阈值。
在该实施例中,待合成文本为需要进行语音合成的原始文本。
待合成文本的文本级别可以为数十至数百级别的常规文本,也可以为数千或数万级别的超长文本。
目标阈值可以基于系统的算力和语音合成模型的能力上限中的至少一种确定,例如可以将目标阈值确定为几百字的范围。
在实际执行过程中,对于获取的待合成文本,优先判断该待合成文本的大小,并将其大小与目标阈值进行比较,在待合成文本的大小不超过目标阈值的情况下,则直接将整个待合成文本确定为目标文本。
在待合成文本的大小超过目标阈值的情况下,则基于获取的系统的算力和语音合成模型的能力信息中的至少一种,先对待合成文本进行切分,以得到多段第一文本,使得每一段第一文本的大小均不超过目标阈值,并将多段第一文本中的第一段文本确定为目标文本。
根据本申请实施例提供的语音合成方法,基于目标阈值对待合成文本进行切分以生成目标文本,能够充分考虑到服务器的实际能力,以提供在服务器的处理能力范围内的目标文本进行语音合成,从而提升语音合成的性能。
在一些实施例中,步骤110可以包括:获取目标文本的句末信息、语调短语、韵律短语、韵律词和音节;基于句末信息、语调短语、韵律短语、韵律词和音节中的至少两种标记目标文本,生成韵律音素序列。
在该实施例中,音节是语流中的语音单位,也是人们听觉上最容易分辨出来的语音单位,例如,音节可以为目标文本中的每一个汉字。
韵律词是一组在实际语流中联系密切且联在一起发音的音节。
韵律短语是介于韵律词和语调短语之间的中等节奏组块,韵律短语中可以包括多个韵律词和语气词,且组成该韵律短语的多个韵律词听起来是共用一个节奏群。
语调短语为将多个韵律短语按照一定的句调模式连接起来所组成的句子,用于表征较大的停顿。
句末信息用于表征每一个长句的结束。
例如,对于目标文本“上海市今天阴转多云东南风三到四级”,其中如“上”、“海”以及“市”等每一个汉字均为该目标文本对应的音节;“上海市”、“今天”以及“阴转多云”等单词或由单词组成的短语即为该目标文本对应的韵律短语;而由韵律短语“上海市”、“今天”以及“阴转多云”所组成的句子“上海市今天阴转多云”,则为该目标文本对应的语调短语。
在获取得到目标文本的句末信息、语调短语、韵律短语、韵律词和音节等信息后,基于其中的至少两种对目标文本进行标记,即可生成韵律音素序列。
申请人在研发过程中发现,相关技术中,往往是通过采用句子中的标点符号来表征句子的韵律,如在句子中的逗号或句号所在的位置处对句子进行切分,以得到多个分句。该方法 一方面无法满足对无标点的文本的切分,另一方面还会导致切分后的两端不均衡,切分效果不佳。
而在本申请中,采用句末信息、语调短语、韵律短语、韵律词和音节至少两项来表征句子的韵律,并以此为基准对目标文本进行切分,不会出现在一次整词中间切断的情况,使得切分后得到的分句停顿和韵律均较为自然。
在一些实施例中,基于句末信息、语调短语、韵律短语、韵律词和音节中的至少两种标记目标文本,生成韵律音素序列,包括:将目标文本转化为音素序列;基于句末信息、语调短语、韵律短语、韵律词和音节中的至少两种,生成多个韵律标识符;基于多个韵律标识符对音素序列进行标记,生成韵律音素序列。
在该实施例中,音素序列为由目标文本中的每个汉字或英文对应的发音标记,包括拼音、声调或英文注音所连接而成的序列。
例如,对于目标文本“上海市今天阴转多云东南风三到四级”,可以将其转化为音素序列:shang4 hai3 shi4 jin1 tian1 yin1 zhuan3 duo1 yun2 dong1 nan2 feng1 san1 dao4 si4 ji2。
韵律标识符为用于表征目标文本中每一个音素所对应的韵律特征的标识符,也即,韵律标识符为用于表征句末信息、语调短语、韵律短语、韵律词和音节的符号。
在实际执行过程中,可以采用特殊符号与数字组合的形式或特定字母组合来表示韵律标识符,例如分别用“#0”、“#1”、“#2”、“#3”以及“#4”来表示韵律标识符,不同的组合表征不同的细粒度级别。
如:#0代表着音节、#1代表韵律词、#2代表韵律短语、#3代表语调短语以及#4代表句末,在该实施例中,细粒度由小到大依次为:#0<#1<#2<#3<#4。
在得到目标文本对应的音素序列以及韵律标识符后,将韵律标识符插入音素序列中的相应位置,如将用于表征音节的韵律标识符#0插入至音素序列中每一个音节所对应的拼音之后,将用于表征韵律短语的韵律标识符#2插入至音素序列中每一句韵律短语之后,从而将音素序列转化为韵律音素序列。
例如,分别采用#0”、“#1”、“#2”、“#3”以及“#4”对音素序列“shang4 hai3 shi4 jin1 tian1 yin1 zhuan3 duo1 yun2 dong1 nan2 feng1 san1 dao4 si4 ji2”进行标记,从而生成韵律音素序列:sil shang4 #0 hai3 #0 shi4 #2 jin1 #0 tian1 #2 yin1 #0 zhuan3 #1 duo1 #0 yun2 #3 dong1 #0 nan2 #0 feng1 #2 san1 #0 dao4 #1 si4 #0 ji2 #4 sil。
其中,sil表征句首和句末的静音。
根据本申请实施例提供的语音合成方法,通过将目标文本转化为音素序列,并基于句末信息、语调短语、韵律短语、韵律词和音节中的至少两种所对应的韵律标识符对音素序列进行标记以生成韵律音素序列,能够提供一种更加精细的韵律表征,从而有助于提高后续切分过程中的切分细腻度与准确性。
继续参考图2,根据本申请的一些实施例,该方法还可以包括:基于韵律音素序列,生成第三语音信息的目标文件大小;
步骤130可以包括:基于目标文件大小生成第二语音信息。
在该实施例中,第三语音信息为对由目标文本所对应的多个分句序列的中的至少两个分句序列所合成的语音信息进行合成所生成的语音信息,其中,该至少两个分句序列中的一个为第一子韵律音素序列。
目标文件大小为经预测得到的,第三语音信息的文件大小。
目标文件大小可以为文件体积信息,或者也可以为第三语音信息的语音长度信息,本申请不做限定。
在得到第三语音信息的目标文件大小后,基于目标文件大小对基于第二子韵律音素序列所生成的语音数据进行数据补齐,从而生成第二语音信息。
在一些实施例中,基于韵律音素序列,生成第三语音信息的目标文件大小,可以包括:基于韵律音素序列,生成第三语音信息的预测文件大小;基于目标残差值校正预测文件大小,生成目标文件大小。
在该实施例中,预测文件大小为基于韵律音素序列预测得到的,未经校正的经目标文本合成的语音的初始文件大小值。
目标残差值用于对预测文件大小进行校正,以提高最终所生成的目标文件大小的准确性。
目标残差值基于样本文件大小和预测的样本文本对应的样本音频文件的大小确定的,样本文件大小为样本文本对应的样本音频文件的实际大小。目标文件大小为基于韵律音素序列预测,且经校正后的经目标文本合成的语音的文件大小值。可以理解的是,目标文件大小的准确性高于预测文件大小。
目标残差值为预先确定的数值,例如目标残差值可以为最大残差值。
在该实施例中,通过对预测文件大小进行增补残差处理,以对预测文件大小进行校正,从而提高最终生成的目标文件大小的准确性。
在一些实施例中,目标残差值可以通过如下步骤确定:
获取样本文本、样本音频文件对应的样本文件大小和样本文本对应的样本音频文件,样本音频文件为对样本文本进行语音合成所生成的;
将样本文本转化为样本韵律音素序列;
基于样本韵律音素序列对样本音频文件的大小进行预测,生成样本音频文件的样本预测文件大小;
将样本文件大小和样本预测文件大小的差值的绝对值,确定为目标残差值。
在该实施例中,样本文本可以为数十至数百级别的常规文本,也可以为数千或数万级别的超长文本。
样本音频文件为对样本文本进行语音合成,所最终生成的音频文件。
样本文件大小为样本音频文件的实际大小值或实际音频时长。
例如,可以采用语音合成系统计算样本文本对应的样本音频文件的真实wav文件大小或音频时长。
样本预测文件大小为经预测得到的,未经校正的样本音频文件的大小值或音频时长。
需要说明的是,样本预测文件大小的生成方式应与预测文件大小的生成方式保持一致。
计算样本预测文件大小减去样本文件大小的差值的绝对值,作为目标残差值。
可以理解的是,在执行过程中,可以对样本韵律音素序列进行多次预测,以得到多个样本预测文件大小。分别计算每一个样本预测文件大小与样本文件大小的差值,得到多个候选差值;然后从多个候选差值中选择最小值的绝对值,确定为目标残差值,以提高目标残差值的准确度。
根据本申请实施例提供的语音合成方法,基于韵律音素序列预测由该目标文本所合成的目标音频文件的大小信息,并基于目标残差值对预测值进行校正,既能够在目标音频文件生成之前即可实现该目标文件的大小值的预测,且预测结果的准确性和精确性较高。
继续参考图2,在一些实施例中,步骤110可以包括:
基于多个韵律标识符在韵律音素序列中确定第一切分位置;
从韵律音素序列中位于第一切分位置之后的韵律标识符在韵律音素序列的位置中,确定第二切分位置;
基于第一切分位置和第二切分位置对韵律音素序列进行切分,生成第一子韵律音素序列和至少两个第二子韵律音素序列,第一子韵律音素序列为韵律音素序列中位于第一切分位置之前的韵律音素序列,至少两个第二子韵律音素序列为韵律音素序列中位于第一切分位置之后的韵律音素序列,相邻的第二子韵律音素序列基于第二切分位置确定,且第一子韵律音素序列对应的语音合成时长在目标时长内。
在该实施例中,第一切分位置为用于第一次切分的切分点。
第二切分位置为除第一次切分以外的其他所有次切分所对应的切分点的位置。
基于第一切分位置,可以将该韵律音素序列切分为前后两个子序列,且将位于第一切分位置之前的子序列确定为第一子韵律音素序列。
需要说明的是,基于第一切分位置所生成的第一子韵律音素序列,其对应的语音合成时长在目标时长内。
其中,第一子韵律音素序列对应的语音合成时长为将第一子韵律音素序列合成为语音所耗费的时间。
语音合成时长与语音合成系统的算力相关。
目标时长为一个较短的时长,目标时长的数值可以基于用户自定义,或者也可以采用系统默认值,例如可以将目标时长设置为0.2s或0.3s等。
在确定第一切分位置后,从韵律音素序列中查找第一切分位置之后的至少部分韵律标识符作为用于确定第二切分位置的候选集,并将候选集中的韵律标识符的位置确定为第二切分位置。
可以理解,在其他实施例中,在没有第二切分位置的情况下,则第二子韵律音素序列即为韵律音素序列中位于第一切分位置之后的整个韵律音素序列。例如,当第二切分位置为#3对应的位置,但是在第二子韵律音素序列中查找不到#3时,此时可以理解为不存在第二切分位置。
在该实施例中,基于韵律音素序列中的韵律标识符来确定第一切分位置,使得基于第一切分位置所得到的第一子韵律音素序列,其对应的语音合成时长能够在合理的时长范围内,从而缩短合成系统的首句响应时间,缩短延迟时间;除此之外,基于该方式所确定的第一切分位置为在停顿时长较长的位置,使得切分得到的第一子韵律音素序列的停顿和韵律更加自然,从而使得后续输出的基于第一子韵律音素序列合成的语音更加自然且流畅。
继续参考图2,根据本申请的一些实施例,在步骤130之后,该方法还可以包括:合并第一语音信息和第二语音信息,生成第三语音信息。
在该实施例中,第二语音信息为对第二子韵律音素序列进行语音合成所得到的语音信息,其中,第二子韵律音素序列可以为一个或多个分句序列,且第二子韵律音素序列均位于目标文本中第一子韵律音素序列之后。
例如,对于依次合成第二语音信息的情况,在输出第一语音信息的同时,可以对位于第一子韵律音素序列之后的与第一子韵律音素序列相邻的第二分句序列进行语音合成,以生成第二分句序列对应的第二语音信息,在输出第二语音信息的同时将第一语音信息和第二语音信息进行合并,以生成第三语音信息。
又如,对于依次合成第二语音信息的情况,在输出第一语音信息的同时,可以对位于第一子韵律音素序列之后的与第一子韵律音素序列相邻的第二子韵律音素序列进行语音合成,以生成第二语音信息,在输出第二语音信息的同时,可以对位于第二分句序列之后的与第二分句序列相邻的第三分句序列进行语音合成,以生成第三分句序列对应的第二语音信息,在输出第二语音信息的同时对后续分句序列进行语音合成,直至生成全部分句对应的第二语音 信息后,对第一语音信息和全部分句对应的第二语音信息进行合成,生成第三语音信息。
对于并行合成第二语音信息的情况,在输出第一语音信息的同时,可并行合成位于第一子韵律音素序列之后的多个分句序列,并生成各分句序列对应的第二语音信息,然后将第一语音信息和得到的多个第二语音信息进行合成,生成第三语音信息。
在一些实施例中,合并第一语音信息和第二语音信息,可以包括:基于第一语音信息对应的音素时长以及第二语音信息对应的音素时长,合并第一语音信息和第二语音信息。
在该实施例中,音素时长即该音素对应的发音时长。
例如,对于分句序列1:sil shang4 #0 hai3 #0 shi4 #2 jin1 #0 tian1 #2 yin1 #0 zhuan3 #1 duo1 #0 yun2 #3,“shang”可以拆分为“sh”和“ang”两个音素,每个音素均对应有一个发音时长。
在拼接过程中,首先需要基于每个分句序列的首部或尾部多余的音素时长对分句序列对应的语音信息进行截除,以除去语音信息的首部或尾部多余的音素时长。
以第一语音信息为例,在第一语音信息为语音的情况下,在合成语音后,截除该语音中第一子韵律音素序列的首部或尾部多余的音素所对应的时长,生成截除后的第一语音信息。
在第一语音信息为高级声学特征的情况下,在合成第一语音信息后,截除第一语音信息对应的高级声学特征中第一子韵律音素序列的首部或尾部多余的音素所对应的时长,生成截除后的高级声学特征;然后使用声码器对截除后的高级声学特征进行语音合成,以生成截除后的第一语音信息。
第二语音信息的截除方式与第一语音信息相同,在此不作赘述。
然后基于第一语音信息对应的第一子韵律音素序列在韵律音素序列中的切分顺序,和第二语音信息对应的第二子韵律音素序列在韵律音素序列中的切分顺序,从第一个分句序列开始,依次拼接相邻的分句序列对应的截除后的语音信息,直至拼接完成全部的分句序列对应的语音信息。
根据本申请实施例提供的语音合成方法,基于音素时长拼接第一语音信息和第二语音信息,能够实现在不需要预设语音拼接单元库的基础上,即可提高相邻语音信息拼接处的自然度与流畅度。
如图3所示,该语音合成装置包括:第一处理模块310、第二处理模块320、和第三处理模块330。
第一处理模块310,用于切分目标文本的韵律音素序列,生成多个分句序列,韵律音素序列包括与目标文本对应的多个音素以及位于相邻音素之间的韵律标识符,每个分句序列包括至少一个音素;第二处理模块320,用于对多个分句序列中的第一子韵律音素序列进行语音合成,得到第一语音信息;第三处理模块330,用于输出第一语音信息且对多个分句序列中的第二子韵律音素序列进行语音合成,生成第二语音信息,第二子韵律音素序列为在韵律音素序列中位于第一子韵律音素序列之后的至少一个分句序列。
根据本申请实施例提供的语音合成装置,通过将目标文本切分为多个分句序列,优先对第一个分句序列进行语音合成生成第一语音信息,在输出第一语音信息的过程中继续对后续分句序列进行语音合成,有效加快了系统在接收到网络语音合成服务请求后的反馈速度,缩短了用户的等待时间,从而有助于提高用户的使用体验。
在一些实施例中,第一处理模块310还用于:将目标文本转化为韵律音素序列,韵律音素序列包括位于相邻音素之间的韵律标识符以及与目标文本对应的多个音素;基于多个韵律标识符中的至少部分对韵律音素序列进行切分,生成多个分句序列,每个分句序列包括至少一个音素。
在一些实施例中,该装置还可以包括:第五处理模块,用于在生成第二语音信息之后, 合并第一语音信息和第二语音信息,生成第三语音信息。
在一些实施例中,该装置还可以包括:第六处理模块,用于在将目标文本转化为韵律音素序列之后,基于韵律音素序列,生成第三语音信息的目标文件大小;
第四处理模块340,还用于基于目标文件大小生成第二语音信息。
在一些实施例中,第六处理模块,还用于:基于韵律音素序列,生成第三语音信息的预测文件大小;基于目标残差值校正预测文件大小,生成目标文件大小;目标残差值基于样本文件大小和预测的样本文本对应的样本音频文件的大小确定的,样本文件大小为样本文本对应的样本音频文件的实际大小。
在一些实施例中,第五处理模块,还用于基于第一语音信息对应的音素时长,以及第二语音信息对应的音素时长,合并第一语音信息和第二语音信息。
在一些实施例中,该装置还可以包括:第七处理模块,用于在将目标文本转化为韵律音素序列之前,获取待合成文本;在待合成文本的大小超过目标阈值的情况下,切分待合成文本,生成目标文本,目标文本的大小不超过目标阈值。
在一些实施例中,第一处理模块310,还用于:基于多个韵律标识符在韵律音素序列中确定第一切分位置;从韵律音素序列中位于第一切分位置之后的韵律标识符在韵律音素序列的位置中,确定第二切分位置;基于第一切分位置和第二切分位置对韵律音素序列进行切分,生成第一子韵律音素序列和至少两个第二子韵律音素序列,第一子韵律音素序列为韵律音素序列中位于第一切分位置之前的韵律音素序列,至少两个第二子韵律音素序列为韵律音素序列中位于第一切分位置之后的韵律音素序列,相邻的第二子韵律音素序列基于第二切分位置确定,且第一子韵律音素序列对应的语音合成时长在目标时长内。
在一些实施例中,第一处理模块310,还用于:获取目标文本的韵律词、音节、韵律短语、句末信息和语调短语;基于韵律词、音节、韵律短语、句末信息和语调短语中的至少两种对目标文本进行标记,生成韵律音素序列。
在一些实施例中,第一处理模块310,还用于:将目标文本转化为音素序列;基于韵律词、音节、韵律短语、句末信息和语调短语中的至少两种,生成多个韵律标识符;基于多个韵律标识符对音素序列进行标记,生成韵律音素序列。
图4示例了一种电子设备的实体结构示意图,如图4所示,该电子设备可以包括:处理器(processor)410、通信接口(Communications Interface)420、存储器(memory)430和通信总线440,其中,处理器410,通信接口420,存储器430通过通信总线440完成相互间的通信。处理器410可以调用存储器430中的逻辑指令,以执行语音合成方法方法,该方法包括:切分目标文本的韵律音素序列,生成多个分句序列,韵律音素序列包括与目标文本对应的多个音素以及位于相邻音素之间的韵律标识符,每个分句序列包括至少一个音素;对多个分句序列中的第一子韵律音素序列进行语音合成,得到第一语音信息;输出第一语音信息且对多个分句序列中的第二子韵律音素序列进行语音合成,生成第二语音信息,第二子韵律音素序列为在韵律音素序列中位于第一子韵律音素序列之后的至少一个分句序列。
此外,上述的存储器430中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码 的介质。
进一步地,本申请还提供一种计算机程序产品,计算机程序产品包括计算机程序,计算机程序可存储在非暂态计算机可读存储介质上,计算机程序被处理器执行时,计算机能够执行上述各方法实施例所提供的语音合成方法方法。
另一方面,本申请实施例还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现以执行上述各实施例提供的语音合成方法。
下面结合图5-图6描述本申请实施例的语音拼接方法。
如图5所示,该语音拼接方法包括:步骤510、步骤520和步骤530。
步骤510、切分目标文本的韵律音素序列,生成多个分句序列,韵律音素序列包括位于相邻音素之间的韵律标识符和与目标文本对应的多个音素,每个分句序列包括至少一个音素;
在一些实施例中,步骤510可以包括:基于多个韵律标识符中的至少部分切分韵律音素序列,生成多个分句序列。
在该实施例中,对于一整段韵律音素序列,包括有多个音素和多个韵律标识符,多个韵律标识符中包括对应不同细粒度等级的韵律标识符。
在实际执行过程中,可以基于实际情况选择合适的细粒度等级作为切分标准,并将该细粒度等级对应的韵律标识符在韵律音素序列中的位置作为切分点,对韵律音素序列进行切分,以得到多个分句序列。
需要说明的是,每个分句序列包括切分点处的韵律标识符以及至少一个音素。
可以理解的是,对于每段韵律音素序列,对应至少一个切分点,则可以得到至少两个分句序列。
例如,对于韵律音素序列“sil shang4 #0 hai3 #0 shi4 #2 jin1 #0 tian1 #2 yin1 #0 zhuan3 #1 duo1 #0 yun2 #3 dong1 #0 nan2 #0 feng1 #2 san1 #0 dao4 #1 si4 #0 ji2 #4 sil”,基于实际需求确定在#3处进行切分,则分别在韵律音素序列中含有#3的位置进行切分,并保留韵律分割符#3至前一个拼接单元,从而可以将该韵律音素序列切分为以下多个分句序列:
分句序列1:sil shang4 #0 hai3 #0 shi4 #2 jin1 #0 tian1 #2 yin1 #0 zhuan3 #1 duo1 #0 yun2 #3;
分句序列2:dong1 #0 nan2 #0 feng1 #2 san1 #0 dao4 #1 si4 #0 ji2 #4 sil。
在其他实施例中,也可以基于多个韵律标识符在韵律音素序列中确定第一切分位置及第二切分位置;基于第一切分位置和第二切分位置对韵律音素序列进行切分,生成第一子韵律音素序列对应的分句序列和第二子韵律音素序列对应的分句序列。
其中,第一子韵律音素序列对应的语音合成时长在目标时长内。例如,目标时长可以是基于系统的算力和语音合成模型的能力上限中的至少一种确定。例如,目标时长为一个较短的时长,目标时长的数值可以基于用户自定义,或者也可以采用系统默认值,例如可以将目标时长设置为0.2s或0.3s等。
可以继续基于韵律标识符对第二子韵律音素序列进行切分得到分句序列,例如可以是在韵律分割符#3处对第二子韵律音素序列进行切分。可以理解,当第二子韵律音素序列中不存在韵律分割符#3时,不继续对第二子韵律音素序列进行切分。
上述实施例中,基于第一切分位置所得到的第一子韵律音素序列,其对应的语音合成时长能够在合理的时长范围内,从而缩短合成系统的首句响应时间,缩短延迟时间;除此之外,基于该方式所确定的第一切分位置及第二切分位置为在停顿时长较长的位置,使得切分得到的分句序列的停顿和韵律更加自然,从而使得后续输出的基于分句序列合成的语音更加自然且流畅。
步骤520、分别对各个分句序列进行语音合成,生成多个第一分句语音信息;
在该步骤中,第一分句语音信息为基于分句序列进行语音合成所生成的语音信息,每一个分句序列对应一个第一分句语音信息。
第一分句语音信息包括每一个韵律标识符和音素对应的第一时长。
需要说明的是,第一分句语音信息可以为语音,或者也可以为高级声学特征;
其中,高级声学特征为用于表征语音声学特性且可用于重构语音的物理量,包括但不限于:线性谱、梅尔谱、梅尔倒谱,以及音色的能量集中区、共振峰频率、共振峰强度和带宽、表示语音韵律特性的时长、基频、平均语声功率等。
音素为一个或多个根据语音的自然属性划分出来的语音单位的组合,语音单位可以为一个汉字对应的拼音、声母或韵母或者一个英文单词、英文音标或英文字母。
第一时长即该韵律标识符或音素对应的发音时长。
例如,对于分句序列1:sil shang4 #0 hai3 #0 shi4 #2 jin1 #0 tian1 #2 yin1 #0 zhuan3 #1 duo1 #0 yun2 #3,“shang4”可以作为一个音素,也可以拆分为“sh”和“ang4”两个音素,每个韵律标识符或音素均对应有一个发音时长。
在实际执行过程中,可以先获取第一分句序列中全部的韵律标识符和音素,然后再基于每个韵律标识符和音素获取该韵律标识符和音素对应的第一时长。
在一些实施例中,步骤520可以包括:将分句序列输入至目标语音合成模型,获取由目标语音合成模型输出的第一分句语音信息,其中,目标语音合成模型为,以样本韵律音素序列为样本,以与样本韵律音素序列对应的样本分句语音为样本标签,训练得到。
在该实施例中,目标语音合成模型可以为端到端语音合成模型。
分句序列为一整段韵律音素序列分割成多段得到。
该目标语音合成模型的输入值为分句序列,输出值为该分句序列对应的第一分句语音,或第一分句语音对应的高级声学特征。
其中,目标语音合成模型为,以样本分句序列为样本,以与样本分句序列对应的样本分句语音为样本标签,训练得到。
目标语音合成模型的训练过程与神经网络模型的训练方式类似,在此不做赘述。
如图6所示,在实际执行过程中,可以将每个分句序列分别转化为端到端语音合成模型可接收的韵律音素序列,并基于韵律音素序列获取该韵律音素序列中每一个音素以及每一个韵律标识符对应的第一时长。
例如,将分句序列1:sil shang4 #0 hai3 #0 shi4 #2 jin1 #0 tian1 #2 yin1 #0 zhuan3 #1 duo1 #0 yun2 #3,转化为端到端语音合成模型可以接收的韵律音素序列1:sil sh ang4 #0 h ai3 #0 sh i4 #2 j in1 #0 t ian1 #2 y in1 #0 zh uan3 #1 d uo1 #0 y vn2 #3 sil eos。
然后对韵律音素序列1中的每一个韵律和音素进行语音合成,合成每一个韵律和音素对应的语音或高级声学特征,从而生成第一分句语音或第一分句语音对应的高级声学特征;并计算每个韵律标识符和音素对应的第一时长。
根据本申请的一些实施例,在步骤520之后,且在步骤530之前,该方法还可以包括:输出第一分句语音信息。
在该实施例中,在生成分句序列对应的第一分句语音信息之后,即可输出该第一分句语音信息。
在该实施例中,通过将目标文本切分为多个分句序列,分别对每个分句序列进行语音合成,生成各分句序列对应的第一语音信息,并优先输出目标文本中的第一个分句序列对应的第一语音信息,有效加快系统在接收到网络语音合成服务请求后的反馈速度,缩短用户的等 待时间,有助于提高用户的使用体验。
步骤530、基于第一分句语音信息对应的分句序列在韵律音素序列中的切分顺序和第一时长,对多个第一分句语音信息进行拼接,生成目标语音。
在该步骤中,目标语音为将这个目标文本进行语音合成后所得到的语音。
在拼接过程中,需要先基于每个分句序列的首部或尾部多余的第一时长对分句序列对应的分句语音信息进行截除。
在第一分句语音信息为第一分句语音的情况下,在合成第一分句语音后,截除第一分句语音中分句序列的首部或尾部多余的音素所对应时长的语音,生成截除后的第一分句语音,依次对相邻的截除后的第一分句语音进行拼接,直至拼接完成全部的截除后的第一分句语音,生成目标语音。
在第一分句语音信息为第一分句语音对应的高级声学特征的情况下,在合成第一分句语音对应的高级声学特征后,截除第一分句语音对应的高级声学特征中分句序列的首部或尾部多余的音素所对应时长的高级声学特征,生成截除后的第一分句语音对应的高级声学特征;然后使用声码器对截除后的第一分句语音对应的高级声学特征进行语音合成,以生成截除后的第一分句语音。
然后基于各分句序列在目标文本中的切分顺序,依次对相邻的截除后的第一分句语音进行拼接,直至拼接完成全部的截除后的第一分句语音,生成目标语音。
在一些实施例中,步骤530可以包括:基于多个音素中的目标音素对应的第一时长,截去第一分句语音信息中与目标音素对应的语音,生成第二分句语音信息;基于第二分句语音信息对应的分句序列在韵律音素序列中的切分顺序,拼接第二分句语音信息,生成目标语音。
在该实施例中,与目标音素对应的语音为分句序列中多余的音素,包括但不限于分句序列首部或尾部所对应的不发音的音素。
第二分句语音信息为对第一分句语音信息中多余的停顿或静音时长进行截除后所生成的语音信息。
其中,第二语音信息可以表现为语音,或者也可以表现为高级声学特征。
第二语音信息的表现形式与第一语音信息的表现形式相对应。
例如,对韵律音素序列1:sil sh ang4 #0 h ai3 #0 sh i4 #2 j in1 #0 t ian1 #2 y in1 #0 zh uan3 #1 d uo1 #0 y vn2 #3 sil eos,进行语音合成所生成的第一语音信息中,包括sil以及eos等对应时长的语音信息,这些语音信息为静音或停顿等多余的语音信息,则可以基于句末目标音素“sil”和“eos”对应的第一时长,截去第一分句语音信息句末sil以及eos对应的多余的时长,生成第二分句语音信息。
然后基于第二分句语音信息对应的分句序列在韵律音素序列中的切分顺序,从第一个分句序列开始,依次拼接相邻的分句序列对应的第二分句语音信息,直至拼接完成全部的分句序列对应的第二分句语音信息。
需要说明的是,对于不同表现形式的第二语音信息,其对应的拼接过程也有所区别,将在后续实施例中进行具体说明,在此暂不作赘述。
根据本申请实施例提供的语音拼接方法,在将目标文本切分为多个分句序列后并合成各分句序列对应的第一分句语音信息后,基于各分句序列中音素对应的时长对第一分句语音信息中多余的音素对应的语音进行截除,从而实现在不需要预设语音拼接单元库且不需要对待拼接语音单元进行平滑处理的基础上,即可提高相邻第一分句语音信息拼接处的自然度与流畅度。
下面分别从两个实现角度对步骤530的实现方式进行具体说明。
一、第一分句语音信息对应的分句序列不为目标文本中的第一个分句序列。
继续参考图2,在一些实施例中,目标音素包括句首多余音素和句末多余音素中的至少一种,基于多个音素中的目标音素对应的第一时长,截去第一分句语音信息中与目标音素对应的语音,可以包括:确定第一分句语音信息对应的分句序列不为目标文本中的第一个分句序列,分别截去第一分句语音信息中与句末多余音素对应的语音和与句首多余音素对应的语音。
继续以目标文本“上海市今天阴转多云东南风三到四级”为例,对该实施例进行说明。
将目标文本“上海市今天阴转多云东南风三到四级”转化为:
端到端语音合成模型可接收的韵律音素序列1:sil sh ang4 #0 h ai3 #0 sh i4 #2 j in1 #0 t ian1 #2 y in1 #0 zh uan3 #1 d uo1 #0 y vn2 #3 sil eos;
端到端语音合成模型可接收的韵律音素序列2:sil d ong1 #0 n an2 #0 f eng1 #2 s an1 #0 d ao4 #1 s i4 #0 j i2 #4 sil eos。
其中端到端语音合成模型可接收的韵律音素序列1为该目标文本中的第一个分句序列,端到端语音合成模型可接收的韵律音素序列2不为该目标文本中的第一个分句序列。
在将端到端语音合成模型可接收的韵律音素序列2合成为第一语音信息后,根据端到端语音合成模型可接收的韵律音素序列2中句首和句末多余音素的时长,也即基于句首“sil”的第一时长,在首部截去对应时长语音或高级声学特征,基于句末“sil”和“eos”的第一时长,在尾部截去对应时长语音或高级声学特征,即可生成第二语音信息。
二、第一分句语音信息对应的分句序列为目标文本中的第一个分句序列。
继续参考图6,在另一些实施例中,目标音素包括句首多余音素和句末多余音素中的至少一种,基于多个音素中的目标音素对应的第一时长,截去第一分句语音信息中与目标音素对应的语音,还可以包括:确定第一分句语音信息对应的分句序列为目标文本中的第一个分句序列,截去第一分句语音信息中与句末多余音素对应的语音。
继续以目标文本“上海市今天阴转多云东南风三到四级”为例,对该实施例进行说明。
将目标文本“上海市今天阴转多云东南风三到四级”转化为:
端到端语音合成模型可接收的韵律音素序列1:sil sh ang4 #0 h ai3 #0 sh i4 #2 j in1 #0 t ian1 #2 y in1 #0 zh uan3 #1 d uo1 #0 y vn2 #3 sil eos;
端到端语音合成模型可接收的韵律音素序列2:sil d ong1 #0 n an2 #0 f eng1 #2 s an1 #0 d ao4 #1 s i4 #0 j i2 #4 sil eos。
其中端到端语音合成模型可接收的韵律音素序列1为该目标文本中的第一个分句序列,端到端语音合成模型可接收的韵律音素序列2不为该目标文本中的第一个分句序列。
在将端到端语音合成模型可接收的韵律音素序列1合成为第一语音信息后,根据端到端语音合成模型可接收的韵律音素序列1中句末多余音素的时长,也即基于“sil”和“eos”的第一时长,在尾部截去对应时长语音或高级声学特征,即可生成第二语音信息。
根据本申请实施例提供的语音拼接方法,在将目标文本切分为多个分句序列后并合成各分句序列对应的第一分句语音信息后,基于各分句序列中音素对应的时长对第一分句语音信息中多余的音素对应的语音进行截除,从而实现在不需要预设语音拼接单元库且不需要对待拼接语音单元进行平滑处理的基础上,即可提高相邻第一分句语音信息拼接处的自然度与流畅度。
下面结合图7-图8描述本申请实施例的音视频文件大小的获取方法。
如图7所示,该音视频文件大小的获取方法,包括:步骤710、步骤720和步骤730。
步骤710、获取目标文本;
在该步骤中,目标文本为当前用于进行语音合成的文本。
其中,目标文本可以为数十至数百级别的常规文本,也可以为数千或数万级别的超长文本。
目标文本可以为存储于数据库中的本地文件,或者也可以为从网络下载的文件,本申请不做限定。
步骤720、对目标文本进行特征提取,生成目标韵律特征和目标音素特征;
在该步骤中,目标韵律特征用于表征目标文本的韵律特征,目标音素特征用于表征目标文本的音素特征。
其中,目标韵律特征和目标音素特征包括但不限于:音素及其对应的声调、音节、韵律词、韵律短语、语调短语、静音以及停顿等特征。
在一些实施例中,步骤720可以包括:将目标文本转化为韵律音素序列,韵律音素序列包括位于相邻音素之间的韵律标识符和与目标文本对应的多个音素;提取韵律音素序列的特征,生成目标韵律特征和音素特征。
在一些实施例中,将目标文本转化为韵律音素序列,可以包括:将目标文本转化为音素序列;获取音素序列的句末信息、语调短语、韵律短语、韵律词和音节;基于句末信息、语调短语、韵律短语、韵律词和音节中的至少两种对音素序列进行标记,生成韵律音素序列。
在得到韵律音素序列后,提取韵律音素序列中的韵律特征和音素特征,即可生成目标韵律特征和目标音素特征。
在一些实施例中,目标韵律特征和目标音素特征可以包括:韵律音素序列的长度、韵律音素序列中的中文拼音的数量、韵律音素序列中的停顿符号的数量、韵律音素序列中的英文音素的数量、韵律音素序列中的中文音素的数量、韵律音素序列中的中文声母的数量、韵律音素序列中的中文韵母的数量以及韵律音素序列中的各个类别的英文音素中的至少一种。
其中,韵律音素序列的长度可以为韵律音素序列中音素的数量。
步骤730、基于目标韵律特征和目标音素特征,获取目标音频文件的目标文件大小。
在该步骤中,目标音频文件为对整个目标文本进行语音合成所生成的音频文件。
可以理解的是,对于音频文件,目标音频文件即为该音频文件;对于视频文件,目标音频文件为该视频文件中所包括的音频文件。
目标文件大小为预测得到的,目标音频文件的文件大小。
目标文件大小可以为文件体积信息,或者也可以为第三语音信息的语音长度信息,本申请不做限定。
在一些实施例中,步骤730可以包括:基于目标韵律特征和目标音素特征,获取目标音频文件的第一预测文件大小;对目标残差值和第一预测文件大小求和,生成目标文件大小。
在该实施例中,第一预测文件大小为基于目标韵律特征和目标音素特征预测得到的,未经校正的经目标文本合成的语音的初始文件大小值。
目标残差值用于对第一预测文件大小进行校正,以提高最终所生成的目标文件大小的准确性。
目标残差值基于样本文件大小和预测的样本文本对应的样本音频文件的大小确定的,样本文件大小为样本文本对应的样本音频文件的实际大小。
目标文件大小为基于目标韵律特征和目标音素特征预测,且经校正后的经目标文本合成的语音的文件大小值。可以理解的是,目标文件大小的准确性高于第一预测文件大小。
目标残差值为预先确定的数值,例如目标残差值可以为残差值的最大绝对值。
在该实施例中,通过对第一预测文件大小进行增补残差处理,以对第一预测文件大小进 行校正,从而提高最终生成的目标文件大小的准确性。
在实际执行过程中,可以采用神经网络模型来预测第一预测文件大小。
下面以神经网络模型为文件大小预测模型为例,对该实施例中第一预测文件大小的生成方式进行说明。
在一些实施例中,步骤730可以包括:将目标韵律特征和目标音素特征输入至文件大小预测模型,获取由文件大小预测模型输出的第一预测文件大小。
在该实施例中,文件大小预测模型可以为预训练的神经网络模型。
文件大小预测模型用于基于文本的韵律特征和音素特征预测该文本所合成的语音的文件大小值。
文件大小预测模型的训练过程为:以样本韵律特征和样本音素特征为样本,以与样本韵律特征和样本音素特征对应的样本文件大小为样本标签,对该文件大小预测模型进行训练。
其中,样本韵律特征和样本音素特征为对样本文本进行韵律特征和音素特征提取所生成的,样本韵律特征和样本音素特征的提取方式与上述目标韵律特征和目标音素特征的提取方式类似,在此不作赘述。
与样本韵律特征和样本音素特征对应的样本文件大小为对样本文本进行语音合成所生成的样本音频文件的实际大小值。
在实际应用过程中,将目标韵律特征和目标音素特征输入至训练好的文件大小预测模型,即可由文件大小预测模型输出该由该目标韵律特征和目标音素特征对应的目标文本进行语音合成所生成的语音所对应的初始文件大小值,也即第一预测文件大小。
在得到第一预测文件大小后,计算第一预测文件大小和目标残差值的和,即可生成目标文件大小。
在该实施例中,通过采用预训练的模型来获取第一预测文件大小,能够提高实际应用过程中的计算效率。
除此之外,对于实际应用过程中的每一个目标文本所对应的目标韵律特征和目标音素特征均可以作为后续训练该文件大小预测模型的训练样本,随着训练样本体积的增大,该文件大小预测模型的智能程度也将不断提高,所最终预测生成的结果也将更加准确。
下面通过具体实施例,对目标残差值的确定方式进行说明。
在一些实施例中,目标残差值通过如下步骤确定:
获取样本文本、样本音频文件对应的样本文件大小和样本文本对应的样本音频文件,样本音频文件为对样本文本进行语音合成所生成的;
对样本文本进行特征提取,生成样本韵律特征和样本音素特征;
基于样本韵律特征和样本音素特征,获取样本音频文件的第二预测文件大小;
将第二预测文件大小和样本文件大小的差值的最大绝对值,确定为目标残差值。
在该实施例中,样本文本可以为数十至数百级别的常规文本,也可以为数千或数万级别的超长文本。
样本音频文件为对样本文本进行语音合成,所最终生成的音频文件。
样本文件大小为样本音频文件的实际大小值或实际音频时长。
例如,可以采用语音合成系统计算样本文本对应的样本音频文件的真实wav文件大小或音频时长。
第二预测文件大小为经预测得到的,未经校正的样本音频文件的大小值或音频时长。
需要说明的是,第二预测文件大小的生成方式应与第一预测文件大小的生成方式保持一致。
在实际执行过程中,可以对样本文本进行特征提取,生成样本韵律特征和样本音素特征,并将样本韵律特征和样本音素特征输入至文件大小预测模型,获取由文件大小预测模型输出的第二预测文件大小。
然后计算第二预测文件大小减去样本文件大小的差值的最大绝对值,作为目标残差值。
可以理解的是,在执行过程中,可以对样本韵律特征和样本音素特征进行多次预测,以得到多个第二预测文件大小。则分别计算每一个第二预测文件大小与样本文件大小的差值,得到多个候选差值;然后从多个候选差值中选择最小非正值的绝对值,确定为目标残差值,以提高目标残差值的准确度。
根据本申请实施例提供的音视频文件大小的获取方法,通过对目标文本进行韵律特征以及音素特征的提取,并基于提取得到的目标韵律特征和目标音素特征预测由该目标文本所合成的目标音频文件的大小信息,能够在目标音频文件生成之前即可实现该目标文件的大小值的预测,具有一定的及时性;且预测结果的准确性和精确性较高。
如图8所示,根据本申请的一些实施例,在步骤730之后,该方法还可以包括:基于目标韵律特征和目标音素特征对目标文本进行切分,生成多个分句序列;对分句序列进行语音合成,生成分句语音;输出分句语音和目标文件大小,并对分句语音进行拼接,生成目标音频文件。
在该实施例中,每个分句序列包括至少一个音素,其中音素可以为中文音素或英文音素。
基于目标韵律特征中的音节、韵律词、韵律短语以及语调短语中的至少一个特征对目标文本进行切分,以得到至少两个分句序列。
例如,对于目标文本“上海市今天阴转多云东南风三到四级”,可以首先将其转化为韵律音素序列:sil shang4 #0 hai3 #0 shi4 #2 jin1 #0 tian1 #2 yin1 #0 zhuan3 #1 duo1 #0 yun2 #3 dong1 #0 nan2 #0 feng1 #2 san1 #0 dao4 #1 si4 #0 ji2 #4 sil;
然后在#3处进行切分,从而可以将该韵律音素序列切分为以下多个分句序列:
分句序列1:sil shang4 #0 hai3 #0 shi4 #2 jin1 #0 tian1 #2 yin1 #0 zhuan3 #1 duo1 #0 yun2 #3;
分句序列2:dong1 #0 nan2 #0 feng1 #2 san1 #0 dao4 #1 si4 #0 ji2 #4 sil。
对多个分句序列中切分顺序最前的分句序列进行语音合成,生成该分句序列对应的分句语音;
输出该分句序列对应的分句语音以及目标文件大小,并合成后续分句序列。
例如,对于样本文本:详细内容麻烦在APP上搜寻下,可以转化为样本韵律音素序列:sil xiang2 #0 xi4 #1 nei4 #0 rong2 #2 ma2 #0 fan5 #2 zai4 #1 AE1 P #0 shang4 #1 sou1 #0 xun2 #0 xia4 #4 sil;
然后对样本韵律音素序列进行特征提取,所提取的样本韵律特征和样本音素特征包括但不限于:样本韵律音素序列长度;样本韵律音素序列中的中文拼音出现的个数、样本韵律音素序列中的停顿符号(#0 #1 #2 #3 sil)的个数、样本韵律音素序列中的英文音素的个数、样本韵律音素序列中的中文音素的个数、样本韵律音素序列中的中文声母的个数、样本韵律音素序列中的中文韵母的个数、样本韵律音素序列中的每种类别的英文音素(Vowels,Diphthongs,R colored vowels,Stops,Affricates,Fricatives,Nasals,Liquids,Semivowels)的个数。
在准备好训练数据后,则可以训练基于ElasticNet回归模型的wav文件大小预测模型。
将上述获取的样本韵律特征和样本音素特征输入至wav文件大小预测模型,训练过程的目标输出为样本音频文件的真实wav文件字节数。
具体地,可以使用交叉验证选取表现最好的模型参数,然后用所选取的参数训练 ElasticNet回归模型。
然后计算目标残差值,如使用样本韵律特征和样本音素特征作为模型的输入,得到第二预测文件大小。
计算第二预测文件大小减去样本文件大小的最小非正值的绝对值,作为最大残差值。
在实际应用过程中,客户端发起请求。如获取目标文本:上海市今天阴转多云东南风三到四级。
系统响应于请求,从客户端请求的目标文本中提取目标韵律特征和目标音素特征。
将提取的目标韵律特征和目标音素特征输入至如上所述的模型中,得到第一预测文件大小。
然后对第一预测文件大小增补残差,所生成的目标文件大小为第一预测文件大小和目标残差值之和。
将生成的目标文件大小作为wav文件大小预测值。
将wav文件大小预测值写入wav文件头。
然后将客户端请求的目标文本进行切分,生成多个分句序列,例如分为:
第一分句序列:上海市今天阴转多云;
第二分句序列:东南风三到四级。
合成第一分句序列“上海市今天阴转多云”的音频,生成第一分句语音,写入wav文件,返回给客户端。
然后按顺序合成第一分句序列之后的音频,并写入wav文件,直至合成完所有的请求。如合成“东南风三到四级”的音频,并写入wav文件,结束。
又如,对于文件大小表现为时长的情况,对于样本文本“可以控制”,可以转化为韵律音素序列:sil k e2 #0 y i3 #1 k ong4 #0 zh i4 #3 sil eos,并预测韵律和音素的时长(梅尔谱帧数):3 1 3 1 1 6 2 2 7 2 4 5 11 4 12,并将该音素的时长总和作为样本文件大小。
在后续模型训练过程中,可以设置模型为1层256维的嵌入层,接4层通道数为256的1维卷积神经网络,接layer norm,接dropout,接一层输出维度维1的全连接层。
然后将音素时长序列d转换到log域,其中,d’=log(d+1);
其中损失函数可以包括音素时长序列的MSE损失和平均每个音素的总时长MAE损失。
然后采用Adam优化器对模型进行迭代优化。
在计算目标残差值的过程中,可以基于以上模型得到预测的梅尔谱总帧数作为第二预测文件大小,然后进行最大残差值的计算。
在实际应用过程中,客户端发起请求。如获取目标文本:上海市今天阴转多云东南风三到四级。
系统响应于请求,从客户端请求的目标文本中提取目标韵律特征和目标音素特征。
将提取的目标韵律特征和目标音素特征输入至如上所述的模型中,得到第一预测文件大小。
然后对第一预测文件大小增补残差,所生成的目标文件大小为第一预测文件大小和目标残差值之和。
需要说明的是,在该实施例中,计算的是梅尔谱帧数,则根据梅尔谱帧移和wav文件的采样频率16000、采样位数16、声道数1将音频时长(梅尔谱总帧数)转换为wav文件大小:
其中wav文件大小=((梅尔谱帧数x梅尔谱帧移/16000)*16000*16*1/8+44)字节。
根据本申请实施例提供的音视频文件大小的获取方法,通过对目标文本进行韵律及音素特征提取,并基于提取得到的目标韵律特征和目标音素特征预测由该目标文本所合成的目标 音频文件的大小信息,能够在目标音频文件生成之前即可实现该目标文件的大小值的预测,具有一定的及时性;且预测结果的准确性和精确性较高。
下面结合图9-图10描述本申请实施例的文本转写方法。
如图9所示,该文本转写方法包括:步骤910和步骤920。
步骤910、对目标文本的韵律音素序列进行切分,生成多个分句序列;
在该步骤中,目标文本为当前用于进行语音合成的文本。
在一些实施例中,韵律标识符可以包括:用于表征音节、用于表征韵律词、用于表征韵律短语、用于表征句末信息和用于表征语调短语的标识符中的至少一种。
可以理解的是,不同的韵律标识符对应有不同的细粒度等级,其中,用于表征停顿的韵律标识符的细粒度大于用于表征语调短语的韵律的标识符的细粒度,用于表征语调短语的细粒度大于用于表征韵律短语的细粒度,用于表征韵律短语的细粒度大于用于表征韵律词的细粒度,用于表征韵律词的细粒度大于用于表征音节的细粒度。
在实际执行过程中,可以用不同的符号表示不同细粒度等级的韵律特征。
例如,对于目标文本“上海市今天阴转多云东南风三到四级”,可以将其转化为韵律音素序列:sil shang4 #0 hai3 #0 shi4 #2 jin1 #0 tian1 #2 yin1 #0 zhuan3 #1 duo1 #0 yun2 #3 dong1 #0 nan2 #0 feng1 #2 san1 #0 dao4 #1 si4 #0 ji2 #4 sil。
可以理解的是,对于该韵律音素序列,韵律标识符可以包括:各相邻的拼音之间的#和数字;音素可以包括每一个汉字对应的拼音和声调或英文音标。
其中,sil为韵律音素序列中的代表句首和句末的静音,#0代表着音节、#1代表韵律词、#2代表韵律短语、#3代表语调短语以及#4代表句末,每个音素后面的数字代表该音素的声调,如shang4中的4代表拼音“shang”的声调为第四声。且细粒度由小到大依次为:#0<#1<#2<#3<#4。
对于每段韵律音素序列,对应至少一个切分点,则可以得到至少两个分句序列。
在一些实施例中,步骤910可以包括:将目标文本转化为韵律音素序列;基于多个韵律标识符中的至少部分切分韵律音素序列,生成多个分句序列。
在该实施例中,对于一整段韵律音素序列,包括有多个音素和多个韵律标识符,多个韵律标识符中包括对应不同细粒度等级的韵律标识符。
在实际执行过程中,可以基于实际情况选择合适的细粒度等级作为切分标准,并将该细粒度等级对应的韵律标识符在韵律音素序列中的位置作为切分点,对韵律音素序列进行切分,以得到多个分句序列。
需要说明的是,每个分句序列包括切分点处的韵律标识符以及至少一个音素。
可以理解的是,对于每段韵律音素序列,对应至少一个切分点,则可以得到至少两个分句序列。
例如,对于韵律音素序列“sil shang4 #0 hai3 #0 shi4 #2 jin1 #0 tian1 #2 yin1 #0 zhuan3 #1 duo1 #0 yun2 #3 dong1 #0 nan2 #0 feng1 #2 san1 #0 dao4 #1 si4 #0 ji2 #4 sil”,基于实际需求确定在#3处进行切分,则分别在韵律音素序列中含有#3的位置进行切分,并保留韵律分割符#3至前一个拼接单元,从而可以将该韵律音素序列切分为以下多个分句序列:
分句序列1:sil shang4 #0 hai3 #0 shi4 #2 jin1 #0 tian1 #2 yin1 #0 zhuan3 #1 duo1 #0 yun2 #3;
分句序列2:dong1 #0 nan2 #0 feng1 #2 san1 #0 dao4 #1 si4 #0 ji2 #4 sil。
在一些实施例中,步骤910还可以包括:
基于多个韵律标识符中的目标标识符切分韵律音素序列,生成多个候选序列;
将多个候选序列中的目标候选序列与相邻候选序列进行组合,生成分句序列对应的细粒度大小以及多个分句序列;
基于分句序列对应的细粒度大小,对多个分句序列进行降序排序。
在该实施例中,基于目标标识符所对应的位置,可以将该韵律音素序列切分为多个候选序列,其中位于第一个切分点位置之前的候选序列,其对应的语音合成时长在目标时长内。
其中,语音合成时长为将该候选序列合成为语音所耗费的时间。
目标时长为一个较短的时长,目标时长可以基于用户自定义,或者也可以采用系统默认值,例如可以将目标时长设置为0.2s或0.3s等。
例如对于韵律音素序列“sil mu4 #0 qian2 #1 xiao2 #0 jia3 #3 ke2 #0 yi3 #1 kong4 #0 zhi4 #1 re4 #0 shui3 #0 qi4 #1 kai1 #0 guan1 #3 tiao2 #0 jie2 #1 wen1 #0 du4 #3 ding4 #0 shi2 #1 kai1 #0  guan1 #3 xiang2 #0 xi4 #1 nei4 #0 rong2 #2 ma2 #0 fan5 #1 zai4 #1 jia 3 #0 ju1 #1 AE1 P #0 shang4 #1 sou1 #0 xun2 #0 xia4 #4 sil”,可以分别将“yi3”对应的韵律标识符以及“yi3”之后的所有“#3”对应的韵律标识符确定为目标标识符,生成如下切分序列:
sil mu4 #0 qian2 #1 xiao2 #0 jia3 #3 ke2 #0 yi3 #1| kong4 #0 zhi4 #1 re4 #0 shui3 #0 qi4 #1 kai1 #0 guan1 #3| tiao2 #0 jie2 #1 wen1 #0 du4 #3 |ding4 #0 shi2 #1 kai1 #0 guan1 #3 |xiang2 #0 xi4 #1 nei4 #0 rong2 #2 ma2 #0 fan5 #1 zai4 #1 jia 3 #0 ju1 #1 AE1 P #0 shang4 #1 sou1 #0 xun2 #0 xia4 #4 sil
其中,“|”为目标标识符对应的切分点。
目标候选序列可以为多个候选序列中的任一候选序列,分别将该目标候选序列与相邻的其他候选序列进行组合,从而生成多个组合后的分句序列,其中,多个分句序列中包括原始的候选序列和目标文本对应的原始的韵律音素序列。
可以理解的是,分句序列的细粒度等级大于目标候选序列的细粒度等级。
例如,句子“希望这首歌|能让你喜欢|为您播放|XX”的韵律音素序列“sil xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3| neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 | wei4 #0 nin2 #1 bo1 #0 fang4 #1 | EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4 sil”中,“xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3”、“neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1”、“wei4 #0 nin2 #1 bo1 #0 fang4 #1”以及“EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4”均为候选序列。
分别将候选序列中的任一候选序列作为目标候选序列,将其与相邻的其他候选序列进行组合,从而可以得到如下多个分句序列(两个“|”之间的为一个分句序列):
“sil xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 | neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 | wei4 #0 nin2 #1 bo1 #0 fang4 #1 | EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4 sil”
“sil xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 | wei4 #0 nin2 #1 bo1 #0 fang4 #1 | EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4 sil”
“sil xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 | neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 wei4 #0 nin2 #1 bo1 #0 fang4 #1 | EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4 sil”
“sil xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 wei4 #0 nin2 #1 bo1 #0 fang4 #1 | EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4 sil”
“sil xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 | neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 | wei4 #0 nin2 #1 bo1 #0 fang4 #1 EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4 sil”
“sil xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 | neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 wei4 #0 nin2 #1 bo1 #0 fang4 #1 EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4 sil”
“sil xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 wei4 #0 nin2 #1 bo1 #0 fang4 #1 EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4 sil”
可以理解的是,每个分句序列均对应有细粒度等级,分句序列中所包括的候选序列的数量越多,则其对应的细粒度越大,例如“希望这首歌能让你喜欢”所对应的细粒度大于“希望这首歌”所对应的细粒度。
基于分句序列对应的细粒度大小,对多个分句序列进行排序,细粒度越大,则其对应的分句序列排在越前面,例如将“xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 wei4 #0 nin2 #1 bo1 #0 fang4 #1 EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4(希望这首歌能让你喜欢为您播放XX的X)”排在“xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 wei4 #0 nin2 #1 bo1 #0 fang4 #1(希望这首歌能让你喜欢为您播放)”的前面。
在该实施例中,基于语义及人的说话习惯预测的韵律结果对韵律音素序列进行切分,以在停顿时长较长位置分割,而非简单基于标点符号切分,有助于提高后续拼接所合成的多个分句语音所生成的目标语音的自然度。
步骤920、确定多个分句序列中的任一待匹配分句序列与缓存的目标分句序列匹配,从缓存中获取与目标分句序列对应的目标分句语音,并将该待匹配分句序列对应的语音确定为目标分句语音。
在该步骤中,目标分句序列为预先生成并存储于系统中的分句序列。
目标分句序列可以为缓存于系统中的全部的预存分句序列中的任意一个。
目标分句语音为预先对目标分句序列进行语音合成所生成的语音,该目标分句语音存储于系统中,且目标分句序列与目标分句语音之间建立有对应关系。
在实际执行过程中,可以将缓存的目标分句序列分别与多个分句序列进行精确匹配,确定目标分句序列与多个分句序列中的任一待匹配分句序列匹配,则将该待匹配分句序列确定为第一分句序列,并从缓存中获取与目标分句序列对应的目标分句语音;
在将该分句序列确定为第一分句序列后,则可以直接将与该第一分句序列相匹配的目标分句对应的目标分句语音确定为该第一分句序列对应的语音。
在一些实施例中,其中,待匹配分句序列包括多个分句序列中的任一分句序列及不同分句序列之间的组合。
如图10所示,在一些实施例中,步骤920可以包括:基于步骤910中的降序排序,从前至后依次将多个分句序列与目标分句序列进行精确匹配。
具体地,基于降序排序的次序,从前至后依次将多个分句序列与缓存中的目标分句序列进行匹配,并将匹配成功的目标分句序列的语音确定为分句序列的语音。
在该实施例中,基于步骤910中所生成的降序排序的顺序,从前到后依次将分句序列与目标分句序列进行精确匹配,例如先将分句序列“xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 wei4 #0 nin2 #1 bo1 #0 fang4 #1 EH1 K S #10 EH1 K S #0 de5 #1 EH1 K S #4(希望这首歌能让你喜欢为您播放XX的X)”与目标分句进行精确匹配,在匹配成功的情况下,则将该分句序列确定为第一分句序列,并将与该第一分句序列匹配的目标分句对应的目标分句语音确定为该第一分句序列对应的语音,结束比较。
在匹配不成功的情况下,则再将分句序列“xi1 #0 wang4 #1 zhe4 #0 shou3 #0 ge1 #3 neng2 #0 rang4 #1 ni2 #1 xi3 #0 huan1 #1 wei4 #0 nin2 #1 bo1 #0 fang4 #1(希望这首歌能让你喜欢为您播放)”与目标分句进行比较,并重复上述过程,直至确定某一分句序列与目标分句能够实现精确匹配,则结束比较。
在一些实施例中,在所有分句序列与目标分句均不能实现精确匹配的情况下,则基于该分句序列生成语音,具体实现方式将在后续实施例中进行说明,在此暂不作赘述。
在一些实施例中,在步骤920之后,该方法还可以包括:输出目标分句语音。
在该实施例中,目标分句语音为第一分句序列对应的语音,其目标分句语音为预先生成且存储与缓存中的语音。
在实际执行过程中,在确定分句序列与目标分句序列精确匹配的情况下,则直接将与该第一分句序列相似的目标分句对应的目标分句语音确定为该第一分句序列对应的语音,并输出目标分句语音。
申请人在研发过程中发现,由于语音的时间特性,合成一段文本需要很多算力,如果语音合成请求量非常大,且基本上相同的文本,服务器将浪费大量算力重复相同的工作。一种简单方法是把待合成的文本作为key,把对应的已合成音频地址作为value,将这一组key和value存放在缓存中。当有重复的文本合成需求,则直接从缓存中查找到对应的音频,而避免重复使用算力合成相同的文本。
但该方法要求整个句子完全匹配,考虑到实际使用过程中,请求的文本相互之间很少存在完全相同的情况(如仅仅是标点符号不同,或者一句话中仅仅是某部分有所变化),从而导致命中率较低,进而影响缓存效率。
而在本申请中,通过将目标文本转化为韵律音素序列,基于韵律特征确定目标文本的停顿位置和停顿时长级别,并基于韵律特征将韵律音素序列切分为多个分句序列,分别将分句序列与缓存的目标分句序列进行比较,由于作为查找关键词的序列更短,在缓存查找中更容易命中,从而可以有效提高命中效率;在分句序列与目标序列相同的情况下,则直接将目标分句序列对应的目标分句语音确定为该分句序列的语音,而无需重新进行语音合成,从而有效降低服务器的算力开支。
如图10所示,在实际执行过程中,可以分别采用韵律预测模块和切分模块来执行上述步骤。
根据本申请实施例提供的文本转写方法,基于韵律特征将韵律音素序列切分为多个分句序列,分别将分句序列与缓存的目标分句序列进行比较,可以有效提高命中效率;在分句序列与目标序列相同的情况下,则直接将目标分句序列对应的目标分句语音确定为该分句序列的语音,而无需重新进行语音合成,从而提高语音合成的效率。
继续参考图10,根据本申请的一些实施例,该方法还可以包括:确定多个分句序列中的任一待匹配分句序列与目标分句序列不匹配,对待匹配分句序列进行语音合成,生成第二分句语音。
在该实施例中,目标分句序列为预先生成并存储于系统中的分句序列。
目标分句序列可以为缓存于系统中的全部的预存分句序列中的任意一个。
确定多个分句序列中的任一待匹配分句序列与目标分句序列不匹配,则将该待匹配分句序列确定为第二分句序列。
在实际执行过程中,将目标分句序列与多个分句序列中的任意待匹配分句序列进行精确匹配,在均不匹配的情况下,则将该待匹配分句序列确定为第二分句序列,并对第二分句序列进行语音合成,生成第二分句语音。
第二分句语音为缓存中不存在的语音。
在实际执行过程中,可以基于步骤910中所生成的降序排序的顺序,从前到后依次将分句序列与目标分句序列进行精确匹配,在所有分句序列与目标分句均不匹配的情况下,则对未查找到相似序列的分句序列进行语音合成,生成第二分句语音。
在一些实施例中,对第二分句序列进行语音合成,生成第二分句语音,可以包括:将第二分句序列转换为端到端语音合成模型可接收的韵律音素序列,对该韵律音素序列进行语音合成,生成第二分句语音。
在该实施例中,韵律音素序列用于表征第二分句序列的韵律信息和音素信息。
其中,音素为根据语音的自然属性划分出来的最小语音单位,依据音节里的发音动作来分析,一个动作即构成一个音素,音素可以为汉语音素,也可以为英语音素。
例如,第二分句序列可以表现为sil shang4 #0 hai3 #0 shi4 #2 jin1 #0 tian1 #2 yin1 #0 zhuan3 #1 duo1 #0 yun2 #3,或者表现为:sil sh ang4 #0 h ai3 #0 sh i4 #2 j in1 #0 t ian1 #2 y in1 #0 zh uan3 #1 d uo1 #0 y vn2 #3 sil eos等不同格式的音素序列。
将第二分句序列输入至语音合成系统(如端到端语音合成模型),由语音合成系统合成第二分句语音。
在实际执行过程中,可以采用文字转音素模块来执行以上操作。
在该实施例中,将音素作为关键字进行缓存,克服了文本中标点的变化,或在数字写法变化而读音完全相同时被当成不同句子进行缓存的缺点,能够实现目标文本的标准化缓存,提高缓存效率。
继续参考图10根据本申请的一些实施例,在生成第二分句语音之后,该方法还可以包括:基于韵律标识符对第二分句语音进行切分,生成多个子第二分句语音;缓存子第二分句语音对应的子分句序列和多个子第二分句语音。
在该实施例中,可以基于第二分句语音所对应的第二分句序列中的韵律标识符,对第二分句语音进行切分,生成多个子第二分句语音。
其中,每个子第二分句语音对应的分句序列即为子分句序列。
在得到子第二分句语音以及其对应的子分句序列后,可以将子分句序列和子第二分句语音缓存至系统,作为后续查询过程中的目标分句序列及其对应的目标分句语音。
根据本申请的一些实施例,在生成第二分句语音之后,该方法还可以包括:拼接第二分句语音和目标分句语音,生成目标文本对应的目标语音。
在该实施例中,目标语音为将这个目标文本进行语音合成后所得到的语音。
目标分句语音为缓存中存在的语音;
第二分句语音为缓存中不存在的语音。
可以理解的是,该目标语音为基于缓存中的目标分句语音以及新生成的第二分句语音中的至少一种所生成。
在一些实施例中,拼接第二分句语音和目标分句语音,还可以包括:基于第二分句语音对应的分句序列在韵律音素序列中的切分顺序,以及目标分句语音对应的分句序列在韵律音素序列中的切分顺序,拼接第二分句语音和目标分句语音。
在该实施例中,基于第二分句语音对应的第二分句序列在韵律音素序列中的切分顺序,以及目标分句语音对应的第一分句序列在韵律音素序列中的切分顺序,从第一个分句序列开始,依次拼接相邻的分句序列对应的语音,直至拼接完成全部的分句序列对应语音,生成目标语音。
在生成目标语音后,还可以输出目标语音。
根据本申请实施例提供的文本转写方法,基于韵律特征将韵律音素序列切分为多个分句序列,分别将分句序列与缓存的目标分句序列进行比较,可以有效提高命中效率;只有在分句序列与目标序列不同的情况下,才进行语音合成,有效减轻服务器的算力压力,提高语音合成的效率。
下面结合图11-图13描述本申请实施例的文本的切分方法。
如图11所示,该文本的切分方法包括:步骤1110、步骤1120和步骤1130。
步骤1110、将目标文本转化为韵律音素序列。
在一些实施例中,该目标文本的大小超过目标阈值,对超过目标阈值的目标文本进行切分。可以理解,超过目标阈值的目标文本是语音合成时间超过预设范围的文本,故而可以通过对较长的文本进行切分,然后对切分后文本进行语音合成,可以提高语音合成的效率。
步骤1120、基于多个韵律标识符在韵律音素序列中确定第一切分位置;
在该步骤中,第一切分位置为用于第一次切分的切分点的位置。基于第一切分位置,可以将该韵律音素序列切分为前后两个子序列,且将位于第一切分位置之前的子序列确定为第一子韵律音素序列。
需要说明的是,基于第一切分位置所生成的第一子韵律音素序列,其对应的语音合成时长在目标时长内。
语音合成时长与语音合成系统的算力相关。
其中,第一子韵律音素序列对应的语音合成时长为将第一子韵律音素序列合成为语音所耗费的时间。
目标时长为一个较短的时长,目标时长的数值可以基于用户自定义,或者也可以采用系统默认值,例如可以将目标时长设置为0.2s或0.3s等。
在一些实施例中,多个韵律标识符可以包括:用于表征音节、用于表征韵律词、用于表征韵律短语、用于表征语调短语和用于表征句末信息的标识符中的至少一种;其中,用于表征句末信息的标识符的细粒度大于用于语调短语的标识符的细粒度,用于表征语调短语的标识符的细粒度大于用于表征韵律短语的标识符的细粒度,用于表征韵律短语的标识符的细粒度大于用于表征韵律词的标识符的细粒度,用于表征韵律词的标识符的细粒度大于用于表征音节的标识符的细粒度。
在该实施例中,韵律标识符为用于表征句末信息、语调短语、韵律短语、韵律词和音节中的至少一种的符号。
在实际执行过程中,可以采用特殊符号与数字组合的形式或特定字母组合来表示韵律标识符,例如分别用“#0”、“#1”、“#2”、“#3”以及“#4”来表示韵律标识符,不同的组合表征不同的细粒度级别。
如:#0代表着音节、#1代表韵律词、#2代表韵律短语、#3代表语调短语以及#4代表句末。
例如,对于目标文本“目前小甲可以控制热水器开关,调节温度,定时开关,详细内容麻烦在甲居app上搜寻下”,可以将其转化为韵律音素序列:sil mu4 #0 qian2 #1 xiao2 #0 jia3 #3 ke2 #0 yi3 #1 kong4 #0 zhi4 #1 re4 #0 shui3 #0 qi4 #1 kai1 #0 guan1 #3 tiao2 #0 jie2 #1 wen1 #0 du4 #3 ding4 #0 shi2 #1 kai1 #0 guan1 #3 xiang2 #0 xi4 #1 nei4 #0 rong2 #2 ma2 #0 fan5 #1 zai4 #1 jia 3 #0 ju1 #1 AE1 P #0 shang4 #1 sou1 #0 xun2 #0 xia4 #4 si。
该韵律音素序列中包括#0、#1、#3以及#4等多个韵律标识符;其中,韵律标识符的细粒度由小到大依次为:#0<#1<#2<#3<#4。
获取每个韵律标识符在该韵律音素序列中的位置,根据韵律音素序列中在各位置之前的子韵律音素序列所对应的语音合成时长与目标时长之间的大小关系,从这多个位置中确定第一目标标识符所在的位置作为第一切分位置,从而确保基于该第一切分位置所生成的第一子韵律音素序列对应的语音合成时长在目标时长之内。
下面结合图12-图13,对该步骤的实现方式进行具体说明。
在一些实施例中,步骤1120可以包括:基于目标阈值范围,从多个韵律标识符中确定细粒度最大的韵律标识符;将细粒度最大的韵律标识符在目标阈值范围内韵律音素序列中的位置确定为第一切分位置。
在该实施例中,目标阈值范围为元素发音长度的最大值和最小值。
其中,元素发音长度为韵律音素序列中在目标位置之前的全部音素的发音长度之和。
目标位置可以为韵律音素序列中的任一韵律标识符所在的位置。
目标阈值范围可以用(n,n+m)表示,其中,n和m的取值可以基于用户自定义或者基于算法确定;其中,n和m均为正整数,且n和m的和不超过韵律音素序列中全部音素的发音长度之和。
例如可以将n设置为5,将m设置为5,也即将目标阈值范围确定为5-10个单位发音长度。
其中,一个汉字为一个单位发音长度,一个成词的英文单词是2个单位发音长度,一串不成词的英文音素是4个单位发音长度。
需要说明的是,在一些实施例中,如果有多个细粒度最大韵律标识符,则取位置号最小的那个,也就是取第一个所在韵律音素序列中的位置确定为第一切分位置。
继续以“目前小甲可以控制热水器开关,调节温度,定时开关,详细内容麻烦在甲居app上搜寻下”这一目标文本为例,进行说明。
在将该目标文本转化为韵律音素序列“sil mu4 #0 qian2 #1 xiao2 #0 jia3 #3 ke2 #0 yi3 #1 kong4 #0 zhi4 #1 re4 #0 shui3 #0 qi4 #1 kai1 #0 guan1 #3 tiao2 #0 jie2 #1 wen1 #0 du4 #3 ding4 #0 shi2 #1 kai1 #0 guan1 #3 xiang2 #0 xi4 #1 nei4 #0 rong2 #2 ma2 #0 fan5 #1 zai4 #1 jia 3 #0 ju1 #1 AE1 P #0 shang4 #1 sou1 #0 xun2 #0 xia4 #4 sil”之后,从该韵律音素序列的首字符开始,选择第5至第10个单位发音长度之间的音节所组成的序列:“ke2 #0 yi3 #1 kong4 #0 zhi4 #1 re4 #0 shui3 #0”,将其确定为目标阈值范围内的待选韵律音素序列。
顺次比较该待选韵律音素序列“ke2 #0 yi3 #1 kong4 #0 zhi4 #1 re4 #0 shui3 #0”中各个韵律标识符所对应的细粒度的大小,选择细粒度最大的韵律标识符所在的位置,确定为第一切分位置。对于上述待选韵律音素序列,#1为细粒度最大的韵律标识符,由于有多个#1,取第一个#1所在的位置确定为第一切分位置,也即,将音节“yi”之后的位置确定为第一切分位置,如下所示:
sil mu4 #0 qian2 #1 xiao2 #0 jia3 #3 ke2 #0 yi3 #1| kong4 #0 zhi4 #1 re4 #0 shui3 #0 qi4 #1 kai1 #0 guan1 #3 tiao2 #0 jie2 #1 wen1 #0 du4 #3 ding4 #0 shi2 #1 kai1 #0 guan1 #3 xiang2 #0 xi4 #1 nei4 #0 rong2 #2 ma2 #0 fan5 #1 zai4 #1 jia 3 #0 ju1 #1 AE1 P #0 shang4 #1 sou1 #0 xun2 #0 xia4 #4 sil。
其中,“|”表征第一切分位置。
在一些实施例中,基于目标阈值范围,从多个韵律标识符中确定细粒度最大的韵律标识符,可以包括:
获取韵律音素序列中目标子韵律音素序列全部音素的第一发音长度;
在第一发音长度在目标阈值范围内的情况下,将第一发音长度对应的目标位置确定为候选切分点位置,生成多个候选切分点位置;
从多个候选切分点位置所对应的韵律标识符中确定细粒度最大的韵律标识符。
在该实施例中,目标子韵律音素序列目标子韵律音素序列为韵律音素序列中在目标位置之前的全部韵律音素序列,其中,目标位置为韵律音素序列中的任一韵律标识符所在的位置。
第一发音长度为该目标子韵律音素序列中各音素对应的发音长度之和。
如图13所示,在实际执行过程中,可以通过第一切分位置查找模块来获取第一发音长度。
例如,将韵律音素序列转换为带韵律的音素列表,并将该带韵律的音素列表输入至第一切分位置查找模块。
第一切分位置查找模块将当前发音长度的初始值和列表索引的初始值均设置为0,初始化空的韵律位置字典dict,并基于如下公式开始循环:
元素发音长度=get_voice_length(列表索引);
列表索引=列表索引+1;
第一发音长度=第一发音长度+元素发音长度;
其中,元素发音长度为当前目标位置处的音节所对应的发音长度,列表索引用于表征目标位置。
函数get_voice_length(列表索引)计算索引所指定的元素的发音长度,计算方法为:1个汉字为一个单位发音长度,1个成词的英文单词(在词典中的英文字符串)为2个单位发音长度,1个不成词(不在词典中)的英文字符串为4个单位发音长度。
通过以上方式,可以分别获取列表索引值在(1,N)范围内的N个第一发音长度,其中,N为列表中的音节的个数。
对于每一次生成的第一发音长度,均与目标阈值进行比较。
在第一发音长度大于目标阈值范围中的阈值下限且小于目标阈值范围中的阈值上限的情况下,将当前韵律标识符和位置索引记录至字典dict中。
在一些实施例中,在第一发音长度在目标阈值范围内的情况下,将第一发音长度对应的目标位置确定为候选切分点位置,包括:
在第一发音长度在目标阈值范围内的情况下,确定第一发音长度对应的目标位置处的韵律标识符为第一次出现,将第一发音长度对应的目标位置确定为候选切分点位置。
继续以上述例子进行说明,在第一发音长度大于目标阈值范围中的阈值下限且小于目标阈值范围中的阈值上限的情况下,判断当前列表索引处对应的韵律标识符是否为第一次出现的标识符,在确认当前列表索引处对应的韵律标识符为第一次出现的标识符的情况下,则将当前韵律和列表索引作为键值对,记录至字典dict中。
在另一些实施例中,在确认当前列表索引处对应的韵律标识符为之前已经出现过的标识符的情况下,则跳过当前列表索引,进入下一轮循环。
例如,对于韵律音素序列“sil mu4 #0 qian2 #1 xiao2 #0 jia3 #3 ke2 #0 yi3 #1 kong4 #0 zhi4 #1 re4 #0 shui3 #0 qi4 #1 kai1 #0 guan1 #3 tiao2 #0 jie2 #1 wen1 #0 du4 #3 ding4 #0 shi2 #1 kai1 #0 guan1 #3 xiang2 #0 xi4 #1 nei4 #0 rong2 #2 ma2 #0 fan5 #1 zai4 #1 jia 3 #0 ju1 #1 AE1 P #0 shang4 #1 sou1 #0 xun2 #0 xia4 #4 sil”,从列表索引数值为1起,依次计算每一个列表索引数值所对应的第一发音长度;
在计算到“ke2 #0”位置所对应的第一发音长度进入目标阈值范围后,记录该位置处的列表索引数值以及该位置处的韵律标识符“#0”;
然后将列表索引进行加一,计算“yi3 #1”位置所对应的第一发音长度,在确定“yi3 #1”位置所对应的第一发音长度在目标阈值范围内时,则记录该位置处的列表索引数值以及该位置处的韵律标识符“#1”;
然后将列表索引进行加一处理,计算“kong4 #0”位置所对应的第一发音长度,在确定“kong4 #0”位置所对应的第一发音长度在目标阈值范围内时,判断该位置处的韵律标识符“#1”并非为第一次出现,则跳过该列表索引,对列表索引进行加一处理,重复上述过程,直至当前列表索引所对应的第一发音长度超过目标阈值范围的阈值上限,则结束循环。
所记录的全部韵律标识符所对应的列表索引即可作为用于确定第一切分位置的候选切分点位置。
然后从记录的全部韵律标识符中筛选出细粒度最大的韵律标识符,如从以上所记录的“#0”和“#1”中确定细粒度最大的韵律标识符“#1”,并将该韵律标识符“#1”对应的目标位置(即位置索引)确定为第一切分位置。
在另一些实施例中,在第一发音长度小于目标阈值范围中的阈值下限的情况下,则对当前列表索引进行加一处理,进入下一个循环。
在又一些实施例中,在列表索引数值超过列表元素个数的情况下,则结束循环。
在确定第一切分位置后,从该第一切分位置处切分韵律音素序列,即可生成第一子韵律音素序列。
申请人在研发过程中发现,由于语音的时间特性,通常系统转化的时间和输入文本的长度成正比,越长的句子合成所需的时间就越长,尤其对于一些超长的文本输入,还可能会造成系统容量超出限制。为解决以上问题,一种简单直接的想法是利用计算机系统的能力并行合成,而在切分过程中所面临的第一个问题就是需要如何切分并行任务。相关技术中主要是基于标点符号来对文本进行切分,但该切分方法既无法解决无标点符号的文本的切分,也无法解决切分后两端不均衡的问题。
而在本申请中,通过将目标文本转化为韵律音素序列,并基于韵律音素序列中的韵律特性来确定第一切分位置,使得基于第一切分位置所得到的第一子韵律音素序列,其对应的语音合成时长能够在合理的时长范围内,从而缩短合成系统的首句响应时间,缩短延迟时间;除此之外,基于该方式所确定的第一切分位置为在停顿时长较长的位置,使得切分得到的第一子韵律音素序列的停顿和韵律更加自然,从而使得后续输出的基于第一子韵律音素序列合成的语音更加自然且流畅。
步骤1130、基于第二切分位置和第一切分位置切分韵律音素序列,生成至少第二子韵律音素序列以及第一子韵律音素序列;其中,第二子韵律音素序列为韵律音素序列中位于第一切分位置之后的韵律音素序列。
在该步骤中,第一子韵律音素序列为基于第一切分位置所切分生成的,第一子韵律音素序列为韵律音素序列中位于第一切分位置之前的韵律音素序列;
第二子韵律音素序列为位于第一切分位置之后的韵律音素序列。
在一些实施例中,在步骤1120之后,且在步骤1130之前,该方法还可以包括:从韵律音素序列中位于第一切分位置之后的韵律标识符在韵律音素序列的位置中,确定第二切分位置;
步骤1130可以包括:基于第一切分位置和第二切分位置对韵律音素序列进行切分,生成第一子韵律音素序列和至少两个第二子韵律音素序列,至少两个第二子韵律音素序列为韵律音素序列中位于第一切分位置之后的韵律音素序列,相邻的第二子韵律音素序列基于第二切分位置确定。
在该实施例中,第二切分位置为除第一次切分以外的其他所有次切分所对应的切分点的位置。
在确定第一切分位置后,从韵律音素序列中查找第一切分位置之后的至少部分韵律标识符作为用于确定第二切分位置的候选集,并将候选集中的韵律标识符的位置确定为第二切分位置。
如图12所示,在实际执行过程中,可以采用第二切分位置查找模块来查找第二切分位置。
例如,将第一切分位置和韵律音素序列输入至第二切分位置查找模块,获取由第二切分 位置查找模块输出的切分点列表,该切分点列表中包括:第一切分位置和第二切分位置。
在一些实施例中,从韵律音素序列中位于第一切分位置之后的韵律标识符在韵律音素序列的位置中,确定第二切分位置可以包括:将韵律音素序列中位于第一切分位置之后的,用于表征语调短语的标识符所对应的位置确定为第二切分位置。
在该实施例中,继续以韵律音素序列“sil mu4 #0 qian2 #1 xiao2 #0 jia3 #3 ke2 #0 yi3 #1 kong4 #0 zhi4 #1 re4 #0 shui3 #0 qi4 #1 kai1 #0 guan1 #3 tiao2 #0 jie2 #1 wen1 #0 du4 #3 ding4 #0 shi2 #1 kai1 #0 guan1 #3 xiang2 #0 xi4 #1 nei4 #0 rong2 #2 ma2 #0 fan5 #1 zai4 #1 jia 3 #0 ju1 #1 AE1 P #0 shang4 #1 sou1 #0 xun2 #0 xia4 #4 sil”为例,进行说明。
在确定第一切分位置后,从第一切分位置所在的位置开始,也即从“yi3 #1”之后开始,依次查找后续位置处,韵律标识符为“#3”的韵律标识符所在的位置,并依次将这些位置确定为第二切分位置,从而得到如下切分序列:
sil mu4 #0 qian2 #1 xiao2 #0 jia3 #3 ke2 #0 yi3 #1| kong4 #0 zhi4 #1 re4 #0 shui3 #0 qi4 #1 kai1 #0 guan1 #3| tiao2 #0 jie2 #1 wen1 #0 du4 #3 |ding4 #0 shi2 #1 kai1 #0 guan1 #3 |xiang2 #0 xi4 #1 nei4 #0 rong2 #2 ma2 #0 fan5 #1 zai4 #1 jia 3 #0 ju1 #1 AE1 P #0 shang4 #1 sou1 #0 xun2 #0 xia4 #4 sil;
其中,第一个“|”为第一切分位置,后续“|”均为第二切分位置。
当然,在其他实施例中,还可以将其他细粒度等级的标识符所在的位置确定为第二切分位置,本申请不做限定。
第二子韵律音素序列为位于第一切分位置之后的整个韵律音素序列进行切分,所生成的韵律音素序列。
在第二切分位置为至少一个的情况下,第二子韵律音素序列为至少两个。
在该实施例中,基于第一切分位置以及韵律特征确定第二切分位置,以提高后续切分生成的第二子韵律音素序列的韵律自然的程度以及切分后两端的均衡性,避免在一个整词中间切断的情况,有助于提高后续语音合成的效率及质量。
当然,在一些实施例中,在没有第二切分位置的情况下,则第二子韵律音素序列即为韵律音素序列中位于第一切分位置之后的整个韵律音素序列。例如,当第二切分位置为#3对应的位置,但是在第二子韵律音素序列中查找不到#3时,此时可以理解为不存在第二切分位置。
例如,经步骤1120确定的第一切分位置和第二切分位置如下:
sil mu4 #0 qian2 #1 xiao2 #0 jia3 #3 ke2 #0 yi3 #1| kong4 #0 zhi4 #1 re4 #0 shui3 #0 qi4 #1 kai1 #0 guan1 #3| tiao2 #0 jie2 #1 wen1 #0 du4 #3 |ding4 #0 shi2 #1 kai1 #0 guan1 #3 |xiang2 #0 xi4 #1 nei4 #0 rong2 #2 ma2 #0 fan5 #1 zai4 #1 jia 3 #0 ju1 #1 AE1 P #0 shang4 #1 sou1 #0 xun2 #0 xia4 #4 sil;
则在该步骤中,从第一个“|”所在的位置开始,依次对该韵律音素序列进行切分,从而生成第一子韵律音素序列“sil mu4 #0 qian2 #1 xiao2 #0 jia3 #3 ke2 #0 yi3 #1”,以及第二子韵律音素序列:“kong4 #0 zhi4 #1 re4 #0 shui3 #0 qi4 #1 kai1 #0 guan1 #3”、“tiao2 #0 jie2 #1 wen1 #0 du4 #3”以及“ding4 #0 shi2 #1 kai1 #0 guan1 #3”等。
其中,该第一子韵律音素序列“sil mu4 #0 qian2 #1 xiao2 #0 jia3 #3 ke2 #0 yi3 #1”的语音合成时长在0.2s左右。
根据本申请实施例提供的文本的切分方法,通过第一子韵律音素序列所对应的语音合成时长来确定用于切分得到第一子韵律音素序列的第一切分位置,以使第一子韵律音素序列对应的语音合成时长能够在合理的时长范围内,从而缩短合成系统的首句响应时间。
在一些实施例中,步骤1110可以包括:获取目标文本的句末信息、语调短语、韵律短语、 韵律词和音节;将目标文本转化为音素序列;基于句末信息、语调短语、韵律短语、韵律词和音节的至少两种,生成多个韵律标识符;基于多个韵律标识符对音素序列进行标记,生成韵律音素序列。
根据本申请的一些实施例,在步骤1130之后,该方法还可以包括:对第一子韵律音素序列进行语音合成,生成第一语音;输出第一语音,并对第二子韵律音素序列进行语音合成,生成第二语音。
在该实施例中,第一子韵律音素序列为目标文本中的第一个切分点之前的序列,也即目标文本所合成的语音中的首句话所对应的序列。
在生成第一子韵律音素序列后,即可对第一子韵律音素序列进行语音合成,生成第一语音。
然后输出该第一语音以供客户端进行播放,在客户端播放该第一语音的同时,系统可以合成后续第二子韵律音素序列,以生成第二语音。
例如,在得到第一子韵律音素序列“sil mu4 #0 qian2 #1 xiao2 #0 jia3 #3 ke2 #0 yi3 #1”后,即可基于该第一子韵律音素序列合成第一语音“目前小甲可以”,并输出该第一语音;在客户端播放该第一语音的同时,系统对第二子韵律音素序列:“kong4 #0 zhi4 #1 re4 #0 shui3 #0 qi4 #1 kai1 #0 guan1 #3”进行合成。
根据本申请实施例提供的文本的切分方法,通过优先合成第一子韵律音素序列对应的第一语音,在输出第一语音的同时对后续第二子韵律音素序列进行语音合成,可以加快系统在接收到网络语音合成服务请求后的反馈速度,缩短用户的等待时间。
以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分的方法。
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。
以上实施方式仅用于说明本申请,而非对本申请的限制。尽管参照实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,对本申请的技术方案进行各种组合、修改或者等同替换,都不脱离本申请技术方案的精神和范围,均应涵盖在本申请的权利要求范围中。

Claims (35)

  1. 一种语音合成方法,包括:
    对目标文本的韵律音素序列进行切分,生成多个分句序列,所述韵律音素序列包括与所述目标文本对应的多个音素以及位于相邻所述音素之间的韵律标识符,每个所述分句序列包括至少一个所述音素;
    对所述多个分句序列中的第一子韵律音素序列进行语音合成,得到第一语音信息;以及
    输出所述第一语音信息且对所述多个分句序列中的第二子韵律音素序列进行语音合成,生成第二语音信息,所述第二子韵律音素序列为在所述韵律音素序列中位于所述第一子韵律音素序列之后的至少一个分句序列。
  2. 根据权利要求1所述的语音合成方法,其中,在所述生成第二语音信息之后,所述方法还包括:
    合并所述第二语音信息和所述第一语音信息,生成第三语音信息。
  3. 根据权利要求2所述的语音合成方法,其中,在所述对目标文本的韵律音素序列进行切分之前,所述方法还包括:
    基于所述韵律音素序列,生成所述第三语音信息的目标文件大小;以及
    所述生成第二语音信息,包括:基于所述目标文件大小生成所述第二语音信息。
  4. 根据权利要求3所述的语音合成方法,其中,所述基于所述韵律音素序列,生成所述第三语音信息的目标文件大小,包括:
    基于所述韵律音素序列,生成所述第三语音信息的预测文件大小;以及
    基于目标残差值对所述预测文件大小进行校正,生成所述目标文件大小,所述目标残差值基于样本文件大小和预测的样本文本对应的样本音频文件的大小确定的,所述样本文件大小为所述样本文本对应的样本音频文件的实际大小。
  5. 根据权利要求4所述的语音合成方法,进一步包括获取所述样本音频文件的大小,其中,获取所述样本音频文件的大小,包括:
    对目标文本进行特征提取,生成目标韵律特征和目标音素特征;
    基于目标文本的所述目标韵律特征和所述目标音素特征,获取目标音频文件的目标文件大小,所述目标音频文件为对所述目标文本进行语音合成所生成的。
  6. 根据权利要求5所述的语音合成方法,其中,所述基于所述目标韵律特征和所述目标音素特征,获取目标音频文件的目标文件大小,包括:
    基于所述目标韵律特征和所述目标音素特征,获取所述目标音频文件的第一预测文件大小;
    对所述第一预测文件大小和目标残差值求和,生成所述目标文件大小,所述目标残差值基于样本文件大小和预测的样本文本对应的样本音频文件的大小确定的,所述样本文件大小为所述样本文本对应的样本音频文件的实际大小。
  7. 根据权利要求6所述的语音合成方法,其中,所述目标残差值通过如下步骤确定:
    获取样本文本、所述样本文本对应的样本音频文件和所述样本音频文件对应的样本文件大小,所述样本音频文件为对所述样本文本进行语音合成所生成的;
    对所述样本文本进行特征提取,生成样本韵律特征和样本音素特征;
    基于所述样本韵律特征和所述样本音素特征,获取所述样本音频文件的第二预测文件大小;
    将所述第二预测文件大小和所述样本文件大小的差值的绝对值,确定为所述目标残差值。
  8. 根据权利要求6所述的语音合成方法,其中,所述基于所述目标韵律特征和所述目标音素特征,获取所述目标音频文件的第一预测文件大小,包括:
    将所述目标韵律特征和所述目标音素特征输入至文件大小预测模型,获取由所述文件大小预测模型输出的所述第一预测文件大小;其中,
    所述文件大小预测模型为,以样本韵律特征和样本音素特征为样本,以与所述样本韵律特征和所述样本音素特征对应的样本文件大小为样本标签,训练得到。
  9. 根据权利要求5所述的语音合成方法,其中,在所述获取目标音频文件的目标文件大小之后,所述方法还包括:
    基于所述目标韵律特征和所述目标音素特征对所述目标文本进行切分,生成多个分句序列;
    对所述分句序列进行语音合成,生成分句语音;
    输出所述分句语音和所述目标文件大小,并对所述分句语音进行拼接,生成所述目标音频文件。
  10. 根据权利要求5-9任一项所述的语音合成方法,其中,所述对所述目标文本进行特征提取,生成目标韵律特征和目标音素特征,包括:
    将所述目标文本转化为韵律音素序列,所述韵律音素序列包括与所述目标文本对应的多个音素以及位于相邻所述音素之间的韵律标识符;
    对所述韵律音素序列进行特征提取,生成所述目标韵律特征和所述目标音素特征。
  11. 根据权利要求2所述的语音合成方法,其中,所述合并所述第二语音信息和所述第一语音信息,包括:
    基于所述第二语音信息对应的音素时长,以及所述第一语音信息对应的音素时长,合并所述第二语音信息和所述第一语音信息。
  12. 根据权利要求11所述的语音合成方法,其中,所述方法还包括:
    分别对所述分句序列进行语音合成,生成多个第一分句语音信息,所述第一分句语音信息包括每一个所述音素和所述韵律标识符对应的第一时长;
    基于所述第一时长和所述第一分句语音信息对应的分句序列在所述韵律音素序列中的切分顺序,拼接所述多个第一分句语音信息,生成目标语音。
  13. 根据权利要求12所述的语音合成方法,其中,所述基于所述第一时长和所述第一分句语音信息对应的分句序列在所述韵律音素序列中的切分顺序,拼接所述多个第一分句语音信息,生成目标语音,包括:
    基于多个音素中的目标音素对应的第一时长,截去所述第一分句语音信息中与所述目标音素对应的语音,生成第二分句语音信息;
    基于所述第二分句语音信息对应的分句序列在所述韵律音素序列中的切分顺序,拼接所述第二分句语音信息,生成所述目标语音。
  14. 根据权利要求13所述的语音合成方法,其中,所述目标音素包括句首多余音素和句末多余音素中的至少一种,所述基于多个音素中的目标音素对应的第一时长,截去所述第一分句语音信息中与所述目标音素对应的语音,包括:
    确定所述第一分句语音信息对应的分句序列不为所述目标文本中的第一个分句序列,分别截去所述第一分句语音信息中与所述句首多余音素对应的语音和与所述句末多余音素对应的语音;
    确定所述第一分句语音信息对应的分句序列为所述目标文本中的第一个分句序列,截去所述第一分句语音信息中与所述句末多余音素对应的语音。
  15. 根据权利要求11所述的语音合成方法,其中,在所述生成多个第一分句语音信息之后,且在所述基于所述第一时长和所述第一分句语音信息对应的分句序列在所述韵律音素序列中的切分顺序,拼接所述多个第一分句语音信息之前,所述方法还包括:
    输出所述第一分句语音信息。
  16. 根据权利要求12-15任一项所述的语音合成方法,其中,所述分别对所述分句序列进行语音合成,包括:
    将所述分句序列输入至目标语音合成模型,获取由所述目标语音合成模型输出的所述第一分句语音信息,其中,
    所述目标语音合成模型为,以样本韵律音素序列为样本,以与所述样本韵律音素序列对应的样本分句语音为样本标签,训练得到。
  17. 根据权利要求1-16任一项所述的语音合成方法,其中,在所述对目标文本的韵律音素序列进行切分之前,所述方法还包括:
    获取待合成文本;以及
    确定所述待合成文本的大小超过目标阈值,对所述待合成文本进行切分,生成所述目标文本,所述目标文本的大小不超过所述目标阈值。
  18. 根据权利要求1-16任一项所述的语音合成方法,其中,所述对目标文本的韵律音素序列进行切分,生成多个分句序列,包括:
    将所述目标文本转化为所述韵律音素序列,所述韵律音素序列包括与所述目标文本对应的多个音素以及位于相邻所述音素之间的韵律标识符;以及
    基于多个所述韵律标识符中的至少部分切分所述韵律音素序列,生成所述多个分句序列。
  19. 根据权利要求18所述的语音合成方法,其中,所述基于多个所述韵律标识符中的至少部分切分所述韵律音素序列,生成所述多个分句序列,包括:
    基于多个所述韵律标识符在所述韵律音素序列中确定第一切分位置;
    基于所述第一切分位置对所述韵律音素序列进行切分,生成所述多个分句序列,所述分句序列包括第一子韵律音素序列和第二子韵律音素序列,所述第一子韵律音素序列为所述韵律音素序列中位于所述第一切分位置之前的韵律音素序列,所述第二子韵律音素序列为所述韵律音素序列中位于所述第一切分位置之后的韵律音素序列,且所述第一子韵律音素序列对应的语音合成时长在目标时长内。
  20. 根据权利要求19所述的语音合成方法,其中,所述基于多个所述韵律标识符在所述韵律音素序列中确定第一切分位置,包括:
    基于目标阈值范围,从多个韵律标识符中确定细粒度最大的所述韵律标识符;
    将所述细粒度最大的所述韵律标识符在所述韵律音素序列中的位置确定为所述第一切分位置。
  21. 根据权利要求20所述的语音合成方法,其中,所述基于目标阈值范围,从所述多个韵律标识符中确定细粒度最大的所述韵律标识符,包括:
    获取所述韵律音素序列中目标子韵律音素序列全部音素的第一发音长度,所述目标子韵律音素序列为所述韵律音素序列中在目标位置之前的全部韵律音素序列;
    确定所述第一发音长度在所述目标阈值范围内,且确定所述第一发音长度对应的目标位置处的所述韵律标识符为第一次出现,将所述第一发音长度对应的目标位置确定为候选切分点位置,生成多个所述候选切分点位置;
    从所述多个所述候选切分点位置所对应的韵律标识符中确定细粒度最大的韵律标识符。
  22. 根据权利要求18所述的语音合成方法,其中,所述基于多个所述韵律标识符中的至 少部分切分所述韵律音素序列,生成所述多个分句序列,包括:
    基于多个所述韵律标识符在所述韵律音素序列中确定第一切分位置;
    从所述韵律音素序列中位于所述第一切分位置之后的所述韵律标识符在所述韵律音素序列的位置中,确定第二切分位置;以及
    基于所述第一切分位置和所述第二切分位置对所述韵律音素序列进行切分,生成所述第一子韵律音素序列和至少两个第二子韵律音素序列,所述第一子韵律音素序列为所述韵律音素序列中位于所述第一切分位置之前的韵律音素序列,所述至少两个第二子韵律音素序列为所述韵律音素序列中位于所述第一切分位置之后的韵律音素序列,相邻的所述第二子韵律音素序列基于所述第二切分位置确定,且所述第一子韵律音素序列对应的语音合成时长在目标时长内。
  23. 根据权利要求18所述的语音合成方法,其中,所述将所述目标文本转化为所述韵律音素序列,包括:
    获取所述目标文本的音节、韵律词、韵律短语、语调短语和句末信息;以及
    基于所述音节、所述韵律词、所述韵律短语、所述语调短语和所述句末信息中的至少两种对所述目标文本进行标记,生成所述韵律音素序列。
  24. 根据权利要求23所述的语音合成方法,其中,所述基于所述音节、所述韵律词、所述韵律短语、所述语调短语和所述句末信息中的至少两种对所述目标文本进行标记,生成所述韵律音素序列,包括:
    将所述目标文本转化为音素序列;
    基于所述音节、所述韵律词、所述韵律短语、所述语调短语和所述句末信息中的至少两种,生成所述多个韵律标识符;以及
    基于所述多个韵律标识符标记所述音素序列,生成所述韵律音素序列。
  25. 根据权利要求1-24任一项所述的语音合成方法,其中,所述方法还包括:
    对目标文本的韵律音素序列进行切分,生成多个分句序列,所述韵律音素序列包括与所述目标文本对应的多个音素以及位于相邻所述音素之间的韵律标识符,每个所述分句序列包括至少一个所述音素;
    确定所述多个分句序列中的任一待匹配分句序列与缓存的目标分句序列匹配,从缓存中获取与所述目标分句序列对应的目标分句语音,将所述待匹配分句序列对应的语音确定为所述目标分句语音。
  26. 根据权利要求25所述的语音合成方法,其中,所述方法还包括:
    确定所述多个分句序列中的任一待匹配分句序列与所述目标分句序列不匹配,对所述待匹配分句序列进行语音合成,生成第二分句语音。
  27. 根据权利要求26所述的语音合成方法,其中,在所述生成第二分句语音之后,所述方法还包括:
    基于所述韵律标识符切分所述第二分句语音,生成多个子第二分句语音;
    缓存所述多个子第二分句语音和所述子第二分句语音对应的子分句序列。
  28. 根据权利要求26所述的语音合成方法,其中,在所述生成第二分句语音之后,所述方法还包括:
    基于所述目标分句语音对应的分句序列在所述韵律音素序列中的切分顺序,以及所述第二分句语音对应的分句序列在所述韵律音素序列中的切分顺序,拼接所述目标分句语音和所述第二分句语音,生成所述目标文本对应的目标语音。
  29. 根据权利要求25所述的语音合成方法,其中,所述基于多个所述韵律标识符中的至 少部分切分所述韵律音素序列,生成所述多个分句序列,包括:
    基于多个韵律标识符中的目标标识符对所述韵律音素序列进行切分,生成多个候选序列,且位于第一个切分点之前的所述候选序列对应的语音合成时长在目标时长内;
    将所述多个候选序列中的目标候选序列与相邻候选序列进行组合,生成所述多个分句序列,并确定所述分句序列对应的细粒度大小;
    基于所述分句序列对应的细粒度大小,对所述多个分句序列进行降序排序。
  30. 根据权利要求1-29任一项所述的语音合成方法,其中,所述韵律标识符包括:用于表征音节、用于表征韵律词、用于表征韵律短语、用于表征语调短语和用于表征句末信息的标识符中的至少一种;
    且所述用于表征句末信息的标识符的细粒度大于所述用于所述语调短语的标识符的细粒度,所述用于表征语调短语的标识符的细粒度大于所述用于表征韵律短语的标识符的细粒度,所述用于表征韵律短语的标识符的细粒度大于所述用于表征韵律词的标识符的细粒度,所述用于表征韵律词的标识符的细粒度大于所述用于表征音节的标识符的细粒度。
  31. 根据权利要求1-30任一项所述的语音合成方法,其中,所述目标韵律特征和所述目标音素特征包括:所述韵律音素序列的长度、所述韵律音素序列中的中文拼音的数量、所述韵律音素序列中的停顿符号的数量、所述韵律音素序列中的英文音素的数量、所述韵律音素序列中的中文音素的数量、所述韵律音素序列中的中文声母的数量、所述韵律音素序列中的中文韵母的数量以及所述韵律音素序列中的各个类别的英文音素中的至少一种。
  32. 一种语音合成装置,包括:
    第一处理模块,被配置为对目标文本的韵律音素序列进行切分,生成多个分句序列,所述韵律音素序列包括与所述目标文本对应的多个音素以及位于相邻所述音素之间的韵律标识符,每个所述分句序列包括至少一个所述音素;
    第二处理模块,被配置为对所述多个分句序列中的第一子韵律音素序列进行语音合成,得到第一语音信息;以及
    第三处理模块,被配置为输出所述第一语音信息且对所述多个分句序列中的第二子韵律音素序列进行语音合成,生成第二语音信息,所述第二子韵律音素序列为在所述韵律音素序列中位于所述第一子韵律音素序列之后的至少一个分句序列。
  33. 一种电子设备,包括:
    处理器;以及
    存储器,存储了可在所述处理器上运行的计算机程序,其中,所述程序在由所述处理器执行时使得所述电子设备执行如权利要求1至31任一项所述的语音合成方法。
  34. 一种非暂态计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时执行如权利要求1至31任一项所述的语音合成方法。
  35. 一种计算机程序产品,包括计算机程序,其中,所述计算机程序被处理器执行时执行如权利要求1至31任一项所述的语音合成方法。
PCT/CN2022/118072 2022-03-31 2022-09-09 语音合成方法和装置 Ceased WO2023184874A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP22933875.1A EP4503017A4 (en) 2022-03-31 2022-09-09 Speech synthesis method and apparatus

Applications Claiming Priority (10)

Application Number Priority Date Filing Date Title
CN202210344456.4A CN114678002A (zh) 2022-03-31 2022-03-31 文本的切分方法和文本的切分装置
CN202210346097.6A CN114708848B (zh) 2022-03-31 2022-03-31 音视频文件大小的获取方法和装置
CN202210346114.6A CN114822490A (zh) 2022-03-31 2022-03-31 语音拼接方法和语音拼接装置
CN202210344448.XA CN114678001A (zh) 2022-03-31 2022-03-31 语音合成方法和语音合成装置
CN202210346094.2A CN114822489A (zh) 2022-03-31 2022-03-31 文本转写方法和文本转写装置
CN202210344448.X 2022-03-31
CN202210346097.6 2022-03-31
CN202210346094.2 2022-03-31
CN202210346114.6 2022-03-31
CN202210344456.4 2022-03-31

Publications (1)

Publication Number Publication Date
WO2023184874A1 true WO2023184874A1 (zh) 2023-10-05

Family

ID=88198941

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/118072 Ceased WO2023184874A1 (zh) 2022-03-31 2022-09-09 语音合成方法和装置

Country Status (2)

Country Link
EP (1) EP4503017A4 (zh)
WO (1) WO2023184874A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114678002A (zh) * 2022-03-31 2022-06-28 美的集团(上海)有限公司 文本的切分方法和文本的切分装置
CN118053416A (zh) * 2024-03-12 2024-05-17 中邮消费金融有限公司 声音定制方法、装置、设备及存储介质
CN118116364A (zh) * 2023-12-29 2024-05-31 上海稀宇极智科技有限公司 语音合成模型训练方法、语音合成方法、电子设备及存储介质
CN118940766A (zh) * 2024-10-11 2024-11-12 中孚安全技术有限公司 改善tts模型处理长文本性能的方法、系统及介质
CN119832893A (zh) * 2024-12-12 2025-04-15 中电信人工智能科技(北京)有限公司 声学模型的生成方法、装置、电子设备及存储介质
CN120164451A (zh) * 2025-03-14 2025-06-17 优酷文化科技(北京)有限公司 一种语音合成方法及装置

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002304186A (ja) * 2001-04-05 2002-10-18 Sharp Corp 音声合成装置、音声合成方法および音声合成プログラム
CN108073572A (zh) * 2016-11-16 2018-05-25 北京搜狗科技发展有限公司 信息处理方法及其装置、同声翻译系统
CN110797006A (zh) * 2020-01-06 2020-02-14 北京海天瑞声科技股份有限公司 端到端的语音合成方法、装置及存储介质
CN111226275A (zh) * 2019-12-31 2020-06-02 深圳市优必选科技股份有限公司 基于韵律特征预测的语音合成方法、装置、终端及介质
CN111524500A (zh) * 2020-04-17 2020-08-11 浙江同花顺智能科技有限公司 语音合成方法、装置、设备和存储介质
CN112037758A (zh) * 2020-06-19 2020-12-04 四川长虹电器股份有限公司 一种语音合成方法及装置
CN112771607A (zh) * 2018-11-14 2021-05-07 三星电子株式会社 电子设备及其控制方法
CN112885328A (zh) * 2021-01-22 2021-06-01 华为技术有限公司 一种文本数据处理方法及装置
CN113053357A (zh) * 2021-01-29 2021-06-29 网易(杭州)网络有限公司 语音合成方法、装置、设备和计算机可读存储介质
CN113516964A (zh) * 2021-08-13 2021-10-19 北京房江湖科技有限公司 语音合成方法、可读存储介质及计算机程序产品
CN114678002A (zh) * 2022-03-31 2022-06-28 美的集团(上海)有限公司 文本的切分方法和文本的切分装置
CN114678001A (zh) * 2022-03-31 2022-06-28 美的集团(上海)有限公司 语音合成方法和语音合成装置
CN114708848A (zh) * 2022-03-31 2022-07-05 美的集团(上海)有限公司 音视频文件大小的获取方法和装置
CN114822489A (zh) * 2022-03-31 2022-07-29 美的集团(上海)有限公司 文本转写方法和文本转写装置
CN114822490A (zh) * 2022-03-31 2022-07-29 美的集团(上海)有限公司 语音拼接方法和语音拼接装置
CN115223541A (zh) * 2022-06-21 2022-10-21 深圳市优必选科技股份有限公司 文本转语音的处理方法、装置、设备及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112420016B (zh) * 2020-11-20 2022-06-03 四川长虹电器股份有限公司 一种合成语音与文本对齐的方法、装置及计算机储存介质
CN112802450B (zh) * 2021-01-05 2022-11-18 杭州一知智能科技有限公司 一种韵律可控的中英文混合的语音合成方法及其系统

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002304186A (ja) * 2001-04-05 2002-10-18 Sharp Corp 音声合成装置、音声合成方法および音声合成プログラム
CN108073572A (zh) * 2016-11-16 2018-05-25 北京搜狗科技发展有限公司 信息处理方法及其装置、同声翻译系统
CN112771607A (zh) * 2018-11-14 2021-05-07 三星电子株式会社 电子设备及其控制方法
CN111226275A (zh) * 2019-12-31 2020-06-02 深圳市优必选科技股份有限公司 基于韵律特征预测的语音合成方法、装置、终端及介质
CN110797006A (zh) * 2020-01-06 2020-02-14 北京海天瑞声科技股份有限公司 端到端的语音合成方法、装置及存储介质
CN111524500A (zh) * 2020-04-17 2020-08-11 浙江同花顺智能科技有限公司 语音合成方法、装置、设备和存储介质
CN112037758A (zh) * 2020-06-19 2020-12-04 四川长虹电器股份有限公司 一种语音合成方法及装置
CN112885328A (zh) * 2021-01-22 2021-06-01 华为技术有限公司 一种文本数据处理方法及装置
CN113053357A (zh) * 2021-01-29 2021-06-29 网易(杭州)网络有限公司 语音合成方法、装置、设备和计算机可读存储介质
CN113516964A (zh) * 2021-08-13 2021-10-19 北京房江湖科技有限公司 语音合成方法、可读存储介质及计算机程序产品
CN114678002A (zh) * 2022-03-31 2022-06-28 美的集团(上海)有限公司 文本的切分方法和文本的切分装置
CN114678001A (zh) * 2022-03-31 2022-06-28 美的集团(上海)有限公司 语音合成方法和语音合成装置
CN114708848A (zh) * 2022-03-31 2022-07-05 美的集团(上海)有限公司 音视频文件大小的获取方法和装置
CN114822489A (zh) * 2022-03-31 2022-07-29 美的集团(上海)有限公司 文本转写方法和文本转写装置
CN114822490A (zh) * 2022-03-31 2022-07-29 美的集团(上海)有限公司 语音拼接方法和语音拼接装置
CN115223541A (zh) * 2022-06-21 2022-10-21 深圳市优必选科技股份有限公司 文本转语音的处理方法、装置、设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4503017A4 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114678002A (zh) * 2022-03-31 2022-06-28 美的集团(上海)有限公司 文本的切分方法和文本的切分装置
CN118116364A (zh) * 2023-12-29 2024-05-31 上海稀宇极智科技有限公司 语音合成模型训练方法、语音合成方法、电子设备及存储介质
CN118053416A (zh) * 2024-03-12 2024-05-17 中邮消费金融有限公司 声音定制方法、装置、设备及存储介质
CN118940766A (zh) * 2024-10-11 2024-11-12 中孚安全技术有限公司 改善tts模型处理长文本性能的方法、系统及介质
CN118940766B (zh) * 2024-10-11 2025-03-18 中孚安全技术有限公司 改善tts模型处理长文本性能的方法、系统及介质
CN119832893A (zh) * 2024-12-12 2025-04-15 中电信人工智能科技(北京)有限公司 声学模型的生成方法、装置、电子设备及存储介质
CN119832893B (zh) * 2024-12-12 2025-10-28 中电信人工智能科技(北京)有限公司 声学模型的生成方法、装置、电子设备及存储介质
CN120164451A (zh) * 2025-03-14 2025-06-17 优酷文化科技(北京)有限公司 一种语音合成方法及装置

Also Published As

Publication number Publication date
EP4503017A1 (en) 2025-02-05
EP4503017A4 (en) 2025-05-07

Similar Documents

Publication Publication Date Title
JP7464621B2 (ja) 音声合成方法、デバイス、およびコンピュータ可読ストレージ媒体
Kaur et al. Conventional and contemporary approaches used in text to speech synthesis: A review
WO2023184874A1 (zh) 语音合成方法和装置
US11322133B2 (en) Expressive text-to-speech utilizing contextual word-level style tokens
US12444401B2 (en) Method, apparatus, computer readable medium, and electronic device of speech synthesis
Singh et al. Improved meta learning for low resource speech recognition
CN112802446B (zh) 音频合成方法及装置、电子设备和计算机可读存储介质
CN103035241A (zh) 模型互补的汉语韵律间断识别系统及方法
CN113990286A (zh) 语音合成方法、装置、设备及存储介质
CN119274533A (zh) 一种基于自然语言描述文本的高表现力音频生成方法
CN114678001A (zh) 语音合成方法和语音合成装置
CN116092471A (zh) 一种面向低资源条件下的多风格个性化藏语语音合成模型
CN114822490A (zh) 语音拼接方法和语音拼接装置
CN118800212A (zh) 语音合成前端处理方法、装置、设备和存储介质
CN114049874B (zh) 用于合成语音的方法
CN114267330B (zh) 语音合成方法、装置、电子设备和存储介质
CN114708848B (zh) 音视频文件大小的获取方法和装置
CN115938341B (zh) 语音合成方法、装置、电子设备及存储介质
CN114822489A (zh) 文本转写方法和文本转写装置
CN114678002A (zh) 文本的切分方法和文本的切分装置
CN121545494B (zh) 基于多模态情感语音合成的站群文章智能播报系统及方法
CN119889281B (zh) 音素知识增强的老-英混合语言语音合成方法和装置
Louw Text-to-speech duration models for resource-scarce languages in neural architectures
CN121354534B (zh) 语音合成方法、装置、电子设备及存储介质
JP2004138661A (ja) 音声素片データベース作成方法、音声合成方法、音声素片データベース作成装置、音声合成装置、音声データベース作成プログラム、音声合成プログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22933875

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022933875

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022933875

Country of ref document: EP

Effective date: 20241031

NENP Non-entry into the national phase

Ref country code: DE