CN112802446A

CN112802446A - Audio synthesis method and device, electronic equipment and computer-readable storage medium

Info

Publication number: CN112802446A
Application number: CN201911114561.3A
Authority: CN
Inventors: 张黄斌; 李辉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-11-14
Filing date: 2019-11-14
Publication date: 2021-05-14
Anticipated expiration: 2039-11-14
Also published as: CN112802446B

Abstract

The embodiment of the disclosure provides an audio synthesis method and device, electronic equipment and a computer-readable storage medium. The method comprises the following steps: obtaining a music score to be processed and lyrics to be processed; extracting music characteristics from the music score to be processed; extracting text features from the lyrics to be processed; processing the music characteristic and the text characteristic through an end-to-end neural network model to obtain frequency spectrum information; and synthesizing singing audio corresponding to the music score to be processed and the lyrics to be processed according to the frequency spectrum information. By the scheme provided by the embodiment of the disclosure, singing audio synthesis can be realized through the end-to-end neural network model.

Description

Audio synthesis method and device, electronic equipment and computer-readable storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an audio synthesis method and apparatus, an electronic device, and a computer-readable storage medium.

Background

In the existing singing synthesis technical scheme, DNN (Deep Neural Network) is adopted for modeling, and the synthesized singing dry sound is low in naturalness and tone quality and is far away from the singing level of a real person.

Therefore, a new audio synthesis method and apparatus, an electronic device, and a computer-readable storage medium are needed.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure.

Disclosure of Invention

The embodiment of the disclosure provides an audio synthesis method and device, electronic equipment and a computer readable storage medium, which can realize the synthesis of singing audio through an end-to-end neural network model, and the synthesized singing audio has higher naturalness and tone quality and is closer to the singing level of a real person.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to an aspect of an embodiment of the present disclosure, there is provided an audio synthesis method, the method including: obtaining a music score to be processed and lyrics to be processed; extracting music characteristics from the music score to be processed; extracting text features from the lyrics to be processed; processing the music characteristic and the text characteristic through an end-to-end neural network model to obtain frequency spectrum information; and synthesizing singing audio corresponding to the music score to be processed and the lyrics to be processed according to the frequency spectrum information.

According to an aspect of the embodiments of the present disclosure, there is provided an audio synthesizing apparatus, the apparatus including: the music score and lyric acquisition module is configured to acquire a music score to be processed and lyrics to be processed; a music feature extraction module configured to extract music features from the music score to be processed; the text feature extraction module is configured to extract text features from the lyrics to be processed; the frequency spectrum information obtaining module is configured to process the music characteristics and the text characteristics through an end-to-end neural network model to obtain frequency spectrum information; and the singing audio synthesis module is configured to synthesize the singing audio corresponding to the music score to be processed and the lyrics to be processed according to the frequency spectrum information.

In some exemplary embodiments of the present disclosure, the musical features include a musical instrument digital interface feature and a duration feature. Wherein, the music characteristic extraction module comprises: a musical instrument digital interface characteristic obtaining sub-module configured to obtain the musical instrument digital interface characteristic according to the pitch in the music score to be processed; and the time value characteristic obtaining submodule is configured to perform normalization processing on the note lengths in the music score to be processed to obtain the time value characteristic.

In some exemplary embodiments of the present disclosure, the end-to-end neural network model includes a text encoder, a musical instrument digital interface encoder, and a duration encoder. Wherein the spectrum information obtaining module comprises: the text embedded vector obtaining submodule is configured to process the text features through the text encoder to obtain a text embedded vector; the musical instrument digital interface embedded vector obtaining submodule is configured to process the characteristics of the musical instrument digital interface through the musical instrument digital interface encoder to obtain a musical instrument digital interface embedded vector; a time value embedding vector obtaining submodule configured to process the time value feature through the time value encoder to obtain a time value embedding vector; a fused embedded vector obtaining sub-module configured to obtain a fused embedded vector according to the text embedded vector, the musical instrument digital interface embedded vector and the duration value embedded vector; and the frequency spectrum information obtaining submodule is configured to obtain the frequency spectrum information according to the fusion embedding vector.

In some exemplary embodiments of the present disclosure, the musical instrument digital interface encoder includes a first dense neural network. Wherein the musical instrument digital interface embedded vector obtaining submodule comprises: a first dense vector obtaining unit configured to process the musical instrument digital interface features through the first dense neural network to obtain a first dense vector; a musical instrument digital interface embedded vector obtaining unit configured to obtain the musical instrument digital interface embedded vector from the first dense vector.

In some exemplary embodiments of the present disclosure, the musical instrument digital interface encoder further comprises a second dense neural network, a forward gate recurrent neural network, and an inverse gate recurrent neural network. Wherein the midi acquisition unit comprises: a second dense vector obtaining subunit configured to process the first dense vector through the second dense neural network to obtain a second dense vector; a first feature map obtaining subunit, configured to process the second dense vector through the forward gate recurrent neural network to obtain a first feature map; a second feature map obtaining subunit, configured to process the first feature map through the inverse gate recurrent neural network to obtain a second feature map; a musical instrument digital interface embedded vector obtaining subunit configured to concatenate the second feature map and the second dense vector to obtain the musical instrument digital interface embedded vector.

In some exemplary embodiments of the present disclosure, the duration encoder comprises a third dense neural network. Wherein the value-embedded vector obtaining sub-module comprises: a time value embedding vector obtaining unit configured to obtain the time value embedding vector by processing the time value features through the third dense neural network.

In some exemplary embodiments of the present disclosure, the end-to-end neural network model further comprises an attention mechanism module and a spectral decoder. Wherein the spectrum information obtaining sub-module comprises: an attention context vector obtaining unit configured to process the fused embedded vector by the attention mechanism module to obtain an attention context vector; a spectrum information obtaining unit configured to obtain the spectrum information by processing the attention context vector by the spectrum decoder.

In some exemplary embodiments of the present disclosure, the spectral information includes mel-spectrum parameters and linear spectrum parameters. Wherein the singing audio synthesis module comprises: and the singing audio synthesizing submodule is configured to process the Mel spectrum parameters and the linear spectrum parameters through a neural network vocoder to synthesize the singing audio.

In some exemplary embodiments of the present disclosure, the apparatus further comprises: the sample information acquisition module is configured to acquire a sample music score, sample lyrics and sample singing audio thereof; a sample music feature extraction module configured to extract sample music features from the sample score; a sample text feature extraction module configured to extract sample text features from the sample lyrics; the sample frequency spectrum information obtaining module is configured to obtain sample frequency spectrum information according to the sample singing audio; the spectrum prediction module is configured to process the sample music characteristics and the sample text characteristics through the end-to-end neural network model to obtain predicted spectrum information; and the model training module is configured to train the end-to-end neural network model according to the sample spectrum information and the predicted spectrum information.

According to an aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements an audio synthesis method as described in the above embodiments.

According to an aspect of the embodiments of the present disclosure, there is provided an electronic device including: one or more processors; a storage device configured to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the audio synthesis method as described in the above embodiments.

In some embodiments of the present disclosure, music features are extracted from a music score to be processed, text features are extracted from lyrics to be processed, and the music features and the text features are processed through an end-to-end neural network model to obtain frequency spectrum information, so that a singing audio corresponding to the music score to be processed and the lyrics to be processed can be synthesized according to the frequency spectrum information. Compared with a DNN singing synthesis scheme, the singing audio synthesized by the end-to-end neural network model has higher naturalness and better tone quality and is closer to the singing level of a real person.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:

fig. 1 shows a schematic diagram of an exemplary system architecture to which an audio synthesis method or an audio synthesis apparatus of an embodiment of the present disclosure may be applied;

FIG. 2 illustrates a schematic structural diagram of a computer system suitable for use with the electronic device used to implement embodiments of the present disclosure;

FIG. 3 is a diagram illustrating the implementation of vocal stem synthesis using DNN in the related art;

FIG. 4 schematically illustrates a flow diagram of an audio synthesis method according to an embodiment of the present disclosure;

FIG. 5 is a diagram illustrating a processing procedure of step S2 shown in FIG. 4 in one embodiment;

FIG. 6 is a diagram illustrating a processing procedure of step S4 shown in FIG. 4 in one embodiment;

FIG. 7 is a diagram illustrating a processing procedure of step S4 shown in FIG. 4 in another embodiment;

FIG. 8 schematically illustrates a structural schematic of an end-to-end neural network model according to an embodiment of the present disclosure;

FIG. 9 is a diagram illustrating a processing procedure of step S42 shown in FIG. 6 in one embodiment;

fig. 10 schematically shows a structural diagram of a MIDI encoder according to an embodiment of the present disclosure;

FIG. 11 is a diagram illustrating a processing procedure of step S422 shown in FIG. 9 in one embodiment;

fig. 12 schematically shows a structural diagram of a MIDI encoder according to another embodiment of the present disclosure;

fig. 13 schematically illustrates a structural diagram of a duration encoder according to an embodiment of the present disclosure;

FIG. 14 schematically shows a flow chart of an audio synthesis method according to another embodiment of the present disclosure;

FIG. 15 schematically illustrates a training process diagram of an end-to-end neural network model, according to an embodiment of the present disclosure;

FIG. 16 schematically illustrates a prediction process diagram of an end-to-end neural network model according to an embodiment of the present disclosure;

fig. 17 is a schematic diagram showing evaluation of the effect of singing synthesis using different modes;

18-21 schematically show application scenario diagrams to which the singing synthesis method proposed by the embodiment of the present disclosure is applied;

fig. 22 schematically shows a block diagram of an audio synthesis apparatus according to an embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

Fig. 1 shows a schematic diagram of an exemplary system architecture 100 to which an audio synthesis method or an audio synthesis apparatus of an embodiment of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include one or more of

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may be various electronic devices having display screens including, but not limited to, smart phones, tablets, portable computers, wearable smart devices, smart home devices and desktop computers, digital cinema projectors, and the like.

The server 105 may be a server that provides various services. For example, the user sends various requests to the server 105 using the terminal device 103 (which may be the terminal device 101 or 102). The server 105 may obtain feedback information in response to the request to the terminal device 103 based on the related information carried in the request, and the user may view the displayed feedback information on the terminal device 103.

Also for example, the terminal device 103 (which may also be the terminal device 101 or 102) may be a smart tv, a VR (Virtual Reality)/AR (Augmented Reality) helmet display, or a mobile terminal such as a smart phone, a tablet computer, etc. on which an instant messaging Application (APP) or the like is installed, and the user may send various requests to the server 105 through the smart tv, the VR/AR helmet display or the instant messaging Application (APP). The server 105 may obtain, based on the request, feedback information in response to the request, and return the feedback information to the smart television, the VR/AR head mounted display, or the instant messaging and video APP, and then display the returned feedback information through the smart television, the VR/AR head mounted display, or the instant messaging and video APP.

FIG. 2 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present disclosure.

It should be noted that the computer system 200 of the electronic device shown in fig. 2 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiments of the present disclosure.

As shown in fig. 2, the computer system 200 includes a Central Processing Unit (CPU)201 that can perform various appropriate actions and processes in accordance with a program stored in a Read-Only Memory (ROM) 202 or a program loaded from a storage section 208 into a Random Access Memory (RAM) 203. In the RAM 203, various programs and data necessary for system operation are also stored. The CPU 201, ROM 202, and RAM 203 are connected to each other via a bus 204. An input/output (I/O) interface 205 is also connected to bus 204.

The following components are connected to the I/O interface 205: an input portion 206 including a keyboard, a mouse, and the like; an output section 207 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 208 including a hard disk and the like; and a communication section 209 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 209 performs communication processing via a network such as the internet. A drive 210 is also connected to the I/O interface 205 as needed. A removable medium 211, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 210 as necessary, so that a computer program read out therefrom is installed into the storage section 208 as necessary.

In particular, the processes described below with reference to the flowcharts may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 209 and/or installed from the removable medium 211. The computer program, when executed by a Central Processing Unit (CPU)201, performs various functions defined in the methods and/or apparatus of the present application.

It should be noted that the computer readable storage medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM) or flash Memory), an optical fiber, a portable compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable storage medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF (Radio Frequency), etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods, apparatus, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules, sub-modules, units and sub-units described in the embodiments of the present disclosure may be implemented by software, or may be implemented by hardware, and the described modules, sub-modules, units and sub-units may also be disposed in a processor. Wherein the names of such modules, sub-modules, units and sub-units do not in some way constitute a limitation on the modules, sub-modules, units and sub-units themselves.

As another aspect, the present application also provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer-readable storage medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method as described in the embodiments below. For example, the electronic device may implement the steps shown in fig. 4, 5, 6, 7, 9, 11, or 14.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Key technologies of Speech Technology (Speech Technology) are Automatic Speech Recognition (ASR) and Speech synthesis (Text to Speech, TTS) as well as voiceprint Recognition. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The scheme provided by the embodiment of the application relates to technologies such as artificial intelligence voice technology and machine learning, and is specifically explained by the following embodiments:

in recent years, singing synthesis technology has been receiving attention from all social circles, and the greatest convenience of singing synthesis technology is that it can make a computer singing songs with any melody, so that the progress of singing synthesis technology in the fields of music production, entertainment and the like closely related to singing is urgently expected.

In the related art, singing synthesis techniques include a Hidden Markov Model (HMM) based parametric synthesis system and a waveform splicing synthesis based system.

The core of waveform splicing is to pre-record the pronunciation of each pronunciation in a certain language at different pitches, and then connect the recordings according to lyrics and music scores. However, a system based on waveform splicing synthesis has two major difficulties, one is that waveform distortion is easily generated in the process of waveform splicing, and further synthesized sound is unnatural; secondly, waveform splicing depends on huge recorded data, which needs to consume a great deal of time and manpower to collect data, thereby leading to higher difficulty in singing synthesis.

The singing synthesis system based on parameter synthesis firstly determines the time length parameter sequence, the fundamental frequency parameter sequence and the frequency spectrum parameter sequence of each basic synthesis unit (such as syllables, phonemes and the like) respectively, and then obtains continuous singing signals by using the parameter synthesizer according to each parameter sequence.

Fig. 3 is a schematic diagram showing a related art implementation of vocal stem synthesis using DNN.

As shown in fig. 3, a singing synthesis technical solution in the related art is modeled by DNN, and the technical solution is divided into two parts, namely model training and prediction. In the model training part, the collected singing voice, music score information and lyric information are used to train a time length model and an acoustic model. The duration model is used for predicting pronunciation duration of each phoneme, and the acoustic model is used for predicting spectrum parameters of the phonemes. In the model prediction part, the singing dry voice is synthesized according to the music score and the lyrics input by the user by using the acoustic model and the duration model obtained by training. The phoneme (phone) is the smallest unit of speech divided from the perspective of timbre, such as initial consonant and vowel in chinese, consonant and vowel in english, and so on.

Specifically, in the model training phase, the method comprises the following steps:

1.1, preparation data: the singing dry voice, the music score information and the lyric information of a singer are collected from a singing database.

1.2, marking data: cutting the singing voice frequency, namely the collected singing dry voice according to sentences. The information of word boundaries, phoneme boundaries, notes, pitch, duration, etc. of each sentence is manually marked. And taking the labeled data as training data.

1.3, HTS (HMM-based speech synthesis system based on hidden Markov model) segmentation: the HTS system is trained using the training data obtained in step 1.2 above, and then each phoneme is further sliced to a state level.

Specifically, an HTS system is constructed in advance, and then each sentence of singing voice in training data is inputted to the HTS system, and the HTS system is used to predict information such as word boundary, phoneme boundary, note, pitch, duration, etc. of each sentence, and compare the information with the above information such as word boundary, phoneme boundary, note, pitch, duration, etc. of each sentence marked by human beings to calculate a loss function, so as to train the HTS system. This HTS system is then used to further slice each phoneme.

For example, the "a" sound is artificially labeled as a whole complete "a" sound, but actually, the "a" sound can be further segmented into several parts (i.e. several states) according to the features of the frequency spectrum, and when the model is actually trained, the duration of the state level is predicted, and the duration of the phoneme is not directly predicted. HTS is to cut each phoneme into 5 states and then output the length of each state of each phoneme.

1.4, extracting text features of the duration model: and extracting duration model text features from the music score and the lyrics.

1.5, extracting phoneme duration parameters: a length of time parameter is extracted from the HTS slicing result obtained in step 1.3 above. Here training is done with the duration of the phoneme state level.

1.6, training duration model: and (3) training a duration model by using the text features of the duration model extracted in the step 1.4 and the phoneme duration parameters extracted in the step 1.5.

1.7, extracting acoustic parameters of singing voice frequency: and (3) further dividing each sentence of audio segmented in the step 1.2 into N equal-length pieces, wherein N is a positive integer greater than or equal to 1, and extracting acoustic parameters capable of expressing audio spectrum characteristics and fundamental frequency and vibrato parameter sequences related to pitches from each small audio.

1.8, extracting acoustic model text features: and combining the data extracted in the steps 1.4 and 1.5 to generate the acoustic model text features with the same audio length by means of sequence expansion.

1.9, training an acoustic model: using the data extracted in steps 1.7 and 1.8 above, the acoustic model is trained.

In the model prediction phase, the method comprises the following steps:

and 2.1, extracting the information of the music score and the lyric provided by the user.

2.2, extracting a duration model text characteristic sequence from the music score and the lyric information.

And 2.3, inputting a duration model by using the duration model text characteristic sequence obtained in the step 2.2, and predicting the duration of each phoneme.

And 2.4, combining the duration model text feature sequence obtained in the step 2.2 and the phoneme duration predicted in the step 2.3, and obtaining an acoustic model text feature sequence by means of sequence expansion.

And 2.5, inputting an acoustic model by using the acoustic model text feature sequence obtained in the step 2.4, and predicting the acoustic parameters, the fundamental frequency and the vibrato parameter sequence of each small piece of audio.

2.6, inputting the acoustic parameters, the fundamental frequency and the tremolo parameter sequence obtained in the step 2.5 into a vocoder to obtain the synthesized vocal dry voice frequency.

However, in the above related art, the phoneme-to-audio transition of the singing vocal stem synthesized by DNN modeling is not very harmonious, the tone is mechanical and unnatural, the whole listening sensation is not "good-hearing", and the level of singing is far from the level of real persons.

In the related art, the end-to-end speech synthesis technology is TTS technology, which is a text-to-speech synthesis technology, but the synthesized audio is the speech of a person speaking normally, and no singing dry sound with a pitch and a duration meeting the requirements of a music score and high naturalness is successfully synthesized by using the end-to-end technology at present.

Fig. 4 schematically shows a flow diagram of an audio synthesis method according to an embodiment of the present disclosure. The method provided by the embodiment of the present disclosure may be executed by any electronic device with computing processing capability, for example, any one or more of the

terminal devices

101, 102, 103 and/or the server 105 in fig. 1. In the following description, the server 105 is exemplified as an execution subject.

As shown in fig. 4, an audio synthesis method provided by an embodiment of the present disclosure may include the following steps.

In step S1, a to-be-processed musical score and its to-be-processed lyrics are acquired.

For example, a user sends a singing synthesis request to a server through a client, the singing synthesis request carries a music score to be processed and lyric information to be processed, which correspond to a current singing audio to be synthesized selected by the user, and the server obtains the music score to be processed and the lyrics to be processed from the music score to be processed after receiving the singing synthesis request.

In the embodiment of the present disclosure, the to-be-processed Music score may be any one of a MIDI (Musical Instrument Digital Interface) format, an ASCII (American Standard Code for Information exchange) format, a Music Extensible Markup Language (xml) format, and the like. The music score to be processed can be a paper music score and/or an electronic music score, the lyrics to be processed correspond to the music score to be processed, and the lyrics to be processed can also be paper lyrics and/or electronic lyrics. The paper music score and the paper lyrics can be respectively printed music scores and lyrics and can also be handwritten music scores and lyrics, the paper music scores and the lyrics can be respectively converted into electronic music scores and lyrics in a scanning mode, and then the music scores and the lyrics in the required digital format are obtained through software processing; the electronic music score and lyrics can be downloaded from a website, or can be edited by a user, and the music score and lyrics in a required digital format are obtained after software processing.

In step S2, music features are extracted from the score to be processed.

In the disclosed embodiment, the musical features may include a MIDI feature associated with a note pitch and a duration (Interval) feature associated with a duration. The specific extraction process can be described with reference to the embodiment of fig. 5 below.

In step S3, text features are extracted from the lyrics to be processed.

In the embodiment of the present disclosure, the text feature extracted from the lyrics to be processed may include: any one or combination of multiple elements of phoneme characteristics, prosodic boundary characteristics, whether or not a phrase is over, how many phonemes there is in the current pinyin, whether or not the current phoneme is a retroactive phoneme, and the like.

For example, taking the lyric to be processed as "pinyin", first converting the text into the corresponding pinyin "ping yin", which can be broken into four sequences "p", "ing", "y" and "in", and taking several features as examples, such as how many phones the current pinyin has (feature 2, corresponding numbers represent the corresponding numbers of phones), whether the current pinyin is a retroflex (feature 3, "0" represents not a retroflex and "1" represents a retroflex), whether the current pinyin is a zero-consonant syllable (feature 4, "0" represents not a zero-consonant syllable and "1" represents a zero-consonant syllable), and explaining the generation process of the text features:

TABLE 1

The zero initial syllable is a syllable consisting of only vowels without consonants, and actually, syllables beginning with vowels (a, o, e, i, u, and u) are zero initial syllables. Thus, the synthesized text features: "p-2-0-0", "ing-2-0-0", "y-2-0-0" and "in-2-0-0".

It should be understood that the above-listed features are for illustration only and may in fact include hundreds of dimensions.

In an embodiment of the disclosure, one or more notes in the musical feature correspond to a lyric in the textual feature. The notes in the music characteristic have the time sequence, that is, the notes in the music score need to be played according to a preset time sequence to form a preset melody, and the lyrics in the text characteristic correspond to the notes, so the lyrics also have the preset time sequence.

In step S4, the music feature and the text feature are processed by an end-to-end neural network model to obtain spectrum information.

In step S5, a singing audio corresponding to the music score to be processed and the lyrics to be processed thereof is synthesized according to the frequency spectrum information.

Specifically, the spectrum information may be input into a pre-trained neural network vocoder, and a sampling signal of the singing audio corresponding to the to-be-processed music score and the to-be-processed lyrics thereof may be automatically output, for example, refer to the content of the neural network vocoder in fig. 8 below.

In the embodiment of the present disclosure, the singing audio may be a singing stem, or may be a song finally synthesized, so as to return the song to the client of the user who sends the singing synthesis request for playing. Wherein, the dry sound refers to the original singing audio without any post-processing (operations of reverberation, modulation, pressure limit, speed change and the like). Corresponding to this is a wet tone, which refers to the post-processed singing audio. And after the singing dry sound corresponding to the music score to be processed and the lyrics to be processed is synthesized according to the frequency spectrum information, the singing dry sound and the corresponding background music can be synthesized to obtain the finally synthesized song.

According to the audio synthesis method provided by the embodiment of the disclosure, music characteristics are extracted from a music score to be processed, text characteristics are extracted from lyrics to be processed, and the music characteristics and the text characteristics are processed through an end-to-end neural network model to obtain frequency spectrum information, so that singing audio corresponding to the music score to be processed and the lyrics to be processed can be synthesized according to the frequency spectrum information. On one hand, because the end-to-end neural network model is directly modeled from text to speech, a duration model in DNN modeling is not needed, and an HTS system is not needed for data preprocessing, the scheme provided by the embodiment of the disclosure can directly use the length of the phoneme without further segmenting to the state level of the phoneme; on the other hand, compared with the DNN singing synthesis scheme, the singing audio synthesized by adopting the end-to-end neural network model has higher naturalness and better tone quality and is closer to the real singing level.

Fig. 5 is a schematic diagram illustrating a processing procedure of step S2 shown in fig. 4 in an embodiment. In embodiments of the present disclosure, the musical features may include a musical instrument digital interface feature and a duration feature.

As shown in fig. 5, in the embodiment of the present disclosure, the step S2 may further include the following steps.

In step S21, the musical instrument digital interface features are obtained according to the pitch in the to-be-processed score.

In the disclosed embodiment, the MIDI features are word-level, and one word in the lyrics corresponds to one MIDI. Pitch refers to the frequency of vibration of the vocal cords when singing. The pitch f in the score to be processed may be transformed according to the following formula to obtain a MIDI characteristic p:

for example, a numbered musical notation for a song is as follows:

the above-mentioned music score is^bThe tone B is the initial pitch, i.e. the frequency of 1 tone at 233.08Hz, and the corresponding frequencies of the above numbered musical notation are 196.00Hz, 233.08Hz, 293.66Hz, 195.00Hz and 456.16Hz, which are shown in Table 2 below, and then the above formula (1) is converted into MIDI (rounded) values of 55, 58, 62, 55 and 70.

TABLE 2 Pitch to frequency LUT

In step S22, the duration of the notes in the score to be processed is normalized to obtain the duration feature.

In the disclosed embodiment, Interval is the duration of the note. And Duration are not the same concept, where Duration is the length indicated on the score that should be sung. But the human pronunciation may not be exactly the same as that designated by the score. Duration refers to the length of time each phoneme (e.g., a, ao, etc.) pronounces when a person actually pronounces.

The value feature may be converted from a value in the score to the value feature:

further, taking the above numbered musical notation as an example, the numbered musical notation indicates 234 beats in one minute, that is, the length of one beat is 60s/234 s-0.256 s-256 ms, and the numbered musical notation sequence corresponds to one beat (256ms), half beat (128ms), and half beat (128ms), so the value of interval _ feature corresponds to: 256/1500-0.1707, 128/1500-0.0853, 128/1500-0.0853, 128/1500-0.0853, 128/1500-0.0853.

In the embodiment of the present disclosure, the extracted musical instrument digital interface feature and the extracted duration feature are used as the music feature, and since the duration feature in the music score does not discretize the duration in the music score, for example, converts the duration into the sequence represented by "0" and "1", but normalizes the duration through the above formula (2), the result of the normalization is better than the discretization, so that the extracted music feature is better.

Fig. 6 is a schematic diagram illustrating a processing procedure of step S4 shown in fig. 4 in an embodiment. In an embodiment of the present disclosure, the end-to-end neural network model may include a Text encoder (Text encoder), a musical instrument digital interface encoder (MIDI encoder), and a chronaxity encoder (Interval encoder).

As shown in fig. 6, in the embodiment of the present disclosure, the step S4 may further include the following steps.

In step S41, the text features are processed by the text encoder to obtain a text Embedding (Embedding) vector.

Where Text encoders are used to convert phoneme-level Text into vectors. The specific implementation process can refer to, for example, a text encoder in the following fig. 8 embodiment for converting a Character sequence into a hidden feature representation, inputting a text (input text) such as the above text features to the text encoder, where the text encoder may include, for example, a Character Embedding (Character Embedding) layer, a 3 convolutional Layers (3 Conv Layers), and a Bidirectional LSTM (Long Short-Term Memory neural network) connected in sequence, the Character Embedding layer is used to convert the text features into a 512-dimensional Character Embedding vector representation and then input the 512-dimensional Character Embedding vector representation to 3 convolutional Layers, each of which includes 512 filters (shape 5 multiplied by 1), for example, each filter spans 5 characters, followed by Batch Normalization (BN) and a ReLU (Rectified Linear rectification function) function, and then inputting the output result of the last convolutional layer to a Bidirectional LSTM, it includes 512 units (256 per direction) for outputting text embedding vectors.

In step S42, the musical instrument digital interface features are processed by the musical instrument digital interface encoder to obtain a musical instrument digital interface Embedding (MIDI Embedding) vector.

For example, referring to the MIDI encoder in the embodiment of fig. 8, the extracted MIDI features are input to the MIDI encoder, and the MIDI Embedding vector is output. The specific structure of the MIDI encoder can be referred to the following description of the embodiments shown in fig. 9 to 12.

In step S43, the value feature is processed by the value encoder, and a value Embedding (Interval Embedding) vector is obtained.

For example, referring to the time value encoder in the embodiment of fig. 8, the extracted time value feature is input to the time value encoder, and the Interval Embedding vector is output. The specific structure of the timer value encoder can be referred to the following description of the embodiment shown in fig. 13.

In step S44, a fused embedded vector is obtained based on the text embedded vector, the midi embedded vector, and the duration embedded vector.

For example, referring to the following fig. 8 embodiment, the output results of the bidirectional LSTM of the text encoder, the MIDI encoder and the duration encoder are spliced to obtain the fused embedded vector.

In step S45, the spectrum information is obtained according to the fused embedded vector.

The spectral information may here comprise a mel-frequency spectrum and/or a linear spectrum. The mel spectrum may be converted into a linear spectrum. For example, referring to the embodiment of fig. 8 below, the fusion embedded vector is input to a Location Sensitive Attention mechanism (Location Sensitive Attention) module to generate an Attention context vector, and then input to a Mel-frequency spectrum Decoder (Mel Decoder) to generate a Mel-frequency spectrum (Mel spectrum).

Fig. 7 is a schematic diagram illustrating a processing procedure of step S4 shown in fig. 4 in another embodiment. In the disclosed embodiment, the end-to-end neural network model may further include an Attention mechanism (Attention) module and a spectrum decoder.

As shown in fig. 7, in the embodiment of the present disclosure, the step S4 may further include the following steps.

In step S46, the fused embedded vector is processed by the attention mechanism module to obtain an attention context vector.

For example, with reference to the FIG. 8 embodiment below, the position sensitive attention mechanism may be used to extend the incremental attention mechanism to use the accumulated attention weight with the previous decoder time step as an assist feature. This encourages the model to continue to advance through the input, reducing the potential failure modes for the decoder to repeat or ignore certain subsequences. After mapping the input and location features to a 128-dimensional hidden-feature representation, the attention probability is calculated. The position feature was calculated using a 321-D convolution filter, length 31.

In step S47, the spectral information is obtained by the spectral decoder processing the attention context vector.

For example, referring to the following fig. 8 embodiment, the spectrum decoder is a mel-spectrum decoder, the mel-spectrum decoder includes 5 convolutional layers (5Conv Layer) Post-Net (network structure that maps mel-spectrum into Linear spectrum), 2 layers Pre-Net (2 Layer Pre-Net, 2 layers Pre-Net), 2LSTM layers and two Linear transformation (Linear Projection) layers, the mel-spectrum decoder is an autoregressive recursive neural network, and one mel-spectrum is predicted from the encoded input sequence one frame at a time. The prediction from the previous time step is first passed through a small Pre-Net comprising 2 fully connected layers of 256 hidden ReLU units. The output of Pre-Net and attention context vector are spliced and input to 2 unidirectional LSTM layers (1024 cells). The concatenation of the LSTM output and the attention context vector is mapped to the predicted target spectral frame by a linear transformation. Finally, the predicted Mel spectra are transmitted to a 5Conv Layer Post-Net, and the prediction residuals are added to improve the overall reconstruction. Each Post-Net comprises 512 filters (shape 5 times 1) with BN followed by a tanh activation function as the last layer of the whole. The mel spectrum is then input to a Neural Network Vocoder (Neural Network Vocoder) and Audio Samples (Audio Samples) are output, i.e., vocal stem.

Fig. 8 schematically illustrates a structural schematic diagram of an end-to-end neural network model according to an embodiment of the present disclosure.

As shown in fig. 8, the Mel Decoder in the embodiment of the present disclosure is an autoregressive recurrent neural network, and combines with the output of the Text encoder to generate the Mel spectrum of the audio by an autoregressive manner. When outputting the Mel Spectrogram, the results are output recursively. For example, initially, all the Mel Spectrograms are predicted by using 1 frame of random numbers in combination with a context vector to predict the Mel Spectrograms of three frames, and then using the last frame of the three frames as input in combination with the context vector to predict the Mel Spectrograms of the next three frames.

In parallel with the spectral frame prediction, the concatenation of the mel-spectrum decoder LSTM output and the attention context is projected to a scalar and passed to a sigmoid activation function to predict the probability of completion of the output sequence. "Stop Token" (Stop Token) predictions are used during inference to allow the model to dynamically decide when to terminate generation, rather than always generating a fixed duration.

In the disclosed embodiment, the neural network Vocoder can use WaveNet Vocoder to convert the Mel Spectrogram feature representation into time domain waveform samples.

The method provided by the embodiment of the disclosure modifies the model structure based on the TTS end-To-end synthesis technology, and provides an end-To-end synthesis technology MusicTactron suitable for TTM (Text To Music, Text To singing synthesis technology, and synthesized singing audio). Compared with the DNN singing synthesis scheme, the audio synthesized by the embodiment of the disclosure has higher naturalness and better tone quality, and is very close to the real singing level.

Note that, although the model structure of the end-to-end neural network model in fig. 8 is exemplified by tacontron 2, a MIDI encoder and an Interval encoder are added. However, the present disclosure is not limited thereto, and other end-to-end neural network model structures may be improved, for example, the structure of tacontron 1 may be improved, and the modified structure is obtained by splicing the output results of the MIDI encoder and the Interval encoder and the output result of the tacontron 1 text encoder together and inputting the spliced output results into an attention module. For another example, the structure of Deep Voice 3 may be improved, and the modified structure is obtained by splicing the output results of the MIDI encoder and the Interval encoder, the output result of the Tacotron1 text encoder, and the output result of the Deep Voice 3 text encoder, and inputting the spliced output results into the AttentionBlock module.

Fig. 9 is a schematic diagram illustrating a processing procedure of step S42 shown in fig. 6 in an embodiment. In the disclosed embodiment, the midi encoder may include a first Dense neural network (density 1, abbreviated as D1).

As shown in fig. 9, in the embodiment of the present disclosure, the step S42 may further include the following steps.

In step S421, the musical instrument digital interface features are processed by the first dense neural network to obtain a first dense vector.

In step S422, the midi embedded vector is obtained according to the first dense vector.

Fig. 10 schematically shows a structural diagram of a MIDI encoder according to an embodiment of the present disclosure.

As shown in fig. 10, in the embodiment of the present disclosure, the MIDI encoder includes a first Dense neural network (D1), and the extracted MIDI features are input into a layer of density 1, where the number of nodes of the layer of density 1 is 1024, the activation function is ReLU, and the output is MIDI embedded vector.

Fig. 11 is a schematic diagram illustrating a processing procedure of step S422 shown in fig. 9 in an embodiment. In the embodiment of the present disclosure, the musical instrument digital interface encoder may further include a second dense Neural Network, a forward Neural Network (RNN), and a backward Neural Network.

As shown in fig. 11, in the embodiment of the present disclosure, the step S422 may further include the following steps.

In step S4221, the first dense vector is processed by the second dense neural network to obtain a second dense vector.

In the embodiment of the present disclosure, the calculation result of D1 in fig. 10 may be input into the next layer density 2, the number of nodes in this layer density 2 may also be 1024, and the activation function is ReLU.

In step S4222, the second dense vector is processed by the forward gate recurrent neural network to obtain a first feature map.

The calculation result of density 2 of the above step S4221 may be input into a layer of GRU1 structure, and the number of nodes of this layer of GRU1 may be 128.

In step S4223, the first feature map is processed by the back gate recurrent neural network to obtain a second feature map.

The calculation result of the GRU1 in the step S4222 may be input to the structure of the GRU2 of the next layer, and the number of nodes of the GRU2 of this layer may be 128.

In step S4224, the second feature map and the second dense vector are concatenated to obtain the midi embedded vector.

The results of the computations of Dense2 and GRU2 can be concatenated together as a MIDI Embedding vector to be output to the attention mechanism module.

Fig. 12 schematically shows a structural diagram of a MIDI encoder according to another embodiment of the present disclosure.

As shown in fig. 12, in the embodiment of the present disclosure, the input of the MIDI encoder is MIDI features, and the MIDI encoder may include a first Dense neural network (density 1, D1), a second Dense neural network (density 2, D2), a forward gate recurrent neural network (GRU1), and a backward gate recurrent neural network (GRU2), and a cascade layer. The output result of D2 is concatenated with the output result of GRU2, and the MIDI embedded vector is output.

In the embodiment of FIG. 12, two Dense layers in the MIDI encoder are MIDI embedding layers, and two GRUs are combined into a Bidirectional GRU (Bidirectional GRU). The stability of the pitch is guaranteed by cascading the results of the Dense2 and GRU2 calculations, i.e., the same MIDI characteristics, as are the parts of the Dense output. Here, concatenation means parallel connection or combination, and the second feature map and the second dense vector are spliced along the channel dimension to be recombined into a midi embedded vector with a larger size and containing more feature information.

It should be understood that the network structure of the MIDI encoder is not limited to the above structure form illustrated in fig. 10 and fig. 12, for example, three or more decks may be included, and/or three or more GRUs may be included, and the GRU layer and/or the deck of the deck may be replaced by another recurrent neural network such as LSTM or any deep neural network, which is not limited in this disclosure.

In an exemplary embodiment, the duration encoder may include a third dense neural network. Wherein, processing the duration value feature by the duration value encoder to obtain a duration value embedded vector may include: processing the chronaxie features through the third dense neural network to obtain the chronaxie embedded vector.

Fig. 13 schematically illustrates a structural diagram of a duration encoder according to an embodiment of the present disclosure.

As shown in fig. 13, the duration encoder provided by the embodiment of the present disclosure may include a third Dense neural network (density 3, D3), and the extracted Interval feature is input into a layer of density 3, the number of nodes in the layer of density 3 may be 1024, and the activation function may be ReLU. The calculation result of Dense3 is input to the attention mechanism module as an Interval Embedding vector.

It is to be understood that the network structure of the duration encoder is not limited to the structure form of fig. 13, for example, it may also be a structure in which at least one layer of sense is superimposed on at least one layer of GRU, which is not limited in this disclosure.

Fig. 14 schematically shows a flow chart of an audio synthesis method according to another embodiment of the present disclosure.

As shown in fig. 14, the method provided by the embodiment of the present disclosure may further include the following steps, which are different from the above-described embodiment. In the disclosed embodiment, a model training part and a model prediction part can be included. Wherein, the model training part comprises the following steps.

In step S6, a sample score, sample lyrics, and sample singing audio thereof are acquired.

And collecting the singing dry voice of a singer as sample singing voice frequency, and respectively using the music score and the lyric corresponding to the singing dry voice as a sample music score and a sample lyric.

In step S7, sample music features are extracted from the sample score.

Cutting a sample music score according to sentences, marking information such as notes, pitches, duration values and the like of each sentence, obtaining MIDI characteristics according to note pitch conversion, and obtaining duration value characteristics according to duration value information, wherein the MIDI characteristics and the duration value characteristics are used as sample music characteristics.

In step S8, sample text features are extracted from the sample lyrics.

And cutting the sample lyrics into sentences, and marking the word boundary, phoneme boundary and other information of each sentence. And obtaining sample text characteristics according to the information such as the word boundary, the phoneme boundary and the like.

In step S9, sample spectral information is obtained from the sample singing audio.

Sample spectral information is extracted from the sample singing audio using a spectral extraction tool, for example, the sample spectral information may include labeled true mel-spectral parameters and linear spectral parameters.

In step S10, the sample music features and the sample text features are processed by the end-to-end neural network model, so as to obtain predicted spectrum information.

And inputting MIDI characteristics in the sample music characteristics into a MIDI encoder, inputting duration value characteristics in the sample music characteristics into a duration value encoder, and inputting the sample text characteristics into a text encoder to obtain predicted spectrum information.

In step S11, the end-to-end neural network model is trained according to the sample spectrum information and the predicted spectrum information.

In the embodiment of the present disclosure, when the end-to-end neural network model is trained, the following loss function may be adopted:

loss＝Mel_loss+Spec_loss (3)

in the above formula, loss represents total loss, Mel _ loss represents Mel spectral loss, and Spec _ loss represents linear spectral loss.

Wherein, the mel-frequency spectrum loss can be calculated by the following formula:

in the above formula, n1 is the number of Mel spectral frames, n1 is a positive integer greater than or equal to 1, y_i1The labeled real i1 frame Mel spectrum, x_i1Is predicted the firsti1 frame Mel spectrum, i1 is positive integer greater than or equal to and less than or equal to n 1.

The linear spectral loss can be calculated by the following formula:

in the above formula, n2 is the linear spectrum frame number, n2 is a positive integer greater than or equal to 1, y_i2Is labeled the true i2 th frame linear spectrum, x_i2For the predicted i2 th frame linear spectrum, i2 is a positive integer greater than or equal to and less than or equal to n 2.

Fig. 15 schematically illustrates a training process diagram of an end-to-end neural network model according to an embodiment of the present disclosure.

As shown in fig. 15, a sample music score, sample lyrics and sample singing audio thereof may be obtained from a singing database, data segmentation and music score labeling may be performed on the obtained data, sample text features and sample music features may be extracted, mel-spectrum parameters and linear-spectrum parameters may be predicted based on the extracted sample text features and sample music features, and an acoustic model, i.e., the above-mentioned end-to-end neural network model, may be trained based on the really labeled mel-spectrum parameters and linear-spectrum parameters.

In an exemplary embodiment, the spectral information may include mel-frequency spectral parameters and linear spectral parameters. Synthesizing a singing audio corresponding to the music score to be processed and the lyrics to be processed according to the frequency spectrum information may include: and processing the Mel spectrum parameters and the linear spectrum parameters through a neural network vocoder to synthesize the singing audio.

Fig. 16 schematically illustrates a prediction process diagram of an end-to-end neural network model according to an embodiment of the present disclosure.

As shown in fig. 16, in the model prediction section, music features are extracted from score information provided by a user, text features are extracted from lyric information provided by the user, the extracted music features and text features are input to the MusicTactron model described above, a predicted mel spectrum and a linear spectrum are obtained, the predicted mel spectrum and/or the linear spectrum are input to the neural network vocoder, and a singing dry sound in which both the lyric and the melody meet the user's requirements is generated.

The voice synthesis method provided by the embodiment of the disclosure improves the structure of an end-to-end neural network model in the related technology, and the information input to the attention mechanism module is added with a MIDI embedded vector and an Interval embedded vector, so that the model originally used for synthesizing normal speaking voice can be used for synthesizing vocal dry sound; on the other hand, by adding a coincidence value encoder, the model can synthesize the singing dry sound with the note value consistent with the requirement of the music score, and a set of end-to-end singing synthesis technology for singing synthesis is provided, so that the real-person-level singing dry sound can be finally synthesized, and the synthesis effect is far stronger than the DNN technical scheme in terms of naturalness and timbre fidelity.

The new end-to-end technical model structure provided by the embodiment of the disclosure can obtain a very good effect in the singing synthesis field, and can synthesize high-quality singing dry voice according to a music score. Compared with the DNN technical scheme in the related art, the naturalness and the timbre fidelity are greatly improved, and the MOS (Mean Opinion Score, voice quality evaluation index) evaluation results of 50 persons are shown in FIG. 17:

the MOS of the scheme is divided into 4.1 which is far higher than 3.6 of the DNN technical scheme. The MOS score of the real singing is 4.3, which shows that the singing dry sound synthesized by the scheme is very close to the singing level of the real person.

The technical scheme provided by the embodiment of the disclosure is a middle platform technology, and can be used for supporting products such as music and social contact. Fig. 18 to 21 schematically show application scenario diagrams to which the singing synthesis method proposed by the embodiment of the present disclosure is applied. As shown in fig. 18, it is the home page of the application, through which the user can make words or even fill songs by himself, and then let the application sing automatically.

As shown in FIG. 19, the "museum" in the application of FIG. 18 is open, where the user may select a song. As shown in fig. 20, the main category of free authoring may include three small categories of free authoring, subject word making and hot recommendation. The user inputs a keyword or a sentence in the subject column and clicks the confirmation virtual button, the application can complete the context according to the keyword or the sentence input by the user and automatically generate the complete lyrics of a song, so that the user does not need to manually input all the lyrics. As shown in fig. 21, the user opens the small category of free composition, and can directly input lyrics to replace the original lyrics, so that the user can directly use the music score of the original song to re-compose the lyrics desired by the user.

As shown in fig. 22, an audio synthesis apparatus 2200 provided in the embodiments of the present disclosure may include: a music score lyric obtaining module 2210, a music feature extracting module 2220, a text feature extracting module 2230, a frequency spectrum information obtaining module 2240, and a singing audio synthesizing module 2250.

The score lyric acquiring module 2210 may be configured to acquire the to-be-processed score and the to-be-processed lyrics thereof. The music feature extraction module 2220 may be configured to extract music features from the to-be-processed score. The text feature extraction module 2230 may be configured to extract text features from the lyrics to be processed. The spectrum information obtaining module 2240 may be configured to process the music feature and the text feature through an end-to-end neural network model to obtain spectrum information. The singing audio synthesizing module 2250 may be configured to synthesize a singing audio corresponding to the to-be-processed music score and the to-be-processed lyrics thereof according to the frequency spectrum information.

In an exemplary embodiment, the music features may include a musical instrument digital interface feature and a duration feature. Among them, the music feature extraction module 2220 may include: a musical instrument digital interface feature obtaining sub-module, which can be configured to obtain the musical instrument digital interface feature according to the pitch in the music score to be processed; a time value feature obtaining sub-module, configured to normalize the note lengths in the score to be processed, and obtain the time value feature.

In an exemplary embodiment, the end-to-end neural network model may include a text encoder, a musical instrument digital interface encoder, and a duration encoder. The spectrum information obtaining module 2240 may include: a text embedded vector obtaining sub-module, which can be configured to process the text features through the text encoder to obtain a text embedded vector; a musical instrument digital interface embedded vector obtaining sub-module, which can be configured to process the musical instrument digital interface features through the musical instrument digital interface encoder to obtain musical instrument digital interface embedded vectors; a time value embedding vector obtaining submodule, which can be configured to process the time value feature through the time value encoder to obtain a time value embedding vector; a fused embedded vector obtaining sub-module, which may be configured to obtain a fused embedded vector according to the text embedded vector, the midi embedded vector, and the duration embedded vector; a spectrum information obtaining sub-module, which may be configured to obtain the spectrum information according to the fusion embedding vector.

In an exemplary embodiment, the musical instrument digital interface encoder may include a first dense neural network. Wherein the midi-embedded vector acquisition submodule may include: a first dense vector obtaining unit, which may be configured to process the musical instrument digital interface features through the first dense neural network to obtain a first dense vector; a midi embedded vector obtaining unit may be configured to obtain the midi embedded vector based on the first dense vector.

In an exemplary embodiment, the musical instrument digital interface encoder may further include a second dense neural network, a forward gate recurrent neural network, and an inverse gate recurrent neural network. Wherein the midi-embedded vector obtaining unit may include: a second dense vector obtaining subunit, which may be configured to process the first dense vector through the second dense neural network to obtain a second dense vector; a first feature map obtaining subunit, which may be configured to process the second dense vector through the forward gate recurrent neural network to obtain a first feature map; a second feature map obtaining subunit, configured to process the first feature map through the inverse gate recurrent neural network to obtain a second feature map; a midi vector obtaining subunit may be configured to concatenate the second feature map and the second dense vector to obtain the midi embedded vector.

In an exemplary embodiment, the duration encoder may include a third dense neural network. Wherein the value-embedded vector obtaining sub-module may include: a time value embedding vector obtaining unit, which may be configured to process the time value features through the third dense neural network, to obtain the time value embedding vector.

In an exemplary embodiment, the end-to-end neural network model may further include an attention mechanism module and a spectral decoder. Wherein the spectrum information obtaining sub-module may include: an attention context vector obtaining unit, which may be configured to process the fused embedded vector through the attention mechanism module to obtain an attention context vector; a spectrum information obtaining unit, which may be configured to obtain the spectrum information by the spectrum decoder processing the attention context vector.

In an exemplary embodiment, the spectral information may include mel-frequency spectral parameters and linear spectral parameters. The singing audio synthesis module 2250 may include: and the singing audio synthesizing sub-module can be configured to process the Mel spectrum parameters and the linear spectrum parameters through a neural network vocoder to synthesize the singing audio.

In an exemplary embodiment, the audio synthesizing apparatus 2200 may further include: a sample information obtaining module configured to obtain a sample music score, a sample lyric and a sample singing audio thereof; a sample music feature extraction module that may be configured to extract sample music features from the sample score; a sample text feature extraction module that may be configured to extract sample text features from the sample lyrics; a sample spectrum information obtaining module configured to obtain sample spectrum information according to the sample singing audio; a spectrum prediction module configured to process the sample music features and the sample text features through the end-to-end neural network model to obtain predicted spectrum information; a model training module may be configured to train the end-to-end neural network model according to the sample spectrum information and the predicted spectrum information.

The specific implementation of each module, sub-module, unit and sub-unit in the audio synthesis apparatus provided in the embodiment of the present disclosure may refer to the content in the audio synthesis method, and is not described herein again.

It should be noted that although several modules, sub-modules, units and sub-units of the apparatus for action execution are mentioned in the above detailed description, such division is not mandatory. Indeed, the features and functions of two or more modules, sub-modules, units and sub-units described above may be embodied in one module, sub-module, unit and sub-unit, in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module, sub-module, unit and sub-unit described above may be further divided into embodiments by a plurality of modules, sub-modules, units and sub-units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An audio synthesis method, comprising:

obtaining a music score to be processed and lyrics to be processed;

extracting music characteristics from the music score to be processed;

extracting text features from the lyrics to be processed;

processing the music characteristic and the text characteristic through an end-to-end neural network model to obtain frequency spectrum information;

and synthesizing singing audio corresponding to the music score to be processed and the lyrics to be processed according to the frequency spectrum information.

2. The method of claim 1, wherein the musical features include a midi feature and a duration feature; wherein, extracting music characteristics from the music score to be processed comprises:

obtaining the musical instrument digital interface characteristics according to pitches in the music score to be processed;

and normalizing the note length in the music score to be processed to obtain the duration characteristic.

3. The method of claim 2, wherein the end-to-end neural network model comprises a text encoder, a musical instrument digital interface encoder, and a duration encoder; the processing the music feature and the text feature through an end-to-end neural network model to obtain frequency spectrum information comprises the following steps:

processing the text features through the text encoder to obtain text embedded vectors;

processing the characteristics of the musical instrument digital interface through the musical instrument digital interface encoder to obtain an embedded vector of the musical instrument digital interface;

processing the time value features through the time value encoder to obtain a time value embedded vector;

obtaining a fusion embedding vector according to the text embedding vector, the musical instrument digital interface embedding vector and the duration embedding vector;

and obtaining the frequency spectrum information according to the fusion embedding vector.

4. The method of claim 3, wherein the instrument digital interface encoder comprises a first dense neural network; wherein, processing the characteristics of the musical instrument digital interface through the musical instrument digital interface encoder to obtain an embedded vector of the musical instrument digital interface, comprising:

processing the musical instrument digital interface features through the first dense neural network to obtain a first dense vector;

and obtaining the musical instrument digital interface embedded vector according to the first dense vector.

5. The method of claim 4, wherein the instrument digital interface encoder further comprises a second dense neural network, a forward gate recurrent neural network, and an inverse gate recurrent neural network; wherein obtaining the midi embedded vector based on the first dense vector comprises:

processing the first dense vector through the second dense neural network to obtain a second dense vector;

processing the second dense vector through the forward gate recurrent neural network to obtain a first feature map;

processing the first characteristic diagram through the reverse gate recurrent neural network to obtain a second characteristic diagram;

and cascading the second feature map and the second dense vector to obtain the musical instrument digital interface embedded vector.

6. The method of claim 3, wherein the duration encoder comprises a third dense neural network; wherein processing the duration features by the duration encoder to obtain a duration embedded vector comprises:

processing the chronaxie features through the third dense neural network to obtain the chronaxie embedded vector.

7. The method of claim 3, wherein the end-to-end neural network model further comprises an attention mechanism module and a spectral decoder; obtaining the spectrum information according to the fusion embedding vector, wherein the obtaining of the spectrum information comprises:

processing the fusion embedding vector through the attention mechanism module to obtain an attention context vector;

processing the attention context vector by the spectral decoder to obtain the spectral information.

8. The method of claim 1, wherein the spectral information comprises mel-frequency spectral parameters and linear spectral parameters; synthesizing a singing audio corresponding to the music score to be processed and the lyrics to be processed according to the frequency spectrum information, wherein the synthesizing comprises the following steps:

and processing the Mel spectrum parameters and the linear spectrum parameters through a neural network vocoder to synthesize the singing audio.

9. The method of claim 1, further comprising:

obtaining a sample music score, sample lyrics and sample singing audio thereof;

extracting sample music features from the sample score;

extracting sample text features from the sample lyrics;

obtaining sample frequency spectrum information according to the sample singing audio;

processing the sample music characteristics and the sample text characteristics through the end-to-end neural network model to obtain predicted spectrum information;

and training the end-to-end neural network model according to the sample spectrum information and the predicted spectrum information.

10. An audio synthesizing apparatus, comprising:

the music score and lyric acquisition module is configured to acquire a music score to be processed and lyrics to be processed;

a music feature extraction module configured to extract music features from the music score to be processed;

the text feature extraction module is configured to extract text features from the lyrics to be processed;

the frequency spectrum information obtaining module is configured to process the music characteristics and the text characteristics through an end-to-end neural network model to obtain frequency spectrum information;

and the singing audio synthesis module is configured to synthesize the singing audio corresponding to the music score to be processed and the lyrics to be processed according to the frequency spectrum information.

11. An electronic device, comprising:

one or more processors;

a storage device configured to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the audio synthesis method of any of claims 1 to 9.

12. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out an audio synthesis method according to any one of claims 1 to 9.