WO2023024501A1 - 音频数据处理方法、装置、设备以及存储介质 - Google Patents
音频数据处理方法、装置、设备以及存储介质 Download PDFInfo
- Publication number
- WO2023024501A1 WO2023024501A1 PCT/CN2022/082305 CN2022082305W WO2023024501A1 WO 2023024501 A1 WO2023024501 A1 WO 2023024501A1 CN 2022082305 W CN2022082305 W CN 2022082305W WO 2023024501 A1 WO2023024501 A1 WO 2023024501A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- fundamental frequency
- audio
- audio data
- data
- human voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/02—Means for controlling the tone frequencies, e.g. attack or decay; Means for producing special musical effects, e.g. vibratos or glissandos
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/36—Accompaniment arrangements
- G10H1/361—Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems
- G10H1/366—Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems with means for modifying or correcting the external signal, e.g. pitch correction, reverberation, changing a singer's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
- G10L13/047—Architecture of speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
- G10L19/032—Quantisation or dequantisation of spectral components
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/0308—Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/09—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being zero crossing rates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/93—Discriminating between voiced and unvoiced parts of speech signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/005—Musical accompaniment, i.e. complete instrumental rhythm synthesis added to a performed melody, e.g. as output by drum machines
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/021—Background music, e.g. for video sequences or elevator music
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/031—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
- G10H2210/041—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal based on mfcc [mel -frequency spectral coefficients]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/155—Musical effects
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/325—Musical pitch modification
- G10H2210/331—Note pitch correction, i.e. modifying a note pitch or replacing it by the closest one in a given scale
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2250/00—Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
- G10H2250/311—Neural networks for electrophonic musical instruments or musical processing, e.g. for musical recognition or control, automatic composition or improvisation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2250/00—Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
- G10H2250/315—Sound category-dependent sound synthesis processes [Gensound] for musical use; Sound category-specific synthesis-controlling parameters or control means therefor
- G10H2250/455—Gensound singing voices, i.e. generation of human voices for musical applications, vocal singing sounds or intelligible words at a desired pitch or with desired vocal effects, e.g. by phoneme synthesis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Definitions
- the present disclosure relates to the technical field of audio processing, in particular to the technical field of speech synthesis.
- the electronic sound effect can be used to adjust and beautify the sound, and it is widely used in scenes such as karaoke works or short video works.
- High-quality electronic sound effects can improve the sound quality of the work.
- product competitiveness can be enhanced
- product gameplay can be enriched
- user interest can be increased.
- the disclosure provides an audio data processing method, device, equipment, storage medium and program product.
- an audio data processing method including: decomposing the original audio data to obtain human voice audio data and background audio data; performing electronic processing on the human voice audio data to obtain electronic human voice data sound data; and synthesizing the electronic vocal data and the background audio data to obtain target audio data.
- an audio data processing device including: a decomposing module for decomposing the original audio data to obtain human voice audio data and background audio data; an electronic sound processing module for processing the human voice The audio data is electronically processed to obtain electronic human voice data; and a synthesis module is used to synthesize the electronic human voice data and the background audio data to obtain target audio data.
- Another aspect of the present disclosure provides an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein, the memory stores information executable by the at least one processor. instructions, the instructions are executed by the at least one processor, so that the at least one processor can execute the method shown in the embodiments of the present disclosure.
- a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to make the computer execute the method shown in the embodiments of the present disclosure.
- a computer program product including computer programs/instructions, which is characterized in that, when the computer program/instructions are executed by a processor, the steps of the methods shown in the embodiments of the present disclosure are implemented.
- Fig. 1 schematically shows a flow chart of an audio data processing method according to an embodiment of the present disclosure
- Fig. 2 schematically shows a flow chart of a method for decomposing raw audio data according to an embodiment of the present disclosure
- Fig. 3 schematically shows a flow chart of a method for performing electroacoustic processing on human voice audio data according to an embodiment of the present disclosure
- Fig. 4 schematically shows a flowchart of an audio data processing method according to another embodiment of the present disclosure
- Fig. 5 schematically shows a block diagram of an audio data processing device according to an embodiment of the present disclosure.
- Fig. 6 schematically shows a block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.
- the audio data processing method of the embodiment of the present disclosure will be described below with reference to FIG. 1 . It should be noted that in the technical solution of the present disclosure, the collection, storage, use, processing, transmission, provision, and disclosure of audio data and other data involved are all in compliance with relevant laws and regulations, and do not violate public order and good customs.
- FIG. 1 is a flowchart of an audio data processing method according to an embodiment of the present disclosure.
- the audio data processing method 100 includes decomposing the original audio data to obtain vocal audio data and background audio data in operation S110.
- the human voice audio data is electronically processed to obtain electronic human voice data.
- the electronic vocal data and background audio data are synthesized to obtain target audio data.
- the original audio data may include, for example, human voice information and background sound information, wherein the human voice may be, for example, a singing voice, and the background sound may, for example, be accompaniment music.
- a sound source separation algorithm can be used to separate the vocal information and background information in the original audio data to obtain vocal audio data containing vocal information and background audio data containing background sound information.
- the disclosed embodiment by separating the human voice information in the original audio data from the background sound information, converting the human voice information into electronic sound, and then synthesizing the electronically sound human voice information with the background sound information, the Electronization of audio data with both background sound information and human voice information.
- a neural network may be used to implement a sound source separation algorithm to decompose original audio data.
- the input of the neural network may be audio data with background sound information and human voice information
- the output of the neural network may be human voice audio data including human voice information and background audio data including background sound information.
- the music file and the human voice file can be obtained in advance, and the music file and the human voice file can be cut into segments of equal length to obtain multiple music segments X and multiple vocal segments Y.
- Each music segment X can be synthesized with a corresponding human voice segment Y to obtain an original audio data Z.
- Each original audio data Z is used as the input of the neural network, and the music segment X and the human voice segment Y corresponding to the original audio data Z are used as the expected output to train the neural network.
- music clip X, vocal clip Y and original audio data Z can be preprocessed into Mel spectrum.
- the output result of the neural network is also based on the Mel spectrum.
- the output result in the form of the mel spectrum can be synthesized into corresponding original audio data through an algorithm such as the Griffin-Lim (Griffin-Lin) algorithm.
- Fig. 2 schematically shows a flowchart of a method for decomposing original audio data according to an embodiment of the present disclosure.
- the method 210 for decomposing original audio data includes determining original mel spectrum data corresponding to the original audio data in operation S211.
- background mel spectrum data and vocal mel spectrum data corresponding to the original mel spectrum data are determined using a neural network.
- the background mel spectrum data may include background sound information in the original mel spectrum data
- the vocal mel spectrum data may include vocal information in the original mel spectrum data
- background audio data is generated based on the background mel spectrum data
- vocal audio data is generated based on the vocal mel spectrum data
- the background audio data can be generated according to the background mel spectrum data through an algorithm such as the Griffin-Lim algorithm, and the vocal audio data can be generated according to the vocal mel spectrum data.
- the electroacoustic processing of human voice audio data can be realized by quantizing the fundamental frequency of the human voice data.
- the fundamental frequency, spectral envelope and aperiodic parameters of vocal data can be determined.
- the fundamental frequency represents the vibration frequency of the vocal cords during pronunciation, which is reflected in the audio frequency as the pitch.
- the fundamental frequency can be quantized, and the human voice data can be re-synthesized according to the quantized fundamental frequency, spectrum envelope and aperiodic parameters, so as to realize the electroacoustic processing of the human voice audio data.
- the resynthesized human voice data is electronic vocal data, which includes human voice information with electronic sound effects.
- Fig. 3 schematically shows a flowchart of a method for electronically processing human voice audio data according to an embodiment of the present disclosure.
- the method 320 for electronically processing human voice audio data may include, in operation S321 , extracting the original fundamental frequency of the human voice audio data.
- the original fundamental frequency can be extracted from human voice audio data according to algorithms such as DIO and Harvest.
- the original fundamental frequency is corrected to obtain the first fundamental frequency.
- human voice audio data may be divided into multiple audio segments. Then, for each audio segment of the plurality of audio segments, the energy and zero-crossing rate of the audio segment are determined. Based on energy and zero-crossing rate, determine whether an audio segment is a voiced audio segment. Then, a linear interpolation algorithm is used to modify the fundamental frequency of the voiced audio segment.
- the vocal audio data can be divided into multiple audio segments with a preset unit length, and the length of each audio segment is a preset unit length.
- the preset unit length can be set according to actual needs.
- the preset unit length may be any value from 10ms to 40ms.
- each audio segment is provided with a plurality of sampling points.
- the energy of an audio clip can be determined based on the value of each sample point in the audio clip. For example, the energy of an audio clip can be calculated according to the following formula:
- x i represents the value of the i-th sampling point
- n is the number of sampling points.
- the number n of sampling points may be determined according to the length and sampling rate of the audio segment. Taking the preset unit length of 10ms as an example, the number n of sampling points can be calculated according to the following formula:
- sr represents the sampling rate of the audio.
- the zero-crossing rate of an audio clip can be calculated according to the following formula:
- ZCR is the zero-crossing rate of the audio clip
- n is the number of sampling points in the audio clip
- x i represents the value of the i-th sampling point in the audio clip
- x i-1 represents the i-1th sampling point in the audio clip value.
- the number n of sampling points may be determined according to the length and sampling rate of the audio segment. Taking the preset unit length of 10ms as an example, the number n of sampling points can be calculated according to the following formula:
- sr represents the sampling rate of the audio.
- the fundamental frequency can be corrected by using the above characteristics.
- the audio segment For each audio segment, if the energy E of the audio segment is less than the threshold e_min, and the zero-crossing rate ZCR of the audio segment is greater than the threshold zcr_max, then the audio segment is an unvoiced audio segment, and its fundamental frequency is 0. Otherwise, the audio segment is a voiced audio segment with a non-zero fundamental frequency.
- e_min and zcr_max can be set according to actual needs.
- the base frequency of the audio segment may be set to 0.
- the fundamental frequency of each voiced sound audio segment can be extracted according to algorithms such as DIO and Harvest, and then whether the fundamental frequency value of each voiced sound audio segment is 0 is detected one by one.
- linear interpolation can be performed based on the values of voiced sound audio fragments near the voiced sound audio clip based on a linear interpolation algorithm to obtain a fundamental frequency value that is not 0 as the fundamental frequency of the voiced sound audio clip value.
- the fundamental frequency values are: 100, 100, 0, 0, 160, and 100. That is, the fundamental frequency values of the 3rd and 4th voiced audio clips are 0. Therefore, linear interpolation can be performed based on non-zero fundamental frequency values near the fundamental frequency values of the third and fourth voiced sound audio clips, that is, linear interpolation can be performed based on the second fundamental frequency value of 100 and the fifth fundamental frequency value of 160,
- the fundamental frequency values of the 3rd and 4th voiced sound audio clips are obtained as 120 and 140. That is, the corrected six fundamental frequency values are 100, 100, 120, 140, 160, and 100.
- the first fundamental frequency is adjusted according to predetermined electronic sound parameters to obtain a second fundamental frequency.
- the predetermined electronic sound parameter may include, for example, an electronic sound degree parameter and/or an electronic sound pitch parameter.
- the electronic sound degree parameter can be used to control the degree of the electronic sound.
- the Electronic Tone parameter can be used to control the tone.
- the electronic sound level parameter may include, for example, 1, 1.2, and 1.4, and the greater the electronic sound level parameter, the more obvious the electronic sound effect.
- the electronic tone parameters may include -3, -2, -1, +1, +2, +3, for example. Among them, -1, -2, and -3 represent 1, 2, and 3 tones lowered, respectively, and +1, +2, and +3 represent 1, 2, and 3 tones raised, respectively.
- parameters cannot be adjusted for the electronic sound effect, and the effect is single.
- two parameters, the degree parameter of the electronic sound and the pitch parameter of the electronic sound are set to control the effect of the electronic sound, which can meet different user demands.
- the variance of the fundamental frequency and/or the average value of the fundamental frequency may be determined according to the fundamental frequencies of all voiced audio segments. Determine the modified fundamental frequency variance according to the electronic sound level parameter and the fundamental frequency variance, and/or determine the corrected fundamental frequency average value according to the electronic sound level parameter and the fundamental frequency average value. Then, according to the modified fundamental frequency variance and/or the corrected fundamental frequency average value, the first fundamental frequency is adjusted to obtain the second fundamental frequency.
- the variance of the fundamental frequencies of all voiced sound audio segments may be calculated as the fundamental frequency variance, and the average value of the fundamental frequencies of all the voiced sound audio segments may be calculated as the average fundamental frequency.
- the modified fundamental frequency variance can be calculated according to the following formula:
- new_var is the variance of the modified fundamental frequency
- var is the variance of the fundamental frequency
- a is the parameter of the electronic sound level.
- the corrected fundamental frequency average can be calculated according to the following formula:
- new_mean is the average value of the modified fundamental frequency
- mean is the average value of the fundamental frequency
- b is the electronic tone parameter.
- the second fundamental frequency can be calculated according to the following formula:
- F0' is the second fundamental frequency.
- each key frequency of the piano can be used as the target frequency to quantize the fundamental frequency of the vocal data.
- the frequency range can be determined according to the following formula:
- scale is the frequency range
- F0' is the second fundamental frequency
- the third fundamental frequency can be determined according to the following formula:
- F0" is the third fundamental frequency.
- the spectral envelope and the aperiodic parameters can be determined according to the human voice audio data and the first fundamental frequency. Then the electronic vocal data can be determined according to the third fundamental frequency, spectrum envelope and aperiodic parameters.
- Fig. 4 schematically shows a flowchart of an audio data processing method according to another embodiment of the present disclosure.
- the audio data processing method 400 includes, in operation S401 , determining whether the audio data (referred to as audio) contains accompaniment music (referred to as accompaniment). If the accompaniment is included, perform operation S402. If only human voice is included but no accompaniment is included, perform operation S403.
- operation S402 the human voice is separated from the accompaniment by using a sound source separation algorithm. Then perform operation S403 for the separated human voice.
- the fundamental frequency is corrected based on the zero-crossing rate and the energy to obtain F0.
- the spectrum envelope SP and the aperiodic parameter AP are calculated using the human voice and the corrected fundamental frequency F0.
- the fundamental frequency is adjusted and F0' is obtained according to the electronic sound level parameter a and the electronic sound pitch parameter b set by the user.
- operation S409 if the audio has an accompaniment, operation S410 is performed. Otherwise, perform operation S411.
- the accompaniment is also mixed into the human voice to generate the final audio with electronic sound effect.
- Fig. 5 schematically shows a block diagram of an audio data processing device according to an embodiment of the present disclosure.
- the audio data processing device 500 includes a decomposition module 510 , an electronic sound processing module 520 and a synthesis module 530 .
- Decomposition module 510 configured to decompose the original audio data to obtain vocal audio data and background audio data.
- the electronic sound processing module 520 is configured to perform electronic sound processing on the human voice audio data to obtain electronic sound human voice data.
- the synthesis module 530 is used for synthesizing the electronic vocal data and the background audio data to obtain the target audio data.
- the decomposition module may include a mel spectrum determination submodule, a decomposition submodule and a generation submodule.
- the mel spectrum determination submodule can be used to determine the original mel spectrum data corresponding to the original audio data.
- the decomposition sub-module can be used to determine background mel spectrum data and vocal mel spectrum data corresponding to the original mel spectrum data by using a neural network.
- the generation sub-module can be used to generate background audio data according to the background Mel spectrum data, and generate human voice audio data according to the vocal Mel spectrum data.
- the electronic sound processing module may include an extraction sub-module, a correction sub-module, an adjustment sub-module, a quantization sub-module and an electronic sound determination sub-module.
- the extraction sub-module can be used to extract the original fundamental frequency of the human voice audio data.
- the correction sub-module can be used to correct the original fundamental frequency to obtain the first fundamental frequency.
- the adjustment sub-module can be used to adjust the first fundamental frequency according to predetermined electronic sound parameters to obtain the second fundamental frequency.
- the quantization sub-module can be used to perform quantization processing on the second fundamental frequency to obtain the third fundamental frequency.
- the electronic sound determination sub-module can be used to determine electronic sound human voice data according to the third fundamental frequency.
- the correction submodule may include: a segmentation unit, an energy determination unit, a zero-crossing rate determination unit, a voiced sound determination unit, and a correction unit.
- the segmentation unit can be used to divide the human voice audio data into multiple audio segments.
- the energy determination unit may be configured to determine the energy of the audio segment for each audio segment in the plurality of audio segments.
- the zero-crossing rate determination unit can be used for determining the zero-crossing rate of the audio segment for each audio segment in the plurality of audio segments.
- the voiced sound judging unit can be used to determine whether the type of the audio segment is a voiced sound audio segment according to the energy and the zero-crossing rate.
- the correction unit can be used to correct the fundamental frequency of the voiced sound audio segment by using a linear interpolation algorithm.
- an audio segment is provided with a plurality of sampling points.
- the energy determination unit can also be used to determine the energy of the audio segment according to the value of each sampling point in the audio segment.
- the zero-crossing rate determination unit can also be used to determine whether the values of every two adjacent sampling points in the audio segment have opposite signs to each other, and then determine that the number of times that adjacent sampling points in the audio segment are of different signs accounts for all The ratio of the number of sampling points is used as the zero-crossing rate.
- the predetermined electronic sound parameters may include electronic sound degree parameters and/or electronic sound tone parameters.
- the adjustment submodule may include a first determination unit, a second determination unit and an adjustment unit.
- the first determination unit may be configured to determine the variance of the fundamental frequency and/or the average value of the fundamental frequency according to the fundamental frequency of the voiced audio segment.
- the second determination unit may be configured to determine the modified fundamental frequency variance according to the electronic sound degree parameter and the fundamental frequency variance, and/or determine the modified fundamental frequency average value according to the electronic sound degree parameter and the fundamental frequency average value.
- the adjusting unit may be configured to adjust the first fundamental frequency according to the modified fundamental frequency variance and/or the corrected fundamental frequency average value to obtain the second fundamental frequency.
- the quantization submodule may include a frequency range determination unit and a third fundamental frequency determination unit.
- the frequency range determination unit can be used to determine the frequency range according to the following formula:
- scale is a frequency range
- F0' is the second fundamental frequency
- the third fundamental frequency determination unit can be used to determine the third fundamental frequency based on the frequency range according to the following formula:
- F0" is the third fundamental frequency.
- the above-mentioned audio data processing apparatus may further include a determination module, which may be configured to determine a spectrum envelope and aperiodic parameters according to the vocal audio data and the first fundamental frequency.
- the electronic sound determining submodule may also be used to determine electronic sound human voice data according to the third fundamental frequency, spectrum envelope and aperiodic parameters.
- the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
- FIG. 6 schematically shows a block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure.
- Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers.
- Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices.
- the components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
- the device 600 includes a computing unit 601 that can execute according to a computer program stored in a read-only memory (ROM) 602 or loaded from a storage unit 608 into a random-access memory (RAM) 603. Various appropriate actions and treatments. In the RAM 603, various programs and data necessary for the operation of the device 600 can also be stored.
- the computing unit 601, ROM 602, and RAM 603 are connected to each other through a bus 604.
- An input/output (I/O) interface 605 is also connected to the bus 604 .
- the I/O interface 605 includes: an input unit 606, such as a keyboard, a mouse, etc.; an output unit 607, such as various types of displays, speakers, etc.; a storage unit 608, such as a magnetic disk, an optical disk, etc. ; and a communication unit 609, such as a network card, a modem, a wireless communication transceiver, and the like.
- the communication unit 609 allows the device 600 to exchange information/data with other devices over a computer network such as the Internet and/or various telecommunication networks.
- the computing unit 601 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 601 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc.
- the computing unit 601 executes various methods and processes described above, such as audio data processing methods.
- the audio data processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608 .
- part or all of the computer program may be loaded and/or installed on the device 600 via the ROM 602 and/or the communication unit 609.
- the computer program When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the audio data processing method described above may be performed.
- the computing unit 601 may be configured to execute the audio data processing method in any other suitable way (for example, by means of firmware).
- Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
- FPGAs field programmable gate arrays
- ASICs application specific integrated circuits
- ASSPs application specific standard products
- SOC system of systems
- CPLD load programmable logic device
- computer hardware firmware, software, and/or combinations thereof.
- programmable processor can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.
- Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented.
- the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
- a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
- a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
- a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
- machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
- RAM random access memory
- ROM read only memory
- EPROM or flash memory erasable programmable read only memory
- CD-ROM compact disk read only memory
- magnetic storage or any suitable combination of the foregoing.
- the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
- a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
- a keyboard and pointing device eg, a mouse or a trackball
- Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.
- the systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system.
- the components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.
- a computer system may include clients and servers.
- Clients and servers are generally remote from each other and typically interact through a communication network.
- the relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.
- the server can be a cloud server, a server of a distributed system, or a server combined with a blockchain.
- steps may be reordered, added or deleted using the various forms of flow shown above.
- each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Description
Claims (21)
- 一种音频数据处理方法,包括:分解原始音频数据,得到人声音频数据和背景音频数据;对所述人声音频数据进行电音化处理,得到电音人声数据;以及将所述电音人声数据和所述背景音频数据合成,得到目标音频数据。
- 根据权利要求1所述的方法,其中,所述分解原始音频数据,得到背景音频数据和人声音频数据,包括:确定与所述原始音频数据对应的原始梅尔频谱数据;使用神经网络确定与所述原始梅尔频谱数据对应的背景梅尔频谱数据和人声梅尔频谱数据;以及根据所述背景梅尔频谱数据,生成所述背景音频数据,并根据所述人声梅尔频谱数据,生成所述人声音频数据。
- 根据权利要求1所述的方法,其中,所述对所述人声音频数据进行电音化处理,得到电音人声数据,包括:提取所述人声音频数据的原始基频;对所述原始基频进行修正,得到第一基频;根据预定电音参数,调整所述第一基频,得到第二基频;针对所述第二基频进行量化处理,得到第三基频;以及根据所述第三基频,确定所述电音人声数据。
- 根据权利要求3所述的方法,其中,所述对所述原始基频进行修正,得到第一基频,包括:将所述人声音频数据分为多个音频片段;针对所述多个音频片段中的每个音频片段,确定所述音频片段的能量和过零率;根据所述能量和过零率,确定所述音频片段是否为浊音音频片段;以及利用线性插值算法,对所述浊音音频片段的基频进行修正。
- 根据权利要求4所述的方法,其中,所述音频片段设置有多个采样点;所述确定所述音频片段的能量包括:根据所述音频片段中每个采样点的数值,确定所述音频片段的能量。
- 根据权利要求4所述的方法,其中,所述音频片段包括多个采样点;所述确定所述音频片段的过零率包括:确定所述音频片段中每两个相邻采样点的数值是否彼此符号相反;以及确定所述音频片段中相邻采样点为异号的次数占所有采样点个数的比值,作为所述过零率。
- 根据权利要求4所述的方法,其中,所述预定电音参数包括电音程度参数和/或电音音调参数;所述根据预定电音参数,调整所述第一基频,得到第二基频,包括:根据所述浊音音频片段的基频,确定基频方差和/或基频平均值;根据所述电音程度参数和所述基频方差,确定修正基频方差,以及/或者,根据所述电音音调参数和所述基频平均值,确定修正基频平均值;以及根据所述修正基频方差和/或修正基频平均值,调整所述第一基频,得到所述第二基频。
- 根据权利要求3-7中任一项所述的方法,还包括:根据所述人声音频数据和所述第一基频,确定频谱包络和非周期参数;其中,所述根据所述第三基频,确定所述电音人声数据,包括:根据所述第三基频、所述频谱包络和所述非周期参数,确定所述电音人声数据。
- 一种音频数据处理装置,包括:分解模块,用于分解原始音频数据,得到人声音频数据和背景音频数据;电音处理模块,用于对所述人声音频数据进行电音化处理,得到电音人声数据;以及合成模块,用于将所述电音人声数据和所述背景音频数据合成,得到目标音频数 据。
- 根据权利要求10所述的装置,其中,所述分解模块包括:梅尔频谱确定子模块,用于确定与所述原始音频数据对应的原始梅尔频谱数据;分解子模块,用于使用神经网络确定与所述原始梅尔频谱数据对应的背景梅尔频谱数据和人声梅尔频谱数据;以及生成子模块,用于根据所述背景梅尔频谱数据,生成所述背景音频数据,并根据所述人声梅尔频谱数据,生成所述人声音频数据。
- 根据权利要求10所述的装置,其中,所述电音处理模块包括:提取子模块,用于提取所述人声音频数据的原始基频;修正子模块,用于对所述原始基频进行修正,得到第一基频;调整子模块,用于根据预定电音参数,调整所述第一基频,得到第二基频;量化子模块,用于针对所述第二基频进行量化处理,得到第三基频;以及电音确定子模块,用于根据所述第三基频,确定所述电音人声数据。
- 根据权利要求12所述的装置,其中,所述修正子模块包括:分段单元,用于将所述人声音频数据分为多个音频片段;能量确定单元,用于针对所述多个音频片段中的每个音频片段,确定所述音频片段的能量;过零率确定单元,用于针对所述多个音频片段中的每个音频片段,确定所述音频片段的过零率;浊音判断单元,用于根据所述能量和过零率,确定所述音频片段的类型是否为浊音音频片段;以及修正单元,用于利用线性插值算法,对所述浊音音频片段的基频进行修正。
- 根据权利要求13所述的装置,其中,所述音频片段设置有多个采样点;所述能量确定单元还用于:根据所述音频片段中每个采样点的数值,确定所述音频片段的能量。
- 根据权利要求13所述的装置,其中,所述音频片段包括多个采样点;所述过零率确定单元还用于:确定所述音频片段中每两个相邻采样点的数值是否彼此符号相反;以及确定所述音频片段中相邻采样点为异号的次数占所有采样点个数的比值,作为所述过零率。
- 根据权利要求13所述的装置,其中,所述预定电音参数包括电音程度参数和/或电音音调参数;所述调整子模块包括:第一确定单元,用于根据所述浊音音频片段的基频,确定基频方差和/或基频平均值;第二确定单元,用于根据所述电音程度参数和所述基频方差,确定修正基频方差,以及/或者,根据所述电音程度参数和所述基频平均值,确定修正基频平均值;以及调整单元,用于根据所述修正基频方差和/或修正基频平均值,调整所述第一基频,得到所述第二基频。
- 根据权利要求12-16中任一项所述的装置,还包括:确定模块,用于根据所述人声音频数据和所述第一基频,确定频谱包络和非周期参数;其中,所述电音确定子模块还用于:根据所述第三基频、所述频谱包络和所述非周期参数,确定所述电音人声数据。
- 一种电子设备,包括:至少一个处理器;以及与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-9中任一项所述的方法。
- 一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行根据权利要求1-9中任一项所述的方法。
- 一种计算机程序产品,包括计算机程序/指令,其特征在于,该计算机程序/指令被处理器执行时实现权利要求1-9中任一项所述方法的步骤。
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US17/915,624 US20240212703A1 (en) | 2021-08-24 | 2022-03-22 | Method of processing audio data, device, and storage medium |
| JP2022560146A JP7465992B2 (ja) | 2021-08-24 | 2022-03-22 | オーディオデータ処理方法、装置、機器、記憶媒体及びプログラム |
| EP22773390.4A EP4167226B1 (en) | 2021-08-24 | 2022-03-22 | Audio data processing method and apparatus, and device and storage medium |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110978065.3 | 2021-08-24 | ||
| CN202110978065.3A CN113689837B (zh) | 2021-08-24 | 2021-08-24 | 音频数据处理方法、装置、设备以及存储介质 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023024501A1 true WO2023024501A1 (zh) | 2023-03-02 |
Family
ID=78582118
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2022/082305 Ceased WO2023024501A1 (zh) | 2021-08-24 | 2022-03-22 | 音频数据处理方法、装置、设备以及存储介质 |
Country Status (5)
| Country | Link |
|---|---|
| US (1) | US20240212703A1 (zh) |
| EP (1) | EP4167226B1 (zh) |
| JP (1) | JP7465992B2 (zh) |
| CN (1) | CN113689837B (zh) |
| WO (1) | WO2023024501A1 (zh) |
Families Citing this family (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113689837B (zh) * | 2021-08-24 | 2023-08-29 | 北京百度网讯科技有限公司 | 音频数据处理方法、装置、设备以及存储介质 |
| CN114449339B (zh) * | 2022-02-16 | 2024-04-12 | 深圳万兴软件有限公司 | 背景音效的转换方法、装置、计算机设备及存储介质 |
| CN115054915B (zh) * | 2022-07-12 | 2026-03-20 | 北京字跳网络技术有限公司 | 环境音频播放方法、装置、存储介质以及电子设备 |
| CN116312431B (zh) * | 2023-03-22 | 2023-11-24 | 广州资云科技有限公司 | 电音基调控制方法、装置、计算机设备和存储介质 |
| CN119922371A (zh) * | 2023-10-31 | 2025-05-02 | 北京字跳网络技术有限公司 | 一种多媒体资源处理方法、装置、设备及存储介质 |
Citations (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2004212473A (ja) * | 2002-12-27 | 2004-07-29 | Matsushita Electric Ind Co Ltd | カラオケ装置及びカラオケ再生方法 |
| CN103440862A (zh) * | 2013-08-16 | 2013-12-11 | 北京奇艺世纪科技有限公司 | 一种语音与音乐合成的方法、装置以及设备 |
| CN108417228A (zh) * | 2018-02-02 | 2018-08-17 | 福州大学 | 乐器音色迁移下的人声音色相似性度量方法 |
| CN108922506A (zh) * | 2018-06-29 | 2018-11-30 | 广州酷狗计算机科技有限公司 | 歌曲音频生成方法、装置和计算机可读存储介质 |
| CN111370019A (zh) * | 2020-03-02 | 2020-07-03 | 字节跳动有限公司 | 声源分离方法及装置、神经网络的模型训练方法及装置 |
| CN111899706A (zh) * | 2020-07-30 | 2020-11-06 | 广州酷狗计算机科技有限公司 | 音频制作方法、装置、设备及存储介质 |
| CN112086085A (zh) * | 2020-08-18 | 2020-12-15 | 珠海市杰理科技股份有限公司 | 音频信号的和声处理方法、装置、电子设备和存储介质 |
| CN212660311U (zh) * | 2020-08-27 | 2021-03-05 | 深圳市十盏灯科技有限责任公司 | 具有耳返功能的k歌耳机 |
| CN113689837A (zh) * | 2021-08-24 | 2021-11-23 | 北京百度网讯科技有限公司 | 音频数据处理方法、装置、设备以及存储介质 |
Family Cites Families (18)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH04340600A (ja) * | 1991-05-17 | 1992-11-26 | Mitsubishi Electric Corp | 音声復号化装置 |
| JP3266819B2 (ja) * | 1996-07-30 | 2002-03-18 | 株式会社エイ・ティ・アール人間情報通信研究所 | 周期信号変換方法、音変換方法および信号分析方法 |
| KR0176623B1 (ko) * | 1996-10-28 | 1999-04-01 | 삼성전자주식회사 | 연속 음성의 유성음부와 무성자음부의 자동 추출방법 및 장치 |
| US6078880A (en) * | 1998-07-13 | 2000-06-20 | Lockheed Martin Corporation | Speech coding system and method including voicing cut off frequency analyzer |
| US7567898B2 (en) * | 2005-07-26 | 2009-07-28 | Broadcom Corporation | Regulation of volume of voice in conjunction with background sound |
| JP5085700B2 (ja) * | 2010-08-30 | 2012-11-28 | 株式会社東芝 | 音声合成装置、音声合成方法およびプログラム |
| JP5589767B2 (ja) | 2010-10-29 | 2014-09-17 | ヤマハ株式会社 | 音声処理装置 |
| JP5830364B2 (ja) | 2011-12-01 | 2015-12-09 | 日本放送協会 | 韻律変換装置およびそのプログラム |
| CN111465982B (zh) | 2017-12-12 | 2024-10-15 | 索尼公司 | 信号处理设备和方法、训练设备和方法以及程序 |
| CN110164469B (zh) | 2018-08-09 | 2023-03-10 | 腾讯科技(深圳)有限公司 | 一种多人语音的分离方法和装置 |
| CN109166593B (zh) * | 2018-08-17 | 2021-03-16 | 腾讯音乐娱乐科技(深圳)有限公司 | 音频数据处理方法、装置及存储介质 |
| CN109346109B (zh) * | 2018-12-05 | 2020-02-07 | 百度在线网络技术(北京)有限公司 | 基频提取方法和装置 |
| JP7309155B2 (ja) | 2019-01-10 | 2023-07-18 | グリー株式会社 | コンピュータプログラム、サーバ装置、端末装置及び音声信号処理方法 |
| CN110706679B (zh) * | 2019-09-30 | 2022-03-29 | 维沃移动通信有限公司 | 一种音频处理方法及电子设备 |
| CN111243619B (zh) * | 2020-01-06 | 2023-09-22 | 平安科技(深圳)有限公司 | 语音信号分割模型的训练方法、装置和计算机设备 |
| CN111724757A (zh) * | 2020-06-29 | 2020-09-29 | 腾讯音乐娱乐科技(深圳)有限公司 | 一种音频数据处理方法及相关产品 |
| CN113178183B (zh) * | 2021-04-30 | 2024-05-14 | 杭州网易云音乐科技有限公司 | 音效处理方法、装置、存储介质和计算设备 |
| CN114360587A (zh) * | 2021-12-27 | 2022-04-15 | 北京百度网讯科技有限公司 | 识别音频的方法、装置、设备、介质及产品 |
-
2021
- 2021-08-24 CN CN202110978065.3A patent/CN113689837B/zh active Active
-
2022
- 2022-03-22 WO PCT/CN2022/082305 patent/WO2023024501A1/zh not_active Ceased
- 2022-03-22 EP EP22773390.4A patent/EP4167226B1/en active Active
- 2022-03-22 US US17/915,624 patent/US20240212703A1/en active Pending
- 2022-03-22 JP JP2022560146A patent/JP7465992B2/ja active Active
Patent Citations (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2004212473A (ja) * | 2002-12-27 | 2004-07-29 | Matsushita Electric Ind Co Ltd | カラオケ装置及びカラオケ再生方法 |
| CN103440862A (zh) * | 2013-08-16 | 2013-12-11 | 北京奇艺世纪科技有限公司 | 一种语音与音乐合成的方法、装置以及设备 |
| CN108417228A (zh) * | 2018-02-02 | 2018-08-17 | 福州大学 | 乐器音色迁移下的人声音色相似性度量方法 |
| CN108922506A (zh) * | 2018-06-29 | 2018-11-30 | 广州酷狗计算机科技有限公司 | 歌曲音频生成方法、装置和计算机可读存储介质 |
| CN111370019A (zh) * | 2020-03-02 | 2020-07-03 | 字节跳动有限公司 | 声源分离方法及装置、神经网络的模型训练方法及装置 |
| CN111899706A (zh) * | 2020-07-30 | 2020-11-06 | 广州酷狗计算机科技有限公司 | 音频制作方法、装置、设备及存储介质 |
| CN112086085A (zh) * | 2020-08-18 | 2020-12-15 | 珠海市杰理科技股份有限公司 | 音频信号的和声处理方法、装置、电子设备和存储介质 |
| CN212660311U (zh) * | 2020-08-27 | 2021-03-05 | 深圳市十盏灯科技有限责任公司 | 具有耳返功能的k歌耳机 |
| CN113689837A (zh) * | 2021-08-24 | 2021-11-23 | 北京百度网讯科技有限公司 | 音频数据处理方法、装置、设备以及存储介质 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4167226A4 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4167226B1 (en) | 2026-02-11 |
| US20240212703A1 (en) | 2024-06-27 |
| EP4167226A4 (en) | 2024-09-18 |
| CN113689837B (zh) | 2023-08-29 |
| JP2023542760A (ja) | 2023-10-12 |
| JP7465992B2 (ja) | 2024-04-11 |
| EP4167226A1 (en) | 2023-04-19 |
| CN113689837A (zh) | 2021-11-23 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN113689837B (zh) | 音频数据处理方法、装置、设备以及存储介质 | |
| CN114141228B (zh) | 语音合成模型的训练方法、语音合成方法和装置 | |
| KR102611024B1 (ko) | 음성 합성 방법, 장치, 기기 및 컴퓨터 기록 매체 | |
| CN114495956B (zh) | 语音处理方法、装置、设备及存储介质 | |
| WO2020215666A1 (zh) | 语音合成方法、装置、计算机设备及存储介质 | |
| CN113963679B (zh) | 一种语音风格迁移方法、装置、电子设备及存储介质 | |
| EP0970466A2 (en) | Voice conversion system and methodology | |
| EP4276822A1 (en) | Method and apparatus for processing audio, electronic device and storage medium | |
| CN112967732B (zh) | 调整均衡器的方法、装置、设备和计算机可读存储介质 | |
| CN111261177A (zh) | 语音转换方法、电子装置及计算机可读存储介质 | |
| CN113421584A (zh) | 音频降噪方法、装置、计算机设备及存储介质 | |
| CN113160849A (zh) | 歌声合成方法、装置及电子设备和计算机可读存储介质 | |
| Dinther et al. | Perception of acoustic scale and size in musical instrument sounds | |
| CN114203155A (zh) | 训练声码器和语音合成的方法和装置 | |
| CN117672254A (zh) | 语音转换方法、装置、计算机设备及存储介质 | |
| CN113889073A (zh) | 语音处理方法、装置、电子设备和存储介质 | |
| CN119360810A (zh) | 一种音乐生成方法及装置 | |
| CN114999440B (zh) | 虚拟形象生成方法、装置、设备、存储介质以及程序产品 | |
| KR102611003B1 (ko) | 음성 처리 방법, 장치, 기기 및 컴퓨터 기록 매체 | |
| CN116129839A (zh) | 音频数据处理方法及装置、电子设备和介质 | |
| CN113066472A (zh) | 合成语音处理方法及相关装置 | |
| CN118609598A (zh) | 一种信息展示方法、装置及相关设备 | |
| US20230206943A1 (en) | Audio recognizing method, apparatus, device, medium and product | |
| CN114420141B (zh) | 声码器的训练方法、装置、设备和存储介质 | |
| CN117542346A (zh) | 一种语音评价方法、装置、设备及存储介质 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| ENP | Entry into the national phase |
Ref document number: 2022560146 Country of ref document: JP Kind code of ref document: A |
|
| ENP | Entry into the national phase |
Ref document number: 2022773390 Country of ref document: EP Effective date: 20220929 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 17915624 Country of ref document: US |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWG | Wipo information: grant in national office |
Ref document number: 2022773390 Country of ref document: EP |
