WO2023217003A1 - 音频处理方法、装置、设备及存储介质 - Google Patents
音频处理方法、装置、设备及存储介质 Download PDFInfo
- Publication number
- WO2023217003A1 WO2023217003A1 PCT/CN2023/092377 CN2023092377W WO2023217003A1 WO 2023217003 A1 WO2023217003 A1 WO 2023217003A1 CN 2023092377 W CN2023092377 W CN 2023092377W WO 2023217003 A1 WO2023217003 A1 WO 2023217003A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio
- accompaniment
- control
- response
- interface
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/36—Accompaniment arrangements
- G10H1/361—Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/0008—Associated control or indicating means
- G10H1/0025—Automatic or semi-automatic music composition, e.g. producing random music, applying rules from music theory or modifying a musical piece
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/36—Accompaniment arrangements
- G10H1/361—Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems
- G10H1/366—Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems with means for modifying or correcting the external signal, e.g. pitch correction, reverberation, changing a singer's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/0091—Means for obtaining special acoustic effects
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/36—Accompaniment arrangements
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/36—Accompaniment arrangements
- G10H1/361—Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems
- G10H1/368—Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems displaying animated or moving pictures synchronized with the music or audio part
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/01—Correction of time axis
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/02—Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
- G11B27/031—Electronic editing of digitised analogue information signals, e.g. audio or video signals
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/10—Indexing; Addressing; Timing or synchronising; Measuring tape travel
- G11B27/34—Indicating arrangements
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/005—Musical accompaniment, i.e. complete instrumental rhythm synthesis added to a performed melody, e.g. as output by drum machines
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/031—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
- G10H2210/056—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for extraction or identification of individual instrumental parts, e.g. melody, chords, bass; Identification or separation of instrumental parts by their characteristic voices or timbres
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/031—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
- G10H2210/071—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for rhythm pattern analysis or rhythm style recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/101—Music Composition or musical creation; Tools or processes therefor
- G10H2210/125—Medley, i.e. linking parts of different musical pieces in one single piece, e.g. sound collage, DJ mix
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/341—Rhythm pattern selection, synthesis or composition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2220/00—Input/output interfacing specifically adapted for electrophonic musical tools or instruments
- G10H2220/005—Non-interactive screen display of musical or status data
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2220/00—Input/output interfacing specifically adapted for electrophonic musical tools or instruments
- G10H2220/091—Graphical user interface [GUI] specifically adapted for electrophonic musical instruments, e.g. interactive musical displays, musical instrument icons or menus; Details of user interactions therewith
- G10H2220/096—Graphical user interface [GUI] specifically adapted for electrophonic musical instruments, e.g. interactive musical displays, musical instrument icons or menus; Details of user interactions therewith using a touch screen
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/04—Time compression or expansion
- G10L21/055—Time compression or expansion for synchronising with other signals, e.g. video signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R2430/00—Signal processing covered by H04R, not provided for in its groups
Definitions
- the embodiments of the present disclosure relate to the field of human-computer interaction technology, and in particular, to an audio processing method, device, equipment and storage medium.
- Audio editing is a common way to create media content.
- Embodiments of the present disclosure provide an audio processing method, device, equipment and storage medium to improve audio processing efficiency and meet users' personalized needs for audio production.
- an embodiment of the present disclosure provides an audio processing method, including:
- the vocal and the accompaniment are mixed to obtain target audio.
- an audio processing device including:
- an acquisition module configured to acquire the human voice in response to the first instruction
- the acquisition module is also used to acquire the accompaniment in response to the second instruction
- a processing module configured to mix the vocal and the accompaniment in response to the third instruction to obtain target audio.
- embodiments of the present disclosure provide an electronic device, including: a processor and a memory;
- the memory stores computer execution instructions
- the processor executes the computer execution instructions stored in the memory, so that the at least one processor executes the audio processing method described in the above first aspect and various possible designs of the first aspect.
- embodiments of the present disclosure provide a computer-readable storage medium.
- Computer-executable instructions are stored in the computer-readable storage medium.
- the processor executes the computer-executable instructions, the above first aspect and the first aspect are implemented. aspects of various possible designs for the described audio processing method.
- embodiments of the present disclosure provide a computer program product, including a computer program that, when executed by a processor, implements the audio processing method described in the first aspect and various possible designs of the first aspect.
- embodiments of the present disclosure provide a computer program that, when executed by a processor, implements the image processing method described in the first aspect and various possible designs of the first aspect.
- Figure 1 is a schematic diagram of an application scenario of the audio processing method provided by an embodiment of the present disclosure
- FIG. 2 is a schematic flowchart 1 of an audio processing method provided by an embodiment of the present disclosure
- FIG. 3 is a schematic flowchart 2 of the audio processing method provided by an embodiment of the present disclosure.
- Figure 4 is a schematic diagram of a user interface provided by an embodiment of the present disclosure.
- Figure 5 is a schematic diagram of user interface changes provided by an embodiment of the present disclosure.
- Figure 6a is a schematic diagram 2 of the user interface provided by an embodiment of the present disclosure.
- Figure 6b is a schematic diagram three of the user interface provided by an embodiment of the present disclosure.
- Figure 7a is a second schematic diagram of user interface changes provided by an embodiment of the present disclosure.
- Figure 7b is a schematic diagram three of user interface changes provided by an embodiment of the present disclosure.
- Figure 7c is a schematic diagram 4 of user interface changes provided by an embodiment of the present disclosure.
- Figure 8 is a schematic diagram 5 of user interface changes provided by an embodiment of the present disclosure.
- Figure 9 is a schematic diagram 4 of the user interface provided by an embodiment of the present disclosure.
- Figure 10a is a schematic diagram 6 of user interface changes provided by an embodiment of the present disclosure.
- Figure 10b is a schematic diagram 7 of user interface changes provided by an embodiment of the present disclosure.
- Figure 11 is a structural block diagram of an audio processing device provided by an embodiment of the present disclosure.
- Figure 12 is a structural block diagram of an electronic device provided by an embodiment of the present disclosure.
- embodiments of the present disclosure propose an audio processing method that provides a visual intelligent audio processing process and can automatically realize the fusion of human voice and accompaniment in the target audio, and Audio editing can be performed directly after intelligent processing.
- the material package can be packaged and output to meet the personalized needs of different users and improve the user's audio production experience.
- FIG. 1 is a schematic diagram of an application scenario of the audio processing method provided by an embodiment of the present disclosure.
- the application scenario provided by this embodiment includes a terminal device 101 and a server 102, and the terminal device 101 is communicatively connected with the server 102.
- the terminal device 101 is preset with an audio processing application APP, which provides the user with one or more of the following functions: recording studio editing function, accompaniment separation function, audio mixing function, style synthesis function, and audio optimization function.
- the user accesses the server 102 through the terminal device 101, for example, uploads two pieces of audio data through the terminal device 101.
- the server 102 first performs sound source separation on these two pieces of audio data (including separation of human voices, musical instruments, etc.), For example, obtain the human voice in one piece of audio data and the accompaniment in another piece of audio data; secondly perform segment recognition on the human voice and accompaniment, and obtain the target segment of the human voice and accompaniment (such as the climax segment); finally, perform segment recognition on the human voice and accompaniment Perform rhythm detection and rhythm alignment on the target clips to generate mixed target audio.
- the server 102 sends the target audio to the terminal device 101 so that the user can listen to, save, share the target audio, or perform post-processing on the target audio.
- the terminal device in this embodiment can be any electronic device with information display function, including but not limited to smartphones, laptops, tablets, smart vehicle equipment, smart wearable devices, smart screens, etc.
- the server in this embodiment can be an ordinary server or a cloud server.
- the cloud server is also called a cloud computing server or a cloud host, and is a host product in the cloud computing service system.
- the server can also be a distributed system server or a server combined with a blockchain.
- the product implementation form of the present disclosure is program code included in platform software and deployed on electronic devices (which may also be hardware with computing capabilities such as computing clouds or mobile terminals).
- the program code of the present disclosure may be stored inside an electronic device.
- the program code runs in the electronic device's host memory and/or GPU memory.
- Embodiments of the present disclosure provide an audio processing method, device, equipment and storage medium.
- the method includes: in response to the first instruction, obtaining the human voice in a piece of audio uploaded by the user; in response to the second instruction, obtaining another piece of audio uploaded by the user. Accompaniment in a piece of audio; in response to the third instruction, the vocals and accompaniment in the two pieces of audio are automatically mixed to improve audio processing efficiency and meet the user's personalized needs for audio production.
- FIG. 2 is a schematic flowchart 1 of an audio processing method provided by an embodiment of the present disclosure. As shown in Figure 2, the method of this embodiment can be applied to terminal devices or servers.
- the audio processing method includes:
- Step 201 In response to the first instruction, obtain a human voice.
- obtaining the human voice includes: in response to the first instruction, obtaining audio data containing only the human voice.
- a first instruction is generated, and the human voice is acquired according to the first instruction.
- a first instruction is generated, and the human voice is acquired according to the first instruction.
- a first instruction is generated, and the human voice is acquired according to the first instruction.
- the first instruction is not only used to instruct the acquisition of audio data containing human voice, but also used to trigger the extraction of the human voice part of the audio data.
- Step 202 In response to the second instruction, obtain the accompaniment.
- obtaining the accompaniment includes: in response to the second instruction, obtaining audio data containing only the accompaniment.
- a second instruction is generated, and the accompaniment is obtained according to the second instruction.
- a second instruction is generated, and according to the second Indicates getting the accompaniment.
- a second instruction is generated, and the accompaniment is obtained according to the second instruction.
- the second instruction not only instructs to obtain the audio data containing the accompaniment, but is also used to trigger the extraction of the accompaniment part of the audio data.
- Step 203 In response to the third instruction, mix the vocal and the accompaniment to obtain the target audio.
- a third instruction is generated, and the vocal and accompaniment are mixed according to the third instruction to obtain the target audio.
- a third instruction is generated, and the vocal and accompaniment are mixed according to the third instruction to obtain the target audio.
- a third instruction is generated, and the human voice and accompaniment are mixed according to the third instruction to obtain the target audio.
- mixing the human voice and the accompaniment can also be described as mixing the human voice and the accompaniment.
- the user uploads two pieces of audio data that need to be mixed through interface touch or voice control, and extracts two pieces of audio data respectively.
- the human voice of one piece of audio data and the accompaniment of another piece of audio data realize the automatic mixing and matching of the human voice and accompaniment in the two pieces of audio, improving audio processing efficiency and meeting the user's personalized needs for audio production.
- FIG 3 is a schematic flowchart 2 of an audio processing method provided by an embodiment of the present disclosure. As shown in Figure 3, the method of this embodiment can be applied to terminal devices or servers.
- the audio processing method includes:
- Step 301 In response to a touch operation on the first control on the first interface, import the first audio, and separate the human voice from the first audio.
- Step 302 In response to the touch operation on the second control on the first interface, import the second audio, and separate the accompaniment from the second audio.
- Step 303 In response to the touch operation on the third control on the first interface, mix the vocal and the accompaniment to obtain the target audio.
- the first interface can also be described as an audio import interface.
- the touch operation for the first control, the second control and the third control on the first interface include but are not limited to click operations.
- FIG. 4 is a schematic diagram of a user interface provided by an embodiment of the present disclosure.
- the first interface 400 includes: a first control 401 , a second control 402 , and a third control 403 .
- the first control 401 is used to extract the human voice from the audio data
- the second control 402 is used to extract the accompaniment from the audio data
- the third control 403 is used to automatically mash (mix) the extracted human voice and accompaniment.
- interface controls in this embodiment include but are not limited to icons, buttons, drop-down boxes, sliders, etc.
- touch operations include but are not limited to click operations, long press operations, double click operations, sliding operations, etc.
- the first voice input by the user is obtained, the first instruction is generated through speech recognition, and the first audio is imported according to the first instruction. , and separate the vocal from the first audio.
- the second voice input by the user is obtained, the second instruction is generated through speech recognition, and the second audio is imported according to the second instruction. , and separate the accompaniment from the second audio.
- a third voice input by the user is obtained, a third instruction is generated through speech recognition, and the human voice and The accompaniment is mixed to obtain the target audio.
- the user can also input control voice through physical buttons of the device, such as side buttons on a smartphone, to import the above-mentioned first audio or second audio or perform audio mixing.
- FIG. 5 is a schematic diagram 1 of user interface changes provided by an embodiment of the present disclosure
- FIG. 6 a is a schematic diagram 2 of a user interface provided by an embodiment of the present disclosure.
- the user imports audio data by clicking the first control 401 of the first interface 400.
- the user can choose to import audio data from a file or video album.
- the user selects Audio 1 on the video album interface 404.
- the human voice part is extracted from the audio data
- the human voice is obtained, and the human voice is visually displayed in the first interface 400, such as the human voice track shown in Figure 5 or Figure 6a.
- the user can Listen to the extracted vocals.
- the user imports another audio data by clicking the second control 402 of the first interface 400. While importing the audio data, the accompaniment part of the audio data is extracted, the accompaniment is obtained, and the accompaniment is visualized in the first interface 400.
- the user can audition the extracted accompaniment, such as the accompaniment track shown in Figure 6a, and the user can audition the extracted accompaniment.
- the user automatically mixes the extracted vocals and accompaniment by clicking the third control 403 of the first interface 400 to obtain the target audio.
- the user uploads the recorded playing and singing audio, and at the same time uploads the existing finished musical work.
- the extracted vocal and accompaniment are mixed to obtain the target Audio, which combines the user's vocals with existing accompaniment.
- the above audio processing process greatly facilitates users to create personalized music and meets the music creation needs of different users.
- FIG. 6b is a schematic diagram 3 of the user interface provided by an embodiment of the present disclosure.
- the interface shown in Figure 6b can be regarded as an optimized version of the interface shown in Figure 6a, including more functional controls.
- the first interface 400 includes: a first playback control, a first delete control and a first replacement control associated with the human voice.
- the first playback control is used to listen to the human voice
- the first delete control is used to delete the vocal
- the first replacement control is used to replace the vocal
- the second play control, the second delete control and the second replacement control associated with the accompaniment the second play control is used to listen to the accompaniment
- the second delete control is used to delete the accompaniment
- the second replacement control is used to replace the accompaniment.
- the first interface 400 also includes: a fourth control 405 and a fifth control 406.
- the fourth control 405 is used to trigger customized processing of the vocal and/or accompaniment, where the custom processing includes audio clips of the vocal and/or accompaniment.
- the fifth control is used to trigger audio editing of vocals and/or accompaniment (go to the recording studio for audio editing or processing), please refer to the following article for details.
- mixing the human voice and the accompaniment to obtain the target audio specifically includes: obtaining the vocal segment of the human voice and the accompaniment segment of the accompaniment; mixing the vocal segment and the accompaniment segment, to get the target audio. That is, when mixing vocals and accompaniment, you can first extract mixable vocal segments and accompaniment segments from the vocals and accompaniment respectively, and then perform audio mixing based on the vocal segments and accompaniment segments to obtain the target audio.
- the vocal clips and accompaniment clips can be obtained through the following implementation:
- the vocal and the accompaniment are respectively input into the segment recognition model to obtain the vocal segment of the human voice and the accompaniment segment of the accompaniment.
- the paragraph recognition model is used to identify target segments of audio.
- the human voice is input into the paragraph recognition model to obtain the target segment of the human voice.
- the target segment can be a chorus segment, climax segment, or other segment of the audio.
- the target segment is a repeated segment in a song.
- the paragraph recognition model can be trained using a deep learning model.
- This embodiment does not limit the structure of the deep learning model.
- This implementation implements intelligent extraction of vocal segments and accompaniment segments through training models, which can improve audio processing efficiency and accuracy.
- the training process of the paragraph recognition model includes: obtaining a training data set.
- the training data set includes multiple audio samples and annotation information of each audio sample.
- the annotation information is used to indicate the target segment corresponding to the audio sample.
- Multiple audio samples in the training data set are used as the input of the paragraph recognition model, and the annotation information of each audio sample in the training data set is used as the output of the paragraph recognition model.
- the paragraph recognition model is trained until the loss function of the paragraph recognition model is achieved. When convergence occurs, the training of the paragraph recognition model is stopped and the model parameters of the trained paragraph recognition model are obtained.
- the paragraph recognition model can be used to analyze information such as the rhythm and loudness changes of the input audio, and can identify the intro, main song, chorus, interlude, bridge, outro, silence and other segments of the audio, and extract the best There is a possible chorus, the climax. Specifically, the start and end timestamps of different segments are extracted, subsequent trimming is performed, and the target audio segment is finally output.
- the vocal and accompaniment tracks are displayed on the second interface; in response to the editing operation on the vocal track, Get the vocal segment; get the accompaniment segment in response to a clip operation on the accompaniment track.
- This implementation method is to obtain the target segments of vocals and accompaniments by the user editing segments on the interface for subsequent audio mixing. This method increases the user's custom processing of imported vocals and accompaniments, which can improve the user's participation in audio production. , to meet the audio production needs of different users.
- mixing the vocal and the accompaniment to obtain the target audio includes: obtaining the first rhythm of the vocal and the second rhythm of the accompaniment, and mixing the first rhythm of the vocal and the accompaniment.
- the second rhythm performs rhythm alignment based on mixing the aligned vocals and accompaniment to obtain the target audio.
- the second rhythm of the accompaniment is adjusted so that the first rhythm of the human voice and the second rhythm of the accompaniment are consistent.
- the first rhythm of the vocal is adjusted based on the second rhythm of the accompaniment, so that the first rhythm of the vocal is consistent with the second rhythm of the accompaniment.
- mixing the human voice and the accompaniment to obtain the target audio includes: mixing the vocal segment of the human voice and the accompaniment segment of the accompaniment to obtain the target audio. Specifically, the first rhythm of the vocal segment and the second rhythm of the accompaniment segment are obtained, the first rhythm of the vocal segment and the second rhythm of the accompaniment segment are rhythmically aligned, and based on the aligned vocal segment and the accompaniment segment, Mix to get the target audio.
- the first rhythm of the vocal segment is used as a reference to adjust the second rhythm of the accompaniment segment so that the first rhythm of the vocal segment and the second rhythm of the accompaniment segment are consistent.
- the first rhythm of the vocal segment is adjusted based on the second rhythm of the accompaniment segment, so that the first rhythm of the vocal segment is consistent with the second rhythm of the accompaniment segment.
- the target audio By obtaining the first rhythm of the third audio and the second rhythm of the fourth audio; rhythmically aligning the first rhythm of the third audio and the second rhythm of the fourth audio; based on the aligned third audio and fourth audio, Get the target audio. Specifically, based on the first rhythm of the third audio, the second rhythm of the fourth audio is adjusted so that the rhythm of the third audio and the fourth audio are consistent.
- the third audio frequency may be one of the human voice and the accompaniment, and correspondingly, the fourth audio frequency may be the human voice and the accompaniment.
- Another audio in may be .
- the third audio may be one of the vocal segment and the accompaniment segment, and the fourth audio may be the other audio of the vocal segment and the accompaniment segment.
- Rhythm detection is used to detect the rebeat time in the beat and infer the speed of the entire audio or audio segment. Adjusting the audio rhythm includes stretching or compressing the audio rhythm. Typically, the rhythm of the vocal track is aligned to the accompaniment track, and the vocal track file is processed through audio stretching or compression.
- the vocals and accompaniment in the mixed target audio are better integrated and the audio processing effect is improved.
- the interface in response to the touch operation on the third control on the first interface, the interface jumps to the third interface, the third interface includes a third playback control, and the third playback control is used to Triggers playback of the target audio.
- the third interface is the audio mixing preview interface. The following is a graphical explanation of the user interface changes to obtain the target audio after the user imports two pieces of audio.
- FIG. 7a is a second schematic diagram of user interface changes provided by an embodiment of the present disclosure.
- the vocal and accompaniment are visually displayed in the first interface 400.
- the user can directly click the third control 403 to automatically mix and match the vocal and accompaniment, and display the vocal and accompaniment in the third interface.
- 701 visualizes the target audio after audio mixing.
- the user can listen to, export, and share the mixed target audio on the third interface 701, or choose to play it again, or import it to the recording studio for further audio processing.
- FIG. 7b is a schematic diagram 3 of user interface changes provided by an embodiment of the present disclosure.
- the user can also click the fourth control 405 of the first interface 400 after uploading two pieces of audio data to trigger customized processing of the vocal and/or accompaniment, and the interface jumps to the second interface 700.
- the user can perform audio editing on the vocal and/or accompaniment on the second interface 700, such as intercepting the climax clip of the vocal and the climax clip of the accompaniment.
- the user can also listen to the edited vocal or accompaniment climax clip on the second interface 700. .
- After completing the audio editing jump to the third interface 701 by clicking the "automatic mix and match" control on the second interface 700 .
- FIG. 7c is a schematic diagram 4 of user interface changes provided by an embodiment of the present disclosure.
- the user after uploading two pieces of audio data, the user directly clicks the third control 403 of the first interface 400 to automatically mix and match the vocals and accompaniment, which can be listened to and exported in the third interface 701 shown in Figure 7c , share the mixed target audio, or choose to cancel, or import it to the recording studio for further audio processing, and you can also set the cover of the target audio in the third interface 701 shown in Figure 7c.
- the third interface 701 shown in Figure 7c can be regarded as an optimized version of the third interface 701 shown in Figure 7a.
- the first window in response to a touch operation on the cover editing control on the third interface, the first window is displayed; in response to a control selection operation on the first window, the target cover is obtained.
- the first window includes a cover import control, one or more preset static cover controls, and one or more preset animation effect controls.
- the target cover is a static cover or a dynamic cover.
- obtaining the target cover in response to a control selection operation on the first window includes: obtaining a static cover and animation in response to a control selection operation on the first window Effect; based on the audio characteristics, static cover and animation effects of the target audio, generate a dynamic cover that changes with the audio characteristics of the target audio.
- the audio characteristics include audio beat and/or volume.
- FIG. 8 is a schematic diagram 5 of user interface changes provided by an embodiment of the present disclosure.
- the user passes Click the cover editing control 705 on the third interface 701, and the first window 800 pops up at the bottom of the third interface 701.
- the first window 800 includes a cover import control 801, a plurality of preset static covers, such as Cover 1 to Cover 3 in Figure 8, and multiple animation effects, such as Animation 1 to Animation 3 in Figure 8.
- the user can import a custom picture from the local photo album by clicking the cover import control 801, and use the custom picture as a static cover, or directly select a preset static cover. Users can directly select a preset animation or not set animation.
- export the target audio and the generated target cover to the album or file system, or share it to a designated application, or import it to the recording studio for further audio processing.
- FIG. 9 is a schematic diagram 4 of the user interface provided by an embodiment of the present disclosure.
- the user sets the static cover and animation effects of the target audio through Figure 8, and can preview the synthesized target cover in the audio mixing preview interface.
- the target cover includes a static cover and animation effects that change with the audio characteristics of the target audio.
- the animation effect can be seen as adding an animation special effects layer at the bottom of the static cover.
- the animation effect can dynamically change at any position around the static cover.
- This embodiment provides users with the function of setting an audio cover, enabling different users to edit the cover in a personalized manner, thereby improving the user's audio production experience.
- data associated with the target audio is exported to the target location.
- the target location includes a photo album or file system.
- the user triggers the first selection window by clicking the export control 702 on the third interface 701, and the user can choose to export the data associated with the target audio to Photo album or file system.
- the fourth voice input by the user is obtained, an export instruction is generated through speech recognition, and the export instruction is related to the target audio. Export the associated data to the target location.
- the data associated with the target audio is shared to the target application.
- the user triggers the second selection window by clicking the sharing control 704 on the third interface 701, and the user can choose to share the data associated with the target audio on the second selection window.
- the target application or a specified user in the target application.
- the fifth voice input by the user is obtained, a sharing instruction is generated through speech recognition, and the sharing instruction is related to the target audio. Share the associated data to the target application, or to specified users in the target application.
- the data associated with the target audio includes at least one of the following: target audio, vocal, accompaniment, vocal segment of the vocal, accompaniment segment of the accompaniment, static cover of the target audio, and dynamic cover of the target audio.
- the data exported or shared by the user can only contain the target audio, or it can also contain all intermediate data in the process of obtaining the target audio.
- the data can be compressed first, and then the compressed data can be exported locally or shared with other users. If the shared data received by other users contains all the intermediate data in the process of obtaining the target audio, in addition to playing the target audio, the user can also query or re-edit the intermediate data and generate new target audio, thereby achieving multi-person collaboration. Carry out audio production to increase interaction between users and improve user experience.
- jumping from the third interface to the fourth interface includes audio processing function controls.
- the fourth interface is an interface for audio post-processing, which can also be described as a recording studio interface. The user can perform audio post-processing on the vocals and accompaniment in the target audio on the fourth interface. reason.
- jumping from the third interface to the fourth interface includes the audio processing function control associated with the third interface.
- Trigger control the trigger control is used to trigger the display of audio processing function controls.
- audio processing function controls include one or more of the following:
- Audio optimization control used to trigger editing of audio to optimize audio
- Style detachment controls for triggering the separation of vocals and/or accompaniment from the audio
- Decorative synthesis controls for triggering the separation of vocals from the audio, mixing and editing the separated vocals with preset accompaniments.
- Audio mashup controls that trigger the separation of the vocal from the first audio, the accompaniment from the second audio, and the mixing and editing of the separated vocal with the separated accompaniment.
- audio optimization includes optimizing the vocals and/or accompaniment of the user's playing and singing audio, that is, audio optimization includes optimization of playing and singing, such as optimization of boys' guitar, girls' guitar, boys' piano, and girls' piano.
- accompaniment separation includes separation processing such as vocal removal and instrument removal.
- style synthesis includes style optimization such as car music, classic pop, heart-warming moments, relaxing moments, childhood memories, etc.
- audio mashing includes optimization processing such as rhythm alignment and pitch transposition.
- Figure 10a is a schematic diagram 6 of user interface changes provided by an embodiment of the present disclosure.
- the user clicks the audio editing control 703 on the third interface 701, and the interface jumps to the fourth interface 1000.
- the fourth interface 1000 directly displays the target audio after mixing the vocal and accompaniment in track 1, and displays a plurality of optional audio processing controls in the audio processing window 1004 of the fourth interface 1000 .
- Figure 10b is a schematic diagram 7 of user interface changes provided by an embodiment of the present disclosure.
- the user clicks the audio editing control 703 on the third interface 701, and the interface jumps to the fourth interface 1000.
- the user can perform audio post-processing on the vocals and accompaniment in the target audio on the fourth interface 1000.
- track 1 of the fourth interface 1000 corresponds to the vocal
- track 2 of the fourth interface 1000 corresponds to the accompaniment.
- the user can also enter the fifth interface 1002 by clicking the interface switching control 1001 on the fourth interface 1000, or by sliding left or right.
- the fifth interface 1002 includes a trigger control 1003 associated with the audio processing function control.
- the trigger control 1003 is used to trigger the display of audio processing function controls.
- Audio processing controls include multiple selectable controls shown in the audio processing window 1004 of Figure 10b.
- the user can add effects, perform further audio processing, adjust the volume, etc. to the vocals of track 1 and the accompaniment of track 2 respectively on the fifth interface 1003.
- the user can also adjust the overall volume of the vocals and accompaniment on the fifth interface 1003. .
- the effects include reverb, equalization, electronic music, phase shifter, flanger, filter, etc.
- the user imports the first audio by touching the first control on the first interface, and separates the human voice from the first audio; and then imports the second audio by touching the second control on the first interface. , and separate the accompaniment from the second audio; finally, by touching the third control on the first interface, mix the vocal and accompaniment to obtain the target audio.
- the above process realizes the automatic mixing and matching of vocals and accompaniment in two pieces of audio, improves the audio processing effect, and meets the user's personalized needs for audio production.
- FIG. 11 is a structural block diagram of an audio processing device provided by an embodiment of the present disclosure.
- the audio processing device 1100 provided in this embodiment includes: an acquisition module 1101 and a processing module 1102.
- Acquisition module 1101 configured to acquire human voices in response to the first instruction
- the acquisition module 1101 is also configured to acquire the accompaniment in response to the second instruction
- the processing module 1102 is configured to mix the vocal and the accompaniment in response to the third instruction to obtain target audio.
- the acquisition module 1101 is configured to import the first audio in response to a touch operation on the first control on the first interface, and separate the first audio from the first audio. said human voice;
- the acquisition module 1101 is also configured to import the second audio in response to the touch operation on the second control on the first interface, and separate the accompaniment from the second audio.
- the processing module 1102 is configured to mix the vocal and the accompaniment in response to a touch operation on the third control on the first interface to obtain the desired Describe the target audio.
- processing module 1102 is used to:
- the vocal segment and the accompaniment segment are mixed to obtain the target audio.
- processing module 1102 is used to:
- paragraph recognition model is used to identify target segments of audio.
- the audio processing device 1100 further includes: a display module 1103;
- the display module 1103 is configured to display the vocal track and the accompaniment track on the second interface in response to a touch operation on the fourth control on the first interface;
- the acquisition module 1101 is configured to acquire the vocal segment in response to the editing operation of the vocal track; and in response to the editing operation of the accompaniment track, acquire the accompaniment segment.
- processing module 1102 is used to:
- the third audio is one audio of the human voice and the accompaniment
- the fourth audio is the other audio of the human voice and the accompaniment
- the third audio is the audio of the human voice and the accompaniment
- One audio of the vocal segment and the accompaniment segment, and the fourth audio is the other audio of the vocal segment and the accompaniment segment.
- the processing module 1102 is configured to adjust the second rhythm of the fourth audio based on the first rhythm of the third audio, so that the The third audio has the same rhythm as the fourth audio.
- the first interface includes:
- a first playback control, a first delete control and a first replacement control associated with the human voice the first playback control is used to listen to the human voice, the first delete control is used to delete the human voice, The first replacement control is used to replace the human voice;
- a second playback control, a second deletion control and a second replacement control associated with the accompaniment is used to listen to the accompaniment, the second deletion control is used to delete the accompaniment, and the second deletion control is used to delete the accompaniment.
- Two replacement controls are used to replace the accompaniment.
- the processing module 1102 is configured to jump to a third interface in response to a touch operation on the third control on the first interface, where the third interface includes A third playback control, the third playback control is used to trigger playback of the target audio.
- the display module 1103 is configured to display a first window in response to a touch operation on the cover editing control on the third interface, where the first window includes a cover import control, one or more preset static cover controls and one or more preset animation effect controls;
- the acquisition module 1101 is configured to acquire a target cover in response to a control selection operation on the first window; the target cover is a static cover or a dynamic cover.
- the acquisition module 1101 is configured to acquire static cover and animation effects in response to a control selection operation on the first window;
- the processing module 1102 is configured to generate a dynamic cover that changes with the audio characteristics of the target audio according to the audio characteristics of the target audio, the static cover, and the animation effect;
- the audio characteristics include audio beat and/or volume.
- the processing module 1102 is configured to export data associated with the target audio to a target location in response to an export instruction on the third interface; the target Locations include photo albums or file systems.
- the processing module 1102 is configured to share data associated with the target audio to a target application in response to a sharing instruction on the third interface.
- the data associated with the target audio includes at least one of the following:
- the target audio the human voice, the accompaniment, the vocal segment of the human voice, the accompaniment segment of the accompaniment, the static cover of the target audio, and the dynamic cover of the target audio.
- the processing module 1102 is configured to jump from the third interface to the fourth interface in response to a touch operation on the audio editing control on the third interface,
- the fourth interface includes an audio processing function control or a trigger control associated with the audio processing function control, the trigger control being used to trigger display of the audio processing function control;
- the audio processing function controls include one or more of the following:
- Audio optimization controls for triggering editing of audio to optimize said audio
- Style detachment controls for triggering the separation of vocals and/or accompaniment from the audio
- Audio mashup controls that trigger the separation of the vocal from the first audio, the accompaniment from the second audio, and the mixing and editing of the separated vocal with the separated accompaniment.
- the audio processing device provided in this embodiment can be used to execute the technical solutions of the above method embodiments. Its implementation principles and technical effects are similar, and will not be described again in this embodiment.
- FIG 12 is a structural block diagram of an electronic device provided by an embodiment of the present disclosure.
- the electronic device 1200 may be a terminal device or a server.
- the terminal devices may include, but are not limited to, mobile phones, notebook computers, digital broadcast receivers, personal digital assistants (Personal Digital Assistant, PDA for short), tablet computers (Portable Android Device, PAD for short), portable multimedia players (Portable Mobile terminals such as Media Player (PMP for short), vehicle-mounted terminals (such as vehicle-mounted navigation terminals), and fixed terminals such as digital TVs, desktop computers, etc.
- PDA Personal Digital Assistant
- PDA Personal Digital Assistant
- PAD Personal Android Device
- portable multimedia players Portable Mobile terminals such as Media Player (PMP for short
- vehicle-mounted terminals such as vehicle-mounted navigation terminals
- fixed terminals such as digital TVs, desktop computers, etc.
- the electronic device shown is only an example and should not bring any limitations to the functions and scope of use of the embodiments of the present disclosure.
- the electronic device 1200 may include a processing device (such as a central processing unit, a graphics processor, etc.) 1201, which may process data according to a program stored in a read-only memory (Read Only Memory, ROM for short) 1202 or from a storage device. 1208 loads the program in the random access memory (Random Access Memory, RAM for short) 1203 to perform various appropriate actions and processing. In the RAM 1203, various programs and data required for the operation of the electronic device 1200 are also stored.
- the processing device 1201, ROM 1202 and RAM 1203 are connected to each other via a bus 1204.
- An input/output (I/O for short) interface 1205 is also connected to bus 1204.
- the following devices can be connected to the I/O interface 1205: input devices 1206 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD). ), an output device 1207 such as a speaker, a vibrator, etc.; a storage device 1208 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 1209.
- the communication device 1209 may allow the electronic device 1200 to communicate wirelessly or wiredly with other devices to exchange data.
- FIG. 12 illustrates electronic device 1200 with various means, it should be understood that implementation or availability of all illustrated means is not required. More or fewer means may alternatively be implemented or provided.
- embodiments of the present disclosure include a computer program product including a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
- the computer program may be downloaded and installed from the network via communication device 1209, or from storage device 1208, or from ROM 1202.
- the processing device 1201 When the computer program is executed by the processing device 1201, the above-mentioned functions defined in the method of the embodiment of the present disclosure are performed.
- the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
- the computer-readable storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination thereof.
- Computer readable storage media may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard drive, random access memory (RAM), read only memory (ROM), removable Programmable Read-Only Memory (Erasable Programmable Read-Only Memory, referred to as EPROM or flash memory), optical fiber, portable compact disk read-only memory (Compact Disc Read-Only Memory, referred to as CD-ROM), optical storage device, magnetic storage device, or the above any suitable combination.
- a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
- a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
- a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
- Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to: wire, optical cable, RF (radio frequency), etc., or any suitable combination of the above.
- the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; it may also exist independently without being assembled into the electronic device.
- the computer-readable medium carries one or more programs.
- the electronic device When the one or more programs are executed by the electronic device, the electronic device performs the method shown in the above embodiment.
- Computer program code for performing the operations of the present disclosure may be written in one or more programming languages, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional Procedural programming language—such as "C" or a similar programming language.
- the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
- the remote computer can be connected to the user's computer through any kind of network—including a Local Area Network (LAN) or a Wide Area Network (WAN)—or it can be connected to an external computer Computer (e.g. connected via the Internet using an Internet service provider).
- LAN Local Area Network
- WAN Wide Area Network
- each block in the flowchart or block diagram may represent a module, segment, or portion of code that contains one or more logic functions that implement the specified executable instructions.
- the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
- each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.
- the units involved in the embodiments of the present disclosure can be implemented in software or hardware.
- the name of the unit does not constitute a limitation on the unit itself under certain circumstances.
- the first acquisition unit can also be described as "the unit that acquires at least two Internet Protocol addresses.”
- exemplary types of hardware logic components include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Products ( Application Specific Standard Parts (ASSP for short), System on Chip (SOC for short), Complex Programmable Logic Device (CPLD for short), etc.
- FPGA Field Programmable Gate Array
- ASIC Application Specific Integrated Circuit
- ASSP Application Specific Standard Parts
- SOC System on Chip
- CPLD Complex Programmable Logic Device
- a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
- the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
- Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
- machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
- RAM random access memory
- ROM read only memory
- EPROM or flash memory erasable programmable read only memory
- CD-ROM portable compact disk read-only memory
- magnetic storage device or any suitable combination of the above.
- an audio processing method including:
- the vocal and the accompaniment are mixed to obtain target audio.
- obtaining the human voice in response to the first instruction includes:
- the step of obtaining the accompaniment in response to the second instruction includes:
- the mixing the vocal and the accompaniment in response to the third instruction to obtain the target audio includes:
- the vocal and the accompaniment are mixed to obtain the target audio.
- mixing the vocal and the accompaniment to obtain target audio includes:
- the vocal segment and the accompaniment segment are mixed to obtain the target audio.
- obtaining the vocal segment of the human voice and the accompaniment segment of the accompaniment includes:
- paragraph recognition model is used to identify target segments of audio.
- obtaining the vocal segment of the human voice and the accompaniment segment of the accompaniment includes:
- the accompaniment segment is obtained in response to a clipping operation on an audio track of the accompaniment.
- mixing the vocal and the accompaniment to obtain target audio includes:
- the third audio is one audio of the human voice and the accompaniment
- the fourth audio is the other audio of the human voice and the accompaniment
- the third audio is the audio of the human voice and the accompaniment
- One audio of the vocal segment and the accompaniment segment, and the fourth audio is the other audio of the vocal segment and the accompaniment segment.
- rhythmically aligning the first rhythm of the third audio and the second rhythm of the fourth audio includes:
- the second rhythm of the fourth audio is adjusted so that the rhythm of the third audio is consistent with that of the fourth audio.
- the first interface includes:
- a first playback control, a first delete control and a first replacement control associated with the human voice the first playback control is used to listen to the human voice, the first delete control is used to delete the human voice, The first replacement control is used to replace the human voice;
- the second playback control is used to listen to the accompaniment
- the second delete control is used to delete the accompaniment
- the second replacement control is used to replace the accompaniment.
- jumping to the third interface includes a third playback control, and the third interface Three playback controls are used to trigger playback of the target audio.
- a first window is displayed, the first window includes a cover import control, one or more presets static cover control and one or more preset animation effect controls;
- the target cover is a static cover or a dynamic cover.
- obtaining the target cover in response to a control selection operation on the first window includes:
- the static cover and the animation effect generate a dynamic cover that changes with the audio characteristics of the target audio
- the audio characteristics include audio beat and/or volume.
- data associated with the target audio is exported to a target location; the target location includes a photo album or a file system.
- data associated with the target audio is shared to a target application in response to a sharing instruction on the third interface.
- data associated with the target audio includes at least one of the following:
- the target audio the human voice, the accompaniment, the vocal segment of the human voice, the accompaniment segment of the accompaniment, the static cover of the target audio, and the dynamic cover of the target audio.
- the fourth interface in response to a touch operation on the audio editing control on the third interface, jumping from the third interface to a fourth interface, the fourth interface includes audio processing Function controls or trigger controls associated with the audio processing function controls, the trigger controls being used to trigger display of the audio processing function controls;
- the audio processing function controls include one or more of the following:
- Audio optimization controls for triggering editing of audio to optimize said audio
- Style detachment controls for triggering the separation of vocals and/or accompaniment from the audio
- Audio mashup controls that trigger the separation of the vocal from the first audio, the accompaniment from the second audio, and the mixing and editing of the separated vocal with the separated accompaniment.
- an audio processing device including:
- an acquisition module configured to acquire the human voice in response to the first instruction
- the acquisition module is also used to acquire the accompaniment in response to the second instruction
- a processing module configured to mix the vocal and the accompaniment in response to the third instruction to obtain target audio.
- the acquisition module is configured to import the first audio in response to a touch operation on the first control on the first interface, and separate the first audio from the first audio. describe the human voice;
- the acquisition module is further configured to import the second audio in response to a touch operation on the second control on the first interface, and separate the accompaniment from the second audio.
- the processing module is configured to mix the vocal and the accompaniment in response to a touch operation on a third control on the first interface to obtain the target audio.
- the processing module is used to:
- the vocal segment and the accompaniment segment are mixed to obtain the target audio.
- the processing module is used to:
- paragraph recognition model is used to identify target segments of audio.
- the audio processing device further includes: a display module
- the display module is configured to display the vocal track and the accompaniment track on the second interface in response to a touch operation on the fourth control on the first interface;
- the acquisition module is configured to acquire the vocal segment in response to an editing operation on the audio track of the human voice; and acquire the accompaniment segment in response to an editing operation on the audio track of the accompaniment.
- the processing module is used to:
- the third audio is one audio of the human voice and the accompaniment
- the fourth audio is the other audio of the human voice and the accompaniment
- the third audio is the audio of the human voice and the accompaniment
- One audio of the vocal segment and the accompaniment segment, and the fourth audio is the other audio of the vocal segment and the accompaniment segment.
- the processing module is configured to adjust the second rhythm of the fourth audio based on the first rhythm of the third audio, so that the The third audio has the same rhythm as the fourth audio.
- the first interface includes:
- a first playback control, a first delete control and a first replacement control associated with the human voice the first playback control is used to listen to the human voice, the first delete control is used to delete the human voice, The first replacement control is used to replace the human voice;
- a second playback control, a second deletion control and a second replacement control associated with the accompaniment is used to listen to the accompaniment, the second deletion control is used to delete the accompaniment, and the second deletion control is used to delete the accompaniment.
- Two replacement controls are used to replace the accompaniment.
- the processing module is configured to jump to a third interface in response to a touch operation on the third control on the first interface, and the third interface includes a third interface.
- the third playback control is used to trigger playback of the target audio.
- the display module is configured to display a first window in response to a touch operation on the cover editing control on the third interface, the first window including a cover import control , one or more preset static cover controls and one or more preset animation effect controls;
- the acquisition module is configured to acquire a target cover in response to a control selection operation on the first window; the
- the target cover is a static cover or a dynamic cover.
- the acquisition module is configured to acquire static cover and animation effects in response to a control selection operation on the first window
- the processing module is configured to generate a dynamic cover that changes with the audio characteristics of the target audio according to the audio characteristics of the target audio, the static cover, and the animation effect;
- the audio characteristics include audio beat and/or volume.
- the processing module is configured to export data associated with the target audio to a target location in response to an export instruction on the third interface; the target location Including photo album or file system.
- the processing module is configured to share data associated with the target audio to a target application in response to a sharing instruction on the third interface.
- data associated with the target audio includes at least one of the following:
- the target audio the human voice, the accompaniment, the vocal segment of the human voice, the accompaniment segment of the accompaniment, the static cover of the target audio, and the dynamic cover of the target audio.
- the processing module is configured to jump from the third interface to the fourth interface in response to a touch operation on the audio editing control on the third interface, so
- the fourth interface includes an audio processing function control or a trigger control associated with the audio processing function control, the trigger control being used to trigger display of the audio processing function control;
- the audio processing function controls include one or more of the following:
- Audio optimization controls for triggering editing of audio to optimize said audio
- Style detachment controls for triggering the separation of vocals and/or accompaniment from the audio
- Audio mashup controls that trigger the separation of the vocal from the first audio, the accompaniment from the second audio, and the mixing and editing of the separated vocal with the separated accompaniment.
- an electronic device including: at least one processor and a memory;
- the memory stores computer execution instructions
- the at least one processor executes the computer execution instructions stored in the memory, so that the at least one processor executes the audio processing method described in the above first aspect and various possible designs of the first aspect.
- a computer-readable storage medium is provided.
- Computer-executable instructions are stored in the computer-readable storage medium.
- a processor executes the computer-executed instructions, Implement the audio processing method described in the above first aspect and various possible designs of the first aspect.
- a computer program product including a computer program that, when executed by a processor, implements the above first aspect and various possible designs of the first aspect.
- the audio processing method is provided, including a computer program that, when executed by a processor, implements the above first aspect and various possible designs of the first aspect.
- embodiments of the present disclosure provide a computer program that, when executed by a processor, implements the audio processing method described in the first aspect and various possible designs of the first aspect.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Theoretical Computer Science (AREA)
- Reverberation, Karaoke And Other Acoustics (AREA)
- Electrophonic Musical Instruments (AREA)
Abstract
Description
Claims (21)
- 一种音频处理方法,包括:响应于第一指示,获取人声;响应于第二指示,获取伴奏;响应于第三指示,对所述人声和所述伴奏进行混合,以获取目标音频。
- 根据权利要求1所述的方法,其中,所述响应于第一指示,获取人声,包括:响应于针对第一界面上的第一控件的触控操作,导入第一音频,并从所述第一音频中分离出所述人声;所述响应于第二指示,获取伴奏,包括:响应于针对所述第一界面上的第二控件的触控操作,导入第二音频,并从所述第二音频中分离出所述伴奏。
- 根据权利要求2所述的方法,其中,所述响应于第三指示,对所述人声和所述伴奏进行混合,以获取目标音频,包括:响应于针对所述第一界面上的第三控件的触控操作,对所述人声和所述伴奏进行混合,以获取所述目标音频。
- 根据权利要求1至3中任一项所述的方法,其中,所述对所述人声和所述伴奏进行混合,以获取目标音频,包括:获取所述人声的人声片段和所述伴奏的伴奏片段;将所述人声片段和所述伴奏片段进行混合,以获取所述目标音频。
- 根据权利要求4所述的方法,其中,所述获取所述人声的人声片段和所述伴奏的伴奏片段,包括:分别将所述人声和所述伴奏输入段落识别模型,以获取所述人声的人声片段和所述伴奏的伴奏片段;其中,所述段落识别模型用于识别音频的目标片段。
- 根据权利要求4所述的方法,其中,所述获取所述人声的人声片段和所述伴奏的伴奏片段,包括:响应于针对第一界面上的第四控件的触控操作,在第二界面显示所述人声和所述伴奏的音轨;响应于针对所述人声的音轨的剪辑操作,获取所述人声片段;响应于针对所述伴奏的音轨的剪辑操作,获取所述伴奏片段。
- 根据权利要求1至3中任一项所述的方法,其中,所述对所述人声和所述伴奏进行混合,以获取目标音频,包括:获取第三音频的第一节奏以及第四音频的第二节奏;对所述第三音频的所述第一节奏和所述第四音频的所述第二节奏进行节奏对齐;基于对齐后的所述第三音频和所述第四音频,获取所述目标音频;其中,所述第三音频为所述人声和所述伴奏中的一个音频,所述第四音频为所述人声和所述伴奏中的另一音频;或者,所述第三音频为所述人声的人声片段和所述伴奏的伴奏片段中的一个音频,所述第四音频为所述人声片段和所述伴奏片段中的另一音频。
- 根据权利要求7所述的方法,其中,所述对所述第三音频的所述第一节奏和所述第四音频的所述第二节奏进行节奏对齐,包括:以所述第三音频的所述第一节奏为基准,调节所述第四音频的所述第二节奏,以使得所述第三音频与所述第四音频节奏一致。
- 根据权利要求2至3和6中任一项所述的方法,其中,所述第一界面包括:与所述人声关联的第一播放控件、第一删除控件以及第一替换控件,所述第一播放控件用于试听所述人声,所述第一删除控件用于删除所述人声,所述第一替换控件用于替换所述人声;以及与所述伴奏关联的第二播放控件、第二删除控件以及第二替换控件,所述第二播放控件用于试听所述伴奏,所述第二删除控件用于删除所述伴奏,所述第二替换控件用于替换所述伴奏。
- 根据权利要求3所述的方法,还包括:响应于针对第一界面上的所述第三控件的触控操作,跳转至第三界面,所述第三界面包括第三播放控件,所述第三播放控件用于触发播放所述目标音频。
- 根据权利要求3所述的方法,还包括:响应于针对第三界面上的封面编辑控件的触控操作,显示第一窗口,所述第一窗口包括封面导入控件、一个或多个预设的静态封面控件以及一个或多个预设的动画效果控件;响应于针对所述第一窗口上的控件选择操作,获取目标封面;所述目标封面为静态封面或者动态封面。
- 根据权利要求11所述的方法,其中,若所述目标封面为动态封面,所述响应于针对所述第一窗口上的控件选择操作,获取目标封面,包括:响应于针对所述第一窗口上的控件选择操作,获取静态封面和动画效果;根据所述目标音频的音频特征、所述静态封面和所述动画效果,生成随所述目标音频的音频特征变化的动态封面;其中,所述音频特征包括音频节拍和/或音量。
- 根据权利要求3所述的方法,还包括:响应于针对第三界面上的导出指示,将与所述目标音频相关联的数据导出到目标位置;所述目标位置包括相册或文件系统。
- 根据权利要求3所述的方法,还包括:响应于针对第三界面上的分享指示,将与所述目标音频相关联的数据分享到目标应用。
- 根据权利要求13或14所述的方法,其中,与所述目标音频相关联的数据包括以下至少一项:所述目标音频,所述人声,所述伴奏,所述人声的人声片段,所述伴奏的伴奏片段,所述目标音频的静态封面,和所述目标音频的动态封面。
- 根据权利要求3所述的方法,还包括:响应于针对第三界面上的音频编辑控件的触控操作,从所述第三界面跳转至第四界面,所述第四界面包括音频处理功能控件或者与所述音频处理功能控件相关联的触发控件,所述触发控件用于触发显示所述音频处理功能控件;所述音频处理功能控件包括以下中的一个或多个:音频优化控件,用于触发对音频进行编辑以优化所述音频;伴奏分离控件,用于触发从音频分离人声和/或伴奏;风格合成控件,用于触发从音频分离人声,并将分离出的人声与预设伴奏进行混合和编辑;音频混搭控件,用于触发从第一音频分离人声,从第二音频分离伴奏,并将分离出的人声与分离出的伴奏进行混合和编辑。
- 一种音频处理装置,包括:获取模块,用于响应于第一指示,获取人声;所述获取模块,还用于响应于第二指示,获取伴奏;处理模块,用于响应于第三指示,对所述人声和所述伴奏进行混合,以获取目标音频。
- 一种电子设备,包括:至少一个处理器和存储器;所述存储器存储计算机执行指令;所述至少一个处理器执行所述存储器存储的计算机执行指令,使得所述至少一个处理器执行如权利要求1至16中任一项所述的音频处理方法。
- 一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如权利要求1至16中任一项所述的音频处理方法。
- 一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如权利要求1至16中任一项所述的方法。
- 一种计算机程序,所述计算机程序被处理器执行时实现如权利要求1至16中任一项所述的方法。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US18/552,347 US20250087188A1 (en) | 2022-05-07 | 2023-05-05 | Audio processing method and apparatus, device and storage medium |
| EP23802775.9A EP4524952A4 (en) | 2022-05-07 | 2023-05-05 | AUDIO PROCESSING METHOD AND APPARATUS, STORAGE DEVICE AND MEDIA |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210495456.4A CN117059055A (zh) | 2022-05-07 | 2022-05-07 | 音频处理方法、装置、设备及存储介质 |
| CN202210495456.4 | 2022-05-07 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023217003A1 true WO2023217003A1 (zh) | 2023-11-16 |
Family
ID=88655974
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2023/092377 Ceased WO2023217003A1 (zh) | 2022-05-07 | 2023-05-05 | 音频处理方法、装置、设备及存储介质 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20250087188A1 (zh) |
| EP (1) | EP4524952A4 (zh) |
| CN (1) | CN117059055A (zh) |
| WO (1) | WO2023217003A1 (zh) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118283015A (zh) * | 2024-05-30 | 2024-07-02 | 江西扬声电子有限公司 | 一种基于客舱以太网的多路音频传输方法和系统 |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20240233694A9 (en) * | 2022-10-20 | 2024-07-11 | Tuttii Inc. | System and method for enhanced audio data transmission and digital audio mashup automation |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109714671A (zh) * | 2017-10-26 | 2019-05-03 | 张德明 | 一种无线k歌音响系统 |
| WO2020034227A1 (zh) * | 2018-08-17 | 2020-02-20 | 华为技术有限公司 | 一种多媒体内容同步方法及电子设备 |
| CN111554329A (zh) * | 2020-04-08 | 2020-08-18 | 咪咕音乐有限公司 | 音频剪辑方法、服务器及存储介质 |
| CN112967705A (zh) * | 2021-02-24 | 2021-06-15 | 腾讯音乐娱乐科技(深圳)有限公司 | 一种混音歌曲生成方法、装置、设备及存储介质 |
| WO2022062979A1 (zh) * | 2020-09-23 | 2022-03-31 | 华为技术有限公司 | 音频处理方法、计算机可读存储介质、及电子设备 |
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111916039B (zh) * | 2019-05-08 | 2022-09-23 | 北京字节跳动网络技术有限公司 | 音乐文件的处理方法、装置、终端及存储介质 |
| US11532317B2 (en) * | 2019-12-18 | 2022-12-20 | Munster Technological University | Audio interactive decomposition editor method and system |
| US11475867B2 (en) * | 2019-12-27 | 2022-10-18 | Spotify Ab | Method, system, and computer-readable medium for creating song mashups |
| CN113411516B (zh) * | 2021-05-14 | 2023-06-20 | 北京达佳互联信息技术有限公司 | 视频处理方法、装置、电子设备及存储介质 |
| CN114023287B (zh) * | 2021-11-02 | 2025-09-02 | 广州酷狗计算机科技有限公司 | 音频文件的混音处理方法、装置、终端及存储介质 |
-
2022
- 2022-05-07 CN CN202210495456.4A patent/CN117059055A/zh active Pending
-
2023
- 2023-05-05 US US18/552,347 patent/US20250087188A1/en active Pending
- 2023-05-05 EP EP23802775.9A patent/EP4524952A4/en active Pending
- 2023-05-05 WO PCT/CN2023/092377 patent/WO2023217003A1/zh not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109714671A (zh) * | 2017-10-26 | 2019-05-03 | 张德明 | 一种无线k歌音响系统 |
| WO2020034227A1 (zh) * | 2018-08-17 | 2020-02-20 | 华为技术有限公司 | 一种多媒体内容同步方法及电子设备 |
| CN111554329A (zh) * | 2020-04-08 | 2020-08-18 | 咪咕音乐有限公司 | 音频剪辑方法、服务器及存储介质 |
| WO2022062979A1 (zh) * | 2020-09-23 | 2022-03-31 | 华为技术有限公司 | 音频处理方法、计算机可读存储介质、及电子设备 |
| CN112967705A (zh) * | 2021-02-24 | 2021-06-15 | 腾讯音乐娱乐科技(深圳)有限公司 | 一种混音歌曲生成方法、装置、设备及存储介质 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4524952A4 * |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118283015A (zh) * | 2024-05-30 | 2024-07-02 | 江西扬声电子有限公司 | 一种基于客舱以太网的多路音频传输方法和系统 |
| CN118283015B (zh) * | 2024-05-30 | 2024-08-20 | 江西扬声电子有限公司 | 一种基于客舱以太网的多路音频传输方法和系统 |
Also Published As
| Publication number | Publication date |
|---|---|
| US20250087188A1 (en) | 2025-03-13 |
| CN117059055A (zh) | 2023-11-14 |
| EP4524952A1 (en) | 2025-03-19 |
| EP4524952A4 (en) | 2026-04-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10062367B1 (en) | Vocal effects control system | |
| WO2020113733A1 (zh) | 动画生成方法、装置、电子设备及计算机可读存储介质 | |
| CN111046226B (zh) | 一种音乐的调音方法及装置 | |
| WO2023217003A1 (zh) | 音频处理方法、装置、设备及存储介质 | |
| US20120072841A1 (en) | Browser-Based Song Creation | |
| JP2014520352A (ja) | エンハンスされたメディア記録およびプレイバック | |
| WO2016112841A1 (zh) | 一种信息处理方法及客户端、计算机存储介质 | |
| EP4715802A1 (en) | Music generation method and apparatus, and electronic device and storage medium | |
| JP2023538943A (ja) | オーディオデータの処理方法、装置、機器及び記憶媒体 | |
| WO2023010949A1 (zh) | 一种音频数据的处理方法及装置 | |
| WO2022160603A1 (zh) | 歌曲的推荐方法、装置、电子设备及存储介质 | |
| US20250299657A1 (en) | Dj performance data conversion | |
| WO2023051246A1 (zh) | 视频录制方法、装置、设备及存储介质 | |
| WO2023216999A1 (zh) | 音频处理方法、装置、设备及存储介质 | |
| US20250036265A1 (en) | Audio processing method and apparatus, device and storage medium | |
| US9705953B2 (en) | Local control of digital signal processing | |
| CN119137652A (zh) | 音乐生成方法、音乐生成装置和计算机可读存储介质 | |
| WO2023160713A1 (zh) | 音乐生成方法、装置、设备、存储介质及程序 | |
| CN115440178A (zh) | 音频录制方法、装置以及存储介质 | |
| WO2022143530A1 (zh) | 音频处理方法、装置、计算机设备及存储介质 | |
| Jago | Adobe Audition CC Classroom in a Book | |
| US12585419B2 (en) | Audio processing method and apparatus, and electronic device | |
| Adobe Creative Team et al. | Adobe Audition CS6 Classroom in a Book | |
| US12586606B1 (en) | Audio alignment systems and techniques | |
| WO2020124679A1 (zh) | 视频处理参数信息的预配置方法、装置及电子设备 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| WWE | Wipo information: entry into national phase |
Ref document number: 18552347 Country of ref document: US |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23802775 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023802775 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2023802775 Country of ref document: EP Effective date: 20241209 |
|
| WWP | Wipo information: published in national office |
Ref document number: 18552347 Country of ref document: US |