WO2023045730A1 - 一种音视频处理方法、装置、设备及存储介质 - Google Patents
一种音视频处理方法、装置、设备及存储介质 Download PDFInfo
- Publication number
- WO2023045730A1 WO2023045730A1 PCT/CN2022/116650 CN2022116650W WO2023045730A1 WO 2023045730 A1 WO2023045730 A1 WO 2023045730A1 CN 2022116650 W CN2022116650 W CN 2022116650W WO 2023045730 A1 WO2023045730 A1 WO 2023045730A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio
- video
- edited
- text data
- preset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/02—Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
- G11B27/031—Electronic editing of digitised analogue information signals, e.g. audio or video signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/85—Assembly of content; Generation of multimedia applications
- H04N21/854—Content authoring
- H04N21/8547—Content authoring involving timestamps for synchronizing content
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
- G06F3/165—Management of the audio stream, e.g. setting of volume, audio stream path
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/10—Indexing; Addressing; Timing or synchronising; Measuring tape travel
- G11B27/34—Indicating arrangements
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/4302—Content synchronisation processes, e.g. decoder synchronisation
- H04N21/4307—Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
- H04N21/43072—Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen of multiple content streams on the same device
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
Definitions
- the present disclosure relates to the field of data processing, and in particular to an audio and video processing method, device, equipment and storage medium.
- an embodiment of the present disclosure provides an audio and video processing method, which can improve the accuracy of audio and video editing and simplify user operations.
- the present disclosure provides an audio and video processing method, the method comprising:
- the audio and video segment corresponding to the target audio and video time stamp in the audio and video to be edited is processed.
- the method also includes:
- the preset keywords or the preset mute segments in the text data are displayed according to a preset second display style.
- the first editing entry corresponds to a first editing card, and a one-key delete control is set on the first editing card; in response to a trigger operation on the first editing entry,
- the preset second display style after displaying the preset keyword or the preset mute segment in the text data, it further includes:
- the preset keyword or the preset mute segment is deleted from the text data.
- the method also includes:
- the human voice in the audio and video to be edited is enhanced.
- the method also includes:
- the method also includes:
- normalization processing is performed on the loudness of the volume in the audio and video to be edited.
- the method also includes:
- the volume of music and voice in the audio and video clips in the audio and video In response to the trigger operation for the smart clip control, adjust the volume of music and voice in the audio and video clips in the audio and video to be edited in the previous preset time period, and obtain the audio and video clips after volume adjustment; Wherein, the volume of the music in the audio and video segment after the volume adjustment is inversely proportional to the volume of the human voice.
- the preset operation includes a selection operation, and based on the preset operation, the audio and video segment corresponding to the target audio and video time stamp in the audio and video to be edited is processed ,include:
- the audio and video segment corresponding to the target audio and video time stamp in the audio and video to be edited is displayed.
- the preset operation includes a delete operation, and based on the preset operation, the audio and video segment corresponding to the target audio and video time stamp in the audio and video to be edited is processed ,include:
- the audio and video segment corresponding to the target audio and video time stamp in the audio and video to be edited is deleted.
- the preset operation includes a modification operation, and based on the preset operation, the audio and video segment corresponding to the target audio and video time stamp in the audio and video to be edited is processed ,include:
- the audio-video segment corresponding to the target audio-video time stamp in the audio-video segment to be edited is replaced.
- the method also includes:
- a first audio and video clip is generated based on the first text data and the timbre information in the audio and video to be edited;
- the first audio and video segment is added to the audio and video to be edited.
- the present disclosure also provides an audio and video processing device, the device comprising:
- the first display module is used to display the text data corresponding to the audio and video to be edited; wherein, the text data has a mapping relationship with the audio and video timestamp of the audio and video to be edited;
- the second display module is used to display the audio and video to be edited according to the time axis track;
- a determining module configured to determine the audio and video timestamp corresponding to the target text data as the target audio and video timestamp in response to a preset operation triggered for the target text data in the text data;
- the editing module is configured to process the audio and video segment corresponding to the target audio and video time stamp in the audio and video to be edited based on the preset operation.
- the present disclosure provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the terminal device is made to implement the above method.
- the present disclosure provides a device, including: a memory, a processor, and a computer program stored on the memory and operable on the processor, when the processor executes the computer program, Implement the above method.
- the present disclosure provides a computer program product, where the computer program product includes a computer program/instruction, and when the computer program/instruction is executed by a processor, the above method is implemented.
- the embodiment of the present disclosure provides an audio and video processing method, by displaying the text data corresponding to the audio and video to be edited, in response to the preset operation triggered for the target text data in the text data, determine the audio and video timestamp corresponding to the target text data , as the target audio and video timestamp, and based on preset operations, process the audio and video segment corresponding to the target audio and video timestamp in the audio and video to be edited. It can be seen that the audio and video processing method provided by the embodiments of the present disclosure can improve the accuracy of audio and video editing, simplify user operations, and lower the threshold for user operations.
- FIG. 1 is a flowchart of an audio and video processing method provided by an embodiment of the present disclosure
- FIG. 2 is a schematic diagram of an audio and video processing interface provided by an embodiment of the present disclosure
- FIG. 3 is a schematic diagram of another audio and video processing interface provided by an embodiment of the present disclosure.
- FIG. 4 is a flowchart of another audio and video processing method provided by an embodiment of the present disclosure.
- FIG. 5 is a schematic diagram of another audio and video processing interface provided by an embodiment of the present disclosure.
- FIG. 6 is a schematic diagram of another audio and video processing interface provided by an embodiment of the present disclosure.
- FIG. 7 is a schematic structural diagram of an audio and video processing device provided by an embodiment of the present disclosure.
- Fig. 8 is a schematic structural diagram of an audio and video processing device provided by an embodiment of the present disclosure.
- FIG. 1 it is a flowchart of an audio and video processing method provided by an embodiment of the present disclosure. The method includes:
- S101 Display text data corresponding to audio and video to be edited.
- the text data has a mapping relationship with the audio and video time stamp of the audio and video to be edited, and the audio and video time stamp is used to indicate the playing time of each frame of the audio and video.
- audio and video to be edited include but are not limited to recorded audio and video, audio and video obtained based on a script, and the like.
- the text data can be obtained by speech recognition of the audio and video to be edited, or it can be a script. Wherein, when the text data is a script, the text data can be matched with the audio and video to be edited to obtain the audio and video of the aforementioned text data and the audio and video to be edited.
- the mapping relationship of video time stamps, speech recognition methods include but not limited to ASR (Automatic Speech Recognition, automatic speech recognition) technology.
- the text data can be displayed on the interface.
- the interface is shown in FIG. 2
- area P in FIG. 2 shows the displayed text data.
- the text data of different users can be determined, such as the text data of user a and user b shown in FIG. 2 .
- S102 Display the audio and video to be edited according to the time axis track.
- the audio and video to be edited can be displayed on the interface according to the time axis track.
- area Q in FIG. 2 shows the displayed audio and video to be edited.
- step 102 is not specifically limited.
- the preset operations include, but are not limited to, selection operations, deletion operations, and modification operations. Since the text data has a mapping relationship with the audio and video time stamp of the audio and video to be edited, for the target text data in the text data, the target audio and video time stamp corresponding to the target text data can be determined according to the mapping relationship.
- S104 Based on the preset operation, process the audio and video segment corresponding to the target audio and video time stamp in the audio and video to be edited.
- the corresponding audio and video segments in the audio and video to be edited can be determined based on the audio and video timestamps, and the audio and video clips based on text are realized by processing the audio and video segments corresponding to the target audio and video timestamps in the audio and video to be edited , by editing the text and the audio and video clips corresponding to the linked editing, it is possible to realize the editing of the audio and video with high accuracy.
- the preset operation includes a selection operation, and based on the preset operation, processing the audio and video segment corresponding to the target audio and video timestamp in the audio and video to be edited includes: following the preset first display style , to display the audio and video segment corresponding to the target audio and video timestamp in the audio and video to be edited.
- the first display style is, for example, highlighting.
- Figure 3 shows a schematic diagram of another interface. Referring to Figure 3, based on the selection operation, the target text data can be highlighted, and based on the time axis track The audio and video segment corresponding to the target audio and video timestamp is highlighted, and the highlighted part is shown as the dotted line in Figure 3 .
- the preset operation includes a delete operation, and based on the preset operation, the audio and video segment corresponding to the target audio and video timestamp in the audio and video to be edited is processed, including: based on the delete operation, the audio and video segment to be edited is processed The audio and video segment corresponding to the target audio and video timestamp in the video is deleted.
- the target text data may be deleted, and the audio and video segment corresponding to the target audio and video time stamp may be deleted.
- a delete control may be displayed, and in response to a trigger operation on the delete control, the target text data and the audio and video segment corresponding to the target audio and video timestamp are deleted.
- the preset operation includes a modification operation, and based on the preset operation, the audio and video segment corresponding to the target audio and video timestamp in the audio and video to be edited is processed, including: obtaining the modified Text data; an audio and video segment is generated based on the modified text data and the timbre information in the audio and video to be edited, as the audio and video segment to be modified; the audio and video segment corresponding to the target audio and video timestamp in the audio and video to be edited is generated by using the audio and video segment to be modified. Video clips are replaced.
- the target text data can be modified.
- the modification control can be displayed, in response to the trigger operation for the modification control, and the modified content is generated according to the received modification content.
- text data can be generated, and the audio and video segment corresponding to the target audio and video time stamp is replaced according to the audio and video segment to be modified, so as to realize the modification of the audio and video to be edited.
- the audio and video processing method provided by the embodiments of the present disclosure, by displaying the text data corresponding to the audio and video to be edited, in response to the preset operation triggered for the target text data in the text data, determine the audio and video timestamp corresponding to the target text data, As the target audio and video timestamp, and based on preset operations, the audio and video segment corresponding to the target audio and video timestamp in the audio and video to be edited is processed. It can be seen that the audio and video processing method provided by the embodiments of the present disclosure can edit audio and video based on text. Since there is a mapping relationship between text and audio and video timestamps, by editing text and associated audio and video clips, audio and video can be edited.
- ineffective modal particles such as "um”, “uh” and “that” and silent segments usually appear in the dialogue. Therefore, in order to ensure the continuity of the dialogue, there is an audio and video to be edited to Delete the aforementioned invalid modal particles and the need for silent segments.
- the audio and video processing method of the embodiment of the present disclosure also includes:
- Step 401 showing the first editing entry for preset keywords or preset silent segments.
- the text data corresponding to the audio and video to be edited can be detected to determine the preset keywords or preset mute segments in the text data, and there are preset keywords or preset mute segments in the text data
- the first edit entry is displayed.
- the control shown in area A in FIG. 3 is the first editing entry, and the information of "modification suggestion 01: remove invalid modal particles" is displayed on the first editing entry.
- the preset keywords can include vocabulary such as invalid modal particles, and there are many ways to determine the preset keywords in the text data.
- Language processing technology to determine preset keywords in text data.
- the preset mute segment is determined according to the interval between the audio and video timestamps corresponding to two adjacent characters, for example, when the interval is greater than a preset threshold, it is determined that there is a preset mute segment between two adjacent characters .
- the mute segment can be displayed in the form of spaces on the interface.
- the display length of the mute segment can be determined according to the value of the interval.
- Step 402 In response to the trigger operation on the first edit entry, display the preset keywords or preset mute segments in the text data according to the preset second display style.
- the trigger operation for the first editing entry includes but not limited to click operation, voice instruction, and touch track.
- the second display style may be highlighting or other display styles, which are not specifically limited here.
- Fig. 5 shows a schematic diagram of an interface.
- preset keywords “uh”, “um” and “that” are highlighted in the interface, as shown by the dotted line.
- Step 403 in response to the trigger operation on the one-key delete control, delete the preset keyword or the preset mute segment from the text data.
- the first editing entry corresponds to the first editing card, and a one-key delete control is set on the first editing card.
- the first editing card is displayed, and the displaying manner of the first editing card includes but not limited to a drop-down option, a floating window, and the like.
- the first editing card is shown in area B in Fig. 5, and the number of occurrences can be counted for each preset keyword, and the preset keywords and the corresponding number of occurrences can be displayed in the first editing card .
- the target keyword in response to a trigger operation for the target keyword in the preset keywords, is removed from the preset keywords, and the number of occurrences of the preset keywords displayed in the first edit card is synchronously modified, so as to Enable users to remove keywords that are not invalid modal particles by clicking and other operations, so as to avoid being deleted with one click.
- the deletion operation of preset keywords or preset silent segments can be presented in the form of an editing card, providing one-click operation, saving editing time, simplifying user operations, and lowering the threshold for users to use.
- the audio and video processing method further includes: displaying a voice enhancement control on the second editing card; in response to a trigger operation on the voice enhancement control, performing enhancement processing on the human voice in the audio and video to be edited.
- the second editing entry for the audio and video to be edited is displayed, and the second editing entry corresponds to the second editing card, and the voice enhancement control is set on the second editing card.
- noise detection can be performed based on audio and video to be edited, and when noise is detected, a second editing entry is displayed.
- the control shown in area C in FIG. 2 is the second editing entry, and the second editing entry The "Enhancement Suggestion: Speech Enhancement" message is displayed on the display.
- a second edit card is displayed.
- the second editing card is shown in area D in Fig. 6.
- the voice enhancement control "Enhanced Voice" is displayed in the second editing card.
- the sound is enhanced, and the triggering operation includes but is not limited to click operation, voice command, and touch track.
- the voice enhancement operation can be presented in the form of an edit card, providing a one-button operation, which can enhance the user's voice to satisfy the listening experience, simplify the user operation, and lower the threshold for the user to use.
- the audio and video processing method further includes: based on the music genre of the audio and video to be edited and/or the content in the text data corresponding to the audio and video to be edited, determining the soundtrack corresponding to the audio and video to be edited; Add to the audio and video clips to be edited.
- multiple tags can be preset, and there is a mapping relationship between each tag and one or more soundtracks, based on the music genre of the audio and video to be edited and/or the content in the text data corresponding to the audio and video to be edited, Determine the tag corresponding to the music genre and/or the content in the text data, and determine the corresponding soundtrack of the audio and video to be edited based on the mapping relationship between the tag and the soundtrack.
- the theme of the content as "sports” based on natural language processing technology, and then determine the soundtrack corresponding to the "sports" label, which is the soundtrack corresponding to the audio and video to be edited , and add the soundtrack to the audio and video clip to be edited.
- the corresponding label is determined, and the soundtrack corresponding to the label is used as the soundtrack corresponding to the audio and video to be edited, and the soundtrack is added to the audio and video segment to be edited.
- the soundtrack can be intelligently recommended based on the content and genre of the text data to meet the scene requirements for adding soundtracks, enrich the diversity of listening experience, improve the listening experience, simplify user operations, and lower the threshold for users to use.
- the audio and video processing method further includes: displaying a loudness equalization control on the third editing card; in response to a trigger operation on the loudness equalization control, performing normalization processing on the loudness of the volume in the audio and video to be edited .
- a third editing entry for the audio and video to be edited is displayed, the third editing entry corresponds to the third editing card, and a loudness equalization control is set on the third editing card.
- the volume loudness detection can be performed based on the audio and video to be edited, and a third editing entry is displayed when it is detected that the audio and video to be edited does not satisfy the preset loudness equalization condition.
- the third editing card is displayed, and in response to the trigger operation on the loudness balance control, the loudness of the volume in the audio and video to be edited is normalized, for example, the audio and video in the audio and video to be edited The loudness of the volume is within the preset range.
- the loudness equalization operation can be presented in the form of an editing card, providing one-click operation, which can improve the listening experience, simplify user operations, and lower the threshold for users to use.
- the audio and video processing method further includes: displaying the smart clip control on the fourth editing card; in response to a trigger operation on the smart clip control, The music volume and the human voice volume in the audio and video clips are adjusted to obtain the audio and video clips after volume adjustment.
- the fourth editing entry for the audio and video to be edited is displayed, and the fourth editing entry corresponds to the fourth editing card, and the smart clip control is set on the fourth editing card.
- the fourth editing card is displayed, and in response to the trigger operation for the smart clip control, the music volume and vocal Adjust the volume, for example, increase the human voice volume by the first volume value, decrease the music volume by the second volume value, or decrease the music volume by the third volume value in the audio and video segment where the human voice is detected, and obtain the adjusted volume Audio and video clips.
- the volume of the music in the audio and video segment is inversely proportional to the volume of the human voice.
- title generation can also be realized, for example, in response to a trigger operation on the smart cutout control, determine the currently selected second text data and the corresponding second text data For the second audio and video segment, the second text data and the second audio and video segment are copied and pasted to the preset title area to realize the effect of the clip.
- the smart clip function can be presented in the form of an editing card, providing one-click operation, realizing the effect of clips, simplifying user operations, and lowering the threshold for users to use.
- the audio and video processing method further includes: when receiving an adding operation for the first text data in the text data, generating a second text data based on the first text data and the timbre information in the audio and video to be edited. An audio and video segment; based on the position information of the first text data in the text data, determine the first audio and video timestamp corresponding to the first text data; based on the first audio and video timestamp, add the first audio and video segment to the to-be-edited audio and video.
- the first text data may be obtained in response to an input operation, or may be obtained based on copying existing text data.
- the timbre information of each user can be obtained according to the audio and video to be edited.
- the corresponding first audio and video time stamp is determined according to the position information of the first text data in the text data, and the time stamp of the first audio and video is Add the first audio and video clip at the poked position.
- the aforementioned editing entry can be automatically displayed based on the detection result, or can be displayed on the interface in response to a trigger operation.
- timbre cloning and voice broadcasting technologies are used to clone timbre based on added text and intelligently generate audio and video clips, which realizes the addition of audio and video clips based on text input, reduces the time cost and editing cost caused by re-recording, and simplifies user experience. operate.
- the present disclosure also provides an audio and video processing device.
- FIG. 7 it is a schematic structural diagram of an audio and video processing device provided by an embodiment of the present disclosure.
- the device includes:
- the first display module 701 is configured to display the text data corresponding to the audio and video to be edited; wherein, the text data has a mapping relationship with the audio and video time stamp of the audio and video to be edited.
- the second display module 702 is configured to display the audio and video to be edited according to the time axis track.
- the determining module 703 is configured to determine an audio and video time stamp corresponding to the target text data as a target audio and video time stamp in response to a preset operation triggered for the target text data in the text data.
- the editing module 704 is configured to process the audio and video segment corresponding to the target audio and video time stamp in the audio and video to be edited based on the preset operation.
- the audio and video processing device also includes:
- the first processing module is configured to display a first edit entry for a preset keyword or a preset mute segment; in response to a trigger operation for the first edit entry, according to a preset second display style, edit the text data The preset keyword or the preset mute segment in the .
- the first edit entry corresponds to the first edit card, and the first edit card is provided with a one-key delete control; the first edit module is also used to respond to the one-key delete control The trigger operation of deleting the preset keyword or the preset silent segment from the text data.
- the audio and video processing device also includes:
- the second processing module is configured to display a voice enhancement control on the second editing card; in response to a trigger operation on the voice enhancement control, perform enhancement processing on the human voice in the audio and video to be edited.
- the audio and video processing device also includes:
- the first adding module is used to determine the soundtrack corresponding to the audio and video to be edited based on the music genre of the audio and video to be edited and/or the content in the text data corresponding to the audio and video to be edited; add the soundtrack to to the audio and video segment to be edited.
- the audio and video processing device also includes:
- the third processing module is configured to display the loudness equalization control on the third editing card; in response to a trigger operation on the loudness equalization control, normalize the loudness of the volume in the audio and video to be edited.
- the audio and video processing device also includes:
- the fourth processing module is used to display the smart cutout control on the fourth editing card; in response to the trigger operation for the smart cutout control, the audio and video clips in the pre-set time period in the audio and video to be edited
- the volume of the music and the volume of the human voice are adjusted to obtain the volume-adjusted audio and video segment; wherein, the volume of the music in the audio and video segment after the volume adjustment is inversely proportional to the volume of the human voice.
- the preset operation includes a selection operation
- the editing module 704 is specifically configured to: display the audio and video segment corresponding to the target audio and video time stamp in the audio and video to be edited according to a preset first display style.
- the preset operation includes a delete operation
- the editing module 704 is specifically configured to: based on the delete operation, edit the audio and video segment corresponding to the target audio and video timestamp in the audio and video to be edited to delete.
- the preset operation includes a modification operation
- the editing module 704 is specifically configured to: obtain the modified text data corresponding to the modification operation; based on the modified text data and the audio and video to be edited
- the timbre information generates an audio and video segment as the audio and video segment to be modified; using the audio and video segment to be modified, the audio and video segment corresponding to the target audio and video time stamp in the audio and video to be edited is replaced.
- the audio and video processing device also includes:
- the second adding module is configured to generate a first audio and video based on the first text data and the timbre information in the audio and video to be edited when receiving an adding operation for the first text data in the text data segment; based on the position information of the first text data in the text data, determine the first audio and video timestamp corresponding to the first text data; based on the first audio and video timestamp, convert the first The audio and video clips are added to the audio and video to be edited.
- the audio and video processing device by displaying the text data corresponding to the audio and video to be edited, in response to the preset operation triggered for the target text data in the text data, determine the audio and video timestamp corresponding to the target text data, As the target audio and video timestamp, and based on preset operations, the audio and video segment corresponding to the target audio and video timestamp in the audio and video to be edited is processed. It can be seen that the audio and video processing method provided by the embodiments of the present disclosure can edit audio and video based on text. Since there is a mapping relationship between text and audio and video timestamps, by editing text and associated audio and video clips, audio and video can be edited.
- an embodiment of the present disclosure also provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the terminal device realizes this
- the audio and video processing method described in the embodiment is disclosed.
- the embodiment of the present disclosure also provides a computer program product, the computer program product includes a computer program/instruction, and when the computer program/instruction is executed by a processor, the audio and video processing method described in the embodiment of the present disclosure is implemented.
- an embodiment of the present disclosure also provides an audio and video processing device, as shown in FIG. 8, which may include:
- Processor 801 , memory 802 , input device 803 and output device 804 The number of processors 801 in the audio and video processing device may be one or more, and one processor is taken as an example in FIG. 8 .
- the processor 801 , the memory 802 , the input device 803 and the output device 804 may be connected through a bus or in other ways, wherein connection through a bus is taken as an example in FIG. 8 .
- the memory 802 can be used to store software programs and modules, and the processor 801 executes various functional applications and data processing of the audio and video processing device by running the software programs and modules stored in the memory 802 .
- the memory 802 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function, and the like.
- the memory 802 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage devices.
- the input device 803 can be used to receive input digital or character information, and generate signal input related to user settings and function control of the audio and video processing equipment.
- the processor 801 will load the executable files corresponding to the processes of one or more application programs into the memory 802 according to the following instructions, and the processor 801 will run the executable files stored in the memory 802. Application programs, so as to realize various functions of the above-mentioned audio and video processing equipment.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computer Security & Cryptography (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Television Signal Processing For Recording (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- Management Or Editing Of Information On Record Carriers (AREA)
Abstract
Description
Claims (15)
- 一种音视频处理方法,所述方法包括:展示待编辑音视频对应的文本数据;其中,所述文本数据与所述待编辑音视频的音视频时间戳具有映射关系;以及,按照时间轴轨道展示所述待编辑音视频;响应于针对所述文本数据中的目标文本数据触发的预设操作,确定所述目标文本数据对应的音视频时间戳,作为目标音视频时间戳;基于所述预设操作,对所述待编辑音视频中与所述目标音视频时间戳对应的音视频片段进行处理。
- 根据权利要求1所述的方法,其中,所述方法还包括:展示针对预设关键词或预设静音片段的第一编辑入口;响应于针对所述第一编辑入口的触发操作,按照预设第二显示样式,对所述文本数据中的所述预设关键词或所述预设静音片段进行显示。
- 根据权利要求2所述的方法,其中,所述第一编辑入口对应于第一编辑卡片,所述第一编辑卡片上设置有一键删除控件;所述响应于针对所述第一编辑入口的触发操作,按照预设第二显示样式,对所述文本数据中的所述预设关键词或所述预设静音片段进行显示之后,还包括:响应于针对所述一键删除控件的触发操作,从所述文本数据中删除所述预设关键词或所述预设静音片段。
- 根据权利要求1所述的方法,其中,所述方法还包括:在第二编辑卡片上展示语音增强控件;响应于针对所述语音增强控件的触发操作,对所述待编辑音视频中的人声进行增强处理。
- 根据权利要求1所述的方法,其中,所述方法还包括:基于所述待编辑音视频的音乐体裁和/或所述待编辑音视频对应的 文本数据中的内容,确定所述待编辑音视频对应的配乐;将所述配乐添加到所述待编辑音视频片段中。
- 根据权利要求1所述的方法,其中,所述方法还包括:在第三编辑卡片上展示响度均衡控件;响应于针对所述响度均衡控件的触发操作,对所述待编辑音视频中音量的响度进行归一化处理。
- 根据权利要求1所述的方法,其中,所述方法还包括:在第四编辑卡片上展示智能片花控件;响应于针对所述智能片花控件的触发操作,对所述待编辑音视频中的前预设时间段内的音视频片段中的音乐音量与人声音量进行调节,得到音量调节后音视频片段;其中,所述音量调节后音视频片段中的音乐音量与人声音量成反比。
- 根据权利要求1所述的方法,其中,所述预设操作包括选中操作,所述基于所述预设操作,对所述待编辑音视频中与所述目标音视频时间戳对应的音视频片段进行处理,包括:按照预设第一显示样式,对所述待编辑音视频中与所述目标音视频时间戳对应的音视频片段进行显示。
- 根据权利要求1所述的方法,其中,所述预设操作包括删除操作,所述基于所述预设操作,对所述待编辑音视频中与所述目标音视频时间戳对应的音视频片段进行处理,包括:基于所述删除操作,对所述待编辑音视频中与所述目标音视频时间戳对应的音视频片段进行删除。
- 根据权利要求1所述的方法,其中,所述预设操作包括修改操作,所述基于所述预设操作,对所述待编辑音视频中与所述目标音视频时间戳对应的音视频片段进行处理,包括:获取所述修改操作对应的修改后文本数据;基于所述修改后文本数据和所述待编辑音视频中的音色信息生成音视频片段,作为待修改音视频片段;利用所述待修改音视频片段,对所述待编辑音视频中与所述目标音视频时间戳对应的音视频片段进行替换处理。
- 根据权利要求1所述的方法,其中,所述方法还包括:当接收到在所述文本数据中针对第一文本数据的增加操作时,基于所述第一文本数据和所述待编辑音视频中的音色信息,生成第一音视频片段;基于所述第一文本数据在所述文本数据中的位置信息,确定所述第一文本数据对应的第一音视频时间戳;基于所述第一音视频时间戳,将所述第一音视频片段添加到所述待编辑音视频中。
- 一种音视频处理装置,所述装置包括:第一展示模块,用于展示待编辑音视频对应的文本数据;其中,所述文本数据与所述待编辑音视频的音视频时间戳具有映射关系;第二展示模块,用于按照时间轴轨道展示所述待编辑音视频;确定模块,用于响应于针对所述文本数据中的目标文本数据触发的预设操作,确定所述目标文本数据对应的音视频时间戳,作为目标音视频时间戳;编辑模块,用于基于所述预设操作,对所述待编辑音视频中与所述目标音视频时间戳对应的音视频片段进行处理。
- 一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当所述指令在终端设备上运行时,使得所述终端设备实现如权利要求1-11任一项所述的方法。
- 一种设备,包括:存储器,处理器,及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时,实现如权利要求1-11任一项所述的方法。
- 一种计算机程序产品,所述计算机程序产品包括计算机程序/指令,所述计算机程序/指令被处理器执行时实现如权利要求1-11任一项所述的方法。
Priority Applications (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| KR1020237044829A KR102919002B1 (ko) | 2021-09-22 | 2022-09-02 | 오디오/비디오 처리 방법 및 장치, 디바이스 및 저장 매체 |
| JP2023578889A JP7764507B2 (ja) | 2021-09-22 | 2022-09-02 | 音声ビデオ処理方法、装置、機器及び記憶媒体 |
| EP22871780.7A EP4344225A4 (en) | 2021-09-22 | 2022-09-02 | AUDIO/VIDEO PROCESSING METHOD AND APPARATUS, DEVICE AND STORAGE MEDIUM |
| US18/395,118 US20240127860A1 (en) | 2021-09-22 | 2023-12-22 | Audio/video processing method and apparatus, device, and storage medium |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111109213.4 | 2021-09-22 | ||
| CN202111109213.4A CN115914734A (zh) | 2021-09-22 | 2021-09-22 | 一种音视频处理方法、装置、设备及存储介质 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US18/395,118 Continuation US20240127860A1 (en) | 2021-09-22 | 2023-12-22 | Audio/video processing method and apparatus, device, and storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023045730A1 true WO2023045730A1 (zh) | 2023-03-30 |
Family
ID=85719279
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2022/116650 Ceased WO2023045730A1 (zh) | 2021-09-22 | 2022-09-02 | 一种音视频处理方法、装置、设备及存储介质 |
Country Status (6)
| Country | Link |
|---|---|
| US (1) | US20240127860A1 (zh) |
| EP (1) | EP4344225A4 (zh) |
| JP (1) | JP7764507B2 (zh) |
| KR (1) | KR102919002B1 (zh) |
| CN (1) | CN115914734A (zh) |
| WO (1) | WO2023045730A1 (zh) |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114915836A (zh) * | 2022-05-06 | 2022-08-16 | 北京字节跳动网络技术有限公司 | 用于编辑音频的方法、装置、设备和存储介质 |
| CN116866670A (zh) * | 2023-05-15 | 2023-10-10 | 维沃移动通信有限公司 | 视频编辑方法、装置、电子设备和存储介质 |
| CN119545081B (zh) * | 2023-08-31 | 2026-01-13 | 北京字跳网络技术有限公司 | 视频处理方法、装置、电子设备、存储介质 |
| CN120786154B (zh) * | 2025-09-10 | 2025-12-26 | 杭州云智创心网络有限公司 | 基于多模态大模型协同的视频剪辑方法及系统 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2005260283A (ja) * | 2004-02-13 | 2005-09-22 | Matsushita Electric Ind Co Ltd | Avコンテンツのネットワーク再生方法 |
| CN103442300A (zh) * | 2013-08-27 | 2013-12-11 | Tcl集团股份有限公司 | 一种音视频跳转播放方法以及装置 |
| CN105744346A (zh) * | 2014-12-12 | 2016-07-06 | 深圳Tcl数字技术有限公司 | 字幕切换方法及装置 |
| CN108259965A (zh) * | 2018-03-31 | 2018-07-06 | 湖南广播电视台广播传媒中心 | 一种视频剪辑方法和剪辑系统 |
| CN112231498A (zh) * | 2020-09-29 | 2021-01-15 | 北京字跳网络技术有限公司 | 互动信息处理方法、装置、设备及介质 |
Family Cites Families (21)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH10191248A (ja) * | 1996-10-22 | 1998-07-21 | Hitachi Denshi Ltd | 映像編集方法およびその方法の手順を記録した記録媒体 |
| JP2004287193A (ja) * | 2003-03-24 | 2004-10-14 | Equos Research Co Ltd | データ作成装置、データ作成プログラム、及び車載装置 |
| KR20060130692A (ko) * | 2004-03-31 | 2006-12-19 | 마쯔시다덴기산교 가부시키가이샤 | 악곡 데이터 편집장치 및 악곡 데이터 편집방법 |
| JP2006227363A (ja) * | 2005-02-18 | 2006-08-31 | Nhk Computer Service:Kk | 放送音声用辞書作成装置および放送音声用辞書作成プログラム |
| JP2009507453A (ja) * | 2005-09-07 | 2009-02-19 | ポータルビデオ・インコーポレーテッド | ビデオ編集方法および装置におけるテキスト位置の時間見積もり |
| JP4741406B2 (ja) * | 2006-04-25 | 2011-08-03 | 日本放送協会 | ノンリニア編集装置およびそのプログラム |
| US9870796B2 (en) * | 2007-05-25 | 2018-01-16 | Tigerfish | Editing video using a corresponding synchronized written transcript by selection from a text viewer |
| CN106598996B (zh) * | 2015-10-19 | 2021-01-01 | 广州酷狗计算机科技有限公司 | 多媒体海报生成方法及装置 |
| EP3776410A4 (en) * | 2018-04-06 | 2021-12-22 | Korn Ferry | INTERVIEW TRAINING SYSTEM AND PROCESS WITH SYNCHRONIZED FEEDBACK |
| US12231745B1 (en) * | 2019-01-23 | 2025-02-18 | Amazon Technologies, Inc. | Automated video summary generation using textual quotes |
| CN110401878A (zh) * | 2019-07-08 | 2019-11-01 | 天脉聚源(杭州)传媒科技有限公司 | 一种视频剪辑方法、系统及存储介质 |
| CN112243151A (zh) * | 2019-07-16 | 2021-01-19 | 腾讯科技(深圳)有限公司 | 一种音频播放控制方法、装置、设备及介质 |
| US20210043174A1 (en) * | 2019-08-09 | 2021-02-11 | Auxbus, Inc. | System and method for semi-automated guided audio production and distribution |
| CN112752047A (zh) * | 2019-10-30 | 2021-05-04 | 北京小米移动软件有限公司 | 视频录制方法、装置、设备及可读存储介质 |
| KR102177768B1 (ko) * | 2020-01-23 | 2020-11-11 | 장형순 | 클라우드 기반 음성결합을 이용한 맞춤형 동영상 제작 서비스 제공 시스템 |
| CN112822542B (zh) * | 2020-08-27 | 2026-02-17 | 腾讯科技(深圳)有限公司 | 视频合成方法、装置、计算机设备和存储介质 |
| CN112102841B (zh) * | 2020-09-14 | 2024-08-30 | 北京搜狗科技发展有限公司 | 一种音频编辑方法、装置和用于音频编辑的装置 |
| CN113365133B (zh) * | 2021-06-02 | 2022-10-18 | 北京字跳网络技术有限公司 | 视频分享方法、装置、设备及介质 |
| US12119027B2 (en) * | 2021-08-27 | 2024-10-15 | Logitech Europe S.A. | Method and apparatus for simultaneous video editing |
| US11770590B1 (en) * | 2022-04-27 | 2023-09-26 | VoyagerX, Inc. | Providing subtitle for video content in spoken language |
| TWI892389B (zh) * | 2023-12-27 | 2025-08-01 | 瑞昱半導體股份有限公司 | 藉助於偵測自定義詞的語音特徵對聲控裝置進行喚醒控制之方法及處理電路 |
-
2021
- 2021-09-22 CN CN202111109213.4A patent/CN115914734A/zh active Pending
-
2022
- 2022-09-02 WO PCT/CN2022/116650 patent/WO2023045730A1/zh not_active Ceased
- 2022-09-02 EP EP22871780.7A patent/EP4344225A4/en active Pending
- 2022-09-02 KR KR1020237044829A patent/KR102919002B1/ko active Active
- 2022-09-02 JP JP2023578889A patent/JP7764507B2/ja active Active
-
2023
- 2023-12-22 US US18/395,118 patent/US20240127860A1/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2005260283A (ja) * | 2004-02-13 | 2005-09-22 | Matsushita Electric Ind Co Ltd | Avコンテンツのネットワーク再生方法 |
| CN103442300A (zh) * | 2013-08-27 | 2013-12-11 | Tcl集团股份有限公司 | 一种音视频跳转播放方法以及装置 |
| CN105744346A (zh) * | 2014-12-12 | 2016-07-06 | 深圳Tcl数字技术有限公司 | 字幕切换方法及装置 |
| CN108259965A (zh) * | 2018-03-31 | 2018-07-06 | 湖南广播电视台广播传媒中心 | 一种视频剪辑方法和剪辑系统 |
| CN112231498A (zh) * | 2020-09-29 | 2021-01-15 | 北京字跳网络技术有限公司 | 互动信息处理方法、装置、设备及介质 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4344225A4 * |
Also Published As
| Publication number | Publication date |
|---|---|
| JP2024523464A (ja) | 2024-06-28 |
| US20240127860A1 (en) | 2024-04-18 |
| JP7764507B2 (ja) | 2025-11-05 |
| CN115914734A (zh) | 2023-04-04 |
| KR102919002B1 (ko) | 2026-01-28 |
| KR20240013879A (ko) | 2024-01-30 |
| EP4344225A4 (en) | 2024-10-02 |
| EP4344225A1 (en) | 2024-03-27 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2023045730A1 (zh) | 一种音视频处理方法、装置、设备及存储介质 | |
| US12026354B2 (en) | Video generation | |
| US11430485B2 (en) | Systems and methods for mixing synthetic voice with original audio tracks | |
| US20180286459A1 (en) | Audio processing | |
| CN104301771A (zh) | 视频文件播放进度的调整方法及装置 | |
| CN111046226B (zh) | 一种音乐的调音方法及装置 | |
| CN112102841B (zh) | 一种音频编辑方法、装置和用于音频编辑的装置 | |
| US20200097528A1 (en) | Method and Device for Quickly Inserting Text of Speech Carrier | |
| WO2016202176A1 (zh) | 一种媒体文件合成方法、装置和设备 | |
| CN113923479A (zh) | 音视频剪辑方法和装置 | |
| CN109460548B (zh) | 一种面向智能机器人的故事数据处理方法及系统 | |
| CN109949792B (zh) | 多音频的合成方法及装置 | |
| CN113516962B (zh) | 语音播报方法、装置、存储介质及电子设备 | |
| CN115082267A (zh) | 具有角色扮演的语言学习方法、装置、计算机设备及存储介质 | |
| US20050016364A1 (en) | Information playback apparatus, information playback method, and computer readable medium therefor | |
| CN113204668A (zh) | 音频裁剪方法、装置、存储介质与电子设备 | |
| JP7562610B2 (ja) | 映像コンテンツに対する合成音のリアルタイム生成を基盤としたコンテンツ編集支援方法およびシステム | |
| CN111625677B (zh) | 一种音频播放方法、电子设备和存储介质 | |
| CN106231395B (zh) | 播放控制方法及媒体播放器、计算机可读存储介质 | |
| WO2022194038A1 (zh) | 音乐的延长方法、装置、电子设备和存储介质 | |
| CN114491087A (zh) | 文本处理方法、装置、电子设备以及存储介质 | |
| WO2021080971A1 (en) | Device and method for creating a sharable clip of a podcast | |
| WO2025092363A1 (zh) | 一种多媒体资源处理方法、装置、设备及存储介质 | |
| US20250335502A1 (en) | Display method and system for multimedia device | |
| CN115602168A (zh) | 一种dia音频商业化内容互动方法及系统 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22871780 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202327087401 Country of ref document: IN Ref document number: 11202309775Y Country of ref document: SG |
|
| ENP | Entry into the national phase |
Ref document number: 2023578889 Country of ref document: JP Kind code of ref document: A |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2022871780 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 20237044829 Country of ref document: KR Kind code of ref document: A |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 1020237044829 Country of ref document: KR |
|
| REG | Reference to national code |
Ref country code: BR Ref legal event code: B01A Ref document number: 112023027241 Country of ref document: BR |
|
| ENP | Entry into the national phase |
Ref document number: 2022871780 Country of ref document: EP Effective date: 20231222 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 11202309775Y Country of ref document: SG |
|
| ENP | Entry into the national phase |
Ref document number: 112023027241 Country of ref document: BR Kind code of ref document: A2 Effective date: 20231222 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |