WO2023045716A1 - 视频处理方法、装置、介质和程序产品 - Google Patents

视频处理方法、装置、介质和程序产品 Download PDF

Info

Publication number
WO2023045716A1
WO2023045716A1 PCT/CN2022/115722 CN2022115722W WO2023045716A1 WO 2023045716 A1 WO2023045716 A1 WO 2023045716A1 CN 2022115722 W CN2022115722 W CN 2022115722W WO 2023045716 A1 WO2023045716 A1 WO 2023045716A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
text
segment
preset
image sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2022/115722
Other languages
English (en)
French (fr)
Inventor
孟凡博
刘金锁
朱伟基
张永哲
丰添
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to JP2023554305A priority Critical patent/JP7697027B2/ja
Priority to EP22871767.4A priority patent/EP4404574A4/en
Publication of WO2023045716A1 publication Critical patent/WO2023045716A1/zh
Priority to US18/365,296 priority patent/US20240022772A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23424Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for inserting or substituting an advertisement
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/20Three-dimensional [3D] animation
    • G06T13/40Three-dimensional [3D] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/20Scenes; Scene-specific elements in augmented reality scenes
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/165Detection; Localisation; Normalisation using facial parts and geometric relationships
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4398Processing of audio elementary streams involving reformatting operations of audio signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/272Means for inserting a foreground image in a background image, i.e. inlay, outlay
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems

Definitions

  • the present application relates to the technical field of communications, and in particular to a video processing method, device, medium and program product.
  • virtual objects can be widely used in application scenarios such as broadcasting scenarios, teaching scenarios, medical scenarios, and customer service scenarios.
  • virtual objects usually need to express text, and correspondingly, a video corresponding to the virtual object can be generated and played.
  • the video can characterize the process of virtual objects expressing text.
  • the video generation process generally includes: a speech generation link and an image sequence generation link.
  • speech generation usually adopts speech synthesis technology.
  • the link of image sequence generation usually adopts image processing technology.
  • the present application discloses a video processing method executed in an electronic device, the method comprising:
  • the first video segment corresponds to the template text in the first text of the video to be generated, and the first video segment includes a video sub-segment of speech pause, and the position of the video sub-segment corresponds to the The boundary position between the template text and the variable text to be processed in the first text;
  • the present application discloses a video processing device, comprising:
  • a module for obtaining a first video segment, the first video segment corresponds to the template text in the first text of the video to be generated, and the first video segment includes a video sub-segment of speech pause, the video sub-segment The position of corresponds to the boundary position between the template text and the variable text to be processed in the first text;
  • a generating module configured to generate a second video clip corresponding to the variable text to be processed
  • a splicing module configured to splice the first video segment and the second video segment to obtain a video corresponding to the first text.
  • the present application discloses a device for video processing, including a memory, and one or more programs, wherein one or more programs are stored in the memory, and the program is executed by one or more processors When executed, the steps of the aforementioned method are realized.
  • the embodiment of the present application discloses one or more machine-readable media, on which instructions are stored, and when executed by one or more processors, the device executes one or more of the aforementioned methods.
  • the embodiment of the present application discloses a computer program product, the program product includes computer instructions, the computer instructions are stored in a computer-readable storage medium; when the processor executes the computer instructions, the processor is made to execute the application.
  • the video processing method of the embodiment is not limited to:
  • FIG. 1A shows a schematic diagram of an application scenario according to an embodiment of the present application
  • FIG. 1B is a flowchart of a video processing method according to an embodiment of the present application.
  • Fig. 2 is a flow chart of a video processing method according to an embodiment of the present application.
  • FIG. 3 is a structural block diagram of a video processing device according to an embodiment of the present application.
  • FIG. 4 is a structural block diagram of a device for video processing according to an embodiment of the present application.
  • Fig. 5 is a structural block diagram of a server in some embodiments of the present application.
  • the virtual object is a vivid and natural virtual object close to the real object obtained through object modeling, motion capture and other technologies.
  • the virtual object can be Possess the ability to cognition, or understanding, or expression.
  • the virtual object specifically includes: a virtual character, or a virtual animal, or a two-dimensional cartoon object, or a three-dimensional cartoon object.
  • virtual objects can replace, for example, media workers for news broadcasting or game commentary.
  • virtual objects in a medical scene, can replace, for example, medical workers for medical guidance.
  • virtual objects can express text.
  • a video corresponding to text and virtual objects can be generated.
  • the video may specifically include: a voice sequence corresponding to the text, and an image frame sequence corresponding to the voice sequence.
  • the text of the video to be generated specifically includes: template text and variable text.
  • the template text is relatively fixed, and the variable text usually changes according to preset factors such as user input.
  • variable text can be determined from user input. Taking the medical scene as an example, the corresponding variable text can be determined according to the name of the disease contained in the user input.
  • the fields corresponding to the variable text specifically include: a disease name field, a food type field, a food quantity field, etc. These fields can be determined according to the disease name included in the user input.
  • variable text in the text may be determined according to actual application requirements, and the embodiment of the present application does not limit the specific manner of determining the variable text.
  • the related technology In order to make the video quality meet the requirements, the related technology usually generates a corresponding complete video for the changed complete text when the variable text is changed. However, it usually takes a lot of time to generate a corresponding complete video for the changed complete text, resulting in low video processing efficiency.
  • an embodiment of the present application provides a video processing solution, which specifically includes: acquiring a first video segment; corresponding to the template text in the first text of the video to be generated, and the The first video segment includes a video sub-segment with a speech pause, and the position of the video sub-segment corresponds to the boundary position between the template text and the variable text to be processed in the first text.
  • the first text includes the template text and the variable text to be processed. variable text; generating a second video segment corresponding to the variable text to be processed; splicing the first video segment and the second video segment to obtain a video corresponding to the first text.
  • the first video segment corresponding to the template text is spliced with the second video segment corresponding to the variable text to be processed.
  • the first video clip may be a pre-saved video clip
  • a second video clip corresponding to the variable text to be processed may be generated during video processing. Since the length of the variable text to be processed is shorter than the length of the complete text, the embodiment of the present application can shorten the length of the generated video and the corresponding time cost, thus improving the video processing efficiency.
  • the first video segment in the embodiment of the present application includes a video sub-segment in which speech is paused.
  • the voice pause refers to the cessation of voice, for example, the virtual object does not speak.
  • the position of the video sub-segment corresponds to the boundary position between the template text and the variable text to be processed in the first text.
  • the video sub-segments of the first video segment where the voice is paused help to overcome the problem of jumping or jittering at the splicing position, so the continuity at the splicing position can be improved.
  • FIG. 1A shows a schematic diagram of an application scenario according to an embodiment of the present application.
  • the client and the server are located in a wired or wireless network, and the client and the server perform data interaction through the wired or wireless network.
  • the client and the server may be collectively referred to as an electronic device.
  • clients include but are not limited to: smartphones, tablet computers, e-book readers, MP3 (Moving Picture Experts Group Audio Layer III) players, MP4 (Moving Picture Experts Group Audio Layer III) players, MP4 (Moving Picture Experts 4, Moving Picture Experts Group Audio Layer IV) player, laptop portable computer, car computer, desktop computer, set-top box, smart TV, wearable device, etc.
  • the server is, for example, a hardware-independent server, a virtual server, or a server cluster.
  • the client refers to a program that corresponds to the server and provides local services for users.
  • the client in this embodiment of the application may receive user input and provide a video corresponding to the user input.
  • the video can be generated by the client or the server, and this embodiment of the present application does not limit the specific generation subject of the video.
  • the client may receive user input and upload the user input to the server, so that the server generates a video corresponding to the user input.
  • the server can determine the variable text to be processed according to user input, generate a second video segment corresponding to the variable text to be processed, and splice the pre-saved first video segment and the second video segment to obtain the template text and the pending Process the video corresponding to the variable text.
  • FIG. 1B shows a flow chart of a video processing method of the present application, which may specifically include the following steps.
  • the video processing method can be executed by electronic equipment, for example.
  • Step 101 obtain the first video segment, the first video segment corresponds to the template text in the first text of the video to be generated, and the first video segment includes a video sub-segment of speech pause, and the position of the video sub-segment corresponds to the template text and the first text The boundary position between variable texts to be processed in a text. .
  • Step 102 generating a second video clip corresponding to the variable text to be processed.
  • Step 103 splicing the first video segment and the second video segment to obtain a video corresponding to the first text.
  • the first video segment corresponding to the template text may be generated and saved in advance.
  • the first video segment includes a video sub-segment with a speech pause.
  • the voice pause means that the voice is stopped or the voice is not output temporarily.
  • a video sub-segment with a pause in speech may be considered a video sub-segment without speech.
  • the position of the video sub-segment corresponds to the boundary position between the template text and the variable text to be processed in the first text, and the video sub-segment can improve the continuity at the splicing position.
  • the structure of the text in the embodiment of the present application specifically includes: template text and variable text. Boundaries can be used to divide adjacent template text and variable text.
  • the process of determining the first video segment may include: generating a preset video according to the template text, the preset variable text, and the pause information at the corresponding boundary position; intercepting the template text from the preset video corresponding to the first video segment.
  • the preset variable text may be any variable text, or the preset variable text may be any instance of the variable text.
  • a preset video can be generated according to the preset complete text corresponding to the template text and the preset variable text, wherein the pause information at the boundary position can be considered during the generation of the preset video.
  • the pause information indicates, for example, a speech pause of a predetermined duration.
  • the preset video may include: a preset voice corresponding to the voice part and a preset image sequence corresponding to the image part.
  • TTS Text To Speech
  • the preset voice may be represented in the form of a waveform.
  • the conversion of the preset complete text into the preset voice in the embodiment of the present application specifically includes: a language analysis link and an acoustic system link.
  • the language analysis link is used to generate corresponding linguistic information according to the preset complete text and its corresponding pause information;
  • the acoustic system link is mainly based on the linguistic information provided by the speech analysis link to generate the corresponding preset voice and realize the vocalization function.
  • the processing of the language analysis link may specifically include: text structure and language judgment, text standardization, text-to-phoneme conversion, and prosody prediction.
  • Linguistic information may be the result of a speech analysis session.
  • text structure and language judgment are used to judge the language of the preset complete text, such as Chinese, English, Vietnamese, Uighur, etc., and divide the preset complete text into sentences according to the grammatical rules of the corresponding language, and Pass the segmented sentence to the subsequent processing module.
  • Text standardization is used to standardize the segmented sentences according to the set rules.
  • Text-to-phoneme used to determine the phoneme features corresponding to the sentence.
  • prosody prediction can be used to determine where a sentence needs to be paused, how long the pause is, and which word or phrase needs to be reread , which word needs to be read lightly, etc., and then realize the high and low tortuous and cadence of the voice.
  • the prosody prediction technology may be used first to determine the prosody prediction result, and then the prosody prediction result may be updated according to the pause information.
  • the pause information can be: add a pause information of a preset duration between the template text "about” and the variable text " ⁇ diabetes>", then update the prosody prediction results. Specifically, it can include: in the template text " Add preset pause information between the phoneme features "guan”, “yu” of the variable text “ ⁇ diabetes>” and the phoneme features "tang", “niao", and “bing" of the variable text, and the updated prosody prediction results Can be: “guan”, “yu”, “pause for N milliseconds", “tang”, "niao", “bing”, etc. Wherein, N may be a natural number greater than 0, and the value of N may be determined by those skilled in the art according to actual application requirements.
  • the acoustic system link can obtain preset voices that meet the needs according to the speech synthesis parameters.
  • the speech synthesis parameters may include: timbre parameters.
  • the timbre parameters can refer to the distinctive characteristics of the frequency of different sounds in terms of waveforms. Usually, different emitters correspond to different timbres. Therefore, according to the timbre parameters, a speech sequence matching the timbre of the target emitter can be obtained.
  • the target sounding body can be specified by the user, for example, the target sounding body can be a designated medical worker or the like. In practical applications, the timbre parameters of the target sound emitting body can be obtained according to the audio frequency of the preset length of the target sound emitting body.
  • the preset image sequence corresponding to the image part can be obtained on the basis of the virtual object image.
  • the embodiment of the present application can assign sub-state features to the virtual object image to obtain the preset image sequence.
  • the virtual object image can be specified by the user, for example, the virtual object image can be an image of a well-known person (such as a presenter).
  • Expression expressing emotion and affection, can refer to thoughts and feelings expressed on the face.
  • Expression features are usually for the entire face.
  • the lip features can be specific to the lips, and are related to the text content, voice, and pronunciation of the text, so the naturalness of the expression corresponding to the preset image sequence can be improved.
  • Body characteristics can convey the thoughts of the characters through the coordinated activities of the head, eyes, neck, hands, elbows, arms, body, hips, feet and other human body parts, and vividly use expressions to express ideas.
  • Body features can include: turning head, shrugging shoulders, gestures, etc., which can improve the richness of the corresponding expression of the image sequence. For example, at least one arm hangs down naturally when speaking, at least one arm rests naturally on the abdomen when not speaking, etc.
  • the image parameters in the process of generating the image part of the preset video, can be determined according to the preset complete text and pause information, and the image parameters can represent the state characteristics of the virtual object; and the image corresponding to the image part can be generated according to the image parameters Preset image sequence.
  • the image parameters may include: pause image parameters, and the pause image parameters may represent pause state features corresponding to the pause information.
  • the pause image parameter represents the state characteristics of the virtual object in terms of body, expression, etc. when the virtual object stops speaking.
  • the preset image sequence may include: an image sequence corresponding to the pause state feature.
  • the characteristics of the pause state may include: a neutral expression, a closed lip state, and a drooping state of an arm.
  • the preset voice and the preset image sequence can be fused to obtain a corresponding preset video.
  • the first video segment corresponding to the template text can be intercepted from the preset video. Specifically, the first video segment may be intercepted according to the start position and end position of the preset variable text in the preset video.
  • the pause information at the boundary position is utilized, so the first video segment before T1 has pause information (that is, the first video segment includes a video sub-segment with a speech pause), so The continuity at the splicing position in the subsequent splicing process can be improved.
  • the first video clips corresponding to multiple template texts can be respectively extracted from the preset video.
  • not only the voice of the video sub-segment in the first video segment is paused, but also the virtual object in the image of the video sub-segment is in a state of not speaking.
  • the video sub-segment is a sub-segment obtained after pause processing.
  • Pause processing for video sub-segments including:
  • a method of obtaining the first video segment may include: generating the first video according to the template text and the preset variable text; intercepting the first video segment corresponding to the template text from the first video; Pause processing is performed on the first video segment at the boundary position.
  • the speech signal sub-segment at the boundary position of the video clip and the mute signal can be weighted to realize the pause processing of the speech part.
  • the image subsequence at the boundary position of the video segment and the image sequence corresponding to the target state characteristics of the pause information can be weighted to realize the pause processing of the image part.
  • the first video clip After the first video clip is obtained, the first video clip can be saved, so that when the variable text changes, the second video clip corresponding to the first video clip and the changed variable text (hereinafter referred to as the variable text to be processed) can be saved. Video clips are stitched together.
  • step 102 the variable text to be processed can be obtained according to user input. It can be understood that the embodiment of the present application does not limit the specific manner of determining the variable text to be processed.
  • generating the second video segment corresponding to the variable text to be processed specifically includes: determining the corresponding voice parameter and image parameter for the sentence where the variable text to be processed is located in the first text, wherein the image parameter represents the The state characteristics of the virtual object to appear in the video corresponding to the first text, and the voice parameters are used to characterize the parameters corresponding to the speech synthesis; from the voice parameters and image parameters, extract the target voice parameters corresponding to the variable text to be processed and Target image parameters; according to the target speech parameters and target image parameters, generate a second video clip corresponding to the variable text to be processed.
  • Technical solution 1 first determines the corresponding speech parameters and image parameters based on the sentence where the variable text to be processed is located, and then extracts the target speech parameter and target image parameter corresponding to the variable text to be processed from the speech parameters and image parameters.
  • a sentence is a grammatically self-contained unit consisting of a word or a syntactically related group of words expressing a claim, question, command, wish or exclamation.
  • the statement usually contains both the template text and the variable text to be processed. Because the voice parameter corresponding to the sentence and the image parameter have certain continuity, so the target voice parameter corresponding to the variable text to be processed and the target image parameter extracted therefrom have certain continuity with the voice parameter and the image parameter corresponding to the template text in the sentence; On this basis, the continuity between the second video segment corresponding to the variable text to be processed and the first video segment corresponding to the template text in the sentence can be improved, thereby improving the continuity at the splicing position.
  • the speech parameters may represent parameters corresponding to speech synthesis.
  • Speech parameters may include: linguistic features and/or acoustic features.
  • Linguistic features may include: phoneme features.
  • a phoneme is the smallest unit of speech divided according to the natural properties of speech. It is analyzed according to the pronunciation actions in a syllable, and an action constitutes a phoneme.
  • Phonemes can include: vowels and consonants.
  • Acoustic features can characterize the characteristics of speech from the perspective of vocalization.
  • Acoustic features may include, but are not limited to, the following:
  • Prosodic features specifically including duration-related features, fundamental frequency-related features, energy-related features, etc.
  • Spectrum-based correlation analysis features which are the embodiment of the correlation between vocal tract shape changes and vocalization movements.
  • spectrum-based correlation features mainly include: Linear Prediction Cepstral Coefficients (LPCC, LinearPredictionCoefficients), Mel Frequency Cepstral Coefficients (MFCC, Mel Frequency Cepstrum Coefficient) and so on.
  • speech synthesis may be performed on the variable text to be processed according to the target speech parameters, so as to convert the variable text to be processed into the target speech.
  • the image parameters may be parameters corresponding to the generation of the image sequence.
  • the image parameters may be used to determine the state characteristics corresponding to the virtual object, or the image parameters may include: the state characteristics corresponding to the virtual object.
  • image parameters may include lip features.
  • state features corresponding to target image parameters may be assigned to the virtual object image to obtain a target image sequence.
  • the target voice and the target image sequence are fused to obtain the second video segment.
  • the second video clip corresponding to the variable text to be processed is generated, which specifically includes: smoothing the target image parameters corresponding to the variable text to be processed according to the preset image parameters of the preset variable text at the boundary position, so as to improve the The continuity between the target image parameter and the image parameter of the template text at the boundary position; according to the smoothed target image parameter, generate a second video segment corresponding to the variable text to be processed.
  • Technical solution 2 performs smoothing processing on the target image parameters corresponding to the variable text to be processed according to the preset image parameters at the boundary position of the preset variable text. Since the preset image parameters of the preset variable text at the boundary position and the image parameters of the template text at the boundary position have a certain continuity, the above smoothing process can improve the distance between the smoothed target image parameters and the image parameters of the template text.
  • the continuity at the boundary position on this basis, the continuity between the second video segment corresponding to the variable text to be processed and the first video segment corresponding to the template text in the sentence can be improved, and then the continuity at the splicing position can be improved .
  • a window function such as a Hanning window may be used to perform smoothing processing on target image parameters corresponding to the variable text to be processed according to preset image parameters. It can be understood that the embodiment of the present application does not limit the specific smoothing process.
  • the embodiment of the present application in the process of generating the image part of the preset video, can determine the image parameters according to the preset complete text and pause information, and the embodiment of the present application can extract the preset variable text from the image parameters in the Preset image parameters at the boundary position, and save the preset image parameters.
  • the image sequence corresponding to the video includes: a background image sequence and a moving image sequence, then generating a second video segment corresponding to the variable text to be processed, specifically including: generating a target moving image sequence corresponding to the variable text to be processed; Assuming a background image sequence, determining a target background image sequence corresponding to the variable text to be processed; merging the target moving image sequence and the target background image sequence to obtain a second video segment corresponding to the variable text to be processed.
  • the image sequence corresponding to the video can be decomposed into two parts.
  • the first part is: a moving image sequence, which can be used to represent the moving part of the virtual object when it is expressed, usually corresponding to preset parts such as lips, eyes, and arms.
  • the second part is: the background image sequence, which can be used to characterize the relatively static part of the virtual object when it is expressed, usually corresponding to parts other than the preset parts.
  • the background image sequence may be obtained from a preset.
  • a preset background image sequence with a preset duration may be preset, and the preset background image sequence may be arranged cyclically in the image sequence (also referred to as cyclic appearance).
  • a moving image sequence can be generated according to target image parameters corresponding to the variable text to be processed.
  • the moving image sequence and the background image sequence can be fused to obtain an image sequence.
  • a moving image sequence can be pasted over a background image sequence to obtain an image sequence.
  • the information of the preset variable text corresponding to the preset background image sequence may be recorded.
  • the information of the preset background image sequence may include: a start frame identifier and an end frame identifier of the preset background image sequence in the preset video.
  • the information of the preset background image sequence may include: a start frame number 100, an end frame number 125, and the like.
  • the background images at the first and last positions of the target background image sequence, and the preset background image sequence Let the background images at the first and last positions of the background image sequence match.
  • the first position may refer to a start position
  • the tail position may refer to an end position.
  • the background image at the first position of the target background image sequence matches the background image at the first position of the preset background image sequence.
  • the background image at the end position of the target background image sequence matches the background image at the end position of the preset background image sequence.
  • the target background image sequence can also be improved when the target background image sequence matches the preset background image sequence at the boundary position.
  • the matching degree and continuity between the background image sequence and the background image sequence corresponding to the template text at the splicing position can also be improved when the target background image sequence matches the preset background image sequence at the boundary position.
  • the above-mentioned determining method for determining the target background image sequence corresponding to the variable text to be processed may specifically include:
  • Determination mode 1 When the number N1 of images corresponding to the preset background image sequence matches the number N2 of images corresponding to the target moving image sequence, determine the preset background image sequence as the target background image sequence; or
  • Determination mode 2 when the number N1 of images corresponding to the preset background image sequence is greater than the number N2 of images corresponding to the target moving image sequence, discarding the first background image located in the middle position from the preset background image sequence; In the case of at least two frames of the first background image, the at least two frames of the first background image are discontinuously distributed in the preset background image sequence; or
  • Determination mode 3 In the case that the number N1 of images corresponding to the preset background image sequence is smaller than the number N2 of images corresponding to the target moving image sequence, a second background image is added on the basis of the preset background image sequence.
  • the preset background image sequence is determined as the target background image sequence, which can realize the matching of the target background image sequence and the preset background image sequence at the boundary position.
  • the number N2 of images corresponding to the target moving image sequence can be determined according to the speech duration information corresponding to the variable text to be processed.
  • the speech duration information may be determined according to the speech parameters corresponding to the variable text to be processed, or the speech duration information may be determined according to the duration of the speech segment corresponding to the variable text to be processed.
  • the first background image located in the middle position is discarded from the preset background image sequence, which can realize the matching of the target background image sequence and the preset background image sequence at the boundary position.
  • the middle position can be different from the first or last position.
  • the discarded at least two frames of the first background image are discontinuously distributed in the preset background image sequence; in this way, the problem of poor continuity of the background image caused by discarding continuous background images can be avoided to a certain extent.
  • the number of the first background images may match the difference between N1 and N2.
  • the information of the preset background image sequence may include: start frame number 100 and end frame number 125, etc., the value of N1 is 26, assuming that the number of images N2 corresponding to the target moving image sequence is 24, then the preset background image In the sequence, discard the first two frames of the background image that are located in the middle and whose positions are discontinuous.
  • N1 is smaller than N2
  • adding a second background image on the basis of the preset background image sequence can realize the matching of the target background image sequence and the preset background image sequence at the boundary position.
  • the second background image may originate from a preset background image sequence, in other words, the second background image to be added may be determined from the preset background image sequence.
  • the preset background image sequence may be determined as the first part of the target background image sequence in the forward order first; then the preset background image sequence may be determined as the target background image sequence in the reverse order The second part of the second part; then according to the forward sequence, the preset background image sequence is determined as the third part of the target background image sequence; wherein, the end frame of the third part matches the end frame of the preset background image sequence.
  • the information of the preset background image sequence may include: start frame number 100 and end frame number 125, etc., the value of N1 is 26, assuming that the number of images N2 corresponding to the target moving image sequence is 30, then the number of images in the target background image sequence
  • the frame number corresponding to one part may be: 100 ⁇ 125
  • the frame number corresponding to the second part of the target background image sequence may be: 125 ⁇ 124
  • the frame number corresponding to the third part of the target background image sequence may be: 124 ⁇ 125.
  • the second background image may originate from a background image sequence other than the preset background image sequence, for example, the second background may be determined from a background image sequence following the preset background image sequence image.
  • the preset background image sequence may be determined as the first part of the target background image sequence in the forward order; then the background image sequence following the preset background image sequence may be determined in the forward order is the second part of the target background image sequence; then, in reverse order, the background image sequence following the preset background image sequence and the end frame of the preset background image sequence are determined as the third part of the target background image sequence; wherein, The end frame of the third part matches the end frame of the preset background image sequence.
  • the information of the preset background image sequence may include: start frame number 100 and end frame number 125, etc., the value of N1 is 26, assuming that the number of images N2 corresponding to the target moving image sequence is 30, then the number of images in the target background image sequence
  • the frame number corresponding to one part may be: 100 ⁇ 125
  • the frame number corresponding to the second part of the target background image sequence may be: 126 ⁇ 127
  • the frame number corresponding to the third part of the target background image sequence may be: 127 ⁇ 125.
  • an inverted target background image sequence may also be determined.
  • the corresponding determination process may include: first, in a reverse order, determine the preset background image sequence as the first part of the target background image sequence; then, in the forward order, determine the preset background image sequence as the first part of the target background image sequence The second part; then, in reverse order, determine the preset background image sequence as the third part of the target background image sequence; wherein, the starting frame of the third part matches the starting frame of the preset background image sequence.
  • the information of the preset background image sequence may include: start frame number 100 and end frame number 125, etc., the value of N1 is 26, assuming that the number of images N2 corresponding to the target moving image sequence is 30, then the number of images in the target background image sequence
  • the frame number corresponding to one part may be: 125 ⁇ 100
  • the frame number corresponding to the second part of the target background image sequence may be: 100 ⁇ 101
  • the frame number corresponding to the third part of the target background image sequence may be: 101 ⁇ 100.
  • the frame number of the target background image sequence obtained may be: 100 ⁇ 101 ⁇ 101 ⁇ 100 ⁇ 100 ⁇ 125.
  • step 103 the first video clip and the second video clip are spliced to obtain a video corresponding to the first text.
  • the first video segment may specifically include: a first audio segment
  • the second video segment may specifically include: a second audio segment
  • the above-mentioned splicing of the first video segment and the second video segment may specifically include: smoothing the voice sub-segments at the splicing position respectively of the first voice segment and the second voice segment; splice the first speech segment and the smoothed second speech segment.
  • smoothing is first performed on the speech sub-segments of the first speech segment and the second speech segment at splicing positions, and then the smoothed first speech segment and the smoothed second speech segment are spliced.
  • the above smoothing process can improve the continuity between the smoothed first speech segment and the second speech segment, and thus can improve the continuity of the first video segment and the second video segment at the splicing position.
  • the spliced video may be output, for example, to a user.
  • the corresponding variable text to be processed can be determined according to the disease name included in the user input, and the video can be obtained by using the method embodiment shown in FIG. 1B and provided to the user.
  • the first video segment corresponding to the template text and the second video segment corresponding to the variable text to be processed are spliced.
  • the first video clip may be a pre-saved video clip
  • a second video clip corresponding to the variable text to be processed may be generated during video processing. Since the length of the variable text to be processed is shorter than the length of the complete text, the embodiment of the present application can shorten the length of the generated video and the corresponding time cost, thus improving the video processing efficiency.
  • the first video segment in the embodiment of the present application is provided with a paused video sub-segment at the boundary position between the template text and the variable text.
  • the above-mentioned pause processing can overcome the jump or jitter problem at the splicing position to a certain extent, and thus can improve the continuity at the splicing position.
  • FIG. 2 shows a flow chart of a video processing method according to an embodiment of the present application, which may specifically include the following steps.
  • Step 201 according to the template text, the preset variable text, and the corresponding pause information at the boundary position, generate preset video pause information indicating a voice pause of a predetermined duration;
  • Step 202 intercepting the first video segment corresponding to the template text from the preset video, and saving the first video segment;
  • Step 203 according to the information of the preset video, save the preset image parameters of the preset variable text at the boundary position and the information of the preset background image sequence corresponding to the preset variable text;
  • Steps 201 to 203 can be used to pre-save the first video segment, the preset image parameters of the preset variable text at the boundary position, and the preset background image sequence corresponding to the preset variable text based on the generated preset video.
  • Steps 204 to 211 can be used to generate a second video clip corresponding to the variable text to be processed according to the pre-saved information; and splicing the pre-saved first video clip and the second video clip.
  • Step 204 for the sentence where the variable text to be processed is located, determine the corresponding speech parameters and image parameters;
  • Step 205 extracting target speech parameters and target image parameters corresponding to the variable text to be processed from the speech parameters and image parameters;
  • Step 206 Perform smoothing processing on target image parameters corresponding to the variable text to be processed according to preset image parameters
  • Step 207 according to the target voice parameter and the smoothed target image parameter, generate the target moving image sequence corresponding to the variable text to be processed;
  • Step 208 according to the preset background image sequence, determine the target background image sequence corresponding to the variable text to be processed
  • Step 209 merging the target moving image sequence and the target background image sequence to obtain a second video clip corresponding to the variable text to be processed
  • Step 210 smoothing the voice sub-segments at the boundary positions of the first voice segment in the first video segment and the second voice segment in the second video segment;
  • Step 211 Splice the first video clip and the second video clip according to the smoothed first speech clip and the smoothed second speech clip.
  • the preset complete text is the aforementioned text A
  • the preset variable text is " ⁇ diabetes>", “ ⁇ fruit>”, “ ⁇ 1800>”, etc. in the text A
  • the preset video can be generated according to the text A and the corresponding pause information, and the preset image parameters of the first video segment in the preset video, the preset variable text at the boundary position, and the preset variable text correspond to the preset background The information of the image sequence is saved.
  • variable text may change. For example, after text A changes to text B "about ⁇ coronary heart disease> and ⁇ vegetables>, I am still researching. I think this ⁇ coronary heart disease dietary advice may also be helpful to you, which contains about ⁇ 900 >In the case of recommendations and taboos of a certain ingredient, please click to view", the variable text to be processed may include: “ ⁇ coronary heart disease>", “ ⁇ vegetable>", " ⁇ 900>” and so on in text B.
  • the second video segment corresponding to the variable text to be processed can be generated. For example, you can first determine the acoustic parameters and lip features of the sentence where the variable text to be processed is located; then, extract the target acoustic parameters and target lip features corresponding to the variable text to be processed, and generate speech segments corresponding to the variable text to be processed and the target image sequence.
  • the target image sequence may include: a target moving image sequence and a target background image sequence.
  • step 206 may be used to smooth the target lip features, so as to improve the continuity of the lip features at the splicing position.
  • Step 208 can be used to generate the target background image sequence to achieve the matching of the target background image sequence and the preset background image sequence at the boundary position, so as to improve the continuity of the background image sequence at the stitching position.
  • each speech sub-segment at the boundary position is smoothed processing; and then splicing the first video clip and the second video clip according to the smoothed first speech clip and the smoothed second speech clip.
  • the video processing method of the embodiment of the present application adds a pause of preset duration at the splicing position of the first video segment, which helps to overcome the jump or jitter problem at the splicing position, so it can improve the splicing position. continuity.
  • the sentence in which the variable text to be processed is located is used as a unit to determine the corresponding speech parameters and image parameters, and then from the speech parameters and image parameters, the target speech parameter and target image corresponding to the variable text to be processed are extracted parameter.
  • the voice parameter corresponding to the sentence and the image parameter have certain continuity, so the target voice parameter corresponding to the variable text to be processed and the target image parameter extracted therefrom have certain continuity with the voice parameter and the image parameter corresponding to the template text in the sentence;
  • the continuity between the second video segment corresponding to the variable text to be processed and the first video segment corresponding to the template text in the sentence can be improved, and the continuity at the splicing position can be further improved.
  • smoothing is performed on the target image parameters corresponding to the variable text to be processed according to the preset image parameters at the boundary positions of the preset variable text. Since the preset image parameters of the preset variable text at the boundary position and the image parameters of the template text at the boundary position have a certain continuity, the above smoothing process can improve the distance between the smoothed target image parameters and the image parameters of the template text.
  • the continuity at the boundary position on this basis, the continuity between the second video segment corresponding to the variable text to be processed and the first video segment corresponding to the template text in the sentence can be improved, and then the continuity at the splicing position can be improved .
  • the embodiment of the present application generates the target background image sequence according to the preset background image sequence, which can realize the matching of the target background image sequence and the preset background image sequence at the boundary position, so as to improve the continuity of the background image sequence at the splicing position .
  • the speech subsection at the boundary position Fragments are smoothed.
  • the above smoothing process can improve the continuity between the smoothed first speech segment and the second speech segment, and thus can improve the continuity of the first video segment and the second video segment at the splicing position.
  • FIG. 3 shows a structural block diagram of an embodiment of a video processing device of the present application, which may specifically include:
  • Provide module 301 be used for obtaining the first video segment, described first video segment corresponds to template text in the first text of video to be generated, and described first video segment comprises the video sub-segment of speech pause, and described video sub-segment The position of the segment corresponds to the boundary position between the template text and the variable text to be processed in the first text;
  • a generating module 302 configured to generate a second video clip corresponding to the variable text to be processed
  • the splicing module 303 is configured to splice the first video clip and the second video clip to obtain a video corresponding to the first text.
  • the above-mentioned device may also include:
  • the preset video generation module is used to generate a preset video according to the template text, the preset variable text, and the corresponding pause information at the boundary position, and the pause information represents a voice pause of a predetermined duration;
  • An intercepting module configured to intercept the first video segment corresponding to the template text from the preset video.
  • the generating module 302 may include:
  • the parameter determination module is used to determine the corresponding speech parameters and image parameters for the sentence where the variable text to be processed is located in the first text, wherein the image parameter indicates that the video corresponding to the first text will appear
  • the state characteristics of the virtual object, the speech parameters are used to characterize the corresponding parameters of speech synthesis
  • a parameter extraction module configured to extract target speech parameters and target image parameters corresponding to the variable text to be processed from the speech parameters and image parameters;
  • the first segment generation module is used to generate the second video segment corresponding to the variable text to be processed according to the target voice parameter and the target image parameter.
  • the generating module 302 may include:
  • the first smoothing processing module is used to perform smoothing processing on the target image parameters corresponding to the variable text to be processed according to the preset image parameters at the boundary positions of the variable text to be processed, so as to improve the relationship between the target image parameters and the target image parameters. Describe the continuity of the image parameters of the template text at the boundary position;
  • the second segment generating module is configured to generate a second video segment corresponding to the variable text to be processed according to the smoothed target image parameters.
  • the above-mentioned first video clip may include: a first audio clip
  • the above-mentioned second video clip may include: a second audio clip
  • the splicing module 303 may include:
  • the second smoothing processing module is used to carry out smoothing processing to the speech sub-segments of the first speech segment and the second speech segment respectively at the splicing position;
  • the splicing module after smoothing is used for splicing the smoothed first speech segment and the smoothed second speech segment.
  • the image sequence corresponding to the above video may include: a background image sequence and a moving image sequence;
  • Generation module 302 may include:
  • a moving image sequence generating module configured to generate a target moving image sequence corresponding to the variable text to be processed
  • a background image sequence generation module configured to determine the target background image sequence corresponding to the variable text to be processed according to the preset background image sequence
  • the fusion module is configured to fuse the target moving image sequence and the target background image sequence to obtain the second video segment corresponding to the variable text to be processed.
  • the background images at the first and last positions of the target background image sequence match the background images at the first and last positions of the preset background image sequence.
  • the above-mentioned background image sequence generation module may include:
  • the first background image sequence generation module is used to determine the above-mentioned preset background image sequence as the target background image sequence when the number of images corresponding to the above-mentioned preset background image sequence matches the number of images corresponding to the above-mentioned target moving image sequence ;or
  • the second background image sequence generation module is used to discard the first image in the middle position from the preset background image sequence when the number of images corresponding to the preset background image sequence is greater than the number of images corresponding to the target moving image sequence. background image; in the case of discarding at least two frames of the first background image, at least two frames of the first background image are discontinuously distributed in the preset background image sequence; or
  • the third background image sequence generating module is configured to add a second background image to the preset background image sequence when the number of images corresponding to the preset background image sequence is smaller than the number of images corresponding to the target moving image sequence.
  • the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
  • Fig. 4 is a structural block diagram of an apparatus 900 for video processing according to an exemplary embodiment.
  • the apparatus 900 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.
  • device 900 may include one or more of the following components: processing component 902, memory 904, power supply component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and communication component 916 .
  • the processing component 902 generally controls the overall operations of the device 900, such as those associated with display, incoming phone calls, data communications, camera operations, and recording operations.
  • the processing element 902 may include one or more processors 920 to execute instructions to complete all or part of the steps of the above method.
  • processing component 902 may include one or more modules that facilitate interaction between processing component 902 and other components.
  • the processing component 902 may include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.
  • the memory 904 is configured to store various types of data to support operations at the device 900 . Examples of such data include instructions for any application or method operating on the device 900, contact data, phonebook data, messages, pictures, videos, and the like.
  • the memory 904 can be implemented by any type of volatile or non-volatile memory device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM erasable Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Magnetic or Optical Disk Magnetic Disk
  • the power supply component 906 provides power to the various components of the device 900 .
  • Power components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for device 900 .
  • the multimedia component 908 includes a screen that provides an output interface between the device 900 and the user.
  • the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.
  • the touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense a boundary of a touch or swipe motion action, but also detect duration and pressure associated with the touch or swipe operation.
  • the multimedia component 908 includes a front camera and/or a rear camera. When the device 900 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capability.
  • the audio component 910 is configured to output and/or input audio signals.
  • the audio component 910 includes a microphone (MIC) configured to receive external audio signals when the device 900 is in operation modes, such as call mode, recording mode and voice recognition mode. Received audio signals may be further stored in memory 904 or sent via communication component 916 .
  • the audio component 910 also includes a speaker for outputting audio signals.
  • the I/O interface 912 provides an interface between the processing component 902 and a peripheral interface module.
  • the peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: a home button, volume buttons, start button, and lock button.
  • Sensor assembly 914 includes one or more sensors for providing status assessments of various aspects of device 900 .
  • the sensor component 914 can detect the open/closed state of the device 900, the relative positioning of components, such as the display and keypad of the device 900, and the sensor component 914 can also detect a change in the position of the device 900 or a component of the device 900 , the presence or absence of user contact with the device 900 , the device 900 orientation or acceleration/deceleration and the temperature change of the device 900 .
  • Sensor assembly 914 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact.
  • Sensor assembly 914 may also include an optical sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
  • the sensor component 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor or a temperature sensor.
  • the communication component 916 is configured to facilitate wired or wireless communication between the apparatus 900 and other devices.
  • the device 900 can access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof.
  • the communication component 916 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 916 also includes a near field communication (NFC) module to facilitate short-range communication.
  • NFC near field communication
  • the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wide Band (UWB) technology, Bluetooth (BT) technology and other technologies.
  • RFID Radio Frequency Identification
  • IrDA Infrared Data Association
  • UWB Ultra Wide Band
  • Bluetooth Bluetooth
  • apparatus 900 may be programmed by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation for performing the methods described above.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGA field programmable A gate array
  • controller microcontroller, microprocessor or other electronic component implementation for performing the methods described above.
  • non-transitory computer-readable storage medium including instructions, such as the memory 904 including instructions, which can be executed by the processor 920 of the device 900 to implement the above method.
  • the non-transitory computer readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.
  • Fig. 5 is a structural block diagram of a server in some embodiments of the present application.
  • the server 1900 may have relatively large differences due to different configurations or performances, and may include one or more central processing units (central processing units, CPU) 1922 (for example, one or more processors) and memory 1932, one or more More than one storage medium 1930 (for example, one or more mass storage devices) storing application programs 1942 or data 1944 .
  • the memory 1932 and the storage medium 1930 may be temporary storage or persistent storage.
  • the program stored in the storage medium 1930 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server.
  • the central processing unit 1922 may be configured to communicate with the storage medium 1930 , and execute a series of instruction operations in the storage medium 1930 on the server 1900 .
  • the server 1900 may also include one or more power sources 1926, one or more wired or wireless network interfaces 1950, one or more input and output interfaces 1958, one or more keyboards 1956, and/or, one or more operating systems 1941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and so on.
  • a non-transitory computer-readable storage medium when the instructions in the storage medium are executed by the processor of the device (device or server), the device can execute the video processing method according to the embodiment of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Social Psychology (AREA)
  • Geometry (AREA)
  • Psychiatry (AREA)
  • Processing Or Creating Images (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

一种视频处理方法、装置、介质和程序产品,其中的方法具体包括:获取第一视频片段,第一视频片段与待生成视频的第一文本中模板文本对应,并且第一视频片段包括语音停顿的视频子片段,视频子片段的位置对应于模板文本与第一文本中待处理变量文本之间的分界位置(101);生成待处理变量文本对应的第二视频片段(102);对第一视频片段和第二视频片段进行拼接,以得到第一文本对应的视频(103)。本申请实施例可以提高视频的处理效率。

Description

视频处理方法、装置、介质和程序产品
本申请要求于2021年9月24日提交中国专利局、申请号为202111124169.4、申请名称为“一种视频处理方法、装置和介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及通信技术领域,特别是涉及一种视频处理方法、装置、介质和程序产品。
背景技术
随着通信技术的发展,虚拟对象可被广泛应用于播报场景、教学场景、医疗场景、客服场景等应用场景。在这些应用场景中,虚拟对象通常需要对文本进行表达,相应地,可以生成并播放虚拟对象对应的视频。该视频可以表征虚拟对象表达文本的过程。视频的生成过程通常包括:语音生成环节和图像序列生成环节。其中,语音生成环节通常采用语音合成技术。图像序列生成环节通常采用图像处理技术。
发明人在实施本申请实施例的过程中发现,相关技术针对完整文本,生成对应的完整视频,通常会耗费较多的时间成本,导致视频的处理效率较低。
发明内容
如何提高视频的处理效率,是本领域技术人员需要解决的技术问题。鉴于上述问题,本申请实施例提出了一种克服上述问题或者至少部分地解决上述问题的视频处理方法、装置、介质和程序产品。
为了解决上述问题,本申请公开了一种视频处理方法,在电子设备中执行,所述方法包括:
获取第一视频片段,所述第一视频片段与待生成视频的第一文本中模板文本对应,并且所述第一视频片段包括语音停顿的视频子片段,所述视频子片段的位置对应于所述模板文本与所述第一文本中待处理变量文本之间的分界位置;
生成所述待处理变量文本对应的第二视频片段;
对所述第一视频片段和所述第二视频片段进行拼接,以得到所述第一文本对应的视频。
另一方面,本申请公开了一种视频处理装置,包括:
提供模块,用于获取第一视频片段,所述第一视频片段与待生成视频的第一文本中模板文本对应,并且所述第一视频片段包括语音停顿的视频子片段,所述视频子片段的位置对应于所述模板文本与所述第一文本中待处理变量文本之间的分界位置;
生成模块,用于生成所述待处理变量文本对应的第二视频片段;
拼接模块,用于对所述第一视频片段和所述第二视频片段进行拼接,以得到所述第一文本对应的视频。
再一方面,本申请公开了一种用于视频处理的装置,包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,所述程序被一个或者一个以上处理器执行时,实现前述方法的步骤。
又一方面,本申请实施例公开了一个或多个机器可读介质,其上存储有指令,当由一个或多个处理器执行时,使得装置执行如前述一个或多个所述的方法。
又一方面,本申请实施例公开了一种计算机程序产品,该程序产品包括计算机指令,该计算机指令存储在计算机可读存储介质中;当处理器执行该计算机指令时,使得处理器执行本申请实施例的视频处理方法。
附图说明
图1A示出了根据本申请实施例的应用场景的示意图;
图1B是本申请实施例的一种视频处理方法的流程图;
图2是本申请实施例的一种视频处理方法的流程图;
图3是本申请实施例的一种视频处理装置的结构框图;
图4是本申请实施例的一种用于视频处理的装置的结构框图;及
图5是本申请一些实施例中服务端的结构框图。
具体实施方式
为使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。
本申请实施例中,虚拟对象是通过对象建模、动作捕捉等技术得到的一种生动自然的、与真实对象接近的虚拟对象,通过语音识别、自然语言理解等人工智能技术,能够使得虚拟对象具备认知、或者理解、或者表达等能力。虚拟对象具体包括:虚拟人物、或虚拟动物、或二维卡通对象、或三维卡通对象等。
例如,在播报场景下,虚拟对象可以代替例如媒体工作者进行新闻播报、或者游戏解说等。又如,在医疗场景下,虚拟对象可以代替例如医学工作者进行医学指导等。
在具体实现中,虚拟对象可以对文本进行表达。而本申请实施例可以生成文本和虚拟对象对应的视频。该视频具体可以包括:文本对应的语音序列、以及语音序列对应的图像帧序列。
在一些应用场景中,待生成视频的文本具体包括:模板文本和变量文本。其中,模板文本相对固定,变量文本通常会根据用户输入等预设因素发生变化。
例如,变量文本可以根据用户输入确定。以医疗场景为例,可以根据用户输入中包含的疾病名称,确定对应的变量文本。可选地,变量文本对应的字段具体包括:疾病名称字段、食物种类字段、食材数量字段等,可以根据用户输入中包含的疾病名称,确定这些字段。
可以理解,本领域技术人员可以根据实际应用需求,确定文本中的变量文本,本申请实施例对于变量文本的具体确定方式不加以限制。
为了使视频质量符合要求,相关技术在变量文本发生改变的情况下,通常针对改变后的完整文本,生成对应的完整视频。然而,针对改变后的完整文本,生成对应的完整视频,通常会耗费较多的时间成本,导致视频的处理效率较低。
针对如何提高视频的处理效率的技术问题,本申请实施例提供了一种视频处理方案,该方案具体包括:获取第一视频片段;与待生成视频的第一文本中模板文本对应,并且所述第一视频片段包括语音停顿的视频子片段,所述视频子片段的位置对应于所述模板文本与所述第一文本中待处理变量文本之间的分界位置第一文本包括模板文本和待处理变量文本;生成待处理变量文本对应的第二视频片段;对该第一视频片段和该第二视频片段进行拼接,以得到第一文本对应的视频。
本申请实施例对模板文本对应的第一视频片段与待处理变量文本对应的第二视频片段进行拼接。其中,第一视频片段可以为预先保存的视频片段,在视频处理过程中可以生成待处理变量文本对应的第二视频片段。由于待处理变量文本的长度小于完整文本的长度,故本申请实施例能够缩短生成视频的长度和对应的时间成本,因此,能够提高视频的处理效率。
并且,本申请实施例的第一视频片段包括语音停顿的视频子片段。这里语音停顿是指语音停止,例如为虚拟对象不说话。视频子片段的位置对应于模板文本与第一文本中待处理变量文本之间的分界位置。上述第一视频片段中语音停顿的视频子片段,有助于克服拼接位置处的跳变或抖动问题,因此能够提高拼接位置处的连续性。
本申请实施例提供的视频处理方法可应用于客户端和服务端对应的应用场景中。例如,图1A示出了根据本申请实施例的应用场景的示意图。客户端与服务端位于有线或无线网络中,通过该有线或无线网络,客户端与服务端进行数据交互。
客户端和服务端可以统称为电子设备。客户端例如包括但不限:智能手机、平板电脑、电子书阅读器、MP3(动态影像专家压缩标准音频层面3,Moving Picture Experts Group Audio Layer III)播放器、 MP4(动态影像专家压缩标准音频层面4,Moving Picture Experts Group Audio Layer IV)播放器、膝上型便携计算机、车载电脑、台式计算机、机顶盒、智能电视机、可穿戴设备等等。服务端例如为硬件独立的服务器、虚拟服务器或者服务器集群等设备。
客户端是指与服务端相对应,为用户提供本地服务的程序。本申请实施例中的客户端可以接收用户输入,并提供该用户输入对应的视频。该视频可由客户端或服务端生成,本申请实施例对于视频的具体生成主体不加以限制。
在本申请的一种实施例中,客户端可以接收用户输入,并向服务端上传用户输入,以使服务端生成用户输入对应的视频。服务端可以根据用户输入确定待处理变量文本,生成待处理变量文本对应的第二视频片段,并对预先保存的第一视频片段和该第二视频片段进行拼接,以得到该模板文本和该待处理变量文本对应的视频。
方法实施例一
参照图1B,示出了本申请的一种视频处理方法的流程图,具体可以包括如下步骤。视频处理方法例如可以由电子设备执行。
步骤101、获取第一视频片段,第一视频片段与待生成视频的第一文本中模板文本对应,并且第一视频片段包括语音停顿的视频子片段,视频子片段的位置对应于模板文本与第一文本中待处理变量文本之间的分界位置。。
步骤102、生成待处理变量文本对应的第二视频片段。
步骤103、对该第一视频片段和该第二视频片段进行拼接,以得到第一文本对应的视频。
在一个实施例中,步骤101,可以预先生成并保存模板文本对应的第一视频片段。第一视频片段包括语音停顿的视频子片段。这里语音停顿是指语音停止或者暂时不输出语音。语音停顿的视频子片段可以认为是没有语音的视频子片段。视频子片段的位置对应于模板文本与第一文本中待处理变量文本之间的分界位置,该视频子片段能够提高拼接位置处的连续性。
本申请实施例的文本的结构具体包括:模板文本和变量文本。分界位置可用于对相邻的模板文本和变量文本进行分割。
以文本A“关于<糖尿病>和<水果>的问题,我还在研究。我想这份<糖尿病>的饮食建议可能也对你有帮助,里面包含了约<1800>种食材的推荐、禁忌,请你点击查看”为例,文本A中存在多个分界位置。例如,模板文本“关于”与变量文本“<糖尿病>”之间对应有分界位置,变量文本“<糖尿病>”与模板文本“和”之间对应有分界位置,模板文本“和”与变量文本“<水果>”之间对应有分界位置,变量文本“<水果>”与模板文本“的”之间对应有分界位置,等等。
在一种实现方式中,第一视频片段的确定过程可以包括:根据模板文本、预设变量文本、以及对应分界位置处的停顿信息,生成预设视频;从上述预设视频中截取上述模板文本对应的第一视频片段。
其中,预设变量文本可以为任意的变量文本,或者,预设变量文本可以为变量文本的任意实例。
本申请实施例可以根据模板文本和预设变量文本对应的预设完整文本,生成预设视频,其中,预设视频的生成过程中可以考虑分界位置处的停顿信息。停顿信息例如表示预定时长的语音停顿。
在实际应用中,预设视频可以包括:语音部分对应的预设语音和图像部分对应的预设图像序列。
在具体实现中,可以利用TTS(语音合成,Text To Speech)技术,将预设完整文本转换为预设语音。预设语音可以表征为波形的形式。
本申请实施例的将预设完整文本转换为预设语音,具体包括:语言分析环节和声学系统环节。其中,语言分析环节用于根据预设完整文本及其对应的停顿信息,生成对应的语言学信息;声学系统环节主要是根据语音分析环节提供的语言学信息,生成对应的预设语音,实现发声的功能。
在一种实现方式中,语言分析环节的处理具体可以包括:文本结构与语种判断、文本标准化、文本转音素和韵律预测。语言学信息可以是语音分析环节的结果。
其中,文本结构与语种判断,用于判断预设完整文本的语种,例如中文、英文、藏语、维语等语种,并根据对应语种的语法规则,把预设完整文本切分为语句,并将切分好的语句传到后面的处理模块。
文本标准化,用于根据设置好的规则,对切分好的语句进行标准化。
文本转音素,用于确定语句对应的音素特征。
由于人类在语言表达的时候通常带着语气与感情,语音合成的目的往往是为了模仿真实的人声;因此,韵律预测可用于确定语句的什么地方需要停顿,停顿多久,哪个字或者词语需要重读,哪个词需要轻读等,进而实现声音的高低曲折,抑扬顿挫。
本申请实施例可以首先利用韵律预测技术,确定韵律预测结果,然后,根据停顿信息,对韵律预测结果进行更新。
以文本A为例,停顿信息可以为:在模板文本“关于”与变量文本“<糖尿病>”之间添加预设时长的停顿信息,则对韵律预测结果进行更新具体可以包括:在模板文本“关于”的音素特征“guan”、“yu”和变量文本“<糖尿病>”的音素特征“tang”、“niao”、“bing”之间添加预设时长的停顿信息,更新后的韵律预测结果可以为:“guan”、“yu”、“停顿N毫秒”、“tang”、“niao”、“bing”等。其中,N可以为大于0的自然数,N的值可由本领域技术人员根据实际应用需求确定。
声学系统环节可以按照语音合成参数,得到符合需求的预设语音。
可选地,语音合成参数可以包括:音色参数。音色参数可以指不同的声音的频率表现在波形方面的与众不同的特性,通常不同的发声体对应不同的音色,因此可以按照音色参数,得到与目标发声体的音色相匹配的语音序列。目标发声体可由用户指定,例如,目标发声体可以为指定的医学工作者等。在实际应用中,可以依据目标发声体的预设长度的音频,得到目标发声体的音色参数。
图像部分对应的预设图像序列可以在虚拟对象图像的基础上得到,换言之,本申请实施例可以为虚拟对象图像赋子状态特征,以得到预设图像序列。虚拟对象图像可由用户指定,例如,虚拟对象图像可以为知名人物(例如主持人)的图像。
上述状态特征可以包括如下特征中的至少一种:
表情特征;
唇部特征;以及
肢体特征。
表情,表达感情、情意,可以指表现在面部的思想感情。
表情特征通常是针对整个面部的。唇部特征可以专门针对唇部,而且跟文本的文本内容、语音、发音方式等都有关系,因此可以提高预设图像序列所对应表达的自然度。
肢体特征可以通过头、眼、颈、手、肘、臂、身、胯、足等人体部位的协调活动来传达人物的思想,形象地借以表情达意。肢体特征可以包括:转头、耸肩、手势等,可以提高图像序列所对应表达的丰富度。例如,说话时至少一个手臂自然下垂,不说话时至少一个手臂自然放在腹部等。
本申请实施例在生成预设视频的图像部分的过程中,可以根据预设完整文本及停顿信息,确定图像参数,该图像参数可以表征虚拟对象的状态特征;并根据图像参数生成图像部分对应的预设图像序列。
其中,图像参数可以包括:停顿图像参数,该停顿图像参数可以表征停顿信息对应的停顿状态特征。换言之,停顿图像参数表示虚拟对象停止说话时,虚拟对象表现出的形体、表情等方面的状态特征。相应地,预设图像序列中可以包括:停顿状态特征对应的图像序列。例如,停顿状态特征可以包括:中性表情、唇部闭合状态、以及手臂下垂状态等。
在生成预设语音和预设图像序列后,可以对预设语音和预设图像序列进行融合,得到对应的预设视频。
在得到预设视频后,可以从上述预设视频中截取上述模板文本对应的第一视频片段。具体地,可以根据预设变量文本在预设视频中的起始位置和结束位置,进行第一视频片段的截取。
以文本A为例,假设预设变量文本“<糖尿病>”在文本中的启始位置对应于预设视频中的起始位置T1,预设变量文本“<糖尿病>”的结束位置对应于预设视频中的结束位置T2,则可以从预设视频中截取T1之前的视频片段,作为模板文本“关于”对应的第一视频片段。需要说明的是,在生成预设视频的过程中利用了分界位置处的停顿信息,故T1之前的第一视频片段带有停顿信息(即第一视频片段包括语音停顿的视频子片段),因此能够提高后续拼接过程中拼接位置处的连续性。
以文本A为例,假设预设变量文本“<水果>”在文本中的起始位置对应预设视频中的起始位置T3, 预设变量文本“<水果>”在文本中的起始位置对应预设视频中的结束位置T4,则可以从预设视频中截取T2与T3之间的视频片段,作为模板文本“和”对应的第一视频片段。
由于预设完整文本中的模板文本被预设变量文本分割为多处,故在实际应用中,可以从预设视频中分别提取多处模板文本对应的第一视频片段。
可以理解,上述在生成预设视频的过程中利用了分界位置处的停顿信息、以获得第一视频片段的获取方式,只是作为可选实施例,实际上,本领域技术人员还可以根据实际应用需求,采用其他获取方式。
在一个实施例中,第一视频片段中视频子片段不仅语音停顿,并且视频子片段的图像中虚拟对象处于不说话的状态。
在一个实施例中,所述视频子片段为经过停顿处理后得到的子片段。
对视频子片段的停顿处理,包括:
对所述第一视频片段中与所述分界位置对应的拼接位置处的语音信号子片段与静音信号进行加权处理,以得到语音停顿的语音信号子片段;
对第一视频片段在拼接位置处的图像子序列与目标状态特征的图像序列进行加权处理,以得到虚拟对象处于不说话的状态的所述图像子序列,其中目标状态特征为表示虚拟对象处于不说话状态的特征。这样,语音停顿的语音信号子片段和虚拟对象处于不说话的状态的图像子序列可以组成所述视频子片段。
在一个实施例中,第一视频片段的一种获取方式可以包括:根据模板文本和预设变量文本,生成第一视频;从上述第一视频中截取上述模板文本对应的第一视频片段;在分界位置处对所述第一视频片段进行停顿处理。
以语音部分的停顿处理为例,可以视频片段在分界位置处的语音信号子片段与静音信号进行加权处理,以实现语音部分的停顿处理。以图像部分的停顿处理为例,可以对视频片段在分界位置处的图像子序列与停顿信息对应目标状态特征的图像序列进行加权处理,以实现图像部分的停顿处理。
在获得第一视频片段后,可以对第一视频片段进行保存,以在变量文本发生变化的情况下,对第一视频片段与改变后的变量文本(以下简称待处理变量文本)对应的第二视频片段进行拼接。
步骤102中,待处理变量文本可以根据用户输入得到。可以理解,本申请实施例对于待处理变量文本的具体确定方式不加以限制。
本申请实施例可以提供生成待处理变量文本对应的第二视频片段的如下技术方案:
技术方案1、
技术方案1中生成待处理变量文本对应的第二视频片段,具体包括:针对待处理变量文本在第一文本中所处的语句,确定对应的语音参数和图像参数,其中,图像参数表征所述第一文本对应的视频中要出现的虚拟对象的状态特征,语音参数用于表征语音合成对应的参数;从所述语音参数和图像参数中,提取所述待处理变量文本对应的目标语音参数和目标图像参数;根据目标语音参数和目标图像参数,生成待处理变量文本对应的第二视频片段。
技术方案1首先以待处理变量文本所在的语句为单位,确定对应的语音参数和图像参数,然后从语音参数和图像参数中,提取所述待处理变量文本对应的目标语音参数和目标图像参数。
语句是一个语法上自成体系的单位,它由一个词或句法上有关联的一组词构成,表达一种主张、疑问、命令、愿望或感叹。
在待处理变量文本对应词的情况下,语句中通常既包含模板文本又包含待处理变量文本。由于语句对应的语音参数和图像参数具有一定的连续性,故从中提取的待处理变量文本对应的目标语音参数和目标图像参数与语句中模板文本对应的语音参数和图像参数具有一定的连续性;在此基础上,能够提高待处理变量文本对应的第二视频片段与语句中模板文本对应的第一视频片段之间的连续性,进而能够提高拼接位置处的连续性。
在实际应用中,语音参数可以表征语音合成所对应的参数。语音参数可以包括:语言特征和/或声学特征。
语言特征可以包括:音素特征。音素是根据语音的自然属性划分出来的最小语音单位,依据音节 里的发音动作来分析,一个动作构成一个音素。音素可以包括:元音与辅音。
声学特征可以从发声角度表征语音的特征。
声学特征可以包括但不限于如下特征:
韵律学特征(超音段特征/超语言学特征),具体包括时长相关特征、基频相关特征、能量相关特征等;
音质特征;
基于谱的相关性分析特征,其是声道形状变化和发声运动之间相关性的体现,目前基于谱的相关特征主要包括:线性预测倒谱系数(LPCC,LinearPredictionCoefficients)、梅尔频率倒谱系数(MFCC,Mel Frequency Cepstrum Coefficient)等。
可以理解,上述语音参数只是作为示例,本申请实施例对具体的语音参数不加以限制。
在具体实现中,可以根据目标语音参数,对待处理变量文本进行语音合成,以将待处理变量文本转换为目标语音。
图像参数可以为图像序列的生成所对应的参数。图像参数可用于确定虚拟对象对应的状态特征,或者,图像参数可以包括:虚拟对象对应的状态特征。例如,图像参数可以包括唇部特征。
在具体实现中,可以为虚拟对象图像赋予目标图像参数对应的状态特征,以得到目标图像序列。对目标语音和目标图像序列进行融合,可以得到第二视频片段。
技术方案2、
技术方案2中生成待处理变量文本对应的第二视频片段,具体包括:根据预设变量文本在边界位置处的预设图像参数,对待处理变量文本对应的目标图像参数进行平滑处理,以提高所述目标图像参数与所述模板文本的图像参数在边界位置处的连续性;根据平滑处理后的目标图像参数,生成所述待处理变量文本对应的第二视频片段。
技术方案2根据预设变量文本在边界位置处的预设图像参数,对待处理变量文本对应的目标图像参数进行平滑处理。由于预设变量文本在边界位置处的预设图像参数与模板文本在边界位置处的图像参数具有一定的连续性,故上述平滑处理能够提高平滑处理后的目标图像参数与模板文本的图像参数在边界位置处的连续性;在此基础上,能够提高待处理变量文本对应的第二视频片段与语句中模板文本对应的第一视频片段之间的连续性,进而能够提高拼接位置处的连续性。
在具体实现中,可以利用汉宁窗等窗函数,根据预设图像参数,对待处理变量文本对应的目标图像参数进行平滑处理。可以理解,本申请实施例对于具体的平滑处理过程不加以限制。
根据前面的介绍,本申请实施例在生成预设视频的图像部分的过程中,可以根据预设完整文本及停顿信息,确定图像参数,本申请实施例可以从图像参数中提取预设变量文本在边界位置处的预设图像参数,并对该预设图像参数进行保存。
以文本A为例,假设预设变量文本“<糖尿病>”的启始位置对应于预设视频中的起始位置T1,预设变量文本“<糖尿病>”的启始位置对应于在预设视频中的结束位置T2,则可以提取T1至T2之间的图像参数,作为预设变量文本“<糖尿病>”在边界位置处的预设图像参数。
技术方案3、
技术方案3中,视频对应的图像序列包括:背景图像序列和运动图像序列,则生成待处理变量文本对应的第二视频片段,具体包括:生成待处理变量文本对应的目标运动图像序列;根据预设背景图像序列,确定待处理变量文本对应的目标背景图像序列;对上述目标运动图像序列和上述目标背景图像序列进行融合,以得到所述待处理变量文本对应的第二视频片段。
在实际应用中,可以将视频对应的图像序列分解为两部分。第一部分是:运动图像序列,可用于表征虚拟对象表达时运动的部分,通常对应唇部、眼部、手臂部位等预设部位。第二部分是:背景图像序列,可用于表征虚拟对象表达时相对静止的部分,通常对应除了预设部位之外的部分。
在具体实现中,背景图像序列可以为预置得到。例如,可以预置预设时长的预设背景图像序列,并在图像序列中对预设背景图像序列进行循环布置(也可以称为循环出现)。可以根据待处理变量文本对应的目标图像参数,生成运动图像序列。
在实际应用中,可以对运动图像序列和背景图像序列进行融合,以得到图像序列。例如,可以将 运动图像序列贴到背景图像序列之上,以得到图像序列。
技术方案3根据变量文本对应的预设背景图像序列,确定待处理变量文本对应的目标背景图像序列,可以提高目标背景图像序列与预设背景图像序列之间的匹配度,进而能够提高待处理变量文本对应目标背景图像序列与模板文本对应背景图像序列之间的匹配度和连续性。
根据前面的介绍,本申请实施例在生成预设视频的图像部分的过程中,可以对预设变量文本对应预设背景图像序列的信息进行记录。例如,预设背景图像序列的信息可以包括:预设背景图像序列在预设视频中的起始帧标识和结束帧标识等。例如,预设背景图像序列的信息可以包括:起始帧编号100和结束帧编号125等。
在一种实施方式中,为了提高目标背景图像序列与预设背景图像序列在起始位置处或结束位置处的匹配度,所述目标背景图像序列的位于首尾位置的背景图像,与所述预设背景图像序列的位于首尾位置的背景图像相匹配。
首位置可以指起始位置,尾位置可以指结束位置。具体地,目标背景图像序列的位于首位置的背景图像,与预设背景图像序列的位于首位置的背景图像相匹配。或者,目标背景图像序列的位于尾位置的背景图像,与预设背景图像序列的位于尾位置的背景图像相匹配。
由于预设背景图像序列与模板文本对应背景图像序列在分界位置处是匹配的和连续的,故在目标背景图像序列与预设背景图像序列在分界位置处相匹配的情况下,也能够提高目标背景图像序列与模板文本对应背景图像序列在拼接位置处的匹配度和连续性。
为了实现目标背景图像序列与预设背景图像序列在分界位置处相匹配,上述确定所述待处理变量文本对应的目标背景图像序列所采用的确定方式,具体可以包括:
确定方式1、在预设背景图像序列对应的图像数量N1与目标运动图像序列对应的图像数量N2相匹配的情况下,将所述预设背景图像序列确定为目标背景图像序列;或者
确定方式2、在预设背景图像序列对应的图像数量N1大于目标运动图像序列对应的图像数量N2的情况下,从所述预设背景图像序列中丢弃位于中间位置的第一背景图像;在丢弃至少两帧第一背景图像的情况下,至少两帧第一背景图像在预设背景图像序列中不连续分布;或者
确定方式3、在预设背景图像序列对应的图像数量N1小于目标运动图像序列对应的图像数量N2的情况下,在预设背景图像序列的基础上增加第二背景图像。
对于确定方式1,在N1与N2相等的情况下,将预设背景图像序列确定为目标背景图像序列,可以实现目标背景图像序列与预设背景图像序列在分界位置处的匹配。
在实际应用中,可以根据待处理变量文本对应的语音时长信息,确定目标运动图像序列对应的图像数量N2。该语音时长信息可以根据待处理变量文本对应的语音参数确定,或者,该语音时长信息可以根据待处理变量文本对应语音片段的时长确定。
对于确定方式2,在N1大于N2的情况下,从预设背景图像序列中丢弃位于中间位置的第一背景图像,可以实现目标背景图像序列与预设背景图像序列在分界位置处的匹配。
中间位置可以不同于首位置或尾位置。且丢弃的至少两帧第一背景图像在预设背景图像序列中不连续分布;这样,可以在一定程度上避免丢弃连续的背景图像导致的背景图像连续性差的问题。
在实际应用中,第一背景图像的数量可以与N1与N2的差值相匹配。例如,预设背景图像序列的信息可以包括:起始帧编号100和结束帧编号125等,N1的值为26,假设目标运动图像序列对应的图像数量N2为24,则可以从预设背景图像序列中丢弃位于中间位置、且位置不连续的2帧第一背景图像。
对于确定方式3,在N1小于N2的情况下,在预设背景图像序列的基础上增加第二背景图像,可以实现目标背景图像序列与预设背景图像序列在分界位置处的匹配。
在本申请的一种可选实施例中,第二背景图像可以源自预设背景图像序列,换言之,可以从预设背景图像序列中确定出待增加的第二背景图像。
在一种实现方式中,可以首先按照正向的顺序,将预设背景图像序列确定为目标背景图像序列的第一部分;然后按照倒向的顺序,将预设背景图像序列确定为目标背景图像序列的第二部分;接着按照正向的顺序,将预设背景图像序列确定为目标背景图像序列的第三部分;其中,第三部分的结束帧 与预设背景图像序列的结束帧相匹配。
例如,预设背景图像序列的信息可以包括:起始帧编号100和结束帧编号125等,N1的值为26,假设目标运动图像序列对应的图像数量N2为30,则目标背景图像序列的第一部分对应的帧编号可以为:100→125,目标背景图像序列的第二部分对应的帧编号可以为:125→124,目标背景图像序列的第三部分对应的帧编号可以为:124→125。
在本申请的另一种可选实施例中,第二背景图像可以源自预设背景图像序列以外的背景图像序列,例如,可以从预设背景图像序列之后的背景图像序列中确定第二背景图像。
在一种实现方式中,可以首先按照正向的顺序,将预设背景图像序列确定为目标背景图像序列的第一部分;然后按照正向的顺序,将预设背景图像序列后续的背景图像序列确定为目标背景图像序列的第二部分;接着按照倒向的顺序,将预设背景图像序列后续的背景图像序列和预设背景图像序列的结束帧确定为目标背景图像序列的第三部分;其中,第三部分的结束帧与预设背景图像序列的结束帧相匹配。
例如,预设背景图像序列的信息可以包括:起始帧编号100和结束帧编号125等,N1的值为26,假设目标运动图像序列对应的图像数量N2为30,则目标背景图像序列的第一部分对应的帧编号可以为:100→125,目标背景图像序列的第二部分对应的帧编号可以为:126→127,目标背景图像序列的第三部分对应的帧编号可以为:127→125。
可以理解,上述在预设背景图像序列的基础上增加第二背景图像的实现方式,只是作为示例,实际上本领域技术人员可以根据实际应用需求,采用其他实现方式,任意的能够实现目标背景图像序列与预设背景图像序列在分界位置处的匹配的实现方式,均在本申请实施例的实现方式的保护范围之内。
例如,在一种其他实现方式中,还可以确定倒向的目标背景图像序列。相应的确定过程可以包括:首先按照倒向的顺序,将预设背景图像序列确定为目标背景图像序列的第一部分;然后按照正向的顺序,将预设背景图像序列确定为目标背景图像序列的第二部分;接着按照倒向的顺序,将预设背景图像序列确定为目标背景图像序列的第三部分;其中,第三部分的起始帧与预设背景图像序列的起始帧相匹配。
例如,预设背景图像序列的信息可以包括:起始帧编号100和结束帧编号125等,N1的值为26,假设目标运动图像序列对应的图像数量N2为30,则目标背景图像序列的第一部分对应的帧编号可以为:125→100,目标背景图像序列的第二部分对应的帧编号可以为:100→101,目标背景图像序列的第三部分对应的帧编号可以为:101→100。此种情况下得到目标背景图像序列的帧编号可以为:100→101→101→100→100→125。
以上通过技术方案1至技术方案3对生成待处理变量文本对应的第二视频片段的过程进行了详细介绍,可以理解,本领域技术人员可以根据实际应用需求,采用技术方案1至技术方案3中的任一或组合,本申请实施例对生成待处理变量文本对应的第二视频片段的具体过程不加以限制。
步骤103中,对该第一视频片段和该第二视频片段进行拼接,可以得到第一文本对应的视频。
在本申请的一种可选实施例中,第一视频片段具体可以包括:第一语音片段,第二视频片段具体可以包括:第二语音片段;
则上述对所述第一视频片段和所述第二视频片段进行拼接,具体可以包括:对第一语音片段和第二语音片段各自在拼接位置处的语音子片段进行平滑处理;对平滑处理后的第一语音片段和平滑处理后第二语音片段进行拼接。
本申请实施例首先对第一语音片段和第二语音片段各自在拼接位置的语音子片段进行平滑处理,然后对平滑处理后的第一语音片段和平滑处理后第二语音片段进行拼接。上述平滑处理能够提高平滑处理后的第一语音片段与第二语音片段之间的连续性,因此能够提高第一视频片段和第二视频片段在拼接位置处的连续性。
在实际应用中,可以对拼接得到的视频进行输出,例如输出给用户。以医疗场景为例,可以根据用户输入中包含的疾病名称,确定对应的待处理变量文本,利用图1B所示方法实施例,得到视频,并向用户提供该视频。
综上,本申请实施例的视频处理方法,对模板文本对应的第一视频片段与待处理变量文本对应的 第二视频片段进行拼接。其中,第一视频片段可以为预先保存的视频片段,在视频处理过程中可以生成待处理变量文本对应的第二视频片段。由于待处理变量文本的长度小于完整文本的长度,故本申请实施例能够缩短生成视频的长度和对应的时间成本,因此,能够提高视频的处理效率。
并且,本申请实施例的第一视频片段在模板文本与变量文本之间的分界位置处,设置有:经过停顿处理的视频子片段。上述停顿处理能够在一定程度上克服拼接位置处的跳变或抖动问题,因此能够提高拼接位置处的连续性。
方法实施例二
参照图2,示出了本申请实施例的一种视频处理方法的流程图,具体可以包括如下步骤。
步骤201、根据模板文本、预设变量文本、以及分界位置处对应的停顿信息,生成预设视频停顿信息表示预定时长的语音停顿;
步骤202、从上述预设视频中截取上述模板文本对应的第一视频片段,并对第一视频片段进行保存;
步骤203、根据预设视频的信息,保存预设变量文本在边界位置处的预设图像参数、以及预设变量文本对应预设背景图像序列的信息;
步骤201至步骤203,可用于基于生成的预设视频,预先保存第一视频片段、预设变量文本在边界位置处的预设图像参数、以及预设变量文本对应预设背景图像序列的信息。
步骤204至步骤211,可用于根据预先保存的信息,生成待处理变量文本对应的第二视频片段;并对预先保存的第一视频片段与第二视频片段进行拼接。
步骤204、针对待处理变量文本所在的语句,确定对应的语音参数和图像参数;
步骤205、从所述语音参数和图像参数中,提取所述待处理变量文本对应的目标语音参数和目标图像参数;
步骤206、根据预设图像参数,对所述待处理变量文本对应的目标图像参数进行平滑处理;
步骤207、根据目标语音参数和平滑处理后的目标图像参数,生成所述待处理变量文本对应的目标运动图像序列;
步骤208、根据预设背景图像序列,确定所述待处理变量文本对应的目标背景图像序列;
步骤209、对所述目标运动图像序列和所述目标背景图像序列进行融合,以得到所述待处理变量文本对应的第二视频片段;
步骤210、对第一视频片段中第一语音片段和第二视频片段中第二语音片段,各自在所述分界位置的语音子片段进行平滑处理;
步骤211、根据平滑处理后的第一语音片段和平滑处理后的第二语音片段,对第一视频片段和第二视频片段进行拼接。
在本申请的一种应用示例中,假设预设完整文本为前述的文本A,预设变量文本为文本A中的“<糖尿病>”、“<水果>”、“<1800>”等,则可以根据文本A及对应的停顿信息,生成预设视频,并对预设视频中的第一视频片段、预设变量文本在边界位置处的预设图像参数、以及预设变量文本对应预设背景图像序列的信息进行保存。
在实际应用中,用户输入等因素可能导致变量文本的变化。例如,在文本A变为文本B“关于<冠心病>和<蔬菜>的问题,我还在研究。我想这份<冠心病的饮食建议可能也对你有帮助,里面包含了约<900>种食材的推荐、禁忌,请你点击查看”的情况下,待处理变量文本可以包括:文本B中的“<冠心病>”、“<蔬菜>”、“<900>”等。
本申请实施例可以生成待处理变量文本对应的第二视频片段。例如,可以首先确定待处理变量文本所在语句的声学参数和唇部特征;然后,从中提取出待处理变量文本对应的目标声学参数和目标唇部特征,并分别生成待处理变量文本对应的语音片段和目标图像序列。目标图像序列可以包括:目标运动图像序列和目标背景图像序列。
在生成目标运动图像序列的过程中,可以利用步骤206对目标唇部特征进行平滑处理,以提高唇部特征在拼接位置处的连续性。
可以利用步骤208,生成目标背景图像序列,实现目标背景图像序列与预设背景图像序列在分界 位置处的匹配,以提高背景图像序列在拼接位置处的连续性。
在对第一视频片段与第二视频片段进行拼接之前,可以首先对第一视频片段中第一语音片段和第二视频片段中第二语音片段,各自在所述分界位置的语音子片段进行平滑处理;然后根据平滑处理后的第一语音片段和平滑处理后的第二语音片段,对第一视频片段和第二视频片段进行拼接。
综上,本申请实施例的视频处理方法,在第一视频片段的拼接位置处加入了预设时长的停顿,有助于克服拼接位置处的跳变或抖动问题,因此能够提高拼接位置处的连续性。
并且,本申请实施例以待处理变量文本所在的语句为单位,确定对应的语音参数和图像参数,然后从语音参数和图像参数中,提取所述待处理变量文本对应的目标语音参数和目标图像参数。由于语句对应的语音参数和图像参数具有一定的连续性,故从中提取的待处理变量文本对应的目标语音参数和目标图像参数与语句中模板文本对应的语音参数和图像参数具有一定的连续性;在此基础上,能够提高待处理变量文本对应的第二视频片段与语句中模板文本对应的第一视频片段之间的连续性,进而能够进一步提高拼接位置处的连续性。
再者,本申请实施例根据预设变量文本在边界位置处的预设图像参数,对待处理变量文本对应的目标图像参数进行平滑处理。由于预设变量文本在边界位置处的预设图像参数与模板文本在边界位置处的图像参数具有一定的连续性,故上述平滑处理能够提高平滑处理后的目标图像参数与模板文本的图像参数在边界位置处的连续性;在此基础上,能够提高待处理变量文本对应的第二视频片段与语句中模板文本对应的第一视频片段之间的连续性,进而能够提高拼接位置处的连续性。
此外,本申请实施例根据预设背景图像序列,生成目标背景图像序列,可以实现目标背景图像序列与预设背景图像序列在分界位置处的匹配,以提高背景图像序列在拼接位置处的连续性。
进一步,本申请实施例在对第一视频片段与第二视频片段进行拼接之前,对第一视频片段中第一语音片段和第二视频片段中第二语音片段,在所述分界位置的语音子片段进行平滑处理。上述平滑处理能够提高平滑处理后的第一语音片段与第二语音片段之间的连续性,因此能够提高第一视频片段和第二视频片段在拼接位置处的连续性。
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的运动动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的运动动作顺序的限制,因为依据本申请实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的运动动作并不一定是本申请实施例所必须的。
装置实施例
参照图3,示出了本申请的一种视频处理装置实施例的结构框图,具体可以包括:
提供模块301,用于获取第一视频片段,所述第一视频片段与待生成视频的第一文本中模板文本对应,并且所述第一视频片段包括语音停顿的视频子片段,所述视频子片段的位置对应于所述模板文本与所述第一文本中待处理变量文本之间的分界位置;
生成模块302,用于生成待处理变量文本对应的第二视频片段;
拼接模块303,用于对所述第一视频片段和所述第二视频片段进行拼接,以得到所述第一文本对应的视频。
可选地,上述装置还可以包括:
预设视频生成模块,用于根据模板文本、预设变量文本、以及所述分界位置处对应的停顿信息,生成预设视频,所述停顿信息表示预定时长的语音停顿;
截取模块,用于从所述预设视频中截取所述模板文本对应的第一视频片段。
可选地,生成模块302可以包括:
参数确定模块,用于针对待处理变量文本在所述第一文本中所处的语句,确定对应的语音参数和图像参数,其中,所述图像参数表征所述第一文本对应的视频中要出现的虚拟对象的状态特征,所述语音参数用于表征语音合成对应的参数;
参数提取模块,用于从所述语音参数和图像参数中,提取所述待处理变量文本对应的目标语音参数和目标图像参数;
第一片段生成模块,用于根据所述目标语音参数和目标图像参数,生成所述待处理变量文本对应 的第二视频片段。
可选地,生成模块302可以包括:
第一平滑处理模块,用于根据所述待处理变量文本在边界位置处的预设图像参数,对所述待处理变量文本对应的目标图像参数进行平滑处理,以提高所述目标图像参数与所述模板文本的图像参数在边界位置处的连续性;
第二片段生成模块,用于根据平滑处理后的目标图像参数,生成所述待处理变量文本对应的第二视频片段。
可选地,上述第一视频片段可以包括:第一语音片段,上述第二视频片段可以包括:第二语音片段;
拼接模块303可以包括:
第二平滑处理模块,用于对第一语音片段和第二语音片段各自在拼接位置处的语音子片段进行平滑处理;
平滑后拼接模块,用于对平滑处理后的第一语音片段和平滑处理后的第二语音片段进行拼接。
可选地,上述视频对应的图像序列可以包括:背景图像序列和运动图像序列;
生成模块302可以包括:
运动图像序列生成模块,用于生成待处理变量文本对应的目标运动图像序列;
背景图像序列生成模块,用于根据预设背景图像序列,确定上述待处理变量文本对应的目标背景图像序列;
融合模块,用于对上述目标运动图像序列和上述目标背景图像序列进行融合,以得到上述待处理变量文本对应的第二视频片段。
可选地,上述目标背景图像序列的位于首尾位置的背景图像,与上述预设背景图像序列的位于首尾位置的背景图像相匹配。
可选地,上述背景图像序列生成模块,可以包括:
第一背景图像序列生成模块,用于在上述预设背景图像序列对应的图像数量与上述目标运动图像序列对应的图像数量相匹配的情况下,将上述预设背景图像序列确定为目标背景图像序列;或者
第二背景图像序列生成模块,用于在上述预设背景图像序列对应的图像数量大于上述目标运动图像序列对应的图像数量的情况下,从上述预设背景图像序列中丢弃位于中间位置的第一背景图像;在丢弃至少两帧第一背景图像的情况下,至少两帧第一背景图像在预设背景图像序列中不连续分布;或者
第三背景图像序列生成模块,用于在上述预设背景图像序列对应的图像数量小于上述目标运动图像序列对应的图像数量的情况下,在上述预设背景图像序列中增加第二背景图像。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
图4是根据一示例性实施例示出的一种用于视频处理的装置900的结构框图。例如,装置900可以是移动来电,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等。
参照图4,装置900可以包括以下一个或多个组件:处理组件902,存储器904,电源组件906,多媒体组件908,音频组件910,输入/输出(I/O)的接口912,传感器组件914,以及通信组件916。
处理组件902通常控制装置900的整体操作,诸如与显示,来电呼叫,数据通信,相机操作和记录操作相关联的操作。处理元件902可以包括一个或多个处理器920来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件902可以包括一个或多个模块,便于处理组件902和其他组件之间的交互。例如,处理组件902可以包括多媒体模块,以方便多媒体组件908和处理组件902之间的 交互。
存储器904被配置为存储各种类型的数据以支持在设备900的操作。这些数据的示例包括用于在装置900上操作的任何应用程序或方法的指令,联系人数据,来电簿数据,消息,图片,视频等。存储器904可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
电源组件906为装置900的各种组件提供电力。电源组件906可以包括电源管理系统,一个或多个电源,及其他与为装置900生成、管理和分配电力相关联的组件。
多媒体组件908包括在所述装置900和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动运动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件908包括一个前置摄像头和/或后置摄像头。当设备900处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。
音频组件910被配置为输出和/或输入音频信号。例如,音频组件910包括一个麦克风(MIC),当装置900处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器904或经由通信组件916发送。在一些实施例中,音频组件910还包括一个扬声器,用于输出音频信号。
I/O接口912为处理组件902和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。
传感器组件914包括一个或多个传感器,用于为装置900提供各个方面的状态评估。例如,传感器组件914可以检测到设备900的打开/关闭状态,组件的相对定位,例如所述组件为装置900的显示器和小键盘,传感器组件914还可以检测装置900或装置900一个组件的位置改变,用户与装置900接触的存在或不存在,装置900方位或加速/减速和装置900的温度变化。传感器组件914可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件914还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件914还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。
通信组件916被配置为便于装置900和其他设备之间有线或无线方式的通信。装置900可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信部件916经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信部件916还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
在示例性实施例中,装置900可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。
在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器904,上述指令可由装置900的处理器920执行以完成上述方法。例如,所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
图5是本申请的一些实施例中服务端的结构框图。该服务端1900可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)1922(例如,一个或一个以上处理器)和存储器1932,一个或一个以上存储应用程序1942或数据1944的存储介质1930(例如一个或一个以上海量存储设备)。其中,存储器1932和存储介质1930可以是短暂存储或持久存储。存储在存储介质1930的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对 服务端中的一系列指令操作。更进一步地,中央处理器1922可以设置为与存储介质1930通信,在服务端1900上执行存储介质1930中的一系列指令操作。
服务端1900还可以包括一个或一个以上电源1926,一个或一个以上有线或无线网络接口1950,一个或一个以上输入输出接口1958,一个或一个以上键盘1956,和/或,一个或一个以上操作系统1941,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
一种非临时性计算机可读存储介质,当所述存储介质中的指令由装置(设备或者服务端)的处理器执行时,使得装置能够执行根据本申请实施例的视频处理方法。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求指出。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制。
以上所述仅为本申请的较佳实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。
以上对本申请实施例所提供的一种视频处理方法、一种视频处理装置和一种用于视频处理的装置,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (18)

  1. 一种视频处理方法,在电子设备中执行,所述方法包括:
    获取第一视频片段,所述第一视频片段与待生成视频的第一文本中模板文本对应,并且所述第一视频片段包括语音停顿的视频子片段,所述视频子片段的位置对应于所述模板文本与所述第一文本中待处理变量文本之间的分界位置;
    生成所述待处理变量文本对应的第二视频片段;
    对所述第一视频片段和所述第二视频片段进行拼接,以得到所述第一文本对应的视频。
  2. 根据权利要求1所述的方法,其中,所述方法还包括:
    根据模板文本、预设变量文本、以及所述分界位置处对应的停顿信息,生成预设视频,所述停顿信息表示预定时长的语音停顿;
    从所述预设视频中截取所述模板文本对应的第一视频片段。
  3. 根据权利要求1所述的方法,其特征在于,所述视频子片段的图像中虚拟对象处于不说话的状态。
  4. 根据权利要求1-3中任一项所述的方法,其特征在于,所述视频子片段为经过停顿处理后得到的子片段;
    其中,对所述视频子片段的停顿处理,包括:
    所述第一视频片段中与所述分界位置对应的拼接位置处的语音信号子片段与静音信号进行加权处理,以得到语音停顿的语音信号子片段;
    所述第一视频片段在所述拼接位置处的图像子序列与目标状态特征的图像序列进行加权处理,以得到虚拟对象处于不说话的状态的所述图像子序列,其中,所述目标状态特征为表示虚拟对象处于不说话状态的特征。
  5. 根据权利要求1-3中任一项所述的方法,其中,所述生成所述待处理变量文本对应的第二视频片段,包括:
    针对待处理变量文本在所述第一文本中所处的语句,确定对应的语音参数和图像参数,其中,所述图像参数表征所述第一文本对应的视频中要出现的虚拟对象的状态特征,所述语音参数用于表征语音合成对应的参数;
    从所述语音参数和图像参数中,提取所述待处理变量文本对应的目标语音参数和目标图像参数;
    根据所述目标语音参数和目标图像参数,生成所述待处理变量文本对应的第二视频片段。
  6. 根据权利要求1-3中任一项所述的方法,其中,所述生成所述待处理变量文本对应的第二视频片段,包括:
    根据所述待处理变量文本在边界位置处的预设图像参数,对所述待处理变量文本对应的目标图像参数进行平滑处理,以提高所述目标图像参数与所述模板文本的图像参数在边界位置处的连续性;
    根据平滑处理后的目标图像参数,生成所述待处理变量文本对应的第二视频片段。
  7. 根据权利要求1-3中任一项所述的方法,其中,所述第一视频片段包括:第一语音片段,所述第二视频片段包括:第二语音片段;
    所述对所述第一视频片段和所述第二视频片段进行拼接,包括:
    对第一语音片段和第二语音片段各自在拼接位置处的语音子片段进行平滑处理;
    对平滑处理后的第一语音片段和平滑处理后的第二语音片段进行拼接。
  8. 根据权利要求1-3中任一项所述的方法,其中,所述视频对应的图像序列包括:背景图像序列和运动图像序列;
    所述生成待处理变量文本对应的第二视频片段,包括:
    生成待处理变量文本对应的目标运动图像序列;
    根据预设背景图像序列,确定所述待处理变量文本对应的目标背景图像序列;
    对所述目标运动图像序列和所述目标背景图像序列进行融合,以得到所述待处理变量文本对应的第二视频片段。
  9. 根据权利要求8所述的方法,其中,所述目标背景图像序列的位于首尾位置的背景图像,与所述预设背景图像序列的位于首尾位置的背景图像相匹配。
  10. 根据权利要求8所述的方法,其中,所述根据预设背景图像序列,确定所述待处理变量文本对应的目标背景图像序列,包括:
    在所述预设背景图像序列对应的图像数量与所述目标运动图像序列对应的图像数量相匹配的情况下,将所述预设背景图像序列确定为目标背景图像序列;或者
    在所述预设背景图像序列对应的图像数量大于所述目标运动图像序列对应的图像数量的情况下,从所述预设背景图像序列中丢弃位于中间位置的第一背景图像;在丢弃至少两帧第一背景图像的情况下,至少两帧第一背景图像在预设背景图像序列中不连续分布;或者
    在所述预设背景图像序列对应的图像数量小于所述目标运动图像序列对应的图像数量的情况下,在所述预设背景图像序列中增加第二背景图像。
  11. 一种视频处理装置,包括:
    提供模块,用于获取第一视频片段,所述第一视频片段与待生成视频的第一文本中模板文本对应,并且所述第一视频片段包括语音停顿的视频子片段,所述视频子片段的位置对应于所述模板文本与所述第一文本中待处理变量文本之间的分界位置;
    生成模块,用于生成所述待处理变量文本对应的第二视频片段;
    拼接模块,用于对所述第一视频片段和所述第二视频片段进行拼接,以得到所述第一文本对应的视频。
  12. 根据权利要求9所述的装置,其中,所述装置还包括:
    预设视频生成模块,用于根据模板文本、预设变量文本、以及所述分界位置处对应的停顿信息,生成预设视频,所述停顿信息表示预定时长的语音停顿;
    截取模块,用于从所述预设视频中截取所述模板文本对应的第一视频片段。
  13. 根据权利要求9或10所述的装置,其中,所述生成模块包括:
    参数确定模块,用于针对待处理变量文本在所述第一文本中所处的语句,确定对应的语音参数和图像参数,其中,所述图像参数表征所述第一文本对应的视频中要出现的虚拟对象的状态特征,所述语音参数用于表征语音合成对应的参数;
    参数提取模块,用于从所述语音参数和图像参数中,提取所述待处理变量文本对应的目标语音参数和目标图像参数;
    第一片段生成模块,用于根据所述目标语音参数和目标图像参数,生成所述待处理变量文本对应的第二视频片段。
  14. 根据权利要求9或10所述的装置,其中,所述生成模块包括:
    第一平滑处理模块,用于根据所述待处理变量文本在边界位置处的预设图像参数,对所述待处理变量文本对应的目标图像参数进行平滑处理,以提高所述目标图像参数与所述模板文本的图像参数在边界位置处的连续性;
    第二片段生成模块,用于根据平滑处理后的目标图像参数,生成所述待处理变量文本对应的第二视频片段。
  15. 根据权利要求9或10所述的装置,其中,所述第一视频片段包括:第一语音片段,所述第二视频片段包括:第二语音片段;
    所述拼接模块包括:
    第二平滑处理模块,用于对第一语音片段和第二语音片段各自在拼接位置处的语音子片段进行平滑处理;
    平滑后拼接模块,用于对平滑处理后的第一语音片段和平滑处理后的第二语音片段进行拼接。
  16. 一种用于视频处理的装置,包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,所述程序被一个或者一个以上处理器执行时,实现权利要求1至10中任一所述方法的步骤。
  17. 一种机器可读介质,其上存储有指令,当由一个或多个处理器执行时,使得装置执行如权利 要求1至10中一个或多个所述的视频处理方法。
  18. 一种计算机程序产品,该程序产品包括计算机指令,该计算机指令存储在计算机可读存储介质中;当处理器执行该计算机指令时,使得处理器执行如权利要求1至10中任一项所述的方法。
PCT/CN2022/115722 2021-09-24 2022-08-30 视频处理方法、装置、介质和程序产品 Ceased WO2023045716A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2023554305A JP7697027B2 (ja) 2021-09-24 2022-08-30 ビデオ処理方法、装置、媒体、及びコンピュータプログラム
EP22871767.4A EP4404574A4 (en) 2021-09-24 2022-08-30 VIDEO PROCESSING METHOD AND APPARATUS AS WELL AS MEDIUM AND PROGRAM PRODUCT
US18/365,296 US20240022772A1 (en) 2021-09-24 2023-08-04 Video processing method and apparatus, medium, and program product

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111124169.4 2021-09-24
CN202111124169.4A CN113891150B (zh) 2021-09-24 2021-09-24 一种视频处理方法、装置和介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/365,296 Continuation US20240022772A1 (en) 2021-09-24 2023-08-04 Video processing method and apparatus, medium, and program product

Publications (1)

Publication Number Publication Date
WO2023045716A1 true WO2023045716A1 (zh) 2023-03-30

Family

ID=79006490

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/115722 Ceased WO2023045716A1 (zh) 2021-09-24 2022-08-30 视频处理方法、装置、介质和程序产品

Country Status (5)

Country Link
US (1) US20240022772A1 (zh)
EP (1) EP4404574A4 (zh)
JP (1) JP7697027B2 (zh)
CN (1) CN113891150B (zh)
WO (1) WO2023045716A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2025130608A1 (zh) * 2023-12-20 2025-06-26 华为技术有限公司 一种数字人生成方法及相关装置

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113891150B (zh) * 2021-09-24 2024-10-11 北京搜狗科技发展有限公司 一种视频处理方法、装置和介质
CN114707019A (zh) * 2022-03-29 2022-07-05 北京拥抱在线科技有限公司 用于阅读的信息处理方法及装置
CN116510308A (zh) * 2023-04-27 2023-08-01 北京字跳网络技术有限公司 一种内容生成方法、装置、计算机设备及存储介质
WO2025258824A1 (ko) * 2024-06-13 2025-12-18 삼성전자주식회사 비디오 프레임의 추가 영역을 생성하기 위한 전자 장치, 방법, 및 비일시적 컴퓨터 판독 가능 저장 매체

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109637518A (zh) * 2018-11-07 2019-04-16 北京搜狗科技发展有限公司 虚拟主播实现方法及装置
CN110381266A (zh) * 2019-07-31 2019-10-25 百度在线网络技术(北京)有限公司 一种视频生成方法、装置以及终端
CN110611840A (zh) * 2019-09-03 2019-12-24 北京奇艺世纪科技有限公司 一种视频生成方法、装置、电子设备及存储介质
US20200098396A1 (en) * 2018-09-20 2020-03-26 Autochartis Limited Automated video generation from financial market analysis
CN111885416A (zh) * 2020-07-17 2020-11-03 北京来也网络科技有限公司 一种音视频的修正方法、装置、介质及计算设备
CN112995706A (zh) * 2019-12-19 2021-06-18 腾讯科技(深圳)有限公司 基于人工智能的直播方法、装置、设备及存储介质
US20210201912A1 (en) * 2020-09-14 2021-07-01 Beijing Baidu Netcom Science And Technology Co., Ltd. Virtual Object Image Display Method and Apparatus, Electronic Device and Storage Medium
CN113891150A (zh) * 2021-09-24 2022-01-04 北京搜狗科技发展有限公司 一种视频处理方法、装置和介质

Family Cites Families (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11231899A (ja) * 1998-02-12 1999-08-27 Matsushita Electric Ind Co Ltd 音声・動画像合成装置及び音声・動画像データベース
US7277855B1 (en) * 2000-06-30 2007-10-02 At&T Corp. Personalized text-to-speech services
WO2010045736A1 (en) * 2008-10-22 2010-04-29 Xtranormal Technology Inc. Reduced-latency rendering for a text-to-movie system
CN102123252B (zh) * 2010-01-07 2016-05-04 新奥特(北京)视频技术有限公司 一种图文包装应用中随动关联播出的实现方法和装置
US8818175B2 (en) * 2010-03-08 2014-08-26 Vumanity Media, Inc. Generation of composited video programming
WO2012154618A2 (en) * 2011-05-06 2012-11-15 Seyyer, Inc. Video generation based on text
US20120308211A1 (en) * 2011-06-01 2012-12-06 Xerox Corporation Asynchronous personalization of records using dynamic scripting
GB2503878A (en) * 2012-07-09 2014-01-15 Nds Ltd Generating interstitial scripts for video content, based on metadata related to the video content
US9620124B2 (en) * 2014-02-28 2017-04-11 Comcast Cable Communications, Llc Voice enabled screen reader
WO2017137947A1 (en) * 2016-02-10 2017-08-17 Vats Nitin Producing realistic talking face with expression using images text and voice
US10204274B2 (en) * 2016-06-29 2019-02-12 Cellular South, Inc. Video to data
US10056083B2 (en) * 2016-10-18 2018-08-21 Yen4Ken, Inc. Method and system for processing multimedia content to dynamically generate text transcript
US10546409B1 (en) * 2018-08-07 2020-01-28 Adobe Inc. Animation production system
CN109635154B (zh) * 2018-12-14 2022-11-29 成都索贝数码科技股份有限公司 一种基于文稿和新闻节目自动生成互联网图文稿件的方法
US11024071B2 (en) * 2019-01-02 2021-06-01 Espiritu Technologies, Llc Method of converting phoneme transcription data into lip sync animation data for 3D animation software
CN113383384A (zh) * 2019-01-25 2021-09-10 索美智能有限公司 语音动画的实时生成
EP3921770B1 (en) * 2019-02-05 2025-07-16 Igentify Ltd. System and methodology for modulation of dynamic gaps in speech
CN109979457A (zh) * 2019-05-29 2019-07-05 南京硅基智能科技有限公司 一种应用于智能对话机器人的千人千面的方法
CN110324709A (zh) * 2019-07-24 2019-10-11 新华智云科技有限公司 一种视频生成的处理方法、装置、终端设备及存储介质
CN111508466A (zh) * 2019-09-12 2020-08-07 马上消费金融股份有限公司 一种文本处理方法、装置、设备及计算机可读存储介质
CN110534088A (zh) * 2019-09-25 2019-12-03 招商局金融科技有限公司 语音合成方法、电子装置及存储介质
US11638049B2 (en) * 2019-10-16 2023-04-25 Dish Network L.L.C. Systems and methods for content item recognition and adaptive packet transmission
CN110866968B (zh) * 2019-10-18 2025-02-28 平安科技(深圳)有限公司 基于神经网络生成虚拟人物视频的方法及相关设备
KR102267673B1 (ko) * 2019-12-23 2021-06-23 극동대학교 산학협력단 사용자 체험형 동영상 컨텐츠 자동제작방법 및 시스템
CN111460785B (zh) * 2020-03-31 2023-02-28 北京市商汤科技开发有限公司 交互对象的驱动方法、装置、设备以及存储介质
US11012737B1 (en) * 2020-04-27 2021-05-18 Dish Network L.L.C. Systems and methods for audio adaptation of content items to endpoint media devices
CN111652678B (zh) * 2020-05-27 2023-11-14 腾讯科技(深圳)有限公司 物品信息显示方法、装置、终端、服务器及可读存储介质
CN111883103B (zh) * 2020-06-19 2021-12-24 马上消费金融股份有限公司 语音合成的方法及装置
CN111741326B (zh) * 2020-06-30 2023-08-18 腾讯科技(深圳)有限公司 视频合成方法、装置、设备及存储介质
CN112543342B (zh) * 2020-11-26 2023-03-14 腾讯科技(深圳)有限公司 虚拟视频直播处理方法及装置、存储介质、电子设备
CN113051420B (zh) * 2021-04-15 2022-07-05 山东大学 一种基于文本生成视频机器人视觉人机交互方法及系统
CN113111812A (zh) * 2021-04-20 2021-07-13 深圳追一科技有限公司 一种嘴部动作驱动模型训练方法及组件
CN113421549A (zh) * 2021-06-30 2021-09-21 平安科技(深圳)有限公司 语音合成方法、装置、计算机设备及存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200098396A1 (en) * 2018-09-20 2020-03-26 Autochartis Limited Automated video generation from financial market analysis
CN109637518A (zh) * 2018-11-07 2019-04-16 北京搜狗科技发展有限公司 虚拟主播实现方法及装置
CN110381266A (zh) * 2019-07-31 2019-10-25 百度在线网络技术(北京)有限公司 一种视频生成方法、装置以及终端
CN110611840A (zh) * 2019-09-03 2019-12-24 北京奇艺世纪科技有限公司 一种视频生成方法、装置、电子设备及存储介质
CN112995706A (zh) * 2019-12-19 2021-06-18 腾讯科技(深圳)有限公司 基于人工智能的直播方法、装置、设备及存储介质
CN111885416A (zh) * 2020-07-17 2020-11-03 北京来也网络科技有限公司 一种音视频的修正方法、装置、介质及计算设备
US20210201912A1 (en) * 2020-09-14 2021-07-01 Beijing Baidu Netcom Science And Technology Co., Ltd. Virtual Object Image Display Method and Apparatus, Electronic Device and Storage Medium
CN113891150A (zh) * 2021-09-24 2022-01-04 北京搜狗科技发展有限公司 一种视频处理方法、装置和介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4404574A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2025130608A1 (zh) * 2023-12-20 2025-06-26 华为技术有限公司 一种数字人生成方法及相关装置

Also Published As

Publication number Publication date
JP2024509873A (ja) 2024-03-05
CN113891150A (zh) 2022-01-04
JP7697027B2 (ja) 2025-06-23
US20240022772A1 (en) 2024-01-18
EP4404574A1 (en) 2024-07-24
CN113891150B (zh) 2024-10-11
EP4404574A4 (en) 2025-01-01

Similar Documents

Publication Publication Date Title
CN113891150B (zh) 一种视频处理方法、装置和介质
US20200279553A1 (en) Linguistic style matching agent
CN113689879B (zh) 实时驱动虚拟人的方法、装置、电子设备及介质
CN110097890B (zh) 一种语音处理方法、装置和用于语音处理的装置
CN110210310B (zh) 一种视频处理方法、装置和用于视频处理的装置
US20250252282A1 (en) Method and apparatus for driving digital human, and electronic device
CN114121006A (zh) 虚拟角色的形象输出方法、装置、设备以及存储介质
CN114999441B (zh) 虚拟形象生成方法、装置、设备、存储介质以及程序产品
WO2021196645A1 (zh) 交互对象的驱动方法、装置、设备以及存储介质
CN110162598B (zh) 一种数据处理方法和装置、一种用于数据处理的装置
CN113689880B (zh) 实时驱动虚拟人的方法、装置、电子设备及介质
CN110992942B (zh) 一种语音识别方法、装置和用于语音识别的装置
CN110148406B (zh) 一种数据处理方法和装置、一种用于数据处理的装置
CN113053364B (zh) 一种语音识别方法、装置和用于语音识别的装置
CN110166844B (zh) 一种数据处理方法和装置、一种用于数据处理的装置
CN108628819B (zh) 处理方法和装置、用于处理的装置
CN113870828B (zh) 音频合成方法、装置、电子设备和可读存储介质
CN112151072B (zh) 语音处理方法、装置和介质
JPWO2018079294A1 (ja) 情報処理装置及び情報処理方法
CN114155849A (zh) 一种虚拟对象的处理方法、装置和介质
CN113674731A (zh) 语音合成处理方法、装置和介质
CN115730048A (zh) 一种会话处理方法、装置、电子设备及可读存储介质
CN112837668A (zh) 一种语音处理方法、装置和用于处理语音的装置
CN116366872A (zh) 基于中之人和人工智能的直播方法、装置及系统
CN114049873A (zh) 语音克隆方法、训练方法、装置和介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22871767

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023554305

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 202347080857

Country of ref document: IN

WWE Wipo information: entry into national phase

Ref document number: 2022871767

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2022871767

Country of ref document: EP

Effective date: 20240419