WO2023045716A1 - 视频处理方法、装置、介质和程序产品 - Google Patents
视频处理方法、装置、介质和程序产品 Download PDFInfo
- Publication number
- WO2023045716A1 WO2023045716A1 PCT/CN2022/115722 CN2022115722W WO2023045716A1 WO 2023045716 A1 WO2023045716 A1 WO 2023045716A1 CN 2022115722 W CN2022115722 W CN 2022115722W WO 2023045716 A1 WO2023045716 A1 WO 2023045716A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video
- text
- segment
- preset
- image sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/23424—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for inserting or substituting an advertisement
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—Three-dimensional [3D] animation
- G06T13/40—Three-dimensional [3D] animation of characters, e.g. humans, animals or virtual beings
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/70—Denoising; Smoothing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/194—Segmentation; Edge detection involving foreground-background segmentation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/20—Scenes; Scene-specific elements in augmented reality scenes
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/161—Detection; Localisation; Normalisation
- G06V40/165—Detection; Localisation; Normalisation using facial parts and geometric relationships
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/02—Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
- G11B27/031—Electronic editing of digitised analogue information signals, e.g. audio or video signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4398—Processing of audio elementary streams involving reformatting operations of audio signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44016—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/83—Generation or processing of protective or descriptive data associated with content; Content structuring
- H04N21/845—Structuring of content, e.g. decomposing content into time segments
- H04N21/8456—Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
- H04N5/272—Means for inserting a foreground image in a background image, i.e. inlay, outlay
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
Definitions
- the present application relates to the technical field of communications, and in particular to a video processing method, device, medium and program product.
- virtual objects can be widely used in application scenarios such as broadcasting scenarios, teaching scenarios, medical scenarios, and customer service scenarios.
- virtual objects usually need to express text, and correspondingly, a video corresponding to the virtual object can be generated and played.
- the video can characterize the process of virtual objects expressing text.
- the video generation process generally includes: a speech generation link and an image sequence generation link.
- speech generation usually adopts speech synthesis technology.
- the link of image sequence generation usually adopts image processing technology.
- the present application discloses a video processing method executed in an electronic device, the method comprising:
- the first video segment corresponds to the template text in the first text of the video to be generated, and the first video segment includes a video sub-segment of speech pause, and the position of the video sub-segment corresponds to the The boundary position between the template text and the variable text to be processed in the first text;
- the present application discloses a video processing device, comprising:
- a module for obtaining a first video segment, the first video segment corresponds to the template text in the first text of the video to be generated, and the first video segment includes a video sub-segment of speech pause, the video sub-segment The position of corresponds to the boundary position between the template text and the variable text to be processed in the first text;
- a generating module configured to generate a second video clip corresponding to the variable text to be processed
- a splicing module configured to splice the first video segment and the second video segment to obtain a video corresponding to the first text.
- the present application discloses a device for video processing, including a memory, and one or more programs, wherein one or more programs are stored in the memory, and the program is executed by one or more processors When executed, the steps of the aforementioned method are realized.
- the embodiment of the present application discloses one or more machine-readable media, on which instructions are stored, and when executed by one or more processors, the device executes one or more of the aforementioned methods.
- the embodiment of the present application discloses a computer program product, the program product includes computer instructions, the computer instructions are stored in a computer-readable storage medium; when the processor executes the computer instructions, the processor is made to execute the application.
- the video processing method of the embodiment is not limited to:
- FIG. 1A shows a schematic diagram of an application scenario according to an embodiment of the present application
- FIG. 1B is a flowchart of a video processing method according to an embodiment of the present application.
- Fig. 2 is a flow chart of a video processing method according to an embodiment of the present application.
- FIG. 3 is a structural block diagram of a video processing device according to an embodiment of the present application.
- FIG. 4 is a structural block diagram of a device for video processing according to an embodiment of the present application.
- Fig. 5 is a structural block diagram of a server in some embodiments of the present application.
- the virtual object is a vivid and natural virtual object close to the real object obtained through object modeling, motion capture and other technologies.
- the virtual object can be Possess the ability to cognition, or understanding, or expression.
- the virtual object specifically includes: a virtual character, or a virtual animal, or a two-dimensional cartoon object, or a three-dimensional cartoon object.
- virtual objects can replace, for example, media workers for news broadcasting or game commentary.
- virtual objects in a medical scene, can replace, for example, medical workers for medical guidance.
- virtual objects can express text.
- a video corresponding to text and virtual objects can be generated.
- the video may specifically include: a voice sequence corresponding to the text, and an image frame sequence corresponding to the voice sequence.
- the text of the video to be generated specifically includes: template text and variable text.
- the template text is relatively fixed, and the variable text usually changes according to preset factors such as user input.
- variable text can be determined from user input. Taking the medical scene as an example, the corresponding variable text can be determined according to the name of the disease contained in the user input.
- the fields corresponding to the variable text specifically include: a disease name field, a food type field, a food quantity field, etc. These fields can be determined according to the disease name included in the user input.
- variable text in the text may be determined according to actual application requirements, and the embodiment of the present application does not limit the specific manner of determining the variable text.
- the related technology In order to make the video quality meet the requirements, the related technology usually generates a corresponding complete video for the changed complete text when the variable text is changed. However, it usually takes a lot of time to generate a corresponding complete video for the changed complete text, resulting in low video processing efficiency.
- an embodiment of the present application provides a video processing solution, which specifically includes: acquiring a first video segment; corresponding to the template text in the first text of the video to be generated, and the The first video segment includes a video sub-segment with a speech pause, and the position of the video sub-segment corresponds to the boundary position between the template text and the variable text to be processed in the first text.
- the first text includes the template text and the variable text to be processed. variable text; generating a second video segment corresponding to the variable text to be processed; splicing the first video segment and the second video segment to obtain a video corresponding to the first text.
- the first video segment corresponding to the template text is spliced with the second video segment corresponding to the variable text to be processed.
- the first video clip may be a pre-saved video clip
- a second video clip corresponding to the variable text to be processed may be generated during video processing. Since the length of the variable text to be processed is shorter than the length of the complete text, the embodiment of the present application can shorten the length of the generated video and the corresponding time cost, thus improving the video processing efficiency.
- the first video segment in the embodiment of the present application includes a video sub-segment in which speech is paused.
- the voice pause refers to the cessation of voice, for example, the virtual object does not speak.
- the position of the video sub-segment corresponds to the boundary position between the template text and the variable text to be processed in the first text.
- the video sub-segments of the first video segment where the voice is paused help to overcome the problem of jumping or jittering at the splicing position, so the continuity at the splicing position can be improved.
- FIG. 1A shows a schematic diagram of an application scenario according to an embodiment of the present application.
- the client and the server are located in a wired or wireless network, and the client and the server perform data interaction through the wired or wireless network.
- the client and the server may be collectively referred to as an electronic device.
- clients include but are not limited to: smartphones, tablet computers, e-book readers, MP3 (Moving Picture Experts Group Audio Layer III) players, MP4 (Moving Picture Experts Group Audio Layer III) players, MP4 (Moving Picture Experts 4, Moving Picture Experts Group Audio Layer IV) player, laptop portable computer, car computer, desktop computer, set-top box, smart TV, wearable device, etc.
- the server is, for example, a hardware-independent server, a virtual server, or a server cluster.
- the client refers to a program that corresponds to the server and provides local services for users.
- the client in this embodiment of the application may receive user input and provide a video corresponding to the user input.
- the video can be generated by the client or the server, and this embodiment of the present application does not limit the specific generation subject of the video.
- the client may receive user input and upload the user input to the server, so that the server generates a video corresponding to the user input.
- the server can determine the variable text to be processed according to user input, generate a second video segment corresponding to the variable text to be processed, and splice the pre-saved first video segment and the second video segment to obtain the template text and the pending Process the video corresponding to the variable text.
- FIG. 1B shows a flow chart of a video processing method of the present application, which may specifically include the following steps.
- the video processing method can be executed by electronic equipment, for example.
- Step 101 obtain the first video segment, the first video segment corresponds to the template text in the first text of the video to be generated, and the first video segment includes a video sub-segment of speech pause, and the position of the video sub-segment corresponds to the template text and the first text The boundary position between variable texts to be processed in a text. .
- Step 102 generating a second video clip corresponding to the variable text to be processed.
- Step 103 splicing the first video segment and the second video segment to obtain a video corresponding to the first text.
- the first video segment corresponding to the template text may be generated and saved in advance.
- the first video segment includes a video sub-segment with a speech pause.
- the voice pause means that the voice is stopped or the voice is not output temporarily.
- a video sub-segment with a pause in speech may be considered a video sub-segment without speech.
- the position of the video sub-segment corresponds to the boundary position between the template text and the variable text to be processed in the first text, and the video sub-segment can improve the continuity at the splicing position.
- the structure of the text in the embodiment of the present application specifically includes: template text and variable text. Boundaries can be used to divide adjacent template text and variable text.
- the process of determining the first video segment may include: generating a preset video according to the template text, the preset variable text, and the pause information at the corresponding boundary position; intercepting the template text from the preset video corresponding to the first video segment.
- the preset variable text may be any variable text, or the preset variable text may be any instance of the variable text.
- a preset video can be generated according to the preset complete text corresponding to the template text and the preset variable text, wherein the pause information at the boundary position can be considered during the generation of the preset video.
- the pause information indicates, for example, a speech pause of a predetermined duration.
- the preset video may include: a preset voice corresponding to the voice part and a preset image sequence corresponding to the image part.
- TTS Text To Speech
- the preset voice may be represented in the form of a waveform.
- the conversion of the preset complete text into the preset voice in the embodiment of the present application specifically includes: a language analysis link and an acoustic system link.
- the language analysis link is used to generate corresponding linguistic information according to the preset complete text and its corresponding pause information;
- the acoustic system link is mainly based on the linguistic information provided by the speech analysis link to generate the corresponding preset voice and realize the vocalization function.
- the processing of the language analysis link may specifically include: text structure and language judgment, text standardization, text-to-phoneme conversion, and prosody prediction.
- Linguistic information may be the result of a speech analysis session.
- text structure and language judgment are used to judge the language of the preset complete text, such as Chinese, English, Vietnamese, Uighur, etc., and divide the preset complete text into sentences according to the grammatical rules of the corresponding language, and Pass the segmented sentence to the subsequent processing module.
- Text standardization is used to standardize the segmented sentences according to the set rules.
- Text-to-phoneme used to determine the phoneme features corresponding to the sentence.
- prosody prediction can be used to determine where a sentence needs to be paused, how long the pause is, and which word or phrase needs to be reread , which word needs to be read lightly, etc., and then realize the high and low tortuous and cadence of the voice.
- the prosody prediction technology may be used first to determine the prosody prediction result, and then the prosody prediction result may be updated according to the pause information.
- the pause information can be: add a pause information of a preset duration between the template text "about” and the variable text " ⁇ diabetes>", then update the prosody prediction results. Specifically, it can include: in the template text " Add preset pause information between the phoneme features "guan”, “yu” of the variable text “ ⁇ diabetes>” and the phoneme features "tang", “niao", and “bing" of the variable text, and the updated prosody prediction results Can be: “guan”, “yu”, “pause for N milliseconds", “tang”, "niao", “bing”, etc. Wherein, N may be a natural number greater than 0, and the value of N may be determined by those skilled in the art according to actual application requirements.
- the acoustic system link can obtain preset voices that meet the needs according to the speech synthesis parameters.
- the speech synthesis parameters may include: timbre parameters.
- the timbre parameters can refer to the distinctive characteristics of the frequency of different sounds in terms of waveforms. Usually, different emitters correspond to different timbres. Therefore, according to the timbre parameters, a speech sequence matching the timbre of the target emitter can be obtained.
- the target sounding body can be specified by the user, for example, the target sounding body can be a designated medical worker or the like. In practical applications, the timbre parameters of the target sound emitting body can be obtained according to the audio frequency of the preset length of the target sound emitting body.
- the preset image sequence corresponding to the image part can be obtained on the basis of the virtual object image.
- the embodiment of the present application can assign sub-state features to the virtual object image to obtain the preset image sequence.
- the virtual object image can be specified by the user, for example, the virtual object image can be an image of a well-known person (such as a presenter).
- Expression expressing emotion and affection, can refer to thoughts and feelings expressed on the face.
- Expression features are usually for the entire face.
- the lip features can be specific to the lips, and are related to the text content, voice, and pronunciation of the text, so the naturalness of the expression corresponding to the preset image sequence can be improved.
- Body characteristics can convey the thoughts of the characters through the coordinated activities of the head, eyes, neck, hands, elbows, arms, body, hips, feet and other human body parts, and vividly use expressions to express ideas.
- Body features can include: turning head, shrugging shoulders, gestures, etc., which can improve the richness of the corresponding expression of the image sequence. For example, at least one arm hangs down naturally when speaking, at least one arm rests naturally on the abdomen when not speaking, etc.
- the image parameters in the process of generating the image part of the preset video, can be determined according to the preset complete text and pause information, and the image parameters can represent the state characteristics of the virtual object; and the image corresponding to the image part can be generated according to the image parameters Preset image sequence.
- the image parameters may include: pause image parameters, and the pause image parameters may represent pause state features corresponding to the pause information.
- the pause image parameter represents the state characteristics of the virtual object in terms of body, expression, etc. when the virtual object stops speaking.
- the preset image sequence may include: an image sequence corresponding to the pause state feature.
- the characteristics of the pause state may include: a neutral expression, a closed lip state, and a drooping state of an arm.
- the preset voice and the preset image sequence can be fused to obtain a corresponding preset video.
- the first video segment corresponding to the template text can be intercepted from the preset video. Specifically, the first video segment may be intercepted according to the start position and end position of the preset variable text in the preset video.
- the pause information at the boundary position is utilized, so the first video segment before T1 has pause information (that is, the first video segment includes a video sub-segment with a speech pause), so The continuity at the splicing position in the subsequent splicing process can be improved.
- the first video clips corresponding to multiple template texts can be respectively extracted from the preset video.
- not only the voice of the video sub-segment in the first video segment is paused, but also the virtual object in the image of the video sub-segment is in a state of not speaking.
- the video sub-segment is a sub-segment obtained after pause processing.
- Pause processing for video sub-segments including:
- a method of obtaining the first video segment may include: generating the first video according to the template text and the preset variable text; intercepting the first video segment corresponding to the template text from the first video; Pause processing is performed on the first video segment at the boundary position.
- the speech signal sub-segment at the boundary position of the video clip and the mute signal can be weighted to realize the pause processing of the speech part.
- the image subsequence at the boundary position of the video segment and the image sequence corresponding to the target state characteristics of the pause information can be weighted to realize the pause processing of the image part.
- the first video clip After the first video clip is obtained, the first video clip can be saved, so that when the variable text changes, the second video clip corresponding to the first video clip and the changed variable text (hereinafter referred to as the variable text to be processed) can be saved. Video clips are stitched together.
- step 102 the variable text to be processed can be obtained according to user input. It can be understood that the embodiment of the present application does not limit the specific manner of determining the variable text to be processed.
- generating the second video segment corresponding to the variable text to be processed specifically includes: determining the corresponding voice parameter and image parameter for the sentence where the variable text to be processed is located in the first text, wherein the image parameter represents the The state characteristics of the virtual object to appear in the video corresponding to the first text, and the voice parameters are used to characterize the parameters corresponding to the speech synthesis; from the voice parameters and image parameters, extract the target voice parameters corresponding to the variable text to be processed and Target image parameters; according to the target speech parameters and target image parameters, generate a second video clip corresponding to the variable text to be processed.
- Technical solution 1 first determines the corresponding speech parameters and image parameters based on the sentence where the variable text to be processed is located, and then extracts the target speech parameter and target image parameter corresponding to the variable text to be processed from the speech parameters and image parameters.
- a sentence is a grammatically self-contained unit consisting of a word or a syntactically related group of words expressing a claim, question, command, wish or exclamation.
- the statement usually contains both the template text and the variable text to be processed. Because the voice parameter corresponding to the sentence and the image parameter have certain continuity, so the target voice parameter corresponding to the variable text to be processed and the target image parameter extracted therefrom have certain continuity with the voice parameter and the image parameter corresponding to the template text in the sentence; On this basis, the continuity between the second video segment corresponding to the variable text to be processed and the first video segment corresponding to the template text in the sentence can be improved, thereby improving the continuity at the splicing position.
- the speech parameters may represent parameters corresponding to speech synthesis.
- Speech parameters may include: linguistic features and/or acoustic features.
- Linguistic features may include: phoneme features.
- a phoneme is the smallest unit of speech divided according to the natural properties of speech. It is analyzed according to the pronunciation actions in a syllable, and an action constitutes a phoneme.
- Phonemes can include: vowels and consonants.
- Acoustic features can characterize the characteristics of speech from the perspective of vocalization.
- Acoustic features may include, but are not limited to, the following:
- Prosodic features specifically including duration-related features, fundamental frequency-related features, energy-related features, etc.
- Spectrum-based correlation analysis features which are the embodiment of the correlation between vocal tract shape changes and vocalization movements.
- spectrum-based correlation features mainly include: Linear Prediction Cepstral Coefficients (LPCC, LinearPredictionCoefficients), Mel Frequency Cepstral Coefficients (MFCC, Mel Frequency Cepstrum Coefficient) and so on.
- speech synthesis may be performed on the variable text to be processed according to the target speech parameters, so as to convert the variable text to be processed into the target speech.
- the image parameters may be parameters corresponding to the generation of the image sequence.
- the image parameters may be used to determine the state characteristics corresponding to the virtual object, or the image parameters may include: the state characteristics corresponding to the virtual object.
- image parameters may include lip features.
- state features corresponding to target image parameters may be assigned to the virtual object image to obtain a target image sequence.
- the target voice and the target image sequence are fused to obtain the second video segment.
- the second video clip corresponding to the variable text to be processed is generated, which specifically includes: smoothing the target image parameters corresponding to the variable text to be processed according to the preset image parameters of the preset variable text at the boundary position, so as to improve the The continuity between the target image parameter and the image parameter of the template text at the boundary position; according to the smoothed target image parameter, generate a second video segment corresponding to the variable text to be processed.
- Technical solution 2 performs smoothing processing on the target image parameters corresponding to the variable text to be processed according to the preset image parameters at the boundary position of the preset variable text. Since the preset image parameters of the preset variable text at the boundary position and the image parameters of the template text at the boundary position have a certain continuity, the above smoothing process can improve the distance between the smoothed target image parameters and the image parameters of the template text.
- the continuity at the boundary position on this basis, the continuity between the second video segment corresponding to the variable text to be processed and the first video segment corresponding to the template text in the sentence can be improved, and then the continuity at the splicing position can be improved .
- a window function such as a Hanning window may be used to perform smoothing processing on target image parameters corresponding to the variable text to be processed according to preset image parameters. It can be understood that the embodiment of the present application does not limit the specific smoothing process.
- the embodiment of the present application in the process of generating the image part of the preset video, can determine the image parameters according to the preset complete text and pause information, and the embodiment of the present application can extract the preset variable text from the image parameters in the Preset image parameters at the boundary position, and save the preset image parameters.
- the image sequence corresponding to the video includes: a background image sequence and a moving image sequence, then generating a second video segment corresponding to the variable text to be processed, specifically including: generating a target moving image sequence corresponding to the variable text to be processed; Assuming a background image sequence, determining a target background image sequence corresponding to the variable text to be processed; merging the target moving image sequence and the target background image sequence to obtain a second video segment corresponding to the variable text to be processed.
- the image sequence corresponding to the video can be decomposed into two parts.
- the first part is: a moving image sequence, which can be used to represent the moving part of the virtual object when it is expressed, usually corresponding to preset parts such as lips, eyes, and arms.
- the second part is: the background image sequence, which can be used to characterize the relatively static part of the virtual object when it is expressed, usually corresponding to parts other than the preset parts.
- the background image sequence may be obtained from a preset.
- a preset background image sequence with a preset duration may be preset, and the preset background image sequence may be arranged cyclically in the image sequence (also referred to as cyclic appearance).
- a moving image sequence can be generated according to target image parameters corresponding to the variable text to be processed.
- the moving image sequence and the background image sequence can be fused to obtain an image sequence.
- a moving image sequence can be pasted over a background image sequence to obtain an image sequence.
- the information of the preset variable text corresponding to the preset background image sequence may be recorded.
- the information of the preset background image sequence may include: a start frame identifier and an end frame identifier of the preset background image sequence in the preset video.
- the information of the preset background image sequence may include: a start frame number 100, an end frame number 125, and the like.
- the background images at the first and last positions of the target background image sequence, and the preset background image sequence Let the background images at the first and last positions of the background image sequence match.
- the first position may refer to a start position
- the tail position may refer to an end position.
- the background image at the first position of the target background image sequence matches the background image at the first position of the preset background image sequence.
- the background image at the end position of the target background image sequence matches the background image at the end position of the preset background image sequence.
- the target background image sequence can also be improved when the target background image sequence matches the preset background image sequence at the boundary position.
- the matching degree and continuity between the background image sequence and the background image sequence corresponding to the template text at the splicing position can also be improved when the target background image sequence matches the preset background image sequence at the boundary position.
- the above-mentioned determining method for determining the target background image sequence corresponding to the variable text to be processed may specifically include:
- Determination mode 1 When the number N1 of images corresponding to the preset background image sequence matches the number N2 of images corresponding to the target moving image sequence, determine the preset background image sequence as the target background image sequence; or
- Determination mode 2 when the number N1 of images corresponding to the preset background image sequence is greater than the number N2 of images corresponding to the target moving image sequence, discarding the first background image located in the middle position from the preset background image sequence; In the case of at least two frames of the first background image, the at least two frames of the first background image are discontinuously distributed in the preset background image sequence; or
- Determination mode 3 In the case that the number N1 of images corresponding to the preset background image sequence is smaller than the number N2 of images corresponding to the target moving image sequence, a second background image is added on the basis of the preset background image sequence.
- the preset background image sequence is determined as the target background image sequence, which can realize the matching of the target background image sequence and the preset background image sequence at the boundary position.
- the number N2 of images corresponding to the target moving image sequence can be determined according to the speech duration information corresponding to the variable text to be processed.
- the speech duration information may be determined according to the speech parameters corresponding to the variable text to be processed, or the speech duration information may be determined according to the duration of the speech segment corresponding to the variable text to be processed.
- the first background image located in the middle position is discarded from the preset background image sequence, which can realize the matching of the target background image sequence and the preset background image sequence at the boundary position.
- the middle position can be different from the first or last position.
- the discarded at least two frames of the first background image are discontinuously distributed in the preset background image sequence; in this way, the problem of poor continuity of the background image caused by discarding continuous background images can be avoided to a certain extent.
- the number of the first background images may match the difference between N1 and N2.
- the information of the preset background image sequence may include: start frame number 100 and end frame number 125, etc., the value of N1 is 26, assuming that the number of images N2 corresponding to the target moving image sequence is 24, then the preset background image In the sequence, discard the first two frames of the background image that are located in the middle and whose positions are discontinuous.
- N1 is smaller than N2
- adding a second background image on the basis of the preset background image sequence can realize the matching of the target background image sequence and the preset background image sequence at the boundary position.
- the second background image may originate from a preset background image sequence, in other words, the second background image to be added may be determined from the preset background image sequence.
- the preset background image sequence may be determined as the first part of the target background image sequence in the forward order first; then the preset background image sequence may be determined as the target background image sequence in the reverse order The second part of the second part; then according to the forward sequence, the preset background image sequence is determined as the third part of the target background image sequence; wherein, the end frame of the third part matches the end frame of the preset background image sequence.
- the information of the preset background image sequence may include: start frame number 100 and end frame number 125, etc., the value of N1 is 26, assuming that the number of images N2 corresponding to the target moving image sequence is 30, then the number of images in the target background image sequence
- the frame number corresponding to one part may be: 100 ⁇ 125
- the frame number corresponding to the second part of the target background image sequence may be: 125 ⁇ 124
- the frame number corresponding to the third part of the target background image sequence may be: 124 ⁇ 125.
- the second background image may originate from a background image sequence other than the preset background image sequence, for example, the second background may be determined from a background image sequence following the preset background image sequence image.
- the preset background image sequence may be determined as the first part of the target background image sequence in the forward order; then the background image sequence following the preset background image sequence may be determined in the forward order is the second part of the target background image sequence; then, in reverse order, the background image sequence following the preset background image sequence and the end frame of the preset background image sequence are determined as the third part of the target background image sequence; wherein, The end frame of the third part matches the end frame of the preset background image sequence.
- the information of the preset background image sequence may include: start frame number 100 and end frame number 125, etc., the value of N1 is 26, assuming that the number of images N2 corresponding to the target moving image sequence is 30, then the number of images in the target background image sequence
- the frame number corresponding to one part may be: 100 ⁇ 125
- the frame number corresponding to the second part of the target background image sequence may be: 126 ⁇ 127
- the frame number corresponding to the third part of the target background image sequence may be: 127 ⁇ 125.
- an inverted target background image sequence may also be determined.
- the corresponding determination process may include: first, in a reverse order, determine the preset background image sequence as the first part of the target background image sequence; then, in the forward order, determine the preset background image sequence as the first part of the target background image sequence The second part; then, in reverse order, determine the preset background image sequence as the third part of the target background image sequence; wherein, the starting frame of the third part matches the starting frame of the preset background image sequence.
- the information of the preset background image sequence may include: start frame number 100 and end frame number 125, etc., the value of N1 is 26, assuming that the number of images N2 corresponding to the target moving image sequence is 30, then the number of images in the target background image sequence
- the frame number corresponding to one part may be: 125 ⁇ 100
- the frame number corresponding to the second part of the target background image sequence may be: 100 ⁇ 101
- the frame number corresponding to the third part of the target background image sequence may be: 101 ⁇ 100.
- the frame number of the target background image sequence obtained may be: 100 ⁇ 101 ⁇ 101 ⁇ 100 ⁇ 100 ⁇ 125.
- step 103 the first video clip and the second video clip are spliced to obtain a video corresponding to the first text.
- the first video segment may specifically include: a first audio segment
- the second video segment may specifically include: a second audio segment
- the above-mentioned splicing of the first video segment and the second video segment may specifically include: smoothing the voice sub-segments at the splicing position respectively of the first voice segment and the second voice segment; splice the first speech segment and the smoothed second speech segment.
- smoothing is first performed on the speech sub-segments of the first speech segment and the second speech segment at splicing positions, and then the smoothed first speech segment and the smoothed second speech segment are spliced.
- the above smoothing process can improve the continuity between the smoothed first speech segment and the second speech segment, and thus can improve the continuity of the first video segment and the second video segment at the splicing position.
- the spliced video may be output, for example, to a user.
- the corresponding variable text to be processed can be determined according to the disease name included in the user input, and the video can be obtained by using the method embodiment shown in FIG. 1B and provided to the user.
- the first video segment corresponding to the template text and the second video segment corresponding to the variable text to be processed are spliced.
- the first video clip may be a pre-saved video clip
- a second video clip corresponding to the variable text to be processed may be generated during video processing. Since the length of the variable text to be processed is shorter than the length of the complete text, the embodiment of the present application can shorten the length of the generated video and the corresponding time cost, thus improving the video processing efficiency.
- the first video segment in the embodiment of the present application is provided with a paused video sub-segment at the boundary position between the template text and the variable text.
- the above-mentioned pause processing can overcome the jump or jitter problem at the splicing position to a certain extent, and thus can improve the continuity at the splicing position.
- FIG. 2 shows a flow chart of a video processing method according to an embodiment of the present application, which may specifically include the following steps.
- Step 201 according to the template text, the preset variable text, and the corresponding pause information at the boundary position, generate preset video pause information indicating a voice pause of a predetermined duration;
- Step 202 intercepting the first video segment corresponding to the template text from the preset video, and saving the first video segment;
- Step 203 according to the information of the preset video, save the preset image parameters of the preset variable text at the boundary position and the information of the preset background image sequence corresponding to the preset variable text;
- Steps 201 to 203 can be used to pre-save the first video segment, the preset image parameters of the preset variable text at the boundary position, and the preset background image sequence corresponding to the preset variable text based on the generated preset video.
- Steps 204 to 211 can be used to generate a second video clip corresponding to the variable text to be processed according to the pre-saved information; and splicing the pre-saved first video clip and the second video clip.
- Step 204 for the sentence where the variable text to be processed is located, determine the corresponding speech parameters and image parameters;
- Step 205 extracting target speech parameters and target image parameters corresponding to the variable text to be processed from the speech parameters and image parameters;
- Step 206 Perform smoothing processing on target image parameters corresponding to the variable text to be processed according to preset image parameters
- Step 207 according to the target voice parameter and the smoothed target image parameter, generate the target moving image sequence corresponding to the variable text to be processed;
- Step 208 according to the preset background image sequence, determine the target background image sequence corresponding to the variable text to be processed
- Step 209 merging the target moving image sequence and the target background image sequence to obtain a second video clip corresponding to the variable text to be processed
- Step 210 smoothing the voice sub-segments at the boundary positions of the first voice segment in the first video segment and the second voice segment in the second video segment;
- Step 211 Splice the first video clip and the second video clip according to the smoothed first speech clip and the smoothed second speech clip.
- the preset complete text is the aforementioned text A
- the preset variable text is " ⁇ diabetes>", “ ⁇ fruit>”, “ ⁇ 1800>”, etc. in the text A
- the preset video can be generated according to the text A and the corresponding pause information, and the preset image parameters of the first video segment in the preset video, the preset variable text at the boundary position, and the preset variable text correspond to the preset background The information of the image sequence is saved.
- variable text may change. For example, after text A changes to text B "about ⁇ coronary heart disease> and ⁇ vegetables>, I am still researching. I think this ⁇ coronary heart disease dietary advice may also be helpful to you, which contains about ⁇ 900 >In the case of recommendations and taboos of a certain ingredient, please click to view", the variable text to be processed may include: “ ⁇ coronary heart disease>", “ ⁇ vegetable>", " ⁇ 900>” and so on in text B.
- the second video segment corresponding to the variable text to be processed can be generated. For example, you can first determine the acoustic parameters and lip features of the sentence where the variable text to be processed is located; then, extract the target acoustic parameters and target lip features corresponding to the variable text to be processed, and generate speech segments corresponding to the variable text to be processed and the target image sequence.
- the target image sequence may include: a target moving image sequence and a target background image sequence.
- step 206 may be used to smooth the target lip features, so as to improve the continuity of the lip features at the splicing position.
- Step 208 can be used to generate the target background image sequence to achieve the matching of the target background image sequence and the preset background image sequence at the boundary position, so as to improve the continuity of the background image sequence at the stitching position.
- each speech sub-segment at the boundary position is smoothed processing; and then splicing the first video clip and the second video clip according to the smoothed first speech clip and the smoothed second speech clip.
- the video processing method of the embodiment of the present application adds a pause of preset duration at the splicing position of the first video segment, which helps to overcome the jump or jitter problem at the splicing position, so it can improve the splicing position. continuity.
- the sentence in which the variable text to be processed is located is used as a unit to determine the corresponding speech parameters and image parameters, and then from the speech parameters and image parameters, the target speech parameter and target image corresponding to the variable text to be processed are extracted parameter.
- the voice parameter corresponding to the sentence and the image parameter have certain continuity, so the target voice parameter corresponding to the variable text to be processed and the target image parameter extracted therefrom have certain continuity with the voice parameter and the image parameter corresponding to the template text in the sentence;
- the continuity between the second video segment corresponding to the variable text to be processed and the first video segment corresponding to the template text in the sentence can be improved, and the continuity at the splicing position can be further improved.
- smoothing is performed on the target image parameters corresponding to the variable text to be processed according to the preset image parameters at the boundary positions of the preset variable text. Since the preset image parameters of the preset variable text at the boundary position and the image parameters of the template text at the boundary position have a certain continuity, the above smoothing process can improve the distance between the smoothed target image parameters and the image parameters of the template text.
- the continuity at the boundary position on this basis, the continuity between the second video segment corresponding to the variable text to be processed and the first video segment corresponding to the template text in the sentence can be improved, and then the continuity at the splicing position can be improved .
- the embodiment of the present application generates the target background image sequence according to the preset background image sequence, which can realize the matching of the target background image sequence and the preset background image sequence at the boundary position, so as to improve the continuity of the background image sequence at the splicing position .
- the speech subsection at the boundary position Fragments are smoothed.
- the above smoothing process can improve the continuity between the smoothed first speech segment and the second speech segment, and thus can improve the continuity of the first video segment and the second video segment at the splicing position.
- FIG. 3 shows a structural block diagram of an embodiment of a video processing device of the present application, which may specifically include:
- Provide module 301 be used for obtaining the first video segment, described first video segment corresponds to template text in the first text of video to be generated, and described first video segment comprises the video sub-segment of speech pause, and described video sub-segment The position of the segment corresponds to the boundary position between the template text and the variable text to be processed in the first text;
- a generating module 302 configured to generate a second video clip corresponding to the variable text to be processed
- the splicing module 303 is configured to splice the first video clip and the second video clip to obtain a video corresponding to the first text.
- the above-mentioned device may also include:
- the preset video generation module is used to generate a preset video according to the template text, the preset variable text, and the corresponding pause information at the boundary position, and the pause information represents a voice pause of a predetermined duration;
- An intercepting module configured to intercept the first video segment corresponding to the template text from the preset video.
- the generating module 302 may include:
- the parameter determination module is used to determine the corresponding speech parameters and image parameters for the sentence where the variable text to be processed is located in the first text, wherein the image parameter indicates that the video corresponding to the first text will appear
- the state characteristics of the virtual object, the speech parameters are used to characterize the corresponding parameters of speech synthesis
- a parameter extraction module configured to extract target speech parameters and target image parameters corresponding to the variable text to be processed from the speech parameters and image parameters;
- the first segment generation module is used to generate the second video segment corresponding to the variable text to be processed according to the target voice parameter and the target image parameter.
- the generating module 302 may include:
- the first smoothing processing module is used to perform smoothing processing on the target image parameters corresponding to the variable text to be processed according to the preset image parameters at the boundary positions of the variable text to be processed, so as to improve the relationship between the target image parameters and the target image parameters. Describe the continuity of the image parameters of the template text at the boundary position;
- the second segment generating module is configured to generate a second video segment corresponding to the variable text to be processed according to the smoothed target image parameters.
- the above-mentioned first video clip may include: a first audio clip
- the above-mentioned second video clip may include: a second audio clip
- the splicing module 303 may include:
- the second smoothing processing module is used to carry out smoothing processing to the speech sub-segments of the first speech segment and the second speech segment respectively at the splicing position;
- the splicing module after smoothing is used for splicing the smoothed first speech segment and the smoothed second speech segment.
- the image sequence corresponding to the above video may include: a background image sequence and a moving image sequence;
- Generation module 302 may include:
- a moving image sequence generating module configured to generate a target moving image sequence corresponding to the variable text to be processed
- a background image sequence generation module configured to determine the target background image sequence corresponding to the variable text to be processed according to the preset background image sequence
- the fusion module is configured to fuse the target moving image sequence and the target background image sequence to obtain the second video segment corresponding to the variable text to be processed.
- the background images at the first and last positions of the target background image sequence match the background images at the first and last positions of the preset background image sequence.
- the above-mentioned background image sequence generation module may include:
- the first background image sequence generation module is used to determine the above-mentioned preset background image sequence as the target background image sequence when the number of images corresponding to the above-mentioned preset background image sequence matches the number of images corresponding to the above-mentioned target moving image sequence ;or
- the second background image sequence generation module is used to discard the first image in the middle position from the preset background image sequence when the number of images corresponding to the preset background image sequence is greater than the number of images corresponding to the target moving image sequence. background image; in the case of discarding at least two frames of the first background image, at least two frames of the first background image are discontinuously distributed in the preset background image sequence; or
- the third background image sequence generating module is configured to add a second background image to the preset background image sequence when the number of images corresponding to the preset background image sequence is smaller than the number of images corresponding to the target moving image sequence.
- the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
- Fig. 4 is a structural block diagram of an apparatus 900 for video processing according to an exemplary embodiment.
- the apparatus 900 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.
- device 900 may include one or more of the following components: processing component 902, memory 904, power supply component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and communication component 916 .
- the processing component 902 generally controls the overall operations of the device 900, such as those associated with display, incoming phone calls, data communications, camera operations, and recording operations.
- the processing element 902 may include one or more processors 920 to execute instructions to complete all or part of the steps of the above method.
- processing component 902 may include one or more modules that facilitate interaction between processing component 902 and other components.
- the processing component 902 may include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.
- the memory 904 is configured to store various types of data to support operations at the device 900 . Examples of such data include instructions for any application or method operating on the device 900, contact data, phonebook data, messages, pictures, videos, and the like.
- the memory 904 can be implemented by any type of volatile or non-volatile memory device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
- SRAM static random access memory
- EEPROM electrically erasable programmable read-only memory
- EPROM erasable Programmable Read Only Memory
- PROM Programmable Read Only Memory
- ROM Read Only Memory
- Magnetic Memory Flash Memory
- Magnetic or Optical Disk Magnetic Disk
- the power supply component 906 provides power to the various components of the device 900 .
- Power components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for device 900 .
- the multimedia component 908 includes a screen that provides an output interface between the device 900 and the user.
- the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.
- the touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense a boundary of a touch or swipe motion action, but also detect duration and pressure associated with the touch or swipe operation.
- the multimedia component 908 includes a front camera and/or a rear camera. When the device 900 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capability.
- the audio component 910 is configured to output and/or input audio signals.
- the audio component 910 includes a microphone (MIC) configured to receive external audio signals when the device 900 is in operation modes, such as call mode, recording mode and voice recognition mode. Received audio signals may be further stored in memory 904 or sent via communication component 916 .
- the audio component 910 also includes a speaker for outputting audio signals.
- the I/O interface 912 provides an interface between the processing component 902 and a peripheral interface module.
- the peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: a home button, volume buttons, start button, and lock button.
- Sensor assembly 914 includes one or more sensors for providing status assessments of various aspects of device 900 .
- the sensor component 914 can detect the open/closed state of the device 900, the relative positioning of components, such as the display and keypad of the device 900, and the sensor component 914 can also detect a change in the position of the device 900 or a component of the device 900 , the presence or absence of user contact with the device 900 , the device 900 orientation or acceleration/deceleration and the temperature change of the device 900 .
- Sensor assembly 914 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact.
- Sensor assembly 914 may also include an optical sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
- the sensor component 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor or a temperature sensor.
- the communication component 916 is configured to facilitate wired or wireless communication between the apparatus 900 and other devices.
- the device 900 can access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof.
- the communication component 916 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel.
- the communication component 916 also includes a near field communication (NFC) module to facilitate short-range communication.
- NFC near field communication
- the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wide Band (UWB) technology, Bluetooth (BT) technology and other technologies.
- RFID Radio Frequency Identification
- IrDA Infrared Data Association
- UWB Ultra Wide Band
- Bluetooth Bluetooth
- apparatus 900 may be programmed by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation for performing the methods described above.
- ASICs application specific integrated circuits
- DSPs digital signal processors
- DSPDs digital signal processing devices
- PLDs programmable logic devices
- FPGA field programmable A gate array
- controller microcontroller, microprocessor or other electronic component implementation for performing the methods described above.
- non-transitory computer-readable storage medium including instructions, such as the memory 904 including instructions, which can be executed by the processor 920 of the device 900 to implement the above method.
- the non-transitory computer readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.
- Fig. 5 is a structural block diagram of a server in some embodiments of the present application.
- the server 1900 may have relatively large differences due to different configurations or performances, and may include one or more central processing units (central processing units, CPU) 1922 (for example, one or more processors) and memory 1932, one or more More than one storage medium 1930 (for example, one or more mass storage devices) storing application programs 1942 or data 1944 .
- the memory 1932 and the storage medium 1930 may be temporary storage or persistent storage.
- the program stored in the storage medium 1930 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server.
- the central processing unit 1922 may be configured to communicate with the storage medium 1930 , and execute a series of instruction operations in the storage medium 1930 on the server 1900 .
- the server 1900 may also include one or more power sources 1926, one or more wired or wireless network interfaces 1950, one or more input and output interfaces 1958, one or more keyboards 1956, and/or, one or more operating systems 1941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and so on.
- a non-transitory computer-readable storage medium when the instructions in the storage medium are executed by the processor of the device (device or server), the device can execute the video processing method according to the embodiment of the present application.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- Signal Processing (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Oral & Maxillofacial Surgery (AREA)
- General Health & Medical Sciences (AREA)
- Business, Economics & Management (AREA)
- Marketing (AREA)
- Social Psychology (AREA)
- Geometry (AREA)
- Psychiatry (AREA)
- Processing Or Creating Images (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- Television Signal Processing For Recording (AREA)
Abstract
Description
Claims (18)
- 一种视频处理方法,在电子设备中执行,所述方法包括:获取第一视频片段,所述第一视频片段与待生成视频的第一文本中模板文本对应,并且所述第一视频片段包括语音停顿的视频子片段,所述视频子片段的位置对应于所述模板文本与所述第一文本中待处理变量文本之间的分界位置;生成所述待处理变量文本对应的第二视频片段;对所述第一视频片段和所述第二视频片段进行拼接,以得到所述第一文本对应的视频。
- 根据权利要求1所述的方法,其中,所述方法还包括:根据模板文本、预设变量文本、以及所述分界位置处对应的停顿信息,生成预设视频,所述停顿信息表示预定时长的语音停顿;从所述预设视频中截取所述模板文本对应的第一视频片段。
- 根据权利要求1所述的方法,其特征在于,所述视频子片段的图像中虚拟对象处于不说话的状态。
- 根据权利要求1-3中任一项所述的方法,其特征在于,所述视频子片段为经过停顿处理后得到的子片段;其中,对所述视频子片段的停顿处理,包括:所述第一视频片段中与所述分界位置对应的拼接位置处的语音信号子片段与静音信号进行加权处理,以得到语音停顿的语音信号子片段;所述第一视频片段在所述拼接位置处的图像子序列与目标状态特征的图像序列进行加权处理,以得到虚拟对象处于不说话的状态的所述图像子序列,其中,所述目标状态特征为表示虚拟对象处于不说话状态的特征。
- 根据权利要求1-3中任一项所述的方法,其中,所述生成所述待处理变量文本对应的第二视频片段,包括:针对待处理变量文本在所述第一文本中所处的语句,确定对应的语音参数和图像参数,其中,所述图像参数表征所述第一文本对应的视频中要出现的虚拟对象的状态特征,所述语音参数用于表征语音合成对应的参数;从所述语音参数和图像参数中,提取所述待处理变量文本对应的目标语音参数和目标图像参数;根据所述目标语音参数和目标图像参数,生成所述待处理变量文本对应的第二视频片段。
- 根据权利要求1-3中任一项所述的方法,其中,所述生成所述待处理变量文本对应的第二视频片段,包括:根据所述待处理变量文本在边界位置处的预设图像参数,对所述待处理变量文本对应的目标图像参数进行平滑处理,以提高所述目标图像参数与所述模板文本的图像参数在边界位置处的连续性;根据平滑处理后的目标图像参数,生成所述待处理变量文本对应的第二视频片段。
- 根据权利要求1-3中任一项所述的方法,其中,所述第一视频片段包括:第一语音片段,所述第二视频片段包括:第二语音片段;所述对所述第一视频片段和所述第二视频片段进行拼接,包括:对第一语音片段和第二语音片段各自在拼接位置处的语音子片段进行平滑处理;对平滑处理后的第一语音片段和平滑处理后的第二语音片段进行拼接。
- 根据权利要求1-3中任一项所述的方法,其中,所述视频对应的图像序列包括:背景图像序列和运动图像序列;所述生成待处理变量文本对应的第二视频片段,包括:生成待处理变量文本对应的目标运动图像序列;根据预设背景图像序列,确定所述待处理变量文本对应的目标背景图像序列;对所述目标运动图像序列和所述目标背景图像序列进行融合,以得到所述待处理变量文本对应的第二视频片段。
- 根据权利要求8所述的方法,其中,所述目标背景图像序列的位于首尾位置的背景图像,与所述预设背景图像序列的位于首尾位置的背景图像相匹配。
- 根据权利要求8所述的方法,其中,所述根据预设背景图像序列,确定所述待处理变量文本对应的目标背景图像序列,包括:在所述预设背景图像序列对应的图像数量与所述目标运动图像序列对应的图像数量相匹配的情况下,将所述预设背景图像序列确定为目标背景图像序列;或者在所述预设背景图像序列对应的图像数量大于所述目标运动图像序列对应的图像数量的情况下,从所述预设背景图像序列中丢弃位于中间位置的第一背景图像;在丢弃至少两帧第一背景图像的情况下,至少两帧第一背景图像在预设背景图像序列中不连续分布;或者在所述预设背景图像序列对应的图像数量小于所述目标运动图像序列对应的图像数量的情况下,在所述预设背景图像序列中增加第二背景图像。
- 一种视频处理装置,包括:提供模块,用于获取第一视频片段,所述第一视频片段与待生成视频的第一文本中模板文本对应,并且所述第一视频片段包括语音停顿的视频子片段,所述视频子片段的位置对应于所述模板文本与所述第一文本中待处理变量文本之间的分界位置;生成模块,用于生成所述待处理变量文本对应的第二视频片段;拼接模块,用于对所述第一视频片段和所述第二视频片段进行拼接,以得到所述第一文本对应的视频。
- 根据权利要求9所述的装置,其中,所述装置还包括:预设视频生成模块,用于根据模板文本、预设变量文本、以及所述分界位置处对应的停顿信息,生成预设视频,所述停顿信息表示预定时长的语音停顿;截取模块,用于从所述预设视频中截取所述模板文本对应的第一视频片段。
- 根据权利要求9或10所述的装置,其中,所述生成模块包括:参数确定模块,用于针对待处理变量文本在所述第一文本中所处的语句,确定对应的语音参数和图像参数,其中,所述图像参数表征所述第一文本对应的视频中要出现的虚拟对象的状态特征,所述语音参数用于表征语音合成对应的参数;参数提取模块,用于从所述语音参数和图像参数中,提取所述待处理变量文本对应的目标语音参数和目标图像参数;第一片段生成模块,用于根据所述目标语音参数和目标图像参数,生成所述待处理变量文本对应的第二视频片段。
- 根据权利要求9或10所述的装置,其中,所述生成模块包括:第一平滑处理模块,用于根据所述待处理变量文本在边界位置处的预设图像参数,对所述待处理变量文本对应的目标图像参数进行平滑处理,以提高所述目标图像参数与所述模板文本的图像参数在边界位置处的连续性;第二片段生成模块,用于根据平滑处理后的目标图像参数,生成所述待处理变量文本对应的第二视频片段。
- 根据权利要求9或10所述的装置,其中,所述第一视频片段包括:第一语音片段,所述第二视频片段包括:第二语音片段;所述拼接模块包括:第二平滑处理模块,用于对第一语音片段和第二语音片段各自在拼接位置处的语音子片段进行平滑处理;平滑后拼接模块,用于对平滑处理后的第一语音片段和平滑处理后的第二语音片段进行拼接。
- 一种用于视频处理的装置,包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,所述程序被一个或者一个以上处理器执行时,实现权利要求1至10中任一所述方法的步骤。
- 一种机器可读介质,其上存储有指令,当由一个或多个处理器执行时,使得装置执行如权利 要求1至10中一个或多个所述的视频处理方法。
- 一种计算机程序产品,该程序产品包括计算机指令,该计算机指令存储在计算机可读存储介质中;当处理器执行该计算机指令时,使得处理器执行如权利要求1至10中任一项所述的方法。
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2023554305A JP7697027B2 (ja) | 2021-09-24 | 2022-08-30 | ビデオ処理方法、装置、媒体、及びコンピュータプログラム |
| EP22871767.4A EP4404574A4 (en) | 2021-09-24 | 2022-08-30 | VIDEO PROCESSING METHOD AND APPARATUS AS WELL AS MEDIUM AND PROGRAM PRODUCT |
| US18/365,296 US20240022772A1 (en) | 2021-09-24 | 2023-08-04 | Video processing method and apparatus, medium, and program product |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111124169.4 | 2021-09-24 | ||
| CN202111124169.4A CN113891150B (zh) | 2021-09-24 | 2021-09-24 | 一种视频处理方法、装置和介质 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US18/365,296 Continuation US20240022772A1 (en) | 2021-09-24 | 2023-08-04 | Video processing method and apparatus, medium, and program product |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023045716A1 true WO2023045716A1 (zh) | 2023-03-30 |
Family
ID=79006490
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2022/115722 Ceased WO2023045716A1 (zh) | 2021-09-24 | 2022-08-30 | 视频处理方法、装置、介质和程序产品 |
Country Status (5)
| Country | Link |
|---|---|
| US (1) | US20240022772A1 (zh) |
| EP (1) | EP4404574A4 (zh) |
| JP (1) | JP7697027B2 (zh) |
| CN (1) | CN113891150B (zh) |
| WO (1) | WO2023045716A1 (zh) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2025130608A1 (zh) * | 2023-12-20 | 2025-06-26 | 华为技术有限公司 | 一种数字人生成方法及相关装置 |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113891150B (zh) * | 2021-09-24 | 2024-10-11 | 北京搜狗科技发展有限公司 | 一种视频处理方法、装置和介质 |
| CN114707019A (zh) * | 2022-03-29 | 2022-07-05 | 北京拥抱在线科技有限公司 | 用于阅读的信息处理方法及装置 |
| CN116510308A (zh) * | 2023-04-27 | 2023-08-01 | 北京字跳网络技术有限公司 | 一种内容生成方法、装置、计算机设备及存储介质 |
| WO2025258824A1 (ko) * | 2024-06-13 | 2025-12-18 | 삼성전자주식회사 | 비디오 프레임의 추가 영역을 생성하기 위한 전자 장치, 방법, 및 비일시적 컴퓨터 판독 가능 저장 매체 |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109637518A (zh) * | 2018-11-07 | 2019-04-16 | 北京搜狗科技发展有限公司 | 虚拟主播实现方法及装置 |
| CN110381266A (zh) * | 2019-07-31 | 2019-10-25 | 百度在线网络技术(北京)有限公司 | 一种视频生成方法、装置以及终端 |
| CN110611840A (zh) * | 2019-09-03 | 2019-12-24 | 北京奇艺世纪科技有限公司 | 一种视频生成方法、装置、电子设备及存储介质 |
| US20200098396A1 (en) * | 2018-09-20 | 2020-03-26 | Autochartis Limited | Automated video generation from financial market analysis |
| CN111885416A (zh) * | 2020-07-17 | 2020-11-03 | 北京来也网络科技有限公司 | 一种音视频的修正方法、装置、介质及计算设备 |
| CN112995706A (zh) * | 2019-12-19 | 2021-06-18 | 腾讯科技(深圳)有限公司 | 基于人工智能的直播方法、装置、设备及存储介质 |
| US20210201912A1 (en) * | 2020-09-14 | 2021-07-01 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Virtual Object Image Display Method and Apparatus, Electronic Device and Storage Medium |
| CN113891150A (zh) * | 2021-09-24 | 2022-01-04 | 北京搜狗科技发展有限公司 | 一种视频处理方法、装置和介质 |
Family Cites Families (33)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH11231899A (ja) * | 1998-02-12 | 1999-08-27 | Matsushita Electric Ind Co Ltd | 音声・動画像合成装置及び音声・動画像データベース |
| US7277855B1 (en) * | 2000-06-30 | 2007-10-02 | At&T Corp. | Personalized text-to-speech services |
| WO2010045736A1 (en) * | 2008-10-22 | 2010-04-29 | Xtranormal Technology Inc. | Reduced-latency rendering for a text-to-movie system |
| CN102123252B (zh) * | 2010-01-07 | 2016-05-04 | 新奥特(北京)视频技术有限公司 | 一种图文包装应用中随动关联播出的实现方法和装置 |
| US8818175B2 (en) * | 2010-03-08 | 2014-08-26 | Vumanity Media, Inc. | Generation of composited video programming |
| WO2012154618A2 (en) * | 2011-05-06 | 2012-11-15 | Seyyer, Inc. | Video generation based on text |
| US20120308211A1 (en) * | 2011-06-01 | 2012-12-06 | Xerox Corporation | Asynchronous personalization of records using dynamic scripting |
| GB2503878A (en) * | 2012-07-09 | 2014-01-15 | Nds Ltd | Generating interstitial scripts for video content, based on metadata related to the video content |
| US9620124B2 (en) * | 2014-02-28 | 2017-04-11 | Comcast Cable Communications, Llc | Voice enabled screen reader |
| WO2017137947A1 (en) * | 2016-02-10 | 2017-08-17 | Vats Nitin | Producing realistic talking face with expression using images text and voice |
| US10204274B2 (en) * | 2016-06-29 | 2019-02-12 | Cellular South, Inc. | Video to data |
| US10056083B2 (en) * | 2016-10-18 | 2018-08-21 | Yen4Ken, Inc. | Method and system for processing multimedia content to dynamically generate text transcript |
| US10546409B1 (en) * | 2018-08-07 | 2020-01-28 | Adobe Inc. | Animation production system |
| CN109635154B (zh) * | 2018-12-14 | 2022-11-29 | 成都索贝数码科技股份有限公司 | 一种基于文稿和新闻节目自动生成互联网图文稿件的方法 |
| US11024071B2 (en) * | 2019-01-02 | 2021-06-01 | Espiritu Technologies, Llc | Method of converting phoneme transcription data into lip sync animation data for 3D animation software |
| CN113383384A (zh) * | 2019-01-25 | 2021-09-10 | 索美智能有限公司 | 语音动画的实时生成 |
| EP3921770B1 (en) * | 2019-02-05 | 2025-07-16 | Igentify Ltd. | System and methodology for modulation of dynamic gaps in speech |
| CN109979457A (zh) * | 2019-05-29 | 2019-07-05 | 南京硅基智能科技有限公司 | 一种应用于智能对话机器人的千人千面的方法 |
| CN110324709A (zh) * | 2019-07-24 | 2019-10-11 | 新华智云科技有限公司 | 一种视频生成的处理方法、装置、终端设备及存储介质 |
| CN111508466A (zh) * | 2019-09-12 | 2020-08-07 | 马上消费金融股份有限公司 | 一种文本处理方法、装置、设备及计算机可读存储介质 |
| CN110534088A (zh) * | 2019-09-25 | 2019-12-03 | 招商局金融科技有限公司 | 语音合成方法、电子装置及存储介质 |
| US11638049B2 (en) * | 2019-10-16 | 2023-04-25 | Dish Network L.L.C. | Systems and methods for content item recognition and adaptive packet transmission |
| CN110866968B (zh) * | 2019-10-18 | 2025-02-28 | 平安科技(深圳)有限公司 | 基于神经网络生成虚拟人物视频的方法及相关设备 |
| KR102267673B1 (ko) * | 2019-12-23 | 2021-06-23 | 극동대학교 산학협력단 | 사용자 체험형 동영상 컨텐츠 자동제작방법 및 시스템 |
| CN111460785B (zh) * | 2020-03-31 | 2023-02-28 | 北京市商汤科技开发有限公司 | 交互对象的驱动方法、装置、设备以及存储介质 |
| US11012737B1 (en) * | 2020-04-27 | 2021-05-18 | Dish Network L.L.C. | Systems and methods for audio adaptation of content items to endpoint media devices |
| CN111652678B (zh) * | 2020-05-27 | 2023-11-14 | 腾讯科技(深圳)有限公司 | 物品信息显示方法、装置、终端、服务器及可读存储介质 |
| CN111883103B (zh) * | 2020-06-19 | 2021-12-24 | 马上消费金融股份有限公司 | 语音合成的方法及装置 |
| CN111741326B (zh) * | 2020-06-30 | 2023-08-18 | 腾讯科技(深圳)有限公司 | 视频合成方法、装置、设备及存储介质 |
| CN112543342B (zh) * | 2020-11-26 | 2023-03-14 | 腾讯科技(深圳)有限公司 | 虚拟视频直播处理方法及装置、存储介质、电子设备 |
| CN113051420B (zh) * | 2021-04-15 | 2022-07-05 | 山东大学 | 一种基于文本生成视频机器人视觉人机交互方法及系统 |
| CN113111812A (zh) * | 2021-04-20 | 2021-07-13 | 深圳追一科技有限公司 | 一种嘴部动作驱动模型训练方法及组件 |
| CN113421549A (zh) * | 2021-06-30 | 2021-09-21 | 平安科技(深圳)有限公司 | 语音合成方法、装置、计算机设备及存储介质 |
-
2021
- 2021-09-24 CN CN202111124169.4A patent/CN113891150B/zh active Active
-
2022
- 2022-08-30 EP EP22871767.4A patent/EP4404574A4/en active Pending
- 2022-08-30 JP JP2023554305A patent/JP7697027B2/ja active Active
- 2022-08-30 WO PCT/CN2022/115722 patent/WO2023045716A1/zh not_active Ceased
-
2023
- 2023-08-04 US US18/365,296 patent/US20240022772A1/en active Pending
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20200098396A1 (en) * | 2018-09-20 | 2020-03-26 | Autochartis Limited | Automated video generation from financial market analysis |
| CN109637518A (zh) * | 2018-11-07 | 2019-04-16 | 北京搜狗科技发展有限公司 | 虚拟主播实现方法及装置 |
| CN110381266A (zh) * | 2019-07-31 | 2019-10-25 | 百度在线网络技术(北京)有限公司 | 一种视频生成方法、装置以及终端 |
| CN110611840A (zh) * | 2019-09-03 | 2019-12-24 | 北京奇艺世纪科技有限公司 | 一种视频生成方法、装置、电子设备及存储介质 |
| CN112995706A (zh) * | 2019-12-19 | 2021-06-18 | 腾讯科技(深圳)有限公司 | 基于人工智能的直播方法、装置、设备及存储介质 |
| CN111885416A (zh) * | 2020-07-17 | 2020-11-03 | 北京来也网络科技有限公司 | 一种音视频的修正方法、装置、介质及计算设备 |
| US20210201912A1 (en) * | 2020-09-14 | 2021-07-01 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Virtual Object Image Display Method and Apparatus, Electronic Device and Storage Medium |
| CN113891150A (zh) * | 2021-09-24 | 2022-01-04 | 北京搜狗科技发展有限公司 | 一种视频处理方法、装置和介质 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4404574A4 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2025130608A1 (zh) * | 2023-12-20 | 2025-06-26 | 华为技术有限公司 | 一种数字人生成方法及相关装置 |
Also Published As
| Publication number | Publication date |
|---|---|
| JP2024509873A (ja) | 2024-03-05 |
| CN113891150A (zh) | 2022-01-04 |
| JP7697027B2 (ja) | 2025-06-23 |
| US20240022772A1 (en) | 2024-01-18 |
| EP4404574A1 (en) | 2024-07-24 |
| CN113891150B (zh) | 2024-10-11 |
| EP4404574A4 (en) | 2025-01-01 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN113891150B (zh) | 一种视频处理方法、装置和介质 | |
| US20200279553A1 (en) | Linguistic style matching agent | |
| CN113689879B (zh) | 实时驱动虚拟人的方法、装置、电子设备及介质 | |
| CN110097890B (zh) | 一种语音处理方法、装置和用于语音处理的装置 | |
| CN110210310B (zh) | 一种视频处理方法、装置和用于视频处理的装置 | |
| US20250252282A1 (en) | Method and apparatus for driving digital human, and electronic device | |
| CN114121006A (zh) | 虚拟角色的形象输出方法、装置、设备以及存储介质 | |
| CN114999441B (zh) | 虚拟形象生成方法、装置、设备、存储介质以及程序产品 | |
| WO2021196645A1 (zh) | 交互对象的驱动方法、装置、设备以及存储介质 | |
| CN110162598B (zh) | 一种数据处理方法和装置、一种用于数据处理的装置 | |
| CN113689880B (zh) | 实时驱动虚拟人的方法、装置、电子设备及介质 | |
| CN110992942B (zh) | 一种语音识别方法、装置和用于语音识别的装置 | |
| CN110148406B (zh) | 一种数据处理方法和装置、一种用于数据处理的装置 | |
| CN113053364B (zh) | 一种语音识别方法、装置和用于语音识别的装置 | |
| CN110166844B (zh) | 一种数据处理方法和装置、一种用于数据处理的装置 | |
| CN108628819B (zh) | 处理方法和装置、用于处理的装置 | |
| CN113870828B (zh) | 音频合成方法、装置、电子设备和可读存储介质 | |
| CN112151072B (zh) | 语音处理方法、装置和介质 | |
| JPWO2018079294A1 (ja) | 情報処理装置及び情報処理方法 | |
| CN114155849A (zh) | 一种虚拟对象的处理方法、装置和介质 | |
| CN113674731A (zh) | 语音合成处理方法、装置和介质 | |
| CN115730048A (zh) | 一种会话处理方法、装置、电子设备及可读存储介质 | |
| CN112837668A (zh) | 一种语音处理方法、装置和用于处理语音的装置 | |
| CN116366872A (zh) | 基于中之人和人工智能的直播方法、装置及系统 | |
| CN114049873A (zh) | 语音克隆方法、训练方法、装置和介质 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22871767 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023554305 Country of ref document: JP |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202347080857 Country of ref document: IN |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2022871767 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2022871767 Country of ref document: EP Effective date: 20240419 |