WO2023088080A1 - Procédé et appareil de génération de vidéo parlante, dispositif électronique et support de stockage - Google Patents
Procédé et appareil de génération de vidéo parlante, dispositif électronique et support de stockage Download PDFInfo
- Publication number
- WO2023088080A1 WO2023088080A1 PCT/CN2022/128584 CN2022128584W WO2023088080A1 WO 2023088080 A1 WO2023088080 A1 WO 2023088080A1 CN 2022128584 W CN2022128584 W CN 2022128584W WO 2023088080 A1 WO2023088080 A1 WO 2023088080A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- key point
- image
- face
- point information
- facial
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—Three-dimensional [3D] animation
- G06T13/205—Three-dimensional [3D] animation driven by audio data
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—Three-dimensional [3D] animation
- G06T13/40—Three-dimensional [3D] animation of characters, e.g. humans, animals or virtual beings
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/233—Processing of audio elementary streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
Definitions
- the present disclosure relates to the technical field of computer vision, and in particular to a method, device, device and storage medium for generating a speaking video.
- Talking video generation technology is an important technology used in voice-driven character images and cross-modal video generation, and plays a key role in the commercialization of virtual digital objects.
- the corresponding mouth shape image is usually determined according to the voice frame, so as to obtain a series of mouth shape images corresponding to the output voice to generate a speaking video.
- the accuracy of the speaker's mouth shape in the video generated by this method is low and the mouth shape changes abruptly. .
- An embodiment of the present disclosure provides a solution for generating a talking video.
- a method for generating a talking video comprising: acquiring phoneme features and acoustic features of sound-driven data, the sound-driven data including at least one of audio and text; according to the The phoneme feature and the acoustic feature obtain at least one set of facial key point information of the target object in the first image; according to the at least one set of facial key point information and the second image containing the face of the target object, obtain At least one target facial image corresponding to the sound driving data, wherein a set area including a specific part of the target object in the second image is blocked; according to the sound driving data and the at least one target Facial image to obtain the speaking video of the target object.
- the acquiring the phoneme features and acoustic features of the sound driving data includes: acquiring the phonemes contained in the audio corresponding to the sound driving data and the time stamps corresponding to each phoneme, and obtaining the sound The phoneme feature of the driving data; performing feature extraction on the audio corresponding to the sound driving data to obtain the acoustic features of the sound driving data.
- the acquiring at least one set of facial key point information of the target object in the first image according to the phoneme features and the acoustic features includes: acquiring multiple phoneme features contained in the phoneme features Sub-phoneme features and sub-acoustic features corresponding to the plurality of sub-phoneme features; the sub-phoneme features and corresponding sub-acoustic features are input to the face key point extraction network, and the sub-phoneme features and the sub-acoustic features are obtained. Corresponding facial key point information.
- the facial key point information includes 3D facial key point information, and according to the at least one set of facial key point information and the second image containing the face of the target object Before obtaining at least one target facial image corresponding to the sound driving data, the method further includes: projecting the 3D facial key point information onto a 2D plane, and obtaining the 3D facial key point information corresponding to 2D facial key point information; using the 2D facial key point information to update the facial key point information.
- the method also includes: performing filtering processing on multiple groups of face key point information, so that the change between the face key point information of each image frame and the face key point information of adjacent frames of the image frame The quantity satisfies the set conditions.
- At least one target face corresponding to the sound driving data is obtained according to the at least one set of facial key point information and the second image containing the face of the target object
- the internal image includes: inputting each group of face key point information and the second image to the face complement network to obtain a target face image corresponding to the face key point information, wherein the face complement
- the whole network is used to complement the occluded set area in the second image according to the key point information of the face.
- the obtaining the speaking video of the target object according to the sound driving data and the at least one target facial image includes: combining the at least one target facial image with the set fused with a given background image to obtain a first image sequence; according to the audio corresponding to the first image sequence and the sound driving data, a speaking video of the target object is obtained.
- the facial key point extraction network is trained using phoneme feature samples and corresponding acoustic feature samples, wherein the phoneme feature samples and the acoustic feature samples include labeled facial key points point information.
- the facial key point extraction network is trained in the following manner: according to the phoneme feature samples and the corresponding acoustic feature samples, the initial facial key point extraction network is trained, and the network When the change of the loss meets the convergence condition, the training is completed to obtain the facial key point extraction network, wherein the network loss includes the difference between the facial key point information predicted by the initial neural network and the marked facial key point information. difference.
- the phoneme feature sample and the acoustic feature sample are obtained by marking the object's facial key point information on the phoneme feature and the acoustic feature of an object's audio.
- the phoneme feature sample and the acoustic feature sample are obtained in the following manner: acquiring a speaking video of the object; acquiring multiple facial images according to the speaking video, and combining with each At least one audio frame corresponding to the facial image; obtaining phoneme features and acoustic features of at least one audio frame corresponding to each facial image; obtaining facial key point information according to the plurality of facial images, and according to The facial key point information marks the phoneme features and the acoustic features to obtain the phoneme feature samples and the acoustic feature samples.
- the face completion network is trained by generating an adversarial network
- the generation adversarial network includes the face completion network and the first identification network
- the trained network loss includes : The first loss, which is used to indicate the difference between the face completion image output by the face completion network and the complete face image, wherein the complete face image is corresponding to the facial key point information A face image; a second loss, which is used to indicate the difference between the classification result output by the first discrimination network for the input image and the annotation information of the input image, wherein the annotation information indicates that the input image is the
- the face complement image output by the face complement network may be a real face image.
- the generation confrontation network further includes a second discriminant network
- the trained network loss further includes: a third loss, which is used to instruct the second discriminative network to be effective for the face complement The difference between the discriminative result and the true corresponding result of the correspondence between the full image and the phoneme features.
- an apparatus for generating a speaking video comprising: a first acquisition unit configured to acquire phoneme features and acoustic features of sound driving data, the sound driving data including audio, text At least one item; a second obtaining unit, configured to obtain at least one set of facial key point information of the target object in the first image according to the phoneme feature and the acoustic feature; a first obtaining unit, configured to obtain at least one set of face key point information according to the at least one Combining face key point information and a second image containing the face of the target object to obtain at least one target face image corresponding to the sound driving data, wherein the second image includes the face of the target object The set area of the specific part is blocked; the second obtaining unit is configured to obtain the speaking video of the target object according to the sound driving data and the at least one target facial image.
- an electronic device the device includes a memory and a processor, the memory is used to store computer instructions executable on the processor, and the processor is used to execute the computer instructions At this time, the method for generating a speaking video described in any implementation manner provided by the present disclosure is implemented.
- a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for generating a talking video in any implementation manner provided by the present disclosure is implemented.
- the speaking video generation method, device, device, and computer-readable storage medium of one or more embodiments of the present disclosure acquire at least one set of facial key points of the target object in the first image according to the phoneme features and acoustic features of the voice driving data information; and according to the at least one set of face key point information and the second image containing the face of the target object, at least one target face image corresponding to the sound driving data is obtained, wherein the second A set area including a specific part of the target object in the image is blocked; finally, according to the sound driving data and the at least one target face image, a speaking video of the target object is obtained.
- a target face image is generated according to the key face information of the target object corresponding to the sound driving data and the image of the target object covering a specific part, and the obtained speech video of the target object is described in
- the mouth shape of the target object has a high degree of matching with the voice driving data, and the mouth shape changes coherently, and the speaking state of the target object is real and natural.
- FIG. 1 is a flow chart of a method for generating a speaking video proposed by at least one embodiment of the present disclosure
- Fig. 2 is a flow chart of a face key point extraction network training method proposed by at least one embodiment of the present disclosure
- Fig. 3 is a flowchart of a sample acquisition method proposed by at least one embodiment of the present disclosure
- Fig. 4 is a flowchart of another method for generating a speaking video proposed by at least one embodiment of the present disclosure
- Fig. 5 is a schematic diagram of the speaking video generation method shown in Fig. 4;
- Fig. 6 is a schematic diagram of acquiring face key point information in a method for generating a speaking video proposed by at least one embodiment of the present disclosure
- Fig. 7 is a schematic structural diagram of a speaking video generation device proposed by at least one embodiment of the present disclosure.
- Fig. 8 is a schematic structural diagram of an electronic device proposed by at least one embodiment of the present disclosure.
- At least one embodiment of the present disclosure provides a method for generating a talking video, and the method may be executed by an electronic device such as a terminal device or a server.
- the terminal device may be a fixed terminal or a mobile terminal, such as a mobile phone, a tablet computer, a game machine, a desktop computer, an advertising machine, an all-in-one machine, a vehicle-mounted device, etc.
- the server includes a local server or a cloud server.
- the method can also be implemented by a processor invoking computer-readable instructions stored in a memory.
- Fig. 1 shows a flowchart of a method for generating a speaking video according to at least one embodiment of the present disclosure. As shown in Fig. 1 , the method includes steps 101-104.
- step 101 phoneme features and acoustic features of sound driving data are acquired.
- a phoneme is the smallest unit of speech constituting a syllable.
- the audio corresponding to the sound driving data may contain one or more phonemes, and phoneme features may include features representing the start and end times of pronunciation of each phoneme.
- the phoneme features of the sound-driven data may include, for example: n[0,0.2], i3[0.2,0.4], h[0.5,0.7], ao3[0.7 ,1.2], wherein, the data in [] indicates the starting and ending time of the pronunciation of the corresponding phoneme, and the unit is, for example, seconds.
- the phoneme features of the sound driving data may be obtained by acquiring the phonemes contained in the audio corresponding to the sound driving data and the time stamps corresponding to each phoneme.
- the acoustic features are mainly used to describe the pronunciation characteristics of the audio, and the acoustic features include but are not limited to at least one of linear prediction parameters, Mel-frequency cepstral coefficients, perceptual linear high-efficiency coefficients, and the like.
- the acoustic features are, for example, Mel-frequency cepstral coefficients.
- the acoustic features of the sound driving data may be obtained by performing feature extraction on the audio corresponding to the sound driving data.
- the sound driving data is pre-stored in the electronic device that executes the talking video generation method, or in other devices other than the electronic device, or is collected on-site by a sound collection device, etc. Sources of sound driver data are restricted.
- the sound driving data may include at least one of audio and text.
- the text for example, text information
- the text can be determined by performing speech recognition on the audio
- the text information corresponding to the text can be converted into audio (for example, a speech segment) by performing speech synthesis on the text;
- the audio and text correspond to the same pronunciation.
- the text is "Hello”
- the audio in the sound driving data is a speech segment that emits the sound of "Hello”.
- the phoneme features of the sound driving data can be obtained by performing an alignment operation on the audio and text corresponding to the sound driving data.
- the alignment operation refers to aligning each speech segment in the audio with the phoneme in the text corresponding to the speech segment, that is, determining when the pronunciation corresponding to the text in the audio starts to be pronounced.
- the phoneme features of the sound driving data can be obtained.
- the phoneme features of the sound driving data may also be acquired in other ways, which is not limited in this embodiment of the present disclosure.
- the target object may be driven in any manner of audio or text to generate a speaking video of the target object.
- step 102 at least one set of facial key point information of the target object in the first image is obtained according to the phoneme feature and the acoustic feature.
- the mouth shape will change accordingly. Therefore, the position of the target object's mouth area or the facial key points in the set area containing the mouth area will change accordingly.
- the phoneme features and acoustic features of an audio frame correspond to a set of facial key point information of the target object.
- the facial key point information includes the facial key points of the target object (for example, key points corresponding to facial features and facial contours) in the image (for example, the first image) containing the face of the target object location information.
- the information of each facial key point at the same time may be referred to as a set of facial key point information.
- the facial key point information sequence includes multiple sets of facial key point information arranged in chronological order.
- the embodiment of the present disclosure also utilizes the acoustic features on the basis of the phoneme features of the voice driving data, so that the acquired facial key point information can better match the pronunciation features of the audio corresponding to the voice driving data, so that the subsequent generated speech Video is more realistic.
- step 103 at least one target facial image corresponding to the sound driving data is obtained according to the at least one set of facial key point information and the second image including the target object's face.
- the second image is an image containing the face of the target object
- the second image can be obtained by performing occlusion processing on the first image, or by occluding another image different from the first image.
- An image is obtained by performing occlusion processing.
- the first image is a face image of the target object A smiling
- another image different from the first image may be a face image of the target object A curling his lips.
- a set area including a specific part (for example, mouth) of the target object is blocked, and the set area includes an area where the position of key points on the face of the target object changes when the target object speaks For example, it may be the lower half of the face of the target object, or the facial area below the forehead, or the mouth area.
- the specific area may be blocked.
- the second image in which the set area is blocked may be generated by filling the set area with noise.
- performing noise filling on the set area refers to setting each pixel in the set area with randomly generated pixel values.
- the blocking of the set area can also be performed in other ways, which is not limited in the present disclosure.
- the blocked part (that is, the set area) in the second image can be complemented, so that the blocked set in the second image
- the distribution of facial keypoints in a given region is consistent with the phonemic and acoustic features of the voice-driven data.
- step 104 a speaking video of the target object is obtained according to the sound driving data and the at least one target facial image.
- the sound output by the target object is the audio corresponding to the sound driving data
- the sound of the target object The face key point information corresponds to the phoneme feature and the acoustic feature of the output voice.
- the mouth shapes and speaking expressions of the target object in the generated speaking video are consistent with the pronunciation, so that the audience can feel that the target object is speaking.
- At least one target facial image is generated according to at least one set of face key point information of the target object corresponding to the sound driving data and an image of the target object that blocks a specific part, and the obtained target object
- the mouth shape of the target object in the speaking video matches the voice driving data to a high degree, and the mouth shape changes coherently, and the speaking state of the target object is real and natural.
- the at least one target face image can be fused with the set background image to obtain the first video, and the target target can be obtained according to the audio corresponding to the first video and the sound driving data.
- the pixels of the face area in the target face image may be used as foreground pixels to be superimposed on the set background image, so as to realize the fusion of the target face image and the set background image.
- the speaking video of the target object in any background can be generated, which enriches the application scenarios of the method for generating the speaking video.
- At least one set of facial key point information of the target object in the first image corresponding to the phoneme feature and the acoustic feature may be obtained by using a facial key point extraction network.
- a plurality of sub-phoneme features included in the phoneme feature and sub-acoustic sub-features corresponding to the plurality of sub-phoneme features may be obtained by performing a sliding window on the phoneme feature and the acoustic feature of the sound driving data.
- the phoneme feature and the acoustic feature can be divided into multiple sub-phoneme features and sub-acoustic features according to the length of the time window.
- the phoneme features and acoustic features in the time window obtained after each sliding window operation can be used as sub-phoneme features and sub-phoneme features.
- Acoustic features, and sub-phoneme features and sub-acoustic features in the same time window correspond to the same speech segment.
- the sub-phoneme features and corresponding sub-acoustic features are input to the trained facial key point extraction network to obtain facial key point information corresponding to the sub-phoneme features and the sub-acoustic features.
- multiple sub-phoneme features and corresponding multiple sub-acoustic features may be input into the facial key point extraction network in time sequence in the form of multiple sub-phoneme feature-sub-acoustic feature pairs.
- the facial key point extraction network is used to determine a corresponding set of facial key point information according to each sub-phoneme feature-sub-acoustic feature pair. After inputting all the sub-phoneme feature-sub-acoustic feature pairs into the facial key point extraction network, multiple sets of facial key point information corresponding to the voice driving data can be obtained.
- the facial key point information corresponding to each sub-phoneme feature-sub-acoustic feature pair can be obtained, and the pronunciation, mouth shape and speaking of the target object can be realized. Good match for expressions.
- the facial key point extraction network may be a three-dimensional 3D facial key point extraction network, that is, the output facial key point information is 3D facial key point information, and the 3D facial key point
- the point information also includes the depth information of the facial key points
- the facial key point extraction network can also be a two-dimensional 2D facial key point extraction network, that is The output facial key point information is 2D facial key point information.
- the facial key point information is 3D facial key point information
- the method further includes: projecting the 3D facial key point information onto a 2D plane to obtain 2D facial key point information corresponding to the 3D facial key point information ; Utilizing the 2D facial key point information to update the facial key point information.
- At least one target face image corresponding to the sound driving data is obtained; finally, according to the sound driving data and the at least one target face image to obtain a speaking video of the target object.
- multiple groups of face key point information can be filtered so that the face key point information of each image frame in the finally obtained talking video is consistent with the adjacent frames of the image frame (including the previous frame and The amount of change between the face key point information of the next frame) satisfies the setting condition, which can include, for example, the position of each face key point in an image frame and the corresponding face key point in the adjacent frame. The variations between the positions are all smaller than the set threshold. Vibrating frames with large changes in face key point information can be filtered out by the above method, so as to avoid sudden changes in mouth shape in the generated speaking video.
- the moving average processing of the consecutive frames corresponding to the multiple sets of facial key point information may be implemented by performing Gaussian filtering on the multiple sets of facial key point information in a time window.
- the moving average processing refers to carrying out a weighted average of the value of the facial key point of each frame and the value of the facial key point of the adjacent frame, and updating the value of the facial key point of the frame by using the result of the weighted average.
- At least one target face image corresponding to the sound driving data may be obtained by inputting each set of face key point information and a second image containing the face of the target object into the face A part completion network to obtain a target face image corresponding to the facial key point information, wherein the face completion network is used to set the occlusion in the second image according to the facial key point information area to complete.
- the masked set area in the second image is complemented according to the face key point information through the face completion network, so that the face key point information of the set area Consistent with the input facial key point information, so that the target object's mouth shape and speaking expression match the voice, and use the face complement network to complement the blocked set area in the second image It can generate high-definition target face images.
- the facial key point extraction network can be obtained by using phoneme feature samples and acoustic feature samples for training.
- the training method may be executed by a server, and the server executing the training method may be different from the device executing the above-mentioned talking video generation method.
- FIG. 2 shows a training method for a facial key point extraction network proposed by at least one embodiment of the present disclosure. As shown in FIG. 2 , the training method includes steps 201-202.
- a phoneme feature sample and a corresponding acoustic feature sample are acquired, and the phoneme feature sample and the acoustic feature sample include marked facial key point information of the target object.
- the phoneme feature sample and the corresponding acoustic feature sample are obtained based on the same speech segment, and the facial key point information marked in the phoneme feature sample and the corresponding acoustic feature sample are the same.
- step 202 according to the phoneme feature sample and the corresponding acoustic feature sample, the initial facial key point extraction network is trained, and when the change of the network loss meets the convergence condition, the training is completed to obtain the facial key point extraction network,
- the network loss includes the difference between the facial key point information predicted by the initial neural network and the labeled facial key point information.
- the phoneme feature samples and the acoustic feature samples are obtained by marking the object's facial key point information on the phoneme features and acoustic features of the audio of an object.
- the phoneme feature samples and corresponding acoustic feature samples may be acquired through the method shown in FIG. 3 .
- the talking video of the subject is acquired.
- the object may be the above-mentioned target object whose talking video is to be generated, or may be a different object from the target object.
- the existing speaking video of the target object is acquired to obtain phoneme feature samples and acoustic feature samples.
- step 302 a plurality of facial images and at least one audio frame corresponding to each of the facial images are acquired according to the talking video.
- a speech segment corresponding to the speaking video and a plurality of facial images included in the speaking video are obtained.
- the multiple audio frames in the speech segment have a corresponding relationship with the multiple facial images.
- step 303 phoneme features and acoustic features of at least one audio frame corresponding to each of the facial images are acquired.
- the phoneme features and acoustic features of at least one audio frame corresponding to any facial image are acquired.
- step 304 facial key point information is obtained according to the plurality of facial images, and the phoneme feature and the acoustic feature are marked according to the facial key point information to obtain the phoneme feature sample and the Acoustic feature samples described above.
- the phoneme feature and acoustic feature samples of the speech spoken by the target object can be accurately established.
- the association between acoustic features and facial keypoint information can better realize the training of facial keypoint generation network.
- the face completion network can be trained using a generative adversarial network.
- the training method may be executed by a server, and the server executing the training method may be different from the device executing the above-mentioned speaking video generation method.
- the generation confrontation network includes the face completion network and the first identification network
- the face completion network is used to analyze the input occluded face image (that is, the one with the occlusion area) according to the facial key point information. face image) to generate a face complement image, wherein the occluded face image is obtained by blocking a set area including a specific part (for example, mouth) in the complete face image, so
- the complete face image may be the face image corresponding to the face key point information; the generated face complement image and the real face image are randomly input into the first identification network, and the first identification network Outputting an identification result for the input image, that is, judging whether the input image is a face complement image or a real face image.
- the loss of using the generation confrontation network to train the face completion network includes:
- the first loss is used to indicate the difference between the face completion image output by the face completion network and the complete face image, wherein the complete face image is the face corresponding to the facial key point information internal image;
- the second loss is used to indicate the difference between the classification result output by the first discrimination network for the input image and the annotation information of the input image, wherein the annotation information indicates that the input image is the face complement
- the face complement image output by the whole network may be a real face image.
- the training is completed when the variation of the training loss satisfies the convergence condition, and the face completion network is obtained.
- using a generative confrontation network to train the face completion network can improve the accuracy of the face completion image output by the face completion network, which is conducive to improving the generated The image quality of the speaking video of the target subject.
- a second discriminant network for judging whether the face complement image is aligned with phoneme features can also be added to assist the training of the face complement network.
- the face complement image output by the face complement network is input to the second discrimination network.
- the loss of this training also includes a third loss, and the third loss is used to indicate that the second discriminant network is for the difference between the face complement image and the phoneme feature The difference between the corresponding (eg, aligned) discriminative result and the true corresponding result.
- the alignment effect between phoneme features and facial key points is further improved, which is beneficial to improve the quality of speaking videos.
- step 401 an alignment operation is performed on the audio and text corresponding to the sound driving data to obtain phoneme features of the sound driving data.
- the audio corresponding to the sound driving data is, for example, as shown in FIG. 6 , which is the speech segment of "Hello", and the text corresponding to the sound driving data is the text of "Hello".
- the phoneme features of the sound driving data are obtained.
- the text corresponding to the audio can be determined by performing speech recognition on the audio; in the case where the sound driving data only includes text, Text information corresponding to the text may be converted into audio by performing speech synthesis on the text.
- step 402 feature extraction is performed on the audio corresponding to the sound driving data to obtain the Mel cepstrum feature of the sound driving data, that is, Mel frequency cepstral coefficients.
- a plurality of sub-phoneme features contained in the phoneme feature and sub-acoustic features corresponding to the plurality of sub-phoneme features are obtained by performing a sliding window on the phoneme feature and the acoustic feature of the sound driving data .
- the time window is shown as a dotted box in FIG. 6 , and the arrow shows the sliding direction of the time window.
- the phoneme features and acoustic features obtained in each time window are sub-phoneme features and sub-acoustic features, and the sub-phoneme features and sub-acoustic features in the same time window correspond to the same speech segment.
- step 404 the sub-phoneme features and the corresponding sub-acoustic features are input to the trained facial key point extraction network to obtain facial key point information corresponding to the sub-phoneme features and the sub-acoustic features.
- the face key point extraction network outputs face key point information corresponding to the time window.
- the facial key point extraction network is a 3D facial key point extraction network, and correspondingly, the obtained facial key point information is 3D facial key point information.
- step 405 2D facial key point information corresponding to the 3D facial key point information is acquired.
- step 406 filter processing is performed on multiple sets of 2D facial key point information, so that the variation between the 2D facial key point information of each image frame and the facial key point information of adjacent frames meets the set condition.
- each set of filtered 2D facial key point information and the second image are input to the face completion network to obtain a target facial image corresponding to the 2D facial key point information, wherein, the second image is an occluded face image, and the lower half of the face in the second image is filled with noise for occlusion.
- step 408 the multi-frame target facial images obtained in step 407 (for example, the speaker's facial images) are fused with the background image to obtain a first image sequence.
- step 409 a speaking video of the target object is obtained according to the first image sequence and the audio corresponding to the sound driving data.
- Fig. 7 is a schematic structural diagram of a talking video generating device proposed by at least one embodiment of the present disclosure; as shown in Fig. 7 , the device includes: a first acquiring unit 701, configured to acquire phoneme features and acoustic features of sound driving data, so The sound driving data includes at least one of audio and text; the second acquisition unit 702 is configured to acquire at least one set of facial key point information of the target object in the first image according to the phoneme features and the acoustic features; the second An obtaining unit 703, configured to obtain at least one target facial image corresponding to the sound driving data according to the at least one set of facial key point information and the second image containing the target object's face, wherein, A set area including a specific part of the target object in the second image is blocked; a second obtaining unit 704 is configured to obtain the target object according to the sound driving data and the at least one target facial image talking video.
- a first acquiring unit 701 configured to acquire phoneme features and acoustic features of sound driving data
- the first acquisition unit is specifically configured to: acquire the phonemes contained in the audio corresponding to the sound driving data and the time stamps corresponding to each phoneme, and obtain the phoneme features of the sound driving data ; Performing feature extraction on the audio corresponding to the sound driving data to obtain the acoustic features of the sound driving data.
- the second acquisition unit is specifically configured to: acquire a plurality of sub-phoneme features included in the phoneme feature and sub-acoustic features corresponding to the plurality of sub-phoneme features;
- the features and corresponding sub-acoustic features are input to the facial key point extraction network to obtain facial key point information corresponding to the sub-phoneme features and the sub-acoustic features.
- the facial key point information includes 3D facial key point information
- the device further includes a projection unit, configured to use the at least one set of facial key point information and include the The second image of the face of the target object, before obtaining at least one target face image corresponding to the sound driving data, project the 3D facial key point information onto a 2D plane to obtain the 3D facial key 2D facial key point information corresponding to the point information; using the 2D facial key point information to update the facial key point information.
- the device further includes a filtering unit, configured to obtain, according to the at least one set of face key point information and the second image containing the face of the target object, a Before the at least one target facial image corresponding to the sound driving data, multiple groups of facial key point information are filtered, so that the facial key point information of each image frame and the facial key point information of adjacent frames The amount of change satisfies the set condition.
- a filtering unit configured to obtain, according to the at least one set of face key point information and the second image containing the face of the target object, a Before the at least one target facial image corresponding to the sound driving data, multiple groups of facial key point information are filtered, so that the facial key point information of each image frame and the facial key point information of adjacent frames The amount of change satisfies the set condition.
- the first obtaining unit is specifically configured to: input each set of facial key point information and the second image to the face completion network, and obtain the facial key point The target face image corresponding to the information, wherein the face completion network is used to complete the blocked set area in the second image according to the facial key point information.
- the second obtaining unit is specifically configured to: fuse the at least one target face image with the set background image to obtain a first image sequence; according to the first image sequence The audio corresponding to the sound driving data is used to obtain the speaking video of the target object.
- the facial key point extraction network is trained using phoneme feature samples and corresponding acoustic feature samples, wherein the phoneme feature samples and the acoustic feature samples include labeled facial key points point information.
- the facial key point extraction network is trained in the following manner: according to the phoneme feature samples and the corresponding acoustic feature samples, the initial facial key point extraction network is trained, and the network When the change of the loss meets the convergence condition, the training is completed to obtain the facial key point extraction network, wherein the network loss includes the difference between the facial key point information predicted by the initial neural network and the marked facial key point information. difference.
- the phoneme feature sample and the acoustic feature sample are obtained by marking the object's facial key point information on the phoneme feature and the acoustic feature of an object's audio.
- the phoneme feature sample and the acoustic feature sample are obtained in the following manner: acquiring a speaking video of the object; acquiring multiple facial images according to the speaking video, and combining with each At least one audio frame corresponding to the facial image; obtaining phoneme features and acoustic features of at least one audio frame corresponding to each facial image; obtaining facial key point information according to the plurality of facial images, and according to The facial key point information marks the phoneme feature and the acoustic feature to obtain the phoneme feature sample and the acoustic feature sample.
- the face completion network is trained by generating an adversarial network
- the generation adversarial network includes the face completion network and the first identification network
- the trained network loss includes : The first loss, which is used to indicate the difference between the face completion image output by the face completion network and the complete face image, wherein the complete face image is corresponding to the facial key point information A face image; a second loss, which is used to indicate the difference between the classification result output by the first discrimination network for the input image and the annotation information of the input image, wherein the annotation information indicates that the input image is the
- the face complement image output by the face complement network may be a real face image.
- the generation confrontation network further includes a second discrimination network
- the trained network loss further comprises: a third loss, which is used to indicate that the second discrimination network is suitable for the face complement The discrepancy between the discriminative results corresponding to the full image and phoneme features and the true corresponding results.
- At least one embodiment of the present disclosure also provides an electronic device, as shown in FIG. 8 , the device includes a memory and a processor, the memory is used to store computer instructions that can be run on the processor, and the processor is used to execute the described The computer instructions implement the method for generating a talking video in any embodiment of the present disclosure.
- At least one embodiment of the present disclosure further provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method for generating a talking video in any embodiment of the present disclosure is implemented.
- one or more embodiments of this specification may be provided as a method, system or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present description may employ a computer program embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. The form of the product.
- each embodiment in this specification is described in a progressive manner, the same and similar parts of each embodiment can be referred to each other, and each embodiment focuses on the differences from other embodiments.
- the description is relatively simple, and for relevant parts, please refer to part of the description of the method embodiment.
- Embodiments of the subject matter and functional operations described in this specification can be implemented in digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or in A combination of one or more of .
- Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, that is, one or more of computer program instructions encoded on a tangible, non-transitory program carrier for execution by or to control the operation of data processing apparatus. Multiple modules.
- the program instructions may be encoded on an artificially generated propagated signal, such as a machine-generated electrical, optical or electromagnetic signal, which is generated to encode and transmit information to a suitable receiver device for transmission by the data
- the processing means executes.
- a computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
- the processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output.
- the processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, such as an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit).
- FPGA Field Programmable Gate Array
- ASIC Application Specific Integrated Circuit
- Computers suitable for the execution of a computer program include, for example, general and/or special purpose microprocessors, or any other type of central processing unit.
- a central processing unit will receive instructions and data from a read only memory and/or a random access memory.
- the basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data.
- a computer will also include, or be operatively coupled to, one or more mass storage devices for storing data, such as magnetic or magneto-optical disks, or optical disks, to receive data therefrom or to It transmits data, or both.
- mass storage devices for storing data, such as magnetic or magneto-optical disks, or optical disks
- a computer may be embedded in another device such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a device such as a Universal Serial Bus (USB) ) portable storage devices like flash drives, to name a few.
- PDA personal digital assistant
- GPS Global Positioning System
- USB Universal Serial Bus
- Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (such as EPROM, EEPROM, and flash memory devices), magnetic disks (such as internal hard disks or removable disks), magneto-optical disks, and CD ROM and DVD-ROM disks.
- semiconductor memory devices such as EPROM, EEPROM, and flash memory devices
- magnetic disks such as internal hard disks or removable disks
- magneto-optical disks and CD ROM and DVD-ROM disks.
- the processor and memory can be supplemented by, or incorporated in, special purpose logic circuitry.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Data Mining & Analysis (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Processing Or Creating Images (AREA)
Abstract
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111386695.8A CN114093384B (zh) | 2021-11-22 | 2021-11-22 | 说话视频生成方法、装置、设备以及存储介质 |
| CN202111386695.8 | 2021-11-22 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023088080A1 true WO2023088080A1 (fr) | 2023-05-25 |
Family
ID=80302733
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2022/128584 Ceased WO2023088080A1 (fr) | 2021-11-22 | 2022-10-31 | Procédé et appareil de génération de vidéo parlante, dispositif électronique et support de stockage |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN114093384B (fr) |
| WO (1) | WO2023088080A1 (fr) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117079664A (zh) * | 2023-08-16 | 2023-11-17 | 北京百度网讯科技有限公司 | 口型驱动及其模型训练方法、装置、设备和介质 |
| CN120125724A (zh) * | 2025-03-03 | 2025-06-10 | 广州趣丸网络科技有限公司 | 一种虚拟对象的口型驱动方法、装置、设备和介质 |
| WO2025213838A1 (fr) * | 2024-04-08 | 2025-10-16 | 北京字跳网络技术有限公司 | Procédé et appareil de traitement de données, dispositif, support et produit |
Families Citing this family (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114093384B (zh) * | 2021-11-22 | 2025-07-18 | 上海商汤科技开发有限公司 | 说话视频生成方法、装置、设备以及存储介质 |
| CN115620371A (zh) * | 2022-10-25 | 2023-01-17 | 贝壳找房(北京)科技有限公司 | 说话视频生成模型的训练方法、装置、电子设备及存储介质 |
| CN115984427B (zh) * | 2022-12-08 | 2024-05-17 | 上海积图科技有限公司 | 基于音频的动画合成方法、装置、设备及存储介质 |
| CN117373455B (zh) * | 2023-12-04 | 2024-03-08 | 翌东寰球(深圳)数字科技有限公司 | 一种音视频的生成方法、装置、设备及存储介质 |
| CN120451033A (zh) * | 2024-02-08 | 2025-08-08 | 北京字跳网络技术有限公司 | 一种图像生成方法、装置、设备、介质、产品 |
| CN118695051B (zh) * | 2024-08-26 | 2024-11-22 | 腾讯科技(深圳)有限公司 | 数据生成方法、装置、产品、设备和介质 |
Citations (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP0992933A2 (fr) * | 1998-10-09 | 2000-04-12 | Mitsubishi Denki Kabushiki Kaisha | Méthode pour générer directement des animations faciales réalistes directement à partir de la parole en utilisant des chaínes de Markov cachées |
| CN109308731A (zh) * | 2018-08-24 | 2019-02-05 | 浙江大学 | 级联卷积lstm的语音驱动唇形同步人脸视频合成算法 |
| CN110555507A (zh) * | 2019-10-22 | 2019-12-10 | 深圳追一科技有限公司 | 虚拟机器人的交互方法、装置、电子设备及存储介质 |
| CN110677598A (zh) * | 2019-09-18 | 2020-01-10 | 北京市商汤科技开发有限公司 | 视频生成方法、装置、电子设备和计算机存储介质 |
| CN111432233A (zh) * | 2020-03-20 | 2020-07-17 | 北京字节跳动网络技术有限公司 | 用于生成视频的方法、装置、设备和介质 |
| CN111741326A (zh) * | 2020-06-30 | 2020-10-02 | 腾讯科技(深圳)有限公司 | 视频合成方法、装置、设备及存储介质 |
| CN112215926A (zh) * | 2020-09-28 | 2021-01-12 | 北京华严互娱科技有限公司 | 一种语音驱动的人脸动作实时转移方法和系统 |
| CN112562722A (zh) * | 2020-12-01 | 2021-03-26 | 新华智云科技有限公司 | 基于语义的音频驱动数字人生成方法及系统 |
| CN112667068A (zh) * | 2019-09-30 | 2021-04-16 | 北京百度网讯科技有限公司 | 虚拟人物的驱动方法、装置、设备及存储介质 |
| CN113228163A (zh) * | 2019-01-18 | 2021-08-06 | 斯纳普公司 | 基于文本和音频的实时面部再现 |
| US20210312915A1 (en) * | 2020-04-06 | 2021-10-07 | Hi Auto LTD. | System and method for audio-visual multi-speaker speech separation with location-based selection |
| CN113542624A (zh) * | 2021-05-28 | 2021-10-22 | 阿里巴巴新加坡控股有限公司 | 生成商品对象讲解视频的方法及装置 |
| CN114093384A (zh) * | 2021-11-22 | 2022-02-25 | 上海商汤科技开发有限公司 | 说话视频生成方法、装置、设备以及存储介质 |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108234735A (zh) * | 2016-12-14 | 2018-06-29 | 中兴通讯股份有限公司 | 一种媒体显示方法及终端 |
| CN111369967B (zh) * | 2020-03-11 | 2021-03-05 | 北京字节跳动网络技术有限公司 | 基于虚拟人物的语音合成方法、装置、介质及设备 |
| CN112750187B (zh) * | 2021-01-19 | 2025-04-01 | 腾讯科技(深圳)有限公司 | 一种动画生成方法、装置、设备及计算机可读存储介质 |
-
2021
- 2021-11-22 CN CN202111386695.8A patent/CN114093384B/zh active Active
-
2022
- 2022-10-31 WO PCT/CN2022/128584 patent/WO2023088080A1/fr not_active Ceased
Patent Citations (16)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP0992933A2 (fr) * | 1998-10-09 | 2000-04-12 | Mitsubishi Denki Kabushiki Kaisha | Méthode pour générer directement des animations faciales réalistes directement à partir de la parole en utilisant des chaínes de Markov cachées |
| JP2000123192A (ja) * | 1998-10-09 | 2000-04-28 | Mitsubishi Electric Inf Technol Center America Inc | 顔面アニメ―ション生成方法 |
| US6735566B1 (en) * | 1998-10-09 | 2004-05-11 | Mitsubishi Electric Research Laboratories, Inc. | Generating realistic facial animation from speech |
| JP3633399B2 (ja) * | 1998-10-09 | 2005-03-30 | ミツビシ・エレクトリック・リサーチ・ラボラトリーズ・インコーポレイテッド | 顔面アニメーション生成方法 |
| CN109308731A (zh) * | 2018-08-24 | 2019-02-05 | 浙江大学 | 级联卷积lstm的语音驱动唇形同步人脸视频合成算法 |
| CN113228163A (zh) * | 2019-01-18 | 2021-08-06 | 斯纳普公司 | 基于文本和音频的实时面部再现 |
| CN110677598A (zh) * | 2019-09-18 | 2020-01-10 | 北京市商汤科技开发有限公司 | 视频生成方法、装置、电子设备和计算机存储介质 |
| CN112667068A (zh) * | 2019-09-30 | 2021-04-16 | 北京百度网讯科技有限公司 | 虚拟人物的驱动方法、装置、设备及存储介质 |
| CN110555507A (zh) * | 2019-10-22 | 2019-12-10 | 深圳追一科技有限公司 | 虚拟机器人的交互方法、装置、电子设备及存储介质 |
| CN111432233A (zh) * | 2020-03-20 | 2020-07-17 | 北京字节跳动网络技术有限公司 | 用于生成视频的方法、装置、设备和介质 |
| US20210312915A1 (en) * | 2020-04-06 | 2021-10-07 | Hi Auto LTD. | System and method for audio-visual multi-speaker speech separation with location-based selection |
| CN111741326A (zh) * | 2020-06-30 | 2020-10-02 | 腾讯科技(深圳)有限公司 | 视频合成方法、装置、设备及存储介质 |
| CN112215926A (zh) * | 2020-09-28 | 2021-01-12 | 北京华严互娱科技有限公司 | 一种语音驱动的人脸动作实时转移方法和系统 |
| CN112562722A (zh) * | 2020-12-01 | 2021-03-26 | 新华智云科技有限公司 | 基于语义的音频驱动数字人生成方法及系统 |
| CN113542624A (zh) * | 2021-05-28 | 2021-10-22 | 阿里巴巴新加坡控股有限公司 | 生成商品对象讲解视频的方法及装置 |
| CN114093384A (zh) * | 2021-11-22 | 2022-02-25 | 上海商汤科技开发有限公司 | 说话视频生成方法、装置、设备以及存储介质 |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117079664A (zh) * | 2023-08-16 | 2023-11-17 | 北京百度网讯科技有限公司 | 口型驱动及其模型训练方法、装置、设备和介质 |
| WO2025213838A1 (fr) * | 2024-04-08 | 2025-10-16 | 北京字跳网络技术有限公司 | Procédé et appareil de traitement de données, dispositif, support et produit |
| CN120125724A (zh) * | 2025-03-03 | 2025-06-10 | 广州趣丸网络科技有限公司 | 一种虚拟对象的口型驱动方法、装置、设备和介质 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN114093384B (zh) | 2025-07-18 |
| CN114093384A (zh) | 2022-02-25 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2023088080A1 (fr) | Procédé et appareil de génération de vidéo parlante, dispositif électronique et support de stockage | |
| CN113077537B (zh) | 一种视频生成方法、存储介质及设备 | |
| CN113299312B (zh) | 一种图像生成方法、装置、设备以及存储介质 | |
| US11587548B2 (en) | Text-driven video synthesis with phonetic dictionary | |
| CN112529992B (zh) | 虚拟形象的对话处理方法、装置、设备及存储介质 | |
| US11968433B2 (en) | Systems and methods for generating synthetic videos based on audio contents | |
| JP7227395B2 (ja) | インタラクティブ対象の駆動方法、装置、デバイス、及び記憶媒体 | |
| CN111459454B (zh) | 交互对象的驱动方法、装置、设备以及存储介质 | |
| TW202248994A (zh) | 互動對象驅動和音素處理方法、設備以及儲存媒體 | |
| WO2013031677A1 (fr) | Dispositif de visualisation de mouvement de prononciation et dispositif d'apprentissage de prononciation | |
| CN115497448A (zh) | 语音动画的合成方法、装置、电子设备及存储介质 | |
| JP2015038725A (ja) | 発話アニメーション生成装置、方法、及びプログラム | |
| CN115550744A (zh) | 一种语音生成视频的方法和装置 | |
| Kadam et al. | A Survey of Audio Synthesis and Lip-syncing for Synthetic Video Generation. | |
| CN114255737B (zh) | 语音生成方法、装置、电子设备 | |
| Kolivand et al. | Realistic lip syncing for virtual character using common viseme set | |
| Hussen Abdelaziz et al. | Speaker-independent speech-driven visual speech synthesis using domain-adapted acoustic models | |
| CN112634861A (zh) | 数据处理方法、装置、电子设备和可读存储介质 | |
| CN113223555A (zh) | 视频生成方法、装置、存储介质及电子设备 | |
| Mahavidyalaya | Phoneme and viseme based approach for lip synchronization | |
| CN115174826A (zh) | 一种音视频合成方法及装置 | |
| CN115797515A (zh) | 一种语音生成和表情驱动方法、客户端及服务端 | |
| Narwekar et al. | PRAV: A Phonetically Rich Audio Visual Corpus. | |
| HK40061479A (en) | Method, equipment, device, and storage medium for generating speech videos | |
| Medina | Talking us into the Metaverse: Towards Realistic Streaming Speech-to-Face Animation |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22894617 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 22894617 Country of ref document: EP Kind code of ref document: A1 |
|
| 32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 05.11.2024) |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 22894617 Country of ref document: EP Kind code of ref document: A1 |