WO2024248736A1 - 视频生成方法、训练视频生成模型的方法及装置 - Google Patents
视频生成方法、训练视频生成模型的方法及装置 Download PDFInfo
- Publication number
- WO2024248736A1 WO2024248736A1 PCT/SG2024/050361 SG2024050361W WO2024248736A1 WO 2024248736 A1 WO2024248736 A1 WO 2024248736A1 SG 2024050361 W SG2024050361 W SG 2024050361W WO 2024248736 A1 WO2024248736 A1 WO 2024248736A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- condition
- local
- sequence
- feature representation
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
- G06T9/002—Image coding using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/0895—Weakly supervised learning, e.g. semi-supervised or self-supervised learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/094—Adversarial learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
- G06T9/001—Model-based coding, e.g. wire frame
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/62—Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
Definitions
- Video Generation Method, Method and Device for Training Video Generation Model Cross-reference
- This disclosure claims the priority of the Chinese patent disclosure filed with the Chinese Patent Office on May 29, 2023, with priority number 202310618707.8 and invention name "Video Generation Method, Method and Device for Training Video Generation Model", all of which are incorporated by reference into this disclosure.
- Technical Field The present disclosure relates to the field of computer vision technology, and in particular to a video generation method, a method and device for training a video generation model. Background
- Text-based video generation technology provides many content creators with brand-new tools, making the creation of video content that originally required professionals and expensive equipment easier and more cost-effective.
- the present disclosure provides a video generation method, a method and device for training a video generation model, so as to improve the quality of the generated videos.
- a video generation method comprising: obtaining local conditions, the local conditions comprising spatial conditions and/or temporal conditions; encoding the local conditions to obtain local condition feature representation; integrating a noise sequence with the local condition feature representation to obtain a noise latent vector sequence; performing denoising on the noise latent vector sequence using a diffusion model to obtain a denoised latent vector sequence; performing decoding on the denoised latent vector sequence to obtain a video.
- the method further comprises: obtaining global conditions, the global conditions comprising at least one of text conditions, style conditions and color conditions; encoding the global conditions to obtain a global condition feature representation; the denoising of the noise latent vector sequence using a diffusion model comprises: the diffusion model performs cross-attention processing on the noise latent vector sequence using the global condition feature representation to predict noise, and performs denoising using the predicted noise.
- the spatial conditions comprise a single image and a single semantic grass At least one of the figures; the temporal condition includes at least one of a motion vector sequence, a depth map sequence, a mask map sequence, a semantic sketch sequence and a grayscale map sequence.
- encoding the global condition to obtain a global condition feature representation includes: text encoding the text condition to obtain a text feature representation, and image encoding the style condition and/or color condition to obtain an image feature representation; integrating the text feature representation and the image feature representation to obtain a global condition feature representation.
- encoding the local condition to obtain a local condition feature representation includes: encoding each local condition using each spatiotemporal condition encoder respectively to obtain a feature tensor corresponding to each local condition, wherein the spatiotemporal condition encoder corresponds to the local condition one by one; fusing the feature tensors corresponding to each local condition to obtain the local condition feature representation.
- the encoding of each local condition respectively using a spatiotemporal condition encoder includes: using a spatiotemporal condition encoder to perform spatial feature encoding on the local condition to obtain a spatial feature representation of the local condition; if the local condition is a sequence, performing a temporal self-attention process on the spatial feature representation of the local condition to obtain a feature tensor corresponding to the local condition; otherwise, copying the spatial feature representation of the local condition in a temporal sequence to generate a spatial feature representation in a temporal sequence, performing a temporal self-attention process on the spatial feature representation in the temporal sequence, and obtaining a feature tensor corresponding to the local condition.
- a method for training a video generation model comprising: obtaining first training data including a plurality of first training samples, the first training samples comprising local condition samples, the local condition samples comprising spatial condition samples and/or temporal condition samples; using the first training data to train a video generation model, the video generation model comprising: a spatiotemporal condition encoder, a diffusion model and a decoder; wherein the spatiotemporal condition encoder encodes the local condition samples to obtain a local condition feature representation; the diffusion model performs denoising on a noise latent vector sequence to obtain a denoised latent vector sequence, the noise latent vector sequence being obtained by integrating the noise sequence with the local condition feature representation; the decoder performs decoding processing using the denoised latent vector sequence to obtain a video; the training objective comprises: minimizing the difference between the noise predicted by the diffusion model when performing denoising and the Gaussian noise.
- the first training sample also comprises a global condition sample
- the global condition sample comprises at least one of a text condition sample, a style condition sample and a color condition sample
- the video generation model further includes a global encoder, and the global encoder encodes the global condition sample to obtain a global condition feature representation
- the diffusion model performs denoising on the noise latent vector sequence, including: the diffusion model performs cross-attention processing on the noise latent vector sequence using the global condition feature representation to predict noise, and performs denoising using the predicted noise.
- the global encoder and the decoder use parameters obtained by pre-training, and in each round of iteration of the training, update the parameters of the spatiotemporal condition encoder and the diffusion model using a loss function corresponding to the training target.
- the acquisition of first training data including a plurality of first training samples includes: acquiring a video sample; acquiring a description text, a style image and/or a color histogram of the video sample as the text condition sample, the style condition sample and/or the color condition sample respectively; extracting at least one of a single image and a single semantic sketch as the spatial condition sample; extracting at least one of a motion vector sequence, a depth map sequence, a mask map sequence, a semantic sketch sequence and/or a grayscale map sequence as the temporal condition sample.
- the pre-training of the diffusion model includes: obtaining second training data including a plurality of second training samples, the second training samples including: extracting description text from a video sample as a text sample; encoding the text sample using a global encoder to obtain a text feature representation; inputting the text feature representation and a noise sequence into a diffusion model to train the diffusion model, the diffusion model performing denoising on the noise sequence using the text feature representation; the training objectives include: minimizing the difference between the noise predicted by the diffusion model when performing denoising at each time step and the Gaussian noise.
- the spatiotemporal condition encoder encodes the local condition sample to obtain the local condition feature representation, including: using each spatiotemporal condition encoder to encode each local condition sample to obtain a feature tensor corresponding to each local condition sample, wherein the spatiotemporal condition encoder corresponds to the local condition; fusing the feature tensors corresponding to each local condition sample to obtain the local condition feature representation.
- each spatiotemporal condition encoder to encode each local condition sample to obtain a feature tensor corresponding to each local condition sample, wherein the spatiotemporal condition encoder corresponds to the local condition; fusing the feature tensors corresponding to each local condition sample to obtain the local condition feature representation.
- the encoding of the conditional sample includes: using a spatiotemporal conditional encoder to perform spatial feature encoding on the local conditional sample to obtain a spatial feature representation of the local conditional sample; if the local conditional sample is a sequence, then performing a temporal self-attention process on the spatial feature representation of the local conditional sample to obtain a feature tensor corresponding to the local conditional sample; otherwise, the spatial feature representation of the local conditional sample is copied in time sequence to generate a spatial feature representation in time sequence, and the spatial feature representation in time sequence is subjected to a temporal self-attention process to obtain a feature tensor corresponding to the local conditional sample.
- a video generation method which is executed by a cloud server, and the method includes: obtaining a local condition from a user terminal, wherein the local condition includes a spatial condition and/or a temporal condition; encoding the local condition to obtain a local condition feature representation; integrating a noise sequence with the local condition feature representation to obtain a noise latent vector sequence; performing a denoising process on the noise latent vector sequence using a diffusion model to obtain a denoised latent vector sequence; performing a decoding process on the denoised latent vector sequence to obtain a video; and sending the video to the user terminal for display.
- a video generating device comprising: a condition acquisition unit, configured to acquire local conditions, the local conditions comprising spatial conditions and/or temporal conditions; a video generating unit, configured to encode the local conditions to obtain local condition feature representation; integrate a noise sequence with the local condition feature representation to obtain a noise latent vector sequence; denoise the noise latent vector sequence using a diffusion model to obtain a denoised latent vector sequence; and decode the denoised latent vector sequence to obtain a video.
- a device for training a video generation model comprising: a sample acquisition unit, configured to acquire first training data comprising multiple first training samples, the first training samples comprising local condition samples, the local condition samples comprising spatial condition samples and/or temporal condition samples; a model training unit, configured to train a video generation model using the first training data, the video generation model comprising: a spatiotemporal condition encoder, a diffusion model and a decoder; wherein the spatiotemporal condition encoder encodes the local condition samples to obtain a local condition feature representation; the diffusion model denoises a noise latent vector sequence to obtain a denoised latent vector sequence, the noise latent vector sequence is obtained by integrating the noise sequence with the local condition feature representation; the decoder performs decoding processing using the denoised latent vector sequence to obtain a video; the training objectives include: minimizing the difference between the noise predicted by the diffusion model when performing denoising and the Gaussian noise.
- a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of the method described in any one of the first to third aspects are implemented.
- an electronic device comprising: one or more processors; and a memory associated with the one or more processors, the memory being configured to store program instructions, and when the program instructions are read and executed by the one or more processors, the steps of the method described in any one of the first to third aspects are implemented.
- the present disclosure incorporates spatial conditions and/or temporal conditions as local conditions, integrates local condition feature representation with noise sequence to obtain noise latent vector sequence, and performs denoising on the noise latent vector sequence to obtain denoised latent vector sequence, and then decodes to obtain video.
- This method makes video generation no longer limited to text conditions, but introduces spatial conditions and/or temporal conditions to guide video generation, thereby generating videos more flexibly and diversely, and improving video quality.
- the present disclosure can generate the required video by combining multiple global conditions, spatial conditions and time conditions in multiple local conditions; wherein the encoding, alignment and fusion of multiple local conditions are realized by the spatiotemporal condition encoder, and local control of the video is realized by integrating with the noise sequence; the global condition is encoded and integrated by the global encoder, and the noise latent vector sequence is cross-attention processed by using the global condition feature representation in the diffusion model to realize global control of the video.
- This method greatly improves the controllability of video generation, and the generated video is more flexible and diverse.
- the present invention innovatively introduces motion vector sequences in the temporal condition.
- the video generation model is able to capture the inter-frame dynamics, thereby achieving control over the internal motion in the video.
- the spatiotemporal condition encoder provided by the present disclosure first extracts local spatial information and then performs temporal modeling, which promotes the display embedding of the time domain. It also provides a unified interface for different local conditions and enhances the consistency between frames. For non-temporal spatial conditions such as a single image and a single semantic sketch, the local conditions are fused by replicating them in the time dimension to ensure consistency with the temporal conditions. This encoding and fusion process enables the subsequently generated video to have better controllability in terms of both temporal and spatial perception.
- the present disclosure can first use the training data of the text-generated video to pre-train the diffusion model, and further train the video generation model based on the parameters of the diffusion model obtained by pre-training, thereby improving the effect and efficiency of the video generation model.
- the training data of the text-generated video pre-train the diffusion model
- the video generation model based on the parameters of the diffusion model obtained by pre-training, thereby improving the effect and efficiency of the video generation model.
- Figure 1 is a system architecture diagram applicable to the embodiment of the present disclosure
- Figure 2 is a flow chart of the video generation method provided by the embodiment of the present disclosure
- Figure 3 is a schematic diagram of the conditions required for generating a video provided by the embodiment of the present disclosure
- Figure 4 is a schematic diagram of the principle of the video generation model provided by the embodiment of the present disclosure
- Figure 5 is a schematic diagram of the structure of the spatiotemporal conditional encoder provided by the embodiment of the present disclosure
- Figure 6 is a schematic diagram of the principle of the diffusion model provided by the embodiment of the present disclosure at each time step
- Figures 72 to 7d are four example diagrams of generating videos provided by the embodiment of the present disclosure
- Figure 8 is a flow chart of the method for training the video generation model provided by the embodiment of the present disclosure
- Figure 9 is a schematic block diagram of the video generation device provided by the embodiment of the present disclosure
- Figure 10 is a schematic block diagram of the device for training the video generation model provided by the embodiment of the present disclosure
- Figure 11 is
- FIG1 shows an exemplary system architecture to which an embodiment of the present disclosure can be applied.
- the system architecture includes a user terminal, and a model training device and a video generation device located at a server.
- the model training device of the server adopts the method provided by the embodiment of the present disclosure to perform model training in the offline stage to obtain a video generation model.
- the user terminal can exchange information with the video generation device of the server through the network.
- the user terminal can send the condition information used for generating the video input by the user to the video generation device through the network, wherein the condition information may include global conditions and local conditions, and the specific content will be described in detail in the subsequent embodiments.
- the video generation device can use the trained video generation model to generate a video under the constraints of global conditions and local conditions, and send the generated video to the user terminal through the network for the user terminal to display the video.
- the above-mentioned user terminal may include, but is not limited to, such as: smart mobile terminals, smart home devices, wearable devices, smart medical devices, PCs (Personal Computers), etc.
- smart and mobile devices may include, for example, mobile phones, tablet computers, laptop computers, PDAs (Personal Digital Assistants), Internet cars, etc.
- Smart home devices may include smart home appliances, such as smart TVs, smart refrigerators, and other home appliances with video playback functions.
- Wearable devices may include, for example, smart watches, smart glasses, smart bracelets, VR (Virtual Reality) devices, AR (Augmented Reality), mixed reality devices (i.e., devices that can support virtual reality and augmented reality), etc.
- the model training device and the video generation device may be set as independent servers, or may be set on the same server or server group, or may be set on an independent or the same cloud server.
- a cloud server also known as a cloud computing server or cloud host, is a host product in a cloud computing service system to solve the defects of difficult management and weak service scalability in traditional physical hosts and virtual private servers (VPS) services.
- the model training device and the video generation device may also be set on a computer terminal with strong computing power. It should be noted that, in addition to performing video generation online, the above-mentioned video generation device may also perform video generation offline. It should be understood that the number of user terminals, model training devices, video generation devices, and video generation models in FIG. 1 is schematic. According to the implementation requirements, there may be any number of user terminals, model training devices, video generation devices, and video generation models. FIG.
- Step 202 obtaining local conditions, the local conditions including spatial conditions and/or temporal conditions.
- Step 204 encoding the local conditions to obtain local condition feature representations.
- Step 206 integrating the noise sequence with the local condition feature representation to obtain a noise latent vector sequence; denoising the noise latent vector sequence using a diffusion model to obtain a denoised latent vector sequence.
- Step 208 performing decoding processing using the denoised latent vector sequence to obtain a video.
- the present disclosure incorporates spatial conditions and/or temporal conditions as local conditions, integrates the local condition feature representation with the noise sequence to obtain a noise latent vector sequence, and uses the global condition feature representation to denoise the noise latent vector sequence to obtain the denoised latent vector sequence, and then decodes to obtain the video.
- This method makes video generation no longer limited to text conditions, but introduces spatial conditions and/or temporal conditions to guide video generation, so as to generate videos more flexibly and diversely and improve the quality of the video.
- the following describes each step in the above process in detail. First, the above step 202, i.e., "obtaining local conditions", is described in detail in conjunction with the embodiment.
- Existing video generation methods usually generate videos under the guidance of text conditions, and have poor flexibility and diversity.
- new combination conditions are introduced to improve the controllability of video generation.
- the global conditions mainly include text conditions, style conditions and/or color conditions.
- the text conditions are mainly a description of the video content, which globally reflects the content of the video to be generated.
- the style condition can usually adopt a style image, and the purpose of the condition is to generate a video with the same style features as the style image, that is, the video has the style expressed in the style image as a whole.
- the color condition can be in the form of a color histogram, etc., and the purpose of the condition is to guide the color distribution of the generated video.
- the local condition mainly includes spatial conditions and/or temporal conditions.
- the spatial condition can include at least one of a single image, a single semantic sketch, etc.
- a single image refers to a single image, which is used to limit the content and structure of the image frame in the video.
- the semantic sketch is also called a hand-drawn sketch, a sketch, etc., which is a way to describe the basic semantics of an object in an image. It is a generalized expression method that mainly depicts the edge (or contour) features of the object, can ignore the details and redundant features of the object, and retain the main information.
- the spatial condition mainly guides the video content from the spatial features.
- the temporal condition includes at least one of a motion vector sequence, a depth map sequence, a mask map sequence, and a semantic sketch sequence. Since the video is an image sequence, the temporal condition is a more refined guidance of the video along the temporal dimension.
- the motion vector sequence contains motion vectors between adjacent frames.
- the motion vector indicates the direction of movement of each Token (element) between adjacent frames, including horizontal and vertical directions. Therefore, the motion vector can be expressed as a two-dimensional vector. It explicitly expresses the movement of each Token between two adjacent image frames, so that the video generated by the video generation model has motion controllability.
- each Token of the image refers to the element that constitutes the image.
- the image is divided into non-overlapping blocks, and the blocks and start symbols in the image are all Tokens.
- the depth map sequence is composed of the depth map of each frame.
- the depth map contains the depth information of each Token in the image frame, which is used to guide the depth information of each frame in the video.
- the mask map sequence is composed of the mask map of each frame.
- the mask map masks the content of a part of the image frame, so that the model has the ability to predict the content of the masked part.
- the mask area can be specified by the user or can be specified by the user. can be determined randomly.
- the semantic sketch sequence is composed of semantic sketches of each frame. Compared with the single semantic sketch in the spatial condition, the sketch sequence can provide more control details.
- the grayscale image sequence is composed of grayscale images of each frame.
- the grayscale image contains the grayscale information of each token in the image frame, which is used to guide the grayscale information of each frame in the video.
- the length of each sequence in the above time condition is consistent, which is equal to the length of the generated video, for example, represented as T1.
- the text condition is "a child standing next to the holy saliva tree”
- the style condition is a style image, which is used to limit the overall style of the generated video, that is, to transfer the style features of the style image to the generated video
- the color condition is a color histogram, as shown in Figure 3, which is used to limit the color distribution of the generated video.
- the single image in the spatial condition can be an image containing a child and a Christmas tree, which is used to limit the appearance of the child and the Christmas tree in the generated video, etc., as shown in Figure 3.
- the semantic sketch can be a picture describing the basic lines of the child and the Christmas tree, which is used to limit the basic shape and position of the child and the Christmas tree, etc.
- the motion vector sequence in the time condition includes the motion vectors between adjacent frames, which is schematically represented by small arrows in the figure.
- the depth map sequence includes the depth map of each frame
- the mask map sequence includes the mask map of each frame
- the semantic sketch sequence includes the semantic sketch of each frame.
- the length of each sequence is T1, and T1 is the length of the video to be generated.
- a condition input interface can be provided to the user, and the user can choose to input the above-mentioned global conditions and local conditions on the interface.
- the user can enter the text condition in the text input box.
- the user can use the image drawing or editing tool provided on the interface to input a single image and/or a single semantic sketch as a spatial condition, and input at least one of a motion vector sequence, a depth map sequence, a mask map sequence, and a semantic map sequence as a time condition.
- the style image can also be input by uploading a self-selected image or selecting an image from the image library provided by the interface.
- the user can select a specific area in the image and generate a moving track by a specific gesture, mouse track, etc. to indicate the moving route of the specific area in the video, and the server automatically generates a motion vector sequence based on the moving track.
- condition input methods listed above, other condition input methods can also be used, which are not exhaustive here.
- the following is a detailed description of the above step 204, i.e., "encoding the local condition to obtain the local condition feature representation, encoding the global condition to obtain the global condition feature representation, and encoding the local condition to obtain the local condition feature representation" in conjunction with the embodiment.
- Steps 204 to 208 involved in FIG. 2 can be implemented by a video generation model obtained by pre-training.
- the video generation model can include a spatiotemporal condition encoder, a diffusion model and a decoder, and can further include a global encoder.
- This step is mainly performed by the spatiotemporal condition encoder, and if the global condition is used, it is further performed by the global encoder.
- the global encoder is used to encode the global condition to obtain the global condition feature representation. If the global condition contains a text condition, the text condition can be encoded by a text encoder, A text feature representation is obtained, and the text feature representation is used as a global condition feature representation.
- the text condition can be encoded by a text encoder to obtain a text feature representation
- the style condition and/or color condition can be encoded by an image encoder to obtain an image feature representation
- the text feature representation and the image feature representation are then integrated to obtain a global condition feature representation.
- the text encoder can be implemented by a pre-trained language model, such as BERT (Bidirectional Encoder Representation from Transformers), XLNet (an autoregressive model that implements bidirectional context information by arranging a language model), GPT (Generative Pre-Training) model, CLIP (an encoding model that can implement multimodal encoding), etc.
- the text encoder can actually obtain the semantic embedding of the text condition, that is, the text feature representation.
- the image encoder can use VIT (Vision Transformer), CLIP, etc., and the image encoder can actually obtain the semantic embedding of the style image, that is, the image feature representation.
- the spatiotemporal condition encoder is used to encode the local conditions to obtain the local condition feature representation. Since the local conditions include spatial conditions and/or temporal conditions. Since the temporal conditions are sequential conditions, the local conditions contain rich and complex spatiotemporal relationships, which poses a challenge to the controllable guidance of the video. In view of this, in the embodiment of the present disclosure, a corresponding spatiotemporal condition encoder is arranged for each local condition.
- the local conditions and the spatiotemporal condition encoder are in a one-to-one correspondence.
- Each spatiotemporal condition encoder is used to encode each local condition respectively to obtain a feature tensor corresponding to each local condition, and then the feature tensors corresponding to each local condition are fused to obtain the local condition feature representation.
- the structure of the spatiotemporal condition encoder can be shown in Figure 5.
- the spatiotemporal condition encoder first encodes the spatial features of the local conditions to obtain the spatial feature representation of the local conditions.
- the spatiotemporal condition encoder can be composed of two two-dimensional convolutions (Conv2D), two activation layers (for example, SiLU activation function can be used) and an average pooling layer.
- the spatial feature representation of the local condition is subjected to temporal self-attention processing to obtain the feature tensor corresponding to the local condition. This part is performed by the temporal Transformer (converter) shown in FIG5.
- the spatial feature representation of the local condition can be first copied in the temporal sequence to generate the spatial feature representation in the temporal sequence. This part is performed by the "tooth" in FIG5 to align it with the spatial feature representation corresponding to the sequence-type local condition in the temporal sequence.
- the temporal self-attention processing is performed on the spatial feature representation in the temporal sequence through the temporal Transformer to obtain the feature tensor corresponding to the local condition.
- the feature tensors corresponding to each local condition are fused, the feature tensors can be added element by element.
- the spatiotemporal condition encoder actually extracts local spatial information first, and then performs temporal modeling, which promotes the display embedding of the time domain. It also provides a unified interface for different local conditions, and enhances the consistency between frames.
- non-temporal spatial conditions such as a single image and a single semantic sketch
- the fusion of local conditions is achieved by replicating them in the time dimension to ensure consistency with the time condition.
- a noise sequence & can be randomly generated.
- the randomly generated noise sequence conforms to the normal distribution, that is, Gaussian noise, and its length is consistent with the length of the video to be generated, both of which are T1.
- the local condition feature representation After the spatiotemporal condition encoder encodes and integrates the local conditions, the local condition feature representation has the same spatial shape as &. Then the noise sequence is integrated with the local condition feature representation, for example, the two are spliced along the channel dimension to obtain a noise latent vector sequence z, , which is used as a control signal for video generation. Diffusion models have been widely used in the field of image generation because of their more stable training and generation flexibility, but have not been well utilized in the field of video generation.
- IDM Topic Diffusion Model
- the initial video is projected to the latent representation, and then the latent representation is mapped back to the pixel space through the decoder to obtain the final video.
- the initial video is the noise sequence
- the final video is the generated video.
- IDM other types of diffusion models can also be used.
- the processing of the diffusion model can be understood as predicting the noise of the normal distribution, and performing denoising at each time step to restore the real video content. This process simulates the reverse process of a Markov chain of length . Where T is the total time step of the diffusion model.
- the diffusion model can use the noise latent vector sequence obtained in the previous time step at each time step (in the first time step, the noise latent vector sequence input to the diffusion model is used) to predict the noise of the current time step, and use the predicted noise to denoise the noise latent vector sequence to obtain the noise latent vector sequence of the current time step. Furthermore, if the input conditions include global conditions, the diffusion model uses the global condition feature representation to perform cross-attention processing on the noise latent vector sequence to predict the noise, and uses the predicted noise to perform denoising.
- the diffusion model uses the global conditional feature representation to perform cross-attention processing on the noise latent vector sequence at the first time step to predict the noise at the current time step, and uses the predicted noise to The latent vector sequence is denoised to obtain the noisy latent vector sequence of the current time step.
- the noisy latent vector sequence obtained at the previous time step is cross-attention processed using the global condition feature representation to predict the noise of the current time step, and the predicted noise is used to denoise the noisy latent vector sequence obtained at the previous time step to obtain the noisy latent vector sequence of the current time step.
- the diffusion model can use the three-dimensional UNet as the backbone network.
- the UNet network is an encoder-decoder network that introduces jump connections.
- the above process can be regarded as applying the denoising function spoon (•,•/) to the noisy latent vector sequence z and the condition (for example, including global conditions and local conditions) c, where / is 1,...,0.
- the denoising process the cross-attention mechanism is used to inject global conditions, so that the global conditions guide the video generation in the overall semantics.
- the denoised latent vector sequence obtained in the last step of the final extended model is input into the decoder in step 208, and the decoder uses the denoised latent vector sequence for decoding processing to obtain a video.
- the decoder can use the decoder of the existing video generation model, which will not be described in detail here.
- various conditions can be flexibly combined to generate more diverse videos.
- Example 1 The text condition in the global condition and the time condition in the local condition can be combined.
- Figure 7a the user inputs the text condition "a rotating perspective of a long-haired woman standing in the forest" and a semantic sketch sequence.
- the sequence length is 6 frames.
- the final generated video is shown in Figure 7a, which shows the frames of the video.
- Example 2 The spatial condition and the temporal condition in the local condition can be combined.
- Example 3 The style condition in the global condition and the depth map sequence and semantic sketch sequence in the local condition can be combined.
- the user inputs a style image, a depth map sequence, and a semantic sketch sequence, and the resulting video is shown in FIG7c.
- Example 4 The text condition, the style condition in the global condition, and a semantic sketch and a motion vector sequence in the local condition can be combined.
- the user inputs the text condition "a moving golden moon", a style image, and a semantic sketch, and the user can hand-draw the direction of the movement of the moon on the semantic sketch.
- the video generation device on the server side can automatically generate a motion vector sequence according to the direction of the movement of the moon hand-drawn by the user on the semantic sketch, thereby generating a video using the method provided by the embodiment of the present disclosure.
- the moon in the generated video moves along the direction of the movement hand-drawn by the user.
- FIG8 is a diagram of the embodiment of the present disclosure
- the flowchart of the method for training a video generation model provided in the embodiment can be executed by the model training device in the system architecture shown in FIG1.
- the method may include the following steps: Step 802: Obtain first training data including multiple first training samples, the first training samples include local condition samples, and the local conditions include spatial condition samples and/or temporal condition samples. Furthermore, the first training data samples may also include global condition samples, and the global condition samples may include at least one of text condition samples, style condition samples, and color condition samples.
- a video sample may be first obtained, and a description text of the video sample may be obtained as a text condition sample.
- a style image of the video sample may be obtained as a style condition sample, and a color histogram of the video sample may be obtained as a color condition sample.
- an existing video description text generation model may be used to obtain the description text of the video sample, or the description text of the video sample may be obtained from the web page from which the video sample is derived, or the description text of the video sample may be manually added, and so on.
- the first frame image of the video sample may be used as a style image.
- the color histogram corresponding to the first frame image of the video sample may be used as a color condition sample.
- at least one of a single image and a single semantic sketch is extracted from the video sample as a spatial condition sample.
- the first frame image of the video sample can be used as the above-mentioned single image.
- the edge extraction can be performed on the first frame image in the video sample, and a single semantic sketch can be formed using the edge information extracted from the image.
- a single semantic sketch can be generated for the first frame image in the video sample using a hand-drawn drawing generation tool.
- at least one of a motion vector sequence, a depth map sequence, a mask map sequence, a semantic sketch sequence, and a grayscale map sequence is extracted from the video sample as a temporal condition sample.
- the motion vectors between adjacent frames can be extracted from the video sample to form a motion vector sequence.
- the depth map of each frame image is extracted from the video sample to form a depth map sequence.
- Masking is performed on each frame in the video sample to obtain a mask map sequence.
- a semantic sketch is generated for each frame image in the video sample to obtain a semantic sketch sequence.
- a grayscale map is generated for each frame image in the video sample to obtain a grayscale map sequence.
- other methods can also be used to obtain the above-mentioned condition samples, which are not listed here one by one.
- the limitations such as “first” and “second” involved in the embodiments of the present disclosure do not have restrictions on size, order, quantity, etc., and are used to distinguish them in name. For example, “first training sample” and “second training sample” are used to distinguish two training samples in name.
- Step 804 Use the first training data to train a video generation model, and the video generation model includes: a spatiotemporal conditional encoder, a diffusion model, and a decoder; wherein the spatiotemporal conditional encoder encodes the local conditional sample to obtain a local conditional feature representation; the diffusion model denoises the noise latent vector sequence to obtain a denoised latent vector sequence, and the noise latent vector sequence is obtained by integrating the noise sequence with the local conditional feature representation; the decoder uses the denoised latent vector sequence to perform decoding processing to obtain a video; the training objectives include: minimizing the difference between the noise predicted by the diffusion model when performing denoising processing and the Gaussian noise. Furthermore, the video generation model may further include a global encoder.
- the global encoder encodes the global condition sample to obtain a global condition feature representation.
- the diffusion model may use the global condition feature representation to perform cross-attention processing on the noise latent vector sequence to predict the noise, and use the predicted noise to perform denoising processing.
- the global encoder may use a text encoder to perform text encoding on the text condition sample to obtain a text feature representation, and use the text feature representation as the global condition feature representation.
- the global encoder may use a text encoder to perform text encoding on the text condition sample to obtain a text feature representation, and use an image encoder to perform image encoding on the style condition sample and/or the color condition sample to obtain an image feature representation; and then integrate the text feature representation and the image feature representation to obtain a global condition feature representation.
- each spatiotemporal condition encoder may be used to encode each local condition sample respectively to obtain a feature tensor corresponding to each local condition sample, wherein the spatiotemporal condition encoder corresponds to the local condition one by one; then the feature tensors corresponding to each local condition sample are fused to obtain the local condition feature table 7F o wherein the spatiotemporal condition encoder may be used to encode the local condition sample spatial features to obtain the spatial feature representation of the local condition sample.
- the spatial feature representation of the local condition sample is subjected to a time domain self-attention process to obtain the feature tensor corresponding to the local condition sample; otherwise, the spatial feature representation of the local condition sample is copied in the time sequence to generate a spatial feature representation in the time sequence, and the spatial feature representation in the time sequence is subjected to a time domain self-attention process to obtain the feature tensor corresponding to the local condition sample.
- the diffusion model can use the global condition feature representation to perform cross-attention processing on the noise latent vector sequence at the first time step to predict the noise of the current time step, and use the predicted noise to perform denoising processing on the noise latent vector sequence to obtain the noise latent vector sequence at the current time step.
- the diffusion model uses the global condition feature representation to perform cross-attention processing on the noise latent vector sequence obtained at the previous time step at other time steps to predict the noise of the current time step, and uses the predicted noise to perform denoising processing on the noise latent vector sequence obtained at the previous time step to obtain the noise latent vector sequence at the current time step.
- the loss function can be constructed according to the above training objectives, and the value of the loss function is used in each round of iteration of the training video generation model to update the model parameters by a method such as gradient descent until the preset training end condition is met.
- the training end condition may include, for example, the value of the loss function is less than or equal to the preset loss function threshold, the number of iterations reaches the preset number threshold, etc.
- other loss functions used within the spirit and principle are also within the protection scope of the present disclosure.
- the difference between the noise predicted by multiple time steps and the Gaussian noise is selected in each round of iteration to obtain the loss function.
- the global encoder and decoder can use the parameters obtained by pre-training, and in each round of training iteration, the parameters of the spatiotemporal condition encoder and the diffusion model are updated using the loss function corresponding to the training target.
- the pre-training of the global encoder can use other tasks, such as image generation tasks, text classification tasks, etc.
- the pre-training of the decoder can also use other tasks, such as image generation tasks, image classification tasks, etc. These are currently available training tasks and will not be described in detail here.
- the parameters of the diffusion model in the image generation model can be used for initialization. This method alleviates the training difficulty to a certain extent and speeds up the training speed, but it is still difficult to learn and process temporal features and generate videos under multiple conditions.
- the present disclosure provides a more preferred implementation, that is, a two-stage training strategy is adopted. Before the video generation model is trained using the first training data, the diffusion model is first pre-trained.
- the pre-training is to use the process of generating videos based on text conditions to make the diffusion model learn. Then, based on the parameters of the diffusion model obtained by the pre-training, the video generation model is further trained using the first training data (i.e., including multiple conditions such as global conditions and local conditions).
- the process of pre-training the diffusion model may include the following steps: first, second training data including multiple second training samples is obtained.
- the second training sample includes: extracting description text from the video sample as a text sample.
- the description text of the video sample may be obtained by using an existing video description text generation model, or the description text of the video sample may be obtained from the webpage from which the video sample is obtained, or the description text of the video sample may be manually added, and so on.
- the text sample is encoded using a global encoder to obtain a text feature representation.
- the obtained text feature representation and noise sequence are input into the diffusion model to train the diffusion model, where the diffusion model uses the text feature representation to denoise the noise sequence;
- the training objectives include: minimizing the difference between the noise predicted by the diffusion model and the Gaussian noise when performing denoising at each time step.
- the loss function used in this part of the pre-training is the same as the above formula (1), but the included condition C is different.
- the condition C involved in the pre-training process includes the text condition Part. That is, when the input condition has a text condition, the process of generating a video is used to train the diffusion model.
- FIG. 9 shows a schematic block diagram of a video generation device according to an embodiment, which corresponds to the model training device in the system shown in Figure 1.
- the device 900 includes: a condition acquisition unit 901 and a video generation unit 902.
- the main functions of each component unit are as follows: the condition acquisition unit 901 is configured to acquire local conditions, which include spatial conditions and/or temporal conditions; the video generation unit 902 is configured to encode the local conditions to obtain local condition feature representation; integrate the noise sequence with the local condition feature representation to obtain a noise latent vector sequence; perform denoising on the noise latent vector sequence using a diffusion model to obtain a denoised latent vector sequence; perform decoding on the denoised latent vector sequence to obtain a video.
- the condition acquisition unit 901 can also be configured to acquire global conditions, which include at least one of text conditions, style conditions and color conditions.
- the video generation unit 902 can also be configured to encode the global conditions to obtain a global condition feature representation.
- the diffusion model can perform cross-attention processing on the noise latent vector sequence using the global condition feature representation to predict noise, and perform denoising using the predicted noise.
- the above-mentioned spatial condition may include at least one of a single image and a single semantic sketch.
- the temporal condition includes at least one of a motion vector sequence, a depth map sequence, a mask map sequence, a semantic sketch sequence, and a grayscale map sequence.
- the video generation unit 902 can be implemented by the video generation model shown in Figure 3.
- the global encoder is used to encode the global condition to obtain a global condition feature representation.
- the spatiotemporal condition encoder is used to encode the local condition to obtain a local condition feature representation.
- the diffusion model is used to denoise the noise latent vector sequence to obtain a denoised latent vector sequence.
- the decoder uses the denoised latent vector sequence for decoding to obtain a video.
- the global encoder in the video generation model performs text encoding on the text condition to obtain a text feature representation, and performs image encoding on the style condition and/or color condition to obtain an image feature table show; integrate the text feature representation and the image feature representation to obtain the global condition feature representation.
- each spatiotemporal condition encoder encodes each local condition respectively to obtain the feature tensor corresponding to each local condition, wherein the spatiotemporal condition encoder corresponds to the local condition one by one; fuse the feature tensors corresponding to each local condition to obtain the local condition feature representation.
- the spatiotemporal condition encoder can use the spatiotemporal condition encoder to perform spatial feature encoding on the local condition to obtain the spatial feature representation of the local condition; if the local condition is a sequence, perform temporal self-attention processing on the spatial feature representation of the local condition to obtain the feature tensor corresponding to the local condition; otherwise, copy the spatial feature representation of the local condition in time sequence to generate a spatial feature representation in time sequence, perform temporal self-attention processing on the spatial feature representation in time sequence, and obtain the feature tensor corresponding to the local condition.
- a device for training a video generation model is provided.
- FIG10 shows a schematic block diagram of a device for training a video generation model according to an embodiment.
- the device 1000 includes: a sample acquisition unit 1001 and a model training unit 1002, and may further include a pre-training unit 1003.
- the main functions of each component unit are as follows: the sample acquisition unit 1001 is configured to acquire first training data including a plurality of first training samples, the first training samples include local condition samples, and the local condition samples include spatial condition samples and/or temporal condition samples.
- the model training unit 1002 is configured to train a video generation model using the first training data, and the video generation model includes: a spatiotemporal condition encoder, a diffusion model, and a decoder; wherein the spatiotemporal condition encoder performs encoding processing on the local condition samples to obtain a local condition feature representation; the diffusion model performs denoising processing on the noise latent vector sequence to obtain a denoised latent vector sequence, and the noise latent vector sequence is obtained by integrating the noise sequence with the local condition feature representation; the decoder performs decoding processing using the denoised latent vector sequence to obtain a video; the training objectives include: minimizing the difference between the noise predicted by the diffusion model when performing denoising processing and the Gaussian noise.
- the first training sample also includes a global condition sample
- the global condition sample includes at least one of a text condition sample, a style condition sample, and a color condition sample.
- the video generation model also includes a global encoder, and the global encoder encodes the global condition sample to obtain a global condition feature representation.
- the diffusion model uses the global condition feature representation to perform cross-attention processing on the noise latent vector sequence to predict the noise, and uses the predicted noise to perform denoising processing.
- the global encoder and decoder use the parameters obtained by pre-training, and in each round of training iteration, the parameters of the spatiotemporal condition encoder and the diffusion model are updated using the loss function corresponding to the training target.
- the sample acquisition unit 1001 can acquire a video sample; acquire the description text, style image and/or color histogram of the video sample as a text condition sample, a style condition sample and/or a color condition sample respectively; extract at least one of a single image and a single semantic sketch as a spatial condition sample; and provide Take at least one of a motion vector sequence, a depth map sequence, a mask map sequence, a semantic sketch sequence and/or a grayscale map sequence as a temporal conditional sample.
- the pre-training unit 1003 can be configured to pre-train the diffusion model in the following manner: obtain second training data including a plurality of second training samples, the second training samples including: extracting description text from a video sample as a text sample; encoding the text sample using a global encoder to obtain a text feature representation; inputting the text feature representation and the noise sequence into the diffusion model to train the diffusion model, and the diffusion model uses the text feature representation to denoise the noise sequence; the training objectives include: minimizing the difference between the noise predicted by the diffusion model and the Gaussian noise when the denoising process is performed at each time step.
- the model training unit 1002 further uses the first training data to train the video generation model based on the parameters of the diffusion model pre-trained by the pre-training unit 1003.
- the spatiotemporal condition encoder can use each spatiotemporal condition encoder to encode each local condition sample respectively, and obtain the feature tensor corresponding to each local condition sample, wherein the spatiotemporal condition encoder corresponds to the local condition one by one; the feature tensor corresponding to each local condition sample is fused to obtain the local condition feature representation.
- the spatiotemporal condition encoder can perform spatial feature encoding on the local condition sample to obtain the spatial feature representation of the local condition sample; if the local condition sample is a sequence, the spatial feature representation of the local condition sample is subjected to time domain self-attention processing to obtain the feature tensor corresponding to the local condition sample; otherwise, the spatial feature representation of the local condition sample is copied in time sequence to generate the spatial feature representation in time sequence, and the spatial feature representation in time sequence is subjected to time domain self-attention processing to obtain the feature tensor corresponding to the local condition sample.
- the description is relatively simple, and the relevant parts can refer to the partial description of the method embodiment.
- the device embodiment described above is schematic, wherein the unit described as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or it may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Ordinary technicians in this field can understand and implement it without paying creative work.
- the user information including but not limited to user device information, user personal information, etc.
- data including but not limited to data for analysis, stored data, displayed data, etc.
- the processing needs to comply with the relevant laws, regulations and standards of the relevant countries and regions, and provide corresponding operation entrances for users to choose to authorize or refuse.
- the embodiment of the present disclosure also provides a computer-readable storage medium, on which a computer program is stored, and the program is executed by a processor to implement the steps of any of the methods in the aforementioned method embodiments.
- an electronic device including: one or more processors; and a memory associated with the one or more processors, the memory is configured to store program instructions, and the program instructions are read and executed by the one or more processors to execute the steps of any of the methods in the aforementioned method embodiments.
- the present disclosure also provides a computer program product, including a computer program, which implements the steps of any of the methods in the aforementioned method embodiments when executed by a processor.
- Figure 11 exemplarily shows the architecture of an electronic device, which may specifically include a processor 1110, a video display adapter 1111, a disk drive 1112, an input/output interface 1113, a network interface 1114, and a memory 1120.
- the processor 1110, the video display adapter 1111, the disk drive 1112, the input/output interface 1113, the network interface 1114, and the memory 1120 can be communicatively connected via a communication bus 1130.
- the processor 1110 can be implemented in the form of a general-purpose CPU, a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits, and is configured to execute relevant programs to implement the technical solutions provided by the present disclosure.
- the memory 1120 can be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, etc.
- the memory 1120 may store an operating system 1121 configured to control the operation of the electronic device 1100, and a basic input/output system (BIOS) 1122 configured to control the low-level operation of the electronic device 1100.
- BIOS basic input/output system
- a web browser 1123, a data storage management system 1124, and a video generation device/model training device 1125, etc. may also be stored.
- the above-mentioned video generation device/model training device 1125 may be an application program for specifically implementing the operations of the aforementioned steps in the embodiment of the present disclosure.
- the relevant program code is stored in the memory 1120 and is called and executed by the processor 1110.
- the input/output interface 1113 is configured to connect an input/output module to implement information input and output.
- the input/output/module may be configured in the device as a component (not shown in the figure), or may be externally connected to the device to provide corresponding functions.
- the input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output device may include a display, a speaker, a vibrator, an indicator light, etc.
- the network interface 1114 is configured to connect to a communication module (not shown) to implement communication interaction between the device and other devices.
- the communication module can implement communication via a wired method (such as USB, network cable, etc.) or a wireless method (such as mobile network, WIFE Bluetooth, etc.).
- the bus 1130 includes a path for transmitting information between various components of the device (e.g., the processor 1110, the video display adapter 1111, the disk drive 1112, the input/output interface 1113, the network interface 1114, and the memory 1120).
- the device may also include other components necessary for normal operation.
- the above device may also include components necessary for implementing the scheme of the present disclosure, and it is not necessary to include all the components shown in the figure.
- the present disclosure can be implemented by means of software plus a necessary general hardware platform.
- the technical solution of the present disclosure can be embodied in the form of a computer program product, which can be stored in a storage medium, such as ROM/RAM, a disk, an optical disk, etc., and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods of various embodiments of the present disclosure or certain parts of the embodiments.
- the embodiment of the present disclosure provides a video generation method, including: obtaining local conditions, the local conditions include spatial conditions and/or temporal conditions, encoding the local conditions to obtain local condition feature representation, integrating the noise sequence with the local condition feature representation to obtain a noise latent vector sequence, denoising the noise latent vector sequence using a diffusion model to obtain a denoised latent vector sequence, and decoding the denoised latent vector sequence to obtain a video.
- the present disclosure incorporates spatial conditions and/or temporal conditions as local conditions, integrates the local condition feature representation with the noise sequence to obtain a noise latent vector sequence, denoises the noise latent vector sequence to obtain a denoised latent vector sequence, and then decodes to obtain a video. This method makes video generation no longer limited to text conditions, but introduces spatial conditions and/or temporal conditions to guide video generation, thereby generating videos more flexibly and diversely, and improving the quality of videos.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Probability & Statistics with Applications (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
- Image Processing (AREA)
Abstract
Description
Claims
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP24816025.1A EP4722982A4 (en) | 2023-05-29 | 2024-05-29 | METHOD AND APPARATUS FOR VIDEO GENERATION, METHOD AND APPARATUS FOR DRIVING A VIDEO GENERATION MODEL |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310618707.8A CN116863003A (zh) | 2023-05-29 | 2023-05-29 | 视频生成方法、训练视频生成模型的方法及装置 |
| CN202310618707.8 | 2023-05-29 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024248736A1 true WO2024248736A1 (zh) | 2024-12-05 |
Family
ID=88217934
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/SG2024/050361 Ceased WO2024248736A1 (zh) | 2023-05-29 | 2024-05-29 | 视频生成方法、训练视频生成模型的方法及装置 |
Country Status (3)
| Country | Link |
|---|---|
| EP (1) | EP4722982A4 (zh) |
| CN (1) | CN116863003A (zh) |
| WO (1) | WO2024248736A1 (zh) |
Cited By (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119364135A (zh) * | 2024-12-26 | 2025-01-24 | 北京生数科技有限公司 | 利用图像生成视频的方法、装置、设备、介质 |
| CN119693971A (zh) * | 2024-12-17 | 2025-03-25 | 山东大学 | 缺失视角下的基于扩散模型的跨视角步态识别方法及系统 |
| CN119741222A (zh) * | 2024-12-26 | 2025-04-01 | 北京生数科技有限公司 | 视频生成方法、装置及电子设备 |
| CN120070638A (zh) * | 2025-02-24 | 2025-05-30 | 中国科学技术大学 | 文本引导的零样本透明图层及分层图像生成方法 |
| CN120216684A (zh) * | 2025-03-11 | 2025-06-27 | 北京合力亿捷科技股份有限公司 | 文本处理方法、装置、设备、存储介质及程序产品 |
| CN120318827A (zh) * | 2025-04-18 | 2025-07-15 | 华南理工大学 | 一种基于生成回放的持续异常检测方法 |
| CN120434477A (zh) * | 2025-07-08 | 2025-08-05 | 北京生数科技有限公司 | 布局可控的视频生成方法、装置、设备、介质和产品 |
| CN120434485A (zh) * | 2025-07-09 | 2025-08-05 | 北京达佳互联信息技术有限公司 | 内容生成方法、内容生成模型的训练方法及对应装置 |
| CN120499471A (zh) * | 2025-05-16 | 2025-08-15 | 合肥孪生宇宙科技有限公司 | 一种基于扩散Transfomer架构的数字人视频生成系统 |
| CN121280441A (zh) * | 2025-12-09 | 2026-01-06 | 山东科技大学 | 基于条件扩散模型的工业零件缺陷样本精准生成方法 |
Families Citing this family (14)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120111318A (zh) * | 2023-12-04 | 2025-06-06 | 北京字跳网络技术有限公司 | 视频生成方法及相关设备 |
| CN117830483B (zh) * | 2023-12-27 | 2024-10-18 | 北京智象未来科技有限公司 | 基于图像的视频生成方法、装置、设备、存储介质 |
| CN120416680A (zh) | 2024-01-30 | 2025-08-01 | 北京有竹居网络技术有限公司 | 用于生成视频的方法、装置、电子设备和计算机程序产品 |
| CN118714417B (zh) * | 2024-02-07 | 2026-01-27 | 浙江天猫技术有限公司 | 视频的生成方法、系统、电子设备和存储介质 |
| CN118365730B (zh) * | 2024-04-29 | 2025-06-03 | 上海人工智能创新中心 | 一种文生图方法、装置、设备及存储介质 |
| CN118233714B (zh) * | 2024-05-23 | 2024-08-13 | 北京大学深圳研究生院 | 全景视频生成方法、装置、设备及存储介质 |
| CN119091016A (zh) * | 2024-07-18 | 2024-12-06 | 浙江师范大学 | 一种基于扩散模型的动态视频生成方法及系统 |
| CN120512504B (zh) * | 2024-08-23 | 2025-12-16 | 北京极佳视界科技有限公司 | 视频生成方法、装置、设备及存储介质 |
| CN119313788A (zh) * | 2024-09-18 | 2025-01-14 | 广东因赛品牌营销集团股份有限公司 | 一种以人物为主的营销视频生成方法 |
| CN119475470B (zh) * | 2024-11-01 | 2025-09-02 | 湖南大学 | 两阶段的产品设计生成方法、系统、设备及介质 |
| CN119629432B (zh) * | 2024-11-22 | 2025-10-17 | 平安科技(深圳)有限公司 | 视频生成方法和装置、电子设备及存储介质 |
| CN119697456B (zh) * | 2024-12-03 | 2025-10-10 | 电子科技大学(深圳)高等研究院 | 文本生成视频方法、装置及存储介质 |
| CN119940424B (zh) * | 2025-01-02 | 2025-09-26 | 湘潭大学 | 一种平流层风场模拟方法 |
| CN120765497A (zh) * | 2025-06-30 | 2025-10-10 | 中国科学院杭州医学研究所 | 一种数字病理图像虚拟染色系统、方法及计算机可读存储介质 |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2022265992A1 (en) * | 2021-06-14 | 2022-12-22 | Google Llc | Diffusion models having improved accuracy and reduced consumption of computational resources |
| CN115965791A (zh) * | 2022-12-19 | 2023-04-14 | 北京字跳网络技术有限公司 | 图像生成方法、装置及电子设备 |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115174824B (zh) * | 2021-03-19 | 2025-06-03 | 阿里巴巴创新公司 | 视频生成方法及装置、宣传类型视频生成方法及装置 |
| CN114973410B (zh) * | 2022-05-20 | 2025-08-22 | 北京沃东天骏信息技术有限公司 | 视频帧的动作特征提取方法及装置 |
| CN115601485B (zh) * | 2022-12-15 | 2023-04-07 | 阿里巴巴(中国)有限公司 | 任务处理模型的数据处理方法及虚拟人物动画生成方法 |
| CN115861131B (zh) * | 2023-02-03 | 2023-05-26 | 北京百度网讯科技有限公司 | 基于图像生成视频、模型的训练方法、装置及电子设备 |
-
2023
- 2023-05-29 CN CN202310618707.8A patent/CN116863003A/zh active Pending
-
2024
- 2024-05-29 WO PCT/SG2024/050361 patent/WO2024248736A1/zh not_active Ceased
- 2024-05-29 EP EP24816025.1A patent/EP4722982A4/en active Pending
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2022265992A1 (en) * | 2021-06-14 | 2022-12-22 | Google Llc | Diffusion models having improved accuracy and reduced consumption of computational resources |
| CN115965791A (zh) * | 2022-12-19 | 2023-04-14 | 北京字跳网络技术有限公司 | 图像生成方法、装置及电子设备 |
Non-Patent Citations (2)
| Title |
|---|
| NI HAOMIAO; SHI CHANGHAO; LI KAI; HUANG SHARON X.; MIN MARTIN RENQIANG: "Conditional Image-to-Video Generation with Latent Flow Diffusion Models", 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 17 June 2023 (2023-06-17), pages 18444 - 18455, XP034401708, DOI: 10.1109/CVPR52729.2023.01769 * |
| See also references of EP4722982A4 * |
Cited By (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119693971A (zh) * | 2024-12-17 | 2025-03-25 | 山东大学 | 缺失视角下的基于扩散模型的跨视角步态识别方法及系统 |
| CN119364135A (zh) * | 2024-12-26 | 2025-01-24 | 北京生数科技有限公司 | 利用图像生成视频的方法、装置、设备、介质 |
| CN119741222A (zh) * | 2024-12-26 | 2025-04-01 | 北京生数科技有限公司 | 视频生成方法、装置及电子设备 |
| CN119741222B (zh) * | 2024-12-26 | 2025-06-17 | 北京生数科技有限公司 | 视频生成方法、装置及电子设备 |
| CN120070638A (zh) * | 2025-02-24 | 2025-05-30 | 中国科学技术大学 | 文本引导的零样本透明图层及分层图像生成方法 |
| CN120216684A (zh) * | 2025-03-11 | 2025-06-27 | 北京合力亿捷科技股份有限公司 | 文本处理方法、装置、设备、存储介质及程序产品 |
| CN120318827A (zh) * | 2025-04-18 | 2025-07-15 | 华南理工大学 | 一种基于生成回放的持续异常检测方法 |
| CN120499471A (zh) * | 2025-05-16 | 2025-08-15 | 合肥孪生宇宙科技有限公司 | 一种基于扩散Transfomer架构的数字人视频生成系统 |
| CN120434477A (zh) * | 2025-07-08 | 2025-08-05 | 北京生数科技有限公司 | 布局可控的视频生成方法、装置、设备、介质和产品 |
| CN120434485A (zh) * | 2025-07-09 | 2025-08-05 | 北京达佳互联信息技术有限公司 | 内容生成方法、内容生成模型的训练方法及对应装置 |
| CN121280441A (zh) * | 2025-12-09 | 2026-01-06 | 山东科技大学 | 基于条件扩散模型的工业零件缺陷样本精准生成方法 |
| CN121280441B (zh) * | 2025-12-09 | 2026-02-17 | 山东科技大学 | 基于条件扩散模型的工业零件缺陷样本精准生成方法 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4722982A1 (en) | 2026-04-08 |
| CN116863003A (zh) | 2023-10-10 |
| EP4722982A4 (en) | 2026-04-22 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2024248736A1 (zh) | 视频生成方法、训练视频生成模型的方法及装置 | |
| CN107979764B (zh) | 基于语义分割和多层注意力框架的视频字幕生成方法 | |
| CN116611496B (zh) | 文本到图像的生成模型优化方法、装置、设备及存储介质 | |
| JP2023541119A (ja) | 文字認識モデルのトレーニング方法、文字認識方法、装置、電子機器、記憶媒体およびコンピュータプログラム | |
| US11409791B2 (en) | Joint heterogeneous language-vision embeddings for video tagging and search | |
| CN118632070B (zh) | 视频的生成方法、装置、电子设备、存储介质及程序产品 | |
| CN112819933A (zh) | 一种数据处理方法、装置、电子设备及存储介质 | |
| CN117593400A (zh) | 图像生成方法、模型训练方法及对应装置 | |
| JP2021501416A (ja) | ビデオコンテンツを特徴付けるための深層強化学習フレームワーク | |
| CN116975347B (zh) | 图像生成模型训练方法及相关装置 | |
| CN118627582A (zh) | 用于模型训练的方法、系统和介质 | |
| CN116957932A (zh) | 一种图像生成方法、装置、电子设备和存储介质 | |
| WO2025256268A1 (zh) | 多模态数据处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品 | |
| CN116977457A (zh) | 一种数据处理方法、设备以及计算机可读存储介质 | |
| Sun et al. | Beyond talking–generating holistic 3d human dyadic motion for communication | |
| WO2026045738A1 (zh) | 一种图像处理方法、装置、设备、介质及程序产品 | |
| WO2025167981A1 (zh) | 视频生成方法、装置、设备以及介质 | |
| Walsh et al. | Using sign language production as data augmentation to enhance sign language translation | |
| CN118429755A (zh) | 文生图模型训练方法、图像预测方法、装置、设备及介质 | |
| CN118823153A (zh) | 图像生成方法、装置、设备及存储介质 | |
| Jeon et al. | Multimodal audiovisual speech recognition architecture using a three‐feature multi‐fusion method for noise‐robust systems | |
| Siniukov et al. | Ditailistener: Controllable high fidelity listener video generation with diffusion | |
| Dhanyalakshmi et al. | A Survey on Face‐Swapping Methods for Identity Manipulation in Deepfake Applications | |
| Dhanyalakshmi et al. | A survey on deep learning based reenactment methods for deepfake applications | |
| CN119646181A (zh) | 内容生成方法、装置、电子设备及存储介质 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24816025 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2024816025 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2024816025 Country of ref document: EP Effective date: 20260102 |
|
| ENP | Entry into the national phase |
Ref document number: 2024816025 Country of ref document: EP Effective date: 20260102 |
|
| ENP | Entry into the national phase |
Ref document number: 2024816025 Country of ref document: EP Effective date: 20260102 |
|
| ENP | Entry into the national phase |
Ref document number: 2024816025 Country of ref document: EP Effective date: 20260102 |
|
| ENP | Entry into the national phase |
Ref document number: 2024816025 Country of ref document: EP Effective date: 20260102 |
|
| ENP | Entry into the national phase |
Ref document number: 2024816025 Country of ref document: EP Effective date: 20260102 |
|
| WWP | Wipo information: published in national office |
Ref document number: 2024816025 Country of ref document: EP |
