WO2024248736A1 - 视频生成方法、训练视频生成模型的方法及装置 - Google Patents

视频生成方法、训练视频生成模型的方法及装置 Download PDF

Info

Publication number
WO2024248736A1
WO2024248736A1 PCT/SG2024/050361 SG2024050361W WO2024248736A1 WO 2024248736 A1 WO2024248736 A1 WO 2024248736A1 SG 2024050361 W SG2024050361 W SG 2024050361W WO 2024248736 A1 WO2024248736 A1 WO 2024248736A1
Authority
WO
WIPO (PCT)
Prior art keywords
condition
local
sequence
feature representation
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/SG2024/050361
Other languages
English (en)
French (fr)
Inventor
陈大友
张士伟
张迎亚
赵德丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Innovation Private Ltd
Original Assignee
Alibaba Innovation Private Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Innovation Private Ltd filed Critical Alibaba Innovation Private Ltd
Priority to EP24816025.1A priority Critical patent/EP4722982A4/en
Publication of WO2024248736A1 publication Critical patent/WO2024248736A1/zh
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/001Model-based coding, e.g. wire frame
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content

Definitions

  • Video Generation Method, Method and Device for Training Video Generation Model Cross-reference
  • This disclosure claims the priority of the Chinese patent disclosure filed with the Chinese Patent Office on May 29, 2023, with priority number 202310618707.8 and invention name "Video Generation Method, Method and Device for Training Video Generation Model", all of which are incorporated by reference into this disclosure.
  • Technical Field The present disclosure relates to the field of computer vision technology, and in particular to a video generation method, a method and device for training a video generation model. Background
  • Text-based video generation technology provides many content creators with brand-new tools, making the creation of video content that originally required professionals and expensive equipment easier and more cost-effective.
  • the present disclosure provides a video generation method, a method and device for training a video generation model, so as to improve the quality of the generated videos.
  • a video generation method comprising: obtaining local conditions, the local conditions comprising spatial conditions and/or temporal conditions; encoding the local conditions to obtain local condition feature representation; integrating a noise sequence with the local condition feature representation to obtain a noise latent vector sequence; performing denoising on the noise latent vector sequence using a diffusion model to obtain a denoised latent vector sequence; performing decoding on the denoised latent vector sequence to obtain a video.
  • the method further comprises: obtaining global conditions, the global conditions comprising at least one of text conditions, style conditions and color conditions; encoding the global conditions to obtain a global condition feature representation; the denoising of the noise latent vector sequence using a diffusion model comprises: the diffusion model performs cross-attention processing on the noise latent vector sequence using the global condition feature representation to predict noise, and performs denoising using the predicted noise.
  • the spatial conditions comprise a single image and a single semantic grass At least one of the figures; the temporal condition includes at least one of a motion vector sequence, a depth map sequence, a mask map sequence, a semantic sketch sequence and a grayscale map sequence.
  • encoding the global condition to obtain a global condition feature representation includes: text encoding the text condition to obtain a text feature representation, and image encoding the style condition and/or color condition to obtain an image feature representation; integrating the text feature representation and the image feature representation to obtain a global condition feature representation.
  • encoding the local condition to obtain a local condition feature representation includes: encoding each local condition using each spatiotemporal condition encoder respectively to obtain a feature tensor corresponding to each local condition, wherein the spatiotemporal condition encoder corresponds to the local condition one by one; fusing the feature tensors corresponding to each local condition to obtain the local condition feature representation.
  • the encoding of each local condition respectively using a spatiotemporal condition encoder includes: using a spatiotemporal condition encoder to perform spatial feature encoding on the local condition to obtain a spatial feature representation of the local condition; if the local condition is a sequence, performing a temporal self-attention process on the spatial feature representation of the local condition to obtain a feature tensor corresponding to the local condition; otherwise, copying the spatial feature representation of the local condition in a temporal sequence to generate a spatial feature representation in a temporal sequence, performing a temporal self-attention process on the spatial feature representation in the temporal sequence, and obtaining a feature tensor corresponding to the local condition.
  • a method for training a video generation model comprising: obtaining first training data including a plurality of first training samples, the first training samples comprising local condition samples, the local condition samples comprising spatial condition samples and/or temporal condition samples; using the first training data to train a video generation model, the video generation model comprising: a spatiotemporal condition encoder, a diffusion model and a decoder; wherein the spatiotemporal condition encoder encodes the local condition samples to obtain a local condition feature representation; the diffusion model performs denoising on a noise latent vector sequence to obtain a denoised latent vector sequence, the noise latent vector sequence being obtained by integrating the noise sequence with the local condition feature representation; the decoder performs decoding processing using the denoised latent vector sequence to obtain a video; the training objective comprises: minimizing the difference between the noise predicted by the diffusion model when performing denoising and the Gaussian noise.
  • the first training sample also comprises a global condition sample
  • the global condition sample comprises at least one of a text condition sample, a style condition sample and a color condition sample
  • the video generation model further includes a global encoder, and the global encoder encodes the global condition sample to obtain a global condition feature representation
  • the diffusion model performs denoising on the noise latent vector sequence, including: the diffusion model performs cross-attention processing on the noise latent vector sequence using the global condition feature representation to predict noise, and performs denoising using the predicted noise.
  • the global encoder and the decoder use parameters obtained by pre-training, and in each round of iteration of the training, update the parameters of the spatiotemporal condition encoder and the diffusion model using a loss function corresponding to the training target.
  • the acquisition of first training data including a plurality of first training samples includes: acquiring a video sample; acquiring a description text, a style image and/or a color histogram of the video sample as the text condition sample, the style condition sample and/or the color condition sample respectively; extracting at least one of a single image and a single semantic sketch as the spatial condition sample; extracting at least one of a motion vector sequence, a depth map sequence, a mask map sequence, a semantic sketch sequence and/or a grayscale map sequence as the temporal condition sample.
  • the pre-training of the diffusion model includes: obtaining second training data including a plurality of second training samples, the second training samples including: extracting description text from a video sample as a text sample; encoding the text sample using a global encoder to obtain a text feature representation; inputting the text feature representation and a noise sequence into a diffusion model to train the diffusion model, the diffusion model performing denoising on the noise sequence using the text feature representation; the training objectives include: minimizing the difference between the noise predicted by the diffusion model when performing denoising at each time step and the Gaussian noise.
  • the spatiotemporal condition encoder encodes the local condition sample to obtain the local condition feature representation, including: using each spatiotemporal condition encoder to encode each local condition sample to obtain a feature tensor corresponding to each local condition sample, wherein the spatiotemporal condition encoder corresponds to the local condition; fusing the feature tensors corresponding to each local condition sample to obtain the local condition feature representation.
  • each spatiotemporal condition encoder to encode each local condition sample to obtain a feature tensor corresponding to each local condition sample, wherein the spatiotemporal condition encoder corresponds to the local condition; fusing the feature tensors corresponding to each local condition sample to obtain the local condition feature representation.
  • the encoding of the conditional sample includes: using a spatiotemporal conditional encoder to perform spatial feature encoding on the local conditional sample to obtain a spatial feature representation of the local conditional sample; if the local conditional sample is a sequence, then performing a temporal self-attention process on the spatial feature representation of the local conditional sample to obtain a feature tensor corresponding to the local conditional sample; otherwise, the spatial feature representation of the local conditional sample is copied in time sequence to generate a spatial feature representation in time sequence, and the spatial feature representation in time sequence is subjected to a temporal self-attention process to obtain a feature tensor corresponding to the local conditional sample.
  • a video generation method which is executed by a cloud server, and the method includes: obtaining a local condition from a user terminal, wherein the local condition includes a spatial condition and/or a temporal condition; encoding the local condition to obtain a local condition feature representation; integrating a noise sequence with the local condition feature representation to obtain a noise latent vector sequence; performing a denoising process on the noise latent vector sequence using a diffusion model to obtain a denoised latent vector sequence; performing a decoding process on the denoised latent vector sequence to obtain a video; and sending the video to the user terminal for display.
  • a video generating device comprising: a condition acquisition unit, configured to acquire local conditions, the local conditions comprising spatial conditions and/or temporal conditions; a video generating unit, configured to encode the local conditions to obtain local condition feature representation; integrate a noise sequence with the local condition feature representation to obtain a noise latent vector sequence; denoise the noise latent vector sequence using a diffusion model to obtain a denoised latent vector sequence; and decode the denoised latent vector sequence to obtain a video.
  • a device for training a video generation model comprising: a sample acquisition unit, configured to acquire first training data comprising multiple first training samples, the first training samples comprising local condition samples, the local condition samples comprising spatial condition samples and/or temporal condition samples; a model training unit, configured to train a video generation model using the first training data, the video generation model comprising: a spatiotemporal condition encoder, a diffusion model and a decoder; wherein the spatiotemporal condition encoder encodes the local condition samples to obtain a local condition feature representation; the diffusion model denoises a noise latent vector sequence to obtain a denoised latent vector sequence, the noise latent vector sequence is obtained by integrating the noise sequence with the local condition feature representation; the decoder performs decoding processing using the denoised latent vector sequence to obtain a video; the training objectives include: minimizing the difference between the noise predicted by the diffusion model when performing denoising and the Gaussian noise.
  • a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of the method described in any one of the first to third aspects are implemented.
  • an electronic device comprising: one or more processors; and a memory associated with the one or more processors, the memory being configured to store program instructions, and when the program instructions are read and executed by the one or more processors, the steps of the method described in any one of the first to third aspects are implemented.
  • the present disclosure incorporates spatial conditions and/or temporal conditions as local conditions, integrates local condition feature representation with noise sequence to obtain noise latent vector sequence, and performs denoising on the noise latent vector sequence to obtain denoised latent vector sequence, and then decodes to obtain video.
  • This method makes video generation no longer limited to text conditions, but introduces spatial conditions and/or temporal conditions to guide video generation, thereby generating videos more flexibly and diversely, and improving video quality.
  • the present disclosure can generate the required video by combining multiple global conditions, spatial conditions and time conditions in multiple local conditions; wherein the encoding, alignment and fusion of multiple local conditions are realized by the spatiotemporal condition encoder, and local control of the video is realized by integrating with the noise sequence; the global condition is encoded and integrated by the global encoder, and the noise latent vector sequence is cross-attention processed by using the global condition feature representation in the diffusion model to realize global control of the video.
  • This method greatly improves the controllability of video generation, and the generated video is more flexible and diverse.
  • the present invention innovatively introduces motion vector sequences in the temporal condition.
  • the video generation model is able to capture the inter-frame dynamics, thereby achieving control over the internal motion in the video.
  • the spatiotemporal condition encoder provided by the present disclosure first extracts local spatial information and then performs temporal modeling, which promotes the display embedding of the time domain. It also provides a unified interface for different local conditions and enhances the consistency between frames. For non-temporal spatial conditions such as a single image and a single semantic sketch, the local conditions are fused by replicating them in the time dimension to ensure consistency with the temporal conditions. This encoding and fusion process enables the subsequently generated video to have better controllability in terms of both temporal and spatial perception.
  • the present disclosure can first use the training data of the text-generated video to pre-train the diffusion model, and further train the video generation model based on the parameters of the diffusion model obtained by pre-training, thereby improving the effect and efficiency of the video generation model.
  • the training data of the text-generated video pre-train the diffusion model
  • the video generation model based on the parameters of the diffusion model obtained by pre-training, thereby improving the effect and efficiency of the video generation model.
  • Figure 1 is a system architecture diagram applicable to the embodiment of the present disclosure
  • Figure 2 is a flow chart of the video generation method provided by the embodiment of the present disclosure
  • Figure 3 is a schematic diagram of the conditions required for generating a video provided by the embodiment of the present disclosure
  • Figure 4 is a schematic diagram of the principle of the video generation model provided by the embodiment of the present disclosure
  • Figure 5 is a schematic diagram of the structure of the spatiotemporal conditional encoder provided by the embodiment of the present disclosure
  • Figure 6 is a schematic diagram of the principle of the diffusion model provided by the embodiment of the present disclosure at each time step
  • Figures 72 to 7d are four example diagrams of generating videos provided by the embodiment of the present disclosure
  • Figure 8 is a flow chart of the method for training the video generation model provided by the embodiment of the present disclosure
  • Figure 9 is a schematic block diagram of the video generation device provided by the embodiment of the present disclosure
  • Figure 10 is a schematic block diagram of the device for training the video generation model provided by the embodiment of the present disclosure
  • Figure 11 is
  • FIG1 shows an exemplary system architecture to which an embodiment of the present disclosure can be applied.
  • the system architecture includes a user terminal, and a model training device and a video generation device located at a server.
  • the model training device of the server adopts the method provided by the embodiment of the present disclosure to perform model training in the offline stage to obtain a video generation model.
  • the user terminal can exchange information with the video generation device of the server through the network.
  • the user terminal can send the condition information used for generating the video input by the user to the video generation device through the network, wherein the condition information may include global conditions and local conditions, and the specific content will be described in detail in the subsequent embodiments.
  • the video generation device can use the trained video generation model to generate a video under the constraints of global conditions and local conditions, and send the generated video to the user terminal through the network for the user terminal to display the video.
  • the above-mentioned user terminal may include, but is not limited to, such as: smart mobile terminals, smart home devices, wearable devices, smart medical devices, PCs (Personal Computers), etc.
  • smart and mobile devices may include, for example, mobile phones, tablet computers, laptop computers, PDAs (Personal Digital Assistants), Internet cars, etc.
  • Smart home devices may include smart home appliances, such as smart TVs, smart refrigerators, and other home appliances with video playback functions.
  • Wearable devices may include, for example, smart watches, smart glasses, smart bracelets, VR (Virtual Reality) devices, AR (Augmented Reality), mixed reality devices (i.e., devices that can support virtual reality and augmented reality), etc.
  • the model training device and the video generation device may be set as independent servers, or may be set on the same server or server group, or may be set on an independent or the same cloud server.
  • a cloud server also known as a cloud computing server or cloud host, is a host product in a cloud computing service system to solve the defects of difficult management and weak service scalability in traditional physical hosts and virtual private servers (VPS) services.
  • the model training device and the video generation device may also be set on a computer terminal with strong computing power. It should be noted that, in addition to performing video generation online, the above-mentioned video generation device may also perform video generation offline. It should be understood that the number of user terminals, model training devices, video generation devices, and video generation models in FIG. 1 is schematic. According to the implementation requirements, there may be any number of user terminals, model training devices, video generation devices, and video generation models. FIG.
  • Step 202 obtaining local conditions, the local conditions including spatial conditions and/or temporal conditions.
  • Step 204 encoding the local conditions to obtain local condition feature representations.
  • Step 206 integrating the noise sequence with the local condition feature representation to obtain a noise latent vector sequence; denoising the noise latent vector sequence using a diffusion model to obtain a denoised latent vector sequence.
  • Step 208 performing decoding processing using the denoised latent vector sequence to obtain a video.
  • the present disclosure incorporates spatial conditions and/or temporal conditions as local conditions, integrates the local condition feature representation with the noise sequence to obtain a noise latent vector sequence, and uses the global condition feature representation to denoise the noise latent vector sequence to obtain the denoised latent vector sequence, and then decodes to obtain the video.
  • This method makes video generation no longer limited to text conditions, but introduces spatial conditions and/or temporal conditions to guide video generation, so as to generate videos more flexibly and diversely and improve the quality of the video.
  • the following describes each step in the above process in detail. First, the above step 202, i.e., "obtaining local conditions", is described in detail in conjunction with the embodiment.
  • Existing video generation methods usually generate videos under the guidance of text conditions, and have poor flexibility and diversity.
  • new combination conditions are introduced to improve the controllability of video generation.
  • the global conditions mainly include text conditions, style conditions and/or color conditions.
  • the text conditions are mainly a description of the video content, which globally reflects the content of the video to be generated.
  • the style condition can usually adopt a style image, and the purpose of the condition is to generate a video with the same style features as the style image, that is, the video has the style expressed in the style image as a whole.
  • the color condition can be in the form of a color histogram, etc., and the purpose of the condition is to guide the color distribution of the generated video.
  • the local condition mainly includes spatial conditions and/or temporal conditions.
  • the spatial condition can include at least one of a single image, a single semantic sketch, etc.
  • a single image refers to a single image, which is used to limit the content and structure of the image frame in the video.
  • the semantic sketch is also called a hand-drawn sketch, a sketch, etc., which is a way to describe the basic semantics of an object in an image. It is a generalized expression method that mainly depicts the edge (or contour) features of the object, can ignore the details and redundant features of the object, and retain the main information.
  • the spatial condition mainly guides the video content from the spatial features.
  • the temporal condition includes at least one of a motion vector sequence, a depth map sequence, a mask map sequence, and a semantic sketch sequence. Since the video is an image sequence, the temporal condition is a more refined guidance of the video along the temporal dimension.
  • the motion vector sequence contains motion vectors between adjacent frames.
  • the motion vector indicates the direction of movement of each Token (element) between adjacent frames, including horizontal and vertical directions. Therefore, the motion vector can be expressed as a two-dimensional vector. It explicitly expresses the movement of each Token between two adjacent image frames, so that the video generated by the video generation model has motion controllability.
  • each Token of the image refers to the element that constitutes the image.
  • the image is divided into non-overlapping blocks, and the blocks and start symbols in the image are all Tokens.
  • the depth map sequence is composed of the depth map of each frame.
  • the depth map contains the depth information of each Token in the image frame, which is used to guide the depth information of each frame in the video.
  • the mask map sequence is composed of the mask map of each frame.
  • the mask map masks the content of a part of the image frame, so that the model has the ability to predict the content of the masked part.
  • the mask area can be specified by the user or can be specified by the user. can be determined randomly.
  • the semantic sketch sequence is composed of semantic sketches of each frame. Compared with the single semantic sketch in the spatial condition, the sketch sequence can provide more control details.
  • the grayscale image sequence is composed of grayscale images of each frame.
  • the grayscale image contains the grayscale information of each token in the image frame, which is used to guide the grayscale information of each frame in the video.
  • the length of each sequence in the above time condition is consistent, which is equal to the length of the generated video, for example, represented as T1.
  • the text condition is "a child standing next to the holy saliva tree”
  • the style condition is a style image, which is used to limit the overall style of the generated video, that is, to transfer the style features of the style image to the generated video
  • the color condition is a color histogram, as shown in Figure 3, which is used to limit the color distribution of the generated video.
  • the single image in the spatial condition can be an image containing a child and a Christmas tree, which is used to limit the appearance of the child and the Christmas tree in the generated video, etc., as shown in Figure 3.
  • the semantic sketch can be a picture describing the basic lines of the child and the Christmas tree, which is used to limit the basic shape and position of the child and the Christmas tree, etc.
  • the motion vector sequence in the time condition includes the motion vectors between adjacent frames, which is schematically represented by small arrows in the figure.
  • the depth map sequence includes the depth map of each frame
  • the mask map sequence includes the mask map of each frame
  • the semantic sketch sequence includes the semantic sketch of each frame.
  • the length of each sequence is T1, and T1 is the length of the video to be generated.
  • a condition input interface can be provided to the user, and the user can choose to input the above-mentioned global conditions and local conditions on the interface.
  • the user can enter the text condition in the text input box.
  • the user can use the image drawing or editing tool provided on the interface to input a single image and/or a single semantic sketch as a spatial condition, and input at least one of a motion vector sequence, a depth map sequence, a mask map sequence, and a semantic map sequence as a time condition.
  • the style image can also be input by uploading a self-selected image or selecting an image from the image library provided by the interface.
  • the user can select a specific area in the image and generate a moving track by a specific gesture, mouse track, etc. to indicate the moving route of the specific area in the video, and the server automatically generates a motion vector sequence based on the moving track.
  • condition input methods listed above, other condition input methods can also be used, which are not exhaustive here.
  • the following is a detailed description of the above step 204, i.e., "encoding the local condition to obtain the local condition feature representation, encoding the global condition to obtain the global condition feature representation, and encoding the local condition to obtain the local condition feature representation" in conjunction with the embodiment.
  • Steps 204 to 208 involved in FIG. 2 can be implemented by a video generation model obtained by pre-training.
  • the video generation model can include a spatiotemporal condition encoder, a diffusion model and a decoder, and can further include a global encoder.
  • This step is mainly performed by the spatiotemporal condition encoder, and if the global condition is used, it is further performed by the global encoder.
  • the global encoder is used to encode the global condition to obtain the global condition feature representation. If the global condition contains a text condition, the text condition can be encoded by a text encoder, A text feature representation is obtained, and the text feature representation is used as a global condition feature representation.
  • the text condition can be encoded by a text encoder to obtain a text feature representation
  • the style condition and/or color condition can be encoded by an image encoder to obtain an image feature representation
  • the text feature representation and the image feature representation are then integrated to obtain a global condition feature representation.
  • the text encoder can be implemented by a pre-trained language model, such as BERT (Bidirectional Encoder Representation from Transformers), XLNet (an autoregressive model that implements bidirectional context information by arranging a language model), GPT (Generative Pre-Training) model, CLIP (an encoding model that can implement multimodal encoding), etc.
  • the text encoder can actually obtain the semantic embedding of the text condition, that is, the text feature representation.
  • the image encoder can use VIT (Vision Transformer), CLIP, etc., and the image encoder can actually obtain the semantic embedding of the style image, that is, the image feature representation.
  • the spatiotemporal condition encoder is used to encode the local conditions to obtain the local condition feature representation. Since the local conditions include spatial conditions and/or temporal conditions. Since the temporal conditions are sequential conditions, the local conditions contain rich and complex spatiotemporal relationships, which poses a challenge to the controllable guidance of the video. In view of this, in the embodiment of the present disclosure, a corresponding spatiotemporal condition encoder is arranged for each local condition.
  • the local conditions and the spatiotemporal condition encoder are in a one-to-one correspondence.
  • Each spatiotemporal condition encoder is used to encode each local condition respectively to obtain a feature tensor corresponding to each local condition, and then the feature tensors corresponding to each local condition are fused to obtain the local condition feature representation.
  • the structure of the spatiotemporal condition encoder can be shown in Figure 5.
  • the spatiotemporal condition encoder first encodes the spatial features of the local conditions to obtain the spatial feature representation of the local conditions.
  • the spatiotemporal condition encoder can be composed of two two-dimensional convolutions (Conv2D), two activation layers (for example, SiLU activation function can be used) and an average pooling layer.
  • the spatial feature representation of the local condition is subjected to temporal self-attention processing to obtain the feature tensor corresponding to the local condition. This part is performed by the temporal Transformer (converter) shown in FIG5.
  • the spatial feature representation of the local condition can be first copied in the temporal sequence to generate the spatial feature representation in the temporal sequence. This part is performed by the "tooth" in FIG5 to align it with the spatial feature representation corresponding to the sequence-type local condition in the temporal sequence.
  • the temporal self-attention processing is performed on the spatial feature representation in the temporal sequence through the temporal Transformer to obtain the feature tensor corresponding to the local condition.
  • the feature tensors corresponding to each local condition are fused, the feature tensors can be added element by element.
  • the spatiotemporal condition encoder actually extracts local spatial information first, and then performs temporal modeling, which promotes the display embedding of the time domain. It also provides a unified interface for different local conditions, and enhances the consistency between frames.
  • non-temporal spatial conditions such as a single image and a single semantic sketch
  • the fusion of local conditions is achieved by replicating them in the time dimension to ensure consistency with the time condition.
  • a noise sequence & can be randomly generated.
  • the randomly generated noise sequence conforms to the normal distribution, that is, Gaussian noise, and its length is consistent with the length of the video to be generated, both of which are T1.
  • the local condition feature representation After the spatiotemporal condition encoder encodes and integrates the local conditions, the local condition feature representation has the same spatial shape as &. Then the noise sequence is integrated with the local condition feature representation, for example, the two are spliced along the channel dimension to obtain a noise latent vector sequence z, , which is used as a control signal for video generation. Diffusion models have been widely used in the field of image generation because of their more stable training and generation flexibility, but have not been well utilized in the field of video generation.
  • IDM Topic Diffusion Model
  • the initial video is projected to the latent representation, and then the latent representation is mapped back to the pixel space through the decoder to obtain the final video.
  • the initial video is the noise sequence
  • the final video is the generated video.
  • IDM other types of diffusion models can also be used.
  • the processing of the diffusion model can be understood as predicting the noise of the normal distribution, and performing denoising at each time step to restore the real video content. This process simulates the reverse process of a Markov chain of length . Where T is the total time step of the diffusion model.
  • the diffusion model can use the noise latent vector sequence obtained in the previous time step at each time step (in the first time step, the noise latent vector sequence input to the diffusion model is used) to predict the noise of the current time step, and use the predicted noise to denoise the noise latent vector sequence to obtain the noise latent vector sequence of the current time step. Furthermore, if the input conditions include global conditions, the diffusion model uses the global condition feature representation to perform cross-attention processing on the noise latent vector sequence to predict the noise, and uses the predicted noise to perform denoising.
  • the diffusion model uses the global conditional feature representation to perform cross-attention processing on the noise latent vector sequence at the first time step to predict the noise at the current time step, and uses the predicted noise to The latent vector sequence is denoised to obtain the noisy latent vector sequence of the current time step.
  • the noisy latent vector sequence obtained at the previous time step is cross-attention processed using the global condition feature representation to predict the noise of the current time step, and the predicted noise is used to denoise the noisy latent vector sequence obtained at the previous time step to obtain the noisy latent vector sequence of the current time step.
  • the diffusion model can use the three-dimensional UNet as the backbone network.
  • the UNet network is an encoder-decoder network that introduces jump connections.
  • the above process can be regarded as applying the denoising function spoon (•,•/) to the noisy latent vector sequence z and the condition (for example, including global conditions and local conditions) c, where / is 1,...,0.
  • the denoising process the cross-attention mechanism is used to inject global conditions, so that the global conditions guide the video generation in the overall semantics.
  • the denoised latent vector sequence obtained in the last step of the final extended model is input into the decoder in step 208, and the decoder uses the denoised latent vector sequence for decoding processing to obtain a video.
  • the decoder can use the decoder of the existing video generation model, which will not be described in detail here.
  • various conditions can be flexibly combined to generate more diverse videos.
  • Example 1 The text condition in the global condition and the time condition in the local condition can be combined.
  • Figure 7a the user inputs the text condition "a rotating perspective of a long-haired woman standing in the forest" and a semantic sketch sequence.
  • the sequence length is 6 frames.
  • the final generated video is shown in Figure 7a, which shows the frames of the video.
  • Example 2 The spatial condition and the temporal condition in the local condition can be combined.
  • Example 3 The style condition in the global condition and the depth map sequence and semantic sketch sequence in the local condition can be combined.
  • the user inputs a style image, a depth map sequence, and a semantic sketch sequence, and the resulting video is shown in FIG7c.
  • Example 4 The text condition, the style condition in the global condition, and a semantic sketch and a motion vector sequence in the local condition can be combined.
  • the user inputs the text condition "a moving golden moon", a style image, and a semantic sketch, and the user can hand-draw the direction of the movement of the moon on the semantic sketch.
  • the video generation device on the server side can automatically generate a motion vector sequence according to the direction of the movement of the moon hand-drawn by the user on the semantic sketch, thereby generating a video using the method provided by the embodiment of the present disclosure.
  • the moon in the generated video moves along the direction of the movement hand-drawn by the user.
  • FIG8 is a diagram of the embodiment of the present disclosure
  • the flowchart of the method for training a video generation model provided in the embodiment can be executed by the model training device in the system architecture shown in FIG1.
  • the method may include the following steps: Step 802: Obtain first training data including multiple first training samples, the first training samples include local condition samples, and the local conditions include spatial condition samples and/or temporal condition samples. Furthermore, the first training data samples may also include global condition samples, and the global condition samples may include at least one of text condition samples, style condition samples, and color condition samples.
  • a video sample may be first obtained, and a description text of the video sample may be obtained as a text condition sample.
  • a style image of the video sample may be obtained as a style condition sample, and a color histogram of the video sample may be obtained as a color condition sample.
  • an existing video description text generation model may be used to obtain the description text of the video sample, or the description text of the video sample may be obtained from the web page from which the video sample is derived, or the description text of the video sample may be manually added, and so on.
  • the first frame image of the video sample may be used as a style image.
  • the color histogram corresponding to the first frame image of the video sample may be used as a color condition sample.
  • at least one of a single image and a single semantic sketch is extracted from the video sample as a spatial condition sample.
  • the first frame image of the video sample can be used as the above-mentioned single image.
  • the edge extraction can be performed on the first frame image in the video sample, and a single semantic sketch can be formed using the edge information extracted from the image.
  • a single semantic sketch can be generated for the first frame image in the video sample using a hand-drawn drawing generation tool.
  • at least one of a motion vector sequence, a depth map sequence, a mask map sequence, a semantic sketch sequence, and a grayscale map sequence is extracted from the video sample as a temporal condition sample.
  • the motion vectors between adjacent frames can be extracted from the video sample to form a motion vector sequence.
  • the depth map of each frame image is extracted from the video sample to form a depth map sequence.
  • Masking is performed on each frame in the video sample to obtain a mask map sequence.
  • a semantic sketch is generated for each frame image in the video sample to obtain a semantic sketch sequence.
  • a grayscale map is generated for each frame image in the video sample to obtain a grayscale map sequence.
  • other methods can also be used to obtain the above-mentioned condition samples, which are not listed here one by one.
  • the limitations such as “first” and “second” involved in the embodiments of the present disclosure do not have restrictions on size, order, quantity, etc., and are used to distinguish them in name. For example, “first training sample” and “second training sample” are used to distinguish two training samples in name.
  • Step 804 Use the first training data to train a video generation model, and the video generation model includes: a spatiotemporal conditional encoder, a diffusion model, and a decoder; wherein the spatiotemporal conditional encoder encodes the local conditional sample to obtain a local conditional feature representation; the diffusion model denoises the noise latent vector sequence to obtain a denoised latent vector sequence, and the noise latent vector sequence is obtained by integrating the noise sequence with the local conditional feature representation; the decoder uses the denoised latent vector sequence to perform decoding processing to obtain a video; the training objectives include: minimizing the difference between the noise predicted by the diffusion model when performing denoising processing and the Gaussian noise. Furthermore, the video generation model may further include a global encoder.
  • the global encoder encodes the global condition sample to obtain a global condition feature representation.
  • the diffusion model may use the global condition feature representation to perform cross-attention processing on the noise latent vector sequence to predict the noise, and use the predicted noise to perform denoising processing.
  • the global encoder may use a text encoder to perform text encoding on the text condition sample to obtain a text feature representation, and use the text feature representation as the global condition feature representation.
  • the global encoder may use a text encoder to perform text encoding on the text condition sample to obtain a text feature representation, and use an image encoder to perform image encoding on the style condition sample and/or the color condition sample to obtain an image feature representation; and then integrate the text feature representation and the image feature representation to obtain a global condition feature representation.
  • each spatiotemporal condition encoder may be used to encode each local condition sample respectively to obtain a feature tensor corresponding to each local condition sample, wherein the spatiotemporal condition encoder corresponds to the local condition one by one; then the feature tensors corresponding to each local condition sample are fused to obtain the local condition feature table 7F o wherein the spatiotemporal condition encoder may be used to encode the local condition sample spatial features to obtain the spatial feature representation of the local condition sample.
  • the spatial feature representation of the local condition sample is subjected to a time domain self-attention process to obtain the feature tensor corresponding to the local condition sample; otherwise, the spatial feature representation of the local condition sample is copied in the time sequence to generate a spatial feature representation in the time sequence, and the spatial feature representation in the time sequence is subjected to a time domain self-attention process to obtain the feature tensor corresponding to the local condition sample.
  • the diffusion model can use the global condition feature representation to perform cross-attention processing on the noise latent vector sequence at the first time step to predict the noise of the current time step, and use the predicted noise to perform denoising processing on the noise latent vector sequence to obtain the noise latent vector sequence at the current time step.
  • the diffusion model uses the global condition feature representation to perform cross-attention processing on the noise latent vector sequence obtained at the previous time step at other time steps to predict the noise of the current time step, and uses the predicted noise to perform denoising processing on the noise latent vector sequence obtained at the previous time step to obtain the noise latent vector sequence at the current time step.
  • the loss function can be constructed according to the above training objectives, and the value of the loss function is used in each round of iteration of the training video generation model to update the model parameters by a method such as gradient descent until the preset training end condition is met.
  • the training end condition may include, for example, the value of the loss function is less than or equal to the preset loss function threshold, the number of iterations reaches the preset number threshold, etc.
  • other loss functions used within the spirit and principle are also within the protection scope of the present disclosure.
  • the difference between the noise predicted by multiple time steps and the Gaussian noise is selected in each round of iteration to obtain the loss function.
  • the global encoder and decoder can use the parameters obtained by pre-training, and in each round of training iteration, the parameters of the spatiotemporal condition encoder and the diffusion model are updated using the loss function corresponding to the training target.
  • the pre-training of the global encoder can use other tasks, such as image generation tasks, text classification tasks, etc.
  • the pre-training of the decoder can also use other tasks, such as image generation tasks, image classification tasks, etc. These are currently available training tasks and will not be described in detail here.
  • the parameters of the diffusion model in the image generation model can be used for initialization. This method alleviates the training difficulty to a certain extent and speeds up the training speed, but it is still difficult to learn and process temporal features and generate videos under multiple conditions.
  • the present disclosure provides a more preferred implementation, that is, a two-stage training strategy is adopted. Before the video generation model is trained using the first training data, the diffusion model is first pre-trained.
  • the pre-training is to use the process of generating videos based on text conditions to make the diffusion model learn. Then, based on the parameters of the diffusion model obtained by the pre-training, the video generation model is further trained using the first training data (i.e., including multiple conditions such as global conditions and local conditions).
  • the process of pre-training the diffusion model may include the following steps: first, second training data including multiple second training samples is obtained.
  • the second training sample includes: extracting description text from the video sample as a text sample.
  • the description text of the video sample may be obtained by using an existing video description text generation model, or the description text of the video sample may be obtained from the webpage from which the video sample is obtained, or the description text of the video sample may be manually added, and so on.
  • the text sample is encoded using a global encoder to obtain a text feature representation.
  • the obtained text feature representation and noise sequence are input into the diffusion model to train the diffusion model, where the diffusion model uses the text feature representation to denoise the noise sequence;
  • the training objectives include: minimizing the difference between the noise predicted by the diffusion model and the Gaussian noise when performing denoising at each time step.
  • the loss function used in this part of the pre-training is the same as the above formula (1), but the included condition C is different.
  • the condition C involved in the pre-training process includes the text condition Part. That is, when the input condition has a text condition, the process of generating a video is used to train the diffusion model.
  • FIG. 9 shows a schematic block diagram of a video generation device according to an embodiment, which corresponds to the model training device in the system shown in Figure 1.
  • the device 900 includes: a condition acquisition unit 901 and a video generation unit 902.
  • the main functions of each component unit are as follows: the condition acquisition unit 901 is configured to acquire local conditions, which include spatial conditions and/or temporal conditions; the video generation unit 902 is configured to encode the local conditions to obtain local condition feature representation; integrate the noise sequence with the local condition feature representation to obtain a noise latent vector sequence; perform denoising on the noise latent vector sequence using a diffusion model to obtain a denoised latent vector sequence; perform decoding on the denoised latent vector sequence to obtain a video.
  • the condition acquisition unit 901 can also be configured to acquire global conditions, which include at least one of text conditions, style conditions and color conditions.
  • the video generation unit 902 can also be configured to encode the global conditions to obtain a global condition feature representation.
  • the diffusion model can perform cross-attention processing on the noise latent vector sequence using the global condition feature representation to predict noise, and perform denoising using the predicted noise.
  • the above-mentioned spatial condition may include at least one of a single image and a single semantic sketch.
  • the temporal condition includes at least one of a motion vector sequence, a depth map sequence, a mask map sequence, a semantic sketch sequence, and a grayscale map sequence.
  • the video generation unit 902 can be implemented by the video generation model shown in Figure 3.
  • the global encoder is used to encode the global condition to obtain a global condition feature representation.
  • the spatiotemporal condition encoder is used to encode the local condition to obtain a local condition feature representation.
  • the diffusion model is used to denoise the noise latent vector sequence to obtain a denoised latent vector sequence.
  • the decoder uses the denoised latent vector sequence for decoding to obtain a video.
  • the global encoder in the video generation model performs text encoding on the text condition to obtain a text feature representation, and performs image encoding on the style condition and/or color condition to obtain an image feature table show; integrate the text feature representation and the image feature representation to obtain the global condition feature representation.
  • each spatiotemporal condition encoder encodes each local condition respectively to obtain the feature tensor corresponding to each local condition, wherein the spatiotemporal condition encoder corresponds to the local condition one by one; fuse the feature tensors corresponding to each local condition to obtain the local condition feature representation.
  • the spatiotemporal condition encoder can use the spatiotemporal condition encoder to perform spatial feature encoding on the local condition to obtain the spatial feature representation of the local condition; if the local condition is a sequence, perform temporal self-attention processing on the spatial feature representation of the local condition to obtain the feature tensor corresponding to the local condition; otherwise, copy the spatial feature representation of the local condition in time sequence to generate a spatial feature representation in time sequence, perform temporal self-attention processing on the spatial feature representation in time sequence, and obtain the feature tensor corresponding to the local condition.
  • a device for training a video generation model is provided.
  • FIG10 shows a schematic block diagram of a device for training a video generation model according to an embodiment.
  • the device 1000 includes: a sample acquisition unit 1001 and a model training unit 1002, and may further include a pre-training unit 1003.
  • the main functions of each component unit are as follows: the sample acquisition unit 1001 is configured to acquire first training data including a plurality of first training samples, the first training samples include local condition samples, and the local condition samples include spatial condition samples and/or temporal condition samples.
  • the model training unit 1002 is configured to train a video generation model using the first training data, and the video generation model includes: a spatiotemporal condition encoder, a diffusion model, and a decoder; wherein the spatiotemporal condition encoder performs encoding processing on the local condition samples to obtain a local condition feature representation; the diffusion model performs denoising processing on the noise latent vector sequence to obtain a denoised latent vector sequence, and the noise latent vector sequence is obtained by integrating the noise sequence with the local condition feature representation; the decoder performs decoding processing using the denoised latent vector sequence to obtain a video; the training objectives include: minimizing the difference between the noise predicted by the diffusion model when performing denoising processing and the Gaussian noise.
  • the first training sample also includes a global condition sample
  • the global condition sample includes at least one of a text condition sample, a style condition sample, and a color condition sample.
  • the video generation model also includes a global encoder, and the global encoder encodes the global condition sample to obtain a global condition feature representation.
  • the diffusion model uses the global condition feature representation to perform cross-attention processing on the noise latent vector sequence to predict the noise, and uses the predicted noise to perform denoising processing.
  • the global encoder and decoder use the parameters obtained by pre-training, and in each round of training iteration, the parameters of the spatiotemporal condition encoder and the diffusion model are updated using the loss function corresponding to the training target.
  • the sample acquisition unit 1001 can acquire a video sample; acquire the description text, style image and/or color histogram of the video sample as a text condition sample, a style condition sample and/or a color condition sample respectively; extract at least one of a single image and a single semantic sketch as a spatial condition sample; and provide Take at least one of a motion vector sequence, a depth map sequence, a mask map sequence, a semantic sketch sequence and/or a grayscale map sequence as a temporal conditional sample.
  • the pre-training unit 1003 can be configured to pre-train the diffusion model in the following manner: obtain second training data including a plurality of second training samples, the second training samples including: extracting description text from a video sample as a text sample; encoding the text sample using a global encoder to obtain a text feature representation; inputting the text feature representation and the noise sequence into the diffusion model to train the diffusion model, and the diffusion model uses the text feature representation to denoise the noise sequence; the training objectives include: minimizing the difference between the noise predicted by the diffusion model and the Gaussian noise when the denoising process is performed at each time step.
  • the model training unit 1002 further uses the first training data to train the video generation model based on the parameters of the diffusion model pre-trained by the pre-training unit 1003.
  • the spatiotemporal condition encoder can use each spatiotemporal condition encoder to encode each local condition sample respectively, and obtain the feature tensor corresponding to each local condition sample, wherein the spatiotemporal condition encoder corresponds to the local condition one by one; the feature tensor corresponding to each local condition sample is fused to obtain the local condition feature representation.
  • the spatiotemporal condition encoder can perform spatial feature encoding on the local condition sample to obtain the spatial feature representation of the local condition sample; if the local condition sample is a sequence, the spatial feature representation of the local condition sample is subjected to time domain self-attention processing to obtain the feature tensor corresponding to the local condition sample; otherwise, the spatial feature representation of the local condition sample is copied in time sequence to generate the spatial feature representation in time sequence, and the spatial feature representation in time sequence is subjected to time domain self-attention processing to obtain the feature tensor corresponding to the local condition sample.
  • the description is relatively simple, and the relevant parts can refer to the partial description of the method embodiment.
  • the device embodiment described above is schematic, wherein the unit described as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or it may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Ordinary technicians in this field can understand and implement it without paying creative work.
  • the user information including but not limited to user device information, user personal information, etc.
  • data including but not limited to data for analysis, stored data, displayed data, etc.
  • the processing needs to comply with the relevant laws, regulations and standards of the relevant countries and regions, and provide corresponding operation entrances for users to choose to authorize or refuse.
  • the embodiment of the present disclosure also provides a computer-readable storage medium, on which a computer program is stored, and the program is executed by a processor to implement the steps of any of the methods in the aforementioned method embodiments.
  • an electronic device including: one or more processors; and a memory associated with the one or more processors, the memory is configured to store program instructions, and the program instructions are read and executed by the one or more processors to execute the steps of any of the methods in the aforementioned method embodiments.
  • the present disclosure also provides a computer program product, including a computer program, which implements the steps of any of the methods in the aforementioned method embodiments when executed by a processor.
  • Figure 11 exemplarily shows the architecture of an electronic device, which may specifically include a processor 1110, a video display adapter 1111, a disk drive 1112, an input/output interface 1113, a network interface 1114, and a memory 1120.
  • the processor 1110, the video display adapter 1111, the disk drive 1112, the input/output interface 1113, the network interface 1114, and the memory 1120 can be communicatively connected via a communication bus 1130.
  • the processor 1110 can be implemented in the form of a general-purpose CPU, a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits, and is configured to execute relevant programs to implement the technical solutions provided by the present disclosure.
  • the memory 1120 can be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, etc.
  • the memory 1120 may store an operating system 1121 configured to control the operation of the electronic device 1100, and a basic input/output system (BIOS) 1122 configured to control the low-level operation of the electronic device 1100.
  • BIOS basic input/output system
  • a web browser 1123, a data storage management system 1124, and a video generation device/model training device 1125, etc. may also be stored.
  • the above-mentioned video generation device/model training device 1125 may be an application program for specifically implementing the operations of the aforementioned steps in the embodiment of the present disclosure.
  • the relevant program code is stored in the memory 1120 and is called and executed by the processor 1110.
  • the input/output interface 1113 is configured to connect an input/output module to implement information input and output.
  • the input/output/module may be configured in the device as a component (not shown in the figure), or may be externally connected to the device to provide corresponding functions.
  • the input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output device may include a display, a speaker, a vibrator, an indicator light, etc.
  • the network interface 1114 is configured to connect to a communication module (not shown) to implement communication interaction between the device and other devices.
  • the communication module can implement communication via a wired method (such as USB, network cable, etc.) or a wireless method (such as mobile network, WIFE Bluetooth, etc.).
  • the bus 1130 includes a path for transmitting information between various components of the device (e.g., the processor 1110, the video display adapter 1111, the disk drive 1112, the input/output interface 1113, the network interface 1114, and the memory 1120).
  • the device may also include other components necessary for normal operation.
  • the above device may also include components necessary for implementing the scheme of the present disclosure, and it is not necessary to include all the components shown in the figure.
  • the present disclosure can be implemented by means of software plus a necessary general hardware platform.
  • the technical solution of the present disclosure can be embodied in the form of a computer program product, which can be stored in a storage medium, such as ROM/RAM, a disk, an optical disk, etc., and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods of various embodiments of the present disclosure or certain parts of the embodiments.
  • the embodiment of the present disclosure provides a video generation method, including: obtaining local conditions, the local conditions include spatial conditions and/or temporal conditions, encoding the local conditions to obtain local condition feature representation, integrating the noise sequence with the local condition feature representation to obtain a noise latent vector sequence, denoising the noise latent vector sequence using a diffusion model to obtain a denoised latent vector sequence, and decoding the denoised latent vector sequence to obtain a video.
  • the present disclosure incorporates spatial conditions and/or temporal conditions as local conditions, integrates the local condition feature representation with the noise sequence to obtain a noise latent vector sequence, denoises the noise latent vector sequence to obtain a denoised latent vector sequence, and then decodes to obtain a video. This method makes video generation no longer limited to text conditions, but introduces spatial conditions and/or temporal conditions to guide video generation, thereby generating videos more flexibly and diversely, and improving the quality of videos.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Image Processing (AREA)

Abstract

本公开实施例公开了一种视频生成方法、训练视频生成模型的方法及装置。其中主要技术方案包括:获取局部条件,局部条件包括空间条件和/或时间条件;对局部条件进行编码处理得到局部条件特征表示;将噪声序列与局部条件特征表示进行整合,得到噪声隐向量序列;利用扩散模型对噪声隐向量序列进行去噪声处理,得到去噪后的隐向量序列;利用去噪后的隐向量序列进行解码处理,得到视频。

Description

视频生成方法、 训练祝频生成模型的方法及装置 交叉援引 本公开要求于 2023年 05月 29 日提交中国专利局、 优先权号为 202310618707.8、 发明名称为 "视频生成方法、 训练视频生成模型的方法及装置” 的中国专利公开的优 先权, 其全部内容通过引用结合在本公开中。 技术领域 本公开涉及计算机视觉技术领域, 特别是涉及一种视频生成方法、 训练视频生成 模型的方法及装置。 背景技术 人工智能技术的不断发展给计算机视觉技术提供了新的契机, 其中自动生成视频 的能力越来越强大。 基于文本的视频生成技术为很多内容创作者提供了全新的工具, 让原本需要专业人员和昂贵设备制作的视频内容创作变得更加容易和低成本。 然而, 目前 ■大多数技术以文本作为引导条件, 生成的视频缺乏多样性和灵活性, 导致视频的 质量较低。 发明内容 有鉴于此, 本公开提供了一种视频生成方法、 训练视频生成模型的方法及装置, 以便于提高生成的视频质量。 本公开提供了如下方案: 第一方面, 提供了一种视频生成方法, 所述方法包括: 获取局部条件, 所述局部条件包括空间条件和 /或时间条件; 对所述局部条件进行编码处理得到局部条件特征表示; 将噪声序列与所述局部条件特征表示进行整合, 得到噪声隐向量序列; 利用扩散模型对所述噪声隐向量序列进行去噪声处理, 得到去噪后的隐向量序列 ; 利用所述去噪后的隐向量序列进行解码处理, 得到视频。 根据本公开实施例中一可实现的方式, 所述方法还包括: 获取全局条件, 所述全 局条件包括文本条件、 风格条件和颜色条件中的至少一种; 对所述全局条件进行编码 处理得到全局条件特征表示; 所述利用扩散模型对所述噪声隐向量序列进行去噪声处理包括: 所述扩散模型利 用所述全局条件特征表示对所述噪声隐向量序列进行交叉注意力处理以预测噪声, 并 利用预测的噪声进行去噪声处理。 根据本公开实施例中一可实现的方式, 所述空间条件包括单幅图像和单幅语义草 图中的至少一种; 所述时间条件包括运动矢量序列、 深度图序列、 掩膜图序列、 语义草图序列和灰 度图序列中的至少一种。 根据本公开实施例中一可实现的方式, 对所述全局条件进行编码处理得到全局条 件特征表示包括: 对所述文本条件进行文本编码得到文本特征表示, 以及对所述风格 条件和 /或颜色条件进行图像编码得到图像特征表示;对所述文本特征表示和所述图像 特征表示进行整合, 得到全局条件特征表示。 根据本公开实施例中一可实现的方式, 对所述局部条件进行编码处理得到局部条 件特征表示包括: 利用各时空条件编码器分别对各局部条件进行编码, 得到各局部条件对应的特征 张量, 其中时空条件编码器与局部条件一一对应; 将各局部条件对应的特征张量进行融合处理, 得到所述局部条件特征表示。 根据本公开实施例中一可实现的方式, 所述利用时空条件编码器分别对各局部条 件进行编码包括: 利用时空条件编码器对局部条件进行空间特征编码, 得到该局部条件的空间特征 表示; 若该局部条件为序列, 则对该局部条件的空间特征表示进行时域自注意力处理, 得到该局部条件对应的特征张量; 否则, 对该局部条件的空间特征表示在时序上复制 产生时序上的空间特征表示, 对所述时序上的空间特征表示进行时域自注意力处理, 得到该局部条件对应的特征张量。 第二方面, 提供了一种训练视频生成模型的方法, 所述方法包括: 获取包括多个第一训练样本的第一训练数据, 所述第一训练样本包括局部条件样 本, 所述局部条件样本包括空间条件样本和 /或时间条件样本; 利用所述第一训练数据训练视频生成模型, 所述视频生成模型包括: 时空条件编 码器、 扩散模型和解码器; 其中, 时空条件编码器对所述局部条件样本进行编码处理 得到局部条件特征表示; 扩散模型对噪声隐向量序列进行去噪声处理, 得到去噪后的 隐向量序列, 所述噪声隐向量序列是将噪声序列与所述局部条件特征表示进行整合后 得到的; 所述解码器利用所述去噪后的隐向量序列进行解码处理, 得到视频; 所述训练的目标包括: 最小化扩散模型进行去噪声处理时预测的噪声与高斯噪声 之间的差异。 根据本公开实施例中一可实现的方式, 所述第一训练样本还包括全局条件样本, 所述全局条件样本包括文本条件样本、 风格条件样本和颜色条件样本中的至少一种; 所述视频生成模型还包括全局编码器, 所述全局编码器对所述全局条件样本进行 编码处理得到全局条件特征表示; 所述扩散模型对噪声隐向量序列进行去噪声处理包括: 所述扩散模型利用所述全 局条件特征表示对所述噪声隐向量序列进行交叉注意力处理以预测噪声, 并利用预测 的噪声进行去噪声处理。 根据本公开实施例中一可实现的方式, 所述全局编码器和所述解码器采用预训练 得到的参数, 在所述训练的每一轮迭代中, 利用与所述训练的目标对应的损失函数更 新所述时空条件编码器和扩散模型的参数。 根据本公开实施例中一可实现的方式, 所述获取包括多个第一训练样本的第一训 练数据包括: 获取视频样本; 获取所述视频样本的描述文本、风格图像和 /或颜色直方图分别作为所述文本条件 样本、 风格条件样本和 /或颜色条件样本; 提取单幅图像和单幅语义草图中的至少一种作为所述空间条件样本; 提取运动矢量序列、 深度图序列、 掩膜图序列、 语义草图序列和/或灰度图序列中 的至少一种作为所述时间条件样本。 根据本公开实施例中一可实现的方式, 在利用所述第一训练数据训练视频生成模 型之前, 还包括预训练所述扩散模型; 在预训练得到的扩散模型的参数基础上进一步 利用所述第一训练数据训练视频生成模型; 其中, 预训练所述扩散模型包括: 获取包括多个第二训练样本的第二训练数据, 所述第二训练样本包括: 从视频样 本中提取描述文本作为文本样本; 利用全局编码器对所述文本样本进行编码处理得到文本特征表示; 将所述文本特征表示和噪声序列输入扩散模型以对所述扩散模型进行训练, 所述 扩散模型利用所述文本特征表示对所述噪声序列进行去噪声处理; 所述训练的目标包 括: 最小化扩散模型在各时间步进行去噪声处理时预测的噪声与高斯噪声之间的差异。 根据本公开实施例中一可实现的方式, 时空条件编码器对所述局部条件样本进行 编码处理得到局部条件特征表示包括: 利用各时空条件编码器分别对各局部条件样本进行编码, 得到各局部条件样本对 应的特征张量, 其中时空条件编码器与局部条件 对应; 将各局部条件样本对应的特征张量进行融合处理, 得到所述局部条件特征表示。 根据本公开实施例中一可实现的方式, 所述利用各时空条件编码器分别对各局部 条件样本进行编码包括: 利用时空条件编码器对局部条件样本进行空间特征编码, 得到该局部条件样本的 空间特征表示; 若该局部条件样本为序列, 则对该局部条件样本的空间特征表示进行时域自注意 力处理, 得到该局部条件样本对应的特征张量; 否则, 对该局部条件样本的空间特征 表示在时序上复制产生时序上的空间特征表示, 对所述时序上的空间特征表示进行时 域自注意力处理, 得到该局部条件样本对应的特征张量。 第三方面, 提供了一种视频生成方法, 由云端服务器执行, 所述方法包括: 获取来自用户终端的局部条件, 所述局部条件包括空间条件和 /或时间条件; 对所述局部条件进行编码处理得到局部条件特征表示; 将噪声序列与所述局部条件特征表示进行整合, 得到噪声隐向量序列; 利用扩散模型对所述噪声隐向量序列进行去噪声处理, 得到去噪后的隐向量序列; 利用所述去噪后的隐向量序列进行解码处理, 得到视频; 将所述视频发送给所述用户终端以进行展示。 第四方面, 提供了一种视频生成装置, 所述装置包括: 条件获取单元, 被配置为获取局部条件, 所述局部条件包括空间条件和/或时间条 件; 视频生成单元, 被配置为对所述局部条件进行编码处理得到局部条件特征表示; 将噪声序列与所述局部条件特征表示进行整合, 得到噪声隐向量序列; 利用扩散模型 对所述噪声隐向量序列进行去噪声处理, 得到去噪后的隐向量序列; 利用所述.去噪后 的隐向量序列进行解码处理, 得到视频。 第五方面, 提供了一种训练视频生成模型的装置, 所述装置包括: 样本获取单元, 被配置为获取包括多个第一训练样本的第一训练数据, 所述第一 训练样本包括局部条件样本,所述局部条件样本包括空间条件样本和 /或时间条件样本; 模型训练单元, 被配置为利用所述第一训练数据训练视频生成模型, 所述视频生 成模型包括: 时空条件编码器、 扩散模型和解码器; 其中, 时空条件编码器对所述局 部条件样本进行编码处理得到局部条件特征表示; 扩散模型对噪声隐向量序列进行去 噪声处理, 得到去噪后的隐向量序列, 所述噪声隐向量序列是将噪声序列与所述局部 条件特征表示进行整合后得到的; 所述解码器利用所述去噪后的隐向量序列进行解码 处理, 得到视频; 所述训练的目标包括: 最小化扩散模型进行去噪声处理时预测的噪声与高斯噪声 之间的差异。 根据第六方面, 提供了一种计算机可读存储介质, 其上存储有计算机程序, 该程 序被处理器执行时实现上述第一方面至第三方面中任一项所述的方法的步骤。 根据第七方面, 提供了一种电子设备, 包括: 一个或多个处理器; 以及 与所述一个或多个处理器关联的存储器,所述存储器被设置为存储程序指令,所述 程序指令在被所述一个或多个处理器读取执行时, 执行上述第一方面至第三方面中任 一项所述的方法的步骤。 根据本公开提供的具体实施例, 本公开公开了以下技术效果:
1) 本公开融入了空间条件和 /或时间条件作为局部条件, 利用局部条件特征表示 与噪声序列进行整合得到噪声隐向量序列, 并对噪声隐向量序列进行去噪声处理, 得 到去噪后的隐向量序列, 进而解码得到视频。 这种方式使得视频生成不再局限于在文 本条件, 而是引入空间条件和 /或时间条件来引导视频生成, 从而更加灵活、 多样化地 生成视频, 提高视频的质量。
2)本公开可以通过组合多种全局条件、多种局部条件中的空间条件和时间条件, 来生成需要的视频;其中通过时空条件编码器实现多种局部条件的编码、对齐和融合, 与噪声序列进行整合实现对视频的局部控制; 通过全局编码器对全局条件进行编码和 整合, 在扩散模型中利用全局条件特征表示对噪声隐向量序列进行交叉注意力处理, 实现对视频的全局控制。 这种方式大大提高了视频生成的可控性, 生成的视频更加灵 活多样。
3) 本公开新创性地在时间条件中引入运动矢量序列, 通过融入运动矢量序列的 控制信号,使得视频生成模型能够捕获帧间动态,从而实现对视频中内部运动的控制。
4)本公开提供的时空条件编码器首先提取局部的空间信息,然后进行时序建模, 促进了时域的显示嵌入。 并且为不同的局部条件提供了统一的接口, 增强了帧间的一 致性。 对于单幅图像、 单幅语义草图等非时序型的空间条件, 通过在时间维度进行复 制以确保与时间条件的一致性, 实现局部条件的融合。 这种编码和融合的处理使得后 续生成的视频在时间和空间感知方面都能够具有更好的可控性。
5) 本公开在训练视频生成装置之前, 可以先利用文本生成视频的训练数据对扩 散模型进行预训练, 在预训练得到的扩散模型的参数基础上进一步训练视频生成模型, 从而提高视频生成模型的效果和效率。 当然, 实施本公开的任一产品并不一定需要同时达到以上所述的所有优点。 附图说明 为了更清楚地说明本公开实施例或现有技术中的技术方案, 下面将对实施例中所 需要使用的附图作简单地介绍, 显而易见地, 下面描述中的附图是本公开的一些实施 例, 对于本领域普通技术人员来讲, 在不付出创造性劳动的前提下, 还可以根据这些 附图获得其他的附图。 图 1为是本公开实施例所适用的系统架构图; 图 2为本公开实施例提供的视频生成方法流程图; 图 3为本公开实施例提供的生成视频所需要条件的示意图; 图 4为本公开实施例提供的视频生成模型的原理性示意图; 图 5为本公开实施例提供的时空条件编码器的结构示意图; 图 6为本公开实施例提供的扩散模型在各时间步的原理性示意图; 图 72~图 7d为本公开实施例提供的四个生成视频的实例图; 图 8为本公开实施例提供的训练视频生成模型的方法流程图; 图 9为本公开实施例提供的视频生成装置的示意性框图; 图 10为本公开实施例提供的训练视频生成模型的装置的示意性框图; 图 11为本公开实施例提供的电子设备的示意性框图。 具体实施方式 下面将结合本公开实施例中的附图, 对本公开实施例中的技术方案进行清楚、 完 整地描述, 显然, 所描述的实施例是本公开一部分实施例, 而不是全部的实施例。 基 于本公开中的实施例, 本领域普通技术人员所获得的所有其他实施例, 都属于本公开 保护的范围。 在本发明实施例中使用的术语是出于描述特定实施例的目的, 而非旨在限制本发 明。 在本发明实施例和所附权利要求书中所使用的单数形式的 “一种 " 、 “所述” 和 “该" 也旨在包括多数形式, 除非上下文清楚地表示其他含义。 应当理解, 本文中使用的术语 “和 /或”是一种描述关联对象的关联关系, 表示可 以存在三种关系, 例如, A和 /或 B , 可以表示: 单独存在 A, 同时存在 A和 B , 单独 存在 B 这三种情况。 另外, 本文中字符 " /” , 一般表示前后关联对象是一种 “或” 的关系。 取决于语境,如在此所使用的词语 “如果 ”可以被解释成为 “在 …时 ”或 “当 … 时” 或 “响应于确定” 或 “响应于检测” 。 类似地, 取决于语境, 短语 “如果确定” 或 “如果检测 (陈述的条件或事件) ”可以被解释成为 “当确定时 ”或"响应于确定” 或 “当检测 (陈述的条件或事件) 时” 或 “响应于检测 (陈述的条件或事件) ” 。 为了方便对本公开的理解, 首先对本公开所适用的系统架构进行简单描述。 图 1 示出了可以应用本公开实施例的示例性系统架构, 如图 1 中所示, 该系统架构包括用 户终端, 以及位于服务端的模型训练装置和视频生成装置。 其中, 服务端的模型训练装置采用本公开实施例提供的方法, 在离线阶段进行模 型训练, 得到视频生成模型。 用户终端可以通过网络与服务端的视频生成装置进行信息交互。 例如, 用户终端 可以将用户输入的生成视频所使用的条件信息通过网络发送给视频生成装置, 其中条 件信息可以包括全局条件和局部条件, 具体内容将在后续实施例中详述。 视频生成装置可以利用已经训练得到的视频生成模型, 在全局条件和局部条件的 限制下生成视频, 并将生成的视频通过网络发送给用户终端, 以供用户终端展示该视 频。 上述用户终端可以包括但不限于诸如: 智能移动终端、 智能家居设备、 可穿戴式 设备、 智能医疗设备、 PC ( Personal Computer , 个人计算机) 等。 其中智能,移动设备 可以包括诸如手机、 平板电脑、 笔记本电脑、 PDA (Personal Digital Assistant, 个人 数字助理) 、 互联网汽车等。 智能家居设备可以包括智能家电设备, 诸如智能电视、 智能冰箱等等具备视频播放功能的家电设备。 可穿戴式设备可以包括诸如智能手表、 智能眼镜、智能手环、 VR(Virtual Reality , 虚拟现实)设备、 AR(Augmented Reality, 增强现实设备) 、 混合现实设备 (即可以支持虚拟现实和增强现实的设备) 等等。 模型训练装置和视频生成装置可以分别设置为独立的服务器, 也可以设置于同一 个服务器或服务器群组, 还可以设置于独立的或者同一云服务器。 云服务器又称为云 计算服务器或云主机, 是云计算服务体系中的一项主机产品, 以解决传统物理主机与 虚 ■拟专用服务器 (VPS , Virtual Private Server) 服务中存在的管理难度大, 服务扩展 性弱的缺陷。 模型训练装置和视频生成装置还可以设置于具有较强计算能力的计算机 终端。 需要说明的是, 上述视频生成装置除了在线上进行视频生成之外, 也可以采用离 线的方式进行视频生成。 应该理解, 图 1 中的用户终端、 模型训练装置、 视频生成装置以及视频生成模型 的数目是示意性的。 根据实现需要, 可以具有任意数目的用户终端、 模型训练装置、 视频生成装置以及视频生成模型。 图 2为本公开实施例提供的视频生成方法流程图, 该方法可以由图 1所示系统架 构中的视频生成装置执行。 如图 2中所示, 该方法可以包括以下步骤: 步骤 202: 获取局部条件, 局部条件包括空间条件和/或时间条件。 步骤 204 : 对局部条件进行编码处理得到局部条件特征表示。 步骤 206: 将噪声序列与局部条件特征表示进行整合, 得到噪声隐向量序列; 利 用扩散模型对噪声隐向量序列进行去噪声处理, 得到去噪后的隐向量序列。 步骤 208: 利用去噪后的隐向量序列进行解码处理, 得到视频。 由上述流程可以看出, 本公开融入了空间条件和 /或时间条件作为局部条件, 利用 局部条件特征表示与噪声序列进行整合得到噪声隐向量序列, 并利用全局条件特征表 示对噪声隐向量序列进行去噪声处理,得到去噪后的隐向量序列,进而解码得到视频。 这种方式使得视频生成不再局限于在文本条件,而是引入空间条件和/或时间条件来引 导视频生成, 从而更加灵活、 多样化地生成视频, 提高视频的质量。 下面对上述流程中的各步骤分别进行详细描述。 首先结合实施例对上述步骤 202 即 "获取局部条件 " 进行详细描述。 现有的视频生成方法通常在文本条件的引导下, 生成视频, 灵活性和多样性差。 本公开实施例中通过引入新的组合性条件来提高视频生成的可控性。 从引入方式上来 看, 至少包括局部条件, 还可以进一步包括全局条件。 其中, 全局条件主要包括文本条件、 风格条件和 /或颜色条件。 文本条件主要是对 视频内容的描述, 从全局上体现出要生成视频的内容。 风格条件通常可以采用一幅风 格图像, 作为条件目的是生成具有与该风格图像所具有的风格特征一致的视频, 即视 频在整体上具有风格图像中所表达的风格。 颜色条件可以采用颜色直方图等形式, 作 为条件的目标是为了对生成的视频的颜色分布进行指导。 局部条件主要包括空间条件和 /或时间条件。 其中, 空间条件可以包括单幅图像、 单幅语义草图等中的至少一种。 其中单幅图像指的是单一图像, 用以限制视频中图像 帧的内容和结构。 语义草图也称为手绘草图、 素描图等等, 是一种用来描述图像中对 象基本语义的方式,是一种概括的表述手法,主要描绘出对象的边缘(或轮廓)特征, 能够忽略对象的细节和冗余特征, 保留主要信息。 可以看出, 空间条件主要是从空间 特征上对视频内容进行指导。 时间条件包括运动矢量序列、 深度图序列、 掩膜图序列和语义草图序列中的至少 一种。 由于视频是图像序列, 相应地, 时间条件是沿着时间维度对视频进行的更精细 的引导。 运动矢量序列包含各相邻帧之间的运动矢量, 运动矢量指示各 Token (元素) 在 相邻帧之间的运动方向, 包括水平方向和垂直方向, 因此运动矢量可以表示为二维矢 量。 它显式地表达了相邻两个图像帧之间各 Token的移动, 用以使得视频生成模型生 成的视频具有运动可控性。 其中, 图像的各 Token指的是构成图像的元素。 对于图像 而言, 将图像切分成不重叠的区块, 则图像中的区块以及起始符等均为 Token。 深度图序列由各帧的深度图构成, 深度图包含图像帧中各 Token的深度信息, 用 以指导视频中各帧的深度信息。 掩膜图序列由各帧的掩膜图构成, 掩膜图是将图像帧中部分区域的内容进行掩膜 , 从而使得模型具备预测掩膜部分内容的能力。 其中掩膜的区域可以是用户指定的, 也 可以是随机确定的。 语义草图序列由各帧的语义草图构成, 与空间条件中的单幅语义草图相比, 草图 序列能够提供更多的控制细节。 灰度图序列由各帧的灰度图构成, 灰度图包含图像帧中各 Token的灰度信息, 用 以指导视频中各帧的灰度信息。 上述时间条件中各序列的长度是一致的,都等于生成视频的长度,例如表示为 T1。 例如, 文本条件为 “一个站在圣涎树旁边的孩子” , 风格条件为一幅风格图像, 用以限制生成的视频的整体风格, 即将风格图像的风格特征迁移至生成的视频, 颜色 条件为一幅颜色直方图, 如图 3中所示, 用以限制所生成视频的颜色分布。 空间条件 中的单幅图像就可以是包含一个孩子和圣诞树的图像, 用以限制生成的视频中孩子和 圣诞树的样子等,如图 3中所示。语义草图可以是描述孩子、圣诞树的基本线条的图, 用以限制孩子和圣诞树的基本形状和位置等。 时间条件中的运动矢量序列包含相邻帧 之间的运动矢量, 图中示意性的以小箭头表示, 深度图序列包含各帧的深度图, 掩膜 图序列包含各帧的掩膜图,语义草图序列包含各帧的语义草图。各序列的长度均为 T1 , T1为要生成的视频的长度。 作为其中一种可实现的方式, 可以向用户提供条件输入界面, 用户可以在界面上 选择输入上述全局条件和局部条件。 例如, 用户可以在文本输入框输入文本条件。 再 例如, 用户可以通过界面上提供的图像绘制或编辑工具, 输入单幅图像和 /或单幅语义 草图作为空间条件, 输入运动矢量序列、 深度图序列、 掩膜图序列和语义图序列等中 的至少一种作为时间条件, 还可以通过上传自选图像或者从界面提供的图像库中选择 图像的方式输入风格图像。 其中在输入运动矢量序列时, 用户可以通过框选图像中的 特定区域, 并通过特定手势、 鼠标轨迹等方式产生移动轨迹来指示特定区域在视频中 的运动路线, 服务端依据该移动轨迹自动生成运动矢量序列。 除了上述列举的条件输 入方式之外, 也可以采用其他条件输入方式, 在此不做穷举。 下面结合实施例对上述步骤 204即 “对局部条件进行编码处理得到局部条件特征 表示对全局条件进行编码处理得到全局条件特征表示, 以及对局部条件进行编码处理 得到局部条件特征表示” 进行详细描述。 图 2中涉及的步骤 204~步骤 208可以通过预先训练得到的视频生成模型实现。如 图 4中所示, 视频生成模型可以包括时空条件编码器、 扩散模型和解码器, 还可以进 一步包括全局编码器。 本步骤主要由时空条件编码器执行, 如果使用了全局条件, 则 进一步全局编码器执行。 全局编码器用以对全局条件进行编码处理得到全局条件特征表示。 若全局条件中包含文本条件, 则可以采用文本编码器对文本条件进行文本编码, 得到文本特征表示, 将该文本特征表示作为全局条件特征表示。 若全局条件中包含文本条件和风格图像, 则可以利用文本编码器对文本条件进行 文本编码得到文本特征表示,以及利用图像编码器对风格条件和/或颜色条件进行图像 编码得到图像特征表示; 再对文本特征表示和图像特征表示进行整合, 得到全局条件 特征表示。 其中, 文本编码器可以采用预训练语言模型实现, 例如 BERT (Bidirectional Encoder Representation from Transformers, 基于转换的双向编码表示) 、 XLNet(一种 通过排列语言模型实现双向上下文信息的自回归模型)、 GPT(Generative Pre-Training , 生成式预训练) 模型、 CLIP (是一种可以实现多模态编码的编码模型) 等。 通过文本 编码器实际上可以获得文本条件的语义嵌入即文本特征表示。 图像编码器可以使用诸如 VIT (Vision Transformer, 视觉转换器) 、 CLIP等, 通 过 图像编码器实际上可以获取风格图像的语义嵌入即图像特征表示。
Figure imgf000012_0001
时空条件编码器用以对局部条件进行编码处理得到局部条件特征表示。 由于局部 条件中包括空间条件和 /时间条件。 由于时间条件是序列型的条件, 因此局部条件包含 了丰富而复杂的时空关系, 对视频的可控引导提出了挑战。 有鉴于此, 本公开实施例 中针对各局部条件分别布设对应的时空条件编码器, 作为其中一种可实现的方式, 局 部条件和时空条件编码器是一一对应的关系。 利用各时空条件编码器分别对各局部条件进行编码, 得到各局部条件对应的特征 张量, 再将各局部条件对应的特征张量进行融合处理, 得到局部条件特征表示。 其中, 时空条件编码器的结构可以如图 5中所示。 时空条件编码器首先对局部条 件进行空间特征编码, 得到该局部条件的空间特征表示。 如图 5中所示, 时空条件编 码器可以由两个二维卷积 (Conv2D) 、 两个激活 ■层 (例如可以采用 SiLU 激活函数) 和一个平均池化层组成。 若输入时空条件编码器的局部条件为序列, 则对该局部条件 (即时序上) 的空间 特征表示进行时域自注意力处理, 得到该局部条件对应的特征张量, 该部分由图 5中 所示的时域 Transformer (转换器) 执行。 若输入时空条件编码器的局部条件并非序列,例如是单幅图像、单幅语义草图等。 则可以首先对该局部条件的空间特征表示在时序上复制产生时序上的空间特征表示, 该部分在图 5中由 “齿”执行, 使其与序列型的局部条件对应的空间特征表示在时序 上对齐。 然后通过时域 Transformer在对时序上的空间特征表示进行时域自注意力处 理, 得到该局部条件对应的特征张量。 作为其中一种可实现的方式, 将各局部条件对应的特征张量进行融合处理时, 可 以采用将各特征张量进行逐元素相加的方式。 可以看出, 时空条件编码器实际上首先提取局部的空间信息,然后进行时序建模, 促进了时域的显示嵌入。 并且为不同的局部条件提供了统一的接口, 增强了帧间的一 致性。 对于单幅图像、 单幅语义草图等非时序型的空间条件, 通过在时间维度进行复 制以确保与时间条件的一致性, 实现局部条件的融合。 这种编码和融合的处理使得后 续生成的视频在时间和空间感知方面都能够具有更好的可控性。 尤其是在时间条件中 引入的运动矢量序列, 能够捕获帧间动态, 从而实现对视频中内部运动的直接控制。 下面结合实施例对上述步骤 206 即 “将噪声序列与局部条件特征表示进行整合, 得到噪声隐向量序列; 利用扩散模型对噪声隐向量序列进行去噪声处理, 得到去噪后 的隐向量序列”进行详细描述。 本公开实施例中可以随机产生噪声序列 & , 随机产生的噪声序列符合正态分布, 即高斯噪声, 且 的长度与要生成的视频长度一致, 均为 T1。 经过时空条件编码器对 局部条件进行的编码处理以及整合后, 局部条件特征表示与&具有相同的空间形状。 然后将噪声序列 与局部条件特征表示进行整合, 例如沿着通道维度将两者进行拼接, 得到噪声隐向量序列 z, , 该噪声隐向量序列作为视频生成的一路控制信号。 扩散模型因其具有更稳定的训练和生成灵活性, 在图像生成领域已经得到了广泛 的应用, 但在视频生成领域尚未被很好地利用。 为了有效地处理视频数据, 本公开实 施例中引入 IDM (Latent Diffusion Model, 隐在扩散模型) 来保持局部保真度, 即将 初始视频投影到隐在表示, 然后再通过解码器将隐在表示映射回像素空间得到最终视 频。 其中, 初始视频即噪声序列, 最终视频即生成的视频。 除了 IDM 之外, 也可以 采用其他类型的扩散模型。 扩散模型的处理可以理解为预测正态分布的噪声, 在每一时间步进行去噪声处理, 旨在恢复真实的视频内容。这个过程模拟了长度为 的马尔科夫链的反向过程。其中 T 为扩散模型的总时间步, T越长去噪效果越好, 但对计算性能的影响越大, 因此需要 在两者之间进行权衡。 可以取经验值或实验值, 例如可以取 1000c 扩散模型可以在各时间步利用上一时间步得到的噪声隐向量序列 (在第一个时间 步则利用扩散模型被输入的噪声隐向量序列) 预测当胡时间步的噪声, 利用预测的噪 声对该噪声隐向量序列进行去噪声处理, 得到当前时间步的噪声隐向量序列。 更进一步地, 若输入的条件包括全局条件, 则扩散模型利用全局条件特征表示对 噪声隐向量序列进行交叉注意力处理以预测噪声, 并利用预测的噪声进行去噪声处理。 具体地, 如图 6中所示, 扩散模型在第一个时间步利用全局条件特征表示对噪声 隐向量序列进行交叉注意力处理, 以预测当前时间步的噪声, 利用预测的噪声对噪声 隐向量序列进行去噪声处理, 得到当前时间步的噪声隐向量序列。 在其他时间步利用全局条件特征表示对上一时间步得到的噪声隐向量序列进行 交叉注意力处理, 以预测当前时间步的噪声, 利用预测的噪声对上一时间步得到的噪 声隐向量序列进行去噪声处理, 得到当前时间步的噪声隐向量序列。 其中, 扩散模型可以将三维 UNet作为骨干网络。 UNet网络是一种引入了跳跃连 接的编码器-解码器网络, 因其结构似字母 U 而得名, 能够将深层高级特征与浅层低 级特征相结合, 鉴于其是一种已知网络, 在此不做详述。 上述过程可以看做是, 对噪声隐向量序列 z,和条件 (例如包括全局条件和局部条 件) c应用去噪函数勺 (•,•/) , 其中 /引1,...,0。 在去噪过程中, 采用交叉注意力机制 注入全局条件, 使得全局条件在整体语义上对视频生成进行引导。 最终扩展模型最后一步得到的去噪后的隐向量序列在步骤 208中输入解码器, 由 解码器利用去噪后的隐向量序列进行解码处理, 得到视频。 关于解码器的部分可以采 用目前已有的视频生成模型的解码器, 在此不做详述。 通过本公开实施例提供的上述方式, 可以灵活组合各种条件来生成更多样化的视 频, 在此举几个实例说明本公开在视频生成上的可控性。 实例 1: 可以组合全局条件中的文本条件和局部条件中的时间条件。 如图 7a中所示, 用户输入文本条件"一个站在森林中的长发女人的旋转视角”, 以及语义草图序列, 本实例中序列长度以 6帧为例。 最终生成视频如图 7a中所示, 图中示出的是视频中的各帧图像。 实例 2 : 可以组合局部条件中的空间条件和时间条件。 用户输入单幅图像和深度图序列, 最终生成视频如图 7b中所示。 实例 3 : 可以组合全局条件中的风格条件和局部条件中的深度图序列、 语义草图 序列。 用户输入风格图像,深度图序列和语义草图序列, 最终生成视频如图 7c中所示。 实例 4 : 可以组合全局条件中的文本条件、 风格条件以及局部条件中的一幅语义 草图、 运动矢量序列。 用户输入文本条件“一轮移动的金色月亮 " , 一幅风格图像, 一幅语义草图, 并 且用户可以在语义草图上手绘出月亮的运动方向, 服务端的视频生成装置能够根据用 户在语义草图上手绘的月亮的运动方向自动生成运动矢量序列, 从而利用本公开实施 例提供的方式生成视频。 如图 7d 中所示, 生成的视频中月亮沿用户手绘的运动方向 发生移动。 下面结合实施例对上述视频生成模型的训练过程进行详细描述。 图 8为本公开实 施例提供的训练视频生成模型的方法流程图, 该方法流程可以由图 1所示系统架构中 的模型训练装置执行。 如图 8中所示, 该方法可以包括如下步骤: 步骤 802 : 获取包括多个第一训练样本的第一训练数据, 第一训练样本包括局部 条件样本, 局部条件包括空间条件样本和 /或时间条件样本。 更进一步地, 第一训练数据样本还可以包括全局条件样本, 全局条件样本可以包 括文本条件样本、 风格条件样本和颜色条件样本中的至少一种。 作为其中一种可实现的方式, 可以首先获取视频样本, 获取视频样本的描述文本 作为文本条件样本。 获取视频样本的风格图像作为风格条件样本, 获取视频样本的颜 色直方图作为颜色条件样本。 例如, 可以采用已有的视频描述文本生成模型来获取视 频样本的描述文本, 也可以从视频样本所来源网页上获取视频样本的描述文本, 也可 以人工添加视频样本的描述文本, 等等。 例如, 可以将视频样本的第一帧图像作为风 格图像。再例如,可以将视频样本的第一帧图像对应的颜色直方图作为颜色条件样本。 然后从视频样本中提取单幅图像和单幅语义草图中的至少一种作为空间条件样 本。 例如可以将视频样本的第一帧图像作为上述的单幅图像。 再例如, 可以对视频样 本中的第一帧图像进行边缘提取, 利用从图像中提取的边缘信息形成单幅语义草图。 或者, 可以利用手绘图生成工具对视频样本中的第一帧图像生成单幅语义草图。 再从视频样本中提取运动矢量序列、 深度图序列、 掩膜图序列、 语义草图序列和 灰度图序列中的至少一种作为时间条件样本。 例如, 可以从视频样本中提取各相邻帧 之间的运动矢量构成运动矢量序列。 从视频样本中提取各帧图像的深度图构成深度图 序列。 将视频样本中各帧进行掩膜处理, 得到掩膜图序列。 针对视频样本中的各帧图 像生成语义草图, 得到语义草图序列。 针对视频样本中的各帧图像生成灰度图, 得到 灰度图序列。 除了上述方式之外, 也可以采用其他方式来得到上述条件样本, 在此不做一一列 举。 需要说明的是, 本公开实施例中涉及的' '第一 ”、 "第二 ”等限定并不具备大小、 顺序和数量等方面的限制, 用以在名称上加以区分, 例如 “第一训练样本" 和 “第二 训练样本” 用以在名称上区分两种训练样本。 步骤 804: 利用第一训练数据训练视频生成模型, 视频生成模型包括: 时空条件 编码器、 扩散模型和解码器; 其中, 时空条件编码器对局部条件样本进行编码处理得 到局部条件特征表示; 扩散模型对噪声隐向量序列进行去噪声处理, 得到去噪后的隐 向量序列, 噪声隐向量序列是将噪声序列与局部条件特征表示进行整合后得到的; 解 码器利用去噪后的隐向量序列进行解码处理, 得到视频; 训练的目标包括: 最小化扩 散模型进行去噪声处理时预测的噪声与高斯噪声之间的差异。 更进一步地, 上述视频生成模型还可以包括全局编码器。 全局编码器对全局条件 样本进行编码处理得到全局条件特征表示。 相应地, 扩散模型可以利用全局条件特征 表示对噪声隐向量序列进行交叉注意力处理以预测噪声, 并利用预测的噪声进行去噪 声处理。 作为其中一种可实现的方式, 若全局条件样本中包含文本条件样本, 则全局编码 器可以采用文本编码器对文本条件样本进行文本编码, 得到文本特征表示, 将该文本 特征表示作为全局条件特征表示。 作为另一种可实现的方式, 若全局条件中包含文本条件样本之外, 还包含风格条 件样本和 /或颜色条件样本,则全局编码器可以利用文本编码器对文本条件样本进行文 本编码得到文本特征表示,以及利用图像编码器对风格条件样本和 /或颜色条件样本进 行图像编码得到图像特征表示; 再对文本特征表示和图像特征表示进行整合, 得到全 局条件特征表示。 作为其中一种可实现的方式, 可以利用各时空条件编码器分别对各局部条件样本 进行编码, 得到各局部条件样本对应的特征张量, 其中时空条件编码器与局部条件一 一对应; 然后将各局部条件样本对应的特征张量进行融合处理, 得到所述局部条件特 征表 7F o 其中, 可以首先利用时空条件编码器对局部条件样本进行空间特征编码, 得到该 局部条件样本的空间特征表示。 若该局部条件样本为序列, 则对该局部条件样本的空 间特征表示进行时域自注意力处理, 得到该局部条件样本对应的特征张量; 否则, 对 该局部条件样本的空间特征表示在时序上复制产生时序上的空间特征表示, 对时序上 的空间特征表示进行时域自注意力处理, 得到该局部条件样本对应的特征张量。 作为其中一种可实现的方式, 扩散模型可以在第一个时间步利用全局条件特征表 示对噪声隐向量序列进行交叉注意力处理, 以预测当前时间步的噪声, 利用预测的噪 声对噪声隐向量序列进行去噪声处理, 得到当前时间步的噪声隐向量序列。 扩散模型在其他时间步利用全局条件特征表示对上一时间步得到的噪声隐向量 序列进行交叉注意力处理, 以预测当前时间步的噪声, 利用预测的噪声对上一时间步 得到的噪声隐向量序列进行去噪声处理, 得到当前时间步的噪声隐向量序列。 关于视频生成模型的具体结构和详细描述可以参见之前方法实施例中关于图 4、 图 5和图 6的相关记载, 在此不做赘述。 在本公开实施例中, 可以依据上述训练目标构造损失函数, 在训练视频生成模型 的每一轮迭代中利用损失函数的取值, 采用诸如梯度下降等方式更新模型参数, 直至 满足预设的训练结束条件。 其中训练结束条件可以包括诸如损失函数的取值小于或等 于预设的损失函数阈值, 迭代次数达到预设的次数阈值等。
Figure imgf000017_0001
另外, 除了上述公式 (1) 中示出的损失函数之外, 在此精神原则之内采用的其 他损失函数同样在本公开的保护范围之内。 例如在每一轮迭代中选择多个时间步预测 得到的噪声与高斯噪声之间的差异得到损失函数。 作为其中一种可实现的方式, 在视频生成模型的训练过程中, 全局编码器和解码 器可以采用预训练得到的参数, 在训练的每一轮迭代中, 利用与训练的目标对应的损 失函数更新时空条件编码器和扩散模型的参数。 其中全局编码器的预训练可以采用其他任务, 例如图像生成任务、 文本分类任务 等等。解码器的预训练也可以采用其他任务,例如图像生成任务、图像分类任务等等。 这些是目前已有的训练任务, 在此不做详述。 在视频生成模型的训练之前, 可以使用图像生成模型中的扩散模型的参数进行初 始化, 这种方式在一定程度上缓解了训练难度, 加快了训练速度, 但仍在学习处理时 序特征和多条件生成视频方面存在难度。 有鉴于此, 本公开提供了一种更为优选的实 施方式, 即采用两阶段的训练策略, 在利用第一训练数据训练视频生成模型之前, 首 先对扩散模型进行预训练, 该预训练是利用基于文本条件生成视频的过程来使得扩散 模型进行学习, 然后在预训练得到的扩散模型的参数基础上进一步利用第一训练数据 (即包含全局条件和局部条件等多种条件) 训练视频生成模型。 其中, 预训练扩散模型的过程可以包括以下步骤: 首先获取包括多个第二训练样本的第二训练数据, 第二训练样本包括: 从视频样 本中提取描述文本作为文本样本。 其中可以采用已有的视频描述文本生成模型来获取 视频样本的描述文本, 也可以从视频样本所来源网页上获取视频样本的描述文本, 也 可以人工添加视频样本的描述文本, 等等。 然后利用全局编码器对文本样本进行编码处理得到文本特征表示。 再将得到的文 本特征表示和噪声序列输入扩散模型以对扩散模型进行训练, 其中扩散模型利用文本 特征表示对噪声序列进行去噪声处理; 训练的目标包括: 最小化扩散模型在各时间步 进行去噪声处理时预测的噪声与高斯噪声之间的差异。 该部分预训练采用的损失函数 与上述公式 (1) 相同, 是包含的条件 C不同, 预训练过程中涉及的条件 C包含文本条 件。 也就是说,输入条件有文本条件的情况下,利用生成视频的过程来训练扩散模型。 然后将训练得到的扩散模型的参数作为步骤 804中训练视频生成模型时扩散模型采用 的初始化参数。 上述对本说明书特定实施例进行了描述。 其它实施例在所附权利要求书的范围内。 在一些情况下, 在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来 执行并且仍然可以实现期望的结果。 另外, 在附图中描绘的过程不一定要求示出的特 定顺序或者连续顺序才能实现期望的结果。 在某些实施方式中, 多任务处理和并行处 理也是可以的或者可能是有利的。 根据另一方面的实施例, 提供了一种视频生成装置。 图 9示出根据一个实施例的 视频生成装置的示意性框图, 该装置对应图 1所示系统中的模型训练装置。 如图 9所 示, 该装置 900包括: 条件获取单元 901和视频生成单元 902 o 其中各组成单元的主 要功能如下: 条件获取单元 901 , 被配置为获取局部条件, 局部条件包括空间条件和 /或时间条 件; 视频生成单元 902, 被配置为对局部条件进行编码处理得到局部条件特征表示; 将噪声序列与局部条件特征表示进行整合, 得到噪声隐向量序列; 利用扩散模型对噪 声隐向量序列进行去噪声处理, 得到去噪后的隐向量序列; 利用去噪后的隐向量序列 进行解码处理, 得到视频。 更进一步地, 条件获取单元 901 , 还可以被配置为获取全局条件, 全局条件包括 文本条件、 风格条件和颜色条件中的至少一种。 相应地, 视频生成单元 902, 还可以别配置为对全局条件进行编码处理得到全局 条件特征表示。 扩散模型可以利用全局条件特征表示对噪声隐向量序列进行交叉注意 力处理以预测噪声, 并利用预测的噪声进行去噪声处理。 上述空间条件可以包括单幅图像和单幅语义草图中的至少一种。 时间条件包括运 动矢量序列、 深度图序列、 掩膜图序列、 语义草图序列和灰度图序列中的至少一种。 其中, 视频生成单元 902可以通过图 3中所示的视频生成模型实现。 全局编码器 用以对全局条件进行编码处理得到全局条件特征表示。 时空条件编码器用以对局部条 件进行编码处理得到局部条件特征表示。 扩散模型用以对噪声隐向量序列进行去噪声 处理, 得到去噪后的隐向量序列。 解码器利用去噪后的隐向量序列进行解码处理, 得 到视频。 作为其中一种可实现的方式, 视频生成模型中的全局编码器对文本条件进行文本 编码得到文本特征表示,以及对风格条件和 /或颜色条件进行图像编码得到图像特征表 示; 对文本特征表示和图像特征表示进行整合, 得到全局条件特征表示。 作为其中一种可实现的方式, 各时空条件编码器分别对各局部条件进行编码, 得 到各局部条件对应的特征张量, 其中时空条件编码器与局部条件一一对应; 将各局部 条件对应的特征张量进行融合处理, 得到局部条件特征表示。 具体地, 时空条件编码器可以利用时空条件编码器对局部条件进行空间特征编码 , 得到该局部条件的空间特征表示; 若该局部条件为序列, 则对该局部条件的空间特征 表示进行时域自注意力处理, 得到该局部条件对应的特征张量; 否则, 对该局部条件 的空间特征表示在时序上复制产生时序上的空间特征表示, 对时序上的空间特征表示 进行时域自注意力处理, 得到该局部条件对应的特征张量。 根据另一方面的实施例, 提供了一种训练视频生成模型的装置。 图 10 示出根据 一个实施例的训练视频生成模型的装置的示意性框图。 如图 10所示, 该装置 1000包 括: 样本获取单元 1001 和模型训练单元 1002, 还可以进一步包括预训练单元 1003。 其中各组成单元的主要功能如下: 样本获取单元 1001 , 被配置为获取包括多个第一训练样本的第一训练数据, 第一 训练样本包括局部条件样本, 局部条件样本包括空间条件样本和 /或时间条件样本。 模型训练单元 1002, 被配置为利用第一训练数据训练视频生成模型, 视频生成模 型包括: 时空条件编码器、 扩散模型和解码器; 其中, 时空条件编码器对局部条件样 本进行编码处理得到局部条件特征表示; 扩散模型对噪声隐向量序列进行去噪声处理, 得到去噪后的隐向量序列, 噪声隐向量序列是将噪声序列与局部条件特征表示进行整 合后得到的; 解码器利用去噪后的隐向量序列进行解码处理, 得到视频; 训练的目标包括: 最小化扩散模型进行去噪声处理时预测的噪声与高斯噪声之间 的差异。 更进一步地, 上述第一训练样本还包括全局条件样本, 全局条件样本包括文本条 件样本、 风格条件样本和颜色条件样本中的至少一种。 视频生成模型还包括全局编码器, 全局编码器对全局条件样本进行编码处理得到 全局条件特征表示。 相应地, 扩散模型利用全局条件特征表示对噪声隐向量序列进行 交叉注意力处理以预测噪声, 并利用预测的噪声进行去噪声处理。 作为其中一种可实现的方式, 全局编码器和解码器采用预训练得到的参数, 在训 练的每一轮迭代中, 利用与训练的目标对应的损失函数更新时空条件编码器和扩散模 型的参数。 作为其中一种可实现的方式, 样本获取单元 1001 可以获取视频样本; 获取视频 样本的描述文本、风格图像和 /或颜色直方图分别作为文本条件样本、 风格条件样本和 /或颜色条件样本;提取单幅图像和单幅语义草图中的至少一种作为空间条件样本;提 取运动矢量序列、 深度图序列、 掩膜图序列、 语义草图序列和 /或灰度图序列中的至少 一种作为时间条件样本。 作为其中一种可实现的方式, 预训练单元 1003 可以被配置为采用如下方式预训 练扩散模型: 获取包括多个第二训练样本的第二训练数据, 第二训练样本包括: 从视频样本中 提取描述文本作为文本样本; 利用全局编码器对文本样本进行编码处理得到文本特征表示; 将文本特征表示和噪声序列输入扩散模型以对扩散模型进行训练, 扩散模型利用 文本特征表示对噪声序列进行去噪声处理; 训练的目标包括: 最小化扩散模型在各时 间步进行去噪声处理时预测的噪声与高斯噪声之间的差异。 相应地, 模型训练单元 1002在预训练单元 1003预训练得到的扩散模型的参数基 础上进一步利用第一训练数据训练视频生成模型。 作为其中一种可实现的方式, 时空条件编码器可以利用各时空条件编码器分别对 各局部条件样本进行编码, 得到各局部条件样本对应的特征张量, 其中时空条件编码 器与局部条件一一对应; 将各局部条件样本对应的特征张量进行融合处理, 得到局部 条件特征表示。 作为其中一种可实现的方式, 时空条件编码器可以对局部条件样本进行空间特征 编码, 得到该局部条件样本的空间特征表示; 若该局部条件样本为序列, 则对该局部 条件样本的空间特征表示进行时域自注意力处理, 得至 ']该局部条件样本对应的特征张 量; 否则, 对该局部条件样本的空间特征表示在时序上复制产生时序上的空间特征表 示, 对时序上的空间特征表示进行时域自注意力处理, 得到该局部条件样本对应的特 征张量。 本说明书中的各个实施例均采用递进的方式描述, 各个实施例之间相同相似的部 分互相参见即可, 每个实施例重点说明的都是与其他实施例的不同之处。 尤其, 对于 装置实施例而言, 由于其基本相似于方法实施例, 所以描述得比较简单, 相关之处参 见方法实施例的部分说明即可。 以上所描述的装置实施例是示意性的, 其中所述作为 分离部件说明的单元可以是或者也可以不是物理上分开的, 作为单元显示的部件可以 是或者也可以不是物理单元, 即可以位于一个地方, 或者也可以分布到多个网络单元 上。 可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。 本领域普通技术人员在不付出创造性劳动的情况下, 即可以理解并实施。 需要说明的是, 本公开所涉及的用户信息 (包括但不限于用户设备信息、 用户个 人信息等) 和数据 (包括但不限于用于分析的数据、 存储的数据、 展示的数据等) , 均为经用户授权或者经过各方充分授权的信息和数据, 并且相关数据的收集、 使用和 处理需要遵守相关国家和地区的相关法律法规和标准, 并提供有相应的操作入口, 供 用户选择授权或者拒绝。 另外, 本公开实施例还提供了一种计算机可读存储介质, 其上存储有计算机程序, 该程序被处理器执行时实现前述方法实施例中任一项所述的方法的步骤。 以及一种电子设备, 包括: 一个或多个处理器; 以及 与所述一个或多个处理器关联的存储器,所述存储器被设置为存储程序指令,所述 程序指令在被所述一个或多个处理器读取执行时, 执行前述方法实施例中任 ■一项所述 的方法的步骤。 本公开还提供了一种计算机程序产品, 包括计算机程序, 该计算机程序在被处理 器执行时实现前述方法实施例中任一项所述的方法的步骤。 其中, 图 1 1 示例性的展示出了电子设备的架构, 具体可以包括处理器 1 1 10, 视 频显示适配器 1 1 1 1 , 磁盘驱动器 1 1 12, 输入 /输出接口 1 1 13 , 网络接口 1 1 14, 以及存 储器 1120。 上述处理器 1110、 视频显示适配器 1111、 磁盘驱动器 1112、 输入 /输出接 口 1113、 网络接口 1114, 与存储器 1120之间可以通过通信总线 1130进行通信连接。 其中, 处理器 1110 可以采用通 .用的 CPU、 微处理器、 应用专用集成电路 (Application Specific Integrated Circuit, ASIC) 、 或者一个或多个集成电路等方式实 现, 被设置为执行相关程序, 以实现本公开所提供的技术方案。 存储器 1120可以采用 ROM(Read Only Memory, 只读存储器) 、 RAM (Random Access Memory , 随机存取存储器) 、 静态存储设备, 动态存储设备等形式实现。 存 储器 1120可以存储被设置为控制电子设备 1100运行的操作系统 1121 ,被设置为控制 电子设备 1100的低级别操作的基本输入输出系统(BIOS) 1122o 另外, 还可以存储网 页浏览器 1123 ,数据存储管理系统 1124,以及视频生成装置 /模型训练装置 1125等等。 上述视频生成装置 /模型训练装置 1125就可以是本公开实施例中具体实现前述各步骤 操作的应用程序。 总之, 在通过软件或者固件来实现本公开所提供的技术方案时, 相 关的程序代码保存在存储器 1 120中, 并由处理器 1 1 10来调用执行。 输入 /输出接口 1113被设置为连接输入 /输出模块, 以实现信息输入及输出。 输入 输出 /模块可以作为组件配置在设备中 (图中未示出) , 也可以外接于设备以提供相应 功能。 其中输入设备可以包括键盘、 鼠标、 触摸屏、 麦克风、 各类传感器等, 输出设 备可以包括显示器、 扬声器、 振动器、 指示灯等。 网络接口 1114 被设置为连接通信模块 (图中未示出) , 以实现本设备与其他设 备的通信交互。 其中通信模块可以通过有线方式 (例如 USB、 网线等) 实现通信, 也 可以通过无线方式 (例如移动网络、 WIFE 蓝牙等) 实现通信。 总线 1 130包括一通路, 在设备的各个组件 (例如处理器 1 1 10、 视频显示适配器 1111、 磁盘驱动器 1112、 输入/输出接口 1113、 网络接口 1114, 与存储器 1120) 之间 传输信息。 需要说明的是, 尽管上述设备示出了处理器 1110、 视频显示适配器 1111、 磁盘 驱动器 1112、 输入 /输出接口 1113、 网络接口 1114, 存储器 112。, 总线 1130等, 但 是在具体实施过程中, 该设备还可以包括实现正常运行所必需的其他组件。 此外, 本 领域的技术人员可以理解的是, 上述设备中也可以包含实现本公开方案所必需的组件, 而不必包含图中所示的全部组件。 通过以上的实施方式的描述可知, 本领域的技术人员可以清楚地了解到本公开可 借助软件加必需的通用硬件平台的方式来实现。 基于这样的理解, 本公开的技术方案 本质上或者说对现有技术做出贡献的部分可以以计算机程序产品的形式体现出来, 该 计算机程序产品可以存储在存储介质中, 如 ROM/RAM、 磁碟、 光盘等, 包括若干指 令用以使得一台计算机设备 (可以是个人计算机, 服务器, 或者网络设备等) 执行本 公开各个实施例或者实施例的某些部分所述的方法。 以上对本公开所提供的技术方案进行了详细介绍, 本文中应用了具体个例对本公 开的原理及实施方式进行了阐述, 以上实施例的说明只是用于帮助理解本公开的方法 及其核心思想; 同时, 对于本领域的一般技术人员, 依据本公开的思想, 在具体实施 方式及应用范围上均会有改变之处。 综上所述, 本说明书内容不应理解为对本公开的 限制。 工业实用性 本公开实施例提供了一种视频生成方法, 包括: 获取局部条件, 局部条件包括空 间条件和 /或时间条件, 对局部条件进行编码处理得到局部条件特征表示, 将噪声序列 与局部条件特征表示进行整合, 得到噪声隐向量序列, 利用扩散模型对噪声隐向量序 列进行去噪声处理, 得到去噪后的隐向量序列, 利用所述去噪后的隐向量序列进行解 码处理, 得到视频。 本公开融入了空间条件和/或时间条件作为局部条件, 利用局部条 件特征表示与噪声序列进行整合得到噪声隐向量序列, 并对噪声隐向量序列进行去噪 声处理, 得到去噪后的隐向量序列, 进而解码得到视频。 这种方式使得视频生成不再 局限于在文本条件,而是引入空间条件和 /或时间条件来引导视频生成,从而更加灵活、 多样化地生成视频, 提高视频的质量。

Claims

权 利 要 求 书
1.一种视频生成方法, 所述方法包括: 获取局部条件, 所述局部条件包括空间条件和 /或时间条件; 对所述局部条件进行编码处理得到局部条件特征表示; 将噪声序列与所述局部条件特征表示进行整合, 得到噪声隐向量序列; 利用扩散模型对所述噪声隐向量序列进行去噪声处理, 得到去噪后的隐向量序列 ; 利用所述去噪后的隐向量序列进行解码处理, 得到视频。
2.根据权利要求 1所述的方法, 其中, 所述方法还包括: 获取全局条件, 所述全 局条件包括文本条件、 风格条件和颜色条件中的至少一种; 对所述全局条件进行编码 处理得到全局条件特征表示; 所述利用扩散模型对所述噪声隐向量序列进行去噪声处理包括: 所述扩散模型利 用所述全局条件特征表示对所述噪声隐向量序列进行交叉注意力处理以预测噪声, 并 利用预测的噪声进行去噪声处理。
3.根据权利要求 1所述的方法, 其中, 所述空间条件包括单幅图像和单幅语义草 图中的至少一种; 所述时间条件包括运动矢量序列、 深度图序列、 掩膜图序列、 语义草图序列和灰 度图序列中的至少一种。
4.根据权利要求 2所述的方法, 其中, 对所述全局条件进行编码处理得到全局条 件特征表示包括: 对所述文本条件进行文本编码得到文本特征表示, 以及对所述风格 条件和 /或颜色条件进行图像编码得到图像特征表示;对所述文本特征表示和所述图像 特征表示进行整合, 得到全局条件特征表示。
5.根据权利要求 1至 4中任一项所述的方法, 其中, 对所述局部条件进行编码处 理得到局部条件特征表示包括: 利用各时空条件编码器分别对各局部条件进行编码, 得到各局部条件对应的特征 张量, 其中时空条件编码器与局部条件一一对应; 将各局部条件对应的特征张量进行融合处理, 得到所述局部条件特征表示。
6.根据权利要求 5所述的方法, 其中, 所述利用时空条件编码器分别对各局部条 件进行编码包括: 利用时空条件编码器对局部条件进行空间特征编码, 得到该局部条件的空间特征 表示; 若该局部条件为序列, 则对该局部条件的空间特征表示进行时域自注意力处理, 得到该局部条件对应的特征张量; 否则, 对该局部条件的空间特征表示在时序上复制 产生时序上的空间特征表示, 对所述时序上的空间特征表示进行时域自注意力处理, 得到该局部条件对应的特征张量。
7. —种训练视频生成模型的方法, 所述方法包括: 获取包括多个第一训练样本的第一训练数据, 所述第一训练样本包括局部条件样 本, 所述局部条件样本包括空间条件样本和 /或时间条件样本; 利用所述第一训练数据训练视频生成模型, 所述视频生成模型包括: 时空条件编 码器、 扩散模型和解码器; 其中, 时空条件编码器对所述局部条件样本进行编码处理 得到局部条件特征表示; 扩散模型对噪声隐向量序列进行去噪声处理, 得到去噪后的 隐向量序列, 所述噪声隐向量序列是将噪声序列与所述局部条件特征表示进行整合后 得到的; 所述解码器利用所述去噪后的隐向量序列进行解码处理, 得到视频; 所述训练的目标包括: 最小化扩散模型进行去噪声处理时预测的噪声与高斯噪声 之间的差异。
8.根据权利要求 7所述的方法, 其中, 所述第一训练样本还包括全局条件样本, 所述全局条件样 ■本包括文本条件样 ■本、 风格条件样本和颜色条件样本中的至少一种 ; 所述视频生成模型还包括全局编码器, 所述全局编码器对所述全局条件样本进行 编码处理得到全局条件特征表示; 所述扩散模型对噪声隐向量序列进行去噪声处理包括: 所述扩散模型利用所述全 局条件特征表示对所述噪声隐向量序列进行交叉注意力处理以预测噪声, 并利用预测 的噪声进行去噪声处理。
9.根据权利要求 8所述的方法, 其中, 所述全局编码器和所述解码器采用预训练 得到的参数, 在所述训练的每一轮迭代中, 利用与所述训练的目标对应的损失函数更 新所述时空条件编码器和扩散模型的参数。
10.根据权利要求 8所述的方法, 其中, 所述获取包括多个第一训练样本的第一训 练数据包括: 获取视频样本; 获取所述视频样本的描述文本、风格图像和 /或颜色直方图分别作为所述文本条件 样本、 风格条件样本和 /或颜色条件样本; 提取单幅图像和单幅语义草图中的至少一种作为所述空间条件样本; 提取运动矢量序列、 深度图序列、 掩膜图序列、 语义草图序列和 /或灰度图序列中 的至少一种作为所述时间条件样本。
11.根据权利要求 7至 10中任一项所述的方法, 其中, 在利用所述第一训练数据 训练视频生成模型之前, 还包括预训练所述扩散模型; 在预训练得到的扩散模型的参 数基础上进一步利用所述第一训练数据训练视频生成模型; 其中, 预训练所述扩散模型包括: 获取包括多个第二训练样本的第二训练数据, 所述第二训练样本包括: 从视频样 本中提取描述文本作为文本样本; 利用全局编码器对所述文本样本进行编码处理得到文本特征表示; 将所述文本特征表示和噪声序列输入扩散模型以对所述扩散模型进行训练, 所述 扩散模型利用所述文本特征表示对所述噪声序列进行去噪声处理; 所述训练的目标包 括 : 最小化扩散模型在各时间步进行去噪声处理时预测的噪声与高斯噪声之间的差异。
12.根据权利要求 7至 10中任一项所述的方法, 其中, 时空条件编码器对所述局 部条件样本进行编码处理得到局部条件特征表示包括: 利用各时空条件编码器分别对各局部条件样本进行编码, 得到各局部条件样本对 应的特征张量, 其中时空条件编码器与局部条件一一对应; 将各局部条件样本对应的特征张量进行融合处理, 得到所述局部条件特征表示。
13.根据权利要求 12所述的方法, 其中, 所述利用各时空条件编码器分别对各局 部条件样本进行编码包括: 利用时空条件编码器对局部条件样本进行空间特征编码, 得到该局部条件样本的 空间特征表示; 若该局部条件样本为序列, 则对该局部条件样本的空间特征表示进行时域自注意 力处理, 得到该局部条件样本对应的特征张量; 否则, 对该局部条件样本的空间特征 表示在时序上复制产生时序上的空间特征表示, 对所述时序上的空间特征表示进行时 域自注意力处理, 得到该局部条件样本对应的特征张量。
14.一种视频生成方法, 由云端服务器执行, 所述方法包括: 获取来自用户终端的局部条件, 所述局部条件包括空间条件和 /或时间条件; 对所述局部条件进行编码处理得到局部条件特征表示; 将噪声序列与所述局部条件特征表示进行整合, 得到噪声隐向量序列; 利用扩散模型对所述噪声隐向量序列进行去噪声处理, 得到去噪后的隐向量序列; 利用所述去噪后的隐向量序列进行解码处理, 得到视频; 将所述视频发送给所述用户终端以进行展示。
15. —种视频生成装置, 所述装置包括: 条件获取单元, 被配置为获取局部条件, 所述局部条件包括空间条件和/或时间条 件; 视频生成单元, 被配置为对所述局部条件进行编码处理得到局部条件特征表示; 将噪声序列与所述局部条件特征表示进行整合, 得到噪声隐向量序列; 利用扩散模型 对所述噪声隐向量序列进行去噪声处理, 得到去噪后的隐向量序列; 利用所述去噪后 的隐向量序列进行解码处理, 得到视频。
16. —种训练视频生成模型的装置, 所述装置包括: 样本获取单元, 被配置为获取包括多个第一训练样本的第一训练数据, 所述第一 训练样本包括局部条件样本,所述局部条件样本包括空间条件样本和/或时间条件样本 ; 模型训练单元, 被配置为利用所述第一训练数据训练视频生成模型, 所述视频生 成模型包括: 时空条件编码器、 扩散模型和解码器; 其中, 时空条件编码器对所述局 部条件样本进行编码处理得到局部条件特征表示; 扩散模型对噪声隐向量序列进行去 噪声处理, 得到去噪后的隐向量序列, 所述噪声隐向量序列是将噪声序列与所述局部 条件特征表示进行整合后得到的; 所述解码器利用所述去噪后的隐向量序列进行解码 处理, 得到视频; 所述训练的目标包括: 最小化扩散模型进行去噪声处理时预测的噪声与高斯噪声 之间的差异。
17.一种计算机可读存储介质, 其上存储有计算机程序, 其中, 该程序被处理器执 行时实现权利要求 1至 14中任一项所述的方法的步骤。
18. —种电子设备, 包括: 一个或多个处理器; 以及 与所述一个或多个处理器关联的存储器,所述存储器被设置为存储程序指令,所述 程序指令在被所述一个或多个处理器读取执行时, 执行权利要求 1 至 14 中任一项所 述的方法的步骤。
PCT/SG2024/050361 2023-05-29 2024-05-29 视频生成方法、训练视频生成模型的方法及装置 Ceased WO2024248736A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP24816025.1A EP4722982A4 (en) 2023-05-29 2024-05-29 METHOD AND APPARATUS FOR VIDEO GENERATION, METHOD AND APPARATUS FOR DRIVING A VIDEO GENERATION MODEL

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202310618707.8A CN116863003A (zh) 2023-05-29 2023-05-29 视频生成方法、训练视频生成模型的方法及装置
CN202310618707.8 2023-05-29

Publications (1)

Publication Number Publication Date
WO2024248736A1 true WO2024248736A1 (zh) 2024-12-05

Family

ID=88217934

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2024/050361 Ceased WO2024248736A1 (zh) 2023-05-29 2024-05-29 视频生成方法、训练视频生成模型的方法及装置

Country Status (3)

Country Link
EP (1) EP4722982A4 (zh)
CN (1) CN116863003A (zh)
WO (1) WO2024248736A1 (zh)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119364135A (zh) * 2024-12-26 2025-01-24 北京生数科技有限公司 利用图像生成视频的方法、装置、设备、介质
CN119693971A (zh) * 2024-12-17 2025-03-25 山东大学 缺失视角下的基于扩散模型的跨视角步态识别方法及系统
CN119741222A (zh) * 2024-12-26 2025-04-01 北京生数科技有限公司 视频生成方法、装置及电子设备
CN120070638A (zh) * 2025-02-24 2025-05-30 中国科学技术大学 文本引导的零样本透明图层及分层图像生成方法
CN120216684A (zh) * 2025-03-11 2025-06-27 北京合力亿捷科技股份有限公司 文本处理方法、装置、设备、存储介质及程序产品
CN120318827A (zh) * 2025-04-18 2025-07-15 华南理工大学 一种基于生成回放的持续异常检测方法
CN120434477A (zh) * 2025-07-08 2025-08-05 北京生数科技有限公司 布局可控的视频生成方法、装置、设备、介质和产品
CN120434485A (zh) * 2025-07-09 2025-08-05 北京达佳互联信息技术有限公司 内容生成方法、内容生成模型的训练方法及对应装置
CN120499471A (zh) * 2025-05-16 2025-08-15 合肥孪生宇宙科技有限公司 一种基于扩散Transfomer架构的数字人视频生成系统
CN121280441A (zh) * 2025-12-09 2026-01-06 山东科技大学 基于条件扩散模型的工业零件缺陷样本精准生成方法

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120111318A (zh) * 2023-12-04 2025-06-06 北京字跳网络技术有限公司 视频生成方法及相关设备
CN117830483B (zh) * 2023-12-27 2024-10-18 北京智象未来科技有限公司 基于图像的视频生成方法、装置、设备、存储介质
CN120416680A (zh) 2024-01-30 2025-08-01 北京有竹居网络技术有限公司 用于生成视频的方法、装置、电子设备和计算机程序产品
CN118714417B (zh) * 2024-02-07 2026-01-27 浙江天猫技术有限公司 视频的生成方法、系统、电子设备和存储介质
CN118365730B (zh) * 2024-04-29 2025-06-03 上海人工智能创新中心 一种文生图方法、装置、设备及存储介质
CN118233714B (zh) * 2024-05-23 2024-08-13 北京大学深圳研究生院 全景视频生成方法、装置、设备及存储介质
CN119091016A (zh) * 2024-07-18 2024-12-06 浙江师范大学 一种基于扩散模型的动态视频生成方法及系统
CN120512504B (zh) * 2024-08-23 2025-12-16 北京极佳视界科技有限公司 视频生成方法、装置、设备及存储介质
CN119313788A (zh) * 2024-09-18 2025-01-14 广东因赛品牌营销集团股份有限公司 一种以人物为主的营销视频生成方法
CN119475470B (zh) * 2024-11-01 2025-09-02 湖南大学 两阶段的产品设计生成方法、系统、设备及介质
CN119629432B (zh) * 2024-11-22 2025-10-17 平安科技(深圳)有限公司 视频生成方法和装置、电子设备及存储介质
CN119697456B (zh) * 2024-12-03 2025-10-10 电子科技大学(深圳)高等研究院 文本生成视频方法、装置及存储介质
CN119940424B (zh) * 2025-01-02 2025-09-26 湘潭大学 一种平流层风场模拟方法
CN120765497A (zh) * 2025-06-30 2025-10-10 中国科学院杭州医学研究所 一种数字病理图像虚拟染色系统、方法及计算机可读存储介质

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022265992A1 (en) * 2021-06-14 2022-12-22 Google Llc Diffusion models having improved accuracy and reduced consumption of computational resources
CN115965791A (zh) * 2022-12-19 2023-04-14 北京字跳网络技术有限公司 图像生成方法、装置及电子设备

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115174824B (zh) * 2021-03-19 2025-06-03 阿里巴巴创新公司 视频生成方法及装置、宣传类型视频生成方法及装置
CN114973410B (zh) * 2022-05-20 2025-08-22 北京沃东天骏信息技术有限公司 视频帧的动作特征提取方法及装置
CN115601485B (zh) * 2022-12-15 2023-04-07 阿里巴巴(中国)有限公司 任务处理模型的数据处理方法及虚拟人物动画生成方法
CN115861131B (zh) * 2023-02-03 2023-05-26 北京百度网讯科技有限公司 基于图像生成视频、模型的训练方法、装置及电子设备

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022265992A1 (en) * 2021-06-14 2022-12-22 Google Llc Diffusion models having improved accuracy and reduced consumption of computational resources
CN115965791A (zh) * 2022-12-19 2023-04-14 北京字跳网络技术有限公司 图像生成方法、装置及电子设备

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
NI HAOMIAO; SHI CHANGHAO; LI KAI; HUANG SHARON X.; MIN MARTIN RENQIANG: "Conditional Image-to-Video Generation with Latent Flow Diffusion Models", 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 17 June 2023 (2023-06-17), pages 18444 - 18455, XP034401708, DOI: 10.1109/CVPR52729.2023.01769 *
See also references of EP4722982A4 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119693971A (zh) * 2024-12-17 2025-03-25 山东大学 缺失视角下的基于扩散模型的跨视角步态识别方法及系统
CN119364135A (zh) * 2024-12-26 2025-01-24 北京生数科技有限公司 利用图像生成视频的方法、装置、设备、介质
CN119741222A (zh) * 2024-12-26 2025-04-01 北京生数科技有限公司 视频生成方法、装置及电子设备
CN119741222B (zh) * 2024-12-26 2025-06-17 北京生数科技有限公司 视频生成方法、装置及电子设备
CN120070638A (zh) * 2025-02-24 2025-05-30 中国科学技术大学 文本引导的零样本透明图层及分层图像生成方法
CN120216684A (zh) * 2025-03-11 2025-06-27 北京合力亿捷科技股份有限公司 文本处理方法、装置、设备、存储介质及程序产品
CN120318827A (zh) * 2025-04-18 2025-07-15 华南理工大学 一种基于生成回放的持续异常检测方法
CN120499471A (zh) * 2025-05-16 2025-08-15 合肥孪生宇宙科技有限公司 一种基于扩散Transfomer架构的数字人视频生成系统
CN120434477A (zh) * 2025-07-08 2025-08-05 北京生数科技有限公司 布局可控的视频生成方法、装置、设备、介质和产品
CN120434485A (zh) * 2025-07-09 2025-08-05 北京达佳互联信息技术有限公司 内容生成方法、内容生成模型的训练方法及对应装置
CN121280441A (zh) * 2025-12-09 2026-01-06 山东科技大学 基于条件扩散模型的工业零件缺陷样本精准生成方法
CN121280441B (zh) * 2025-12-09 2026-02-17 山东科技大学 基于条件扩散模型的工业零件缺陷样本精准生成方法

Also Published As

Publication number Publication date
EP4722982A1 (en) 2026-04-08
CN116863003A (zh) 2023-10-10
EP4722982A4 (en) 2026-04-22

Similar Documents

Publication Publication Date Title
WO2024248736A1 (zh) 视频生成方法、训练视频生成模型的方法及装置
CN107979764B (zh) 基于语义分割和多层注意力框架的视频字幕生成方法
CN116611496B (zh) 文本到图像的生成模型优化方法、装置、设备及存储介质
JP2023541119A (ja) 文字認識モデルのトレーニング方法、文字認識方法、装置、電子機器、記憶媒体およびコンピュータプログラム
US11409791B2 (en) Joint heterogeneous language-vision embeddings for video tagging and search
CN118632070B (zh) 视频的生成方法、装置、电子设备、存储介质及程序产品
CN112819933A (zh) 一种数据处理方法、装置、电子设备及存储介质
CN117593400A (zh) 图像生成方法、模型训练方法及对应装置
JP2021501416A (ja) ビデオコンテンツを特徴付けるための深層強化学習フレームワーク
CN116975347B (zh) 图像生成模型训练方法及相关装置
CN118627582A (zh) 用于模型训练的方法、系统和介质
CN116957932A (zh) 一种图像生成方法、装置、电子设备和存储介质
WO2025256268A1 (zh) 多模态数据处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品
CN116977457A (zh) 一种数据处理方法、设备以及计算机可读存储介质
Sun et al. Beyond talking–generating holistic 3d human dyadic motion for communication
WO2026045738A1 (zh) 一种图像处理方法、装置、设备、介质及程序产品
WO2025167981A1 (zh) 视频生成方法、装置、设备以及介质
Walsh et al. Using sign language production as data augmentation to enhance sign language translation
CN118429755A (zh) 文生图模型训练方法、图像预测方法、装置、设备及介质
CN118823153A (zh) 图像生成方法、装置、设备及存储介质
Jeon et al. Multimodal audiovisual speech recognition architecture using a three‐feature multi‐fusion method for noise‐robust systems
Siniukov et al. Ditailistener: Controllable high fidelity listener video generation with diffusion
Dhanyalakshmi et al. A Survey on Face‐Swapping Methods for Identity Manipulation in Deepfake Applications
Dhanyalakshmi et al. A survey on deep learning based reenactment methods for deepfake applications
CN119646181A (zh) 内容生成方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24816025

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2024816025

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2024816025

Country of ref document: EP

Effective date: 20260102

ENP Entry into the national phase

Ref document number: 2024816025

Country of ref document: EP

Effective date: 20260102

ENP Entry into the national phase

Ref document number: 2024816025

Country of ref document: EP

Effective date: 20260102

ENP Entry into the national phase

Ref document number: 2024816025

Country of ref document: EP

Effective date: 20260102

ENP Entry into the national phase

Ref document number: 2024816025

Country of ref document: EP

Effective date: 20260102

ENP Entry into the national phase

Ref document number: 2024816025

Country of ref document: EP

Effective date: 20260102

WWP Wipo information: published in national office

Ref document number: 2024816025

Country of ref document: EP