WO2024100564A1 - A system and a method for obtaining a processed output image having quality index selectable by an user - Google Patents
A system and a method for obtaining a processed output image having quality index selectable by an user Download PDFInfo
- Publication number
- WO2024100564A1 WO2024100564A1 PCT/IB2023/061253 IB2023061253W WO2024100564A1 WO 2024100564 A1 WO2024100564 A1 WO 2024100564A1 IB 2023061253 W IB2023061253 W IB 2023061253W WO 2024100564 A1 WO2024100564 A1 WO 2024100564A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- backbone
- quality index
- gan
- image
- earlier
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/60—Image enhancement or restoration using machine learning, e.g. neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/094—Adversarial learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Definitions
- GANs Generative adversarial networks
- GANs are a class of generative frameworks based on the competition between two neural networks, namely a generator and a discriminator. While the latter performs a classification task (decides whether a generated image is real or not), the former synthesizes an image from a target distribution.
- Conditional GANs are a variation of the original framework.
- Neural head avatars allow for reenacting a face with given expression and pose. Such models could be divided into two groups – the ones with latent geometry and those with 3d prior, e.g., head mesh.
- Generative DNNs are a powerful tool for image synthesis, but they are limited by their computational load. On the other hand, given a trained model and a task, e.g. faces generation within a range of characteristics, the output image quality index will be unevenly distributed among images with different characteristics. It follows, that it possible to restrain the model’s complexity on some instances, maintaining a high quality index.
- Image synthesis by GANs received great attention in recent years, its applications span from image-to-image translation to text-to-image rendering, neural head avatars generation and many more. However, this approach suffers from heavy computational burdens when challenged with producing photo- realistic images.
- exits are a computational-saving strategy employed mainly in classification tasks. They are characterized by the addition of outputs to the DNN, from which an approximation of the final result can be obtained at a lower computational cost. They were rediscovered through the years as a standalone approach, despite being natively implemented in architectures such as Inception as a countermeasure to overfitting. Seldom this approach has also been called cascade learning, adaptive neural network or simply branching. Proposed implementations differ on three design choices: exits’ architecture, i.e.
- a system for obtaining output images having a quality index selectable by a user includes: an electronic device including at least one processor and a memory operably connected to the processor and storing input images, a plurality of generative artificial neural networks (GANs) and a plurality of predictors, the at least one processor being configured to implement the GANs and the predictors to perform artificial neural network operations; wherein each GAN of the being selectable by the user from the plurality of GANs stored in the memory, for obtaining an image with predefined quality index, each GAN is pre-trained, and each GAN includes: a plurality of calculating modules forming a backbone, and a plurality of Earlier Exit branches each of which is connected after each calculating module, except for a last calculating module of the backbone, each Earlier Exit branch containing as many calculating modules as
- the electronic device may include a display.
- Quality indexes of the output images generated by the plurality of Earlier Exit branches may increase as proximity to the backbone exit increases.
- the system may further include a database storing guide data.
- Each GAN may be further configured to fetch, from the database, guide data corresponding to the input image and to concatenate with input image data inputted to the GAN.
- the guide data may be concatenated with data from one of the plurality of calculating modules before the Earlier Exit branch, and obtained after concatenating data are fed into the Earlier Exit branch for further processing.
- the guide data may be image patches.
- the guide data may be image features.
- the guide data may be feature patches.
- a method for obtaining an output image with a quality index selected by a user includes: selecting, from a memory by the user, an input image, a pre-trained generative artificial neural networks (GAN), and pre-trained predictor corresponding to the pre-trained GAN, the pre- trained GAN comprising a plurality of calculating modules forming a backbone, a plurality of Earlier Exit branches each of which is connected after each calculating module, except for a last calculating module of the backbone, each Earlier Exit branch containing as many calculating modules as remain in the backbone from a connection point of that Earlier Exit branch to a backbone exit, each calculating module of each Earlier Exit branch performing a same function as a corresponding remaining calculating module in the backbone, and a computational budget of each Earlier Exit branch being less than a computational budget of corresponding remaining calculating
- the method may include storing in the memory the output image. [0026] The method may include displaying, on a display of an electronic device, the output image. [0027] The quality indexes of output images generated by the plurality of Earlier Exit branches may increase as proximity to the backbone exit increases. [0028] The quality index may be expressed in Fréchet inception distance (FID) units.
- FDD Fréchet inception distance
- a method for obtaining an output image with a quality index selected by a user includes: selecting, from a memory by the user, an input image, a pre-trained generative artificial neural networks (GAN), and pre-trained predictor corresponding to the pre-trained GAN, the pre- trained GAN comprising a plurality of calculating modules forming a backbone, a plurality of Earlier Exit branches each of which is connected after each calculating module, except for a last calculating module of the backbone, each Earlier Exit branch containing as many calculating modules as remain in the backbone from a connection point of that Earlier Exit branch to a backbone exit, each calculating module of each Earlier Exit branch performing a same function as a corresponding remaining calculating module in the backbone, and a computational budget of each Earlier Exit branch being less than a computational budget of corresponding remaining calculating modules of the backbone to the backbone exit; selecting, by the user, the quality index for the output image;
- GAN generative artificial neural networks
- the method further includes processing the GAN input data by the pre-trained GAN with the one Earlier Exit branch, and during the processing: fetching, from a database storing guide data, fetched guide data corresponding to the input image, concatenating the fetched guide data with data output from one of the calculating modules preceding the one Earlier Exit branch to generate concatenated data, and feeding the concatenated data into the one Earlier Exit branch or into the backbone for further processing.
- the method further includes obtaining, on exit of the one Earlier Exit branch, the output image. [0030]
- the method may further include displaying, on a display of an electronic device, the output image.
- Quality indexes of output images generated by the plurality of Earlier Exit branches may increase as proximity to the backbone exit increases.
- the quality index may be expressed in Fréchet inception distance (FID) units.
- the guide data may be image patches.
- the guide data may be features.
- the guide data may be feature patches.
- the resulting processed image can be, for example, displayed on the display of the electronic device.
- the images can be stored into the database by the user in advance based on the original images selected from the memory.
- Proposed is a computer-readable medium storing instructions for performing the any of the proposed methods by an electronic device.
- At least one of the plurality of calculating modules may be implemented through an AI model.
- the processor may include one or a plurality of processors.
- one or a plurality of processors may be a general purpose processor, such as a central processing unit (CPU), an application processor (AP), a graphics processing unit (GPU), a visual processing unit (VPU), and/or an AI-dedicated processor such as a neural processing unit (NPU), or the like.
- the one or a plurality of processors control the processing of the input data in accordance with a predefined operating rule or artificial intelligence (AI) model stored in the non-volatile memory and the volatile memory.
- the predefined operating rule or artificial intelligence model is provided through training or learning.
- the learning may be performed in a device itself in which AI according to an embodiment is performed, and/o may be implemented through a separate server/system.
- the AI model may consist of a plurality of neural network layers. Each layer has a plurality of weight values, and performs a layer operation through calculation of a previous layer and an operation of a plurality of weights.
- neural networks include, but are not limited to, convolutional neural network (CNN), deep neural network (DNN), recurrent neural network (RNN), restricted Boltzmann Machine (RBM), deep belief network (DBN), bidirectional recurrent deep neural network (BRDNN), generative adversarial networks (GAN), and deep Q-networks.
- CNN convolutional neural network
- DNN deep neural network
- RNN recurrent neural network
- RBM restricted Boltzmann Machine
- DNN deep belief network
- BBN bidirectional recurrent deep neural network
- GAN generative adversarial networks
- the artificial intelligence model may be obtained by training.
- “obtained by training” means that a predefined operation rule or artificial intelligence model configured to perform a desired feature (or purpose) is obtained by training a basic artificial intelligence model with multiple pieces of training data by a training algorithm.
- the artificial intelligence model may include a plurality of neural network layers.
- Each of the plurality of neural network layers includes a plurality of network parameter values and performs neural network computation by computation between a result of computation by a previous layer and the plurality of the parameter values.
- Visual understanding is a technique for recognizing and processing things as does human vision and includes, e.g., object recognition, object tracking, image retrieval, human recognition, scene recognition, 3D reconstruction/localization, or image enhancement.
- Prediction of reasoning in Artificial Intelligence is a technique of logically reasoning and predicting by determining information and includes, e.g., knowledge-based reasoning, optimization prediction, preference-based planning, or recommendation.
- FIG. 1A illustrates an electronic device according to one or more embodiments
- FIG. 1B illustrates example of a system for obtaining a processed image having quality index selectable by an user, according to one or more embodiments
- FIG.2 illustrates relation between quality index (expressed in FID units) and computations for all branches at different scale factors of the OASIS implementation, with the use of the guiding database according to one or more embodiments
- FIG. 1A illustrates an electronic device according to one or more embodiments
- FIG. 1B illustrates example of a system for obtaining a processed image having quality index selectable by an user, according to one or more embodiments
- FIG.2 illustrates relation between quality index (expressed in FID units) and computations for all branches at different scale factors of the OASIS implementation, with the use of the guiding database according to one or more embodiments
- FIG. 1A illustrates an electronic device according to one or more embodiments
- FIG. 1B illustrates example of a system for obtaining a processed image having quality index selectable by an user
- FIG. 3 illustrates examples of branches’ outputs for the OASIS pipeline, according to one or more embodiments
- FIG. 4 illustrates distribution of computations among branches of the OASIS backbone for a range of imposed Learned Perceptual Image Patch Similarity (LPIPS) thresholds, according to one or more embodiments
- FIG.5 illustrates examples of branches outputs for the MegaPortraits pipeline, according to one or more embodiments
- FIG. 6 illustrates relation between quality index (expressed in LPIPS units) and computations for all branches, according to one or more embodiments; [0055] FIG.
- LPIPS Learned Perceptual Image Patch Similarity
- FIG. 7 illustrates distribution of computations among branches of the MegaPortraits backbone for a range of imposed LPIPS thresholds, according to one or more embodiments;
- FIG. 8 illustrates OASIS pipeline, distribution of images routed to different branches in relation to their head rotation angle (in OASIS pipeline), according to one or more embodiments;
- FIG. 8 illustrates OASIS pipeline, distribution of images routed to different branches in relation to their head rotation angle (in OASIS pipeline), according to one or more embodiments;
- FIG. 9 illustrates comparison between the quality index distribution of single OASIS branches, and the quality index distribution obtained by use of the predictor (P), according to one or more embodiments;
- FIG.10 illustrates comparison between quality index and computations of the OASIS pipeline, according to one or more embodiments;
- FIG.11 illustrates, comparison between the efficacy of different scale factors (in OASIS pipeline), according to one or more embodiments;
- FIG. 12 illustrates comparison between the efficacy of different scale factors (in OASIS pipeline), according to one or more embodiments;
- FIG. 13 illustrates comparison between the efficacy of different scale factors (in MegaPortraits pipeline), according to one or more embodiments; [0062] FIG.
- FIG. 14 illustrates comparison between the database effect to quality index distribution for different scale factors (in OASIS pipeline), according to one or more embodiments; and [0063]
- FIG. 15 illustrates distribution of images routed to different branches in relation to their head rotation angle (in MegaPortraits pipeline), according to one or more embodiments.
- DNNs deep neural networks
- One or more embodiments of the disclosure may accelerate the operation of generative neural networks, enable neural networks to work faster or to lower energy consumption when the desired output frequency is reached.
- One or more embodiments may enable lowering power consumption, electrical consumption and heating of devices used for running the neural network, such as discrete graphic cards for PC, servers or notebooks, or smartphone’s System on Chip (SoC).
- SoC System on Chip
- the disclosure is applicable to any neural networks designed to generate images based on a latent vector whose architecture consists of a plurality of blocks. Blocks l1, l2, l3, l4 in FIG. 1B are modules of the neural network architecture.
- Each block performs the function of a small neural network, transforming inputs according to multiplication by weights, adding biases, and applying a non-linear activation function and other operations.
- the disclosure can be applied to any multilayer neural networks, this includes any GAN.
- One or more embodiments of the disclosure make it possible to speed up the operation of neural networks, or to reduce power consumption when the required real-time output frequency is reached. Also, the disclosure allows to reduce the load, power consumption and heating of devices, which are used to execute the neural network and the disclosure is used in devices such as a discrete video card for a PC, a server, or a laptop, as well as an SoC for a smartphone, as well as in any electronic device that has a central processing unit.
- a computer-readable medium stores computer code for executing the method by a computer or suitable electronic device.
- the electronic device may comprise a display, a memory storing images and a set of artificial neural networks (ANN), also the electronic device is configured for performing operations of an artificial neural network.
- ANN artificial neural networks
- One or more embodiments of the disclosure allow diminishing computations by adding so-called early exit branches to the original architecture (the backbone) and dynamically switching the computational path depending on how difficult it will be to render the output.
- the backbone is any generative model that uses a decoder, namely, a neural network that takes a latent vector and emits a result, usually after signal processing by several layers.
- Difficulty is determined by quality index of the original image (sets of input data). The worse the quality index of the original image, the higher the difficulty. Quality index is measured by common image metrics (FID, LPIPS).
- FID common image metrics
- LPIPS LPIPS
- Several paths with a different number of parameters are created, and training an neural network, called the Predictor, is performed to predict what quality index of the images will be obtained from each path. When the user sets a lower quality index when generating images, the Predictor proposes the easiest way to satisfy this requirement.
- the neural network operates with data tensors, therefore the images to be input in the neural network and processed within the neural network should be converted into data tensors.
- the term “input image” implies use of the term “data tensors” obtained by conversion of the input images for further processing by the neural network.
- One or more embodiments of the disclosure show possibility to output images with custom lower predefined quality index of the output images, and it can be applied wherever there is a model that generates images using a decoder. Considered are, as examples, the application of the method to generation from a semantic map (known from the related art is used) and cross reenactment of face expressions.
- Generation of an image from a semantic map is the task of generating an image with a list of all pixels belonging to a class as input. For example, for a photo of a street, the semantic map contains a list of all the pixels that should contain the road, trees, buildings, and so on.
- the cross reenactment of face expressions is a task in which two portraits are input, one specifies the personality and the second specifies the expression and position of the head contributing to the first personality.
- LPIPS Learning Perceptual Image Patch Similarity
- the method according to one or more embodiments employs an “Early Exit” strategy for image synthesis, dynamically routing the computational flow towards the needed Early Exit branches in accordance to images’ complexity, therefore reducing computational redundancy while maintaining desired quality index predefined by user.
- Exit branches of the Early Exit are attached to the original main independent artificial neural network (referred as a backbone consisting of an input, calculating modules and an exit), as portrayed in FIG.1B.
- Calculating modules of the Early Exits are marked with corresponding numbers in ascending order of quality index (marked as 1, 2, 3, in circles, herein 4 in circle is exit of the backbone).
- These calculating modules are built of lightweight version, i.e.
- FIG. 1A illustrates an electronic device according to one or more embodiments
- FIG. 1B illustrates an example of a system for obtaining a processed image having quality index selectable by an user (predefined quality index), according to one or more embodiments.
- the electronic device includes a memory 10, a display 30, and a processor 20 operatively connected to the memory 10 and the display 30.
- the processor 10 may be a plurality of processors.
- the system contains artificial neural networks (ANN), in particular including GANs.
- the memory of the electronic device contains a plurality of GANs for different tasks.
- Each GAN stored in the memory contains N calculating modules, forming a backbone, and a number of Earlier Exit branches each of which is connected after each calculating module of the backbone, except of the last calculating module of the backbone.
- Each Earlier Exit branch contains as many calculating modules as they remain in the backbone after the connection point of the Earlier Exit branch up to the backbone exit.
- Each calculating module of each Earlier Exit branch performs the same function as the corresponding remaining calculating module in the backbone.
- each Earlier Exit branch is less than the computational budget of the corresponding remaining calculating modules of the backbone up to the backbone exit.
- the backbone generator is composed of calculating modules l1 through l4.
- Each Earlier Exit branches are connected after each calculating module of the backbone (exits 1, 2, 3 in circles in FIG. 1B).
- Three Early Exits branches are illustrated in FIG.1B, thus adding early exits 1, 2, 3.
- Each Earlier Exit branch contains as many calculating modules as they remain in the backbone after the connection point of the Earlier Exit branch up to the backbone exit.
- Each calculating module of each Earlier Exit branch performs the same function as the corresponding remaining calculating module in the backbone.
- each branch has a different depth (the depth is the number of calculating modules), and is composed of lightweight calculating modules , that is the calculating modules that contain fewer parameters than the calculating modules of the backbone, although their structures are similar.
- the input data for the backbone is 2 images, selected by the user from the memory of the electronic device. One image whose personality needs to be saved and the second image whose facial expressions need to be saved. The output should be the first personality with the second facial expressions.
- the memory contains pre- prepared backbone with connected pre-trained Early Exit branches at the output of which the quality index of the resulting image is inferior to the image quality index obtained during backbone operation, and the computational budget of each branch is much less than the computational budget of the original backbone.
- the user retrieves from the memory of the electronic device the GAN, that is backbone with connected Early Exit branches, wherein the backbone is suitable for processing images selected by the user from the memory. At that, selected backbone can perform processing with obtaining a resultant image having the highest quality index, wherein calculating modules of the backbone having a high computational budget are used to obtain the highest quality index image.
- Memory of the electronic device also contains a set of predictors, each being an artificial neural network.
- Each predictor is generated and pre-trained for particular GAN stored in the memory.
- the predictor is configured to predict the processed image quality index for each output of each Earlier Exit branch of the particular GAN based on the original image, which the user intends to apply to the input of the particular GAN.
- the predictors are grouped with the corresponding backbone with connected Early Exit branch in the memory. [0079] For example, the user selects two images from the memory of the electronic device.
- One is a source, it is an image of a person whose appearance user wants to save (a human on the left (images (a) or (b)), as example in FIG.1B), the other is a driver, it is any another photograph of the face whose facial expressions user wants to convey, (it is image of the human on the right (images (a) or (b)) as example in FIG.1B)). That is, in the example shown in FIG.1B of the image of the human on the left (images (a) or (b)) is taken as a personality that is displayed with the facial expressions of the human on the right (images (a) or (b)).
- FIG.1B shows, for example, computational path for two distinct inputs: [0081] first input (images a) - the source is the human on the left (image (a)), the driver is the human on the right (images (a)); [0082] second input (images b) - the source is the human on the left (image b), the driver is the human on the right(image b). [0083] Each images (a) and (b) treated separately, in FIG. 1B they are shown together for illustration purposes only. [0084] The user selects image (a) from the memory.
- the already prepared and pre- trained GAN (the backbone with the connected Early Exit branches), as well as the corresponding pre-trained predictor, are already stored in memory for solving such an image processing task. Therefore, the user selects the suitable GAN along with a predictor from memory after selecting images. [0085]
- the user selects desirable a quality index for processed output image. Image of the human on the right from images (a) is fed to the predictor.
- the predictor predicts image quality index for image generated by the backbone and each Earlier Exit branch of the selected GAN.
- the Earlier Exit branch will be used, that generates processed output image having quality index most matching (or matching) the quality index selected by the user.
- the backbone will be used without using the Early Exit branches.
- the whole backbone will be used without using any Earlier Exit branches (exit number 4).
- the backbone with Early Exit branch number 2 will be used, since according to the predictions of the predictor quality index of the output image of the Earlier Exit branch with the exit number 2, in this case, has been the most matching the quality index selected by the user.
- Examples of images coming out after l1 or l3 are not shown.
- the task of the neural network is to replace the image of the head of the second person (the driver) in the second image or video with the head of the first person (the source).
- the driver When submitting video it will replace the head of the driver with the head of the source, not only for one image, but throughout the entire video.
- a resulting processed image, outputted from the exit of the backbone or of the one Earlier Exit branch that provides the predefined quality index, are displayed, for example, on the display of the electronic device. The more complex the image, the more calculations are needed to obtain the required predefined quality index.
- the predictor is DNN trained in advance on the outputs of the proposed branches, and capable of indicating the exit needed for outputting an image of a predefined quality index.
- the predictor is trained by supervised learning, imposing minimum squared error loss between its predictions and the actual quality index.
- the predictor is trained on examples and can predict what quality index all Early Exits connected with the backbone will give for a given input images.
- output 4 is assigned to a more complex image (a) in order to maintain the required quality index.
- the bottom input image (b) (solid line), instead, needs only exit 2 to maintain the required quality index predefined by the user.
- the predictor When operating the image of driver is passed through the predictor, and the sequence numbers of Early Exits are received, which, according to the predictor’s prediction, can provide the required predefined quality index of the exit image, set in advance.
- the blocks for example l1,l2,l3,l4 in FIG. 1B
- the Early Exit branches ensure that the final image exits the network earlier than a normal exit of the backbone without the Early Exit branches.
- the image quality index is calculated using common measures (LPIPS, FID), and the predictor is an auxiliary network that has been trained on examples and the predefined quality index of the generated image.
- the predictor calculates what generation quality index each output will have. Thus, it becomes possible to choose the fastest (from the point of view of calculations) exit from those who satisfy the given quality index condition.
- the predictor is an absolutely independent neural network. Its output is a list of image quality index generated by all Early Exits for a given input. The predictor is used by the method to select the output with the quality index most matching the quality index selected by the user. The predictor determines the output quality index in the form of an LPIPS metric (known in the art).
- the number of the calculating modules i.e.
- the depth of the Early-Exit varies in accordance to the number of backbone modules left after the Early-Exit gets attached to. In this way, intermediate backbone logits are fairly processed, wherein the calculations are faster, since calculating modules of the Early-Exit have fewer parameters than backbone calculating modules.
- the number of calculating modules of the Early-Exit is equal to the number of calculating modules that remained at the backbone that is, unclaimed calculating modules of the backbone. That is, for example, if there are 4 calculating modules left from the attachment point of the Early-Exit to the end of the main path, the Early-Exit will have 4 calculating modules.
- the system can further contain a database of guiding data (examples) storing guide, from which guiding examples, having, for example, image of the person that matches the image of a person on the input image source (a human on the left (images (a) or (b))in FIG.1B) are extracted and fed to each branch.
- the person’s pose in the guiding image is closest to the person’s pose in the driver’s input image, but has the appearance of the source (a human on the left (images (a) or (b))in FIG. 1B).
- the source a human on the left (images (a) or (b))in FIG. 1B.
- one example is extracted, that is, one example per pass.
- Examples database are formed in advance for each task, for example by the user. [0099]
- an guiding image of the human on the left (images (a) or (b)) is retrieved from the database, in order to improve the generation quality index.
- the guide data are concatenated with data from a calculating module before the Earlier Exit branch, and concatenated data are fed into the Earlier Exit branch for further processing.
- the user For example, if a user wants to get his image with the pose of another person depicted in another image, then the user first compiles a database of the guiding examples of his own photos in different poses (photos at different angles), then feeds any of user image (source image) and an image with another person (driver image) to the input of the neural network with Early Exit.
- the guiding example of the user images is selected from the database of the guiding examples formed by the user, in which the user is depicted in a pose that most closely matches the pose of the driver image. If the user wants to get a video with his own images (source images), but with the poses of another person (driver images), then the neural network with early exits processes each image from the video sequence separately, extracting from the database of the guiding examples for each frame the user’s image (source image) with the pose closest to the person’s pose on the corresponding frame (driver image) of the video sequence submitted at the moment to the input of the neural network with early outputs.
- Presence of database of the guiding examples yields a quality index gain for Earlier Exits branches, at the expense of a small amount of memory and computations, thus harmonizing exits’ output quality index.
- This is extremely handy for settings where real-time rendering is needed and guiding examples can be readily provided, such as neural avatar generation.
- This is the task of generating an avatar - a virtual image of a person. For example, a user wants in real time, during a digital conference, to impose a different personality on his face, while maintaining all his reproducible facial expressions. In this case, the user can take an image of his face in advance from different angles and upload it to the database. This will greatly help the generation, but is not an absolutely necessary operation.
- the method is applicable to both untrained and already trained models, but requires additional training for the newly introduced components.
- the backbone is always fixed, and the Predictor only indicates the output at which the quality index required by the user will be obtained. It should be noted that during the operation of the GAN (backbone and Earlier Exit branches), only the branch of the Early Exit is used, which gives the output image of the required quality (which is the most matching the quality index selected by the user) at the output. To do this, any suitable and known from the related art switch mechanism is implemented in the code, so all other branches of the Early Exit, that are not required for the execution of the selected early exit, are not used.
- the method can be applied to any generation tasks, with the presence of a generator, i.e. a neural network that creates images from a latent vector.
- a generator i.e. a neural network that creates images from a latent vector.
- the main result may be summarized in this way: method is easily applicable to already existing and trained generative models, containing a generator (backbone), i.e. a neural network that creates images from a latent vector.
- the method is capable of outputting images with custom lower quality index threshold by routing easier images to shorter computational paths, and the main gain in terms of saved computations per quality index loss is, respectively, 1.2 ⁇ 10 3 , and 1.3 ⁇ 10 3 GFLOPs/LPIPS for the two applications.
- the GANs are composed by two competing DNN: a generator G and a discriminator D.
- the generator G is designed to synthesize arbitrary images when given a low dimensional random vector of features: G : z ⁇ g, where z is the input and g is the generated image.
- the discriminator D learns to distinguish between the generated images’ distribution and the one of the original examples .
- the minimax game is known from the related art, and means a decision rule for minimizing possible losses from those that the decision maker cannot prevent in the worst case scenario): [0108] Where is the loss function the weights of Generator G have to minimize, and the weights of Discriminator D have to maximize; is the expected value of the expression “ “ when the random variable x is drawn with probability distribution p; D(x) and G(z) are the outputs of the Discriminator D and Generator G. [0109] By providing conditions c (e.g. in the form of labels) to both generator and discriminator, the former can learn to synthesize images from a subspace of pg: .
- conditions c e.g. in the form of labels
- G(x) is the generator’s output; p z and p g are probability distributions of input noise and output images; and c is the conditioning parameter.
- Any GAN generator is composed by a series of convolutional modules labeled li. The output of each module, namely constitutes a candidate for an early exit, but it is not a rendered image. For this reason, it is processed by a series of additional convolutions, before an image can be retrieved from it. These new convolutional calculating modules constitute what calls a branch, this is the Early Exits.
- For a backbone built out of N calculating modules, after calculating module k, appended is a branch of length N ⁇ k.
- the branches’ calculating modules are less complex, than the backbones’, their width, i.e. number of channels, is decreased. In this way, at the output of each branch , retrieved is an image rendered with a lesser number of computations than at the backbone’s output is retrieved.
- Each Early Exit branch is trained in advance by adversarial loss with copies of the backbone original discriminator. [0112] During the inference phase, having a set of trained branches, each image can be synthesized through a different exit. Given a predefined quality index, the branch that will achieve it and performing the least possible calculations is selected.
- the branch that will output images with equal or higher quality index is selected (i.e. quality index most matching the quality index selected by the user), while performing the least possible amount of calculations.
- the predictor P employed is the predictor P, constituted by convolutional and fully connected layers.
- the predictor is trained by supervised learning, using input conditions c as training examples, and vectors of LPIPS scores S for images generated by branches as labels. It should be noted that training to create a semantic map and training to create a cross-reproduction of facial expressions are no different.
- Data processed by the calculating modules of the backbone, which are located before the calculating modules of the Early Exit branch, are under processed tensors.
- the under processed tensors are concatenated with the data patches before feeding into the Early Exit branch.
- This ensures an increase in quality index more prominent in earlier exits, which are the fastest, but suffer the most from the quality index decrease due to their lower number of parameters.
- By adding a moderate amount of memory and computations, achieved are better results, harmonizing the output quality index of different branches.
- stored is a collection of tensor pairs as the guide data, called key-values pairs.
- the guide data are generated by the backbone when the database is formed.
- the guide data is already in the database and are being retrieved from the database during the GAN operation when generating the image.
- Keys are obtained by applying to the original data all the trained layers of the backbone prior to the first Early Exit branch, and cutting the obtained tensors, called guide features, into non-overlapping patches.
- the keys are obtained as follows: data of the original images is fed into the backbone, the result before the first Early Exit branch is divided into patches and these patches are the keys. Values are obtained by applying the trained layers of the backbone to the original images and cutting the resulting features into data patches, by dividing into N patches.
- a mean quality index gain of 1.3 ⁇ 10 3 GFLOPs/LPIPS is achieved, meaning that lowering the quality index threshold by +0.01 LPIPS will yield a decrease of 13 GFLOPs.
- branches are appended, one after each backbone module l1 to l4.
- the branches’ calculating modules were SPADE ResNet modules as well, and their length varied in order to preserve , where len is the number of the calculating modules of the Early Exit, k is the exit number. That is, if the exit number k is 2, then the number of the calculating modules of the Early Exit branch is 4.
- SF scale factor
- GFLOPs floating point operations
- Table 1 illustrates comparison between GFLOPs of all 5 computational routes through branches 1-4 and the OASIS backbone (BB (backbone), rightmost column). Different rows correspond to different scale factors (SF). As can be seen from the table, the SF does not equally affect all calculating modules, since imposed is a minimum number of channels equal to 64 after which no further scaling is imposed. It should be noted that the number 64 is set arbitrarily, and changes in subsequent tests.
- the channel is a standard terminology, RGB images have 3 channels, convolutional networks create other channels depending on their architecture, in general, this is a network parameter that needs to be reduced. [0128] For the implementation of method trained are all branches and the predictor.
- Table 2 describes quantitative results for the OASIS pipeline.
- the minimum number of channels is 64. At that, the minimum number of channels sets the lower quality index threshold.
- SF scale factors
- Table 3 describes quantitative results for the OASIS pipeline at different scale factors. The minimum number of channels is 32.
- SF Scalable Scal Factor
- the pipeline is tested with and without the guiding database, shown in the Bank column as a tick or a cross.
- FID Frechet inception distance
- mIOU mean intersection over union
- the learning rate is a parameter generally accepted in the related art that is needed for learning, if the user wants to reproduce experiments with high accuracy, user should know this coefficient.
- the choice of training set for the predictor was not trivial, since the pipeline inputs consist of a semantic map concatenated to a 3D noise tensor. Due to the high dimensionality of the noise space, sampling uniformly from it does not guarantee any convergence for the learning process. Instead, randomly extracted are 1003D noise tensors and combined them with 500 semantic maps, thus obtaining 50000 examples. Then tested is this technique by using 100, 300 and 500 noise tensors. Once trained, measured is the predictor’s error by using 500 semantic maps combined with the same noises used for the training and with new noises.
- Table 4 describes the validation error for the OASIS predictor.
- the validation set was created joining the noises (random signal) used for the training to 500 semantic maps.
- the first column indicates the quantity of noises used to train the predictor, while columns B1 through B4 indicate the error obtained by individual branches 1 through 4 when validated.
- the last column indicates the average of all errors.
- the table shows how the error decreases with increasing noise that we use for training.
- Table 5 [0143] Table S4 describes a test error for the OASIS predictor.
- the test set was created joining random noises to 500 semantic maps
- the first column indicates the quantity of noises used to train the predictor, while columns B1 through B4 indicate the error obtained by individual branches 1 through 4 when tested.
- the last column indicates the average of all errors.
- the table shows how the error decreases with increasing noise that we use for training.
- the database In order to implement the database for guiding image generation, it is populated by 500 randomly extracted images from the train dataset.
- For each one of randomly extracted images created are 100 different inputs using a fixed set of 3D noises (noises having the structure of not a two- dimensional matrix, but a three-dimensional tensor). The inputs are fed into the first 2D convolutional layer and the subsequent ResNet module of the backbone.
- the values were extracted by processing the inputs up to the third ResNet module of the backbone (OASIS architecture calculating module) and cutting the obtained features into the same data patches.
- the database is populated once at the beginning of the training phase.
- FPS sampling to them (since many images can be extremely similar and it makes no sense to store everything, the FPS sampling algorithm is used for this, which selects only one image from each similarity cluster), during the forward phase, after an input was processed through the first 2D convolutional layer and the subsequent ResNet layer, it was divided into 128 identical data patches (128 is arbitrary number preserving proportions.
- FIG. 2 describes relation between quality index (expressed in FID units) and computations for all branches at different scale factors of the OASIS implementation, with the use of the guiding database.
- the three curves on the plot connect FID values scored by exits 1 through 4 (branches 1 through 4) for different scale factors. Squares indicate scale factor 1/2, dots indicate scale factor 1/3, and triangles indicate scale factor 1/2. A higher scaling saves computations lowering quality index.
- the asterisk indicates the original quality index and computations for the OASIS pipeline. As mentioned above, the larger the FID, the lower the quality index of the output image, while the lower the computational cost. Finally, the pipeline comprehending all generating branches and the backbone, together with the database guidance, was used to produce the dataset for training the predictor.
- the OASIS input consists of a semantic map and a high-dimensional random noise space (a set of multidimensional vectors consisting of random numbers).
- the training is restricted to 100 fixed noise vectors in combination with the Cityscape train set.
- FIG.3 illustrates examples of branches’ outputs for the OASIS pipeline.
- Top left image represents the semantic map used as input to the pipeline
- Top middle image is the output of the original OASIS model (Backbone)
- top right image is the output obtained by the first branch with a corresponding quality index of 0.13 LPIPS
- bottom left image is the second branch’s output with a corresponding quality index of 0.11 LPIPS
- bottom middle image is the third branch’s output with a corresponding quality index of 0.10 LPIPS
- bottom right image is the fourth branch’s output (Early Exit) with a corresponding quality index of 0.07 LPIPS.
- FIG.3 illustrates the resulting image obtained at each branch. It can be seen how the quality index deteriorates as the output order decreases, i.e. the first output has the worst quality index.
- FIG. 4 illustrates distribution of computations among branches of the OASIS backbone for a range of imposed LPIPS thresholds (the lower limit of quality index, measured in terms of LPIPS metric, is generally accepted in the related art).
- Branch 1 is “a”
- Branch 2 is “b”
- Branch 3 is “c”
- Branch 4 is “d”
- Backbone is “e”.
- the predictor routes the computation towards one of five possible exits based on the input’s complexity it learned. As quality index requirements decrease, the use of the first branches becomes more prominent. All distributions were obtained sampling the same 500 test images and using scale factor 1/4.
- the neural head avatar implementation is based on the MegaPortraits generating method for 512 ⁇ 512 pixels images.
- This pipeline consists of multiple operations ensuring the transfer of traits from a source face to a driver face, i.e. the one with the desired orientation and expression.
- backbone calculating modules used is li, i ⁇ [1, 9] its final set of the calculating modules comprehending 9 residual blocks, which amount to a total of 213 GFLOPs. Attached are 3 branches, one after backbone’s block number 2, 4, and 6. Their calculating modules were the same residual blocks, and their respective depth, i.e.
- FIG.5 illustrates examples of branches’ outputs for the MegaPortraits pipeline.
- the top row uses as source and driver respectively the first and second image.
- the source’s appearance is imposed on the driver’s expression.
- the third image is the output given by the original MegaPortraits pipeline, the fourth is the image retrieved from the database. Following are outputs from branch 1 with LPIPS score 0.1, branch 2 with LPIPS 0.08, and branch 3 with LPIPS 0.05.
- the bottom row mirrors the top row, only changing source (image of the human on the left in FIG.1B (b)) and driver images (image of the human on the right in FIG.1B (b)).
- FIG.6 illustrates relation between quality index (expressed in LPIPS units) and computations for all branches at different scale factors of the MegaPortraits implementation, with the use of the guiding database.
- FIG.6 the three curves on the plot connect FID values scored by exits 1 through 3 (branches 1 through 3) for different scale factors. Dots indicate scale factor 1/3, squares indicate 1/6, and squares indicate 1/15. It can be seen how a higher scaling saves computations lowering quality index. The asterisk indicates the original quality index and computations for the OASIS pipeline. [0156] Finally, trained is the predictor. Afterwards, it was possible to impose any quality index threshold and the predictor was able to choose the path that satisfied it with the least computation. The overall results for the whole pipeline are summarized by FIG. 7. Branch 1 is “a”; Branch 2 is “b”; Branch 3 is “c”; Backbone is “d”. FIG.
- FIG. 7 illustrates the distribution of computations among branches of the MegaPortraits backbone for a range of imposed LPIPS thresholds.
- Number images on the y- axis indicates the number of images with LPIPS quality index on the x-axis.
- FIG.9 The comparison between quality index distributions of images obtained from single branches and those obtained by the use of the predictor, set to output a threshold equal to the branches’ mean quality index, is shown in FIG.9 (the solid line - it is with predictor, and the dotted line – it is without predictor). Number images on the y-axis indicates the number of images with LPIPS quality index on the x-axis. It is possible to clearly see how the predictor enforces the quality index threshold by routing difficult images towards the next branches, thus shifting the distribution.
- FIG. 9 illustrates comparison between quality index distributions of single OASIS branches, and quality index distributions obtained by use of the predictor (P). The predictor was set to enforce thresholds equal to the branches’ mean quality index.
- FIG.10 illustrates comparison between number of images routed to different branches in relation to their head rotation. The greater the angle between the two images the higher the difficulty gets, as reported in FIG.10.
- the x-axis shows the distance, in degrees, meaning the angle between the head from the database and the head of the driver.
- the whole pipeline is applicable only to architectures which include a decoder, one cannot apply it as it is to transformers and other synthesis algorithms that don’t comprehend a decoder.
- Authors chose to populate it randomly, but this may actually not be the best choice.
- Saving computations is useful for the exploitation of complex algorithms, which yield state-of-the-art outputs, but are mostly implemented on “heavy machinery”.
- Table 7 describes the dimensions of modules for all branches in the form of (input channels, output channels, image height, image width).
- the table is divided in three subtables according to the scale factor applied.
- the first column indicates the module’s type, i.e. the transformation applied to input data; columns 2 through 5 indicate branches 1 through 4 calculating modules’ dimensions.
- At the bottom of each suitable is the total count of parameters with and without the addition of the auxiliary database.
- the guiding features are taken after the first Conv2D and ResNet blocks of the backbone. Then, for each on the N ⁇ [1, 35] semantic classes present in the input, these features are cut into 128 data patches and their 1024- dimensional space is scanned in order to find the closest key from the database with corresponding semantic class. This search is performed quite rapidly thanks to the FAISS library, and thus does not burden computations. [0177] Once retrieved all 128 data patches, a guiding feature is constructed by gluing them together. This feature is concatenated to the input of each branch, and for this reason their number of channels must be increased.
- Table 8 describes architecture of the MegaPortraits and OASIS predictors. Dimensions are in the form (input channels, output channels). In both subtables, the left column indicates what kind of layers the neural network is composed of, while the right column reports its input and output dimensions. The bottom rows indicate the total number of parameters composing the networks and the number of operations needed to execute them once.
- the original MegaPortraits generative DNN for images of resolution 512 ⁇ 512 pixels consists of a set of calculating modules predicting a volumetric representation and another set, called G2D, that renders an output image from a processed volume. Its total number of parameters is 32M. Branches are appended after ResBlock2D modules 2, 4, 6. Their respective length is 7, 5, 3.
- ResBlock2D modules 2, 4, 6. Their respective length is 7, 5, 3.
- lighter computational paths are created by scaling down all channels uniformly.
- the new channel numbers were obtained multiplying the original ones by a scale factor.
- restricted is the effect of this scaling by imposing a minimum number of channels equal to 24, parameter selected without strict justification. It is not a necessary part of the method, it can be replaced by another, as further analysis shows under which no further scaling was forced.
- Table 9 describes MegaPortraits pipeline. Dimensions of the calculating modules for all branches in the form of (input channels, output channels). The table is divided in four subtables according to the scale factor applied. The first column indicates the calculating module’s type, i.e. the transformation applied to input data; columns 2,3 and 4 indicate branches 1 through 3 calculating modules’ dimensions. At the bottom of each subtable is the total count of parameters comprehending the addition of the auxiliary database.
- the Res-Block2D are made of layers BatchNorm2D, h-swish, Conv2D, BatchNorm2D, h-swish, Conv2D, Conv2D with skipped connections.
- 2D bilinear upsampling When employing the database, all input channel numbers must be increased by 3.
- used is a database containing 960 key-value pairs. The values consisted of RGB images of the source subject, uniformly covering the space of head rotations and expressions.
- the keys were obtained exploiting the MegaPortraits initial calculating modules, the so-called encoders, that yield the Euler angles at which a head is rotated, as well as a multitude of parameters encoding face expressions. Each key encoded 3 angles and a 512-dimensional vector for the expressions. [0185] The total size of stored parameters is therefore 10 9 .
- the database was searched for the closest key during the inference phase with the aid of the FAISS library. Each retrieved image was subsequently concatenated to the input of all ResBlock2D modules in every branch, thus when employing the database 3 channels must be added to all input channels in Table S9.
- the architecture of the MegaPortraits predictor is summarized in Table 10.
- Table 10 describes architecture of the MegaPortraits predictor and OASIS predictors. Dimensions are in the form (input channels, output channels). In both subtables, the left column indicates what kind of layers the neural network is composed of, while the right column reports its input and output dimensions. The bottom rows indicate the total number of parameters composing the networks and the number of operations needed to execute them once. [0187] Training details [0188] For the MegaPortraits pipeline, trained are branches using hinge adversarial loss, each branch competing against a copy of multi-scale data patch discriminator. Additionally, imposed are feature matching, VGG19 perceptual, L1 and MS-SSIM losses.
- Table 11 describes quantitative results for the MegaPortraits pipeline, cross- reenactment.
- SF first column
- the pipeline is tested with and without the guiding database, shown in the Bank column as a tick or a cross.
- Three columns, one for each branch, contain the FID (Fréchet inception distance) and mIOU (mean intersection over union) scores; at the bottom are reported these two values for the Backbone.
- FID Frechet inception distance
- mIOU mean intersection over union
- SF 1/2, 1/3, 1/4, 1/6.
- FIG.14 illustrates OASIS pipeline, comparison between the database effect to quality index distribution for different scale factors.
- Each branch’s quality index distribution is numbered differently 1, 1’, 2, 2’, 3, 3’, 4, 4’.
- Distributions without the database usage are shown by a dotted curve (1’, 2’, 3’, 4’), while distributions obtained with the database usage are shown with a solid curve (1, 2, 3, 4).
- the minimum number of channels is 64.
- SF 1/2, 1/3, 1/4.
- quality index is shown in LPIPS units, while Y shows the number of images outputted with said quality index.
- Different branches output quality index distributions plotted with different colors.
- Quality index distributions obtained without the database implementation are shown by a dotted curve, while quality index with database usage is plotted with a solid line. It is possible to clearly see how the first branches are affected the most by the database implementation, since quality index distribution for these first branches are shifted the most towards better values.
- FIG. 15 illustrates MegaPortraits pipeline, distribution of images routed to different branches in relation to their head rotation angle.
- First row SF 1/8
- second row SF 1/15.
- the angle between the reference’s head and the outputs’ head is reported, while on the Y axes the number of images outputted with said angle is reported.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
Description
Claims
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP23888205.4A EP4519796A4 (en) | 2022-11-09 | 2023-11-08 | System and method for obtaining a processed output image with a user-selectable quality index |
| CN202380077935.7A CN120226021A (en) | 2022-11-09 | 2023-11-08 | System and method for obtaining a processed output image with a quality index selectable by a user |
| US18/435,776 US20240177273A1 (en) | 2022-11-09 | 2024-02-07 | System and a method for obtaining a processed output image having quality index selectable by an user |
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| RU2022129040 | 2022-11-09 | ||
| RU2022129040 | 2022-11-09 | ||
| RU2023115413A RU2823750C1 (en) | 2023-06-13 | System and method for obtaining processed output image having user-selectable quality factor | |
| RU2023115413 | 2023-06-13 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US18/435,776 Continuation US20240177273A1 (en) | 2022-11-09 | 2024-02-07 | System and a method for obtaining a processed output image having quality index selectable by an user |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024100564A1 true WO2024100564A1 (en) | 2024-05-16 |
Family
ID=91032044
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/IB2023/061253 Ceased WO2024100564A1 (en) | 2022-11-09 | 2023-11-08 | A system and a method for obtaining a processed output image having quality index selectable by an user |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20240177273A1 (en) |
| EP (1) | EP4519796A4 (en) |
| CN (1) | CN120226021A (en) |
| WO (1) | WO2024100564A1 (en) |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20210089903A1 (en) * | 2019-09-24 | 2021-03-25 | Naver Corporation | Neural network for generating images trained with a generative adversarial network |
| CN112906721A (en) * | 2021-05-07 | 2021-06-04 | 腾讯科技(深圳)有限公司 | Image processing method, device, equipment and computer readable storage medium |
| WO2022098203A1 (en) | 2020-11-09 | 2022-05-12 | Samsung Electronics Co., Ltd. | Method and apparatus for image segmentation |
-
2023
- 2023-11-08 EP EP23888205.4A patent/EP4519796A4/en active Pending
- 2023-11-08 WO PCT/IB2023/061253 patent/WO2024100564A1/en not_active Ceased
- 2023-11-08 CN CN202380077935.7A patent/CN120226021A/en active Pending
-
2024
- 2024-02-07 US US18/435,776 patent/US20240177273A1/en active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20210089903A1 (en) * | 2019-09-24 | 2021-03-25 | Naver Corporation | Neural network for generating images trained with a generative adversarial network |
| WO2022098203A1 (en) | 2020-11-09 | 2022-05-12 | Samsung Electronics Co., Ltd. | Method and apparatus for image segmentation |
| CN112906721A (en) * | 2021-05-07 | 2021-06-04 | 腾讯科技(深圳)有限公司 | Image processing method, device, equipment and computer readable storage medium |
Non-Patent Citations (3)
| Title |
|---|
| See also references of EP4519796A4 |
| STEFANOS LASKARIDIS; ALEXANDROS KOURIS; NICHOLAS D. LANE: "Adaptive Inference through Early-Exit Networks: Design, Challenges and Directions", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 9 June 2021 (2021-06-09), 201 Olin Library Cornell University Ithaca, NY 14853, XP081987122, DOI: 10.1145/3469116.3470012 * |
| TEERAPITTAYANON SURAT; MCDANEL BRADLEY; KUNG H.T.: "BranchyNet: Fast inference via early exiting from deep neural networks", 2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), IEEE, 4 December 2016 (2016-12-04), pages 2464 - 2469, XP033085956, DOI: 10.1109/ICPR.2016.7900006 * |
Also Published As
| Publication number | Publication date |
|---|---|
| US20240177273A1 (en) | 2024-05-30 |
| EP4519796A4 (en) | 2025-08-13 |
| EP4519796A1 (en) | 2025-03-12 |
| CN120226021A (en) | 2025-06-27 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Wang et al. | Efficient video transformers with spatial-temporal token selection | |
| US12299573B2 (en) | Attention-based decoder-only sequence transduction neural networks | |
| US12062227B2 (en) | Systems and methods for progressive learning for machine-learned models to optimize training speed | |
| EP3973459B1 (en) | Generative adversarial networks with temporal and spatial discriminators for efficient video generation | |
| US20260079926A1 (en) | Prompt Tuning Using One or More Machine-Learned Models | |
| US12517977B2 (en) | Apparatus and method of performing matrix multiplication operation of neural network | |
| US20230359865A1 (en) | Modeling Dependencies with Global Self-Attention Neural Networks | |
| KR20210029785A (en) | Neural network acceleration and embedding compression system and method including activation sparse | |
| CN114548423B (en) | Machine learning attention model featuring omnidirectional processing | |
| CN116912367B (en) | Method and system for generating image based on lightweight dynamic refinement text | |
| US20230124177A1 (en) | System and method for training a sparse neural network whilst maintaining sparsity | |
| CN110162993A (en) | Desensitization process method, model training method, device and computer equipment | |
| Chen et al. | Coupled end-to-end transfer learning with generalized fisher information | |
| CN115803753A (en) | Multi-Stage Machine Learning Model Synthesis for Efficient Inference | |
| CN114049527A (en) | Self-knowledge distillation method and system based on online cooperation and fusion | |
| CN118568227B (en) | A human-computer collaborative topic classification search mode method, device and storage medium | |
| WO2025171219A2 (en) | Inverted bottleneck architecture search and efficient attention mechanism for machine-learned models | |
| CN117011943A (en) | Action recognition method based on decoupled 3D network of multi-scale self-attention mechanism | |
| CN121072619A (en) | Model quantification realization method, model and computer equipment | |
| CN115082840B (en) | Action video classification method and device based on data combination and channel correlation | |
| CN114925774B (en) | A method for generating image description sentences based on convolutional neural networks | |
| CN120823459A (en) | A method for generating grain images of titanium alloy microstructure based on potential diffusion model | |
| US20240177273A1 (en) | System and a method for obtaining a processed output image having quality index selectable by an user | |
| CN120580557A (en) | A diffusion model optimization method and system based on U-Net parameter enhancement | |
| CN118734899A (en) | Segmentation model optimization method and device based on memory efficient attention mechanism |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23888205 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023888205 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2023888205 Country of ref document: EP Effective date: 20241203 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: CN2023800779357 Country of ref document: CN Ref document number: 202380077935.7 Country of ref document: CN |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWP | Wipo information: published in national office |
Ref document number: 202380077935.7 Country of ref document: CN |
