WO2024099004A1 - 一种图像处理模型训练方法、装置、电子设备、计算机可读存储介质及计算机程序产品 - Google Patents
一种图像处理模型训练方法、装置、电子设备、计算机可读存储介质及计算机程序产品 Download PDFInfo
- Publication number
- WO2024099004A1 WO2024099004A1 PCT/CN2023/123450 CN2023123450W WO2024099004A1 WO 2024099004 A1 WO2024099004 A1 WO 2024099004A1 CN 2023123450 W CN2023123450 W CN 2023123450W WO 2024099004 A1 WO2024099004 A1 WO 2024099004A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- processing model
- image processing
- face
- swapped
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/50—Image enhancement or restoration using two or more images, e.g. averaging or subtraction
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/60—Image enhancement or restoration using machine learning, e.g. neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/70—Denoising; Smoothing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/75—Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
- G06V10/751—Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20112—Image segmentation details
- G06T2207/20132—Image cropping
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20212—Image combination
- G06T2207/20221—Image fusion; Image merging
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30196—Human being; Person
- G06T2207/30201—Face
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Definitions
- the embodiments of the present application relate to machine learning technology, and in particular, to an image processing model training method, device, electronic device, computer-readable storage medium, and computer program product.
- the embodiments of the present application provide an image processing model training method, device, electronic device, computer-readable storage medium and computer program product, which can assist in training a first image processing model with a heavy parameterized structure through a pre-trained second image processing model, taking into account both the model performance and the model's computational complexity.
- the present application embodiment provides an image processing model training method, which is performed by an electronic device and includes:
- the first training sample set includes at least one triplet training sample
- the triplet training sample includes: a source image, a template image, and a true value image
- the first image processing model is trained according to the fusion loss function, and when the training convergence condition of the first image processing model is reached, the model parameters of the first image processing model are determined.
- the present application also provides an image processing model training device, the training device comprising:
- a data transmission module is configured to obtain a first training sample set, wherein the first training sample set includes at least one triplet training sample, and the triplet training sample includes: a source image, a template image, and a true value image;
- an image processing model training module configured to perform face swapping on the source image and the template image through a first image processing model to obtain a first face swapped image, wherein the first image processing model is a re-parameterized structure;
- the image processing model training module is configured to obtain a second image processing model corresponding to the first image processing model, wherein the second image processing model is a pre-trained image processing model;
- the image processing model training module is configured to calculate a fusion loss function of the first image processing model according to the second image processing model, the first face-swapped image and the true value image;
- the image processing model training module is configured to train the first image processing model according to the fusion loss function, and determine the model parameters of the first image processing model when the training convergence condition of the first image processing model is reached.
- the present application also provides an electronic device, the electronic device comprising:
- a memory for storing computer executable instructions
- the processor is used to implement the image processing model training method provided in the embodiment of the present application when running the computer executable instructions stored in the memory.
- An embodiment of the present application also provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, implement the image processing model training method provided in the embodiment of the present application.
- An embodiment of the present application also provides a computer program product, including computer executable instructions, characterized in that when the computer executable instructions are executed by a processor, the image processing model training method provided in the embodiment of the present application is implemented.
- the embodiment of the present application obtains a triplet training sample including a source image, a template image and a true image, and performs face-swapping on the source image and the template image through a first image processing model to obtain a first face-swapping image.
- the first image processing model is a re-parameterized structure. Due to the characteristics of the re-parameterized structure, the first image processing model will be more lightweight in the application stage, thereby reducing resource consumption when the model is applied; a pre-trained second image processing model corresponding to the first image processing model is obtained, and a fusion loss function of the first image processing model is calculated according to the second image processing model, the first face-swapping image and the true image.
- the first image processing model is trained.
- the model parameters of the first image processing model are determined.
- the first image processing model finally obtained can achieve lightweight application and can have high accuracy.
- FIG1 is a schematic diagram of a use environment of an image processing model training method provided in an embodiment of the present application.
- FIG2 is a schematic diagram of the composition structure of an image processing model training device provided in an embodiment of the present application.
- FIG3 is a schematic diagram of generating image processing results in a related solution
- FIG4 is a flow chart of an image processing model training method provided in an embodiment of the present application.
- FIG5 is a schematic diagram of a facial image acquisition process in an embodiment of the present application.
- FIG6 is a schematic diagram of a facial image acquisition process in an embodiment of the present application.
- FIG7 is a schematic diagram of a model structure of a first image processing model in an embodiment of the present application.
- FIG8 is a schematic diagram of a test process of a first image processing model in an embodiment of the present application.
- FIG9 is a schematic diagram of the working process of a trained image processing model in an embodiment of the present application.
- FIG. 10 is a schematic diagram of the face-changing effect in an embodiment of the present application.
- first ⁇ second ⁇ third involved are merely used to distinguish similar objects and do not represent a specific ordering of the objects. It can be understood that “first ⁇ second ⁇ third” can be interchanged with a specific order or sequence where permitted, so that the embodiments of the present application described here can be implemented in an order other than that illustrated or described here.
- Video face swapping The input source image is swapped onto the template face, and the face in the output image retains the expression, angle, background and other information of the template face. The information in the output image other than the face is the same as the source image.
- Neural Network In the field of machine learning and cognitive science, it is a mathematical model or computational model that imitates the structure and function of biological neural networks and is used to estimate or approximate functions.
- Model parameters It is a quantity that uses universal variables to establish the relationship between functions and variables. In artificial neural networks, model parameters are usually real number matrices.
- knowledge transfer refers to using the output data of the training sample data in the intermediate network layer or the final network layer of the teacher network to assist in the training of a student network with faster but poorer performance, thereby migrating the teacher network with good performance to the student network.
- knowledge distillation refers to the technique of using the smoothed category posterior probabilities output by the teacher network to train the student network in classification problems.
- Teacher Network A high-performance neural network that provides more accurate supervision information to the student network during the knowledge transfer process.
- Student Network A single neural network with fast computing speed but poor performance, suitable for deployment in practical application scenarios with high real-time requirements. Compared with the teacher network, the student network has a larger computing throughput and fewer model parameters.
- Downsampling sampling is performed at intervals of several sample values in the sample sequence, so that the new sequence is the downsampling of the original sequence. For example, for an image I with a size of M*N, it is downsampled s times to obtain a low-resolution image of size (M/s)*(N/s), where s is the common divisor of M and N.
- Generative Adversarial Networks is a deep learning model that produces better outputs through the mutual game learning of at least two modules in the framework: the generative model G (Generative Model) and the discriminative model D (Discriminative Model).
- G is a model for making high-resolution images (also called reconstructed images in this article)
- D is a model for detecting whether it is an original natural image.
- the goal of G is to make D unable to judge whether the high-resolution image generated by G is an unnatural image.
- D should try its best to distinguish whether the input image is an original natural image or an unnatural image generated by G.
- the parameters of G and D are continuously iterated and updated until the generative adversarial network meets the convergence conditions.
- a generator network used to generate high-resolution images from low-resolution images.
- the generator can be a convolutional neural network based on deep learning.
- Discriminator network used to determine whether the input image x is an unnatural image generated by the generator or a natural image.
- the discriminator outputs a probability value D1(x) in the range of 0-1.
- D1(x) When D1(x) is 0, it means that the image x input to the discriminator is a natural image.
- D1(x) When D1(x) is 1, it means that the image x input to the discriminator is an unnatural image.
- the three-primary color encoding method also known as the RGB color mode, is the color standard in the industry. It obtains various colors by changing the three color channels of red (R), green (G), and blue (B) and superimposing them on each other. RGB represents the colors of the three channels of red, green, and blue. This standard covers almost all colors that can be perceived by human vision and is one of the most widely used color systems at present.
- Face swapping using the target part of the object in the image to be processed to replace the part of the object in other images that corresponds to the target part.
- FIG1 is a schematic diagram of a related art super-resolution processing of an image based on a super-resolution generative adversarial network.
- the structure of the super-resolution generative adversarial network is shown in FIG1, and includes a generator network 301 and a discriminator network 302.
- the generator network 301 and the discriminator network 302 are deep neural network models.
- a high-definition image is used as a training sample image and down-sampled to form a low-resolution (relative to the high-definition image) training sample image.
- the low-resolution training sample image is reconstructed by the generator network 301 in the super-resolution generative adversarial network model to form a reconstructed image; the discriminator network 302 in the super-resolution generative adversarial network model identifies the reconstructed image, and adjusts the parameters of the generator network 301 and/or the discriminator network 302 according to the corresponding identification results, until the generator network 301 and the discriminator network 302 can reach Nash equilibrium, and the training of the super-resolution generative adversarial network model is completed, so that the super-resolution generative adversarial network model can reconstruct the input image with a lower resolution to form an image with a higher resolution.
- the above-mentioned solutions of the related art have the following problems: generating high-resolution images requires the model to have a very large number of parameters, such as the Pix2PixHD model has about 100 million parameters.
- the disadvantage of this large-scale model is that the test speed is slow and it is difficult to deploy on mobile devices, so the image processing model needs to be compressed.
- the time consumption of the face-changing model is often not considered in the related art.
- the supervised face-changing model is trained through a complex network structure, resulting in the model's computational complexity being too high and unable to run on mobile devices.
- the embodiments of the present application are aimed at the computational complexity of the model in the related art and cannot be run on mobile devices.
- the embodiments of the present application provide an image processing model training method, device, electronic device, computer-readable storage medium and computer program product, which can optimize the model's floating-point transport times per second to 544 megabytes by cutting the model structure, introducing the ideas of structural reparameterization and knowledge distillation, and reducing the computational workload by 94% compared with 9373 megabytes in the related art.
- the number of frames per second of screen transmission can be between 17 and 20, and the time consumption basically meets the real-time requirements of the mobile terminal.
- the image processing model training method provided in the embodiment of the present application can be implemented by the terminal/server alone; or it can be implemented by the terminal and the server in collaboration, for example, the terminal alone undertakes the following image processing model training method, or the terminal sends a training request to the server, and the server The image processing model training method is executed according to the received training request.
- the terminal sends an image processing request to the server.
- the server generates an image processing result for the target image to be processed by calling the generator network in the set image processing model, and returns the image processing result to the terminal.
- the electronic device for executing the image processing model training method provided in the embodiment of the present application can be various types of terminal devices or servers, wherein the server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server providing cloud computing services; the terminal can be a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto.
- the terminal and the server can be directly or indirectly connected via wired or wireless communication, and this application does not limit this.
- the server can be a server cluster deployed in the cloud, opening artificial intelligence cloud services (AI as a Service, AIaaS) to users.
- AIaaS artificial intelligence cloud services
- the AIaaS platform will split several common AI services and provide independent or packaged services in the cloud. This service model is similar to an AI theme mall. All users can access one or more artificial intelligence services provided by the AIaaS platform through an application programming interface.
- FIG. 2 is a schematic diagram of the use scenario of the image processing model training method provided in an embodiment of the present application.
- a client of an image processing software is set on the terminal (including terminal 10-1 and terminal 10-2), and the user can input the corresponding image to be processed through the set image processing software client.
- the image processing client can also receive the corresponding image processing result and display the received image processing result to the user;
- the terminal is connected to the server 200 through the network 300, and the network 300 can be a wide area network or a local area network, or a combination of the two, and a wireless link is used to realize data transmission.
- the server 200 is used to set up an image processing model and train the image processing model to iteratively update the generator parameters and discriminator parameters of the image processing model, so as to generate an image processing result for the target image to be processed through the generator network in the image processing model, and display the image processing result corresponding to the image to be processed generated by the image processing model through the terminal (terminal 10-1 and/or terminal 10-2).
- the image processing model needs to be trained. After the parameters of the image processing model are determined, it is deployed in the mobile terminal for the user to use, and can also be saved in the cloud server network waiting for the user to download and use.
- the image processing model training method provided in the embodiment of the present application can be implemented based on artificial intelligence.
- Artificial Intelligence is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
- artificial intelligence is a comprehensive technology in computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence.
- Artificial intelligence is to study the design principles and implementation methods of various intelligent machines so that the machines have the functions of perception, testing and decision-making.
- Artificial intelligence technology is a comprehensive discipline that covers a wide range of fields, including both hardware-level and software-level technologies.
- Basic artificial intelligence technologies generally include sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operating/interactive systems, mechatronics, and other technologies.
- Artificial intelligence software technologies mainly include computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
- FIG3 is a schematic diagram of the composition structure of the image processing model training device provided in the embodiment of the present application, which can be It should be understood that FIG3 only shows an exemplary structure of the image processing model training device rather than the entire structure, and part or all of the structure shown in FIG3 may be implemented as needed.
- the image processing model training device includes: at least one processor 201, a memory 202, a user interface 203 and at least one network interface 204.
- the various components in the image processing model training device 20 are coupled together through a bus system 205.
- the bus system 205 is used to realize the connection and communication between these components.
- the bus system 205 also includes a power bus, a control bus and a status signal bus.
- various buses are marked as bus systems 205 in Figure 2.
- the user interface 203 may include a display, a keyboard, a mouse, a trackball, a click wheel, keys, buttons, a touch pad or a touch screen.
- the memory 202 can be a volatile memory or a non-volatile memory, and can also include both volatile and non-volatile memories.
- the memory 202 in the embodiment of the present application can store data to support the operation of the terminal (such as 10-1). Examples of these data include: any computer program for operating on the terminal (such as 10-1), such as an operating system and an application.
- the operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, etc., which are used to implement various basic services and process hardware-based tasks.
- the application can include various applications.
- the image processing model training device provided in the embodiment of the present application can be implemented in a combination of software and hardware.
- the image processing model training device provided in the embodiment of the present application can be a processor in the form of a hardware decoding processor, which is programmed to execute the image processing model training method provided in the embodiment of the present application.
- the processor in the form of a hardware decoding processor can adopt one or more application specific integrated circuits (ASICs), DSPs, programmable logic devices (PLDs), complex programmable logic devices (CPLDs), field programmable gate arrays (FPGAs) or other electronic components.
- ASICs application specific integrated circuits
- DSPs digital signal processor
- PLDs programmable logic devices
- CPLDs complex programmable logic devices
- FPGAs field programmable gate arrays
- the image processing model training device provided in an embodiment of the present application can be directly embodied as a combination of software modules executed by a processor 201, and the software module can be located in a storage medium, and the storage medium is located in a memory 202.
- the processor 201 reads the executable instructions included in the software module in the memory 202, and combines with necessary hardware (for example, including a processor 201 and other components connected to a bus 205) to complete the image processing model training method provided in an embodiment of the present application.
- processor 201 can be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., where the general-purpose processor can be a microprocessor or any conventional processor, etc.
- DSP digital signal processor
- the device provided in the embodiment of the present application can be directly executed by a processor 201 in the form of a hardware decoding processor.
- the image processing model training method provided in the embodiment of the present application can be implemented by one or more application specific integrated circuits (ASICs), DSPs, programmable logic devices (PLDs), complex programmable logic devices (CPLDs), field programmable gate arrays (FPGAs), or other electronic components.
- ASICs application specific integrated circuits
- DSPs digital signal processor
- PLDs programmable logic devices
- CPLDs complex programmable logic devices
- FPGAs field programmable gate arrays
- the memory 202 in the embodiment of the present application is used to store various types of data to support the operation of the image processing model training device 20.
- Examples of such data include: any executable instructions for operating on the image processing model training device 20, such as executable instructions, and the program for implementing the image processing model training method of the embodiment of the present application can be included in the executable instructions.
- the image processing model training device provided in the embodiment of the present application can be implemented in a software manner.
- FIG. 3 shows a storage
- the image processing model training device 255 stored in the memory 250 can be software in the form of a program or a plug-in, and includes the following software modules: a data transmission module 2551, an image processing model training module 2552. These modules are logical, and thus can be arbitrarily combined or further split according to the functions implemented. The functions of each module will be described below.
- FIG. 4 is a flow chart of the image processing model training method provided in an embodiment of the present application. It can be understood that the steps shown in FIG. 4 can be performed by various electronic devices running an image processing model training device, such as a small program running terminal with a face image detection and adjustment function, or a terminal with an image processing model training function. The steps shown in FIG. 4 are described below.
- Step 401 Obtain the first training sample set
- the first training sample set includes at least one triplet training sample
- the triplet training sample includes: a source image, a template image and a true image.
- a face image in the environment where the mobile terminal is located can be collected as the source image, where the source image can be image A including object A, the template image can be image B including object B, and the true image can be an image in which the face of object B in image B is replaced with the face of object A.
- FIG. 5 is a schematic diagram of the facial image acquisition process in an embodiment of the present application.
- dark channel dehazing processing can be performed on the facial image to form an enhanced image.
- the enhanced image formed may include facial features and/or limb features.
- the dark channel defogging process here is as follows: determine the dark channel value of the facial image, the grayscale value of the facial image, and the defogging adjustment value; determine the atmospheric light value of the facial image based on the dark channel value, the defogging adjustment value, and the grayscale value of the facial image; process the facial image according to the atmospheric light value and the light adjustment value of the facial image to form an enhanced image.
- Dark channel refers to the grayscale image formed by taking the minimum value from the three RGB channels of the acquired facial image when the facial image is acquired, and performing minimum value filtering on the grayscale image formed by the minimum value.
- the defogging adjustment value can be obtained; after converting the acquired facial image into a grayscale image, the grayscale value of the facial image and the dark channel value can be obtained.
- the dark channel value be Dark_channel
- the grayscale value of the facial image be Mean_H and Mean_V
- the atmospheric light value of the facial image be AirLight
- the dehazing adjustment value be P
- the light adjustment value be A
- the facial image to be enhanced be Input
- a face image acquired by a terminal in the use environment of the first image processing model can be acquired through a mobile terminal; image augmentation processing is performed on the face image; based on the image augmentation processing result, a corresponding face position is determined through a face detection algorithm, and a face image including a background image is intercepted based on the face position; the face image including the background image is processed Cropping is performed to obtain the source image.
- the face detection algorithm may be an algorithm for detecting the position of a face in an image, such as a face detection and face alignment method based on deep learning.
- FIG. 6 is a schematic diagram of the facial image acquisition process in an embodiment of the present application. Since the position of the image acquisition device is fixed and the height of the target object is different, the comprehensiveness of the acquired facial image is also different (it may be that the target object is too short or too tall and an accurate facial image cannot be acquired).
- the acquired facial image can be augmented; based on the processing result of the image augmentation, the corresponding face position is determined by a face detection algorithm, and the facial image including the background image is captured; the facial image including the background image is cleared to form the corresponding facial image of the target user.
- the face detection technology can be used to frame the area where the user's face is located, and then expand it by 2 times with this area as the center, as shown in Figure 6, the detection area of the detection frame 601 is adjusted to the detection area of the detection frame 602, so as to obtain more background content and crop the facial image including the background content; for example, the following methods can be used: using a face detection algorithm to frame the face position of the target object; using a facial feature positioning algorithm to mark the feature points of the eyes, mouth, nose, etc.; and capturing the facial image including the background content according to the detected face position.
- the deep processing network may include but is not limited to: LeNet, AlexNet, VGG, Inception series networks, ResNet networks; by extracting features of the face image (for example, extracting grayscale-based features such as mean and variance and features based on distribution histograms, features based on correlation matrices such as GLCM and GLRLM, or signal features after Fourier transform of the image), and performing background removal processing based on the extracted features, a real face cropped by the deep processing network is obtained, and a depth map corresponding to the face is calculated.
- features of the face image for example, extracting grayscale-based features such as mean and variance and features based on distribution histograms, features based on correlation matrices such as GLCM and GLRLM, or signal features after Fourier transform of the image
- a real face cropped by the deep processing network is obtained, and a depth map corresponding to the face is calculated.
- the facial image calculated by using a real human face includes a depth map, while the depth map corresponding to the attack image (such as a face photo) is a black background map.
- the facial image of the target object can be obtained by restoring the depth map.
- the obtained facial image of the target object does not include the background image, which can make the face-changing function processing result of the image processing model more accurate.
- Step 402 Process the first training sample set through a first image processing model to obtain a first face-swapped image, wherein the first image processing model is a re-parameterized structure.
- the reparameterized structure means that the first image processing model is obtained based on the structural reparameterization technology.
- Structural reparameterization means first constructing a series of structures (generally used for training) and converting their parameters into another set of parameters (generally used for reasoning), thereby converting this series of structures into another series of structures.
- the structure during training is larger and has a good property (higher accuracy or other useful properties, such as sparsity), and the converted structure during reasoning is smaller and retains this property (same accuracy or other useful properties).
- the original meaning of the term "structural reparameterization" is: converting a set of parameters of a structure into another set of parameters, and using the converted parameters to parameterize another structure. As long as the parameter conversion is equivalent, the replacement of the two structures is equivalent.
- the first training sample set can be denoised to save the training time of the first image processing model and improve the training accuracy of the first image processing model.
- the usage environment of the trained first image processing model includes: film and television production scenes, game image production scenes, live virtual image production scenes, and ID photo production scenes.
- an image containing a character object can be used as a source image, and an image containing a game character can be used as a template image.
- the source image and the template image are input into the trained first image processing model, and a corresponding face-changing image is output.
- the output face-changing image replaces the identity of the game character in the template image with the identity of the character object in the source image. It can be seen that a unique game character can be designed for a character through a face-changing image.
- the image containing the virtual image can be used as the source image, and each image frame containing the human object in the live video can be used as the template image and input into the trained first image processing model respectively with the source image, and the corresponding face-changing image is output.
- the output face-changing image replaces the identity of the human object in the template image with the virtual image. It can be seen that the virtual image can be used to replace the identity in the live broadcast scene to enhance the fun of the live broadcast scene.
- the image of the object for which the ID photo needs to be made can be used as the source image, the source image and the ID photo template image are input into the trained first image processing model, and the corresponding face-changing image is output.
- the output face-changing image replaces the identity of the template object in the ID photo template image with the object for which the ID photo needs to be made. It can be seen that through the face-changing image, the object for which the ID photo needs to be made can directly make the ID photo by providing an image without shooting, which greatly reduces the production cost of the ID photo.
- a dynamic noise threshold that matches the usage environment of the first image processing model can be determined; the first training sample set is denoised according to the dynamic noise threshold to form a second training sample set that matches the dynamic noise threshold, thereby ensuring the training accuracy of the image processing model.
- a fixed noise threshold corresponding to the second image processing model is determined, and the first training sample set is denoised according to the fixed noise threshold to form a second training sample set that matches the fixed noise threshold, which can further compress the training time of the image processing model.
- Step 403 Obtain a second image processing model corresponding to the first image processing model, wherein the second image processing model is a pre-trained image processing model, and the model parameters of the second image processing model remain unchanged during subsequent training.
- the second image processing model may be a trained neural network, and the second image processing model may be a large-scale neural network, for example, the number of network parameters of the second image processing model is greater than a certain value, but the embodiment of the present application does not limit this.
- the second image processing model may be a convolutional neural network (CNN Convolutional Neural Network), a deep neural network (DNN Deep Neural Network), a recurrent neural network (RNN Recurrent Neural Network), etc., and the embodiment of the present application does not limit the type of the second image processing model.
- the second image processing model may be a neural network applicable to different computer vision tasks, such as: target recognition tasks, target classification tasks, target detection tasks or posture estimation tasks, etc.
- the second image processing model may also be a neural network applicable to different application scenarios, such as: security detection scenarios, face unlocking scenarios, intelligent driving or remote sensing scenarios, etc., and the embodiment of the present application does not limit the scope of application of the first image processing model.
- the network structure of the second image processing model may be designed according to computer vision tasks, or the network structure of the second image processing model may adopt at least a part of the existing network structure, such as: deep residual network, visual geometry group network (VGGNet Visual Geometry Group Network), etc.
- the first image processing model may be a neural network to be trained, and the first image processing model may be a neural network with a smaller scale. It is convenient to deploy in a mobile terminal by taking advantage of the low floating-point operation amount.
- the number of network parameters of the first image processing model is less than a certain value (at least the number of network parameters of the first image processing model is less than the number of network parameters of the second image processing model), but the embodiment of the present application is not limited to this.
- the network scale of the second image processing model is larger than the network scale of the first image processing model.
- the second image processing model can be a teacher network, and the first image processing model can be a student network. Using the teacher network to train the student network can improve the performance of the trained student network.
- the first image processing model can be trained using knowledge distillation methods or other methods, which is not limited to this in the embodiment of the present application.
- FIG7 is a schematic diagram of a model structure of the first image processing model in an embodiment of the present application, wherein the structure of the encoder and decoder in the model of the first image processing model is a re-parameterized structure RepVGG (Visual Geometry Group), as shown in FIG7 ,
- a in FIG7 represents the original ResNet network, which contains the residual structure of Conv1*1 and the residual structure of Identity. The existence of these residual structures solves the gradient vanishing problem in the deep network and makes the network easier to converge.
- FIG7 B represents the RepVGG network architecture in the training phase, in which the main body of the entire network contains the residual structure. At the same time, the residual blocks in the RepVGG network do not cross layers, and the entire network contains two residual structures.
- the first image processing model uses the structure shown in FIG7 , which is actually similar to training multiple networks and integrating multiple networks into one network, which has higher training efficiency.
- C in Figure 7 represents the RepVGG network in the test phase.
- the structure of the network is very simple. The entire network is formed by the following network Conv3*3+Relu connection, which is easy to test and accelerate the model.
- Step 404 Calculate the fusion loss function of the first image processing model based on the second image processing model and the first face-swapped image.
- the fusion loss function of the first image processing model is composed of a combination of different loss functions, and a second face-changing image output by the second image processing model is obtained, and the reconstruction loss function of the first image processing model is calculated using the first face-changing image and the second face-changing image;
- the feature loss function of the first image processing model is calculated based on the first face-changing image and the second face-changing image;
- the estimation loss function of the first image processing model is calculated based on the first face-changing image and the source image;
- the adversarial loss function of the first image processing model is calculated based on the first face-changing image and the true image;
- the sum of the reconstruction loss function, the feature loss function, the estimation loss function and the adversarial loss function is calculated to obtain the fusion loss function of the first image processing model;
- the training effect of the first image processing model can be improved from multiple dimensions, thereby improving the face-changing accuracy of the first image processing model.
- Reconstruction_loss is the reconstruction loss function
- LPIPS_loss is the feature loss function
- ID_loss is the estimation loss function
- D_loss is the discriminator loss
- G_loss is the generator loss
- (D_loss+G_loss) constitutes the adversarial loss function.
- the second face-changing image calculated by the second image processing model can be represented as BigModel_fake, and the first face-changing image calculated by the first image processing model can be represented as fake; through the embodiments of the present application, the training effect of the second image processing model can be transferred to the first image processing model, thereby playing a teaching role.
- BigModel_fake is the second face-swapped image
- BigModel_swap represents the forward processing process of the second face-swapped model
- source is the source image
- template is the template image
- Reconstruction_loss is the reconstruction loss function
- fake is the first face-swapped image.
- LPIPS_loss
- alexnet_feature(fake) means inputting the first face image (fake) into the alexnet network model and outputting the features output by the four feature extraction layers (corresponding to different levels) of the alexnet network model
- result_fea1, result_fea2, result_fea3 and result_fea4 are the decoded face features of the first face-changing image output by each of the four feature extraction layers.
- alexnet_feature(gt_img) means inputting the second face-swapped image gt_img into the alexnet network model and outputting the features of gt_img output by the four feature extraction layers (corresponding to different levels) of the alexnet network model.
- gt_img_fea1, gt_img_fea2, gt_img_fea3 and gt_img_fea4 are the standard facial features of the second face-swapped image gt_img output by each of the four feature extraction layers.
- ID_loss 1 - cosine_similarity (fake_id_features, true_id_features) (4);
- ID_loss is the estimated loss function
- fake_id_features is the feature vector of the first face-swapped image
- socre_id_features is the feature vector of the source image
- cosine_similarity is the cosine similarity
- the generator network which can be called G
- the discriminator network which can be called D
- D the discriminator network
- a high-resolution image x is input and a number D(x) in the range of 0-1 is output.
- D(x) is used to determine whether the input image is generated by the generator, where 0 means no and 1 means yes.
- D_loss is the discriminator loss
- G_loss is the generator loss
- (D_loss+G_loss) constitutes the adversarial loss function loss.
- the calculation of the adversarial loss function refers to formula (5):
- D_loss is the discriminator loss
- G_loss is the generator loss
- D(gt_img) is the discriminant result of the discriminator output for the true value image
- D(fake) is the discriminant result of the discriminator output for the first face-swapped image
- loss is the adversarial loss function.
- the discriminant result here can be a probability, that is, the probability of belonging to the real image.
- Step 405 Train the first image processing model according to the fusion loss function, and when the training convergence condition of the first image processing model is reached, determine the model parameters of the first image processing model.
- the training convergence condition here may be that a set number of training times is reached, or that the fusion loss function converges to a minimum value.
- step 401-step 405 the training of the first image processing model is completed, and the parameters of the first image processing model are determined.
- the trained first image processing model can be deployed in the mobile terminal to perform the face-changing function.
- the floating-point operation FLOPs of the first image processing model is optimized to 544 megabytes, which is 94% less than the 9373 megabytes of FLOPs of the second image processing model.
- the number of frames per second can be transmitted between 17 and 20, so that the time consumption of the face-changing function meets the real-time requirements of the mobile terminal.
- FIG. 9 is a schematic diagram of the working process of the trained image processing model in an embodiment of the present application, specifically including the following steps:
- Step 901 When the image processing model training is completed and deployed in the mobile terminal, the target face image (corresponding to the source image) and the face image to be replaced (corresponding to the template image) are obtained.
- Step 902 Encode the target face image and the face image to be replaced through the encoder network of the image processing model to obtain a face image vector.
- Step 903 Decode the face image vector through the decoder network of the image processing model to obtain the face-swapped image.
- the generator uses asymmetric input and output. Since the screen of the mobile terminal is small, the decoder downgrades the network output resolution from 512 pixels to 256 pixels, and designs the input resolution to 128 pixels to meet the use of mobile terminals.
- the encoder network uses convolution to halve the input and gradually increase the number of channels. Specifically, the input is encoded from 128*128*6 (the target face image and the face image to be replaced, each with 3 RGB channels) to 64*64*32, 32*32*64, 16*16*128, and so on.
- the decoder network gradually doubles the resolution through deconvolution operations and decodes it to 32*32*64, 64*64*32, 128*128*16, 256*256*3, and finally obtains the face-changing result.
- FIG10 is a schematic diagram of the face-changing effect in an embodiment of the present application.
- the target source facial image may be, for example, the facial image (a) in FIG10
- the target template facial image may be, for example, the facial image (b) in FIG10
- the facial replacement image may be, for example, the facial image (c) in FIG4 .
- the facial image (c) is obtained by replacing the face in the facial image (a) with the face in the facial image (b). It can be seen from the facial image (c) that the identity and additional image of the facial image (c) are consistent with those of the facial image (b), that is, the facial image (c) and the facial image (b) are of the same person. and face image (c) includes the same glasses as face image (b).
- the attributes of face image (c) are consistent with those of face image (a). For example, it can be seen from face image (c) that the hairstyle of face image (c) is consistent with that of face image (a), and the angle of the mouth opening of face image (c) is larger than that of face image (b), thereby meeting the angle of the mouth opening of face image (a), thereby achieving the face-changing processing effect required by the user.
- the software module stored in the image processing model training device in the memory may include: a data transmission module 2081, configured to obtain a first training sample set, wherein the first training sample set includes at least one triple training sample, and the triple training sample includes: a source image, a template image, and a true value image; an image processing model training module 2082, configured to perform face-swapping on the source image and the template image through a first image processing model to obtain a first face-swapping image, wherein the first image processing model is a re-parameterized structure; the image processing model training module 2082 is configured to obtain a second image processing model corresponding to the first image processing model, wherein the second image processing model is a pre-trained image processing model; the image processing model training module 2082 is configured to calculate the fusion loss function of the first image processing model according to the second image processing
- the image processing model training module 2082 is further configured to determine a dynamic noise threshold that matches the usage environment of the first image processing model; denoise the first training sample set according to the dynamic noise threshold to form a second training sample set that matches the dynamic noise threshold; or, determine a fixed noise threshold corresponding to the second image processing model, and denoise the first training sample set according to the fixed noise threshold to form a second training sample set that matches the fixed noise threshold.
- the image processing model training module 2082 is also configured to obtain a facial image collected by a terminal in the use environment of the first image processing model; perform image augmentation processing on the facial image; determine the corresponding facial position based on the image augmentation processing result, and capture the facial image including the background image based on the facial position; and crop the facial image including the background image to obtain the source image.
- the image processing model training module 2082 is further configured to obtain a second face-swapped image output by the second image processing model, and use the first face-swapped image and the second face-swapped image to calculate the reconstruction loss function of the first image processing model; calculate the feature loss function of the first image processing model; calculate the estimation loss function of the first image processing model; calculate the adversarial loss function of the first image processing model; and fuse the reconstruction loss function, the feature loss function, the estimation loss function and the adversarial loss function to obtain the fusion loss function of the first image processing model.
- the image processing model training module 2082 is further configured to calculate the pixel level difference between the first face-swapped image and the second face-swapped image; and determine the reconstruction loss function of the first image processing model based on the pixel level difference.
- the image processing model training module 2082 is further configured to extract features from the first face-swapped image through a pre-trained feature extraction network to obtain features of multiple levels of the first face-swapped image;
- the second face-swapped image is subjected to feature extraction to obtain features of multiple levels of the second face-swapped image; and a feature loss function of the first image processing model is determined based on the difference between the features of multiple levels of the first face-swapped image and the features of multiple levels of the second face-swapped image.
- the image processing model training module 2082 is further configured to extract a first face-swapped image feature vector of the first face-swapped image; extract a source image feature vector of the source image; and calculate an estimated loss function of the first image processing model using the similarity between the first face-swapped image feature vector and the source image feature vector.
- the image processing model training module 2082 is also configured to obtain a target face image and a face image to be replaced when the first image processing model is trained and deployed in a mobile terminal; encode the target face image and the face image to be replaced through the encoder network of the first image processing model to obtain a face image vector; and decode the face image vector through the decoder network of the first image processing model to obtain a third face-changing image.
- the embodiment of the present application provides a computer program product, which includes a computer program or a computer executable instruction, and the computer executable instruction is stored in a computer readable storage medium.
- the processor of the electronic device reads the computer executable instruction from the computer readable storage medium, and the processor executes the computer executable instruction, so that the electronic device executes the image processing model training method described in the embodiment of the present application.
- An embodiment of the present application provides a computer-readable storage medium storing computer-executable instructions, wherein computer-executable instructions are stored.
- the processor will execute the image processing model training method provided by the embodiment of the present application.
- the computer-readable storage medium may be a memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface storage, optical disk, or CD-ROM; or it may be various devices including one or any combination of the above memories.
- computer executable instructions may be in the form of a program, software, software module, script or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine or other unit suitable for use in a computing environment.
- computer-executable instructions may, but do not necessarily, correspond to a file in a file system, may be stored as part of a file that stores other programs or data, such as, for example, in one or more scripts in a HyperText Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files storing one or more modules, subroutines, or code portions).
- HTML HyperText Markup Language
- computer executable instructions may be deployed to be executed on one electronic device, or on multiple electronic devices located at one site, or on multiple electronic devices distributed at multiple sites and interconnected by a communication network.
- the embodiment of the present application obtains a first training sample set, wherein the first training sample set includes at least one triplet training sample, and the triplet training sample includes: a source image, a template image, and a true value; the first training sample set is processed by a first image processing model to obtain a first face-changing image, wherein the first image processing model is a re-parameterized structure, and a second image processing model corresponding to the first image processing model is obtained, wherein the second image processing model is a pre-trained image processing model, and the model parameters of the second image processing model are fixed; the fusion loss function of the first image processing model is calculated according to the second image processing model and the first face-changing image; the first image processing model is trained according to the fusion loss function, and when the convergence condition of the first image processing model is reached, the model parameters of the first image processing model are determined.
- the first image processing model is a re-parameterized structure
- the structure of the first image processing model is complex when it is trained.
- the first image processing model has strong processing capabilities and can learn complex data, while the structure is simple during testing, which can reduce the time consumption during testing and reduce the amount of floating-point operations, making it easier to deploy on mobile terminals.
- using the second image processing model for training guidance can steadily improve the accuracy of smaller-scale image processing models without increasing the total amount of training samples and without the need for retraining. It is also generally applicable to most neural network models and data.
- the training of smaller-scale image processing models takes into account the training accuracy while reducing the overfitting of the neural network model and enhancing the generalization ability of the neural network model, making it easier to deploy the image processing model in mobile terminals and realize large-scale application of image processing models.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Biology (AREA)
- General Engineering & Computer Science (AREA)
- Image Analysis (AREA)
Abstract
Description
BigModel_fake=BigModel_swap(source,template)
Reconstruction_loss=|BigModel_fake-fake|(2);
LPIPS_loss=|result_fea1-gt_img_fea1|+|result_fea2-gt_img_fea2|+
|result_fea3-gt_img_fea3|+|result_fea4-gt_img_fea4|(3);
Claims (12)
- 一种图像处理模型训练方法,所述方法由电子设备执行,所述方法包括:获取第一训练样本集合,其中,所述第一训练样本集合包括至少一个三元组训练样本,所述三元组训练样本包括:源图像、模板图像和真值图像;通过第一图像处理模型对所述源图像以及所述模板图像进行换脸,得到第一换脸图像,其中,所述第一图像处理模型为重参数化结构;获取与所述第一图像处理模型相对应的第二图像处理模型,其中,所述第二图像处理模型为经过预训练的图像处理模型;根据所述第二图像处理模型、所述第一换脸图像以及所述真值图像,计算所述第一图像处理模型的融合损失函数;根据所述融合损失函数,对所述第一图像处理模型进行训练,当达到所述第一图像处理模型的训练收敛条件时,确定所述第一图像处理模型的模型参数。
- 根据权利要求1所述的方法,其中,所述方法还包括:确定与所述第一图像处理模型的使用环境相匹配的动态噪声阈值;根据所述动态噪声阈值对所述第一训练样本集合进行去噪处理,以形成与所述动态噪声阈值相匹配的第二训练样本集合;或者,确定与所述第二图像处理模型相对应的固定噪声阈值,并根据所述固定噪声阈值对所述第一训练样本集合进行去噪处理,以形成与所述固定噪声阈值相匹配的第二训练样本集合。
- 根据权利要求1所述的方法,其中,所述方法还包括:获取所述第一图像处理模型的使用环境中的终端所采集的人脸图像;对所述人脸图像进行图像增广处理;基于图像增广的处理结果,确定相应的人脸位置,并基于所述人脸位置截取包括背景图像的人脸图像;对所述包括背景图像的人脸图像进行裁剪处理,得到所述源图像。
- 根据权利要求1所述的方法,其中,所述根据所述第二图像处理模型、所述第一换脸图像以及所述真值图像,计算所述第一图像处理模型的融合损失函数,包括:获取所述第二图像处理模型输出的第二换脸图像,利用所述第一换脸图像和所述第二换脸图像计算所述第一图像处理模型的重构损失函数;基于所述第一换脸图像以及所述第二换脸图像计算所述第一图像处理模型的特征损失函数;基于所述第一换脸图像以及所述源图像计算所述第一图像处理模型的估计损失函数;基于所述第一换脸图像以及所述真值图像计算所述第一图像处理模型的对抗损失函数;对所述重构损失函数、所述特征损失函数、所述估计损失函数以及所述对抗损失函数进行融合处理,得到所述第一图像处理模型的融合损失函数。
- 根据权利要求4所述的方法,其中,所述利用所述第一换脸图像和所述第二换脸图像计算所述第一图像处理模型的重构损失函数,包括:计算所述第一换脸图像和所述第二换脸图像的像素级差值;根据所述像素级差值,确定所述第一图像处理模型的重构损失函数。
- 根据权利要求4所述的方法,其中,所述基于所述第一换脸图像以及所述第二换脸图像计算所述第一图像处理模型的特征损失函数,包括:通过预训练的特征提取网络,对所述第一换脸图像进行特征提取,得到所述第一换脸图像的多个层级的特征;通过预训练的特征提取网络,对所述第二换脸图像进行特征提取,得到所述第二换脸图像的多个层级的特征;基于所述第一换脸图像的多个层级的特征与所述第二换脸图像的多个层级的特征之间的差值,确定所述第一图像处理模型的特征损失函数。
- 根据权利要求4所述的方法,其中,所述基于所述第一换脸图像以及所述源图像计算所述第一图像处理模型的估计损失函数,包括:提取所述第一换脸图像的第一换脸图像特征向量;提取所述源图像的源图像特征向量;利用所述第一换脸图像特征向量和所述源图像特征向量的相似度,计算所述第一图像处理模型的估计损失函数。
- 根据权利要求1所述的方法,其中,所述方法还包括:当所述第一图像处理模型训练完成,并部署在移动终端中时,获取目标人脸图像和待替换人脸图像;通过所述第一图像处理模型的编码器网络,对所述目标人脸图像和所述待替换人脸图像进行编码,得到人脸图像向量;通过所述第一图像处理模型的解码器网络,对所述人脸图像向量进行解码,得到第三换脸图像。
- 一种图像处理模型训练装置,所述训练装置包括:数据传输模块,配置为获取第一训练样本集合,其中所述第一训练样本集合包括至少一个三元组训练样本,所述三元组训练样本包括:源图像、模板图像和真值图像;图像处理模型训练模块,配置为通过第一图像处理模型对所述源图像以及所述模板图像进行换脸,得到第一换脸图像,其中,所述第一图像处理模型为重参数化结构;所述图像处理模型训练模块,配置为获取与所述第一图像处理模型相对应的第二图像处理模型,其中,所述第二图像处理模型为经过预训练的图像处理模型;所述图像处理模型训练模块,配置为根据所述第二图像处理模型、所述第一换脸图像以及所述真值图像,计算所述第一图像处理模型的融合损失函数;所述图像处理模型训练模块,配置为根据所述融合损失函数,对所述第一图像处理模型进行训练,当达到所述第一图像处理模型的训练收敛条件时,确定所述第一图像处理模型的模型参数。
- 一种电子设备,所述电子设备包括:存储器,用于存储计算机可执行指令;处理器,用于运行所述存储器存储的计算机可执行指令时,实现权利要求1至8任一项所述的图像处理模型训练方法。
- 一种计算机程序产品,包括计算机可执行指令,所述计算机可执行指令被处理器执行时,实现权利要求1至8任一项所述的图像处理模型训练方法。
- 一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令被处理器执行时实现权利要求1至8任一项所述的图像处理模型训练方法。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP23887696.5A EP4560588A4 (en) | 2022-11-09 | 2023-10-08 | IMAGE PROCESSING MODEL TRAINING METHOD AND APPARATUS, AND ELECTRONIC DEVICE, COMPUTER-READABLE STORAGE MEDIUM, AND COMPUTER PROGRAM PRODUCT |
| US18/813,622 US20240420288A1 (en) | 2022-11-09 | 2024-08-23 | Method and apparatus for training image processing model, electronic device, computer-readable storage medium, and computer program product |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211397807.4A CN117011665A (zh) | 2022-11-09 | 2022-11-09 | 一种图像处理模型训练方法、装置、电子设备及存储介质 |
| CN202211397807.4 | 2022-11-09 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US18/813,622 Continuation US20240420288A1 (en) | 2022-11-09 | 2024-08-23 | Method and apparatus for training image processing model, electronic device, computer-readable storage medium, and computer program product |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024099004A1 true WO2024099004A1 (zh) | 2024-05-16 |
Family
ID=88569795
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2023/123450 Ceased WO2024099004A1 (zh) | 2022-11-09 | 2023-10-08 | 一种图像处理模型训练方法、装置、电子设备、计算机可读存储介质及计算机程序产品 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20240420288A1 (zh) |
| EP (1) | EP4560588A4 (zh) |
| CN (1) | CN117011665A (zh) |
| WO (1) | WO2024099004A1 (zh) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119600104A (zh) * | 2024-11-20 | 2025-03-11 | 深圳市大成机电技术有限公司 | 基于机器学习的工业机器人视觉定位方法及系统 |
| CN120807296A (zh) * | 2025-09-16 | 2025-10-17 | 济南大学 | 基于知识蒸馏的医学图像超分辨率重建方法及系统 |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119443195A (zh) * | 2023-07-30 | 2025-02-14 | 鸿海精密工业股份有限公司 | 机器学习方法 |
| US20250245886A1 (en) * | 2024-01-31 | 2025-07-31 | Google Llc | Optimization of overall editing vector to achieve target expression photo editing effect |
Citations (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110826593A (zh) * | 2019-09-29 | 2020-02-21 | 腾讯科技(深圳)有限公司 | 融合图像处理模型的训练方法、图像处理方法、装置及存储介质 |
| CN111783603A (zh) * | 2020-06-24 | 2020-10-16 | 有半岛(北京)信息科技有限公司 | 生成对抗网络训练方法、图像换脸、视频换脸方法及装置 |
| CN111860043A (zh) * | 2019-04-26 | 2020-10-30 | 北京陌陌信息技术有限公司 | 人脸换像模型训练方法、装置、设备和介质 |
| CN113411425A (zh) * | 2021-06-21 | 2021-09-17 | 深圳思谋信息科技有限公司 | 视频超分模型构建处理方法、装置、计算机设备和介质 |
| US20220004803A1 (en) * | 2020-06-29 | 2022-01-06 | L'oreal | Semantic relation preserving knowledge distillation for image-to-image translation |
| CN114004772A (zh) * | 2021-09-30 | 2022-02-01 | 阿里巴巴(中国)有限公司 | 图像处理方法、图像合成模型的确定方法、系统及设备 |
| CN114387656A (zh) * | 2022-01-14 | 2022-04-22 | 平安科技(深圳)有限公司 | 基于人工智能的换脸方法、装置、设备及存储介质 |
| CN114611700A (zh) * | 2022-01-23 | 2022-06-10 | 杭州领见数字农业科技有限公司 | 一种基于结构重参数化的模型推理速度提升方法及装置 |
| WO2022179401A1 (zh) * | 2021-02-26 | 2022-09-01 | 腾讯科技(深圳)有限公司 | 图像处理方法、装置、计算机设备、存储介质和程序产品 |
| CN115294423A (zh) * | 2022-08-15 | 2022-11-04 | 网易(杭州)网络有限公司 | 模型确定方法、图像处理方法、装置、设备及存储介质 |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111881838B (zh) * | 2020-07-29 | 2023-09-26 | 清华大学 | 具有隐私保护功能的运动障碍评估录像分析方法及设备 |
| US12452385B2 (en) * | 2022-03-29 | 2025-10-21 | Disney Enterprises, Inc. | Method and system for deep learning based face swapping with multiple encoders |
| US12211178B2 (en) * | 2022-04-21 | 2025-01-28 | Adobe Inc. | Transferring faces between digital images by combining latent codes utilizing a blending network |
-
2022
- 2022-11-09 CN CN202211397807.4A patent/CN117011665A/zh active Pending
-
2023
- 2023-10-08 EP EP23887696.5A patent/EP4560588A4/en active Pending
- 2023-10-08 WO PCT/CN2023/123450 patent/WO2024099004A1/zh not_active Ceased
-
2024
- 2024-08-23 US US18/813,622 patent/US20240420288A1/en active Pending
Patent Citations (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111860043A (zh) * | 2019-04-26 | 2020-10-30 | 北京陌陌信息技术有限公司 | 人脸换像模型训练方法、装置、设备和介质 |
| CN110826593A (zh) * | 2019-09-29 | 2020-02-21 | 腾讯科技(深圳)有限公司 | 融合图像处理模型的训练方法、图像处理方法、装置及存储介质 |
| CN111783603A (zh) * | 2020-06-24 | 2020-10-16 | 有半岛(北京)信息科技有限公司 | 生成对抗网络训练方法、图像换脸、视频换脸方法及装置 |
| US20220004803A1 (en) * | 2020-06-29 | 2022-01-06 | L'oreal | Semantic relation preserving knowledge distillation for image-to-image translation |
| WO2022179401A1 (zh) * | 2021-02-26 | 2022-09-01 | 腾讯科技(深圳)有限公司 | 图像处理方法、装置、计算机设备、存储介质和程序产品 |
| CN113411425A (zh) * | 2021-06-21 | 2021-09-17 | 深圳思谋信息科技有限公司 | 视频超分模型构建处理方法、装置、计算机设备和介质 |
| CN114004772A (zh) * | 2021-09-30 | 2022-02-01 | 阿里巴巴(中国)有限公司 | 图像处理方法、图像合成模型的确定方法、系统及设备 |
| CN114387656A (zh) * | 2022-01-14 | 2022-04-22 | 平安科技(深圳)有限公司 | 基于人工智能的换脸方法、装置、设备及存储介质 |
| CN114611700A (zh) * | 2022-01-23 | 2022-06-10 | 杭州领见数字农业科技有限公司 | 一种基于结构重参数化的模型推理速度提升方法及装置 |
| CN115294423A (zh) * | 2022-08-15 | 2022-11-04 | 网易(杭州)网络有限公司 | 模型确定方法、图像处理方法、装置、设备及存储介质 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4560588A4 * |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119600104A (zh) * | 2024-11-20 | 2025-03-11 | 深圳市大成机电技术有限公司 | 基于机器学习的工业机器人视觉定位方法及系统 |
| CN120807296A (zh) * | 2025-09-16 | 2025-10-17 | 济南大学 | 基于知识蒸馏的医学图像超分辨率重建方法及系统 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4560588A4 (en) | 2025-10-22 |
| CN117011665A (zh) | 2023-11-07 |
| EP4560588A1 (en) | 2025-05-28 |
| US20240420288A1 (en) | 2024-12-19 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Ren et al. | Low-light image enhancement via a deep hybrid network | |
| Wan et al. | CoRRN: Cooperative reflection removal network | |
| US10614574B2 (en) | Generating image segmentation data using a multi-branch neural network | |
| WO2024099004A1 (zh) | 一种图像处理模型训练方法、装置、电子设备、计算机可读存储介质及计算机程序产品 | |
| CN112528764B (zh) | 人脸表情识别方法、系统、装置及可读存储介质 | |
| CN111079764B (zh) | 一种基于深度学习的低照度车牌图像识别方法及装置 | |
| CN111833360B (zh) | 一种图像处理方法、装置、设备以及计算机可读存储介质 | |
| EP4668206A1 (en) | Image enhancement method and apparatus, electronic device, computer-readable storage medium, and computer program product | |
| CN112750176B (zh) | 一种图像处理方法、装置、电子设备及存储介质 | |
| CN112329752B (zh) | 人眼图像处理模型的训练方法、图像处理方法及装置 | |
| CN113762032B (zh) | 图像处理方法、装置、电子设备及存储介质 | |
| CN113658065A (zh) | 图像降噪方法及装置、计算机可读介质和电子设备 | |
| EP4617999A1 (en) | Image enhancement method and apparatus, electronic device, computer-readable storage medium and computer program product | |
| CN119693632B (zh) | 对象的识别方法和装置、存储介质及电子设备 | |
| CN111382654A (zh) | 图像处理方法和装置以及存储介质 | |
| CN114694065B (zh) | 视频处理方法、装置、计算机设备及存储介质 | |
| CN116958306A (zh) | 图像合成方法和装置、存储介质及电子设备 | |
| CN115861637A (zh) | 一种基于双分支网络的显著目标检测方法 | |
| CN113822117B (zh) | 一种数据处理方法、设备以及计算机可读存储介质 | |
| CN115965839A (zh) | 图像识别方法、存储介质及设备 | |
| CN118411320A (zh) | 图像重建方法、电子设备及存储介质 | |
| CN116797466B (zh) | 一种图像处理方法、装置、设备及可读存储介质 | |
| CN116958758A (zh) | 数据处理方法及装置、设备、存储介质、程序产品 | |
| CN116777766A (zh) | 图像增强方法、电子设备及存储介质 | |
| Li et al. | RSID: A remote sensing image dehazing network |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23887696 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023887696 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2023887696 Country of ref document: EP Effective date: 20250219 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 11202500820S Country of ref document: SG |
|
| WWP | Wipo information: published in national office |
Ref document number: 11202500820S Country of ref document: SG |
|
| WWP | Wipo information: published in national office |
Ref document number: 2023887696 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |