WO2025053602A1 - 생성 모델을 이용하여 이미지의 일부 영역을 생성하는 방법 및 이를 수행하는 전자 장치 - Google Patents
생성 모델을 이용하여 이미지의 일부 영역을 생성하는 방법 및 이를 수행하는 전자 장치 Download PDFInfo
- Publication number
- WO2025053602A1 WO2025053602A1 PCT/KR2024/013315 KR2024013315W WO2025053602A1 WO 2025053602 A1 WO2025053602 A1 WO 2025053602A1 KR 2024013315 W KR2024013315 W KR 2024013315W WO 2025053602 A1 WO2025053602 A1 WO 2025053602A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- present disclosure
- electronic device
- encoder
- generated image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T11/00—Two-dimensional [2D] image generation
- G06T11/60—Creating or editing images; Combining images with text
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/094—Adversarial learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/60—Image enhancement or restoration using machine learning, e.g. neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/70—Denoising; Smoothing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/77—Retouching; Inpainting; Scratch removal
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2210/00—Indexing scheme for image generation or computer graphics
- G06T2210/12—Bounding box
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2211/00—Image generation
- G06T2211/40—Computed tomography
- G06T2211/441—AI-based methods, deep learning or artificial neural networks
Definitions
- the present disclosure relates to a method for processing an image, and more particularly, to a method for generating an image using a generative model and an electronic device performing the same.
- Generative AI technology refers to a technology that learns the patterns and structures of massive training data and generates new data similar to the input data based on that. Using generative AI technology, you can obtain an image corresponding to text or expand the image to an area that was not included in the original image.
- Generative AI technology can be applied to the field of image processing to support outpainting or inpainting.
- Outpainting is expanding an image while maintaining the style and content of the image, and inpainting is generating an image to fill a specific area within the image.
- a method for generating a partial region of an image using a generative model may be provided.
- the method may include a step of obtaining an image including information of the partial region, a step of obtaining an intermediate generated image including first image information for the partial region by inputting the image into a first generative model, and a step of obtaining a final generated image including second image information, the second image information being at least partially different from the first image information by inputting the image and the intermediate generated image into a second generative model.
- an electronic device may be provided.
- the electronic device may include a memory storing at least one instruction, and at least one processor, whereby the at least one processor executes at least one instruction stored in the memory, such that the electronic device may obtain an image including information of a portion of the region, input the image into a first generation model, thereby obtaining an intermediate generation image including first image information for the portion of the region, and input the image and the intermediate generation image into a second generation model, thereby obtaining a final generation image including second image information at least partially different from the first image information.
- a computer-readable recording medium having recorded thereon a program for performing a method of generating a portion of an image using a generative model on a computer may be provided.
- FIG. 1 is a conceptual diagram illustrating an electronic device for generating a portion of an image using a generative model according to one embodiment of the present disclosure.
- FIG. 2 is a conceptual diagram illustrating an encoder and a decoder according to one embodiment of the present disclosure.
- FIG. 3 is a conceptual diagram illustrating a mask map according to one embodiment of the present disclosure.
- FIG. 4 is a conceptual diagram for explaining the operation of a second generation model according to one embodiment of the present disclosure.
- FIG. 5a is a conceptual diagram for explaining text guidance input to a second generation model according to one embodiment of the present disclosure.
- FIG. 5b is a conceptual diagram for explaining image guidance input to a second generation model according to one embodiment of the present disclosure.
- FIG. 5c is a conceptual diagram for explaining text guidance and image guidance input to a second generation model according to one embodiment of the present disclosure.
- FIG. 5d is a conceptual diagram illustrating an embodiment in which the output of the first generation model according to one embodiment of the present disclosure is used as image guidance.
- FIG. 5e is a conceptual diagram illustrating an embodiment in which the output of the first generation model according to one embodiment of the present disclosure is used as image guidance.
- FIG. 5f is a conceptual diagram illustrating an embodiment in which the output of the first generative model is used as image guidance together with text guidance according to one embodiment of the present disclosure.
- FIG. 5g is a conceptual diagram illustrating an embodiment in which the output of the first generation model is used as image guidance together with text guidance according to one embodiment of the present disclosure.
- FIG. 6 is a conceptual diagram for explaining a learning method of a first generation model according to one embodiment of the present disclosure.
- FIGS. 7A and 7B are conceptual diagrams illustrating the configuration of a second generation model according to one embodiment of the present disclosure.
- FIG. 8 is a conceptual diagram for explaining the configuration of an interpreter according to one embodiment of the present disclosure.
- FIGS. 9A and 9B are conceptual diagrams illustrating an embodiment of adding noise to an intermediate generated image according to one embodiment of the present disclosure.
- FIGS. 10A to 10C are conceptual diagrams illustrating an example of determining a denoising intensity according to the quality of an intermediate generated image according to one embodiment of the present disclosure.
- FIG. 11 is a conceptual diagram illustrating an electronic device for generating a portion of an image using a generative model according to one embodiment of the present disclosure.
- FIG 19 is a flowchart for explaining detailed steps of step S1430 of Figure 14.
- a function related to 'Artificial Intelligence' is operated through a processor and a memory.
- the processor may be composed of one or more processors.
- one or more processors may be a general-purpose processor such as a CPU, an AP, a DSP (Digital Signal Processor), a graphic-only processor such as a GPU, a VPU (Vision Processing Unit), or an AI-only processor such as an NPU.
- One or more processors control input data to be processed according to a predefined operation rule or AI model stored in a memory.
- the AI-only processor may be designed with a hardware structure specialized for processing a specific AI model.
- the 'artificial intelligence model' may include a neural network model.
- the neural network model may be composed of a plurality of neural network layers.
- Each of the plurality of neural network layers has a plurality of weight values, and performs a neural network operation through an operation between the operation result of the previous layer and the plurality of weights.
- the plurality of weights of the plurality of neural network layers may be optimized by the learning result of the artificial intelligence model. For example, the plurality of weights may be updated so that a loss value or a cost value obtained from the artificial intelligence model during the learning process is reduced or minimized.
- a 'generative model' may represent a model that generates new data based on given input data.
- the generative model may receive various types of data (e.g., text, images, videos, sounds, random vector data, etc.) as input.
- the generative model may generate new text, images, videos, sounds, or combinations thereof.
- the generative model may be an artificial intelligence model.
- the generative model may include a generative adversarial network (GAN) model, a variational autoencoder (VAE) model, a diffusion model, a transformer, etc., but is not limited to the examples described above.
- 'inpainting' may refer to the task of performing inference on an internal region of an original image.
- 'outpainting' may refer to the task of performing inference on an external area of an original image.
- a 'mask' may represent an area (or an unknown area) that requires inference among the entire area of a specific image.
- a 'masked image' may include pixel information corresponding to an unmasked area and boundary information for an area corresponding to the mask.
- the pixel information may include location information (e.g., coordinate values) and color information (e.g., RGB values) of a pixel.
- a 'mask map' may represent data that distinguishes a masked area and an unmasked area among the entire area of an image.
- the mask map may be a binarization map.
- a masked area may be expressed by a first value
- an unmasked area may be expressed by a second value.
- 'denoising' may be defined as a single operation that is input to and output from a second generative model.
- 'denoising order' may be defined as the number of times that denoising is repeated.
- 'denoising total order' may be defined as a hyperparameter for how many times denoising is repeated. In one embodiment of the present disclosure, the denoising total order may vary depending on the settings of a user or a manufacturer.
- 'guidance' may represent information indicating an image to be inferred.
- the information may be any information indicating an image to be inferred.
- the guidance may include text guidance generated from text or image guidance generated from an image, but the type of guidance is not limited thereto.
- the guidance or guidance information may be input to the second generative model while performing a denoising operation.
- FIG. 1 is a conceptual diagram illustrating an electronic device that generates a portion of an image using a generation model according to one embodiment of the present disclosure.
- the electronic device (1000) can obtain a final generated image from a masked image using a first generation model (1100) and a second generation model (1200).
- each of the first generation model (1100) and the second generation model (1200) can generate an entire image inferring an unknown region based on an image in which an unknown region exists.
- the first generation model (1100) can generate a first image based on an input image
- the second generation model (1200) can generate a second image based on the input image.
- the unknown region may be a region in which image information is not included in the input image or image information intended by a user is not included.
- Each of the first generation model (1100) and the second generation model (1200) can perform inpainting or outpainting on the input image. According to one embodiment of the present disclosure, the speed and/or performance of image inference can be dramatically improved by utilizing the output of the first generative model in the inference operation of the second generative model.
- the electronic device (1000) may be implemented in various forms.
- the electronic device (1000) may include a personal computer (PC), a terminal, a portable telephone, a smart phone, a tablet PC, a handheld device, a wearable device, a server device, etc.
- the electronic device (1000) may acquire an image including information of a portion of an image.
- the portion of the image may be predefined according to a user or manufacturer's setting.
- the portion of the image may be referred to as a target portion or a candidate portion.
- the information of the portion of the image may also be referred to as information about the portion of the image, information related to the portion of the image, or information corresponding to the portion of the image.
- the information of the portion of the image may include location information corresponding to the portion of the image and/or boundary information between the portion of the image and the portion excluding the portion of the image.
- the information of the portion of the image may include location information of a bounding box on the image.
- the information of the portion of the image may include pixel coordinate values on the image corresponding to the portion of the image.
- the portion of the image may also be referred to as a mask.
- the electronic device (100) may include a sensor and a display.
- the electronic device (1000) may obtain an original image of an object by using the sensor.
- the sensor may include, but is not limited to, a charge-coupled device (CCD) sensor and a complementary metal-oxide-semiconductor (CMOS) sensor.
- CMOS complementary metal-oxide-semiconductor
- the sensor may be referred to as a camera.
- the original image may include, but is not limited to, an RGB image.
- the electronic device (1000) may display the original image through the display.
- the electronic device (1000) may obtain a user input for a portion of the original image through a user interface.
- the portion may include at least one of an inner portion and an outer portion of the original image.
- the electronic device (1000) may generate an image including information of the portion of the area based on the user input.
- the image including information of the portion of the area may also be referred to as a masked image.
- an image including information of a portion of a region may be configured in the form of an original image with a mask added thereto.
- the electronic device (1000) may generate a mask map based on at least one of an image including information of a portion of a region and a user input.
- the electronic device (1000) may receive at least one of an original image, information of a portion of a region, and a mask map from an external server.
- the electronic device (1000) can input an image including information of a certain area into the first generation model (1100).
- the first generation model (1100) can include a pre-trained GAN model.
- the first generation model (1100) can include a generator model among the pre-trained GAN models.
- the electronic device (1000) can input an image pair consisting of a masked image and a mask map into the first generation model.
- the electronic device (100) can obtain an intermediate generated image including first image information for a portion of the region from the first generation model (1100).
- the intermediate generated image can include color information of at least one pixel corresponding to the portion of the region.
- the intermediate generated image can include color information of pixels corresponding to the entire region together with color information of at least one pixel corresponding to the portion of the region.
- the present disclosure is not limited thereto, and the intermediate generated image can include color information of at least one pixel corresponding to the portion of the region and not include color information of pixels excluding the portion of the region.
- the electronic device (1000) can input an image and an intermediate generated image including information of a portion of the region to the second generation model (1200).
- the second generation model (1200) can include an artificial intelligence model that restores the image from noise.
- the second generation model (1200) can include a pre-learned diffusion model.
- the first generation model (1100) may be composed of fewer layers and/or fewer weight values than the second generation model (1200). In one embodiment of the present disclosure, the processing speed of the first generation model (1100) may be faster than the processing speed of the second generation model. In one embodiment of the present disclosure, the memory capacity occupied by the first generation model (1100) may be smaller than the memory capacity occupied by the second generation model (1200).
- the intermediate generated image may be preprocessed before being input to the second generation model (1200).
- the intermediate generated image may be converted into an image embedding and/or a text embedding corresponding to the image.
- 'embedding' may represent low-dimensional data converted from high-dimensional data.
- the embedding may also be referred to as an embedding vector, a feature vector, a feature representation, a latent vector, or a latent representation.
- the present disclosure is not limited thereto, and the embedding may include low-dimensional data converted from high-dimensional data in another manner.
- the electronic device (1000) can generate a connected image based on an image including information of a portion of the region and noise information. For example, the electronic device (1000) can concatenate an image including information of a portion of the region to a predefined initial noise.
- the electronic device (1000) can input the connected image to a second generation model (1200).
- the electronic device (1000) can concatenate the image including information of a portion of the region to an output of the second generation model (1200).
- the electronic device (1000) can input the connected image to the second generation model (1200).
- the electronic device (1000) can repeat the operation of inputting the connected image to the second generation model (1200). For example, the electronic device (1000) can repeatedly perform the operation of inputting the connected image to the second generation model (1200) as many times as the total number of predefined denoising orders.
- the electronic device (1000) can determine whether the operation has been repeated a predefined total number of denoising orders. Based on determining that the operation has been repeated a predefined total number of denoising orders, the electronic device (1000) can obtain a final generated image from the second generation model (1200).
- the second generation model (1200) may include at least one layer.
- the electronic device (1000) may input an intermediate generation image or image information (e.g., image embedding) corresponding to the intermediate generation image into at least one layer of the second generation model (1200).
- the electronic device (1000) can obtain a denoising strength for an intermediate generated image.
- the denoising strength can represent a numerical value of how strongly noise is to be added to the image. For example, assuming that the denoising strength has a value between 0 and 1, when the denoising strength is 0, no noise is added to the image, and when the denoising strength is 1, the image can be changed into completely random noise. For example, when the denoising strength is 0, no noise can be added to the image. On the other hand, when the denoising strength is 1, the image can be changed to include random noise.
- the electronic device (1000) can determine the amount of noise to be added to the intermediate generated image based on the denoising strength.
- the amount of noise can represent the degree of noise to be added to the intermediate generated image.
- the denoising strength can be predefined according to a setting of a user or a manufacturer.
- the electronic device (1000) can obtain a user input corresponding to a denoising intensity through a user interface.
- the electronic device (1000) can determine a denoising intensity based on the user input.
- the electronic device (1000) can add noise to the intermediate generated image based on the denoising intensity.
- the second generation model (1200) may receive the intermediate generated image.
- the second generation model (1200) may obtain a denoising intensity from the intermediate generated image.
- the second generation model (1200) may add noise to the intermediate generated image based on the denoising intensity.
- the intermediate generated image with added noise may be input to at least one layer of the second generation model (1200).
- the electronic device (1000) can measure the quality of an intermediately generated image. For example, the electronic device (1000) can obtain a confidence value based on the intermediately generated image. For example, the electronic device (1000) can obtain a predicted confidence value based on the intermediately generated image. For example, the predicted confidence value can indicate a degree to which the intermediately generated image output (or predicted, inferred) by the first generation model (1100) can be trusted.
- the electronic device (1000) can determine a denoising intensity based on the predicted confidence value. For example, denoising intensities corresponding to a plurality of threshold ranges can be mapped in advance.
- a first denoising intensity can be mapped to a first threshold range
- a second denoising intensity can be mapped to a second threshold range.
- the electronic device (1000) can determine a threshold range corresponding to a predicted confidence value among the plurality of threshold ranges.
- the electronic device (1000) can add noise to the intermediate generated image with a denoising intensity mapped to a determined threshold range.
- the electronic device (1000) can determine a target denoising order corresponding to an intermediate generated image with added noise among a predefined total number of denoising orders based on a denoising intensity.
- the electronic device (1000) can set a current denoising order as the target denoising order.
- the electronic device (1000) can post-process the image output by the second generation model (1200) to obtain a final generated image.
- the electronic device (1000) can obtain a final generated image based on image information corresponding to a portion of the image output by the second generation model (1200) (i.e., an unknown portion or a target portion) and image information corresponding to a portion of the image (i.e., a known portion) excluding a portion of the image that was initially input (which may also be referred to as an image including information of a portion of the portion),
- FIG. 2 is a conceptual diagram illustrating an encoder and a decoder according to one embodiment of the present disclosure. Any content that overlaps with that described in FIG. 1 is omitted.
- the electronic device (1000) may include a first encoder (1310) and a decoder (1400).
- the first encoder (1310) may encode an image.
- the decoder (1400) may output an image by decoding an image embedding.
- the first encoder (1310) and the decoder (1400) may be pre-learned artificial intelligence models.
- the first encoder (1310) and the decoder (1400) can be implemented as a variational autoencoder (VAE) model or a vector quantized generative adversarial network (VQGAN).
- the first encoder (1310) can be an encoder part of a variational autoencoder or a vector quantized generative adversarial network.
- the decoder (1400) can be a decoder part of a variational autoencoder or a vector quantized generative adversarial network.
- the first encoder (1310) and the decoder (1400) can be trained in a learning manner of a variational autoencoder model or a vector quantized generative adversarial network.
- the output of the first encoder (1310) can be input to the decoder (1400).
- the first encoder (1310) and the decoder (1400) can be trained together by comparing the input of the first encoder (1310) and the output of the decoder (1400).
- the present disclosure is not limited thereto, and thus, according to other embodiments, the first encoder (1310) and/or the decoder (1400) may be implemented in other ways.
- the first encoder (1310) can output a latent vector corresponding to an input image.
- the latent vector can represent a probability value based on a Gaussian probability distribution expressed by a mean and a variance.
- the first encoder (1310) can transmit the latent vector to the second generative model.
- the second generation model (1200) can output a latent vector (hereinafter, also referred to as final noise).
- the decoder (1400) can output a final generated image by decoding the latent vector, which is an output of the second generation model (1200).
- FIG. 3 is a conceptual diagram for explaining a mask map according to one embodiment of the present disclosure. Any content overlapping with that described in FIGS. 1 and 2 is omitted.
- the electronic device (1000) can obtain a mask map including location information of a portion of an image (e.g., an unknown portion or a target portion to be inferred). For example, the electronic device (1000) can obtain a mask map corresponding to an image from an external server. For example, the electronic device (1000) can generate a mask map based on the image. For example, the electronic device (1000) can generate a mask map in which a portion of an image (e.g., an unknown portion or a target portion) is expressed as a first value (e.g., a white portion of FIG.
- a portion of an image e.g., an unknown portion or a target portion
- a first value e.g., a white portion of FIG.
- the mask map can be a binary image composed of the first value or the second value.
- the present disclosure is not limited thereto, and thus the mask map can be defined in another format.
- the electronic device (1000) can concatenate an image and a mask map.
- the image can be expressed as a three-channel image with each of the R value, the G value, and the B value of the RGB value as one channel.
- the mask map can be expressed as a one-channel image.
- the image to which the mask map is connected can have a total of four channels.
- the present disclosure is not limited thereto, and the number of channels of the image or the number of channels of the mask map is not limited thereto.
- a channel can represent one dimension of three-dimensional input data.
- the electronic device (1000) can transmit the linked image to the first generation model (1100).
- the electronic device (1000) can transmit the linked image to the second generation model (1200).
- Fig. 4 is a conceptual diagram for explaining the operation of a second generation model according to one embodiment of the present disclosure. Any content that overlaps with that described in Figs. 1 to 3 is omitted.
- the electronic device (1000) can transmit an image to a first encoder (1310).
- the first encoder (1310) can encode the image.
- the electronic device (1000) can obtain an encoded image (e.g., a latent vector) (Z image ).
- the electronic device (1000) can adjust the mask map.
- the electronic device (1000) can adjust the mask map to a specific size.
- the specific size may be a predefined size.
- the predefined size may be the same as the encoded image (Z image ).
- the number of channels (e.g., 3) of the encoded image (Z image ) and the number of channels (e.g., 1) of the adjusted mask map (M r ) may be different.
- the electronic device (1000) can obtain current noise information (Z t ).
- the number of channels and the size of the current noise information (Z t ) may be the same as the number of channels and the size of the encoded image (Z image ).
- the present disclosure is not limited thereto, and the number of channels and the size of the current noise information (Z t ) may be different from the number of channels and the size of the encoded image (Z image ).
- t is defined as a current denoising order of the second generation model (1200).
- the current denoising order may represent the number of times the second generation model (1200) repeats input and output.
- t may be expressed as an integer greater than or equal to 0 and less than or equal to the total denoising order (e.g., T).
- Z T may represent initial noise information that is first input to the second generation model (1200).
- the initial noise information may be composed of random values.
- the initial noise information may be composed of Gaussian noise following a Gaussian distribution, but the present disclosure is not limited thereto.
- the electronic device (1000) can connect the encoded image (Z image ), the adjusted mask map (M r ), and the current noise information (Z t ).
- the electronic device (1000) can transmit the data (hereinafter, input data) in which the encoded image (Z image ), the adjusted mask map (M r ), and the current noise information (Z t ) are connected to the second generation model (1200).
- the second generation model (1200) can perform a denoising operation based on the input data.
- the denoising operation can refer to an operation of removing a predetermined noise from the input noise.
- the second generation model (1200) can generate the next noise information (Z t-1 ) by performing the denoising operation.
- the electronic device (1000) can determine whether the denoising operation has been performed as many times as the total number of denoising orders. For example, the electronic device (1000) can determine whether the next noise information (Z t-1 ) is the final noise information (Z 0 ). If it is determined that the denoising operation has been performed as many times as the total number of denoising orders, the electronic device (1000) can transmit the final noise information (Z 0 ) to the decoder (1400). The decoder (1400) can generate a final generated image based on the final noise information (Z 0 ).
- the electronic device (1000) can concatenate the encoded image (Z image ), the adjusted mask map (M r ), and the next noise information (Z t-1 ).
- the electronic device (1000) can repeat the denoising operation by inputting the connected encoded image (Z image ), the adjusted mask map (M r ), and the next noise information (Z t-1 ) into the second generation model (1200).
- the first encoder (1310) and the decoder (1400) may be omitted in the electronic device (1000).
- the electronic device (1000) may concatenate the image, the mask map, and the current noise information (Z t ).
- the electronic device (1000) may repeat the denoising operation by inputting the concatenated image, the mask map, and the current noise information (Z t ) into the second generation model (1200).
- the second generation model (1200) may output the final noise information (Z 0 ).
- the final noise information (Z 0 ) may include the final generated image.
- one or more processors of the electronic device (1000) may repeatedly perform the denoising operation by combining the image, the mask map, and the current noise information (Z t ) and inputting the combined image, the mask map, and the current noise information (Z t ) into the second generation model (1200).
- the second generation model (1200) can output final noise information (Z 0 ).
- the final noise information (Z 0 ) can include the final generated image.
- FIG. 5a is a conceptual diagram for explaining text guidance input to a second generation model according to one embodiment of the present disclosure. Any content that overlaps with that described in FIGS. 1 to 4 is omitted.
- the electronic device (1000) can obtain a text input from an external server. In one embodiment of the present disclosure, the electronic device (1000) can obtain a text input from a user through a user interface. In one embodiment of the present disclosure, the electronic device (1000) can include a sound-to-text converter. The sound-to-text converter can include a speech-to-text converter. The electronic device (1000) can obtain a user voice input from a user through the user interface. The electronic device (1000) can convert the user voice input into a text input by using the sound-to-text converter. In one embodiment of the present disclosure, the electronic device (1000) can include an image-to-text converter. The electronic device (1000) can obtain a text input representing an image by inputting an image into the image-to-text converter.
- the second encoder (1320) can encode the text input.
- the second encoder (1320) can be an artificial intelligence model that is trained to encode the text input.
- the second encoder (1320) can pass the encoded text input to the second generative model (1200).
- the encoded text input can also be referred to as text guidance, text embedding, or guidance information.
- the electronic device (1000) can pass the encoded text input to at least one layer of the second generative model (1200).
- the second generative model (1200) can output a final generated image based on the image, the intermediate generated image, and the encoded text input.
- FIG. 5b is a conceptual diagram for explaining image guidance input to a second generation model according to one embodiment of the present disclosure. Any content overlapping with that explained in FIGS. 1 to 5a is omitted.
- the electronic device (1000) may include a second encoder (1320) and a third encoder (1330).
- the function and operation of the second encoder (1320) correspond to the function and operation of the second encoder (1320) of FIG. 5a, so that the overlapping content is omitted.
- the function and operation of the third encoder (1330) correspond to the function and operation of the third encoder (1330) of FIG. 5b, so that the overlapping content is omitted.
- the second encoder (1320) may output text guidance information based on text input.
- the third encoder (1330) may output image guidance information based on an image.
- the second generation model (1200) may include a first neural network (1210), a second neural network (1220), and a noise blender (1230).
- the first neural network (1210) may output a first noise (N1) based on an image, an intermediate generated image, and image guidance information.
- the second neural network (1220) may output a second noise (N2) based on the image, the intermediate generated image, and text guidance information.
- the noise blender (1230) may blend the first noise and the second noise.
- the noise blender (1230) may output a blended noise by weighting the first noise and the second noise.
- the second generation model (1200) may output a final generated image based on the blended noise.
- the weights between the first noise and the second noise may be different depending on the settings of the user or the manufacturer.
- the electronic device (1000) may obtain the weights from the user through the user interface.
- FIG. 5d is a conceptual diagram for explaining an embodiment in which the output of the first generation model according to one embodiment of the present disclosure is used as image guidance. Any content that overlaps with that described in FIGS. 1 to 5c is omitted.
- the electronic device (1000) may include a fourth encoder (1340).
- the electronic device (1000) may transmit an intermediate generated image to the fourth encoder (1340).
- the fourth encoder (1340) may encode the intermediate generated image.
- the fourth encoder (1340) may be an artificial intelligence model that has been trained to encode an image.
- the configuration, operation, and function of the fourth encoder (1340) may correspond to the configuration, operation, and function of the third encoder (1330).
- the fourth encoder (1340) can transmit the encoded intermediate generated image to the second generation model (1200).
- the encoded intermediate generated image may also be referred to as image guidance, image embedding, or guidance information.
- the electronic device (1000) can transmit the encoded intermediate generated image to at least one layer of the second generation model (1200).
- the second generation model (1200) can output a final generated image based on the image and the encoded intermediate generated image.
- a second generation model that outputs an appropriate image suitable for the image context can be provided.
- FIG. 5e is a conceptual diagram for explaining an embodiment in which the output of the first generation model according to one embodiment of the present disclosure is used as image guidance. Any content that overlaps with that described in FIGS. 1 to 5c is omitted.
- the electronic device (1000) may include a fourth encoder (1340).
- the function and operation of the fourth encoder (1340) correspond to the function and operation of the fourth encoder (1340) of FIG. 5d, so that redundant content is omitted.
- the fourth encoder (1340) may output image guidance information based on an intermediate generated image.
- the electronic device (1000) can transmit the intermediate generated image, which is the output of the first generation model (1100), to the second generation model (1200).
- the second generation model (1200) can infer the final generated image based on the intermediate generated image and the image guidance information.
- a description of an embodiment in which the second generation model (1200) infers the final generated image based on the intermediate generated image is omitted because it overlaps with the contents described in FIGS. 1 to 5c.
- a description of an embodiment in which the second generation model (1200) infers the final generated image based on the image guidance information is omitted because it overlaps with the contents described in FIG. 5d.
- the second generation model (1200) can receive an encoded text input.
- the electronic device (1000) can receive a text input and encode the text input using an encoder such as the second encoder (1320) of FIG. 5a.
- the second generation model (1200) can take as input the encoded text input (text guidance), an image, an intermediate generated image, and an encoded intermediate generated image (image guidance) and output a final generated image.
- the second generation model (1200) may improve the performance of inpainting and/or outpainting by not only referring to the image guidance, but also inferring the final generated image based on the intermediate generated image.
- FIG. 5f is a conceptual diagram for explaining an embodiment in which the output of the first generation model is used as image guidance together with text guidance according to one embodiment of the present disclosure. Any content that overlaps with that described in FIGS. 1 to 5d is omitted.
- the electronic device (1000) may include a second encoder (1320) and a fourth encoder (1340).
- the function and operation of the second encoder (1320) correspond to the function and operation of the second encoder (1320) of FIG. 5c, so that the overlapping content is omitted.
- the function and operation of the fourth encoder (1340) correspond to the function and operation of the fourth encoder (1340) of FIG. 5d, so that the overlapping content is omitted.
- the second encoder (1320) may output text guidance information based on text input.
- the fourth encoder (1340) may output image guidance information based on an intermediately generated image.
- the second generation model (1200) may include a first neural network (1210), a second neural network (1220), and a noise blender (1230).
- the configuration, function, and operation of the first neural network (1210), the second neural network (1220), and the noise blender (1230) correspond to the configuration, function, and operation of the first neural network (1210), the second neural network (1220), and the noise blender (1230) of FIG. 5C, so that redundant descriptions are omitted.
- the electronic device (1000) may transmit image guidance information from the fourth encoder (1340) to the first neural network (1210).
- the electronic device (1000) may transmit text guidance information from the second encoder (1340) to the second neural network (1220).
- the performance of inpainting and/or outpainting can be improved by inferring the final generated image by referring to text guidance as well as image guidance.
- FIG. 5g is a conceptual diagram illustrating an embodiment in which the output of the first generation model is used as image guidance together with text guidance according to one embodiment of the present disclosure.
- the electronic device (1000) may include a second encoder (1320) and a fourth encoder (1340).
- the functions and operations of the second encoder (1320) and the fourth encoder (1340) correspond to the functions and operations of the second encoder (1320) and the fourth encoder (1340) of FIG. 5f, so that overlapping content is omitted.
- the second encoder (1320) may output text guidance information based on text input.
- the fourth encoder (1340) may output image guidance information based on an intermediately generated image.
- the second generation model (1200) may include a first neural network (1210), a second neural network (1220), and a noise blender (1230).
- the configuration, function, and operation of the first neural network (1210), the second neural network (1220), and the noise blender (1230) correspond to the configuration, function, and operation of the first neural network (1210), the second neural network (1220), and the noise blender (1230) of FIGS. 5C and 5F, so that redundant content is omitted.
- the electronic device (1000) may transmit image guidance information from the fourth encoder (1340) to the first neural network (1210).
- the electronic device (1000) may transmit text guidance information from the second encoder (1340) to the second neural network (1220).
- the electronic device (1000) can transmit an intermediate generated image, which is an output of the first generation model (1100), to a second generation model (1200).
- the second generation model (1200) can infer a final generated image based on the intermediate generated image, the image guidance information, and the text guidance information.
- a description of an embodiment in which the second generation model (1200) infers a final generated image based on the intermediate generated image is omitted because it overlaps with the contents described in FIGS. 1 to 5c and 5e.
- a description of an embodiment in which the second generation model (1200) infers a final generated image based on the image guidance information and the text guidance information is omitted because it overlaps with the contents described in FIGS. 5c and 5f.
- the second generation model (1200) may improve the performance of inpainting and/or outpainting by not only referring to image guidance and/or text guidance, but also inferring the final generated image based on the intermediate generated image.
- FIGS. 5a to 5g the image is illustrated as being input to the first generation model (1100) and the second generation model (1200), but, unlike what is illustrated, a mask map may be connected to the image and input to the first generation model (1100) and the second generation model (1200).
- the first encoder (1310) and decoder (1400) of FIGS. 2 and 4 are illustrated as omitted, but unlike the illustration, the image and/or mask map, which are inputs of the second generation model (1200), may be encoded by the first encoder (1310), and the final output of the second generation model (1200) may be decoded by the decoder (1400).
- Fig. 6 is a conceptual diagram for explaining a learning method of a first generation model according to one embodiment of the present disclosure. Any content overlapping with that explained in Figs. 1 to 5g is omitted.
- the model learning system (10) may include a first generation model (1100), a discriminator (12), and a loss function (13).
- the model learning system (10) may be a system that learns an adversarial generative neural network model.
- the first generation model (1100) may be referred to as a generator (1100).
- the adversarial generative neural network model may represent a model in which the generator (1100) and the discriminator (12) compete adversarially by improving each other's performance through learning.
- Each of the generator (1100) and the discriminator (12) may include at least one layer.
- the layer may include a filter composed of weight values for extracting features from input data.
- the generator (1100) can be trained to input a data set (DS) and output fake data (FD).
- the data set (DS) can be a set of data including a plurality of images.
- the data set (DS) can include a mask map for each of the plurality of images.
- the fake data (FD) can represent fake image data.
- the real data DB (11) can include a set of real data (RD).
- the discriminator (12) can be trained to determine whether the fake data (FD) or real data (RD) is fake by inputting fake data (FD) or real data (RD).
- the loss function (13) can calculate the loss function value based on the discrimination result (DR).
- the loss function value can be transmitted to the discriminator (12) and the generator (1100) through backpropagation.
- the weights of at least one layer included in the discriminator (12) and the generator (1100) can be updated based on the loss function value.
- At least a portion of the functions of the model learning system (10) may be performed in the electronic device (1000) described in FIGS. 1 to 5g, but the present disclosure is not limited thereto, and may be performed in an external server device other than the electronic device (1000).
- the generator (1100) learned through the model learning system (10) may correspond to the first generation model (1100) described in FIGS. 1 to 5g.
- FIGS. 7A and 7B are conceptual diagrams for explaining the configuration of a second generation model according to one embodiment of the present disclosure. Any content overlapping with that described in FIGS. 1 to 6 is omitted.
- the second generation model (1200) may include a first neural network (1210).
- the first neural network (1210) may include at least one layer.
- at least one layer may perform cross attention.
- the present disclosure is not limited thereto, and at least one layer may include a self-attention layer or a residual block, etc.
- a layer among the at least one layer that performs cross attention may also be referred to as a cross attention layer.
- the electronic device (1000) can transmit guidance information to the cross attention layer of the first neural network (1210).
- the first neural network (1210) can reflect weights on the guidance information based on the correlation between the input image and the guidance information.
- the first neural network (1210) can perform a cross-attention operation using a query, a key, and a value as operands.
- the query may include current noise information
- the key and the value may include guidance information.
- the present disclosure is not limited thereto, and the query, the key, and the value may include different information.
- the cross-attention operation can be performed in a cross-attention layer.
- the first neural network (1210) can transmit the cross-attention operation result to the next layer.
- the first neural network (1210) may include a self-attention layer.
- the query, the key, and the value may include current noise information.
- the first neural network (1210) may perform a self-attention operation using the query, the key, and the value as operands.
- the self-attention operation may be performed in the self-attention layer.
- the first neural network (1210) may transmit the result of the self-attention operation to the next layer.
- the second generation model (1200) may include an interpreter (1240).
- the second generation model (1200) may be a pre-trained model that inputs text guidance information obtained from text into a cross-attention layer.
- the interpreter (1240) may convert the guidance information into properties and/or forms of text guidance information so that the performance of the second generation model (1200) is maintained even when guidance information obtained from any data form other than text is input into the cross-attention layer.
- the interpreter (1240) may convert image guidance information into properties and/or forms of text guidance information.
- the interpreter (1240) can convert the image guidance information output by the third encoder (1330) and the fourth encoder (1340) illustrated in FIGS. 5b to 5g into having properties and/or forms of text guidance information.
- various data can be utilized as guidance by only adding an interpreter configuration without retraining or fine-tuning a second generative model with high learning cost.
- the interpreter (1240) may be omitted.
- the second generation model (1200) may be pre-trained through a process in which arbitrary guidance information (e.g., image guidance information) is input to the cross attention layer.
- the performance of a task can be improved by fine-tuning a pre-learned first neural network (1210).
- the electronic device (1000) can calculate a weight change ( ⁇ W) of at least some of the pre-defined layers according to additional learning data input while keeping all parameters (e.g., weights) of the pre-learned first neural network (1210) fixed.
- the electronic device (1000) can determine new weights by adding a weight change amount ( ⁇ W) to fixed weights of at least some of the predefined layers corresponding to fixed parameters. According to one embodiment of the present disclosure, the performance of a task corresponding to additional learning data input can be improved by fine-tuning the first neural network (1210).
- Fig. 8 is a conceptual diagram for explaining the configuration of an interpreter according to one embodiment of the present disclosure. Any content overlapping with that described in Figs. 1 to 7b is omitted.
- the interpreter (1240) may include at least one single-layer perceptron.
- the interpreter (1240) may include a first single-layer perceptron (1241) and a second single-layer perceptron (1242).
- the interpreter (1240) may be configured as a multi-layer perceptron including the first single-layer perceptron (1241) and the second single-layer perceptron (1242).
- the first single-layer perceptron (1241) and the second single-layer perceptron (1242) may be connected to each other and may be pre-trained.
- n is a natural number.
- m may be a natural number greater than or equal to n, but the present disclosure is not limited thereto, and m may be less than or equal to n.
- the second embedding may be a matrix having a size of k x n.
- the second single-layer perceptron (1242) can take the second embedding as input and output the third embedding.
- the third embedding may also be referred to as the result image embedding.
- the second single-layer perceptron (1242) may include a third layer and a fourth layer.
- the third layer may include k nodes.
- the fourth layer may include l nodes. (l is a natural number) In one embodiment of the present disclosure, k may be less than or equal to l, but the present disclosure is not limited thereto, and k may be greater than or equal to l.
- the third embedding may be a matrix having a size of l x n.
- the third embedding may be input to at least one layer, such as a cross-attention layer, of the second generative model (7a, 7b, 1200) (or the first neural network (FIGS. 7a, 7b, 1210)).
- the interpreter (1240) may include two or more single-layer perceptrons. Although only two single-layer perceptrons (a first single-layer perceptron (1241) and a second single-layer perceptron (1242)) are illustrated in FIG. 8, the present disclosure is not limited thereto, and the interpreter (1240) may include three or more single-layer perceptrons.
- the next single-layer perceptron may output the next embedding by inputting the embedding output by the previous single-layer perceptron.
- the embedding output by the last single-layer perceptron may be referred to as a result image embedding.
- the result image embedding may be input to at least one layer, such as a cross-attention layer, of the second generative model (7a, 7b, 1200) (or the first neural network (FIGS. 7a, 7b, 1210)).
- the interpreter (1240) may include a single-layer perceptron.
- the embedding output by the single-layer perceptron may be referred to as a result image embedding.
- the result image embedding may be input to at least one layer, such as a cross-attention layer, of the second generative model (7a, 7b, 1200) (or the first neural network (FIGS. 7a, 7b, 1210)).
- the interpreter (1240) can perform a function of converting a dimension of image guidance into a dimension of text guidance. According to one embodiment of the present disclosure, the interpreter (1240) can perform a function of interpreting or converting a difference in properties between image guidance and text guidance. According to one embodiment of the present disclosure, by performing the functions described above by the interpreter (1240), a second generative model that has been previously trained to understand only text guidance can understand various guidance information such as image guidance.
- FIGS. 9A and 9B are conceptual diagrams for explaining an embodiment of adding noise to an intermediate generated image according to one embodiment of the present disclosure. Any content overlapping with that described in FIGS. 1 to 8 is omitted.
- the electronic device (1000) can obtain a denoising strength.
- the denoising strength can correspond to an amount of noise to be added to an intermediate generated image.
- the denoising strength can be a specific value.
- the specific value can be a value preset according to a setting of a user or a manufacturer.
- the denoising strength can be determined by considering a performance indicator of the first generated model.
- the electronic device (1000) may include a noise generator (1500).
- the noise generator (1500) may add noise to an intermediate generated image based on a denoising intensity.
- the electronic device (1000) may identify a denoising order mapped to the denoising intensity.
- the electronic device (1000) may set the denoising order of the current noise information of the second generation model (1200) to the identified denoising order.
- the intermediate generated image (Z n ) to which noise has been added may be utilized as the current noise information having the identified denoising order (e.g., n) of the second generation model.
- the second generation model (1200) may obtain the next noise information (Z n-1 ) by inputting an image (e.g., a masked image), the intermediate generated image (Z n ) to which noise has been added, and guidance information.
- the electronic device (1000) can encode an image using the first encoder (1310).
- the first encoder (1310) can output an encoded image (Z i2 ).
- the electronic device (1000) can adjust a mask map to a predefined size.
- the electronic device (1000) can encode an intermediate generated image using the first encoder (1310).
- the first encoder (1310) can output an encoded intermediate generated image (Z i1 ).
- the noise generator (1500) can add noise to the encoded intermediate generated image (Z i1 ) based on a denoising intensity.
- the electronic device (1000) can determine a denoising order (n) corresponding to the denoising intensity.
- the noise generator (1500) can set the intermediate generated image (Z n ) with added noise as current noise information.
- the electronic device (1000) can connect the encoded image (Z i2 ), the adjusted mask map (M r ), and the intermediate generated image (Z n ) with added noise.
- the order of connecting the encoded image (Z i2 ), the adjusted mask map (M r ), and the intermediate generated image (Z n ) with added noise can be arbitrarily determined, but the order determined during learning of the second generation model (1200) and the order during denoising (or inference) must be the same.
- the electronic device (1000) can input data (hereinafter, input data) in which the encoded image (Z i2 ), the adjusted mask map (M r ), and the intermediate generated image (Z n ) with added noise are connected to the second generation model (1200).
- the second generation model (1200) can output the following noise information (Z n-1 ) based on the input data.
- the electronic device (1000) can connect the encoded image (Z i2 ), the adjusted mask map (M r ), and the next noise information (Z n-1 ), and input the connected data into the second generation model (1200).
- the electronic device (1000) can repeat the denoising operation until the output of the second generation model (1200) becomes the final noise information (Z 0 ).
- the electronic device (1000) can obtain the final generated image by inputting the final noise information into the decoder (1400).
- the first generative model has lower computational cost and faster inference speed than the second generative model.
- the computational cost of the second generative model can be reduced and the inference speed can be improved.
- FIGS. 10A to 10C are conceptual diagrams for explaining an example of determining a denoising intensity according to the quality of an intermediate generated image according to one embodiment of the present disclosure. Any content that overlaps with that explained in FIGS. 1 to 9B is omitted.
- the electronic device (1000) may include a denoising strength determiner (1550).
- the denoising strength determiner (1550) may obtain a predicted confidence value of the intermediate generated image based on the intermediate generated image. For example, the predicted confidence value may be determined based on a confidence score value for an image into which the first generation model (1100) is input. However, the present disclosure is not limited thereto, and the predicted confidence value may be determined by any technique for measuring the quality of the image.
- the denoising strength determiner (1550) may determine the denoising strength based on the predicted confidence value.
- the denoising intensity determiner (1550) can identify the size and/or shape of a portion of an image (e.g., an unknown region or a masked region).
- the denoising intensity determiner (1550) can determine the denoising intensity based on the identified size and/or shape. For example, as the size of the unknown region increases, the prediction performance of the first generation model (1100) can decrease.
- the denoising intensity determiner (1550) can increase the denoising intensity as the size of the unknown region increases. For example, when the shape of the unknown region is a specific shape, the prediction performance of the first generation model (1100) can decrease.
- the denoising intensity determiner (1550) can determine the denoising intensity differently depending on the identified shape.
- a pre-learned classification model that classifies the shape of the region can be used.
- the electronic device (1000) may include a fourth encoder (1340).
- the configuration, function, and operation of the fourth encoder (1340) correspond to the configuration, function, and operation of the fourth encoder (1340) of FIGS. 5d to 5f, and therefore, redundant descriptions are omitted.
- the second generation model (1200) may include a first neural network (1210), a second neural network (1220), and a noise blender (1230).
- the configuration, function, and operation of the first neural network (1210), the second neural network (1220), and the noise blender (1230) correspond to the configuration, function, and operation of the first neural network (1210), the second neural network (1220), and the noise blender (1230) of FIGS. 5f and 5g, and therefore, overlapping content is omitted.
- the fourth encoder (1340) can encode an intermediate generated image.
- the encoded intermediate generated image may also be referred to as image guidance information or image embedding.
- the electronic device (1000) may input the image guidance information into at least one layer of the first neural network (1210).
- the electronic device (1000) may input text guidance information (which may also be referred to as text embedding) into at least one layer of the second neural network (1220).
- the noise generator (1500) can add noise to the intermediate generated image based on the denoising strength.
- the intermediate generated image (Z n ) with added noise can be utilized as current noise information having the identified denoising order (e.g., n) of the first neural network (1210) and/or the second neural network (1220).
- the added intermediate generated image (Z n ) is illustrated as being input to both the first neural network (1210) and the second neural network (1220), but the present disclosure is not limited thereto.
- the added intermediate generated image (Z n ) can be input to at least one of the first neural network (1210) and the second neural network (1220).
- the first neural network (1210) and/or the second neural network (1220) can obtain the following noise information (e.g., Z n-1 ) by taking as input an image (e.g., a masked image), an intermediate generated image (Z n ) with added noise, and guidance information (e.g., image guidance information and/or text guidance information).
- noise information e.g., Z n-1
- Z n an image
- guidance information e.g., image guidance information and/or text guidance information
- the denoising intensity determiner (1550) can identify the size and/or shape of a masked region in the image from the mask map.
- the denoising intensity determiner (1550) can determine the denoising intensity based on at least one of a predicted confidence value, a size of a portion of the region, and a shape of a portion of the region.
- FIG. 11 is a conceptual diagram illustrating an electronic device for generating a portion of an image using a generation model according to one embodiment of the present disclosure. Any content that overlaps with that described in FIGS. 1 to 10c is omitted.
- the electronic device (1000) may include a fifth encoder (1350) and a second generation model (1200).
- the electronic device (1000) may obtain an image including information of a preset portion of an area.
- the portion of an area may also be referred to as an unknown area or a masked area.
- the image may include color information for an area excluding the portion of an area.
- an image may include a mask map as described in FIG. 3.
- the electronic device (1000) may obtain a mask map that separates a portion (or target portion) from an entire portion of the image.
- the portion may be a preset portion.
- the electronic device (1000) may generate a mask map based on the image.
- the electronic device (1000) may connect the mask map to the image.
- the electronic device (1000) can transmit an image including information of a portion of the region to the fifth encoder (1350).
- the fifth encoder (1350) can be an artificial intelligence model that has been trained to encode an image.
- the encoded image can be used as guidance information for the second generation model (1200).
- the second generation model (1200) can take as input an image including information of a portion of the region and an image encoded by the fifth encoder (1350) and output a final generated image.
- the electronic device (1000) may include a first encoder (1310) and a decoder (1400).
- the electronic device (1000) may encode an image including information of a portion of the image using the first encoder (1310).
- the electronic device (1000) may decode an output of the second generation model using the decoder (1400).
- FIGS. 12A to 12B are conceptual diagrams for explaining a learning method of a fifth encoder according to an embodiment of the present disclosure.
- FIGS. 13A to 13B are conceptual diagrams for explaining a learning method of a fifth encoder according to an embodiment of the present disclosure. Any content overlapping with that described in FIGS. 1 to 11 is omitted.
- the encoder learning system (20) may include a fifth encoder (1350), a sixth encoder (21), and a seventh encoder (22).
- the fifth encoder (1350) may be an artificial intelligence model whose learning has not been completed.
- the encoder learning system (20) may train the fifth encoder (1350).
- the fifth encoder (1350) may be a pre-learned artificial intelligence model.
- the encoder learning system (20) may train the fifth encoder (1350) through an additional learning technique such as fine-tuning.
- the encoder learning system (20) can obtain a learning image (also referred to as a first image) that includes information (e.g., image information) for a portion of the entire area.
- the learning image can include image information for an area excluding a portion of the entire area.
- the encoder learning system (20) can input the learning image to the fifth encoder (1350).
- the fifth encoder (1350) can output the first image embedding by encoding the learning image.
- the encoder learning system (20) can obtain a ground truth image (which may also be referred to as a second image) including image information for the entire area.
- the ground truth image may represent an answer image to be inferred from the learning image.
- the encoder learning system (20) can input the ground truth image to the sixth encoder (21).
- the sixth encoder (21) can output a second image embedding by encoding the ground truth image.
- the sixth encoder (21) may be a pre-trained artificial intelligence model.
- the sixth encoder (21) may be an image encoder part of a pre-trained Contrastive Language-Image Pretraining (CLIP) model.
- CLIP Contrastive Language-Image Pretraining
- the sixth encoder (21) may no longer be learned from a pre-learned state.
- the sixth encoder (21) may have fixed parameters that have been pre-learned.
- the parameters of the sixth encoder (21) may no longer be updated.
- the encoder learning system (20) can obtain the first loss based on the first image embedding and the second image embedding. For example, the encoder learning system (20) can calculate the similarity between the first image embedding and the second image embedding. The encoder learning system (20) can obtain the first loss based on the similarity. For example, the larger the similarity, the smaller the first loss.
- the encoder learning system (20) can obtain ground truth text representing a ground truth image.
- the ground truth text can represent a sentence describing a correct image to be inferred from a learning image.
- the encoder learning system (20) can input the ground truth text to the seventh encoder (22).
- the seventh encoder (22) can output a text embedding by encoding the ground truth text.
- the seventh encoder (22) can be a pre-trained artificial intelligence model.
- the seventh encoder (22) can be a text encoder part of a pre-trained Contrastive Language-Image Pretraining (CLIP) model.
- CLIP Contrastive Language-Image Pretraining
- the seventh encoder (22) may no longer be learned from a pre-learned state.
- the seventh encoder (22) may have fixed parameters that have been pre-learned.
- the parameters of the seventh encoder (22) may no longer be updated.
- the encoder learning system (20) can obtain the second loss based on the first image embedding and the text embedding. For example, the encoder learning system (20) can calculate the similarity between the first image embedding and the text embedding. The encoder learning system (20) can obtain the second loss based on the similarity. For example, the larger the similarity, the smaller the second loss. In one embodiment of the present disclosure, the second loss can be obtained using a contrastive loss technique, but the present disclosure is not limited thereto, and the second loss can use any loss calculation technique (or loss function) to maximize the similarity between the first image embedding and the text embedding.
- the encoder learning system (20) can update at least one parameter (e.g., weights and/or bias) of the fifth encoder (1350) based on the first loss and the second loss.
- at least one parameter e.g., weights and/or bias
- At least one parameter of the sixth encoder (21) may not be updated, but the present disclosure is not limited thereto.
- the encoder learning system (20) may update at least one parameter of the sixth encoder (21) based on the first loss and/or the second loss.
- the fifth encoder (1350) can input data in which a mask map is connected to a learning image and output a first image embedding.
- the fifth encoder (1350) can support an input data format of a total of four channels, including three channels corresponding to the learning image and one channel corresponding to the mask map.
- the sixth encoder (21) can support an input data format of a total of three channels corresponding to the ground truth image.
- the present disclosure is not limited thereto, and the fifth encoder (1350) can support an input data format of more channels than the sixth encoder (21) by the number of channels of the mask map.
- the encoder learning system (20) may include a fifth encoder (1350) and a seventh encoder (22).
- the sixth encoder (21) of FIGS. 12a and 12b may be the same encoder as the fifth encoder (1350).
- the fifth encoder (1350) can input a learning image at an arbitrary learning stage (e.g., iteration) and output a first image embedding. At the same learning stage in which the first image embedding is output, the fifth encoder (1350) can input a ground truth image and output a second image embedding.
- the encoder learning system (20) can obtain a first loss based on the first image embedding and the second image embedding.
- the seventh encoder (22) can input a ground truth text and output a text embedding.
- the encoder learning system (20) can obtain a second loss based on the first image embedding and the text embedding.
- the encoder learning system (20) can update at least one parameter (e.g., weights and/or bias) of the fifth encoder (1350) based on the first loss and the second loss.
- the fifth encoder (1350) can input data to which a first mask map is connected to a learning image and output a first image embedding.
- the first mask map can be composed of binary values and coordinate values that distinguish a part of the entire area (e.g., an unknown area or a masked area).
- the fifth encoder (1350) can input data to which a second mask map is connected to a ground truth image and output a second image embedding. Since the second mask map does not have an unknown area, the entire area can be composed of one value.
- a fifth encoder (1350) learned by an encoder learning system (20) can input and encode a third image (e.g., an image including information of a portion of a region).
- the learned fifth encoder (1350) can output a first image embedding corresponding to the third image.
- the first image embedding may also be referred to as guidance information or image guidance information.
- the first image embedding can be input to at least one layer of a second generation model (1200) that infers a fourth image (e.g., a final generated image) that is different from the third image by at least a portion thereof, using the third image as an input.
- the memory capacity of the sixth encoder (21) can be secured more, and the learning cost of the fifth encoder can be reduced and the learning speed can be improved.
- a method for generating a partial region of an image using a generation model may include steps S1410 to S1430.
- steps S1410 to S1430 may be performed by an electronic device (1000) or a processor (not shown) of the electronic device (1000).
- the present disclosure is not limited thereto, and steps S1410 to S1430 may be performed by any electronic device.
- a method for generating a partial region of an image using a generation model according to one embodiment of the present disclosure is not limited to that illustrated in FIG. 14, and any one of the steps illustrated in FIG. 14 may be omitted, or steps not illustrated in FIG. 14 may be further included.
- the electronic device (1000) can obtain an image including information of a portion of the area.
- the electronic device (1000) can obtain a mask map corresponding to the portion of the area.
- the electronic device (100) can connect the mask map to the image including information of the portion of the area.
- the electronic device (1000) can obtain a final generated image including second image information that is at least partially different from the first image information by using a second generation model (1200) that inputs an image including information of a partial region and an intermediate generated image.
- the electronic device (1000) can obtain second pixel information from the second generation model (120).
- the electronic device (1000) can obtain the final generated image by performing a blending operation between the second pixel information and the original pixel information of the image for an area other than the partial region.
- the second generation model (1200) can obtain the final generated image by repeatedly performing a denoising operation a predefined total number of denoising orders.
- step S1410 of FIG. 14 may include steps S1510 to S1520.
- steps S1510 to S1520 may be performed by the electronic device (1000) or a processor (not shown) of the electronic device (1000).
- steps S1510 to S1520 may be performed by any electronic device.
- the detailed steps of step S1410 according to the present disclosure are not limited to those illustrated in FIG. 15, and any one of the steps illustrated in FIG. 15 may be omitted, and steps not illustrated in FIG. 15 may be further included.
- the electronic device (1000) can obtain a mask map that distinguishes a certain area from an entire area of an image including information of a certain area.
- the electronic device (1000) can obtain the mask map from an external server.
- the mask map can be generated by distinguishing a masked area of the image as a first value and an unmasked area of the image as a second value.
- the electronic device (1000) can connect a mask map to an image including information of a portion of the region.
- the electronic device (1000) can encode an image including information of a portion of the region.
- the electronic device (1000) can connect a mask map to the encoded image.
- the electronic device (1000) can further connect current noise information to the connected image.
- the electronic device (1000) can input data connected up to the current noise information into the second generation model (1200).
- Fig. 16 is a flowchart for explaining detailed steps of step S1430 of Fig. 14. Contents overlapping with those explained in Figs. 1 to 15 are omitted. For convenience of explanation, Fig. 16 is explained with reference to Figs. 5d and 5e.
- step S1430 of FIG. 14 may include steps S1610 to S1620.
- steps S1610 to S1620 may be performed by the electronic device (1000) or a processor (not shown) of the electronic device (1000).
- steps S1610 to S1620 may be performed by any electronic device.
- the detailed steps of step S1430 according to the present disclosure are not limited to those shown in FIG. 16, and any one of the steps shown in FIG. 16 may be omitted, and steps not shown in FIG. 16 may be further included.
- the electronic device (1000) can encode an intermediate generated image.
- the electronic device (1000) can encode the intermediate generated image using a pre-learned encoder.
- the encoded intermediate generated image can be input to at least one layer of the second generation model (1200).
- the encoded intermediate generated image can be utilized as guidance information in the second generation model (1200).
- the electronic device (1000) can obtain a final generated image including second image information that is at least partially different from the first image information by using a second generation model that inputs an image including information of a portion of the region and an encoded intermediate generated image.
- the electronic device (1000) can obtain conversion data by inputting the encoded intermediate generated image into an interpreter.
- the electronic device (1000) can input the conversion data into at least one layer of the second generation model.
- the electronic device (1000) can transmit the intermediate generated image to the second generation model (1200).
- the electronic device (1000) can obtain a final generated image including second image information that is at least partially different from the first image information by using the second generation model that takes as input an image including information of a portion of the region, the intermediate generated image, and the encoded intermediate generated image.
- Fig. 17 is a flowchart for explaining detailed steps of step S1430 of Fig. 14. Contents overlapping with those explained in Figs. 1 to 16 are omitted. For convenience of explanation, Fig. 17 is explained with reference to Fig. 5a.
- step S1430 of FIG. 14 may include steps S1710 to S1730.
- steps S1710 to S1730 may be performed by the electronic device (1000) or a processor (not shown) of the electronic device (1000).
- steps S1710 to S1730 may be performed by any electronic device.
- the detailed steps of step S1430 according to the present disclosure are not limited to those shown in FIG. 17, and any one of the steps shown in FIG. 17 may be omitted, and steps not shown in FIG. 17 may be further included.
- the electronic device (1000) can obtain a text input.
- the electronic device (1000) can obtain a text input from an external server.
- the electronic device (1000) can obtain a text input from a user interface.
- the text input can include an image including information of a part of the region and/or a sentence describing a final generated image.
- the electronic device (1000) can encode a text input.
- the electronic device (1000) can encode the text input using a pre-learned encoder.
- the encoded text input can be input to at least one layer of the second generation model (1200).
- the encoded text input can be utilized as guidance information in the second generation model (1200).
- the electronic device (1000) may obtain a final generated image including second image information that is at least partially different from the first image information by using a second generation model (1200) that inputs an encoded text input, an image including information of a portion of the region, and an intermediate generated image.
- the electronic device (1000) may generate current noise information based on the intermediate generated image.
- the electronic device (1000) may input the image including information of a portion of the region and the current noise information into the second generation model (1200).
- the electronic device (1000) may input the encoded text input into at least one layer of the second generation model (1200).
- the electronic device (1000) may obtain the following noise information from the second generation model (1200).
- Fig. 18 is a flowchart for explaining detailed steps of step S1430 of Fig. 14. Contents overlapping with those explained in Figs. 1 to 17 are omitted. For convenience of explanation, Fig. 18 is explained with reference to Figs. 9a and 9b.
- step S1430 of FIG. 14 may include steps S1810 to S1830.
- steps S1810 to S1830 may be performed by the electronic device (1000) or a processor (not shown) of the electronic device (1000).
- steps S1810 to S1830 may be performed by any electronic device.
- the detailed steps of step S1430 according to the present disclosure are not limited to those shown in FIG. 18, and any one of the steps shown in FIG. 18 may be omitted, and steps not shown in FIG. 18 may be further included.
- the electronic device (1000) can obtain a denoising intensity for the intermediate generated image.
- the denoising intensity can correspond to an amount of noise to be added to the intermediate generated image.
- the denoising intensity can be predefined.
- the electronic device (1000) can obtain a predicted confidence value based on the intermediate generated image.
- the electronic device (1000) can determine the denoising intensity based on the predicted confidence value.
- the electronic device (1000) can identify a size and/or shape of a portion of the region (e.g., a masked region) based on a mask map.
- the electronic device (1000) can determine the denoising intensity based on the size and/or shape of the portion of the region (e.g., a masked region).
- the electronic device (1000) may add noise to the intermediate generated image based on the denoising intensity.
- the higher the denoising intensity the greater the amount of noise added.
- step S1830 the electronic device (1000) can obtain a final generated image including second image information that is at least partially different from the first image information by using a second generation model (1200) that inputs an image including information of a portion of the region and an intermediate generated image with added noise.
- a second generation model (1200) that inputs an image including information of a portion of the region and an intermediate generated image with added noise.
- Fig. 19 is a flowchart for explaining detailed steps of step S1430 of Fig. 14. Contents overlapping with those explained in Figs. 1 to 18 are omitted. For convenience of explanation, Fig. 19 is explained with reference to Figs. 9a and 9b.
- the electronic device (1000) can obtain current noise information.
- the electronic device (1000) can obtain current noise information from the output of the second generation model (1200).
- the output of the second generation model (1200) does not exist.
- the electronic device (1000) can generate current noise information composed of random values.
- the electronic device (1000) can generate current noise information composed of random values according to Gaussian noise.
- the electronic device (1000) may connect an image (or an encoded image) including current noise information and information of a portion of the region.
- the electronic device (1000) may connect a mask map whose size is adjusted to have the position and width of the image including the mask map or information of a portion of the region, to the image including current noise information and information of a portion of the region.
- step S1930 the electronic device (1000) can input the connected image to the second generation model (1200).
- the second generation model (1200) can output the following noise information based on the connected image by performing a denoising operation.
- step S1940 the electronic device (1000) can obtain the following noise information, which is the output of the second generation model (1200).
- step S1950 the electronic device (1000) can determine whether the denoising operation has been repeated as many times as the predefined total denoising orders. Based on determining that the denoising operation has been repeated as many times as the predefined total denoising orders (Yes), the electronic device (1000) can generate a final generated image based on the next noise information. Based on determining that the denoising operation has been repeated as many times as the predefined total denoising orders (No), the procedure moves to step S1910.
- the next noise information can be the current noise information of the next denoising order.
- FIG. 20 is a flowchart for explaining a method of learning an encoder for encoding an image according to one embodiment of the present disclosure. Any content overlapping with that described in FIGS. 1 to 19 is omitted.
- FIG. 20 will be explained with reference to FIGS. 12a and 13b.
- the configuration, function, and operation of the fifth encoder (1350), the sixth encoder (21), and the seventh encoder (22) of FIGS. 12a and 13b may correspond to the configuration, function, and operation of the first encoder, the second encoder, and the third encoder in FIG. 20, respectively.
- a method for learning an encoder for encoding an image may include steps S2010 to S2070.
- steps S2010 to S2070 may be performed by an encoder learning system (20), and at least some of the functions of the encoder learning system (20) may be performed by any electronic device or a processor of any electronic device.
- a method for learning an encoder for encoding an image according to one embodiment of the present disclosure is not limited to that illustrated in FIG. 20, and any one of the steps illustrated in FIG. 20 may be omitted, and steps not illustrated in FIG. 20 may be further included.
- the encoder learning system (20) can obtain a first image including information about a portion of the entire area, a second image including image information about the entire area, and text representing the second image.
- the encoder learning system (20) can connect a first mask map that distinguishes a portion of the entire area to the first image.
- step S2020 the encoder learning system (20) can obtain a first image embedding using a first encoder that takes a first image as input.
- the encoder learning system (20) can obtain a second image embedding by using a second encoder that inputs a second image.
- the second encoder may be a pre-learned artificial intelligence model.
- the second encoder may no longer be learned in a pre-learned state.
- the second encoder may have fixed parameters but may no longer be updated.
- the first encoder and the second encoder may be the same encoder. In this case, the encoders corresponding to the first encoder and the second encoder may be encoders for which learning has not been completed.
- the encoders corresponding to the first encoder and the second encoder may not have learned, but the present disclosure is not limited thereto, and the encoders may be pre-learned artificial intelligence models.
- the pre-trained encoder can be further trained using additional training techniques such as fine-tuning.
- the encoder training system (20) can connect a second mask map that configures the entire area as a single value to the second image.
- the encoder learning system (20) can obtain text embedding by using a third encoder that takes text as input.
- the third encoder may be a pre-learned artificial intelligence model.
- the third encoder may no longer be learned in a pre-learned state.
- the third encoder may have fixed parameters but may no longer be updated.
- step S2050 the encoder learning system (20) can obtain a first loss based on the first image embedding and the second image embedding.
- step S2060 the encoder learning system (20) can obtain a second loss based on the first image embedding and the text embedding.
- the encoder learning system (20) can update at least one parameter of the first encoder based on the first loss and the second loss. In one embodiment of the present disclosure, the encoder learning system (20) can update at least one parameter of the second encoder and/or the third encoder based on the first loss and the second loss. In one embodiment of the present disclosure, the encoder learning system (20) can repeatedly update at least one parameter of the first encoder, the second encoder, and/or the third encoder for a predefined number of learning rounds.
- FIG. 21 is a flowchart for explaining a method for generating a portion of an image using a generation model according to one embodiment of the present disclosure. Any content overlapping with that described in FIGS. 1 to 20 is omitted. For convenience of explanation, FIG. 21 will be described with reference to FIG. 11.
- the configuration, function, and operation of the fifth encoder (1350) and the second generation model (1200) of FIG. 11 may correspond to the configuration, function, and operation of the first encoder and the generation model of FIG. 21.
- a method for generating a partial region of an image using a generation model may include steps S2110 to S2130.
- steps S2110 to S2130 may be performed by an electronic device (1000) or a processor (not shown) of the electronic device (1000).
- the present disclosure is not limited thereto, and steps S2110 to S2130 may be performed by any electronic device.
- a method for generating a partial region of an image using a generation model according to one embodiment of the present disclosure is not limited to that illustrated in FIG. 21, and any one of the steps illustrated in FIG. 21 may be omitted, or steps not illustrated in FIG. 21 may be further included.
- step S2110 the electronic device (1000) can obtain an image including information of a certain area.
- the electronic device (1000) may obtain a target image embedding by using a first encoder that inputs an image including information of a partial region.
- the first encoder may obtain a first learning image including location information for a partial region among an entire region, a second learning image including image information for the entire region, and learning text representing the second learning image, obtain a first image embedding by using a first encoder that inputs the first learning image, obtain a second image embedding by using a second encoder that inputs the second learning image, obtain a text embedding by using a third encoder that inputs the learning text, obtain a first loss based on the first image embedding and the second image embedding, obtain a second loss based on the first image embedding and the text embedding, and update at least one parameter of the first encoder based on the first loss and the second loss.
- the electronic device (1000) can update at least one parameter of the second encoder and/or
- step S2130 the electronic device (1000) can obtain a final generated image by using an image including information of a portion of the region and a generation model that takes as input a target image embedding.
- the generative model may include a first neural network that outputs a final generated image based on an image including information of a portion of the region.
- the electronic device (1000) may input a target image embedding to at least one layer of the first neural network.
- the generative model may include an interpreter that transforms the target image embedding and transmits it to at least one layer of the first neural network.
- the interpreter of the generative model may include a first single-layer perceptron and a second single-layer perceptron.
- the interpreter of the generative model may include three or more single-layer perceptrons.
- the interpreter of the generative model may include a single single-layer perceptron.
- the electronic device (1000) can obtain an intermediate image embedding by using a first single-layer perceptron having a target image embedding as an input.
- the electronic device (1000) can obtain a result image embedding by using a second single-layer perceptron having the intermediate image embedding as an input.
- the electronic device (1000) can input the result image embedding to at least one layer of the first neural network.
- the electronic device (1000) can configure multiple next single-layer perceptrons that input the image embedding output by the previous single-layer perceptron, and obtain a result image embedding using the last single-layer perceptron. In one embodiment of the present disclosure, the electronic device (1000) can obtain a result image embedding directly using the first single-layer perceptron.
- the generative model can output a first noise based on an image including information of a portion of the region.
- the generative model can output a second noise based on an image including information of a portion of the region.
- the generative model can include an interpreter that transforms a target image embedding and passes it to at least one layer of the first neural network.
- the generative model may include a first neural network and a second neural network.
- the first neural network may output a first noise using at least one of an image including information of a portion of a region, an intermediate generated image, and image guidance information.
- the second neural network may output a second noise using at least one of an image including information of a portion of a region, an intermediate generated image, and text guidance information.
- the electronic device (1000) can obtain a target text corresponding to an image including information of a portion of the image.
- the electronic device (1000) can obtain a target text embedding based on the target text.
- the electronic device (1000) can input the target text embedding to at least one layer of the second neural network.
- the electronic device (1000) can obtain a final generated image based on the first noise and the second noise. In one embodiment of the present disclosure, the electronic device (1000) can obtain the following noise information through a weighted sum of the first noise and the second noise.
- FIG. 22 is a block diagram for explaining the configuration of a user device according to one embodiment of the present disclosure. Any content overlapping with that described in FIGS. 1 to 21 is omitted.
- the configuration, function, and operation of the electronic device (1000) of FIGS. 1 to 11 may correspond to the configuration, function, and operation of the user device (2000) of FIG. 22.
- the user device (2000) may include a communication interface (2100), a user interface (2200), a camera (2300), a processor (2400), and a memory (2500).
- a communication interface (2100) may be implemented with more components than the illustrated components, or may be implemented with fewer components.
- the communication interface (2100) may include one or more components that perform communication between the user device (2000) and a server device (not shown), the user device (2000) and any electronic device (not shown), and the user device (2000) and another user device (not shown).
- the user device (2000) can receive an image including a portion (or a masked region) from a server device via the communication interface (2100).
- the user device (2000) can receive a mask map from the server device via the communication interface (2100).
- the user device (2000) can receive a text input from the server device via the communication interface (2100).
- the user device (2000) can receive various hyperparameters (e.g., total denoising order, etc.) necessary for inferring a final generated image from another electronic device via the communication interface (2100).
- the user device (2000) can receive a pre-learned generative model and/or a pre-learned encoder from the server device via the communication interface (2100).
- the user interface (2200) may include an input interface and an output interface.
- the input interface is for receiving input from a user (hereinafter, user input).
- the input interface may be at least one of a key pad, a dome switch, a touch pad (contact electrostatic capacitance type, pressure resistive film type, infrared detection type, surface ultrasonic conduction type, integral tension measurement type, piezo effect type, etc.), a jog wheel, a jog switch, and a microphone, but is not limited thereto.
- the user device (2000) can receive hyperparameters, etc. set by the user through the input interface.
- the user device (2000) can receive images and/or texts through the input interface.
- the user device (2000) can obtain an audio signal processed by the user's voice through a microphone.
- the user device (2000) can convert the audio signal into text.
- the output interface is for outputting audio signals or video signals, and may include, for example, a display or a speaker.
- the user device (2000) can display an image through the display.
- the user device (2000) can display a GUI corresponding to the input interface through the display.
- the user device (2000) can display an image through the display.
- the user device (2000) can receive a user input for specifying a certain area of the image displayed on the display through the input interface.
- the user device (2000) can mask a certain area of the image based on the user input.
- the user device (2000) can receive a user input for rotating and/or resizing an image displayed on a display through an input interface.
- the user device (2000) can mask an area where there is no image information within a predefined image size based on the user input.
- the user device (2000) can receive a user input (e.g., an arbitrary line or shape) drawn on an image displayed on a display through an input interface.
- the user device (2000) can mask an area corresponding to the user input.
- the user device (2000) may segment at least one object area in an image using an artificial intelligence model that performs object segmentation.
- the artificial intelligence model may be stored in the memory (2500) of the user device (2000).
- the processor (2400) of the user device (2000) may input an image into the artificial intelligence model to output a segmentation result.
- the segmentation result of the artificial intelligence model may be received from a device (e.g., a server) external to the user device (2000).
- the user device (2000) may display the image and the segmentation result together on the display.
- the user device (2000) may receive a user input for selecting at least one object displayed on the display through an input interface.
- the user device (2000) may determine an object corresponding to the user input among the objects according to the segmentation result.
- the user device (2000) may mask the area of the determined object.
- the present disclosure is not limited thereto, and the user device (2000) may merge and mask the area of the determined object and the area corresponding to the user input (e.g., any line or shape) drawn on the image displayed on the display.
- the user device (2000) may display an image through the display.
- the user device (2000) may receive a user input corresponding to at least one position value of the image displayed on the display through an input interface.
- the at least one position value may be composed of coordinate values of an image pixel.
- the at least one position value may be composed of coordinate values of an image pixel corresponding to a boundary of a specific area within the image.
- the user device (2000) may obtain a segmentation map that distinguishes an area of an object corresponding to the at least one position value and other areas by using an artificial intelligence model that inputs at least one position value.
- the user device (2000) may mask an area of an object corresponding to the user input based on the segmentation map.
- the display may include at least one of a liquid crystal display, a thin film transistor-liquid crystal display, a light-emitting diode (LED), an organic light-emitting diode, a flexible display, a 3D display, and an electrophoretic display. And, depending on the implementation form of the user device (2000), it may include two or more displays.
- the speaker can output an audio signal received from a communication interface (2100) or stored in a memory (2500).
- the camera (2300) can capture a surrounding space to generate an image.
- the camera (2300) can include an image sensor.
- the user device (2000) can train the first generation model (1100) and/or the second generation model (1200) based on the image captured by the camera (2300).
- the user device (2000) can obtain a final generated image by inputting the image captured by the camera (2300) into the first generation model (1100) and/or the second generation model (1200).
- the processor (2400) can control the overall operation of the user device (2000) using a program or information stored in the memory (2500).
- the processor (2400) may be implemented through a combination of a general-purpose processor such as an application processor (AP), a central processing unit (CPU), or a graphic processing unit (GPU) and software.
- a dedicated processor it may include a memory for implementing an embodiment of the present disclosure or a memory processing unit for using an external memory.
- the processor (2400) may be composed of a plurality of processors. In this case, it may be implemented through a combination of dedicated processors, or it may be implemented through a combination of a plurality of general-purpose processors such as an AP, a CPU, or a GPU, and software.
- the processor (2400) may include an artificial intelligence (AI) dedicated processor.
- AI dedicated processor may be manufactured in the form of a dedicated hardware chip for artificial intelligence (AI), or may be manufactured as a part of an existing general-purpose processor (e.g., a CPU or an application processor) or a graphics dedicated processor (e.g., a GPU) and mounted on the user device (2000).
- the AI dedicated processor may perform inference and/or learning operations related to at least one of the first generation model (1100), the second generation model (1200), the encoder (1300), and the decoder (1400).
- the processor (2400) may infer a designated area of an image based on image information excluding a designated area of the image using the first generation model (1100) and/or the second generation model (1200).
- the designated area may mean an unknown area.
- the processor (2400) can train at least one of the first generative model (1100), the second generative model (1200), the encoder (1300), and the decoder (1400) using a training data set stored in the memory (2500).
- the processor (2400) can store the trained first generative model (1100), the second generative model (1200), the encoder (1300), and/or the decoder (1400) in the memory (2500).
- the memory (2500) may store a program for processing by the processor (2400) and may also store input/output data.
- the memory (2500) may include at least one type of storage medium among a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (for example, an SD or XD memory, etc.), a RAM (Random Access Memory), a SRAM (Static Random Access Memory), a ROM (Read-Only Memory), an EEPROM (Electrically Erasable Programmable Read-Only Memory), a PROM (Programmable Read-Only Memory), a magnetic memory, a magnetic disk, and an optical disk.
- the programs stored in the memory (2500) may be classified into a plurality of modules according to their functions.
- the memory (2500) may include a first generation model (1100), a second generation model (1200), an encoder (1300), a decoder (1400), a noise generator (1500), and a denoising strength determiner (1550).
- the configuration, function, and operation of the first generation model (1100), the second generation model (1200), the decoder (1400), the noise generator (1500), and the denoising strength determiner (1550) may correspond to the configuration, function, and operation of the first generation model (1100), the second generation model (1200), the decoder (1400), the noise generator (1500), and the denoising strength determiner (1550) of FIGS. 1 to 11.
- the encoder (1300) may include a plurality of encoders.
- the configuration, function, and operation of the plurality of encoders may correspond to the configuration, function, and operation of the first to seventh encoders (1310, 1320, 1330, 1340, 1350, 21, 22) of FIGS. 1 to 13b.
- Each of the multiple encoders can be the encoder part of a different autoencoder.
- FIG. 23 is a block diagram for explaining the configuration of a user device and a server device according to one embodiment of the present disclosure.
- the overlapping content described in FIGS. 1 to 21 is omitted.
- the configuration, function, and operation of the electronic device (1000) of FIGS. 1 to 11 may correspond to the configuration, function, and operation of the server device (3000) of FIG. 23.
- the configuration, function, and operation of the user device (2000) of FIG. 22 may correspond to the configuration, function, and operation of the user device (2000) of FIG. 23.
- the server device (3000) may include a communication interface (3100), a processor (3200), and a memory (3300). However, not all of the illustrated components are essential components.
- the server device (3000) may be implemented with more components than the illustrated components, or may be implemented with fewer components.
- the communication interface (3100) may include one or more components that perform communication between the server device (3000) and the user device (2000), the server device (3000) and any electronic device (not shown), and the server device (3000) and an external server device (not shown).
- the server device (3000) can receive an image including a portion (or a masked region) from the user device (2000) through the communication interface (3100).
- the server device (3000) can receive a mask map from the user device (2000) through the communication interface (3100).
- the server device (3000) can receive a text input from the user device (2000) through the communication interface (3100).
- the server device (3000) can receive various hyperparameters (e.g., total denoising order, etc.) required to infer a final generated image from the user device (2000) through the communication interface (3100).
- the processor (3200) can control the overall operation of the server device (3000) using a program or information stored in the memory (3300).
- the processor (3200) may be implemented through a combination of a general-purpose processor such as an application processor (AP), a central processing unit (CPU), or a graphic processing unit (GPU) and software.
- a dedicated processor it may include a memory for implementing an embodiment of the present disclosure or a memory processing unit for using an external memory.
- the processor (3200) may be composed of a plurality of processors. In this case, it may be implemented through a combination of dedicated processors, or it may be implemented through a combination of a plurality of general-purpose processors such as an AP, a CPU, or a GPU, and software.
- the processor (3200) may include an artificial intelligence (AI) dedicated processor.
- AI artificial intelligence
- the AI dedicated processor may be manufactured in the form of a dedicated hardware chip for artificial intelligence (AI), or may be manufactured as a part of an existing general-purpose processor (e.g., a CPU or an application processor) or a graphics dedicated processor (e.g., a GPU) and mounted on the server device (3000).
- the AI dedicated processor may perform inference and/or learning operations related to at least one of the first generation model (1100), the second generation model (1200), the encoder (1300), and the decoder (1400).
- the processor (3200) may receive a request signal for generating an image and a portion of the image from the user device (2000) through the communication interface (3100).
- the processor (3200) may infer a final generated image in which a portion of the image is generated by inputting the image into the learned first generation model (1100) and/or the second generation model (1200) in response to the request signal.
- the processor (3200) may transmit the final generated image to the user device (2000) through the communication interface (3100).
- the user device (2000) may receive the final generated image.
- the user device (2000) may display the final generated image through the user interface (2200).
- the processor (3200) can infer a designated region of an image based on image information excluding a designated region of the image using the first generation model (1100) and/or the second generation model (1200).
- the processor (3200) can train at least one of the first generative model (1100), the second generative model (1200), the encoder (1300), and the decoder (1400) using a training data set stored in the memory (3300).
- the processor (2400) can store the trained first generative model (1100), the second generative model (1200), the encoder (1300), and/or the decoder (1400) in the memory (3300).
- the memory (3300) may store a program for processing by the processor (3200) and may also store input/output data.
- the memory (2500) may include at least one type of storage medium among a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (for example, an SD or XD memory, etc.), a RAM (Random Access Memory), a SRAM (Static Random Access Memory), a ROM (Read-Only Memory), an EEPROM (Electrically Erasable Programmable Read-Only Memory), a PROM (Programmable Read-Only Memory), a magnetic memory, a magnetic disk, and an optical disk.
- the programs stored in the memory (2500) may be classified into a plurality of modules according to their functions.
- the memory (3300) may include a first generation model (1100), a second generation model (1200), an encoder (1300), a decoder (1400), a noise generator (1500), and a denoising strength determiner (1550).
- the configuration, function, and operation of the first generation model (1100), the second generation model (1200), the decoder (1400), the noise generator (1500), and the denoising strength determiner (1550) may correspond to the configuration, function, and operation of the first generation model (1100), the second generation model (1200), the decoder (1400), the noise generator (1500), and the denoising strength determiner (1550) of FIGS. 1 to 11.
- the encoder (1300) may include a plurality of encoders.
- the configuration, function, and operation of the plurality of encoders may correspond to the configuration, function, and operation of the first to seventh encoders (1310, 1320, 1330, 1340, 1350, 21, 22) of FIGS. 1 to 13b.
- Each of the multiple encoders can be the encoder part of a different autoencoder.
- the user device (2000) may infer an intermediate generated image using the first generation model (1100) stored in the memory (2500).
- the server device (3000) may receive the intermediate generated image via the communication interface (3100).
- the server device (3000) may infer a final generated image based on the intermediate generated image using the second generation model (1200) stored in the memory (3300).
- this is only an example, and at least some of the first generation model (1100), the second generation model (1200), the encoder (1300), the decoder (1400), the noise generator (1500), and the denoising strength determiner (1550) may be executed on the user device (2000) or on the server device (3000).
- a method for generating a partial region of an image using a generative model may be provided.
- the method may include a step of obtaining an image including information of the partial region.
- the method may include a step of obtaining an intermediate generated image including first image information for the partial region by using a first generative model having as input an image including information of the partial region.
- the method may include a step of obtaining a final generated image including second image information, which is at least partially different from the first image information, by using a second generative model having as input an image including information of the partial region and the intermediate generated image.
- the step of obtaining an image including information of the partial region may include a step of obtaining a mask map that demarcates the partial region from an entire area of the image including information of the partial region.
- the step of obtaining an image including information of the partial region may include a step of concatenating the mask map to the image including information of the partial region.
- the step of obtaining the final generated image may include the step of encoding the intermediate generated image.
- the step of obtaining the final generated image may include the step of obtaining a final generated image including second image information, which is at least partially different from the first image information, by using a second generation model that takes as input an image including information of the partial region and the encoded intermediate generated image.
- the step of obtaining the final generated image may include the step of obtaining a final generated image including second image information, which is at least partially different from the first image information, by using an image including information of the partial region, the encoded intermediate generated image, and a second generation model that takes the intermediate generated image as input.
- the step of obtaining the final generated image may include the step of obtaining a text input.
- the step of obtaining the final generated image may include the step of encoding the text input.
- the step of obtaining the final generated image may include the step of obtaining a final generated image including second image information, which is at least partially different from the first image information, by using a second generation model that has as input the encoded text input, an image including information of the partial region, and the intermediate generated image.
- the step of obtaining the final generated image may include the step of obtaining a final generated image including second image information, which is at least partially different from the first image information, by using the encoded text input, the image including information of the partial region, the intermediate generated image, and a second generation model that takes the encoded intermediate generated image as input.
- the step of obtaining the final generated image may include the step of obtaining a denoising intensity for the intermediate generated image.
- the step of obtaining the final generated image may include the step of adding noise to the intermediate generated image based on the denoising intensity.
- the step of obtaining the final generated image may include the step of obtaining a final generated image including second image information, which is at least partially different from the first image information, by using a second generation model that inputs an image including information of the partial region and the intermediate generated image to which the noise has been added.
- the step of obtaining a denoising intensity for the intermediate generated image may include a step of obtaining a predicted confidence value based on the intermediate generated image.
- the step of obtaining the denoising intensity for the intermediate generated image may include a step of determining the denoising intensity based on at least one of the predicted confidence value, the size of the partial region, and the shape of the partial region.
- the step of obtaining the final generated image may include a step of obtaining current noise information.
- the step of obtaining the final generated image may include a step of connecting an image including the current noise information and information of the partial region.
- the step of obtaining the final generated image may include a step of inputting the connected image into the second generation model.
- the step of obtaining the final generated image may include a step of obtaining next noise information from the second generation model.
- the current noise information may correspond to an intermediate generated image to which the noise has been added.
- the step of obtaining the final generated image may include a step of determining a target denoising order corresponding to the intermediate generated image to which noise has been added among a predefined total number of denoising orders based on the denoising intensity.
- the step of obtaining the final generated image may include a step of setting the denoising order of the current noise information to the determined target denoising order.
- the first generative model may be a generative adversarial network (GAN) model.
- the second generative model may be a diffusion model.
- an electronic device may be provided.
- the electronic device may include a memory storing at least one instruction.
- the electronic device may include at least one processor executing the at least one instruction.
- the at least one processor may obtain an image including information of a portion of an area.
- the at least one processor may obtain an intermediate generated image including first image information for the portion of an area by using a first generation model having as input an image including information of the portion of an area.
- the at least one processor may obtain a final generated image including second image information, at least a portion of which is different from the first image information, by using a second generation model having as input an image including information of the portion of an area and the intermediate generated image.
- a method for learning an encoder that encodes an image may be provided.
- the method may include a step of obtaining a first image including information about a portion of an entire region, a second image including image information about the entire region, and a text representing the second image.
- the method may include a step of obtaining a first image embedding using a first encoder having the first image as an input.
- the method may include a step of obtaining a second image embedding using a second encoder having the second image as an input.
- the method may include a step of obtaining a text embedding using a third encoder having the text as an input.
- the method may include a step of obtaining a first loss based on the first image embedding and the second image embedding.
- the method may include a step of obtaining a second loss based on the first image embedding and the text embedding.
- the method may include updating at least one parameter of the first encoder based on the first loss and the second loss.
- the second encoder and the third encoder may be artificial intelligence models that have been pre-learned and have fixed parameters.
- the method may include updating at least one parameter of the second encoder and the third encoder based on the first loss and the second loss.
- the method may include a step of connecting a first mask map that separates the partial region from the entire region to the first image.
- the first encoder and the second encoder may be the same encoder.
- the method may include a step of connecting a second mask map comprising the entire area as a single value to the second image.
- a method for generating a portion of an image using a generative model may be provided.
- the method may include a step of obtaining an image including information of the portion of the image.
- the method may include a step of obtaining a target image embedding using a first encoder that inputs an image including information of the portion of the image.
- the method may include a step of obtaining a final generated image using a generative model that inputs the image including information of the portion of the image and the target image embedding.
- the first encoder may be trained by obtaining a first training image including position information for a portion of the entire region, a second training image including image information for the entire region, and training text representing the second training image, obtaining a first image embedding using the first encoder having the first training image as an input, obtaining a second image embedding using the second encoder having the second training image as an input, obtaining a text embedding using the third encoder having the training text as an input, obtaining a first loss based on the first image embedding and the second image embedding, obtaining a second loss based on the first image embedding and the text embedding, and updating at least one parameter of the first encoder based on the first loss and the second loss.
- the second encoder and the third encoder may be artificial intelligence models that have been pre-learned and have fixed parameters.
- the method comprises: wherein the first learning image is connected to a first mask map that separates the partial region from the entire region.
- the first encoder and the second encoder may be the same encoder.
- the second learning image may be connected to a second mask map that configures the entire area as one value.
- the generative model may include a first neural network that outputs the final generated image based on an image including information of the partial region.
- the method may include a step of inputting the target image embedding into at least one layer of the first neural network.
- the generative model may include an interpreter that transforms the target image embedding and passes it to at least one layer of the first neural network.
- the generative model may include at least one single-layer perceptron.
- At least one single-layer perceptron may include a first single-layer perceptron and a second single-layer perceptron.
- the interpreter may be composed of three or more single-layer perceptrons. In one embodiment of the present disclosure, the interpreter may be composed of a single single-layer perceptron.
- the method may include a step of obtaining an intermediate image embedding using the first single-layer perceptron having the target image embedding as an input.
- the method may include a step of obtaining a result image embedding using the second single-layer perceptron having the intermediate image embedding as an input.
- the method may include a step of inputting the result image embedding to at least one layer of the first neural network.
- the method may include a step of causing a next single-layer perceptron, which takes as input the image embedding output by the previous single-layer perceptron, to output a next image embedding.
- the method may repeat the step of causing a next single-layer perceptron, which takes as input the image embedding output by the previous single-layer perceptron, to output a next image embedding.
- the method may include a step of obtaining a result image embedding using a last single-layer perceptron among the at least one single-layer perceptron.
- the method may include a step of inputting the result image embedding to at least one layer of the first neural network.
- At least one single-layer perceptron may comprise a unique single-layer perceptron.
- the method may comprise a step of obtaining a result image embedding using the unique single-layer perceptron.
- the method may comprise a step of inputting the result image embedding to at least one layer of the first neural network.
- the generative model may include a first neural network that outputs a first noise based on an image including information of the partial region.
- the generative model may include a second neural network that outputs a second noise based on an image including information of the partial region.
- the generative model may include an interpreter that transforms the target image embedding and transmits it to at least one layer of the first neural network.
- the first neural network can input an image including information of the partial region and the target image embedding and output a first noise.
- the first neural network can input an image including information of the partial region, the target image embedding, and an intermediate generated image and output a first noise.
- the method may include a step of obtaining a target text corresponding to an image including information of the partial region.
- the method may include a step of obtaining a target text embedding based on the target text.
- the method may include a step of inputting the target text embedding to at least one layer of the second neural network.
- the second neural network can input an image including information of the above-described partial region and a target text embedding and output a second noise.
- the second neural network can input an image including information of the above-described partial region, a target text embedding, and an intermediate generated image and output second noise.
- the method may include a step of obtaining the final generated image based on the first noise and the second noise.
- an electronic device may be provided.
- the electronic device may include a memory storing at least one instruction.
- the electronic device may include at least one processor executing the at least one instruction.
- An electronic device wherein the at least one processor obtains a first image including information about a portion of an entire area, a second image including image information about the entire area, and text representing the second image, the at least one processor obtains a first image embedding using a first encoder having the first image as an input, the at least one processor obtains a second image embedding using a second encoder having the second image as an input, the at least one processor obtains a second text embedding using a third encoder having the text as an input, the at least one processor obtains a first loss based on the first image embedding and the second image embedding, the at least one processor obtains a second loss based on the first image embedding and the text embedding, and the at least one processor updates at least one parameter of the first encoder based on the first loss and the
- the method according to one embodiment of the present disclosure may be implemented in the form of program commands that can be executed through various computer means and recorded on a computer-readable medium.
- the computer-readable medium may include program commands, data files, data structures, etc., alone or in combination.
- the program commands recorded on the medium may be those specially designed and configured for the present disclosure or may be those known to and available to those skilled in the art of computer software.
- Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, and hardware devices specially configured to store and execute program commands such as ROMs, RAMs, flash memories, etc.
- Examples of the program commands include not only machine language codes generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter, etc.
- Computer-readable media may be any available media that can be accessed by a computer, and includes both volatile and nonvolatile media, removable and non-removable media.
- Computer-readable media may include both computer storage media and communication media.
- Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer-readable instructions, data structures, program modules, or other data.
- Communication media typically includes computer-readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transport mechanism, and includes any information delivery media.
- some embodiments of the present disclosure may also be implemented as a computer program or computer program product containing computer-executable instructions, such as a computer program that is executed by a computer.
- the device-readable storage medium may be provided in the form of a non-transitory storage medium.
- the term 'non-transitory storage medium' means a tangible device and does not contain a signal (e.g., electromagnetic waves), and this term does not distinguish between cases where data is stored semi-permanently in the storage medium and cases where data is stored temporarily.
- the 'non-transitory storage medium' may include a buffer in which data is temporarily stored.
- the method according to various embodiments disclosed in the present document may be provided as included in a computer program product.
- the computer program product may be traded between a seller and a buyer as a commodity.
- the computer program product may be distributed in the form of a machine-readable storage medium (e.g., a compact disc read only memory (CD-ROM)), or may be distributed online (e.g., downloaded or uploaded) through an application store or directly between two user devices (e.g., smartphones).
- a machine-readable storage medium e.g., a compact disc read only memory (CD-ROM)
- CD-ROM compact disc read only memory
- At least a part of the computer program product may be at least temporarily stored or temporarily generated in a machine-readable storage medium, such as a memory of a manufacturer's server, a server of an application store, or an intermediary server.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Image Processing (AREA)
Abstract
Description
Claims (15)
- 생성 모델을 이용하여 이미지의 일부 영역을 생성하는 방법에 있어서,상기 일부 영역의 정보를 포함하는 이미지를 획득하는 단계;상기 이미지를 제1 생성 모델(1100)에 입력함으로써, 상기 일부 영역에 대한 제1 이미지 정보를 포함하는 중간 생성 이미지를 획득하는 단계; 및상기 이미지 및 상기 중간 생성 이미지를 제2 생성 모델(1200)에 입력함으로써, 상기 제1 이미지 정보와 적어도 일부가 상이한 제2 이미지 정보를 포함하는 최종 생성 이미지를 획득하는 단계를 포함하는, 방법.
- 제1항에 있어서,상기 일부 영역의 정보를 포함하는 이미지를 획득하는 단계는:상기 이미지의 전체 영역에서 상기 일부 영역을 구분하는 마스크 맵(mask map)을 획득하는 단계; 및상기 이미지에 상기 마스크 맵을 연결(concatenating)하는 단계를 더 포함하는, 방법.
- 제1항 및 제2항 중 어느 한 항에 있어서,상기 최종 생성 이미지를 획득하는 단계는:상기 중간 생성 이미지를 인코딩하는 단계; 및상기 이미지 및 상기 인코딩된 중간 생성 이미지를 상기 제2 생성 모델에 입력함으로써, 상기 최종 생성 이미지를 획득하는 단계를 포함하는, 방법.
- 제1항 내지 제3항 중 어느 한 항에 있어서,상기 최종 생성 이미지를 획득하는 단계는:텍스트 입력을 획득하는 단계;상기 텍스트 입력을 인코딩하는 단계; 및상기 인코딩된 텍스트 입력, 상기 이미지, 및 상기 중간 생성 이미지를 상기 제2 생성 모델에 입력함으로써, 상기 최종 생성 이미지를 획득하는 단계를 포함하는, 방법.
- 제1항 내지 제4항 중 어느 한 항에 있어서,상기 최종 생성 이미지를 획득하는 단계는:상기 중간 생성 이미지에 대한 디노이징 강도를 획득하는 단계;상기 디노이징 강도에 기초하여 상기 중간 생성 이미지에 노이즈를 추가하는 단계; 및상기 이미지 및 상기 노이즈가 추가된 중간 생성 이미지를 상기 제2 생성 모델에 입력함으로써, 상기 최종 생성 이미지를 획득하는 단계를 포함하는, 방법.
- 제1항 내지 제5항 중 어느 한 항에 있어서,상기 중간 생성 이미지에 대한 디노이징 강도를 획득하는 단계는:상기 중간 생성 이미지에 기초하여 예측 신뢰 값을 획득하는 단계; 및상기 예측 신뢰 값, 상기 일부 영역의 크기, 및 상기 일부 영역의 모양 중 적어도 하나에 기초하여 상기 디노이징 강도를 결정하는 단계를 포함하는, 방법.
- 제1항 내지 제6항 중 어느 한 항에 있어서,상기 최종 생성 이미지를 획득하는 단계는:현재 노이즈 정보를 획득하는 단계;상기 현재 노이즈 정보 및 상기 이미지를 연결하는 단계;상기 연결된 이미지를 상기 제2 생성 모델에 입력하는 단계; 및상기 제2 생성 모델로부터 다음 노이즈 정보를 획득하는 단계를 포함하는, 방법.
- 제1항 내지 제7항 중 어느 한 항에 있어서,상기 최종 생성 이미지를 획득하는 단계는:상기 디노이징 강도에 기초하여, 기 정의된 디노이징 총 차수 중 상기 노이즈가 추가된 중간 생성 이미지에 대응하는 대상 디노이징 차수를 결정하는 단계; 및상기 현재 노이즈 정보의 디노이징 차수를 상기 결정된 대상 디노이징 차수로 설정하는 단계를 포함하는, 방법.
- 제1항 내지 제8항 중 어느 한 항에 있어서,상기 제1 생성 모델은 GAN(generative adversarial network) 모델이고,상기 제2 생성 모델은 확산(diffusion) 모델인, 방법.
- 적어도 하나의 인스트럭션을 저장하는 메모리(2500; 3300); 및적어도 하나의 프로세서(2400; 3200)를 포함하고,상기 적어도 하나의 프로세서(2400; 3200)가 상기 메모리에 저장된 적어도 하나의 인스트럭션을 실행함으로써 전자 장치는,일부 영역의 정보를 포함하는 이미지를 획득하고,상기 이미지를 제1 생성 모델(1100)에 입력함으로써, 상기 일부 영역에 대한 제1 이미지 정보를 포함하는 중간 생성 이미지를 획득하고,상기 이미지 및 상기 중간 생성 이미지를 제2 생성 모델(1200)에 입력함으로써, 상기 제1 이미지 정보와 적어도 일부가 상이한 제2 이미지 정보를 포함하는 최종 생성 이미지를 획득하는, 전자 장치.
- 제10항에 있어서,상기 적어도 하나의 프로세서(2400; 3200)가 상기 메모리에 저장된 적어도 하나의 인스트럭션을 실행함으로써 전자 장치는:상기 이미지의 전체 영역에서 상기 일부 영역을 구분하는 마스크 맵(mask map)을 획득하고,상기 이미지에 상기 마스크 맵을 연결(concatenating)하는, 전자 장치.
- 제10항 및 제11항 중 어느 한 항에 있어서,상기 적어도 하나의 프로세서(2400; 3200)가 상기 메모리에 저장된 적어도 하나의 인스트럭션을 실행함으로써 전자 장치는:상기 중간 생성 이미지를 인코딩하고,상기 이미지 및 상기 인코딩된 중간 생성 이미지를 상기 제2 생성 모델에 입력함으로써, 상기 최종 생성 이미지를 획득하는, 전자 장치.
- 제10항 내지 제12항 중 어느 한 항에 있어서,상기 적어도 하나의 프로세서(2400; 3200)가 상기 메모리에 저장된 적어도 하나의 인스트럭션을 실행함으로써 전자 장치는:텍스트 입력을 획득하고,상기 텍스트 입력을 인코딩하고,상기 인코딩된 텍스트 입력, 상기 이미지, 및 상기 중간 생성 이미지를 상기 제2 생성 모델에 입력함으로써, 상기 최종 생성 이미지를 획득하는, 전자 장치.
- 제10항 내지 제13항 중 어느 한 항에 있어서,상기 적어도 하나의 프로세서(2400; 3200)가 상기 메모리에 저장된 적어도 하나의 인스트럭션을 실행함으로써 전자 장치는:상기 중간 생성 이미지에 대한 디노이징 강도를 획득하고,상기 디노이징 강도에 기초하여 상기 중간 생성 이미지에 노이즈를 추가하고,상기 이미지 및 상기 노이즈가 추가된 중간 생성 이미지를 상기 제2 생성 모델에 입력함으로써, 상기 최종 생성 이미지를 획득하는 전자 장치.
- 제1항 내지 제9항 중 어느 한 항의 방법을 컴퓨터에서 수행하기 위한 프로그램이 기록된 컴퓨터로 읽을 수 있는 기록매체.
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP24863182.2A EP4733997A4 (en) | 2023-09-04 | 2024-09-04 | METHOD FOR GENERATING A PARTIAL IMAGE REGION USING A GENERATE MODEL, AND ELECTRONIC DEVICE FOR ITS IMPLEMENTATION |
| CN202480056645.9A CN121889806A (zh) | 2023-09-04 | 2024-09-04 | 用于通过使用生成模型生成图像的部分区域的方法和用于执行该方法的电子设备 |
| US18/906,885 US20250078366A1 (en) | 2023-09-04 | 2024-10-04 | Method of generating partial area of image by using generative model and electronic device for performing the method |
Applications Claiming Priority (6)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| KR20230117238 | 2023-09-04 | ||
| KR10-2023-0117238 | 2023-09-04 | ||
| KR20230182370 | 2023-12-14 | ||
| KR10-2023-0182370 | 2023-12-14 | ||
| KR1020240006753A KR20250034864A (ko) | 2023-09-04 | 2024-01-16 | 생성 모델을 이용하여 이미지의 일부 영역을 생성하는 방법 및 이를 수행하는 전자 장치 |
| KR10-2024-0006753 | 2024-01-16 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US18/906,885 Continuation US20250078366A1 (en) | 2023-09-04 | 2024-10-04 | Method of generating partial area of image by using generative model and electronic device for performing the method |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025053602A1 true WO2025053602A1 (ko) | 2025-03-13 |
Family
ID=94924365
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/KR2024/013315 Pending WO2025053602A1 (ko) | 2023-09-04 | 2024-09-04 | 생성 모델을 이용하여 이미지의 일부 영역을 생성하는 방법 및 이를 수행하는 전자 장치 |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025053602A1 (ko) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR102260628B1 (ko) * | 2020-02-13 | 2021-06-03 | 이인현 | 협력적 스타일 트랜스퍼 기술을 이용한 이미지 생성 시스템 및 방법 |
| KR102287407B1 (ko) * | 2020-12-18 | 2021-08-06 | 영남대학교 산학협력단 | 이미지 생성을 위한 학습 장치 및 방법과 이미지 생성 장치 및 방법 |
| KR102479965B1 (ko) * | 2017-04-10 | 2022-12-20 | 삼성전자주식회사 | 이미지 슈퍼 레졸루션을 딥 러닝하는 방법 및 시스템 |
| US20230095092A1 (en) * | 2021-09-30 | 2023-03-30 | Nvidia Corporation | Denoising diffusion generative adversarial networks |
| CN115908187A (zh) * | 2022-12-07 | 2023-04-04 | 北京航空航天大学 | 基于快速去噪扩散概率模型的图像特征分析与生成方法 |
-
2024
- 2024-09-04 WO PCT/KR2024/013315 patent/WO2025053602A1/ko active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR102479965B1 (ko) * | 2017-04-10 | 2022-12-20 | 삼성전자주식회사 | 이미지 슈퍼 레졸루션을 딥 러닝하는 방법 및 시스템 |
| KR102260628B1 (ko) * | 2020-02-13 | 2021-06-03 | 이인현 | 협력적 스타일 트랜스퍼 기술을 이용한 이미지 생성 시스템 및 방법 |
| KR102287407B1 (ko) * | 2020-12-18 | 2021-08-06 | 영남대학교 산학협력단 | 이미지 생성을 위한 학습 장치 및 방법과 이미지 생성 장치 및 방법 |
| US20230095092A1 (en) * | 2021-09-30 | 2023-03-30 | Nvidia Corporation | Denoising diffusion generative adversarial networks |
| CN115908187A (zh) * | 2022-12-07 | 2023-04-04 | 北京航空航天大学 | 基于快速去噪扩散概率模型的图像特征分析与生成方法 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2020190112A1 (en) | Method, apparatus, device and medium for generating captioning information of multimedia data | |
| WO2020050499A1 (ko) | 객체 정보 획득 방법 및 이를 수행하는 장치 | |
| WO2022154457A1 (en) | Action localization method, device, electronic equipment, and computer-readable storage medium | |
| EP3908943A1 (en) | Method, apparatus, electronic device and computer readable storage medium for image searching | |
| WO2018088794A2 (ko) | 디바이스가 이미지를 보정하는 방법 및 그 디바이스 | |
| WO2019135621A1 (ko) | 영상 재생 장치 및 그의 제어 방법 | |
| WO2018117619A1 (en) | Display apparatus, content recognizing method thereof, and non-transitory computer readable recording medium | |
| WO2021167210A1 (ko) | 서버, 전자 장치 및 그들의 제어 방법 | |
| WO2020153717A1 (en) | Electronic device and controlling method of electronic device | |
| WO2020091253A1 (ko) | 전자 장치 및 전자 장치의 제어 방법 | |
| WO2020017827A1 (ko) | 전자 장치, 및 전자 장치의 제어 방법 | |
| WO2019203421A1 (ko) | 디스플레이 장치 및 디스플레이 장치의 제어 방법 | |
| EP3997623A1 (en) | Electronic device and control method thereof | |
| WO2024005513A1 (en) | Image processing method, apparatus, electronic device and storage medium | |
| WO2021040490A1 (en) | Speech synthesis method and apparatus | |
| WO2025053602A1 (ko) | 생성 모델을 이용하여 이미지의 일부 영역을 생성하는 방법 및 이를 수행하는 전자 장치 | |
| WO2025053605A1 (ko) | 이미지를 인코딩하는 인코더를 학습하는 방법, 생성 모델을 이용하여 이미지의 일부 영역을 생성하는 방법, 및 이들 방법을 수행하는 전자 장치 | |
| WO2022265467A1 (ko) | 이미지 내의 객체를 검출하기 위한 전자 장치 및 방법 | |
| WO2025005544A1 (ko) | 개인화된 이미지를 제공하는 방법 및 전자 장치 | |
| WO2025005552A1 (ko) | 맞춤형 화면을 제공하는 방법 및 전자 장치 | |
| WO2023008678A1 (ko) | 영상 처리 장치 및 그 동작 방법 | |
| WO2025188066A1 (ko) | 전자 장치 및 그 동작 방법 | |
| WO2022065561A1 (ko) | 문자열의 의도 분류 방법 및 컴퓨터 프로그램 | |
| WO2025063427A1 (en) | Method and electronic device for image matting | |
| WO2025230384A1 (ko) | 이미지 편집을 위한 방법 및 이를 수행하기 위한 전자 장치 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24863182 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2024863182 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2024863182 Country of ref document: EP Effective date: 20260120 |
|
| ENP | Entry into the national phase |
Ref document number: 2024863182 Country of ref document: EP Effective date: 20260120 |
|
| ENP | Entry into the national phase |
Ref document number: 2024863182 Country of ref document: EP Effective date: 20260120 |