WO2024191234A1 - Method and apparatus for processing an image - Google Patents

Method and apparatus for processing an image Download PDF

Info

Publication number
WO2024191234A1
WO2024191234A1 PCT/KR2024/095121 KR2024095121W WO2024191234A1 WO 2024191234 A1 WO2024191234 A1 WO 2024191234A1 KR 2024095121 W KR2024095121 W KR 2024095121W WO 2024191234 A1 WO2024191234 A1 WO 2024191234A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
depth
incomplete
masked
original
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/KR2024/095121
Other languages
French (fr)
Inventor
Isaac Hisanao KASAHARA
Shubham Agrawal
Kazim Selim ENGIN
Nikhil Narsingh Chavan Dafle
Shuran Song
Ibrahim Volkan Isler
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung Electronics Co Ltd
Original Assignee
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samsung Electronics Co Ltd filed Critical Samsung Electronics Co Ltd
Priority to EP24771233.4A priority Critical patent/EP4599405A4/en
Priority to CN202480007064.6A priority patent/CN120677505A/en
Publication of WO2024191234A1 publication Critical patent/WO2024191234A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/60Image enhancement or restoration using machine learning, e.g. neural networks
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/60Intended control result
    • G05D1/617Safety or protection, e.g. defining protection zones around obstacles or avoiding hazards
    • G05D1/622Obstacle avoidance
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/00Three-dimensional [3D] image rendering
    • G06T15/10Geometric effects
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/00Three-dimensional [3D] image rendering
    • G06T15/10Geometric effects
    • G06T15/40Hidden part removal
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three-dimensional [3D] modelling for computer graphics
    • G06T17/20Finite element generation, e.g. wire-frame surface description, tesselation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating three-dimensional [3D] models or images for computer graphics
    • G06T19/20Editing of three-dimensional [3D] images, e.g. changing shapes or colours, aligning objects or positioning parts
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/77Retouching; Inpainting; Scratch removal
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D2101/00Details of software or hardware architectures used for the control of position
    • G05D2101/10Details of software or hardware architectures used for the control of position using artificial intelligence [AI] techniques
    • G05D2101/15Details of software or hardware architectures used for the control of position using artificial intelligence [AI] techniques using machine learning, e.g. neural networks
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D2105/00Specific applications of the controlled vehicles
    • G05D2105/10Specific applications of the controlled vehicles for cleaning, vacuuming or polishing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2210/00Indexing scheme for image generation or computer graphics
    • G06T2210/56Particle system, point based geometry or rendering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2219/00Indexing scheme for manipulating 3D models or images for computer graphics
    • G06T2219/004Annotating, labelling
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2219/00Indexing scheme for manipulating 3D models or images for computer graphics
    • G06T2219/20Indexing scheme for editing of 3D models
    • G06T2219/2012Colour editing, changing, or manipulating; Use of colour codes
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2219/00Indexing scheme for manipulating 3D models or images for computer graphics
    • G06T2219/20Indexing scheme for editing of 3D models
    • G06T2219/2016Rotation, translation, scaling

Definitions

  • the disclosure relates to a method for processing an image, and an apparatus for the same, and more particularly to a method for performing masking and inpainting for generalizable scene completion, and an apparatus for the same.
  • 3D structures of scenes may be important for many applications, for example robot navigation, planning, manipulation, and interaction. Improvements in 3D perception capabilities have accompanied the increasing availability of depth sensors on smartphones and robots. However, a complete and coherent reconstruction is challenging when only partial observation of the scene is available.
  • Scene completion is an important task which may allow for better robot action planning such as grasp planning, path planning, and long-horizon task planning. Scene completion may also be useful in contexts such as autonomous navigation and image generation for augmented reality (AR) and virtual reality (VR) devices.
  • AR augmented reality
  • VR virtual reality
  • a single view of the environment may capture only limited information of the scene, which presents a major challenge for scene completion.
  • Example embodiments address at least the above problems and/or disadvantages and other disadvantages not described above. Also, the example embodiments are not required to overcome the disadvantages described above, and may not overcome any of the problems described above.
  • a method for processing image data for scene completion may include obtaining an original image from an original viewpoint corresponding to a first direction, wherein the original image includes an object and a background, wherein a first surface of the object is an image of the object corresponding to the first direction.
  • the method may include receiving an original image from an original viewpoint corresponding to a first direction, wherein the original image includes an object and a background, wherein a first surface of the object is an image of the object corresponding to the first direction.
  • the method may include obtaining a first image from a new viewpoint corresponding to a second direction different from the first direction by rotating the original image based on 3-dimensional (3D) information generated from 2-dimensional (2D) information which is obtained from the original image.
  • the method may include determining an area within the first image for generating a second surface of the object based on depth information about a depth between the object and the background of the original image.
  • the method may include obtaining a second image by inputting the first image and the determined area to an artificial intelligence (AI) inpainting model, wherein the AI inpainting model generates the second surface of the object which occupies a portion of the determined area in the second image.
  • AI artificial intelligence
  • an electronic device for processing image data for scene completion may include at least one memory configured to store instructions.
  • the electronic device for processing image data for scene completion may include at least one processor configured to execute the instructions to receive an original image from an original viewpoint corresponding to a first direction, wherein the original image includes an object and a background, wherein a first surface of the object is an image of the object corresponding to the first direction.
  • the electronic device for processing image data for scene completion may include at least one processor configured to execute the instructions to obtain an original image from an original viewpoint corresponding to a first direction, wherein the original image includes an object and a background, wherein a first surface of the object is an image of the object corresponding to the first direction.
  • the electronic device for processing image data for scene completion may include at least one processor configured to execute the instructions to obtain a first image from a new viewpoint corresponding to a second direction different from the first direction by rotating the original image based on 3-dimensional (3D) information generated based on 2-dimensional information which is obtained from the original image.
  • the electronic device for processing image data for scene completion may include at least one processor configured to execute the instructions to determine an area with the first image for generating a second surface of the object based on depth information about a depth between the object and the background of the original image.
  • the electronic device for processing image data for scene completion may include at least one processor configured to execute the instructions to obtain a second image by inputting the first image and the determined area to an artificial intelligence (AI) inpainting model, wherein the AI inpainting model generates the second surface of the object which occupies a portion of the determined area in the second image.
  • AI artificial intelligence
  • FIG. 1 is a diagram showing a viewpoint module, according to an embodiment of the present disclosure
  • FIG. 3B is a diagram illustrating a process for generating a merged point cloud, according to an embodiment of the present disclosure
  • FIG. 5 is a diagram showing an example of generating a mask without using surface-aware masking, according to an embodiment of the present disclosure
  • FIG. 6A to 6C illustrate results of performing scene completion based on a mask generated according to FIG. 5, according to an embodiment of the present disclosure
  • FIGS. 7A-7D are diagrams showing an example of generating a mask using surface-aware masking, according to an embodiment of the present disclosure
  • FIG. 8A to 8C illustrate results of performing scene completion based on a mask generated according to FIGS. 7A-7D, according to an embodiment of the present disclosure
  • FIG. 9 is a flowchart illustrating a method of performing surface-aware masking for scene completion, according to an embodiment of the present disclosure
  • FIGS. 10A to 10C show further examples of a surface-aware masking process, according to an embodiment of the present disclosure
  • FIGS. 11A to 11C show further examples of a surface-aware masking process, according to an embodiment of the present disclosure
  • FIGS. 12A and 12B are flowcharts illustrating a use applications of scene completion methods, according to an embodiment of the present disclosure
  • FIGS. 13A and 13B are a flowchart illustrating a method of processing an image to perform scene completion, according to an embodiment of the present disclosure
  • FIG. 14 is a diagram of electronic devices for performing scene completion according to an embodiment of the present disclosure.
  • FIG. 15 is a diagram of components of one or more electronic devices of FIG. 12 according to an embodiment of the present disclosure.
  • the expression, "at least one of a, b, and c,” should be understood as including only a, only b, only c, both a and b, both a and c, both b and c, all of a, b, and c, or any variations of the aforementioned examples.
  • module is intended to be broadly construed as hardware, software, firmware, or any combination thereof.
  • Embodiments may relate to methods, systems, and apparatuses for performing scene completion.
  • Embodiments may provide a method, system, or apparatus which may obtain an input image of a scene, for example an RGB-D image, and may generate a completed 3D representation of the scene, for example a completed scene point cloud, which may include regions which are unobservable or occluded in the input image.
  • a point cloud may be a multidimensional set of points which represent at least one of an object and a space.
  • each point may represent geometric coordinates of a single point on a surface of an object, and may further represent information such as texture information and color information corresponding to the single point.
  • the scene may include one or more objects and a background
  • the completed scene point cloud may include both depth information and texture information about the scene and the one or more objects included and the background in the scene.
  • embodiments are not limited thereto.
  • embodiments may relate to any multidimensional representations of objects and spaces, for example mesh representations, voxel grid representations, implicit surface representations, distance field representations, and any other type of representation.
  • the reconstruction of the completed scene point cloud may be performed in two general steps, for example a step of scene view completion, and a step of lifting the scene from a two-dimensional representation to a three-dimensional representation.
  • an embodiment may apply the generalization capability of large language models to inpaint the missing areas of color images rendered from different viewpoints. Then, these inpainted images may be converted from two-dimensional (2D) images to three-dimensional (3D) representations, for example point clouds, by predicting per-pixel depth values using a combination of a trained network and depth information in the input image.
  • this lifting process may be referred to as deprojection.
  • an entire completed scene point cloud for a scene may be reconstructed based on a single image of the scene, for example a single RGB-D image. For example, based on the single image, the entire scene layout may be reconstructed in a globally-consistent fashion.
  • Some related-art methods may be confined to task-specific models which often do not generalize appropriately to distributions beyond the training data, which may limit their applicability.
  • an embodiment of the present disclosure may provide generalization to unseen scenes, objects, and categories by leveraging inpainted features.
  • An embodiment may utilize the generalizable aspects of machine learning (ML) and artificial intelligence (AI) models, for example visual language models (VLMs) for completing novel views and depth maps.
  • ML machine learning
  • AI artificial intelligence
  • VLMs visual language models
  • the present disclosure is not limited in this regard, and an embodiment may utilize other types of ML and AI models.
  • the integrated pipeline provided by an embodiment may be used for scene completion of unseen objects with occlusion and clutter.
  • the generalization capabilities of large VLMs with respect to 2D images may be leveraged to lift the information contained in the 2D images into 3D space for practical robotics applications. Accordingly, an embodiment may provide consistent scene completion in new environments, and with unseen objects.
  • FIG. 1 is an example of a diagram showing a viewpoint module for performing scene completion, according to an embodiment of the present disclosure.
  • a viewpoint module 100 may include an image rotation module 102, a surface-aware masking (SAM) module 104, an inpainting model 106, one or more depth estimation models 108, for example normal estimation model 108A and boundary estimation model 108B, a depth completion module 110, and a deprojection module 112.
  • SAM surface-aware masking
  • the viewpoint module 100 may obtain (e.g. receive, capture, download) an original image, for example an RGB-D image of a scene, as input, and may output one or more estimated point clouds , where N is the number of predicted points in the scene, and H and W denote dimensions of the RGB-D image.
  • the RGB-D image may include an input color image and an input depth image .
  • a color image may be referred to as an RGB image or a texture image, and the like.
  • the image rotation module 102, the SAM module 104, and the inpainting model 106 may be referred to as an inpainting pipeline, which may obtain the RGB-D image from an original viewpoint , and may output an incomplete depth image and an inpainted color image from a new viewpoint .
  • the original viewpoint may correspond to a view of the scene from a first direction
  • the new viewpoint may correspond to a view of the scene from a second direction which is different from the first direction.
  • the one or more depth estimation models 108 and the depth completion module 110 may be referred to as a depth completion pipeline, which may obtain the incomplete depth image and the inpainted color image , and may output an estimated depth image .
  • the deprojection module 112 may generate 3D information about the scene based on 2D information which is obtained from the RGB-D image .
  • the 2D information may include at least one from among boundary information, texture information, color information, and depth information included in the RGB-D image .
  • the 3D information may include a 3D representation of the scene, for example a point cloud as discussed above.
  • the deprojection module 112 may obtain the inpainted color image and the estimated depth image , and may obtain an estimated point cloud corresponding to the viewpoint .
  • a process of generating a 2D image from a 3D representation may be referred to as projecting the 2D image from the point cloud.
  • a process of generating a 3D representation such as a point cloud from a 2D image may be referred to as deprojecting the point cloud from the 3D image.
  • a depth image which is a 2D image that has a depth value at every pixel, and also given camera information used to capture the 2D image (for example focal length, etc.), it may be possible to deproject each pixel using the camera information and the depth information at that 2D pixel location.
  • this may be similar to drawing a line or ray from the camera through the 2D pixel location, and placing a point along the line at a distance corresponding to the depth information for the pixel. If the depth image is available, then the deprojection may be performed without an algorithm or model. However, if no depth image is available, or only a partial depth image is available, an AI model such as the one or more depth estimation models 108 may be used to predict the depth image.
  • FIG. 2 is a flowchart illustrating a method 200 of processing an image to perform scene completion, according to an embodiment of the present disclosure.
  • one or more operations of the method 200 of FIG. 2 may be performed by or using the viewpoint module 100 and any of the elements included therein, and any other element described herein.
  • the incomplete color image and an incomplete depth image may be referred to as "incomplete” because they may be missing information about one or more areas of the scene which are obscured or occluded by an object in the deprojected point cloud .
  • incomplete when the point cloud is rotated, some points in the rotated point cloud may correspond to occluded areas of the scene which are obscured by a surface of an object which is present in the RGB-D image .
  • the occluded areas of the scene may be regions which include at least one of a portion of a background of the original image (from the new viewpoint ), and a portion of a surface of an object (from the new viewpoint ).
  • this portion of the surface of the object may be referred to as an "object area”. Therefore, when the rotated point cloud is used to generate a 2D image, this 2D may also be missing information, and therefore may be referred to as an incomplete image. Because the rotation of the point cloud may correspond to changing the viewpoint, the incomplete color image and the incomplete depth image may correspond to a new viewpoint .
  • the incomplete color image and the incomplete depth image may be missing color information and depth information corresponding to areas of the scene which are occluded or otherwise not visible in the original RGB-D image .
  • the incomplete color image and the incomplete depth image may be referred to as, or included in, an incomplete RGB-D image .
  • a process for generating the incomplete RGB-D image from the new viewpoint based on information in the original image may be referred to as "rotating" the original image .
  • the process of deprojecting the image into the point cloud , rotating the deprojected point cloud , and reprojecting to render the incomplete color image and the incomplete depth image described above with respect to operations S202 and S203 may be referred to as "rotating" the original image .
  • the new viewpoint may be selected based on a context ratio , which may be determined based on Equation 1 below:
  • Equation 1 above may denote a number of context pixels in an image, and may denote a number of all pixels in an image.
  • the context ratio may provide an indication about how accurately an inpainting model such as the inpainting model 106 may be able to fill in missing areas in an image. For example, a low value of the context ratio may indicate that many areas are unknown, and that an inpainting model may struggle to fill in missing areas, and a high value of the context ratio may indicate that an inpainting model may more easily fill in missing areas, but may only fill in limited information.
  • the image rotation module 102 may start from the original viewpoint , and may rotate the deprojected point cloud in various directions to various new viewpoints.
  • an image may be projected based on the rotated point cloud, and a context ratio of the projected image may be calculated.
  • the corresponding viewpoint may be selected as the new viewpoint .
  • the predetermined criteria may be satisfied when the context ratio of a projected image being closest to context threshold from among context ratios a plurality of projected images corresponding to a plurality of new viewpoints. This process may be repeated to obtain a plurality of evenly spaced new viewpoints, but embodiments are not limited thereto.
  • preprocessing steps may be applied to increase the quality of the inpainting results.
  • the incomplete color image may be preprocessed to fill in relatively small holes which are produced as a result of the reprojecting described above.
  • a naive inpainting filter that works with relatively small areas of missing values may be applied.
  • the naive inpainting filter may be a general inpainting filter or inpainting model which is trained using a general image dataset that is not specific to the particular scene. Starting at boundaries of missing pixels, a weighted average of the nearest ground truth pixels may be determined. The naive inpainting filter may then work inward to fill larger holes.
  • the naive inpainting filter may be used to fill relatively small holes of missing information in order to produce a denser image that gives more context for the inpainting model 106.
  • the naive inpainting filter may produce unrealistic results for relatively large missing areas.
  • the SAM module 104 may generate a mask which indicates the large missing areas.
  • the missing areas may include areas in which no pixel information is available when the original image is rotated. In an embodiment, even if there is pixel information available when the original image is rotated (some of which may correspond to the background) the SAM module 104 may determine that an area predicted as the surface area of the object should be masked. An example of a method for generating the mask is provided below with respect to FIGS. 4 to 9C.
  • the SAM module 104 may mask the incomplete color image to obtain a masked color image, and may mask the incomplete depth image to obtain a masked depth image.
  • the viewpoint module 100 may provide the masked color image, or for example the mask and the incomplete color image , to the inpainting model 106 to obtain an inpainted color image .
  • the inpainting model 106 may generate predicted image information corresponding to portions of the incomplete color image which are masked by the mask , and the inpainted color image may be generated by applying the predicted image information to the incomplete color image .
  • the inpainted color image may be referred to as a predicted image.
  • the inpainting model 106 may be or may include an AI or ML model, for example at least one of a diffusion model and a VLM such as DALL-E 2.
  • the inpainting model 106 may obtain the masked color image and an input prompt P that describes the context of the original RGB-D image in words or text.
  • the prompt may include "household objects on a table”.
  • the prompt may include "room with carpet and furniture”.
  • the prompt may include any additional known information about the scene, such as "a baseball and glove on a table” if these objects are known to be on the table, or "top-down view of household objects on a table” if the viewpoint is known to be from a top-down perspective.
  • the additional known information may be at least one of information that was previously provided or confirmed by a user, information that is associated with the image such as information included in tags or metadata, and information obtained using image analysis or view analysis, for example using an image analysis algorithm or model.
  • the prompt may include any other information.
  • the original RGB-D image may be provided to an automatic captioning model, and the output of the output of the automatic captioning model may be used as the prompt .
  • the output of the automatic captioning model may be a proposed prompt such as "household objects on a table". This output may be provided to the user, and the user may then revise or modify this proposed prompt to obtain a revised prompt.
  • the revised prompt may be "household objects such as a dish, cloth, cutlery, and a pot on a table", or "household objects such as drinking glasses and dinner plates on a white marble dining table” (in which text in italics indicates modifications to the proposed prompt which are input by the user).
  • the user may input an original prompt , and then based on the output of the inpainting model 106, may modify the original prompt to obtain a revised prompt, and may request a new inpainted image to be generated based on the revised prompt.
  • the user may originally input "a baseball and glove on a table” as the original prompt .
  • the user may input a revised prompt such as "a baseball and a leather baseball glove on a wooden table" (in which text in italics indicates revisions to the original prompt which are input by the user).
  • a user may input any prompt as desired, for example to change the style of the original RGB-D image to another style.
  • the appearance or visual style of the original RGB-D image may be modified using a neural style transfer (NST) model, for example by modifying style features of the original RGB-D image while maintaining content features of the original RGB-D image .
  • NST neural style transfer
  • the inpainting model 106 may output the inpainted color image , which may contain estimated areas corresponding to areas of the incomplete color image which are masked by the mask .
  • prompt may refer to text used to initiate interaction with a generative model that generates images for electronic devices.
  • a prompt may include one or more words, phrases, and/or sentences.
  • the inpainting model 106 may be, may include, or may be similar to such a generative model.
  • a prompt may contain natural language text that carries various information that the generative model can use to generate images, such as context, intent, task, constraints, and more.
  • Electronic devices may process natural language text using natural language processing (NLP) models.
  • NLP natural language processing
  • prompts and revised prompts can be received from users.
  • electronic devices may receive text input from users, or they can receive voice input and perform automatic speech recognition (ASR) to convert the user's voice input into text.
  • ASR automatic speech recognition
  • the present disclosure is not limited in this regard, and electronic devices may receive other types of input from users.
  • prompts may be generated by electronic devices using various techniques, such as image captioning.
  • electronic devices can receive image input from users and extract text descriptions from the images.
  • prompts may be replaced with a similar expression that represents the same concept.
  • prompts can be replaced with terms like “input,” “user input,” “input phrase,” “user command,” “directive,” “starting sentence,” “task query,” “trigger sentence,” “message,” and others, not limited to the examples mentioned.
  • the inpainting model 106 may be used to generate multiple candidate inpainted color images based on the same masked color image. Then, these candidate inpainted color images may be compared against the input prompt P by encoding them to an embedded space, and the candidate inpainted color image having the highest similarity may be chosen as the inpainted color image .
  • the inpainted color image may be provided to one or more depth estimation models 108.
  • the one or more depth estimation models 108 may be ML or AI models.
  • the inpainted color image may be provided to the normal estimation model 108A, which may be trained to estimate normals
  • the inpainted color image may be provided to the boundary estimation model 108B, which may be trained to estimate occlusion boundaries.
  • the one or more depth estimation models 108 may be trained or optimized for a specific category of scenes, for example a scene including objects on a tabletop, or a scene including a room to vacuumed by a robotic vacuum cleaner.
  • an estimated normal(s) may be, for example, geometric normal(s).
  • normal or “geometric normal” may refer to a vector associated with a point on a surface of a 3D object in computer graphics and 3D computer modeling, and may represent a direction in which a surface is facing at each point on the surface (e.g., the direction that is perpendicular to a tangent plane of the surface at that point).
  • the depth completion module 110 may generate an estimated depth image based on the masked depth image and the output of the one or more depth estimation models. For example, depth information for areas with missing depths in the masked depth image may be computed by tracing along the estimated normal(s) from areas of known depth, and the estimated occlusion boundaries may act as barriers which the estimated normal(s) should not be traced across. As an example, in an embodiment a system of equations may be solved to minimize an error E , where E is defined according to Equation 2 below:
  • Equation 2 above may denote the distance between the ground truth and estimated depth, may denote the influences of nearby pixels to have similar depths, and may denote the consistency of estimated depth and estimated normal values.
  • , may denote constants or weight values corresponding to , , and , respectively.
  • weight value corresponding to the estimated normal values based on the probability that a boundary is present.
  • the value of may be obtained based on the estimated occlusion boundaries discussed above.
  • the deprojection module 112 may generate an estimated point cloud corresponding to the viewpoint by deprojecting the inpainted color image and the estimated depth image .
  • the estimated point cloud may be a completed scene point cloud, and may be used to perform other tasks such as robot action planning, autonomous navigation, and image generation for AR devices and VR devices.
  • the method 200 may be performed multiple times based on multiple new viewpoints, and the resulting estimated point clouds may be merged to obtain the completed scene point cloud. An example of a merging process is described below with reference to FIGS. 3A-3B.
  • FIG. 3A is a diagram showing a scene completion system
  • FIG. 3B is a diagram illustrating a process for generating a merged point cloud, according to an embodiment of the present disclosure.
  • a scene completion system 300 may include the viewpoint module 100 discussed above, and a merging module 302. The scene completion system 300 may obtain the RGB-D image as input, and may output a completed scene point cloud which is obtained based on multiple estimated point clouds.
  • the method 200 discussed above may be performed on the original RGB-D image by rotating the point cloud by angle
  • the method 200 may be performed again, this time rotating the point cloud by angle to obtain estimated point cloud corresponding to a viewpoint .
  • the method 200 may then be performed two more times by rotating the point cloud by and to obtain estimated point cloud corresponding to a viewpoint , and estimated point cloud corresponding to a viewpoint . Accordingly, as shown in FIG. 3A, four novel views of the scene are obtained, complete with RGB and depth information.
  • the merging module 302 may combine the estimated point clouds , , , and while enforcing consistency across them. For example, when inpainting real objects, completion of objects may be inconsistent, and hallucinated objects that are not in the original scene may be created by the inpainting model 106 and included in the inpainted color image .
  • the merging module 302 may compare the original point cloud and at least one of the estimated point clouds , , , and , may determine points which intersect among multiple point clouds, and may add the intersecting points to the merged point cloud , while discarding points which are present in only one point cloud.
  • the merged point cloud may only include points which are present in more than two point clouds, or points which are present in all of the point clouds.
  • the merging module 302 may discard points which do not directly intersect, or may only discard points which are not within a certain threshold distance from points in other point clouds.
  • the merged point cloud may be a completed scene point cloud, and may be used to perform other tasks such as robot action planning, autonomous navigation, and image generation for AR devices and VR devices.
  • the method 200 may be performed multiple times based on multiple RGB-D images, and the resulting estimated point clouds may be merged to generate the completed scene point cloud.
  • the point cloud may be determined by deprojecting multiple RGB-D images, and the other steps of the method 200 may be performed based on the point cloud .
  • the method 200 may be performed based on the one or more additional or updated RGB-D images, and the resulting estimated point clouds may be merged with the previously-completed scene point cloud to obtain an updated point cloud.
  • FIG. 4 is a diagram showing an example configuration of the SAM module 104, according to an embodiment of the present disclosure.
  • the SAM module 104 may include a mask generation module 402, and an image masking module 404.
  • the mask generation module 402 may generate the mask , which may indicate areas to be inpainted by the inpainting model 106.
  • the inpainting model 106 may inadvertently use background pixels to perform when performing inpainting on an occluded surface of an object.
  • an original RGB-D image may show a surface 502 of a foreground object, and background surfaces 504 and 506.
  • some inappropriate background pixels 508 from the background surfaces 504 and 506, which would not actually be visible from the viewpoint may be inadvertently included in the incomplete color image , and may therefore be mistakenly used by the inpainting model 106 to perform inpainting corresponding to the object.
  • an image of a surface 510 in the inpainted color image may be generated based on the inappropriate background pixels.
  • FIG. 6A shows an example of an incomplete color image that shows background pixels which are inappropriately included in areas which would be covered by objects.
  • FIG. 6B shows a mask generated based on the incomplete color image of FIG. 6A
  • FIG. 6C shows an example inpainted color image in which the inappropriate background pixels were used for inpainting.
  • the SAM module 104 may perform surface-aware masking.
  • the mask generation module 402 may generate a 3D mesh, which may for example have a shape of a frustum, based on the input color image and an input depth image , and may use this 3D mesh to generate the mask .
  • FIGS. 7A-7D show example operations which may be included in a surface-aware masking process, according to an embodiment of the present disclosure.
  • a ray may be cast from the viewpoint through each point in the deprojected point cloud .
  • the mask generation module 402 may perform this process for every ray to obtain an occlusion point cloud 702 which shows the potential space that could be possibly filled in the completed scene point cloud by objects corresponding to the surfaces.
  • the mask generation module 402 may convert this occlusion point cloud to the mesh 700, and when the point cloud is rotated to the new viewpoint , the mesh 700 may be rotated as well, as shown in FIG. 7D. Then, when the SAM module 104 projects the incomplete color image and the incomplete depth image from the rotated point cloud, the SAM module 104 may discard points which are occluded by the mesh 700. Accordingly, the incomplete color image may be prevented from including inappropriate pixels, as shown by the dashed boxes in FIG. 7D. For example, as can be seen in FIG. 7D the incomplete color image does not include the inappropriate background pixels 506 shown in FIG. 5. After these pixels are discarded, the blank pixels in the incomplete color image and the incomplete depth image may be used as the mask . Based on the mask , an inpainted color image may be generated to include, for example, an image of a surface 704 in which the inappropriate background pixels are not included.
  • FIG. 8A shows an example of an incomplete color image in which surface aware masking is performed according to the process described above with respect to FIGS. 7A to 7D.
  • the mesh 700 may prevent inappropriate pixels from being included in the incomplete color image .
  • FIG. 8B shows a mask generated based on the incomplete color image of FIG. 8A
  • FIG. 8C shows an example inpainted color image in which the inappropriate background pixels are not included.
  • FIG. 9 is a flowchart illustrating a method 900 of performing surface-aware masking, according to an embodiment of the present disclosure.
  • one or more operations of the method 900 may correspond to the surface-aware masking process discussed above with respect to FIGS. 7A-7D.
  • the mask generation module 402 may generate a plurality of points which extend beyond a surface included in the original RGB-D image .
  • the mask generation module 402 may subsample pixels from a uniform grid in the input RGB-D image to obtain a set of points .
  • the mask generation module 402 may initialize an empty point set , and for every point in , may deproject the point to a point in the point cloud , and generate additional points which are then added to the point set .
  • the mask generation module 402 may add a predetermined number of additional points for each point , and the additional points may be equally spaced. In an embodiment, the number of additional points and the spacing therebetween may vary based on the scene.
  • the mask generation module 402 may use fewer points which are more closely spaced than would be used for scene including a room to be vacuumed by a robot vacuum cleaner.
  • embodiments are not limited thereto, and the number of additional points and the spacing therebetween may be determined in any manner.
  • the point set may correspond to the points shown in FIG. 7B.
  • the mask generation module 402 may generate a mesh based on the plurality of points.
  • the mesh may be generated by performing surface triangulation on the points in the point set .
  • this mesh may correspond to the mesh 700 discussed above.
  • the mask generation module 402 may set the pixel to one ("1") if the estimated depth for the pixel in the incomplete depth image is equal to zero ("0") or is otherwise not present, or if the estimated depth for the pixel in the incomplete depth image is greater than the depth indicated for the pixel by the depth map representing the mesh.
  • the pixels which are set to one ("1") may correspond to the masked areas and/or the points which are discarded when generating the masked color image and the masked depth image.
  • the incomplete depth image includes an estimated depth for a particular pixel that is greater than the depth indicated for that same pixel by the depth map, this may indicate that the pixel corresponds to an area of the scene that was occluded or obscured in the original RGB-D image by a surface corresponding to the mesh. Accordingly, information corresponding to that pixel in the incomplete depth image and in the incomplete color image may be determined to be unreliable, and the pixel may therefore be masked and/or discarded when the masked color image and the masked depth image are generated.
  • FIGS. 10A-10C and FIGS. 11A-11C show further examples of a surface-aware masking process, according to an embodiment of the present disclosure.
  • an original RGB-D image may show a surface 1011 of a first foreground object and a surface 1012, and background surfaces 1013, 1014 and 1015.
  • some inappropriate background pixels from the background surfaces 1013, 1014 and 1015 which would not actually be visible from the viewpoints and , may be inadvertently included in the incomplete color images and , and may therefore be mistakenly used by the inpainting model 106 to perform inpainting corresponding to the object.
  • a ray may be cast from the viewpoint through each point in the deprojected point cloud .
  • the ray Once the ray has passed through its respective point, for example by passing through one of the surfaces 1011, 1012, 1013, 1014, and 1015, it may be used to generate a list of points along the ray from that depth onward.
  • the mask generation module 402 may perform this process for every ray to obtain an occlusion point cloud 1100 which shows the potential space that could be possibly filled in the completed scene point cloud by objects corresponding to the surfaces. As shown in FIGS.
  • the mask generation module 402 may convert this occlusion point cloud to the mesh 1101, and when the point cloud is rotated to the new viewpoints and , the mesh 1101 may be rotated as well. Then, when the SAM module 104 projects the incomplete color image and the incomplete depth image from the rotated point cloud, the SAM module 104 may discard points which are occluded by the mesh 1101. Accordingly, the incomplete color image may be prevented from including inappropriate pixels.
  • the mesh 700 and the mask corresponding to the new viewpoint may be generated based on depth information included in the original image , and then the incomplete color image and the incomplete depth image may be generated, for example by rotating and reprojecting the deprojected point cloud .
  • an embodiment described above may be used by at least one of an AR device and a VR device to perform scene completion of an environment surrounding a user in order to generate appropriate AR and VR images in anticipation of movements by the user.
  • an embodiment described above may be used to perform scene completion to reconstruct areas which are not immediately visible to the user, but which the user may wish to see later.
  • the completed scene point cloud may then be used to construct a plurality of potential AR/VR images to be displayed to the user, which may help to reduce latency in images provided to the user. Accordingly, images displayed by the AR device or the VR device may seamlessly transition according to a user's head movements.
  • FIG. 12A is a flowchart illustrating a method 1200A of performing scene completion in at least one of an AR device and a VR device, according to an embodiment of the present disclosure.
  • one or more operations of the method 1200A may be performed by or using at least one of the viewpoint module 120, the scene completion system 300, and any of the elements included therein, and any other element described herein.
  • the method 1200A may include obtaining an image corresponding to a current viewpoint of a user.
  • the image may correspond to the original RGB-D depth image described above.
  • the method 1200A may include performing scene completion to obtain a completed 3D representation of the environment of the user, for example a completed scene point cloud of a scene included in the environment.
  • the scene completion may correspond to any of the scene completion methods described above.
  • the method 1200A may include obtaining a plurality of potential AR/VR images corresponding to a plurality of potential viewpoints based on the completed point cloud.
  • the estimated point cloud may correspond to at least one of the estimated point cloud and the merged point cloud described above.
  • the plurality of potential AR/VR images may be AR images or VR images which are generated based on the at least one of the estimated point cloud and the merged point cloud .
  • the plurality of potential AR/VR images may be or may include a potential AR image which presents information corresponding to objects in the environment of the user from the perspective of a viewpoint which the user has not yet viewed, or in an area which is hidden from the field of view of the user.
  • the plurality of potential AR/VR images may be or may include a potential VR image which corresponds to a portion of the environment from the perspective of a viewpoint which the user had not yet viewed, or in an area which is hidden from the field of view of the user.
  • the potential VR image may include a VR object, obstacle, or boundary which corresponds to a real object in the environment a portion of the environment from the perspective of a viewpoint which the user had not yet viewed.
  • the method 1200A may include, based on the user moving from a position corresponding to the current viewpoint to a position corresponding to a potential viewpoint, displaying a transition between a current AR/VR image and a potential AR/VR image to the user.
  • the current AR/VR image may be an AR or VR image corresponding to the current viewpoint of the user, and the potential AR/VR image may be selected from among the plurality of potential AR/VR images obtained in operation S1213. Accordingly, a seamless transition from the current AR/VR image may be provided by the plurality of AR/VR images.
  • an embodiment described above may be used to manipulate or generate images in a device such as at least one of an AR device, a VR device, a mobile device, a camera, and a computer such as a personal computer, a laptop computer, and a tablet computer.
  • a device such as at least one of an AR device, a VR device, a mobile device, a camera, and a computer such as a personal computer, a laptop computer, and a tablet computer.
  • an embodiment described above may be used to generate a completed 3D representation of a scene based on a 2D image captured by a camera or an application or other computer program, for example a camera application. Based on the completed 3D representation, a user may generate one or more 2D images from different viewpoints or directions.
  • the original image used to generate the completed 3D representation may correspond to only a portion of the 2D image.
  • one or more objects may be extracted from the 2D image, and an embodiment described above may be used to generate 3D representations of the one or more objects, and new 2D images of the one or more objects may be generated based on input received from a user.
  • the input from the user may be used to select new directions or viewpoints used to generate the 3D representation and the new 2D images.
  • the user input may correspond to a manipulation of the 3D representation, and the new 2D images may be generated based on the manipulation being stopped.
  • the user may provide an input such as a dragging gesture which may be used to rotate the 3D representation, and based on the dragging gesture being stopped, one or more new 2D images may be generated based on the rotated 3D representation.
  • one or more new directions or viewpoints may be predicted in advance, and corresponding new 2D images may be created in advance, and each time the user provides an input such as a dragging gesture, a corresponding 2D image may be displayed to the user.
  • an embodiment described above may be used to perform scene completion in order to assist with tasks performed by a robot.
  • an embodiment described above may be used to plan actions such as grasping for a robotic arm, or to plan movements by a robotic vacuum cleaner.
  • FIG. 12B is a flowchart illustrating a method 1200B of performing scene completion in at least one of an AR device and a VR device, according to an embodiment of the present disclosure.
  • one or more operations of the method 1200B may be performed by or using at least one of the viewpoint module 120, the scene completion system 300, and any of the elements included therein, and any other element described herein.
  • the method 1200B may include obtaining an image of an environment of the robot.
  • this current image may correspond to the original RGB-D depth image described above.
  • the robot may include a robotic vacuum cleaner, and the environment may include a room which is to be vacuumed by the robotic vacuum cleaner.
  • the drone device such as a flying drone, and the environment may include a scene including an object which is to be observed or picked up by the drone, or an area in which the drone is to place an object.
  • the robot may include a robotic arm, and the environment may include a tabletop scene which includes an object to be grasped by the robotic arm.
  • the present disclosure is not limited in this regard.
  • the method 1200B may include performing scene completion to obtain a completed 3D representation of the environment of the robot, for example a completed scene point cloud of a scene included in the environment.
  • the scene completion may correspond to any of the scene completion methods described above.
  • the completed 3D representation may include predicted areas which are hidden from view in original RGB-D depth image .
  • the original RGB-D depth image may be captured from the perspective of a robotic vacuum cleaner with a limited vertical field of view, and these predicted areas may be an upper portion of the scene which is not visible to the robotic vacuum cleaner.
  • the original RGB-D depth image may be captured from the perspective of a drone device with a limited vertical field of view, and these predicted areas may be a lower portion of the scene which is not visible to the drone device.
  • the original RGB-D depth image may be captured from the perspective of a robotic arm with a limited horizontal field of view, and these predicted areas may be a left and/or right portion of the scene which is not visible to the robotic arm.
  • these are provided only as examples, and embodiments are not limited thereto.
  • the method 1200B may include planning a movement of the robot.
  • planning the movement may include planning a route to be taken by the robotic vacuum cleaner in order to vacuum the room.
  • planning the movement may include planning a movement to position the robotic arm to grasp the object.
  • the robot may determine a new viewpoint or a portion of the new viewpoint based on a desired rotation direction for the robot, and an embodiment described above may be used to generate the a 2D image of the new viewpoint.
  • the robot may determine a portion of a viewpoint that it expects to see based on anticipating another aspect of the recognized object based on the desired rotation direction, and an embodiment described above may be used to generate an image of that portion.
  • planning the movement may include planning a movement based on the completed 3D presentation.
  • FIG. 13A is a flowchart illustrating a method 1300A of performing scene completion, according to an embodiment of the present disclosure.
  • one or more operations of the method 1300A may be performed by or using at least one of the viewpoint module 100, the scene completion system 300, and any of the elements included therein, and any other element described herein.
  • the method 1300A may include obtaining an original image from an original viewpoint corresponding to a first direction, wherein the scene includes an object and a background, wherein a first surface of the object is an image of the object corresponding to the first direction.
  • the original image may correspond to the RGB-D image discussed above.
  • the first surface may correspond to the surface 502 in Fig. 5 and the surface 1011 in Fig. 10A.
  • the method 1300A may include obtaining a first image from a new viewpoint corresponding to a second direction different from the first direction by rotating the original image based on 3D information generated from 2D information which is obtained from the original image.
  • the 3D information may correspond to the deprojected point cloud discussed above.
  • the first image may correspond to the and the incomplete depth image and the new viewpoint may correspond to the new viewpoint discussed above.
  • the method 1300A may include determining an area within the first image for generating a second surface of the object based on depth information about a depth between the object and the background of the original image.
  • the area may correspond to the inappropriate background pixels 508 in Fig. 5 as discussed above.
  • the area within the first image for generating the second surface of the object may correspond to at least one of the mesh 700 and the mesh 1100 discussed above and may correspond to the masking area which done by the SAM module 104 that performs surface-aware masking.
  • the area within the first image may correspond to at least a portion of the background of the original image.
  • the area may be inadvertently considered to be included in a surface of the object, and the AI inpainting model may therefore inadvertently use inappropriate background pixels when performing inpainting the area within the first image as discussed above with respect to FIGS. 6A to 6C.
  • this area may be one or more areas as discussed above with respect to FIGS. 10A-10C and FIGS. 11A-11C.
  • the area may be masked, and the masked area may be provided to the AI inpainting model. Accordingly, the masked area within the first image may be considered to be the surface of object and may not be considered to be a background area. Therefore, the masked area may be inpainted as a surface of the object.
  • the method 1300A may include generating a second image by inputting the first image and the determined area to an AI inpainting model, wherein the AI inpainting model generates the second surface of the object which occupies a portion of the determined area in the second image.
  • the second image may correspond to the inpainted image discussed above.
  • the second surface of object may correspond to the surface 704 in Fig. 7D.
  • the area in the second image may correspond to the second surface of the object.
  • the second image and the second surface of the object may correspond to a second direction different from the first direction.
  • FIG. 13B is a flowchart illustrating a method 1300B of performing scene completion, according to an embodiment of the present disclosure.
  • one or more operations of the method 1300B may be performed by or using at least one of the viewpoint module 100, the scene completion system 300, and any of the elements included therein, and any other element described herein.
  • the method 1300B may include obtaining an original image from an original viewpoint corresponding to a first direction, wherein the scene includes an object and a background, wherein a first surface of the object is an image of the object corresponding to the first direction.
  • the original image may correspond to the RGB-D image discussed above.
  • the first surface may correspond to the surface 502 in Fig. 5 and the surface 1011 in Fig. 10A as discussed above.
  • the method 1300B may include determining an area for generating a second surface of the object based on depth information about a depth between the object and the background of the original image.
  • the area may correspond to the inappropriate background pixels 508 in Fig. 5 as discussed above.
  • the area within the first image for generating the second surface of the object may correspond to at least one of the mesh 700 and the mesh 1100 and may correspond to the masking area which done by the SAM module 104 that performs surface-aware masking.
  • the area may correspond to at least a portion of the background of the original image.
  • the area may be inadvertently considered to be included in a surface of the object, and the AI inpainting model may therefore inadvertently use inappropriate background pixels when performing inpainting on the area as discussed above with respect to FIGS. 6A to 6C.
  • this area may be one or more areas as discussed above with respect to FIGS. 10A-10C and FIGS. 11A-11C.
  • the area may be masked, and the masked area may be provided to the AI inpainting model. Accordingly, the masked area may be considered to be the surface of object and may not be considered to be a background area. Therefore, the masked area may be inpainted as a surface of the object.
  • the method 1300B may include obtaining a first image from a new viewpoint corresponding to a second direction different from the first direction by rotating the original image based on 3D information generated from 2D information which is obtained from the original image.
  • the 3D information may correspond to the deprojected point cloud discussed above.
  • the first image may correspond to the and the incomplete depth image and the new viewpoint may correspond to the new viewpoint discussed above.
  • the method 1300B may include generating a second image by inputting the first image and the determined area to an AI inpainting model, wherein the AI inpainting model generates the second surface of the object which occupies a portion of the determined area in the second image.
  • the second image may correspond to the inpainted image discussed above.
  • the second surface of object may correspond to the surface 704 in Fig. 7D.
  • the area in the second image may correspond to the second surface of the object.
  • the second image and the second surface of the object may correspond to a second direction different from the first direction.
  • FIG. 14 is a diagram of devices for performing a scene completion task according to an embodiment.
  • FIG. 14 includes a user device 1410, a server 1420, and a communication network 1430.
  • the user device 1410 and the server 1420 may interconnect via wired connections, wireless connections, or a combination of wired and wireless connections.
  • the user device 1410 may include one or more devices (e.g., a processor 1411 and a data storage 1412) configured to retrieve an image corresponding to a search query.
  • the user device 1410 may include a computing device (e.g., a desktop computer, a laptop computer, a tablet computer, a handheld computer, a smart speaker, a server, etc.), a mobile phone (e.g., a smart phone, a radiotelephone, etc.), a camera device, a wearable device (e.g., a pair of smart glasses, a smart watch, etc.), a home appliance (e.g., a robot vacuum cleaner, a smart refrigerator, etc. ), or a similar device.
  • a computing device e.g., a desktop computer, a laptop computer, a tablet computer, a handheld computer, a smart speaker, a server, etc.
  • a mobile phone e.g., a smart phone, a radiotelephone, etc.
  • a camera device
  • the data storage 1412 of the user device 1410 may include one or more of the viewpoint module 100 and the scene completion system 300, or any of the elements included therein.
  • the user device 1410 may store one or more of the viewpoint module 100 and the scene completion system 300, or any of the elements included therein, or vice versa.
  • the server 1420 may include one or more devices (e.g., a processor 1421 and a data storage 1422) configured to implement one or more of the viewpoint module 100 and the scene completion system 300, or any of the elements included therein.
  • the data storage 1422 of the server 1420 may include one or more of the viewpoint module 100 and the scene completion system 300, or any of the elements included therein.
  • the user device 1410 may store the one or more of viewpoint module 100 and the scene completion system 300, or any of the elements included therein.
  • the communication network 1430 may include one or more wired and/or wireless networks.
  • network 1430 may include a cellular network, a public land mobile network (PLMN), a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a telephone network (e.g., the Public Switched Telephone Network (PSTN)), a private network, an ad hoc network, an intranet, the Internet, a fiber optic-based network, or the like, and/or a combination of these or other types of networks.
  • PLMN public land mobile network
  • LAN local area network
  • WAN wide area network
  • MAN metropolitan area network
  • PSTN Public Switched Telephone Network
  • the number and arrangement of devices and networks shown in FIG. 14 are provided as an example. In practice, there may be additional devices and/or networks, fewer devices and/or networks, different devices and/or networks, or differently arranged devices and/or networks than those shown in FIG. 14. Furthermore, two or more devices shown in FIG. 14 may be implemented within a single device, or a single device shown in FIG. 14 may be implemented as multiple, distributed devices. Additionally, or alternatively, a set of devices (e.g., one or more devices) may perform one or more functions described as being performed by another set of devices.
  • FIG. 15 is a diagram of components of one or more electronic devices of FIG. 14 according to an embodiment.
  • An electronic device 1500 in FIG. 15 may correspond to the user device 1410 and/or the server 1420.
  • FIG. 15 is for illustration only, and other embodiments of the electronic device 1500 could be used without departing from the scope of this disclosure.
  • the electronic device 1500 may correspond to a client device or a server.
  • the electronic device 1500 includes a bus 1510, a processor 1520, a memory 1530, an interface 1540, and a display 1550.
  • the bus 1510 includes a circuit for connecting the components 1520 to 1550 with one another.
  • the bus 1510 functions as a communication system for transferring data between the components 1520 to 1550 or between electronic devices.
  • the bus 1510 may be a communication bus, a cross-over bar, a network, or the like.
  • the bus 1510 is depicted as a single line in FIG. 15, the bus 1510 may be implemented using multiple (e.g., two or more) connections between the set of components of the electronic device 1500. The present disclosure is not limited in this regard.
  • the processor 1520 includes one or more of a central processing unit (CPU), a graphics processor unit (GPU), an accelerated processing unit (APU), a many integrated core (MIC), a field-programmable gate array (FPGA), or a digital signal processor (DSP).
  • the processor 1520 is able to perform control of any one or any combination of the other components of the electronic device 1500, and/or perform an operation or data processing relating to communication. For example, the processor 1520 may perform the methods discussed above.
  • the processor 1520 executes one or more programs stored in the memory 1530.
  • the memory 1530 may include a volatile and/or non-volatile memory.
  • the memory 1530 may include volatile memory such as, but not limited to, random access memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), and the like.
  • the memory 1530 may include non-volatile memory such as, but not limited to, read only memory (ROM), electrically erasable programmable ROM (EEPROM), NAND flash memory, phase-change RAM (PRAM), magnetic RAM (MRAM), resistive RAM (RRAM), ferroelectric RAM (FRAM), magnetic memory, optical memory, and the like.
  • ROM read only memory
  • EEPROM electrically erasable programmable ROM
  • NAND flash memory phase-change RAM
  • PRAM phase-change RAM
  • MRAM magnetic RAM
  • RRAM resistive RAM
  • FRAM ferroelectric RAM
  • magnetic memory optical memory, and the like.
  • the present disclosure is not limited in this regard, and the memory 1530 may include other types of dynamic and/or static memory storage.
  • the memory 1530 may store information and/or instructions for use (e.g., execution) by the processor 1520.
  • the memory 1530 stores information, such as one or more of commands, data, programs (one or more instructions), application(s) 1534, etc., which are related to at least one other component of the electronic device 1500 and for driving and controlling the electronic device 1500.
  • commands and/or data may formulate an operating system (OS) 1532.
  • OS operating system
  • Information stored in the memory 1530 may be executed by the processor 1520.
  • the application(s) 1534 may include the above-discussed embodiments. These functions can be performed by a single application or by multiple applications that each carry out one or more of these functions.
  • the application(s) 1534 may include an artificial intelligence (AI) model for performing the methods discussed above.
  • AI artificial intelligence
  • the display 1550 may include, for example, a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, a quantum-dot light emitting diode (QLED) display, a microelectromechanical systems (MEMS) display, or an electronic paper display.
  • the display 1550 can also be a depth-aware display, such as a multi-focal display.
  • the display 1550 is able to present, for example, various contents, such as text, images, videos, icons, and symbols.
  • the communication interface 1544 may enable communication between the electronic device 1500 and other external devices, via a wired connection, a wireless connection, or a combination of wired and wireless connections.
  • the communication interface 1544 may permit the electronic device 1500 to obtain information from another device and/or provide information to another device.
  • the communication interface 1544 may include an Ethernet interface, an optical interface, a coaxial interface, an infrared interface, a radio frequency (RF) interface, a universal serial bus (USB) interface, a Wi-Fi interface, a cellular network interface, or the like.
  • the communication interface 1544 may obtain videos and/or video frames from an external device, such as a server.
  • the sensor(s) 1546 of the interface 1540 can meter a physical quantity or detect an activation state of the electronic device 1500 and convert metered or detected information into an electrical signal.
  • the sensor(s) 1546 can include one or more cameras or other imaging sensors for capturing images of scenes.
  • the sensor(s) 1546 can also include any one or any combination of a microphone, a keyboard, a mouse, and one or more buttons for touch input.
  • the sensor(s) 1546 can further include an inertial measurement unit.
  • the sensor(s) 1546 can include a control circuit for controlling at least one of the sensors included herein. Any of these sensor(s) 1546 can be located within or coupled to the electronic device 1500.
  • the sensor(s) 1546 may obtain a text and/or a voice signal that contains one or more queries.
  • the scene completion processes and methods described above may be written as computer-executable programs or instructions that may be stored in a medium.
  • the medium may continuously store the computer-executable programs or instructions, or temporarily store the computer-executable programs or instructions for execution or downloading.
  • the medium may be any one of various recording media or storage media in which a single piece or plurality of pieces of hardware are combined, and the medium is not limited to a medium directly connected to electronic device 1500, but may be distributed on a network.
  • Examples of the medium include magnetic media, such as a hard disk, a floppy disk, and a magnetic tape, optical recording media, such as CD-ROM and DVD, magneto-optical media such as a floptical disk, and ROM, RAM, and a flash memory, which are configured to store program instructions.
  • Other examples of the medium include recording media and storage media managed by application stores distributing applications or by websites, servers, and the like supplying or distributing other various types of software.
  • a computer program product may include a product (for example, a downloadable application) in a form of a software program electronically distributed through a manufacturer or an electronic market. For electronic distribution, at least a part of the software program may be stored in a storage medium or may be temporarily generated.
  • a model related to the neural networks described above may be implemented via a software module.
  • the model When the model is implemented via a software module (for example, a program module including instructions), the model may be stored in a computer-readable recording medium.
  • the model may be a part of the electronic device 1400 described above by being integrated in a form of a hardware chip.
  • the model may be manufactured in a form of a dedicated hardware chip for artificial intelligence, or may be manufactured as a part of an existing general-purpose processor (for example, a CPU or application processor) or a graphic-dedicated processor (for example a GPU).
  • the model may be provided in a form of downloadable software.
  • a computer program product may include a product (for example, a downloadable application) in a form of a software program electronically distributed through a manufacturer or an electronic market.
  • a product for example, a downloadable application
  • the software program may be stored in a storage medium or may be temporarily generated.
  • the storage medium may be a server of the manufacturer or electronic market, or a storage medium of a relay server.
  • a method may include rendering an incomplete color image and an incomplete depth image corresponding to the new viewpoint based on the 3D information.
  • the method may include masking a portion of the incomplete color image based on the 3D information and the incomplete depth image to obtain a masked color image, wherein the masked portion of the incomplete color image corresponds to the determined area and indicates that the masked portion of the incomplete color image is obscured by the object when the scene is viewed from the new viewpoint.
  • the method may include inpainting the masked color image to obtain the second image.
  • the method may include the obtaining the second image which includes inpainting the masked color image based on the AI inpainting model to obtain the second image.
  • the method may include obtaining an image caption by providing the second image to an AI caption model.
  • the method may include determining whether to re-inpaint the second image by comparing an embedding of the image caption and an embedding of the prompt.
  • the method may include masking a portion of the incomplete depth image based on the 3D information and the incomplete depth image to obtain a masked depth image, wherein the masked portion of the incomplete depth image corresponds to the determined area.
  • the method may include providing the second image to an AI depth estimation model.
  • the method may include generating an estimated depth image based on the masked depth image and an output of the AI depth estimation model.
  • the method may include generating a completed 3D representation based on the second image and the estimated depth image.
  • the method may include the generating the estimated depth image which includes obtaining at least one estimated normal and at least one estimated occlusion boundary by providing the second image to the AI depth estimation model.
  • the method may include the generating the estimated depth image which includes obtaining the estimated depth image based on the incomplete depth image, the at least one estimated normal, and the at least one estimated occlusion boundary.
  • the method may include rendering a plurality of incomplete color images and a plurality of incomplete depth images from a plurality of new viewpoints based on the 3D information.
  • the method may include masking the plurality of incomplete color images to obtain a plurality of masked color images, and masking the plurality of incomplete depth images to obtain a plurality of masked depth images.
  • the method may include obtaining a plurality of second images by providing the plurality of masked color images to the AI inpainting model.
  • the method may include providing the plurality of second images to the AI depth estimation model.
  • the method may include obtaining a plurality of estimated depth images based on the plurality of masked depth images and a plurality of outputs of the AI depth estimation model, wherein the completed 3D representation is generated based on the plurality of second images and the plurality of estimated depth images.
  • the generating of the completed 3D representation may include generating a plurality of estimated point clouds based on the second image, the estimated depth image, the plurality of second images, and the plurality of estimated depth images.
  • the method may include merging the plurality of estimated point clouds by discarding points which are not included in at least two estimated point clouds from among the plurality of estimated point clouds to obtain a completed scene point cloud representing the scene.
  • the masking may include generating a plurality of points which extend beyond a surface included in the original image.
  • the masking may include generating a mesh based on the plurality of points.
  • the masking may include rendering a depth map representing the mesh from the new viewpoint.
  • the masking may include generating a mask based on a comparison between the incomplete depth image and the depth map.
  • the masking may include applying the mask to the incomplete color image.
  • the method may include the mask which indicates a plurality of pixels which are not used for generating the second image.
  • the method may include the plurality of pixels which includes a first plurality of pixels for which a depth is not indicated by the incomplete depth image, and a second plurality of pixels for which a depth indicated by the incomplete depth image is greater than a depth indicated by the depth map.
  • the method may include the original image which is captured by at least one of an augmented reality (AR) device and a virtual reality (VR) device.
  • the method may include the original viewpoint which comprises a current viewpoint of a user, and the original image which corresponds to a current AR/VR image displayed to the user.
  • the method may include obtaining a completed 3D representation of the scene based on the second image.
  • the method may include obtaining a potential AR/VR image based on the completed 3D representation, wherein the potential AR/VR image corresponds to a potential viewpoint of the user.
  • the method may include based on the user moving from a position corresponding to the current viewpoint to a position corresponding to the potential viewpoint, displaying a transition between the current AR/VR image and the potential AR/VR image to the user.
  • the method may include the original image which is captured by a robot.
  • the method may include planning a movement path for the robot based on the second image.
  • an electronic device may include at least one processor configured to execute the instructions to render an incomplete color image and an incomplete depth image corresponding to the new viewpoint based on the 3D information.
  • the electronic device may include at least one processor configured to execute the instructions to mask a portion of the incomplete color image based on the 3D information and the incomplete depth image to obtain a masked color image, wherein the masked portion of the incomplete color image corresponds to the determined area and indicates that the masked portion of the incomplete color image is obscured by the object when the scene is viewed from the new viewpoint.
  • the electronic device may include at least one processor configured to execute the instructions to inpaint the masked color image to obtain the second image.
  • the electronic device to inpaint the masked color image, may include at least one processor configured to execute the instructions to inpaint the masked color image based on the AI inpainting model to obtain the second image.
  • the electronic device may include at least one processor configured to execute the instructions to obtain an image caption by providing the second image to an AI caption model.
  • the electronic device may include at least one processor configured to execute the instructions to determine whether to re-inpaint the second image by comparing an embedding of the image caption and an embedding of the prompt.
  • the electronic device may include at least one processor configured to execute the instructions to mask a portion of the incomplete depth image based on the 3D information and the incomplete depth image to obtain a masked depth image, wherein the masked portion of the incomplete depth image corresponds to the determined area.
  • the electronic device may include at least one processor configured to execute the instructions to provide the second image to an AI depth estimation model.
  • the electronic device may include at least one processor configured to execute the instructions to generate an estimated depth image based on the masked depth image and an output of the AI depth estimation model.
  • the electronic device may include at least one processor configured to execute the instructions to generate a completed 3D representation based on the second image and the estimated depth image.
  • the electronic device may include at least one processor configured to execute the instructions to obtain at least one estimated normal and at least one estimated occlusion boundary by providing the second image to the AI depth estimation model.
  • the electronic device, to generate the estimated depth image may include at least one processor configured to execute the instructions to obtain the estimated depth image based on the incomplete depth image, the at least one estimated normal, and the at least one estimated occlusion boundary.
  • the electronic device may include at least one processor configured to execute the instructions to render a plurality of incomplete color images and a plurality of incomplete depth images from a plurality of new viewpoints based on the 3D information.
  • the electronic device may include at least one processor configured to execute the instructions to mask the plurality of incomplete color images to obtain a plurality of masked color images, and masking the plurality of incomplete depth images to obtain a plurality of masked depth images.
  • the electronic device may include at least one processor configured to execute the instructions to obtain a plurality of second images by providing the plurality of masked color images to the AI inpainting model.
  • the electronic device may include at least one processor configured to execute the instructions to provide the plurality of second images to the AI depth estimation model.
  • the electronic device may include at least one processor configured to execute the instructions to obtain a plurality of estimated depth images based on the plurality of masked depth images and a plurality of outputs of the AI depth estimation model.
  • the electronic device may include the completed 3D representation which is generated based on the plurality of second images and the plurality of estimated depth images.
  • the electronic device may include at least one processor configured to execute the instructions to generate a plurality of estimated point clouds based on the second image, the estimated depth image, the plurality of second images, and the plurality of estimated depth images.
  • the electronic device, to generate the completed 3D representation may include at least one processor configured to execute the instructions to merge the plurality of estimated point clouds by discarding points which are not included in at least two estimated point clouds from among the plurality of estimated point clouds.
  • the electronic device, to mask the incomplete color image may include at least one processor configured to execute the instructions to generate a plurality of points which extend beyond a surface included in the original image.
  • the electronic device, to mask the incomplete color image may include at least one processor configured to execute the instructions to generate a mesh based on the plurality of points.
  • the electronic device, to mask the incomplete color image may include at least one processor configured to execute the instructions to render a depth map representing the mesh from the new viewpoint.
  • the electronic device, to mask the incomplete color image may include at least one processor configured to execute the instructions to generate a mask based on a comparison between the incomplete depth image and the depth map.
  • the electronic device, to mask the incomplete color image may include at least one processor configured to execute the instructions to apply the mask to the incomplete color image.
  • the electronic device may include the mask which indicates a plurality of pixels which are not used for generating the second image.
  • the electronic device may include the plurality of pixels which includes a first plurality of pixels for which a depth is not indicated by the incomplete depth image, and a second plurality of pixels for which a depth indicated by the incomplete depth image is greater than a depth indicated by the depth map.
  • the electronic device may include the original image which is captured by at least one of an augmented reality (AR) device and a virtual reality (VR) device.
  • the electronic device may include the original viewpoint which comprises a current viewpoint of a user, and the original image which corresponds to a current AR/VR image displayed to the user.
  • the electronic device may include at least one processor which is configured to execute the instructions to obtain a completed 3D representation of the scene based on the second image.
  • the electronic device may include at least one processor which is configured to execute the instructions to obtain a potential AR/VR image based on the completed 3D representation, wherein the potential AR/VR image corresponds to a potential viewpoint of the user.
  • the electronic device may include at least one processor which is configured to execute the instructions, based on the user moving from a position corresponding to the current viewpoint to a position corresponding to the potential viewpoint, to display a transition between the current AR/VR image and the potential AR/VR image to the user.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • Architecture (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Remote Sensing (AREA)
  • Automation & Control Theory (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Image Generation (AREA)
  • Processing Or Creating Images (AREA)

Abstract

Methods and devices for processing image data for scene completion, including obtaining an original image from an original viewpoint corresponding to a first direction, wherein the original image includes an object; obtaining a first image from a new viewpoint corresponding to a second direction by rotating the original image based on 3-dimensional information generated from 2-dimensional information which is obtained from the original image; determining an area within the first image for generating a second surface of the object based on depth information about a depth between the object and the background of the original image, wherein the determined area is expected to include an object area; and obtaining a second image by inputting the first image and the determined area to an AI inpainting model, wherein the AI inpainting model generates the second surface of the object which occupies a portion of the determined area in the second image.

Description

METHOD AND APPARATUS FOR PROCESSING AN IMAGE
The disclosure relates to a method for processing an image, and an apparatus for the same, and more particularly to a method for performing masking and inpainting for generalizable scene completion, and an apparatus for the same.
Building three-dimensional (3D) structures of scenes may be important for many applications, for example robot navigation, planning, manipulation, and interaction. Improvements in 3D perception capabilities have accompanied the increasing availability of depth sensors on smartphones and robots. However, a complete and coherent reconstruction is challenging when only partial observation of the scene is available.
The task of estimating the full 3D geometry of a scene containing unseen objects, from a single red, green, blue plus depth (RGB-D) image may be referred to as general or generalizable scene completion. Scene completion is an important task which may allow for better robot action planning such as grasp planning, path planning, and long-horizon task planning. Scene completion may also be useful in contexts such as autonomous navigation and image generation for augmented reality (AR) and virtual reality (VR) devices. However, a single view of the environment may capture only limited information of the scene, which presents a major challenge for scene completion.
Example embodiments address at least the above problems and/or disadvantages and other disadvantages not described above. Also, the example embodiments are not required to overcome the disadvantages described above, and may not overcome any of the problems described above.
According to an aspect of the present disclosure, a method for processing image data for scene completion may include obtaining an original image from an original viewpoint corresponding to a first direction, wherein the original image includes an object and a background, wherein a first surface of the object is an image of the object corresponding to the first direction. The method may include receiving an original image from an original viewpoint corresponding to a first direction, wherein the original image includes an object and a background, wherein a first surface of the object is an image of the object corresponding to the first direction. The method may include obtaining a first image from a new viewpoint corresponding to a second direction different from the first direction by rotating the original image based on 3-dimensional (3D) information generated from 2-dimensional (2D) information which is obtained from the original image. The method may include determining an area within the first image for generating a second surface of the object based on depth information about a depth between the object and the background of the original image. The method may include obtaining a second image by inputting the first image and the determined area to an artificial intelligence (AI) inpainting model, wherein the AI inpainting model generates the second surface of the object which occupies a portion of the determined area in the second image.
According to an embodiment of the disclosure, an electronic device for processing image data for scene completion may include at least one memory configured to store instructions. The electronic device for processing image data for scene completion may include at least one processor configured to execute the instructions to receive an original image from an original viewpoint corresponding to a first direction, wherein the original image includes an object and a background, wherein a first surface of the object is an image of the object corresponding to the first direction. The electronic device for processing image data for scene completion may include at least one processor configured to execute the instructions to obtain an original image from an original viewpoint corresponding to a first direction, wherein the original image includes an object and a background, wherein a first surface of the object is an image of the object corresponding to the first direction. The electronic device for processing image data for scene completion may include at least one processor configured to execute the instructions to obtain a first image from a new viewpoint corresponding to a second direction different from the first direction by rotating the original image based on 3-dimensional (3D) information generated based on 2-dimensional information which is obtained from the original image. The electronic device for processing image data for scene completion may include at least one processor configured to execute the instructions to determine an area with the first image for generating a second surface of the object based on depth information about a depth between the object and the background of the original image. The electronic device for processing image data for scene completion may include at least one processor configured to execute the instructions to obtain a second image by inputting the first image and the determined area to an artificial intelligence (AI) inpainting model, wherein the AI inpainting model generates the second surface of the object which occupies a portion of the determined area in the second image.
According to an embodiment of the present disclosure, a computer-readable storage medium which is configured to store instruction is provided. The instructions, when executed by at least one processor of a device, may cause the at least one processor to perform the method corresponding.
The above and other aspects, features, and aspects of embodiments of the disclosure will be more apparent from the following description taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a diagram showing a viewpoint module, according to an embodiment of the present disclosure;
FIG. 2 is a flowchart illustrating a method of processing an image to perform scene completion, according to an embodiment of the present disclosure;
FIG. 3A is a diagram showing a scene completion system, according to an embodiment of the present disclosure;
FIG. 3B is a diagram illustrating a process for generating a merged point cloud, according to an embodiment of the present disclosure;
FIG. 4 is a diagram showing an example configuration of a surface-aware masking module, according to an embodiment of the present disclosure;
FIG. 5 is a diagram showing an example of generating a mask without using surface-aware masking, according to an embodiment of the present disclosure;
FIG. 6A to 6C illustrate results of performing scene completion based on a mask generated according to FIG. 5, according to an embodiment of the present disclosure;
FIGS. 7A-7D are diagrams showing an example of generating a mask using surface-aware masking, according to an embodiment of the present disclosure;
FIG. 8A to 8C illustrate results of performing scene completion based on a mask generated according to FIGS. 7A-7D, according to an embodiment of the present disclosure;
FIG. 9 is a flowchart illustrating a method of performing surface-aware masking for scene completion, according to an embodiment of the present disclosure;
FIGS. 10A to 10C show further examples of a surface-aware masking process, according to an embodiment of the present disclosure;
FIGS. 11A to 11C show further examples of a surface-aware masking process, according to an embodiment of the present disclosure;
FIGS. 12A and 12B are flowcharts illustrating a use applications of scene completion methods, according to an embodiment of the present disclosure;
FIGS. 13A and 13B are a flowchart illustrating a method of processing an image to perform scene completion, according to an embodiment of the present disclosure;
FIG. 14 is a diagram of electronic devices for performing scene completion according to an embodiment of the present disclosure; and
FIG. 15 is a diagram of components of one or more electronic devices of FIG. 12 according to an embodiment of the present disclosure.
Example embodiments are described in greater detail below with reference to the accompanying drawings.
In the following description, like drawing reference numerals are used for like elements, even in different drawings. The matters defined in the description, such as detailed construction and elements, are provided to assist in a comprehensive understanding of the example embodiments. However, it is apparent that the example embodiments can be practiced without those specifically defined matters. Also, well-known functions or constructions are not described in detail since they would obscure the description with unnecessary detail.
Expressions such as "at least one of," when preceding a list of elements, modify the entire list of elements and do not modify the individual elements of the list. For example, the expression, "at least one of a, b, and c," should be understood as including only a, only b, only c, both a and b, both a and c, both b and c, all of a, b, and c, or any variations of the aforementioned examples.
While such terms as "first," "second," etc., may be used to describe various elements, such elements must not be limited to the above terms. The above terms may be used only to distinguish one element from another.
The term "module" is intended to be broadly construed as hardware, software, firmware, or any combination thereof.
It will be apparent that systems and/or methods, described herein, may be implemented in different forms of hardware, firmware, or a combination of hardware and software. The actual specialized control hardware or software code used to implement these systems and/or methods is not limiting of the implementations. Thus, the operation and behavior of the systems and/or methods were described herein without reference to specific software code―it being understood that software and hardware may be designed to implement the systems and/or methods based on the description herein.
Even though particular combinations of features are recited in the claims and/or disclosed in the specification, these combinations are not intended to limit the disclosure of possible implementations. In fact, many of these features may be combined in ways not specifically recited in the claims and/or disclosed in the specification. Although each dependent claim listed below may directly depend on only one claim, the disclosure of possible implementations includes each dependent claim in combination with every other claim in the claim set.
No element, act, or instruction used herein should be construed as critical or essential unless explicitly described as such. Also, as used herein, the articles "a" and "an" are intended to include one or more items, and may be used interchangeably with "one or more." Furthermore, as used herein, the term "set" is intended to include one or more items (e.g., related items, unrelated items, a combination of related and unrelated items, etc.), and may be used interchangeably with "one or more." Where only one item is intended, the term "one" or similar language is used. Also, as used herein, the terms "has," "have," "having," or the like are intended to be open-ended terms. Further, the phrase "based on" is intended to mean "based, at least in part, on" unless explicitly stated otherwise.
Embodiments may relate to methods, systems, and apparatuses for performing scene completion. Embodiments may provide a method, system, or apparatus which may obtain an input image of a scene, for example an RGB-D image, and may generate a completed 3D representation of the scene, for example a completed scene point cloud, which may include regions which are unobservable or occluded in the input image. In embodiments, a point cloud may be a multidimensional set of points which represent at least one of an object and a space. For example, each point may represent geometric coordinates of a single point on a surface of an object, and may further represent information such as texture information and color information corresponding to the single point. In embodiments, the scene may include one or more objects and a background, and the completed scene point cloud may include both depth information and texture information about the scene and the one or more objects included and the background in the scene. Although examples are provided herein in terms of point clouds, embodiments are not limited thereto. For example, embodiments may relate to any multidimensional representations of objects and spaces, for example mesh representations, voxel grid representations, implicit surface representations, distance field representations, and any other type of representation.
According to an embodiment, the reconstruction of the completed scene point cloud may be performed in two general steps, for example a step of scene view completion, and a step of lifting the scene from a two-dimensional representation to a three-dimensional representation. For example, an embodiment may apply the generalization capability of large language models to inpaint the missing areas of color images rendered from different viewpoints. Then, these inpainted images may be converted from two-dimensional (2D) images to three-dimensional (3D) representations, for example point clouds, by predicting per-pixel depth values using a combination of a trained network and depth information in the input image. In an embodiment, this lifting process may be referred to as deprojection.
According to an embodiment, an entire completed scene point cloud for a scene may be reconstructed based on a single image of the scene, for example a single RGB-D image. For example, based on the single image, the entire scene layout may be reconstructed in a globally-consistent fashion. Some related-art methods may be confined to task-specific models which often do not generalize appropriately to distributions beyond the training data, which may limit their applicability. In contrast, an embodiment of the present disclosure may provide generalization to unseen scenes, objects, and categories by leveraging inpainted features. An embodiment may utilize the generalizable aspects of machine learning (ML) and artificial intelligence (AI) models, for example visual language models (VLMs) for completing novel views and depth maps. However, the present disclosure is not limited in this regard, and an embodiment may utilize other types of ML and AI models. The integrated pipeline provided by an embodiment may be used for scene completion of unseen objects with occlusion and clutter.
For example, according to an embodiment, the generalization capabilities of large VLMs with respect to 2D images may be leveraged to lift the information contained in the 2D images into 3D space for practical robotics applications. Accordingly, an embodiment may provide consistent scene completion in new environments, and with unseen objects.
FIG. 1 is an example of a diagram showing a viewpoint module for performing scene completion, according to an embodiment of the present disclosure.
As shown in FIG. 1, a viewpoint module 100 may include an image rotation module 102, a surface-aware masking (SAM) module 104, an inpainting model 106, one or more depth estimation models 108, for example normal estimation model 108A and boundary estimation model 108B, a depth completion module 110, and a deprojection module 112.
According to an embodiment, the viewpoint module 100 may obtain (e.g. receive, capture, download) an original image, for example an RGB-D image
Figure PCTKR2024095121-appb-img-000001
of a scene, as input, and may output one or more estimated point clouds
Figure PCTKR2024095121-appb-img-000002
, where N is the number of predicted points in the scene, and H and W denote dimensions of the RGB-D image. In an embodiment, the RGB-D image
Figure PCTKR2024095121-appb-img-000003
may include an input color image
Figure PCTKR2024095121-appb-img-000004
and an input depth image
Figure PCTKR2024095121-appb-img-000005
. In an embodiment, a color image may be referred to as an RGB image or a texture image, and the like.
In an embodiment, the image rotation module 102, the SAM module 104, and the inpainting model 106 may be referred to as an inpainting pipeline, which may obtain the RGB-D image
Figure PCTKR2024095121-appb-img-000006
from an original viewpoint
Figure PCTKR2024095121-appb-img-000007
, and may output an incomplete depth image
Figure PCTKR2024095121-appb-img-000008
and an inpainted color image
Figure PCTKR2024095121-appb-img-000009
from a new viewpoint
Figure PCTKR2024095121-appb-img-000010
. For example, in an embodiment, the original viewpoint
Figure PCTKR2024095121-appb-img-000011
may correspond to a view of the scene from a first direction, and the new viewpoint
Figure PCTKR2024095121-appb-img-000012
may correspond to a view of the scene from a second direction which is different from the first direction. The one or more depth estimation models 108 and the depth completion module 110 may be referred to as a depth completion pipeline, which may obtain the incomplete depth image
Figure PCTKR2024095121-appb-img-000013
and the inpainted color image
Figure PCTKR2024095121-appb-img-000014
, and may output an estimated depth image
Figure PCTKR2024095121-appb-img-000015
.
The deprojection module 112 may generate 3D information about the scene based on 2D information which is obtained from the RGB-D image
Figure PCTKR2024095121-appb-img-000016
. In an embodiment, the 2D information may include at least one from among boundary information, texture information, color information, and depth information included in the RGB-D image
Figure PCTKR2024095121-appb-img-000017
. In an embodiment, the 3D information may include a 3D representation of the scene, for example a point cloud as discussed above. In an embodiment, the deprojection module 112 may obtain the inpainted color image
Figure PCTKR2024095121-appb-img-000018
and the estimated depth image
Figure PCTKR2024095121-appb-img-000019
, and may obtain an estimated point cloud
Figure PCTKR2024095121-appb-img-000020
corresponding to the viewpoint
Figure PCTKR2024095121-appb-img-000021
.
For example, a process of generating a 2D image from a 3D representation, such as a point cloud, may be referred to as projecting the 2D image from the point cloud. Similarly, a process of generating a 3D representation such as a point cloud from a 2D image may be referred to as deprojecting the point cloud from the 3D image. For example, given a depth image which is a 2D image that has a depth value at every pixel, and also given camera information used to capture the 2D image (for example focal length, etc.), it may be possible to deproject each pixel using the camera information and the depth information at that 2D pixel location. In an embodiment, this may be similar to drawing a line or ray from the camera through the 2D pixel location, and placing a point along the line at a distance corresponding to the depth information for the pixel. If the depth image is available, then the deprojection may be performed without an algorithm or model. However, if no depth image is available, or only a partial depth image is available, an AI model such as the one or more depth estimation models 108 may be used to predict the depth image.
FIG. 2 is a flowchart illustrating a method 200 of processing an image to perform scene completion, according to an embodiment of the present disclosure. In an embodiment, one or more operations of the method 200 of FIG. 2 may be performed by or using the viewpoint module 100 and any of the elements included therein, and any other element described herein.
Referring to FIG. 2, at operation S201 the image rotation module 102 may obtain (e.g. receive, capture, download) the RGB-D image
Figure PCTKR2024095121-appb-img-000022
and information about the image
Figure PCTKR2024095121-appb-img-000023
, for example intrinsic information about a camera or other device used to capture the image
Figure PCTKR2024095121-appb-img-000024
(such as focal length, etc). Then, at operation S202, the image rotation module 102 may deproject the image
Figure PCTKR2024095121-appb-img-000025
into a point cloud
Figure PCTKR2024095121-appb-img-000026
corresponding to the original viewpoint
Figure PCTKR2024095121-appb-img-000027
. At operation S203, the image rotation module 102 may then rotate the deprojected point cloud
Figure PCTKR2024095121-appb-img-000028
by an angle
Figure PCTKR2024095121-appb-img-000029
about its center point, and the rotated point cloud may be reprojected to render an incomplete color image
Figure PCTKR2024095121-appb-img-000030
and an incomplete depth image
Figure PCTKR2024095121-appb-img-000031
. In an embodiment, the incomplete color image
Figure PCTKR2024095121-appb-img-000032
and an incomplete depth image
Figure PCTKR2024095121-appb-img-000033
may be referred to as "incomplete" because they may be missing information about one or more areas of the scene which are obscured or occluded by an object in the deprojected point cloud
Figure PCTKR2024095121-appb-img-000034
. For example, when the point cloud
Figure PCTKR2024095121-appb-img-000035
is rotated, some points in the rotated point cloud may correspond to occluded areas of the scene which are obscured by a surface of an object which is present in the RGB-D image
Figure PCTKR2024095121-appb-img-000036
.
For example, the occluded areas of the scene may be regions which include at least one of a portion of a background of the original image (from the new viewpoint
Figure PCTKR2024095121-appb-img-000037
), and a portion of a surface of an object (from the new viewpoint
Figure PCTKR2024095121-appb-img-000038
). In an embodiment, this portion of the surface of the object may be referred to as an "object area". Therefore, when the rotated point cloud is used to generate a 2D image, this 2D may also be missing information, and therefore may be referred to as an incomplete image. Because the rotation of the point cloud
Figure PCTKR2024095121-appb-img-000039
may correspond to changing the viewpoint, the incomplete color image
Figure PCTKR2024095121-appb-img-000040
and the incomplete depth image
Figure PCTKR2024095121-appb-img-000041
may correspond to a new viewpoint
Figure PCTKR2024095121-appb-img-000042
. In an embodiment, the incomplete color image
Figure PCTKR2024095121-appb-img-000043
and the incomplete depth image
Figure PCTKR2024095121-appb-img-000044
may be missing color information and depth information corresponding to areas of the scene which are occluded or otherwise not visible in the original RGB-D image
Figure PCTKR2024095121-appb-img-000045
. In an embodiment, the incomplete color image
Figure PCTKR2024095121-appb-img-000046
and the incomplete depth image
Figure PCTKR2024095121-appb-img-000047
may be referred to as, or included in, an incomplete RGB-D image
Figure PCTKR2024095121-appb-img-000048
.
In an embodiment, a process for generating the incomplete RGB-D image
Figure PCTKR2024095121-appb-img-000049
from the new viewpoint
Figure PCTKR2024095121-appb-img-000050
based on information in the original image
Figure PCTKR2024095121-appb-img-000051
may be referred to as "rotating" the original image
Figure PCTKR2024095121-appb-img-000052
. For example, the process of deprojecting the image
Figure PCTKR2024095121-appb-img-000053
into the point cloud
Figure PCTKR2024095121-appb-img-000054
, rotating the deprojected point cloud
Figure PCTKR2024095121-appb-img-000055
, and reprojecting to render the incomplete color image
Figure PCTKR2024095121-appb-img-000056
and the incomplete depth image
Figure PCTKR2024095121-appb-img-000057
described above with respect to operations S202 and S203 may be referred to as "rotating" the original image
Figure PCTKR2024095121-appb-img-000058
.
In an embodiment, the new viewpoint
Figure PCTKR2024095121-appb-img-000059
may be selected based on a context ratio
Figure PCTKR2024095121-appb-img-000060
, which may be determined based on Equation 1 below:
Figure PCTKR2024095121-appb-img-000061
Equation 1
In Equation 1 above,
Figure PCTKR2024095121-appb-img-000062
may denote a number of context pixels in an image, and
Figure PCTKR2024095121-appb-img-000063
may denote a number of all pixels in an image. The context ratio
Figure PCTKR2024095121-appb-img-000064
may provide an indication about how accurately an inpainting model such as the inpainting model 106 may be able to fill in missing areas in an image. For example, a low value of the context ratio
Figure PCTKR2024095121-appb-img-000065
may indicate that many areas are unknown, and that an inpainting model may struggle to fill in missing areas, and a high value of the context ratio
Figure PCTKR2024095121-appb-img-000066
may indicate that an inpainting model may more easily fill in missing areas, but may only fill in limited information.
When selecting the new viewpoint
Figure PCTKR2024095121-appb-img-000067
, the image rotation module 102 may start from the original viewpoint
Figure PCTKR2024095121-appb-img-000068
, and may rotate the deprojected point cloud
Figure PCTKR2024095121-appb-img-000069
in various directions to various new viewpoints. At each step in the rotation, an image may be projected based on the rotated point cloud, and a context ratio
Figure PCTKR2024095121-appb-img-000070
of the projected image may be calculated. Based on the context ratio
Figure PCTKR2024095121-appb-img-000071
of a projected image satisfying a predetermined criteria, the corresponding viewpoint may be selected as the new viewpoint
Figure PCTKR2024095121-appb-img-000072
. In an embodiment, the predetermined criteria may be satisfied when the context ratio
Figure PCTKR2024095121-appb-img-000073
of a projected image being closest to context threshold
Figure PCTKR2024095121-appb-img-000074
from among context ratios a plurality of projected images corresponding to a plurality of new viewpoints. This process may be repeated to obtain a plurality of evenly spaced new viewpoints, but embodiments are not limited thereto.
In an embodiment, before the incomplete color image
Figure PCTKR2024095121-appb-img-000075
is inpainted, preprocessing steps may be applied to increase the quality of the inpainting results. For example, the incomplete color image
Figure PCTKR2024095121-appb-img-000076
may be preprocessed to fill in relatively small holes which are produced as a result of the reprojecting described above. For example, a naive inpainting filter that works with relatively small areas of missing values may be applied. In an embodiment, the naive inpainting filter may be a general inpainting filter or inpainting model which is trained using a general image dataset that is not specific to the particular scene. Starting at boundaries of missing pixels, a weighted average of the nearest ground truth pixels may be determined. The naive inpainting filter may then work inward to fill larger holes. In an embodiment, the naive inpainting filter may be used to fill relatively small holes of missing information in order to produce a denser image that gives more context for the inpainting model 106. However, the naive inpainting filter may produce unrealistic results for relatively large missing areas.
Therefore, at operation S204, the SAM module 104 may generate a mask
Figure PCTKR2024095121-appb-img-000077
which indicates the large missing areas. In an embodiment, the missing areas may include areas in which no pixel information is available when the original image is rotated. In an embodiment, even if there is pixel information available when the original image is rotated (some of which may correspond to the background) the SAM module 104 may determine that an area predicted as the surface area of the object should be masked. An example of a method for generating the mask is provided below with respect to FIGS. 4 to 9C. At operation S205, the SAM module 104 may mask the incomplete color image
Figure PCTKR2024095121-appb-img-000078
to obtain a masked color image, and may mask the incomplete depth image
Figure PCTKR2024095121-appb-img-000079
to obtain a masked depth image.
At operation S206, the viewpoint module 100 may provide the masked color image, or for example the mask
Figure PCTKR2024095121-appb-img-000080
and the incomplete color image
Figure PCTKR2024095121-appb-img-000081
, to the inpainting model 106 to obtain an inpainted color image
Figure PCTKR2024095121-appb-img-000082
. For example, the inpainting model 106 may generate predicted image information corresponding to portions of the incomplete color image
Figure PCTKR2024095121-appb-img-000083
which are masked by the mask
Figure PCTKR2024095121-appb-img-000084
, and the inpainted color image
Figure PCTKR2024095121-appb-img-000085
may be generated by applying the predicted image information to the incomplete color image
Figure PCTKR2024095121-appb-img-000086
. In an embodiment, the inpainted color image
Figure PCTKR2024095121-appb-img-000087
may be referred to as a predicted image. In an embodiment, the inpainting model 106 may be or may include an AI or ML model, for example at least one of a diffusion model and a VLM such as DALL-E 2.
In an embodiment, the inpainting model 106 may obtain the masked color image and an input prompt P that describes the context of the original RGB-D image
Figure PCTKR2024095121-appb-img-000088
in words or text. For example, based on the original scene including objects on a tabletop, the prompt
Figure PCTKR2024095121-appb-img-000089
may include "household objects on a table". As an example, based on the original scene including a room to be vacuumed by a robotic vacuum cleaner, the prompt
Figure PCTKR2024095121-appb-img-000090
may include "room with carpet and furniture". As further examples, the prompt
Figure PCTKR2024095121-appb-img-000091
may include any additional known information about the scene, such as "a baseball and glove on a table" if these objects are known to be on the table, or "top-down view of household objects on a table" if the viewpoint is known to be from a top-down perspective. For example, the additional known information may be at least one of information that was previously provided or confirmed by a user, information that is associated with the image such as information included in tags or metadata, and information obtained using image analysis or view analysis, for example using an image analysis algorithm or model. However, embodiments are not limited thereto, and the prompt
Figure PCTKR2024095121-appb-img-000092
may include any other information. For example, in an embodiment the original RGB-D image
Figure PCTKR2024095121-appb-img-000093
may be provided to an automatic captioning model, and the output of the output of the automatic captioning model may be used as the prompt
Figure PCTKR2024095121-appb-img-000094
. In detail, based on the scene including objects on a tabletop, the output of the automatic captioning model may be a proposed prompt such as "household objects on a table". This output may be provided to the user, and the user may then revise or modify this proposed prompt to obtain a revised prompt. Based on the example above, the revised prompt may be "household objects such as a dish, cloth, cutlery, and a pot on a table", or "household objects such as drinking glasses and dinner plates on a white marble dining table" (in which text in italics indicates modifications to the proposed prompt which are input by the user).
As an example, the user may input an original prompt
Figure PCTKR2024095121-appb-img-000095
, and then based on the output of the inpainting model 106, may modify the original prompt
Figure PCTKR2024095121-appb-img-000096
to obtain a revised prompt, and may request a new inpainted image to be generated based on the revised prompt. For example, the user may originally input "a baseball and glove on a table" as the original prompt
Figure PCTKR2024095121-appb-img-000097
. After reviewing the inpainted image output by the inpainting model 106, the user may input a revised prompt such as "a baseball and a leather baseball glove on a wooden table" (in which text in italics indicates revisions to the original prompt
Figure PCTKR2024095121-appb-img-000098
which are input by the user).
As an example, a user may input any prompt as desired, for example to change the style of the original RGB-D image
Figure PCTKR2024095121-appb-img-000099
to another style. For example, the appearance or visual style of the original RGB-D image
Figure PCTKR2024095121-appb-img-000100
may be modified using a neural style transfer (NST) model, for example by modifying style features of the original RGB-D image
Figure PCTKR2024095121-appb-img-000101
while maintaining content features of the original RGB-D image
Figure PCTKR2024095121-appb-img-000102
.
The inpainting model 106 may output the inpainted color image
Figure PCTKR2024095121-appb-img-000103
, which may contain estimated areas corresponding to areas of the incomplete color image
Figure PCTKR2024095121-appb-img-000104
which are masked by the mask
Figure PCTKR2024095121-appb-img-000105
.
The term "prompt" may refer to text used to initiate interaction with a generative model that generates images for electronic devices. A prompt may include one or more words, phrases, and/or sentences. In an embodiment, the inpainting model 106 may be, may include, or may be similar to such a generative model. In one example, a prompt may contain natural language text that carries various information that the generative model can use to generate images, such as context, intent, task, constraints, and more. Electronic devices may process natural language text using natural language processing (NLP) models.
In one scenario, prompts and revised prompts can be received from users. For instance, electronic devices may receive text input from users, or they can receive voice input and perform automatic speech recognition (ASR) to convert the user's voice input into text. However, the present disclosure is not limited in this regard, and electronic devices may receive other types of input from users.
In an example, prompts may be generated by electronic devices using various techniques, such as image captioning. For instance, electronic devices can receive image input from users and extract text descriptions from the images.
Additionally, the term "prompt" may be replaced with a similar expression that represents the same concept. For example, prompts can be replaced with terms like "input," "user input," "input phrase," "user command," "directive," "starting sentence," "task query," "trigger sentence," "message," and others, not limited to the examples mentioned.
Due to the randomized nature of inpainting, some inpainted color images
Figure PCTKR2024095121-appb-img-000106
which may be generated by the inpainting model 106 may vary in terms of their perceived realism. Therefore, in an embodiment, the inpainting model 106 may be used to generate multiple candidate inpainted color images based on the same masked color image. Then, these candidate inpainted color images may be compared against the input prompt P by encoding them to an embedded space, and the candidate inpainted color image having the highest similarity may be chosen as the inpainted color image
Figure PCTKR2024095121-appb-img-000107
.
At operation S207, the inpainted color image
Figure PCTKR2024095121-appb-img-000108
may be provided to one or more depth estimation models 108. In an embodiment, the one or more depth estimation models 108 may be ML or AI models. For example, the inpainted color image
Figure PCTKR2024095121-appb-img-000109
may be provided to the normal estimation model 108A, which may be trained to estimate normals, and the inpainted color image
Figure PCTKR2024095121-appb-img-000110
may be provided to the boundary estimation model 108B, which may be trained to estimate occlusion boundaries. In an embodiment, the one or more depth estimation models 108 may be trained or optimized for a specific category of scenes, for example a scene including objects on a tabletop, or a scene including a room to vacuumed by a robotic vacuum cleaner. In an embodiment, an estimated normal(s) may be, for example, geometric normal(s). The term "normal" or "geometric normal" may refer to a vector associated with a point on a surface of a 3D object in computer graphics and 3D computer modeling, and may represent a direction in which a surface is facing at each point on the surface (e.g., the direction that is perpendicular to a tangent plane of the surface at that point).
At operation S208, the depth completion module 110 may generate an estimated depth image
Figure PCTKR2024095121-appb-img-000111
based on the masked depth image and the output of the one or more depth estimation models. For example, depth information for areas with missing depths in the masked depth image may be computed by tracing along the estimated normal(s) from areas of known depth, and the estimated occlusion boundaries may act as barriers which the estimated normal(s) should not be traced across. As an example, in an embodiment a system of equations may be solved to minimize an error E, where E is defined according to Equation 2 below:
Figure PCTKR2024095121-appb-img-000112
Equation 2
In Equation 2 above,
Figure PCTKR2024095121-appb-img-000113
may denote the distance between the ground truth and estimated depth,
Figure PCTKR2024095121-appb-img-000114
may denote the influences of nearby pixels to have similar depths, and
Figure PCTKR2024095121-appb-img-000115
may denote the consistency of estimated depth and estimated normal values. In addition,
Figure PCTKR2024095121-appb-img-000116
,
Figure PCTKR2024095121-appb-img-000117
, and
Figure PCTKR2024095121-appb-img-000118
may denote constants or weight values corresponding to
Figure PCTKR2024095121-appb-img-000119
,
Figure PCTKR2024095121-appb-img-000120
, and
Figure PCTKR2024095121-appb-img-000121
, respectively.
Further,
Figure PCTKR2024095121-appb-img-000122
may denote a weight value corresponding to the estimated normal values based on the probability that a boundary is present. In an embodiment, the value of
Figure PCTKR2024095121-appb-img-000123
may be obtained based on the estimated occlusion boundaries discussed above.
At operation S209, the deprojection module 112 may generate an estimated point cloud
Figure PCTKR2024095121-appb-img-000124
corresponding to the viewpoint
Figure PCTKR2024095121-appb-img-000125
by deprojecting the inpainted color image
Figure PCTKR2024095121-appb-img-000126
and the estimated depth image
Figure PCTKR2024095121-appb-img-000127
. In an embodiment, the estimated point cloud
Figure PCTKR2024095121-appb-img-000128
may be a completed scene point cloud, and may be used to perform other tasks such as robot action planning, autonomous navigation, and image generation for AR devices and VR devices. In an embodiment, the method 200 may be performed multiple times based on multiple new viewpoints, and the resulting estimated point clouds may be merged to obtain the completed scene point cloud. An example of a merging process is described below with reference to FIGS. 3A-3B.
FIG. 3A is a diagram showing a scene completion system, and FIG. 3B is a diagram illustrating a process for generating a merged point cloud, according to an embodiment of the present disclosure. According to an embodiment, a scene completion system 300 may include the viewpoint module 100 discussed above, and a merging module 302. The scene completion system 300 may obtain the RGB-D image
Figure PCTKR2024095121-appb-img-000129
as input, and may output a completed scene point cloud which is obtained based on multiple estimated point clouds.
For example, the method 200 discussed above may be performed on the original RGB-D image
Figure PCTKR2024095121-appb-img-000130
by rotating the point cloud
Figure PCTKR2024095121-appb-img-000131
by angle
, to obtain estimated point cloud
Figure PCTKR2024095121-appb-img-000132
corresponding to a viewpoint
Figure PCTKR2024095121-appb-img-000133
. Then, the method 200 may be performed again, this time rotating the point cloud
Figure PCTKR2024095121-appb-img-000134
by angle
Figure PCTKR2024095121-appb-img-000135
to obtain estimated point cloud
Figure PCTKR2024095121-appb-img-000136
corresponding to a viewpoint
Figure PCTKR2024095121-appb-img-000137
. The method 200 may then be performed two more times by rotating the point cloud
Figure PCTKR2024095121-appb-img-000138
by
Figure PCTKR2024095121-appb-img-000139
and
Figure PCTKR2024095121-appb-img-000140
to obtain estimated point cloud
Figure PCTKR2024095121-appb-img-000141
corresponding to a viewpoint
Figure PCTKR2024095121-appb-img-000142
, and estimated point cloud
Figure PCTKR2024095121-appb-img-000143
corresponding to a viewpoint
Figure PCTKR2024095121-appb-img-000144
. Accordingly, as shown in FIG. 3A, four novel views of the scene are obtained, complete with RGB and depth information.
The merging module 302 may combine the estimated point clouds
Figure PCTKR2024095121-appb-img-000145
,
Figure PCTKR2024095121-appb-img-000146
,
Figure PCTKR2024095121-appb-img-000147
, and
Figure PCTKR2024095121-appb-img-000148
while enforcing consistency across them. For example, when inpainting real objects, completion of objects may be inconsistent, and hallucinated objects that are not in the original scene may be created by the inpainting model 106 and included in the inpainted color image
Figure PCTKR2024095121-appb-img-000149
.
To combat this issue, filtering may be performed for consistent predictions across viewpoints. For example, the merging module 302 may compare the original point cloud
Figure PCTKR2024095121-appb-img-000150
and at least one of the estimated point clouds
Figure PCTKR2024095121-appb-img-000151
,
Figure PCTKR2024095121-appb-img-000152
,
Figure PCTKR2024095121-appb-img-000153
, and
Figure PCTKR2024095121-appb-img-000154
, may determine points which intersect among multiple point clouds, and may add the intersecting points to the merged point cloud
Figure PCTKR2024095121-appb-img-000155
, while discarding points which are present in only one point cloud. However, embodiments are not limited thereto. For example, in an embodiment the merged point cloud
Figure PCTKR2024095121-appb-img-000156
may only include points which are present in more than two point clouds, or points which are present in all of the point clouds. As an example, the merging module 302 may discard points which do not directly intersect, or may only discard points which are not within a certain threshold distance from points in other point clouds.
In an embodiment, the merged point cloud
Figure PCTKR2024095121-appb-img-000157
may be a completed scene point cloud, and may be used to perform other tasks such as robot action planning, autonomous navigation, and image generation for AR devices and VR devices.
Although an embodiment is described above as generating a completed scene point cloud based on a single RGB-D image
Figure PCTKR2024095121-appb-img-000158
, embodiments are not limited thereto. For example, the method 200 may be performed multiple times based on multiple RGB-D images, and the resulting estimated point clouds may be merged to generate the completed scene point cloud. As an example, the point cloud
Figure PCTKR2024095121-appb-img-000159
may be determined by deprojecting multiple RGB-D images, and the other steps of the method 200 may be performed based on the point cloud
Figure PCTKR2024095121-appb-img-000160
. As an example, after the completed scene point cloud is generated, one or more additional or updated RGB-D images may be obtained, the method 200 may be performed based on the one or more additional or updated RGB-D images, and the resulting estimated point clouds may be merged with the previously-completed scene point cloud to obtain an updated point cloud.
FIG. 4 is a diagram showing an example configuration of the SAM module 104, according to an embodiment of the present disclosure.
As shown in FIG. 4, the SAM module 104 may include a mask generation module 402, and an image masking module 404. As discussed above, after the original point cloud
Figure PCTKR2024095121-appb-img-000161
is rotated to the new viewpoint
Figure PCTKR2024095121-appb-img-000162
, any 3D space for which reconstruction is possible may be represented as being available for inpainting in the incomplete color image
Figure PCTKR2024095121-appb-img-000163
. In order to do so, the mask generation module 402 may generate the mask
Figure PCTKR2024095121-appb-img-000164
, which may indicate areas to be inpainted by the inpainting model 106.
In an embodiment, if a mask is generated without taking into account surfaces shown in the original RGB-D image
Figure PCTKR2024095121-appb-img-000165
, the inpainting model 106 may inadvertently use background pixels to perform when performing inpainting on an occluded surface of an object. For example, as shown in FIG. 5, an original RGB-D image
Figure PCTKR2024095121-appb-img-000166
may show a surface 502 of a foreground object, and background surfaces 504 and 506. When the point cloud
Figure PCTKR2024095121-appb-img-000167
is rotated to new viewpoint
Figure PCTKR2024095121-appb-img-000168
, some inappropriate background pixels 508 from the background surfaces 504 and 506, which would not actually be visible from the viewpoint
Figure PCTKR2024095121-appb-img-000169
, may be inadvertently included in the incomplete color image
Figure PCTKR2024095121-appb-img-000170
, and may therefore be mistakenly used by the inpainting model 106 to perform inpainting corresponding to the object. For example, an image of a surface 510 in the inpainted color image
Figure PCTKR2024095121-appb-img-000171
may be generated based on the inappropriate background pixels.
FIG. 6A shows an example of an incomplete color image
Figure PCTKR2024095121-appb-img-000172
that shows background pixels which are inappropriately included in areas which would be covered by objects. FIG. 6B shows a mask generated based on the incomplete color image
Figure PCTKR2024095121-appb-img-000173
of FIG. 6A, and FIG. 6C shows an example inpainted color image
Figure PCTKR2024095121-appb-img-000174
in which the inappropriate background pixels were used for inpainting.
Therefore, in order to prevent inappropriate pixels from being included in the incomplete color image
Figure PCTKR2024095121-appb-img-000175
, the SAM module 104 may perform surface-aware masking. For example, the mask generation module 402 may generate a 3D mesh, which may for example have a shape of a frustum, based on the input color image
Figure PCTKR2024095121-appb-img-000176
and an input depth image
Figure PCTKR2024095121-appb-img-000177
, and may use this 3D mesh to generate the mask
Figure PCTKR2024095121-appb-img-000178
.
FIGS. 7A-7D show example operations which may be included in a surface-aware masking process, according to an embodiment of the present disclosure.
As shown in FIG. 7A, for every pixel in the input color image
Figure PCTKR2024095121-appb-img-000179
and the input depth image
Figure PCTKR2024095121-appb-img-000180
, a ray may be cast from the viewpoint
Figure PCTKR2024095121-appb-img-000181
through each point in the deprojected point cloud
Figure PCTKR2024095121-appb-img-000182
. Once the ray has passed through its respective point, for example by passing through one of the surfaces 502, 504, and 506, it may be used to generate a list of points along the ray from that depth onward. As shown in FIG. 7B, the mask generation module 402 may perform this process for every ray to obtain an occlusion point cloud 702 which shows the potential space that could be possibly filled in the completed scene point cloud by objects corresponding to the surfaces. As shown in FIG. 7C, the mask generation module 402 may convert this occlusion point cloud to the mesh 700, and when the point cloud
Figure PCTKR2024095121-appb-img-000183
is rotated to the new viewpoint
Figure PCTKR2024095121-appb-img-000184
, the mesh 700 may be rotated as well, as shown in FIG. 7D. Then, when the SAM module 104 projects the incomplete color image
Figure PCTKR2024095121-appb-img-000185
and the incomplete depth image
Figure PCTKR2024095121-appb-img-000186
from the rotated point cloud, the SAM module 104 may discard points which are occluded by the mesh 700. Accordingly, the incomplete color image
Figure PCTKR2024095121-appb-img-000187
may be prevented from including inappropriate pixels, as shown by the dashed boxes in FIG. 7D. For example, as can be seen in FIG. 7D the incomplete color image
Figure PCTKR2024095121-appb-img-000188
does not include the inappropriate background pixels 506 shown in FIG. 5. After these pixels are discarded, the blank pixels in the incomplete color image
Figure PCTKR2024095121-appb-img-000189
and the incomplete depth image
Figure PCTKR2024095121-appb-img-000190
may be used as the mask
Figure PCTKR2024095121-appb-img-000191
. Based on the mask
Figure PCTKR2024095121-appb-img-000192
, an inpainted color image
Figure PCTKR2024095121-appb-img-000193
may be generated to include, for example, an image of a surface 704 in which the inappropriate background pixels are not included.
FIG. 8A shows an example of an incomplete color image
Figure PCTKR2024095121-appb-img-000194
in which surface aware masking is performed according to the process described above with respect to FIGS. 7A to 7D. As can be seen in FIG. 8A, the mesh 700 may prevent inappropriate pixels from being included in the incomplete color image
Figure PCTKR2024095121-appb-img-000195
. FIG. 8B shows a mask generated based on the incomplete color image
Figure PCTKR2024095121-appb-img-000196
of FIG. 8A, and FIG. 8C shows an example inpainted color image
Figure PCTKR2024095121-appb-img-000197
in which the inappropriate background pixels are not included.
FIG. 9 is a flowchart illustrating a method 900 of performing surface-aware masking, according to an embodiment of the present disclosure. In an embodiment, one or more operations of the method 900 may correspond to the surface-aware masking process discussed above with respect to FIGS. 7A-7D.
As shown in FIG. 9, at operation S901 the mask generation module 402 may generate a plurality of points which extend beyond a surface included in the original RGB-D image
Figure PCTKR2024095121-appb-img-000198
. For example, the mask generation module 402 may subsample pixels from a uniform grid in the input RGB-D image
Figure PCTKR2024095121-appb-img-000199
to obtain a set of points
Figure PCTKR2024095121-appb-img-000200
. Then, the mask generation module 402 may initialize an empty point set
Figure PCTKR2024095121-appb-img-000201
, and for every point
Figure PCTKR2024095121-appb-img-000202
in
Figure PCTKR2024095121-appb-img-000203
, may deproject the point
Figure PCTKR2024095121-appb-img-000204
to a point
Figure PCTKR2024095121-appb-img-000205
in the point cloud
Figure PCTKR2024095121-appb-img-000206
, and generate additional points which are then added to the point set
Figure PCTKR2024095121-appb-img-000207
. In an embodiment, the mask generation module 402 may add a predetermined number of additional points for each point
Figure PCTKR2024095121-appb-img-000208
, and the additional points may be equally spaced. In an embodiment, the number of additional points and the spacing therebetween may vary based on the scene. For example, based on the scene including objects on a tabletop, the mask generation module 402 may use fewer points which are more closely spaced than would be used for scene including a room to be vacuumed by a robot vacuum cleaner. However, embodiments are not limited thereto, and the number of additional points and the spacing therebetween may be determined in any manner. In an embodiment, the point set
Figure PCTKR2024095121-appb-img-000209
may correspond to the points shown in FIG. 7B.
Referring again to FIG. 9, at operation S902, the mask generation module 402 may generate a mesh based on the plurality of points. For example, the mesh may be generated by performing surface triangulation on the points in the point set
Figure PCTKR2024095121-appb-img-000210
. In an embodiment, this mesh may correspond to the mesh 700 discussed above.
Then, as discussed above, the method 900 may include discarding points which are occluded by the mesh. For example, at operation S903, the mask generation module 402 may render a depth map representing the mesh from the new viewpoint
Figure PCTKR2024095121-appb-img-000211
. Then, at operation S904, based on a comparison between the incomplete depth image
Figure PCTKR2024095121-appb-img-000212
and the depth map, the mask generation module 402 may generate the mask
Figure PCTKR2024095121-appb-img-000213
. For example, the mask generation module 402 may initialize all pixels of the mask
Figure PCTKR2024095121-appb-img-000214
as zeros ("0"s). Then, for each pixel in the mask
Figure PCTKR2024095121-appb-img-000215
, the mask generation module 402 may set the pixel to one ("1") if the estimated depth for the pixel in the incomplete depth image
Figure PCTKR2024095121-appb-img-000216
is equal to zero ("0") or is otherwise not present, or if the estimated depth for the pixel in the incomplete depth image
Figure PCTKR2024095121-appb-img-000217
is greater than the depth indicated for the pixel by the depth map representing the mesh. In the final mask
Figure PCTKR2024095121-appb-img-000218
, the pixels which are set to one ("1") may correspond to the masked areas and/or the points which are discarded when generating the masked color image and the masked depth image.
For example, in an embodiment, if the incomplete depth image
Figure PCTKR2024095121-appb-img-000219
includes an estimated depth for a particular pixel that is greater than the depth indicated for that same pixel by the depth map, this may indicate that the pixel corresponds to an area of the scene that was occluded or obscured in the original RGB-D image
Figure PCTKR2024095121-appb-img-000220
by a surface corresponding to the mesh. Accordingly, information corresponding to that pixel in the incomplete depth image
Figure PCTKR2024095121-appb-img-000221
and in the incomplete color image
Figure PCTKR2024095121-appb-img-000222
may be determined to be unreliable, and the pixel may therefore be masked and/or discarded when the masked color image and the masked depth image are generated.
FIGS. 10A-10C and FIGS. 11A-11C show further examples of a surface-aware masking process, according to an embodiment of the present disclosure.
As shown in FIG. 10A, an original RGB-D image
Figure PCTKR2024095121-appb-img-000223
may show a surface 1011 of a first foreground object and a surface 1012, and background surfaces 1013, 1014 and 1015. When the point cloud
Figure PCTKR2024095121-appb-img-000224
is rotated to new viewpoint
Figure PCTKR2024095121-appb-img-000225
and
Figure PCTKR2024095121-appb-img-000226
as shown in FIGs. 10B and 10C, some inappropriate background pixels from the background surfaces 1013, 1014 and 1015, which would not actually be visible from the viewpoints
Figure PCTKR2024095121-appb-img-000227
and
Figure PCTKR2024095121-appb-img-000228
, may be inadvertently included in the incomplete color images
Figure PCTKR2024095121-appb-img-000229
and
Figure PCTKR2024095121-appb-img-000230
, and may therefore be mistakenly used by the inpainting model 106 to perform inpainting corresponding to the object.
Therefore, as shown in FIG. 11A, for every pixel in the input color image
Figure PCTKR2024095121-appb-img-000231
and the input depth image
Figure PCTKR2024095121-appb-img-000232
, a ray may be cast from the viewpoint
Figure PCTKR2024095121-appb-img-000233
through each point in the deprojected point cloud
Figure PCTKR2024095121-appb-img-000234
. Once the ray has passed through its respective point, for example by passing through one of the surfaces 1011, 1012, 1013, 1014, and 1015, it may be used to generate a list of points along the ray from that depth onward. The mask generation module 402 may perform this process for every ray to obtain an occlusion point cloud 1100 which shows the potential space that could be possibly filled in the completed scene point cloud by objects corresponding to the surfaces. As shown in FIGS. 11B and 11C, the mask generation module 402 may convert this occlusion point cloud to the mesh 1101, and when the point cloud
Figure PCTKR2024095121-appb-img-000235
is rotated to the new viewpoints
Figure PCTKR2024095121-appb-img-000236
and
Figure PCTKR2024095121-appb-img-000237
, the mesh 1101 may be rotated as well. Then, when the SAM module 104 projects the incomplete color image
Figure PCTKR2024095121-appb-img-000238
and the incomplete depth image
Figure PCTKR2024095121-appb-img-000239
from the rotated point cloud, the SAM module 104 may discard points which are occluded by the mesh 1101. Accordingly, the incomplete color image
Figure PCTKR2024095121-appb-img-000240
may be prevented from including inappropriate pixels.
Although an embodiment discussed above show that the mask
Figure PCTKR2024095121-appb-img-000241
is obtained after the incomplete color image
Figure PCTKR2024095121-appb-img-000242
and the incomplete depth image
Figure PCTKR2024095121-appb-img-000243
, embodiments are not limited thereto. For example, in an embodiment, the mesh 700 and the mask
Figure PCTKR2024095121-appb-img-000244
corresponding to the new viewpoint
Figure PCTKR2024095121-appb-img-000245
may be generated based on depth information included in the original image
Figure PCTKR2024095121-appb-img-000246
, and then the incomplete color image
Figure PCTKR2024095121-appb-img-000247
and the incomplete depth image
Figure PCTKR2024095121-appb-img-000248
may be generated, for example by rotating and reprojecting the deprojected point cloud
Figure PCTKR2024095121-appb-img-000249
.
Embodiments described above may be useful in many different use applications. For example, an embodiment described above may be used by at least one of an AR device and a VR device to perform scene completion of an environment surrounding a user in order to generate appropriate AR and VR images in anticipation of movements by the user. For example, during a time period in which the user is stationary, an embodiment described above may be used to perform scene completion to reconstruct areas which are not immediately visible to the user, but which the user may wish to see later. The completed scene point cloud may then be used to construct a plurality of potential AR/VR images to be displayed to the user, which may help to reduce latency in images provided to the user. Accordingly, images displayed by the AR device or the VR device may seamlessly transition according to a user's head movements.
FIG. 12A is a flowchart illustrating a method 1200A of performing scene completion in at least one of an AR device and a VR device, according to an embodiment of the present disclosure. In an embodiment, one or more operations of the method 1200A may be performed by or using at least one of the viewpoint module 120, the scene completion system 300, and any of the elements included therein, and any other element described herein.
As shown in FIG. 12A, at operation S1211, the method 1200A may include obtaining an image corresponding to a current viewpoint of a user. In an embodiment, the image may correspond to the original RGB-D depth image
Figure PCTKR2024095121-appb-img-000250
described above.
As further shown in FIG. 12A, at operation S1212, the method 1200A may include performing scene completion to obtain a completed 3D representation of the environment of the user, for example a completed scene point cloud of a scene included in the environment. In an embodiment, the scene completion may correspond to any of the scene completion methods described above.
As further shown in FIG. 12A, at operation S1213, the method 1200A may include obtaining a plurality of potential AR/VR images corresponding to a plurality of potential viewpoints based on the completed point cloud. In an embodiment, the estimated point cloud may correspond to at least one of the estimated point cloud
Figure PCTKR2024095121-appb-img-000251
and the merged point cloud
Figure PCTKR2024095121-appb-img-000252
described above. In an embodiment, the plurality of potential AR/VR images may be AR images or VR images which are generated based on the at least one of the estimated point cloud
Figure PCTKR2024095121-appb-img-000253
and the merged point cloud
Figure PCTKR2024095121-appb-img-000254
. For example, the plurality of potential AR/VR images may be or may include a potential AR image which presents information corresponding to objects in the environment of the user from the perspective of a viewpoint which the user has not yet viewed, or in an area which is hidden from the field of view of the user. As an example, the plurality of potential AR/VR images may be or may include a potential VR image which corresponds to a portion of the environment from the perspective of a viewpoint which the user had not yet viewed, or in an area which is hidden from the field of view of the user. For example, the potential VR image may include a VR object, obstacle, or boundary which corresponds to a real object in the environment a portion of the environment from the perspective of a viewpoint which the user had not yet viewed.
As further shown in FIG. 12A, at operation S1213, the method 1200A may include, based on the user moving from a position corresponding to the current viewpoint to a position corresponding to a potential viewpoint, displaying a transition between a current AR/VR image and a potential AR/VR image to the user. In an embodiment, the current AR/VR image may be an AR or VR image corresponding to the current viewpoint of the user, and the potential AR/VR image may be selected from among the plurality of potential AR/VR images obtained in operation S1213. Accordingly, a seamless transition from the current AR/VR image may be provided by the plurality of AR/VR images.
As an example, an embodiment described above may be used to manipulate or generate images in a device such as at least one of an AR device, a VR device, a mobile device, a camera, and a computer such as a personal computer, a laptop computer, and a tablet computer. For example, an embodiment described above may be used to generate a completed 3D representation of a scene based on a 2D image captured by a camera or an application or other computer program, for example a camera application. Based on the completed 3D representation, a user may generate one or more 2D images from different viewpoints or directions.
In an embodiment, the original image used to generate the completed 3D representation may correspond to only a portion of the 2D image. For example, one or more objects may be extracted from the 2D image, and an embodiment described above may be used to generate 3D representations of the one or more objects, and new 2D images of the one or more objects may be generated based on input received from a user. For example, the input from the user may be used to select new directions or viewpoints used to generate the 3D representation and the new 2D images.
For example, in an embodiment, the user input may correspond to a manipulation of the 3D representation, and the new 2D images may be generated based on the manipulation being stopped. For example, the user may provide an input such as a dragging gesture which may be used to rotate the 3D representation, and based on the dragging gesture being stopped, one or more new 2D images may be generated based on the rotated 3D representation. As an example, one or more new directions or viewpoints may be predicted in advance, and corresponding new 2D images may be created in advance, and each time the user provides an input such as a dragging gesture, a corresponding 2D image may be displayed to the user.
In addition, an embodiment described above may be used to perform scene completion in order to assist with tasks performed by a robot. For example, an embodiment described above may be used to plan actions such as grasping for a robotic arm, or to plan movements by a robotic vacuum cleaner.
FIG. 12B is a flowchart illustrating a method 1200B of performing scene completion in at least one of an AR device and a VR device, according to an embodiment of the present disclosure. In an embodiment, one or more operations of the method 1200B may be performed by or using at least one of the viewpoint module 120, the scene completion system 300, and any of the elements included therein, and any other element described herein.
As shown in FIG. 12B, at operation S1221, the method 1200B may include obtaining an image of an environment of the robot. In an embodiment, this current image may correspond to the original RGB-D depth image
Figure PCTKR2024095121-appb-img-000255
described above. In an embodiment, the robot may include a robotic vacuum cleaner, and the environment may include a room which is to be vacuumed by the robotic vacuum cleaner. In an embodiment, the drone device such as a flying drone, and the environment may include a scene including an object which is to be observed or picked up by the drone, or an area in which the drone is to place an object. In an embodiment, the robot may include a robotic arm, and the environment may include a tabletop scene which includes an object to be grasped by the robotic arm. However, the present disclosure is not limited in this regard.
As further shown in FIG. 12B, at operation S1222, the method 1200B may include performing scene completion to obtain a completed 3D representation of the environment of the robot, for example a completed scene point cloud of a scene included in the environment. In an embodiment, the scene completion may correspond to any of the scene completion methods described above.
In an embodiment, the completed 3D representation may include predicted areas which are hidden from view in original RGB-D depth image
Figure PCTKR2024095121-appb-img-000256
. For example, the original RGB-D depth image
Figure PCTKR2024095121-appb-img-000257
may be captured from the perspective of a robotic vacuum cleaner with a limited vertical field of view, and these predicted areas may be an upper portion of the scene which is not visible to the robotic vacuum cleaner. As an example, the original RGB-D depth image
Figure PCTKR2024095121-appb-img-000258
may be captured from the perspective of a drone device with a limited vertical field of view, and these predicted areas may be a lower portion of the scene which is not visible to the drone device. As an example, the original RGB-D depth image
Figure PCTKR2024095121-appb-img-000259
may be captured from the perspective of a robotic arm with a limited horizontal field of view, and these predicted areas may be a left and/or right portion of the scene which is not visible to the robotic arm. However, these are provided only as examples, and embodiments are not limited thereto.
As further shown in FIG. 12B, at operation S1223, the method 1200B may include planning a movement of the robot. In an embodiment, planning the movement may include planning a route to be taken by the robotic vacuum cleaner in order to vacuum the room. In an embodiment, planning the movement may include planning a movement to position the robotic arm to grasp the object.
As an example, based on a robot recognizing an object, the robot may determine a new viewpoint or a portion of the new viewpoint based on a desired rotation direction for the robot, and an embodiment described above may be used to generate the a 2D image of the new viewpoint.
As yet an example, based on a robot recognizing an object, the robot may determine a portion of a viewpoint that it expects to see based on anticipating another aspect of the recognized object based on the desired rotation direction, and an embodiment described above may be used to generate an image of that portion.
In an embodiment, planning the movement may include planning a movement based on the completed 3D presentation.
FIG. 13A is a flowchart illustrating a method 1300A of performing scene completion, according to an embodiment of the present disclosure. In an embodiment, one or more operations of the method 1300A may be performed by or using at least one of the viewpoint module 100, the scene completion system 300, and any of the elements included therein, and any other element described herein.
As shown in FIG. 13A, at operation S1311, the method 1300A may include obtaining an original image from an original viewpoint corresponding to a first direction, wherein the scene includes an object and a background, wherein a first surface of the object is an image of the object corresponding to the first direction. In an embodiment, the original image may correspond to the RGB-D image
Figure PCTKR2024095121-appb-img-000260
discussed above. In an embodiment, the first surface may correspond to the surface 502 in Fig. 5 and the surface 1011 in Fig. 10A.
As further shown in FIG. 13A, at operation S1312, the method 1300A may include obtaining a first image from a new viewpoint corresponding to a second direction different from the first direction by rotating the original image based on 3D information generated from 2D information which is obtained from the original image. In an embodiment, the 3D information may correspond to the deprojected point cloud discussed above. In an embodiment, the first image may correspond to the
Figure PCTKR2024095121-appb-img-000261
and the incomplete depth image
Figure PCTKR2024095121-appb-img-000262
and the new viewpoint may correspond to the new viewpoint
Figure PCTKR2024095121-appb-img-000263
discussed above.
As further shown in FIG. 13A, at operation S1313, the method 1300A may include determining an area within the first image for generating a second surface of the object based on depth information about a depth between the object and the background of the original image. In an embodiment, the area may correspond to the inappropriate background pixels 508 in Fig. 5 as discussed above. In an embodiment, the area within the first image for generating the second surface of the object may correspond to at least one of the mesh 700 and the mesh 1100 discussed above and may correspond to the masking area which done by the SAM module 104 that performs surface-aware masking. In an embodiment, the area within the first image may correspond to at least a portion of the background of the original image. Without masking the area, the area may be inadvertently considered to be included in a surface of the object, and the AI inpainting model may therefore inadvertently use inappropriate background pixels when performing inpainting the area within the first image as discussed above with respect to FIGS. 6A to 6C. In an embodiment, this area may be one or more areas as discussed above with respect to FIGS. 10A-10C and FIGS. 11A-11C. In an embodiment, the area may be masked, and the masked area may be provided to the AI inpainting model. Accordingly, the masked area within the first image may be considered to be the surface of object and may not be considered to be a background area. Therefore, the masked area may be inpainted as a surface of the object.
As further shown in FIG. 13A, at operation S1314, the method 1300A may include generating a second image by inputting the first image and the determined area to an AI inpainting model, wherein the AI inpainting model generates the second surface of the object which occupies a portion of the determined area in the second image. In an embodiment, the second image may correspond to the inpainted image discussed above. In an embodiment, the second surface of object may correspond to the surface 704 in Fig. 7D. In an embodiment, the area in the second image may correspond to the second surface of the object. In an embodiment, the second image and the second surface of the object may correspond to a second direction different from the first direction.
FIG. 13B is a flowchart illustrating a method 1300B of performing scene completion, according to an embodiment of the present disclosure. In an embodiment, one or more operations of the method 1300B may be performed by or using at least one of the viewpoint module 100, the scene completion system 300, and any of the elements included therein, and any other element described herein.
As shown in FIG. 13B, at operation S1321, the method 1300B may include obtaining an original image from an original viewpoint corresponding to a first direction, wherein the scene includes an object and a background, wherein a first surface of the object is an image of the object corresponding to the first direction. In an embodiment, the original image may correspond to the RGB-D image
Figure PCTKR2024095121-appb-img-000264
discussed above. In an embodiment, the first surface may correspond to the surface 502 in Fig. 5 and the surface 1011 in Fig. 10A as discussed above.
As further shown in FIG. 13B, at operation S1322, the method 1300B may include determining an area for generating a second surface of the object based on depth information about a depth between the object and the background of the original image. In an embodiment, the area may correspond to the inappropriate background pixels 508 in Fig. 5 as discussed above. In an embodiment, the area within the first image for generating the second surface of the object may correspond to at least one of the mesh 700 and the mesh 1100 and may correspond to the masking area which done by the SAM module 104 that performs surface-aware masking. In an embodiment, the area may correspond to at least a portion of the background of the original image. Without masking the area, the area may be inadvertently considered to be included in a surface of the object, and the AI inpainting model may therefore inadvertently use inappropriate background pixels when performing inpainting on the area as discussed above with respect to FIGS. 6A to 6C. In an embodiment, this area may be one or more areas as discussed above with respect to FIGS. 10A-10C and FIGS. 11A-11C. In an embodiment, the area may be masked, and the masked area may be provided to the AI inpainting model. Accordingly, the masked area may be considered to be the surface of object and may not be considered to be a background area. Therefore, the masked area may be inpainted as a surface of the object.
As further shown in FIG. 13B, at operation S1323, the method 1300B may include obtaining a first image from a new viewpoint corresponding to a second direction different from the first direction by rotating the original image based on 3D information generated from 2D information which is obtained from the original image. In an embodiment, the 3D information may correspond to the deprojected point cloud discussed above. In an embodiment, the first image may correspond to the
Figure PCTKR2024095121-appb-img-000265
and the incomplete depth image
Figure PCTKR2024095121-appb-img-000266
and the new viewpoint may correspond to the new viewpoint
Figure PCTKR2024095121-appb-img-000267
discussed above.
As further shown in FIG. 13B, at operation S1324, the method 1300B may include generating a second image by inputting the first image and the determined area to an AI inpainting model, wherein the AI inpainting model generates the second surface of the object which occupies a portion of the determined area in the second image. In an embodiment, the second image may correspond to the inpainted image discussed above. In an embodiment, the second surface of object may correspond to the surface 704 in Fig. 7D. In an embodiment, the area in the second image may correspond to the second surface of the object. In an embodiment, the second image and the second surface of the object may correspond to a second direction different from the first direction.
FIG. 14 is a diagram of devices for performing a scene completion task according to an embodiment. FIG. 14 includes a user device 1410, a server 1420, and a communication network 1430. The user device 1410 and the server 1420 may interconnect via wired connections, wireless connections, or a combination of wired and wireless connections.
The user device 1410 may include one or more devices (e.g., a processor 1411 and a data storage 1412) configured to retrieve an image corresponding to a search query. For example, the user device 1410 may include a computing device (e.g., a desktop computer, a laptop computer, a tablet computer, a handheld computer, a smart speaker, a server, etc.), a mobile phone (e.g., a smart phone, a radiotelephone, etc.), a camera device, a wearable device (e.g., a pair of smart glasses, a smart watch, etc.), a home appliance (e.g., a robot vacuum cleaner, a smart refrigerator, etc. ), or a similar device. The data storage 1412 of the user device 1410 may include one or more of the viewpoint module 100 and the scene completion system 300, or any of the elements included therein. Alternatively, the user device 1410 may store one or more of the viewpoint module 100 and the scene completion system 300, or any of the elements included therein, or vice versa.
The server 1420 may include one or more devices (e.g., a processor 1421 and a data storage 1422) configured to implement one or more of the viewpoint module 100 and the scene completion system 300, or any of the elements included therein. The data storage 1422 of the server 1420 may include one or more of the viewpoint module 100 and the scene completion system 300, or any of the elements included therein. Alternatively, the user device 1410 may store the one or more of viewpoint module 100 and the scene completion system 300, or any of the elements included therein.
The communication network 1430 may include one or more wired and/or wireless networks. For example, network 1430 may include a cellular network, a public land mobile network (PLMN), a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a telephone network (e.g., the Public Switched Telephone Network (PSTN)), a private network, an ad hoc network, an intranet, the Internet, a fiber optic-based network, or the like, and/or a combination of these or other types of networks.
The number and arrangement of devices and networks shown in FIG. 14 are provided as an example. In practice, there may be additional devices and/or networks, fewer devices and/or networks, different devices and/or networks, or differently arranged devices and/or networks than those shown in FIG. 14. Furthermore, two or more devices shown in FIG. 14 may be implemented within a single device, or a single device shown in FIG. 14 may be implemented as multiple, distributed devices. Additionally, or alternatively, a set of devices (e.g., one or more devices) may perform one or more functions described as being performed by another set of devices.
FIG. 15 is a diagram of components of one or more electronic devices of FIG. 14 according to an embodiment. An electronic device 1500 in FIG. 15 may correspond to the user device 1410 and/or the server 1420.
FIG. 15 is for illustration only, and other embodiments of the electronic device 1500 could be used without departing from the scope of this disclosure. For example, the electronic device 1500 may correspond to a client device or a server.
The electronic device 1500 includes a bus 1510, a processor 1520, a memory 1530, an interface 1540, and a display 1550.
The bus 1510 includes a circuit for connecting the components 1520 to 1550 with one another. The bus 1510 functions as a communication system for transferring data between the components 1520 to 1550 or between electronic devices. For example, the bus 1510 may be a communication bus, a cross-over bar, a network, or the like. Although the bus 1510 is depicted as a single line in FIG. 15, the bus 1510 may be implemented using multiple (e.g., two or more) connections between the set of components of the electronic device 1500. The present disclosure is not limited in this regard.
The processor 1520 includes one or more of a central processing unit (CPU), a graphics processor unit (GPU), an accelerated processing unit (APU), a many integrated core (MIC), a field-programmable gate array (FPGA), or a digital signal processor (DSP). The processor 1520 is able to perform control of any one or any combination of the other components of the electronic device 1500, and/or perform an operation or data processing relating to communication. For example, the processor 1520 may perform the methods discussed above. The processor 1520 executes one or more programs stored in the memory 1530.
The memory 1530 may include a volatile and/or non-volatile memory. In an embodiment, the memory 1530 may include volatile memory such as, but not limited to, random access memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), and the like. In an embodiment, the memory 1530 may include non-volatile memory such as, but not limited to, read only memory (ROM), electrically erasable programmable ROM (EEPROM), NAND flash memory, phase-change RAM (PRAM), magnetic RAM (MRAM), resistive RAM (RRAM), ferroelectric RAM (FRAM), magnetic memory, optical memory, and the like. However, the present disclosure is not limited in this regard, and the memory 1530 may include other types of dynamic and/or static memory storage. In an embodiment, the memory 1530 may store information and/or instructions for use (e.g., execution) by the processor 1520. The memory 1530 stores information, such as one or more of commands, data, programs (one or more instructions), application(s) 1534, etc., which are related to at least one other component of the electronic device 1500 and for driving and controlling the electronic device 1500. For example, commands and/or data may formulate an operating system (OS) 1532. Information stored in the memory 1530 may be executed by the processor 1520.
The application(s) 1534 may include the above-discussed embodiments. These functions can be performed by a single application or by multiple applications that each carry out one or more of these functions. For example, the application(s) 1534 may include an artificial intelligence (AI) model for performing the methods discussed above.
The display 1550 may include, for example, a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, a quantum-dot light emitting diode (QLED) display, a microelectromechanical systems (MEMS) display, or an electronic paper display. The display 1550 can also be a depth-aware display, such as a multi-focal display. The display 1550 is able to present, for example, various contents, such as text, images, videos, icons, and symbols.
The interface 1540 may include input/output (I/O) interface 1542, communication interface 1544, and/or one or more sensors 1546. The I/O interface 1542 serves as an interface that can, for example, transfer commands and/or data between a user and/or other external devices and other component(s) of the electronic device 1500.
The communication interface 1544 may enable communication between the electronic device 1500 and other external devices, via a wired connection, a wireless connection, or a combination of wired and wireless connections. The communication interface 1544 may permit the electronic device 1500 to obtain information from another device and/or provide information to another device. For example, the communication interface 1544 may include an Ethernet interface, an optical interface, a coaxial interface, an infrared interface, a radio frequency (RF) interface, a universal serial bus (USB) interface, a Wi-Fi interface, a cellular network interface, or the like. The communication interface 1544 may obtain videos and/or video frames from an external device, such as a server.
The sensor(s) 1546 of the interface 1540 can meter a physical quantity or detect an activation state of the electronic device 1500 and convert metered or detected information into an electrical signal. For example, the sensor(s) 1546 can include one or more cameras or other imaging sensors for capturing images of scenes. The sensor(s) 1546 can also include any one or any combination of a microphone, a keyboard, a mouse, and one or more buttons for touch input. The sensor(s) 1546 can further include an inertial measurement unit. The sensor(s) 1546 can include a control circuit for controlling at least one of the sensors included herein. Any of these sensor(s) 1546 can be located within or coupled to the electronic device 1500. The sensor(s) 1546 may obtain a text and/or a voice signal that contains one or more queries.
The scene completion processes and methods described above may be written as computer-executable programs or instructions that may be stored in a medium.
The medium may continuously store the computer-executable programs or instructions, or temporarily store the computer-executable programs or instructions for execution or downloading. Also, the medium may be any one of various recording media or storage media in which a single piece or plurality of pieces of hardware are combined, and the medium is not limited to a medium directly connected to electronic device 1500, but may be distributed on a network. Examples of the medium include magnetic media, such as a hard disk, a floppy disk, and a magnetic tape, optical recording media, such as CD-ROM and DVD, magneto-optical media such as a floptical disk, and ROM, RAM, and a flash memory, which are configured to store program instructions. Other examples of the medium include recording media and storage media managed by application stores distributing applications or by websites, servers, and the like supplying or distributing other various types of software.
The scene completion methods and processes may be provided in a form of downloadable software. A computer program product may include a product (for example, a downloadable application) in a form of a software program electronically distributed through a manufacturer or an electronic market. For electronic distribution, at least a part of the software program may be stored in a storage medium or may be temporarily generated.
The foregoing disclosure provides illustration and description, but is not intended to be exhaustive or to limit the implementation to the precise form disclosed. Modifications and variations are possible in light of the above disclosure or may be acquired from practice of the implementation.
It will be apparent that systems and/or methods, described herein, may be implemented in different forms of hardware, firmware, or a combination of hardware and software. The actual specialized control hardware or software code used to implement these systems and/or methods is not limiting of the implementations. Thus, the operation and behavior of the systems and/or methods were described herein without reference to specific software code―it being understood that software and hardware may be designed to implement the systems and/or methods based on the description herein.
Even though particular combinations of features are recited in the claims and/or disclosed in the specification, these combinations are not intended to limit the disclosure of possible implementations. In fact, many of these features may be combined in ways not specifically recited in the claims and/or disclosed in the specification. Although each dependent claim listed below may directly depend on only one claim, the disclosure of possible implementations includes each dependent claim in combination with every other claim in the claim set.
A model related to the neural networks described above may be implemented via a software module. When the model is implemented via a software module (for example, a program module including instructions), the model may be stored in a computer-readable recording medium.
Also, the model may be a part of the electronic device 1400 described above by being integrated in a form of a hardware chip. For example, the model may be manufactured in a form of a dedicated hardware chip for artificial intelligence, or may be manufactured as a part of an existing general-purpose processor (for example, a CPU or application processor) or a graphic-dedicated processor (for example a GPU).
Also, the model may be provided in a form of downloadable software. A computer program product may include a product (for example, a downloadable application) in a form of a software program electronically distributed through a manufacturer or an electronic market. For electronic distribution, at least a part of the software program may be stored in a storage medium or may be temporarily generated. In this case, the storage medium may be a server of the manufacturer or electronic market, or a storage medium of a relay server.
According to an aspect of the present disclosure, a method may include rendering an incomplete color image and an incomplete depth image corresponding to the new viewpoint based on the 3D information. The method may include masking a portion of the incomplete color image based on the 3D information and the incomplete depth image to obtain a masked color image, wherein the masked portion of the incomplete color image corresponds to the determined area and indicates that the masked portion of the incomplete color image is obscured by the object when the scene is viewed from the new viewpoint. The method may include inpainting the masked color image to obtain the second image.
According to an embodiment of the disclosure, the method may include the obtaining the second image which includes inpainting the masked color image based on the AI inpainting model to obtain the second image.
According to an embodiment of the disclosure, the method may include obtaining an image caption by providing the second image to an AI caption model. The method may include determining whether to re-inpaint the second image by comparing an embedding of the image caption and an embedding of the prompt.
According to an embodiment of the disclosure, the method may include masking a portion of the incomplete depth image based on the 3D information and the incomplete depth image to obtain a masked depth image, wherein the masked portion of the incomplete depth image corresponds to the determined area. The method may include providing the second image to an AI depth estimation model. The method may include generating an estimated depth image based on the masked depth image and an output of the AI depth estimation model. The method may include generating a completed 3D representation based on the second image and the estimated depth image.
According to an embodiment of the disclosure, the method may include the generating the estimated depth image which includes obtaining at least one estimated normal and at least one estimated occlusion boundary by providing the second image to the AI depth estimation model. The method may include the generating the estimated depth image which includes obtaining the estimated depth image based on the incomplete depth image, the at least one estimated normal, and the at least one estimated occlusion boundary.
According to an embodiment of the disclosure, the method may include rendering a plurality of incomplete color images and a plurality of incomplete depth images from a plurality of new viewpoints based on the 3D information. The method may include masking the plurality of incomplete color images to obtain a plurality of masked color images, and masking the plurality of incomplete depth images to obtain a plurality of masked depth images. The method may include obtaining a plurality of second images by providing the plurality of masked color images to the AI inpainting model. The method may include providing the plurality of second images to the AI depth estimation model. The method may include obtaining a plurality of estimated depth images based on the plurality of masked depth images and a plurality of outputs of the AI depth estimation model, wherein the completed 3D representation is generated based on the plurality of second images and the plurality of estimated depth images.
According to an embodiment of the disclosure, the generating of the completed 3D representation may include generating a plurality of estimated point clouds based on the second image, the estimated depth image, the plurality of second images, and the plurality of estimated depth images. The method may include merging the plurality of estimated point clouds by discarding points which are not included in at least two estimated point clouds from among the plurality of estimated point clouds to obtain a completed scene point cloud representing the scene.
According to an embodiment of the disclosure, the masking may include generating a plurality of points which extend beyond a surface included in the original image. The masking may include generating a mesh based on the plurality of points. The masking may include rendering a depth map representing the mesh from the new viewpoint. The masking may include generating a mask based on a comparison between the incomplete depth image and the depth map. The masking may include applying the mask to the incomplete color image.
According to an embodiment of the disclosure, the method may include the mask which indicates a plurality of pixels which are not used for generating the second image. The method may include the plurality of pixels which includes a first plurality of pixels for which a depth is not indicated by the incomplete depth image, and a second plurality of pixels for which a depth indicated by the incomplete depth image is greater than a depth indicated by the depth map.
According to an embodiment of the disclosure, the method may include the original image which is captured by at least one of an augmented reality (AR) device and a virtual reality (VR) device. The method may include the original viewpoint which comprises a current viewpoint of a user, and the original image which corresponds to a current AR/VR image displayed to the user. The method may include obtaining a completed 3D representation of the scene based on the second image. The method may include obtaining a potential AR/VR image based on the completed 3D representation, wherein the potential AR/VR image corresponds to a potential viewpoint of the user. The method may include based on the user moving from a position corresponding to the current viewpoint to a position corresponding to the potential viewpoint, displaying a transition between the current AR/VR image and the potential AR/VR image to the user.
According to an embodiment of the disclosure, the method may include the original image which is captured by a robot. The method may include planning a movement path for the robot based on the second image.
According to an embodiment of the disclosure, an electronic device may include at least one processor configured to execute the instructions to render an incomplete color image and an incomplete depth image corresponding to the new viewpoint based on the 3D information. The electronic device may include at least one processor configured to execute the instructions to mask a portion of the incomplete color image based on the 3D information and the incomplete depth image to obtain a masked color image, wherein the masked portion of the incomplete color image corresponds to the determined area and indicates that the masked portion of the incomplete color image is obscured by the object when the scene is viewed from the new viewpoint. The electronic device may include at least one processor configured to execute the instructions to inpaint the masked color image to obtain the second image.
According to an embodiment of the disclosure, the electronic device, to inpaint the masked color image, may include at least one processor configured to execute the instructions to inpaint the masked color image based on the AI inpainting model to obtain the second image.
According to an embodiment of the disclosure, the electronic device may include at least one processor configured to execute the instructions to obtain an image caption by providing the second image to an AI caption model. The electronic device may include at least one processor configured to execute the instructions to determine whether to re-inpaint the second image by comparing an embedding of the image caption and an embedding of the prompt.
According to an embodiment of the disclosure, the electronic device may include at least one processor configured to execute the instructions to mask a portion of the incomplete depth image based on the 3D information and the incomplete depth image to obtain a masked depth image, wherein the masked portion of the incomplete depth image corresponds to the determined area. The electronic device may include at least one processor configured to execute the instructions to provide the second image to an AI depth estimation model. The electronic device may include at least one processor configured to execute the instructions to generate an estimated depth image based on the masked depth image and an output of the AI depth estimation model. The electronic device may include at least one processor configured to execute the instructions to generate a completed 3D representation based on the second image and the estimated depth image.
According to an embodiment of the disclosure, the electronic device, to generate the estimated depth image, may include at least one processor configured to execute the instructions to obtain at least one estimated normal and at least one estimated occlusion boundary by providing the second image to the AI depth estimation model. The electronic device, to generate the estimated depth image, may include at least one processor configured to execute the instructions to obtain the estimated depth image based on the incomplete depth image, the at least one estimated normal, and the at least one estimated occlusion boundary.
According to an embodiment of the disclosure, the electronic device may include at least one processor configured to execute the instructions to render a plurality of incomplete color images and a plurality of incomplete depth images from a plurality of new viewpoints based on the 3D information. The electronic device may include at least one processor configured to execute the instructions to mask the plurality of incomplete color images to obtain a plurality of masked color images, and masking the plurality of incomplete depth images to obtain a plurality of masked depth images. The electronic device may include at least one processor configured to execute the instructions to obtain a plurality of second images by providing the plurality of masked color images to the AI inpainting model. The electronic device may include at least one processor configured to execute the instructions to provide the plurality of second images to the AI depth estimation model. The electronic device may include at least one processor configured to execute the instructions to obtain a plurality of estimated depth images based on the plurality of masked depth images and a plurality of outputs of the AI depth estimation model. The electronic device may include the completed 3D representation which is generated based on the plurality of second images and the plurality of estimated depth images.
According to an embodiment of the disclosure, the electronic device, to generate the completed 3D representation, may include at least one processor configured to execute the instructions to generate a plurality of estimated point clouds based on the second image, the estimated depth image, the plurality of second images, and the plurality of estimated depth images. The electronic device, to generate the completed 3D representation, may include at least one processor configured to execute the instructions to merge the plurality of estimated point clouds by discarding points which are not included in at least two estimated point clouds from among the plurality of estimated point clouds.
According to an embodiment of the disclosure, the electronic device, to mask the incomplete color image, may include at least one processor configured to execute the instructions to generate a plurality of points which extend beyond a surface included in the original image. The electronic device, to mask the incomplete color image, may include at least one processor configured to execute the instructions to generate a mesh based on the plurality of points. The electronic device, to mask the incomplete color image, may include at least one processor configured to execute the instructions to render a depth map representing the mesh from the new viewpoint. The electronic device, to mask the incomplete color image, may include at least one processor configured to execute the instructions to generate a mask based on a comparison between the incomplete depth image and the depth map. The electronic device, to mask the incomplete color image, may include at least one processor configured to execute the instructions to apply the mask to the incomplete color image.
According to an embodiment of the disclosure, the electronic device may include the mask which indicates a plurality of pixels which are not used for generating the second image. The electronic device may include the plurality of pixels which includes a first plurality of pixels for which a depth is not indicated by the incomplete depth image, and a second plurality of pixels for which a depth indicated by the incomplete depth image is greater than a depth indicated by the depth map.
According to an embodiment of the disclosure, the electronic device may include the original image which is captured by at least one of an augmented reality (AR) device and a virtual reality (VR) device. The electronic device may include the original viewpoint which comprises a current viewpoint of a user, and the original image which corresponds to a current AR/VR image displayed to the user. The electronic device may include at least one processor which is configured to execute the instructions to obtain a completed 3D representation of the scene based on the second image. The electronic device may include at least one processor which is configured to execute the instructions to obtain a potential AR/VR image based on the completed 3D representation, wherein the potential AR/VR image corresponds to a potential viewpoint of the user. The electronic device may include at least one processor which is configured to execute the instructions, based on the user moving from a position corresponding to the current viewpoint to a position corresponding to the potential viewpoint, to display a transition between the current AR/VR image and the potential AR/VR image to the user.
According to an embodiment of the disclosure, the electronic device may include the original image which is captured by a robot. The electronic device may include at least one processor which is configured to execute the instructions to plan a movement path for the robot based on the second image.While the embodiments of the disclosure have been described with reference to the figures, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope as defined by the following claims.

Claims (15)

  1. A method for processing image data for scene completion comprising:
    obtaining an original image from an original viewpoint corresponding to a first direction, wherein the original image includes an object and a background, wherein a first surface of the object is an image of the object corresponding to the first direction;
    obtaining a first image from a new viewpoint corresponding to a second direction different from the first direction by rotating the original image based on 3-dimensional (3D) information generated from 2-dimensional (2D) information which is obtained from the original image;
    determining an area within the first image for generating a second surface of the object based on depth information about a depth between the object and the background of the original image; and
    obtaining a second image by inputting the first image and the determined area to an artificial intelligence (AI) inpainting model, wherein the AI inpainting model generates the second surface of the object which occupies a portion of the determined area in the second image.
  2. The method of claim 1, further comprising:
    rendering an incomplete color image and an incomplete depth image corresponding to the new viewpoint based on the 3D information;
    masking a portion of the incomplete color image based on the 3D information and the incomplete depth image to obtain a masked color image, wherein the masked portion of the incomplete color image corresponds to the determined area and indicates that the masked portion of the incomplete color image is obscured by the object when the scene is viewed from the new viewpoint; and
    inpainting the masked color image to obtain the second image.
  3. The method any one of claims 1 to 2, wherein the obtaining the second image comprises:
    inpainting the masked color image based on the AI inpainting model to obtain the second image.
  4. The method any one of claims 1 to 3, further comprising:
    obtaining an image caption by providing the second image to an AI caption model; and
    determining whether to re-inpaint the second image by comparing an embedding of the image caption and an embedding of the prompt.
  5. The method any one of claims 1 to 4, further comprising:
    masking a portion of the incomplete depth image based on the 3D information and the incomplete depth image to obtain a masked depth image, wherein the masked portion of the incomplete depth image corresponds to the determined area;
    providing the second image to an AI depth estimation model;
    generating an estimated depth image based on the masked depth image and an output of the AI depth estimation model; and
    generating a completed 3D representation based on the second image and the estimated depth image.
  6. The method any one of claims 1 to 5, wherein the masking comprises:
    generating a plurality of points which extend beyond a surface included in the original image;
    generating a mesh based on the plurality of points;
    rendering a depth map representing the mesh from the new viewpoint;
    generating a mask based on a comparison between the incomplete depth image and the depth map; and
    applying the mask to the incomplete color image.
  7. The method any one of claims 1 to 6, wherein the mask indicates a plurality of pixels which are not used for generating the second image, and
    wherein the plurality of pixels includes a first plurality of pixels for which a depth is not indicated by the incomplete depth image, and a second plurality of pixels for which a depth indicated by the incomplete depth image is greater than a depth indicated by the depth map.
  8. An electronic device for processing image data for scene completion, the electronic device comprising:
    at least one memory configured to store instructions; and
    at least one processor configured to execute the instructions to:
    obtain an original image from an original viewpoint corresponding to a first direction, wherein the original image includes an object and a background, wherein a first surface of the object is an image of the object corresponding to the first direction,
    obtain a first image from a new viewpoint corresponding to a second direction different from the first direction by rotating the original image based on 3-dimensional (3D) information generated based on 2-dimensional information which is obtained from the original image,
    determine an area with the first image for generating a second surface of the object based on depth information about a depth between the object and the background of the original image; and
    obtain a second image by inputting the first image and the determined area to an artificial intelligence (AI) inpainting model, wherein the AI inpainting model generates the second surface of the object which occupies a portion of the determined area in the second image.
  9. The electronic device of claim 8, wherein the at least one processor is further configured to execute the instructions to:
    render an incomplete color image and an incomplete depth image corresponding to the new viewpoint based on the 3D information,
    mask a portion of the incomplete color image based on the 3D information and the incomplete depth image to obtain a masked color image, wherein the masked portion of the incomplete color image corresponds to the determined area and indicates that the masked portion of the incomplete color image is obscured by the object when the scene is viewed from the new viewpoint, and
    inpaint the masked color image to obtain the second image.
  10. The electronic device any one of claims 8 to 9, wherein to inpaint the masked color image, the at least one processor is further configured to execute the instructions to:
    inpaint the masked color image based on the AI inpainting model to obtain the second image.
  11. The electronic device any one of claims 8 to 10, wherein the at least one processor is further configured to execute the instructions to:
    obtain an image caption by providing the second image to an AI caption model; and
    determine whether to re-inpaint the second image by comparing an embedding of the image caption and an embedding of the prompt.
  12. The electronic device any one of claims 8 to 11, wherein the at least one processor is further configured to execute the instructions to:
    mask a portion of the incomplete depth image based on the 3D information and the incomplete depth image to obtain a masked depth image, wherein the masked portion of the incomplete depth image corresponds to the determined area;
    provide the second image to an AI depth estimation model;
    generate an estimated depth image based on the masked depth image and an output of the AI depth estimation model; and
    generate a completed 3D representation based on the second image and the estimated depth image.
  13. The electronic device any one of claims 8 to 12, wherein to mask the incomplete color image, the at least one processor is further configured to execute the instructions to:
    generate a plurality of points which extend beyond a surface included in the original image;
    generate a mesh based on the plurality of points;
    render a depth map representing the mesh from the new viewpoint;
    generate a mask based on a comparison between the incomplete depth image and the depth map; and
    apply the mask to the incomplete color image.
  14. The electronic device any one of claims 8 to 13, wherein the mask indicates a plurality of pixels which are not used for generating the second image, and
    wherein the plurality of pixels includes a first plurality of pixels for which a depth is not indicated by the incomplete depth image, and a second plurality of pixels for which a depth indicated by the incomplete depth image is greater than a depth indicated by the depth map.
  15. A computer-readable medium configured to store instructions which, when executed by at least one processor of a device, cause the at least one processor to perform the method of any one of claims 1 to 7.
PCT/KR2024/095121 2023-03-14 2024-02-14 Method and apparatus for processing an image Ceased WO2024191234A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP24771233.4A EP4599405A4 (en) 2023-03-14 2024-02-14 METHOD AND DEVICE FOR PROCESSING AN IMAGE
CN202480007064.6A CN120677505A (en) 2023-03-14 2024-02-14 Method and apparatus for processing image

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202363452059P 2023-03-14 2023-03-14
US63/452,059 2023-03-14
US18/400,889 US20240312166A1 (en) 2023-03-14 2023-12-29 Rotation, inpainting and completion for generalizable scene completion
US18/400,889 2023-12-29

Publications (1)

Publication Number Publication Date
WO2024191234A1 true WO2024191234A1 (en) 2024-09-19

Family

ID=92714507

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2024/095121 Ceased WO2024191234A1 (en) 2023-03-14 2024-02-14 Method and apparatus for processing an image

Country Status (4)

Country Link
US (1) US20240312166A1 (en)
EP (1) EP4599405A4 (en)
CN (1) CN120677505A (en)
WO (1) WO2024191234A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12608879B2 (en) * 2022-12-12 2026-04-21 Adobe Inc. Generation of a 360-degree object view by leveraging available images on an online platform
US12592030B2 (en) 2023-08-17 2026-03-31 Adobe Inc. Interactive three-dimension aware text-to-image generation
US20250117995A1 (en) * 2023-10-05 2025-04-10 Adobe Inc. Image and depth map generation using a conditional machine learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20140021766A (en) * 2012-08-10 2014-02-20 광운대학교 산학협력단 A boundary noise removal and hole filling method for virtual viewpoint image generation
KR20200063367A (en) * 2018-11-23 2020-06-05 네이버웹툰 주식회사 Method and apparatus of converting 3d video image from video image using deep learning
US20200410746A1 (en) * 2019-06-27 2020-12-31 Electronics And Telecommunications Research Institute Method and apparatus for generating 3d virtual viewpoint image
WO2021042134A1 (en) * 2019-08-28 2021-03-04 Snap Inc. Generating 3d data in a messaging system
KR20220140402A (en) * 2021-04-08 2022-10-18 구글 엘엘씨 Neural blending for new view synthesis

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10726560B2 (en) * 2014-10-31 2020-07-28 Fyusion, Inc. Real-time mobile device capture and generation of art-styled AR/VR content
WO2023014368A1 (en) * 2021-08-05 2023-02-09 Google Llc Single image 3d photography with soft-layering and depth-aware inpainting

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20140021766A (en) * 2012-08-10 2014-02-20 광운대학교 산학협력단 A boundary noise removal and hole filling method for virtual viewpoint image generation
KR20200063367A (en) * 2018-11-23 2020-06-05 네이버웹툰 주식회사 Method and apparatus of converting 3d video image from video image using deep learning
US20200410746A1 (en) * 2019-06-27 2020-12-31 Electronics And Telecommunications Research Institute Method and apparatus for generating 3d virtual viewpoint image
WO2021042134A1 (en) * 2019-08-28 2021-03-04 Snap Inc. Generating 3d data in a messaging system
KR20220140402A (en) * 2021-04-08 2022-10-18 구글 엘엘씨 Neural blending for new view synthesis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4599405A4 *

Also Published As

Publication number Publication date
CN120677505A (en) 2025-09-19
US20240312166A1 (en) 2024-09-19
EP4599405A1 (en) 2025-08-13
EP4599405A4 (en) 2026-01-07

Similar Documents

Publication Publication Date Title
WO2024191234A1 (en) Method and apparatus for processing an image
US10732725B2 (en) Method and apparatus of interactive display based on gesture recognition
CN110310175A (en) System and method for mobile augmented reality
WO2023051289A1 (en) Navigation method and apparatus for unmanned device, medium, and unmanned device
US20120124509A1 (en) Information processor, processing method and program
WO2019231130A1 (en) Electronic device and control method therefor
WO2019059505A1 (en) Method and apparatus for recognizing object
WO2024090989A1 (en) Multi-view segmentation and perceptual inpainting with neural radiance fields
US12354385B2 (en) Image processing method and apparatus, electronic device, and computer-readable storage medium for identifying two-dimensional shapes using a depth image
WO2020138602A1 (en) Method for identifying user's real hand and wearable device therefor
WO2017099555A1 (en) Handwritten signature authentication system and method based on time division segment block
WO2015199502A1 (en) Apparatus and method for providing augmented reality interaction service
KR102275682B1 (en) SLAM-based mobile scan backpack system for rapid real-time building scanning
WO2025028912A1 (en) Electronic device for generating virtual object and operation method thereof
CN121043130A (en) A method, apparatus, equipment, medium, and product for controlling a robotic arm.
WO2020204355A1 (en) Electronic device and control method therefor
WO2023239035A1 (en) Electronic device for obtaining image data related to hand gesture and operation method therefor
WO2024002065A1 (en) Video encoding method and apparatus, electronic device, and medium
WO2019245320A1 (en) Mobile robot device for correcting position by fusing image sensor and plurality of geomagnetic sensors, and control method
WO2019207875A1 (en) Information processing device, information processing method, and program
WO2023090808A1 (en) Representing 3d shapes with probabilistic directed distance fields
WO2023224326A1 (en) Augmented reality device for acquiring depth information, and operating method therefor
WO2017171142A1 (en) System and method for detecting facial feature point
WO2023063570A1 (en) Electronic device for obtaining image data relating to hand motion and method for operating same
WO2022270683A1 (en) Depth map image generation method and computing device for same

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24771233

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2024771233

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2024771233

Country of ref document: EP

Effective date: 20250509

WWE Wipo information: entry into national phase

Ref document number: 202480007064.6

Country of ref document: CN

WWP Wipo information: published in national office

Ref document number: 2024771233

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 202480007064.6

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE