WO2024186013A1 - Method and apparatus for removing and rendering an image - Google Patents

Method and apparatus for removing and rendering an image Download PDF

Info

Publication number
WO2024186013A1
WO2024186013A1 PCT/KR2024/001943 KR2024001943W WO2024186013A1 WO 2024186013 A1 WO2024186013 A1 WO 2024186013A1 KR 2024001943 W KR2024001943 W KR 2024001943W WO 2024186013 A1 WO2024186013 A1 WO 2024186013A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
nerf
scene
images
viewpoint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/KR2024/001943
Other languages
French (fr)
Inventor
Ashkan MIRZAEI
Tristan Ty Aumentado-Armstrong
Konstantinos G. DERPANIS
Igor Gilitschenski
Aleksai Levinshtein
Marcus Brubaker
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung Electronics Co Ltd
Original Assignee
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samsung Electronics Co Ltd filed Critical Samsung Electronics Co Ltd
Priority to EP24767302.3A priority Critical patent/EP4616371A4/en
Priority to CN202480006898.5A priority patent/CN120476430A/en
Publication of WO2024186013A1 publication Critical patent/WO2024186013A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/00Three-dimensional [3D] image rendering
    • G06T15/10Geometric effects
    • G06T15/20Perspective computation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/00Three-dimensional [3D] image rendering
    • G06T15/08Volume rendering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three-dimensional [3D] modelling for computer graphics
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating three-dimensional [3D] models or images for computer graphics
    • G06T19/20Editing of three-dimensional [3D] images, e.g. changing shapes or colours, aligning objects or positioning parts
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/77Retouching; Inpainting; Scratch removal
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2219/00Indexing scheme for manipulating 3D models or images for computer graphics
    • G06T2219/20Indexing scheme for editing of 3D models
    • G06T2219/2021Shape modification

Definitions

  • This application is related to synthesizing a view of a 3D scene from a novel viewpoint.
  • Neural Radiance Fields (NeRFs) for view synthesis has led to a desire for NeRF editing tools.
  • NeRFs Using existing NeRFs techniques to provide a scene representation comes with technical problems.
  • a method may include obtaining a plurality of images from a user, wherein the plurality of images were acquired by an electronic device viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene.
  • the method may include obtaining a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene.
  • the method may include obtaining a second indication of a first object to be removed from the first image.
  • the method may include removing the first object from the first image to obtain a reference image.
  • the method may include obtaining a third indication of a second viewpoint from the user, wherein the second viewpoint is different from each of the respective viewpoints of the plurality of images.
  • the method may include rendering, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF.
  • the method may include displaying the second image on a display of the electronic device.
  • NeRF neural radiance field
  • an apparatus may include one or more processors.
  • the apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least receive a plurality of images from a user, wherein the plurality of images were acquired by the apparatus viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene.
  • the apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least obtain a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene.
  • the apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least obtain a second indication of a first object to be removed from the first image.
  • the apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least remove the first object from the first image to obtain a reference image.
  • the apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least obtain a third indication of a second viewpoint from the user, wherein the second viewpoint does not correspond to any of the plurality of images.
  • the apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least render, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF.
  • the apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least display the second image on a display of the apparatus.
  • NeRF neural radiance field
  • a computer-readable storage medium storing instruction.
  • the instructions when executed by at least one processor, may cause the at least one processor to obtain a plurality of images from a user, wherein the plurality of images were acquired by an electronic device viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene.
  • the instructions when executed by at least one processor, may cause the at least one processor to obtain a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene.
  • the instructions when executed by at least one processor, may cause the at least one processor to obtain a second indication of a first object to be removed from the first image.
  • the instructions when executed by at least one processor, may cause the at least one processor to remove the first object from the first image to obtain a reference image.
  • the instructions when executed by at least one processor, may cause the at least one processor to obtain a third indication of a second viewpoint from the user, wherein the second viewpoint is different from each of the respective viewpoints of the plurality of images.
  • the instructions when executed by at least one processor, may cause the at least one processor to render, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF.
  • NeRF neural radiance field
  • the instructions when executed by at least one processor, may cause the at least one processor to display the second image on a display of the electronic device.
  • a method including receiving a plurality of images from a user, wherein the plurality of images were acquired by an electronic device viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene; receiving a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene; receiving a second indication of a first object to be removed from the first image; removing the first object from the first image to obtain a reference image; receiving a third indication of a second viewpoint from the user, wherein the second viewpoint does not correspond to any of the plurality of images; rendering, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF; and displaying the second image on a display of
  • NeRF neural
  • FIG. 1A illustrates an example of logic for rendering an inpainted 3D scene from a novel viewpoint, according to some embodimentsan embodiment.
  • FIG. 1B illustrates an example of adding a selected object to a 3D scene, according to some an embodiments.
  • FIG. 1C illustrates an example of a rendering of the inpainted 3D scene of FIG. 1B from the novel viewpoint, according to some an embodiments.
  • FIG. 2A illustrates an example of a system for providing the novel view, according to some an embodiments.
  • FIG. 2B illustrates an example of logic for training a NeRF to represent an inpainted 3D scene and using the NeRF to obtain the novel view, according to some an embodiments.
  • FIG. 3 illustrates an example of training the NeRF to represent an inpainted 3D scene, according to some an embodiments.
  • FIG. 4 illustrates an example of further details of training the NeRF of FIG. 3 to represent the inpainted 3D scene, according to some an embodiments.
  • FIG. 5 illustrates an example of geometry related to a view substitution technique.
  • FIG. 6 represents an example of an input image.
  • FIG. 7 illustrates an example of removing a backpack from the input image of FIG. 6 and receiving obtaining a text command to inpaint a red fence, and inpainting the red fence.
  • FIG. 8 illustrates an example of removing a backpack from the input image of FIG. 6 and obtainingreceiving a text command to inpaint a rubber duck, and inpainting the rubber duck.
  • FIG. 9 illustrates an example of removing a backpack from the input image of FIG. 6 and obtainingreceiving a text command to inpaint a flower pot, and inpainting the flower pot.
  • FIG. 10 illustrates an example of an input image with a backpack as an object to be removed.
  • FIG. 11 illustrates an example of the red fence replacing the backpack using 2D inpainting.
  • FIG. 12 related to FIG. 10, illustrates an example of replacing the backpack by pasting an image of a mailbox.
  • FIG. 13, related to FIG. 10, illustrates an example of replacing the backpack by inpainting the red fence and then manually pasting a shrub.
  • FIG. 14 illustrates an example of a reference image in which an object has been removed.
  • FIG. 15 illustrates an example of a set of input images.
  • FIG. 16 illustrates an example of a set of masks corresponding to the input images of FIG. 15.
  • FIG. 17 illustrates an example of an initial target view with distortion in the area corresponding to the inpainting in the reference view.
  • FIG. 18 illustrates an example of a residual with respect to the target view of FIG. 17.
  • FIG. 19 illustrates an example of an updated rendering of the target view based on the residual of FIG. 18.
  • FIG. 20 illustrates an example of dDisocclusion processing to improve the inpainted 3D scene represented by the NeRF for views other than the reference view.
  • FIG. 21 illustrates exemplary hardware for implementation of computing devices for implementing the systems and algorithms described by the figures, according to some embodimentsan embodiment.
  • the present disclosure provides methods, apparatuses, and computer-readable mediums for inpainting an unwanted object in one of several 2D images forming a complete 3D scene representation.
  • the unwanted object is removed from any viewpoint within the 3D scene image.
  • Obtaining may include receiving, accessing, acquiring and the like.
  • NeRF techniques may be used to inpaint unwanted regions in a view-consistent manner, allowing users to exercise control over the generated scene through a single inpainted image.
  • NeRFs are an implicit neural field representation (e.g., coordinate mapping) for 3D scenes and objects, generally fit to multiview posed image sets.
  • the basic constituents are (i) a field, , that maps a 3D coordinate, , and a view direction, , to a color, , and density, , via learnable parameters , and (ii) a rendering operator that produces color and depth for a given view pixel.
  • the field, can be constructed in a variety of ways; the rendering operator is implemented as the classical volume rendering integral, approximated via quadrature, where a ray, r, is divided into N sections between and (the near and far bounds), with sampled from the i-th section. The estimated color is then given by Equation 1.
  • Equation 1 where is the transmittance, and and are the color and density at .
  • Equation 1 estimates depth, , and disparity (inverse depth), , instead.
  • the inputs are n input images, , their camera transform matrices, , and their corresponding masks, , delineating the unwanted region.
  • the inputs also include a single inpainted reference view, , where , which provides the information which an embodiment maps, or extrapolates, into a 3D inpainting of the scene represented by the NeRF.
  • An embodiment uses I ref , not only to inpaint the NeRF, but also to generate 3D details and VDEs from other viewpoints.
  • Training may include an experience with respect to a task and attempts to improve a performance with respect to performance of the task at a future time after the training.
  • training is based on the following four losses: i) L_unmasked, ii) L_depth, iii) L_substituted and iv) L_occluded. These four losses represent the unmasked appearance loss, masked geometry loss, view-dependent masked color loss, and dis-occlusion loss, respectively.
  • Equation 2 The overall objective for inpainted NeRF fitting is given by Equation 2 (including weights on the last three terms).
  • Supervision is computed modulo an iteration count. For example, supervision for the respective summands of Equation 2 are computed every N unmasked , N depth , N sub and N occluded iterations. A particular loss is not used until the appropriate number of iterations has passed.
  • Equation 3 (in contrast to ) is the set of rays corresponding to the pixels in the unmasked part of the image (the part not affected by the mask) and is the ground truth (GT) color for the ray, r.
  • Equation 4 The loss for the masked portion based on depth is developed by Equations 4, 5 and 6.
  • scalars h and w are the height and width of the input images.
  • the monocular depth estimation of the masked region from the reference image, in terms of disparity, is .
  • the disparity from the NeRF model is .
  • Equation 4 The coefficients in Equation 4 are found by optimization, with F being the objective (Equation 5).
  • Equation 4 J is the all-ones matrix.
  • Equation 6 the expectation is over .
  • Equation 6 is a variable obtained by optimizing to encourage greater smoothness around the mask.
  • An example smoothing technique minimizes the total variation of around mask boundaries.
  • Equation 9 The expectation in Equation 9 is over .
  • x i is a shading point position, on a ray emanating from the reference camera (with direction ), is a corresponding ray direction that intersects x i from a target-image camera (at o t ).
  • is an inpainted residual is a reference view, is a view-substituted image, is a target color is a mask, is a bilateral solver.
  • Equation 10 A loss to solve for occluded areas in the reference image which are however visible in a non-reference image is provided by Equation 10.
  • Equation 10 the expectation is over , , , and color and disparity are and .
  • L_unmasked this is a NeRF reconstruction loss over the unmasked area of the K input images. See Equation 3.
  • L_depth this loss is based on monocular depth estimation to predict an uncalibrated disparity of the reference image and guide the geometry. See Equation 6.
  • VDEs view-dependent effects
  • L_occluded the overall algorithm is focused on the reference view, and pixels which are visible in target views but not visible in the reference view are called dis-occluded pixels (they are occluded in the reference view, and become dis-occluded when the scene is viewed from other viewpoints). This loss supervises the NeRF training so that the NeRF produces plausible results with respect to these dis-occluded pixels. See Equation 10.
  • the reference image constructed by inpainting a portion of .
  • an image of the 3D scene inpainted into the NeRF is from a user-requested viewpoint, and is produced by the NeRF.
  • Equation 8 the view-substituted image with VDEs from res target after using Equation 8.
  • FIG. 1A illustrates an example of a flowchart of a method L1 for rendering an inpainted 3D scene from a novel viewpoint.
  • a user has a camera.
  • the device may include the user captures several pictures, possibly as a video sequence by the user
  • the method may include selecting one of the images as an input image by the user
  • the method may include selecting an undesired object to be removed from the input image.
  • the electronic device may perform the selection by recommending objects to be erased by the device.
  • the electronic device may select portions of the images with features such as, but not limited to, many light reflections, blurry portions, or portions identified by the electronic device as background objects.
  • the method may include performing the selection by the user.
  • the method may be performed by the user selecting an area around an object, the electronic device analyzing the identified area and electronic device selects around the object outline.
  • An embodiment may include an additional selection by the electronic device based on user-selected information.
  • the electronic device may analyze the selected object and recommend whether other objects of a similar type to the object selected by the user should also be selected and erased from the images.
  • the method may include removing the undesired object from the images using masks.
  • a device may obtain information about the object to be inpainted into an image from the user.
  • the device may be an electronic device.
  • the obtaining method may vary, such as, but not limited to, by text, by voice, by click, by touch, and the image or the video corresponding to a text and the image or the video corresponding to voice are shown, and those images and videos can be inserted into the desired input location.
  • the device used by the user possibly a mobile terminal which includes the camera
  • the identification may be by various methods, such as, but not limited to, voice command, text command, touch command, click command or from an image or a video submitted to the device.
  • the desired object is inpainted to a reference image.
  • the method includes allowing the user the option, in an embodiment, to communicate the new object not only by text or voice, but to provide an image of the desired object, for example to perform manual insertion of an image, and the like.
  • the inserted image in an embodiment is downloaded from the Internet (something the user found appealing), or the inserted image is from the user's photo gallery or another photo gallery.
  • An embodiment may be configured to allow a user to select from the multiple images indicated in a list shown at the bottom of the electronic device user interface display or on the side of the electronic device user interface display.
  • An embodiment also allows the device to move the image part as desired once the image that corresponds to the multiple texts is selected, and that image part enters the inpainted region.
  • the device may remove the undesired object and fills in the gap in the 3D scene with the desired object, this creates I ref .
  • I ref is an inpainted reference view, providing the information that a user expects to be extrapolated into a 3D inpainting of the scene which is the subject of the images ⁇ I i ⁇ .
  • the method may include training a neural radiance field (NeRF) to represent the inpainted 3D scene. See FIGS. 3-4 and Equations 1-10.
  • NeRF neural radiance field
  • the method may include providing the user a viewpoint from which to view the 3D scene.
  • the device may render the novel viewpoint and display it to the user. See FIG. 1C.
  • the method may include choosing, by the user, another object to inpaint or to view the 3D scene from yet another viewpoint.
  • FIG. 1B illustrates adding a selected object to a 3D scene, according to an embodiment.
  • FIG. 1B illustrates an example of a mobile device displaying a first image Iin.
  • Examples of a mobile device may be a smartphone with a camera, a tablet PC with a camera and the like.
  • a mobile device is an example and embodiments are not limited to mobile devices.
  • An embodiment is applicable to electronic devices, such as, but not limited to, AR headset, smart glasses, smartphone
  • the method may include selecting an object (for example, a flowerpot in FIG. 1B) and adds it to the first image I in to obtain an inpainted version of the first image, reference image I ref .
  • FIG. 1C illustrates an example of a rendering of the inpainted 3D scene of FIG. 1B from the novel viewpoint of obtain the image I novel .
  • FIG. 2A illustrates an example of the overall system for providing the novel view.
  • K views, K masks, the reference view with an additional object inpainted, and a request for a rendering from a novel viewpoint are provided to the NeRF.
  • the training of the NeRF may occur at a mobile terminal, at a server and the like.
  • the NeRF may provide a novel view I novel of an inpainted 3D scene.
  • FIG. 2B illustrates an example of a method L2 for training a NeRF to represent an inpainted 3D scene and using the NeRF to obtain the novel view I novel .
  • the method may include K view of a scene as an input.
  • the ith view may be denoted as image Ii.
  • the method may include segmenting an undesired object to remove it from the scene in each view. This may result in a mask for each scene.
  • the i th mask may be denoted M i .
  • the method may include selecting one of the images from the set ⁇ I i ⁇ as the input image from which to create the reference image I ref .
  • the method may include training an inpainting neural radiance field to represent an inpainted 3D scene.
  • the NeRF may be a neural network specific to the scene.
  • the method may include using the NeRF to render the inpainted 3D scene from a novel viewpoint, to obtain I novel .
  • FIG. 3 illustrates an example of method L3 for training the NeRF to represent an inpainted 3D scene in terms of four training epochs.
  • Each training phase in the figure has a predefined number of iterations inside it.
  • Each training iteration in NeRF training samples random rays from the input views in the scene, renders them using the current NeRF network, and updates the NeRF parameters by minimizing the corresponding losses.
  • the loss L_unmasked may be used at operation A1. See Equation 3. Operation A1 may be performed once every N unmasked iterations. Input view and camera parameters, masks, (inpainted) reference view may be used as input at operation A1. At operation A1, the method may include training the NeRF for the unmasked portion using the loss L_unmasked. At operation A1, the losses may be cumulative. The method may include training with available losses.
  • the losses L_depth and L_unmasked may be used at operation A2. See Equations 3 and 6. Operation A2 may be performed once every N depth iterations. At operation A2, the method may include a depth estimation of the masked portion. The depth estimation of the masked portion may include training using L_depth and L_unmaksed. At operation A1, the losses may be cumulative. The method may include training with available losses.
  • the losses L_substituted, L_depth and L_unmasked may be used at operation A3. See Equations 3, 6 and 9. Operation A3 may be performed once every N substituted iterations. K-1 target views, (inpainted reference view, and the result of operation A2 may be used as input at operation A3. At operation A3, the method may include view substitution training using L_substituted, L_depth and L_unmasked. At operation A3, the losses may be cumulative. The method may include training with available losses.
  • the losses L_occluded, L_substituted, L_depth and L_unmasked may be used at operation A4. See Equations 3, 6, 9 and 10. Operation A4, may be performed once every N occluded iterations.
  • the method may include dis-occluded pixels in target views training using L_occluded, L_substituted, L_depth and L_unmasked.
  • the losses may be cumulative.
  • the method may include training with available losses.
  • Operation A4 may output trained NeRF representing inapinted 3D scene.
  • one or more of A2, A3 and A4 may be not used at all in training the NeRF.
  • FIG. 4 illustrates an example of further details of training the NeRF of FIG. 3 to represent the inpainted 3D scene.
  • the NeRF may be trained for the unmasked portion of the images ⁇ I i ⁇ .
  • training may be performed using L_unmasked.
  • depth may be obtained of the masked portion in the reference image.
  • disparity alignment and smoothing may be performed.
  • training may be performed using L_unmasked and L_depth.
  • colors along a ray from the reference camera may be obtained but with view directions from target cameras. This is referred to as view-substitution.
  • a comparison between I ref and I ref target may be made with the reference view to get a residual, .
  • view dependent effects VDEs
  • the bilateral solver may treat I ref as reference input.
  • confidence may be zero inside the mask. See Equation 8.
  • target colors may be gotten which include the VDEs for this view.
  • training may be performed using L_unmasked, L_depth and L_substitute. See Equation 9.
  • disoccluded pixels may be determined by reprojecting all pixels from the reference view into a target view.
  • the disoccluded pixels may be inpainted for view t using leftmost, rightmost and topmost target images.
  • a disparity version of the disoccluded pixels may be inpainted using a bilateral solver.
  • training of the NeRF may be performed using L_unmasked, L_depth, L_substitute, and L_occluded. See Equations 3, 6, 9, 10.
  • Operation A4-4 may output trained NeRF representing inapinted 3D scene.
  • FIG. 5 illustrates an example of geometry related to a view substitution technique.
  • the view substitution technique disclosed herein may enable rendering from the reference viewpoint, but with the view-dependent effects of a target viewpoint, by substituting the directional input to the per-shading-point neural color field.
  • the upper portion of FIG. 5, 510 illustrates that, given a shading point position, x i , on a ray emanating from the reference camera (with direction ), an embodiment may obtain the corresponding ray direction, , that intersects x i from a target-image camera (at o t ). See Equation 7.
  • 520 and 530 illustrates, on the 520, that standard inputs may be used to query the NeRF for the color, , at shading point x i.
  • the 530 of FIG. 5 shows that view-substituted inputs may be used to query the NeRF, obtaining as the color instead.
  • the NeRF may provide 3D information (3D point color and density), which then have to be integrated along a ray to get rendered (i.e. get a view).
  • the output from a NeRF network may be 3D.
  • FIGS. 6-11 present some example results at the level of image changes.
  • FIG. 6 represents an example of an input image, I in .
  • FIG. 7 illustrates an example of removing a backpack from the input image of FIG. 6 and obtaining a text command to inpaint a red fence, and inpainting the red fence.
  • FIG. 8 illustrates an example of removing a backpack from the input image of FIG. 6 and obtaining a text command to inpaint a rubber duck, and inpainting the rubber duck.
  • FIG. 9 illustrates an example of removing a backpack from the input image of FIG. 6 and obtaining a text command to inpaint a flower pot, and inpainting the flower pot.
  • FIG. 7 illustrates an example of removing a backpack from the input image of FIG. 6 and obtaining a text command to inpaint a red fence, and inpainting the red fence.
  • FIG. 8 illustrates an example of removing a backpack from the input image of FIG. 6 and
  • FIG. 10 illustrates an example of an input image with a backpack as an undesired object to be removed.
  • FIG. 11 illustrates an example of the red fence replacing, in an inpainted region, the backpack using 2D inpainting.
  • a text command is an example and embodiments are not limited to text commands.
  • An embodiment can obtain information about the object to be inpainted into an image in various forms, such as, but not limited to, by text, by voice, by click, by touch and the image or the video corresponding to a text and the image or the video corresponding to voice are shown, and those images and videos can be inserted into the desired input location.
  • a device may obtain (e.g. receive, capture, download) a plurality of images or a short video, while moving a camera around a scene.
  • the device may then interactively segment the object of interest from the scene, using well known techniques (e.g. SPIn-NeRF).
  • reference-guided controllable 3D scene inpainting may be performed.
  • the method may include selecting a view and using a controllable 2D inpainting method to inpaint the object.
  • the controllable inpainting method may be, for one example, stable diffusion inpainting guided by text input.
  • the method may include creating the inpainted image by first inpainting it with the background using any 2D inpainting method and then overlaying an object of interest manually in the inpainted region.
  • An inpainting NeRF may be then trained guided by the single inpainted view.
  • the inpainted NeRF may be used to render the inpainted 3D scene from arbitrary views.
  • FIG. 12, related to FIG. 10 illustrates an example of replacing the backpack by pasting an image of a mailbox.
  • FIG. 13, related to FIG. 10 illustrates an example of replacing the backpack by inpainting the red fence and then manually pasting a shrub.
  • the method may include obtaining an indication of a selection of an object to be inpainted the first image.
  • FIG. 14 to 19 illustrates an example of images that describe a method for training NeRFs with view substitution.
  • FIG. 14 illustrates an example of a reference image, I ref , in which an object has been removed.
  • FIG. 15 illustrates an example of a set of input images. The images of ⁇ I i ⁇ other than I in are referred to as target images.
  • the undesired object, UO, in FIG. 15 is a music book on a piano stand.
  • FIG. 16 illustrates an example of a set of masks M i corresponding to the input images of FIG. 15.
  • FIG. 17 illustrates an example of an initial target view, I ref,target with distortion in the area corresponding to the inpainting in the reference view.
  • FIG. 18 illustrates an example of a residual, res target with respect to the target view of FIG. 17.
  • FIG. 19 illustrates an example of an updated rendering, of the target view based on the residual of FIG. 18.
  • An embodiment may provide view-dependent effects as follows. For each target, t, the scene may be rendered from the reference camera with target colors to get the view-substituted image, (FIG. 17).
  • a bilateral solver may inpaint the residual between the reference view and the view-substituted image, see Equation 8, resulting in the inpainted residual, res target (FIG. 18), which is subtracted from the reference view to get the target color, (FIG. 19).
  • the discrepancy between the target colors and the view-substituted images may provide supervision for the masked region.
  • While single-reference inpainting may prevent problems incurred by view-inconsistent inpaintings, it is missing multiview information in the inpainted region. For example, when inserting a duck into the scene (see FIG. 20), viewing the scene from another perspective naturally may unveil new details on and around the duck, due to dis-occlusions (see the dark areas marked as in the image second from left in FIG. 20). An embodiments may construct these missing details.
  • An embodiment may identify pixels in the target view, (also referred to as ), that are not visible from the reference view, to build a dis-occlusion mask, . From , an embodiment then may inpaint a -masked color, see the upper right image in FIG. 20( ). This is followed by in-filling a disparity rendered image, using bilateral guidance to ensure consistency. See the upper right image in FIG. 20 ( ) and the disparity image of FIG. 20 which are arguments for terms in L_occluded of Equation 10. Finally, these inpainted disoccluded values may be used for supervision. See A4 of FIG. 3.
  • FR Quantitative full-reference
  • An embodiment with stable diffusion (SD) performs best by both metrics.
  • an embodiment may provide the best performance on both FR metrics.
  • Combining Masked-NeRF with DreamFusion performs slightly better. This indicates some utility of the diffusion prior; however, while DreamFusion can generate impressive 3D entities in isolation, it does not produce sufficiently realistic outputs for inpainting real scenes.
  • SPIn-NeRF-SD obtains a similar poor LPIPS, though with better FID. It is unable to cope with the greater mismatches of the SD generations. NeRF-In outperforms the aforementioned models. Still, the use of a pixelwise loss leads to blurry outputs.
  • FIG. 21 illustrates an exemplary apparatus 21-1 for implementation of an embodiment disclosed herein.
  • FIG. 21 illustrates an hardware for performing embodiments provided.
  • the apparatus 21-1 may be a server, a computer, a laptop computer, a handheld device, or a tablet computer device, for example.
  • the an NeRF of FIG. 2A performing the method L2 of FIG. 2B is located on the electronic device, and the method L2 may process the obtained information from an input unit of the electronic device.
  • the NeRF of FIG. 2A performing the method L2 of FIG. 2B is located on a server, and the images are server images.
  • An input value of the server image (area select, obtaining object information to be inpainted, content obtained from text,voice and the like) may be obtained from the communication unit of the server and applied using the method L2 of FIG. 2B.
  • Apparatus 21-1 may include one or more hardware processors 21-9.
  • the one or more hardware processors 21-9 may include an ASIC (application specific integrated circuit), CPU (for example CISC or RISC device), and/or custom hardware.
  • An embodiment can be deployed on various GPUs. As an example, a provider of GPUs is Nvidia TM , Santa Clara, California. For example, an embodiment may have been deployed on Nvidia TM A6000 GPUs with 48GB of GDDR6 memory.
  • Lambda TM is a workstation company in San Francisco, California. Experiments using embodiments have been conducted on a Lambda TM Vector Workstation.
  • Apparatus 21-1 also may include a user interface 21-5 (for example a display screen and/or keyboard and/or pointing device such as a mouse).
  • Apparatus 21-1 may include one or more volatile memories 21-2.
  • Apparatus 21-1 may include one or more non-volatile memories 21-3.
  • the one or more non-volatile memories 21-3 may include a computer readable medium storing instructions for execution by the one or more hardware processors 21-9 to cause apparatus 21-1 to perform any of the methods of embodiments disclosed herein.
  • Apparatus 21-1 may include wired and/or wireless interfaces 21-4.
  • the wired and/or wireless interfaces 21-4 may include a receiver component, a transmitter component, and/or a transceiver component.
  • the wired and/or wireless interfaces 21-4 may enable the apparatus 21-1 to establish connections and/or transfer communications with other devices (e.g., a server, another device).
  • the communications may be affected via a wired connection, a wireless connection, or a combination of wired and wireless connections.
  • the wired and/or wireless interfaces 21-4 may permit the apparatus 21-1 to receive information from another device and/or provide information to another device.
  • the wired and/or wireless interfaces 21-4 may provide for communications with another device via a network, such as, but not limited to a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a private network, an ad hoc network, an intranet, the Internet, a fiber optic-based network, a cellular network (e.g., a fifth generation (5G) network, a long-term evolution (LTE) network, a third generation (3G) network, a code division multiple access (CDMA) network, and the like), a public land mobile network (PLMN), a telephone network (e.g., the Public Switched Telephone Network (PSTN)), or the like, and/or a combination of these or other types of networks.
  • a network such as, but not limited to a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a private network, an ad hoc network, an intranet, the Internet, a fiber optic-based network,
  • the wired and/or wireless interfaces 21-4 may provide for communications with another device via a device-to-device (D2D) communication link, such as, but not limited to FlashLinQ, WiMedia, Bluetooth ⁇ , Bluetooth ⁇ Low Energy (BLE), ZigBee, Institute of Electrical and Electronics Engineers (IEEE) 802.11x (Wi-Fi), LTE, 5G, and the like.
  • D2D device-to-device
  • the wired and/or wireless interfaces 21-4 may include an Ethernet interface, an optical interface, a coaxial interface, an infrared interface, a radio frequency (RF) interface, a USB interface, an IEEE 1094 (FireWire) interface, or the like.
  • Apparatus 21-1 may include display device 21-6.
  • Apparatus 21-1 may include a display device 21-6.
  • the display device 21-6 may include one or more components that may permit serves to present information from the set of components of the apparatus 21-1.
  • the bus 21-7 may be a computer monitor, smartphone screen, Television(TV), tablet screen, digital watches, AR headset and the like. The present disclosure is not limited in this regard.
  • Apparatus 21-1 may include a bus 21-7.
  • the set of components of the apparatus 21-1 may be communicatively coupled via the bus 21-7.
  • the bus 21-7 may include one or more components that may permit communication among the set of components of the apparatus 21-1.
  • the bus 21-7 may be a communication bus, a cross-over bar, a network, or the like.
  • the bus 21-7 is depicted as a single line in FIG. 21, the bus 21-7 may be implemented using multiple (e.g., two or more) connections between the set of components of the apparatus 21-1. The present disclosure is not limited in this regard.
  • An embodiment provides an approach to inpaint NeRFs, via a single inpainted reference image.
  • An embodiment may use a monocular depth estimator, aligning its output to the coordinate system of the inpainted NeRF to back-project the inpainted material from the reference view into 3D space.
  • An embodiment uses bilateral solvers to add VDEs to the inpainted region, and use 2D inpainters to fill dis-occluded areas. Table 1 and Table 2, using multiple evaluation metrics, illustrate the superiority of an embodiment over prior 3D inpainting methods.
  • an embodiment includes a controllability advantage enabling users to easily alter a generated 3D scene through a single guidance image (I ref ).
  • I ref guidance image
  • An embodiment of the present disclosure may solve one or more technical problems.
  • An embodiment may use a single inpainted reference, thus avoiding view inconsistencies.
  • an embodiment may use an optimization-based formulation with monocular depth estimation.
  • An embodiment may obtain view dependent effects (VDEs) of non-reference views from the reference viewpoint. This may enable a guided inpainting approach, propagating non-reference colors (with VDEs) into the mask area of the 3D scene represented by the NeRF.
  • An embodiment may also inpaint disoccluded appearance and geometry in a consistent manner.
  • An embodiment may be provided for inpainting regions in a view-consistent and controllable manner.
  • an embodiment may require only a single inpainted view of the scene, e.g., a reference view.
  • An embodiment may use monocular depth estimators to back-project the inpainted view to the correct 3D positions.
  • a bilateral solver of an embodiment may construct view-dependent effects in non-reference views, making the inpainted region appear consistent from any view.
  • an embodiment may provide a method based on image inpainters to guide both the geometry and appearance.
  • An embodiment may show superior performance to NeRF inpainting baselines, with the additional advantage that a user can control the generated scene via a single inpainted image.
  • a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer.
  • an application running on a computing device and the computing device can be a component.
  • One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between two or more computers.
  • these components can execute from various computer readable media having various data structures stored thereon.
  • the components can communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets, such as data from one component interacting with another component in a local system, distributed system, and/or across a network such as the Internet with other systems by way of the signal.
  • a signal having one or more data packets, such as data from one component interacting with another component in a local system, distributed system, and/or across a network such as the Internet with other systems by way of the signal.
  • An embodiment may relate to a system, a method, and/or a computer readable medium at any possible technical detail level of integration
  • the computer readable medium may include a computer-readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out operations.
  • computer-readable media may exclude transitory signals.
  • the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
  • the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • a non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a RAM, a ROM, an erasable programmable read-only memory (EEPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a DVD, a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
  • a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
  • Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • the network may include copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
  • Computer readable program code/instructions for carrying out operations may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the "C" programming language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a LAN or a WAN, or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider (ISP)).
  • ISP Internet Service Provider
  • electronic circuitry including, for example, programmable logic circuitry, FPGAs, or programmable logic arrays (PLAs) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects or operations.
  • These computer readable program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein includes an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
  • the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • At least one of the components, elements, modules or units may be embodied as various numbers of hardware, software and/or firmware structures that execute respective functions described above, according to an example embodiment.
  • at least one of these components may use a direct circuit structure, such as a memory, a processor, a logic circuit, a look-up table, and the like, that may execute the respective functions through controls of one or more microprocessors or other control apparatuses.
  • at least one of these components may be specifically embodied by a module, a program, or a part of code, which contains one or more executable instructions for performing specified logic functions, and executed by one or more microprocessors or other control apparatuses.
  • At least one of these components may include or may be implemented by a processor such as a CPU that performs the respective functions, a microprocessor, or the like. Two or more of these components may be combined into one single component which performs all operations or functions of the combined two or more components. Also, at least part of functions of at least one of these components may be performed by another of these components. Functional aspects of the above example embodiments may be implemented in algorithms that execute on one or more processors. Furthermore, the components represented by a block or processing steps may employ any number of related art techniques for electronics configuration, signal processing and/or control, data processing and the like.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which includes one or more executable instructions for implementing the specified logical functions.
  • the method, computer system, and computer readable medium may include additional blocks, fewer blocks, different blocks, or differently arranged blocks than those depicted in the Figures.
  • the functions noted in the blocks may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be executed concurrently or substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
  • an element e.g., a first element
  • the element may be coupled with the other element directly (e.g., wired), wirelessly, or via a third element.
  • a method may include obtaining a plurality of images from a user, wherein the plurality of images were acquired by an electronic device viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene.
  • the method may include obtaining a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene.
  • the method may include obtaining a second indication of a first object to be removed from the first image.
  • the method may include removing the first object from the first image to obtain a reference image.
  • the method may include obtaining a third indication of a second viewpoint from the user, wherein the second viewpoint is different from each of the respective viewpoints of the plurality of images.
  • the method may include rendering, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF.
  • the method may include displaying the second image on a display of the electronic device.
  • NeRF neural radiance field
  • the removing of the first object may include performing a first inpainting on the first image by applying a mask to the first object, to obtain the reference image.
  • the method further may include inpainting the 3D scene into the NeRF in part by adjusting a first size of the mask according to a second size of the first object that appears in the second image and applying the mask with the adjusted size to the second image.
  • the method may include based on a user input requesting an image of the 3D scene seen from the second viewpoint, inputting the reference image, and information of the second viewpoint to the NeRF to provide the second image corresponding to the 3D scene seen from the second viewpoint.
  • the method may include obtaining a fourth indication from the user, wherein the fourth indication is associated with a second object to be inpainted into the first image.
  • the method may include updating, before training the NeRF, the first image to remove the first object from the first image by using a mask, wherein the first image includes an unmasked portion and a masked portion.
  • the method may include training the NeRF after the first object is removed from the first image, wherein the NeRF is trained to output an inpainted 3D scene from an unobserved view point by accepting as input a reference inpainted view image that is obtained by selecting one of a plurality of views of a scene and applying a mask to inpaint an object into the reference view image.
  • the training may be performed at the electronic device. According to an embodiment of the disclosure, the training may be performed at a server.
  • the method may include obtaining, after the displaying, a fifth indication from the user, wherein the fifth indication is a selection of a second object to be inpainted into the first image.
  • the method may include obtaining a second representative image by inpainting the second object into the first image.
  • the method may include updating the training of the NeRF based on the second representative image.
  • the method may include rendering, using the NeRF, a third image.
  • the method may include displaying the third image on the display of the electronic device.
  • the training the NeRF may include training the NeRF, based on the reference image and the plurality of images, using a first loss associated with the unmasked portion.
  • the training the NeRF may include training the NeRF using a second loss based on the masked portion and an estimated depth, wherein the estimated depth is associated with a first geometry of the first scene in the masked portion.
  • the training the NeRF may include identifying a plurality of disoccluded pixels, wherein the plurality of disoccluded pixels are present in the target image and are associated with the second viewpoint.
  • the method may include determining a fourth loss, wherein the fourth loss is associated with a second inpainting of the plurality of disoccluded pixels of the target image.
  • the method may include training the NeRF using the fourth loss.
  • the method may include when the first object is removed from the reference image using a first mask, and if a second size of the first object in other images differs from a first size in the reference image, adjusting proportionally mask sizes of respective masks in the other images proportionally to the respective object sizes of the first object in the other images.
  • a method of training a neuro radiance fieled may include initially training the neural radiance field using a first loss associated with a plurality of unmasked regions respectively associated with a reference image and a plurality of target images, wherein the reference image is associated with a reference viewpoint and each target of the plurality of target images is associated with a respective target viewpoint.
  • the method may include updating the training of the neural radiance field using a second loss associated with a depth estimate of a masked region in the reference image.
  • the method may include updating the training of the neural radiance field using a third loss associated with a plurality of view-substituted images, wherein each view-substituted image of the plurality of view-substituted images is associated with the respective target view of the plurality of target images, each view-substituted image is a volume rendering from the reference viewpoint across pixels with view-substituted target colors, and the third loss is based on the plurality of view-substituted images.
  • the method may include additionally updating the training of the neural radiance field with a fourth loss, wherein the fourth loss is associated with dis-occluded pixels in each target image of the plurality of target images.
  • the method wherein rendering an image with depth information, may include obtaining image data that comprises a plurality of images that show a first scene from different viewpoints.
  • the method, wherein rendering an image with depth information may include, based on a first user input identifying a target object from one of the plurality of images, performing a first inpainting on the one of the plurality of images to obtain a reference image by applying a mask to the target object.
  • the method wherein rendering an image with depth information, may include, inpainting a 3D scene into a neural radiance field (NeRF), based on the reference image, by adjusting a first size of the mask according to a second size of the target object in each of remaining images other than the one of the plurality of images to obtain a plurality of adjusted masks, and applying the plurality of adjusted masks to respective ones of the remaining images.
  • NeRF neural radiance field
  • the method wherein rendering an image with depth information, may include, based on a second user input requesting a first image of the 3D scene seen from a requested view point, inputting the reference image, and the requested view point to a neural radiance field (NeRF) model to provide the first image, wherein the first image corresponds to the 3D scene seen from the requested view point.
  • NeRF neural radiance field
  • an apparatus may include one or more processors.
  • the apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to obtain a plurality of images from a user, wherein the plurality of images were acquired by the apparatus viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene.
  • the apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to obtain a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene.
  • the apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to obtain a second indication of a first object to be removed from the first image.
  • the apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to remove the first object from the first image to obtain a reference image.
  • the apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to obtain a third indication of a second viewpoint from the user, wherein the second viewpoint does not correspond to any of the plurality of images.
  • the apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to render, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF.
  • the apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to display the second image on a display of the apparatus.
  • NeRF neural radiance field
  • the apparatus may include the instructions, configured to cause the apparatus to remove the first object by performing a first inpainting on the first image by applying a mask to the first object, to obtain the reference image.
  • the apparatus may include the instructions, configured to cause the apparatus to inpaint the 3D scene into the NeRF in part by adjusting a first size of the mask according to a second size of the first object that appears in the second image and applying the mask with the adjusted size to the second image.
  • the apparatus may include the instructions, configured to cause the apparatus to, based on a user input requesting an image of the 3D scene seen from the second viewpoint, input the reference image, and information of the second viewpoint to the NeRF to provide the second image corresponding to the 3D scene seen from the second viewpoint.
  • the apparatus may include the instructions, configured to cause the apparatus to obtain a fourth indication from the user, wherein the fourth indication is associated with a second object to be inpainted into the first image.
  • the apparatus may include the instructions, configured to cause the apparatus to update, before a training of the NeRF, the first image to remove the first object from the first image by using a mask, wherein the first image includes an unmasked portion and a masked portion.
  • the apparatus may include the instructions, configured to cause the apparatus to train the NeRF after the first object is removed from the first image.
  • the appratus may be a mobile device.
  • the apparatus may include the instructions, configured to cause the apparatus to obtain the NeRF from a server after a training of the NeRF, wherein the NeRF has been trained at the server.
  • a computer-readable storage medium storing instruction.
  • the instructions when executed by at least one processor, may cause the at least one processor to obtain a plurality of images from a user, wherein the plurality of images were acquired by an electronic device viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene.
  • the instructions when executed by at least one processor, may cause the at least one processor to obtain a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene.
  • the instructions when executed by at least one processor, may cause the at least one processor to obtain a second indication of a first object to be removed from the first image.
  • the instructions when executed by at least one processor, may cause the at least one processor to remove the first object from the first image to obtain a reference image.
  • the instructions when executed by at least one processor, may cause the at least one processor to obtain a third indication of a second viewpoint from the user, wherein the second viewpoint is different from each of the respective viewpoints of the plurality of images.
  • the instructions when executed by at least one processor, may cause the at least one processor to render, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF.
  • the instructions when executed by at least one processor, may cause the at least one processor to display the second image on a display of the electronic device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Architecture (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Processing Or Creating Images (AREA)
  • Image Processing (AREA)

Abstract

The present disclosure provides methods and apparatuses for training a neural radiance field and producing a rendering of a 3D scene from a novel viewpoint with view-dependent effects. The neural radiance field is initially trained using a first loss associated with a plurality of unmasked regions associated with a reference image and a plurality of target images. The training may also be updated using a second loss associated with a depth estimate of a masked region in the reference image. The training may also be further updated using a third loss associated with a view-substituted image associated with a respective target image. The view-substituted image is a volume rendering from the reference viewpoint across pixels with view-substituted target colors. In an embodiment, the neural radiance field is additionally trained with a fourth loss. The fourth loss is associated with dis-occluded pixels in a target image.

Description

METHOD AND APPARATUS FOR REMOVING AND RENDERING AN IMAGE
This application is related to synthesizing a view of a 3D scene from a novel viewpoint.
The popularity of Neural Radiance Fields (NeRFs) for view synthesis has led to a desire for NeRF editing tools.
Using existing NeRFs techniques to provide a scene representation comes with technical problems. First, the black box nature of implicit neural representations makes it infeasible to simply edit the underlying data structure based on geometric understanding. There is not explainability at the internal node level in a NeRF neural network. Second, because NeRFs are trained from images, special considerations are required for maintaining multiview consistency. Independently inpainting images of a scene using 2D inpainters yields viewpoint-inconsistent imagery. Training a standard NeRF to reconstruct these 3D inconsistent images would result in blurry inpainting.
According to an embodiment of the disclosure, a method may include obtaining a plurality of images from a user, wherein the plurality of images were acquired by an electronic device viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene. The method may include obtaining a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene. The method may include obtaining a second indication of a first object to be removed from the first image. The method may include removing the first object from the first image to obtain a reference image. The method may include obtaining a third indication of a second viewpoint from the user, wherein the second viewpoint is different from each of the respective viewpoints of the plurality of images. The method may include rendering, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF. The method may include displaying the second image on a display of the electronic device.
According to an embodiment of the disclosure, an apparatus may include one or more processors. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least receive a plurality of images from a user, wherein the plurality of images were acquired by the apparatus viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least obtain a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least obtain a second indication of a first object to be removed from the first image. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least remove the first object from the first image to obtain a reference image. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least obtain a third indication of a second viewpoint from the user, wherein the second viewpoint does not correspond to any of the plurality of images. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least render, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least display the second image on a display of the apparatus.
According to an aspect of the present disclosure, a computer-readable storage medium storing instruction is provided. The instructions, when executed by at least one processor, may cause the at least one processor to obtain a plurality of images from a user, wherein the plurality of images were acquired by an electronic device viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene. The instructions, when executed by at least one processor, may cause the at least one processor to obtain a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene. The instructions, when executed by at least one processor, may cause the at least one processor to obtain a second indication of a first object to be removed from the first image. The instructions, when executed by at least one processor, may cause the at least one processor to remove the first object from the first image to obtain a reference image. The instructions, when executed by at least one processor, may cause the at least one processor to obtain a third indication of a second viewpoint from the user, wherein the second viewpoint is different from each of the respective viewpoints of the plurality of images. The instructions, when executed by at least one processor, may cause the at least one processor to render, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF. The instructions, when executed by at least one processor, may cause the at least one processor to display the second image on a display of the electronic device.Provided herein is a method including receiving a plurality of images from a user, wherein the plurality of images were acquired by an electronic device viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene; receiving a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene; receiving a second indication of a first object to be removed from the first image; removing the first object from the first image to obtain a reference image; receiving a third indication of a second viewpoint from the user, wherein the second viewpoint does not correspond to any of the plurality of images; rendering, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF; and displaying the second image on a display of the electronic device.
The text and figures are provided solely as examples to aid the reader in understanding the invention. They are not intended and are not to be construed as limiting the scope of this invention in any manner. Although certain embodiments and examples have been provided, it will be apparent to those skilled in the art based on the disclosures herein that changes in the embodiments and examples shown may be made without departing from the scope of embodiments provided herein.
FIG. 1A illustrates an example of logic for rendering an inpainted 3D scene from a novel viewpoint, according to some embodimentsan embodiment.
FIG. 1B illustrates an example of adding a selected object to a 3D scene, according to some an embodiments.
FIG. 1C illustrates an example of a rendering of the inpainted 3D scene of FIG. 1B from the novel viewpoint, according to some an embodiments.
FIG. 2A illustrates an example of a system for providing the novel view, according to some an embodiments.
FIG. 2B illustrates an example of logic for training a NeRF to represent an inpainted 3D scene and using the NeRF to obtain the novel view, according to some an embodiments.
FIG. 3 illustrates an example of training the NeRF to represent an inpainted 3D scene, according to some an embodiments.
FIG. 4 illustrates an example of further details of training the NeRF of FIG. 3 to represent the inpainted 3D scene, according to some an embodiments.
FIG. 5 illustrates an example of geometry related to a view substitution technique.
FIG. 6 represents an example of an input image.
FIG. 7 illustrates an example of removing a backpack from the input image of FIG. 6 and receiving obtaining a text command to inpaint a red fence, and inpainting the red fence.
FIG. 8 illustrates an example of removing a backpack from the input image of FIG. 6 and obtainingreceiving a text command to inpaint a rubber duck, and inpainting the rubber duck.
FIG. 9 illustrates an example of removing a backpack from the input image of FIG. 6 and obtainingreceiving a text command to inpaint a flower pot, and inpainting the flower pot.
FIG. 10 illustrates an example of an input image with a backpack as an object to be removed.
FIG. 11 illustrates an example of the red fence replacing the backpack using 2D inpainting.
FIG. 12, related to FIG. 10, illustrates an example of replacing the backpack by pasting an image of a mailbox.
FIG. 13, related to FIG. 10, illustrates an example of replacing the backpack by inpainting the red fence and then manually pasting a shrub.
FIG. 14 illustrates an example of a reference image in which an object has been removed.
FIG. 15 illustrates an example of a set of input images.
FIG. 16 illustrates an example of a set of masks corresponding to the input images of FIG. 15.
FIG. 17 illustrates an example of an initial target view with distortion in the area corresponding to the inpainting in the reference view.
FIG. 18 illustrates an example of a residual with respect to the target view of FIG. 17.
FIG. 19 illustrates an example of an updated rendering of the target view based on the residual of FIG. 18.
FIG. 20 illustrates an example of dDisocclusion processing to improve the inpainted 3D scene represented by the NeRF for views other than the reference view.
FIG. 21 illustrates exemplary hardware for implementation of computing devices for implementing the systems and algorithms described by the figures, according to some embodimentsan embodiment.
The detailed description set forth below in connection with the appended drawings is intended as a description of various configurations and is not intended to represent the only configurations in which the concepts described herein may be practiced. The detailed description includes specific details for the purpose of providing a thorough understanding of various concepts. However, it is to be apparent to those skilled in the art that these concepts may be practiced without these specific details. In some instances, well known structures and components are shown in block diagram form in order to avoid obscuring such concepts. In the descriptions that follow, like parts are marked throughout the specification and drawings with the same numerals, respectively.
The following description provides examples, and is not limiting of the scope, applicability, or embodiments set forth in the claims. Changes may be made in the function and/or arrangement of elements discussed without departing from the scope of the present disclosure. Various examples may omit, substitute, or add various procedures or components as appropriate. For example, the methods described may be performed in an order different from that described, and various steps may be added, omitted, and/or combined. In an embodiment, features described with reference to some examples may be combined in other examples.
Various aspects and/or features may be presented in terms of systems that may include a number of devices, components, modules, and the like. It is to be understood and appreciated that the various systems may include additional devices, components, modules, and the like and/or may not include all of the devices, components, modules, and the like discussed in connection with the figures. A combination of these approaches may also be used.
As a general introduction to the subject matter described in more detail below, the present disclosure provides methods, apparatuses, and computer-readable mediums for inpainting an unwanted object in one of several 2D images forming a complete 3D scene representation. The unwanted object is removed from any viewpoint within the 3D scene image. Obtaining may include receiving, accessing, acquiring and the like.
NeRF techniques may be used to inpaint unwanted regions in a view-consistent manner, allowing users to exercise control over the generated scene through a single inpainted image.
NeRFs are an implicit neural field representation (e.g., coordinate mapping) for 3D scenes and objects, generally fit to multiview posed image sets. The basic constituents are (i) a field,
Figure PCTKR2024001943-appb-img-000001
, that maps a 3D coordinate,
Figure PCTKR2024001943-appb-img-000002
, and a view direction,
Figure PCTKR2024001943-appb-img-000003
, to a color,
Figure PCTKR2024001943-appb-img-000004
, and density,
Figure PCTKR2024001943-appb-img-000005
, via learnable parameters
Figure PCTKR2024001943-appb-img-000006
, and (ii) a rendering operator that produces color and depth for a given view pixel. The field,
Figure PCTKR2024001943-appb-img-000007
, can be constructed in a variety of ways; the rendering operator is implemented as the classical volume rendering integral, approximated via quadrature, where a ray, r, is divided into N sections between
Figure PCTKR2024001943-appb-img-000008
and
Figure PCTKR2024001943-appb-img-000009
(the near and far bounds), with
Figure PCTKR2024001943-appb-img-000010
sampled from the i-th section. The estimated color is then given by Equation 1.
Figure PCTKR2024001943-appb-img-000011
Equation 1
where
Figure PCTKR2024001943-appb-img-000012
is the transmittance,
Figure PCTKR2024001943-appb-img-000013
and
Figure PCTKR2024001943-appb-img-000014
and
Figure PCTKR2024001943-appb-img-000015
are the color and density at
Figure PCTKR2024001943-appb-img-000016
. Replacing
Figure PCTKR2024001943-appb-img-000017
with
Figure PCTKR2024001943-appb-img-000018
in Equation 1 estimates depth, 
Figure PCTKR2024001943-appb-img-000019
, and disparity (inverse depth),
Figure PCTKR2024001943-appb-img-000020
, instead.
The inputs are n input images,
Figure PCTKR2024001943-appb-img-000021
, their camera transform matrices,
Figure PCTKR2024001943-appb-img-000022
, and their corresponding masks,
Figure PCTKR2024001943-appb-img-000023
, delineating the unwanted region. The inputs also include a single inpainted reference view,
Figure PCTKR2024001943-appb-img-000024
, where
Figure PCTKR2024001943-appb-img-000025
, which provides the information which an embodiment maps, or extrapolates, into a 3D inpainting of the scene represented by the NeRF.
An embodiment uses Iref, not only to inpaint the NeRF, but also to generate 3D details and VDEs from other viewpoints.
Below, the following topics are discussed: i) the use of monocular depth estimators to guide the geometry of the inpainted region, according to the depth of the reference image,
Figure PCTKR2024001943-appb-img-000026
(see FIG. 4 items A2-1, A2-2 and A2-3), ii) the use of bilateral solvers, in conjunction with a view-substitution technique of an embodiment, to add VDEs to views other than the reference view (see FIG. 5 and FIGS. 14-19 for a depiction of the geometry supervision and VDE handling, and iii) since not all the masked target pixels are visible in the reference, an embodiment provides supervision during training for such dis-occluded pixels, via additional inpaintings (see FIG. 20). However, the present disclosure is not limited in this regard, and other attachment methods may be utilized without departing from the scope of the present disclosure.
Training may include an experience with respect to a task and attempts to improve a performance with respect to performance of the task at a future time after the training.
In an embodiment, training is based on the following four losses: i) L_unmasked, ii) L_depth, iii) L_substituted and iv) L_occluded. These four losses represent the unmasked appearance loss, masked geometry loss, view-dependent masked color loss, and dis-occlusion loss, respectively.
The overall objective for inpainted NeRF fitting is given by Equation 2 (including weights
Figure PCTKR2024001943-appb-img-000027
on the last three terms).
Figure PCTKR2024001943-appb-img-000028
Equation 2
Supervision is computed modulo an iteration count. For example, supervision for the respective summands of Equation 2 are computed every Nunmasked, Ndepth, Nsub and Noccluded iterations. A particular loss is not used until the appropriate number of iterations has passed.
In the first stage of training,
Figure PCTKR2024001943-appb-img-000029
is supervised on the unmasked pixels for Nunmasked iterations, via a NeRF reconstruction loss shown in Equation 3.
Figure PCTKR2024001943-appb-img-000030
Equation 3
In Equation 3,
Figure PCTKR2024001943-appb-img-000031
(in contrast to
Figure PCTKR2024001943-appb-img-000032
) is the set of rays corresponding to the pixels in the unmasked part of the image (the part not affected by the mask) and
Figure PCTKR2024001943-appb-img-000033
is the ground truth (GT) color for the ray, r.
The loss for the masked portion based on depth is developed by Equations 4, 5 and 6.
Figure PCTKR2024001943-appb-img-000034
Equation 4
Figure PCTKR2024001943-appb-img-000035
Equation 5
Figure PCTKR2024001943-appb-img-000036
Equation 6
Above, scalars h and w are the height and width of the input images.
Concerning the matrices H and V, for a pixel p at position (px, py), H(p) = px and V=py.
The monocular depth estimation of the masked region from the reference image, in terms of disparity, is
Figure PCTKR2024001943-appb-img-000037
. The disparity from the NeRF model is
Figure PCTKR2024001943-appb-img-000038
.
The coefficients
Figure PCTKR2024001943-appb-img-000039
in Equation 4 are found by optimization, with F being the objective (Equation 5).
In Equation 4, J is the all-ones matrix.
The inverse of the distance between p and the mask is
Figure PCTKR2024001943-appb-img-000040
.
In Equation 6, the expectation is over
Figure PCTKR2024001943-appb-img-000041
.
Also, in Equation 6,
Figure PCTKR2024001943-appb-img-000042
is a variable obtained by optimizing
Figure PCTKR2024001943-appb-img-000043
to encourage greater smoothness around the mask. An example smoothing technique minimizes the total variation of
Figure PCTKR2024001943-appb-img-000044
around mask boundaries.
A loss to obtain VDEs for the masked portion is developed by Equations 7, 8 and 9.
Figure PCTKR2024001943-appb-img-000045
Equation 7
Figure PCTKR2024001943-appb-img-000046
Equation 8
Figure PCTKR2024001943-appb-img-000047
Equation 9
The expectation in Equation 9 is over
Figure PCTKR2024001943-appb-img-000048
.
Above, xi is a shading point position, on a ray emanating from the reference camera (with direction
Figure PCTKR2024001943-appb-img-000049
),
Figure PCTKR2024001943-appb-img-000050
is a corresponding ray direction that intersects xi from a target-image camera (at ot).
Figure PCTKR2024001943-appb-img-000051
is an inpainted residual,
Figure PCTKR2024001943-appb-img-000052
is a reference view,
Figure PCTKR2024001943-appb-img-000053
is a view-substituted image,
Figure PCTKR2024001943-appb-img-000054
is a target color
Figure PCTKR2024001943-appb-img-000055
is a mask,
Figure PCTKR2024001943-appb-img-000056
is a bilateral solver.
A loss to solve for occluded areas in the reference image which are however visible in a non-reference image is provided by Equation 10.
Figure PCTKR2024001943-appb-img-000057
Equation 10
In Equation 10, the expectation is over
Figure PCTKR2024001943-appb-img-000058
,
Figure PCTKR2024001943-appb-img-000059
,
Figure PCTKR2024001943-appb-img-000060
, and color and disparity are
Figure PCTKR2024001943-appb-img-000061
and
Figure PCTKR2024001943-appb-img-000062
.
The above equations are discussed with reference to the drawings. Before discussing the drawings, a partial list of identifiers with comments is provided here.
L_unmasked: this is a NeRF reconstruction loss over the unmasked area of the K input images. See Equation 3.
L_depth: this loss is based on monocular depth estimation
Figure PCTKR2024001943-appb-img-000063
to predict an uncalibrated disparity of the reference image and guide the geometry. See Equation 6.
L_substituted: this loss accounts for view-dependent effects (VDEs) such as specularities and surfaces which are not rough (do not deflect light in every direction). See Equation 9.
L_occluded: the overall algorithm is focused on the reference view, and pixels which are visible in target views but not visible in the reference view are called dis-occluded pixels (they are occluded in the reference view, and become dis-occluded when the scene is viewed from other viewpoints). This loss supervises the NeRF training so that the NeRF produces plausible results with respect to these dis-occluded pixels. See Equation 10.
Figure PCTKR2024001943-appb-img-000064
: the input image chosen as the basis for the reference image.
Figure PCTKR2024001943-appb-img-000065
: the set of input images, excluding
Figure PCTKR2024001943-appb-img-000066
.
Figure PCTKR2024001943-appb-img-000067
: the reference image, constructed by inpainting a portion of
Figure PCTKR2024001943-appb-img-000068
.
Figure PCTKR2024001943-appb-img-000069
: an image of the 3D scene inpainted into the NeRF;
Figure PCTKR2024001943-appb-img-000070
is from a user-requested viewpoint, and
Figure PCTKR2024001943-appb-img-000071
is produced by the NeRF.
Figure PCTKR2024001943-appb-img-000072
: a view-substituted image produced by the NeRF and associated with one of the target viewpoints.
Figure PCTKR2024001943-appb-img-000073
: the view-substituted image with VDEs from restarget after using Equation 8.
Figure PCTKR2024001943-appb-img-000074
: confidences used by a bilateral solver in dis-occlusion processing.
Figure PCTKR2024001943-appb-img-000075
: a target view, exhibiting dis-occluded pixels.
Figure PCTKR2024001943-appb-img-000076
: a disparity image produced by the NeRF during dis-occlusion processing.
Figure PCTKR2024001943-appb-img-000077
: an inpainted version of the target view exhibiting dis-occluded pixels.
Figure PCTKR2024001943-appb-img-000078
: a disparity image obtained using bilateral guidance applied to
Figure PCTKR2024001943-appb-img-000079
.
Figure PCTKR2024001943-appb-img-000080
: a residual used in obtaining the VDEs for one of the target viewpoints.
Obtaining the novel view from the 3D inpainted into the NeRF is now described with respect to the figures.
FIG. 1A illustrates an example of a flowchart of a method L1 for rendering an inpainted 3D scene from a novel viewpoint.
For example, a user has a camera. At operation S1-1, the device may include the user captures several pictures, possibly as a video sequence by the user
At operation S1-2, the method may include selecting one of the images as an input image by the user
At operation S1-3, the method may include selecting an undesired object to be removed from the input image.
In an embodiment, the electronic device may perform the selection by recommending objects to be erased by the device. The electronic device may select portions of the images with features such as, but not limited to, many light reflections, blurry portions, or portions identified by the electronic device as background objects.
In an embodiment, the method may include performing the selection by the user. The method may be performed by the user selecting an area around an object, the electronic device analyzing the identified area and electronic device selects around the object outline.
An embodiment may include an additional selection by the electronic device based on user-selected information. In this an embodiment, the electronic device may analyze the selected object and recommend whether other objects of a similar type to the object selected by the user should also be selected and erased from the images.
The method may include removing the undesired object from the images using masks.
In an embodiment, at operation S1-4, a device may obtain information about the object to be inpainted into an image from the user. The device may be an electronic device. The obtaining method may vary, such as, but not limited to, by text, by voice, by click, by touch, and the image or the video corresponding to a text and the image or the video corresponding to voice are shown, and those images and videos can be inserted into the desired input location. As one example, the device used by the user (possibly a mobile terminal which includes the camera), may determine the identity of a desired object from the user. The identification may be by various methods, such as, but not limited to, voice command, text command, touch command, click command or from an image or a video submitted to the device. The desired object is inpainted to a reference image. As an example, the method includes allowing the user the option, in an embodiment, to communicate the new object not only by text or voice, but to provide an image of the desired object, for example to perform manual insertion of an image, and the like. However, the present disclosure is not limited in this regard, and other communication methods may be utilized without departing from the scope of the present disclosure. The inserted image, in an embodiment is downloaded from the Internet (something the user found appealing), or the inserted image is from the user's photo gallery or another photo gallery.
In an embodiment, there may be multiple images corresponding to the text when a user enters text. An embodiment may be configured to allow a user to select from the multiple images indicated in a list shown at the bottom of the electronic device user interface display or on the side of the electronic device user interface display. An embodiment also allows the device to move the image part as desired once the image that corresponds to the multiple texts is selected, and that image part enters the inpainted region.
At operation S1-5, the device, using the NeRF, may remove the undesired object and fills in the gap in the 3D scene with the desired object, this creates Iref. Methods for performing this inpainting are known to practitioners working in this field. Iref is an inpainted reference view, providing the information that a user expects to be extrapolated into a 3D inpainting of the scene which is the subject of the images {Ii}.
At operation S1-6, the method may include training a neural radiance field (NeRF) to represent the inpainted 3D scene. See FIGS. 3-4 and Equations 1-10.
At operation S1-7, the method may include providing the user a viewpoint from which to view the 3D scene.
At operation S1-8, the device may render the novel viewpoint and display it to the user. See FIG. 1C.
At operation S1-9, the method may include choosing, by the user, another object to inpaint or to view the 3D scene from yet another viewpoint.
FIG. 1B illustrates adding a selected object to a 3D scene, according to an embodiment.
FIG. 1B illustrates an example of a mobile device displaying a first image Iin. Examples of a mobile device may be a smartphone with a camera, a tablet PC with a camera and the like. A mobile device is an example and embodiments are not limited to mobile devices. An embodiment is applicable to electronic devices, such as, but not limited to, AR headset, smart glasses, smartphone The method may include selecting an object (for example, a flowerpot in FIG. 1B) and adds it to the first image Iin to obtain an inpainted version of the first image, reference image Iref.
FIG. 1C illustrates an example of a rendering of the inpainted 3D scene of FIG. 1B from the novel viewpoint of obtain the image Inovel.
FIG. 2A illustrates an example of the overall system for providing the novel view. K views, K masks, the reference view with an additional object inpainted, and a request for a rendering from a novel viewpoint are provided to the NeRF. The training of the NeRF may occur at a mobile terminal, at a server and the like. The NeRF may provide a novel view Inovel of an inpainted 3D scene.
FIG. 2B illustrates an example of a method L2 for training a NeRF to represent an inpainted 3D scene and using the NeRF to obtain the novel view Inovel. In an embodiment, the method may include K view of a scene as an input. The ith view may be denoted as image Ii.
At operation S2-1, the method may include segmenting an undesired object to remove it from the scene in each view. This may result in a mask for each scene. The ith mask may be denoted Mi.
At operation S2-2, the method may include selecting one of the images from the set {Ii} as the input image from which to create the reference image Iref. At operation S2-3, the method may include training an inpainting neural radiance field to represent an inpainted 3D scene. The NeRF may be a neural network specific to the scene.
At operation S2-4, the method may include using the NeRF to render the inpainted 3D scene from a novel viewpoint, to obtain Inovel.
FIG. 3 illustrates an example of method L3 for training the NeRF to represent an inpainted 3D scene in terms of four training epochs.
Each training phase in the figure has a predefined number of iterations inside it. Each training iteration in NeRF training samples random rays from the input views in the scene, renders them using the current NeRF network, and updates the NeRF parameters by minimizing the corresponding losses.
The loss L_unmasked may be used at operation A1. See Equation 3. Operation A1 may be performed once every Nunmasked iterations. Input view and camera parameters, masks, (inpainted) reference view may be used as input at operation A1. At operation A1, the method may include training the NeRF for the unmasked portion using the loss L_unmasked. At operation A1, the losses may be cumulative. The method may include training with available losses.
The losses L_depth and L_unmasked may be used at operation A2. See Equations 3 and 6. Operation A2 may be performed once every Ndepth iterations. At operation A2, the method may include a depth estimation of the masked portion. The depth estimation of the masked portion may include training using L_depth and L_unmaksed. At operation A1, the losses may be cumulative. The method may include training with available losses.
The losses L_substituted, L_depth and L_unmasked may be used at operation A3. See Equations 3, 6 and 9. Operation A3 may be performed once every Nsubstituted iterations. K-1 target views, (inpainted reference view, and the result of operation A2 may be used as input at operation A3. At operation A3, the method may include view substitution training using L_substituted, L_depth and L_unmasked. At operation A3, the losses may be cumulative. The method may include training with available losses.
The losses L_occluded, L_substituted, L_depth and L_unmasked may be used at operation A4. See Equations 3, 6, 9 and 10. Operation A4, may be performed once every Noccluded iterations. The method may include dis-occluded pixels in target views training using L_occluded, L_substituted, L_depth and L_unmasked. At operation A4, the losses may be cumulative. The method may include training with available losses. Operation A4 may output trained NeRF representing inapinted 3D scene.
In an embodiment, one or more of A2, A3 and A4 may be not used at all in training the NeRF.
FIG. 4 illustrates an example of further details of training the NeRF of FIG. 3 to represent the inpainted 3D scene.
At operation A1-1, the NeRF may be trained for the unmasked portion of the images {Ii}. At operation A1-1, training may be performed using L_unmasked.
At operation A2-1, depth may be obtained of the masked portion in the reference image. At operation A2-2, disparity alignment and smoothing may be performed. At operation A2-3, training may be performed using L_unmasked and L_depth.
At A3-1, colors along a ray from the reference camera may be obtained but with view directions from target cameras. This is referred to as view-substitution. At operation A3-2 a comparison between Iref and Iref, target may be made with the reference view to get a residual,
Figure PCTKR2024001943-appb-img-000081
. At operation A3-3, view dependent effects (VDEs) may be obtained by using a bilateral solver. The bilateral solver may treat Iref as reference input. At operation A3-3, confidence may be zero inside the mask. See Equation 8. At operation A3-4, target colors may be gotten which include the VDEs for this view. At operation A3-5, training may be performed using L_unmasked, L_depth and L_substitute. See Equation 9.
At operation A4-1, disoccluded pixels may be determined by reprojecting all pixels from the reference view into a target view. At operation A4-2, the disoccluded pixels may be inpainted for view t using leftmost, rightmost and topmost target images. At operation A4-3, a disparity version of the disoccluded pixels may be inpainted using a bilateral solver. At operation A4-4, training of the NeRF may be performed using L_unmasked, L_depth, L_substitute, and L_occluded. See Equations 3, 6, 9, 10. Operation A4-4 may output trained NeRF representing inapinted 3D scene.
FIG. 5 illustrates an example of geometry related to a view substitution technique. The view substitution technique disclosed herein may enable rendering from the reference viewpoint, but with the view-dependent effects of a target viewpoint, by substituting the directional input to the per-shading-point neural color field. The upper portion of FIG. 5, 510 illustrates that, given a shading point position, xi, on a ray emanating from the reference camera (with direction
Figure PCTKR2024001943-appb-img-000082
), an embodiment may obtain the corresponding ray direction,
Figure PCTKR2024001943-appb-img-000083
, that intersects xi from a target-image camera (at ot). See Equation 7. The lower portion of FIG. 5, 520 and 530, illustrates, on the 520, that standard inputs may be used to query the NeRF for the color,
Figure PCTKR2024001943-appb-img-000084
, at shading point xi. The 530 of FIG. 5 shows that view-substituted inputs may be used to query the NeRF, obtaining
Figure PCTKR2024001943-appb-img-000085
as the color instead.
The NeRF (for example, in FIG. 5) may provide 3D information (3D point color and density), which then have to be integrated along a ray to get rendered (i.e. get a view). The output from a NeRF network may be 3D.
FIGS. 6-11 present some example results at the level of image changes. FIG. 6 represents an example of an input image, Iin. FIG. 7 illustrates an example of removing a backpack from the input image of FIG. 6 and obtaining a text command to inpaint a red fence, and inpainting the red fence. FIG. 8 illustrates an example of removing a backpack from the input image of FIG. 6 and obtaining a text command to inpaint a rubber duck, and inpainting the rubber duck. FIG. 9 illustrates an example of removing a backpack from the input image of FIG. 6 and obtaining a text command to inpaint a flower pot, and inpainting the flower pot. FIG. 10 illustrates an example of an input image with a backpack as an undesired object to be removed. FIG. 11 illustrates an example of the red fence replacing, in an inpainted region, the backpack using 2D inpainting. A text command is an example and embodiments are not limited to text commands. An embodiment can obtain information about the object to be inpainted into an image in various forms, such as, but not limited to, by text, by voice, by click, by touch and the image or the video corresponding to a text and the image or the video corresponding to voice are shown, and those images and videos can be inserted into the desired input location.
In an embodiment, a device may obtain (e.g. receive, capture, download) a plurality of images or a short video, while moving a camera around a scene. The device may then interactively segment the object of interest from the scene, using well known techniques (e.g. SPIn-NeRF).
In an embodiment, reference-guided controllable 3D scene inpainting may be performed. The method may include selecting a view and using a controllable 2D inpainting method to inpaint the object. The controllable inpainting method may be, for one example, stable diffusion inpainting guided by text input. Alternatively, the method may include creating the inpainted image by first inpainting it with the background using any 2D inpainting method and then overlaying an object of interest manually in the inpainted region. An inpainting NeRF may be then trained guided by the single inpainted view. The inpainted NeRF may be used to render the inpainted 3D scene from arbitrary views.
For example, FIG. 12, related to FIG. 10, illustrates an example of replacing the backpack by pasting an image of a mailbox. FIG. 13, related to FIG. 10, illustrates an example of replacing the backpack by inpainting the red fence and then manually pasting a shrub. In an embodiment, the method may include obtaining an indication of a selection of an object to be inpainted the first image.
However, the present disclosure is not limited in this regard, and other methods or examples may be utilized without departing from the scope of the present disclosure.
FIG. 14 to 19 illustrates an example of images that describe a method for training NeRFs with view substitution. FIG. 14 illustrates an example of a reference image, Iref, in which an object has been removed. FIG. 15 illustrates an example of a set of input images. The images of {Ii} other than Iin are referred to as target images. The undesired object, UO, in FIG. 15 is a music book on a piano stand. FIG. 16 illustrates an example of a set of masks Mi corresponding to the input images of FIG. 15. FIG. 17 illustrates an example of an initial target view, Iref,target with distortion in the area corresponding to the inpainting in the reference view. FIG. 18 illustrates an example of a residual, restarget with respect to the target view of FIG. 17.
FIG. 19 illustrates an example of an updated rendering,
Figure PCTKR2024001943-appb-img-000086
of the target view based on the residual of FIG. 18.
An embodiment may provide view-dependent effects as follows. For each target, t, the scene may be rendered from the reference camera with target colors to get the view-substituted image,
Figure PCTKR2024001943-appb-img-000087
(FIG. 17). A bilateral solver may inpaint the residual between the reference view and the view-substituted image, see Equation 8, resulting in the inpainted residual, restarget (FIG. 18), which is subtracted from the reference view to get the target color,
Figure PCTKR2024001943-appb-img-000088
(FIG. 19). The discrepancy between the target colors and the view-substituted images may provide supervision for the masked region.
After obtaining the view substituted images
Figure PCTKR2024001943-appb-img-000089
(after at least Nsubstitute iterations), the training may be able to supervise the masked appearances of the target images. Each such image
Figure PCTKR2024001943-appb-img-000090
may look at the scene via the reference source camera (e.g., has the image structure of Iref), but may have the colors (in particular, VDEs) of Itarget. An embodiment may use those colors, obtained by the bilateral solver of Equation 8, to supervise the target view appearance under the mask (that is, in Rmask). An Embodiment may render each view-substituted image inside the mask (obtaining
Figure PCTKR2024001943-appb-img-000091
as in FIG. 17), and compute a reconstruction loss by comparing it to the bilaterally inpainted output,
Figure PCTKR2024001943-appb-img-000092
as shown in Equation 9.
FIG. 20 illustrates an example of dis-occlusion processing to improve the inpainted 3D scene represented by the NeRF for views other than the reference view.
While single-reference inpainting may prevent problems incurred by view-inconsistent inpaintings, it is missing multiview information in the inpainted region. For example, when inserting a duck into the scene (see FIG. 20), viewing the scene from another perspective naturally may unveil new details on and around the duck, due to dis-occlusions (see the dark areas marked as
Figure PCTKR2024001943-appb-img-000093
in the image second from left in FIG. 20). An embodiments may construct these missing details.
An embodiment may identify pixels in the target view,
Figure PCTKR2024001943-appb-img-000094
(also referred to as
Figure PCTKR2024001943-appb-img-000095
), that are not visible from the reference view, to build a dis-occlusion mask,
Figure PCTKR2024001943-appb-img-000096
. From
Figure PCTKR2024001943-appb-img-000097
, an embodiment then may inpaint a
Figure PCTKR2024001943-appb-img-000098
-masked color, see the upper right image in FIG. 20(
Figure PCTKR2024001943-appb-img-000099
). This is followed by in-filling a disparity rendered image, using bilateral guidance to ensure consistency. See the upper right image in FIG. 20 (
Figure PCTKR2024001943-appb-img-000100
) and the disparity image
Figure PCTKR2024001943-appb-img-000101
of FIG. 20 which are arguments for terms in L_occluded of Equation 10. Finally, these inpainted disoccluded values may be used for supervision. See A4 of FIG. 3.
Quantitative full-reference (FR) evaluation of 3D inpainting techniques on the inpainted areas of held-out views from the SPIn-NeRF dataset are shown in Table 1. Columns show distance from known ground-truth images of the scene (without the target object), based on a learned perceptual image patch similarity (LPIPS) and feature-based statistical distance (FID).
An embodiment with stable diffusion (SD) performs best by both metrics.
Quantitative full-reference (FR) evaluation of 3D inpainting techniques
Method LPIPS FID
NeRF + LaMa (2D) 0.5369 174.61
Object NeRF 0.6829 271.80
L_unmasked 0.6030 294.69
L_unmasked + DreamFusion 0.5934 264.71
NeRF-In, multiple 0.5699 238.33
NeRF-In, single 0.4884 183.23
SPIn-NeRF-SD 0.5701 186.48
SPIn-NeRF-LaMa 0.4654 156.64
An embodiment (FIGS. 3-4 and Equations 1-10), using stable diffusion 0.4532 116.24
As seen in Table 1, an embodiment may provide the best performance on both FR metrics. The Object-NeRF and Masked-NeRF approaches, which perform object removal without altering the newly revealed areas, perform the worst. Combining Masked-NeRF with DreamFusion performs slightly better. This indicates some utility of the diffusion prior; however, while DreamFusion can generate impressive 3D entities in isolation, it does not produce sufficiently realistic outputs for inpainting real scenes. SPIn-NeRF-SD obtains a similar poor LPIPS, though with better FID. It is unable to cope with the greater mismatches of the SD generations. NeRF-In outperforms the aforementioned models. Still, the use of a pixelwise loss leads to blurry outputs. Finally, our model outperforms the second-best model (SPIn-NeRF-LaMa) considerably in terms of FID, reducing it by ~25%. An embodiment is also applicable to videos. Table 2 provides an indication of the technical improvement. SD and LaMa are known inpainters.
Quantitative full-reference (FR) evaluation of 3D inpainting techniques on videos
Method Sharpness MUSIQ
SPIn-NeRF-LaMa 354.31 58.10
An embodiment, using LaMa 394.55 62.0
An embodiment, using SD 398.56 61.47
FR measures are limited by their use of a single GT target image. We therefore also examine NR performance, demonstrating improvements over SPIn-NeRF, in terms of both sharpness (by 11.2%) and MUSIQ (by 5.8%); see Table 2. Table 2 indicates that embodiments provide a novel view which is numerically sharper and more realistic.FIG. 21 illustrates an exemplary apparatus 21-1 for implementation of an embodiment disclosed herein. FIG. 21 illustrates an hardware for performing embodiments provided. The apparatus 21-1 may be a server, a computer, a laptop computer, a handheld device, or a tablet computer device, for example.
As an example, the an NeRF of FIG. 2A performing the method L2 of FIG. 2B is located on the electronic device, and the method L2 may process the obtained information from an input unit of the electronic device.
As an example, the NeRF of FIG. 2A performing the method L2 of FIG. 2B is located on a server, and the images are server images. An input value of the server image (area select, obtaining object information to be inpainted, content obtained from text,voice and the like) may be obtained from the communication unit of the server and applied using the method L2 of FIG. 2B.
Apparatus 21-1 may include one or more hardware processors 21-9. The one or more hardware processors 21-9 may include an ASIC (application specific integrated circuit), CPU (for example CISC or RISC device), and/or custom hardware. An embodiment can be deployed on various GPUs. As an example, a provider of GPUs is NvidiaTM, Santa Clara, California. For example, an embodiment may have been deployed on NvidiaTM A6000 GPUs with 48GB of GDDR6 memory.
An embodiment may be deployed on various computers, servers or workstations. LambdaTM is a workstation company in San Francisco, California. Experiments using embodiments have been conducted on a LambdaTM Vector Workstation.
Apparatus 21-1 also may include a user interface 21-5 (for example a display screen and/or keyboard and/or pointing device such as a mouse). Apparatus 21-1 may include one or more volatile memories 21-2. Apparatus 21-1 may include one or more non-volatile memories 21-3. The one or more non-volatile memories 21-3 may include a computer readable medium storing instructions for execution by the one or more hardware processors 21-9 to cause apparatus 21-1 to perform any of the methods of embodiments disclosed herein.
Apparatus 21-1 may include wired and/or wireless interfaces 21-4. The wired and/or wireless interfaces 21-4 may include a receiver component, a transmitter component, and/or a transceiver component. The wired and/or wireless interfaces 21-4 may enable the apparatus 21-1 to establish connections and/or transfer communications with other devices (e.g., a server, another device). The communications may be affected via a wired connection, a wireless connection, or a combination of wired and wireless connections. The wired and/or wireless interfaces 21-4 may permit the apparatus 21-1 to receive information from another device and/or provide information to another device. In an embodiment, the wired and/or wireless interfaces 21-4 may provide for communications with another device via a network, such as, but not limited to a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a private network, an ad hoc network, an intranet, the Internet, a fiber optic-based network, a cellular network (e.g., a fifth generation (5G) network, a long-term evolution (LTE) network, a third generation (3G) network, a code division multiple access (CDMA) network, and the like), a public land mobile network (PLMN), a telephone network (e.g., the Public Switched Telephone Network (PSTN)), or the like, and/or a combination of these or other types of networks. In an embodiment, the wired and/or wireless interfaces 21-4 may provide for communications with another device via a device-to-device (D2D) communication link, such as, but not limited to FlashLinQ, WiMedia, Bluetooth쪠, Bluetooth쪠 Low Energy (BLE), ZigBee, Institute of Electrical and Electronics Engineers (IEEE) 802.11x (Wi-Fi), LTE, 5G, and the like. In an embodiment, the wired and/or wireless interfaces 21-4 may include an Ethernet interface, an optical interface, a coaxial interface, an infrared interface, a radio frequency (RF) interface, a USB interface, an IEEE 1094 (FireWire) interface, or the like.
Apparatus 21-1 may include display device 21-6. Apparatus 21-1 may include a display device 21-6. The display device 21-6 may include one or more components that may permit serves to present information from the set of components of the apparatus 21-1. For example, the bus 21-7 may be a computer monitor, smartphone screen, Television(TV), tablet screen, digital watches, AR headset and the like. The present disclosure is not limited in this regard.
Apparatus 21-1 may include a bus 21-7. The set of components of the apparatus 21-1 may be communicatively coupled via the bus 21-7. The bus 21-7 may include one or more components that may permit communication among the set of components of the apparatus 21-1. For example, the bus 21-7 may be a communication bus, a cross-over bar, a network, or the like. Although the bus 21-7 is depicted as a single line in FIG. 21, the bus 21-7 may be implemented using multiple (e.g., two or more) connections between the set of components of the apparatus 21-1. The present disclosure is not limited in this regard.
An embodiment provides an approach to inpaint NeRFs, via a single inpainted reference image. An embodiment may use a monocular depth estimator, aligning its output to the coordinate system of the inpainted NeRF to back-project the inpainted material from the reference view into 3D space. An embodiment uses bilateral solvers to add VDEs to the inpainted region, and use 2D inpainters to fill dis-occluded areas. Table 1 and Table 2, using multiple evaluation metrics, illustrate the superiority of an embodiment over prior 3D inpainting methods.
Finally, an embodiment includes a controllability advantage enabling users to easily alter a generated 3D scene through a single guidance image (Iref). However, the present disclosure is not limited in this regard, and advantages may be utilized without departing from the scope of the present disclosure.
An embodiment of the present disclosure may solve one or more technical problems.
An embodiment may use a single inpainted reference, thus avoiding view inconsistencies. To geometrically supervise the inpainted area, an embodiment may use an optimization-based formulation with monocular depth estimation. An embodiment may obtain view dependent effects (VDEs) of non-reference views from the reference viewpoint. This may enable a guided inpainting approach, propagating non-reference colors (with VDEs) into the mask area of the 3D scene represented by the NeRF. An embodiment may also inpaint disoccluded appearance and geometry in a consistent manner.
An embodiment may be provided for inpainting regions in a view-consistent and controllable manner. In addition to the typical NeRF inputs and masks delineating the unwanted region in each view, an embodiment may require only a single inpainted view of the scene, e.g., a reference view. An embodiment may use monocular depth estimators to back-project the inpainted view to the correct 3D positions. Then, via a novel rendering technique, a bilateral solver of an embodiment may construct view-dependent effects in non-reference views, making the inpainted region appear consistent from any view. For non-reference disoccluded regions, which cannot be supervised by the single reference view, an embodiment may provide a method based on image inpainters to guide both the geometry and appearance. An embodiment may show superior performance to NeRF inpainting baselines, with the additional advantage that a user can control the generated scene via a single inpainted image.
The foregoing disclosure provides illustration and description, but is not intended to be exhaustive or to limit the implementations to the precise form disclosed. Modifications and variations are possible in light of the above disclosure or may be acquired from practice of the implementations.
As used herein, the terms "component," "module," "system" and the like are intended to include a computer-related entity, such as but not limited to hardware, firmware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device can be a component. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between two or more computers. In addition, these components can execute from various computer readable media having various data structures stored thereon. The components can communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets, such as data from one component interacting with another component in a local system, distributed system, and/or across a network such as the Internet with other systems by way of the signal.
An embodiment may relate to a system, a method, and/or a computer readable medium at any possible technical detail level of integration. The computer readable medium may include a computer-readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out operations. computer-readable media may exclude transitory signals.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a RAM, a ROM, an erasable programmable read-only memory (EEPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a DVD, a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may include copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program code/instructions for carrying out operations may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a LAN or a WAN, or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider (ISP)). In an embodiment, electronic circuitry including, for example, programmable logic circuitry, FPGAs, or programmable logic arrays (PLAs) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects or operations.
These computer readable program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein includes an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
At least one of the components, elements, modules or units (collectively "components" in this paragraph) represented by a block in the drawings may be embodied as various numbers of hardware, software and/or firmware structures that execute respective functions described above, according to an example embodiment. According to example embodiments, at least one of these components may use a direct circuit structure, such as a memory, a processor, a logic circuit, a look-up table, and the like, that may execute the respective functions through controls of one or more microprocessors or other control apparatuses. Also, at least one of these components may be specifically embodied by a module, a program, or a part of code, which contains one or more executable instructions for performing specified logic functions, and executed by one or more microprocessors or other control apparatuses. Further, at least one of these components may include or may be implemented by a processor such as a CPU that performs the respective functions, a microprocessor, or the like. Two or more of these components may be combined into one single component which performs all operations or functions of the combined two or more components. Also, at least part of functions of at least one of these components may be performed by another of these components. Functional aspects of the above example embodiments may be implemented in algorithms that execute on one or more processors. Furthermore, the components represented by a block or processing steps may employ any number of related art techniques for electronics configuration, signal processing and/or control, data processing and the like.
The flowchart and block diagrams in the drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer readable media according to various embodiments. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which includes one or more executable instructions for implementing the specified logical functions. The method, computer system, and computer readable medium may include additional blocks, fewer blocks, different blocks, or differently arranged blocks than those depicted in the Figures. In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be executed concurrently or substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It may also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
It is to be apparent that systems and/or methods, described herein, may be implemented in different forms of hardware, firmware, or a combination of hardware and software. The actual specialized control hardware or software code used to implement these systems and/or methods is not limiting of the implementations. Thus, the operation and behavior of the systems and/or methods were described herein without reference to specific software code―it being understood that software and hardware may be designed to implement the systems and/or methods based on the description herein.
No element, act, or instruction used herein should be construed as critical or essential unless explicitly described as such. Also, as used herein, the articles "a" and "an" are intended to include one or more items, and may be used interchangeably with "one or more." Furthermore, as used herein, the term "set" is intended to include one or more items (e.g., related items, unrelated items, a combination of related and unrelated items, and the like), and may be used interchangeably with "one or more." Where only one item is intended, the term "one" or similar language is used. Also, as used herein, the terms "has," "have," "having," "includes," "including," or the like are intended to be open-ended terms. Further, the phrase "based on" is intended to mean "based, at least in part, on" unless explicitly stated otherwise. In addition, expressions such as "at least one of [A] and [B]" or "at least one of [A] or [B]" are to be understood as including only A, only B, or both A and B.
Reference throughout this specification to "one embodiment," "an embodiment," or similar language means that a particular feature, structure, or characteristic described in connection with the indicated embodiment is included in at least one embodiment of the present solution. Thus, the phrases "in one embodiment", "in an embodiment," and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment. As used herein, such terms as "1st" and "2nd," or "first" and "second" may be used to simply distinguish a corresponding component from another, and does not limit the components in other aspects (e.g., importance or order). It is to be understood that if an element (e.g., a first element) is referred to, with or without the term "operatively" or "communicatively", as "coupled with," "coupled to," "connected with," or "connected to" another element (e.g., a second element), it means that the element may be coupled with the other element directly (e.g., wired), wirelessly, or via a third element.
It is to be understood that when an element or layer is referred to as being "over," "above," "on," "below," "under," "beneath," "connected to" or "coupled to" another element or layer, it can be directly over, above, on, below, under, beneath, connected or coupled to the other element or layer or intervening elements or layers may be present. In contrast, when an element is referred to as being "directly over," "directly above," "directly on," "directly below," "directly under," "directly beneath," "directly connected to" or "directly coupled to" another element or layer, there are no intervening elements or layers present.
The descriptions of the various aspects and embodiments have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Even though combinations of features are recited in the claims and/or disclosed in the specification, these combinations are not intended to limit the disclosure of possible implementations. In fact, many of these features may be combined in ways not specifically recited in the claims and/or disclosed in the specification. Although each dependent claim listed below may directly depend on only one claim, the disclosure of possible implementations includes each dependent claim in combination with every other claim in the claim set. Many modifications and variations may be apparent to those of ordinary skill in the art without departing from the scope of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
It is to be understood that the specific order or hierarchy of blocks in the processes/flowcharts disclosed are an illustration of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of blocks in the processes/flowcharts may be rearranged. Further, some blocks may be combined or omitted. The accompanying claims present elements of the various blocks in a sample order, and are not meant to be limited to the specific order or hierarchy presented.
Furthermore, the described features, advantages, and characteristics of the present disclosure may be combined in any suitable manner in one or more embodiments. One skilled in the relevant art may recognize, in light of the description herein, that the present disclosure can be practiced without one or more of the specific features or advantages of a particular embodiment. In other instances, additional features and advantages may be recognized in certain embodiments that may not be present in all embodiments of the present disclosure.
According to an aspect of the present disclosure, a method may include obtaining a plurality of images from a user, wherein the plurality of images were acquired by an electronic device viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene. The method may include obtaining a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene. The method may include obtaining a second indication of a first object to be removed from the first image. The method may include removing the first object from the first image to obtain a reference image. The method may include obtaining a third indication of a second viewpoint from the user, wherein the second viewpoint is different from each of the respective viewpoints of the plurality of images. The method may include rendering, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF. The method may include displaying the second image on a display of the electronic device.
According to an embodiment of the disclosure, the removing of the first object may include performing a first inpainting on the first image by applying a mask to the first object, to obtain the reference image. The method further may include inpainting the 3D scene into the NeRF in part by adjusting a first size of the mask according to a second size of the first object that appears in the second image and applying the mask with the adjusted size to the second image. The method may include based on a user input requesting an image of the 3D scene seen from the second viewpoint, inputting the reference image, and information of the second viewpoint to the NeRF to provide the second image corresponding to the 3D scene seen from the second viewpoint.
According to an embodiment of the disclosure, the method may include obtaining a fourth indication from the user, wherein the fourth indication is associated with a second object to be inpainted into the first image.
The method may include updating, before training the NeRF, the first image to remove the first object from the first image by using a mask, wherein the first image includes an unmasked portion and a masked portion.
The method may include training the NeRF after the first object is removed from the first image, wherein the NeRF is trained to output an inpainted 3D scene from an unobserved view point by accepting as input a reference inpainted view image that is obtained by selecting one of a plurality of views of a scene and applying a mask to inpaint an object into the reference view image.
According to an embodiment of the disclosure, the training may be performed at the electronic device. According to an embodiment of the disclosure, the training may be performed at a server.
According to an embodiment of the disclosure, the method may include obtaining, after the displaying, a fifth indication from the user, wherein the fifth indication is a selection of a second object to be inpainted into the first image. The method may include obtaining a second representative image by inpainting the second object into the first image. The method may include updating the training of the NeRF based on the second representative image. The method may include rendering, using the NeRF, a third image. The method may include displaying the third image on the display of the electronic device. The training the NeRF may include training the NeRF, based on the reference image and the plurality of images, using a first loss associated with the unmasked portion.
According to an embodiment of the disclosure, the training the NeRF may include training the NeRF using a second loss based on the masked portion and an estimated depth, wherein the estimated depth is associated with a first geometry of the first scene in the masked portion.
According to an embodiment of the disclosure, the training the NeRF may include identifying a plurality of disoccluded pixels, wherein the plurality of disoccluded pixels are present in the target image and are associated with the second viewpoint. The method may include determining a fourth loss, wherein the fourth loss is associated with a second inpainting of the plurality of disoccluded pixels of the target image. The method may include training the NeRF using the fourth loss.
According to an embodiment of the disclosure, the method may include when the first object is removed from the reference image using a first mask, and if a second size of the first object in other images differs from a first size in the reference image, adjusting proportionally mask sizes of respective masks in the other images proportionally to the respective object sizes of the first object in the other images.
According to an embodiment of the disclosure, a method of training a neuro radiance fieled may include initially training the neural radiance field using a first loss associated with a plurality of unmasked regions respectively associated with a reference image and a plurality of target images, wherein the reference image is associated with a reference viewpoint and each target of the plurality of target images is associated with a respective target viewpoint. The method may include updating the training of the neural radiance field using a second loss associated with a depth estimate of a masked region in the reference image. The method may include updating the training of the neural radiance field using a third loss associated with a plurality of view-substituted images, wherein each view-substituted image of the plurality of view-substituted images is associated with the respective target view of the plurality of target images, each view-substituted image is a volume rendering from the reference viewpoint across pixels with view-substituted target colors, and the third loss is based on the plurality of view-substituted images.
According to an embodiment of the disclosure, the method may include additionally updating the training of the neural radiance field with a fourth loss, wherein the fourth loss is associated with dis-occluded pixels in each target image of the plurality of target images.
According to an embodiment of the disclosure, the method, wherein rendering an image with depth information, may include obtaining image data that comprises a plurality of images that show a first scene from different viewpoints. The method, wherein rendering an image with depth information, may include, based on a first user input identifying a target object from one of the plurality of images, performing a first inpainting on the one of the plurality of images to obtain a reference image by applying a mask to the target object. The method, wherein rendering an image with depth information, may include, inpainting a 3D scene into a neural radiance field (NeRF), based on the reference image, by adjusting a first size of the mask according to a second size of the target object in each of remaining images other than the one of the plurality of images to obtain a plurality of adjusted masks, and applying the plurality of adjusted masks to respective ones of the remaining images. The method, wherein rendering an image with depth information, may include, based on a second user input requesting a first image of the 3D scene seen from a requested view point, inputting the reference image, and the requested view point to a neural radiance field (NeRF) model to provide the first image, wherein the first image corresponds to the 3D scene seen from the requested view point.
According to an embodiment of the disclosure, an apparatus may include one or more processors. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to obtain a plurality of images from a user, wherein the plurality of images were acquired by the apparatus viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to obtain a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to obtain a second indication of a first object to be removed from the first image. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to remove the first object from the first image to obtain a reference image. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to obtain a third indication of a second viewpoint from the user, wherein the second viewpoint does not correspond to any of the plurality of images. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to render, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to display the second image on a display of the apparatus.
According to an embodiment of the disclosure, the apparatus may include the instructions, configured to cause the apparatus to remove the first object by performing a first inpainting on the first image by applying a mask to the first object, to obtain the reference image. The apparatus may include the instructions, configured to cause the apparatus to inpaint the 3D scene into the NeRF in part by adjusting a first size of the mask according to a second size of the first object that appears in the second image and applying the mask with the adjusted size to the second image. The apparatus may include the instructions, configured to cause the apparatus to, based on a user input requesting an image of the 3D scene seen from the second viewpoint, input the reference image, and information of the second viewpoint to the NeRF to provide the second image corresponding to the 3D scene seen from the second viewpoint.
According to an embodiment of the disclosure, the apparatus may include the instructions, configured to cause the apparatus to obtain a fourth indication from the user, wherein the fourth indication is associated with a second object to be inpainted into the first image. The apparatus may include the instructions, configured to cause the apparatus to update, before a training of the NeRF, the first image to remove the first object from the first image by using a mask, wherein the first image includes an unmasked portion and a masked portion. The apparatus may include the instructions, configured to cause the apparatus to train the NeRF after the first object is removed from the first image.
According to an embodiment of the disclosure, the appratus may be a mobile device.
According to an embodiment of the disclosure, the apparatus may include the instructions, configured to cause the apparatus to obtain the NeRF from a server after a training of the NeRF, wherein the NeRF has been trained at the server.
According to an aspect of the present disclosure, a computer-readable storage medium storing instruction is provided. The instructions, when executed by at least one processor, may cause the at least one processor to obtain a plurality of images from a user, wherein the plurality of images were acquired by an electronic device viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene. The instructions, when executed by at least one processor, may cause the at least one processor to obtain a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene. The instructions, when executed by at least one processor, may cause the at least one processor to obtain a second indication of a first object to be removed from the first image. The instructions, when executed by at least one processor, may cause the at least one processor to remove the first object from the first image to obtain a reference image. The instructions, when executed by at least one processor, may cause the at least one processor to obtain a third indication of a second viewpoint from the user, wherein the second viewpoint is different from each of the respective viewpoints of the plurality of images. The instructions, when executed by at least one processor, may cause the at least one processor to render, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF. The instructions, when executed by at least one processor, may cause the at least one processor to display the second image on a display of the electronic device.

Claims (15)

  1. A method comprising:
    obtaining a plurality of images from a user, wherein the plurality of images were acquired by an electronic device viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene;
    obtaining a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene;
    obtaining a second indication of a first object to be removed from the first image;
    removing the first object from the first image to obtain a reference image;
    obtaining a third indication of a second viewpoint from the user, wherein the second viewpoint is different from each of the respective viewpoints of the plurality of images;
    rendering, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF; and
    displaying the second image on a display of the electronic device.
  2. The method of claim 1, wherein the removing of the first object comprises performing a first inpainting on the first image by applying a mask to the first object, to obtain the reference image, wherein the method further comprises:
    inpainting the 3D scene into the NeRF in part by adjusting a first size of the mask according to a second size of the first object that appears in the second image and applying the mask with the adjusted size to the second image; and
    based on a user input requesting an image of the 3D scene seen from the second viewpoint, inputting the reference image, and information of the second viewpoint to the NeRF to provide the second image corresponding to the 3D scene seen from the second viewpoint.
  3. The method any one of claims 1 to 2, wherein the method further comprises:
    obtaining a fourth indication from the user, wherein the fourth indication is associated with a second object to be inpainted into the first image;
    updating, before training the NeRF, the first image to remove the first object from the first image by using a mask, wherein the first image includes an unmasked portion and a masked portion; and
    training the NeRF after the first object is removed from the first image, wherein the NeRF is trained to output an inpainted 3D scene from an unobserved view point by accepting as input a reference inpainted view image that is obtained by selecting one of a plurality of views of a scene and applying a mask to inpaint an object into the reference view image.
  4. The method of any one of claims 1 to 3, further comprising:
    obtaining, after the displaying, a fifth indication from the user, wherein the fifth indication is a selection of a second object to be inpainted into the first image;
    obtaining a second representative image by inpainting the second object into the first image;
    updating the training of the NeRF based on the second representative image;
    rendering, using the NeRF, a third image; and
    displaying the third image on the display of the electronic device.
  5. The method of any one of claims 1 to 4, wherein the training the NeRF is performed at a server, and
    wherein the training the NeRF comprises training the NeRF, based on the reference image and the plurality of images, using a first loss associated with the unmasked portion.
  6. The method of any one of claims 1 to 5, wherein the training the NeRF further comprises training the NeRF using a second loss based on the masked portion and an estimated depth, wherein the estimated depth is associated with a first geometry of the first scene in the masked portion.
  7. The method of any one of claims 1 to 6, wherein the training the NeRF further comprises:
    performing a view substitution of a target image to obtain a view substituted image, wherein the view substituted image comprises view dependent effects (VDEs) from a third viewpoint different from the first viewpoint associated with the first image, whereby view substituted colors are obtained associated with the third viewpoint, wherein a second geometry of the first scene underlying the view substituted image is that of the reference image, wherein the plurality of images comprises the target image and the target image is not the first image; and
    training the NeRF using a third loss based on the view substituted colors.
  8. The method of any one of claims 1 to 7, wherein the training the NeRF further comprises:
    identifying a plurality of disoccluded pixels, wherein the plurality of disoccluded pixels are present in the target image and are associated with the second viewpoint;
    determining a fourth loss, wherein the fourth loss is associated with a second inpainting of the plurality of disoccluded pixels of the target image;
    and
    training the NeRF using the fourth loss.
  9. The method of any one of claims 1 to 8, further comprising, when the first object is removed from the reference image using a first mask, and if a second size of the first object in other images differs from a first size in the reference image, adjusting proportionally mask sizes of respective masks in the other images proportionally to the respective object sizes of the first object in the other images.
  10. The method of any one of claims 1 to 9, wherein rendering an image with depth information further comprises:
    obtaining image data that comprises a plurality of images that show a first scene from different viewpoints;
    based on a first user input identifying a target object from one of the plurality of images, performing a first inpainting on the one of the plurality of images to obtain a reference image by applying a mask to the target object;
    inpainting a 3D scene into a neural radiance field (NeRF), based on the reference image, by adjusting a first size of the mask according to a second size of the target object in each of remaining images other than the one of the plurality of images to obtain a plurality of adjusted masks, and applying the plurality of adjusted masks to respective ones of the remaining images; and
    based on a second user input requesting a first image of the 3D scene seen from a requested view point, inputting the reference image, and the requested view point to a neural radiance field (NeRF) model to provide the first image, wherein the first image corresponds to the 3D scene seen from the requested view point.
  11. An apparatus comprising:
    one or more processors; and
    one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least:
    obtain a plurality of images from a user, wherein the plurality of images were acquired by the apparatus viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene;
    obtain a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene;
    obtain a second indication of a first object to be removed from the first image;
    remove the first object from the first image to obtain a reference image;
    obtain a third indication of a second viewpoint from the user, wherein the second viewpoint does not correspond to any of the plurality of images;
    render, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF; and
    display the second image on a display of the apparatus.
  12. The apparatus of claim 11, wherein the instructions are further configured to cause the apparatus to remove the first object by performing a first inpainting on the first image by applying a mask to the first object, to obtain the reference image, and wherein the instructions are further configured to cause the apparatus to:
    inpaint the 3D scene into the NeRF in part by adjusting a first size of the mask according to a second size of the first object that appears in the second image and applying the mask with the adjusted size to the second image; and
    based on a user input requesting an image of the 3D scene seen from the second viewpoint, input the reference image, and information of the second viewpoint to the NeRF to provide the second image corresponding to the 3D scene seen from the second viewpoint.
  13. The apparatus of any one of claims 11 to 12, wherein the instructions are further configured to cause the apparatus to:
    obtain a fourth indication from the user, wherein the fourth indication is associated with a second object to be inpainted into the first image;
    update, before a training of the NeRF, the first image to remove the first object from the first image by using a mask, wherein the first image includes an unmasked portion and a masked portion; and
    train the NeRF after the first object is removed from the first image.
  14. The apparatus of any one of claims 11 to 13, wherein the instructions are further configured to cause the apparatus to:
    obtain the NeRF from a server after a training of the NeRF, wherein the NeRF has been trained at the server.
  15. A computer readable medium storing instructions, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform the method of any one of claims 1 to 10.
PCT/KR2024/001943 2023-03-08 2024-02-08 Method and apparatus for removing and rendering an image Ceased WO2024186013A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP24767302.3A EP4616371A4 (en) 2023-03-08 2024-02-08 METHOD AND DEVICE FOR REMOVING AND REPRODUCTING AN IMAGE
CN202480006898.5A CN120476430A (en) 2023-03-08 2024-02-08 Method and apparatus for removing and rendering images

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202363450739P 2023-03-08 2023-03-08
US63/450,739 2023-03-08
US18/389,072 US20240303789A1 (en) 2023-03-08 2023-11-13 Reference-based nerf inpainting
US18/389,072 2023-11-13

Publications (1)

Publication Number Publication Date
WO2024186013A1 true WO2024186013A1 (en) 2024-09-12

Family

ID=92635773

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2024/001943 Ceased WO2024186013A1 (en) 2023-03-08 2024-02-08 Method and apparatus for removing and rendering an image

Country Status (4)

Country Link
US (1) US20240303789A1 (en)
EP (1) EP4616371A4 (en)
CN (1) CN120476430A (en)
WO (1) WO2024186013A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20250245866A1 (en) * 2024-01-31 2025-07-31 Adobe Inc. Text-guided video generation
WO2026030772A2 (en) * 2024-10-24 2026-02-05 Futurewei Technologies, Inc. Video-guided free-view video generation with robust object control by gaussian editing
CN121258846B (en) * 2025-12-04 2026-02-06 厦门理工学院 A method and system for restoring historical documents based on implicit interpolation network enhancement

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170094254A1 (en) * 2014-05-20 2017-03-30 Medit Corp. Method and apparatus for acquiring three-dimensional image, and computer readable recording medium
US20200327718A1 (en) * 2019-04-09 2020-10-15 Facebook Technologies, Llc Three-dimensional Modeling Volume for Rendering Images
US20210158606A1 (en) * 2019-11-27 2021-05-27 Electronics And Telecommunications Research Institute Apparatus and method for generating three-dimensional model
US20220122311A1 (en) * 2020-10-21 2022-04-21 Samsung Electronics Co., Ltd. 3d texturing via a rendering loss

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170094254A1 (en) * 2014-05-20 2017-03-30 Medit Corp. Method and apparatus for acquiring three-dimensional image, and computer readable recording medium
US20200327718A1 (en) * 2019-04-09 2020-10-15 Facebook Technologies, Llc Three-dimensional Modeling Volume for Rendering Images
US20210158606A1 (en) * 2019-11-27 2021-05-27 Electronics And Telecommunications Research Institute Apparatus and method for generating three-dimensional model
US20220122311A1 (en) * 2020-10-21 2022-04-21 Samsung Electronics Co., Ltd. 3d texturing via a rendering loss

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MILDENHALL BEN, SRINIVASAN PRATUL P., TANCIK MATTHEW, BARRON JONATHAN T., RAMAMOORTHI RAVI, NG REN: "NeRF : representing scenes as neural radiance fields for view synthesis", ARXIV:2003.08934V1, UNITED STATES, 19 March 2020 (2020-03-19), United States , XP093208291 *
See also references of EP4616371A4
WEDER SILVAN ET AL., REMOVING OBJECTS FROM NEURAL RADIANCE FIELDS

Also Published As

Publication number Publication date
US20240303789A1 (en) 2024-09-12
EP4616371A4 (en) 2025-11-12
CN120476430A (en) 2025-08-12
EP4616371A1 (en) 2025-09-17

Similar Documents

Publication Publication Date Title
WO2024186013A1 (en) Method and apparatus for removing and rendering an image
WO2020096403A1 (en) Textured neural avatars
WO2015188685A1 (en) Depth camera-based human-body model acquisition method and network virtual fitting system
WO2024029793A1 (en) Method of distortion calibration in video see-through augmented-reality system
WO2018090455A1 (en) Method and device for processing panoramic image of terminal, and terminal
JP4831514B2 (en) Setting parameter optimization device and program thereof
WO2013168998A1 (en) Apparatus and method for processing 3d information
WO2016145602A1 (en) Apparatus and method for focal length adjustment and depth map determination
WO2016003253A1 (en) Method and apparatus for image capturing and simultaneous depth extraction
WO2019156428A1 (en) Electronic device and method for correcting images using external electronic device
WO2021006482A1 (en) Apparatus and method for generating image
WO2023055033A1 (en) Method and apparatus for enhancing texture details of images
WO2026010149A1 (en) Method and apparatus for three-dimensional reconstruction of a scene, electronic device, and storage medium
WO2025100673A1 (en) Final view generation using offset and/or angled see-through cameras in video see-through (vst) extended reality (xr)
EP4434219A1 (en) Standard dynamic range (sdr) to high dynamic range (hdr) inverse tone mapping using machine learning
CN116385507A (en) A registration method and system for multi-source point cloud data based on different scales
WO2019059635A1 (en) Electronic device for providing function by using rgb image and ir image acquired through one image sensor
WO2024228495A1 (en) Artificial intelligence-based tooth shade diagnosis system and operating method therefor
WO2025183299A1 (en) Registration and parallax error correction for video see-through (vst) extended reality (xr)
EP4356341A1 (en) Adaptive sub-pixel spatial temporal interpolation for color filter array
WO2020149527A1 (en) Apparatus and method for encoding in structured depth camera system
WO2020076026A1 (en) Method for acquiring three-dimensional object by using artificial lighting photograph and device thereof
WO2020111382A1 (en) Apparatus and method for optimizing inverse tone mapping on basis of single image, and recording medium for performing method
WO2023229431A1 (en) Method for correcting image by using neural network model, and computing device for executing neural network model for image correction
WO2023219189A1 (en) Electronic device for compositing images on basis of depth map and method therefor

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24767302

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2024767302

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2024767302

Country of ref document: EP

Effective date: 20250610

WWE Wipo information: entry into national phase

Ref document number: 202480006898.5

Country of ref document: CN

WWP Wipo information: published in national office

Ref document number: 202480006898.5

Country of ref document: CN

WWP Wipo information: published in national office

Ref document number: 2024767302

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE