WO2024186013A1

WO2024186013A1 - Method and apparatus for removing and rendering an image

Info

Publication number: WO2024186013A1
Application number: PCT/KR2024/001943
Authority: WO
Inventors: Ashkan MIRZAEI; Tristan Ty Aumentado-Armstrong; Konstantinos G. DERPANIS; Igor Gilitschenski; Aleksai Levinshtein; Marcus Brubaker
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2023-03-08
Filing date: 2024-02-08
Publication date: 2024-09-12
Anticipated expiration: 2025-09-08
Also published as: US20240303789A1; EP4616371A4; CN120476430A; EP4616371A1

Abstract

The present disclosure provides methods and apparatuses for training a neural radiance field and producing a rendering of a 3D scene from a novel viewpoint with view-dependent effects. The neural radiance field is initially trained using a first loss associated with a plurality of unmasked regions associated with a reference image and a plurality of target images. The training may also be updated using a second loss associated with a depth estimate of a masked region in the reference image. The training may also be further updated using a third loss associated with a view-substituted image associated with a respective target image. The view-substituted image is a volume rendering from the reference viewpoint across pixels with view-substituted target colors. In an embodiment, the neural radiance field is additionally trained with a fourth loss. The fourth loss is associated with dis-occluded pixels in a target image.

Description

METHOD AND APPARATUS FOR REMOVING AND RENDERING AN IMAGE

This application is related to synthesizing a view of a 3D scene from a novel viewpoint.

The popularity of Neural Radiance Fields (NeRFs) for view synthesis has led to a desire for NeRF editing tools.

Using existing NeRFs techniques to provide a scene representation comes with technical problems. First, the black box nature of implicit neural representations makes it infeasible to simply edit the underlying data structure based on geometric understanding. There is not explainability at the internal node level in a NeRF neural network. Second, because NeRFs are trained from images, special considerations are required for maintaining multiview consistency. Independently inpainting images of a scene using 2D inpainters yields viewpoint-inconsistent imagery. Training a standard NeRF to reconstruct these 3D inconsistent images would result in blurry inpainting.

According to an embodiment of the disclosure, a method may include obtaining a plurality of images from a user, wherein the plurality of images were acquired by an electronic device viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene. The method may include obtaining a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene. The method may include obtaining a second indication of a first object to be removed from the first image. The method may include removing the first object from the first image to obtain a reference image. The method may include obtaining a third indication of a second viewpoint from the user, wherein the second viewpoint is different from each of the respective viewpoints of the plurality of images. The method may include rendering, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF. The method may include displaying the second image on a display of the electronic device.

According to an embodiment of the disclosure, an apparatus may include one or more processors. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least receive a plurality of images from a user, wherein the plurality of images were acquired by the apparatus viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least obtain a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least obtain a second indication of a first object to be removed from the first image. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least remove the first object from the first image to obtain a reference image. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least obtain a third indication of a second viewpoint from the user, wherein the second viewpoint does not correspond to any of the plurality of images. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least render, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least display the second image on a display of the apparatus.

According to an aspect of the present disclosure, a computer-readable storage medium storing instruction is provided. The instructions, when executed by at least one processor, may cause the at least one processor to obtain a plurality of images from a user, wherein the plurality of images were acquired by an electronic device viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene. The instructions, when executed by at least one processor, may cause the at least one processor to obtain a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene. The instructions, when executed by at least one processor, may cause the at least one processor to obtain a second indication of a first object to be removed from the first image. The instructions, when executed by at least one processor, may cause the at least one processor to remove the first object from the first image to obtain a reference image. The instructions, when executed by at least one processor, may cause the at least one processor to obtain a third indication of a second viewpoint from the user, wherein the second viewpoint is different from each of the respective viewpoints of the plurality of images. The instructions, when executed by at least one processor, may cause the at least one processor to render, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF. The instructions, when executed by at least one processor, may cause the at least one processor to display the second image on a display of the electronic device.Provided herein is a method including receiving a plurality of images from a user, wherein the plurality of images were acquired by an electronic device viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene; receiving a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene; receiving a second indication of a first object to be removed from the first image; removing the first object from the first image to obtain a reference image; receiving a third indication of a second viewpoint from the user, wherein the second viewpoint does not correspond to any of the plurality of images; rendering, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF; and displaying the second image on a display of the electronic device.

The text and figures are provided solely as examples to aid the reader in understanding the invention. They are not intended and are not to be construed as limiting the scope of this invention in any manner. Although certain embodiments and examples have been provided, it will be apparent to those skilled in the art based on the disclosures herein that changes in the embodiments and examples shown may be made without departing from the scope of embodiments provided herein.

FIG. 1A illustrates an example of logic for rendering an inpainted 3D scene from a novel viewpoint, according to some embodimentsan embodiment.

FIG. 1B illustrates an example of adding a selected object to a 3D scene, according to some an embodiments.

FIG. 1C illustrates an example of a rendering of the inpainted 3D scene of FIG. 1B from the novel viewpoint, according to some an embodiments.

FIG. 2A illustrates an example of a system for providing the novel view, according to some an embodiments.

FIG. 2B illustrates an example of logic for training a NeRF to represent an inpainted 3D scene and using the NeRF to obtain the novel view, according to some an embodiments.

FIG. 3 illustrates an example of training the NeRF to represent an inpainted 3D scene, according to some an embodiments.

FIG. 4 illustrates an example of further details of training the NeRF of FIG. 3 to represent the inpainted 3D scene, according to some an embodiments.

FIG. 5 illustrates an example of geometry related to a view substitution technique.

FIG. 6 represents an example of an input image.

FIG. 7 illustrates an example of removing a backpack from the input image of FIG. 6 and receiving obtaining a text command to inpaint a red fence, and inpainting the red fence.

FIG. 8 illustrates an example of removing a backpack from the input image of FIG. 6 and obtainingreceiving a text command to inpaint a rubber duck, and inpainting the rubber duck.

FIG. 9 illustrates an example of removing a backpack from the input image of FIG. 6 and obtainingreceiving a text command to inpaint a flower pot, and inpainting the flower pot.

FIG. 10 illustrates an example of an input image with a backpack as an object to be removed.

FIG. 11 illustrates an example of the red fence replacing the backpack using 2D inpainting.

FIG. 12, related to FIG. 10, illustrates an example of replacing the backpack by pasting an image of a mailbox.

FIG. 13, related to FIG. 10, illustrates an example of replacing the backpack by inpainting the red fence and then manually pasting a shrub.

FIG. 14 illustrates an example of a reference image in which an object has been removed.

FIG. 15 illustrates an example of a set of input images.

FIG. 16 illustrates an example of a set of masks corresponding to the input images of FIG. 15.

FIG. 17 illustrates an example of an initial target view with distortion in the area corresponding to the inpainting in the reference view.

FIG. 18 illustrates an example of a residual with respect to the target view of FIG. 17.

FIG. 19 illustrates an example of an updated rendering of the target view based on the residual of FIG. 18.

FIG. 20 illustrates an example of dDisocclusion processing to improve the inpainted 3D scene represented by the NeRF for views other than the reference view.

FIG. 21 illustrates exemplary hardware for implementation of computing devices for implementing the systems and algorithms described by the figures, according to some embodimentsan embodiment.

The detailed description set forth below in connection with the appended drawings is intended as a description of various configurations and is not intended to represent the only configurations in which the concepts described herein may be practiced. The detailed description includes specific details for the purpose of providing a thorough understanding of various concepts. However, it is to be apparent to those skilled in the art that these concepts may be practiced without these specific details. In some instances, well known structures and components are shown in block diagram form in order to avoid obscuring such concepts. In the descriptions that follow, like parts are marked throughout the specification and drawings with the same numerals, respectively.

The following description provides examples, and is not limiting of the scope, applicability, or embodiments set forth in the claims. Changes may be made in the function and/or arrangement of elements discussed without departing from the scope of the present disclosure. Various examples may omit, substitute, or add various procedures or components as appropriate. For example, the methods described may be performed in an order different from that described, and various steps may be added, omitted, and/or combined. In an embodiment, features described with reference to some examples may be combined in other examples.

Various aspects and/or features may be presented in terms of systems that may include a number of devices, components, modules, and the like. It is to be understood and appreciated that the various systems may include additional devices, components, modules, and the like and/or may not include all of the devices, components, modules, and the like discussed in connection with the figures. A combination of these approaches may also be used.

As a general introduction to the subject matter described in more detail below, the present disclosure provides methods, apparatuses, and computer-readable mediums for inpainting an unwanted object in one of several 2D images forming a complete 3D scene representation. The unwanted object is removed from any viewpoint within the 3D scene image. Obtaining may include receiving, accessing, acquiring and the like.

NeRF techniques may be used to inpaint unwanted regions in a view-consistent manner, allowing users to exercise control over the generated scene through a single inpainted image.

NeRFs are an implicit neural field representation (e.g., coordinate mapping) for 3D scenes and objects, generally fit to multiview posed image sets. The basic constituents are (i) a field,

, that maps a 3D coordinate,

, and a view direction,

, to a color,

, and density,

, via learnable parameters

, and (ii) a rendering operator that produces color and depth for a given view pixel. The field,

, can be constructed in a variety of ways; the rendering operator is implemented as the classical volume rendering integral, approximated via quadrature, where a ray, r, is divided into N sections between

and

(the near and far bounds), with

sampled from the i-th section. The estimated color is then given by Equation 1.

Equation 1

where

is the transmittance,

and

are the color and density at

. Replacing

with

in Equation 1 estimates depth,　

, and disparity (inverse depth),

, instead.

The inputs are n input images,

, their camera transform matrices,

, and their corresponding masks,

, delineating the unwanted region. The inputs also include a single inpainted reference view,

, where

, which provides the information which an embodiment maps, or extrapolates, into a 3D inpainting of the scene represented by the NeRF.

An embodiment uses I_ref, not only to inpaint the NeRF, but also to generate 3D details and VDEs from other viewpoints.

Below, the following topics are discussed: i) the use of monocular depth estimators to guide the geometry of the inpainted region, according to the depth of the reference image,

(see FIG. 4 items A2-1, A2-2 and A2-3), ii) the use of bilateral solvers, in conjunction with a view-substitution technique of an embodiment, to add VDEs to views other than the reference view (see FIG. 5 and FIGS. 14-19 for a depiction of the geometry supervision and VDE handling, and iii) since not all the masked target pixels are visible in the reference, an embodiment provides supervision during training for such dis-occluded pixels, via additional inpaintings (see FIG. 20). However, the present disclosure is not limited in this regard, and other attachment methods may be utilized without departing from the scope of the present disclosure.

Training may include an experience with respect to a task and attempts to improve a performance with respect to performance of the task at a future time after the training.

In an embodiment, training is based on the following four losses: i) L_unmasked, ii) L_depth, iii) L_substituted and iv) L_occluded. These four losses represent the unmasked appearance loss, masked geometry loss, view-dependent masked color loss, and dis-occlusion loss, respectively.

The overall objective for inpainted NeRF fitting is given by Equation 2 (including weights

on the last three terms).

Equation 2

Supervision is computed modulo an iteration count. For example, supervision for the respective summands of Equation 2 are computed every N_unmasked, N_depth, N_sub and N_occluded iterations. A particular loss is not used until the appropriate number of iterations has passed.

In the first stage of training,

is supervised on the unmasked pixels for N_unmasked iterations, via a NeRF reconstruction loss shown in Equation 3.

Equation 3

In Equation 3,

(in contrast to

) is the set of rays corresponding to the pixels in the unmasked part of the image (the part not affected by the mask) and

is the ground truth (GT) color for the ray, r.

The loss for the masked portion based on depth is developed by Equations 4, 5 and 6.

Equation 4

Equation 5

Equation 6

Above, scalars h and w are the height and width of the input images.

Concerning the matrices H and V, for a pixel p at position (p_x, p_y), H(p) = p_x and V=p_y.

The monocular depth estimation of the masked region from the reference image, in terms of disparity, is

. The disparity from the NeRF model is

.

The coefficients

in Equation 4 are found by optimization, with F being the objective (Equation 5).

In Equation 4, J is the all-ones matrix.

The inverse of the distance between p and the mask is

.

In Equation 6, the expectation is over

.

Also, in Equation 6,

is a variable obtained by optimizing

to encourage greater smoothness around the mask. An example smoothing technique minimizes the total variation of

around mask boundaries.

A loss to obtain VDEs for the masked portion is developed by Equations 7, 8 and 9.

Equation 7

Equation 8

Equation 9

The expectation in Equation 9 is over

.

Above, x_i is a shading point position, on a ray emanating from the reference camera (with direction

),

is a corresponding ray direction that intersects x_i from a target-image camera (at o_t).

is an inpainted residual,

is a reference view,

is a view-substituted image,

is a target color

is a mask,

is a bilateral solver.

A loss to solve for occluded areas in the reference image which are however visible in a non-reference image is provided by Equation 10.

Equation 10

In Equation 10, the expectation is over

,

, and color and disparity are

and

.

The above equations are discussed with reference to the drawings. Before discussing the drawings, a partial list of identifiers with comments is provided here.

L_unmasked: this is a NeRF reconstruction loss over the unmasked area of the K input images. See Equation 3.

L_depth: this loss is based on monocular depth estimation

to predict an uncalibrated disparity of the reference image and guide the geometry. See Equation 6.

L_substituted: this loss accounts for view-dependent effects (VDEs) such as specularities and surfaces which are not rough (do not deflect light in every direction). See Equation 9.

L_occluded: the overall algorithm is focused on the reference view, and pixels which are visible in target views but not visible in the reference view are called dis-occluded pixels (they are occluded in the reference view, and become dis-occluded when the scene is viewed from other viewpoints). This loss supervises the NeRF training so that the NeRF produces plausible results with respect to these dis-occluded pixels. See Equation 10.

: the input image chosen as the basis for the reference image.

: the set of input images, excluding

.

: the reference image, constructed by inpainting a portion of

.

: an image of the 3D scene inpainted into the NeRF;

is from a user-requested viewpoint, and

is produced by the NeRF.

: a view-substituted image produced by the NeRF and associated with one of the target viewpoints.

: the view-substituted image with VDEs from res_target after using Equation 8.

: confidences used by a bilateral solver in dis-occlusion processing.

: a target view, exhibiting dis-occluded pixels.

: a disparity image produced by the NeRF during dis-occlusion processing.

: an inpainted version of the target view exhibiting dis-occluded pixels.

: a disparity image obtained using bilateral guidance applied to

.

: a residual used in obtaining the VDEs for one of the target viewpoints.

Obtaining the novel view from the 3D inpainted into the NeRF is now described with respect to the figures.

FIG. 1A illustrates an example of a flowchart of a method L1 for rendering an inpainted 3D scene from a novel viewpoint.

For example, a user has a camera. At operation S1-1, the device may include the user captures several pictures, possibly as a video sequence by the user

At operation S1-2, the method may include selecting one of the images as an input image by the user

At operation S1-3, the method may include selecting an undesired object to be removed from the input image.

In an embodiment, the electronic device may perform the selection by recommending objects to be erased by the device. The electronic device may select portions of the images with features such as, but not limited to, many light reflections, blurry portions, or portions identified by the electronic device as background objects.

In an embodiment, the method may include performing the selection by the user. The method may be performed by the user selecting an area around an object, the electronic device analyzing the identified area and electronic device selects around the object outline.

An embodiment may include an additional selection by the electronic device based on user-selected information. In this an embodiment, the electronic device may analyze the selected object and recommend whether other objects of a similar type to the object selected by the user should also be selected and erased from the images.

The method may include removing the undesired object from the images using masks.

In an embodiment, at operation S1-4, a device may obtain information about the object to be inpainted into an image from the user. The device may be an electronic device. The obtaining method may vary, such as, but not limited to, by text, by voice, by click, by touch, and the image or the video corresponding to a text and the image or the video corresponding to voice are shown, and those images and videos can be inserted into the desired input location. As one example, the device used by the user (possibly a mobile terminal which includes the camera), may determine the identity of a desired object from the user. The identification may be by various methods, such as, but not limited to, voice command, text command, touch command, click command or from an image or a video submitted to the device. The desired object is inpainted to a reference image. As an example, the method includes allowing the user the option, in an embodiment, to communicate the new object not only by text or voice, but to provide an image of the desired object, for example to perform manual insertion of an image, and the like. However, the present disclosure is not limited in this regard, and other communication methods may be utilized without departing from the scope of the present disclosure. The inserted image, in an embodiment is downloaded from the Internet (something the user found appealing), or the inserted image is from the user's photo gallery or another photo gallery.

In an embodiment, there may be multiple images corresponding to the text when a user enters text. An embodiment may be configured to allow a user to select from the multiple images indicated in a list shown at the bottom of the electronic device user interface display or on the side of the electronic device user interface display. An embodiment also allows the device to move the image part as desired once the image that corresponds to the multiple texts is selected, and that image part enters the inpainted region.

At operation S1-5, the device, using the NeRF, may remove the undesired object and fills in the gap in the 3D scene with the desired object, this creates I_ref. Methods for performing this inpainting are known to practitioners working in this field. I_ref is an inpainted reference view, providing the information that a user expects to be extrapolated into a 3D inpainting of the scene which is the subject of the images {I_i}.

At operation S1-6, the method may include training a neural radiance field (NeRF) to represent the inpainted 3D scene. See FIGS. 3-4 and Equations 1-10.

At operation S1-7, the method may include providing the user a viewpoint from which to view the 3D scene.

At operation S1-8, the device may render the novel viewpoint and display it to the user. See FIG. 1C.

At operation S1-9, the method may include choosing, by the user, another object to inpaint or to view the 3D scene from yet another viewpoint.

FIG. 1B illustrates adding a selected object to a 3D scene, according to an embodiment.

FIG. 1B illustrates an example of a mobile device displaying a first image Iin. Examples of a mobile device may be a smartphone with a camera, a tablet PC with a camera and the like. A mobile device is an example and embodiments are not limited to mobile devices. An embodiment is applicable to electronic devices, such as, but not limited to, AR headset, smart glasses, smartphone The method may include selecting an object (for example, a flowerpot in FIG. 1B) and adds it to the first image I_into obtain an inpainted version of the first image, reference image I_ref.

FIG. 1C illustrates an example of a rendering of the inpainted 3D scene of FIG. 1B from the novel viewpoint of obtain the image I_novel.

FIG. 2A illustrates an example of the overall system for providing the novel view. K views, K masks, the reference view with an additional object inpainted, and a request for a rendering from a novel viewpoint are provided to the NeRF. The training of the NeRF may occur at a mobile terminal, at a server and the like. The NeRF may provide a novel view I_novel of an inpainted 3D scene.

FIG. 2B illustrates an example of a method L2 for training a NeRF to represent an inpainted 3D scene and using the NeRF to obtain the novel view I_novel. In an embodiment, the method may include K view of a scene as an input. The ith view may be denoted as image Ii.

At operation S2-1, the method may include segmenting an undesired object to remove it from the scene in each view. This may result in a mask for each scene. The i^th mask may be denoted M_i.

At operation S2-2, the method may include selecting one of the images from the set {I_i} as the input image from which to create the reference image I_ref. At operation S2-3, the method may include training an inpainting neural radiance field to represent an inpainted 3D scene. The NeRF may be a neural network specific to the scene.

At operation S2-4, the method may include using the NeRF to render the inpainted 3D scene from a novel viewpoint, to obtain I_novel.

FIG. 3 illustrates an example of method L3 for training the NeRF to represent an inpainted 3D scene in terms of four training epochs.

Each training phase in the figure has a predefined number of iterations inside it. Each training iteration in NeRF training samples random rays from the input views in the scene, renders them using the current NeRF network, and updates the NeRF parameters by minimizing the corresponding losses.

The loss L_unmasked may be used at operation A1. See Equation 3. Operation A1 may be performed once every N_unmasked iterations. Input view and camera parameters, masks, (inpainted) reference view may be used as input at operation A1. At operation A1, the method may include training the NeRF for the unmasked portion using the loss L_unmasked. At operation A1, the losses may be cumulative. The method may include training with available losses.

The losses L_depth and L_unmasked may be used at operation A2. See Equations 3 and 6. Operation A2 may be performed once every N_depth iterations. At operation A2, the method may include a depth estimation of the masked portion. The depth estimation of the masked portion may include training using L_depth and L_unmaksed. At operation A1, the losses may be cumulative. The method may include training with available losses.

The losses L_substituted, L_depth and L_unmasked may be used at operation A3. See Equations 3, 6 and 9. Operation A3 may be performed once every N_substituted iterations. K-1 target views, (inpainted reference view, and the result of operation A2 may be used as input at operation A3. At operation A3, the method may include view substitution training using L_substituted, L_depth and L_unmasked. At operation A3, the losses may be cumulative. The method may include training with available losses.

The losses L_occluded, L_substituted, L_depth and L_unmasked may be used at operation A4. See Equations 3, 6, 9 and 10. Operation A4, may be performed once every N_occluded iterations. The method may include dis-occluded pixels in target views training using L_occluded, L_substituted, L_depth and L_unmasked. At operation A4, the losses may be cumulative. The method may include training with available losses. Operation A4 may output trained NeRF representing inapinted 3D scene.

In an embodiment, one or more of A2, A3 and A4 may be not used at all in training the NeRF.

FIG. 4 illustrates an example of further details of training the NeRF of FIG. 3 to represent the inpainted 3D scene.

At operation A1-1, the NeRF may be trained for the unmasked portion of the images {I_i}. At operation A1-1, training may be performed using L_unmasked.

At operation A2-1, depth may be obtained of the masked portion in the reference image. At operation A2-2, disparity alignment and smoothing may be performed. At operation A2-3, training may be performed using L_unmasked and L_depth.

At A3-1, colors along a ray from the reference camera may be obtained but with view directions from target cameras. This is referred to as view-substitution. At operation A3-2 a comparison between I_refand I_{ref, target} may be made with the reference view to get a residual,

. At operation A3-3, view dependent effects (VDEs) may be obtained by using a bilateral solver. The bilateral solver may treat I_ref as reference input. At operation A3-3, confidence may be zero inside the mask. See Equation 8. At operation A3-4, target colors may be gotten which include the VDEs for this view. At operation A3-5, training may be performed using L_unmasked, L_depth and L_substitute. See Equation 9.

At operation A4-1, disoccluded pixels may be determined by reprojecting all pixels from the reference view into a target view. At operation A4-2, the disoccluded pixels may be inpainted for view t using leftmost, rightmost and topmost target images. At operation A4-3, a disparity version of the disoccluded pixels may be inpainted using a bilateral solver. At operation A4-4, training of the NeRF may be performed using L_unmasked, L_depth, L_substitute, and L_occluded. See Equations 3, 6, 9, 10. Operation A4-4 may output trained NeRF representing inapinted 3D scene.

FIG. 5 illustrates an example of geometry related to a view substitution technique. The view substitution technique disclosed herein may enable rendering from the reference viewpoint, but with the view-dependent effects of a target viewpoint, by substituting the directional input to the per-shading-point neural color field. The upper portion of FIG. 5, 510 illustrates that, given a shading point position, x_i, on a ray emanating from the reference camera (with direction

), an embodiment may obtain the corresponding ray direction,

, that intersects x_i from a target-image camera (at o_t). See Equation 7. The lower portion of FIG. 5, 520 and 530, illustrates, on the 520, that standard inputs may be used to query the NeRF for the color,

, at shading point x_i.The 530 of FIG. 5 shows that view-substituted inputs may be used to query the NeRF, obtaining

as the color instead.

The NeRF (for example, in FIG. 5) may provide 3D information (3D point color and density), which then have to be integrated along a ray to get rendered (i.e. get a view). The output from a NeRF network may be 3D.

FIGS. 6-11 present some example results at the level of image changes. FIG. 6 represents an example of an input image, I_in. FIG. 7 illustrates an example of removing a backpack from the input image of FIG. 6 and obtaining a text command to inpaint a red fence, and inpainting the red fence. FIG. 8 illustrates an example of removing a backpack from the input image of FIG. 6 and obtaining a text command to inpaint a rubber duck, and inpainting the rubber duck. FIG. 9 illustrates an example of removing a backpack from the input image of FIG. 6 and obtaining a text command to inpaint a flower pot, and inpainting the flower pot. FIG. 10 illustrates an example of an input image with a backpack as an undesired object to be removed. FIG. 11 illustrates an example of the red fence replacing, in an inpainted region, the backpack using 2D inpainting. A text command is an example and embodiments are not limited to text commands. An embodiment can obtain information about the object to be inpainted into an image in various forms, such as, but not limited to, by text, by voice, by click, by touch and the image or the video corresponding to a text and the image or the video corresponding to voice are shown, and those images and videos can be inserted into the desired input location.

In an embodiment, a device may obtain (e.g. receive, capture, download) a plurality of images or a short video, while moving a camera around a scene. The device may then interactively segment the object of interest from the scene, using well known techniques (e.g. SPIn-NeRF).

In an embodiment, reference-guided controllable 3D scene inpainting may be performed. The method may include selecting a view and using a controllable 2D inpainting method to inpaint the object. The controllable inpainting method may be, for one example, stable diffusion inpainting guided by text input. Alternatively, the method may include creating the inpainted image by first inpainting it with the background using any 2D inpainting method and then overlaying an object of interest manually in the inpainted region. An inpainting NeRF may be then trained guided by the single inpainted view. The inpainted NeRF may be used to render the inpainted 3D scene from arbitrary views.

For example, FIG. 12, related to FIG. 10, illustrates an example of replacing the backpack by pasting an image of a mailbox. FIG. 13, related to FIG. 10, illustrates an example of replacing the backpack by inpainting the red fence and then manually pasting a shrub. In an embodiment, the method may include obtaining an indication of a selection of an object to be inpainted the first image.

However, the present disclosure is not limited in this regard, and other methods or examples may be utilized without departing from the scope of the present disclosure.

FIG. 14 to 19 illustrates an example of images that describe a method for training NeRFs with view substitution. FIG. 14 illustrates an example of a reference image, I_ref, in which an object has been removed. FIG. 15 illustrates an example of a set of input images. The images of {I_i} other than I_in are referred to as target images. The undesired object, UO, in FIG. 15 is a music book on a piano stand. FIG. 16 illustrates an example of a set of masks M_i corresponding to the input images of FIG. 15. FIG. 17 illustrates an example of an initial target view, I_ref,target with distortion in the area corresponding to the inpainting in the reference view. FIG. 18 illustrates an example of a residual, res_target with respect to the target view of FIG. 17.

FIG. 19 illustrates an example of an updated rendering,

of the target view based on the residual of FIG. 18.

An embodiment may provide view-dependent effects as follows. For each target, t, the scene may be rendered from the reference camera with target colors to get the view-substituted image,

(FIG. 17). A bilateral solver may inpaint the residual between the reference view and the view-substituted image, see Equation 8, resulting in the inpainted residual, res_target (FIG. 18), which is subtracted from the reference view to get the target color,

(FIG. 19). The discrepancy between the target colors and the view-substituted images may provide supervision for the masked region.

After obtaining the view substituted images

(after at least N_substitute iterations), the training may be able to supervise the masked appearances of the target images. Each such image

may look at the scene via the reference source camera (e.g., has the image structure of I_ref), but may have the colors (in particular, VDEs) of I_target. An embodiment may use those colors, obtained by the bilateral solver of Equation 8, to supervise the target view appearance under the mask (that is, in R_mask). An Embodiment may render each view-substituted image inside the mask (obtaining

as in FIG. 17), and compute a reconstruction loss by comparing it to the bilaterally inpainted output,

as shown in Equation 9.

FIG. 20 illustrates an example of dis-occlusion processing to improve the inpainted 3D scene represented by the NeRF for views other than the reference view.

While single-reference inpainting may prevent problems incurred by view-inconsistent inpaintings, it is missing multiview information in the inpainted region. For example, when inserting a duck into the scene (see FIG. 20), viewing the scene from another perspective naturally may unveil new details on and around the duck, due to dis-occlusions (see the dark areas marked as

in the image second from left in FIG. 20). An embodiments may construct these missing details.

An embodiment may identify pixels in the target view,

(also referred to as

), that are not visible from the reference view, to build a dis-occlusion mask,

. From

, an embodiment then may inpaint a

-masked color, see the upper right image in FIG. 20(

). This is followed by in-filling a disparity rendered image, using bilateral guidance to ensure consistency. See the upper right image in FIG. 20 (

) and the disparity image

of FIG. 20 which are arguments for terms in L_occluded of Equation 10. Finally, these inpainted disoccluded values may be used for supervision. See A4 of FIG. 3.

Quantitative full-reference (FR) evaluation of 3D inpainting techniques on the inpainted areas of held-out views from the SPIn-NeRF dataset are shown in Table 1. Columns show distance from known ground-truth images of the scene (without the target object), based on a learned perceptual image patch similarity (LPIPS) and feature-based statistical distance (FID).

An embodiment with stable diffusion (SD) performs best by both metrics.

Quantitative full-reference (FR) evaluation of 3D inpainting techniques

Method	LPIPS	FID
NeRF + LaMa (2D)	0.5369	174.61
Object NeRF	0.6829	271.80
L_unmasked	0.6030	294.69
L_unmasked + DreamFusion	0.5934	264.71
NeRF-In, multiple	0.5699	238.33
NeRF-In, single	0.4884	183.23
SPIn-NeRF-SD	0.5701	186.48
SPIn-NeRF-LaMa	0.4654	156.64
An embodiment (FIGS. 3-4 and Equations 1-10), using stable diffusion	0.4532	116.24

As seen in Table 1, an embodiment may provide the best performance on both FR metrics. The Object-NeRF and Masked-NeRF approaches, which perform object removal without altering the newly revealed areas, perform the worst. Combining Masked-NeRF with DreamFusion performs slightly better. This indicates some utility of the diffusion prior; however, while DreamFusion can generate impressive 3D entities in isolation, it does not produce sufficiently realistic outputs for inpainting real scenes. SPIn-NeRF-SD obtains a similar poor LPIPS, though with better FID. It is unable to cope with the greater mismatches of the SD generations. NeRF-In outperforms the aforementioned models. Still, the use of a pixelwise loss leads to blurry outputs. Finally, our model outperforms the second-best model (SPIn-NeRF-LaMa) considerably in terms of FID, reducing it by ~25%. An embodiment is also applicable to videos. Table 2 provides an indication of the technical improvement. SD and LaMa are known inpainters.

Quantitative full-reference (FR) evaluation of 3D inpainting techniques on videos

Method	Sharpness	MUSIQ
SPIn-NeRF-LaMa	354.31	58.10
An embodiment, using LaMa	394.55	62.0
An embodiment, using SD	398.56	61.47

FR measures are limited by their use of a single GT target image. We therefore also examine NR performance, demonstrating improvements over SPIn-NeRF, in terms of both sharpness (by 11.2%) and MUSIQ (by 5.8%); see Table 2. Table 2 indicates that embodiments provide a novel view which is numerically sharper and more realistic.FIG. 21 illustrates an exemplary apparatus 21-1 for implementation of an embodiment disclosed herein. FIG. 21 illustrates an hardware for performing embodiments provided. The apparatus 21-1 may be a server, a computer, a laptop computer, a handheld device, or a tablet computer device, for example.

As an example, the an NeRF of FIG. 2A performing the method L2 of FIG. 2B is located on the electronic device, and the method L2 may process the obtained information from an input unit of the electronic device.

As an example, the NeRF of FIG. 2A performing the method L2 of FIG. 2B is located on a server, and the images are server images. An input value of the server image (area select, obtaining object information to be inpainted, content obtained from text,voice and the like) may be obtained from the communication unit of the server and applied using the method L2 of FIG. 2B.

Apparatus 21-1 may include one or more hardware processors 21-9. The one or more hardware processors 21-9 may include an ASIC (application specific integrated circuit), CPU (for example CISC or RISC device), and/or custom hardware. An embodiment can be deployed on various GPUs. As an example, a provider of GPUs is Nvidia^TM, Santa Clara, California. For example, an embodiment may have been deployed on Nvidia^TM A6000 GPUs with 48GB of GDDR6 memory.

An embodiment may be deployed on various computers, servers or workstations. Lambda^TM is a workstation company in San Francisco, California. Experiments using embodiments have been conducted on a Lambda^TM Vector Workstation.

Apparatus 21-1 also may include a user interface 21-5 (for example a display screen and/or keyboard and/or pointing device such as a mouse). Apparatus 21-1 may include one or more volatile memories 21-2. Apparatus 21-1 may include one or more non-volatile memories 21-3. The one or more non-volatile memories 21-3 may include a computer readable medium storing instructions for execution by the one or more hardware processors 21-9 to cause apparatus 21-1 to perform any of the methods of embodiments disclosed herein.

Apparatus 21-1 may include wired and/or wireless interfaces 21-4. The wired and/or wireless interfaces 21-4 may include a receiver component, a transmitter component, and/or a transceiver component. The wired and/or wireless interfaces 21-4 may enable the apparatus 21-1 to establish connections and/or transfer communications with other devices (e.g., a server, another device). The communications may be affected via a wired connection, a wireless connection, or a combination of wired and wireless connections. The wired and/or wireless interfaces 21-4 may permit the apparatus 21-1 to receive information from another device and/or provide information to another device. In an embodiment, the wired and/or wireless interfaces 21-4 may provide for communications with another device via a network, such as, but not limited to a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a private network, an ad hoc network, an intranet, the Internet, a fiber optic-based network, a cellular network (e.g., a fifth generation (5G) network, a long-term evolution (LTE) network, a third generation (3G) network, a code division multiple access (CDMA) network, and the like), a public land mobile network (PLMN), a telephone network (e.g., the Public Switched Telephone Network (PSTN)), or the like, and/or a combination of these or other types of networks. In an embodiment, the wired and/or wireless interfaces 21-4 may provide for communications with another device via a device-to-device (D2D) communication link, such as, but not limited to FlashLinQ, WiMedia, Bluetooth쪠, Bluetooth쪠 Low Energy (BLE), ZigBee, Institute of Electrical and Electronics Engineers (IEEE) 802.11x (Wi-Fi), LTE, 5G, and the like. In an embodiment, the wired and/or wireless interfaces 21-4 may include an Ethernet interface, an optical interface, a coaxial interface, an infrared interface, a radio frequency (RF) interface, a USB interface, an IEEE 1094 (FireWire) interface, or the like.

Apparatus 21-1 may include display device 21-6. Apparatus 21-1 may include a display device 21-6. The display device 21-6 may include one or more components that may permit serves to present information from the set of components of the apparatus 21-1. For example, the bus 21-7 may be a computer monitor, smartphone screen, Television(TV), tablet screen, digital watches, AR headset and the like. The present disclosure is not limited in this regard.

Apparatus 21-1 may include a bus 21-7. The set of components of the apparatus 21-1 may be communicatively coupled via the bus 21-7. The bus 21-7 may include one or more components that may permit communication among the set of components of the apparatus 21-1. For example, the bus 21-7 may be a communication bus, a cross-over bar, a network, or the like. Although the bus 21-7 is depicted as a single line in FIG.　21, the bus 21-7 may be implemented using multiple (e.g., two or more) connections between the set of components of the apparatus 21-1. The present disclosure is not limited in this regard.

An embodiment provides an approach to inpaint NeRFs, via a single inpainted reference image. An embodiment may use a monocular depth estimator, aligning its output to the coordinate system of the inpainted NeRF to back-project the inpainted material from the reference view into 3D space. An embodiment uses bilateral solvers to add VDEs to the inpainted region, and use 2D inpainters to fill dis-occluded areas. Table 1 and Table 2, using multiple evaluation metrics, illustrate the superiority of an embodiment over prior 3D inpainting methods.

Finally, an embodiment includes a controllability advantage enabling users to easily alter a generated 3D scene through a single guidance image (I_ref). However, the present disclosure is not limited in this regard, and advantages may be utilized without departing from the scope of the present disclosure.

An embodiment of the present disclosure may solve one or more technical problems.

An embodiment may use a single inpainted reference, thus avoiding view inconsistencies. To geometrically supervise the inpainted area, an embodiment may use an optimization-based formulation with monocular depth estimation. An embodiment may obtain view dependent effects (VDEs) of non-reference views from the reference viewpoint. This may enable a guided inpainting approach, propagating non-reference colors (with VDEs) into the mask area of the 3D scene represented by the NeRF. An embodiment may also inpaint disoccluded appearance and geometry in a consistent manner.

An embodiment may be provided for inpainting regions in a view-consistent and controllable manner. In addition to the typical NeRF inputs and masks delineating the unwanted region in each view, an embodiment may require only a single inpainted view of the scene, e.g., a reference view. An embodiment may use monocular depth estimators to back-project the inpainted view to the correct 3D positions. Then, via a novel rendering technique, a bilateral solver of an embodiment may construct view-dependent effects in non-reference views, making the inpainted region appear consistent from any view. For non-reference disoccluded regions, which cannot be supervised by the single reference view, an embodiment may provide a method based on image inpainters to guide both the geometry and appearance. An embodiment may show superior performance to NeRF inpainting baselines, with the additional advantage that a user can control the generated scene via a single inpainted image.

The foregoing disclosure provides illustration and description, but is not intended to be exhaustive or to limit the implementations to the precise form disclosed. Modifications and variations are possible in light of the above disclosure or may be acquired from practice of the implementations.

As used herein, the terms "component," "module," "system" and the like are intended to include a computer-related entity, such as but not limited to hardware, firmware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device can be a component. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between two or more computers. In addition, these components can execute from various computer readable media having various data structures stored thereon. The components can communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets, such as data from one component interacting with another component in a local system, distributed system, and/or across a network such as the Internet with other systems by way of the signal.

An embodiment may relate to a system, a method, and/or a computer readable medium at any possible technical detail level of integration. The computer readable medium may include a computer-readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out operations. computer-readable media may exclude transitory signals.

The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a RAM, a ROM, an erasable programmable read-only memory (EEPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a DVD, a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.

Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may include copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.

Computer readable program code/instructions for carrying out operations may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a LAN or a WAN, or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider (ISP)). In an embodiment, electronic circuitry including, for example, programmable logic circuitry, FPGAs, or programmable logic arrays (PLAs) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects or operations.

These computer readable program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein includes an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

At least one of the components, elements, modules or units (collectively "components" in this paragraph) represented by a block in the drawings may be embodied as various numbers of hardware, software and/or firmware structures that execute respective functions described above, according to an example embodiment. According to example embodiments, at least one of these components may use a direct circuit structure, such as a memory, a processor, a logic circuit, a look-up table, and the like, that may execute the respective functions through controls of one or more microprocessors or other control apparatuses. Also, at least one of these components may be specifically embodied by a module, a program, or a part of code, which contains one or more executable instructions for performing specified logic functions, and executed by one or more microprocessors or other control apparatuses. Further, at least one of these components may include or may be implemented by a processor such as a CPU that performs the respective functions, a microprocessor, or the like. Two or more of these components may be combined into one single component which performs all operations or functions of the combined two or more components. Also, at least part of functions of at least one of these components may be performed by another of these components. Functional aspects of the above example embodiments may be implemented in algorithms that execute on one or more processors. Furthermore, the components represented by a block or processing steps may employ any number of related art techniques for electronics configuration, signal processing and/or control, data processing and the like.

The flowchart and block diagrams in the drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer readable media according to various embodiments. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which includes one or more executable instructions for implementing the specified logical functions. The method, computer system, and computer readable medium may include additional blocks, fewer blocks, different blocks, or differently arranged blocks than those depicted in the Figures. In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be executed concurrently or substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It may also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

It is to be apparent that systems and/or methods, described herein, may be implemented in different forms of hardware, firmware, or a combination of hardware and software. The actual specialized control hardware or software code used to implement these systems and/or methods is not limiting of the implementations. Thus, the operation and behavior of the systems and/or methods were described herein without reference to specific software code―it being understood that software and hardware may be designed to implement the systems and/or methods based on the description herein.

No element, act, or instruction used herein should be construed as critical or essential unless explicitly described as such. Also, as used herein, the articles "a" and "an" are intended to include one or more items, and may be used interchangeably with "one or more." Furthermore, as used herein, the term "set" is intended to include one or more items (e.g., related items, unrelated items, a combination of related and unrelated items, and the like), and may be used interchangeably with "one or more." Where only one item is intended, the term "one" or similar language is used. Also, as used herein, the terms "has," "have," "having," "includes," "including," or the like are intended to be open-ended terms. Further, the phrase "based on" is intended to mean "based, at least in part, on" unless explicitly stated otherwise. In addition, expressions such as "at least one of [A] and [B]" or "at least one of [A] or [B]" are to be understood as including only A, only B, or both A and B.

Reference throughout this specification to "one embodiment," "an embodiment," or similar language means that a particular feature, structure, or characteristic described in connection with the indicated embodiment is included in at least one embodiment of the present solution. Thus, the phrases "in one embodiment", "in an embodiment," and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment. As used herein, such terms as "1st" and "2nd," or "first" and "second" may be used to simply distinguish a corresponding component from another, and does not limit the components in other aspects (e.g., importance or order). It is to be understood that if an element (e.g., a first element) is referred to, with or without the term "operatively" or "communicatively", as "coupled with," "coupled to," "connected with," or "connected to" another element (e.g., a second element), it means that the element may be coupled with the other element directly (e.g., wired), wirelessly, or via a third element.

It is to be understood that when an element or layer is referred to as being "over," "above," "on," "below," "under," "beneath," "connected to" or "coupled to" another element or layer, it can be directly over, above, on, below, under, beneath, connected or coupled to the other element or layer or intervening elements or layers may be present. In contrast, when an element is referred to as being "directly over," "directly above," "directly on," "directly below," "directly under," "directly beneath," "directly connected to" or "directly coupled to" another element or layer, there are no intervening elements or layers present.

The descriptions of the various aspects and embodiments have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Even though combinations of features are recited in the claims and/or disclosed in the specification, these combinations are not intended to limit the disclosure of possible implementations. In fact, many of these features may be combined in ways not specifically recited in the claims and/or disclosed in the specification. Although each dependent claim listed below may directly depend on only one claim, the disclosure of possible implementations includes each dependent claim in combination with every other claim in the claim set. Many modifications and variations may be apparent to those of ordinary skill in the art without departing from the scope of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

It is to be understood that the specific order or hierarchy of blocks in the processes/flowcharts disclosed are an illustration of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of blocks in the processes/flowcharts may be rearranged. Further, some blocks may be combined or omitted. The accompanying claims present elements of the various blocks in a sample order, and are not meant to be limited to the specific order or hierarchy presented.

Furthermore, the described features, advantages, and characteristics of the present disclosure may be combined in any suitable manner in one or more embodiments. One skilled in the relevant art may recognize, in light of the description herein, that the present disclosure can be practiced without one or more of the specific features or advantages of a particular embodiment. In other instances, additional features and advantages may be recognized in certain embodiments that may not be present in all embodiments of the present disclosure.

According to an aspect of the present disclosure, a method may include obtaining a plurality of images from a user, wherein the plurality of images were acquired by an electronic device viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene. The method may include obtaining a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene. The method may include obtaining a second indication of a first object to be removed from the first image. The method may include removing the first object from the first image to obtain a reference image. The method may include obtaining a third indication of a second viewpoint from the user, wherein the second viewpoint is different from each of the respective viewpoints of the plurality of images. The method may include rendering, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF. The method may include displaying the second image on a display of the electronic device.

According to an embodiment of the disclosure, the removing of the first object may include performing a first inpainting on the first image by applying a mask to the first object, to obtain the reference image. The method further may include inpainting the 3D scene into the NeRF in part by adjusting a first size of the mask according to a second size of the first object that appears in the second image and applying the mask with the adjusted size to the second image. The method may include based on a user input requesting an image of the 3D scene seen from the second viewpoint, inputting the reference image, and information of the second viewpoint to the NeRF to provide the second image corresponding to the 3D scene seen from the second viewpoint.

According to an embodiment of the disclosure, the method may include obtaining a fourth indication from the user, wherein the fourth indication is associated with a second object to be inpainted into the first image.

The method may include updating, before training the NeRF, the first image to remove the first object from the first image by using a mask, wherein the first image includes an unmasked portion and a masked portion.

The method may include training the NeRF after the first object is removed from the first image, wherein the NeRF is trained to output an inpainted 3D scene from an unobserved view point by accepting as input a reference inpainted view image that is obtained by selecting one of a plurality of views of a scene and applying a mask to inpaint an object into the reference view image.

According to an embodiment of the disclosure, the training may be performed at the electronic device. According to an embodiment of the disclosure, the training may be performed at a server.

According to an embodiment of the disclosure, the method may include obtaining, after the displaying, a fifth indication from the user, wherein the fifth indication is a selection of a second object to be inpainted into the first image. The method may include obtaining a second representative image by inpainting the second object into the first image. The method may include updating the training of the NeRF based on the second representative image. The method may include rendering, using the NeRF, a third image. The method may include displaying the third image on the display of the electronic device. The training the NeRF may include training the NeRF, based on the reference image and the plurality of images, using a first loss associated with the unmasked portion.

According to an embodiment of the disclosure, the training the NeRF may include training the NeRF using a second loss based on the masked portion and an estimated depth, wherein the estimated depth is associated with a first geometry of the first scene in the masked portion.

According to an embodiment of the disclosure, the training the NeRF may include identifying a plurality of disoccluded pixels, wherein the plurality of disoccluded pixels are present in the target image and are associated with the second viewpoint. The method may include determining a fourth loss, wherein the fourth loss is associated with a second inpainting of the plurality of disoccluded pixels of the target image. The method may include training the NeRF using the fourth loss.

According to an embodiment of the disclosure, the method may include when the first object is removed from the reference image using a first mask, and if a second size of the first object in other images differs from a first size in the reference image, adjusting proportionally mask sizes of respective masks in the other images proportionally to the respective object sizes of the first object in the other images.

According to an embodiment of the disclosure, a method of training a neuro radiance fieled may include initially training the neural radiance field using a first loss associated with a plurality of unmasked regions respectively associated with a reference image and a plurality of target images, wherein the reference image is associated with a reference viewpoint and each target of the plurality of target images is associated with a respective target viewpoint. The method may include updating the training of the neural radiance field using a second loss associated with a depth estimate of a masked region in the reference image. The method may include updating the training of the neural radiance field using a third loss associated with a plurality of view-substituted images, wherein each view-substituted image of the plurality of view-substituted images is associated with the respective target view of the plurality of target images, each view-substituted image is a volume rendering from the reference viewpoint across pixels with view-substituted target colors, and the third loss is based on the plurality of view-substituted images.

According to an embodiment of the disclosure, the method may include additionally updating the training of the neural radiance field with a fourth loss, wherein the fourth loss is associated with dis-occluded pixels in each target image of the plurality of target images.

According to an embodiment of the disclosure, the method, wherein rendering an image with depth information, may include obtaining image data that comprises a plurality of images that show a first scene from different viewpoints. The method, wherein rendering an image with depth information, may include, based on a first user input identifying a target object from one of the plurality of images, performing a first inpainting on the one of the plurality of images to obtain a reference image by applying a mask to the target object. The method, wherein rendering an image with depth information, may include, inpainting a 3D scene into a neural radiance field (NeRF), based on the reference image, by adjusting a first size of the mask according to a second size of the target object in each of remaining images other than the one of the plurality of images to obtain a plurality of adjusted masks, and applying the plurality of adjusted masks to respective ones of the remaining images. The method, wherein rendering an image with depth information, may include, based on a second user input requesting a first image of the 3D scene seen from a requested view point, inputting the reference image, and the requested view point to a neural radiance field (NeRF) model to provide the first image, wherein the first image corresponds to the 3D scene seen from the requested view point.

According to an embodiment of the disclosure, an apparatus may include one or more processors. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to obtain a plurality of images from a user, wherein the plurality of images were acquired by the apparatus viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to obtain a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to obtain a second indication of a first object to be removed from the first image. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to remove the first object from the first image to obtain a reference image. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to obtain a third indication of a second viewpoint from the user, wherein the second viewpoint does not correspond to any of the plurality of images. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to render, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF. The apparatus may include one or more memories, the one or more memories storing instructions configured to cause the apparatus to display the second image on a display of the apparatus.

According to an embodiment of the disclosure, the apparatus may include the instructions, configured to cause the apparatus to remove the first object by performing a first inpainting on the first image by applying a mask to the first object, to obtain the reference image. The apparatus may include the instructions, configured to cause the apparatus to inpaint the 3D scene into the NeRF in part by adjusting a first size of the mask according to a second size of the first object that appears in the second image and applying the mask with the adjusted size to the second image. The apparatus may include the instructions, configured to cause the apparatus to, based on a user input requesting an image of the 3D scene seen from the second viewpoint, input the reference image, and information of the second viewpoint to the NeRF to provide the second image corresponding to the 3D scene seen from the second viewpoint.

According to an embodiment of the disclosure, the apparatus may include the instructions, configured to cause the apparatus to obtain a fourth indication from the user, wherein the fourth indication is associated with a second object to be inpainted into the first image. The apparatus may include the instructions, configured to cause the apparatus to update, before a training of the NeRF, the first image to remove the first object from the first image by using a mask, wherein the first image includes an unmasked portion and a masked portion. The apparatus may include the instructions, configured to cause the apparatus to train the NeRF after the first object is removed from the first image.

According to an embodiment of the disclosure, the appratus may be a mobile device.

According to an embodiment of the disclosure, the apparatus may include the instructions, configured to cause the apparatus to obtain the NeRF from a server after a training of the NeRF, wherein the NeRF has been trained at the server.

According to an aspect of the present disclosure, a computer-readable storage medium storing instruction is provided. The instructions, when executed by at least one processor, may cause the at least one processor to obtain a plurality of images from a user, wherein the plurality of images were acquired by an electronic device viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene. The instructions, when executed by at least one processor, may cause the at least one processor to obtain a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene. The instructions, when executed by at least one processor, may cause the at least one processor to obtain a second indication of a first object to be removed from the first image. The instructions, when executed by at least one processor, may cause the at least one processor to remove the first object from the first image to obtain a reference image. The instructions, when executed by at least one processor, may cause the at least one processor to obtain a third indication of a second viewpoint from the user, wherein the second viewpoint is different from each of the respective viewpoints of the plurality of images. The instructions, when executed by at least one processor, may cause the at least one processor to render, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF. The instructions, when executed by at least one processor, may cause the at least one processor to display the second image on a display of the electronic device.

Claims

A method comprising:

obtaining a plurality of images from a user, wherein the plurality of images were acquired by an electronic device viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene;

obtaining a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene;

obtaining a second indication of a first object to be removed from the first image;

removing the first object from the first image to obtain a reference image;

obtaining a third indication of a second viewpoint from the user, wherein the second viewpoint is different from each of the respective viewpoints of the plurality of images;

rendering, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF; and

displaying the second image on a display of the electronic device.
The method of claim 1, wherein the removing of the first object comprises performing a first inpainting on the first image by applying a mask to the first object, to obtain the reference image, wherein the method further comprises:

inpainting the 3D scene into the NeRF in part by adjusting a first size of the mask according to a second size of the first object that appears in the second image and applying the mask with the adjusted size to the second image; and

based on a user input requesting an image of the 3D scene seen from the second viewpoint, inputting the reference image, and information of the second viewpoint to the NeRF to provide the second image corresponding to the 3D scene seen from the second viewpoint.
The method any one of claims 1 to 2, wherein the method further comprises:

obtaining a fourth indication from the user, wherein the fourth indication is associated with a second object to be inpainted into the first image;

updating, before training the NeRF, the first image to remove the first object from the first image by using a mask, wherein the first image includes an unmasked portion and a masked portion; and

training the NeRF after the first object is removed from the first image, wherein the NeRF is trained to output an inpainted 3D scene from an unobserved view point by accepting as input a reference inpainted view image that is obtained by selecting one of a plurality of views of a scene and applying a mask to inpaint an object into the reference view image.
The method of any one of claims 1 to 3, further comprising:

obtaining, after the displaying, a fifth indication from the user, wherein the fifth indication is a selection of a second object to be inpainted into the first image;

obtaining a second representative image by inpainting the second object into the first image;

updating the training of the NeRF based on the second representative image;

rendering, using the NeRF, a third image; and

displaying the third image on the display of the electronic device.
The method of any one of claims 1 to 4, wherein the training the NeRF is performed at a server, and

wherein the training the NeRF comprises training the NeRF, based on the reference image and the plurality of images, using a first loss associated with the unmasked portion.
The method of any one of claims 1 to 5, wherein the training the NeRF further comprises training the NeRF using a second loss based on the masked portion and an estimated depth, wherein the estimated depth is associated with a first geometry of the first scene in the masked portion.
The method of any one of claims 1 to 6, wherein the training the NeRF further comprises:

performing a view substitution of a target image to obtain a view substituted image, wherein the view substituted image comprises view dependent effects (VDEs) from a third viewpoint different from the first viewpoint associated with the first image, whereby view substituted colors are obtained associated with the third viewpoint, wherein a second geometry of the first scene underlying the view substituted image is that of the reference image, wherein the plurality of images comprises the target image and the target image is not the first image; and

training the NeRF using a third loss based on the view substituted colors.
The method of any one of claims 1 to 7, wherein the training the NeRF further comprises:

identifying a plurality of disoccluded pixels, wherein the plurality of disoccluded pixels are present in the target image and are associated with the second viewpoint;

determining a fourth loss, wherein the fourth loss is associated with a second inpainting of the plurality of disoccluded pixels of the target image;

and

training the NeRF using the fourth loss.
The method of any one of claims 1 to 8, further comprising, when the first object is removed from the reference image using a first mask, and if a second size of the first object in other images differs from a first size in the reference image, adjusting proportionally mask sizes of respective masks in the other images proportionally to the respective object sizes of the first object in the other images.
The method of any one of claims 1 to 9, wherein rendering an image with depth information further comprises:

obtaining image data that comprises a plurality of images that show a first scene from different viewpoints;

based on a first user input identifying a target object from one of the plurality of images, performing a first inpainting on the one of the plurality of images to obtain a reference image by applying a mask to the target object;

inpainting a 3D scene into a neural radiance field (NeRF), based on the reference image, by adjusting a first size of the mask according to a second size of the target object in each of remaining images other than the one of the plurality of images to obtain a plurality of adjusted masks, and applying the plurality of adjusted masks to respective ones of the remaining images; and

based on a second user input requesting a first image of the 3D scene seen from a requested view point, inputting the reference image, and the requested view point to a neural radiance field (NeRF) model to provide the first image, wherein the first image corresponds to the 3D scene seen from the requested view point.
An apparatus comprising:

one or more processors; and

one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least:

obtain a plurality of images from a user, wherein the plurality of images were acquired by the apparatus viewing a first scene and each of the plurality of images is associated with a corresponding viewpoint of the first scene;

obtain a first indication identifying a first image of the plurality of images, wherein the first image is associated with a first viewpoint of the first scene;

obtain a second indication of a first object to be removed from the first image;

remove the first object from the first image to obtain a reference image;

obtain a third indication of a second viewpoint from the user, wherein the second viewpoint does not correspond to any of the plurality of images;

render, using a neural radiance field (NeRF), a second image that corresponds to a 3D scene as seen from the second viewpoint, wherein the 3D scene has been inpainted into the NeRF; and

display the second image on a display of the apparatus.
The apparatus of claim 11, wherein the instructions are further configured to cause the apparatus to remove the first object by performing a first inpainting on the first image by applying a mask to the first object, to obtain the reference image, and wherein the instructions are further configured to cause the apparatus to:

inpaint the 3D scene into the NeRF in part by adjusting a first size of the mask according to a second size of the first object that appears in the second image and applying the mask with the adjusted size to the second image; and

based on a user input requesting an image of the 3D scene seen from the second viewpoint, input the reference image, and information of the second viewpoint to the NeRF to provide the second image corresponding to the 3D scene seen from the second viewpoint.
The apparatus of any one of claims 11 to 12, wherein the instructions are further configured to cause the apparatus to:

obtain a fourth indication from the user, wherein the fourth indication is associated with a second object to be inpainted into the first image;

update, before a training of the NeRF, the first image to remove the first object from the first image by using a mask, wherein the first image includes an unmasked portion and a masked portion; and

train the NeRF after the first object is removed from the first image.
The apparatus of any one of claims 11 to 13, wherein the instructions are further configured to cause the apparatus to:

obtain the NeRF from a server after a training of the NeRF, wherein the NeRF has been trained at the server.
A computer readable medium storing instructions, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform the method of any one of claims 1 to 10.