WO2020174215A1 - Décodeurs de forme et de texture de joint pour rendu tridimensionnel - Google Patents
Décodeurs de forme et de texture de joint pour rendu tridimensionnel Download PDFInfo
- Publication number
- WO2020174215A1 WO2020174215A1 PCT/GB2020/050372 GB2020050372W WO2020174215A1 WO 2020174215 A1 WO2020174215 A1 WO 2020174215A1 GB 2020050372 W GB2020050372 W GB 2020050372W WO 2020174215 A1 WO2020174215 A1 WO 2020174215A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- visual data
- neural network
- mesh
- shape
- dimensional
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T17/00—Three-dimensional [3D] modelling for computer graphics
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T15/00—Three-dimensional [3D] image rendering
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30196—Human being; Person
- G06T2207/30201—Face
Definitions
- This specification relates to methods of rendering three dimensional visual data.
- this specification relates to methods of training and using a mesh decoder neural network for shape and texture modelling in three-dimensional rendering.
- Three-dimensional morphable models are a class of statistical models that are used to represent facial texture and shape variations, and are used to reconstruct three- dimensional faces from two-dimensional facial images.
- Current methods used to generate three-dimensional morphable models typically use large neural networks, with a large number of parameters, and are thus slow and inefficient to use and train. Summary
- a method of training a mesh decoder neural network for shape and texture modelling in three-dimensional rendering comprising training the mesh decoder neural network on a set of visual data to generate a shape and texture map of the input visual data, wherein the training comprises: generating embedding parameters of input visual data from the set of visual data, the embedding parameters representing shape and texture of the input visual data; applying the mesh decoder neural network to the embedding parameters to generate a shape and texture map of the input visual data, wherein the mesh decoder neural network comprises one or more geometric convolutional layers; and updating parameters of the mesh decoder based on a comparison of the input facial visual data with output visual data derived from the generated shape and texture map.
- the mesh decoder may comprise a plurality of geometric convolutional layers and a plurality of upscaling layers.
- the plurality of upscaling layers maybe interlaced with the plurality of geometric convolutional layers.
- the set of visual data may comprise a set of known shape and texture maps and a set of two- dimensional images.
- the training may comprise jointly training mesh decoder neural network on the set of known shape and texture maps and the set of two -dimensional images. Updating parameters of the mesh decoder may comprise using a joint loss function to compare the input facial visual data with the output three-dimensional visual data derived from the generated shape and texture map.
- the method may be iterated until a threshold condition is met.
- the threshold condition maybe a threshold number of iterations.
- the threshold condition maybe a threshold number of training epochs.
- the input visual data may comprise a known shape and texture map
- generating the embedding parameters may comprise applying a mesh encoder neural network to the known shape and texture map.
- the mesh encoder may comprise one or more geometric convolutional layers.
- the mesh encoder may comprise a plurality of geometric convolutional layers and a plurality of down-sampling layers.
- the plurality of down-sampling layers maybe interlaced with the plurality of geometric convolutional layers.
- the output three-dimensional visual data may comprise the generated shape and texture map.
- the method may further comprise updating parameters of the mesh encoder neural network based on a comparison of the input visual data with the output three-dimensional visual data derived from the generated shape and texture map.
- the input visual data may comprise a two-dimensional image.
- Generating the embedding parameters may comprise applying an image encoder neural network to the two-dimensional image.
- the image encoder neural network may comprise one or more convolutional layers.
- the image encoder neural network may further output a rendering parameter comprising one or more of: a camera position; a camera orientation; a camera upright direction; a field of view; and/or one or more lighting parameters.
- the output three-dimensional visual data may comprise a rendered three-dimensional image generated in dependence on the rendering parameter using a rendering model.
- the method may further comprise updating parameters of the image encoder neural network based on a comparison of the input visual data with the output three-dimensional data derived from the generated shape and texture map.
- the visual data is facial visual data.
- the facial visual data may comprise a two- dimensional image comprising a part or a whole of a face.
- the facial visual data may comprise a shape and texture map comprising a part or a whole of a face.
- a method of generating a three-dimensional shape and texture map from input two dimensional visual data comprising: generating embedding parameters of the input two-dimensional visual data, the embedding parameters representing shape and texture of the input two-dimensional visual data; and applying a mesh decoder neural network to the embedding parameters to generate a shape and texture map of the input two dimensional visual data, wherein the mesh decoder neural network comprises one or more geometric convolutional layers, wherein parameters of the mesh decoder neural network have been determined using any of the training methods described herein.
- processors comprising: one or more processors; and a memory, the memory comprising computer readable instructions which, when executed by the one or more processors, cause the apparatus to perform one or more of the methods described herein.
- a computer program product comprising computer readable instructions which, when executed by a computer, cause the computer to perform one or more of the methods described herein.
- visual data is preferably used to connote one or more of: an image; a part of an image; a shape map; a texture map; a joint shape and texture map; a frame of video; a part of a frame of video; a three dimensional model and/or a computer rendered image.
- the visual data may be a two-dimensional image, for example described by a plurality of pixels in a two-dimensional space, or a three - dimensional image, for example, described by a shape and texture map and/or a mesh.
- Figure l shows an overview of a system/method for shape and texture modelling in three-dimensional rendering
- Figure 2 shows an example flow chart of a method of training a mesh decoder neural network for generating a shape and texture map of input visual data
- Figure 3 shows an overview of an embodiment of a training method for a mesh decoder neural network
- Figure 4 shows an example of a structure for an image encoder neural network
- Figure 5 shows an example of a structure for a mesh encoder neural network
- Figure 6 shows an example of a structure for a mesh decoder neural network
- Figure 7 shows a schematic example of a system/apparatus for performing any of the methods described herein
- Figure 1 shows an overview of a system/method for shape and texture modelling in three-dimensional rendering.
- the method constructs a three dimensional shape and texture map 114 from a two-dimensional image 102 using a neural network.
- the neural network comprises a plurality of geometric/mesh convolutional layers.
- the two-dimensional image 102 is shown as facial visual data, though other types of visual data may alternatively or additionally be used.
- the two- dimensional image may comprise one or more parts of a human body 102a, such as, for example, a hand 102b.
- the shape and texture mesh 114 corresponds to one or more parts of the input visual data 102.
- the shape and texture map 114 is a facial shape and texture map, though other types of shape and texture map may alternatively or additionally be ouput.
- the shape and texture map may comprise one or more parts of a human body 102a, such as, for example, a hand 102b.
- Other examples of two-dimensional images 102 and shape and texture maps 114 may be used.
- geometrical convolutional layers can allow the mesh decoder neural network to be a lightweight model (i.e. described with a low number of parameters).
- the use of geometrical convolutional layers can also allow the shape and texture to be processed jointly, which can result in more accurate shape and texture models (where, for example, accuracy may be defined in terms of a reconstruction loss).
- the runtime of the trained models may be reduced when compared to current three dimensional modelling methods. While the methods described herein are suitable for shape and texture modelling of objects general, the use of the methods when modelling facial visual data can result in one or more further advantages.
- the system too comprises an encoder 104 configured to take two-dimensional visual data 102 as input.
- the encoder 104 is configured to process the visual data 102 and output embedding parameters, f SA , 106 that represent the shape and texture of the visual data 104.
- the embedding parameters form a low dimensional representation of features of the two-dimensional visual data 102.
- the encoder 104 may, for example, be a neural network, such as a convolutional neural network.
- the visual data undergoes a pre-processing step, in which a feature detector is applied to the visual data 102 to isolate relevant parts of the visual data 102.
- a feature detector maybe applied to an input image to isolate one or more objects of a given type within the image, for example faces.
- the encoder 104 may also be configured to output a set of rendering parameters 108 represent properties of a camera used to capture the visual data 102. Examples of such rendering parameters include camera position, camera orientation and camera“up” direction in a coordinate system 110.
- the system further comprises a mesh decoder neural network 112.
- the mesh decoder neural network 112 comprises a plurality of layers of nodes, each node associated with one or more parameters.
- the parameters of each node of the neural network may comprise one or more weights and/ or biases.
- the nodes take as input one or more outputs of nodes in the previous layer.
- the one or more outputs of nodes in the previous layer are used by the node to generate an activation value using an activation function and the parameters of the neural network.
- the mesh decoder neural network 112 comprises one or more geometric convolutional layers (also referred to herein as mesh convolutional layers).
- a geometric convolutional layer is a type of convolutional filter layer of the neural network used in geometric deep learning that can be applied directly in the mesh domain.
- the mesh decoder 112 is configured to take as input the embedding parameters 106 and to process the input using a series of neural network layers to output a shape map and a texture map 114 of the two dimensional image 102.
- the shape and texture map 114 comprises three dimensional shape and texture information relating to the input visual data 102.
- the output is a set of parameters describing (x, y, z) coordinates of points in a mesh and corresponding (r, g, b) values for each point in the mesh.
- the output may, in some embodiments, be a coloured mesh within a unit sphere.
- the joint shape and texture map 114 may undergo may undergo further processing (not shown) to generate a rendered representation of the input two-dimensional visual data 102.
- the representation may comprise a rendered three-dimensional model of the input two-dimensional image 102.
- the representation may comprise a two-dimensional projection of the joint shape and texture map 114 onto an image plane.
- the set of rendering parameters 108 may be used in the further processing to render the representation of the two-dimensional visual data, for example to define the
- a mesh may be defined in terms of an undirected connected graph , where
- n vertices is a set of n vertices containing joint shape (e.g. (x, y, z)) and texture (e.g. (r, g, b)) information and e e ⁇ 0,1 ⁇ nxn is an adjacency matrix defining the connection status between vertices.
- joint shape e.g. (x, y, z)
- texture e.g. (r, g, b)
- e e ⁇ 0,1 ⁇ nxn is an adjacency matrix defining the connection status between vertices.
- alternative representations of a mesh may be used.
- a convolutional operator on a graph/mesh can be defined by formulating mesh filtering with a kernel ge using a recursive polynomial, such as a Chebyshev polynomial.
- a convolutional operator on a mesh it is useful to define some intermediate variables in the following way.
- the Laplacian can be diagonalised by the Fourier bases such
- a given mesh can then be defined as with inverse .
- an example of a convolutional operator, g q on a graph/mesh can be defined.
- the filter g q can be parameterised as a truncated Chebyshev polynomial expansion of order K as
- a spectral convolution can then be defined as
- FIG. 2 shows an example flow chart of a method of training a mesh decoder neural network for generating a shape and texture map of input visual data.
- the mesh decoder neural network is jointly trained on two dimensional visual data and three-dimensional shape and texture maps.
- the mesh decoder neural network is trained on input two-dimensional visual data.
- the is mesh decoder neural network is trained on input shape and texture maps.
- a set of embedding parameters are generated from input visual data selected from a set of training visual data.
- the embedding parameters form a low- dimensional representation of shape and texture information of the input visual data.
- the embedding parameters may be represented as an N-dimensional vector. N may, for example, be sixty-four, one-hundred and twenty-eight or two-hundred and fifty-six.
- the input visual data may comprise a two dimensional image, for example of a face.
- An image encoder neural network may be used to generate a set of embedding parameters.
- the image encoder neural network may comprise a plurality of convolutional layers.
- An example of a mesh encoder neural network is described below in relation to Figure 4.
- the input visual data may alternatively or additionally comprise a three dimensional shape and texture map, for example of a face.
- the three dimensional shape and texture map maybe a mesh comprising a set of vertices, each associated with a position (x, y, z ) and RGB texture value (r, g, b ).
- a mesh encoder neural network maybe used to generate the set of embedding parameters.
- the mesh encoder neural network may comprise a plurality of geometrical convolutional layers.
- the mesh encoder neural network may comprise one or more downsampling layers interlaced with the geometrical convolutional layers.
- the mesh encoder neural network may comprise a plurality of mesh downsampling layers.
- the mesh encoder neural network may comprise one or more fully connected layers. An example of a mesh encoder neural network is described below in relation to Figure 5. The
- combination of the mesh encoder and mesh decoder may form an autoencoder.
- a set of rendering parameters may also be generated from the input visual data.
- the rendering parameters may comprise one or more of: a camera position; a camera orientation; and/or one or more lighting parameters, where the camera is the camera that captured the image.
- the rendering parameters may be generated, for example, by the image encoder neural network. Alternatively, they may be generated by a separate method, such a separate neural network trained to generate the rendering parameters using input two-dimensional images with known rendering parameters.
- a mesh decoder neural network is applied to the embedding parameters to generate a shape and texture map of the input visual data.
- the mesh decoder neural network comprises a plurality of geometric convolutional layers.
- the mesh decoder neural network may comprise one or more upsampling layers interlaced with the geometric convolutional layers.
- the mesh decoder neural network comprises a plurality of upscaling layers.
- the mesh decoder may comprise one or more fully connected layers.
- parameters of the mesh decoder neural network are updated based on a comparison of the input visual data to output visual data derived from the generated shape and texture map.
- the parameters of the mesh decoder neural network may comprise, for example, the weights and biases nodes in the mesh decoder neural network.
- the parameters of the mesh decoder neural network may comprise, for example, Chebyshev coefficients of a geometric convolution, as described above.
- Output visual data derived from the generated shape and texture map may, in some embodiments, comprise a two-dimensional image derived from the shape and texture map.
- the two dimensional image may be derived from the shape and texture map using a Tenderer that projects the output shape and texture map to an image plane.
- Rendering parameters generated from the input visual data may be used to perform the rendering.
- the two dimensional image may be compared to a corresponding input two dimensional visual data that was input to an image encoder.
- the comparison maybe performed using a loss function, such as a reconstruction loss function.
- Output visual data derived from the generated shape and texture map may, in some embodiments, comprise the shape and texture map itself.
- the output shape and texture map maybe compared to a corresponding input shape and texture map using a loss function. For example, an autoencoder loss function may be used to perform the comparison.
- a joint loss function may combine an autoencoder loss function with a rendering loss function.
- a different data source may be used for the two-dimensional visual data and the known (input) shape and texture maps.
- Updating the parameters of the mesh decoder neural network may comprise applying an optimisation procedure to a loss function used to make the comparison.
- the aim of the optimisation procedure maybe to maximise/minimise the loss function to within a threshold value, depending on the definition of the loss function.
- An example of such an optimisation procedure is gradient descent/ascent.
- Backpropagation maybe used to determine gradients of the loss function with respect to the parameters of the mesh decoder neural network, and updated parameters determined based on the determined gradients.
- Other optimisation procedures may alternatively be used, such as gradient free methods.
- Parameters of the image encoder neural network and/ or mesh encoder neural network may also be updated based on the comparisons.
- the same methods used to update the mesh decoder neural network maybe used to update parameters of the mesh encoder neural network and/or the image encoder neural network.
- the training method may be iterated until a threshold condition is met.
- the threshold condition maybe a threshold number of training epochs.
- the threshold number of epochs may be between one-hundred and three-hundred epochs, such as two-hundred epochs.
- a different learning rate may be used at each epoch.
- the learning rate may decay by a predetermined factor after each epoch.
- FIG 3 shows an overview of an embodiment of a training method for a mesh decoder neural network.
- the training method 300 comprises training a mesh decoder neural network 112 using training two-dimensional visual data 302 and training three dimensional visual data 304.
- the joint training maybe described as comprising two elements.
- One element comprises an“under-control” joint shape and texture autoencoder training process 306, indicated by the process below the dashed line in Figure 3.
- This trains a mesh encoder neural network 310 and the mesh decoder neural network 112 together as an autoencoder process.
- the other element comprises a self-supervised training process 308, indicated by the process above the dashed line in Figure 3.
- This trains an image encoder neural network 316 and the mesh decoder neural network 112 together.
- These processes 306, 308 share a mesh decoder 112.
- the joint shape and texture autoencoder training process 306 and the self-supervised training process 308 may be trained end- to-end jointly. Each process may use a different source of training data 302, 304.
- the joint shape and texture autoencoder training process 306 takes as input a known shape and texture map 304 from a training set of known shape and texture maps.
- a mesh encoder neural network 310 is applied to the input shape and texture map 304 to generate a set of embedding parameters 106a, f SA , that form a low dimensional representation of features of the shape and texture map.
- the action of the mesh encoder neural network 310 on an input shape and texture map 304, X can be described symbolically by:
- the mesh encoder neural network 112 is then applied to the embedding parameters 106a to generate an output joint shape and texture map 312a.
- the action of the mesh decoder 112 on the embedding parameters 106 maybe described symbolically by:
- Y is the output shape and texture map and q D are the parameters of the mesh decoder neural network.
- the output shape and texture map 312a, Y is compared to the input shape and texture map 304, X, to determine parameter updates for the mesh decoder neural network 112 and/or the mesh encoder neural network 310.
- An autoencoder loss 314 maybe used to perform the comparison.
- An example of such a function is given by:
- S i denotes the shape part of the input texture map 304
- the sum is taken over an ensemble of training data comprising one or more training examples that is indexed by the label i.
- the respective norms, n and m, of the shape part and the texture part of the loss function may be the same. Alternatively, different norms may be used for the shape part and the texture part of the loss function.
- parameters of the mesh decoder neural network 112 and/or the mesh encoder neural network 310 may be updated.
- An optimisation procedure such as gradient descent, may be applied to the autoencoder loss to determine the parameter updates.
- the self-supervised training process 308 takes as input a two-dimensional image 302 from a set of two-dimensional images.
- the set of two dimensional images may comprise“in-the-wild” images.
- One or more random perturbations maybe applied to a given input image 302 during the training. For example, one or more of: a random rotation (e.g. ⁇ 30 degrees); a random flipping; a random scaling; and/or a random cropping may be applied.
- the input image 302 is input into an image encoder neural network 316.
- the image encoder neural network generates a set of embedding parameters 106b, f SA , from the input image 302.
- the set of embedding parameters 106b forms a low dimensional representation of features of the input image 302.
- the action of the image encoder neural network 316 on an input image 302, I can be described symbolically by:
- the image encoder neural network is further configured to output one or more rendering parameters 108.
- the rendering parameters 108 may comprise one or more of: a camera position; a camera orientation; and/or one or more lighting parameters, the camera being the camera used to capture the input image 302.
- the camera position, camera orientation; and/or one or more lighting parameters may be defined in terms of a coordinate system 110.
- the mesh encoder neural network 112 is then applied to the embedding parameters 106b to generate an output joint shape and texture map 312b. As with the joint shape and texture autoencoder training process 306, the action of the mesh decoder 112 on the embedding parameters 106 maybe described symbolically by:
- Y is the output shape and texture map and q D are the parameters of the mesh decoder neural network.
- the output shape and texture map 312b maybe converted into an output two-dimensional image 314 in an image plane.
- the output shape and texture map 312b may be in the form of a coloured mesh within a unit sphere.
- the camera model may project the three dimensional shape and texture mesh 312b from an object centred Cartesian coordinates to an image plane in the same coordinates 318 and a Tenderer 320 may be used to render the image to form the output two dimensional image.
- the rendering parameters 110 maybe used in this projection.
- a pinhole camera model is used to perform the projection onto the image plane.
- the pinhole camera model may utilise a perspective transformation model.
- the parameters of the projection model, c can be described as:
- the Tenderer may be a differentiable Tenderer, allowing the whole pipeline to be trained end-to-end.
- the Tenderer (also referred to as a rasteriser), generates barycentric coordinates and corresponding triangle IDs for each pixel at the image plane.
- the rasterising procedure may involve Phong shading and/or interpolating colours and normal between vertices according to the barycentric coordinates.
- the camera projections and lighting maybe computed in-graph.
- the loss maybe back-propagated through the rasteriser, preventing rendering becoming a bottleneck for the training.
- the output image 314, Z may be described in terms of the projection and rendering operations as:
- the output image 314, Z is compared to the input image 302, I, to determine parameter updates for the mesh decoder neural network 112 and/ or the image encoder neural network 316.
- a reconstruction loss 322 may be used to perform the comparison.
- An example of such a function is given by:
- parameters of the mesh decoder neural network 112 and/or the image encoder neural network 316 maybe updated.
- An optimisation procedure, such as gradient descent, may be applied to the autoencoder loss to determine the parameter updates.
- the parameters q D q EI and m may be the targets of the
- the reconstruction loss 322 and autoencoder loss 314 may be combined into a joint loss function, V, using
- hyper parameter controlling the relative importance of the self- supervised training process 308 and joint shape and texture autoencoder training process 306. may be varied during the training process. For example, it may start at an initial value (e.g. 0.01) and gradually increase to a final value (e.g. 1.0) during the training process.
- Both the“in the wild” self-supervised model i.e. the part trained during the self- supervised training process 308) and the under control shape and texture autoencoder (the part trained during the joint shape and texture autoencoder training process 306) maybe trained end-to-end jointly while using a different data source.
- Both models may have the same learning rate for the optimisation procedure. This may be initialised at an initial value (e.g. ie-4).
- a learning decay may be applied after each training epoch (e.g. a decay with a rate of 0.98).
- the model is trained for at least two-hundred epochs.
- the training datasets may comprise a training dataset of two-dimensional visual data.
- the two-dimensional visual data may comprise a set of two-dimensional facial images. Examples of such a dataset include the CelebA dataset and the 300W-LP dataset.
- the training datasets may comprise a training dataset of three-dimensional visual data.
- the three-dimensional visual data may comprise a set of three-dimensional facial data, such as three-dimensional facial scans. Examples of such a dataset include the MeIn3D dataset and the AFLW2000-3D dataset.
- Training visual data taken from the training datasets may be cropped into bounding boxes of given landmarks, such as facial landmarks. Random perturbations may be applied to simulate a coarse detector..
- training data with a given attribute e.g. a plurality of
- facial images with a given facial expression, such as a smile can be fed into the trained encoder networks in order to generate a set of embedding parameters
- a mean shape and texture model of visual data with the given attribute can be generated. Based on principal component analysis on the set of embedding parameters one or more variables may be identified
- Figure 4 shows an example of a structure for an image encoder neural network.
- the image encoder neural network 400 takes as input a two dimensional image 302 and outputs embedding parameters 106 that represent the shape and texture of the input image 302.
- the output may further comprise one or more rendering parameters (not shown).
- the image encoder neural network 400 comprises a plurality of convolutional layers 402.
- Each convolutional layer comprises a convolutional filter.
- the convolutional filter has a filter size. In some embodiments, the filter size is a three-by-three convolutional filter. Other filter sizes are possible.
- the convolutional filter of each layer is also associated with a stride. In convolutional layers 402 with a stride equal to one, the output of said convolutional layer is the same size as the input layer. In convolutional layers 402 with a stride greater than one, the output of said convolutional layer a smaller size than the input layer (i.e. downsampled).
- the image encoder neural network may alternate between convolutional filters with a stride of one, and convolutional filters with a stride of greater than one.
- the image encoder neural network 400 comprises ten convolutional layers. It may, however, have greater or fewer convolutional layers.
- one or more of the convolutional layers 402 are followed by a batch normalisation layer and/or a ReLU activation function (not shown). Each convolutional layer 402 maybe followed by a batch normalisation layer and/or a ReLU activation function in this way.
- the image encoder neural network 400 may comprise one or more fully connected layers 404.
- the final layer of the image encoder neural network 400 i.e. the layer that outputs the embedding 106 and, optionally, the rendering parameters
- An example structure of an image encoder neural network 400 is given by:
- “Conv” indicates a convolutional layer 402 and“FC” indicates a fully connected layer 404.
- the input comprises the RGB channels of a 112x112 pixel image, though other image sizes may be used.
- the output is a 256 component embedding vector 106 and a twenty-two component rendering vector 108. Other sizes of embedding vector 106 and rendering parameter may alternatively be used.
- Figure 5 shows an example of a structure for a mesh encoder neural network.
- the mesh encoder neural network 500 takes as input a three dimensional shape and texture mesh 304 and outputs embedding parameters 106 that represent the shape and texture of the input shape-and texture map 304.
- the mesh encoder neural network 500 comprises a plurality of geometric convolutional layers 502.
- the geometric convolutional layers 502 are configured to take a mesh as input and to perform a geometric/mesh convolution on the mesh. For example, a geometric/ mesh convolution as defined above in relation to Figure 1 may be performed.
- the geometric convolutional layers may be followed by a ReLU activation function (not shown).
- the mesh down-sampling layers 504 are configured to take a mesh as input and to output a mesh with a reduced number of vertices.
- Mesh down- sampling layers 504 may alternate with geometric convolutional layers 502.
- Mesh down-sampling from a mesh of m vertices to a mesh of n vertices may be performed using a binary transform matrix .
- Q d may be determined by iteratively contracting vertex pairs under the constraint of minimising quadratic error.
- the barycentric coordinates of the discarded vertices are with regard to the down-sampled mesh are stored. These can be used during upsampling to add new vertices with the barycentric locations information.
- the mesh encoder neural network 500 may comprise one or more fully connected layers 506.
- the final layer of the mesh encoder neural network 500 i.e. the layer that outputs the embedding 106
- An example structure of a mesh encoder neural network 500 is given by:
- the input comprises a joint shape and texture mesh 304 mesh with 28431 vertices, each associated with six values - the (x, y, z ) position (i.e. shape) of the vertex and the (r, g, b) values (i.e. texture) of the vertex.
- the down-sampling layers reduce the size of the mesh by a factor of four at each stage, though may be alternatively configured to provide other down-sampling factors.
- the output is a 256 component embedding vector 106. Other sizes of embedding vector 106 may alternatively be used.
- FIG 6 shows an example of a structure for a mesh decoder neural network.
- the mesh decoder neural network 600 takes as input an embedding vector 106 that represent the shape and texture of an input shape-and texture map 304 or input two-dimensional image 302 and outputs a three dimensional shape and texture mesh 312.
- the mesh decoder neural network 600 comprises a plurality of geometric convolutional layers 606.
- the geometric convolutional layers 606 are configured to take a mesh as input and to perform a geometric/mesh convolution on the mesh. For example, a geometric/mesh convolution as defined above in relation to Figure 1 maybe performed.
- the geometric convolutional layers may be followed by a ReLU activation function (not shown).
- the mesh up-sampling layers 604 are configured to take a mesh as input and to output a mesh with an increased number of vertices.
- Mesh up-sampling layers 604 may alternate with geometric convolutional layers 606.
- Mesh up-sampling from a mesh of n vertices to a mesh of m vertices maybe performed using a transform matrix .
- the transform matrix may correspond to the down-sampling matrix Q d .
- Vertices retained during down-sampling may be directly retained during up- sampling.
- Vertices discarded during downsampling may be mapped into the mesh using recorded barycentric coordinates.
- Other mesh upsampling methods maybe used.
- the up-sampled mesh V u is predicted from a lower-dimensional mesh V d by a sparse matrix multiplication .
- the mesh decoder neural network 600 may comprise one or more fully connected layers 602.
- the initial layer of the mesh decoder neural network 600 i.e. the layer that takes the embedding 106 as an input
- An example structure of a mesh decoder neural network 600 is given by:
- “Convolution” indicates a geometric convolutional layer 606 and“Up-sampling” indicates an upsampling layer 604.
- the input comprises a 256 component embedding vector 106. Other sizes of embedding vector 106 may alternatively be used.
- the upsampling layers increase the size of the mesh by a factor of four at each stage, though may be alternatively configured to provide other upsampling factors.
- the output is a joint shape and texture mesh 304 mesh with 28431 vertices, each associated with six values - the (x, y, z ) position (i.e. shape) of the vertex and the (r, g, b) values (i.e. texture) of the vertex.
- Figure 7 shows a schematic example of a system/apparatus for performing any of the methods described herein.
- the system/apparatus shown is an example of a computing device. It will be appreciated by the skilled person that other types of computing devices/systems may alternatively be used to implement the methods described herein, such as a distributed computing system.
- the apparatus (or system) 700 comprises one or more processors 702.
- the one or more processors control operation of other components of the system/apparatus 700.
- the one or more processors 702 may, for example, comprise a general purpose processor.
- the one or more processors 702 may be a single core device or a multiple core device.
- the one or more processors 702 may comprise a central processing unit (CPU) or a graphical processing unit (GPU).
- the one or more processors 702 may comprise specialised processing hardware, for instance a RISC processor or
- the system/ apparatus comprises a working or volatile memory 704.
- the one or more processors may access the volatile memory 704 in order to process data and may control the storage of data in memory.
- the volatile memory 704 may comprise RAM of any type, for example Static RAM (SRAM), Dynamic RAM (DRAM), or it may comprise Flash memory, such as an SD-Card.
- the system/apparatus comprises a non-volatile memory 706.
- the non-volatile memory 706 stores a set of operation instructions 708 for controlling the operation of the processors 702 in the form of computer readable instructions.
- the non-volatile memory 706 may be a memory of any kind such as a Read Only Memory (ROM), a Flash memory or a magnetic drive memory.
- the one or more processors 702 are configured to execute operating instructions 408 to cause the system/ apparatus to perform any of the methods described herein.
- the operating instructions 708 may comprise code (i.e. drivers) relating to the hardware components of the system/ apparatus 700, as well as code relating to the basic operation of the system/apparatus 700.
- the one or more processors 702 execute one or more instructions of the operating instructions 708, which are stored permanently or semi-permanently in the non-volatile memory 706, using the volatile memory 704 to temporarily store data generated during execution of said operating instructions 708.
- Implementations of the methods described herein may be realised as in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These may include computer program products (such as software stored on e.g. magnetic discs, optical disks, memory, Programmable Logic Devices) comprising computer readable instructions that, when executed by a computer, such as that described in relation to Figure 7, cause the computer to perform one or more of the methods described herein.
- Any system feature as described herein may also be provided as a method feature, and vice versa.
- means plus function features may be expressed alternatively in terms of their corresponding structure. In particular, method aspects may be applied to system aspects, and vice versa.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Graphics (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Geometry (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
Décodeurs de forme et de texture de joint pour un rendu tridimensionnel. La présente invention concerne des procédés de rendu de données visuelles tridimensionnelles. En particulier, la présente invention concerne des procédés d'apprentissage et d'utilisation d'un réseau neuronal de décodeur de maillage pour une modélisation de forme et de texture dans un rendu tridimensionnel. Selon un premier aspect, l'invention concerne un procédé d'apprentissage d'un réseau neuronal de décodeur de maillage pour une modélisation de forme et de texture dans un rendu tridimensionnel, le procédé comprenant l'apprentissage du réseau neuronal de décodeur de maillage sur un ensemble de données visuelles pour générer une carte de forme et de texture des données visuelles d'entrée, l'apprentissage consistant à : générer des paramètres d'incorporation de données visuelles d'entrée à partir de l'ensemble de données visuelles, les paramètres d'incorporation représentant la forme et la texture des données visuelles d'entrée ; appliquer le réseau neuronal de décodeur de maillage aux paramètres d'incorporation pour générer une carte de forme et de texture des données visuelles d'entrée, le réseau neuronal de décodeur de maillage comprenant une ou plusieurs couches de convolution géométriques ; et mettre à jour des paramètres du décodeur de maillage sur la base d'une comparaison des données visuelles de visage d'entrée avec des données visuelles de sortie dérivées de la forme et de la carte de texture générées.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GB1902524.6A GB2581536B (en) | 2019-02-25 | 2019-02-25 | Joint shape and texture decoders for three-dimensional rendering |
| GB1902524.6 | 2019-02-25 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2020174215A1 true WO2020174215A1 (fr) | 2020-09-03 |
Family
ID=65999046
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/GB2020/050372 Ceased WO2020174215A1 (fr) | 2019-02-25 | 2020-02-17 | Décodeurs de forme et de texture de joint pour rendu tridimensionnel |
Country Status (2)
| Country | Link |
|---|---|
| GB (1) | GB2581536B (fr) |
| WO (1) | WO2020174215A1 (fr) |
Cited By (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112669441A (zh) * | 2020-12-09 | 2021-04-16 | 北京达佳互联信息技术有限公司 | 一种对象重建方法、装置、电子设备和存储介质 |
| CN112700481A (zh) * | 2020-12-23 | 2021-04-23 | 杭州群核信息技术有限公司 | 基于深度学习的纹理图自动生成方法、装置、计算机设备和存储介质 |
| WO2021165628A1 (fr) * | 2020-02-17 | 2021-08-26 | Ariel Ai Ltd | Génération de modèles tridimensionnels d'objets à partir d'images bidimensionnelles |
| CN114445584A (zh) * | 2020-11-04 | 2022-05-06 | 复旦大学 | 基于彩色点云生成带纹理三维网格模型的方法及装置 |
| CN115147508A (zh) * | 2022-06-30 | 2022-10-04 | 北京百度网讯科技有限公司 | 服饰生成模型的训练、生成服饰图像的方法和装置 |
| CN115147526A (zh) * | 2022-06-30 | 2022-10-04 | 北京百度网讯科技有限公司 | 服饰生成模型的训练、生成服饰图像的方法和装置 |
| US20240144578A1 (en) * | 2022-10-14 | 2024-05-02 | Electronics And Telecommunications Research Institute | Apparatus and method for generating texture map of 3-dimensional mesh |
| CN119544979A (zh) * | 2025-01-21 | 2025-02-28 | 长沙超创电子科技有限公司 | 一种用于神经网络的图像编解码及传输方法及系统 |
Families Citing this family (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110288697A (zh) * | 2019-06-24 | 2019-09-27 | 天津大学 | 基于多尺度图卷积神经网络的3d人脸表示与重建方法 |
| CN114119827B (zh) * | 2021-11-18 | 2025-01-03 | 北京蔚领时代科技有限公司 | 一种静态物体网络模型的渲染方法及装置 |
| CN114724227B (zh) * | 2022-04-26 | 2025-04-04 | 中国矿业大学 | 基于变形自编码器和解耦交换的面部动作单元迁移方法及装置 |
| CN115775298B (zh) * | 2022-12-21 | 2025-06-10 | 南京理工大学 | 一种单视角图像的三维风格迁移方法 |
| CN119810294B (zh) * | 2024-12-05 | 2025-10-10 | 北京百度网讯科技有限公司 | 服饰纹理贴图生成方法、装置、设备以及存储介质 |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2018080702A1 (fr) * | 2016-10-31 | 2018-05-03 | Google Llc | Reconstitution d'un visage à partir d'une intégration apprise |
| US20180365874A1 (en) * | 2017-06-14 | 2018-12-20 | Adobe Systems Incorporated | Neural face editing with intrinsic image disentangling |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108521820B (zh) * | 2017-06-06 | 2021-08-20 | 美的集团股份有限公司 | 使用深度神经网络的从粗略到精细的手部检测方法 |
| CN108830812B (zh) * | 2018-06-12 | 2021-08-31 | 福建帝视信息科技有限公司 | 一种基于网格结构深度学习的视频高帧率重制方法 |
-
2019
- 2019-02-25 GB GB1902524.6A patent/GB2581536B/en active Active
-
2020
- 2020-02-17 WO PCT/GB2020/050372 patent/WO2020174215A1/fr not_active Ceased
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2018080702A1 (fr) * | 2016-10-31 | 2018-05-03 | Google Llc | Reconstitution d'un visage à partir d'une intégration apprise |
| US20180365874A1 (en) * | 2017-06-14 | 2018-12-20 | Adobe Systems Incorporated | Neural face editing with intrinsic image disentangling |
Non-Patent Citations (2)
| Title |
|---|
| TRAN LUAN ET AL: "Nonlinear 3D Face Morphable Model", 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, IEEE, 18 June 2018 (2018-06-18), pages 7346 - 7355, XP033473654, DOI: 10.1109/CVPR.2018.00767 * |
| YUXIANG ZHOU ET AL: "Dense 3D Face Decoding over 2500FPS: Joint Texture & Shape Convolutional Mesh Decoders", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 6 April 2019 (2019-04-06), XP081165962 * |
Cited By (14)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2021165628A1 (fr) * | 2020-02-17 | 2021-08-26 | Ariel Ai Ltd | Génération de modèles tridimensionnels d'objets à partir d'images bidimensionnelles |
| US12340467B2 (en) | 2020-02-17 | 2025-06-24 | Snap Inc. | Generating three-dimensional object models from two-dimensional images |
| CN114445584A (zh) * | 2020-11-04 | 2022-05-06 | 复旦大学 | 基于彩色点云生成带纹理三维网格模型的方法及装置 |
| CN114445584B (zh) * | 2020-11-04 | 2025-05-13 | 复旦大学 | 基于彩色点云生成带纹理三维网格模型的方法及装置 |
| CN112669441B (zh) * | 2020-12-09 | 2023-10-17 | 北京达佳互联信息技术有限公司 | 一种对象重建方法、装置、电子设备和存储介质 |
| CN112669441A (zh) * | 2020-12-09 | 2021-04-16 | 北京达佳互联信息技术有限公司 | 一种对象重建方法、装置、电子设备和存储介质 |
| CN112700481A (zh) * | 2020-12-23 | 2021-04-23 | 杭州群核信息技术有限公司 | 基于深度学习的纹理图自动生成方法、装置、计算机设备和存储介质 |
| CN115147508B (zh) * | 2022-06-30 | 2023-09-22 | 北京百度网讯科技有限公司 | 服饰生成模型的训练、生成服饰图像的方法和装置 |
| CN115147526B (zh) * | 2022-06-30 | 2023-09-26 | 北京百度网讯科技有限公司 | 服饰生成模型的训练、生成服饰图像的方法和装置 |
| CN115147526A (zh) * | 2022-06-30 | 2022-10-04 | 北京百度网讯科技有限公司 | 服饰生成模型的训练、生成服饰图像的方法和装置 |
| CN115147508A (zh) * | 2022-06-30 | 2022-10-04 | 北京百度网讯科技有限公司 | 服饰生成模型的训练、生成服饰图像的方法和装置 |
| US20240144578A1 (en) * | 2022-10-14 | 2024-05-02 | Electronics And Telecommunications Research Institute | Apparatus and method for generating texture map of 3-dimensional mesh |
| US12374020B2 (en) * | 2022-10-14 | 2025-07-29 | Electronics And Telecommunications Research Institute | Apparatus and method for generating texture map of 3-dimensional mesh |
| CN119544979A (zh) * | 2025-01-21 | 2025-02-28 | 长沙超创电子科技有限公司 | 一种用于神经网络的图像编解码及传输方法及系统 |
Also Published As
| Publication number | Publication date |
|---|---|
| GB2581536B (en) | 2024-01-17 |
| GB2581536A (en) | 2020-08-26 |
| GB201902524D0 (en) | 2019-04-10 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2020174215A1 (fr) | Décodeurs de forme et de texture de joint pour rendu tridimensionnel | |
| AU2020200811B2 (en) | Direct meshing from multiview input using deep learning | |
| US20250285380A1 (en) | Generating three-dimensional object models from two-dimensional images | |
| JP7634017B2 (ja) | ニューラルネットワークを用いた歯群のセグメンテーション | |
| US20190355103A1 (en) | Guided hallucination for missing image content using a neural network | |
| JP7343963B2 (ja) | 画像を入力とする関数を学習するためのデータセット | |
| CN111127631B (zh) | 基于单图像的三维形状和纹理重建方法、系统及存储介质 | |
| GB2582833A (en) | Facial localisation in images | |
| JP2019003615A (ja) | オートエンコーダの学習 | |
| Wang et al. | Feature-preserving volume data reduction and focus+ context visualization | |
| CN114429518A (zh) | 人脸模型重建方法、装置、设备和存储介质 | |
| CN113989441B (zh) | 基于单张人脸图像的三维漫画模型自动生成方法及系统 | |
| Ashfaq et al. | 3D Point Cloud Generation to Understand Real Object Structure via Graph Convolutional Networks. | |
| CN105825471A (zh) | 一种基于Unity 3D的三维体表面重构与渲染方法 | |
| CN113781659A (zh) | 一种三维重建方法、装置、电子设备及可读存储介质 | |
| Xian et al. | Fast generation of high-fidelity RGB-D images by deep learning with adaptive convolution | |
| US20220172421A1 (en) | Enhancement of Three-Dimensional Facial Scans | |
| CN117372521A (zh) | 一种适用于对称物体的实时单目6d位姿估计方法及系统 | |
| Xiao | Research on visual image texture rendering for artistic aided design | |
| Garcia et al. | CPU-based real-time surface and solid voxelization for incomplete point cloud | |
| EP4128154B1 (fr) | Estimation de mouvement tridimensionnel | |
| Shah et al. | GPU-accelerated post-processing and animated volume rendering of isogeometric analysis results | |
| Jain et al. | GenIcoNet: Generative icosahedral mesh convolutional network | |
| Fiala et al. | Empirical Analysis of Image-Based 3D Reconstruction Technique for Museum Artifacts Incorporating Artificial Features | |
| Chang et al. | GPU-based parallel construction of compact visual hull meshes |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20707798 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 20707798 Country of ref document: EP Kind code of ref document: A1 |