WO2025200079A1 - Codeur assimilable convertissant un nuage de points en grille pour reconnaissance visuelle - Google Patents
Codeur assimilable convertissant un nuage de points en grille pour reconnaissance visuelleInfo
- Publication number
- WO2025200079A1 WO2025200079A1 PCT/CN2024/090762 CN2024090762W WO2025200079A1 WO 2025200079 A1 WO2025200079 A1 WO 2025200079A1 CN 2024090762 W CN2024090762 W CN 2024090762W WO 2025200079 A1 WO2025200079 A1 WO 2025200079A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- point cloud
- graph
- generating
- nodes
- cloud graph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/64—Three-dimensional [3D] objects
Definitions
- the CNN 150 may output information indicating conditions of objects.
- Example conditions may include classification, gesture, pose, movement, action, mood, orientation, interest, traffic-related condition, other types of conditions, or some combination thereof.
- Conditions of objects may be used in various applications, such as human pose lifting, skeleton-based human action recognition, 3D mesh reconstruction, traffic navigation, social network analysis, recommend system, scientific computing, and so on.
- the training module 120 may jointly train at least part of the PC2G encoder 140 and at least part of the CNN 150.
- the training module 120 may train the up-sampling model 170, the reshaping module 180, and the CNN 150 through the same training process.
- the training module 120 may input one or more training samples into the visual recognition module 110, e.g., directly input into the up-sampling model 170.
- the training module 120 may cause executions of the up-sampling model 170, the reshaping module 180, and the CNN 150 on the one or more training samples.
- the CNN 150 may also be further trained in the second training stage.
- the values of the internal parameters of the CNN 150 may start with the values that are learned in the first training stage.
- the training module 120 may further adjust the values of the internal parameters of the CNN 150 to further train the CNN 150 in the second training stage.
- the up-sampling model 170, the reshaping module 180, and the CNN are jointly trained in the second training stage.
- the training module 120 also determines hyperparameters for training the visual recognition module 110.
- Hyperparameters are variables specifying the training process. Hyperparameters are different from parameters inside the visual recognition module 110 ( “internal parameters, ” e.g., internal parameters of the up-sampling model 170, internal parameters of the reshaping module 180, weights for convolution operations in the CNN 150, etc. ) .
- hyperparameters include variables determining the architecture of at least part of the visual recognition module 110, such as number of hidden layers in the CNN 150, and so on. Hyperparameters also include variables which determine how the visual recognition module 110 is trained, such as batch size, number of epochs, etc.
- the training module 120 also adds an activation function to a hidden layer or the output layer.
- An activation function of a layer transforms the weighted sum of the input of the layer to an output of the layer.
- the activation function may be, for example, a rectified linear unit (ReLU) activation function, a tangent activation function, or other types of activation functions.
- ReLU rectified linear unit
- the training module 120 may train the visual recognition module 110 for a predetermined number of epochs.
- the number of epochs is a hyperparameter that defines the number of times that the DL algorithm will work through the entire training dataset.
- One epoch means that each sample in the training dataset has had an opportunity to update the internal parameters of the visual recognition module 110.
- the training module 120 may stop updating the internal parameters of the visual recognition module 110, and the visual recognition module 110 is considered trained.
- the width W may be the total number of elements in a row of the feature map 300.
- the height H may be the total number of elements in a column of the feature map 300.
- Each element is represented by a dark circle in FIG. 3.
- an element may be a node in the up-sampled graph, e.g., a node 210 or a node 215.
- the up-sampled graph is input into the reshaping module 180, and the reshaping module 180 outputs the feature map 300.
- the feature map 300 is a 2D tensor.
- the feature map 300 may be a 3D tensor, e.g., a tensor with a spatial size H ⁇ W ⁇ C.
- C may be 2, 3, or a larger number.
- FIG. 4 illustrates a convolution 400 executed on a grid 410, in accordance with various embodiments.
- the grid 410 may be generated by reshaping a graph.
- the grid 410 is two-dimensional and has a spatial size of 6 ⁇ 6, i.e., there are six elements in each row and six elements in each column. Every element of the grid 410 is represented by a black circle in FIG. 4. An element may be a node in the graph.
- the convolution 400 may be a deep learning operation in a CNN, e.g., the CNN 150.
- the convolution 400 has a kernel 420 with a spatial size of 3 ⁇ 3, i.e., the kernel 420 has nine weights arranged in three rows and three columns.
- the grid 410 is used as an IFM of the convolution 400.
- multiply-accumulate (MAC) operations are performed as the kernel 420 slides through the grid 410, as indicated by the arrows in FIG. 4.
- the stride for applying the kernel on the grid 700 is 1, meaning the kernel slides one data element at a time. In other embodiments, the stride may be more than one.
- the convolution 400 generates an OFM 430, which is a 6 ⁇ 6 tensor and has the same spatial size as the grid 410.
- the OFM 430 may have a different spatial size.
- the convolution 400 may include padding, through which additional data elements are added to the grid 410 before the kernel is applied on the grid 410. In an example where the padding factor is 1, one additional row is added to the top of the grid 410, one additional row is added to the bottom of the grid 410, one additional row is added to the right of the grid 410, and one additional row is added to the left of the grid 410. The size of the grid 410 after the padding would become 8 ⁇ 8.
- the OFM 430 may be processed in additional deep learning operations in the CNN, e.g., another convolution, activation function, pooling operation, linear transformation, and so on.
- the CNN may output information indicating one or more conditions of the object 310.
- the real matrix 501 may be denoted as The real matrix 501 may be a continuous approximation of the binary matrix 503.
- the real matrix 501 may be used to assist the learning of the binary matrix 503, which may be denoted as ⁇ .
- directly learning the binary matrix 503 would cut off the gradient flow in the backward path 520, which can make the training non-differentiable.
- Straight-through estimator (STE) may be used for parameter update to solve this problem.
- the binarization module 502 may convert the real matrix 501 to the binary matrix 503 by binarizing the real matrix 501 row by row, according to:
- the binary matrix 503 may then be used to convert an up-sampled graph 504 to a grid 505.
- the grid 505 is a 3D tensor that has a spatial size of 5 ⁇ 5 ⁇ 3. In other embodiments, the grid may have a different shape or dimension. For instance, the grid 505 may have one or more larger dimensions. Each element in the grid 505 may be a different node in the up-sampled graph 504.
- the grid 505 is input into a CNN 506 and may be processed by the CNN 506 as an IFM.
- the CNN 506 may be an example of the CNN 150 in FIG. 1.
- the CNN 506 may generate a label that indicates visual recognition of an object represented by the up-sampled graph 504.
- the multiplication applied between a kernel-sized patch of the IFM 640 and a kernel may be a dot product.
- a dot product is the elementwise multiplication between the kernel-sized patch of the IFM 640 and the corresponding kernel, which is then summed, always resulting in a single value. Because it results in a single value, the operation is often referred to as the “scalar product. ”
- Using a kernel smaller than the IFM 640 is intentional as it allows the same kernel (set of weights) to be multiplied by the IFM 640 multiple times at different points on the IFM 640. Specifically, the kernel is applied systematically to each overlapping part or kernel-sized patch of the IFM 640, left to right, top to bottom.
- the depthwise convolution 683 In the depthwise convolution 683, the input channels are not combined. Rather, MAC operations are performed on an individual input channel and an individual kernel and produce an output channel. As shown in FIG. 6, the depthwise convolution 683 produces a depthwise output tensor 680.
- the depthwise output tensor 680 is represented by a 5 ⁇ 5 ⁇ 3 3D matrix.
- the depthwise output tensor 680 includes 3 output channels, each of which is represented by a 5 ⁇ 5 2D matrix.
- the 5 ⁇ 5 2D matrix includes 5 output elements in each row and five output elements in each column.
- Each output channel is a result of MAC operations of an input channel of the IFM 640 and a kernel of the filter 650.
- the OFM 660 is then passed to the next layer in the sequence.
- the OFM 660 is passed through an activation function.
- An example activation function is ReLU.
- ReLU is a calculation that returns the value provided as input directly, or the value zero if the input is zero or less.
- the convolutional layer 610 may receive several images as input and calculate the convolution of each of them with each of the kernels. This process can be repeated several times. For instance, the OFM 660 is passed to the subsequent convolutional layer 610 (i.e., the convolutional layer 610 following the convolutional layer 610 generating the OFM 660 in the sequence) .
- the subsequent convolutional layers 610 perform a convolution on the OFM 660 with new kernels and generate a new feature map.
- the new feature map may also be normalized and resized.
- the new feature map can be kernelled again by a further subsequent convolutional layer 610, and so on.
- the pooling layers 620 down-sample feature maps generated by the convolutional layers, e.g., by summarizing the presence of features in the patches of the feature maps.
- a pooling layer 620 is placed between two convolution layers 610: a preceding convolutional layer 610 (the convolution layer 610 preceding the pooling layer 620 in the sequence of layers) and a subsequent convolutional layer 610 (the convolution layer 610 subsequent to the pooling layer 620 in the sequence of layers) .
- a pooling layer 620 is added after a convolutional layer 610, e.g., after an activation function (e.g., ReLU, etc. ) has been applied to the OFM 660.
- an activation function e.g., ReLU, etc.
- a pooling layer 620 receives feature maps generated by the preceding convolution layer 610 and applies a pooling operation to the feature maps.
- the pooling operation reduces the size of the feature maps while preserving their important characteristics. Accordingly, the pooling operation improves the efficiency of the CNN and avoids over-learning.
- the pooling layers 620 may perform the pooling operation through average pooling (calculating the average value for each patch on the feature map) , max pooling (calculating the maximum value for each patch of the feature map) , or a combination of both.
- the size of the pooling operation is smaller than the size of the feature maps.
- the pooling operation is 2 ⁇ 2 pixels applied with a stride of two pixels, so that the pooling operation reduces the size of a feature map by a factor of 2, e.g., the number of pixels or values in the feature map is reduced to one quarter the size.
- a pooling layer 620 applied to a feature map of 6 ⁇ 6 results in an output pooled feature map of 3 ⁇ 3.
- the output of the pooling layer 620 is inputted into the subsequent convolution layer 610 for further feature extraction.
- the pooling layer 620 operates upon each feature map separately to create a new set of the same number of pooled feature maps.
- the fully-connected layers 630 are the last layers of the CNN.
- the fully-connected layers 630 may be convolutional or not.
- the fully-connected layers 630 may also be referred to as linear layers.
- a fully-connected layer 630 (e.g., the first fully-connected layer in the CNN 600) may receive an input operand.
- the input operand may define the output of the convolutional layers 610 and pooling layers 620 and includes the values of the last feature map generated by the last pooling layer 620 in the sequence.
- the fully-connected layer 630 may apply a linear transformation to the input operand through a weight matrix.
- the weight matrix may be a kernel of the fully-connected layer 630.
- the linear transformation may include a tensor multiplication between the input operand and the weight matrix.
- the result of the linear transformation may be an output operand.
- the fully-connected layer may further apply a non-linear transformation (e.g., by using a non-linear activation function) on the result of the linear transformation to generate an output operand.
- the output operand may contain as many elements as there are classes: element i represents the probability that the image belongs to class i. Each element is therefore between 0 and 6, and the sum of all is worth one. These probabilities are calculated by the last fully-connected layer 630 by using a logistic function (binary classification) or a SoftMax function (multi-class classification) as an activation function.
- the input tensor 710 has a spatial size H in ⁇ W in ⁇ C in , where H in is the height of the 3D matrix (i.e., the length along the Y axis, which indicates the number of activations in a column in the 3D matrix of each input channel) , W in is the width of the 3D matrix (i.e., the length along the X axis, which indicates the number of activations in a row in the 3D matrix of each input channel) , and C in is the depth of the 3D matrix (i.e., the length along the Z axis, which indicates the number of input channels) .
- the input tensor 710 has a spatial size of 7 ⁇ 7 ⁇ 3, i.e., the input tensor 710 includes three input channels and each input channel has a 7 ⁇ 7 2D matrix.
- Each input element in the input tensor 710 may be represented by a (X, Y, Z) coordinate. In other embodiments, the height, width, or depth of the input tensor 710 may be different.
- Each filter 720 includes weights arranged in a 3D matrix. The values of the weights may be determined through training the DNN.
- a filter 720 has a spatial size H f ⁇ W f ⁇ C f , where H f is the height of the filter (i.e., the length along the Y axis, which indicates the number of weights in a column in each kernel) , W f is the width of the filter (i.e., the length along the X axis, which indicates the number of weights in a row in each kernel) , and C f is the depth of the filter (i.e., the length along the Z axis, which indicates the number of channels) . In some embodiments, C f equals C in .
- each filter 720 in FIG. 7 has a spatial size of 7 ⁇ 3 ⁇ 3, i.e., the filter 720 includes 7 convolutional kernels with a spatial size of 3 ⁇ 3.
- the height, width, or depth of the filter 720 may be different.
- the spatial size of the convolutional kernels is smaller than the spatial size of the 2D matrix of each input channel in the input tensor 710.
- each filter 720 slides across the input tensor 710 and generates a 2D matrix for an output channel in the output tensor 730.
- the 2D matrix has a spatial size of 5 ⁇ 5.
- the output tensor 730 includes activations (also referred to as “output activations, ” “elements, ” or “output element” ) arranged in a 3D matrix.
- An output activation is a data point in the output tensor 730.
- MAC operations can be performed on a 3 ⁇ 3 ⁇ 3 subtensor 715 (which is highlighted with a dotted pattern in FIG. 7) in the input tensor 710 and each filter 720.
- the result of the MAC operations on the subtensor 715 and one filter 720 is an output activation.
- an output activation may include 8 bits, e.g., one byte.
- an output activation may include more than one byte. For instance, an output element may include two bytes.
- a vector 735 is produced.
- the vector 735 is highlighted with slashes in FIG. 7.
- the vector 735 includes a sequence of output activations, which are arranged along the Z axis.
- the output activations in the vector 735 have the same (X, Y) coordinate, but the output activations correspond to different output channels and have different Z coordinates.
- the dimension of the vector 735 along the Z axis may equal the total number of output channels in the output tensor 730.
- FIG. 8 illustrates an AI-based visual recognition environment 800, in accordance with various embodiments.
- the AI-based visual recognition environment 800 includes a visual recognition module 810, client devices 820 (individually referred to as client device 820) , and a third-party system 830.
- client devices 820 individually referred to as client device 820
- third-party system 830 the AI-based visual recognition environment 800 may include fewer, more, or different components.
- the AI-based visual recognition environment 800 may include a different number of client devices 820 or more than one third-party system 830.
- the visual recognition module 810 performs visual recognition tasks, e.g., detection of conditions of objects. For instance, the visual recognition module 810 may track 3D motions of an object by estimating 3D poses of the object. In some embodiments, the visual recognition module 810 may receive one or more point clouds captured by one or more sensors placed in a local area where an object is located. The visual recognition module 810 may receive the point clouds from one or more client devices 820 or the third-party system 830. Also, the visual recognition module 810 may transmit information indicating visual recognition of the object to one or more client devices 820 or the third-party system 830. Additionally or alternatively, the visual recognition module 810 may transmit content items generated using the estimated 3D poses of the object to one or more client devices 820 or the third-party system 830. An example of the visual recognition module 810 is the visual recognition module 110 in FIG. 1.
- the client devices 820 are in communication with the visual recognition module 810.
- the client device 820 may receive 3D pose graphical representations from the visual recognition module 810 and display the 3D pose graphical representations to one or more users associated with the client device 820.
- a client device 820 may facilitate an interface with one or more depth cameras in a local area and may send commands to the depth cameras to capture depth images to be used by the visual recognition module 810.
- the client device 820 may facilitate an interface with one or more projectors in a local area and may provide content items to the projectors for the projectors to present the content items in the local area.
- the client device 820 may generate the content items using motion tracking results from the visual recognition module 810.
- a client device may have one or more users, whose motions may be tracked by the visual recognition module 810.
- a client device 820 may execute one or more applications allowing one or more users of the client device 820 to interact with the visual recognition module 810.
- a client device 820 executes a browser application to enable interaction between the client device 820 and the visual recognition module 810.
- a client device 820 interacts with the visual recognition module 810 through an application programming interface (API) running on a native operating system of the client device 820, such as or ANDROID TM .
- API application programming interface
- a client device 820 may be one or more computing devices capable of receiving user input as well as transmitting and/or receiving data via the network 840.
- a client device 820 is a conventional computer system, such as a desktop or a laptop computer.
- a client device 820 may be a device having computer functionality, such as a personal digital assistant (PDA) , a mobile telephone, a smartphone, an autonomous vehicle, or another suitable device.
- PDA personal digital assistant
- a client device 820 is configured to communicate via the network 840.
- a client device 820 is an integrated computing device that operates as a standalone network-enabled device.
- the client device 820 includes display, speakers, microphone, camera, and input device.
- a client device 820 is a computing device for coupling to an external media device such as a television or other external display and/or audio output system.
- the client device 820 may couple to the external media device via a wireless interface or wired interface and may utilize various functions of the external media device such as its display, speakers, microphone, camera, and input devices.
- the client device 820 may be configured to be compatible with a generic external media device that does not have specialized software, firmware, or hardware specifically for interacting with the client device 820.
- the third-party system 830 is an online system that may communicate with the visual recognition module 810 or at least one of the client devices 820.
- the third-party system 830 may provide data to the visual recognition module 810 for 3D pose estimation.
- the data may include depth images, data for training DNNs, data for validating DNNs, and so on.
- the third-party system 830 may be a social media system, an online image gallery, an online searching system, and so on. Additionally or alternatively, the third-party system 830 may use results of 3D pose estimation in various applications. For instance, the third-party system 830 may use motion tracking results from the visual recognition module 810 for action recognition, sport analysis, virtual reality, augmented reality, film and game production, telepresence, and so on.
- the visual recognition module 810, client devices 820, and third-party system 830 are connected through a network 840.
- the network 840 may comprise any combination of local area and/or wide area networks, using both wired and/or wireless communication systems.
- the network 840 may use standard communications technologies and/or protocols.
- the network 840 may include communication links using technologies such as Ethernet, 8010.11, worldwide interoperability for microwave access (WiMAX) , 3G, 4G, code division multiple access (CDMA) , digital subscriber line (DSL) , etc.
- networking protocols used for communicating via the network 840 may include multiprotocol label switching (MPLS) , transmission control protocol/Internet protocol (TCP/IP) , hypertext transport protocol (HTTP) , simple mail transfer protocol (SMTP) , and file transfer protocol (FTP) .
- MPLS multiprotocol label switching
- TCP/IP transmission control protocol/Internet protocol
- HTTP hypertext transport protocol
- SMTP simple mail transfer protocol
- FTP file transfer protocol
- Data exchanged over the network 840 may be represented using any suitable format, such as hypertext markup language (HTML) or extensible markup language (XML) .
- HTML hypertext markup language
- XML extensible markup language
- all or some of the communication links of the network 840 may be encrypted using any suitable technique or techniques.
- FIG. 9 is a flowchart showing a method 900 of visual recognition, in accordance with various embodiments.
- the method 900 may be a method of 3D visual recognition.
- the method 900 may be performed by the visual recognition module 110 in FIG. 1.
- the method 900 is described with reference to the flowchart illustrated in FIG. 9, many other methods for visual recognition may alternatively be used.
- the order of execution of the steps in FIG. 9 may be changed.
- some of the steps may be changed, eliminated, or combined.
- the visual recognition module 110 generates 910 a point cloud graph by removing one or more points from a point cloud capturing an object.
- the point cloud graph comprises a first group of nodes.
- a node encodes a feature in the object, such as a portion of the object.
- the point cloud graph also includes one or more edges. An edge connects two adjacent nodes. In some embodiments, the edge represents a topological connection between two features in the object that are encoded by the two nodes.
- the visual recognition module 110 generates the feature map in two stages. In some embodiments, the visual recognition module 110 generates an additional point cloud graph from the point cloud graph, the point cloud graph comprising the second group of nodes. The visual recognition module 110 transforms the additional point cloud graph into the feature map by rearranging the second group of nodes into the grid structure. In some embodiments, the visual recognition module 110 generates the additional point cloud graph by interpolating one or more new nodes between two nodes in the point cloud graph. The two nodes are connected through an edge in the point cloud graph. In some embodiments, the visual recognition module 110 generates the additional point cloud graph by applying an up-sampling matrix and an adjacency matrix on the point cloud graph.
- the visual recognition module 110 inputs the point cloud graph into a trained model.
- the trained model comprises a learnable binary matrix.
- the visual recognition module 110 generates the feature map using the learnable binary matrix.
- the trained model is trained by updating one or more parameters of a real matrix and binarizing the real matrix row by row to obtain the binary matrix.
- the visual recognition module 110 executes 930 one or more deep learning operations in a neural network on the up-sampled grid representation of the object.
- the neural network may be a CNN, e.g., the CNN 150 or the CNN 600.
- the one or more deep learning operations comprises a convolution.
- the convolution is executed on the up-sampled grid representation of the object.
- the convolution has a kernel. The kernel has a smaller size than the feature map.
- the visual recognition module 110 determines 940 a condition of the object based on an output of the neural network.
- the neural network outputs information describing the condition of the object.
- the condition of the object may be a pose, movement, gesture, orientation, mood, color, shape, size, or other types of conditions of the object.
- FIG. 10 is a block diagram of an example computing device 1000, in accordance with various embodiments.
- the computing device 1000 can be used as at least part of the computer vision system 100.
- a number of components are illustrated in FIG. 10 as included in the computing device 1000, but any one or more of these components may be omitted or duplicated, as suitable for the application.
- some or all of the components included in the computing device 1000 may be attached to one or more motherboards. In some embodiments, some or all of these components are fabricated onto a single system on a chip (SoC) die.
- SoC system on a chip
- the computing device 1000 may not include one or more of the components illustrated in FIG. 10, but the computing device 1000 may include interface circuitry for coupling to the one or more components.
- the computing device 1000 may not include a display device 1006, but may include display device interface circuitry (e.g., a connector and driver circuitry) to which a display device 1006 may be coupled.
- the computing device 1000 may not include an audio input device 1018 or an audio output device 1008, but may include audio input or output device interface circuitry (e.g., connectors and supporting circuitry) to which an audio input device 1018 or audio output device 1008 may be coupled.
- the computing device 1000 may include a processing device 1002 (e.g., one or more processing devices) .
- the processing device 1002 processes electronic data from registers and/or memory to transform that electronic data into other electronic data that may be stored in registers and/or memory.
- the computing device 1000 may include a memory 1004, which may itself include one or more memory devices such as volatile memory (e.g., DRAM) , nonvolatile memory (e.g., read-only memory (ROM) ) , high bandwidth memory (HBM) , flash memory, solid state memory, and/or a hard drive.
- the memory 1004 may include memory that shares a die with the processing device 1002.
- the memory 1004 includes one or more non-transitory computer-readable media storing instructions executable to perform operations for performing 3D visual recognition, e.g., the method 900 described above in conjunction with FIG. 9 or some operations performed by the computer vision system 100 described above in conjunction with FIG. 1.
- the instructions stored in the one or more non-transitory computer-readable media may be executed by the processing device 1002.
- the computing device 1000 may include a communication chip 1012 (e.g., one or more communication chips) .
- the communication chip 1012 may be configured for managing wireless communications for the transfer of data to and from the computing device 1000.
- wireless and its derivatives may be used to describe circuits, devices, systems, methods, techniques, communications channels, etc., that may communicate data through the use of modulated electromagnetic radiation through a nonsolid medium. The term does not imply that the associated devices do not contain any wires, although in some embodiments they might not.
- the communication chip 1012 may implement any of a number of wireless standards or protocols, including but not limited to Institute for Electrical and Electronic Engineers (IEEE) standards including Wi-Fi (IEEE 802.10 family) , IEEE 802.16 standards (e.g., IEEE 802.16-2005 Amendment) , Long-Term Evolution (LTE) project along with any amendments, updates, and/or revisions (e.g., advanced LTE project, ultramobile broadband (UMB) project (also referred to as "3GPP2" ) , etc. ) .
- IEEE Institute for Electrical and Electronic Engineers
- Wi-Fi IEEE 802.10 family
- IEEE 802.16 standards e.g., IEEE 802.16-2005 Amendment
- LTE Long-Term Evolution
- LTE Long-Term Evolution
- UMB ultramobile broadband
- WiMAX Broadband Wireless Access
- the communication chip 1012 may operate in accordance with a Global System for Mobile Communication (GSM) , General Packet Radio Service (GPRS) , Universal Mobile Telecommunications System (UMTS) , High Speed Packet Access (HSPA) , Evolved HSPA (E-HSPA) , or LTE network.
- GSM Global System for Mobile Communication
- GPRS General Packet Radio Service
- UMTS Universal Mobile Telecommunications System
- HSPA High Speed Packet Access
- E-HSPA Evolved HSPA
- the communication chip 1012 may operate in accordance with Enhanced Data for GSM Evolution (EDGE) , GSM EDGE Radio Access Network (GERAN) , Universal Terrestrial Radio Access Network (UTRAN) , or Evolved UTRAN (E-UTRAN) .
- the communication chip 1012 may operate in accordance with CDMA, Time Division Multiple Access (TDMA) , Digital Enhanced Cordless Telecommunications (DECT) , Evolution-Data Optimized (EV-DO) , and derivatives thereof, as well as any other wireless protocols that are designated as 3G, 4G, 5G, and beyond.
- the communication chip 1012 may operate in accordance with other wireless protocols in other embodiments.
- the computing device 1000 may include an antenna 1022 to facilitate wireless communications and/or to receive other wireless communications (such as AM or FM radio transmissions) .
- the communication chip 1012 may manage wired communications, such as electrical, optical, or any other suitable communication protocols (e.g., the Ethernet) .
- the communication chip 1012 may include multiple communication chips. For instance, a first communication chip 1012 may be dedicated to shorter-range wireless communications such as Wi-Fi or Bluetooth, and a second communication chip 1012 may be dedicated to longer-range wireless communications such as global positioning system (GPS) , EDGE, GPRS, CDMA, WiMAX, LTE, EV-DO, or others.
- GPS global positioning system
- a first communication chip 1012 may be dedicated to wireless communications
- a second communication chip 1012 may be dedicated to wired communications.
- the computing device 1000 may include a display device 1006 (or corresponding interface circuitry, as discussed above) .
- the display device 1006 may include any visual indicators, such as a heads-up display, a computer monitor, a projector, a touchscreen display, a liquid crystal display (LCD) , a light-emitting diode display, or a flat panel display, for example.
- LCD liquid crystal display
- the computing device 1000 may include an audio output device 1008 (or corresponding interface circuitry, as discussed above) .
- the audio output device 1008 may include any device that generates an audible indicator, such as speakers, headsets, or earbuds, for example.
- the computing device 1000 may include an audio input device 1018 (or corresponding interface circuitry, as discussed above) .
- the audio input device 1018 may include any device that generates a signal representative of a sound, such as microphones, microphone arrays, or digital instruments (e.g., instruments having a musical instrument digital interface (MIDI) output) .
- MIDI musical instrument digital interface
- the computing device 1000 may include a GPS device 1016 (or corresponding interface circuitry, as discussed above) .
- the GPS device 1016 may be in communication with a satellite-based system and may receive a location of the computing device 1000, as known in the art.
- the computing device 1000 may include another output device 1010 (or corresponding interface circuitry, as discussed above) .
- Examples of the other output device 1010 may include an audio codec, a video codec, a printer, a wired or wireless transmitter for providing information to other devices, or an additional storage device.
- the computing device 1000 may include another input device 1020 (or corresponding interface circuitry, as discussed above) .
- Examples of the other input device 1020 may include an accelerometer, a gyroscope, a compass, an image capture device, a keyboard, a cursor control device such as a mouse, a stylus, a touchpad, a bar code reader, a Quick Response (QR) code reader, any sensor, or a radio frequency identification (RFID) reader.
- the computing device 1000 may have any desired form factor, such as a handheld or mobile computer system (e.g., a cell phone, a smart phone, a mobile internet device, a music player, a tablet computer, a laptop computer, a netbook computer, an ultrabook computer, a PDA, an ultramobile personal computer, etc. ) , a desktop computer system, a server or other networked computing component, a printer, a scanner, a monitor, a set-top box, an entertainment control unit, a vehicle control unit, a digital camera, a digital video recorder, or a wearable computer system.
- the computing device 1000 may be any other electronic device that processes data.
- Example 1 provides a method, including generating a point cloud graph by removing one or more points from a point cloud capturing at least part of an object, the point cloud graph including a first group of nodes; generating a feature map from the point cloud graph, the feature map including a second group of nodes that are arranged in a grid structure, the second group of nodes including more nodes than the first group of nodes; executing one or more deep learning operations in a neural network on the feature map; and determining a condition of the object based on an output of the neural network.
- Example 4 provides the method of any one of examples 1-3, in which generating the point cloud graph includes selecting a point in the point cloud based on a state value of the point, the state value indicating a degree of invalidity of the point; and generating a node in the point cloud graph from the selected point.
- Example 5 provides the method of any one of examples 1-4, in which generating the feature map includes generating an additional point cloud graph from the point cloud graph, the point cloud graph including the second group of nodes; and transforming the additional point cloud graph into the feature map by rearranging the second group of nodes into the grid structure.
- Example 7 provides the method of example 5 or 6, in which generating the additional point cloud graph includes applying an up-sampling matrix and an adjacency matrix on the point cloud graph.
- Example 8 provides the method of any one of examples 1-7, in which generating the feature map includes inputting the point cloud graph into a trained model, the trained model including a learnable binary matrix and generating the feature map using the learnable binary matrix.
- Example 9 provides the method of example 8, in which the trained model is trained by: updating one or more parameters of a real matrix; and binarizing the real matrix row by row to obtain the binary matrix.
- Example 10 provides the method of any one of examples 1-9, in which the one or more deep learning operations includes a convolution.
- Example 11 provides one or more non-transitory computer-readable media storing instructions executable to perform operations, the operations including generating a point cloud graph by removing one or more points from a point cloud capturing at least part of an object, the point cloud graph including a first group of nodes; generating a feature map from the point cloud graph, the feature map including a second group of nodes that are arranged in a grid structure, the second group of nodes including more nodes than the first group of nodes; executing one or more deep learning operations in a neural network on the feature map; and determining a condition of the object based on an output of the neural network.
- Example 12 provides the one or more non-transitory computer-readable media of example 11, in which generating the point cloud graph includes dividing the point cloud into one or more regions; and for each region of the point cloud, removing one or more points in the region.
- Example 13 provides the one or more non-transitory computer-readable media of example 11 or 12, in which generating the point cloud graph includes dividing the point cloud into one or more regions; and for each region of the point cloud, generating a node in the point cloud graph by averaging points in the region.
- Example 20 provides the one or more non-transitory computer-readable media of any one of examples 11-19, in which the one or more deep learning operations includes a convolution.
- Example 21 provides an apparatus, including a computer processor for executing computer program instructions; and a non-transitory computer-readable memory storing computer program instructions executable by the computer processor to perform operations including generating a point cloud graph by removing one or more points from a point cloud capturing at least part of an object, the point cloud graph including a first group of nodes, generating a feature map from the point cloud graph, the feature map including a second group of nodes that are arranged in a grid structure, the second group of nodes including more nodes than the first group of nodes, executing one or more deep learning operations in a neural network on the feature map, and determining a condition of the object based on an output of the neural network.
- Example 22 provides the apparatus of example 21, in which generating the point cloud graph includes dividing the point cloud into one or more regions; and for each region of the point cloud, removing one or more points in the region.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Image Analysis (AREA)
Abstract
Un codeur assimilable de conversion d'un nuage de points en grille (PC2G) et un réseau neuronal peuvent être utilisés pour effectuer des tâches de reconnaissance visuelle basées sur un nuage de points. Le PC2G peut recevoir un nuage de points capturant au moins une partie d'un objet. Le codeur PC2G peut supprimer un ou plusieurs points du nuage de points et générer un graphe de nuage de points épars. Le codeur PC2G peut convertir le graphe de nuage de points épars en un graphe de nuage de points suréchantillonné qui comporte plus de nœuds que le graphe de nuage de points épars. Le codeur PC2G peut ensuite transformer le graphe de nuage de points suréchantillonné en une carte de caractéristiques avec une structure de grille en réorganisant les nœuds dans le graphe de nuage de points suréchantillonné. La carte de caractéristiques peut ensuite être traitée par le réseau neuronal qui peut être un réseau neuronal convolutif. Le réseau neuronal peut délivrer des données indiquant une ou plusieurs conditions (par exemple, pose, mouvement, geste, orientation, forme, couleur, etc.) de l'objet.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN2024084402 | 2024-03-28 | ||
| CNPCT/CN2024/084402 | 2024-03-28 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025200079A1 true WO2025200079A1 (fr) | 2025-10-02 |
Family
ID=97216013
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2024/090762 Pending WO2025200079A1 (fr) | 2024-03-28 | 2024-04-30 | Codeur assimilable convertissant un nuage de points en grille pour reconnaissance visuelle |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025200079A1 (fr) |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20150213646A1 (en) * | 2014-01-28 | 2015-07-30 | Siemens Aktiengesellschaft | Method and System for Constructing Personalized Avatars Using a Parameterized Deformable Mesh |
| CN109964222A (zh) * | 2016-11-03 | 2019-07-02 | 三菱电机株式会社 | 用于处理具有多个点的输入点云的系统和方法 |
| CN113970922A (zh) * | 2020-07-22 | 2022-01-25 | 商汤集团有限公司 | 点云数据的处理方法、智能行驶控制方法及装置 |
| US20220277514A1 (en) * | 2021-02-26 | 2022-09-01 | Adobe Inc. | Reconstructing three-dimensional scenes portrayed in digital images utilizing point cloud machine-learning models |
| CN115909319A (zh) * | 2022-12-15 | 2023-04-04 | 南京工业大学 | 一种基于分层图网络在点云上用于3d对象检测方法 |
| CN116310104A (zh) * | 2023-03-08 | 2023-06-23 | 武汉纺织大学 | 一种复杂场景下的人体三维重建方法、系统及存储介质 |
-
2024
- 2024-04-30 WO PCT/CN2024/090762 patent/WO2025200079A1/fr active Pending
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20150213646A1 (en) * | 2014-01-28 | 2015-07-30 | Siemens Aktiengesellschaft | Method and System for Constructing Personalized Avatars Using a Parameterized Deformable Mesh |
| CN109964222A (zh) * | 2016-11-03 | 2019-07-02 | 三菱电机株式会社 | 用于处理具有多个点的输入点云的系统和方法 |
| CN113970922A (zh) * | 2020-07-22 | 2022-01-25 | 商汤集团有限公司 | 点云数据的处理方法、智能行驶控制方法及装置 |
| US20220277514A1 (en) * | 2021-02-26 | 2022-09-01 | Adobe Inc. | Reconstructing three-dimensional scenes portrayed in digital images utilizing point cloud machine-learning models |
| CN115909319A (zh) * | 2022-12-15 | 2023-04-04 | 南京工业大学 | 一种基于分层图网络在点云上用于3d对象检测方法 |
| CN116310104A (zh) * | 2023-03-08 | 2023-06-23 | 武汉纺织大学 | 一种复杂场景下的人体三维重建方法、系统及存储介质 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20230010142A1 (en) | Generating Pretrained Sparse Student Model for Transfer Learning | |
| US12554988B2 (en) | System and method for compressing convolutional neural networks | |
| US20230016455A1 (en) | Decomposing a deconvolution into multiple convolutions | |
| WO2023220878A1 (fr) | Entraînement de réseau neuronal par l'intermédiaire d'une distillation de connaissances basée sur une connexion dense | |
| US20220101091A1 (en) | Near memory sparse matrix computation in deep neural network | |
| US12572800B2 (en) | Transposing memory layout of weights in deep neural networks (DNNs) | |
| EP4195104A1 (fr) | Système et procédé d'élagage de filtres dans des réseaux neuronaux profonds | |
| WO2024040601A1 (fr) | Architecture de tête pour réseau neuronal profond (dnn) | |
| US20230298322A1 (en) | Out-of-distribution detection using a neural network | |
| EP4354348A1 (fr) | Traitement de rareté sur des données non emballées | |
| US20230071760A1 (en) | Calibrating confidence of classification models | |
| US20230059976A1 (en) | Deep neural network (dnn) accelerator facilitating quantized inference | |
| WO2024040546A1 (fr) | Réseau à grille de points avec transformation de grille sémantique pouvant s'apprendre | |
| WO2023220888A1 (fr) | Modélisation de données structurées en graphe avec convolution sur grille de points | |
| WO2025200079A1 (fr) | Codeur assimilable convertissant un nuage de points en grille pour reconnaissance visuelle | |
| WO2024072472A1 (fr) | Génération de carte d'activation de classe efficace sans gradient | |
| WO2025097349A1 (fr) | Vision artificielle basée sur un graphe à l'aide d'un apprenant à grille progressive et d'un réseau neuronal convolutif | |
| WO2025123208A1 (fr) | Réseau d'annotation pour estimation de pose tridimensionnelle | |
| WO2025102256A1 (fr) | Entraînement de réseau neuronal avec distillation de connaissances contrastives | |
| WO2024077463A1 (fr) | Modélisation séquentielle avec une mémoire contenant des réseaux à plages multiples | |
| WO2025200078A1 (fr) | Suivi de visage basé sur une agrégation spatio-temporelle et antérieur rigide | |
| US20240346293A1 (en) | Out-of-distribution detection using autoencoder appended to neural network | |
| WO2023220867A1 (fr) | Réseau neuronal avec couche de convolution à grille de points | |
| WO2025107244A1 (fr) | Entraînement d'un réseau de neurones avec distillation générative de caractéristiques à origines multiples et destination unique | |
| US20250363591A1 (en) | High resolution patch management system in an early-stage image |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24932703 Country of ref document: EP Kind code of ref document: A1 |