WO2024098941A1 - 图像的处理方法、装置、电子设备和存储介质 - Google Patents
图像的处理方法、装置、电子设备和存储介质 Download PDFInfo
- Publication number
- WO2024098941A1 WO2024098941A1 PCT/CN2023/118093 CN2023118093W WO2024098941A1 WO 2024098941 A1 WO2024098941 A1 WO 2024098941A1 CN 2023118093 W CN2023118093 W CN 2023118093W WO 2024098941 A1 WO2024098941 A1 WO 2024098941A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- query
- feature
- task
- tensor
- query feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/56—Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Definitions
- the present disclosure relates to computer vision technology, and in particular to an image processing method, device, electronic device and storage medium.
- Embodiments of the present disclosure provide an image processing method, an apparatus, an electronic device, and a storage medium.
- an image processing method comprising: determining first image features corresponding to each of the perspectives based on an image to be processed corresponding to each of the perspectives in at least one perspective; determining a first bird's-eye view feature based on the first image features corresponding to each of the perspectives; determining at least one task query feature among a static element task query feature, a dynamic object task query feature and a motion trajectory task query feature based on the first bird's-eye view feature; and determining a task processing result corresponding to each of the task query features based on each of the at least one task query feature.
- an image processing device comprising: a first processing module, for determining first image features corresponding to each of the at least one perspective based on an image to be processed corresponding to each of the perspectives; a second processing module, for determining first bird's-eye view features based on the first image features corresponding to each of the perspectives; a third processing module, for determining at least one task query feature among static element task query features, dynamic object task query features and motion trajectory task query features based on the first bird's-eye view features; and a fourth processing module, for determining task processing results corresponding to each of the task query features based on each of the at least one task query feature.
- a computer-readable storage medium wherein the storage medium stores a computer program, and the computer program is used to execute the image processing method described in any of the above embodiments of the present disclosure.
- an electronic device comprising: a processor; a memory for storing executable instructions of the processor; the processor is configured to read the executable instructions from the memory and execute the instructions to implement the image processing method described in any of the above embodiments of the present disclosure.
- the image processing method, device, electronic device and storage medium provided by the above-mentioned embodiments of the present disclosure can determine the bird's-eye view features based on the image features corresponding to each perspective determined by the images to be processed from each perspective, determine at least one task query feature based on the bird's-eye view features, and then obtain the task processing results corresponding to each task based on the task query features.
- End-to-end single-task or multi-task processing can be achieved based on multi-perspective environmental images, which can avoid or reduce dependence on high-precision maps, so that accurate surrounding environment information can be effectively obtained even in the absence of high-precision maps, which helps to improve versatility and reduce costs.
- FIG1 is an exemplary application scenario of the image processing method provided by the present disclosure
- FIG2 is a schematic flow chart of an image processing method provided by an exemplary embodiment of the present disclosure
- FIG3 is a schematic flow chart of an image processing method provided by another exemplary embodiment of the present disclosure.
- FIG4 is a flow chart of step 2031a provided by an exemplary embodiment of the present disclosure.
- FIG5 is a schematic diagram of a network structure of a first decoding network provided by an exemplary embodiment of the present disclosure
- FIG6 is a flowchart of step 2031b provided by an exemplary embodiment of the present disclosure.
- FIG. 7 is a flow chart of step 2031c provided by an exemplary embodiment of the present disclosure.
- FIG8 is a schematic diagram of the structure of a third decoding network provided by an exemplary embodiment of the present disclosure.
- FIG9 is a flowchart of an image processing method provided by yet another exemplary embodiment of the present disclosure.
- FIG10 is a flow chart of step 301 provided by an exemplary embodiment of the present disclosure.
- FIG11 is a schematic diagram of the principle of determining initial motion trajectory query features provided by an exemplary embodiment of the present disclosure
- FIG. 12 is a flow chart of step 2021 provided by an exemplary embodiment of the present disclosure.
- FIG13 is a schematic diagram of a network structure of an encoder network provided by an exemplary embodiment of the present disclosure.
- FIG14 is a schematic diagram of the overall structure of a network model for image processing provided by an exemplary embodiment of the present disclosure
- FIG15 is a schematic diagram of the structure of an image processing device provided by an exemplary embodiment of the present disclosure.
- FIG16 is a schematic diagram of the structure of an image processing device provided by another exemplary embodiment of the present disclosure.
- FIG17 is a schematic diagram of the structure of a third processing module 503 provided by an exemplary embodiment of the present disclosure.
- FIG. 18 is a schematic diagram of the structure of an application embodiment of the electronic device disclosed herein.
- the inventors discovered that in the field of autonomous driving, how to rely on multi-perspective environmental images to efficiently understand environmental information is an extremely important technical issue. If the understanding of the surrounding environmental information is achieved based on multi-perspective environmental images combined with high-precision maps, it is easy to lead to poor accuracy of environmental information obtained in the absence of high-precision maps.
- FIG. 1 is an exemplary application scenario of the image processing method provided by the present disclosure.
- an on-board surround view camera (which may include a camera with multiple viewing angles) may be used to collect images of the vehicle's surrounding environment as images to be processed corresponding to each viewing angle respectively.
- the image processing device of the present invention may be used to execute the image processing method of the present invention. Based on the images to be processed corresponding to each viewing angle in at least one viewing angle respectively, first image features corresponding to each viewing angle respectively may be determined. Based on the first image features corresponding to each viewing angle respectively, a first bird's-eye view feature may be determined.
- the first bird's-eye view feature is a feature in a grid coordinate system corresponding to a bird's-eye view (Bird's Eye View, abbreviated as: BEV).
- a static element task query feature, a dynamic object task query feature, and a motion trajectory task query feature may be determined. Then, based on the static element task query feature, the dynamic object task query feature, and the motion trajectory task query feature respectively, Determine the corresponding task processing results (for example, but not limited to task processing result 1, task processing result 2, and task processing result 3), such as determining the static element task processing results based on the static element task query feature (specifically, such as the static element detection results of the vehicle's surrounding environment in the autonomous driving scene), determining the dynamic object task processing results based on the dynamic object task query feature (specifically, such as the three-dimensional target detection results), and determining the motion trajectory task processing results based on the motion trajectory task query feature (specifically, such as the motion trajectory prediction results of dynamic objects), to achieve end-to-end single-task or multi-task processing based on multi-view environmental images without the need to combine high-precision maps, which helps to avoid or reduce dependence on high-precision maps, thereby
- the image processing method disclosed in the present invention is not limited to the above-mentioned autonomous driving scenarios, and can be applied to any other possible scenarios according to actual needs, such as security monitoring scenarios in a certain area.
- the bird's-eye view features of the area can be obtained through images collected by cameras from various perspectives, and end-to-end task processing of static elements, dynamic objects and/or motion trajectories of dynamic objects in the area can be achieved.
- the specific scenarios can be set according to actual needs.
- FIG2 is a flowchart of an image processing method provided by an exemplary embodiment of the present disclosure. This embodiment can be applied to electronic devices, such as a vehicle-mounted computing platform, as shown in FIG2, including the following steps:
- Step 201 determining first image features corresponding to each viewing angle respectively based on the to-be-processed images corresponding to each viewing angle respectively in at least one viewing angle.
- the number of viewing angles can be set according to actual needs.
- the number of viewing angles is the number of surround view cameras set on the vehicle, and each camera corresponds to a viewing angle.
- a four-way surround view system consisting of a left front camera, a left rear camera, a right front camera, and a right rear camera includes four viewing angles, and there is no specific limitation.
- the first image feature can be obtained by any feasible feature extraction method, such as extracting features from each image to be processed based on a pre-trained feature extraction network to obtain the first image features corresponding to each viewing angle.
- the feature extraction network can be set according to actual needs.
- a convolutional neural network can be used as a feature extraction network.
- step 201 may be executed by the processor calling a corresponding instruction stored in a memory, or may be executed by a first processing module executed by the processor.
- Step 202 Determine a first bird's-eye view feature based on the first image features corresponding to each viewing angle.
- the first bird's-eye view feature is a BEV feature in a grid coordinate system corresponding to the bird's-eye view.
- the first image features of each perspective can be encoded based on a pre-trained encoder network to obtain the first bird's-eye view feature.
- the encoder network can be set according to actual needs.
- step 202 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by a second processing module executed by the processor.
- Step 203 Determine at least one task query feature of a static element task query feature, a dynamic object task query feature, and a motion track task query feature based on the first bird's-eye view feature.
- the static element task query feature is a task query feature related to the static element extracted from the first bird's-eye view feature.
- the dynamic object task query feature is a task query feature related to the dynamic object extracted from the first bird's-eye view feature
- the motion trajectory task query feature is a task query feature related to the motion trajectory of the dynamic object extracted from the static element task query feature.
- Which or which types of task query features need to be obtained can be set according to actual needs. For example, any one, any two, or all three can be obtained at the same time.
- the first bird's-eye view feature can be decoded using a decoding network corresponding to the task obtained by pre-training. The specific decoding network can be set according to actual needs.
- step 203 may be executed by the processor calling a corresponding instruction stored in a memory, or may be executed by a third processing module executed by the processor.
- Step 204 determining task processing results corresponding to each task query feature based on each task query feature of the at least one task query feature.
- a head network corresponding to the task can be set, and trained to obtain the trained head network, which is used to output the task query feature corresponding to the task, and obtain the task processing result corresponding to the task query feature.
- the specific network structure of the head network can be set according to actual needs, for example, it can be implemented by a multilayer perceptron (MLP).
- MLP multilayer perceptron
- step 204 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by a fourth processing module executed by the processor.
- the image processing method provided in this embodiment can determine the bird's-eye view image based on the image features corresponding to each perspective determined by the image to be processed at each perspective. Based on the bird's-eye view feature, at least one task query feature is determined, and then based on the task query features, the task processing results corresponding to each task are obtained. End-to-end single task or multi-task processing can be achieved by relying only on multi-view environmental images, without the need to combine with high-precision maps, which can avoid or reduce dependence on high-precision maps, thereby achieving accurate surrounding environment information even in the absence of high-precision maps, which helps to improve versatility and reduce costs.
- FIG. 3 is a schematic flow chart of an image processing method provided by another exemplary embodiment of the present disclosure.
- step 203 may specifically include the following steps:
- Step 2031a based on the first bird's-eye view feature and the initial static element query feature, the static element task query feature is determined using the pre-trained first decoding network, where the initial static element query feature includes initial query features corresponding to each static element in at least one static element.
- the initial static element query feature can be set according to actual needs, such as the initial static element query feature obtained by initialization based on the first initialization rule.
- the first initialization rule can be set according to actual needs, such as randomly initializing N2 D-dimensional static queries (static query represents the initial query feature corresponding to the static element) as the initial static element query feature, N2 is the number of static queries, that is, the number of static elements, N2 can be set according to actual needs, and D represents the dimension of the initial query feature corresponding to each static element.
- the first decoding network may include at least one decoder, which is used to query the task query feature related to the static element from the first bird's-eye view feature based on the initial static element query feature to obtain the static element task query feature.
- the specific network structure of the first decoding network can be set according to actual needs.
- step 2031a may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by the first processing unit executed by the processor.
- the disclosed embodiment utilizes the first decoding network obtained through training to obtain static element task query features related to static elements from the first bird's-eye view features based on the initial static element query features, thereby providing more accurate and effective feature data for the implementation of subsequent static element detection tasks, so as to realize end-to-end task processing based on multi-view images.
- static map information can be generated online without relying on high-precision maps generated offline, thereby further improving versatility.
- step 203 may specifically include the following steps:
- Step 2031b based on the first bird's-eye view feature and the initial dynamic object query feature, a dynamic object task query feature is determined using a pre-trained second decoding network, where the initial dynamic object query feature includes initial query features corresponding to each dynamic object in at least one dynamic object.
- the initial dynamic object query feature can be set according to actual needs, such as the initial dynamic object query feature obtained by initialization based on the second initialization rule.
- the second initialization rule can be set according to actual needs, such as randomly initializing N1 D-dimensional dynamic queries (dynamic query represents the initial query feature corresponding to the dynamic object) as the initial dynamic object query feature, N1 is the number of dynamic queries, that is, the number of dynamic objects, N1 can be set according to actual needs, and D represents the dimension of the initial query feature corresponding to each dynamic object.
- the second decoding network may include at least one decoder for querying task query features related to dynamic objects from the first bird's-eye view feature based on the initial dynamic object query feature to obtain dynamic object task query features.
- the specific network structure of the second decoding network can be set according to actual needs.
- step 2031b may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by a second processing unit executed by the processor.
- the disclosed embodiment utilizes the second decoding network obtained through training to obtain dynamic object task query features related to dynamic objects from the first bird's-eye view features based on the initial dynamic object query features, thereby providing more accurate and effective feature data for the implementation of subsequent dynamic object detection tasks, so as to realize end-to-end three-dimensional target detection task processing based on multi-view images, avoid tracking of dynamic objects, thereby helping to reduce the computational complexity of the network model, and at the same time avoid the impact of target tracking errors on subsequent applications.
- step 203 may specifically include the following steps:
- Step 2031c based on the static element task query feature and the initial motion trajectory query feature, the motion trajectory task query feature is determined using a pre-trained third decoding network, where the initial motion trajectory query feature includes initial trajectory query features corresponding to each dynamic object in at least one dynamic object.
- the initial motion trajectory query feature needs to be determined in combination with the dynamic object task query feature and the modal query feature.
- the modal query feature is used to characterize the motion trend of the dynamic object. Different modalities focus on different future motion types (such as fast straight driving, slow straight driving, left turn, right turn, etc.).
- the dynamic object task query feature is used to characterize the features related to the dynamic object. By combining the modal query feature and the dynamic object task query feature, the initial trajectory query feature related to the motion trajectory of the dynamic object can be determined, and then interacted with the static element task query feature in the third decoding network to decode the motion trajectory task query feature.
- the third decoding network can be set according to actual needs.
- step 2031c may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by a third processing unit executed by the processor.
- the disclosed embodiment utilizes the third decoding network obtained through training to obtain motion trajectory task query features related to the motion trajectory of dynamic objects from static element task query features based on initial motion trajectory query features, thereby providing more accurate and effective feature data for the implementation of subsequent motion trajectory prediction tasks of dynamic objects, so as to realize end-to-end motion trajectory prediction task processing based on multi-view images.
- step 203 may include at least two of the above steps 2031a-2031c, which can be set according to actual needs, so that end-to-end multi-task processing can be achieved relying only on multi-view images, and task processing results corresponding to multiple tasks can be obtained at the same time, which can avoid dependence on high-precision maps and lidar, and help to further improve versatility.
- FIG. 4 is a flowchart of step 2031a provided by an exemplary embodiment of the present disclosure.
- step 2031a determines the static element task query feature based on the first bird's-eye view feature and the initial static element query feature using a pre-trained first decoding network, including:
- Step 20311a based on the initial static element query feature, determine the first query tensor, the first key tensor and the first value tensor.
- the initial static element query feature can be mapped to the first query tensor based on the first query mapping rule, such as mapping based on the first query mapping matrix.
- the initial static element query feature can be mapped to the first key tensor based on the first key mapping rule, and the initial static element query feature can be mapped to the first value tensor based on the first value mapping rule.
- the specific mapping principle will not be repeated here.
- Step 20312a based on the first query tensor, the first key tensor and the first value tensor, using the first self-attention network of the first decoder in the first decoding network network, determine the first self-attention result.
- the first self-attention network is a network based on the self-attention mechanism, which can be set according to actual needs to complete the self-attention operation of the first query tensor, the first key tensor and the first value tensor. Specifically, a self-attention operation is performed based on the first query tensor and the first key tensor to obtain a first weight, and then a weighted sum is performed on the first value tensor based on the first weight to obtain a first self-attention result. The principle of the specific self-attention mechanism will not be repeated.
- Step 20313a based on the first self-attention result and the initial static element query feature, determine the first intermediate result using the first additive normalization network of the first decoder in the first decoding network.
- the first additive normalization network has two functions of addition and normalization.
- the addition is to add the first self-attention result to the initial static element query feature to obtain a first addition result, and then normalize the first addition result to obtain a first intermediate result.
- Step 20314a determine the second query tensor based on the first intermediate result.
- the first intermediate result can be used as the second query tensor, or the first intermediate result can be mapped to the second query tensor based on the corresponding mapping rule, which can be specifically set according to actual needs.
- Step 20315a based on the first bird's-eye view feature, determine a second key tensor and a second value tensor.
- the determination principle of the second key tensor and the second value tensor can be found in the above content and will not be repeated here.
- Step 20316a based on the second query tensor, the second key tensor and the second value tensor, using the first deformable cross attention network of the first decoder in the first decoding network, determine a first cross attention result.
- the first deformable cross-attention network is a cross-attention network based on deformable convolution, and its function is to extract features in the local area near the position corresponding to the initial static element query feature from the first bird's-eye view feature, further improving the accuracy and effectiveness of the extracted features.
- Step 20317a determining static element task query features based on the first cross-attention result and the first intermediate result.
- the first decoder in the first decoding network may also include other related networks after the first deformable cross-attention network, such as an addition normalization network (Add&Norm), a feedforward network (Feed Forward), etc. Therefore, after obtaining the first cross-attention result, the first cross-attention result and the first intermediate result can be added and normalized, and then passed through other related networks to finally obtain the decoding result of the first decoder.
- Add&Norm addition normalization network
- Feed Forward feedforward network
- the decoding result of the first decoder can also be used as the input of the second decoder, and decoding is performed according to the decoding process of the first decoder, and so on, until the decoding of all decoders is completed, and the final decoding result of the first decoding network is obtained, and the final decoding result is used as the static element task query feature.
- FIG5 is a schematic diagram of the network structure of the first decoding network provided by an exemplary embodiment of the present disclosure.
- the first decoding network includes 6 decoders.
- ⁇ 6 indicates that the first decoding network includes 6 decoders in the virtual box, and the decoder in the virtual box takes the first decoder as an example, Q1, K1, and V1 respectively represent the first query tensor, the first key tensor, and the first value tensor, Self Attention represents the first self-attention network, Add&Norm represents the additive normalization network, and Add&Norm connected to the first self-attention network represents the first additive normalization network, Q2, K2, and V2 respectively represent the second query tensor, the second key tensor, and the second value tensor; Deformable Cross Attention represents the first deformable cross-attention network, and Feed Forward represents the feedforward network.
- the initial static element query feature is mapped to the first query tensor, the first key tensor and the first value tensor, and then self-interacts in the first self-attention network to obtain a first self-attention result.
- the first self-attention result is added to the initial static element query feature and normalized to obtain a first intermediate result.
- the first intermediate result is mapped to the second query tensor.
- the first bird's-eye view feature is mapped to the second key tensor and the second value tensor.
- the second query tensor, the second key tensor and the second value tensor are cross-attended in the first deformable cross-attention network to realize the interaction between the first bird's-eye view feature and the initial static element query feature to obtain a first cross-attention result.
- the first cross-attention result is added to the first intermediate result and normalized to obtain a first normalized result.
- the first normalized result is passed through a feedforward network and another addition normalization network to obtain the decoding result of the first decoder.
- the decoding result is then decoded by 5 decoders to obtain the static element task query feature.
- the first self-attention network can be a multi-head self-attention network
- the first deformable cross-attention network can be a multi-head deformable cross-attention network, which can be set according to actual needs.
- steps 20311a to 20317a can be executed by the processor calling the corresponding instructions stored in the memory, or can be executed by the first processing unit executed by the processor.
- the disclosed embodiment realizes self-interaction of static element query features through the self-attention network in the first decoding network, which can capture the internal correlation of static element query features, and then interacts with the bird's-eye view features in the deformable cross-attention network to realize sparse attention on the data, and flexibly captures the features of related local areas, which helps to reduce the amount of calculation while ensuring the acquisition of accurate and effective related features, thereby improving the network reasoning speed.
- FIG. 6 is a flowchart of step 2031b provided by an exemplary embodiment of the present disclosure.
- step 2031b determines the dynamic object task query feature based on the first bird's-eye view feature and the initial dynamic object query feature using a pre-trained second decoding network, including:
- Step 20311b based on the initial dynamic object query features, determine a third query tensor, a third key tensor, and a third value tensor.
- Step 20312b based on the third query tensor, the third key tensor and the third value tensor, using the second self-attention network of the first decoder in the second decoding network, determine a second self-attention result.
- Step 20313b based on the second self-attention result and the initial dynamic object query feature, determine the second intermediate result using the second additive normalization network of the first decoder in the second decoding network.
- Step 20314b determine the fourth query tensor based on the second intermediate result.
- Step 20315b determine a fourth key tensor and a fourth value tensor based on the first bird's-eye view feature.
- Step 20316b based on the fourth query tensor, the fourth key tensor and the fourth value tensor, using the second deformable cross attention network of the first decoder in the second decoding network, determine a second cross attention result.
- Step 20317b determine the dynamic object task query feature based on the second cross-attention result and the second intermediate result.
- steps 20311b-20317b is the same or similar to the aforementioned steps 20311a-20317a, except that step 20311b is based on the initial dynamic object query feature, which is different from the initial static element query feature in step 20311a, and will not be described in detail here.
- the network structure of the second decoding network is the same or similar to the first decoding network, and will not be described in detail here.
- steps 20311b to 20317b can be executed by the processor calling the corresponding instructions stored in the memory, or by the processor.
- the second processing unit of the processor is executed.
- FIG. 7 is a flowchart of step 2031c provided by an exemplary embodiment of the present disclosure.
- step 2031c determines the motion trajectory task query feature based on the static element task query feature and the initial motion trajectory query feature using a pre-trained third decoding network, including:
- Step 20311c based on the initial motion trajectory query feature, determine the fifth query tensor, the fifth key tensor and the fifth value tensor.
- Step 20312c based on the fifth query tensor, the fifth key tensor and the fifth value tensor, using the third self-attention network of the first decoder in the third decoding network, determine a third self-attention result.
- Step 20313c based on the third self-attention result and the initial motion trajectory query feature, use the third additive normalization network of the first decoder in the third decoding network to determine the third intermediate result.
- Step 20314c determine the sixth query tensor based on the third intermediate result.
- step 20311c to step 20314c are the same or similar to the aforementioned steps 20311a to 20314a, and will not be repeated here.
- Step 20315c based on the static element task query feature, determine the sixth key tensor and the sixth value tensor.
- the sixth query tensor and the sixth value tensor of this step are obtained based on the static element task query feature mapping obtained in the above example. For the mapping principle, see the above content.
- Step 20316c based on the sixth query tensor, the sixth key tensor and the sixth value tensor, using the first cross-attention network of the first decoder in the third decoding network, determine the third cross-attention result.
- the first cross-attention network can adopt any feasible cross-attention network, which can be set according to actual needs.
- the first cross-attention network can adopt the cross-attention network structure in a conventional visual Transformer, without specific limitation.
- Step 20317c based on the third cross-attention result and the third intermediate result, determine the motion trajectory task query feature.
- step 20317a The specific operating principle of this step can be found in the aforementioned step 20317a, which will not be repeated here.
- FIG8 is a schematic diagram of the structure of the third decoding network provided by an exemplary embodiment of the present disclosure.
- the initial motion trajectory query feature includes multiple trajectory queries (the trajectory query represents the initial trajectory query feature corresponding to the dynamic object), Q5, K5, and V5 represent the fifth query tensor, the fifth key tensor, and the fifth value tensor, respectively, and Q6, K6, and V6 represent the sixth query tensor, the sixth key tensor, and the sixth value tensor, respectively.
- the meaning of other symbols and the reasoning process refer to the aforementioned content and will not be repeated here.
- steps 20311c to 20317c can be executed by the processor calling the corresponding instructions stored in the memory, or can be executed by a third processing unit executed by the processor.
- the disclosed embodiment realizes self-interaction of motion trajectory query features through the self-attention network in the third decoding network, which can capture the internal correlation of the motion trajectory query features, and then interact with the static element task query features in the cross-attention network to achieve effective capture of motion trajectory related features of dynamic objects, thereby implicitly understanding the surrounding static information (such as surrounding road information), which helps to provide accurate and effective feature data for accurately predicting a more reasonable future motion trajectory of dynamic objects, and realize end-to-end motion trajectory prediction based on multi-view image features.
- FIG. 9 is a schematic flow chart of an image processing method provided by yet another exemplary embodiment of the present disclosure.
- the method before determining the motion trajectory task query feature based on the static element task query feature and the initial motion trajectory query feature using the pre-trained third decoding network in step 2031c, the method further includes:
- Step 301 based on the dynamic object task query feature and the modal query feature, determine the initial motion trajectory query feature, the modal query feature includes the first modal query feature corresponding to each mode of at least one modality, and the first modal query feature corresponding to the modality is used to characterize a motion trend of the dynamic object.
- the modal query feature can be set according to actual needs, such as being initialized through the third initialization rule. Since the modal query feature can characterize the motion trend of the dynamic object, combined with the dynamic object task query feature that characterizes the position of the dynamic object, the initial motion trajectory query feature can be determined.
- N3 D-dimensional modal queries (modal query represents the first modal query feature corresponding to the modality) can be randomly initialized as modal query features. N3 can be set according to actual needs.
- the modal query feature is fused with the dynamic object task query feature to form N1 ⁇ N3 D-dimensional trajectory queries as the initial motion trajectory query feature, that is, each dynamic object has N3 modes, characterizing its N3 motion trends.
- step 301 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by a fourth processing unit executed by the processor.
- the disclosed embodiment obtains the initial motion trajectory query feature based on the fusion of the updated dynamic object task query feature and the modal query feature, so that the initial motion trajectory query feature can include the task query features corresponding to each dynamic object and the query features of multiple modalities of each dynamic object.
- Different modalities can focus on different future motion types (such as fast straight driving, slow straight driving, left turn, right turn, etc.), which helps to provide effective data support for the subsequent decoding of the motion trajectory task query feature through the third decoding network.
- FIG. 10 is a flowchart of step 301 provided by an exemplary embodiment of the present disclosure.
- the dynamic object task query feature includes at least one dynamic object task query feature; and determining the initial motion trajectory query feature based on the dynamic object task query feature and the modal query feature in step 301 includes:
- Step 3011 for each task query feature corresponding to a dynamic object, a first number of the task query features is determined based on the task query feature, where the first number is the number of first modal query features included in the modal query feature.
- each dynamic object can be assigned a first number of first modal query features to characterize the first number of movement trends of the dynamic object. Therefore, in order to be able to merge the task query features of the object with the modal query features, the number of task query features of the dynamic object can be transformed to be the same as the number of modal query features. Therefore, based on the task query features of the dynamic object, the first number of task query features can be determined.
- the modal query feature includes N3 D-dimensional modal queries.
- the task query feature (which may be referred to as task query) of each dynamic object, the task query feature may be copied N3 times to obtain N3 identical D-dimensional task query features.
- step 3011 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by a fourth processing unit executed by the processor.
- Step 3012 Add the first number of task query features to the first modality query features corresponding to each modality in the modality query features to obtain the task query features.
- Initial trajectory query features corresponding to dynamic objects.
- N3 identical task query features may be added to N3 first modal query features (modal query) to form N3 motion trajectory query features.
- step 3012 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by a fourth processing unit executed by the processor.
- Step 3013 determining initial motion trajectory query features based on the initial trajectory query features corresponding to each dynamic object.
- Figure 11 is a schematic diagram of the determination principle of the initial motion trajectory query feature provided by an exemplary embodiment of the present disclosure.
- the number of dynamic objects is N1
- the task query represents the task query feature corresponding to the dynamic object
- the number of the first modal query feature (modal query) of each dynamic object is N3
- the final obtained initial motion trajectory query feature includes N1 ⁇ N3 D-dimensional initial trajectory query features.
- step 3013 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by a fourth processing unit executed by the processor.
- the disclosed embodiment obtains an initial motion trajectory query feature by fusing the dynamic object task query feature with the modal query feature, so that the initial motion trajectory query feature includes dynamic object task query-related features and information related to the motion trend of the dynamic object. Furthermore, the third decoding network obtained through training can obtain accurate and effective motion trajectory task query features.
- determining the first bird's-eye view feature based on the first image features corresponding to each viewing angle in step 202 includes:
- Step 2021 determining the first bird's-eye view feature based on the first image features corresponding to each viewing angle, the initial bird's-eye view query feature, and the bird's-eye view feature of the previous frame obtained before the first bird's-eye view feature.
- the initial bird's-eye view query feature is a feature initialized under the bird's-eye view, and its size can represent the size of the bird's-eye view.
- the specific initialization rules can be set according to actual needs. For example, based on the required bird's-eye view size, H ⁇ W D-dimensional bird's-eye view queries are initialized as the initial bird's-eye view query features, where H and W are the height and width of the bird's-eye view, respectively, and D is the feature dimension of each bird's-eye view query.
- Each bird's-eye view query can correspond to a set of three-dimensional coordinates in a physical space, such as three-dimensional coordinates in a world coordinate system with the vehicle as the origin, which can be set according to actual needs.
- the bird's-eye view feature in the previous frame is the bird's-eye view feature obtained in the current image processing process, and its specific processing process is consistent with the first bird's-eye view feature.
- step 2021 may be executed by the processor calling a corresponding instruction stored in a memory, or may be executed by a first determination unit executed by the processor.
- the bird's-eye view features of the previous frame include relevant historical information such as static elements and dynamic objects in the image captured in the previous frame, combined with the bird's-eye view features of the previous frame, the initial bird's-eye view query features and the first image features, it is possible to not only determine the relevant features of static elements and dynamic objects in the current image to be processed, but also realize the changes of dynamic objects relative to the previous frame, thereby facilitating the tracking of dynamic objects.
- FIG. 12 is a flowchart of step 2021 provided by an exemplary embodiment of the present disclosure.
- determining the first bird's-eye view feature based on the first image features corresponding to each perspective, the initial bird's-eye view query feature, and the bird's-eye view feature of the previous frame obtained before the first bird's-eye view feature in step 2021 includes:
- Step 20211 based on the bird's-eye view features of the previous frame and the initial bird's-eye view query features, the temporal self-attention network of the first encoder in the pre-trained encoder network is used to determine the temporal self-attention result.
- the encoder network is used to encode the first image features corresponding to each extracted perspective based on the bird's-eye view features of the previous frame and the initial bird's-eye view query features to obtain the first bird's-eye view features under the bird's-eye view.
- the encoder network may include one or more encoders, each of which may include a temporal self-attention network.
- the temporal self-attention network is used for the initial bird's-eye view query feature to find the corresponding position in the historical bird's-eye view of the previous frame according to the movement of the vehicle, as a reference position, for subsequent extraction of features corresponding to the reference position area in the first image feature.
- Step 20212 based on the temporal self-attention results and the initial bird's-eye view query features, determine the fourth intermediate result using the fourth additive normalization network of the first encoder.
- Step 20213 based on the first image features corresponding to each perspective and the fourth intermediate result, the spatial cross-attention network in the first encoder is used to determine the spatial cross-attention result.
- the spatial cross-attention network is used to uniformly sample the fourth intermediate result in height to obtain a set of three-dimensional coordinates, and then map the three-dimensional coordinates to the corresponding positions in the first image features corresponding to each perspective according to the camera's internal and external parameters. Then, the features of the corresponding positions in the first image features can be extracted based on deformable convolution, and after subsequent processing, the first bird's-eye view features of the current frame are obtained.
- Step 20214 determine the first bird's-eye view feature based on the spatial cross-attention result and the fourth intermediate result.
- the fourth additive normalization network and the spatial cross-attention network some other related networks are also included, such as an additive normalization network, a feedforward network, and another additive normalization network after the spatial cross-attention network, etc., which can be set according to actual needs. Therefore, after obtaining the spatial cross-attention result, the spatial cross-attention result can also be added and normalized with the fourth intermediate result, and then passed through other related networks to obtain the encoding result of the first encoder. If the encoder network includes multiple encoders, the encoding result output by the first encoder can also be further encoded by subsequent encoders to finally obtain the first bird's-eye view feature.
- Figure 13 is a schematic diagram of the network structure of an encoder network provided by an exemplary embodiment of the present disclosure.
- BEV B(t-1) represents the bird's-eye view feature of the previous frame
- BEV queries Q represents the initial bird's-eye view query feature
- Temporal Self-Attention represents the temporal self-attention network
- Spatial Cross-Attention represents the spatial cross-attention network
- the meanings of other symbols refer to the above content.
- the previous bird's-eye view feature BEV B(t-1) interacts with the initial bird's-eye view query feature BEV queries Q in the temporal self-attention network.
- the corresponding position of the initial bird's-eye view query feature in the previous frame bird's-eye view is found as the reference position to obtain the temporal self-attention result.
- the temporal self-attention result is added and normalized with the initial bird's-eye view query feature to obtain the fourth intermediate result.
- the fourth intermediate result and the first image features of each perspective are subjected to a spatial cross-attention operation in the spatial cross-attention network.
- the spatial cross-attention network is based on deformable convolution to extract the features of the corresponding position from the first image features based on the reference position to obtain the spatial cross-attention result.
- the spatial cross-attention result is added and normalized with the fourth intermediate result to obtain the fifth intermediate result.
- the fifth intermediate result is obtained through the feedforward network (Feed Forward) and the addition normalization network (Add&Norm).
- the encoding result of the first encoder is then encoded by the subsequent encoder to obtain the final encoding result, which is the first bird's-eye view feature.
- the spatial position encoding can also be embedded (Embedding) for each first image feature, which is not limited in the present disclosure.
- the above steps 20211 to 20214 can be executed by the processor calling the corresponding instructions stored in the memory, or can be executed by the first determination unit executed by the processor.
- the disclosed embodiment achieves position matching of the initial bird's-eye view query features of the current frame with the bird's-eye view of the previous frame through the temporal self-attention network in the encoder network, establishes the temporal correlation between the current frame and the previous frame, and then extracts local features near the reference position from the first image features based on the spatial cross-attention network, which helps to reduce the amount of calculation on the basis of extracting effective features, thereby improving image processing efficiency.
- determining the task processing results corresponding to each task query feature based on each task query feature of the at least one task query feature in step 204 includes:
- Step 2041a based on the static element task query feature, using the pre-trained static element detection head network, determine the static element detection result.
- the static element detection head network can adopt any implementable head network, such as a head network based on a multi-layer perceptron.
- step 2041a may be executed by the processor calling a corresponding instruction stored in a memory, or may be executed by a second determination unit executed by the processor.
- Step 2041b based on the dynamic object task query feature, using the pre-trained dynamic object detection head network, determine the dynamic object detection result.
- the dynamic object detection head network may adopt any implementable head network, such as a head network based on a multi-layer perceptron.
- step 2041b may be executed by the processor calling a corresponding instruction stored in a memory, or may be executed by a third determination unit executed by the processor.
- Step 2041c based on the motion trajectory task query features, using the pre-trained motion trajectory prediction head network, determine the motion trajectory prediction result.
- the motion trajectory prediction head network can adopt any feasible head network, such as a head network based on a multi-layer perceptron.
- step 2041c may be executed by the processor calling a corresponding instruction stored in a memory, or may be executed by a fourth determining unit executed by the processor.
- the disclosed embodiment obtains a first bird's-eye view feature by encoding the first image feature corresponding to each perspective, decodes the task query features of different tasks based on the first bird's-eye view feature and the decoding networks corresponding to different tasks, and then obtains the task processing results corresponding to the different tasks based on the head networks corresponding to the different tasks.
- This can achieve end-to-end multi-task processing based on multi-frame multi-perspective surround view images, helps to improve task processing efficiency, avoid or reduce dependence on offline generated high-precision maps and lidars, and simultaneously achieves static element detection, three-dimensional detection of dynamic objects, and prediction of motion trajectories, which helps to reduce costs.
- FIG14 is a schematic diagram of the overall structure of a network model for image processing provided by an exemplary embodiment of the present disclosure.
- a feature extraction network can be used to obtain first image features corresponding to each perspective, each first image feature can obtain a first bird's-eye view feature through an encoder network, the first bird's-eye view feature can obtain a static element task query feature through a first decoding network, and then a static element detection result can be obtained using a static element detection head network, the first bird's-eye view feature can obtain a dynamic object task query feature through a second decoding network, and then a dynamic object detection result can be obtained using a dynamic object detection head network; the dynamic object task query result is fused with the modal query feature to obtain an initial motion trajectory query feature, based on the initial motion trajectory query feature and the static element task query feature obtained by the first decoding network, a motion trajectory task query feature can be obtained using a third decoding network, and then a motion trajectory prediction head network can be used to obtain first image features corresponding to each perspective,
- the network model can be obtained by pre-training.
- multiple tasks can be trained together, or they can be trained separately first and then trained comprehensively.
- the specific settings can be made according to actual needs. For example, in order to ensure better performance in motion trajectory prediction, static element tasks and dynamic object tasks can be trained first to obtain a basic model, and then based on the basic model, the three tasks can be trained together. The specific training principle will not be repeated here.
- the disclosed embodiment can realize task processing of static elements, dynamic objects and motion trajectories using only multi-view images. Compared with LiDAR, it can obtain richer environmental information, and the hardware cost is low and easy to deploy. Moreover, the disclosed embodiment can realize online generation of static map information through static element detection, and does not rely on high-precision maps generated offline, and has a wider range of application scenarios. In addition, the disclosed embodiment can track dynamic targets without display, which helps to reduce the complexity of model calculation and can avoid the impact of tracking module errors on subsequent processing, thereby further improving the accuracy of task processing results.
- the features of the images from each perspective can be fused with the data collected by the lidar for end-to-end single-task or multi-task processing to improve the richness of the feature information, thereby helping to further improve the model performance.
- Any image processing method provided in the embodiments of the present disclosure may be executed by any appropriate device with data processing capabilities, including but not limited to: a terminal device and a server.
- any image processing method provided in the embodiments of the present disclosure may be executed by a processor, such as the processor executing any image processing method mentioned in the embodiments of the present disclosure by calling corresponding instructions stored in a memory. This will not be described in detail below.
- FIG15 is a schematic diagram of the structure of an image processing device provided by an exemplary embodiment of the present disclosure.
- the device of this embodiment can be used to implement the corresponding method embodiment of the present disclosure.
- the device shown in FIG15 includes: a first processing module 501, a second processing module 502, a third processing module 503 and a fourth processing module 504.
- the first processing module 501 is used to determine the first image features corresponding to each perspective based on the image to be processed corresponding to each perspective in at least one perspective; the second processing module 502 is used to determine the first bird's-eye view features based on the first image features corresponding to each perspective; the third processing module 503 is used to determine at least one task query feature among static element task query features, dynamic object task query features and motion trajectory task query features based on the first bird's-eye view features; the fourth processing module 504 is used to determine the task processing results corresponding to each task query feature based on each task query feature in at least one task query feature.
- FIG. 16 is a schematic diagram of the structure of an image processing apparatus provided by another exemplary embodiment of the present disclosure.
- the third processing module 503 includes:
- the first processing unit 5031 is used to determine the static element task query feature based on the first bird's-eye view feature and the initial static element query feature using a pre-trained first decoding network, where the initial static element query feature includes initial query features corresponding to each static element in at least one static element.
- the third processing module 503 includes:
- the second processing unit 5032 is used to determine the dynamic object task query feature based on the first bird's-eye view feature and the initial dynamic object query feature using a pre-trained second decoding network, where the initial dynamic object query feature includes initial query features corresponding to each dynamic object in at least one dynamic object.
- the third processing module 503 includes:
- the third processing unit 5033 is used to determine the motion trajectory task query feature based on the static element task query feature and the initial motion trajectory query feature using a pre-trained third decoding network, where the initial motion trajectory query feature includes initial trajectory query features corresponding to each dynamic object in at least one dynamic object.
- the third processing module 503 may include at least two of the first processing unit 5031 , the second processing unit 5032 and the third processing unit 5033 , which may be specifically configured according to actual needs.
- the first processing unit 5031 is specifically configured to:
- a first query tensor, a first key tensor and a first value tensor Based on the initial static element query feature, determine a first query tensor, a first key tensor and a first value tensor; based on the first query tensor, the first key tensor and the first value tensor, determine a first self-attention result by using a first self-attention network of a first decoder in a first decoding network; based on the first self-attention result and the initial static element query feature, determine a first intermediate result by using a first additive normalization network of the first decoder in the first decoding network; based on the first intermediate result, determine a second query tensor; based on the first bird's-eye view feature, determine a second key tensor and a second value tensor; based on the second query tensor, the second key tensor and the second value tensor, determine a first cross-attention result by using
- the second processing unit 5032 is specifically configured to:
- a third query tensor, a third key tensor and a third value tensor Based on the initial dynamic object query feature, determine a third query tensor, a third key tensor and a third value tensor; based on the third query tensor, the third key tensor and the third value tensor, use the second self-attention network of the first decoder in the second decoding network to determine a second self-attention result; based on the second self-attention result and the initial dynamic object query feature, use the second additive normalization network of the first decoder in the second decoding network to determine a second intermediate result; based on the second intermediate result, determine a fourth query tensor; based on the first bird's-eye view feature, determine a fourth key tensor and a fourth value tensor; based on the fourth query tensor, the fourth key tensor and the fourth value tensor, use the second deformable cross-attention network of the first de
- the third processing unit 5033 is specifically configured to:
- the fifth query tensor, the fifth key tensor and the fifth value tensor Based on the initial motion trajectory query features, determine the fifth query tensor, the fifth key tensor and the fifth value tensor; based on the fifth query tensor, the fifth key tensor and the fifth value tensor, use the third self-attention network of the first decoder in the third decoding network to determine a third self-attention result; based on the third self-attention result and the initial motion trajectory query features, use the third additive normalization network of the first decoder in the third decoding network to determine a third intermediate result; based on the third intermediate result, determine the sixth query tensor; based on the static element task query features, determine the sixth key tensor and the sixth value tensor; based on the sixth query tensor, the sixth key tensor and the sixth value tensor, use the first cross-attention network of the first decoder in the third decoding network to determine a third cross-
- FIG. 17 is a schematic diagram of the structure of a third processing module 503 provided by an exemplary embodiment of the present disclosure.
- the third processing module 503 further includes:
- the fourth processing unit 5034 is used to determine the initial motion trajectory query feature based on the dynamic object task query feature and the modal query feature.
- the modal query feature includes a first modal query feature corresponding to each mode of at least one modality.
- the first modal query feature corresponding to the modality is used to characterize a motion trend of the dynamic object.
- the dynamic object task query feature includes at least one dynamic object task query feature; the fourth processing unit 5034 is specifically configured to:
- a first number of the task query features is determined based on the task query features, where the first number is the number of first modal query features included in the modal query features; the first number of the task query features are respectively added to the first modal query features corresponding to each modality in the modal query features to obtain an initial trajectory query feature corresponding to the dynamic object; and an initial motion trajectory query feature is determined based on the initial trajectory query features corresponding to each dynamic object.
- the second processing module 502 includes:
- the first determining unit 5021 is used to determine the first bird's-eye view feature based on the first image features corresponding to each viewing angle, the initial bird's-eye view query feature, and the bird's-eye view feature of the previous frame obtained before the first bird's-eye view feature.
- the first determining unit 5021 is specifically configured to:
- the temporal self-attention result is determined by using the temporal self-attention network of the first encoder in the pre-trained encoder network; based on the temporal self-attention result and the initial bird's-eye view query features, the fourth additive normalization network of the first encoder is used to determine the fourth intermediate result; based on the first image features corresponding to each perspective and the fourth intermediate result, the spatial cross attention result is determined by using the spatial cross attention network in the first encoder; based on the spatial cross attention result and the fourth intermediate result, the first bird's-eye view feature is determined.
- the fourth processing module 504 includes:
- the second determination unit 5041 is used to determine the static element detection result based on the static element task query feature by using the pre-trained static element detection head network;
- the third determination unit 5042 is used to determine the dynamic object detection result based on the dynamic object task query feature by using the pre-trained dynamic object detection head network;
- the fourth determination unit 5043 is used to determine the motion trajectory prediction result based on the motion trajectory task query feature by using the pre-trained motion trajectory prediction head network.
- the above-mentioned units of the present disclosure can also be divided into finer granularity according to actual needs, such as dividing the unit into multiple sub-units, which can be specifically set according to actual needs.
- FIG18 is a schematic diagram of a structure of an application embodiment of the electronic device disclosed in the present invention.
- the electronic device 10 includes one or more processors 11 and a memory 12.
- the processor 11 may be a central processing unit (CPU) or other forms of processing units having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.
- CPU central processing unit
- the memory 12 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
- the volatile memory may include, for example, a random access memory (RAM) and/or a cache memory (cache), etc.
- the non-volatile memory may include, for example, a read-only memory (ROM), a hard disk, a flash memory, etc.
- One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 11 may run the program instructions to implement the methods of the various embodiments of the present disclosure described above and/or other desired functions.
- Various contents such as input signals, signal components, noise components, etc. may also be stored in the computer-readable storage medium.
- the electronic device 10 may further include: an input device 13 and an output device 14, and these components are interconnected via a bus system and/or other forms of connection mechanisms (not shown).
- the input device 13 may also include, for example, a keyboard, a mouse, etc.
- the output device 14 can output various information to the outside, and the output device 14 can include, for example, a display, a speaker, a printer, a communication network and a remote output device connected thereto, and the like.
- FIG18 only shows some of the components related to the present disclosure in the electronic device 10, omitting components such as a bus, an input/output interface, etc.
- the electronic device 10 may further include any other appropriate components according to specific application scenarios.
- an embodiment of the present disclosure may also be a computer program product, which includes computer program instructions, which, when executed by a processor, enable the processor to execute the steps of the method according to various embodiments of the present disclosure described in the above-mentioned "Exemplary Method" section of this specification.
- the computer program product may be written in any combination of one or more programming languages to write program code for performing the operations of the disclosed embodiments, including object-oriented programming languages such as Java, C++, etc., and conventional procedural programming languages such as "C" or similar programming languages.
- the program code may be executed entirely on the user computing device, partially on the user device, as a separate software package, partially on the user computing device and partially on a remote computing device, or entirely on a remote computing device or server.
- an embodiment of the present disclosure may also be a computer-readable storage medium having computer program instructions stored thereon, which, when executed by a processor, enable the processor to execute the steps of the method according to various embodiments of the present disclosure described in the above “Exemplary Method” section of this specification.
- the computer readable storage medium can adopt any combination of one or more readable media.
- the readable medium can be a readable signal medium or a readable storage medium.
- the readable storage medium can include, for example, but is not limited to, a system, device or device of electricity, magnetism, light, electromagnetic, infrared, or semiconductor, or any combination of the above.
- readable storage media include: an electrical connection with one or more wires, a portable disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
- RAM random access memory
- ROM read-only memory
- EPROM or flash memory erasable programmable read-only memory
- CD-ROM compact disk read-only memory
- magnetic storage device or any suitable combination of the above.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biodiversity & Conservation Biology (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims (12)
- 一种图像的处理方法,包括:基于至少一个视角中各所述视角分别对应的待处理图像,确定各所述视角分别对应的第一图像特征;基于各所述视角分别对应的所述第一图像特征,确定第一鸟瞰图特征;基于所述第一鸟瞰图特征,确定静态元素任务查询特征、动态对象任务查询特征和运动轨迹任务查询特征中的至少一种任务查询特征;基于所述至少一种任务查询特征中的各所述任务查询特征,确定各所述任务查询特征分别对应的任务处理结果。
- 根据权利要求1所述的方法,其中,所述基于所述第一鸟瞰图特征,确定静态元素任务查询特征、动态对象任务查询特征和运动轨迹任务查询特征中的至少一种任务查询特征,包括:基于所述第一鸟瞰图特征及初始静态元素查询特征,利用预先训练获得的第一解码网络,确定所述静态元素任务查询特征,所述初始静态元素查询特征包括至少一个静态元素中各所述静态元素分别对应的初始查询特征;和/或,基于所述第一鸟瞰图特征及初始动态对象查询特征,利用预先训练获得的第二解码网络,确定所述动态对象任务查询特征,所述初始动态对象查询特征包括至少一个动态对象中各所述动态对象分别对应的初始查询特征;和/或,基于所述静态元素任务查询特征及初始运动轨迹查询特征,利用预先训练获得的第三解码网络,确定所述运动轨迹任务查询特征,所述初始运动轨迹查询特征包括至少一个动态对象中各所述动态对象分别对应的初始轨迹查询特征。
- 根据权利要求2所述的方法,其中,所述基于所述第一鸟瞰图特征及初始静态元素查询特征,利用预先训练获得的第一解码网络,确定所述静态元素任务查询特征,包括:基于所述初始静态元素查询特征,确定第一查询张量、第一键张量和第一值张量;基于所述第一查询张量、所述第一键张量和所述第一值张量,利用所述第一解码网络中第一个解码器的第一自注意力网络,确定第一自注意力结果;基于所述第一自注意力结果和所述初始静态元素查询特征,利用所述第一解码网络中的第一个解码器的第一相加归一化网络,确定第一中间结果;基于所述第一中间结果,确定第二查询张量;基于所述第一鸟瞰图特征,确定第二键张量和第二值张量;基于所述第二查询张量、所述第二键张量和所述第二值张量,利用所述第一解码网络中第一个解码器的第一可变形交叉注意力网络,确定第一交叉注意力结果;基于所述第一交叉注意力结果和所述第一中间结果,确定所述静态元素任务查询特征;和/或,所述基于所述第一鸟瞰图特征及初始动态对象查询特征,利用预先训练获得的第二解码网络,确定所述动态对象任务查询特征,包括:基于所述初始动态对象查询特征,确定第三查询张量、第三键张量和第三值张量;基于所述第三查询张量、所述第三键张量和所述第三值张量,利用所述第二解码网络中第一个解码器的第二自注意力网络,确定第二自注意力结果;基于所述第二自注意力结果和所述初始动态对象查询特征,利用所述第二解码网络中的第一个解码器的第二相加归一化网络,确定第二中间结果;基于所述第二中间结果,确定第四查询张量;基于所述第一鸟瞰图特征,确定第四键张量和第四值张量;基于所述第四查询张量、所述第四键张量和所述第四值张量,利用所述第二解码网络中第一个解码器的第二可变形交叉注意力网络,确定第二交叉注意力结果;基于所述第二交叉注意力结果和所述第二中间结果,确定所述动态对象任务查询特征。
- 根据权利要求2所述的方法,其中,所述基于所述静态元素任务查询特征及初始运动轨迹查询特征,利用预先训练获得的第三解码网络,确定所述运动轨迹任务查询特征,包括:基于所述初始运动轨迹查询特征,确定第五查询张量、第五键张量和第五值张量;基于所述第五查询张量、所述第五键张量和所述第五值张量,利用所述第三解码网络中第一个解码器的第三自注意力网络,确定第三自注意力结果;基于所述第三自注意力结果和所述初始运动轨迹查询特征,利用所述第三解码网络中的第一个解码器的第三相加归一化网络,确定第三中间结果;基于所述第三中间结果,确定第六查询张量;基于所述静态元素任务查询特征,确定第六键张量和第六值张量;基于所述第六查询张量、所述第六键张量和所述第六值张量,利用所述第三解码网络中第一个解码器的第一交叉注意力网络,确定第三交叉注意力结果;基于所述第三交叉注意力结果和所述第三中间结果,确定所述运动轨迹任务查询特征。
- 根据权利要求2所述的方法,其中,在所述基于所述静态元素任务查询特征及初始运动轨迹查询特征,利用预先训练获得的第三解码网络,确定所述运动轨迹任务查询特征之前,还包括:基于所述动态对象任务查询特征及模态查询特征,确定所述初始运动轨迹查询特征,所述模态查询特征包括至少一种模态中各所述模态分别对应的第一模态查询特征,所述模态对应的所述第一模态查询特征用于表征动态对象的一种运动趋势。
- 根据权利要求5所述的方法,其中,所述动态对象任务查询特征包括至少一个动态对象的任务查询特征;所述基于所述动态对象任务查询特征及模态查询特征,确定所述初始运动轨迹查询特征,包括:对于每个所述动态对象对应的所述任务查询特征,基于该任务查询特征,确定第一数量的该任务查询特征,所述第一数量为所述模态查询特征中包括的第一模态查询特征数量;将第一数量的该任务查询特征分别与所述模态查询特征中各所述模态分别对应的所述第一模态查询特征相加,获得该动态对象对应的初始轨迹查询特征;基于各所述动态对象分别对应的所述初始轨迹查询特征,确定所述初始运动轨迹查询特征。
- 根据权利要求1-6任一所述的方法,其中,所述基于各所述视角分别对应的所述第一图像特征,确定第一鸟瞰图特征,包括:基于各所述视角分别对应的所述第一图像特征、初始鸟瞰图查询特征、及在所述第一鸟瞰图特征之前获得的在前帧鸟瞰图特征,确定所述第一鸟瞰图特征。
- 根据权利要求7所述的方法,其中,所述基于各所述视角分别对应的所述第一图像特征、初始鸟瞰图查询特征、及在所述第一鸟瞰图特征之前获得的在前帧鸟瞰图特征,确定所述第一鸟瞰图特征,包括:基于所述在前帧鸟瞰图特征和所述初始鸟瞰图查询特征,利用预先训练获得的编码器网络中第一个编码器的时序自注意力网络,确定时序自注意力结果;基于所述时序自注意力结果和所述初始鸟瞰图查询特征,利用所述第一个编码器的第四相加归一化网络,确定第四中间结果;基于各所述视角分别对应的所述第一图像特征和所述第四中间结果,利用所述第一个编码器中的空间交叉注意力网络,确定空间交叉注意力结果;基于所述空间交叉注意力结果和所述第四中间结果,确定所述第一鸟瞰图特征。
- 根据权利要求1-6任一所述的方法,其中,所述基于所述至少一种任务查询特征中的各所述任务查询特征,确定各所述任务查询特征分别对应的任务处理结果,包括:基于所述静态元素任务查询特征,利用预先训练获得的静态元素检测头网络,确定静态元素检测结果;基于所述动态对象任务查询特征,利用预先训练获得的动态对象检测头网络,确定动态对象检测结果;基于所述运动轨迹任务查询特征,利用预先训练获得的运动轨迹预测头网络,确定运动轨迹预测结果。
- 一种图像的处理装置,包括:第一处理模块,用于基于至少一个视角中各所述视角分别对应的待处理图像,确定各所述视角分别对应的第一图像特征;第二处理模块,用于基于各所述视角分别对应的所述第一图像特征,确定第一鸟瞰图特征;第三处理模块,用于基于所述第一鸟瞰图特征,确定静态元素任务查询特征、动态对象任务查询特征和运动轨迹任务查询特征中的至少一种任务查询特征;第四处理模块,用于基于所述至少一种任务查询特征中的各所述任务查询特征,确定各所述任务查询特征分别对应的任务处理结果。
- 一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序用于执行上述权利要求1-9任一所述的图像的处理方法。
- 一种电子设备,所述电子设备包括:处理器;用于存储所述处理器可执行指令的存储器;所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述指令以实现上述权利要求1-9任一所述的图像的处理方法。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP23887635.3A EP4610945A4 (en) | 2022-11-11 | 2023-09-11 | IMAGE PROCESSING METHOD AND APPARATUS, AS WELL AS ELECTRONIC DEVICE AND STORAGE MEDIA |
| JP2025526866A JP2025539067A (ja) | 2022-11-11 | 2023-09-11 | 画像の処理方法、装置、電子機器及び記憶媒体 |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211417346.2 | 2022-11-11 | ||
| CN202211417346.2A CN115719476A (zh) | 2022-11-11 | 2022-11-11 | 图像的处理方法、装置、电子设备和存储介质 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024098941A1 true WO2024098941A1 (zh) | 2024-05-16 |
Family
ID=85255051
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2023/118093 Ceased WO2024098941A1 (zh) | 2022-11-11 | 2023-09-11 | 图像的处理方法、装置、电子设备和存储介质 |
Country Status (4)
| Country | Link |
|---|---|
| EP (1) | EP4610945A4 (zh) |
| JP (1) | JP2025539067A (zh) |
| CN (1) | CN115719476A (zh) |
| WO (1) | WO2024098941A1 (zh) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118397067A (zh) * | 2024-06-25 | 2024-07-26 | 中国科学技术大学 | 一种自适应深度编码的多视图稀疏查询3d目标检测方法 |
| CN120014605A (zh) * | 2025-04-21 | 2025-05-16 | 智驾大陆(上海)智能科技有限公司 | 图像处理方法、可读存储介质、程序产品及车载设备 |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115719476A (zh) * | 2022-11-11 | 2023-02-28 | 北京地平线信息技术有限公司 | 图像的处理方法、装置、电子设备和存储介质 |
| CN117272207B (zh) * | 2023-10-10 | 2026-01-02 | 江苏衡新数智科技有限公司 | 数据中心异常分析方法及系统 |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113569868A (zh) * | 2021-06-11 | 2021-10-29 | 北京旷视科技有限公司 | 一种目标检测方法、装置及电子设备 |
| US20220207270A1 (en) * | 2020-12-31 | 2022-06-30 | Toyota Research Institute, Inc. | Using a bird's eye view feature map, augmented with semantic information, to detect an object in an environment |
| CN114723955A (zh) * | 2022-03-30 | 2022-07-08 | 上海人工智能创新中心 | 图像处理方法、装置、设备和计算机可读存储介质 |
| CN114882465A (zh) * | 2022-06-01 | 2022-08-09 | 北京地平线信息技术有限公司 | 视觉感知方法、装置、存储介质和电子设备 |
| CN114898315A (zh) * | 2022-05-05 | 2022-08-12 | 北京鉴智科技有限公司 | 驾驶场景信息确定方法、对象信息预测模型训练方法及装置 |
| CN115273022A (zh) * | 2022-06-27 | 2022-11-01 | 重庆长安汽车股份有限公司 | 车辆的鸟瞰图生成方法、装置、车辆及存储介质 |
| CN115719476A (zh) * | 2022-11-11 | 2023-02-28 | 北京地平线信息技术有限公司 | 图像的处理方法、装置、电子设备和存储介质 |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP3950085B2 (ja) * | 2003-06-10 | 2007-07-25 | 株式会社つくばマルチメディア | 地図誘導全方位映像システム |
| US11554785B2 (en) * | 2019-05-07 | 2023-01-17 | Foresight Ai Inc. | Driving scenario machine learning network and driving environment simulation |
| US12175775B2 (en) * | 2020-06-11 | 2024-12-24 | Toyota Research Institute, Inc. | Producing a bird's eye view image from a two dimensional image |
| CN114463553B (zh) * | 2022-02-09 | 2026-01-23 | 北京地平线信息技术有限公司 | 图像处理方法和装置、电子设备和存储介质 |
-
2022
- 2022-11-11 CN CN202211417346.2A patent/CN115719476A/zh active Pending
-
2023
- 2023-09-11 EP EP23887635.3A patent/EP4610945A4/en active Pending
- 2023-09-11 WO PCT/CN2023/118093 patent/WO2024098941A1/zh not_active Ceased
- 2023-09-11 JP JP2025526866A patent/JP2025539067A/ja active Pending
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20220207270A1 (en) * | 2020-12-31 | 2022-06-30 | Toyota Research Institute, Inc. | Using a bird's eye view feature map, augmented with semantic information, to detect an object in an environment |
| CN113569868A (zh) * | 2021-06-11 | 2021-10-29 | 北京旷视科技有限公司 | 一种目标检测方法、装置及电子设备 |
| CN114723955A (zh) * | 2022-03-30 | 2022-07-08 | 上海人工智能创新中心 | 图像处理方法、装置、设备和计算机可读存储介质 |
| CN114898315A (zh) * | 2022-05-05 | 2022-08-12 | 北京鉴智科技有限公司 | 驾驶场景信息确定方法、对象信息预测模型训练方法及装置 |
| CN114882465A (zh) * | 2022-06-01 | 2022-08-09 | 北京地平线信息技术有限公司 | 视觉感知方法、装置、存储介质和电子设备 |
| CN115273022A (zh) * | 2022-06-27 | 2022-11-01 | 重庆长安汽车股份有限公司 | 车辆的鸟瞰图生成方法、装置、车辆及存储介质 |
| CN115719476A (zh) * | 2022-11-11 | 2023-02-28 | 北京地平线信息技术有限公司 | 图像的处理方法、装置、电子设备和存储介质 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4610945A4 |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118397067A (zh) * | 2024-06-25 | 2024-07-26 | 中国科学技术大学 | 一种自适应深度编码的多视图稀疏查询3d目标检测方法 |
| CN120014605A (zh) * | 2025-04-21 | 2025-05-16 | 智驾大陆(上海)智能科技有限公司 | 图像处理方法、可读存储介质、程序产品及车载设备 |
Also Published As
| Publication number | Publication date |
|---|---|
| JP2025539067A (ja) | 2025-12-03 |
| EP4610945A4 (en) | 2026-02-18 |
| EP4610945A1 (en) | 2025-09-03 |
| CN115719476A (zh) | 2023-02-28 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2024098941A1 (zh) | 图像的处理方法、装置、电子设备和存储介质 | |
| Tang et al. | Perception and navigation in autonomous systems in the era of learning: A survey | |
| Qin et al. | Unifusion: Unified multi-view fusion transformer for spatial-temporal representation in bird's-eye-view | |
| Li et al. | Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion | |
| Chen et al. | Deep learning for visual localization and mapping: A survey | |
| Li et al. | Omnifusion: 360 monocular depth estimation via geometry-aware fusion | |
| Hou et al. | Multiview detection with feature perspective transformation | |
| Yoon et al. | Predictively encoded graph convolutional network for noise-robust skeleton-based action recognition | |
| Jiang et al. | Rodyn-slam: Robust dynamic dense rgb-d slam with neural radiance fields | |
| Ding et al. | Object detection method based on lightweight YOLOv4 and attention mechanism in security scenes | |
| Zeng et al. | ARF-YOLOv8: a novel real-time object detection model for UAV-captured images detection | |
| Liu et al. | Event-based monocular dense depth estimation with recurrent transformers | |
| US20250378390A1 (en) | Image Processing Method and Related Device | |
| Xie et al. | S4-driver: Scalable self-supervised driving multimodal large language model with spatio-temporal visual representation | |
| Mohan et al. | Progressive multi-modal fusion for robust 3d object detection | |
| CN116399360A (zh) | 车辆路径规划方法 | |
| Yan et al. | RigNet++: Semantic Assisted Repetitive Image Guided Network for Depth Completion: Z. Yan et al. | |
| Du et al. | PointDMIG: a dynamic motion-informed graph neural network for 3D action recognition | |
| CN114943747A (zh) | 图像分析方法及其装置、视频编辑方法及其装置、介质 | |
| Zhang et al. | Occloff: Learning optimized feature fusion for 3d occupancy prediction | |
| Shi et al. | Lane detection by variational auto-encoder with normalizing flow for autonomous driving | |
| Hou et al. | Towards real-time embodied AI agent: a bionic visual encoding framework for mobile robotics: X. Hou et al. | |
| Yang et al. | SA‐FlowNet: Event‐based self‐attention optical flow estimation with spiking‐analogue neural networks | |
| CN119964205A (zh) | 一种基于隐编码神经网络表示的动物姿态估计方法及系统 | |
| Liu et al. | Weakly but deeply supervised occlusion-reasoned parametric road layouts |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23887635 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2025526866 Country of ref document: JP Kind code of ref document: A |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2025526866 Country of ref document: JP |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023887635 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2023887635 Country of ref document: EP Effective date: 20250527 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWP | Wipo information: published in national office |
Ref document number: 2023887635 Country of ref document: EP |