WO2024098941A1 - 图像的处理方法、装置、电子设备和存储介质 - Google Patents

图像的处理方法、装置、电子设备和存储介质 Download PDF

Info

Publication number
WO2024098941A1
WO2024098941A1 PCT/CN2023/118093 CN2023118093W WO2024098941A1 WO 2024098941 A1 WO2024098941 A1 WO 2024098941A1 CN 2023118093 W CN2023118093 W CN 2023118093W WO 2024098941 A1 WO2024098941 A1 WO 2024098941A1
Authority
WO
WIPO (PCT)
Prior art keywords
query
feature
task
tensor
query feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2023/118093
Other languages
English (en)
French (fr)
Inventor
蒋博
陈少宇
廖本成
程天恒
陈嘉杰
周贺龙
张骞
黄畅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Horizon Information Technology Co Ltd
Original Assignee
Beijing Horizon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Horizon Information Technology Co Ltd filed Critical Beijing Horizon Information Technology Co Ltd
Priority to EP23887635.3A priority Critical patent/EP4610945A4/en
Priority to JP2025526866A priority patent/JP2025539067A/ja
Publication of WO2024098941A1 publication Critical patent/WO2024098941A1/zh
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present disclosure relates to computer vision technology, and in particular to an image processing method, device, electronic device and storage medium.
  • Embodiments of the present disclosure provide an image processing method, an apparatus, an electronic device, and a storage medium.
  • an image processing method comprising: determining first image features corresponding to each of the perspectives based on an image to be processed corresponding to each of the perspectives in at least one perspective; determining a first bird's-eye view feature based on the first image features corresponding to each of the perspectives; determining at least one task query feature among a static element task query feature, a dynamic object task query feature and a motion trajectory task query feature based on the first bird's-eye view feature; and determining a task processing result corresponding to each of the task query features based on each of the at least one task query feature.
  • an image processing device comprising: a first processing module, for determining first image features corresponding to each of the at least one perspective based on an image to be processed corresponding to each of the perspectives; a second processing module, for determining first bird's-eye view features based on the first image features corresponding to each of the perspectives; a third processing module, for determining at least one task query feature among static element task query features, dynamic object task query features and motion trajectory task query features based on the first bird's-eye view features; and a fourth processing module, for determining task processing results corresponding to each of the task query features based on each of the at least one task query feature.
  • a computer-readable storage medium wherein the storage medium stores a computer program, and the computer program is used to execute the image processing method described in any of the above embodiments of the present disclosure.
  • an electronic device comprising: a processor; a memory for storing executable instructions of the processor; the processor is configured to read the executable instructions from the memory and execute the instructions to implement the image processing method described in any of the above embodiments of the present disclosure.
  • the image processing method, device, electronic device and storage medium provided by the above-mentioned embodiments of the present disclosure can determine the bird's-eye view features based on the image features corresponding to each perspective determined by the images to be processed from each perspective, determine at least one task query feature based on the bird's-eye view features, and then obtain the task processing results corresponding to each task based on the task query features.
  • End-to-end single-task or multi-task processing can be achieved based on multi-perspective environmental images, which can avoid or reduce dependence on high-precision maps, so that accurate surrounding environment information can be effectively obtained even in the absence of high-precision maps, which helps to improve versatility and reduce costs.
  • FIG1 is an exemplary application scenario of the image processing method provided by the present disclosure
  • FIG2 is a schematic flow chart of an image processing method provided by an exemplary embodiment of the present disclosure
  • FIG3 is a schematic flow chart of an image processing method provided by another exemplary embodiment of the present disclosure.
  • FIG4 is a flow chart of step 2031a provided by an exemplary embodiment of the present disclosure.
  • FIG5 is a schematic diagram of a network structure of a first decoding network provided by an exemplary embodiment of the present disclosure
  • FIG6 is a flowchart of step 2031b provided by an exemplary embodiment of the present disclosure.
  • FIG. 7 is a flow chart of step 2031c provided by an exemplary embodiment of the present disclosure.
  • FIG8 is a schematic diagram of the structure of a third decoding network provided by an exemplary embodiment of the present disclosure.
  • FIG9 is a flowchart of an image processing method provided by yet another exemplary embodiment of the present disclosure.
  • FIG10 is a flow chart of step 301 provided by an exemplary embodiment of the present disclosure.
  • FIG11 is a schematic diagram of the principle of determining initial motion trajectory query features provided by an exemplary embodiment of the present disclosure
  • FIG. 12 is a flow chart of step 2021 provided by an exemplary embodiment of the present disclosure.
  • FIG13 is a schematic diagram of a network structure of an encoder network provided by an exemplary embodiment of the present disclosure.
  • FIG14 is a schematic diagram of the overall structure of a network model for image processing provided by an exemplary embodiment of the present disclosure
  • FIG15 is a schematic diagram of the structure of an image processing device provided by an exemplary embodiment of the present disclosure.
  • FIG16 is a schematic diagram of the structure of an image processing device provided by another exemplary embodiment of the present disclosure.
  • FIG17 is a schematic diagram of the structure of a third processing module 503 provided by an exemplary embodiment of the present disclosure.
  • FIG. 18 is a schematic diagram of the structure of an application embodiment of the electronic device disclosed herein.
  • the inventors discovered that in the field of autonomous driving, how to rely on multi-perspective environmental images to efficiently understand environmental information is an extremely important technical issue. If the understanding of the surrounding environmental information is achieved based on multi-perspective environmental images combined with high-precision maps, it is easy to lead to poor accuracy of environmental information obtained in the absence of high-precision maps.
  • FIG. 1 is an exemplary application scenario of the image processing method provided by the present disclosure.
  • an on-board surround view camera (which may include a camera with multiple viewing angles) may be used to collect images of the vehicle's surrounding environment as images to be processed corresponding to each viewing angle respectively.
  • the image processing device of the present invention may be used to execute the image processing method of the present invention. Based on the images to be processed corresponding to each viewing angle in at least one viewing angle respectively, first image features corresponding to each viewing angle respectively may be determined. Based on the first image features corresponding to each viewing angle respectively, a first bird's-eye view feature may be determined.
  • the first bird's-eye view feature is a feature in a grid coordinate system corresponding to a bird's-eye view (Bird's Eye View, abbreviated as: BEV).
  • a static element task query feature, a dynamic object task query feature, and a motion trajectory task query feature may be determined. Then, based on the static element task query feature, the dynamic object task query feature, and the motion trajectory task query feature respectively, Determine the corresponding task processing results (for example, but not limited to task processing result 1, task processing result 2, and task processing result 3), such as determining the static element task processing results based on the static element task query feature (specifically, such as the static element detection results of the vehicle's surrounding environment in the autonomous driving scene), determining the dynamic object task processing results based on the dynamic object task query feature (specifically, such as the three-dimensional target detection results), and determining the motion trajectory task processing results based on the motion trajectory task query feature (specifically, such as the motion trajectory prediction results of dynamic objects), to achieve end-to-end single-task or multi-task processing based on multi-view environmental images without the need to combine high-precision maps, which helps to avoid or reduce dependence on high-precision maps, thereby
  • the image processing method disclosed in the present invention is not limited to the above-mentioned autonomous driving scenarios, and can be applied to any other possible scenarios according to actual needs, such as security monitoring scenarios in a certain area.
  • the bird's-eye view features of the area can be obtained through images collected by cameras from various perspectives, and end-to-end task processing of static elements, dynamic objects and/or motion trajectories of dynamic objects in the area can be achieved.
  • the specific scenarios can be set according to actual needs.
  • FIG2 is a flowchart of an image processing method provided by an exemplary embodiment of the present disclosure. This embodiment can be applied to electronic devices, such as a vehicle-mounted computing platform, as shown in FIG2, including the following steps:
  • Step 201 determining first image features corresponding to each viewing angle respectively based on the to-be-processed images corresponding to each viewing angle respectively in at least one viewing angle.
  • the number of viewing angles can be set according to actual needs.
  • the number of viewing angles is the number of surround view cameras set on the vehicle, and each camera corresponds to a viewing angle.
  • a four-way surround view system consisting of a left front camera, a left rear camera, a right front camera, and a right rear camera includes four viewing angles, and there is no specific limitation.
  • the first image feature can be obtained by any feasible feature extraction method, such as extracting features from each image to be processed based on a pre-trained feature extraction network to obtain the first image features corresponding to each viewing angle.
  • the feature extraction network can be set according to actual needs.
  • a convolutional neural network can be used as a feature extraction network.
  • step 201 may be executed by the processor calling a corresponding instruction stored in a memory, or may be executed by a first processing module executed by the processor.
  • Step 202 Determine a first bird's-eye view feature based on the first image features corresponding to each viewing angle.
  • the first bird's-eye view feature is a BEV feature in a grid coordinate system corresponding to the bird's-eye view.
  • the first image features of each perspective can be encoded based on a pre-trained encoder network to obtain the first bird's-eye view feature.
  • the encoder network can be set according to actual needs.
  • step 202 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by a second processing module executed by the processor.
  • Step 203 Determine at least one task query feature of a static element task query feature, a dynamic object task query feature, and a motion track task query feature based on the first bird's-eye view feature.
  • the static element task query feature is a task query feature related to the static element extracted from the first bird's-eye view feature.
  • the dynamic object task query feature is a task query feature related to the dynamic object extracted from the first bird's-eye view feature
  • the motion trajectory task query feature is a task query feature related to the motion trajectory of the dynamic object extracted from the static element task query feature.
  • Which or which types of task query features need to be obtained can be set according to actual needs. For example, any one, any two, or all three can be obtained at the same time.
  • the first bird's-eye view feature can be decoded using a decoding network corresponding to the task obtained by pre-training. The specific decoding network can be set according to actual needs.
  • step 203 may be executed by the processor calling a corresponding instruction stored in a memory, or may be executed by a third processing module executed by the processor.
  • Step 204 determining task processing results corresponding to each task query feature based on each task query feature of the at least one task query feature.
  • a head network corresponding to the task can be set, and trained to obtain the trained head network, which is used to output the task query feature corresponding to the task, and obtain the task processing result corresponding to the task query feature.
  • the specific network structure of the head network can be set according to actual needs, for example, it can be implemented by a multilayer perceptron (MLP).
  • MLP multilayer perceptron
  • step 204 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by a fourth processing module executed by the processor.
  • the image processing method provided in this embodiment can determine the bird's-eye view image based on the image features corresponding to each perspective determined by the image to be processed at each perspective. Based on the bird's-eye view feature, at least one task query feature is determined, and then based on the task query features, the task processing results corresponding to each task are obtained. End-to-end single task or multi-task processing can be achieved by relying only on multi-view environmental images, without the need to combine with high-precision maps, which can avoid or reduce dependence on high-precision maps, thereby achieving accurate surrounding environment information even in the absence of high-precision maps, which helps to improve versatility and reduce costs.
  • FIG. 3 is a schematic flow chart of an image processing method provided by another exemplary embodiment of the present disclosure.
  • step 203 may specifically include the following steps:
  • Step 2031a based on the first bird's-eye view feature and the initial static element query feature, the static element task query feature is determined using the pre-trained first decoding network, where the initial static element query feature includes initial query features corresponding to each static element in at least one static element.
  • the initial static element query feature can be set according to actual needs, such as the initial static element query feature obtained by initialization based on the first initialization rule.
  • the first initialization rule can be set according to actual needs, such as randomly initializing N2 D-dimensional static queries (static query represents the initial query feature corresponding to the static element) as the initial static element query feature, N2 is the number of static queries, that is, the number of static elements, N2 can be set according to actual needs, and D represents the dimension of the initial query feature corresponding to each static element.
  • the first decoding network may include at least one decoder, which is used to query the task query feature related to the static element from the first bird's-eye view feature based on the initial static element query feature to obtain the static element task query feature.
  • the specific network structure of the first decoding network can be set according to actual needs.
  • step 2031a may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by the first processing unit executed by the processor.
  • the disclosed embodiment utilizes the first decoding network obtained through training to obtain static element task query features related to static elements from the first bird's-eye view features based on the initial static element query features, thereby providing more accurate and effective feature data for the implementation of subsequent static element detection tasks, so as to realize end-to-end task processing based on multi-view images.
  • static map information can be generated online without relying on high-precision maps generated offline, thereby further improving versatility.
  • step 203 may specifically include the following steps:
  • Step 2031b based on the first bird's-eye view feature and the initial dynamic object query feature, a dynamic object task query feature is determined using a pre-trained second decoding network, where the initial dynamic object query feature includes initial query features corresponding to each dynamic object in at least one dynamic object.
  • the initial dynamic object query feature can be set according to actual needs, such as the initial dynamic object query feature obtained by initialization based on the second initialization rule.
  • the second initialization rule can be set according to actual needs, such as randomly initializing N1 D-dimensional dynamic queries (dynamic query represents the initial query feature corresponding to the dynamic object) as the initial dynamic object query feature, N1 is the number of dynamic queries, that is, the number of dynamic objects, N1 can be set according to actual needs, and D represents the dimension of the initial query feature corresponding to each dynamic object.
  • the second decoding network may include at least one decoder for querying task query features related to dynamic objects from the first bird's-eye view feature based on the initial dynamic object query feature to obtain dynamic object task query features.
  • the specific network structure of the second decoding network can be set according to actual needs.
  • step 2031b may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by a second processing unit executed by the processor.
  • the disclosed embodiment utilizes the second decoding network obtained through training to obtain dynamic object task query features related to dynamic objects from the first bird's-eye view features based on the initial dynamic object query features, thereby providing more accurate and effective feature data for the implementation of subsequent dynamic object detection tasks, so as to realize end-to-end three-dimensional target detection task processing based on multi-view images, avoid tracking of dynamic objects, thereby helping to reduce the computational complexity of the network model, and at the same time avoid the impact of target tracking errors on subsequent applications.
  • step 203 may specifically include the following steps:
  • Step 2031c based on the static element task query feature and the initial motion trajectory query feature, the motion trajectory task query feature is determined using a pre-trained third decoding network, where the initial motion trajectory query feature includes initial trajectory query features corresponding to each dynamic object in at least one dynamic object.
  • the initial motion trajectory query feature needs to be determined in combination with the dynamic object task query feature and the modal query feature.
  • the modal query feature is used to characterize the motion trend of the dynamic object. Different modalities focus on different future motion types (such as fast straight driving, slow straight driving, left turn, right turn, etc.).
  • the dynamic object task query feature is used to characterize the features related to the dynamic object. By combining the modal query feature and the dynamic object task query feature, the initial trajectory query feature related to the motion trajectory of the dynamic object can be determined, and then interacted with the static element task query feature in the third decoding network to decode the motion trajectory task query feature.
  • the third decoding network can be set according to actual needs.
  • step 2031c may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by a third processing unit executed by the processor.
  • the disclosed embodiment utilizes the third decoding network obtained through training to obtain motion trajectory task query features related to the motion trajectory of dynamic objects from static element task query features based on initial motion trajectory query features, thereby providing more accurate and effective feature data for the implementation of subsequent motion trajectory prediction tasks of dynamic objects, so as to realize end-to-end motion trajectory prediction task processing based on multi-view images.
  • step 203 may include at least two of the above steps 2031a-2031c, which can be set according to actual needs, so that end-to-end multi-task processing can be achieved relying only on multi-view images, and task processing results corresponding to multiple tasks can be obtained at the same time, which can avoid dependence on high-precision maps and lidar, and help to further improve versatility.
  • FIG. 4 is a flowchart of step 2031a provided by an exemplary embodiment of the present disclosure.
  • step 2031a determines the static element task query feature based on the first bird's-eye view feature and the initial static element query feature using a pre-trained first decoding network, including:
  • Step 20311a based on the initial static element query feature, determine the first query tensor, the first key tensor and the first value tensor.
  • the initial static element query feature can be mapped to the first query tensor based on the first query mapping rule, such as mapping based on the first query mapping matrix.
  • the initial static element query feature can be mapped to the first key tensor based on the first key mapping rule, and the initial static element query feature can be mapped to the first value tensor based on the first value mapping rule.
  • the specific mapping principle will not be repeated here.
  • Step 20312a based on the first query tensor, the first key tensor and the first value tensor, using the first self-attention network of the first decoder in the first decoding network network, determine the first self-attention result.
  • the first self-attention network is a network based on the self-attention mechanism, which can be set according to actual needs to complete the self-attention operation of the first query tensor, the first key tensor and the first value tensor. Specifically, a self-attention operation is performed based on the first query tensor and the first key tensor to obtain a first weight, and then a weighted sum is performed on the first value tensor based on the first weight to obtain a first self-attention result. The principle of the specific self-attention mechanism will not be repeated.
  • Step 20313a based on the first self-attention result and the initial static element query feature, determine the first intermediate result using the first additive normalization network of the first decoder in the first decoding network.
  • the first additive normalization network has two functions of addition and normalization.
  • the addition is to add the first self-attention result to the initial static element query feature to obtain a first addition result, and then normalize the first addition result to obtain a first intermediate result.
  • Step 20314a determine the second query tensor based on the first intermediate result.
  • the first intermediate result can be used as the second query tensor, or the first intermediate result can be mapped to the second query tensor based on the corresponding mapping rule, which can be specifically set according to actual needs.
  • Step 20315a based on the first bird's-eye view feature, determine a second key tensor and a second value tensor.
  • the determination principle of the second key tensor and the second value tensor can be found in the above content and will not be repeated here.
  • Step 20316a based on the second query tensor, the second key tensor and the second value tensor, using the first deformable cross attention network of the first decoder in the first decoding network, determine a first cross attention result.
  • the first deformable cross-attention network is a cross-attention network based on deformable convolution, and its function is to extract features in the local area near the position corresponding to the initial static element query feature from the first bird's-eye view feature, further improving the accuracy and effectiveness of the extracted features.
  • Step 20317a determining static element task query features based on the first cross-attention result and the first intermediate result.
  • the first decoder in the first decoding network may also include other related networks after the first deformable cross-attention network, such as an addition normalization network (Add&Norm), a feedforward network (Feed Forward), etc. Therefore, after obtaining the first cross-attention result, the first cross-attention result and the first intermediate result can be added and normalized, and then passed through other related networks to finally obtain the decoding result of the first decoder.
  • Add&Norm addition normalization network
  • Feed Forward feedforward network
  • the decoding result of the first decoder can also be used as the input of the second decoder, and decoding is performed according to the decoding process of the first decoder, and so on, until the decoding of all decoders is completed, and the final decoding result of the first decoding network is obtained, and the final decoding result is used as the static element task query feature.
  • FIG5 is a schematic diagram of the network structure of the first decoding network provided by an exemplary embodiment of the present disclosure.
  • the first decoding network includes 6 decoders.
  • ⁇ 6 indicates that the first decoding network includes 6 decoders in the virtual box, and the decoder in the virtual box takes the first decoder as an example, Q1, K1, and V1 respectively represent the first query tensor, the first key tensor, and the first value tensor, Self Attention represents the first self-attention network, Add&Norm represents the additive normalization network, and Add&Norm connected to the first self-attention network represents the first additive normalization network, Q2, K2, and V2 respectively represent the second query tensor, the second key tensor, and the second value tensor; Deformable Cross Attention represents the first deformable cross-attention network, and Feed Forward represents the feedforward network.
  • the initial static element query feature is mapped to the first query tensor, the first key tensor and the first value tensor, and then self-interacts in the first self-attention network to obtain a first self-attention result.
  • the first self-attention result is added to the initial static element query feature and normalized to obtain a first intermediate result.
  • the first intermediate result is mapped to the second query tensor.
  • the first bird's-eye view feature is mapped to the second key tensor and the second value tensor.
  • the second query tensor, the second key tensor and the second value tensor are cross-attended in the first deformable cross-attention network to realize the interaction between the first bird's-eye view feature and the initial static element query feature to obtain a first cross-attention result.
  • the first cross-attention result is added to the first intermediate result and normalized to obtain a first normalized result.
  • the first normalized result is passed through a feedforward network and another addition normalization network to obtain the decoding result of the first decoder.
  • the decoding result is then decoded by 5 decoders to obtain the static element task query feature.
  • the first self-attention network can be a multi-head self-attention network
  • the first deformable cross-attention network can be a multi-head deformable cross-attention network, which can be set according to actual needs.
  • steps 20311a to 20317a can be executed by the processor calling the corresponding instructions stored in the memory, or can be executed by the first processing unit executed by the processor.
  • the disclosed embodiment realizes self-interaction of static element query features through the self-attention network in the first decoding network, which can capture the internal correlation of static element query features, and then interacts with the bird's-eye view features in the deformable cross-attention network to realize sparse attention on the data, and flexibly captures the features of related local areas, which helps to reduce the amount of calculation while ensuring the acquisition of accurate and effective related features, thereby improving the network reasoning speed.
  • FIG. 6 is a flowchart of step 2031b provided by an exemplary embodiment of the present disclosure.
  • step 2031b determines the dynamic object task query feature based on the first bird's-eye view feature and the initial dynamic object query feature using a pre-trained second decoding network, including:
  • Step 20311b based on the initial dynamic object query features, determine a third query tensor, a third key tensor, and a third value tensor.
  • Step 20312b based on the third query tensor, the third key tensor and the third value tensor, using the second self-attention network of the first decoder in the second decoding network, determine a second self-attention result.
  • Step 20313b based on the second self-attention result and the initial dynamic object query feature, determine the second intermediate result using the second additive normalization network of the first decoder in the second decoding network.
  • Step 20314b determine the fourth query tensor based on the second intermediate result.
  • Step 20315b determine a fourth key tensor and a fourth value tensor based on the first bird's-eye view feature.
  • Step 20316b based on the fourth query tensor, the fourth key tensor and the fourth value tensor, using the second deformable cross attention network of the first decoder in the second decoding network, determine a second cross attention result.
  • Step 20317b determine the dynamic object task query feature based on the second cross-attention result and the second intermediate result.
  • steps 20311b-20317b is the same or similar to the aforementioned steps 20311a-20317a, except that step 20311b is based on the initial dynamic object query feature, which is different from the initial static element query feature in step 20311a, and will not be described in detail here.
  • the network structure of the second decoding network is the same or similar to the first decoding network, and will not be described in detail here.
  • steps 20311b to 20317b can be executed by the processor calling the corresponding instructions stored in the memory, or by the processor.
  • the second processing unit of the processor is executed.
  • FIG. 7 is a flowchart of step 2031c provided by an exemplary embodiment of the present disclosure.
  • step 2031c determines the motion trajectory task query feature based on the static element task query feature and the initial motion trajectory query feature using a pre-trained third decoding network, including:
  • Step 20311c based on the initial motion trajectory query feature, determine the fifth query tensor, the fifth key tensor and the fifth value tensor.
  • Step 20312c based on the fifth query tensor, the fifth key tensor and the fifth value tensor, using the third self-attention network of the first decoder in the third decoding network, determine a third self-attention result.
  • Step 20313c based on the third self-attention result and the initial motion trajectory query feature, use the third additive normalization network of the first decoder in the third decoding network to determine the third intermediate result.
  • Step 20314c determine the sixth query tensor based on the third intermediate result.
  • step 20311c to step 20314c are the same or similar to the aforementioned steps 20311a to 20314a, and will not be repeated here.
  • Step 20315c based on the static element task query feature, determine the sixth key tensor and the sixth value tensor.
  • the sixth query tensor and the sixth value tensor of this step are obtained based on the static element task query feature mapping obtained in the above example. For the mapping principle, see the above content.
  • Step 20316c based on the sixth query tensor, the sixth key tensor and the sixth value tensor, using the first cross-attention network of the first decoder in the third decoding network, determine the third cross-attention result.
  • the first cross-attention network can adopt any feasible cross-attention network, which can be set according to actual needs.
  • the first cross-attention network can adopt the cross-attention network structure in a conventional visual Transformer, without specific limitation.
  • Step 20317c based on the third cross-attention result and the third intermediate result, determine the motion trajectory task query feature.
  • step 20317a The specific operating principle of this step can be found in the aforementioned step 20317a, which will not be repeated here.
  • FIG8 is a schematic diagram of the structure of the third decoding network provided by an exemplary embodiment of the present disclosure.
  • the initial motion trajectory query feature includes multiple trajectory queries (the trajectory query represents the initial trajectory query feature corresponding to the dynamic object), Q5, K5, and V5 represent the fifth query tensor, the fifth key tensor, and the fifth value tensor, respectively, and Q6, K6, and V6 represent the sixth query tensor, the sixth key tensor, and the sixth value tensor, respectively.
  • the meaning of other symbols and the reasoning process refer to the aforementioned content and will not be repeated here.
  • steps 20311c to 20317c can be executed by the processor calling the corresponding instructions stored in the memory, or can be executed by a third processing unit executed by the processor.
  • the disclosed embodiment realizes self-interaction of motion trajectory query features through the self-attention network in the third decoding network, which can capture the internal correlation of the motion trajectory query features, and then interact with the static element task query features in the cross-attention network to achieve effective capture of motion trajectory related features of dynamic objects, thereby implicitly understanding the surrounding static information (such as surrounding road information), which helps to provide accurate and effective feature data for accurately predicting a more reasonable future motion trajectory of dynamic objects, and realize end-to-end motion trajectory prediction based on multi-view image features.
  • FIG. 9 is a schematic flow chart of an image processing method provided by yet another exemplary embodiment of the present disclosure.
  • the method before determining the motion trajectory task query feature based on the static element task query feature and the initial motion trajectory query feature using the pre-trained third decoding network in step 2031c, the method further includes:
  • Step 301 based on the dynamic object task query feature and the modal query feature, determine the initial motion trajectory query feature, the modal query feature includes the first modal query feature corresponding to each mode of at least one modality, and the first modal query feature corresponding to the modality is used to characterize a motion trend of the dynamic object.
  • the modal query feature can be set according to actual needs, such as being initialized through the third initialization rule. Since the modal query feature can characterize the motion trend of the dynamic object, combined with the dynamic object task query feature that characterizes the position of the dynamic object, the initial motion trajectory query feature can be determined.
  • N3 D-dimensional modal queries (modal query represents the first modal query feature corresponding to the modality) can be randomly initialized as modal query features. N3 can be set according to actual needs.
  • the modal query feature is fused with the dynamic object task query feature to form N1 ⁇ N3 D-dimensional trajectory queries as the initial motion trajectory query feature, that is, each dynamic object has N3 modes, characterizing its N3 motion trends.
  • step 301 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by a fourth processing unit executed by the processor.
  • the disclosed embodiment obtains the initial motion trajectory query feature based on the fusion of the updated dynamic object task query feature and the modal query feature, so that the initial motion trajectory query feature can include the task query features corresponding to each dynamic object and the query features of multiple modalities of each dynamic object.
  • Different modalities can focus on different future motion types (such as fast straight driving, slow straight driving, left turn, right turn, etc.), which helps to provide effective data support for the subsequent decoding of the motion trajectory task query feature through the third decoding network.
  • FIG. 10 is a flowchart of step 301 provided by an exemplary embodiment of the present disclosure.
  • the dynamic object task query feature includes at least one dynamic object task query feature; and determining the initial motion trajectory query feature based on the dynamic object task query feature and the modal query feature in step 301 includes:
  • Step 3011 for each task query feature corresponding to a dynamic object, a first number of the task query features is determined based on the task query feature, where the first number is the number of first modal query features included in the modal query feature.
  • each dynamic object can be assigned a first number of first modal query features to characterize the first number of movement trends of the dynamic object. Therefore, in order to be able to merge the task query features of the object with the modal query features, the number of task query features of the dynamic object can be transformed to be the same as the number of modal query features. Therefore, based on the task query features of the dynamic object, the first number of task query features can be determined.
  • the modal query feature includes N3 D-dimensional modal queries.
  • the task query feature (which may be referred to as task query) of each dynamic object, the task query feature may be copied N3 times to obtain N3 identical D-dimensional task query features.
  • step 3011 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by a fourth processing unit executed by the processor.
  • Step 3012 Add the first number of task query features to the first modality query features corresponding to each modality in the modality query features to obtain the task query features.
  • Initial trajectory query features corresponding to dynamic objects.
  • N3 identical task query features may be added to N3 first modal query features (modal query) to form N3 motion trajectory query features.
  • step 3012 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by a fourth processing unit executed by the processor.
  • Step 3013 determining initial motion trajectory query features based on the initial trajectory query features corresponding to each dynamic object.
  • Figure 11 is a schematic diagram of the determination principle of the initial motion trajectory query feature provided by an exemplary embodiment of the present disclosure.
  • the number of dynamic objects is N1
  • the task query represents the task query feature corresponding to the dynamic object
  • the number of the first modal query feature (modal query) of each dynamic object is N3
  • the final obtained initial motion trajectory query feature includes N1 ⁇ N3 D-dimensional initial trajectory query features.
  • step 3013 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by a fourth processing unit executed by the processor.
  • the disclosed embodiment obtains an initial motion trajectory query feature by fusing the dynamic object task query feature with the modal query feature, so that the initial motion trajectory query feature includes dynamic object task query-related features and information related to the motion trend of the dynamic object. Furthermore, the third decoding network obtained through training can obtain accurate and effective motion trajectory task query features.
  • determining the first bird's-eye view feature based on the first image features corresponding to each viewing angle in step 202 includes:
  • Step 2021 determining the first bird's-eye view feature based on the first image features corresponding to each viewing angle, the initial bird's-eye view query feature, and the bird's-eye view feature of the previous frame obtained before the first bird's-eye view feature.
  • the initial bird's-eye view query feature is a feature initialized under the bird's-eye view, and its size can represent the size of the bird's-eye view.
  • the specific initialization rules can be set according to actual needs. For example, based on the required bird's-eye view size, H ⁇ W D-dimensional bird's-eye view queries are initialized as the initial bird's-eye view query features, where H and W are the height and width of the bird's-eye view, respectively, and D is the feature dimension of each bird's-eye view query.
  • Each bird's-eye view query can correspond to a set of three-dimensional coordinates in a physical space, such as three-dimensional coordinates in a world coordinate system with the vehicle as the origin, which can be set according to actual needs.
  • the bird's-eye view feature in the previous frame is the bird's-eye view feature obtained in the current image processing process, and its specific processing process is consistent with the first bird's-eye view feature.
  • step 2021 may be executed by the processor calling a corresponding instruction stored in a memory, or may be executed by a first determination unit executed by the processor.
  • the bird's-eye view features of the previous frame include relevant historical information such as static elements and dynamic objects in the image captured in the previous frame, combined with the bird's-eye view features of the previous frame, the initial bird's-eye view query features and the first image features, it is possible to not only determine the relevant features of static elements and dynamic objects in the current image to be processed, but also realize the changes of dynamic objects relative to the previous frame, thereby facilitating the tracking of dynamic objects.
  • FIG. 12 is a flowchart of step 2021 provided by an exemplary embodiment of the present disclosure.
  • determining the first bird's-eye view feature based on the first image features corresponding to each perspective, the initial bird's-eye view query feature, and the bird's-eye view feature of the previous frame obtained before the first bird's-eye view feature in step 2021 includes:
  • Step 20211 based on the bird's-eye view features of the previous frame and the initial bird's-eye view query features, the temporal self-attention network of the first encoder in the pre-trained encoder network is used to determine the temporal self-attention result.
  • the encoder network is used to encode the first image features corresponding to each extracted perspective based on the bird's-eye view features of the previous frame and the initial bird's-eye view query features to obtain the first bird's-eye view features under the bird's-eye view.
  • the encoder network may include one or more encoders, each of which may include a temporal self-attention network.
  • the temporal self-attention network is used for the initial bird's-eye view query feature to find the corresponding position in the historical bird's-eye view of the previous frame according to the movement of the vehicle, as a reference position, for subsequent extraction of features corresponding to the reference position area in the first image feature.
  • Step 20212 based on the temporal self-attention results and the initial bird's-eye view query features, determine the fourth intermediate result using the fourth additive normalization network of the first encoder.
  • Step 20213 based on the first image features corresponding to each perspective and the fourth intermediate result, the spatial cross-attention network in the first encoder is used to determine the spatial cross-attention result.
  • the spatial cross-attention network is used to uniformly sample the fourth intermediate result in height to obtain a set of three-dimensional coordinates, and then map the three-dimensional coordinates to the corresponding positions in the first image features corresponding to each perspective according to the camera's internal and external parameters. Then, the features of the corresponding positions in the first image features can be extracted based on deformable convolution, and after subsequent processing, the first bird's-eye view features of the current frame are obtained.
  • Step 20214 determine the first bird's-eye view feature based on the spatial cross-attention result and the fourth intermediate result.
  • the fourth additive normalization network and the spatial cross-attention network some other related networks are also included, such as an additive normalization network, a feedforward network, and another additive normalization network after the spatial cross-attention network, etc., which can be set according to actual needs. Therefore, after obtaining the spatial cross-attention result, the spatial cross-attention result can also be added and normalized with the fourth intermediate result, and then passed through other related networks to obtain the encoding result of the first encoder. If the encoder network includes multiple encoders, the encoding result output by the first encoder can also be further encoded by subsequent encoders to finally obtain the first bird's-eye view feature.
  • Figure 13 is a schematic diagram of the network structure of an encoder network provided by an exemplary embodiment of the present disclosure.
  • BEV B(t-1) represents the bird's-eye view feature of the previous frame
  • BEV queries Q represents the initial bird's-eye view query feature
  • Temporal Self-Attention represents the temporal self-attention network
  • Spatial Cross-Attention represents the spatial cross-attention network
  • the meanings of other symbols refer to the above content.
  • the previous bird's-eye view feature BEV B(t-1) interacts with the initial bird's-eye view query feature BEV queries Q in the temporal self-attention network.
  • the corresponding position of the initial bird's-eye view query feature in the previous frame bird's-eye view is found as the reference position to obtain the temporal self-attention result.
  • the temporal self-attention result is added and normalized with the initial bird's-eye view query feature to obtain the fourth intermediate result.
  • the fourth intermediate result and the first image features of each perspective are subjected to a spatial cross-attention operation in the spatial cross-attention network.
  • the spatial cross-attention network is based on deformable convolution to extract the features of the corresponding position from the first image features based on the reference position to obtain the spatial cross-attention result.
  • the spatial cross-attention result is added and normalized with the fourth intermediate result to obtain the fifth intermediate result.
  • the fifth intermediate result is obtained through the feedforward network (Feed Forward) and the addition normalization network (Add&Norm).
  • the encoding result of the first encoder is then encoded by the subsequent encoder to obtain the final encoding result, which is the first bird's-eye view feature.
  • the spatial position encoding can also be embedded (Embedding) for each first image feature, which is not limited in the present disclosure.
  • the above steps 20211 to 20214 can be executed by the processor calling the corresponding instructions stored in the memory, or can be executed by the first determination unit executed by the processor.
  • the disclosed embodiment achieves position matching of the initial bird's-eye view query features of the current frame with the bird's-eye view of the previous frame through the temporal self-attention network in the encoder network, establishes the temporal correlation between the current frame and the previous frame, and then extracts local features near the reference position from the first image features based on the spatial cross-attention network, which helps to reduce the amount of calculation on the basis of extracting effective features, thereby improving image processing efficiency.
  • determining the task processing results corresponding to each task query feature based on each task query feature of the at least one task query feature in step 204 includes:
  • Step 2041a based on the static element task query feature, using the pre-trained static element detection head network, determine the static element detection result.
  • the static element detection head network can adopt any implementable head network, such as a head network based on a multi-layer perceptron.
  • step 2041a may be executed by the processor calling a corresponding instruction stored in a memory, or may be executed by a second determination unit executed by the processor.
  • Step 2041b based on the dynamic object task query feature, using the pre-trained dynamic object detection head network, determine the dynamic object detection result.
  • the dynamic object detection head network may adopt any implementable head network, such as a head network based on a multi-layer perceptron.
  • step 2041b may be executed by the processor calling a corresponding instruction stored in a memory, or may be executed by a third determination unit executed by the processor.
  • Step 2041c based on the motion trajectory task query features, using the pre-trained motion trajectory prediction head network, determine the motion trajectory prediction result.
  • the motion trajectory prediction head network can adopt any feasible head network, such as a head network based on a multi-layer perceptron.
  • step 2041c may be executed by the processor calling a corresponding instruction stored in a memory, or may be executed by a fourth determining unit executed by the processor.
  • the disclosed embodiment obtains a first bird's-eye view feature by encoding the first image feature corresponding to each perspective, decodes the task query features of different tasks based on the first bird's-eye view feature and the decoding networks corresponding to different tasks, and then obtains the task processing results corresponding to the different tasks based on the head networks corresponding to the different tasks.
  • This can achieve end-to-end multi-task processing based on multi-frame multi-perspective surround view images, helps to improve task processing efficiency, avoid or reduce dependence on offline generated high-precision maps and lidars, and simultaneously achieves static element detection, three-dimensional detection of dynamic objects, and prediction of motion trajectories, which helps to reduce costs.
  • FIG14 is a schematic diagram of the overall structure of a network model for image processing provided by an exemplary embodiment of the present disclosure.
  • a feature extraction network can be used to obtain first image features corresponding to each perspective, each first image feature can obtain a first bird's-eye view feature through an encoder network, the first bird's-eye view feature can obtain a static element task query feature through a first decoding network, and then a static element detection result can be obtained using a static element detection head network, the first bird's-eye view feature can obtain a dynamic object task query feature through a second decoding network, and then a dynamic object detection result can be obtained using a dynamic object detection head network; the dynamic object task query result is fused with the modal query feature to obtain an initial motion trajectory query feature, based on the initial motion trajectory query feature and the static element task query feature obtained by the first decoding network, a motion trajectory task query feature can be obtained using a third decoding network, and then a motion trajectory prediction head network can be used to obtain first image features corresponding to each perspective,
  • the network model can be obtained by pre-training.
  • multiple tasks can be trained together, or they can be trained separately first and then trained comprehensively.
  • the specific settings can be made according to actual needs. For example, in order to ensure better performance in motion trajectory prediction, static element tasks and dynamic object tasks can be trained first to obtain a basic model, and then based on the basic model, the three tasks can be trained together. The specific training principle will not be repeated here.
  • the disclosed embodiment can realize task processing of static elements, dynamic objects and motion trajectories using only multi-view images. Compared with LiDAR, it can obtain richer environmental information, and the hardware cost is low and easy to deploy. Moreover, the disclosed embodiment can realize online generation of static map information through static element detection, and does not rely on high-precision maps generated offline, and has a wider range of application scenarios. In addition, the disclosed embodiment can track dynamic targets without display, which helps to reduce the complexity of model calculation and can avoid the impact of tracking module errors on subsequent processing, thereby further improving the accuracy of task processing results.
  • the features of the images from each perspective can be fused with the data collected by the lidar for end-to-end single-task or multi-task processing to improve the richness of the feature information, thereby helping to further improve the model performance.
  • Any image processing method provided in the embodiments of the present disclosure may be executed by any appropriate device with data processing capabilities, including but not limited to: a terminal device and a server.
  • any image processing method provided in the embodiments of the present disclosure may be executed by a processor, such as the processor executing any image processing method mentioned in the embodiments of the present disclosure by calling corresponding instructions stored in a memory. This will not be described in detail below.
  • FIG15 is a schematic diagram of the structure of an image processing device provided by an exemplary embodiment of the present disclosure.
  • the device of this embodiment can be used to implement the corresponding method embodiment of the present disclosure.
  • the device shown in FIG15 includes: a first processing module 501, a second processing module 502, a third processing module 503 and a fourth processing module 504.
  • the first processing module 501 is used to determine the first image features corresponding to each perspective based on the image to be processed corresponding to each perspective in at least one perspective; the second processing module 502 is used to determine the first bird's-eye view features based on the first image features corresponding to each perspective; the third processing module 503 is used to determine at least one task query feature among static element task query features, dynamic object task query features and motion trajectory task query features based on the first bird's-eye view features; the fourth processing module 504 is used to determine the task processing results corresponding to each task query feature based on each task query feature in at least one task query feature.
  • FIG. 16 is a schematic diagram of the structure of an image processing apparatus provided by another exemplary embodiment of the present disclosure.
  • the third processing module 503 includes:
  • the first processing unit 5031 is used to determine the static element task query feature based on the first bird's-eye view feature and the initial static element query feature using a pre-trained first decoding network, where the initial static element query feature includes initial query features corresponding to each static element in at least one static element.
  • the third processing module 503 includes:
  • the second processing unit 5032 is used to determine the dynamic object task query feature based on the first bird's-eye view feature and the initial dynamic object query feature using a pre-trained second decoding network, where the initial dynamic object query feature includes initial query features corresponding to each dynamic object in at least one dynamic object.
  • the third processing module 503 includes:
  • the third processing unit 5033 is used to determine the motion trajectory task query feature based on the static element task query feature and the initial motion trajectory query feature using a pre-trained third decoding network, where the initial motion trajectory query feature includes initial trajectory query features corresponding to each dynamic object in at least one dynamic object.
  • the third processing module 503 may include at least two of the first processing unit 5031 , the second processing unit 5032 and the third processing unit 5033 , which may be specifically configured according to actual needs.
  • the first processing unit 5031 is specifically configured to:
  • a first query tensor, a first key tensor and a first value tensor Based on the initial static element query feature, determine a first query tensor, a first key tensor and a first value tensor; based on the first query tensor, the first key tensor and the first value tensor, determine a first self-attention result by using a first self-attention network of a first decoder in a first decoding network; based on the first self-attention result and the initial static element query feature, determine a first intermediate result by using a first additive normalization network of the first decoder in the first decoding network; based on the first intermediate result, determine a second query tensor; based on the first bird's-eye view feature, determine a second key tensor and a second value tensor; based on the second query tensor, the second key tensor and the second value tensor, determine a first cross-attention result by using
  • the second processing unit 5032 is specifically configured to:
  • a third query tensor, a third key tensor and a third value tensor Based on the initial dynamic object query feature, determine a third query tensor, a third key tensor and a third value tensor; based on the third query tensor, the third key tensor and the third value tensor, use the second self-attention network of the first decoder in the second decoding network to determine a second self-attention result; based on the second self-attention result and the initial dynamic object query feature, use the second additive normalization network of the first decoder in the second decoding network to determine a second intermediate result; based on the second intermediate result, determine a fourth query tensor; based on the first bird's-eye view feature, determine a fourth key tensor and a fourth value tensor; based on the fourth query tensor, the fourth key tensor and the fourth value tensor, use the second deformable cross-attention network of the first de
  • the third processing unit 5033 is specifically configured to:
  • the fifth query tensor, the fifth key tensor and the fifth value tensor Based on the initial motion trajectory query features, determine the fifth query tensor, the fifth key tensor and the fifth value tensor; based on the fifth query tensor, the fifth key tensor and the fifth value tensor, use the third self-attention network of the first decoder in the third decoding network to determine a third self-attention result; based on the third self-attention result and the initial motion trajectory query features, use the third additive normalization network of the first decoder in the third decoding network to determine a third intermediate result; based on the third intermediate result, determine the sixth query tensor; based on the static element task query features, determine the sixth key tensor and the sixth value tensor; based on the sixth query tensor, the sixth key tensor and the sixth value tensor, use the first cross-attention network of the first decoder in the third decoding network to determine a third cross-
  • FIG. 17 is a schematic diagram of the structure of a third processing module 503 provided by an exemplary embodiment of the present disclosure.
  • the third processing module 503 further includes:
  • the fourth processing unit 5034 is used to determine the initial motion trajectory query feature based on the dynamic object task query feature and the modal query feature.
  • the modal query feature includes a first modal query feature corresponding to each mode of at least one modality.
  • the first modal query feature corresponding to the modality is used to characterize a motion trend of the dynamic object.
  • the dynamic object task query feature includes at least one dynamic object task query feature; the fourth processing unit 5034 is specifically configured to:
  • a first number of the task query features is determined based on the task query features, where the first number is the number of first modal query features included in the modal query features; the first number of the task query features are respectively added to the first modal query features corresponding to each modality in the modal query features to obtain an initial trajectory query feature corresponding to the dynamic object; and an initial motion trajectory query feature is determined based on the initial trajectory query features corresponding to each dynamic object.
  • the second processing module 502 includes:
  • the first determining unit 5021 is used to determine the first bird's-eye view feature based on the first image features corresponding to each viewing angle, the initial bird's-eye view query feature, and the bird's-eye view feature of the previous frame obtained before the first bird's-eye view feature.
  • the first determining unit 5021 is specifically configured to:
  • the temporal self-attention result is determined by using the temporal self-attention network of the first encoder in the pre-trained encoder network; based on the temporal self-attention result and the initial bird's-eye view query features, the fourth additive normalization network of the first encoder is used to determine the fourth intermediate result; based on the first image features corresponding to each perspective and the fourth intermediate result, the spatial cross attention result is determined by using the spatial cross attention network in the first encoder; based on the spatial cross attention result and the fourth intermediate result, the first bird's-eye view feature is determined.
  • the fourth processing module 504 includes:
  • the second determination unit 5041 is used to determine the static element detection result based on the static element task query feature by using the pre-trained static element detection head network;
  • the third determination unit 5042 is used to determine the dynamic object detection result based on the dynamic object task query feature by using the pre-trained dynamic object detection head network;
  • the fourth determination unit 5043 is used to determine the motion trajectory prediction result based on the motion trajectory task query feature by using the pre-trained motion trajectory prediction head network.
  • the above-mentioned units of the present disclosure can also be divided into finer granularity according to actual needs, such as dividing the unit into multiple sub-units, which can be specifically set according to actual needs.
  • FIG18 is a schematic diagram of a structure of an application embodiment of the electronic device disclosed in the present invention.
  • the electronic device 10 includes one or more processors 11 and a memory 12.
  • the processor 11 may be a central processing unit (CPU) or other forms of processing units having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.
  • CPU central processing unit
  • the memory 12 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
  • the volatile memory may include, for example, a random access memory (RAM) and/or a cache memory (cache), etc.
  • the non-volatile memory may include, for example, a read-only memory (ROM), a hard disk, a flash memory, etc.
  • One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 11 may run the program instructions to implement the methods of the various embodiments of the present disclosure described above and/or other desired functions.
  • Various contents such as input signals, signal components, noise components, etc. may also be stored in the computer-readable storage medium.
  • the electronic device 10 may further include: an input device 13 and an output device 14, and these components are interconnected via a bus system and/or other forms of connection mechanisms (not shown).
  • the input device 13 may also include, for example, a keyboard, a mouse, etc.
  • the output device 14 can output various information to the outside, and the output device 14 can include, for example, a display, a speaker, a printer, a communication network and a remote output device connected thereto, and the like.
  • FIG18 only shows some of the components related to the present disclosure in the electronic device 10, omitting components such as a bus, an input/output interface, etc.
  • the electronic device 10 may further include any other appropriate components according to specific application scenarios.
  • an embodiment of the present disclosure may also be a computer program product, which includes computer program instructions, which, when executed by a processor, enable the processor to execute the steps of the method according to various embodiments of the present disclosure described in the above-mentioned "Exemplary Method" section of this specification.
  • the computer program product may be written in any combination of one or more programming languages to write program code for performing the operations of the disclosed embodiments, including object-oriented programming languages such as Java, C++, etc., and conventional procedural programming languages such as "C" or similar programming languages.
  • the program code may be executed entirely on the user computing device, partially on the user device, as a separate software package, partially on the user computing device and partially on a remote computing device, or entirely on a remote computing device or server.
  • an embodiment of the present disclosure may also be a computer-readable storage medium having computer program instructions stored thereon, which, when executed by a processor, enable the processor to execute the steps of the method according to various embodiments of the present disclosure described in the above “Exemplary Method” section of this specification.
  • the computer readable storage medium can adopt any combination of one or more readable media.
  • the readable medium can be a readable signal medium or a readable storage medium.
  • the readable storage medium can include, for example, but is not limited to, a system, device or device of electricity, magnetism, light, electromagnetic, infrared, or semiconductor, or any combination of the above.
  • readable storage media include: an electrical connection with one or more wires, a portable disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or flash memory erasable programmable read-only memory
  • CD-ROM compact disk read-only memory
  • magnetic storage device or any suitable combination of the above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Image Analysis (AREA)

Abstract

本公开实施例公开了一种图像的处理方法、装置、电子设备和存储介质,其中,方法包括:基于至少一个视角中各视角分别对应的待处理图像,确定各视角分别对应的第一图像特征;基于各视角分别对应的第一图像特征,确定第一鸟瞰图特征;基于第一鸟瞰图特征,确定静态元素任务查询特征、动态对象任务查询特征和运动轨迹任务查询特征中的至少一种任务查询特征;基于至少一种任务查询特征中的各任务查询特征,确定各任务查询特征分别对应的任务处理结果。本公开实施例仅依赖多视角环境图像即可实现端到端的单任务或多任务处理,即使在没有高精地图的情况下也能够有效获得准确的周围环境信息,大大提高通用性,且有效降低成本。

Description

图像的处理方法、装置、电子设备和存储介质
本公开要求在2022年11月11日提交国家知识产权局、申请号为CN202211417346.2、发明名称为“图像的处理方法、装置、电子设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及计算机视觉技术,尤其是一种图像的处理方法、装置、电子设备和存储介质。
背景技术
在自动驾驶领域,如何依赖多视角的环境图像高效理解环境信息是极为重要的技术问题。
发明内容
本公开的实施例提供了一种图像的处理方法、装置、电子设备和存储介质。
根据本公开实施例的一个方面,提供了一种图像的处理方法,包括:基于至少一个视角中各所述视角分别对应的待处理图像,确定各所述视角分别对应的第一图像特征;基于各所述视角分别对应的所述第一图像特征,确定第一鸟瞰图特征;基于所述第一鸟瞰图特征,确定静态元素任务查询特征、动态对象任务查询特征和运动轨迹任务查询特征中的至少一种任务查询特征;基于所述至少一种任务查询特征中的各所述任务查询特征,确定各所述任务查询特征分别对应的任务处理结果。
根据本公开实施例的另一个方面,提供了一种图像的处理装置,包括:第一处理模块,用于基于至少一个视角中各所述视角分别对应的待处理图像,确定各所述视角分别对应的第一图像特征;第二处理模块,用于基于各所述视角分别对应的所述第一图像特征,确定第一鸟瞰图特征;第三处理模块,用于基于所述第一鸟瞰图特征,确定静态元素任务查询特征、动态对象任务查询特征和运动轨迹任务查询特征中的至少一种任务查询特征;第四处理模块,用于基于所述至少一种任务查询特征中的各所述任务查询特征,确定各所述任务查询特征分别对应的任务处理结果。
根据本公开实施例的再一方面,提供一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序用于执行本公开上述任一实施例所述的图像的处理方法。
根据本公开实施例的又一方面,提供一种电子设备,所述电子设备包括:处理器;用于存储所述处理器可执行指令的存储器;所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述指令以实现本公开上述任一实施例所述的图像的处理方法。
本公开上述实施例提供的图像的处理方法、装置、电子设备和存储介质,可以基于各视角的待处理图像确定出的各视角分别对应的图像特征,确定出鸟瞰图特征,基于鸟瞰图特征确定出至少一种任务查询特征,进而基于各任务查询特征,获得各任务分别对应的任务处理结果,基于多视角环境图像即可实现端到端的单任务或多任务处理,可以避免或降低对高精地图的依赖,从而实现即使在没有高精地图的情况下也能够有效获得准确的周围环境信息,有助于提高通用性,且可以降低成本。
下面通过附图和实施例,对本公开的技术方案做进一步的详细描述。
附图说明
图1是本公开提供的图像的处理方法的一个示例性的应用场景;
图2是本公开一示例性实施例提供的图像的处理方法的流程示意图;
图3是本公开另一示例性实施例提供的图像的处理方法的流程示意图;
图4是本公开一示例性实施例提供的步骤2031a的流程示意图;
图5是本本公开一示例性实施例提供的第一解码网络的网络结构示意图;
图6是本公开一示例性实施例提供的步骤2031b的流程示意图;
图7是本公开一示例性实施例提供的步骤2031c的流程示意图;
图8是本公开一示例性实施例提供的第三解码网络的结构示意图;
图9是本公开再一示例性实施例提供的图像的处理方法的流程示意图;
图10是本公开一示例性实施例提供的步骤301的流程示意图;
图11是本公开一示例性实施例提供的初始运动轨迹查询特征的确定原理示意图;
图12是本公开一示例性实施例提供的步骤2021的流程示意图;
图13是本公开一示例性实施例提供的编码器网络的网络结构示意图;
图14是本公开一示例性实施例提供的用于图像处理的网络模型的整体结构示意图;
图15是本公开一示例性实施例提供的图像的处理装置的结构示意图;
图16是本公开另一示例性实施例提供的图像的处理装置的结构示意图;
图17是本公开一示例性实施例提供的第三处理模块503的结构示意图;
图18是本公开电子设备一个应用实施例的结构示意图。
具体实施方式
为了解释本公开,下面将参考附图详细地描述本公开的示例实施例,显然,所描述的实施例仅是本公开的一部分实 施例,而不是全部实施例,应理解,本公开不受示例性实施例的限制。
应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本公开的范围。
本公开概述
在实现本公开的过程中,发明人发现,在自动驾驶领域,如何依赖多视角的环境图像高效理解环境信息是极为重要的技术问题,若基于多视角的环境图像结合高精地图实现对周围环境信息的理解,容易导致在没有高精地图情况下获得的环境信息准确性较差。
示例性概述
图1是本公开提供的图像的处理方法的一个示例性的应用场景。
在自动驾驶场景,可以基于车载环视摄像头(可以包括多个视角的摄像头)采集车辆周围环境图像作为各视角分别对应的待处理图像,利用本公开的图像的处理装置执行本公开的图像的处理方法,可以基于至少一个视角中各视角分别对应的待处理图像,确定各视角分别对应的第一图像特征,基于各视角分别对应的第一图像特征,确定第一鸟瞰图特征,第一鸟瞰图特征是鸟瞰图(Bird’s Eye View,简称:BEV)对应的网格坐标系下的特征,基于第一鸟瞰图特征可以确定静态元素任务查询特征、动态对象任务查询特征和运动轨迹任务查询特征,进而分别基于静态元素任务查询特征、动态对象任务查询特征和运动轨迹任务查询特征,确定对应的任务处理结果(例如但不限于包括任务处理结果1、任务处理结果2和任务处理结果3),比如基于静态元素任务查询特征确定出静态元素任务处理结果(具体比如自动驾驶场景中车辆周围环境的静态元素检测结果),基于动态对象任务查询特征确定出动态对象任务处理结果(具体比如三维目标检测结果),基于运动轨迹任务查询特征确定出运动轨迹任务处理结果(具体比如动态对象的运动轨迹预测结果),实现基于多视角环境图像的端到端的单任务或多任务处理,无需结合高精地图,有助于避免或降低对高精地图的依赖,从而实现即使在没有高精地图的情况下也能够有效获得准确的周围环境信息,有助于提高通用性,且可以降低成本。其中,静态元素可以包括车道线、斑马线、路沿等静态的对象元素,动态对象可以包括周围的车辆、行人等具有运动属性的对象,运动轨迹是指动态对象的运动轨迹。
需要说明的是,本公开的图像的处理方法不限于上述的自动驾驶场景,可以根据实际需求应用于其他任意可能的场景,比如一定区域的安防监控场景,通过各视角的摄像头采集的图像获得该区域内的鸟瞰图特征,实现该区域内静态元素、动态对象和/或动态对象运动轨迹的端到端任务处理,具体场景可以根据实际需求设置。
示例性方法
图2是本公开一示例性实施例提供的图像的处理方法的流程示意图。本实施例可应用在电子设备上,具体比如车载计算平台上,如图2所示,包括如下步骤:
步骤201,基于至少一个视角中各视角分别对应的待处理图像,确定各视角分别对应的第一图像特征。
其中,视角数量可以根据实际需求设置,比如在自动驾驶场景,视角数量为车辆上设置的环视摄像头数量,每个摄像头对应一个视角,比如左前摄像头、左后摄像头、右前摄像头和右后摄像头构成的四路环视系统包括4个视角,具体不作限定。第一图像特征可以采用任意可实施的特征提取方式获得,比如基于预先训练获得的特征提取网络对各待处理图像进行特征提取,获得各视角分别对应的第一图像特征。其中,特征提取网络可以根据实际需求设置,比如可以采用卷积神经网络作为特征提取网络。
在一个可选示例中,该步骤201可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第一处理模块执行。
步骤202,基于各视角分别对应的第一图像特征,确定第一鸟瞰图特征。
其中,第一鸟瞰图特征是鸟瞰图对应的网格坐标系下的BEV特征,可以基于预先训练获得的编码器网络,对各视角的第一图像特征进行编码,获得第一鸟瞰图特征。编码器网络可以根据实际需求设置。
在一个可选示例中,该步骤202可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第二处理模块执行。
步骤203,基于第一鸟瞰图特征,确定静态元素任务查询特征、动态对象任务查询特征和运动轨迹任务查询特征中的至少一种任务查询特征。
其中,静态元素任务查询特征是从第一鸟瞰图特征中提取的与静态元素相关的任务查询特征,同理,动态对象任务查询特征是从第一鸟瞰图特征中提取的与动态对象相关的任务查询特征,运动轨迹任务查询特征是从静态元素任务查询特征中提取的与动态对象运动轨迹相关的任务查询特征。具体需要获得哪种或哪几种任务查询特征可以根据实际需求设置。比如可以获得任一种,也可以获得任两种,也可以同时获得三种。对于任一种任务查询特征,可以采用预先训练获得的该任务对应的解码网络对第一鸟瞰图特征进行解码获得。具体解码网络可以根据实际需求设置。
在一个可选示例中,该步骤203可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第三处理模块执行。
步骤204,基于至少一种任务查询特征中的各任务查询特征,确定各任务查询特征分别对应的任务处理结果。
其中,对于任一种任务,可以设置该任务对应的头网络,并进行训练获得训练后的头网络,用于对该任务对应的任务查询特征进行输出投影,获得该任务查询特征对应的任务处理结果。其中,头网络的具体网络结构可以根据实际需求设置,比如可以采用多层感知机(Multilayer Perceptron,简称:MLP)实现。
在一个可选示例中,该步骤204可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第四处理模块执行。
本实施例提供的图像的处理方法,可以基于各视角的待处理图像确定出的各视角分别对应的图像特征,确定出鸟瞰 图特征,基于鸟瞰图特征确定出至少一种任务查询特征,进而基于各任务查询特征,获得各任务分别对应的任务处理结果,仅依赖多视角环境图像即可实现端到端的单任务或多任务处理,无需结合高精地图,可以避免或降低对高精地图的依赖,从而实现即使在没有高精地图的情况下也能够有效获得准确的周围环境信息,有助于提高通用性,且可以降低成本。
图3是本公开另一示例性实施例提供的图像的处理方法的流程示意图。
在一个可选示例中,步骤203具体可以包括以下步骤:
步骤2031a,基于第一鸟瞰图特征及初始静态元素查询特征,利用预先训练获得的第一解码网络,确定静态元素任务查询特征,初始静态元素查询特征包括至少一个静态元素中各静态元素分别对应的初始查询特征。
其中,初始静态元素查询特征可以根据实际需求设置,比如基于第一初始化规则进行初始化获得的初始静态元素查询特征,第一初始化规则可以根据实际需求设置,比如随机初始化N2个D维的静态query(静态query表示静态元素对应的初始查询特征)作为初始静态元素查询特征,N2为静态query的数量,也即静态元素的数量,N2可以根据实际需求设置,D表示每个静态元素对应的初始查询特征的维度。第一解码网络可以包括至少一个解码器,用于基于初始静态元素查询特征从第一鸟瞰图特征中查询与静态元素相关的任务查询特征,获得静态元素任务查询特征。第一解码网络的具体网络结构可以根据实际需求设置。
在一个可选示例中,该步骤2031a可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第一处理单元执行。
本公开实施例通过利用训练获得的第一解码网络,基于初始静态元素查询特征从第一鸟瞰图特征中查询获得与静态元素相关的静态元素任务查询特征,为后续静态元素检测任务的实现提供更加准确有效的特征数据,以实现基于多视角图像的端到端的任务处理,当应用于地图重建场景时,可以不依赖离线生成的高精地图即可在线生成静态地图信息,进一步提高通用性。
在一个可选示例中,步骤203具体可以包括以下步骤:
步骤2031b,基于第一鸟瞰图特征及初始动态对象查询特征,利用预先训练获得的第二解码网络,确定动态对象任务查询特征,初始动态对象查询特征包括至少一个动态对象中各动态对象分别对应的初始查询特征。
其中,初始动态对象查询特征可以根据实际需求设置,比如基于第二初始化规则进行初始化获得的初始动态对象查询特征,第二初始化规则可以根据实际需求设置,比如随机初始化N1个D维的动态query(动态query表示动态对象对应的初始查询特征)作为初始动态对象查询特征,N1为动态query的数量,也即动态对象的数量,N1可以根据实际需求设置,D表示每个动态对象对应的初始查询特征的维度。第二解码网络可以包括至少一个解码器,用于基于初始动态对象查询特征从第一鸟瞰图特征中查询与动态对象相关的任务查询特征,获得动态对象任务查询特征。第二解码网络的具体网络结构可以根据实际需求设置。
在一个可选示例中,该步骤2031b可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第二处理单元执行。
本公开实施例通过利用训练获得的第二解码网络,基于初始动态对象查询特征从第一鸟瞰图特征中查询获得与动态对象相关的动态对象任务查询特征,为后续动态对象检测任务的实现提供更加准确有效的特征数据,以实现基于多视角图像的端到端的三维目标检测任务处理,可以避免再进行动态对象的跟踪,从而有助于降低网络模型的计算复杂度,同时可以避免目标跟踪误差对后续应用产生的影响。
在一个可选示例中,步骤203具体可以包括以下步骤:
步骤2031c,基于静态元素任务查询特征及初始运动轨迹查询特征,利用预先训练获得的第三解码网络,确定运动轨迹任务查询特征,初始运动轨迹查询特征包括至少一个动态对象中各动态对象分别对应的初始轨迹查询特征。
其中,初始运动轨迹查询特征需要结合动态对象任务查询特征及模态查询特征确定,模态查询特征用于表征动态对象的运动趋势,不同模态关注不同的未来运动类型(比如快速直行、低速直行、左转、右转,等等),动态对象任务查询特征用于表征动态对象相关的特征,结合模态查询特征和动态对象任务查询特征,可以确定出与动态对象运动轨迹相关的初始轨迹查询特征,进而在第三解码网络中与静态元素任务查询特征交互,解码出运动轨迹任务查询特征。第三解码网络可以根据实际需求设置。
在一个可选示例中,该步骤2031c可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第三处理单元执行。
本公开实施例通过利用训练获得的第三解码网络,基于初始运动轨迹查询特征从静态元素任务查询特征中查询获得与动态对象运动轨迹相关的运动轨迹任务查询特征,为后续动态对象的运动轨迹预测任务的实现提供更加准确有效的特征数据,以实现基于多视角图像的端到端的运动轨迹预测任务处理。
在一个可选示例中,步骤203可以包括上述2031a-2031c中的至少两个步骤,具体可以根据实际需求设置,从而可以仅依赖多视角图像实现端到端的多任务处理,同时获得多个任务分别对应的任务处理结果,可以避免对高精地图及激光雷达的依赖,有助于进一步提高通用性。
图4是本公开一示例性实施例提供的步骤2031a的流程示意图。
在一个可选示例中,步骤2031a的基于第一鸟瞰图特征及初始静态元素查询特征,利用预先训练获得的第一解码网络,确定静态元素任务查询特征,包括:
步骤20311a,基于初始静态元素查询特征,确定第一查询张量、第一键张量和第一值张量。
其中,可以基于第一查询映射规则将初始静态元素查询特征映射为第一查询张量,比如基于第一查询映射矩阵进行映射,同理,可以基于第一键映射规则将初始静态元素查询特征映射为第一键张量,基于第一值映射规则将初始静态元素查询特征映射为第一值张量,具体映射原理不再赘述。
步骤20312a,基于第一查询张量、第一键张量和第一值张量,利用第一解码网络中第一个解码器的第一自注意力网 络,确定第一自注意力结果。
其中,第一自注意力网络为基于自注意力机制的网络,可以根据实际需求设置,用于完成第一查询张量、第一键张量和第一值张量的自注意力操作。具体来说,基于第一查询张量和第一键张量进行自注意力操作,获得第一权重,进而基于第一权重对第一值张量进行加权求和,获得第一自注意力结果。具体自注意力机制的原理不再赘述。
步骤20313a,基于第一自注意力结果和初始静态元素查询特征,利用第一解码网络中的第一个解码器的第一相加归一化网络,确定第一中间结果。
其中,第一相加归一化网络(Add&Norm)具有相加和归一化两种功能,其中,相加是将第一自注意力结果与初始静态元素查询特征相加,获得第一相加结果,将第一相加结果再进行归一化,获得第一中间结果。
步骤20314a,基于第一中间结果,确定第二查询张量。
其中,可以将第一中间结果作为第二查询张量,或者基于相应的映射规则将第一中间结果映射为第二查询张量,具体可以根据实际需求设置。
步骤20315a,基于第一鸟瞰图特征,确定第二键张量和第二值张量。
其中,第二键张量和第二值张量的确定原理参见前述内容,在此不再赘述。
步骤20316a,基于第二查询张量、第二键张量和第二值张量,利用第一解码网络中第一个解码器的第一可变形交叉注意力网络,确定第一交叉注意力结果。
其中,第一可变形交叉注意力网络为基于可变形卷积的交叉注意力网络,其功能是可以从第一鸟瞰图特征中提取与初始静态元素查询特征对应位置附近的局部区域内的特征,进一步提高提取特征的准确性和有效性。
步骤20317a,基于第一交叉注意力结果和第一中间结果,确定静态元素任务查询特征。
其中,在第一解码网络中的第一个解码器中在第一可变形交叉注意力网络之后还可以包括其他相关网络,比如相加归一化网络(Add&Norm)、前馈网络(Feed Forward)等,因此在获得第一交叉注意力结果后,可以将第一交叉注意力结果与第一中间结果相加并归一化后,再通过其他相关网络,最终获得第一个解码器的解码结果,当第一解码网络中包括多个解码器时,第一个解码器的解码结果还可以作为第二个解码器的输入,按照上述第一个解码器的解码流程再进行解码,以此类推,直至完成所有解码器的解码,获得第一解码网络的最终解码结果,该最终解码结果作为静态元素任务查询特征。
在一个可选示例中,图5是本公开一示例性实施例提供的第一解码网络的网络结构示意图。如图5所示,第一解码网络包括6个解码器。其中,×6表示第一解码网络包括6个虚框内的解码器,该虚框内的解码器以第一个解码器为例,Q1、K1、V1分别表示第一查询张量、第一键张量和第一值张量,Self Attention表示第一自注意力网络,Add&Norm表示相加归一化网络,与第一自注意力网络连接的Add&Norm表示第一相加归一化网络,Q2、K2、V2分别表示第二查询张量、第二键张量和第二值张量;Deformable Cross Attention表示第一可变形交叉注意力网络,Feed Forward表示前馈网络。初始静态元素查询特征映射为第一查询张量、第一键张量和第一值张量后在第一自注意力网络进行自我交互,获得第一自注意力结果,第一自注意力结果与初始静态元素查询特征相加并归一化后获得第一中间结果,第一中间结果映射为第二查询张量,同时,第一鸟瞰图特征映射为第二键张量和第二值张量,第二查询张量、第二键张量和第二值张量在第一可变形交叉注意力网络进行交叉注意力,实现第一鸟瞰图特征与初始静态元素查询特征的交互,获得第一交叉注意力结果,第一交叉注意力结果与第一中间结果相加并归一化后获得第一归一化结果,第一归一化结果经前馈网络和又一个相加归一化网络后获得第一个解码器的解码结果,该解码结果再经过5个解码器的解码获得静态元素任务查询特征。
在一个可选示例中第一自注意力网络可以是多头自注意力网络,第一可变形交叉注意力网络也可以是多头可变形交叉注意力网络,具体可以根据实际需求设置。
在一个可选示例中,上述步骤20311a至步骤20317a可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第一处理单元执行。
本公开实施例通过第一解码网络中的自注意力网络实现静态元素查询特征的自我交互,可以捕获静态元素查询特征的内部相关性,进而在可变形交叉注意力网络与鸟瞰图特征进行交互,实现对数据的稀疏注意力,灵活地捕获相关局部区域的特征,有助于在保证获得准确有效的相关特征基础上,降低计算量,从而可以提高网络推理速度。
图6是本公开一示例性实施例提供的步骤2031b的流程示意图。
在一个可选示例中,步骤2031b的基于第一鸟瞰图特征及初始动态对象查询特征,利用预先训练获得的第二解码网络,确定动态对象任务查询特征,包括:
步骤20311b,基于初始动态对象查询特征,确定第三查询张量、第三键张量和第三值张量。
步骤20312b,基于第三查询张量、第三键张量和第三值张量,利用第二解码网络中第一个解码器的第二自注意力网络,确定第二自注意力结果。
步骤20313b,基于第二自注意力结果和初始动态对象查询特征,利用第二解码网络中的第一个解码器的第二相加归一化网络,确定第二中间结果。
步骤20314b,基于第二中间结果,确定第四查询张量。
步骤20315b,基于第一鸟瞰图特征,确定第四键张量和第四值张量。
步骤20316b,基于第四查询张量、第四键张量和第四值张量,利用第二解码网络中第一个解码器的第二可变形交叉注意力网络,确定第二交叉注意力结果。
步骤20317b,基于第二交叉注意力结果和第二中间结果,确定动态对象任务查询特征。
步骤20311b-20317b的具体操作原理与前述步骤20311a-20317a相同或相似,不同之处在于步骤20311b中基于的是初始动态对象查询特征,与步骤20311a中的初始静态元素查询特征不同,在此不再一一赘述。基于此,第二解码网络的网络结构与第一解码网络相同或相似,在此不再赘述。
在一个可选示例中,上述步骤20311b至步骤20317b可以由处理器调用存储器存储的相应指令执行,也可以由被处 理器运行的第二处理单元执行。
图7是本公开一示例性实施例提供的步骤2031c的流程示意图。
在一个可选示例中,步骤2031c的基于静态元素任务查询特征及初始运动轨迹查询特征,利用预先训练获得的第三解码网络,确定运动轨迹任务查询特征,包括:
步骤20311c,基于初始运动轨迹查询特征,确定第五查询张量、第五键张量和第五值张量。
步骤20312c,基于第五查询张量、第五键张量和第五值张量,利用第三解码网络中第一个解码器的第三自注意力网络,确定第三自注意力结果。
步骤20313c,基于第三自注意力结果和初始运动轨迹查询特征,利用第三解码网络中的第一个解码器的第三相加归一化网络,确定第三中间结果。
步骤20314c,基于第三中间结果,确定第六查询张量。
步骤20311c-步骤20314c的具体操作原理与前述步骤20311a-20314a相同或相似,在此不再赘述。
步骤20315c,基于静态元素任务查询特征,确定第六键张量和第六值张量。
该步骤的第六查询张量和第六值张量是基于前述示例获得的静态元素任务查询特征映射获得,映射原理参见前述内容。
步骤20316c,基于第六查询张量、第六键张量和第六值张量,利用第三解码网络中第一个解码器的第一交叉注意力网络,确定第三交叉注意力结果。
其中,第一交叉注意力网络可以采用任意可实施的交叉注意力网络,具体可以根据实际需求设置,示例性地,第一交叉注意力网络可以采用常规的视觉Transformer中的交叉注意力网络结构,具体不作限定。
步骤20317c,基于第三交叉注意力结果和第三中间结果,确定运动轨迹任务查询特征。
该步骤的具体操作原理参见前述步骤20317a,在此不再赘述。
示例性的,图8是本公开一示例性实施例提供的第三解码网络的结构示意图。其中,初始运动轨迹查询特征包括多个轨迹query(轨迹query表示动态对象对应的初始轨迹查询特征),Q5、K5、V5分别表示第五查询张量、第五键张量和第五值张量,Q6、K6、V6分别表示第六查询张量、第六键张量和第六值张量,其他符号含义及推理过程参见前述内容,在此不再赘述。
在一个可选示例中,上述步骤20311c至步骤20317c可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第三处理单元执行。
本公开实施例通过第三解码网络中的自注意力网络实现运动轨迹查询特征的自我交互,可以捕获运动轨迹查询特征的内部相关性,进而在交叉注意力网络与静态元素任务查询特征进行交互,实现对动态对象的运动轨迹相关特征的有效捕获,从而隐式地了解到周围的静态信息(比如周围道路信息),有助于为准确预测出动态对象的更合理的未来运动轨迹提供准确有效的特征数据,实现基于多视角图像特征的端到端运动轨迹预测。
图9是本公开再一示例性实施例提供的图像的处理方法的流程示意图。
在一个可选示例中,在步骤2031c的基于静态元素任务查询特征及初始运动轨迹查询特征,利用预先训练获得的第三解码网络,确定运动轨迹任务查询特征之前,还包括:
步骤301,基于动态对象任务查询特征及模态查询特征,确定初始运动轨迹查询特征,模态查询特征包括至少一种模态中各模态分别对应的第一模态查询特征,模态对应的第一模态查询特征用于表征动态对象的一种运动趋势。
其中,模态查询特征可以根据实际需求设置,比如通过第三初始化规则进行初始化获得,由于模态查询特征可以表征动态对象的运动趋势,结合表征动态对象位置的动态对象任务查询特征,可以确定出初始运动轨迹查询特征。
示例性的,可以随机初始化N3个D维的模态query(模态query表示模态对应的第一模态查询特征),作为模态查询特征。N3可以根据实际需求设置。模态查询特征与动态对象任务查询特征融合,形成N1×N3个D维的轨迹query,作为初始运动轨迹查询特征,即每个动态对象具有N3个模态,表征其N3种运动趋势。
在一个可选示例中,该步骤301可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第四处理单元执行。
本公开实施例基于更新获得的动态对象任务查询特征和模态查询特征融合,获得初始运动轨迹查询特征,从而使得初始运动轨迹查询特征可以包括各动态对象分别对应的任务查询特征以及每个动态对象的多个模态的查询特征,不同模态可以关注不同的未来运动类型(比如快速直行、低速直行、左转、右转,等等),有助于为后续通过第三解码网络解码出运动轨迹任务查询特征提供有效的数据支撑。
图10是本公开一示例性实施例提供的步骤301的流程示意图。
在一个可选示例中,动态对象任务查询特征包括至少一个动态对象的任务查询特征;步骤301的基于动态对象任务查询特征及模态查询特征,确定初始运动轨迹查询特征,包括:
步骤3011,对于每个动态对象对应的任务查询特征,基于该任务查询特征,确定第一数量的该任务查询特征,第一数量为模态查询特征中包括的第一模态查询特征数量。
其中,每个动态对象都可以赋予第一数量的第一模态查询特征,用于表征该动态对象的第一数量的运动趋势,因此,为了能够将该对象的任务查询特征与模态查询特征融合,可以将该动态对象的任务查询特征的数量变换为与模态查询特征的数量相同,因此,基于该动态对象的任务查询特征,可以确定出第一数量的该任务查询特征。
示例性的,模态查询特征包括N3个D维的模态query,则对于每个动态对象的D维的任务查询特征(可以称为任务query),可以将该任务查询特征复制N3份,获得N3个相同的D维的任务查询特征。
在一个可选示例中,该步骤3011可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第四处理单元执行。
步骤3012,将第一数量的该任务查询特征分别与模态查询特征中各模态分别对应的第一模态查询特征相加,获得该 动态对象对应的初始轨迹查询特征。
示例性的,可以将N3个相同的任务查询特征与N3个第一模态查询特征(模态query)相加(add),形成N3个运动轨迹查询特征。
在一个可选示例中,该步骤3012可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第四处理单元执行。
步骤3013,基于各动态对象分别对应的初始轨迹查询特征,确定初始运动轨迹查询特征。
示例性的,图11是本公开一示例性实施例提供的初始运动轨迹查询特征的确定原理示意图。在本示例中,动态对象数量为N1,任务query表示动态对象对应的任务查询特征,每个动态对象的第一模态查询特征(模态query)的数量为N3,最终获得的初始运动轨迹查询特征包括N1×N3个D维的初始轨迹查询特征。
在一个可选示例中,该步骤3013可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第四处理单元执行。
本公开实施例通过动态对象任务查询特征与模态查询特征融合获得初始运动轨迹查询特征,使得初始运动轨迹查询特征包含动态对象任务查询相关特征和动态对象的运动趋势相关信息,进一步通过训练获得的第三解码网络可以获得准确有效的运动轨迹任务查询特征。
在一个可选示例中,步骤202的基于各视角分别对应的第一图像特征,确定第一鸟瞰图特征,包括:
步骤2021,基于各视角分别对应的第一图像特征、初始鸟瞰图查询特征、及在第一鸟瞰图特征之前获得的在前帧鸟瞰图特征,确定第一鸟瞰图特征。
其中,初始鸟瞰图查询特征是鸟瞰图下初始化的特征,其大小可以表征鸟瞰图的大小,具体初始化规则可以根据实际需求设置。比如基于需要的鸟瞰图大小初始化H×W个D维的鸟瞰图query,作为初始鸟瞰图查询特征,其中,H和W分别为鸟瞰图的高度和宽度,D为每个鸟瞰图query的特征维度。每个鸟瞰图query可以对应一组物理空间的三维坐标,比如可以是以自车为原点的世界坐标系下的三维坐标,具体可以根据实际需求设置。在前帧鸟瞰图特征是当前之前的图像处理流程中获得的鸟瞰图特征,其具体处理流程与第一鸟瞰图特征一致。
在一个可选示例中,该步骤2021可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第一确定单元执行。
由于在前帧鸟瞰图特征中包含了在前帧采集的图像中静态元素、动态对象等相关历史信息,结合在前帧鸟瞰图特征、初始鸟瞰图查询特征和第一图像特征,既可以实现当前的待处理图像中静态元素和动态对象相关特征的确定,还可以实现动态对象相对在前帧的变化,从而便于实现动态对象的跟踪。
图12是本公开一示例性实施例提供的步骤2021的流程示意图。
在一个可选示例中,步骤2021的基于各视角分别对应的第一图像特征、初始鸟瞰图查询特征、及在第一鸟瞰图特征之前获得的在前帧鸟瞰图特征,确定第一鸟瞰图特征,包括:
步骤20211,基于在前帧鸟瞰图特征和初始鸟瞰图查询特征,利用预先训练获得的编码器网络中第一个编码器的时序自注意力网络,确定时序自注意力结果。
其中,编码器网络用于基于在前帧鸟瞰图特征和初始鸟瞰图查询特征对提取的各视角的分别对应的第一图像特征进行编码,获得鸟瞰图下的第一鸟瞰图特征。编码器网络可以包括一个或多个编码器,每个编码器可以包括时序自注意力网络,时序自注意力网络用于初始鸟瞰图查询特征根据自车的运动情况,找到其在历史的在前帧鸟瞰图中对应的位置,作为参考位置,用于后续在第一图像特征中提取该参考位置区域对应的特征。通过多个编码器的不断编码,可以获得当前帧的第一鸟瞰图特征。
步骤20212,基于时序自注意力结果和初始鸟瞰图查询特征,利用第一个编码器的第四相加归一化网络,确定第四中间结果。
其中,第四相加归一化网络的具体原理参见前述内容,在此不再赘述。
步骤20213,基于各视角分别对应的第一图像特征和第四中间结果,利用第一个编码器中的空间交叉注意力网络,确定空间交叉注意力结果。
其中,空间交叉注意力网络用于对第四中间结果在高度上进行均匀采样,获得一组三维坐标,然后根据相机(摄像头)内参和外参将三维坐标映射到各视角分别对应的第一图像特征中的对应位置上,进而可以基于可变形卷积提取第一图像特征中对应位置的特征,经过后续处理,获得当前帧的第一鸟瞰图特征。
步骤20214,基于空间交叉注意力结果和第四中间结果,确定第一鸟瞰图特征。
其中,在每个编码器中,除了上述的时序自注意力网络、第四相加归一化网络和空间交叉注意力网络之外还包括一些其他相关网络,比如在空间交叉注意力网络之后还包括相加归一化网络、前馈网络、再一相加归一化网络,等等,具体可以根据实际需求设置。因此,在获得空间交叉注意力结果之后,还可以将空间交叉注意力结果与第四中间结果相加并归一化,然后再经过其他相关网络,获得第一个编码器的编码结果,若编码器网络包括多个编码器,则第一个编码器输出的编码结果还可以经后续的各编码器继续进行编码,最终获得第一鸟瞰图特征。
示例性的,图13是本公开一示例性实施例提供的编码器网络的网络结构示意图。其中,BEV B(t-1)表示在前帧鸟瞰图特征,BEV queries Q表示初始鸟瞰图查询特征,Temporal Self-Attention表示时序自注意力网络,Spatial Cross-Attention表示空间交叉注意力网络,其他符号含义参见前述内容。在前鸟瞰图特征BEV B(t-1)与初始鸟瞰图查询特征BEV queries Q在时序自注意力网络进行交互,根据车辆运动情况,找到初始鸟瞰图查询特征在在前帧鸟瞰图中对应的位置作为参考位置,获得时序自注意力结果,时序自注意力结果与初始鸟瞰图查询特征相加并归一化,获得第四中间结果,第四中间结果与各视角的第一图像特征在空间交叉注意力网络进行空间交叉注意力操作,空间交叉注意力网络基于可变形卷积实现基于参考位置从第一图像特征中提取出对应位置的特征,获得空间交叉注意力结果,空间交叉注意力结果与第四中间结果相加并归一化获得第五中间结果,第五中间结果经前馈网络(Feed Forward)和相加归一化网络(Add&Norm)获得 第一个编码器的编码结果,该编码结果再经后续编码器的编码,获得最终的编码结果,即为第一鸟瞰图特征。可以理解地,在网络推理过程中对于各第一图像特征,还可以进行空间位置编码的嵌入(Embedding),本公开不作限定。
在一个可选示例中,上述步骤20211至步骤20214可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第一确定单元执行。
本公开实施例通过编码器网络中的时序自注意力网络实现当前帧初始鸟瞰图查询特征与在前帧鸟瞰图的位置匹配,建立当前帧与在前帧的时序相关性,进而基于空间交叉注意力网络从第一图像特征中提取出参考位置附近的局部特征,有助于在提取出有效特征的基础上,降低计算量,从而可以提高图像处理效率。
在一个可选示例中,步骤204的基于至少一种任务查询特征中的各任务查询特征,确定各任务查询特征分别对应的任务处理结果,包括:
步骤2041a,基于静态元素任务查询特征,利用预先训练获得的静态元素检测头网络,确定静态元素检测结果。
其中,静态元素检测头网络可以采用任意可实施的头网络,比如基于多层感知机的头网络。
在一个可选示例中,该步骤2041a可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第二确定单元执行。
步骤2041b,基于动态对象任务查询特征,利用预先训练获得的动态对象检测头网络,确定动态对象检测结果。
其中,动态对象检测头网络可以采用任意可实施的头网络,比如基于多层感知机的头网络。
在一个可选示例中,该步骤2041b可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第三确定单元执行。
步骤2041c,基于运动轨迹任务查询特征,利用预先训练获得的运动轨迹预测头网络,确定运动轨迹预测结果。
其中,运动轨迹预测头网络可以采用任意可实施的头网络,比如基于多层感知机的头网络。
在一个可选示例中,该步骤2041c可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第四确定单元执行。
本公开实施例通过基于各视角分别对应的第一图像特征进行编码获得第一鸟瞰图特征,基于第一鸟瞰图特征及不同任务分别对应的解码网络解码出不同任务的任务查询特征,进而基于不同任务分别对应的头网络,获得不同任务分别对应的任务处理结果,可以实现基于多帧多视角环视图像的端到端的多任务处理,有助于提高任务处理效率,可以避免或降低对离线生成的高精地图及激光雷达的依赖,同时实现静态元素检测、动态对象三维检测、运动轨迹的预测,有助于降低成本。
在一个可选示例中,图14是本公开一示例性实施例提供的用于图像处理的网络模型的整体结构示意图。可以基于多视角的待处理图像,利用特征提取网络获得各视角分别对应的第一图像特征,各第一图像特征通过编码器网络获得第一鸟瞰图特征,第一鸟瞰图特征经第一解码网络获得静态元素任务查询特征,进而利用静态元素检测头网络获得静态元素检测结果,第一鸟瞰图特征经第二解码网络获得动态对象任务查询特征,进而利用动态对象检测头网络获得动态对象检测结果;动态对象任务查询结果与模态查询特征融合获得初始运动轨迹查询特征,基于初始运动轨迹查询特征和第一解码网络获得的静态元素任务查询特征,利用第三解码网络获得运动轨迹任务查询特征,进而利用运动轨迹预测头网络,获得运动轨迹预测结果。
在一个可选示例中,网络模型可以通过预先训练获得。当网络模型同时包括多个任务时,可以多个任务一起训练,也可以先单独训练,再综合训练,具体可以根据实际需求设置,比如,为了保证运动轨迹预测性能更好,可以先训练静态元素任务和动态对象任务,获得基础模型,再基于基础模型,进行三种任务的一起训练,具体训练原理不再赘述。
本公开实施例可以仅使用多视角图像实现静态元素、动态对象及运动轨迹的任务处理,相对于激光雷达,可以获得更丰富的环境信息,并且硬件成本较低,易于部署。且本公开实施例通过静态元素检测可以实现静态地图信息的在线生成,可以不依赖离线生成的高精地图,应用场景更加广泛。此外,本公开实施例可以无需进行显示的动态目标跟踪,有助于降低模型计算复杂度,且能够避免跟踪模块误差对后续处理产生影响,从而可以进一步提高任务处理结果的准确性。
在一个可选示例中,还可以将各视角图像与激光雷达采集的数据进行特征融合,用于端到端的单任务或多任务处理,提高特征信息的丰富度,从而有助于进一步提高模型性能。
本公开上述各实施例或可选示例可以单独实施也可以在不冲突的情况下以任意组合方式结合实施,具体可以根据实际需求设置,本公开不作限定。
本公开实施例提供的任一种图像的处理方法可以由任意适当的具有数据处理能力的设备执行,包括但不限于:终端设备和服务器等。或者,本公开实施例提供的任一种图像的处理方法可以由处理器执行,如处理器通过调用存储器存储的相应指令来执行本公开实施例提及的任一种图像的处理方法。下文不再赘述。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
示例性装置
图15是本公开一示例性实施例提供的图像的处理装置的结构示意图。该实施例的装置可用于实现本公开相应的方法实施例,如图15所示的装置包括:第一处理模块501、第二处理模块502、第三处理模块503和第四处理模块504。
第一处理模块501,用于基于至少一个视角中各视角分别对应的待处理图像,确定各视角分别对应的第一图像特征;第二处理模块502,用于基于各视角分别对应的第一图像特征,确定第一鸟瞰图特征;第三处理模块503,用于基于第一鸟瞰图特征,确定静态元素任务查询特征、动态对象任务查询特征和运动轨迹任务查询特征中的至少一种任务查询特征;第四处理模块504,用于基于至少一种任务查询特征中的各任务查询特征,确定各任务查询特征分别对应的任务处理结果。
图16是本公开另一示例性实施例提供的图像的处理装置的结构示意图。
在一个可选示例中,第三处理模块503包括:
第一处理单元5031,用于基于第一鸟瞰图特征及初始静态元素查询特征,利用预先训练获得的第一解码网络,确定静态元素任务查询特征,初始静态元素查询特征包括至少一个静态元素中各静态元素分别对应的初始查询特征。
在一个可选示例中,第三处理模块503包括:
第二处理单元5032,用于基于第一鸟瞰图特征及初始动态对象查询特征,利用预先训练获得的第二解码网络,确定动态对象任务查询特征,初始动态对象查询特征包括至少一个动态对象中各动态对象分别对应的初始查询特征。
在一个可选示例中,第三处理模块503包括:
第三处理单元5033,用于基于静态元素任务查询特征及初始运动轨迹查询特征,利用预先训练获得的第三解码网络,确定运动轨迹任务查询特征,初始运动轨迹查询特征包括至少一个动态对象中各动态对象分别对应的初始轨迹查询特征。
在一个可选示例中,第三处理模块503可以包括上述第一处理单元5031、第二处理单元5032和第三处理单元5033中的至少两个,具体可以根据实际需求设置。
在一个可选示例中,第一处理单元5031具体用于:
基于初始静态元素查询特征,确定第一查询张量、第一键张量和第一值张量;基于第一查询张量、第一键张量和第一值张量,利用第一解码网络中第一个解码器的第一自注意力网络,确定第一自注意力结果;基于第一自注意力结果和初始静态元素查询特征,利用第一解码网络中的第一个解码器的第一相加归一化网络,确定第一中间结果;基于第一中间结果,确定第二查询张量;基于第一鸟瞰图特征,确定第二键张量和第二值张量;基于第二查询张量、第二键张量和第二值张量,利用第一解码网络中第一个解码器的第一可变形交叉注意力网络,确定第一交叉注意力结果;基于第一交叉注意力结果和第一中间结果,确定静态元素任务查询特征。
在一个可选示例中,第二处理单元5032具体用于:
基于初始动态对象查询特征,确定第三查询张量、第三键张量和第三值张量;基于第三查询张量、第三键张量和第三值张量,利用第二解码网络中第一个解码器的第二自注意力网络,确定第二自注意力结果;基于第二自注意力结果和初始动态对象查询特征,利用第二解码网络中的第一个解码器的第二相加归一化网络,确定第二中间结果;基于第二中间结果,确定第四查询张量;基于第一鸟瞰图特征,确定第四键张量和第四值张量;基于第四查询张量、第四键张量和第四值张量,利用第二解码网络中第一个解码器的第二可变形交叉注意力网络,确定第二交叉注意力结果;基于第二交叉注意力结果和第二中间结果,确定动态对象任务查询特征。
在一个可选示例中,第三处理单元5033具体用于:
基于初始运动轨迹查询特征,确定第五查询张量、第五键张量和第五值张量;基于第五查询张量、第五键张量和第五值张量,利用第三解码网络中第一个解码器的第三自注意力网络,确定第三自注意力结果;基于第三自注意力结果和初始运动轨迹查询特征,利用第三解码网络中的第一个解码器的第三相加归一化网络,确定第三中间结果;基于第三中间结果,确定第六查询张量;基于静态元素任务查询特征,确定第六键张量和第六值张量;基于第六查询张量、第六键张量和第六值张量,利用第三解码网络中第一个解码器的第一交叉注意力网络,确定第三交叉注意力结果;基于第三交叉注意力结果和第三中间结果,确定运动轨迹任务查询特征。
图17是本公开一示例性实施例提供的第三处理模块503的结构示意图。
在一个可选示例中,第三处理模块503还包括:
第四处理单元5034,用于基于动态对象任务查询特征及模态查询特征,确定初始运动轨迹查询特征,模态查询特征包括至少一种模态中各模态分别对应的第一模态查询特征,模态对应的第一模态查询特征用于表征动态对象的一种运动趋势。
在一个可选示例中,动态对象任务查询特征包括至少一个动态对象的任务查询特征;第四处理单元5034具体用于:
对于每个动态对象对应的任务查询特征,基于该任务查询特征,确定第一数量的该任务查询特征,第一数量为模态查询特征中包括的第一模态查询特征数量;将第一数量的该任务查询特征分别与模态查询特征中各模态分别对应的第一模态查询特征相加,获得该动态对象对应的初始轨迹查询特征;基于各动态对象分别对应的初始轨迹查询特征,确定初始运动轨迹查询特征。
在一个可选示例中,第二处理模块502包括:
第一确定单元5021,用于基于各视角分别对应的第一图像特征、初始鸟瞰图查询特征、及在第一鸟瞰图特征之前获得的在前帧鸟瞰图特征,确定第一鸟瞰图特征。
在一个可选示例中,第一确定单元5021具体用于:
基于在前帧鸟瞰图特征和初始鸟瞰图查询特征,利用预先训练获得的编码器网络中第一个编码器的时序自注意力网络,确定时序自注意力结果;基于时序自注意力结果和初始鸟瞰图查询特征,利用第一个编码器的第四相加归一化网络,确定第四中间结果;基于各视角分别对应的第一图像特征和第四中间结果,利用第一个编码器中的空间交叉注意力网络,确定空间交叉注意力结果;基于空间交叉注意力结果和第四中间结果,确定第一鸟瞰图特征。
在一个可选示例中,第四处理模块504包括:
第二确定单元5041,用于基于静态元素任务查询特征,利用预先训练获得的静态元素检测头网络,确定静态元素检测结果;第三确定单元5042,用于基于动态对象任务查询特征,利用预先训练获得的动态对象检测头网络,确定动态对象检测结果;第四确定单元5043,用于基于运动轨迹任务查询特征,利用预先训练获得的运动轨迹预测头网络,确定运动轨迹预测结果。
在一个可选示例中,本公开上述各单元还可以根据实际需求进行更细粒度的划分,比如将单元划分成多个子单元,具体可以根据实际需求设置。
本公开上述各实施例或可选示例可以单独实施也可以在不冲突的情况下以任意组合方式结合实施,具体可以根据实 际需求设置,本公开不作限定。
本装置示例性实施例对应的有益技术效果可以参见上述示例性方法部分的相应有益技术效果,在此不再赘述。
示例性电子设备
图18是本公开电子设备一个应用实施例的结构示意图。本实施例中,该电子设备10包括一个或多个处理器11和存储器12。
处理器11可以是中央处理单元(CPU)或者具有数据处理能力和/或指令执行能力的其他形式的处理单元,并且可以控制电子设备10中的其他组件以执行期望的功能。
存储器12可以包括一个或多个计算机程序产品,所述计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。所述易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。所述非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。在所述计算机可读存储介质上可以存储一个或多个计算机程序指令,处理器11可以运行所述程序指令,以实现上文所述的本公开的各个实施例的方法以及/或者其他期望的功能。在所述计算机可读存储介质中还可以存储诸如输入信号、信号分量、噪声分量等各种内容。
在一个示例中,电子设备10还可以包括:输入装置13和输出装置14,这些组件通过总线系统和/或其他形式的连接机构(未示出)互连。
此外,该输入装置13还可以包括例如键盘、鼠标等等。
该输出装置14可以向外部输出各种信息,该输出装置14可以包括例如显示器、扬声器、打印机、以及通信网络及其所连接的远程输出设备等等。
当然,为了简化,图18中仅示出了该电子设备10中与本公开有关的组件中的一些,省略了诸如总线、输入/输出接口等等的组件。除此之外,根据具体应用情况,电子设备10还可以包括任何其他适当的组件。
示例性计算机程序产品和计算机可读存储介质
除了上述方法和设备以外,本公开的实施例还可以是计算机程序产品,其包括计算机程序指令,所述计算机程序指令在被处理器运行时使得所述处理器执行本说明书上述“示例性方法”部分中描述的根据本公开各种实施例的方法中的步骤。
所述计算机程序产品可以以一种或多种程序设计语言的任意组合来编写用于执行本公开实施例操作的程序代码,所述程序设计语言包括面向对象的程序设计语言,诸如Java、C++等,还包括常规的过程式程序设计语言,诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。
此外,本公开的实施例还可以是计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令在被处理器运行时使得所述处理器执行本说明书上述“示例性方法”部分中描述的根据本公开各种实施例的方法中的步骤。
所述计算机可读存储介质可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以包括但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
以上结合具体实施例描述了本公开的基本原理,但是,在本公开中提及的优点、优势、效果等仅是示例而非限制,不能认为其是本公开的各个实施例必须具备的。另外,上述公开的具体细节仅是为了示例的作用和便于理解的作用,而非限制,上述细节并不限制本公开为必须采用上述具体的细节来实现。
本领域的技术人员可以对本公开进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本公开权利要求及其等同技术的范围之内,则本公开也意图包含这些改动和变型在内。

Claims (12)

  1. 一种图像的处理方法,包括:
    基于至少一个视角中各所述视角分别对应的待处理图像,确定各所述视角分别对应的第一图像特征;
    基于各所述视角分别对应的所述第一图像特征,确定第一鸟瞰图特征;
    基于所述第一鸟瞰图特征,确定静态元素任务查询特征、动态对象任务查询特征和运动轨迹任务查询特征中的至少一种任务查询特征;
    基于所述至少一种任务查询特征中的各所述任务查询特征,确定各所述任务查询特征分别对应的任务处理结果。
  2. 根据权利要求1所述的方法,其中,所述基于所述第一鸟瞰图特征,确定静态元素任务查询特征、动态对象任务查询特征和运动轨迹任务查询特征中的至少一种任务查询特征,包括:
    基于所述第一鸟瞰图特征及初始静态元素查询特征,利用预先训练获得的第一解码网络,确定所述静态元素任务查询特征,所述初始静态元素查询特征包括至少一个静态元素中各所述静态元素分别对应的初始查询特征;和/或,
    基于所述第一鸟瞰图特征及初始动态对象查询特征,利用预先训练获得的第二解码网络,确定所述动态对象任务查询特征,所述初始动态对象查询特征包括至少一个动态对象中各所述动态对象分别对应的初始查询特征;和/或,
    基于所述静态元素任务查询特征及初始运动轨迹查询特征,利用预先训练获得的第三解码网络,确定所述运动轨迹任务查询特征,所述初始运动轨迹查询特征包括至少一个动态对象中各所述动态对象分别对应的初始轨迹查询特征。
  3. 根据权利要求2所述的方法,其中,所述基于所述第一鸟瞰图特征及初始静态元素查询特征,利用预先训练获得的第一解码网络,确定所述静态元素任务查询特征,包括:
    基于所述初始静态元素查询特征,确定第一查询张量、第一键张量和第一值张量;
    基于所述第一查询张量、所述第一键张量和所述第一值张量,利用所述第一解码网络中第一个解码器的第一自注意力网络,确定第一自注意力结果;
    基于所述第一自注意力结果和所述初始静态元素查询特征,利用所述第一解码网络中的第一个解码器的第一相加归一化网络,确定第一中间结果;
    基于所述第一中间结果,确定第二查询张量;
    基于所述第一鸟瞰图特征,确定第二键张量和第二值张量;
    基于所述第二查询张量、所述第二键张量和所述第二值张量,利用所述第一解码网络中第一个解码器的第一可变形交叉注意力网络,确定第一交叉注意力结果;
    基于所述第一交叉注意力结果和所述第一中间结果,确定所述静态元素任务查询特征;和/或,
    所述基于所述第一鸟瞰图特征及初始动态对象查询特征,利用预先训练获得的第二解码网络,确定所述动态对象任务查询特征,包括:
    基于所述初始动态对象查询特征,确定第三查询张量、第三键张量和第三值张量;
    基于所述第三查询张量、所述第三键张量和所述第三值张量,利用所述第二解码网络中第一个解码器的第二自注意力网络,确定第二自注意力结果;
    基于所述第二自注意力结果和所述初始动态对象查询特征,利用所述第二解码网络中的第一个解码器的第二相加归一化网络,确定第二中间结果;
    基于所述第二中间结果,确定第四查询张量;
    基于所述第一鸟瞰图特征,确定第四键张量和第四值张量;
    基于所述第四查询张量、所述第四键张量和所述第四值张量,利用所述第二解码网络中第一个解码器的第二可变形交叉注意力网络,确定第二交叉注意力结果;
    基于所述第二交叉注意力结果和所述第二中间结果,确定所述动态对象任务查询特征。
  4. 根据权利要求2所述的方法,其中,所述基于所述静态元素任务查询特征及初始运动轨迹查询特征,利用预先训练获得的第三解码网络,确定所述运动轨迹任务查询特征,包括:
    基于所述初始运动轨迹查询特征,确定第五查询张量、第五键张量和第五值张量;
    基于所述第五查询张量、所述第五键张量和所述第五值张量,利用所述第三解码网络中第一个解码器的第三自注意力网络,确定第三自注意力结果;
    基于所述第三自注意力结果和所述初始运动轨迹查询特征,利用所述第三解码网络中的第一个解码器的第三相加归一化网络,确定第三中间结果;
    基于所述第三中间结果,确定第六查询张量;
    基于所述静态元素任务查询特征,确定第六键张量和第六值张量;
    基于所述第六查询张量、所述第六键张量和所述第六值张量,利用所述第三解码网络中第一个解码器的第一交叉注意力网络,确定第三交叉注意力结果;
    基于所述第三交叉注意力结果和所述第三中间结果,确定所述运动轨迹任务查询特征。
  5. 根据权利要求2所述的方法,其中,在所述基于所述静态元素任务查询特征及初始运动轨迹查询特征,利用预先训练获得的第三解码网络,确定所述运动轨迹任务查询特征之前,还包括:
    基于所述动态对象任务查询特征及模态查询特征,确定所述初始运动轨迹查询特征,所述模态查询特征包括至少一种模态中各所述模态分别对应的第一模态查询特征,所述模态对应的所述第一模态查询特征用于表征动态对象的一种运动趋势。
  6. 根据权利要求5所述的方法,其中,所述动态对象任务查询特征包括至少一个动态对象的任务查询特征;所述基于所述动态对象任务查询特征及模态查询特征,确定所述初始运动轨迹查询特征,包括:
    对于每个所述动态对象对应的所述任务查询特征,基于该任务查询特征,确定第一数量的该任务查询特征,所述第一数量为所述模态查询特征中包括的第一模态查询特征数量;
    将第一数量的该任务查询特征分别与所述模态查询特征中各所述模态分别对应的所述第一模态查询特征相加,获得该动态对象对应的初始轨迹查询特征;
    基于各所述动态对象分别对应的所述初始轨迹查询特征,确定所述初始运动轨迹查询特征。
  7. 根据权利要求1-6任一所述的方法,其中,所述基于各所述视角分别对应的所述第一图像特征,确定第一鸟瞰图特征,包括:
    基于各所述视角分别对应的所述第一图像特征、初始鸟瞰图查询特征、及在所述第一鸟瞰图特征之前获得的在前帧鸟瞰图特征,确定所述第一鸟瞰图特征。
  8. 根据权利要求7所述的方法,其中,所述基于各所述视角分别对应的所述第一图像特征、初始鸟瞰图查询特征、及在所述第一鸟瞰图特征之前获得的在前帧鸟瞰图特征,确定所述第一鸟瞰图特征,包括:
    基于所述在前帧鸟瞰图特征和所述初始鸟瞰图查询特征,利用预先训练获得的编码器网络中第一个编码器的时序自注意力网络,确定时序自注意力结果;
    基于所述时序自注意力结果和所述初始鸟瞰图查询特征,利用所述第一个编码器的第四相加归一化网络,确定第四中间结果;
    基于各所述视角分别对应的所述第一图像特征和所述第四中间结果,利用所述第一个编码器中的空间交叉注意力网络,确定空间交叉注意力结果;
    基于所述空间交叉注意力结果和所述第四中间结果,确定所述第一鸟瞰图特征。
  9. 根据权利要求1-6任一所述的方法,其中,所述基于所述至少一种任务查询特征中的各所述任务查询特征,确定各所述任务查询特征分别对应的任务处理结果,包括:
    基于所述静态元素任务查询特征,利用预先训练获得的静态元素检测头网络,确定静态元素检测结果;
    基于所述动态对象任务查询特征,利用预先训练获得的动态对象检测头网络,确定动态对象检测结果;
    基于所述运动轨迹任务查询特征,利用预先训练获得的运动轨迹预测头网络,确定运动轨迹预测结果。
  10. 一种图像的处理装置,包括:
    第一处理模块,用于基于至少一个视角中各所述视角分别对应的待处理图像,确定各所述视角分别对应的第一图像特征;
    第二处理模块,用于基于各所述视角分别对应的所述第一图像特征,确定第一鸟瞰图特征;
    第三处理模块,用于基于所述第一鸟瞰图特征,确定静态元素任务查询特征、动态对象任务查询特征和运动轨迹任务查询特征中的至少一种任务查询特征;
    第四处理模块,用于基于所述至少一种任务查询特征中的各所述任务查询特征,确定各所述任务查询特征分别对应的任务处理结果。
  11. 一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序用于执行上述权利要求1-9任一所述的图像的处理方法。
  12. 一种电子设备,所述电子设备包括:
    处理器;
    用于存储所述处理器可执行指令的存储器;
    所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述指令以实现上述权利要求1-9任一所述的图像的处理方法。
PCT/CN2023/118093 2022-11-11 2023-09-11 图像的处理方法、装置、电子设备和存储介质 Ceased WO2024098941A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP23887635.3A EP4610945A4 (en) 2022-11-11 2023-09-11 IMAGE PROCESSING METHOD AND APPARATUS, AS WELL AS ELECTRONIC DEVICE AND STORAGE MEDIA
JP2025526866A JP2025539067A (ja) 2022-11-11 2023-09-11 画像の処理方法、装置、電子機器及び記憶媒体

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211417346.2 2022-11-11
CN202211417346.2A CN115719476A (zh) 2022-11-11 2022-11-11 图像的处理方法、装置、电子设备和存储介质

Publications (1)

Publication Number Publication Date
WO2024098941A1 true WO2024098941A1 (zh) 2024-05-16

Family

ID=85255051

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/118093 Ceased WO2024098941A1 (zh) 2022-11-11 2023-09-11 图像的处理方法、装置、电子设备和存储介质

Country Status (4)

Country Link
EP (1) EP4610945A4 (zh)
JP (1) JP2025539067A (zh)
CN (1) CN115719476A (zh)
WO (1) WO2024098941A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118397067A (zh) * 2024-06-25 2024-07-26 中国科学技术大学 一种自适应深度编码的多视图稀疏查询3d目标检测方法
CN120014605A (zh) * 2025-04-21 2025-05-16 智驾大陆(上海)智能科技有限公司 图像处理方法、可读存储介质、程序产品及车载设备

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115719476A (zh) * 2022-11-11 2023-02-28 北京地平线信息技术有限公司 图像的处理方法、装置、电子设备和存储介质
CN117272207B (zh) * 2023-10-10 2026-01-02 江苏衡新数智科技有限公司 数据中心异常分析方法及系统

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113569868A (zh) * 2021-06-11 2021-10-29 北京旷视科技有限公司 一种目标检测方法、装置及电子设备
US20220207270A1 (en) * 2020-12-31 2022-06-30 Toyota Research Institute, Inc. Using a bird's eye view feature map, augmented with semantic information, to detect an object in an environment
CN114723955A (zh) * 2022-03-30 2022-07-08 上海人工智能创新中心 图像处理方法、装置、设备和计算机可读存储介质
CN114882465A (zh) * 2022-06-01 2022-08-09 北京地平线信息技术有限公司 视觉感知方法、装置、存储介质和电子设备
CN114898315A (zh) * 2022-05-05 2022-08-12 北京鉴智科技有限公司 驾驶场景信息确定方法、对象信息预测模型训练方法及装置
CN115273022A (zh) * 2022-06-27 2022-11-01 重庆长安汽车股份有限公司 车辆的鸟瞰图生成方法、装置、车辆及存储介质
CN115719476A (zh) * 2022-11-11 2023-02-28 北京地平线信息技术有限公司 图像的处理方法、装置、电子设备和存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3950085B2 (ja) * 2003-06-10 2007-07-25 株式会社つくばマルチメディア 地図誘導全方位映像システム
US11554785B2 (en) * 2019-05-07 2023-01-17 Foresight Ai Inc. Driving scenario machine learning network and driving environment simulation
US12175775B2 (en) * 2020-06-11 2024-12-24 Toyota Research Institute, Inc. Producing a bird's eye view image from a two dimensional image
CN114463553B (zh) * 2022-02-09 2026-01-23 北京地平线信息技术有限公司 图像处理方法和装置、电子设备和存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220207270A1 (en) * 2020-12-31 2022-06-30 Toyota Research Institute, Inc. Using a bird's eye view feature map, augmented with semantic information, to detect an object in an environment
CN113569868A (zh) * 2021-06-11 2021-10-29 北京旷视科技有限公司 一种目标检测方法、装置及电子设备
CN114723955A (zh) * 2022-03-30 2022-07-08 上海人工智能创新中心 图像处理方法、装置、设备和计算机可读存储介质
CN114898315A (zh) * 2022-05-05 2022-08-12 北京鉴智科技有限公司 驾驶场景信息确定方法、对象信息预测模型训练方法及装置
CN114882465A (zh) * 2022-06-01 2022-08-09 北京地平线信息技术有限公司 视觉感知方法、装置、存储介质和电子设备
CN115273022A (zh) * 2022-06-27 2022-11-01 重庆长安汽车股份有限公司 车辆的鸟瞰图生成方法、装置、车辆及存储介质
CN115719476A (zh) * 2022-11-11 2023-02-28 北京地平线信息技术有限公司 图像的处理方法、装置、电子设备和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4610945A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118397067A (zh) * 2024-06-25 2024-07-26 中国科学技术大学 一种自适应深度编码的多视图稀疏查询3d目标检测方法
CN120014605A (zh) * 2025-04-21 2025-05-16 智驾大陆(上海)智能科技有限公司 图像处理方法、可读存储介质、程序产品及车载设备

Also Published As

Publication number Publication date
JP2025539067A (ja) 2025-12-03
EP4610945A4 (en) 2026-02-18
EP4610945A1 (en) 2025-09-03
CN115719476A (zh) 2023-02-28

Similar Documents

Publication Publication Date Title
WO2024098941A1 (zh) 图像的处理方法、装置、电子设备和存储介质
Tang et al. Perception and navigation in autonomous systems in the era of learning: A survey
Qin et al. Unifusion: Unified multi-view fusion transformer for spatial-temporal representation in bird's-eye-view
Li et al. Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion
Chen et al. Deep learning for visual localization and mapping: A survey
Li et al. Omnifusion: 360 monocular depth estimation via geometry-aware fusion
Hou et al. Multiview detection with feature perspective transformation
Yoon et al. Predictively encoded graph convolutional network for noise-robust skeleton-based action recognition
Jiang et al. Rodyn-slam: Robust dynamic dense rgb-d slam with neural radiance fields
Ding et al. Object detection method based on lightweight YOLOv4 and attention mechanism in security scenes
Zeng et al. ARF-YOLOv8: a novel real-time object detection model for UAV-captured images detection
Liu et al. Event-based monocular dense depth estimation with recurrent transformers
US20250378390A1 (en) Image Processing Method and Related Device
Xie et al. S4-driver: Scalable self-supervised driving multimodal large language model with spatio-temporal visual representation
Mohan et al. Progressive multi-modal fusion for robust 3d object detection
CN116399360A (zh) 车辆路径规划方法
Yan et al. RigNet++: Semantic Assisted Repetitive Image Guided Network for Depth Completion: Z. Yan et al.
Du et al. PointDMIG: a dynamic motion-informed graph neural network for 3D action recognition
CN114943747A (zh) 图像分析方法及其装置、视频编辑方法及其装置、介质
Zhang et al. Occloff: Learning optimized feature fusion for 3d occupancy prediction
Shi et al. Lane detection by variational auto-encoder with normalizing flow for autonomous driving
Hou et al. Towards real-time embodied AI agent: a bionic visual encoding framework for mobile robotics: X. Hou et al.
Yang et al. SA‐FlowNet: Event‐based self‐attention optical flow estimation with spiking‐analogue neural networks
CN119964205A (zh) 一种基于隐编码神经网络表示的动物姿态估计方法及系统
Liu et al. Weakly but deeply supervised occlusion-reasoned parametric road layouts

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23887635

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2025526866

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2025526866

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 2023887635

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2023887635

Country of ref document: EP

Effective date: 20250527

NENP Non-entry into the national phase

Ref country code: DE

WWP Wipo information: published in national office

Ref document number: 2023887635

Country of ref document: EP