WO2024254861A1 - 一种智能驾驶方法及装置 - Google Patents

一种智能驾驶方法及装置 Download PDF

Info

Publication number
WO2024254861A1
WO2024254861A1 PCT/CN2023/100778 CN2023100778W WO2024254861A1 WO 2024254861 A1 WO2024254861 A1 WO 2024254861A1 CN 2023100778 W CN2023100778 W CN 2023100778W WO 2024254861 A1 WO2024254861 A1 WO 2024254861A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
voxels
information
scene
obstacle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2023/100778
Other languages
English (en)
French (fr)
Inventor
苏鹏
李世勇
黄青虬
许春景
叶超强
董思远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yinwang Intelligent Technology Co Ltd
Original Assignee
Shenzhen Yinwang Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Yinwang Intelligent Technology Co Ltd filed Critical Shenzhen Yinwang Intelligent Technology Co Ltd
Priority to CN202380088064.9A priority Critical patent/CN120418139A/zh
Priority to PCT/CN2023/100778 priority patent/WO2024254861A1/zh
Priority to AU2023456637A priority patent/AU2023456637A1/en
Priority to KR1020267001236A priority patent/KR20260023647A/ko
Priority to EP23941094.7A priority patent/EP4714766A1/en
Publication of WO2024254861A1 publication Critical patent/WO2024254861A1/zh
Priority to US19/419,891 priority patent/US20260103189A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W30/00Purposes of road vehicle drive control systems not related to the control of a particular sub-unit, e.g. of systems using conjoint control of vehicle sub-units
    • B60W30/08Active safety systems predicting or avoiding probable or impending collision or attempting to minimise its consequences
    • B60W30/09Taking automatic action to avoid collision, e.g. braking and steering
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W10/00Conjoint control of vehicle sub-units of different type or different function
    • B60W10/22Conjoint control of vehicle sub-units of different type or different function including control of suspension systems
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W30/00Purposes of road vehicle drive control systems not related to the control of a particular sub-unit, e.g. of systems using conjoint control of vehicle sub-units
    • B60W30/08Active safety systems predicting or avoiding probable or impending collision or attempting to minimise its consequences
    • B60W30/085Taking automatic action to adjust vehicle attitude in preparation for collision, e.g. braking for nose dropping
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W30/00Purposes of road vehicle drive control systems not related to the control of a particular sub-unit, e.g. of systems using conjoint control of vehicle sub-units
    • B60W30/08Active safety systems predicting or avoiding probable or impending collision or attempting to minimise its consequences
    • B60W30/095Predicting travel path or likelihood of collision
    • B60W30/0956Predicting travel path or likelihood of collision the prediction being responsive to traffic or environmental parameters
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W40/00Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models
    • B60W40/02Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models related to ambient conditions
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • B60W50/08Interaction between the driver and the control system
    • B60W50/14Means for informing the driver, warning the driver or prompting a driver intervention
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W60/00Drive control systems specially adapted for autonomous road vehicles
    • B60W60/001Planning or execution of driving tasks
    • B60W60/0011Planning or execution of driving tasks involving control alternatives for a single driving scenario, e.g. planning several paths to avoid obstacles
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W60/00Drive control systems specially adapted for autonomous road vehicles
    • B60W60/001Planning or execution of driving tasks
    • B60W60/0015Planning or execution of driving tasks specially adapted for safety
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2420/00Indexing codes relating to the type of sensors based on the principle of their operation
    • B60W2420/40Photo, light or radio wave sensitive means, e.g. infrared sensors
    • B60W2420/403Image sensing, e.g. optical camera
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2420/00Indexing codes relating to the type of sensors based on the principle of their operation
    • B60W2420/40Photo, light or radio wave sensitive means, e.g. infrared sensors
    • B60W2420/408Radar; Laser, e.g. lidar
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2554/00Input parameters relating to objects
    • B60W2554/40Dynamic objects, e.g. animals, windblown objects
    • B60W2554/404Characteristics
    • B60W2554/4045Intention, e.g. lane change or imminent movement

Definitions

  • the present application relates to the field of intelligent driving, and in particular to an intelligent driving method and device.
  • the perception system of some vehicles uses a purely visual method to detect obstacles in the surrounding environment. This method is highly dependent on training materials (such as white lists). The perception system must be trained and learned about obstacles before it can recognize the obstacles.
  • lidar or millimeter-wave radar to detect obstacles in the surrounding environment, but this type of detection is easily affected by the weather. For example, the accuracy of obstacle detection is low in rainy and snowy weather.
  • the present application discloses an intelligent driving method and device, which can enhance the vehicle's perception of surrounding objects, help improve the accuracy of obstacle detection, and avoid collisions.
  • the present application provides an intelligent driving method, which includes: obtaining data collected by a sensor for a first scene, the sensor including at least one of a camera and a radar; inputting the collected data into a perception detection network, and outputting perception information, wherein the perception information is used to indicate voxels of obstacles in the first scene; and controlling vehicle driving based at least on the perception information.
  • obstacles refer to entities that the vehicle does not expect to collide with during driving, and the entities can be static or dynamic.
  • static entities can be, for example, cartons on the road, road construction signs, road dividing railings, dirt piles, tires, overturned vehicles, lying people, animals, buildings beside the road, trees, parked vehicles, road signs, electric poles, roadside isolation strips and other stationary objects with volume and mass
  • dynamic entities can be, for example, pedestrians (such as walking pedestrians, pedestrians riding bicycles, etc.), animals, vehicles, vehicles carrying goods (such as loaded with cartons, branches or other goods) and other moving objects with volume and mass.
  • the application does not limit the appearance of obstacles in the physical world.
  • a vehicle can be in the form of tires touching the ground when driving or parking, or in the form of overturning after a collision, or in the form of a vehicle with cargo (such as branches, cartons, etc.) loaded in the rear box, or in the form of multiple vehicles connected.
  • the type of vehicle is not limited when the obstacle is a vehicle.
  • the type of vehicle can be, for example, a car, a truck, a bus, a trailer, an incomplete vehicle, a motorcycle, a bicycle, etc.
  • the radar includes at least one of a laser radar and a millimeter wave radar.
  • the first scene can be understood as an environment space that can be detected by a sensor on the vehicle during driving. It can be understood that during driving, each moment can correspond to a scene, and the scenes corresponding to multiple moments include the scenes corresponding to each moment in the multiple moments.
  • the perception detection network is trained based on the sensor data set and the label information corresponding to the sensor data set generated by 4D reconstruction (i.e., spatiotemporal reconstruction including dynamic and static targets).
  • 4D reconstruction i.e., spatiotemporal reconstruction including dynamic and static targets.
  • the perception detection network is provided with the true value information of its prediction results. It can be understood that 4D reconstruction can describe the changes of physical objects in three-dimensional space in the time dimension.
  • vehicle travel can also be controlled by combining at least one of navigation map information, high-precision map information, roadside equipment, and live traffic information broadcast by surrounding vehicles.
  • the method can be applied to a vehicle or a component (such as a chip or integrated circuit) in the vehicle for intelligent driving control.
  • the vehicle is equipped with an automatic driving system, where the automatic driving system is not limited to a fully automatic driving system, a highly automatic driving system, a conditional automatic driving system, or a partially automatic driving system, etc.
  • the automatic driving system is not limited to a fully automatic driving system, a highly automatic driving system, a conditional automatic driving system, or a partially automatic driving system, etc.
  • the scene data is collected by pure vision or a combination of vision and radar, and the scene data is processed by the perception detection network to output the perception information of the voxels indicating the obstacles, which can enhance the vehicle's perception of surrounding objects, realize the perception of obstacles unrelated to the semantic category, and improve the generalization ability and accuracy of obstacle detection in the scene.
  • controlling the driving of the vehicle based on the perception information can improve the safety of vehicle driving.
  • the method further includes: displaying the obstacle based on the perception information, the obstacle being marked with a polygonal box; and/or displaying voxels of the obstacle based on the perception information, the voxels of the obstacle being marked with a polygonal box.
  • the polygonal frame may be two-dimensional or three-dimensional.
  • dynamic obstacles and static obstacles at the current moment can be distinguished by different colors, or by displaying an arrow on the dynamic obstacle to distinguish dynamic obstacles from static obstacles, and the arrow on the dynamic obstacle is used to indicate the movement direction of the dynamic obstacle.
  • marking obstacles with polygonal boxes is more consistent with the shape of the obstacles themselves.
  • the user can clearly and intuitively understand the vehicle's current perception of the surrounding environment.
  • the perception information includes at least one item of the following information: the occupancy status of voxels of the first scene, speed information of voxels of the first scene, visibility status of voxels of the first scene, and corner point information of a polygonal box corresponding to the obstacle; wherein the polygonal box corresponding to the obstacle is associated with the voxels of the obstacle.
  • the visible state of a voxel can be divided into “visible” and "invisible". For example, in a scene where a vehicle is located at the current moment, if a certain voxel in the scene is not touched by an observation signal of any sensor (including a camera and a radar) on the vehicle at the current moment, the visible state of the voxel is invisible; if the voxel is touched by an observation signal of at least one sensor on the vehicle, the visible state of the voxel is visible.
  • any sensor including a camera and a radar
  • the occupancy state of a voxel can be, for example, classified as "occupied” or "empty” (i.e., unoccupied). For example, in the scene where the vehicle is currently located, if a certain voxel in the scene has a physical entity at the corresponding spatial position in the physical world where the scene is located, the occupancy state of the voxel is occupied; if the voxel does not have a physical entity at the corresponding spatial position in the physical world where the scene is located, the occupancy state of the voxel is empty. It can be understood that air is not a physical entity.
  • the association between the polygonal box corresponding to the obstacle and the voxel of the obstacle can be understood as: the corner point information of the polygonal box corresponding to the obstacle is obtained based on the index information of the voxel of the obstacle.
  • the corner point information of the polygonal box corresponding to the obstacle can be calculated based on the index information of the voxel of the obstacle using a convex hull algorithm, for example.
  • the blind spots of the vehicles in the current scene can be known based on the visible state of the voxels.
  • the occupied state of the voxels can be known that the vehicles in the current scene should avoid the areas where the voxels in the "occupied" state are located to avoid collisions.
  • the corner point information of the polygonal box corresponding to the obstacles can be used to quickly locate the obstacles in the scene.
  • the speed information of the obstacle in the current scene can be determined by the speed information of the obstacle and the corner point information of the rectangular box corresponding to the obstacle.
  • the perception information is also used to indicate voxels of a road surface of the first scene
  • controlling vehicle driving at least based on the perception information includes: generating road surface geometry information of the first scene at least based on the perception information; and adjusting a suspension in the vehicle based on the road surface geometry information.
  • the road geometry information is used to indicate the road condition of the first scene (eg, whether there are potholes on the road, whether there are bumps on the road, etc.).
  • the vehicle can obtain the road condition in front of the vehicle in advance based on the perception information.
  • the vehicle has enough time to adjust the suspension of the vehicle in time to keep the vehicle as level and stable as possible during driving, thereby reducing the vibration caused by the undulations in the road and improving the comfort of riding in the vehicle.
  • controlling the vehicle driving at least based on the perception information includes: adjusting the driving path of the vehicle at least according to the perception information, the adjusted driving path does not pass through the area where the voxels of the obstacle are located.
  • the implementation of the above-mentioned implementation method can avoid the collision between the vehicle and obstacles during driving, which is beneficial to improving the safety of vehicle driving.
  • the collected data includes image data and point cloud data
  • the perception detection network includes an image feature extraction network, a point cloud feature extraction network, a feature fusion network, and an output network, wherein:
  • the image feature extraction network is used to extract 3D image features of the image data
  • the point cloud feature extraction network is used to extract point cloud features of voxels corresponding to the point cloud data
  • the feature fusion network is used to fuse the point cloud features of the voxels corresponding to the 3D image features and the point cloud data to obtain fused features of the voxels of the first scene;
  • the output network is used to process the fused features of the voxels of the first scene and output the perception information.
  • the voxels of the first scene refer to the voxels fused by the feature fusion network.
  • the raw data of the multimodal sensor on the vehicle is used to perceive obstacles in the surrounding environment, integrating the advantages of different sensors (such as providing texture semantic information of the image and providing depth information of the point cloud), which is conducive to enhancing the vehicle's perception of the surrounding environment and improving the generalization ability and accuracy of obstacle detection.
  • the method further includes: inputting text query information and fusion features of the voxels of the obstacle into an attribute recognition network, and outputting category information of the obstacle; the text query information is used to request a query category; and displaying the category information of the obstacle; wherein the fusion features of the voxels of the obstacle are determined based on corner point information of a polygonal box corresponding to the obstacle and fusion features of voxels of the first scene, and the polygonal box corresponding to the obstacle is associated with the voxels of the obstacle.
  • the text query information is used to request a query of Q categories, assuming that the number of categories of obstacles in a certain scene is P, where Q and P are positive integers and Q is greater than P.
  • the number of categories that the attribute recognition network actually supports is greater than the number of categories of obstacles in any scene, so that it can be ensured that the attribute recognition network avoids omissions when performing category recognition of obstacles in any scene.
  • the fusion features of the voxels of the obstacle are determined based on the corner point information of the polygonal box corresponding to the obstacle and the fusion features of the voxels of the first scene, which means that since the corner point information of the polygonal box corresponding to the obstacle corresponds to the index information of the voxels of the obstacle, the fusion features of the voxels of the obstacle can be determined from the fusion features of the voxels of the first scene based on the index information of the voxels of the obstacle.
  • associating the polygonal box corresponding to the obstacle with the voxels of the obstacle means that the corner point information of the polygonal box corresponding to the obstacle is obtained based on the index information of the voxels of the obstacle.
  • attribute recognition includes a text encoding network and an attribute decoding network, wherein the text encoding network is used to extract word vector features of text query information; and the attribute decoding network is used to output category information of the obstacle based on the fusion features of the word vector features and the voxels of the obstacle.
  • the vehicle can not only detect obstacles in the surrounding environment during driving, but also identify the category of obstacles, so that the vehicle can not only see objects but also understand them.
  • the method also includes: obtaining multiple planned paths for a vehicle; inputting the fusion features of the multiple planned paths of the vehicle and the voxels of the first scene into a path evaluation network, outputting recommendation coefficients of the multiple planned paths and recommended paths among the multiple planned paths, wherein the recommended paths are associated with the recommendation coefficients of the multiple planned paths; and displaying the recommended paths.
  • the path evaluation network includes a path encoding network, a feature interaction network and an evaluation output network, wherein the path encoding network is used to extract path features of each planned path among multiple planned paths; the feature interaction network is used to obtain risk features of each planned path based on the path features of each planned path and the fusion features of the voxels of the first scene; and the evaluation output network is used to output recommendation coefficients of the multiple planned paths and recommended paths among the multiple planned paths based on the risk features of the multiple planned paths.
  • the recommended path is a planned path corresponding to the highest recommendation coefficient among multiple planned paths.
  • the recommendation coefficient of the planned path can be obtained based on at least one of the risk coefficient, comfort and traffic efficiency of the planned path.
  • the risk coefficient of the planned path is related to at least one of the factors such as the distance between the planned path and obstacles (including visible obstacles and obstacles currently in blind spots), whether the planned path conflicts with the paths of other traffic participants on the road (for example, whether there will be a collision at the current moment or in the future), etc.
  • the traffic efficiency of the planned path is related to at least one of the factors such as the length of the planned path, the estimated travel time corresponding to the planned path, the number of traffic lights along the planned path, and the area of the drivable area where the planned path is located.
  • the comfort of the planned path is related to at least one of the factors such as the magnitude and frequency of the steering acceleration of the planned path, the rate of change of the acceleration of the planned path, the flatness of the road surface along the planned path, the number of traffic lights along the planned path, the type of road where the planned path is located, and whether the area along the planned path is shady.
  • the lower the risk coefficient of the planned path the higher the recommendation coefficient of the planned path
  • the higher the comfort of the planned path the higher the recommendation coefficient of the planned path
  • the higher the traffic efficiency of the planned path the higher the recommendation coefficient of the planned path.
  • route recommendation can be achieved by deploying the attribute recognition network, which is beneficial to improving the safety and comfort of vehicle driving.
  • the present application provides a system for intelligent driving, the system comprising: a perception detection network, for outputting perception information based on data collected by a sensor for a first scene, the perception information being used to indicate voxels of obstacles in the first scene; the sensor comprising at least one of a camera and a radar; an attribute recognition network, for outputting category information of the obstacle based on text query information and fusion features of the voxels of the obstacle, the fusion features of the voxels of the obstacle being determined based on corner point information of a polygonal box corresponding to the obstacle and fusion features of the voxels of the first scene, the polygonal box corresponding to the obstacle being associated with the voxels of the obstacle, and the fusion features of the voxels of the first scene being the fusion features of the voxels of the obstacle.
  • the perception detection network is obtained by temporal and/or spatial fusion based on at least one of the 3D image features and the point cloud features of the voxels extracted from the acquired data; the path evaluation network is used to output the recommendation coefficients of the multiple planned paths and the recommended paths among the multiple planned paths according to the fusion features of the multiple planned paths and the voxels of the first scene, and the recommended paths are associated with the recommendation coefficients of the multiple planned paths.
  • the system can be deployed in a vehicle or a component in a vehicle for intelligent driving control, which component can be, for example, a chip or an integrated circuit.
  • a vehicle for intelligent driving control
  • the vehicle can be specifically described in the first aspect above, and will not be repeated here.
  • the perception detection network can enhance the system's ability to perceive the surrounding environment for intelligent driving, thereby avoiding collisions between the deployment end of the system and obstacles, thereby improving the safety of the system;
  • the attribute recognition network enables the system to identify the category of obstacles based on perceived obstacles, thereby improving the intelligence of the system;
  • the path evaluation network can recommend low-risk paths, thereby providing convenience for intelligent travel.
  • the perception information includes at least one item of the following information: the occupancy status of voxels of the first scene, speed information of voxels of the first scene, visibility status of voxels of the first scene, and corner point information of a polygonal box corresponding to the obstacle; wherein the polygonal box corresponding to the obstacle is associated with the voxels of the obstacle.
  • the collected data includes image data and point cloud data
  • the perception detection network includes an image feature extraction network, a point cloud feature extraction network, a feature fusion network, and an output network, wherein:
  • the image feature extraction network is used to extract 3D image features of the image data
  • the point cloud feature extraction network is used to extract point cloud features of voxels corresponding to the point cloud data
  • the feature fusion network is used to fuse the point cloud features of the voxels corresponding to the 3D image features and the point cloud data to obtain fused features of the voxels of the first scene;
  • the output network is used to process the fused features of the voxels of the first scene and output the perception information.
  • the attribute recognition network includes a text encoding network and an attribute decoding network, wherein the text encoding network is used to extract word vector features of text query information; and the attribute decoding network is used to output category information of the obstacle based on the word vector features and the fusion features of the voxels of the obstacle.
  • the path evaluation network includes a path encoding network, a feature interaction network, and an evaluation output network, wherein:
  • the path encoding network is used to extract path features of each planned path among the multiple planned paths;
  • the feature interaction network is used to obtain the risk feature of each planned path according to the path feature of each planned path and the fusion feature of the voxels of the first scene;
  • the evaluation output network is used to output the recommendation coefficients of the multiple planned paths and the recommended paths among the multiple planned paths according to the risk characteristics of the multiple planned paths.
  • the present application provides a device for intelligent driving, comprising: a receiving unit, for acquiring data collected by a sensor for a first scene, the sensor comprising at least one of a camera and a radar; a processing unit, for inputting the collected data into a perception detection network, and outputting perception information, wherein the perception information is used to indicate voxels of obstacles in the first scene; the processing unit is also used to control vehicle driving based at least on the perception information.
  • the device further includes a display unit, the display unit being configured to display the obstacle based on the perception information, the obstacle being marked with a polygonal frame; and/or displaying the obstacle based on the perception information.
  • the voxels of the obstacles are configured to display the obstacle based on the perception information, the obstacle being marked with a polygonal frame; and/or displaying the obstacle based on the perception information.
  • the perception information includes at least one item of the following information: the occupancy status of voxels of the first scene, speed information of voxels of the first scene, visibility status of voxels of the first scene, and corner point information of a polygonal box corresponding to the obstacle; wherein the polygonal box corresponding to the obstacle is associated with the voxels of the obstacle.
  • the perception information is also used to indicate voxels of the road surface of the first scene
  • the processing unit is specifically used to: generate road surface geometry information of the first scene at least based on the perception information; and adjust the suspension in the vehicle based on the road surface geometry information.
  • the processing unit is specifically used to: adjust the driving path of the vehicle at least according to the perception information, and the adjusted driving path does not pass through the area where the voxels of the obstacle are located.
  • the collected data includes image data and point cloud data
  • the perception detection network includes an image feature extraction network, a point cloud feature extraction network, a feature fusion network, and an output network, wherein:
  • the image feature extraction network is used to extract 3D image features of the image data
  • the point cloud feature extraction network is used to extract point cloud features of voxels corresponding to the point cloud data
  • the feature fusion network is used to fuse the point cloud features of the voxels corresponding to the 3D image features and the point cloud data to obtain fused features of the voxels of the first scene;
  • the output network is used to process the fused features of the voxels of the first scene and output the perception information.
  • the processing unit is further used to: input text query information and fusion features of the voxels of the obstacle into an attribute recognition network, and output category information of the obstacle; the text query information is used to request a query category; the display unit is further used to display the category information of the obstacle; wherein the fusion features of the voxels of the obstacle are determined based on corner point information of a polygonal box corresponding to the obstacle and fusion features of voxels of the first scene, and the polygonal box corresponding to the obstacle is associated with the voxels of the obstacle.
  • the receiving unit is also used to: obtain multiple planned paths for the vehicle; the processing unit is also used to: input the multiple planned paths of the vehicle and the fusion features of the voxels of the first scene into the path evaluation network, output the recommendation coefficients of the multiple planned paths and the recommended paths among the multiple planned paths, and the recommended paths are associated with the recommendation coefficients of the multiple planned paths; the display unit is also used to: display the recommended paths.
  • the present application provides a device for intelligent driving, comprising a processor and a memory, wherein the memory is used to store program instructions; the processor calls the program instructions in the memory so that the device executes the method in the first aspect or any possible implementation of the first aspect.
  • the present application provides a vehicle, comprising a system as described in the second aspect or any possible implementation of the second aspect, or a device as described in the third aspect or any possible implementation of the third aspect, or a device as described in the fourth aspect.
  • the present application provides a computer-readable storage medium, comprising computer instructions, which, when executed by a processor, implement the method in the above-mentioned first aspect or any possible implementation manner of the first aspect.
  • the present application provides a computer program product, which, when executed by a processor, implements the method in the first aspect or any possible embodiment of the first aspect.
  • the computer program product for example, may be a software installation package, and when the method provided by any possible design of the first aspect is needed, the computer program product may be downloaded and executed on a processor to implement the method in the first aspect or any possible embodiment of the first aspect.
  • FIG1 is a schematic diagram of a communication system provided in an embodiment of the present application.
  • FIG2 is a system schematic diagram of a perception model for intelligent driving provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of feature extraction of a perception detection network provided in an embodiment of the present application.
  • FIG4 is a flow chart of an intelligent driving method provided by an embodiment of the present application.
  • FIG5 is a schematic diagram of some scenarios provided by embodiments of the present application.
  • FIG6A is a schematic diagram of marking obstacles in a scene with a polygonal frame provided by an embodiment of the present application.
  • FIG6B is a schematic diagram showing a voxel of an obstacle provided by an embodiment of the present application.
  • FIG6C is a schematic diagram showing voxels of a road surface in a scene provided by an embodiment of the present application.
  • FIG7A is a flow chart of a method for training a perception detection network provided in an embodiment of the present application.
  • FIG7B is a schematic diagram of a training process of a perception detection network provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of a chip hardware structure provided in an embodiment of the present application.
  • FIG9A is a schematic diagram of the structure of a computing device provided in an embodiment of the present application.
  • FIG9B is a schematic diagram of the structure of a training device provided in an embodiment of the present application.
  • FIG. 10 is a schematic diagram of the structure of a processing device provided in an embodiment of the present application.
  • the number of described objects is not limited by the prefix and can be one or more. Taking “first device” as an example, the number of "devices" can be one or more.
  • the objects modified by different prefixes can be the same or different. For example, if the object being described is a "device”, then the "first device” and the “second device” can be the same device, the same type of device, or different types of devices; for another example, if the object being described is "information”, then the "first information” and the “second information” can be information of the same content or information of different contents.
  • the use of prefixes used to distinguish the described objects in the embodiments of the present application does not constitute a limitation on the described objects. For the statement of the described objects, refer to the description in the context of the claims or embodiments, and no unnecessary limitation should be constituted due to the use of such prefixes.
  • the description methods such as "at least one of a1, a2, ... and an" used in the embodiments of the present application include the situation where any one of a1, a2, ... and an exists alone, and also include any combination of any multiple of a1, a2, ... and an, and each situation can exist alone.
  • the description method of "at least one of a, b and c" includes the situation where a is alone, b is alone, c is alone, a combination of a and b, a combination of a and c, b and c, or a combination of abc.
  • Autonomous driving also known as intelligent driving or assisted driving
  • intelligent driving is an important direction for the development of vehicle intelligence.
  • driving automation classification standards including driving levels L0 to L5, where L0 is no automation, and the human driver has full control over the vehicle.
  • the driver can get warnings or assistance from the driving system, such as autonomous emergency braking (AEB), blind spot monitoring (BSM) or lane departure warning (LDW).
  • AEB autonomous emergency braking
  • BSM blind spot monitoring
  • LW lane departure warning
  • L1 is driving assistance, and the driving operation is completed by the human driver and the driving system.
  • the driving system can provide driving assistance for the steering wheel or acceleration and deceleration operation through the driving environment, and other driving operations are performed by the human driver, such as adaptive cruise control (ACC) or lane keep assistance/support (LKA/LKS).
  • L2 is partial automation, and the driving environment provides driving assistance for the steering wheel and multiple acceleration and deceleration operations, and other driving actions are performed by the human driver, such as a combination of adaptive cruise control (ACC) and lane keep assistance (LKA/LKS).
  • Level 3 is conditional automation, where the driving system can complete all driving operations, but the human driver needs to respond to the request of the driving system at the appropriate time, that is, the human driver needs to be prepared to take over the driving system;
  • Level 4 is highly automated, where the driving system can complete all driving operations, and the human driver does not necessarily need to respond to the request of the driving system. For example, when the road and environmental conditions permit (such as closed parks, highways, urban roads or fixed driving routes, etc.), the human driver may not take over the driving;
  • Level 5 is fully automated, where the driving system can autonomously complete driving operations under various road and environmental conditions that human drivers can cope with.
  • the driving system mainly provides support for the driver, and the driver still needs to supervise the driving and steer, brake or accelerate as needed to ensure safety.
  • the driving system can complete all driving operations on behalf of the driver.
  • the driver must be prepared to take over the driving.
  • the driving system can achieve full driving under some conditions and all conditions, and the driver can choose whether to take over.
  • the above classification is an example. With the evolution of technology or different regulations in different countries or regions, the above classification may change.
  • the vehicle automation classification proposed by the Ministry of Industry and Information Technology of China includes 6 levels of vehicle driving automation, among which 0-2 is driving assistance, the system assists humans in performing dynamic driving tasks, and the driving subject is still the driver; 3-5 is automatic driving, the system replaces humans in performing dynamic driving tasks under the designed operating conditions, and when the function is activated, the driving subject is the system.
  • Level 0 driving automation (emergency assistance)
  • the system cannot continuously perform the lateral or longitudinal motion control of the vehicle in dynamic driving tasks, but has the ability to continuously perform partial target and event detection and response in dynamic driving tasks.
  • Level 1 driving automation (partial driver assistance) The system continuously performs the lateral or longitudinal motion control of the vehicle in dynamic driving tasks under its designed operating conditions (or called the design operating range ODD), and has the ability to detect and respond to partial targets and events that are compatible with the lateral or longitudinal motion control of the vehicle being performed.
  • Level 2 driving automation combined driver assistance
  • Level 3 driving automation (conditionally automated driving) systems continuously perform all dynamic driving tasks under their designed operating conditions.
  • Level 4 driving automation highly automated driving
  • Level 5 driving automation (fully automated driving) systems continuously perform all dynamic driving tasks under any drivable conditions and automatically implement minimum risk strategies.
  • lateral control is mainly used for vehicle steering control, for example, controlling the steering wheel torque or angle to control the direction of the vehicle;
  • longitudinal control is mainly used for vehicle speed control, for example, controlling the brake pedal, accelerator pedal, or gear position to control the acceleration/deceleration, braking, etc. of the vehicle.
  • an obstacle refers to an entity that the vehicle does not expect to collide with during driving, and the entity can be static or dynamic.
  • static entities can be, for example, cartons on the road, road construction signs, road dividing railings, dirt piles, tires, overturned vehicles, lying people, animals, buildings beside the road, trees, parked vehicles, road signs, electric poles, roadside isolation strips and other stationary objects with volume and mass;
  • dynamic entities can be, for example, pedestrians (such as walking pedestrians, pedestrians riding bicycles, etc.), animals, vehicles, vehicles carrying goods (such as loaded with cartons, branches or other goods) and other moving objects with volume and mass.
  • a scene refers to an environment space that can be detected by sensors on the vehicle during driving. It can be understood that during driving, each moment corresponds to a scene, and the scenes corresponding to multiple moments include the scenes corresponding to each moment in the multiple moments.
  • Voxel also known as stereo pixel or volume element, is the smallest unit of segmentation in three-dimensional space, similar to the smallest unit of two-dimensional space - pixel. Voxel can be used to grid 3D space and give each grid feature. In this case, voxel represents the value on a regular grid in three-dimensional space, and the location of the voxel can be inferred based on the position of the voxel relative to other voxels.
  • Figure 1 is a schematic diagram of a communication system provided in an embodiment of the present application.
  • the system includes a network side device and a vehicle, wherein the network side device and the vehicle communicate with each other in a wireless manner.
  • the network side device is a device with computing capabilities.
  • the network side device may be, for example, a server deployed on the network side (e.g., a server for intelligent driving processing), or a component or chip in the server.
  • the network side device may also be a system-level device or a computing device cluster composed of multiple servers.
  • the network side device may be deployed in a cloud environment or an edge environment, and the embodiments of the present application are not specifically limited.
  • the vehicle refers to a vehicle equipped with an automatic driving system.
  • the automatic driving system is not limited to a fully automatic driving system, a highly automatic driving system, a conditional automatic driving system, or a partially automatic driving system. Those skilled in the art can understand that any non-fully manual driving system that provides intelligent driving can be covered under this concept.
  • the vehicle may be a new energy vehicle or a traditional vehicle, etc.
  • a traditional vehicle refers to a fuel vehicle, such as a gasoline vehicle, a diesel vehicle, etc.
  • a new energy vehicle may be, for example, an electric vehicle (EV), a hybrid electric vehicle (HEV), an extended-range electric vehicle (range extended EV), a plug-in hybrid vehicle (Plug-in HEV), a fuel cell vehicle or other new energy vehicles, without specific limitation herein.
  • a traditional vehicle refers to a fuel vehicle, such as a gasoline vehicle, a diesel vehicle, etc.
  • a new energy vehicle may be, for example, an electric vehicle (EV), a hybrid electric vehicle (HEV), an extended-range electric vehicle (range extended EV), a plug-in hybrid vehicle (Plug-in HEV), a fuel cell vehicle or other new energy vehicles, without specific limitation herein.
  • EV electric vehicle
  • HEV hybrid electric vehicle
  • range extended EV range extended EV
  • Plug-in HEV plug-in hybrid vehicle
  • cameras and radars are deployed on the vehicle, wherein the camera is used to collect image data of the vehicle's current surroundings, and the radar is used to collect point cloud data of the vehicle's current surroundings.
  • the radar includes at least one of a laser radar Lidar, a millimeter-wave radar Radar, and the like.
  • the camera can be divided into, for example, a front-view camera, a surround-view camera, a rear-view camera, and a side-view camera, and the like; based on the structural division of the camera, the camera can be divided into, for example, a monocular camera, a binocular camera, a wide-angle camera, and the like.
  • the embodiment of the present application does not limit the number of cameras configured for the vehicle.
  • the camera on the vehicle needs to be able to collect 360-degree image data around the vehicle body.
  • a perception model is deployed on a network-side device, and the network-side device uses training data to train the perception model, wherein the training data includes sensor data obtained from a data source device (e.g., a collection vehicle fleet), and the sensor data includes image data collected by a vehicle-mounted camera and point cloud data collected by a vehicle-mounted radar.
  • the network-side device After the network-side device has trained the perception model, it can provide the trained perception model to the vehicle for use.
  • the specific training process of the perception model can refer to the description of the corresponding content in the following method embodiment, which will not be repeated here.
  • the vehicle can obtain a perception model (i.e., a trained perception model) from a network-side device.
  • a perception model i.e., a trained perception model
  • the vehicle collects data about the environment (or scene) within a certain range from the vehicle through its own sensors (such as cameras, radars, etc.) to obtain collected data.
  • the collected data includes, for example, image data and point cloud data collected for the scene.
  • the vehicle uses the perception model to process the collected data to output perception information of the scene.
  • the perception information is used to indicate the voxels of obstacles in the scene.
  • the vehicle can control its own driving at least based on the perception information.
  • the communication between the network-side device and the vehicle may use cellular communication technology, such as 2G cellular communication, such as global system for mobile communication (GSM), general packet radio service (GPRS); or 3G cellular communication, such as wideband code division multiple access (WCDMA), time division-synchronous code division multiple access (TS-SCDMA), code division multiple access (CDMA), or 4G cellular communication, such as long term evolution (LTE), LTE-vehicle to everything (V2X) wireless communication technology, PC5 communication, or 5G cellular communication, such as new radio (NR)-V2X PC5 communication, or other evolved cellular communication technologies.
  • the wireless communication system may also utilize non-cellular communication technologies, such as Wi-Fi and wireless local area network (WLAN), which are not specifically limited here.
  • FIG1 is only an exemplary architecture diagram, but does not limit the number of network elements included in the system shown in FIG1. Although not shown in FIG1, FIG1 may also include other functional entities in addition to the functional entities shown in FIG1.
  • the method provided in the embodiment of the present application can be applied to the communication system shown in FIG1. Of course, the method provided in the embodiment of the present application can also be applied to other communication systems, and the embodiment of the present application is not limited to this.
  • FIG. 2 is a system schematic diagram of a perception model for intelligent driving provided in an embodiment of the present application.
  • the perception model includes a perception detection network, which is used to output perception information based on the sensor data collected from the scene (e.g., image data and point cloud data), and the perception information is used to indicate the voxels of obstacles in the scene.
  • the perception information is also used to indicate the voxels of the road surface in the scene.
  • the perception information can be used to assist the driving of the vehicle.
  • the framework of the perception detection network is introduced below.
  • the perception detection network includes an image feature extraction network, a point cloud feature extraction network, a feature fusion network and an output network
  • the image feature extraction network is used to extract 3D image features of the image data from the image data and output the features to the feature fusion network
  • the point cloud feature extraction network is used to extract point cloud features of voxels corresponding to the point cloud data from the point cloud data and output the features to the feature fusion network
  • the feature fusion network is used to fuse the 3D image features of the image data and the point cloud features of the voxels corresponding to the point cloud data, obtain the fused features of the voxels of the corresponding scene and output the features to the output network
  • the output network makes predictions based on the fused features of the voxels of the corresponding scene and outputs the perception information of the scene.
  • the feature fusion network in FIG2 may only perform spatial feature fusion.
  • the image collected by the camera at time t, and the point cloud data are the data collected by the radar at time t, then the feature fusion network only needs to spatially fuse the 3D image features of the image data at time t and the point cloud features of the voxels corresponding to the point cloud data at time t.
  • the feature fusion network in FIG2 can perform spatial and temporal feature fusion.
  • the image data is the image data collected by the camera at n moments
  • the point cloud data is the point cloud data collected by the radar at these n moments.
  • the feature fusion network can first spatially fuse the 3D image features corresponding to each moment in the n moments and the point cloud features of the voxels corresponding to the moment to obtain the spatial fusion features of the voxels corresponding to each moment, and then temporally fuse the spatial fusion features of the voxels corresponding to each moment in the n moments.
  • the point cloud feature extraction module in the perception detection network shown in Figure 2 can be omitted. If the image data is image data collected by the camera at n moments, the feature fusion network can temporally fuse the 3D image features corresponding to each of the n moments.
  • the image feature extraction module in the perception detection network shown in Figure 2 can be omitted. If the point cloud data is point cloud data collected by the radar at n moments, the feature fusion network can temporally fuse the point cloud features of the voxels corresponding to each of the n moments.
  • the feature fusion network can adopt a network structure of a recurrent neural network (RNN) or a recurrent convolutional neural network (RCNN).
  • RNN recurrent neural network
  • RCNN recurrent convolutional neural network
  • a CNN can be a long short-term memory network (LSTM), a gated recurrent unit network (GRU), etc.
  • the image feature extraction network includes a camera backbone network and a stereo conversion network, wherein the camera backbone network is used to extract 2D image features of image data, and the stereo conversion network is used to convert the 2D image features of image data into 3D image features of image data.
  • the stereo conversion network can realize the conversion of 2D image features into 3D image features in the vehicle body coordinate system, and the features extracted from the radar point cloud data are themselves 3D features in the vehicle body coordinate system, so that it is convenient for the subsequent feature fusion network to fuse features from different sensors, which is conducive to eliminating heterogeneous differences between multimodal sensors.
  • the 2D image features of the image data include, but are not limited to, color features, shape features, texture features, and spatial relationship features of the image data.
  • the camera backbone network may adopt a convolutional neural network (CNN) (e.g., residual network Resnet), a transformer network, a vision transformer (ViT) network, or a network structure of other backbone networks.
  • CNN convolutional neural network
  • ViT vision transformer
  • the stereo conversion network may adopt a transformer network or a lift-splat-shoot (LSS) network structure.
  • the point cloud feature extraction network includes a radar coding network and a point backbone network, wherein the radar coding network is used to voxelize the point cloud data to establish a correspondence between points and voxels in the point cloud data, thereby obtaining the features of the voxels corresponding to the point cloud data, and the point backbone network is used to extract the point cloud features (i.e., 3D features) of the voxels corresponding to the point cloud data based on the features of the voxels corresponding to the point cloud data.
  • the radar coding network and the point backbone network can be combined into one network to extract the point cloud features of the voxels corresponding to the point cloud data, which is not specifically limited here.
  • the radar coding network and the point backbone network can also be combined into one network, which is used to extract the point cloud features of the voxels corresponding to the point cloud data.
  • the radar encoding network can adopt a network structure such as a voxel feature encoding (VFE) network or a pillar feature encoding (PFE) network.
  • VFE voxel feature encoding
  • PFE pillar feature encoding
  • the network structure of the convolution neural network such as U-Net or the transformation transformer network.
  • the output network is the detection head of the perception detection network.
  • the output network includes at least one head network, and the number of head networks in the output network is determined based on the number of types of prediction results in the perception information output by the output network.
  • the perception information includes the occupancy state of voxels, the velocity information of voxels, the visible state of voxels, and the corner point information of the polygonal box corresponding to the obstacle, wherein the polygonal box corresponding to the obstacle is associated with the voxels of the obstacle.
  • the perception information contains 4 kinds of prediction results
  • the output network includes four head networks, namely head network 1, head network 2, head network 3 and head network 4, wherein head network 1 is used to output the corner point information of the polygonal box corresponding to the obstacle, head network 2 is used to output the occupancy state of the voxel, head network 3 is used to output the velocity information of the voxel, and head network 4 is used to output the visible state of the voxel.
  • the visible state of a voxel means: in the scene where the vehicle is located at the current moment, if a voxel in the scene is not touched by the observation signal of any sensor (including cameras and radars) on the vehicle at the current moment, then the visible state of the voxel is invisible; if the voxel is touched by the observation signal of at least one sensor, then the visible state of the voxel is visible.
  • the occupancy state of a voxel means: in the scene where the vehicle is currently located, if a certain voxel in the scene has an entity at the corresponding spatial position in the physical world where the scene is located, then the occupancy state of the voxel is occupied; if the voxel does not have an entity at the corresponding spatial position in the physical world where the scene is located, then the occupancy state of the voxel is empty (i.e., unoccupied).
  • an entity can be understood as an object with a certain volume and mass. It can be understood that air is not an entity.
  • any head network in the output network may adopt a network structure of a convolutional neural network CNN or a transformer network.
  • the internal network structures of different head networks in the output network may be the same or different. It can be understood that different head networks process the same input features differently.
  • a neural sampling network in order to reduce the consumption of computing power, in the perception detection network shown in FIG. 2, can also be set between the feature fusion network and the output network, that is, the feature fusion network outputs the fusion features of the voxels of the scene to the neural sampling network, and the neural sampling network processes the fusion features of the voxels of the scene at different resolutions according to the importance of the region in which the voxels of the scene are located.
  • the neural sampling network can realize fine-grained processing of voxels in key areas of the scene, and coarse-grained processing of voxels in non-key areas of the scene, which can greatly save computing power, is conducive to improving the data processing efficiency of the perception detection network, and is also conducive to reducing the deployment cost of hardware.
  • the importance of the area 1 is greater than the importance of the area 2:
  • the volume of obstacles in area 1 is greater than the volume of obstacles in area 2.
  • the neural sampling network can adopt the network structure of a neural network, a multi-layer perceptron (MLP), or a transformer network.
  • MLP multi-layer perceptron
  • Figure 3 is a schematic diagram of feature extraction of a perception detection network provided by an embodiment of the present application.
  • the 2D image features of the image data can be extracted through the above-mentioned camera backbone network.
  • the 2D image features of the image data are stereoscopically
  • the conversion network can extract the 3D image features of the image data.
  • the point cloud data collected by the radar can be used to extract the features of the voxels corresponding to the point cloud data (i.e., 3D features) through the radar encoding network.
  • the features of the voxels corresponding to the point cloud data can be used to extract the point cloud features of the voxels corresponding to the point cloud data (i.e., 3D features) through the point backbone network.
  • the 3D image features of the above image data and the point cloud features of the voxels corresponding to the point cloud data are fused through the feature fusion network to output the fused features of the voxels.
  • the fused features of the voxels are respectively output through the above output network to output the occupancy state of the voxels, the speed information of the voxels, the visible state of the voxels, and the corner point information of the polygonal box corresponding to the obstacle.
  • FIG3 is only an example of the feature extraction process of the perception detection network, and does not limit the feature extraction process in the perception detection network to that shown in FIG3 .
  • the perception model also includes an attribute recognition network, which can be used to identify the category of the obstacle.
  • the attribute recognition network is used to output the category information of the obstacle based on the text query information and the fusion features of the voxels of the obstacle, wherein the text query information is used to request the query category, and the fusion features of the voxels of the obstacle are determined based on the corner point information of the polygonal box corresponding to the obstacle and the fusion features of the voxels of the scene, and the polygonal box corresponding to the obstacle is associated with the voxels of the obstacle.
  • the corner point information of the polygonal box corresponding to the obstacle comes from the output network in the perception detection network (specifically, the head network 1 in the output network), and the fusion features of the voxels of the scene are the output of the feature fusion network in the perception detection network.
  • the association between the polygonal box corresponding to the obstacle and the voxel of the obstacle can be understood as follows: the corner point information of the polygonal box corresponding to the obstacle is obtained based on the index information of the voxel of the obstacle.
  • the corner point information of the polygonal box corresponding to the obstacle can be obtained by predicting the index information of the voxel of the obstacle based on the learned rules by the head network 1 in FIG. 2, or by calculating the corner point information of the polygonal box corresponding to the obstacle based on the index information of the voxel of the obstacle using a convex hull algorithm, which is not specifically limited here.
  • the polygonal frame corresponding to the obstacle may be two-dimensional or three-dimensional, and is not specifically limited here.
  • the attribute recognition network includes a text encoding network and an attribute decoding network, wherein the text encoding network is used to extract word vector features of text query information; the attribute decoding network is used to output category information of the obstacle based on the fusion features of the word vector features and the voxels of the obstacle.
  • the text query information is used to request to query Q categories.
  • the number of categories of obstacles in a scene is P, where Q and P are positive integers and Q is greater than P.
  • the number of categories that the attribute recognition network actually supports is greater than the number of categories of obstacles in any scene, so that the attribute recognition network can avoid omissions when identifying the categories of obstacles in any scene.
  • the text query information includes K pieces of text query information such as "Is it a car?”, “Is it a pedestrian?", “Is it a telephone pole?", “Is it a road sign?”, “Is it a road dividing railing?”, etc.
  • the text encoding network in the attribute recognition network performs feature extraction on the K pieces of text query information to obtain the word vector features corresponding to each piece of text query information, wherein the word vector features corresponding to each piece of text query information can represent the image semantic features of the category indicated by the text query information.
  • obstacle 1 is any obstacle in the scene
  • the attribute decoding network calculates the similarity between the fused features of the voxels of obstacle 1 and the word vector features corresponding to each piece of text query information in the K pieces of text query information, and determines that the category corresponding to the word vector feature with the highest similarity to the fused features of the voxels of obstacle 1 is the category of obstacle 1, so that the category information of obstacle 1 can be output.
  • the fusion feature of the voxel of obstacle 1 is determined based on the corner point information of the polygonal box corresponding to obstacle 1 and the fusion feature of the voxel of the scene: since the corner point information of the polygonal box corresponding to obstacle 1 corresponds to the index information of the voxel of obstacle 1, the index information of the voxel of obstacle 1 can be used to determine the fusion feature of the voxel of the scene. Fusion features of voxels.
  • the text encoding network and the attribute decoding network can both adopt the network structure of a convolutional neural network or a transformer network. It can be understood that the text encoding network and the attribute decoding network can adaptively adjust the relevant parameters of the network according to their own functions.
  • the perception model further includes a path evaluation network, which can be used to determine a recommended path for the vehicle.
  • the path evaluation network is used to output recommendation coefficients of the multiple planned paths and a recommended path among the multiple planned paths based on the fusion features of the multiple planned paths of the vehicle and the voxels of the scene.
  • the path evaluation network includes a path encoding network, a feature interaction network and an evaluation output network, wherein the path encoding network is used to extract the path features of each planned path among multiple planned paths of the vehicle; the feature interaction network is used to obtain the risk features of each planned path based on the path features of each planned path and the fusion features of the voxels of the scene; the evaluation output network is used to output the recommendation coefficients of the multiple planned paths and the recommended paths among the multiple planned paths based on the risk features of the multiple planned paths.
  • the recommendation coefficient of the planned path can be obtained based on at least one of the risk coefficient, comfort and traffic efficiency of the planned path.
  • the risk coefficient of the planned path is related to at least one of the factors such as the distance between the planned path and obstacles (including visible obstacles and obstacles currently in blind spots), whether the planned path conflicts with the paths of other traffic participants on the road (for example, whether there will be a collision at the current moment or in the future), etc.
  • the traffic efficiency of the planned path is related to at least one of the factors such as the length of the planned path, the estimated travel time corresponding to the planned path, the number of traffic lights along the planned path, and the area of the drivable area where the planned path is located.
  • the comfort of the planned path is related to at least one of the factors such as the magnitude and frequency of the steering acceleration of the planned path, the rate of change of the acceleration of the planned path, the flatness of the road surface along the planned path, the number of traffic lights along the planned path, the type of road where the planned path is located, and whether the area along the planned path is shady.
  • the recommended path is the planned path corresponding to the highest recommendation coefficient among the multiple planned paths.
  • the path encoding network can adopt the network structure of a convolutional neural network, a transformer network, a graph neural network (GNN), or a graph convolution neural network (GCNNs).
  • the feature interaction network can adopt the network structure of a graph neural network or a transformer network.
  • the evaluation output network can adopt the network structure of a neural network or a multi-layer perceptron MLP.
  • the training of the perception detection network, the attribute recognition network, and the path evaluation network can be separate, for example, the perception detection network is trained first, and after the perception detection network training is completed, the attribute recognition network and the path evaluation network are trained in sequence.
  • the training of the perception detection network, the attribute recognition network, and the path evaluation network can also be carried out simultaneously, which is not specifically limited here.
  • the training process of each network in the perception model can refer to the description of the corresponding content in the following embodiments, which will not be repeated here.
  • FIG 4 is a flow chart of an intelligent driving method provided by an embodiment of the present application.
  • the method can be applied to the vehicle in Figure 1 above or a component (such as a chip or integrated circuit, etc.) on the vehicle for automatic driving control, and the vehicle is at least equipped with the above-mentioned perception detection network.
  • the method includes but is not limited to the following steps:
  • S401 Acquire data collected by a sensor on a first scene, where the sensor includes at least one of a camera and a radar.
  • the first scene can be understood as the environment space that can be detected by the sensor while the vehicle is driving.
  • the senor is deployed on the vehicle.
  • the camera can be divided into a front-view camera, a surround-view camera, a rear-view camera, and a side-view camera, etc.
  • the radar includes at least one of a laser radar and a millimeter-wave radar.
  • the embodiment of the present application does not limit the number of cameras and radars configured on the vehicle.
  • the camera is used to collect image data
  • the radar is used to collect point cloud data
  • the above-mentioned collected data includes at least one of image data and point cloud data.
  • a vehicle may be equipped with multiple cameras, and different cameras have different field of view angles, and the field of view angles of these multiple cameras may cover a 360-degree field of view centered on the vehicle.
  • the field of view angles of adjacent cameras in the multiple cameras may partially overlap, so that data in the same environment space can be collected by multiple sensors at the same time, which is conducive to improving the confidence of data observation.
  • the senor includes a camera and a radar.
  • the number of cameras on the vehicle is m
  • the m cameras collect image data for the first scene.
  • each camera collects an image at each moment, which means that the collected data corresponding to each moment includes image data corresponding to the m images collected by the camera and point cloud data collected by the radar.
  • S402 Input the collected data into a perception detection network, and output perception information, where the perception information is used to indicate voxels of obstacles in the first scene.
  • obstacles refer to entities that the vehicle does not expect to collide with during driving, and the entities can be static or dynamic.
  • static entities can be, for example, cartons on the road, road construction signs, road dividing railings, dirt piles, tires, overturned vehicles, lying people, animals, buildings beside the road, trees, parked vehicles, road signs, electric poles, roadside isolation strips and other stationary objects with volume and mass
  • dynamic entities can be, for example, pedestrians (such as walking pedestrians, pedestrians riding bicycles, etc.), animals, vehicles, vehicles carrying goods (such as loaded with cartons, branches or other goods) and other moving objects with volume and mass.
  • the perception detection network is a trained perception detection network deployed on the vehicle side.
  • the perception detection network is used to output perception information based on the data collected by the sensor for the first scene.
  • the perception detection network is obtained by training the network-side device shown in FIG1 based on the sensor data set and the label information corresponding to the sensor data set generated by 4D reconstruction.
  • the label information corresponding to the sensor data set can be generated by the network-side device using a self-supervised method to perform 4D reconstruction based on the sensor data set, and the label information is used to provide the perception detection network with the true value information of its prediction results during the training process of the perception detection network.
  • the prediction tasks of the perception detection network include predicting the occupancy state of voxels, the velocity information of voxels, the visible state of voxels, and the corner point information of the polygonal box corresponding to the obstacle.
  • the perception detection network performs the above four prediction tasks on the input data and outputs the predicted perception information (i.e., the prediction result).
  • the label information includes the true value information of the prediction results corresponding to the image data at time t and the point cloud data at time t.
  • the collected data includes image data and point cloud data
  • the perception detection network includes an image feature extraction network, a point cloud feature extraction network, a feature fusion network, and an output network.
  • the processing process of the perception detection network can refer to the following steps A1-A4, for example:
  • the image feature extraction network extracts the 3D image features of the image data
  • the point cloud feature extraction network extracts the point cloud features of the voxels corresponding to the point cloud data
  • the feature fusion network fuses the point cloud features of the voxels corresponding to the 3D image features and the point cloud data to obtain the fusion features of the voxels of the first scene;
  • the output network processes the fused features of the voxels of the first scene and outputs the perception information.
  • the reasoning process of the perception detection network can be specifically referred to the description of the perception detection network in the embodiment of FIG2 above, and the above-mentioned image feature extraction network, point cloud feature extraction network, feature fusion network and output network can be referred to the description of the corresponding contents in the embodiment of FIG2 above, which will not be repeated here. It can be understood that the above example does not limit the framework of the perception detection network.
  • the perception information includes at least one of the following information: the occupancy status of the voxels of the first scene, the speed information of the voxels of the first scene, the visibility status of the voxels of the first scene, and the corner point information of the polygonal box corresponding to the obstacle of the first scene; wherein the polygonal box corresponding to the obstacle of the first scene is associated with the voxels of the obstacle.
  • the output network includes four head networks, each head network corresponds to a prediction task.
  • the perception information includes the occupancy state of the voxels of the first scene, the speed information of the voxels of the first scene, the visible state of the voxels of the first scene, and the corner point information of the polygonal box corresponding to the obstacles of the first scene.
  • the occupation state of a voxel can be divided into two types, namely, "occupied” and "empty".
  • occupation state of a voxel reference may be made to the above description of the occupation state of a voxel, which will not be repeated here.
  • voxel 1 of the scene corresponds to vehicle A in the physical world where the scene is located, and the occupancy state of voxel 1 is "occupied"
  • voxel 2 of the scene corresponds to the air in the physical world where the scene is located, and the occupancy state of voxel 2 is "empty”.
  • the visible states of the voxels may also be two, namely, “visible” and “invisible.”
  • visible states of the voxels reference may be made to the above description of the visible states of the voxels, which will not be repeated here.
  • FIG5 is a schematic diagram of some scenarios provided by an embodiment of the present application.
  • FIG5 (1) shows scene 1 corresponding to time t1.
  • vehicle 1 is the main vehicle (i.e., the above-mentioned perception detection network is deployed on vehicle 1). It can be seen that vehicles 1, 2, and 3 are located in the same lane, and vehicle 2 is currently performing a lane change operation. Assuming that the vehicle size of vehicle 2 is larger than the vehicle size of vehicle 3 in front, vehicle 3 is completely blocked by vehicle 2 from the perspective of vehicle 1. Vehicle 3 is located in the blind spot of vehicle 1. Therefore, at time t1, the observation signal of any sensor on vehicle 1 cannot reach the voxel of vehicle 3.
  • the visible state of the voxels of vehicle 2 at time t2 is "visible” and the visible state of the voxels of vehicle 3 at time t2 is also “visible”. It can also be seen that inputting the acquisition data at multiple times into the perception detection network can not only complete the observation information, but also help to restore the physical world where the scene is located more realistically from multiple directions and angles.
  • S403 Control vehicle driving based at least on the perception information.
  • controlling the driving of the vehicle includes at least one of the following operations: changing lanes, adjusting the driving speed, adjusting the driving path, turning on the warning lights, and adjusting the suspension of the vehicle.
  • the vehicle facilitates real-time decision-making based on at least the perception information to improve the safety of the vehicle during driving.
  • controlling vehicle travel based at least on the perception information includes: adjusting a driving path of the vehicle based at least on the perception information, wherein the adjusted driving path does not pass through an area where voxels of obstacles are located.
  • the vehicle's current driving path can be adjusted in time so that the adjusted driving path does not pass through the area where the voxels of the obstacle are located. In this way, the vehicle can be prevented from colliding with obstacles during driving, which is beneficial to improving the safety of vehicle driving.
  • the perception information is also used to indicate voxels of the road surface of the first scene, and the vehicle driving is controlled at least based on the perception information, including: generating road surface geometry information of the first scene at least based on the perception information; and adjusting the suspension in the vehicle based on the road surface geometry information.
  • the road geometry information is used to indicate the road condition of the first scene (for example, whether there are potholes on the road, whether there are bumps on the road, etc.). Based on the perception information, the vehicle can obtain the road condition in front of the vehicle in advance. When the road surface is detected to be undulating, the vehicle has enough time to adjust the vehicle's suspension in time to keep the vehicle as level and stable as possible during driving, reducing the vibration caused by road undulations and improving the comfort of riding in the vehicle.
  • controlling vehicle travel based on perception information may also include: determining the blind spot in the scene and obstacle information in the blind spot (e.g., speed information of the obstacle, corner point information of the polygonal box opposite the obstacle, etc.) based on the perception information; when the vehicle approaches the blind spot, controlling the vehicle to slow down, stop, or turn based on the obstacle information in the blind spot.
  • obstacle information in the blind spot e.g., speed information of the obstacle, corner point information of the polygonal box opposite the obstacle, etc.
  • controlling the vehicle to slow down, stop, or turn based on the obstacle information in the blind spot may be static or dynamic, which are not specifically limited here.
  • controlling the vehicle to be in a deceleration state, a parking state, or a turning state can avoid collision between the vehicle and obstacles in the blind spot, thereby improving the safety of vehicle driving.
  • the blind spot is, for example, the area where the voxels whose visible state is "invisible" at the current moment in the perception information are located.
  • the blind spot includes the area that the observation signal of the sensor on the vehicle at the current moment could have reached but could not reach due to other obstacles and the detection blind spot of the sensor itself.
  • the speed information of the obstacle may be obtained based on the speed information of the voxels of the obstacle, for example.
  • the vehicle in addition to controlling the driving of the vehicle based on the perception information output by the vehicle itself, can also control the driving of the vehicle in combination with at least one of navigation map information, high-precision map information, live traffic information broadcast by roadside equipment, and live traffic information broadcast by other surrounding vehicles.
  • the roadside equipment can be, for example, a road side unit (RSU), multi-access edge computing (MEC), or a sensor, or a component or chip inside these devices, or a system-level device composed of an RSU and a MEC, or a system-level device composed of an RSU and a sensor, or a system-level device composed of an RSU, a MEC, and a sensor.
  • the above-mentioned intelligent driving method also includes: displaying obstacles of the first scene based on the perception information, wherein the obstacles of the first scene are marked with polygonal boxes; and/or displaying voxels of the obstacles of the first scene based on the perception information.
  • the obstacle or the voxel of the obstacle can be presented on a display device of the vehicle.
  • the display device can be a vehicle-mounted tablet, a vehicle-mounted display, a head-up display (HUD) system, or an enhanced head-up display AR-HUD system, etc., which are not specifically limited here.
  • HUD head-up display
  • AR-HUD enhanced head-up display
  • Figure 6A is a schematic diagram of marking obstacles in a scene with a polygonal box provided by an embodiment of the present application.
  • Figure 6A shows the obstacles in the scene where the ego vehicle is located at the current moment, wherein the obstacles are marked with polygonal boxes.
  • the vehicle at the bottom of the center in Figure 6A is the ego vehicle.
  • the obstacles in the environment surrounding the ego vehicle in this scene are marked and displayed with polygonal boxes.
  • the obstacles in the scene include at least vehicles, buildings, etc.
  • the polygonal box can be two-dimensional or three-dimensional.
  • the polygonal box when the polygonal box is displayed in 2D, the polygonal box can be connected by 10 corner points indicated by one group of corner point information; when the polygonal box is displayed in 3D, the polygonal box can be connected by multiple groups of corner point information, wherein each group of corner point information indicates 10 corner points.
  • an arrow can also be added to the polygonal box corresponding to the dynamic obstacle, and the arrow indicates that the obstacle is a dynamic obstacle and The direction of the arrow indicates the direction of movement of the obstacle, and the length of the arrow indicates the speed of the obstacle.
  • FIG6A is only an example of the marking display of obstacles in the scene where the vehicle is located at a certain moment, and should not constitute a limitation on the marking display of obstacles in the scene where the vehicle is located.
  • Figure 6B is a schematic diagram of the display of voxels of an obstacle provided in an embodiment of the present application.
  • Figure 6B shows the voxels of the obstacle in the scene where the vehicle is located at the current moment.
  • the voxels of the obstacle are composed of multiple voxels in the scene, and the voxel can be understood as the three-dimensional grid of the smallest unit in Figure 6B.
  • dynamic obstacles and static obstacles can be distinguished and displayed by different colors (that is, obstacles of different speeds can be distinguished by different colors), and different obstacles can also be distinguished by different colors, which is not specifically limited here.
  • Figure 6B is only an example of the display of voxels of obstacles in the scene where the vehicle is located at a certain moment, and should not limit the display of voxels of obstacles in the scene where the vehicle is located.
  • the voxels of the road surface in the scene where the vehicle is currently located can also be displayed.
  • FIG. 6C is a schematic diagram showing the voxels of the road surface in the scene provided in an embodiment of the present application.
  • FIG. 6C not only displays the voxels of the obstacles in the scene at the current moment, but also displays the voxels of the road surface in the scene at the current moment.
  • the degree of undulation of the road surface ahead can be seen based on FIG. 6C . It can be understood that FIG.
  • 6C is only an example of displaying the voxels of obstacles and the voxels of the road surface in the scene where the vehicle is located at a certain moment, and should not limit the display of the voxels of obstacles and the voxels of the road surface in the scene where the vehicle is located.
  • an attribute recognition network may also be deployed on the vehicle, wherein the attribute recognition network is used to identify the category of obstacles. In this way, the vehicle can not only detect obstacles in the surrounding environment during driving, but also identify the category of obstacles, so that the vehicle can not only see objects but also understand them.
  • the intelligent driving method further includes: obtaining text query information; inputting the text query information and the fusion features of the voxels of the obstacle into the attribute recognition network, and outputting the category information of the obstacle; wherein the text query information is used to request the query category; and displaying the category information of the obstacle; wherein the fusion features of the voxels of the obstacle are determined based on the corner point information of the polygonal box corresponding to the obstacle and the fusion features of the voxels of the first scene, and the polygonal box corresponding to the obstacle is associated with the voxels of the obstacle.
  • the corner point information of the polygonal box corresponding to the obstacle and the fusion features of the voxels of the first scene are both from the perception detection network.
  • the corner point information of the polygonal box corresponding to the obstacle comes from the output network in the perception detection network, and the fusion features of the voxels of the first scene come from the feature fusion network in the perception detection network.
  • the attribute recognition network in the embodiment of FIG. 2 above.
  • further fusion processing can be performed in combination with the detection results of the detection algorithm configured in the camera, the detection results of the detection algorithm configured in the radar, or the detection results of other models. In this way, when the same obstacle can be perceived in multiple different ways, the confidence in detecting the obstacle is also higher.
  • a path evaluation network may also be deployed on the vehicle, wherein the path evaluation network is used to recommend the lowest risk path for the vehicle, which is conducive to improving driving safety and the accuracy of driving decisions.
  • the intelligent driving method further includes: obtaining multiple planned paths of the vehicle; inputting the fusion features of the multiple planned paths of the vehicle and the voxels of the first scene into the path evaluation network, outputting the recommendation coefficients of the multiple planned paths and the recommended paths among the multiple planned paths, wherein the recommended paths are associated with the recommendation coefficients of the multiple planned paths; and displaying the recommended paths.
  • the fusion features of the voxels of the first scene come from the perception detection network. From the description of the detection network, it can be known that the fusion features of the voxels of the first scene are provided by the feature fusion network in the perception detection network.
  • multiple planned paths are generated by the vehicle, for example, the vehicle generates multiple planned paths based on navigation map information.
  • the path evaluation network in the embodiment of Figure 2 above. For the sake of brevity of the specification, it will not be repeated here.
  • the recommended path output by the path evaluation network includes at least two planned paths among multiple planned paths.
  • it can also be used to recommend the recommended path to the user, receive feedback information from the user, and the feedback information is used to indicate the path selected by the user from the at least two planned paths, and control the own vehicle to travel along the path selected by the user.
  • the recommended path includes multiple planned paths
  • the recommendation coefficients of the planned paths included in the recommended path are similar or the same, but some planned paths are the shortest paths, some planned paths are the most comfortable paths, and some planned paths are the shortest paths, etc. In this case, users can freely choose according to their own needs, providing users with a good riding experience.
  • the above-mentioned perception detection network, attribute recognition network and path evaluation network may also be deployed on the vehicle at the same time.
  • the above-mentioned perception detection network, attribute recognition network and path evaluation network may also be deployed on the vehicle at the same time.
  • the vehicle's ability to perceive the surrounding environment can be enhanced, so that the vehicle can perceive surrounding obstacles during driving, thereby avoiding collisions and improving the safety of the vehicle.
  • the vehicle can recognize the category of the obstacle based on the perception of the obstacle, thereby improving the intelligence of the vehicle.
  • FIG. 7A is a flow chart of a training method for a perception detection network provided in an embodiment of the present application.
  • the method can be applied to the network-side device shown in Figure 1 above or a component in the network-side device (such as a chip or integrated circuit, etc.).
  • the perception detection network shown in Figure 2 includes an image feature extraction network, a point cloud feature extraction network, a feature fusion network, and an output network, wherein the output network includes multiple head networks.
  • the method includes but is not limited to the following steps:
  • S701 In each training process, feature extraction is performed on image data at each moment in a batch of sensor data through an image feature extraction network to obtain 3D image features of image data at K moments.
  • a batch of sensor data used in each training process is located in a sensor dataset.
  • a batch of sensor data includes image data at K moments, wherein the image data at K moments come from at least one camera of the vehicle, and K is a positive integer.
  • the image data at K moments in a batch of sensor data is input into the image feature extraction network.
  • the image feature extraction network obtains the 3D image features of the image data at each moment based on the image data at that moment. Therefore, the image feature extraction network will obtain the 3D image features of the image data at K moments and output them to the feature fusion network.
  • the framework of the image feature extraction network can be specifically referred to the description of the corresponding content in the embodiment of FIG. 2 , which will not be repeated here.
  • a batch of sensor data also includes the point cloud data at these K moments, wherein the point cloud data at these K moments come from at least one radar of the vehicle.
  • the point cloud data of K moments in a batch of sensor data is input into the point cloud feature extraction network.
  • the point cloud feature extraction network obtains the point cloud features of the voxels corresponding to the point cloud data at that moment based on the point cloud data at each moment. Therefore, the point cloud feature extraction network will obtain the point cloud features of the voxels corresponding to the point cloud data at K moments and output them to the feature fusion network. Combined network.
  • S703 Perform feature fusion on the 3D image features of the image data at K moments and the point cloud features of the voxels corresponding to the point cloud data at K moments through a feature fusion network to obtain fused features of the voxels of the scene at K moments.
  • the voxels of the scene refer to the voxels after feature fusion is performed by the feature fusion network.
  • the feature fusion network spatially fuses the 3D image features of the image data at each moment and the point cloud features of the voxels corresponding to the point cloud data at that moment to obtain the spatial fusion features of the voxels of the scene at that moment; the feature fusion network then temporally fuses the spatial fusion features of the voxels of the scene at K moments to obtain the fusion features of the voxels of the scene at K moments.
  • the fusion features of the voxels of the scene at each moment can be called the spatiotemporal fusion features of the voxels of the scene at that moment.
  • the fused features of the voxels of the scene at K moments correspond to each head network in the output network, and each head network in the output network performs a prediction task.
  • the output network shown in Figure 2 includes head network 1, head network 2, head network 3 and head network 4, wherein head network 1 is used to predict the corner point information of the polygonal box corresponding to the obstacle of the scene, head network 2 is used to predict the occupancy state of the voxel of the scene, head network 3 is used to predict the velocity information of the voxel of the scene, and head network 4 is used to predict the visible state of the voxel of the scene.
  • each head network makes a prediction based on the fused features of the voxels of the scene at each moment to obtain a prediction result corresponding to the scene at that moment, so that each head network can obtain K prediction results.
  • the head network 1 outputs prediction result 1 based on the fusion features of the voxels of the scene at moment t1 , and the prediction result 1 includes the corner point information of the polygonal box corresponding to the obstacle in the scene at moment t1 ; the head network 1 outputs prediction result 2 based on the fusion features of the voxels of the scene at moment t2 , and the prediction result 2 includes the corner point information of the polygonal box corresponding to the obstacle in the scene at moment t2 ; ..., in this way, the head network 1 will output K prediction results based on the fusion features of the voxels of the scenes at K moments.
  • S705 Obtain the loss value of each head network in the output network according to the label information corresponding to the batch of sensor data and the K prediction results output by each head network in the output network.
  • the label information corresponding to the batch of sensor data is generated by 4D reconstruction based on the batch of sensor data.
  • the label information corresponding to the batch of sensor data is used to provide the perception detection network with true value information corresponding to the prediction result of the perception detection network on the batch of sensor data.
  • the label information corresponding to the batch of sensor data includes the true value information of the corner points of the polygonal box corresponding to the obstacles of the scene at K moments, the true value information of the occupancy state of the voxels of the scene at these K moments, the true value information of the speed of the voxels of the scene at these K moments, and the true value information of the visible state of the voxels of the scene at these K moments, wherein the true value information of the corner points of the polygonal box corresponding to the obstacles of the scene at K moments corresponds to the K prediction results output by the above-mentioned head network 1, the true value information of the occupancy state of the voxels of the scene at these K moments corresponds to the K prediction results output by the above-mentioned head network 2, the true value information of the speed of the voxels of the scene at these K moments corresponds to the K prediction results output by the above-mentioned head network 3, and the true value information of the visible state of the vo
  • the loss value of head network 1 is obtained based on the true value information of the corner points of the polygonal box corresponding to the obstacles of the scene at K moments in the above label information and the K prediction results output by head network 1.
  • the loss value of head network 1 at that moment can be obtained based on the true value information of the corner points of the polygonal box corresponding to the obstacle in the scene at each moment and the prediction result at that moment among the K prediction results output by head network 1 (i.e., including the corner point information of the polygonal box corresponding to the obstacle in the scene at that moment), and then the loss value of head network 1 can be obtained based on the loss value of head network 1 at each moment among the K moments.
  • other head networks in the output network can also use this method to calculate the loss value of their own head networks, so that the loss value of each head network in the output network can be obtained.
  • S706 Weighting the loss values of each head network in the output network to obtain a loss value corresponding to each training process; and using the loss value to update the parameters in the perception detection network.
  • the weight of each head network in the output network can be user-defined.
  • the loss value is used to update the parameters in the perception detection network (for example, each head network + feature fusion network + image feature extraction network + point cloud feature extraction network in the output network).
  • Figure 7A is an example of training the above-mentioned perception detection network alone, and the training process of the perception detection network is not limited to the form shown in Figure 7A.
  • each head network in the output network of the above-mentioned perception detection network can also be trained separately.
  • the above-mentioned S706 is not a necessary execution step.
  • the perception detection network can also be jointly trained with the neural radiation field NeRF network to further improve the detection accuracy and training efficiency, which is not specifically limited here.
  • the label information corresponding to a batch of sensor data in each training process does not need to be generated through manual annotation, which not only saves manpower consumption but also improves the efficiency of obtaining label information.
  • the perception detection network adopts self-supervised training, which can learn that based on the input data of the vehicle at any moment, it can accurately predict the perception information in the scene where the vehicle is located at that moment.
  • FIG. 7B is a schematic diagram of the training process of a perception detection network provided in an embodiment of the present application.
  • a batch of sensor data in the sensor data set is input into the image feature extraction network and the point cloud feature extraction network in the perception detection network.
  • the image feature extraction network performs feature extraction on the image data in the batch of sensor data to obtain 3D image features of the image data.
  • the point cloud feature extraction network performs feature extraction on the point cloud data in the batch of sensor data to obtain point cloud features of voxels corresponding to the point cloud data.
  • the feature fusion network in the perception detection network performs feature fusion on the 3D image features of the image data from the image feature extraction network and the point cloud features of the voxels corresponding to the point cloud data from the point cloud feature extraction network to obtain fused features of the voxels of the scene and input them into the output network in the perception detection network.
  • Each head network in the output network obtains corresponding prediction results based on the fused features of the voxels of the scene. Based on the prediction results output by each head network in the output network and the label information corresponding to the sensor data of the batch, the loss value of each head network in the output network is obtained, and the loss value of each head network in the output network is weighted to obtain a loss value corresponding to the training process, and the loss value is used for back propagation to sequentially update the parameters of the above-mentioned output network, feature fusion network, and feature extraction network (including image feature extraction network and point cloud feature extraction network).
  • Figure 7B is only an illustration of the training process of the perception detection network, and does not limit the training process of the perception detection network to only that shown in Figure 7B.
  • each head network in the output network can also be trained separately.
  • the training of the perception detection network may be specifically referred to the relevant description of the embodiment of FIG. 7A above, which will not be described in detail here.
  • the attribute recognition network may be trained.
  • the training process of the attribute recognition network is described below.
  • the attribute recognition network includes a text encoding network and an attribute decoding network.
  • the text encoding network can directly use a trained word vector feature extractor (or obtain it by pre-training learning of text-image), and the embodiment of the present application can only train the attribute decoding network.
  • the trained word vector feature extractor can extract the word vector features of the text query information based on the input text query information, and the word vector features of the text query information can represent the image semantic features of the category indicated by the text query information.
  • the process of obtaining a text encoding network using text-image pre-training can be: obtaining a massive amount of text-image training data, wherein the text-image training data includes multiple text-image data groups, each text-image data group includes text information indicating category information and an image corresponding to the text information, for example, text-image training group 1 includes text information indicating a car and an image of a car, the text information in the text-image training data is input into the text encoder to respectively extract word vector features of each text information, the image in the text information in the text-image training data is input into the image encoder to respectively extract image features of each image, the parameters of the text encoder and the parameters of the image encoder are adjusted based on the training idea of making the word vector features of the text information belonging to the same text-image data group as close to the image features of the image as possible, and the word vector features of the text information belonging to different text-image data groups as far away from the image features of the image as possible, then the trained text encoder can be directly used as the
  • the training of the attribute decoding network can be as follows: the text encoding network inputs the word vector features of each text query information to the attribute decoding network based on the received multiple text query information, the attribute decoding network predicts the category information of each obstacle based on the word vector features of the multiple text query information and the fusion features of the voxels of the obstacle provided by the perception detection network, the loss value of the attribute decoding network for this training is obtained based on the predicted category information of each obstacle and the category labeling information of each obstacle, and finally the parameters of the attribute decoding network are reversely updated based on the loss value of the attribute decoding network for this training.
  • the path evaluation network may be trained.
  • the training process of the path evaluation network is described below.
  • the path evaluation network includes a path encoding network, a feature interaction network, and an evaluation output network.
  • the training process of the path evaluation network may be, for example, obtaining path training data, the path training data including multiple paths planned by the vehicle within the above-mentioned K time ranges and the recommendation coefficient annotation information of the multiple paths, inputting the multiple paths into the path encoding network, the path encoding network outputs the extracted path features of each path to the feature interaction network, the feature interaction network outputs the risk features of each path according to the path features of each path and the fusion features of the voxels of the scenes at the above-mentioned K time (from the perception detection network), the evaluation output network outputs the predicted recommendation coefficients of the multiple paths based on the risk features of the multiple paths and determines the predicted recommended paths from the multiple paths based on the predicted recommendation coefficients of the multiple paths, obtaining the loss values of the multiple paths according to the predicted recommendation coefficients of the multiple paths and the recommendation coefficient annotation information of the
  • the training of the above-mentioned perception detection network, attribute recognition network and path evaluation network is carried out separately and independently.
  • the perception detection network, attribute recognition network and path evaluation network can also be trained jointly.
  • the loss value corresponding to each training process is obtained by weighting the loss value of the attribute recognition network in this training process, the loss value of the perception detection network in this training process (that is, the loss value of the attribute decoding network in this training process) and the loss value of the path evaluation network in this training process.
  • the loss value of this training process can be used to update the parameters in the above-mentioned perception detection network, the attribute decoding network in the attribute recognition network and the above-mentioned path evaluation network.
  • Figure 8 is a schematic diagram of a chip hardware structure provided in an embodiment of the present application, which can be used to execute the intelligent driving method and/or training method in the embodiment of the present application.
  • the neural-networks processing unit (NPU) 80 is mounted on the host CPU (Host CPU) as a coprocessor, and the Host CPU assigns tasks to execute the relevant processes of the intelligent driving method in the aforementioned embodiment or the training method in the aforementioned embodiment.
  • the core part of the NPU is the operation circuit 803, and the controller 804 controls the operation circuit 803 to extract data from the memory (weight memory or input memory) and perform operations.
  • the operation circuit 803 includes multiple processing units (process engines, PEs) inside.
  • the operation circuit 803 is a two-dimensional systolic array.
  • the operation circuit 803 can also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
  • the operation circuit 803 is a general-purpose matrix processor.
  • the operation circuit takes the corresponding data of matrix B from the weight memory 802 and caches it on each PE in the operation circuit.
  • the operation circuit takes the matrix A data from the input memory 801 and performs matrix operation with matrix B, and the partial result or final result of the matrix is stored in the accumulator 808.
  • the vector calculation unit 807 can further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc.
  • the vector calculation unit 807 can be used for network calculations of non-convolutional/non-FC layers in a neural network, such as pooling, batch normalization, local response normalization, etc.
  • the vector calculation unit 807 can store the processed output vector to the unified memory 806.
  • the vector calculation unit 807 can apply a nonlinear function to the output of the operation circuit 803, such as a vector of accumulated values, to generate an activation value.
  • the vector calculation unit 807 generates a normalized value, a merged value, or both.
  • the processed output vector can be used as an activation input to the operation circuit 803, such as for use in a subsequent layer in a neural network.
  • the unified memory 806 is used to store input data and output data.
  • the direct memory access controller 805 moves the input data in the external memory to the input memory 801 and/or the unified memory 806, stores the weight data in the external memory into the weight memory 802, and stores the data in the unified memory 806 into the external memory.
  • the bus interface unit (BIU) 810 is used to realize the interaction between the main CPU, DMAC and instruction fetch memory 809 through the bus.
  • the instruction fetch buffer 809 connected to the controller 804 is used to store instructions used by the controller 804.
  • the controller 804 is used to call the instructions cached in the instruction fetch memory 809 to control the working process of the computing accelerator.
  • the unified memory 806, the input memory 801, the weight memory 802 and the instruction fetch memory 809 are all on-chip memories, and the external memory is a memory outside the NPU, which can be a double data rate synchronous dynamic random access memory (DDR SDRAM), a high bandwidth memory (HBM) or other readable and writable memory.
  • DDR SDRAM double data rate synchronous dynamic random access memory
  • HBM high bandwidth memory
  • FIG. 9A is a schematic diagram of a structure of a computing device provided in an embodiment of the present application.
  • the computing device 30 includes a receiving unit 310 and a processing unit 312.
  • the computing device 30 may be implemented by hardware, software, or a combination of hardware and software. to achieve.
  • the receiving unit 310 is used to obtain the collected data of the sensor on the first scene, and the sensor includes at least one of a camera and a radar; the processing unit 312 is used to input the collected data into the perception detection network and output perception information, and the perception information is used to indicate the voxels of the obstacles in the first scene; the processing unit 312 is also used to display the voxels of the obstacles based on at least the perception information.
  • the computing device 30 also includes a display unit 314 (not shown), which is used to: display the above-mentioned obstacle based on the perception information, and the obstacle is marked with a polygonal box; and/or display the voxels of the above-mentioned obstacle based on the perception information.
  • a display unit 314 (not shown), which is used to: display the above-mentioned obstacle based on the perception information, and the obstacle is marked with a polygonal box; and/or display the voxels of the above-mentioned obstacle based on the perception information.
  • the computing device 30 may be used to implement the method described in the embodiment of Fig. 4.
  • the receiving unit 310 may be used to execute S401
  • the processing unit 312 may be used to execute S402 and S403.
  • the training device 40 includes an encoding unit 410, a decoding unit 412 and an updating unit 414.
  • the training device 40 can be implemented by hardware, software or a combination of hardware and software.
  • the encoding unit 410 is used to extract features of the image data at each moment in a batch of sensor data through an image feature extraction network during each training process to obtain 3D image features of the image data at K moments, where K is a positive integer; extract features of the point cloud data at each moment in a batch of sensor data through a point cloud feature extraction network to obtain point cloud features of voxels corresponding to the point cloud data at K moments; and fuse the 3D image features of the image data at K moments and the point cloud features of the voxels corresponding to the point cloud data at K moments through a feature fusion network to obtain field features at K moments.
  • the decoding unit 412 is used to output K prediction results according to the fusion features of the voxels of the scene at K moments through each head network in the output network, wherein each of the K prediction results corresponds to the scene at one moment;
  • the updating unit 414 is used to obtain the loss value of each head network in the output network according to the label information corresponding to the batch of sensor data and the K prediction results output by each head network in the output network; and to weight the loss values of each head network in the output network to obtain a loss value corresponding to each training process; and to use the loss value to update the parameters in the perception detection network.
  • the training device 40 can be used to implement the method described in the embodiment of Figure 7A.
  • the encoding unit 410 can be used to execute S701-S703
  • the decoding unit 412 can be used to execute S704
  • the updating unit 414 can be used to execute S705 and S706.
  • the division of the units in the above devices is only a division of logical functions. In actual implementation, they can be fully or partially integrated into one physical entity, or they can be physically separated.
  • the units in the device can be implemented in the form of a processor calling software; for example, the device includes a processor, the processor is connected to a memory, the memory stores instructions, and the processor calls the instructions stored in the memory to implement any of the above methods or to implement the functions of the units of the device, wherein the processor is, for example, a general-purpose processor, such as a central processing unit (CPU) or a microprocessor, and the memory is a memory inside the device or a memory outside the device.
  • CPU central processing unit
  • the units in the device can be implemented in the form of hardware circuits, and the functions of some or all of the units can be implemented by designing the hardware circuits, and the hardware circuits can be understood as one or more processors; for example, in one implementation, the hardware circuit is an application-specific integrated circuit (ASIC), and the functions of some or all of the above units are implemented by designing the logical relationship of the components in the circuit; for example, in another implementation, the hardware circuit is a programmable logic device that can be used. (Programmable logic device, PLD) is implemented. Taking field programmable gate array (FPGA) as an example, it can include a large number of logic gate circuits.
  • PLD programmable logic device
  • connection relationship between the logic gate circuits is configured through configuration files, so as to realize the functions of some or all of the above units.
  • All units of the above device can be implemented in the form of software called by the processor, or in the form of hardware circuits, or in part in the form of software called by the processor, and the rest in the form of hardware circuits.
  • the processor is a circuit with the ability to process signals.
  • the processor may be a circuit with the ability to read and run instructions, such as a central processing unit (CPU), a microprocessor, a graphics processing unit (GPU) (which can be understood as a microprocessor), or a digital signal processor (DSP); in another implementation, the processor may implement certain functions through the logical relationship of a hardware circuit, and the logical relationship of the hardware circuit is fixed or reconfigurable, such as a hardware circuit implemented by an application-specific integrated circuit (ASIC) or a programmable logic device (PLD), such as an FPGA.
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the process of the processor loading a configuration document to implement the hardware circuit configuration can be understood as the process of the processor loading instructions to implement the functions of some or all of the above units.
  • it can also be a hardware circuit designed for artificial intelligence, which can be understood as an ASIC, such as a neural network processing unit (NPU), a tensor processing unit (TPU), a deep learning processing unit (DPU), etc.
  • NPU neural network processing unit
  • TPU tensor processing unit
  • DPU deep learning processing unit
  • each unit in the above device can be one or more processors (or processing circuits) configured to implement the above method, such as: CPU, GPU, NPU, TPU, DPU, microprocessor, DSP, ASIC, FPGA, or a combination of at least two of these processor forms.
  • processors or processing circuits
  • SOC system-on-a-chip
  • the SOC may include at least one processor for implementing any of the above methods or implementing the functions of each unit of the device.
  • the type of the at least one processor may be different, for example, including a CPU and an FPGA, a CPU and an artificial intelligence processor, a CPU and a GPU, etc.
  • FIG. 10 is a schematic diagram of the structure of a processing device provided in an embodiment of the present application.
  • the processing device 50 includes: a processor 501, a communication interface 502, a memory 503, and a bus 504.
  • the processor 501, the memory 503, and the communication interface 502 communicate with each other via the bus 504. It should be understood that the present application does not limit the number of processors and memories in the processing device 50.
  • the processing device 50 is a component (such as a chip or integrated circuit, etc.) used for automatic driving control in the vehicle.
  • the vehicle is equipped with an automatic driving system, and the automatic driving system is not limited to a fully automatic driving system, a highly automatic driving system, a conditional automatic driving system, or a partially automatic driving system, etc.
  • the automatic driving system is not limited to a fully automatic driving system, a highly automatic driving system, a conditional automatic driving system, or a partially automatic driving system, etc.
  • the processing device 50 may be a network-side device.
  • a network-side device is a device with computing capabilities.
  • the network-side device may be, for example, a server deployed on the network side (e.g., a server for intelligent driving processing), or a component or chip in the server.
  • the network-side device may also be a system-level device composed of multiple servers.
  • the network-side device may be deployed in a cloud environment or an edge environment, which is not specifically limited herein.
  • the bus 504 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus.
  • the bus may be divided into an address bus, a Bus 504 may include a data bus, a control bus, etc.
  • FIG8 shows only one line, but does not mean that there is only one bus or one type of bus.
  • Bus 504 may include a path for transmitting information between various components of processing device 50 (e.g., memory 503, processor 501, communication interface 502).
  • the processor 501 can refer to the related description of the processor in the above embodiment, which will not be repeated here.
  • the memory 503 is used to provide a storage space, and the storage space can store data such as an operating system and a computer program.
  • the memory 503 can be a random access memory (RAM), an erasable programmable read only memory (EPROM), a read-only memory (ROM), or a portable read-only memory (CD-ROM), etc., or a combination of multiple types.
  • RAM random access memory
  • EPROM erasable programmable read only memory
  • ROM read-only memory
  • CD-ROM portable read-only memory
  • the memory 503 can exist alone or be integrated into the processor 501.
  • the communication interface 502 may be used to provide information input or output for the processor 501.
  • the communication interface 502 may be used to receive data sent externally and/or send data externally, and may be a wired link interface such as an Ethernet cable, or a wireless link interface (such as Wi-Fi, Bluetooth, general wireless transmission, etc.).
  • the communication interface 502 may also include a transmitter (such as a radio frequency transmitter, an antenna, etc.) coupled to the interface, or a receiver, etc.
  • the processing device 50 further includes a display 505.
  • the display 505 is connected or coupled to the processor 501 via a bus 504.
  • the display 505 can be used to display the polygon instance of the first scene.
  • the display 505 can be a display screen, and the display screen can be a liquid crystal display (LCD), an organic or inorganic light-emitting diode (OLED), an active matrix organic light-emitting diode panel (AMOLED), etc.
  • the display 505 can also be a car tablet, a car display, a head-up display (HUD) system, or an enhanced head-up display AR-HUD system, etc.
  • HUD head-up display
  • the processor 501 in the processing device 50 is used to read the computer program stored in the memory 503 to execute the aforementioned method, such as the method described in FIG. 4 or FIG. 7A .
  • the processing device 50 may be one or more modules in an execution body for executing the method shown in FIG. 4 , and the processor 501 may be used to read one or more computer programs stored in a memory to perform the following operations:
  • the collected data is input into a perception detection network, and perception information is output, where the perception information is used to indicate voxels of obstacles in the first scene; and the voxels of the obstacles are displayed at least based on the perception information.
  • the processing device 50 may be one or more modules in an execution body for executing the method shown in FIG. 7A , and the processor 501 may be used to read one or more computer programs stored in a memory to perform the following operations:
  • the encoding unit 410 extracts features from the image data at each moment in a batch of sensor data through an image feature extraction network to obtain 3D image features of the image data at K moments, where K is a positive integer; during each training process, extracts features from the point cloud data at each moment in a batch of sensor data through a point cloud feature extraction network to obtain point cloud features of voxels corresponding to the point cloud data at K moments; and fuses the 3D image features of the image data at K moments and the point cloud features of the voxels corresponding to the point cloud data at K moments through a feature fusion network to obtain fused features of voxels of the scene at K moments;
  • the update unit 414 obtains the loss value of each head network in the output network according to the label information corresponding to the batch of sensor data and the K prediction results output by each head network in the output network; and the loss values of each head network in the output network are weighted to obtain a loss value corresponding to each training process; and the loss value is used to update the parameters in the perception detection network.
  • ROM read-only memory
  • RAM random access memory
  • PROM programmable read-only memory
  • EPROM erasable programmable read-only memory
  • OTPROM one-time programmable read-only memory
  • EEPROM electrically-erasable programmable read-only memory
  • CD-ROM compact disc read-only memory or other optical disc storage, magnetic disk storage, magnetic tape storage, or any other computer-readable medium that can be used to carry or store data.
  • the essence of the technical solution of the present application or the part that makes the contribution or all or part of the technical solution can be embodied in the form of a software product.
  • the computer program product is stored in a storage medium and includes a number of instructions for enabling a device (which can be a personal computer, a server, or a network device, a robot, a single-chip microcomputer, a chip, a robot, etc.) to execute all or part of the steps of the method described in each embodiment of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Automation & Control Theory (AREA)
  • Transportation (AREA)
  • Mechanical Engineering (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Combustion & Propulsion (AREA)
  • Traffic Control Systems (AREA)

Abstract

公开了一种智能驾驶方法及装置,该方法包括:获取车辆的传感器对场景的采集数据,该传感器包括摄像头和雷达中的至少一种;将采集数据输入至感知检测网络,输出感知信息,感知信息用于指示第一场景的障碍物的体素;至少基于感知信息控制该车辆行驶。如此,能增强车辆对周围环境的感知能力,有利于提高障碍物的检测准确率,避免碰撞的发生。

Description

一种智能驾驶方法及装置 技术领域
本申请涉及智能驾驶领域,尤其涉及一种智能驾驶方法及装置。
背景技术
自动驾驶车辆的感知系统对周围环境的感知能力与车辆的安全行驶息息相关。
当前有车辆的感知系统采用纯视觉的方式对周围环境中的障碍物进行检测,这种方式对训练材料(例如白名单)的依赖度高,感知系统必须经过对障碍物的训练学习后才能识别出该障碍物。
当前也有车辆的感知系统采用激光雷达或毫米波雷达对周围环境中的障碍物进行检测,但该方式的检测容易受到天气的影响,例如雨、雪天气下对障碍物的检测准确率低。
发明内容
本申请公开了一种智能驾驶方法及装置,能够增强车辆对周围物体的感知能力,有利于提高障碍物的检测准确率,避免碰撞的发生。
第一方面,本申请提供了一种智能驾驶方法,所述方法包括:获取传感器对第一场景的采集数据,所述传感器包括摄像头和雷达中的至少一种;将所述采集数据输入至感知检测网络,输出感知信息,所述感知信息用于指示所述第一场景的障碍物的体素;至少基于所述感知信息控制车辆行驶。
这里,障碍物是指车辆在行驶过程中不期望与之发生碰撞的实体,该实体可以是静态的,也可以是动态的。其中,静态的实体例如可以是道路中的纸箱、道路施工牌、道路分界栏杆、土堆、轮胎、侧翻的车辆、躺着的人、动物、道路旁的建筑物、树木、停放的车辆、道路指示牌、电线杆、路边隔离带等具有体积和质量的静止物体;动态的实体例如可以是行人(例如行走的行人、骑自行车的行人等)、动物、车辆、载物的车辆(例如装载了纸箱、树枝或其他货物)等具有体积和质量的运动物体。
可以理解,本申请并不限定障碍物在物理世界中的呈现形态,以车辆为例,其可以是行驶或停车时的轮胎着地的形态,也可以是经过碰撞事故后的处于翻倒的形态,也可以是车辆的后箱装载了货物(例如树枝、纸箱等)时的形态,还可以是多节车连接时的形态。这里,也不限定障碍物为车辆时该车辆的种类,车辆的种类例如可以是轿车、货车、客车、挂车、非完整车辆、摩托车、自行车等。
这里,雷达包括激光雷达、毫米波雷达中的至少一种。
示例性地,第一场景可以理解为车辆在行驶过程中车辆上的传感器可以探测到的环境空间。可以理解,车辆在行驶过程中,每个时刻可以对应一个场景,多个时刻对应的场景包括这多个时刻中各个时刻对应的场景。
示例性地,感知检测网络是基于传感器数据集和采用4D重建(即包括动、静态目标的时空重建)生成的该传感器数据集对应的标签信息进行训练获得。该标签信息用于在感知检 测网络的训练过程中为感知检测网络提供其预测结果的真值信息。可以理解,4D重建可以实现在时间维度上描述三维空间内实体对象的变化。
示例性地,除了可以基于感知信息控制车辆行驶,还可以结合导航地图信息、高精地图信息、路侧设备以及周围其车辆广播的交通实况信息等中的至少一项控制车辆的行驶。
示例性地,该方法可以应用于车辆或者车辆内用于智能驾驶控制的组件(例如芯片或者集成电路)。该车辆配置有自动驾驶系统,这里,自动驾驶系统并不局限于完全自动驾驶系统、高度自动驾驶系统、有条件自动驾驶系统、或部分自动驾驶系统等,本领域技术人员可以理解,提供智能驾驶的非完全人工驾驶系统都可以涵盖在本概念之下。
上述方法中,采用纯视觉或者视觉与雷达结合的方式采集场景的数据,通过感知检测网络对场景的数据进行处理以输出指示了障碍物的体素的感知信息,能够增强车辆对周围物体的感知能力,实现了与语义类别无关的障碍物的感知,提高了场景中障碍物的检测的泛化能力以及准确率。另外,基于感知信息控制车辆的驾驶,可以提高车辆行车的安全性。
在第一方面的一种可能的实现方式中,所述方法还包括:基于所述感知信息显示所述障碍物,所述障碍物以多边形框进行标记;和/或基于所述感知信息显示所述障碍物的体素,所述障碍物的体素以多边形框进行标记。
示例性地,多边形框可以是二维,也可以是三维的。
示例性地,对障碍物或者障碍物的体素进行显示时,可以通过不同的颜色区分当前时刻下动态的障碍物和静态的障碍物,也可以通过在动态的障碍物上另外显示一个箭头来区分动态的障碍物和静态的障碍物,动态的障碍物上的箭头用于指示该动态的障碍物的运动方向。
实施上述实现方式,以多边形框标记障碍物更加贴合障碍物本身的形状,通过对障碍物和/或障碍物的体素进行呈现,用户能清晰直观地了解到车辆当前时刻对周围环境的感知情况。
在第一方面的一种可能的实现方式中,所述感知信息包括以下信息的至少一项:所述第一场景的体素的占据状态、所述第一场景的体素的速度信息、所述第一场景的体素的可见状态和所述障碍物对应的多边形框的角点信息;其中,所述障碍物对应的多边形框与所述障碍物的体素关联。
这里,体素的可见状态例如可以分为“可见”和“不可见”。例如,车辆在当前时刻所处的场景中,如场景中的某个体素在当前时刻未被车辆上的任何一个传感器(包括摄像头和雷达)的观测信号所触及,则该体素的可见状态为不可见;如果该体素被该车辆上的至少一个传感器的观测信号触及,则该体素的可见状态为可见。
这里,体素的占据状态例如可以分为“占据”或“空”(即未被占据)。例如,车辆在当前时刻所处的场景中,如场景中的某个体素在该场景所在的物理世界中对应的空间位置上存在物理实体,则该体素的占据状态为占据;如该体素在该场景所在的物理世界中对应的空间位置上不存在物理实体,则体素的占据状态为空。可以理解,空气不是物理实体。
示例性地,障碍物对应的多边形框与所述障碍物的体素关联可以理解为:障碍物对应的多边形框的角点信息基于该障碍物的体素的索引信息获得。障碍物对应的多边形框的角点信息例如可以是采用凸包算法基于障碍物的体素的索引信息计算获得。
实施上述实现方式,基于体素的可见状态可以知晓当前场景下的车辆的视线盲区,基于体素的占据状态可以知晓当前场景下车辆应避开处于“占据”状态的体素所在的区域以避免碰撞的发生,基于障碍物对应的多边形框的角点信息可以快速定位场景中的障碍物,基于体素 的速度信息以及障碍物对应的边形框的角点信息可以确定当前场景下障碍物的速度信息。
在第一方面的一种可能的实现方式中,所述感知信息还用于指示所述第一场景的路面的体素,所述至少基于所述感知信息控制车辆行驶,包括:至少根据所述感知信息,生成所述第一场景的路面几何信息;根据所述路面几何信息,调整所述车辆内的悬架。
示例性地,路面几何信息用于指示第一场景的路面状况(例如路面是否有坑洼、路面是否有凸起等)。
实施上述实现方式,基于感知信息车辆可以提前获取车辆前方的路面状况,在监测到前方路面有起伏时,车辆有足够的时间可以及时调整车辆的悬架,以使车辆在行驶过程中尽可能始终保持水平且平稳的状态,减少因路面起伏带来的振动感,提高了乘坐车辆的舒适性。
在第一方面的一种可能的实现方式中,所述至少基于所述感知信息控制车辆行驶,包括:至少根据所述感知信息,调整所述车辆的行驶路径,所述调整后的行驶路径不途径所述障碍物的体素所在的区域。
实施上述实现方式,可以避免车辆在行驶过程与障碍物发生碰撞,有利于提高车辆行驶的安全性。
在第一方面的一种可能的实现方式中,所述采集数据包括图像数据和点云数据,所述感知检测网络包括图像特征提取网络、点云特征提取网络、特征融合网络和输出网络,其中,
所述图像特征提取网络,用于提取所述图像数据的3D图像特征;
所述点云特征提取网络,用于提取所述点云数据对应的体素的点云特征;
所述特征融合网络,用于根据所述3D图像特征和所述点云数据对应的体素的点云特征进行融合,获得所述第一场景的体素的融合特征;
所述输出网络,用于处理所述第一场景的体素的融合特征并输出所述感知信息。
这里,区别于点云数据对应的体素,在本申请中,第一场景的体素是指经过特征融合网络融合后的体素。
实施上述实现方式,通过车辆上的多模态传感器的原始数据实现对周围环境的障碍物的感知,融合了不同传感器的优势(例如提供了图像的纹理语义信息、提供了点云的深度信息),有利于增强车辆对周围环境的感知能力,提高了障碍物检测的泛化能力和精度。
在第一方面的一种可能的实现方式中,所述方法还包括:将文本查询信息和所述障碍物的体素的融合特征输入至属性识别网络,输出所述障碍物的类别信息;所述文本查询信息用于请求查询类别;显示所述障碍物的类别信息;其中,所述障碍物的体素的融合特征基于所述障碍物对应的多边形框的角点信息和所述第一场景的体素的融合特征确定,所述障碍物对应的多边形框与所述障碍物的体素关联。
示例性地,文本查询信息用于请求查询Q种类别,假设某场景中障碍物的类别的数量为P,其中,Q、P为正整数且Q为大于P。也就是说,属性识别网络实际支持识别的类别的数量大于任一场景中障碍物的类别,如此,能确保属性识别网络对任一场景中的障碍物进行类别识别时避免出现遗漏。
示例性地,障碍物的体素的融合特征基于所述障碍物对应的多边形框的角点信息和所述第一场景的体素的融合特征确定是指:由于障碍物对应的多边形框的角点信息与该障碍物的体素的索引信息对应,基于该障碍物的体素的索引信息可以从第一场景的体素的融合特征中确定该障碍物的体素的融合特征。
这里,障碍物对应的多边形框与所述障碍物的体素关联是指:障碍物对应的多边形框的角点信息基于该障碍物的体素的索引信息获得。
示例性地,属性识别包括文本编码网络和属性解码网络,其中,文本编码网络用于提取文本查询信息的词向量特征;属性解码网络用于根据该词向量特征和障碍物的体素的融合特征输出该障碍物的类别信息。
实施上述实现方式,在感知检测网络的基础上,通过部署属性识别网络使得车辆在行驶过程中不仅可以检测到周围环境中的障碍物,还可以识别出障碍物的类别,实现了车辆不仅看的到物还能看的懂物。
在第一方面的一种可能的实现方式中,所述方法还包括:获取车辆的多条规划路径;将所述车辆的多条规划路径和所述第一场景的体素的融合特征输入至路径评估网络,输出所述多条规划路径的推荐系数和所述多条规划路径中的推荐路径,所述推荐路径与所述多条规划路径的推荐系数关联;显示所述推荐路径。
示例性地,路径评估网络包括路径编码网络、特征交互网络和评估输出网络,其中,路径编码网络用于提取多条规划路径中每条规划路径的路径特征;特征交互网络用于根据每条规划路径的路径特征和第一场景的体素的融合特征获得每条规划路径的风险特征;评估输出网络用于根据这多条规划路径的风险特征输出这多条规划路径的推荐系数和这多条规划路径中的推荐路径。
示例性地,推荐路径为多条规划路径中最高推荐系数对应的规划路径。
示例性地,规划路径的推荐系数可以基于该规划路径的风险系数、舒适度和通行效率中的至少一项获得。其中,该规划路径的风险系数与该规划路径与障碍物(包括可见的障碍物以及当前处于盲区的障碍物)的距离、该规划路径与道路中其他交通参与者的路径是否有冲突(例如当前时刻或者未来时刻是否会发生碰撞)等因素中的至少一项有关,该规划路径的通行效率与该规划路径的长度、该规划路径对应的预估通行时长、该规划路径途径的红绿灯的数量、该规划路径所在的可行驶区域的面积等因素中的至少一项有关,该规划路径的舒适度与该规划路径的转向加速度的大小及转向频率、该规划路径的加速度的变化率、该规划路径途径的路面的平整度、该规划路径途径的红绿灯的数量、该规划路径所在的道路的类型、该规划路径途径区域是否阴凉等因素中的至少一项有关。
示例性地,在其他因素不变的情况下,该规划路径的风险系数越低,则该规划路径的推荐系数越高;在其他因素不变的情况下,该规划路径的舒适度越高,则该规划路径的推荐系数越高;在其他因素不变的情况下,该规划路径的通行效率越高,则该规划路径的推荐系数越高。
实施上述实现方式,在感知检测网络的基础上,通过部署属性识别网络可以实现路径推荐,有利于提高车辆行车的安全性和舒适性。
第二方面,本申请提供了一种用于智能驾驶的系统,所述系统包括:感知检测网络,用于根据传感器对第一场景的采集数据输出感知信息,所述感知信息用于指示所述第一场景的障碍物的体素;所述传感器包括摄像头和雷达中的至少一种;属性识别网络,用于根据文本查询信息和所述障碍物的体素的融合特征输出所述障碍物的类别信息,所述障碍物的体素的融合特征基于所述障碍物对应的多边形框的角点信息和所述第一场景的体素的融合特征确定,所述障碍物对应的多边形框与所述障碍物的体素关联,所述第一场景的体素的融合特征为所 述感知检测网络基于从所述采集数据中提取的3D图像特征和体素的点云特征中的至少一项进行时间和/或空间上的融合获得;路径评估网络,用于根据多条规划路径和所述第一场景的体素的融合特征输出所述多条规划路径的推荐系数和所述多条规划路径中的推荐路径,所述推荐路径与所述多条规划路径的推荐系数关联。
示例性地,该系统可以部署在车辆或者车辆内用于智能驾驶控制的组件,该组件例如可以是芯片或者集成电路。车辆具体可以参考上述第一方面对车辆的叙述,在此不再赘述。
上述方法中,通过感知检测网络可以增强用于智能驾驶的系统对周围环境的感知能力,从而能避免该系统的部署端与障碍物发生碰撞,提高了该系统的安全性;通过属性识别网络使得该系统在感知到障碍物的基础上还能识别出障碍物的类别,提高了该系统的智能性;通过路径评估网络能实现低风险路径的推荐,为智能出行提供了方便。
下述第二方面的任一特征的有益效果可以参考上述第一方面相应特征的有益效果的描述,在此不再赘述。
在第二方面的一种可能的实现方式中,所述感知信息包括以下信息的至少一项:所述第一场景的体素的占据状态、所述第一场景的体素的速度信息、所述第一场景的体素的可见状态和所述障碍物对应的多边形框的角点信息;其中,所述障碍物对应的多边形框与所述障碍物的体素关联。
在第二方面的一种可能的实现方式中,所述采集数据包括图像数据和点云数据,所述感知检测网络包括图像特征提取网络、点云特征提取网络、特征融合网络和输出网络,其中,
所述图像特征提取网络,用于提取所述图像数据的3D图像特征;
所述点云特征提取网络,用于提取所述点云数据对应的体素的点云特征;
所述特征融合网络,用于根据所述3D图像特征和所述点云数据对应的体素的点云特征进行融合,获得所述第一场景的体素的融合特征;
所述输出网络,用于处理所述第一场景的体素的融合特征并输出所述感知信息。
在第二方面的一种可能的实现方式中,所述属性识别网络包括文本编码网络和属性解码网络,其中,所述文本编码网络,用于提取文本查询信息的词向量特征;所述属性解码网络,用于根据所述词向量特征和所述障碍物的体素的融合特征输出所述障碍物的类别信息。
在第二方面的一种可能的实现方式中,所述路径评估网络包括路径编码网络、特征交互网络和评估输出网络,其中,
所述路径编码网络,用于提取多条规划路径中每条规划路径的路径特征;
所述特征交互网络,用于根据每条规划路径的路径特征和所述第一场景的体素的融合特征获得所述每条规划路径的风险特征;
所述评估输出网络,用于根据所述多条规划路径的风险特征输出所述多条规划路径的推荐系数和所述多条规划路径中的推荐路径。
第三方面,本申请提供了一种用于智能驾驶的装置,该装置包括:接收单元,用于获取传感器对第一场景的采集数据,所述传感器包括摄像头和雷达中的至少一种;处理单元,用于将所述采集数据输入至感知检测网络,输出感知信息,所述感知信息用于指示所述第一场景的障碍物的体素;该处理单元,还用于至少基于所述感知信息控制车辆行驶。
在第三方面的一种可能的实现方式中,该装置还包括显示单元,显示单元用于基于所述感知信息显示所述障碍物,所述障碍物以多边形框进行标记;和/或基于所述感知信息显示所 述障碍物的体素。
在第三方面的一种可能的实现方式中,所述感知信息包括以下信息的至少一项:所述第一场景的体素的占据状态、所述第一场景的体素的速度信息、所述第一场景的体素的可见状态和所述障碍物对应的多边形框的角点信息;其中,所述障碍物对应的多边形框与所述障碍物的体素关联。
在第三方面的一种可能的实现方式中,所述感知信息还用于指示所述第一场景的路面的体素,所述处理单元具体用于:至少根据所述感知信息,生成所述第一场景的路面几何信息;根据所述路面几何信息,调整所述车辆内的悬架。
在第三方面的一种可能的实现方式中,所述处理单元具体用于:至少根据所述感知信息,调整所述车辆的行驶路径,所述调整后的行驶路径不途径所述障碍物的体素所在的区域。
在第三方面的一种可能的实现方式中,所述采集数据包括图像数据和点云数据,所述感知检测网络包括图像特征提取网络、点云特征提取网络、特征融合网络和输出网络,其中,
所述图像特征提取网络,用于提取所述图像数据的3D图像特征;
所述点云特征提取网络,用于提取所述点云数据对应的体素的点云特征;
所述特征融合网络,用于根据所述3D图像特征和所述点云数据对应的体素的点云特征进行融合,获得所述第一场景的体素的融合特征;
所述输出网络,用于处理所述第一场景的体素的融合特征并输出所述感知信息。
在第三方面的一种可能的实现方式中,处理单元还用于:将文本查询信息和所述障碍物的体素的融合特征输入至属性识别网络,输出所述障碍物的类别信息;所述文本查询信息用于请求查询类别;显示单元还用于显示所述障碍物的类别信息;其中,所述障碍物的体素的融合特征基于所述障碍物对应的多边形框的角点信息和所述第一场景的体素的融合特征确定,所述障碍物对应的多边形框与所述障碍物的体素关联。
在第三方面的一种可能的实现方式中,接收单元还用于:获取车辆的多条规划路径;处理单元还用于:将所述车辆的多条规划路径和所述第一场景的体素的融合特征输入至路径评估网络,输出所述多条规划路径的推荐系数和所述多条规划路径中的推荐路径,所述推荐路径与所述多条规划路径的推荐系数关联;显示单元还用于:显示所述推荐路径。
第四方面,本申请提供了一种用于智能驾驶的装置,该装置包括处理器和存储器,其中,存储器用于存储程序指令;所述处理器调用所述存储器中的程序指令,使得装置执行第一方面或者第一方面的任一可能的实现方式中的方法。
第五方面,本申请提供了一种车辆,该车辆包括如上述第二方面或者第二方面的任一可能的实现方式的系统,或者包括如上述第三方面或者第三方面的任一可能的实现方式的装置,或者包括上述第四方面的装置。
第六方面,本申请提供了一种计算机可读存储介质,包括计算机指令,当所述计算机指令在被处理器运行时,实现上述第一方面或者第一方面的任一可能的实现方式中的方法。
第七方面,本申请提供了一种计算机程序产品,当该计算机程序产品被处理器执行时,实现上述第一方面或者第一方面的任一可能的实施例中的所述方法。该计算机程序产品,例如可以为一个软件安装包,在需要使用上述第一方面的任一种可能的设计提供的方法的情况下,可以下载该计算机程序产品并在处理器上执行该计算机程序产品,以实现第一方面或者第一方面的任一可能的实施例中的所述方法。
附图说明
图1是本申请实施例提供的一种通信系统的示意图;
图2是本申请实施例提供的一种用于智能驾驶的感知模型的系统示意图;
图3是本申请实施例提供的一种感知检测网络的特征提取的示意图;
图4是本申请实施例提供的一种智能驾驶方法的流程图;
图5是本申请实施例提供的一些场景示意图;
图6A是本申请实施例提供的一种以多边形框标记场景中的障碍物的示意图;
图6B是本申请实施例提供的一种障碍物的体素的显示示意图;
图6C是本申请实施例提供的一种显示了场景中路面的体素的示意图;
图7A是本申请实施例提供的一种感知检测网络的训练方法的流程图;
图7B是本申请实施例提供的一种感知检测网络的训练过程示意图;
图8是本申请实施例提供的一种芯片硬件结构示意图;
图9A是本申请实施例提供的一种计算装置的结构示意图;
图9B是本申请实施例提供的一种训练装置的结构示意图;
图10是本申请实施例提供的一种处理设备的结构示意图。
具体实施方式
需要说明的是,本申请中采用诸如“第一”、“第二”的前缀词,仅仅为了区分不同的描述对象,对被描述对象的位置、顺序、优先级、数量或内容等没有任何限定作用。例如,被描述对象为“字段”,则“第一字段”和“第二字段”中“字段”之前的序数词并不限制“字段”之间的位置或顺序,“第一”和“第二”并不限制其修饰的“字段”是否在同一个消息中,也不限制“第一字段”和“第二字段”的先后顺序。再如,被描述对象为“等级”,则“第一等级”和“第二等级”中“等级”之前的序数词并不限制“等级”之间的优先级。再如,被描述对象的数量并不受前缀词的限制,可以是一个或者多个,以“第一设备”为例,其中“设备”的数量可以是一个或者多个。此外,不同前缀词修饰的对象可以相同或不同,例如,被描述对象为“设备”,则“第一设备”和“第二设备”可以是同一个设备、相同类型的设备或者不同类型的设备;再如,被描述对象为“信息”,则“第一信息”和“第二信息”可以是相同内容的信息或者不同内容的信息。总之,本申请实施例中对用于区分描述对象的前缀词的使用不构成对所描述对象的限制,对所描述对象的陈述参见权利要求或实施例中上下文的描述,不应因为使用这种前缀词而构成多余的限制。
需要说明的是,本申请实施例中采用诸如“a1、a2、……和an中的至少一项(或至少一个)”等的描述方式,包括了a1、a2、……和an中任意一个单独存在的情况,也包括了a1、a2、……和an中任意多个的任意组合情况,每种情况可以单独存在。例如,“a、b和c中的至少一项”的描述方式,包括了单独a、单独b、单独c、a和b组合、a和c组合、b和c组合,或abc三者组合的情况。
为了便于理解,下面先对本申请实施例可能涉及的相关术语等进行介绍。
(1)自动驾驶
自动驾驶又可以称为智能驾驶或辅助驾驶,是车辆智能化发展的重要方向,随着感知技术的发展以及芯片能力的提升,智能驾驶为人们提供了越来越多的丰富的驾驶功能,逐渐实现不同级别的驾驶体验。自动机工程师学会(society of automotive engineers,SAE)提供了一 种驾驶自动化分级标准,包括驾驶等级L0至L5,其中L0级为无自动化,由人类驾驶者全权操作车辆,在行驶过程中可以得到驾驶系统的警告或辅助,例如自动紧急制动(autonomous emergency braking,AEB),盲点检测(blind spot monitoring,BSM)或车道偏离报警(lane departure warning,LDW)等。L1级为驾驶支援,驾驶操作由人类驾驶者和驾驶系统共同完成,驾驶系统可以通过驾驶环境对方向盘或加减速操作提供驾驶支援,其他的驾驶操作由人类驾驶员进行,例如自适应巡航控制(adaptive cruise control,ACC)或车道保持辅助/支持(lane keep assistance/support,LKA/LKS)等;L2级为部分自动化,通过驾驶环境对方向盘和加减速中的多项提供驾驶支援,其他的驾驶动作由人类驾驶员进行,例如结合了自适应巡航控制(adaptive cruise control,ACC)和车道保持辅助(lane keep assistance,LKA)的跟车功能;L3级为有条件自动化,可以由驾驶系统完成所有的驾驶操作,但人类驾驶员需要在适当的时候应答驾驶系统的请求,即人类驾驶员需要做好接管驾驶系统的准备;L4级为高度自动化,可以由驾驶系统完成所有的驾驶操作,人类驾驶员不一定需要对驾驶系统的请求作出应答,例如在道路和环境条件允许的情况下(比如封闭的园区、高速公路、城市道路或固定的行车线路等)人类驾驶员可以不接管驾驶;L5级为完全自动化,在各种人类驾驶员可以应对的道路和环境条件下的驾驶操作均可以由驾驶系统自主完成。可见,L0至L2的级别,驾驶系统主要为驾驶员提供支持,驾驶员仍然需要做好驾驶监督,根据需要进行转向、制动或加速以保证安全。L3至L5级别,驾驶系统可以代替驾驶员完成所有的驾驶操作,L3级别下,驾驶员要做好接管驾驶的准备,L4和L5级别驾驶系统可以实现部分条件和所有条件下的完全驾驶,驾驶员可以选择是否接管。
以上分级是一种示例,随着技术的演进或者在不同国家或地区的规定不同,以上分级可以变化,例如,中国工业和信息化部提出的车辆自动化分级包括在车辆驾驶自动化的6个等级,其中0-2级为驾驶辅助,系统辅助人类执行动态驾驶任务,驾驶主体仍为驾驶员;3-5级为自动驾驶,系统在设计运行条件下代替人类执行动态驾驶任务,当功能激活时,驾驶主体是系统。各级名称及定义如下:0级驾驶自动化(应急辅助,emergency assistance)系统不能持续执行动态驾驶任务中的车辆横向或纵向运动控制,但具备持续执行动态驾驶任务中的部分目标和事件探测与响应的能力。1级驾驶自动化(部分驾驶辅助,partial driver assistance)系统在其设计运行条件(或称为设计运行范围ODD)下持续地执行动态驾驶任务中的车辆横向或纵向运动控制,且具备与所执行的车辆横向或纵向运动控制相适应的部分目标和事件探测与响应的能力。2级驾驶自动化(组合驾驶辅助,combined driver assistance)系统在其设计运行条件下持续地执行动态驾驶任务中的车辆横向和纵向运动控制,且具备与所执行的车辆横向和纵向运动控制相适应的部分目标和事件探测与响应的能力。3级驾驶自动化(有条件自动驾驶,conditionally automated driving)系统在其设计运行条件下持续地执行全部动态驾驶任务。4级驾驶自动化(高度自动驾驶,highly automated driving)系统在其设计运行条件下持续地执行全部动态驾驶任务并自动执行最小风险策略。5级驾驶自动化(完全自动驾驶,fully automated driving)系统在任何可行驶条件下持续地执行全部动态驾驶任务并自动执行最小风险策略。其中,横向控制主要用于车辆转向的控制,例如,控制方向盘扭矩或角度以控制车辆的方向;纵向控制主要用于车辆的速度控制,例如控制制动踏板、加速踏板、或档位等以控制车辆的加/减速、刹车等。
无论采用何种分级方式,本申请实施例的描述可以适用于以上需要部分或全部参与车辆 驾驶的自动驾驶系统。
(2)障碍物
在本申请实施例中,障碍物是指车辆在行驶过程中不期望与之发生碰撞的实体,该实体可以是静态的,也可以是动态的。其中,静态的实体例如可以是道路中的纸箱、道路施工牌、道路分界栏杆、土堆、轮胎、侧翻的车辆、躺着的人、动物、道路旁的建筑物、树木、停放的车辆、道路指示牌、电线杆、路边隔离带等具有体积和质量的静止物体;动态的实体例如可以是行人(例如行走的行人、骑自行车的行人等)、动物、车辆、载物的车辆(例如装载了纸箱、树枝或其他货物)等具有体积和质量的运动物体。
(3)场景
场景是指车辆在行驶过程中车辆上的传感器可以探测到的环境空间。可以理解,车辆在行驶过程中,每个时刻对应一个场景,多个时刻对应的场景包括这多个时刻中各个时刻对应的场景。
(4)体素
体素(voxel),也可以称为立体像素或体积元素。体素是三维空间上分割的最小单位,类似于二维空间的最小单位-像素。通过体素可以对3D空间进行网格划分并赋予每个网格特征,在此情况下,体素表示三维空间中规则网格上的值,基于体素相对于其他体素的位置可以推断该体素的定位。
下面将结合附图,对本申请实施例中的技术方案进行描述。
参见图1,图1是本申请实施例提供的一种通信系统的示意图。如图1所示,该系统包括网络侧设备和车辆,其中,网络侧设备与车辆之间以无线的方式进行通信。
这里,网络侧设备是具有计算能力的设备。网络侧设备例如可以是部署在网络侧的服务器(例如用于智能驾驶处理的服务器),或者为该服务器中的组件或者芯片。在一些可能的实施例中,网络侧设备也可以是由多个服务器组成的系统级设备或者计算设备集群。网络侧设备可以部署在云环境或者边缘环境中,本申请实施例不做具体限定。
这里,车辆是指配置有自动驾驶系统的车辆。自动驾驶系统并不局限于完全自动驾驶系统、高度自动驾驶系统、有条件自动驾驶系统、或部分自动驾驶系统等,本领域技术人员可以理解,提供智能驾驶的非完全人工驾驶系统都可以涵盖在本概念之下。
示例性地,依据车辆的动力来源的不同,车辆例如可以是新能源车辆或传统车辆等,其中,传统车辆是指燃油类车辆,例如可以是汽油车辆、柴油车辆等,新能源车辆例如可以是电动车辆(electric vehicle,EV)、混合动力车辆(hybrid electric vehicle,HEV)、增程式电动车辆(range extended EV)、插电式混合动力车辆(Plug-in HEV)、燃料电池车辆或其他新能源车辆,在此不作具体限定。
其中,车辆上部署有摄像头和雷达,其中,摄像头用于采集车辆当前周围环境的图像数据,雷达用于采集车辆当前周围环境的点云数据。雷达包括激光雷达Lidar、毫米波雷达Radar等中的至少一种。基于摄像头在车辆的安装位置,摄像头例如可以分为前视摄像头、环视摄像头、后视摄像头和侧视摄像头等;基于摄像头的结构划分,摄像头例如可以分为单目摄像头、双目摄像头、广角摄像头等。这里,本申请实施例不限定车辆配置的摄像头的数量,出于安全考虑,车辆上的摄像头需要能采集到车身周围360度的图像数据。
示例性地,网络侧设备上部署有感知模型,网络侧设备使用训练数据对感知模型进行训练,其中,训练数据包括从数据源设备(例如,采集车队)获取传感器数据,传感器数据包括车载摄像头采集的图像数据以及车载雷达采集的点云数据。网络侧设备将感知模型训练好后可以将训练好后的感知模型提供给车辆使用。感知模型的训练过程具体可参考下述方法实施例中相应内容的叙述,在此不再赘述。
进一步地,车辆可以从网络侧设备获取感知模型(即训练好的感知模型)。在车辆的行驶过程中,车辆通过自身搭载的传感器(例如摄像头、雷达等)对距离自车一定范围内的环境(或称为场景)进行数据采集获得采集数据,该采集数据例如包括针对该场景采集的图像数据以及点云数据,车辆使用该感知模型对采集数据进行处理以输出该场景的感知信息,感知信息用于指示该场景内的障碍物的体素,车辆至少基于感知信息可以控制自身的行驶。
感知模型具体可参考下述图2实施例的相关叙述,在此不再赘述。
在图1所示系统,网络侧设备与车辆之间的通信可使用蜂窝通信技术,例如2G蜂窝通信,例如全球移动通信系统(global system for mobile communication,GSM)、通用分组无线业务(general packet radio service,GPRS);或者3G蜂窝通信,例如宽带码分多址(wideband code division multiple access,WCDMA)、时分同步码分多址接入(time division-synchronous code division multiple access,TS-SCDMA)、码分多址接入(code division multiple access,CDMA),或者4G蜂窝通信,例如长期演进(long term evolution,LTE)、LTE-车联网无线通信技术(vehicle to everything,V2X),PC5通信,或者5G蜂窝通信,例如新空口(new radio,NR)-V2X PC5通信,或者其他演进的蜂窝通信技术。无线通信系统也可利用非蜂窝通信技术,如Wi-Fi与无线局域网(wireless local area network,WLAN)通信,在此不作具体限定。
可以理解,图1仅为示例性架构图,但不限定图1所示系统包括的网元的数量。虽然图1未示出,但除图1所示的功能实体外,图1还可以包括其他功能实体。另外,本申请实施例提供的方法可以应用于图1所示的通信系统,当然本申请实施例提供的方法也可以适用其他通信系统,本申请实施例对此不予限制。
参见图2,图2是本申请实施例提供的一种用于智能驾驶的感知模型的系统示意图。
在图2中,感知模型包括感知检测网络,感知检测网络用于根据传感对场景的采集数据(例如包括图像数据和点云数据)输出感知信息,感知信息用于指示该场景的障碍物的体素。在一些可能的实施例中,感知信息还用于指示该场景内的路面的体素。感知信息可以用于辅助车辆的驾驶。
下面介绍感知检测网络的框架。
一种实现方式中,在采集数据包括图像数据和点云数据的情况下,感知检测网络包括图像特征提取网络、点云特征提取网络、特征融合网络和输出网络,其中,图像特征提取网络用于从图像数据中提取该图像数据的3D图像特征并将该特征输出至特征融合网络,点云特征提取网络用于从点云数据中提取该点云数据对应的体素的点云特征并将该特征输出至特征融合网络,特征融合网络用于对图像数据的3D图像特征和点云数据对应的体素的点云特征进行融合,获得对应场景的体素的融合特征并将该特征输出至输出网络,输出网络根据该对应场景的体素的融合特征进行预测,输出该场景的感知信息。
示例性地,图2中的特征融合网络可以只执行空间上的特征融合。例如,图像数据为第 t时刻摄像头采集的图像,点云数据为第t时刻雷达采集的数据,则特征融合网络只需对第t时刻该图像数据的3D图像特征和第t时刻该点云数据对应的体素的点云特征进行空间上的融合。
示例性地,图2中的特征融合网络可以执行空间和时间上的特征融合。例如,图像数据为摄像头在n个时刻采集的图像数据,点云数据为雷达在这n个时刻采集的点云数据,则特征融合网络可以先对n个时刻中每个时刻对应的3D图像特征和该时刻对应的体素的点云特征进行空间上的融合获得每个时刻对应的体素的空间融合特征,再将n个时刻中各时刻对应的体素的空间融合特征进行时间上的融合。
在一些可能的实施例中,如采集数据只包括图像数据,则图2所示感知检测网络内的点云特征提取模块可以缺省,如果图像数据为摄像头在n个时刻采集的图像数据,则特征融合网络对n个时刻中各时刻对应的3D图像特征进行时间上的融合即可。
在一些可能的实施例中,如果采集数据只包括点云数据,则图2所示感知检测网络内的图像特征提取模块可以缺省,如点云数据为雷达在n个时刻采集的点云数据,则特征融合网络对n个时刻中各时刻对应的体素的点云特征进行时间上的融合即可。
示例性地,特征融合网络可以采用循环神经网络(recurrent neural network,RNN)或者循环卷积神经网络(recurrent CNN,RCNN)的网络结构,CNN例如可以是长期短记忆网络(long short-term memory networks,LSTM)、门控循环单元网络(gated recurrent unit network,GRU)等。
进一步地,图像特征提取网络包括相机主干网络和立体转换网络,其中,相机主干网络用于提取图像数据的2D图像特征,立体转换网络用于将图像数据的2D图像特征转换为图像数据的3D图像特征。这里,立体转换网络可以实现将2D图像特征转换为车体坐标系下的3D图像特征,而从雷达的点云数据中提取出的特征本身就是车体坐标下的3D特征,如此,方便后续特征融合网络对来自不同传感器的特征进行特征融合,有利于消除多模态传感器之间的异构差异。
示例性地,图像数据的2D图像特征包括但不限于该图像数据的颜色特征、形状特征、纹理特征和空间关系特征等。
示例性地,相机主干网络可以采用卷积神经网络(convolutional neural networks,CNN)(例如残差网络Resnet)、变换transformer网络、视觉变换(vision transformer,ViT)网络、或者其他主干网络的网络结构。立体转换网络可以采用变换transformer网络或者举起投掷射击(lift-splat-shoot,LSS)网络的网络结构。
进一步地,点云特征提取网络包括雷达编码网络和点主干网络,其中,雷达编码网络用于对点云数据进行体素化处理以建立点云数据中的点与体素之间的对应关系,从而获得点云数据对应的体素的特征,点主干网络用于根据点云数据对应的体素的特征提取点云数据对应的体素的点云特征(即3D特征)。在一些可能的实施例中,雷达编码网络和点主干网络可以合并为一个网络,以提取点云数据对应的体素的点云特征,在此不作具体限定。在一些可能的实施例中,在算力支持的情况下,雷达编码网络和点主干网络也可以合并为一个网络,该网络用于提取点云数据对应的体素的点云特征。
示例性地,雷达编码网络可以采用体素特征编码(voxel feature encoding,VFE)网络或者支柱特征编码(pillar feature encoding,PFE)网络等的网络结构。点主干网络可以采用卷 积神经网络(例如U-Net)或变换transformer网络等的网络结构。
输出网络即为感知检测网络的检测头。输出网络包括至少一个头网络,输出网络中头网络的数量基于输出网络输出的感知信息中预测结果的种类数确定。示例性地,如图2所示,感知信息包括体素的占据状态、体素的速度信息、体素的可见状态和障碍物对应的多边形框的角点信息,其中,障碍物对应的多边形框与障碍物的体素关联,由此可以看出,感知信息中包含4种预测结果,故输出网络包括四个头网络,分别为头网络1、头网络2、头网络3和头网络4、其中,头网络1用于输出障碍物对应的多边形框的角点信息,头网络2用于输出体素的占据状态,头网络3用于输出体素的速度信息,头网络4用于输出体素的可见状态。
这里,体素的可见状态是指:车辆在当前时刻所处的场景中,如场景中的某个体素在当前时刻未被车辆上的任何一个传感器(包括摄像头和雷达)的观测信号所触及,则该体素的可见状态为不可见;如果该体素被至少一个传感器的观测信号触及,则该体素的可见状态为可见。
示例性地,体素的占据状态是指:车辆在当前时刻所处的场景中,如场景中的某个体素在该场景所在的物理世界中对应的空间位置上存在实体,则该体素的占据状态为占据;如该体素在该场景所在的物理世界中对应的空间位置上不存在实体,则体素的占据状态为空(即未被占据)。这里,实体可以理解为具有一定体积和质量的物体。可以理解,空气不是实体。
示例性地,输出网络中的任一头网络可以采用卷积神经网络CNN或者变换transformer网络的网络结构。这里,输出网络中不同头网络的内部网络结构可以相同也可以不同,可以理解,不同头网络对相同的输入特征的处理方式不同。
在一些可能的实施例中,为了减少算力的消耗,在图2所示的感知检测网络中,还可以在特征融合网络与输出网络之间设置神经采样网络,即特征融合网络将该场景的体素的融合特征输出至神经采样网络,神经采样网络根据场景的体素中所在区域的重要度采用不同分辨率对该场景的体素的融合特征进行处理,例如,该场景中区域一的重要度大于该场景中区域二的重要度,则以第一分辨率对该区域一中体素的融合特征进行处理,以第二分辨率对区域二中的体素的融合特征进行处理,其中,第一分辨率大于第二分辨率。如此,神经采样网络可以实现对场景中关键区域的体素进行细粒度的处理,以及对场景中非关键区域的体素进行粗粒度的处理,如此,可以大大节省算力,有利于提高感知检测网络的数据处理效率,也有利于降低硬件的部署成本。
示例性地,上述区域一和区域二满足下述条件中的至少一项时,区域一的重要度大于区域二的重要度:
(1)区域一距离车辆的距离小于区域二距离车辆的距离;
(2)区域一内障碍物的数量大于区域二内障碍物的数量;
(3)区域一内动态的障碍物的数量大于区域二内动态的障碍物的数量;和
(4)区域一内障碍物的体积大于区域二内障碍物的体积。
示例性地,神经采样网络可以采用神经网络、多层感知器(multi-layer perceptron,MLP)或者变换transformer网络的网络结构。
为了更清楚地显示感知检测网络的特征提取的流程,参见图3,图3是本申请实施例提供的一种感知检测网络的特征提取的示意图。在图3中,基于n个摄像头采集的图像数据经过上述相机主干网络可以提取出图像数据的2D图像特征,图像数据的2D图像特征经过立体 转换网络可以提取出该图像数据的3D图像特征,雷达采集的点云数据经过雷达编码网络可以提取出点云数据对应的体素的特征(即3D特征),点云数据对应的体素的特征经过点主干网络可以提取出点云数据对应的体素的点云特征(即3D特征),上述图像数据的3D图像特征和点云数据对应的体素的点云特征经过特征融合网络的融合输出体素的融合特征,最后,体素的融合特征经过上述输出网络分别输出体素的占据状态、体素的速度信息、体素的可见状态和障碍物对应的多边形框的角点信息。
可以理解,图3只是对感知检测网络的特征提取过程的一种示例,并不限定感知检测网络中特征提取流程仅为图3所示。
在一些可能的实施例中,感知模型还包括属性识别网络,属性识别网络可以用于识别障碍物的类别。示例性地,属性识别网络用于根据文本查询信息和障碍物的体素的融合特征输出障碍物的类别信息,其中,文本查询信息用于请求查询类别,障碍物的体素的融合特征基于障碍物对应的多边形框的角点信息和场景的体素的融合特征确定,障碍物对应的多边形框与该障碍物的体素关联。由图2可以知晓,障碍物对应的多边形框的角点信息来自感知检测网络中的输出网络(具体为输出网络中的头网络1),场景的体素的融合特征为感知检测网络中的特征融合网络的输出。
这里,障碍物对应的多边形框与该障碍物的体素关联可以理解为:障碍物对应的多边形框的角点信息基于该障碍物的体素的索引信息获得。障碍物对应的多边形框的角点信息例如可以是图2中的头网络1基于学习到的规则对该障碍物的体素的索引信息进行预测获得,也可以是采用凸包算法基于障碍物的体素的索引信息计算获得该障碍物对应的多边形框的角点信息,在此不作具体限定。
这里,障碍物对应的多边形框可以是二维的,也可以是三维的,在此不作具体限定。
一种实现方式中,属性识别网络包括文本编码网络和属性解码网络,其中,文本编码网络,用于提取文本查询信息的词向量特征;属性解码网络,用于根据该词向量特征和障碍物的体素的融合特征输出该障碍物的类别信息。
这里,文本查询信息用于请求查询Q种类别,假设某场景中障碍物的类别的数量为P,其中,Q、P为正整数且Q为大于P。也就是说,属性识别网络实际支持识别的类别的数量大于任一场景中障碍物的类别,如此,能确保属性识别网络对任一场景中的障碍物进行类别识别时避免出现遗漏。
例如,在类别的推理过程中,文本查询信息例如包括“是车吗”、“是行人吗”、“是电线杆吗”、“是道路指示牌吗”、“是道路分界栏杆吗”、……等K条文本查询信息,属性识别网络中的文本编码网络对K条文本查询信息进行特征提取获得每条文本查询信息对应的词向量特征,其中,每条文本查询信息对应的词向量特征可以表征该文本查询信息指示的类别的图像语义特征,以属性识别网络中的属性解码网络对障碍物1的类型识别为例,障碍物1为场景中的任意一个障碍物,属性解码网络将障碍物1的体素的融合特征与K条文本查询信息中每条文本查询信息对应的词向量特征进行相似度计算,确定与障碍物1的体素的融合特征相似度最高的词向量特征对应的类别为障碍物1的类别,从而可以输出该障碍物1的类别信息。
这里,障碍物1的体素的融合特征基于障碍物1对应的多边形框的角点信息和场景的体素的融合特征确定可以是:由于障碍物1对应的多边形框的角点信息与障碍物1的体素的索引信息对应,基于障碍物1的体素的索引信息可以从场景的体素的融合特征确定障碍物1的 体素的融合特征。
示例性地,文本编码网络、属性解码网络均可以采用卷积神经网络或者变换transformer网络的网络结构。可以理解,文本编码网络、属性解码网络可以根据自身的功能自适应调整网络的相关参数。
在一些可能的实施例中,感知模型还包括路径评估网络,路径评估网络可以用于为车辆确定推荐路径。示例性地,路径评估网络用于根据车辆的多条规划路径和场景的体素的融合特征输出这多条规划路径的推荐系数以及这多条规划路径中的推荐路径。
一种实现方式中,路径评估网络包括路径编码网络、特征交互网络和评估输出网络,其中,路径编码网络用于提取车辆的多条规划路径中每条规划路径的路径特征;特征交互网络用于根据每条规划路径的路径特征和场景的体素的融合特征获得每条规划路径的风险特征;评估输出网络用于根据这多条规划路径的风险特征输出这多条规划路径的推荐系数和这多条规划路径中的推荐路径。
示例性地,规划路径的推荐系数可以基于该规划路径的风险系数、舒适度和通行效率中的至少一项获得。其中,该规划路径的风险系数与该规划路径与障碍物(包括可见的障碍物以及当前处于盲区的障碍物)的距离、该规划路径与道路中其他交通参与者的路径是否有冲突(例如当前时刻或者未来时刻是否会发生碰撞)等因素中的至少一项有关,该规划路径的通行效率与该规划路径的长度、该规划路径对应的预估通行时长、该规划路径途径的红绿灯的数量、该规划路径所在的可行驶区域的面积等因素中的至少一项有关,该规划路径的舒适度与该规划路径的转向加速度的大小及转向频率、该规划路径的加速度的变化率、该规划路径途径的路面的平整度、该规划路径途径的红绿灯的数量、该规划路径所在的道路的类型、该规划路径途径区域是否阴凉等因素中的至少一项有关。
示例性地,推荐路径为这多条规划路径中最高推荐系数对应的规划路径。
示例性地,路径编码网络可以采用卷积神经网络、变换transformer网络、图神经网络(graph neural network,GNN)、图卷积神经网络(graph convolution neural networks,GCNNs)的网络结构。特征交互网络可以采用图神经网络或者变换transformer网络的网络结构。评估输出网络可以采用神经网络或多层感知器MLP的网络结构。
可以理解,图2所示的感知模型的框架只是本申请是实施例给出的一种可行的示例,并不应对感知模型的框架构成限定。
示例性地,在感知模型包括感知检测网络、属性识别网络和路径评估网络的情况下,感知检测网络、属性识别网络和路径评估网络的训练可以是分开的,例如先训练感知检测网络,感知检测网络训练完成后再依次训练属性识别网络和路径评估网络。感知检测网络、属性识别网络和路径评估网络的训练也可以是同时进行,在此不作具体限定。感知模型中各个网络的训练过程可参考下述实施例中的相应内容的叙述,在此不再赘述。
参见图4,图4是本申请实施例提供的一种智能驾驶方法的流程图。该方法可以应用于上述图1中的车辆或者车辆上用于自动驾驶控制的组件(例如芯片或集成电路等),该车辆上至少部署有上述感知检测网络。该方法包括但不限于以下步骤:
S401:获取传感器对第一场景的采集数据,传感器包括摄像头和雷达中的至少一种。
这里,第一场景可以理解为车辆在行驶过程中传感器可以探测到的环境空间。
这里,传感器部署在车辆上。其中,基于摄像头在车辆的安装位置,摄像头例如可以分为前视摄像头、环视摄像头、后视摄像头和侧视摄像头等。雷达包括激光雷达、毫米波雷达中的至少一项。本申请实施例不限定车辆上配置的摄像头的数量以及雷达的数量。
其中,摄像头用于采集图像数据,雷达用于采集点云数据,故上述采集数据包括图像数据和点云数据中的至少一种。
示例性地,车辆上可以配置多个摄像头,不同摄像头的视场角不同,这多个摄像头的视场角可以覆盖以车辆为中心的360度的视野范围。示例性地,多个摄像头中相邻摄像头的视场角范围可以存在部分重叠,如此,同一环境空间内的数据可以同时被多个传感器采集到,有利于提高数据观测的置信度。
示例性地,传感器包括摄像头和雷达,假设车辆上摄像头的数量为m,m个摄像头对第一场景进行图像数据的采集,假设每个时刻每个摄像头采集一张图像,即意味着每个时刻对应的采集数据均包括摄像头采集的m张图像对应的图像数据和雷达采集的点云数据。
S402:将采集数据输入至感知检测网络,输出感知信息,感知信息用于指示第一场景的障碍物的体素。
这里,障碍物是指车辆在行驶过程中不期望与之发生碰撞的实体,该实体可以是静态的,也可以是动态的。其中,静态的实体例如可以是道路中的纸箱、道路施工牌、道路分界栏杆、土堆、轮胎、侧翻的车辆、躺着的人、动物、道路旁的建筑物、树木、停放的车辆、道路指示牌、电线杆、路边隔离带等具有体积和质量的静止物体;动态的实体例如可以是行人(例如行走的行人、骑自行车的行人等)、动物、车辆、载物的车辆(例如装载了纸箱、树枝或其他货物)等具有体积和质量的运动物体。
这里,感知检测网络为部署在车端的已训练好的感知检测网络。感知检测网络用于根据传感器对第一场景的采集数据输出感知信息。例如,感知检测网络为图1所示的网络侧设备基于传感器数据集和采用4D重建生成的该传感器数据集对应的标签信息进行训练获得。传感器数据集对应的标签信息可以是网络侧设备采用自监督方式基于该传感器数据集进行4D重建生成,该标签信息用于在感知检测网络的训练过程中为感知检测网络提供其预测结果的真值信息。
例如,感知检测网络的预测任务包括预测体素的占据状态、体素的速度信息、体素的可见状态以及障碍物对应的多边形框的角点信息这四种预测任务,在感知检测网络的训练过程中,假设感知检测网络当前的输入数据为t时刻的图像数据和t时刻的点云数据,则感知检测网络对该输入数据执行上述四种预测任务的处理并输出预测的感知信息(即预测结果),相应地,标签信息包括该t时刻的图像数据和t时刻的点云数据二者对应的预测结果的真值信息。
一种实现方式中,采集数据包括图像数据和点云数据,感知检测网络包括图像特征提取网络、点云特征提取网络、特征融合网络和输出网络,在此情况下,感知检测网络的处理过程例如可以参考下述步骤A1-A4:
A1:图像特征提取网络提取该图像数据的3D图像特征;
A2:点云特征提取网络提取该点云数据对应的体素的点云特征;
A3:特征融合网络根据所上述3D图像特征和点云数据对应的体素的点云特征进行融合,获得第一场景的体素的融合特征;
A4:输出网络处理第一场景的体素的融合特征并输出感知信息。
这里,感知检测网络的推理过程具体可参考上述图2实施例中对感知检测网络的叙述,上述图像特征提取网络、点云特征提取网络、特征融合网络和输出网络参考上述图2实施例中相应内容的叙述,在此不再赘述。可以理解,上述示例不对感知检测网络的框架构成限制。
在本申请实施例中,感知信息包括以下信息的至少一项:第一场景的体素的占据状态、第一场景的体素的速度信息、第一场景的体素的可见状态和第一场景的障碍物对应的多边形框的角点信息;其中,第一场景的障碍物对应的多边形框与该障碍物的体素关联。
例如,参见图2所示的感知检测网络的框架,可知输出网络包括四个头网络,每个头网络对应一种预测任务,在此情况下,感知信息包括第一场景的体素的占据状态、第一场景的体素的速度信息、第一场景的体素的可见状态和第一场景的障碍物对应的多边形框的角点信息。
这里,体素的占据状态可以分为两种,即“占据”和“空”。有关体素的占据状态可参考前述对体素的占据状态的相关叙述,在此不再赘述。
例如,场景的体素1与该场景所在的物理世界中的车辆A对应,则体素1的占据状态为“占据”;场景的体素2与该场景所在的物理世界中的空气对应,则体素2的占据状态为“空”。
这里,体素的可见状态也可以两种,即“可见”和“不可见”。有关体素的可见状态可参考前述对体素的可见状态的相关叙述,在此不再赘述。
示例性地,体素的可见状态是可以变化的。参见图5,图5是本申请实施例提供的一些场景示意图。图5的(1)示出了t1时刻对应的场景1,在图5的(1)中,车辆1为主车(即上述感知检测网络部署在车辆1上),可以看出,车辆1、车辆2和车辆3位于同一车道,且车辆2当前正在执行换道操作,假设车辆2的车辆体形大于前方车辆3的车辆体形,导致从车辆1的视角看车辆3被车辆2完全遮挡,车辆3位于车辆1的视线盲区内,故在t1时刻车辆1上的任意一个传感器的观测信号均无法触及车辆3的体素,因此,车辆1输出的感知信息中,车辆2的体素在t1时刻的可见状态为“可见”但车辆3的体素在t1时刻的可见状态均为“不可见”。图5的(2)示出了t2时刻对应的场景2,可以看出,车辆2当前已完成换道操作,假设车辆2和车辆3均出现在车辆1的传感器的采集视野范围内,即意味着车辆2的体素以及车辆3的体素在t2时刻均可以被车辆1上的至少一个传感器的观测信号触及,因此,车辆1输出的感知信息中,车辆2的体素在t2时刻的可见状态为“可见”且车辆3的体素在t2时刻的可见状态也为“可见”。由此也可以看出,将多个时刻的采集数据输入至感知检测网络,不仅可以补齐观测信息,也有利于从多方位、多角度更真实地对场景所在的物理世界进行还原。
S403:至少基于感知信息控制车辆行驶。
这里,控制车辆的行驶包括以下操作中的至少一项:变换车道、调整行驶速度、调整行驶路径、开启警示灯和调整车辆的悬架。如此,车辆至少基于感知信息方便实时决策,以提高自身行驶时过程中的安全性。
一种实现方式中,至少基于感知信息控制车辆行驶,包括:至少根据感知信息,调整车辆的行驶路径,其中,调整后的行驶路径不途径障碍物的体素所在的区域。
例如,根据感知信息可以确定车辆当前的行驶路径在当前时刻以及未来时刻与该场景中相应时刻下的障碍物的体素是否会发生碰撞,在预测到有碰撞的情况下,可以及时调整车辆当前的行驶路径,使得调整后的行驶路径不途径障碍物的体素所在的区域,如此,可以避免车辆在行驶过程与障碍物发生碰撞,有利于提高车辆行驶的安全性。
一种实现方式中,感知信息还用于指示第一场景的路面的体素,则至少基于感知信息控制车辆行驶,包括:至少根据感知信息,生成第一场景的路面几何信息;根据该路面几何信息,调整车辆内的悬架。
这里,路面几何信息用于指示第一场景的路面状况(例如路面是否有坑洼、路面是否有凸起等),基于感知信息车辆可以提前获取车辆前方的路面状况,在监测到路面有起伏时,车辆有足够的时间可以及时调整车辆的悬架,以使车辆在行驶过程中尽可能始终保持水平且平稳的状态,减少因路面起伏带来的振动感,提高了乘坐车辆的舒适性。
示例性地,基于感知信息控制车辆行驶,还可以是:根据感知信息确定场景中的盲区和盲区内的障碍物信息(例如该障碍物的速度信息、该障碍物对边的多边形框的角点信息等),当车辆接近盲区时,基于盲区内的障碍物信息控制车辆减速、停车或转向。这里,不限定盲区内的障碍物的种类、呈现形态等,且盲区内的障碍物可能是静态的,也可能是动态,在此不作具体限定。如此,当车辆接近当前时刻场景中的盲区时,控制车辆处于减速状态或停车状态或转向状态,能避免车辆与盲区内的障碍物发生碰撞,提高了车辆行车的安全性。
这里,盲区例如为感知信息中当前时刻可见状态为“不可见”的体素所在的区域。示例性地,盲区包括当前时刻车辆上的传感器的观测信号本可以触及的区域中因其他障碍物遮挡未能触及的区域和传感器本身的探测盲区。
这里,障碍物的速度信息例如可以基于该障碍物的体素的速度信息获得。
在一些可能的实施例中,车辆除了可以基于车辆自身输出的感知信息控制车辆的行驶,还可以结合导航地图信息、高精地图信息、路侧设备广播的交通实况信息以及周围其他车辆广播的交通实况信息等中的至少一项控制车辆的行驶。这里,路侧设备例如可以是路侧单元(road side unit,RSU)、多接入边缘计算(multi-access edge computing,MEC)或者传感器等装置,或者是这些装置内部的组件或者芯片,也可以是由RSU和MEC组成的系统级设备,或者是由RSU和传感器组成的系统级设备,还可以是由RSU、MEC和传感器组成的系统级设备。
可选地,在一些可能的实施例中,上述智能驾驶方法还包括:基于感知信息显示第一场景的障碍物,其中,第一场景的障碍物以多边形框进行标记;和/或基于感知信息显示第一场景的障碍物的体素。
示例性地,可以在车辆的显示装置上呈现障碍物或者障碍物的体素。例如,显示装置可以是车端设备的车机平板、车载显示器、抬头显示(head up display,HUD)系统或者增强抬头显示AR-HUD系统等,在此不作具体限定。
参见图6A,图6A是本申请实施例提供的一种以多边形框标记场景中的障碍物的示意图。图6A显示了当前时刻自车所在场景中的障碍物,其中,障碍物以多边形框进行标记。图6A中处于中心下方的车辆为自车,可以看出,该场景下自车周围环境中的障碍物被用多边形框进行了标记显示,基于多边形框的形状可以看出该场景的障碍物至少包括车辆、建筑等。示例性地,多边形框可以是二维的,也可以是三维的。以图6A中自车右侧距离自车最近的多变形框为例,该多边形框以2D显示时,该多边形框可以由1组角点信息指示的10个角点连接而成;该多边形框以3D显示时,该多边形框可以由多组角点信息指示的角点连接而成,其中,每组角点信息指示10个角点。在一些可能的实施例中,对于场景中动态的障碍物,还可以在该动态的障碍物对应的多边形框上添加箭头,该箭头表示该障碍物为动态的障碍物且 该箭头的方向指示了该障碍物的运动方向,该箭头的长度表示该障碍物的速度的大小。可以理解,图6A仅为某个时刻车辆所在场景的障碍物的标记显示的一种示例,并不应对车辆所在场景的障碍物的标记显示构成限定。
参见图6B,图6B是本申请实施例提供的一种障碍物的体素的显示示意图。图6B示出了当前时刻自车所在场景中的障碍物的体素,可以看出,障碍物的体素由该场景中的多个体素组成,体素可以理解为图6B中最小单元的立体方格。示例性地,在图6B中,可以通过不同的颜色将动态的障碍物和静态的障碍物进行区分显示(即可以通过不同颜色区分不同速度的障碍物),也可以通过不同的颜色将不同的障碍物进行区分,在此不作具体限定。可以理解,图6B仅为某个时刻车辆所在场景的障碍物的体素的一种显示示例,并不应对车辆所在场景的障碍物的体素的显示构成限定。
在一些可能的实施例中,还可以显示车辆当前所在场景中路面的体素。参见图6C,图6C是本申请实施例提供的一种显示了场景中路面的体素的示意图。图6C不仅对当前时刻该场景中障碍物的体素进行了显示,还对当前时刻该场景中路面的体素也进行了显示,如此,基于图6C可以看出前方路面的起伏程度。可以理解,图6C仅为某个时刻车辆所在场景的障碍物的体素以及路面的体素的一种显示示例,并不应对车辆所在场景中障碍物的体素以及路面的体素的显示构成限定。
在一些可能的实施例中,车辆上除了部署有上述感知检测网络,还可以部署属性识别网络,其中,属性识别网络用于识别障碍物的类别。如此,车辆在行驶过程中不仅可以检测到周围环境中的障碍物,还可以识别出障碍物的类别,实现了车辆不仅看的到物还能看的懂物。
进一步地,上述智能驾驶方法还包括:获取文本查询信息;将文本查询信息和障碍物的体素的融合特征输入至属性识别网络,输出该障碍物的类别信息;其中,文本查询信息用于请求查询类别;显示该障碍物的类别信息;其中,该障碍物的体素的融合特征基于该障碍物对应的多边形框的角点信息和第一场景的体素的融合特征确定,该障碍物对应的多边形框与该障碍物的体素关联。其中,障碍物对应的多边形框的角点信息和第一场景的体素的融合特征均来自感知检测网络,进一步地,结合上述图2所示的感知检测网络可以知晓,障碍物对应的多边形框的角点信息来自感知检测网络中的输出网络,第一场景的体素的融合特征来自上述感知检测网络中的特征融合网络。此实施例具体可参考上述图2实施例对属性识别网络的相关说明,为了说明书的简洁,在此不再赘述。
在一些可能的实施例中,在获得上述感知信息后,还可以结合摄像头内配置的检测算法的检测结果、雷达内配置的检测算法的检测结果或其他模型的检测结果进行进一步融合处理,如此,当同一障碍物通过多种不同的方式均能感知到的情况下,则该检测出该障碍物的置信度也更高。
在一些可能的实施例中,车辆上除了部署有上述感知检测网络,还可以部署路径评估网络,其中,路径评估网络用于为车辆推荐最低风险路径。如此,有利于提高驾驶的安全性和驾驶决策的准确率。
进一步地,上述智能驾驶方法还包括:获取车辆的多条规划路径;将车辆的多条规划路径和第一场景的体素的融合特征输入至路径评估网络,输出这多条规划路径的推荐系数和这多条规划路径中的推荐路径,其中,推荐路径与这多条规划路径的推荐系数关联;显示所述推荐路径。可以知晓,第一场景的体素的融合特征来自感知检测网络,结合上述图2对感知 检测网络的叙述可以知第一场景的体素的融合特征由感知检测网络中的特征融合网络提供。这里,多条规划路径由车辆生成,例如车辆基于导航地图信息生成多条规划路径。此实施例具体可参考上述图2实施例对路径评估网络的相关说明,为了说明书的简洁,在此不再赘述。
示例性地,路径评估网络输出的推荐路径包括多条规划路径中的至少两条规划路径,在此情况下,在人机共驾的场景下,还可以用于用户推荐该推荐路径,从用户接收反馈信息,反馈信息用于指示用户从该至少两条规划路径中选择的路径,并控制自身车辆沿着用户选择的路径行驶。
可以理解,推荐路径包括的规划路径的数量为多个的情况下,可以理解为推荐路径包括的各规划路径的推荐系数相近或相同,但有的规划路径是耗时最短的路径,有的规划路径是舒适度最高的路径,有的规划路径是距离最短的路径等,在此情况下,可供用户根据自身的需求自由选择,为用户提供了良好的乘车体验感。
在一些可能的实施例中,车辆上也可以同时部署上述感知检测网络、属性识别网络和路径评估网络,相应描述可以参考相应实施例的描述,在此不再赘述。
可以看到,实施本申请实施例,通过在车辆在部署上述感知检测网络,能够增强车辆对周围环境的感知能力,使得车辆在行驶过程中能感知到周围的障碍物,从而能避免碰撞的发生,提高了车辆的安全性。另外,通过在车辆上部署上述属性识别网络,车辆在感知到障碍物的基础上,还能识别出障碍物的类别,提高了车辆的智能性。
参见图7A,图7A是本申请实施例提供的一种感知检测网络的训练方法的流程图。该方法可以应用于上述图1所示的网络侧设备或者网络侧设备内的组件(例如芯片或集成电路等)。以图2所示的感知检测网络为例,感知检测网络包括图像特征提取网络、点云特征提取网络、特征融合网络和输出网络,其中,输出网络包括多个头网络。该方法包括但不限于以下步骤:
S701:在每次训练过程中,通过图像特征提取网络对一批次传感器数据中每个时刻的图像数据进行特征提取,获得K个时刻的图像数据的3D图像特征。
示例性地,每次训练过程中使用的一个批次的传感器数据位于传感器数据集中。
例如,一个批次的传感器数据包括K个时刻的图像数据,其中,K个时刻的图像数据来自车辆的至少一个摄像头,K为正整数。
具体地,该一个批次的传感器数据中K个时刻的图像数据输入至图像特征提取网络,图像特征提取网络基于每个时刻的图像数据得到该时刻的图像数据的3D图像特征,故图像特征提取网络将获得K个时刻的图像数据的3D图像特征,并将其输出至特征融合网络。
这里,图像特征提取网络的框架具体可参考图2实施例中相应内容的叙述,在此不再赘述。
S702:在每次训练过程中,通过点云特征提取网络对一批次传感器数据中每个时刻的点云数据进行特征提取,获得K个时刻的点云数据对应的体素的点云特征。
这里,一个批次的传感器数据还包括这K个时刻的点云数据,其中,这K个时刻的点云数据来自该车辆的至少一个雷达。
具体地,该一个批次的传感器数据中K个时刻的点云数据输入至点云特征提取网络,点云特征提取网络基于每个时刻的点云数据得到该时刻的点云数据对应的体素的点云特征,故点云特征提取网络将获得K个时刻的点云数据对应的体素的点云特征,并将其输出至特征融 合网络。
S703:通过特征融合网络对K个时刻的图像数据的3D图像特征和K个时刻的点云数据对应的体素的点云特征进行特征融合,获得K个时刻的场景的体素的融合特征。
这里,场景的体素是指经过特征融合网络执行特征融合后的体素。
示例性地,特征融合网络根据每个时刻的图像数据的3D图像特征和该时刻的点云数据对应的体素的点云特征进行空间上的融合,获得该时刻的场景的体素的空间融合特征;特征融合网络再根据K个时刻的场景的体素的空间融合特征进行时间上的融合,获得K个时刻的场景的体素的融合特征,在此情况下,每个时刻的场景的体素的融合特征可以称为该时刻的场景的体素的时空融合特征。
S704:通过输出网络中的每个头网络根据K个时刻的场景的体素的融合特征输出K个预测结果,其中,K个预测结果中的每个预测结果对应一个时刻的场景。
其中,K个时刻的场景的体素的融合特征与输出网络中的每个头网络对应,输出网络中的每个头网络执行一种预测任务。
例如,图2所示的输出网络包含头网络1、头网络2、头网络3和头网络4,其中,头网络1用于预测场景的障碍物对应的多边形框的角点信息,头网络2用于预测场景的体素的占据状态,头网络3用于预测场景的体素的速度信息,头网络4用于预测场景的体素的可见状态。
具体地,在输出网络中,每个头网络基于每个时刻的场景的体素的融合特征进行预测,获得该时刻的场景对应的预测结果,从而每个头网络可以获得K个预测结果。
以图2所示的输出网络中的头网络1为例,假设K个时刻包括t1时刻、t2时刻、……、tK时刻,头网络1基于t1时刻的场景的体素的融合特征输出预测结果1,预测结果1包括t1时刻的场景中障碍物对应的多边形框的角点信息;头网络1基于t2时刻的场景的体素的融合特征输出预测结果2,预测结果2包括t2时刻的场景中障碍物对应的多边形框的角点信息;……,如此,头网络1基于K个时刻的场景的体素的融合特征将输出K个预测结果。
S705:根据该批次的传感器数据对应的标签信息和输出网络中每个头网络输出的K个预测结果获得输出网络中每个头网络的损失值。
其中,该批次的传感器数据对应的标签信息是通过4D重建基于该批次的传感器数据生成。该批次的传感器数据对应的标签信息用于为感知检测网络提供感知检测网络对该批次的传感器数据的预测结果对应的真值信息。
以图2所示的输出网络为例,则该批次的传感器数据对应的标签信息包括K个时刻的场景的障碍物对应的多边形框的角点真值信息、这K个时刻的场景的体素的占据状态真值信息、这K个时刻的场景的体素的速度真值信息和这K个时刻的场景的体素的可见状态真值信息,其中,K个时刻的场景的障碍物对应的多边形框的角点真值信息与上述头网络1输出的K个预测结果对应,这K个时刻的场景的体素的占据状态真值信息与上述头网络2输出的K个预测结果对应,这K个时刻的场景的体素的速度真值信息与上述头网络3输出的K个预测结果对应,以及这K个时刻的场景的体素的可见状态真值信息与上述头网络4输出的K个预测结果对应。
以头网络1的损失值的计算为例,基于上述标签信息中K个时刻的场景的障碍物对应的多边形框的角点真值信息和头网络1输出的K个预测结果,获得头网络1的损失值。示例性 地,可以先根据每个时刻的场景的障碍物对应的多边形框的角点真值信息和头网络1输出的K个预测结果中该时刻的预测结果(即包括该时刻的场景中障碍物对应的多边形框的角点信息)获得头网络1在该时刻的损失值,然后再根据头网络1在K个时刻中各时刻的损失值获得头网络1的损失值。同理,输出网络中的其他头网络也可以采用此方式进行自身头网络的损失值的计算,如此,可以获得该输出网络中每个头网络的损失值。
S706:对输出网络中各个头网络的损失值进行加权,获得每次训练过程对应的一个损失值;利用该损失值对感知检测网络中的参数进行更新。
这里,输出网络中每个头网络的权重可以是用户自定义设置。
在得到每次训练过程对应的一个损失值后,利用该损失值对感知检测网络(例如输出网络中的每个头网络+特征融合网络+图像特征提取网络+点云特征提取网络)中的参数进行更新。
可以理解,图7A为上述感知检测网络单独进行训练的一种示例,并不限定感知检测网络的训练流程仅为图7A所示形式。在一些可能的实施例中,上述感知检测网络中输出网络中的各个头网络也可以单独训练的,在此情况下,上述S706不是必要执行步骤。在一些可能的实施例中,感知检测网络还可以与神经辐射场NeRF网络进行联合训练,以进一步提高检测的精准率以及训练效率,在此不作具体限定。
实施本申请实施例,每次训练过程中一个批次的传感器数据对应的标签信息无需通过人工标注生成,不仅节省了人力的消耗,也提高了标签信息的获取效率。感知检测网络采用自监督训练这种方式,能够学习到基于任一时刻车辆的输入数据可以准确预测该时刻下车辆所在场景中的感知信息。
参见图7B,图7B是本申请实施例提供的一种感知检测网络的训练过程示意图。
如图7B所示,将传感器数据集中一个批次的传感器数据输入至感知检测网络中的图像特征提取网络和点云特征提取网络,图像特征提取网络对该批次的传感器数据中的图像数据进行特征提取,获得该图像数据的3D图像特征,点云特征提取网络对该批次的传感器数据中的点云数据进行特征提取,获得该点云数据对应的体素的点云特征,感知检测网络中的特征融合网络对来自图像特征提取网络的该图像数据的3D图像特征和来自点云特征提取网络的该点云数据对应的体素的点云特征进行特征融合,获得场景的体素的融合特征并将其输入至感知检测网络中的输出网络中,输出网络中的每个头网络基于场景的体素的融合特征获得对应的预测结果。基于输出网络中每个头网络输出的预测结果和该批次的传感器数据对应的标签信息获得输出网络中每个头网络的损失值,将输出网络中各头网络的损失值进行加权,该次训练过程对应的一个损失值,并利用该损失值进行反向传播,实现依次更新上述输出网络、特征融合网络、特征提取网络(包括图像特征提取网络和点云特征提取网络)的参数。可以理解,图7B仅为感知检测网络的训练过程的一种示意,并不限定感知检测网络的训练过程仅为图7B所示,例如输出网络中的各个头网络也可以单独训练。
这里,感知检测网络的训练具体可参考上述图7A实施例的相关叙述,此处不赘述。
在一些可能的实施例中,在完成对感知检测网络的训练后,可以对属性识别网络进行训练。下面来说明属性识别网络的训练过程。
以图2所示的属性识别网络为例,属性识别网络包括文本编码网络和属性解码网络。示 例性地,文本编码网络可以直接使用训练好的词向量特征提取器(或者采用文本-图像的预训练学习获得),本申请实施例可以仅对属性解码网络进行训练。该训练好的词向量特征提取器可以基于输入的文本查询信息提取该文本查询信息的词向量特征,且该文本查询信息的词向量特征可以表征该文本查询信息指示的类别的图像语义特征。
示例性地,采用文本-图像的预训练获得文本编码网络的过程可以是:获取海量的文本-图像训练数据,其中,文本-图像训练数据包括多个文本-图像数据组,每个文本-图像数据组包括指示了类别信息的文本信息和该文本信息对应的图像,例如,文本-图像训练组1包括指示了车的文本信息和车的图像,将文本-图像训练数据中的文本信息输入至文本编码器以分别提取每条文本信息的词向量特征,将文本-图像训练数据中的文本信息中的图像输入至图像编码器以分别提取每张图像的图像特征,基于使得属于同一文本-图像数据组的文本信息的词向量特征与图像的图像特征尽可能接近、而属于不同的文本--图像数据组的文本信息的词向量特征与图像的图像特征尽可能远离这一训练思想调整文本编码器的参数和图像编码器的参数,则训练好的文本编码器可以直接作为上述属性识别网络中的文本编码网络使用。
对于属性解码网络的训练,例如可以是:文本编码网络基于接收的多条文本查询信息向属性解码网络输入每条文本查询信息的词向量特征,属性解码网络根据该多条文本查询信息的词向量特征和感知检测网络提供的障碍物的体素的融合特征预测每个障碍物的类别信息,根据预测的每个障碍物的类别信息和每个障碍物的类别标注信息获得属性解码网络该次训练的损失值,最后基于属性解码网络该次训练的损失值反向更新属性解码网络的参数。
在一些可能的实施例中,在完成对感知检测网络的训练后,可以对路径评估网络进行训练。下面来说明路径评估网络的训练过程。
以图2所示的路径评估网络为例,路径评估网络包括路径编码网络、特征交互网络和评估输出网络。路径评估网络的训练过程例如可以是:获取路径训练数据,路径训练数据包括车辆在上述K个时刻范围内规划的多条路径以及这多条路径的推荐系数标注信息,将该多条路径输入至路径编码网络,路径编码网络将提取的每条路径的路径特征输出至特征交互网络,特征交互网络根据每条路径的路径特征和上述K个时刻的场景的体素的融合特征(来自感知检测网络)输出每条路径的风险特征,评估输出网络基于该多条路径的风险特征输出多条路径的预测推荐系数并基于这多条路径的预测推荐系数从这多条路径中确定预测的推荐路径,根据这多条路径的预测推荐系数和这多条路径的推荐系数标注信息获得者这多条路径的损失值,其中,这多条路径的损失值是基于该多条路径中各条路径的损失值加权获得,最后基于该多条路径的损失值更新路径评估网络中的参数。
可以看出,上述感知检测网络、属性识别网络和路径评估网络的训练是分开独立进行的。在一些可能的实施例中,感知检测网络、属性识别网络和路径评估网络也可以联合训练,在此情况下,每次训练过程对应的损失值是基于属性识别网络在该次训练过程中的损失值、感知检测网络在该次训练过程中的损失值(即属性解码网络在该次训练过程中的损失值)和路径评估网络在该次训练过程中的损失值进行加权获得,最后可以该次训练过程中的损失值分别更新上述感知检测网络、属性识别网络中的属性解码网络以及上述路径评估网络内的参数。
请参见图8,图8是本申请实施例提供的一种芯片硬件结构示意图,可以用于执行本申请实施例中的智能驾驶方法和/或训练方法。
如图8所示,神经网络处理器(neural-networks processing unit,NPU)80作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务,以执行前述实施例中的智能驾驶方法或者前述实施例中的训练方法的相关过程。
NPU的核心部分为运算电路803,控制器804控制运算电路803提取存储器(权重存储器或输入存储器)中的数据并进行运算。
在一些实现中,运算电路803内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路803是二维脉动阵列。运算电路803还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路803是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器802中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器801中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)808中。
向量计算单元807可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元807可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。
在一些实现种,向量计算单元807能将经处理的输出的向量存储到统一存储器806。例如,向量计算单元807可以将非线性函数应用到运算电路803的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元807生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路803的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器806用于存放输入数据以及输出数据。
存储单元访问控制器805(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器801和/或统一存储器806、将外部存储器中的权重数据存入权重存储器802,以及将统一存储器806中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)810,用于通过总线实现主CPU、DMAC和取指存储器809之间进行交互。
与控制器804连接的取指存储器(instruction fetch buffer)809,用于存储控制器804使用的指令。
控制器804,用于调用取指存储器809中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器806,输入存储器801,权重存储器802以及取指存储器809均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
参见图9A,图9A是本申请实施例提供的一种计算装置的结构示意图,计算装置30包括接收单元310和处理单元312。该计算装置30可以通过硬件、软件或者软硬件结合的方式 来实现。
其中,接收单元310,用于获取传感器对第一场景的采集数据,传感器包括摄像头和雷达中的至少一种;处理单元312,用于将采集数据输入至感知检测网络,输出感知信息,感知信息用于指示第一场景的障碍物的体素;处理单元312还用于至少基于感知信息显示障碍物的体素。
在一些可能的实施例中,计算装置30还包括显示单元314(图未示),显示单元314用于:基于感知信息显示上述障碍物,该障碍物以多边形框进行标记;和/或,基于感知信息显示上述障碍物的体素。
该计算装置30可用于实现图4实施例所描述的方法。在图4施例中,接收单元310可用于执行S401,处理单元312可用于执行S402和S403。
参见图9B,图9B是本申请实施例提供的一种训练装置的结构示意图,训练装置40包括编码单元410、解码单元412和更新单元414。该训练装置40可以通过硬件、软件或者软硬件结合的方式来实现。
其中,编码单元410用于在每次训练过程中,通过图像特征提取网络对一批次传感器数据中每个时刻的图像数据进行特征提取,获得K个时刻的图像数据的3D图像特征,K为正整数;在每次训练过程中,通过点云特征提取网络对一批次传感器数据中每个时刻的点云数据进行特征提取,获得K个时刻的点云数据对应的体素的点云特征;以及通过特征融合网络对K个时刻的图像数据的3D图像特征和K个时刻的点云数据对应的体素的点云特征进行特征融合,获得K个时刻的场景的体素的融合特征;解码单元412,用于通过输出网络中的每个头网络根据K个时刻的场景的体素的融合特征输出K个预测结果,其中,K个预测结果中的每个预测结果对应一个时刻的场景;更新单元414用于根据该批次的传感器数据对应的标签信息和输出网络中每个头网络输出的K个预测结果获得输出网络中每个头网络的损失值;以及对输出网络中各个头网络的损失值进行加权,获得每次训练过程对应的一个损失值;并利用该损失值对感知检测网络中的参数进行更新。
该训练装置40可用于实现图7A实施例所描述的方法。在图7A施例中,编码单元410可用于执行S701-S703,解码单元412可用于执行S704,更新单元414可用于执行S705和S706。
应理解,以上装置(例如计算装置30和训练装置40)中各单元的划分仅是一种逻辑功能的划分,实际实现时可以全部或部分集成到一个物理实体上,也可以物理上分开。此外,装置中的单元可以以处理器调用软件的形式实现;例如装置包括处理器,处理器与存储器连接,存储器中存储有指令,处理器调用存储器中存储的指令,以实现以上任一种方法或实现该装置各单元的功能,其中处理器例如为通用处理器,例如中央处理单元(central processing unit,CPU)或微处理器,存储器为装置内的存储器或装置外的存储器。或者,装置中的单元可以以硬件电路的形式实现,可以通过对硬件电路的设计实现部分或全部单元的功能,该硬件电路可以理解为一个或多个处理器;例如,在一种实现中,该硬件电路为专用集成电路(application-specific integrated circuit,ASIC),通过对电路内元件逻辑关系的设计,实现以上部分或全部单元的功能;再如,在另一种实现中,该硬件电路为可以通过可编程逻辑器件 (programmable logic device,PLD)实现,以现场可编程门阵列(field programmable gate array,FPGA)为例,其可以包括大量逻辑门电路,通过配置文件来配置逻辑门电路之间的连接关系,从而实现以上部分或全部单元的功能。以上装置的所有单元可以全部通过处理器调用软件的形式实现,或全部通过硬件电路的形式实现,或部分通过处理器调用软件的形式实现,剩余部分通过硬件电路的形式实现。
在本申请实施例中,处理器是一种具有信号的处理能力的电路,在一种实现中,处理器可以是具有指令读取与运行能力的电路,例如中央处理单元(central processing unit,CPU)、微处理器、图形处理器(graphics processing unit,GPU)(可以理解为一种微处理器)、或数字信号处理器(digital signal processor,DSP)等;在另一种实现中,处理器可以通过硬件电路的逻辑关系实现一定功能,该硬件电路的逻辑关系是固定的或可以重构的,例如处理器为专用集成电路(application-specific integrated circuit,ASIC)或可编程逻辑器件(programmable logic device,PLD)实现的硬件电路,例如FPGA。在可重构的硬件电路中,处理器加载配置文档,实现硬件电路配置的过程,可以理解为处理器加载指令,以实现以上部分或全部单元的功能的过程。此外,还可以是针对人工智能设计的硬件电路,其可以理解为一种ASIC,例如神经网络处理单元(neural network processing unit,NPU)、张量处理单元(tensor processing unit,TPU)、深度学习处理单元(deep learning processing unit,DPU)等。
可见,以上装置中的各单元可以是被配置成实施以上方法的一个或多个处理器(或处理电路),例如:CPU、GPU、NPU、TPU、DPU、微处理器、DSP、ASIC、FPGA,或这些处理器形式中至少两种的组合。
此外,以上装置中的各单元可以全部或部分可以集成在一起,或者可以独立实现。在一种实现中,这些单元集成在一起,以片上系统(system-on-a-chip,SOC)的形式实现。该SOC中可以包括至少一个处理器,用于实现以上任一种方法或实现该装置各单元的功能,该至少一个处理器的种类可以不同,例如包括CPU和FPGA,CPU和人工智能处理器,CPU和GPU等。
参见图10,图10是本申请实施例提供的一种处理设备的结构示意图。如图10所示,处理设备50包括:处理器501、通信接口502、存储器503和总线504。处理器501、存储器503和通信接口502之间通过总线504通信。应理解,本申请不限定处理设备50中的处理器、存储器的个数。
一种实现方式中,处理设备50为车辆内用于自动驾驶控制的组件(例如芯片或集成电路等)。其中,该车辆配置有自动驾驶系统,这里,自动驾驶系统并不局限于完全自动驾驶系统、高度自动驾驶系统、有条件自动驾驶系统、或部分自动驾驶系统等,本领域技术人员可以理解,提供智能驾驶的非完全人工驾驶系统都可以涵盖在本概念之下。
另一种实现方式中,处理设备50可以是网络侧设备。网络侧设备是具有计算能力的设备。网络侧设备例如可以是部署在网络侧的服务器(例如用于智能驾驶处理的服务器),或者为该服务器中的组件或者芯片。在一些可能的实施例中,网络侧设备也可以是由多个服务器组成的系统级设备。网络侧设备可以部署在云环境或者边缘环境中,在此不作具体限定。
总线504可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总 线、数据总线、控制总线等。为便于表示,图8中仅用一条线表示,但并不表示仅有一根总线或一种类型的总线。总线504可包括在处理设备50各个部件(例如,存储器503、处理器501、通信接口502)之间传送信息的通路。
处理器501可参考上述实施例中对处理器的相关描述,在此不再赘述。
存储器503用于提供存储空间,存储空间中可以存储操作系统和计算机程序等数据。存储器503可以是随机存取存储器(random access memory,RAM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)、只读存储器(read-only memory,ROM),或便携式只读存储器(compact disc read memory,CD-ROM)等中的一种或者多种的组合。存储器503可以单独存在,也可以集成于处理器501内部。
通信接口502可用于为处理器501提供信息输入或输出。或者可替换的,该通信接口502可用于接收外部发送的数据和/或向外部发送数据,可以为包括诸如以太网电缆等的有线链路接口,也可以是无线链路(如Wi-Fi、蓝牙、通用无线传输等)接口。或者可替换的,通信接口502还可以包括与接口耦合的发射器(如射频发射器、天线等),或者接收器等。
在一些可能的实施例中,当处理设备50还包括显示器505。显示器505与处理器501通过总线504连接或耦合。显示器505可以用于显示第一场景的多边形实例。显示器505可以是显示屏,显示屏可以是液晶显示器(liquid crystal display,LCD)、有机或无机发光二极管(organic light-emitting diode,OLED)、有源矩阵有机发光二极体面板(active matrix/organic light emitting diode,AMOLED)等。显示器505也可以是车机平板、车载显示器、抬头显示(head up display,HUD)系统或增强抬头显示AR-HUD系统等。
该处理设备50中的处理器501用于读取存储器503中存储的计算机程序,用于执行前述的方法,例如图4或图7A所描述的方法。
在一种可能的设计方式中,处理设备50可为执行图4所示方法的执行主体中的一个或多个模块,该处理器501可用于读取存储器中存储的一个或多个计算机程序,用于执行以下操作:
通过接收单元310获取传感器对第一场景的采集数据,传感器包括摄像头和雷达中的至少一种;
将采集数据输入至感知检测网络,输出感知信息,感知信息用于指示第一场景的障碍物的体素;至少基于感知信息显示障碍物的体素。
在另一种可能的设计方式中,处理设备50可为执行图7A所示方法的执行主体中的一个或多个模块,该处理器501可用于读取存储器中存储的一个或多个计算机程序,用于执行以下操作:
通过编码单元410在每次训练过程中,通过图像特征提取网络对一批次传感器数据中每个时刻的图像数据进行特征提取,获得K个时刻的图像数据的3D图像特征,K为正整数;在每次训练过程中,通过点云特征提取网络对一批次传感器数据中每个时刻的点云数据进行特征提取,获得K个时刻的点云数据对应的体素的点云特征;以及通过特征融合网络对K个时刻的图像数据的3D图像特征和K个时刻的点云数据对应的体素的点云特征进行特征融合,获得K个时刻的场景的体素的融合特征;
通过解码单元412通过输出网络中的每个头网络根据K个时刻的场景的体素的融合特征输出K个预测结果,其中,K个预测结果中的每个预测结果对应一个时刻的场景;
通过更新单元414根据该批次的传感器数据对应的标签信息和输出网络中每个头网络输出的K个预测结果获得输出网络中每个头网络的损失值;以及对输出网络中各个头网络的损失值进行加权,获得每次训练过程对应的一个损失值;并利用该损失值对感知检测网络中的参数进行更新。
在本文上述的实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详细描述的部分,可以参见其他实施例的相关描述。另外,在本申请的各个实施例中,如果没有特殊说明以及逻辑冲突,各个实施例之间的术语和/或描述具有一致性、且可以相互引用,不同的实施例中的技术特征根据其内在的逻辑关系可以组合形成新的实施例。
需要说明的是,本领域普通技术人员可以看到上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质包括只读存储器(read-only memory,ROM)、随机存储器(random access memory,RAM)、可编程只读存储器(programmable read-only memory,PROM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)、一次可编程只读存储器(one-time programmable read-only memory,OTPROM)、电子抹除式可复写只读存储(electrically-erasable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其他光盘存储器、磁盘存储器、磁带存储器、或者能够用于携带或存储数据的计算机可读的任何其他介质。
本申请的技术方案本质上或者说做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机程序产品存储在一个存储介质中,包括若干指令用以使得一个设备(可以是个人计算机,服务器,或者网络设备、机器人、单片机、芯片、机器人等)执行本申请各个实施例所述方法的全部或部分步骤。

Claims (17)

  1. 一种智能驾驶方法,其特征在于,所述方法包括:
    获取传感器对第一场景的采集数据,所述传感器包括摄像头和雷达中的至少一种;
    将所述采集数据输入至感知检测网络,输出感知信息,所述感知信息用于指示所述第一场景的障碍物的体素;
    至少基于所述感知信息控制车辆行驶。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    基于所述感知信息显示所述障碍物,所述障碍物以多边形框进行标记;和/或
    基于所述感知信息显示所述障碍物的体素。
  3. 根据权利要求1或2所述的方法,其特征在于,所述感知信息包括以下信息的至少一项:
    所述第一场景的体素的占据状态、所述第一场景的体素的速度信息、所述第一场景的体素的可见状态和所述障碍物对应的多边形框的角点信息;
    其中,所述障碍物对应的多边形框与所述障碍物的体素关联。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述感知信息还用于指示所述第一场景的路面的体素,所述至少基于所述感知信息控制车辆行驶,包括:
    至少根据所述感知信息,生成所述第一场景的路面几何信息;
    根据所述路面几何信息,调整所述车辆内的悬架。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述至少基于所述感知信息控制车辆行驶,包括:至少根据所述感知信息,调整所述车辆的行驶路径,所述调整后的行驶路径不途径所述障碍物的体素所在的区域。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述采集数据包括图像数据和点云数据,所述感知检测网络包括图像特征提取网络、点云特征提取网络、特征融合网络和输出网络,其中,
    所述图像特征提取网络,用于提取所述图像数据的3D图像特征;
    所述点云特征提取网络,用于提取所述点云数据对应的体素的点云特征;
    所述特征融合网络,用于根据所述3D图像特征和所述点云数据对应的体素的点云特征进行融合,获得所述第一场景的体素的融合特征;
    所述输出网络,用于处理所述第一场景的体素的融合特征并输出所述感知信息。
  7. 根据权利要求6所述的方法,其特征在于,所述方法还包括:
    将文本查询信息和所述障碍物的体素的融合特征输入至属性识别网络,输出所述障碍物的类别信息;所述文本查询信息用于请求查询类别;
    显示所述障碍物的类别信息;
    其中,所述障碍物的体素的融合特征基于所述障碍物对应的多边形框的角点信息和所述第一场景的体素的融合特征确定,所述障碍物对应的多边形框与所述障碍物的体素关联。
  8. 根据权利要求6或7所述的方法,其特征在于,所述方法还包括:
    获取车辆的多条规划路径;
    将所述车辆的多条规划路径和所述第一场景的体素的融合特征输入至路径评估网络,输出所述多条规划路径的推荐系数和所述多条规划路径中的推荐路径,所述推荐路径与所述多条规划路径的推荐系数关联;
    显示所述推荐路径。
  9. 一种用于智能驾驶的系统,其特征在于,所述系统包括:
    感知检测网络,用于根据传感器对第一场景的采集数据输出感知信息,所述感知信息用于指示所述第一场景的障碍物的体素;所述传感器包括摄像头和雷达中的至少一种;
    属性识别网络,用于根据文本查询信息和所述障碍物的体素的融合特征输出所述障碍物的类别信息,所述障碍物的体素的融合特征基于所述障碍物对应的多边形框的角点信息和所述第一场景的体素的融合特征确定,所述障碍物对应的多边形框与所述障碍物的体素关联,所述第一场景的体素的融合特征为所述感知检测网络基于从所述采集数据中提取的3D图像特征和体素的点云特征中的至少一项进行时间和/或空间上的融合获得;
    路径评估网络,用于根据多条规划路径和所述第一场景的体素的融合特征输出所述多条规划路径的推荐系数和所述多条规划路径中的推荐路径,所述推荐路径与所述多条规划路径的推荐系数关联。
  10. 根据权利要求9所述的系统,其特征在于,所述感知信息包括以下信息的至少一项:
    所述第一场景的体素的占据状态、所述第一场景的体素的速度信息、所述第一场景的体素的可见状态和所述障碍物对应的多边形框的角点信息;
    其中,所述障碍物对应的多边形框与所述障碍物的体素关联。
  11. 根据权利要求9或10所述的系统,其特征在于,所述采集数据包括图像数据和点云数据,所述感知检测网络包括图像特征提取网络、点云特征提取网络、特征融合网络和输出网络,其中,
    所述图像特征提取网络,用于提取所述图像数据的3D图像特征;
    所述点云特征提取网络,用于提取所述点云数据对应的体素的点云特征;
    所述特征融合网络,用于根据所述3D图像特征和所述点云数据对应的体素的点云特征进行融合,获得所述第一场景的体素的融合特征;
    所述输出网络,用于处理所述第一场景的体素的融合特征并输出所述感知信息。
  12. 根据权利要求9-11任一项所述的系统,其特征在于,所述属性识别网络包括文本编码网络和属性解码网络,其中,
    所述文本编码网络,用于提取文本查询信息的词向量特征;
    所述属性解码网络,用于根据所述词向量特征和所述障碍物的体素的融合特征输出所述障碍物的类别信息。
  13. 根据权利要求9-12任一项所述的系统,其特征在于,所述路径评估网络包括路径编码网络、特征交互网络和评估输出网络,其中,
    所述路径编码网络,用于提取多条规划路径中每条规划路径的路径特征;
    所述特征交互网络,用于根据每条规划路径的路径特征和所述第一场景的体素的融合特征获得所述每条规划路径的风险特征;
    所述评估输出网络,用于根据所述多条规划路径的风险特征输出所述多条规划路径的推荐系数和所述多条规划路径中的推荐路径。
  14. 一种用于智能驾驶的装置,其特征在于,所述装置包括:
    接收单元,用于获取传感器对第一场景的采集数据,所述传感器包括摄像头和雷达中的至少一种;
    处理单元,用于将所述采集数据输入至感知检测网络,输出感知信息,所述感知信息用于指示所述第一场景的障碍物的体素;
    所述处理单元,还用于至少基于所述感知信息控制车辆行驶。
  15. 一种计算装置,其特征在于,所述计算装置包括存储器和处理器,所述存储器用于存储程序指令;在所述处理器执行所述存储器中的程序指令时,所述计算装置执行如权利要求1-8中任一项所述的方法。
  16. 一种车辆,其特征在于,所述车辆包括上述如权利要求9-13任一项所述的系统,或者,包括如权利要求14或15所述的装置。
  17. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有程序指令,所述程序指令用于实现权利要求1-8任一项所述的方法。
PCT/CN2023/100778 2023-06-16 2023-06-16 一种智能驾驶方法及装置 Ceased WO2024254861A1 (zh)

Priority Applications (6)

Application Number Priority Date Filing Date Title
CN202380088064.9A CN120418139A (zh) 2023-06-16 2023-06-16 一种智能驾驶方法及装置
PCT/CN2023/100778 WO2024254861A1 (zh) 2023-06-16 2023-06-16 一种智能驾驶方法及装置
AU2023456637A AU2023456637A1 (en) 2023-06-16 2023-06-16 Intelligent driving method and apparatus
KR1020267001236A KR20260023647A (ko) 2023-06-16 2023-06-16 지능형 주행 방법 및 장치
EP23941094.7A EP4714766A1 (en) 2023-06-16 2023-06-16 Intelligent driving method and apparatus
US19/419,891 US20260103189A1 (en) 2023-06-16 2025-12-15 Intelligent Driving Method and Apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2023/100778 WO2024254861A1 (zh) 2023-06-16 2023-06-16 一种智能驾驶方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US19/419,891 Continuation US20260103189A1 (en) 2023-06-16 2025-12-15 Intelligent Driving Method and Apparatus

Publications (1)

Publication Number Publication Date
WO2024254861A1 true WO2024254861A1 (zh) 2024-12-19

Family

ID=93851211

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/100778 Ceased WO2024254861A1 (zh) 2023-06-16 2023-06-16 一种智能驾驶方法及装置

Country Status (6)

Country Link
US (1) US20260103189A1 (zh)
EP (1) EP4714766A1 (zh)
KR (1) KR20260023647A (zh)
CN (1) CN120418139A (zh)
AU (1) AU2023456637A1 (zh)
WO (1) WO2024254861A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120510419A (zh) * 2025-04-28 2025-08-19 合肥瑞徽人工智能研究院有限公司 一种用于自动驾驶传感器故障的3d目标检测方法
CN120922179A (zh) * 2025-10-15 2025-11-11 北京理工大学前沿技术研究院 一种矿用卡车自动驾驶控制方法和系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145680A (zh) * 2017-06-16 2019-01-04 百度在线网络技术(北京)有限公司 一种获取障碍物信息的方法、装置、设备和计算机存储介质
CN111665852A (zh) * 2020-06-30 2020-09-15 中国第一汽车股份有限公司 一种障碍物避让方法、装置、车辆及存储介质
CN112703144A (zh) * 2020-12-21 2021-04-23 华为技术有限公司 控制方法、相关设备及计算机可读存储介质
CN113859267A (zh) * 2021-10-27 2021-12-31 广州小鹏自动驾驶科技有限公司 路径决策方法、装置及车辆
KR20230079855A (ko) * 2021-11-29 2023-06-07 주식회사 와이즈오토모티브 자율 주행 차량의 장애물 인지 성능 평가 장치 및 방법

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145680A (zh) * 2017-06-16 2019-01-04 百度在线网络技术(北京)有限公司 一种获取障碍物信息的方法、装置、设备和计算机存储介质
CN111665852A (zh) * 2020-06-30 2020-09-15 中国第一汽车股份有限公司 一种障碍物避让方法、装置、车辆及存储介质
CN112703144A (zh) * 2020-12-21 2021-04-23 华为技术有限公司 控制方法、相关设备及计算机可读存储介质
CN113859267A (zh) * 2021-10-27 2021-12-31 广州小鹏自动驾驶科技有限公司 路径决策方法、装置及车辆
KR20230079855A (ko) * 2021-11-29 2023-06-07 주식회사 와이즈오토모티브 자율 주행 차량의 장애물 인지 성능 평가 장치 및 방법

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120510419A (zh) * 2025-04-28 2025-08-19 合肥瑞徽人工智能研究院有限公司 一种用于自动驾驶传感器故障的3d目标检测方法
CN120510419B (zh) * 2025-04-28 2025-11-07 合肥瑞徽人工智能研究院有限公司 一种用于自动驾驶传感器故障的3d目标检测方法
CN120922179A (zh) * 2025-10-15 2025-11-11 北京理工大学前沿技术研究院 一种矿用卡车自动驾驶控制方法和系统

Also Published As

Publication number Publication date
US20260103189A1 (en) 2026-04-16
AU2023456637A1 (en) 2026-01-08
CN120418139A (zh) 2025-08-01
EP4714766A1 (en) 2026-03-25
KR20260023647A (ko) 2026-02-20

Similar Documents

Publication Publication Date Title
US12266148B2 (en) Real-time detection of lanes and boundaries by autonomous vehicles
US11941873B2 (en) Determining drivable free-space for autonomous vehicles
US20240362929A1 (en) Object lane assignment for autonomous systems and applications
US12072442B2 (en) Object detection and detection confidence suitable for autonomous driving
US11789445B2 (en) Remote control system for training deep neural networks in autonomous machine applications
US12164059B2 (en) Top-down object detection from LiDAR point clouds
US20240410981A1 (en) Top-down object detection from lidar point clouds
US20260103189A1 (en) Intelligent Driving Method and Apparatus
US12233854B2 (en) Perception-based parking assistance for autonomous machine systems and applications
CN120180849A (zh) 自主系统和应用中的置信度和可见性建模

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23941094

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202380088064.9

Country of ref document: CN

WWP Wipo information: published in national office

Ref document number: 202380088064.9

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: AU2023456637

Country of ref document: AU

WWE Wipo information: entry into national phase

Ref document number: 2023941094

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2023456637

Country of ref document: AU

Date of ref document: 20230616

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2023941094

Country of ref document: EP

Effective date: 20251217

ENP Entry into the national phase

Ref document number: 1020267001236

Country of ref document: KR

Free format text: ST27 STATUS EVENT CODE: A-0-1-A10-A15-NAP-PA0105 (AS PROVIDED BY THE NATIONAL OFFICE)

WWE Wipo information: entry into national phase

Ref document number: 1020267001236

Country of ref document: KR

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2023941094

Country of ref document: EP

Effective date: 20251217

ENP Entry into the national phase

Ref document number: 2023941094

Country of ref document: EP

Effective date: 20251217

ENP Entry into the national phase

Ref document number: 2023941094

Country of ref document: EP

Effective date: 20251217

ENP Entry into the national phase

Ref document number: 2023941094

Country of ref document: EP

Effective date: 20251217

WWP Wipo information: published in national office

Ref document number: 1020267001236

Country of ref document: KR

WWP Wipo information: published in national office

Ref document number: 2023941094

Country of ref document: EP