WO2022134996A1

WO2022134996A1 - Lane line detection method based on deep learning, and apparatus

Info

Publication number: WO2022134996A1
Application number: PCT/CN2021/132554
Authority: WO
Inventors: Xuefeng Yang
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2020-12-25
Filing date: 2021-11-23
Publication date: 2022-06-30
Anticipated expiration: 2023-06-25
Also published as: CN112287912B; EP4252148B1; CN112287912A; EP4252148A4; EP4252148A1

Abstract

Disclosed are a lane line detection method based on deep learning, an apparatus, a storage medium, and an electronic device. The method includes: obtaining a first picture that is planned to be detected; obtaining a target feature map by inputting the first picture into a target neural network model; wherein the target neural network model includes a neural network model generated based on a multi-scale attention mechanism and a deep separable convolution model, and the target feature map is configured to represent a probability of each pixel in the first picture being a lane line pixel; and obtaining a target detection result by performing an image post-processing on the target feature map; wherein the target detection result is configured to indicate a detected lane line in the first picture.

Description

LANE LINE DETECTION METHOD BASED ON DEEP LEARNING, AND APPARATUS

CROSS REFERENCE

The present application claims foreign priority of China Patent Application No. 202011555482.9 filed on December 25, 2020, in the China National Intellectual Property Administration, the entire contents of which are hereby incorporated by reference.

TECHNICAL FIELD

The present disclosure relates to the field of communication technologies, and in particular to a lane line detection method based on deep learning, an apparatus, a storage medium, and an electronic device.

BACKGROUND

With the rapid development of technology and economy, there are more and more cars running on the road, which facilitates people’s travel while bringing more and more traffic accidents and traffic congestion problems. Autonomous driving and an intelligent assisted driving system can help a driver process most of road information, provide precise guidance for the driver, and reduce the probability of traffic accidents. An intelligent transportation system can identify the number of vehicles on a lane, determine whether the lane is congested, plan a more reasonable travel route for the driver, and alleviate traffic congestion. In the automatic driving, intelligent driving assistance system and intelligent transportation system, a vision-based lane line detection is very critical and is the basis and core technology for realizing lane departure warning and lane congestion warning.

In the current related technologies, the method for lane line detection requires a large number of training samples and a complex neural network model, resulting in a technical problem that the efficiency of detecting lane lines is very low.

Aiming at the technical problem of low efficiency of detecting lane lines in the related technologies, no effective solution has been proposed yet.

SUMMARY OF THE DISCLOSURE

The present disclosure provides a lane line detection method based on deep learning, an apparatus, a storage medium, and an electronic device, to solve the technical problem of low detection accuracy of lane lines in the related art.

In a first aspect, the present disclosure provides a lane line detection method based on deep learning, comprising: obtaining a first picture that is planned to be detected; obtaining a target feature map by inputting the first picture into a target neural network model; wherein the target neural network model comprises a neural network model generated based on a multi-scale attention mechanism and a deep separable convolution model, and the target feature map is configured to represent a probability of each pixel in the first picture being a lane line pixel; and obtaining a target detection result by performing an image post-processing on the target feature map; wherein the target detection result is configured to indicate a detected lane line in the first picture.

In a second aspect, the present disclosure provides a lane line detection apparatus based on deep learning, comprising: an obtaining module, configured to obtain a first picture that is planned to be detected; a first processing module, configured to obtain a target feature map by inputting the first picture into a target neural network model; wherein the target neural network model comprises a neural network model generated based on a multi-scale attention mechanism and a deep separable convolution model, and the target feature map is configured to represent a probability of each pixel in the first picture being a lane line pixel; and a second processing module, configured to obtain a target detection result by performing an image post-processing on the target feature map; wherein the target detection result is configured to indicate a detected lane line in the first picture.

In a third aspect, the present disclosure provides a computer-readable storage medium, storing a computer program; wherein the computer program is configured to perform the method as described when executed.

In a fourth aspect, the present disclosure provides an electronic device, comprising a processor, a memory, and a computer program stored in the memory and executable on the processor; wherein the processor is configured to perform the method as described above when executing the computer program.

In the present disclosure, a first picture that is planned to be detected is obtained; a target feature map is obtained by inputting the first picture into a target neural network model; wherein the target neural network model includes a neural network model generated based on a multi-scale attention mechanism and a deep separable convolution model, and the target feature map is configured to represent a probability of each pixel in the first picture being a lane line pixel; and a target detection result is obtained by performing an image post-processing on the target feature map; wherein the target detection result is configured to indicate a detected lane line in the first picture. Therefore, the problem of low detection accuracy of lane lines in the related art may be solved, thereby increasing the detection efficiency of lane lines, increasing the detection accuracy of lane lines, and reducing the detection cost.

BRIEF DESCRIPTION OF THE DRAWINGS

The drawings described here are to provide a further understanding of the present disclosure and constitute a part of the present disclosure. The exemplary embodiments and their descriptions of the present disclosure are to explain the present disclosure, and do not constitute an improper limitation of the present disclosure.

FIG. 1 is a block view of a hardware structure of a mobile terminal performing a lane line detection method based on deep learning according to an embodiment of the present disclosure.

FIG. 2 is a flowchart of a lane line detection method based on deep learning according to an embodiment of the present disclosure.

FIG. 3 is a schematic view of a lane line detection method based on deep learning according to an embodiment of the present disclosure.

FIG. 4 is a schematic view of a lane line detection method based on deep learning according to another embodiment of the present disclosure.

FIG. 4a is a schematic view of a lane line detection method based on deep learning according to further another embodiment of the present disclosure.

FIG. 4b is a schematic view of a lane line detection method based on deep learning according to further another embodiment of the present disclosure.

FIG. 4c is a schematic view of a lane line detection method based on deep learning according to further another embodiment of the present disclosure.

FIG. 5 is a schematic view of a lane line detection method based on deep learning according to further another embodiment of the present disclosure.

FIG. 6 is a schematic view of a lane line detection method based on deep learning according to further another embodiment of the present disclosure.

FIG. 7 is a schematic view of a lane line detection method based on deep learning according to further another embodiment of the present disclosure.

FIG. 8 is a schematic view of a lane line detection method based on deep learning according to further another embodiment of the present disclosure.

FIG. 9 is a schematic view of a lane line detection method based on deep learning according to further another embodiment of the present disclosure.

FIG. 10 is a structural block view of a lane line detection apparatus based on deep learning according to an embodiment of the present disclosure.

DETAILED DESCRIPTION

The embodiments of the present disclosure will be described in detail with reference to the drawings and in conjunction with the embodiments.

It should be noted that the terms “first” and “second” in the specification and claims of the present disclosure and the drawings are used to distinguish similar objects, and not necessarily to describe a specific sequence or order.

The method embodiments provided in the embodiments of the present application may be executed in a mobile terminal, a computer terminal or a similar computing device. Taking running on a mobile terminal as an example, FIG. 1 is a block view of a hardware structure of a mobile terminal performing a lane line detection method based on deep learning according to an embodiment of the present disclosure. As shown in FIG. 1, the mobile terminal may include one or more (only one is shown in FIG. 1) processor 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data. The mobile terminal may also include a transmission device 106 and an input/output device 108 for communication functions. Those skilled in the art can understand that the structure shown in FIG. 1 is only for illustration and does not limit the structure of the mobile terminal. For example, the mobile terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration from that shown in FIG.

The memory 104 may be configured to store computer programs, for example, software programs and modules of application software. Specifically, the memory may store computer programs corresponding to the lane line detection method based on deep learning herein the embodiments of the present disclosure. The processor 102 runs the computer programs stored in the memory 104 to perform various functional applications and data processing, that is, to achieve the above method. The memory 104 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include a memory remotely arranged relative to the processor 102, and the remote memory may be connected to the mobile terminal through a network. Examples of the network include, but are not limited to, the Internet, corporate intranet, local area network, mobile communication network, and any combination thereof.

The transmission device 106 is configured to receive or send data via a network. Specific examples of the network may include a wireless network provided by a communication provider of the mobile terminal. In some examples, the transmission device 106 includes a network interface adapter (NIC) , which can be connected to other network devices through a base station to communicate with the Internet. In some examples, the transmission device 106 may be a radio frequency (RF) module configured to communicate with the Internet in a wireless manner.

In the embodiments, a lane line detection method based on deep learning running on a mobile terminal, a computer terminal or a similar computing device is provided. FIG. 2 is a flowchart of a lane line detection method based on deep learning according to an embodiment of the present disclosure. The method may include operations at blocks as followed.

At block S202: A first picture that is planned to be detected is obtained.

At block S204: A target feature map is obtained by inputting the first picture into a target neural network model; wherein the target neural network model includes a neural network model generated based on a multi-scale attention mechanism and a deep separable convolution model, and the target feature map is configured to represent a probability of each pixel in the first picture being a lane line pixel.

At block S206: A target detection result is obtained by performing an image post-processing on the target feature map; wherein the target detection result is configured to indicate a detected lane line in the first picture.

In some embodiments, the first picture may include, but is not limited to, a picture collected by an image or video capture device in an automatic driving system, and/or an intelligent assisted driving system, and/or an intelligent transportation system. The first picture may also include but is not limited to a picture pre-stored in a database, or collected through other methods.

In some embodiments, the target neural network model includes, but is not limited to, a convolutional neural network model, a recurrent neural network model, and a combination of one or more neural network models. Specifically, the target neural network model includes, but is not limited to, the neural network model generated based on the multi-scale attention mechanism and the deep separable convolution model.

In some embodiments, the target feature map includes, but is not limited to, a picture generated after feature information is extracted from the first picture. The value of each pixel in the first picture may include, but is not limited to, a probability of the corresponding position of each pixel in the first picture being a target object. The value of each pixel in the first picture may also include, but is not limited to, the probability of each pixel in the first picture being the lane line pixel or a probability of each pixel in the first picture being an image background pixel.

It should be noted that the lane line pixel and the image background pixel may include, but are not limited to, two target objects that are mutually exclusive. In other words, when a pixel is not configured to represent a lane line, the pixel is configured to represent an image background.

In some embodiments, a marked picture may include, but is not limited to, a picture collected by an image or video capture device in an automatic driving system, and/or an intelligent assisted driving system, and/or an intelligent transportation system. The marked picture may also include but is not limited to a picture pre-stored in a database, or collected through other methods.

In some embodiments, the image post-processing may include, but is not limited to, normalization, image smoothing, image sharpening, image dilation, image erosion, or another one or a combination of image processing methods.

The foregoing is only an exemplary description, and the embodiments do not make any specific limitations.

Through the present disclosure, a first picture that is planned to be detected is obtained, and a target feature map is obtained by inputting the first picture into a target neural network model. The target neural network model is a model obtained by training a to-be-trained initial neural network model with a set of marked images. Each marked picture includes a marked image background pixel and a marked lane line pixel. The target feature map is configured to represent a probability of each pixel in the first picture being a lane line pixel, and the target neutral network model is configured to determine a distribution of the probability in the first picture. A target detection result is obtained by performing image post-processing on the target feature map. The target detection result is configured to indicate a detected lane line in the first picture. The target detection result is configured to indicate the lane line detected in the first picture. Therefore, the technical problem of low detection efficiency of lane lines in related technologies may be solved, and the technical effects of improving the detection efficiency of lane lines, increasing the detection accuracy of lane lines, and reducing detection costs may be achieved.

In some embodiments, the obtaining the target feature map by inputting the first picture into the target neural network model includes: obtaining a first feature map by inputting the first picture into a first convolutional layer; obtaining a second feature map by inputting the first feature map into a second convolutional layer, wherein the target neural network model includes the second convolutional layer, and the second convolutional layer is configured to increase a weight of a preset region in the first feature map based on the multi-scale attention mechanism; and obtaining the target feature map by performing a first preset processing on the second feature map, wherein the first preset processing includes an up-sampling operation.

In some embodiments, the first preset processing may include, but is not limited to, the up-sampling operation, a feature extraction operation, and other processing methods for generating the target feature map from the second feature map.

In some embodiments, the target neural network model includes the first convolutional layer, and the first convolutional layer is configured to adjust the resolution of the first picture, and to extract the lane line pixel of the first image after the resolution adjustment. The first convolutional layer may also include, but is not limited to, a convolutional neural network for feature extraction, for example, a lightweight convolutional neural network. The lightweight convolutional neural network may perform operations including, but not limited to, a convolution operation with a convolutional kernel in 1×1, 3×3, 5×5, 7×7 and other size. The convolution operation may be achieved with a preset sliding step size. For example, the step size may be 1, 2, etc.

In some embodiments, the obtaining the first feature map by inputting the first picture into the first convolutional layer includes: obtaining the first feature map by inputting the first picture with a resolution less than or equal to a first preset resolution into the first convolutional layer.

In some embodiments, the first preset resolution may be manually set by an operator, and may also set flexibly based on a computational power of the target neural network model or an original resolution of an image or video stream to which the first picture corresponds.

In some embodiments, the obtaining the first feature map by inputting the first picture with a resolution less than or equal to the first preset resolution may include, but is not limited to, inputting a picture with a lower resolution into the first convolutional layer to achieve the technical effect of improving the efficiency of feature extraction.

In some embodiments, the obtaining the first feature map by inputting the first picture into the first convolutional layer includes: obtaining a first sub-feature map by inputting the first picture into a first sub-convolutional layer, wherein the first sub-convolutional layer is configured to perform a convolution operation, and the first convolutional layer includes the first sub-convolutional layer; obtaining a second sub-feature map by inputting the first sub-feature map into a second sub-convolutional layer, wherein the second sub-convolutional layer is configured to perform a depth separable convolution operation to extract feature information of the first sub-feature map, and the first convolutional layer includes the second sub-convolutional layer; in response to a resolution of the second sub-feature map being greater than a second preset resolution, reducing a resolution of the first sub-feature map by re-inputting the second sub-feature map into the first sub-convolutional layer; in response to the resolution of the second sub-feature map being less than or equal to the second preset resolution, determining the second sub-feature map to be the first feature map.

In some embodiments, the first sub-convolutional layer is configured to reduce the resolution of a feature map to improve the efficiency of feature extraction, and the second sub-convolutional layer is configured to perform the depth separable convolution operation to extract the feature information in the first sub-feature map. The second preset resolution may be the same as or different from the first preset resolution.

In some embodiments, the obtaining the second sub-feature map by inputting the first sub-feature map into the second sub-convolutional layer includes: obtaining the second sub-feature map by extracting the feature information of the first sub-feature map with a 1×n convolutional kernel and by extracting the feature information of the first sub-feature map with an n×1 convolutional kernel, wherein the n is a positive odd number greater than 1.

In some embodiments, the first convolutional layer may include, but is not limited to, a special combination of a traditional convolutional neural network and a lightweight convolutional neural network. At the bottom of the above target neural network, due to the relatively large resolution of the input image, usually only a smaller convolutional kernel can be used, resulting in a smaller receptive field of the bottom convolutional kernel, while the use of a larger convolutional kernel will lead to excessive computational cost. By adopting the depth separable convolution with the large convolutional kernel in the second sub-convolutional layer in the first convolutional layer mentioned, and by using 1×n and n×1 convolutions instead of n×n convolutions, the computational cost of the neural network model is reduced while increasing the receptive field of the target neural network model. For an n×n convolutional kernel, assuming that the number of input channels is c1 and the number of output channels is c2, the number of parameters of the convolution layer is n×n×c1×c2. When the two convolutions of 1×n and n×1 are used instead, the number of parameters of the convolutional layer is 2×n×c1×c2, and the number of parameters is reduced by (n×n-2×n) ×c1×c2. Therefore, the greater the value of n, the more obvious the effect of reducing the number of parameters, while the receptive field remains unchanged.

In some embodiments, the obtaining the second feature map by inputting the first feature map into the second convolutional layer includes: obtaining a first type feature map and a plurality of second type feature maps by inputting the first feature map into the second convolutional layer and by performing a convolution operation on the first feature map with a plurality of convolutional kernels of different sizes included in the second convolution layer; determining a plurality of third type feature maps from the plurality of second type feature maps through a preset statistical method, wherein the plurality of third type feature maps are configured to be performed with a second preset processing such that sizes of the plurality of third type feature maps match each other, the plurality of third type feature maps are configured to be performed with a third preset processing to obtain an attention feature map, and a size of the attention feature map matches a size of the first type feature map; and obtaining the second feature map by performing the third preset processing on the attention feature map and the first type feature map.

In some embodiments, the plurality of second type feature maps may include, but are not limited to, second type feature maps obtained by performing a convolution operation on the first feature map with a plurality of convolutional kernels of different sizes.

In some embodiments, the preset statistical method may include, but is not limited to, norm formulas in statistics. Specifically, the preset statistical method may include, but is not limited to, vector norms, matrix norms, etc.

In some embodiments, the second preset processing may include, but is not limited to, calling a function to adjust the number of channels of a feature map, for example, a reshape operation to adjust the number of rows or columns of a feature vector corresponding to the feature map.

In some embodiments, the obtaining the second feature map by inputting the first feature map into the second convolutional layer may also include, but is not limited to, increasing a weight of an important region in the first feature map by the multi-scale attention mechanism in a region close to an output layer of the neural network model (for example, the weight of the important region in the first feature map includes but is not limited to a weight of a related region at which a lane line is prone to appear) , thereby improving the detection accuracy of the neural network model.

Specifically, assuming that the resolution of the first feature map is w, h, and the number of channels is c. First, the first feature map is performed with convolution operations with convolutional kernels including but not limited to three convolutional kernels of different scales (1×1, 3×3, and 5×5, corresponding to the aforementioned plurality of convolutional kernels of different sizes) respectively. The use of the different convolutional kernels is to fuse information of lane line elements at different receptive field scales. Three feature maps are output, specifically including two second type feature maps and one first type feature map. While the resolution remains unchanged, and the number of channels is 0.5×c, 0.5×c, and c, respectively. The 3×3 and 5×5 second type feature maps are configured to calculate a correlation between elements to determine the importance of each pixel position in the feature map for global inference. Then element value at each position is counted from channel dimensions using a statistical method. A calculation formula may be as follows.

where x is a counted element value, x _i is element values on different channels at a same location. After calculation, the resolution of the output feature map remains unchanged, and the number of channels becomes 1. Then, operations may include, but are not limited to, processing the output feature map into (w×h) ×1 and 1× (w×h) (corresponding to the aforementioned third type feature maps) through the reshape function (corresponding to the second preset processing) , obtaining a matrix with a dimension of (w×h) × (w×h) through matrix multiplication (corresponding to the third preset processing) of the (w×h) ×1 and 1× (w×h) feature maps, obtaining the attention feature map by performing a softmax operation on the matrix, processing a feature map generated by the 1×1 convolutional kernel (corresponding to the aforementioned first type feature map) into a dimension of c× (w×h) through the reshape function, obtaining another matrix with a dimension of c× (w×h) by performing matrix multiplication of the processed feature map and the attention feature map, processing the another matrix into a c×w×h feature map through the reshape function, and taking the processed another matrix as the second feature map.

It should be noted that, in middle and top layers of the target neural network, the dimension of the output feature is greatly reduced compared to the bottom of the target neural network. Ordinary convolution is used for feature extraction, and the convolutional kernel is adopted with a larger kernel to continue to maintain the receptive field of the convolutional kernel. 1×n and n×1 convolutions are adopted instead of n×n convolution to reduce the network computational cost.

Through the embodiments, the target neural network generated based on the attention mechanism enables the output target feature map to more effectively and accurately reflect each pixel corresponding to the position of the each pixel in the input image, and achieve the detection accuracy of lane lines.

In some embodiments, before the obtaining the target feature map by inputting the first picture into the target neural network model, the method further includes: obtaining a first sample picture and a label picture corresponding to the first sample picture; and obtaining the target neural network model by training a to-be-trained initial neutral network model with the first sample picture and the label picture, wherein the target neural network model is configured to be trained through a loss function as follows.

loss=C (l _p, l _t) +αC (b _p, b _t)

where C represents a cross entropy loss function, l _p represents the lane line pixel obtained by inputting the first sample picture into the to-be-trained initial neural network model, l _t represents the lane line pixel marked by the label picture, b _p represents the image background pixel obtained by inputting the first sample picture into the to-be-trained initial neural network model, b _t represents the image background pixel marked by the label picture, and α is a preset parameter greater than 0.

In some embodiments, since the number of the image background pixels is usually far greater than the number of the lane line pixels, the preset α is configured to alleviate the impact of category imbalance. During the training process, the α may be set to be 0.1.

In some embodiments, during the training process of the initial neural network model, operations may include, but are not limited to, configuring an initial learning rate to a preset value, for example, 0.01; configuring a learning rate attenuation strategy to a preset strategy, for example, every 10,000 iterations, a learning rate is multiplied by 0.1; and configuring a total number of iterations to another preset value, for example, 60,000 iterations.

In some embodiments, the obtaining the target detection result by performing the image post-processing on the target feature map includes: obtaining a lane line result segmentation map by performing a binarization processing on the target feature map; obtaining a processed lane line result segmentation map containing a plurality of connected domains by preprocessing the lane line result segmentation map with an image erosion operation and an image dilation operation; obtaining a target detection result by deleting at least one of the plurality of connected domains that does not meet a preset condition and fitting the remaining of the plurality of connected domains that meets a preset condition to obtain a fitted connected domain, wherein the fitted connected domain included in the target detection result represents a detected lane line in the first picture.

In some embodiments, the binarization processing may include, but is not limited to, obtaining the lane line result segmentation map by marking an original image based on the target feature map; wherein a part predicted to be the lane line pixel is marked as 1, and another part predicted not to be a lane line, that is, to be the background pixel, is marked as 0.

It should be noted that FIG. 3 is a schematic view of a lane line detection method based on deep learning according to an embodiment of the present disclosure. As shown in FIG. 3, the lane line pixel may be filtered by performing operations including but not limited to the following operations.

Operation S1: A width and a height of a circumscribed rectangle of each connected domain 302 are calculated. When the width and height are each less than a corresponding threshold, the connected domain is likely not to be a lane line pixel position, and the connected domain is directly deleted. A scenario before the filtering the lane line pixel may be shown in FIG. 4a, and a result of the filtering the lane line pixel may be shown in FIG. 4b.

Operation S2: On a processed probability map, each different connected domain may be marked with a different digital id through a Skimage library, representing different lane lines. A result after the processing may be shown in FIG. 4c. The lane line pixels represented by different brightness belong to different lane lines.

Operation S3: FIG. 5 is a schematic view of a lane line detection method based on deep learning according to further another embodiment of the present disclosure. As shown in FIG. 5, after the connected domains of different ids are obtained, a function y=kx+ b is applied to fit each connected domain separately, where k is the slope and b is the intercept, thereby obtaining a lane line model corresponding to each connected domain as a candidate lane line 502.

For example, due to the perspective effect, when parallel lane lines in a three-dimensional space are projected into a two-dimensional space, an angle between each lane line and a horizontal direction will be within a certain range. FIG. 6 is a schematic view of a lane line detection method based on deep learning according to further another embodiment of the present disclosure. As shown in FIG. 6, the range of the angle is required to be determined according to the camera’s installation position and focal length parameters. Then, according to the slope of each lane line, the candidate lane lines of which the angle is not within the range are continued to be filtered.

In some embodiments, the preprocessing of the image erosion operation and the image dilation operation may include but is not limited to one or more operations to obtain the target feature map containing the plurality of connected domains. The preset condition may include but is not limited to a determination based on the size, length and other shape characteristics of the connected domains. For example, by setting a preset length threshold and a preset width threshold, a connected domain is deleted in response to the length of the connected domain being less than the preset length threshold and/or the width of the connected domain being less than the preset width threshold, and the target feature map containing the connected domains that meet the preset condition is retained to determine the final target detection result.

For example, in some complex road scenes, the lane line features recorded in the target feature map output by the aforementioned target neural network model may show some discontinuities, as shown in FIG. 4a. The scattered and small area lane line prediction pixels may be eliminated using including but not limited to the image erosion operation in OpenCV, because the scattered and small area of lane line prediction pixels are likely to be falsely detected. The prediction pixel area for remaining lane lines is then increased using the image dilation operation in OpenCV, and finally, the connected domains that do not belong to the lane lines are filtered out by configuring the size of the connected domains.

In some embodiments, the method further includes:

re-determining the detected lane line in the first picture by like-for-like merging the fitted connected domains contained in the target detection result by a clustering algorithm.

In some embodiments, operations may include, but are not limited to, leaving a plurality of lane lines after filtering the non-lane lines. Some candidate lane lines belong to different parts of the same lane line, as shown in the left side of FIG. 5. The two candidate lane lines in the left belong to the same lane line logically, and these lane lines are required to be merged. Clustering is performed using the slope of each candidate lane line. The clustering algorithm used may include, but is not limited to, a mean drift clustering algorithm, which may be implemented by sklearn. cluster. MeanShift function. A clustering radius is required to be determined based on the camera installation location and focal length parameters. The number of clustering centers obtained is the number of lane lines, and the equation of each lane line after merging is as follows.

where n indicates the number of candidate lane lines being merged into the lane line, and k _i and b _i indicate the slope and intercept of the i-th candidate lane line respectively.

In some embodiments, the target neural network model includes a neural network model including a lightweight convolutional neural network.

The following will further explain the embodiments in combination with specific examples.

The target neural network model architecture may be adopted with a special combination of a traditional convolutional neural network model and a lightweight convolutional neural network model. At the bottom of the above target neural network model, due to the relatively large resolution of the input image, usually only a smaller convolutional kernel can be used, resulting in a smaller receptive field of the bottom convolutional kernel, while the use of a larger convolutional kernel will lead to excessive computational cost. In the present disclosure, the depth separable convolution of a large convolutional kernel is used at the bottom of the target neural network model, such that the computational cost of the neural network model is reduced while increasing the receptive field of the target neural network model. For an n×n convolutional kernel, assuming that the number of input channels is c1 and the number of output channels is c2, the number of parameters of the convolution layer is n×n×c1×c2. When the two convolutions of 1×n and n×1 are used instead, the number of parameters of the convolutional layer is 2×n×c1×c2, and the number of parameters is reduced by (n×n-2×n) ×c1×c2. Therefore, the greater the value of n, the more obvious the effect of reducing the number of parameters, while the receptive field remains unchanged.

In the middle and top layers of the target neural network model, the dimensionality of the feature map is greatly reduced compared to the bottom of the target neural network model. Ordinary convolution is used for feature extraction, and the convolutional kernel is adopted with a larger kernel to continue to maintain the receptive field of the convolutional kernel. 1×n and n×1 convolutions are adopted instead of n×n convolution to reduce the computational cost of the target neural network model.

The target neural network model structure is shown in Table 1.

Table 1

Among them, the above dewConvSP convolution operation may include but is not limited to the content shown in Table 2.

Table 2

In some embodiments, near the output layer of the target neural network model, a multi-scale attention mechanism is configured to increase the weight of the important region on the feature map (corresponding to the attention convolutional layer in Table 1) . To improve the accuracy of target neural network model detection, the calculation flowchart of the proposed attention mechanism is shown in FIG. 7.

S702: The feature map output by a previous convolutional layer (corresponding to the aforementioned first feature map) is input, assuming that the resolution of the first feature map is w, h, and the number of channels is c.

S704: A convolution operation is performed on the input image using three convolutional kernels of different scales, wherein the use of the different convolutional kernels is to fuse information of lane line elements at different receptive field scales. Three feature maps are output, the resolution remains unchanged, and the number of channels is 0.5×c, 0.5×c, and c (from top to bottom) .

S706: First two feature maps are configured to calculate the correlation between elements. In order to determine the importance of each location to the global inference. Then element value at each position is counted from channel dimensions using a statistical method. A calculation formula may be as follows.

The foregoing calculation process schematic diagram may include, but is not limited to, as shown in FIG. 8:

where x is a counted element value, x _i is element values on different channels at a same location. Taking i equal to 4 as an example, after calculation, the resolution of the output feature map remains unchanged, and the number of channels becomes 1. Then the output feature map is processed into (w×h) ×1 and 1× (w×h) through the reshape function, a matrix with a dimension of (w×h) × (w×h) is obtained through matrix multiplication, and the attention feature map is obtained by performing a softmax operation on the matrix.

S708: a feature map generated by H (x) is processed into a dimension of c× (w×h) through the reshape function, another matrix with a dimension of c× (w×h) is obtained by performing matrix multiplication of the processed feature map and the attention feature map, the another matrix is processed into a c×w×h feature map through the reshape function, and the processed another matrix is taken as the second feature map.

In some embodiments, for all convolution operations in the target neural network model (corresponding to Table 1) , as long as the sliding step size is 1, the feature map will be filled with 0 around it before the convolution operation, ensuring that the resolution of the output feature map remains unchanged after the convolution operation. The input of the network is an RGB three-channel color image. The original image is resized to 320×184×3 using a bilinear interpolation, and the label image is resized to 320×184 using a nearest neighbor interpolation algorithm. After a series of convolution processing, the result of lane line segmentation (corresponding to a picture in which each pixel indicates the probability value of whether the location of that pixel is a lane line or not) with the same resolution as the input image. The two channels represent the image background pixel location information and lane line pixel location information, respectively. The loss function used for model training is cross entropy with weights, with the following equation.

loss=C (l _p, l _t) +αC (b _p, b _t)

It should be noted that since background pixels are usually far more than lane line pixels, α is configured to alleviate the impact of category imbalance, and α may be set to 0.1 during training. The initial learning rate used in training is 0.01. The learning rate decay strategy is that every 10,000 iterations, a learning rate is multiplied by 0.1, and the total number of iterations is set to 60,000.

After the aforementioned target neural network model outputs the results of the probability map of the lane line segmentation results, the image post-processing and statistical methods are configured to filter the falsely detected lane lines and merge the lane lines. The processing flow chart is shown in FIG. 9. The specific steps may be as follows.

S902: the target neural network model outputs the target feature map;

S904: corrosion and dilation processing are performed on the target feature map;

S906: the connected domains are filtered;

S908: the connected domains are each marked with an ID;

S910: the lane lines are fitted according to the connected domains;

S912: the lane lines are filtered according to the angle range;

S914: clustering is performed to obtain the final lane line;

S916: the detection result is output.

Among them, in some complex road scenes, the lane line features recorded in the target feature map output by the aforementioned target neural network model may show some discontinuities, as shown in FIG. 4a. The scattered and small area lane line prediction pixels may be eliminated using including but not limited to the image erosion operation in OpenCV, because the scattered and small area of lane line prediction pixels are likely to be falsely detected. The prediction pixel area for remaining lane lines is then increased using the image dilation operation in OpenCV, and finally, the connected domains that do not belong to the lane lines are filtered out by configuring the size of the connected domains.

It should be noted that the above lane line pixels can be filtered by including but not limited to the following methods:

S1: A width and a height of a circumscribed rectangle of each connected domain 302 as shown in FIG. 3 are calculated. When the width and height are each less than a corresponding threshold, the connected domain is likely not to be a lane line pixel position, and the connected domain is directly deleted. A result after the lane line pixel is processed may be shown in FIG. 4b. The connected domains shown in FIG. 4b do not include part of the connected domains included in FIG. 4a.

S2: On a processed probability map, each different connected domain may be marked with a different digital id through a Skimage library, representing different lane lines. A result after the processing may be shown in FIG. 4c. The lane line pixels represented by different brightness belong to different lane lines.

S3: After the connected domains of different ids are obtained, a function y=kx+ b is applied to fit each connected domain separately, where k is the slope and b is the intercept, thereby obtaining a lane line model corresponding to each connected domain as a candidate lane line 502 as shown in FIG. 5.

For example, due to the perspective effect, when parallel lane lines in a three-dimensional space are projected into a two-dimensional space, an angle between each lane line and a horizontal direction will be within a certain range. As shown in FIG. 6, the range of the angle is required to be determined according to the camera’s installation position and focal length parameters. Then, according to the slope of each lane line, the candidate lane lines of which the angle is not within the range are continued to be filtered.

A plurality of lane lines may be left after filtering the non-lane lines. Some candidate lane lines belong to different parts of the same lane line, as shown in the left side of FIG. 5. The two candidate lane lines in the left belong to the same lane line logically, and these lane lines are required to be merged. Clustering is performed using the slope of each candidate lane line. The clustering algorithm used may include, but is not limited to, a mean drift clustering algorithm, which may be implemented by sklearn. cluster. MeanShift function. A clustering radius is required to be determined based on the camera installation location and focal length parameters. The number of clustering centers obtained is the number of lane lines, and the equation of each lane line after merging is as follows.

With this embodiment, a target neural network model based on a multi-scale attention mechanism is used for the lane line detection task, a lightweight lane line detection network is designed based on the characteristics of each layer of the deep neural network to achieve the use of a lightweight deep neural network in the lane line detection task, which reduces the complexity and computational cost of the network, has advantages when arranged in embedded devices with limited storage capacity and computational power, increases the receptive field of the target neural network model, and alleviates the problem of predicted lane line pixel discontinuity in the lane line segmentation task. Moreover, combined with the output format of the target neural network model, a set of post-processing algorithms may be designed that can accurately detect lane lines based on the coarse extraction of lane line features by the deep neural network, reducing the network’s requirement for training datasets. For example, the training can be performed on some general datasets, and the requirement of roughly extracting lane line features can be achieved without specifically collecting the corresponding road scenes for labeling, which reduces the cost of practical application of the algorithm.

From the description of the above embodiments, it will be clear to those skilled in the art that the method according to the above embodiments can be implemented with the aid of software plus the necessary general purpose hardware platform, or of course by means of hardware, but in many cases the former is the better way of implementation. Based on this understanding, the technical solution of the present disclosure, which essentially or rather contributes to the prior art, may be embodied in the form of a software product, which is stored in a storage medium (e.g. ROM/RAM, disk, CD-ROM) and includes a number of instructions to enable a terminal device (which may be a cell phone, a computer, a server, or a network device, etc. ) to perform the method described in various embodiments of the present disclosure.

The present disclosure also provides a lane line detection apparatus based on deep learning, which is configured to implement the aforementioned embodiments and preferred implementations, and what has been explained will not be repeated. As mentioned below, the term “module” may implement a combination of software and/or hardware with predetermined functions. Although the apparatus described in the following embodiments is preferably implemented by software, hardware or a combination of software and hardware is also possible and conceived.

FIG. 10 is a structural block view of a lane line detection apparatus based on deep learning according to an embodiment of the present disclosure. As shown in FIG. 10, the apparatus includes:

an obtaining module 1002, configured to obtain a first picture that is planned to be detected;

a first processing module 1004, configured to obtain a target feature map by inputting the first picture into a target neural network model; wherein the target neural network model includes a neural network model generated based on a multi-scale attention mechanism and a deep separable convolution model, and the target feature map is configured to represent a probability of each pixel in the first picture being a lane line pixel;

a second processing module 1006, configured to obtain a target detection result by performing an image post-processing on the target feature map; wherein the target detection result is configured to indicate a detected lane line in the first picture.

In some embodiments, the first processing module 1004 includes:

a first calculation unit, configured to obtain a first feature map by inputting the first picture into a first convolutional layer; wherein the target neural network model includes the first convolutional layer, and the first convolutional layer is configured to adjust a resolution of the first picture and to extract the lane line pixel of the first image after the resolution adjustment;

a second calculation unit, configured to obtain a second feature map by inputting the first feature map into a second convolutional layer; wherein the target neural network model includes the second convolutional layer, and the second convolutional layer is configured to increase a weight of a preset region in the first feature map based on the multi-scale attention mechanism; and

a first processing unit, configured to obtain the target feature map by performing a first preset processing on the second feature map; wherein the first preset processing includes an up-sampling operation.

In some embodiments, the aforementioned apparatus is configured to input the first picture into the first convolutional layer in the following manner to obtain the first feature map: obtaining the first feature map by inputting the first picture with a resolution less than or equal to a first preset resolution into the first convolutional layer.

In some embodiments, he aforementioned apparatus is configured to input the first picture into the first convolutional layer in the following manner to obtain the first feature map: obtaining a first sub-feature map by inputting the first picture into a first sub-convolutional layer, wherein the first sub-convolutional layer is configured to perform a convolution operation, and the first convolutional layer includes the first sub-convolutional layer; obtaining a second sub-feature map by inputting the first sub-feature map into a second sub-convolutional layer, wherein the second sub-convolutional layer is configured to perform a depth separable convolution operation to extract feature information of the first sub-feature map, and the first convolutional layer includes the second sub-convolutional layer; in response to a resolution of the second sub-feature map being greater than a second preset resolution, reducing a resolution of the first sub-feature map by re-inputting the second sub-feature map into the first sub-convolutional layer; in response to the resolution of the second sub-feature map being less than or equal to the second preset resolution, determining the second sub-feature map to be the first feature map.

In some embodiments, the first calculation unit is configured to input the first sub-feature map into the second sub-convolutional layer in the following manner to obtain the second sub-feature map: obtaining the second sub-feature map by extracting the feature information of the first sub-feature map with a 1×n convolutional kernel and by extracting the feature information of the first sub-feature map with an n×1 convolutional kernel, wherein the n is a positive odd number greater than 1.

In some embodiments, the second calculation unit is configured to input the first feature map into the second convolutional layer in the following manner to obtain the second feature map: obtaining a first type feature map and a plurality of second type feature maps by inputting the first feature map into the second convolutional layer and by performing a convolution operation on the first feature map with a plurality of convolutional kernels of different sizes included in the second convolution layer; determining a plurality of third type feature maps from the plurality of second type feature maps through a preset statistical method, wherein the plurality of third type feature maps are configured to be performed with a second preset processing such that sizes of the plurality of third type feature maps match each other, the plurality of third type feature maps are configured to be performed with a third preset processing to obtain an attention feature map, and a size of the attention feature map matches a size of the first type feature map; and obtaining the second feature map by performing the third preset processing on the attention feature map and the first type feature map.

In some embodiments, the apparatus is further configured to obtain a first sample picture and a label picture corresponding to the first sample picture; and

obtain the target neural network model by training a to-be-trained initial neutral network model with the first sample picture and the label picture, wherein the target neural network model is configured to be trained through a loss function as follows.

loss=C (l _p, l _t) +αC (b _p, b _t)

In some embodiments, the second processing module 1006 includes:

a second processing unit, configured to obtain a lane line result segmentation map by performing a binarization processing on the target feature map;

a third processing unit, configured to obtain a processed lane line result segmentation map containing a plurality of connected domains by preprocessing the lane line result segmentation map with an image erosion operation and an image dilation operation;

a fourth processing unit, configured to obtain a target detection result by deleting at least one of the plurality of connected domains that does not meet a preset condition and fitting the remaining of the plurality of connected domains that meets a preset condition to obtain a fitted connected domain, wherein the fitted connected domain included in the target detection result represents a detected lane line in the first picture.

In some embodiments, the apparatus is further configured to:

re-determine the detected lane line in the first picture by like-for-like merging the fitted connected domains contained in the target detection result by a clustering algorithm.

It should be noted that each of the above modules can be implemented by software or hardware. For the latter, it can be implemented in the following way, but not limited to: the above modules are all located in the same processor; or, each of the above modules is located in a different processor in any combination.

An embodiment of the present disclosure also provides a computer-readable storage medium in which a computer program is stored, wherein the computer program is configured to execute the steps in any one of the foregoing method embodiments when executed.

In this embodiment, the computer-readable storage medium may be configured to store a computer program for executing the following steps:

S1: A first picture that is planned to be detected is obtained.

S2. A target feature map is obtained by inputting the first picture into a target neural network model; wherein the target neural network model includes a neural network model generated based on a multi-scale attention mechanism and a deep separable convolution model, and the target feature map is configured to represent a probability of each pixel in the first picture being a lane line pixel.

S3: A target detection result is obtained by performing an image post-processing on the target feature map; wherein the target detection result is configured to indicate a detected lane line in the first picture.

In some embodiments, the foregoing computer-readable storage medium may include, but is not limited to, a USB flash drive, a read-only memory (ROM) , a random access memory (RAM) , a mobile hard drive, a magnetic disk or an optical disk and other media that can store computer programs.

An embodiment of the present disclosure also provides an electronic device, including a memory and a processor, the memory stores a computer program, and the processor is configured to run the computer program to execute the steps in any of the foregoing method embodiments.

In some embodiments, the aforementioned electronic device may further include a transmission device and an input-output device, wherein the transmission device is connected to the processor, and the input-output device is connected to the processor.

In some embodiments, the processor may be configured to execute the following steps through a computer program:

S1: A first picture that is planned to be detected is obtained.

For specific examples in this embodiment, reference may be made to the examples described in the aforementioned embodiments and exemplary implementations, which will not be repeated here.

Clearly, it should be understood by those skilled in the art that the modules or steps of the present disclosure described above may be implemented with a generic computing device, they may be centralized on a single computing device or distributed on a network of multiple computing devices, they may be implemented with program code executable by the computing device, thus, they may be stored in a storage device to be executed by the computing device. In some cases, the steps shown or described may be executed in a different order than herein, or they may be implemented separately as individual integrated circuit modules, or multiple modules or steps thereof may be implemented as individual integrated circuit modules. In this way, the present disclosure is not limited to any particular combination of hardware and software.

The foregoing is only preferred embodiments of the present disclosure and is not intended to limit the present disclosure. To those skilled in the art, the present disclosure is subject to various modifications and variations. Any modification, equivalent replacement, improvement, etc. made within the principles of the present disclosure shall be included in the scope of the present disclosure.

Claims

A lane line detection method based on deep learning, comprising:

obtaining a first picture that is planned to be detected;

obtaining a target feature map by inputting the first picture into a target neural network model; wherein the target neural network model comprises a neural network model generated based on a multi-scale attention mechanism and a deep separable convolution model, and the target feature map is configured to represent a probability of each pixel in the first picture being a lane line pixel; and

obtaining a target detection result by performing an image post-processing on the target feature map; wherein the target detection result is configured to indicate a detected lane line in the first picture.
The method according to claim 1, wherein the obtaining the target feature map by inputting the first picture into the target neural network model comprises:

obtaining a first feature map by inputting the first picture into a first convolutional layer;

obtaining a second feature map by inputting the first feature map into a second convolutional layer, wherein the target neural network model comprises the second convolutional layer, and the second convolutional layer is configured to increase a weight of a preset region in the first feature map based on the multi-scale attention mechanism; and

obtaining the target feature map by performing a first preset processing on the second feature map, wherein the first preset processing comprises an up-sampling operation.
The method according to claim 2, wherein the obtaining the first feature map by inputting the first picture into the first convolutional layer comprises:

obtaining the first feature map by inputting the first picture with a resolution less than or equal to a first preset resolution into the first convolutional layer.
The method according to claim 2, wherein the obtaining the first feature map by inputting the first picture into the first convolutional layer comprises:

obtaining a first sub-feature map by inputting the first picture into a first sub-convolutional layer, wherein the first sub-convolutional layer is configured to perform a convolution operation, and the first convolutional layer comprises the first sub-convolutional layer;

obtaining a second sub-feature map by inputting the first sub-feature map into a second sub-convolutional layer, wherein the second sub-convolutional layer is configured to perform a depth separable convolution operation to extract feature information of the first sub-feature map, and the first convolutional layer comprises the second sub-convolutional layer;

in response to a resolution of the second sub-feature map being greater than a second preset resolution, reducing a resolution of the first sub-feature map by re-inputting the second sub-feature map into the first sub-convolutional layer; and

in response to the resolution of the second sub-feature map being less than or equal to the second preset resolution, determining the second sub-feature map to be the first feature map.
The method according to claim 4, wherein the obtaining the second sub-feature map by inputting the first sub-feature map into the second sub-convolutional layer comprises:

obtaining the second sub-feature map by extracting the feature information of the first sub-feature map with a 1×n convolutional kernel and by extracting the feature information of the first sub-feature map with an n×1 convolutional kernel, wherein the n is a positive odd number greater than 1.
The method according to claim 2, wherein the obtaining the second feature map by inputting the first feature map into the second convolutional layer comprises:

obtaining a first type feature map and a plurality of second type feature maps by inputting the first feature map into the second convolutional layer and by performing a convolution operation on the first feature map with a plurality of convolutional kernels of different sizes included in the second convolution layer;

determining a plurality of third type feature maps from the plurality of second type feature maps through a preset statistical method, wherein the plurality of third type feature maps are configured to be performed with a second preset processing such that sizes of the plurality of third type feature maps match each other, the plurality of third type feature maps are configured to be performed with a third preset processing to obtain an attention feature map, and a size of the attention feature map matches a size of the first type feature map; and

obtaining the second feature map by performing the third preset processing on the attention feature map and the first type feature map.
The method according to claim 1, before the obtaining the target feature map by inputting the first picture into the target neural network model, further comprising:

obtaining a first sample picture and a label picture corresponding to the first sample picture; and

obtaining the target neural network model by training a to-be-trained initial neutral network model with the first sample picture and the label picture, wherein the target neural network model is configured to be trained through a loss function as follows:

loss=C (l _p, l _t) +αC (b _p, b _t)

where C represents a cross entropy loss function, l _p represents a lane line pixel obtained by inputting the first sample picture into the to-be-trained initial neural network model, l _t represents a lane line pixel marked by the label picture, b _p represents an image background pixel obtained by inputting the first sample picture into the to-be-trained initial neural network model, b _t represents an image background pixel marked by the label picture, and α is a preset parameter greater than 0.
The method according to claim 1, wherein the obtaining the target detection result by performing the image post-processing on the target feature map comprises:

obtaining a lane line result segmentation map by performing a binarization processing on the target feature map;

obtaining a processed lane line result segmentation map containing a plurality of connected domains by preprocessing the lane line result segmentation map with an image erosion operation and an image dilation operation; and

obtaining a target detection result by deleting at least one of the plurality of connected domains that does not meet a preset condition and fitting a remaining of the plurality of connected domains that meets a preset condition, wherein a fitted connected domain included in the target detection result represents the detected lane line in the first picture.
The method according to claim 8, further comprising:

re-determining the detected lane line in the first picture by like-for-like merging the fitted connected domain included in the target detection result by a clustering algorithm.
The method according any one of claims 1-9, wherein the target neural network model comprises a neural network model comprising a lightweight convolutional neural network.
A lane line detection apparatus based on deep learning, comprising:

an obtaining module, configured to obtain a first picture that is planned to be detected;

a first processing module, configured to obtain a target feature map by inputting the first picture into a target neural network model; wherein the target neural network model comprises a neural network model generated based on a multi-scale attention mechanism and a deep separable convolution model, and the target feature map is configured to represent a probability of each pixel in the first picture being a lane line pixel; and

a second processing module, configured to obtain a target detection result by performing an image post-processing on the target feature map; wherein the target detection result is configured to indicate a detected lane line in the first picture.
A computer-readable storage medium, storing a computer program; wherein the computer program is configured to perform the method according to any one of claims 1-10 when executed.
An electronic device, comprising a processor, a memory, and a computer program stored in the memory and executable on the processor; wherein the processor is configured to perform the method according to any one of claims 1-10 when executing the computer program.