CN114582017B

CN114582017B - Method and system for generating gesture data set, and storage medium

Info

Publication number: CN114582017B
Application number: CN202210206428.6A
Authority: CN
Inventors: 全世红; 胡立天; 史伟兰
Original assignee: Shenzhen Jinghong Technology Co ltd
Current assignee: Shenzhen Jinghong Technology Co ltd
Priority date: 2022-03-02
Filing date: 2022-03-02
Publication date: 2025-05-09
Anticipated expiration: 2042-03-02
Also published as: CN114582017A

Abstract

The present invention discloses a method for generating a gesture data set, a generating system, and a storage medium. The method for generating gesture data synthesis includes: using an infrared camera to collect an original image, wherein the original image includes gesture analysis data, gesture target data, and background data; preprocessing the original image to obtain segmented gesture analysis data, segmented gesture target data, and background data to be synthesized; adjusting the segmented gesture target data according to the segmented gesture analysis data to generate a target number of gesture target data to be synthesized; synthesizing the gesture target data to be synthesized and the background data to be synthesized to obtain an initial gesture synthesis image. The present invention generates a large number of different image data that are consistent with samples actually shot by the camera through means such as adjustment and fitting, thereby saving manpower and material resources for sample generation.

Description

Gesture data set generation method and system and storage medium

Technical Field

The present invention relates to the field of computer vision, and in particular, to a method and system for generating a gesture data set, and a storage medium.

Background

In recent years, with the continuous development of deep learning theory and computer power, deep learning has made a significant breakthrough in the directions of image recognition, image detection, image segmentation, generation type countermeasure network and the like in computer vision, and has been successfully applied to a plurality of fields such as face recognition, vehicle detection, face changing of characters, video restoration and the like, and has achieved good effects. However, due to the complexity of the task, deep-learning neural network models typically contain millions of parameters, and training such huge network models requires a large amount of sufficient data, e.g., the well-known ImageNet dataset contains 20000 categories, around 1500 tens of thousands of pictures. Thus, large scale marker data is critical for deep learning.

The gesture data set disclosed in the prior art has some defects, including firstly that most of data is RGB, IR image data (IR image data is infrared image data output by an infrared image sensor) is relatively less, secondly that the data is generally collected according to the requirements of collectors, and is not friendly and flexible for users outside the collectors, collected content cannot be defined according to the actual requirements of the users, and is limited too much, thirdly that differences exist in collected environmental conditions, such as differences in collected cameras, the collected data also have differences, compatibility is poor, good effects cannot be obtained, thirdly that the data diversity is insufficient, such as gestures in different positions of the collectors and under different distances have larger differences in detail and brightness, and application scenes which can be covered are insufficient.

Although the application of infrared images is wider and widely accepted by people, the actual data acquisition is not easy, the task amount is too large, a great deal of manpower, financial resources, time and energy are required to be consumed, so that the acquisition of enough data becomes difficult, but the subsequent work is difficult to be carried out without a new data set, and the situation of going forward and backward is faced.

Disclosure of Invention

In order to solve the technical problem that the gesture data set with enough quantity is difficult to obtain in the prior art, the invention provides a gesture data set generation method, a gesture data set generation system and a storage medium.

The method for generating the gesture data set synthesis provided by the invention comprises the following steps:

step 1, acquiring an original image by using an infrared camera, wherein the original image comprises gesture analysis data, gesture target data and background data;

Step 2, preprocessing an original image to obtain segmented gesture analysis data, segmented gesture target data and background data to be synthesized;

Step 3, adjusting the segmented gesture target data according to the segmented gesture analysis data to generate gesture target data to be synthesized of target quantity;

And 4, synthesizing the gesture target data to be synthesized and the background data to be synthesized to obtain an initial gesture synthesized image.

Further, the method further comprises a step 5 of post-processing the initial synthesized image to obtain final gesture synthesized data.

Further, the infrared camera is used for collecting gesture images with different distances in a preset number of effective ranges, different positions in the same view field and blank background to form gesture analysis data;

And/or acquiring gesture images with different single angles and blank backgrounds within a preset distance, at the same central position in the field of view and in a preset quantity by using the infrared camera to form gesture target data;

And/or acquiring images of various scenes without gestures in a preset number of effective ranges by using the infrared camera to form the background data.

Further, the preprocessing includes data converting the gesture analysis data, gesture target data such that pixel values of the original image satisfy a threshold interval [0,255], and/or resizing the original image, and/or at least one of filtering, smoothing, noise reduction processing the original image.

Further, the preprocessing includes one or a combination of rotation, clipping, scaling, contrast adjustment and brightness adjustment of the background data to obtain a target synthesis number of background images to be synthesized.

Further, the step 3 includes:

Step 31, counting the divided gesture analysis data to obtain a corresponding relation between the mean value and the variance of the gesture data, and determining the value range of the mean value and the variance to form a reference standard condition;

step 32, selecting the segmented gesture target data, and calculating to obtain a variance S1 and a mean M1 of gesture data to be adjusted;

step 33, randomly selecting a mean value M2 and a variance S2 of one gesture data from the reference standard conditions, adjusting the gray value of the pixel point of the gesture data to be adjusted point by point according to the formula h2= (H1-M1)/s1×s2+m2, and if the adjusted gray value of the pixel point exceeds a threshold interval, assigning the gray value of the pixel point to be the nearest threshold;

step 34, by repeating step 33 or step 32 and step 33, the gray value of the pixel point of each divided gesture target data image is adjusted until the target number of gesture target images to be synthesized is reached.

Further, the correspondence between the mean and the variance of the gesture data is that

Max (0.28465×M+15.27581, 0.1). Ltoreq.S.ltoreq.0.28465×M-25.27581, where S is the mean and M is the variance.

In step 4, in the process of synthesizing the initial gesture synthesized image, when the gesture target image to be synthesized is attached, a closed gesture frame can be formed on the periphery of the gesture target image, the position of the gesture frame on the background image is represented by a position L, and the position L is the labeling information of the initial gesture synthesized image.

Further, the step 5 includes:

And randomly taking the maximum attenuation value at different view fields, fitting according to the distance relation with the circle center, multiplying the initial gesture cooperation image and the corresponding elements of the matrix A one by one, and fitting out gesture data at different positions in a lens as final gesture synthesis data.

And further, randomly adding at least one or a combination of scattered noise, linear noise and curve noise into gesture data of different positions in the fitted lens, and then taking the gesture data as final gesture synthesis data.

The gesture data set generating system of the invention adopts the gesture data set generating method described in the technical scheme to obtain gesture synthesized data, and comprises the following steps:

the acquisition module is used for acquiring the original image;

the preprocessing module is used for preprocessing the original image;

the adjusting module is used for adjusting the processing result of the preprocessing module;

And the synthesis module is used for carrying out synthesis processing on gesture target data to be synthesized and background data to be synthesized.

The computer readable storage medium of the present invention is used for storing a computer program, and the computer program executes the method for generating the gesture data set according to the technical scheme to obtain gesture synthesized data when running.

According to the gesture data set establishing method, a gesture data set based on infrared camera acquisition can be established conveniently and quickly without inputting a large amount of resources such as manpower, material resources and financial resources. A small amount of IR image data is acquired through an infrared camera, gesture target data, background data and required adjustment basic reference conditions are respectively obtained after preprocessing, so that a larger amount of gesture target data are rapidly generated, more original gesture synthetic data are rapidly generated with the background data, the synthetic gesture data conform to the data similar to the actual acquisition through post processing, and meanwhile, marking information is automatically generated, the workload of marking is avoided, and more full data guarantee is provided for deep learning. Compared with the traditional method, the method greatly improves the data acquisition and labeling efficiency, and can efficiently obtain the required training data set with less investment.

Drawings

The invention is described in detail below with reference to examples and figures, wherein:

FIG. 1 is a flow chart of an embodiment of the present invention.

FIG. 2 is a graph of mean and variance of a single gesture according to the present invention.

FIG. 3 is a graph of a number of gesture mean and variance relationships in accordance with the present invention.

FIG. 4 is a diagram showing the center light pattern of the IR chart.

Detailed Description

In order to make the technical problems, technical schemes and beneficial effects to be solved more clear, the invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Thus, reference throughout this specification to one feature will be used in order to describe one embodiment of the invention, not to imply that each embodiment of the invention must be in the proper motion. Furthermore, it should be noted that the present specification describes a number of features. Although certain features may be combined together to illustrate a possible system design, such features may be used in other combinations not explicitly described. Thus, unless otherwise indicated, the illustrated combinations are not intended to be limiting.

According to the method for generating the gesture data set synthesis, a large number of gesture pictures can be quickly synthesized according to a small amount of collected IR gesture data to form a new data set, enriched training data is provided when a gesture classification network is trained, coordinate information of a gesture frame can be provided when a gesture detection network is trained, and diversity and collection efficiency of the training data set are greatly improved.

As shown in fig. 1, the gesture data set generating method of the present invention includes at least 4 steps, and in a preferred embodiment, includes 5 steps.

In step 1 (S1), an original image is acquired by using an infrared camera, wherein the original image includes gesture analysis data, gesture target data and background data.

The gesture analysis data set is acquired by utilizing an infrared camera to acquire gesture images with different distances in an effective range and blank backgrounds at different positions in the same view field, and a certain number of gesture images are acquired to form an original gesture analysis data set. The gesture images of different distances in the effective range of the preset quantity, different positions in the same view field and blank background are collected by utilizing the infrared camera to form gesture analysis data.

The gesture target data set is acquired by utilizing an infrared camera to acquire gesture images with middle distance in an effective range, central position in the same view field, different single angles and blank background, and a certain number of gesture images are obtained to form an original gesture target data set. The gesture target data are formed by collecting gesture images with different single angles and blank backgrounds within a preset distance, at the same central position in the field of view and in a preset quantity by utilizing an infrared camera.

And (3) collecting background data, namely collecting images of various scenes which do not contain gestures in the effective range by using an infrared camera, and forming an original background data set by a certain amount. The method comprises the steps of acquiring image formation background data of various scenes without gestures in a preset number of effective ranges by using an infrared camera.

The acquisition sequence of the original images is not sequential.

And 2 (S2) preprocessing the original image to obtain segmented gesture analysis data, segmented gesture target data and background data to be synthesized. Specifically, the preprocessing includes performing data conversion on gesture analysis data and gesture target data so that pixel values of the original image satisfy a threshold interval [0,255], and/or resizing the original image, and/or performing at least one of filtering, smoothing, and noise reduction processing on the original image.

The preprocessing of gesture analysis data is that in the original gesture analysis image data, the data value of each pixel point is different and can be large or small, so that the conversion of data types is needed. In this embodiment, the infrared camera used is a TOF camera, and the raw data range is found to be [0,1023] after statistics. Since the pixel value of the original data range is set to exceed the value range interval [0,255], the subsequent calculation is not facilitated, and the normalization processing is required to be carried out on the original acquired data.

The normalization processing comprises setting the normalized gray value to be P/1023 x 255 if the acquired gray value of the pixel point is P. Then intercepting the gesture frame, adjusting the size of the picture under the condition of ensuring that the aspect ratio of the gesture frame is unchanged, and enabling the side length of the longest side to be 180, so as to obtain a unified gesture analysis picture.

Then, median filtering is carried out, the noise extreme value is removed, the gesture boundary is smoothed, then, the gesture position is segmented by using an adaptive threshold method, the largest connected domain is used as the effective gesture position, and the small connected domain is used as the removed noise. Then, the background is filled with a gray value 255, and if a gray value at a certain point in the effective gesture is 255, the gray value is changed to 254, so that the gesture analysis data after segmentation is obtained.

Preprocessing the gesture target data is the same step and process as preprocessing the gesture analysis data.

The preprocessing of the background image is to adapt to various scenes in more reality, and one or a combination of rotation, clipping, scaling, contrast, darkness and the like can be respectively adjusted, so that more background images to be synthesized are obtained.

In step 3 (S3), the segmented gesture target data is adjusted according to the segmented gesture analysis data to generate a target number of gesture target data to be synthesized, which specifically includes the following steps.

And step 31, counting the separated gesture analysis data to obtain a corresponding relation between the mean value and the variance of the gesture data, and determining the value range of the mean value and the variance to form a reference standard condition.

In an actual camera, the gray scales of gestures at different distances are different, and the gestures close to the camera are bright and have large change in brightness, and the gestures far from the camera are dark and have small change in brightness. The situation that the brightness of the gesture far from the gesture is low and the change is small, and the brightness of the gesture close to the gesture is large and the change is large can be used as a reference for adjusting the gesture target, so that a reference standard condition corresponding to the adjustment of the gesture target needs to be found. The specific method is to count the gesture analysis data after segmentation, and actually obtain a distribution condition and a corresponding relation (see fig. 2 and 3) of gesture mean and variance, namely, the following relation is satisfied:

max(0.28465*M+15.27581,0.1)≤S≤0.28465*M-25.27581

where S is the mean and M is the variance.

From fig. 2, it can be seen that if a single gesture is analyzed, the distribution of the mean and variance of the gesture is more concentrated, from fig. 3, it can be seen that if multiple gestures are acquired, the distribution of the mean and variance of the gesture is more scattered, but it can be seen that the correspondence between the mean and variance of the gesture satisfies the above relation, which indicates that there is a high correlation between them. That is, the gesture analysis dataset is effective and can be used on the gesture target dataset, so that the gesture mean and standard deviation are defined according to the actual situation in consideration of that the gesture analysis dataset is acquired without covering more actual situations, and the mean M and standard deviation S of the training set can be properly relaxed in order to ensure that the training set can cover all more actual situations. In the embodiment, the average value M and the standard deviation S are respectively equal to or more than 45 and equal to or less than 225 and equal to or more than 0.1.

Referring specifically to the distribution in fig. 3, the value of M is the horizontal axis in the drawing, and the value of S is the vertical axis in the drawing, but in order to avoid the standard deviation being equal to or less than 0, the minimum value of S is 0.1. The data required for training should be distributed between the two diagonal lines in fig. 3.

Then, a piece of segmented gesture target data (a piece of segmented gesture target data image data) can be selected, the mean value of the segmented gesture target data is M1, the variance of the segmented gesture target data is S1, and then, a mean value M2 and a variance S2 are randomly selected from reference standard conditions corresponding to adjustment of the gesture target, so that the adjustment is carried out point by point according to the following relation:

(H1-M1)/S1=(H2-M2)/S2

Wherein H1 is an original gray value of a certain pixel point in an image, and H2 is an adjusted gray value.

I.e. step 32 and step 33 are performed. Step 32 selects the segmented gesture target data, and calculates a variance S1 and a mean M1 of the gesture data to be adjusted. Step 33 randomly selects a mean M2 and a variance S2 of one gesture data from the reference standard conditions, adjusts the gray value of the pixel of the gesture data to be adjusted point by point according to the formula h2= (H1-M1)/s1×s2+m2, and if the adjusted gray value of the pixel exceeds the threshold interval, assigns the gray value of the pixel to be the closest threshold. For example, after adjustment, h2=0 if h2<0, and h2=254 if h2> 254.

Step 34 adjusts the gray value of the pixel point of each divided gesture target data image by repeating step 33 or step 32 and step 33 until the target number of gesture target images to be synthesized is reached. By continuously and repeatedly adjusting each segmented gesture target image, more gesture target images to be synthesized required in different scenes can be obtained.

In step 4 (S4), the gesture target data to be synthesized and the background data to be synthesized are synthesized to obtain an initial gesture synthesized image, which is specifically implemented by counting the maximum value and the minimum value of the side length of the gesture target image to be suitable, randomly generating numbers in the interval as the longest side, and then performing expansion or contraction with the length-width ratio unchanged on the gesture target image to be synthesized. And then randomly rotating or/and mirroring, taking the part with the data value not being 255 in the gesture target image as a foreground to be attached to a background image to be synthesized, and simultaneously recording the position L of the gesture frame, so as to obtain an initial gesture synthesized image, and obtaining the labeling information of the gesture in the initial gesture synthesized image based on the position L, wherein the labeling information can be used in subsequent deep learning. In some special cases, the initial gesture composite image may be used as a gesture-related dataset with a larger number of samples. In step 4, in the process of performing initial gesture synthesis image (fourth image set), when the gesture target image to be synthesized is attached, a closed loop can be formed on the periphery of the gesture target image, the loop is called a gesture frame, the position of the frame (such as a rectangular frame for drawing the frame containing the gesture target) on the background image is designated, and the position is represented by a position L, where the position L is the labeling information of the initial gesture synthesis image.

In a preferred embodiment, the method may further include step 5 (S5), where the initial composite image is post-processed to obtain final gesture composite data in step 5. As shown in fig. 4, in general, an image captured by a camera is brightest at the center of the image due to a lens, and then appears darker and darker with concentric circles around, also called LENS SHADING (lens uniformity). The gesture target image to be synthesized is not present in this case, which may cause distortion of the original synthesized image, which is not intended by the user. In order to obtain an original image similar to or close to that acquired by a camera, in step 5, the initial composite image is constructed as a matrix A, so that the number of rows and columns of the initial gesture composite image is equal to that of the matrix A, the maximum attenuation value at the position is randomly selected at different angles of view (can be understood as randomly selecting points or randomly selecting an area, and when the images at different angles of view are processed, the maximum attenuation value is the maximum attenuation value of the point or the area in the currently selected angle of view image), fitting is carried out according to the distance relation with the circle center, and the initial gesture cooperation image and the corresponding elements of the matrix A are multiplied one by one, so that gesture data at different positions in a lens are fitted to be used as final gesture composite data. The gesture data passing through the step 5 are gesture data fitted with different positions in the lens, so that more diversified requirements can be met. The method comprises the steps of gradually reducing the center point of an image to the periphery by taking the center point of the image as the center point according to the linear attenuation relation from the center to the periphery of the lens, then taking the maximum attenuation value of the center point at different view fields randomly, fitting according to the distance relation with the center point, and multiplying the initial gesture cooperation image and the corresponding elements of a matrix A one by one, so that gesture data of different positions in the lens are fitted.

In a further embodiment, at least one of the scattered noise, the linear noise and the curved noise or a combination thereof is randomly added to the gesture data of different positions in the fitted lens, and then the gesture data is used as final gesture synthesis data. The obtained gesture data set contains more sampling samples of various situations, and basically can meet all different data requirements for gestures. Because one image can be synthesized into a plurality of data, over fitting can easily occur, which is unfavorable for the deep learning network to detect gestures of different people and the influence of objects such as rings, bracelets and the like. In order to improve the robustness of the network, the network can better recognize gestures of different people, reduce the influence of objects such as rings, bracelets and the like, and increase noise. The method comprises the steps of randomly selecting a plurality of pixel points (U) to add scattered noise, randomly determining parameters such as the radius and brightness of the noise to attenuate from the center to the periphery, randomly generating a plurality of straight lines (V), adding linear noise, randomly generating parameters such as the radius and brightness of a curve, adding the curve noise, and restoring to a synthetic image with more noise through some external conditions.

And obtaining a final gesture synthesis data set after the post-processing. This dataset may be used for training, and if applied for gesture detection, this dataset may be used as the detection dataset together with the corresponding position coordinates L recorded in step 4. If the method is applied to gesture classification, the corresponding region can be cut out according to the position coordinates L recorded in the step 4 to be used as a classification data set.

The invention also protects a corresponding gesture data set generating system, and the gesture synthesized data is obtained by adopting the gesture data set generating method of the technical scheme, which comprises the following steps:

the acquisition module is used for acquiring the original image;

the preprocessing module is used for preprocessing the original image;

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. A method of generating a gesture data set, comprising:

Acquiring gesture images with different distances in an effective range, different positions in the same view field and blank background by using the infrared camera to form gesture analysis data;

and/or acquiring gesture images with different single angles and blank backgrounds within a preset distance and at the same central position in the field of view by using the infrared camera to form gesture target data;

And/or the infrared camera is used for collecting images of various scenes without gestures in the effective range to form the background data;

Step 4, synthesizing the gesture target data to be synthesized and the background data to be synthesized to obtain an initial gesture synthesized image;

The step 3 comprises the following steps:

step 34, by repeating step 33 or step 32 and step 33, adjusting the gray value of the pixel point of each divided gesture target data image until the target number of gesture target images to be synthesized is reached;

H1 is an original gray value of a certain pixel point in an image, and H2 is an adjusted gray value.

2. The method of generating a gesture data set of claim 1, further comprising:

and 5, performing post-processing on the initial synthesized image to obtain final gesture synthesized data.

3. The method of claim 1, wherein the preprocessing includes performing data conversion on the gesture analysis data and the gesture target data so that pixel values of the original image satisfy a threshold interval [0,255], and/or adjusting the size of the original image, and/or performing at least one of filtering, smoothing, and noise reduction processing on the original image, and/or performing one or a combination of rotation, clipping, scaling, contrast adjustment, and shading adjustment on the background data to obtain a target synthesized number of background images to be synthesized.

4. The method for generating a gesture data set according to claim 1, wherein the correspondence between the mean and the variance of the gesture data is max (0.28465×m+15.27581, 0.1) +.s+.0.28465×m-25.27581, where S is the mean and M is the variance.

5. The method of generating a gesture data set according to claim 1, wherein in the step 4, in the process of synthesizing the initial gesture synthesized image, when the gesture target image to be synthesized is attached, a closed gesture frame can be formed on the periphery of the gesture target image, and the position of the gesture frame on the background image is represented by a position L, where the position L is labeling information of the initial gesture synthesized image.

6. The method for generating a gesture data set according to claim 2, wherein the step 5 includes:

The initial synthetic image is constructed as a matrix A, so that the number of rows and columns of the initial gesture synthetic image is equal to that of the matrix A, the maximum attenuation values at the initial synthetic image and the matrix A are randomly taken in different view fields, fitting is carried out according to the distance relation between the initial gesture synthetic image and the circle center, and corresponding elements of the initial gesture cooperative image and the matrix A are multiplied one by one, so that gesture data of different positions in a lens are fitted;

And randomly adding at least one or a combination of scattered noise, linear noise and curve noise into gesture data of different positions in the fitted lens, and then taking the gesture data as final gesture synthetic data.

7. A system for generating a gesture data set, wherein the gesture synthesized data is obtained by adopting the method for generating a gesture data set according to any one of claims 1 to 6, and the method comprises:

the acquisition module is used for acquiring the original image;

the preprocessing module is used for preprocessing the original image;

8. Computer readable storage medium for storing a computer program, wherein the computer program when run performs a method of generating a set of gesture data according to any one of claims 1 to 6 to obtain gesture composition data.