WO2023061102A1 - 视频行为识别方法、装置、计算机设备和存储介质 - Google Patents
视频行为识别方法、装置、计算机设备和存储介质 Download PDFInfo
- Publication number
- WO2023061102A1 WO2023061102A1 PCT/CN2022/116947 CN2022116947W WO2023061102A1 WO 2023061102 A1 WO2023061102 A1 WO 2023061102A1 CN 2022116947 W CN2022116947 W CN 2022116947W WO 2023061102 A1 WO2023061102 A1 WO 2023061102A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature
- features
- time
- behavior recognition
- video
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/62—Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/761—Proximity, similarity or dissimilarity measures
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
Definitions
- the present application relates to the field of computer technology, in particular to a video behavior recognition method, device, computer equipment, storage medium and computer program product.
- Video behavior recognition is one of the important topics in the field of computer vision. Based on video behavior recognition, it is possible to recognize the action behavior of the target object in a given video, such as eating, running, talking and other actions.
- behavior recognition is mostly performed by extracting features from videos, but the features extracted in traditional video behavior recognition processing cannot effectively reflect the behavior information in the video, resulting in low accuracy of video behavior recognition. Low.
- a video behavior recognition method performed by a computer device, said method comprising:
- the time feature of the intermediate image feature and the cohesive feature corresponding to the time feature are fused to obtain the fusion feature; the prior information is obtained according to the change information of the intermediate image feature in the time dimension; the cohesive feature is the time feature. Obtained through attention processing;
- a video behavior recognition device comprising:
- Video image feature extraction module for extracting video image features from at least two frames of target video images
- the spatial feature contribution adjustment module is used to adjust the contribution of the spatial feature of the video image feature to obtain the intermediate image feature
- the feature fusion module is used to fuse the time features of the intermediate image features and the cohesive features corresponding to the time features based on the prior information to obtain the fusion features; the prior information is obtained according to the change information of the intermediate image features in the time dimension; the inner The poly feature is obtained by focusing on the time feature;
- a temporal feature contribution adjustment module is used to adjust the temporal feature contribution to the fusion feature to obtain behavior recognition features
- the video behavior recognition module is used for performing video behavior recognition based on behavior recognition features.
- a computer device comprising a memory and a processor, the memory stores computer-readable instructions, and the processor implements the following steps when executing the computer-readable instructions:
- the time feature of the intermediate image feature and the cohesive feature corresponding to the time feature are fused to obtain the fusion feature; the prior information is obtained according to the change information of the intermediate image feature in the time dimension; the cohesive feature is the time feature. Obtained through attention processing;
- a computer-readable storage medium on which computer-readable instructions are stored, and when the computer-readable instructions are executed by a processor, the following steps are implemented:
- the time feature of the intermediate image feature and the cohesive feature corresponding to the time feature are fused to obtain the fusion feature; the prior information is obtained according to the change information of the intermediate image feature in the time dimension; the cohesive feature is the time feature. Obtained through attention processing;
- a computer program product comprising computer readable instructions which, when executed by a processor, implement the following steps:
- the time feature of the intermediate image feature and the cohesive feature corresponding to the time feature are fused to obtain the fusion feature; the prior information is obtained according to the change information of the intermediate image feature in the time dimension; the cohesive feature is the time feature. Obtained through attention processing;
- Fig. 1 is the application environment diagram of video behavior recognition method in an embodiment
- Fig. 2 is a schematic flow chart of a video behavior recognition method in an embodiment
- FIG. 3 is a schematic flow chart of cohesive processing of time features in one embodiment
- Fig. 4 is a schematic structural diagram of a video behavior recognition model in an embodiment
- FIG. 5 is a schematic flow chart of structural parameter weighted fusion in an embodiment
- Fig. 6 is a schematic diagram of determining the processing of structural parameters in an embodiment
- FIG. 7 is a schematic flow chart of feature fusion based on prior information in an embodiment
- Fig. 8 is a schematic flow chart of high cohesion processing in an embodiment
- Fig. 9 is a structural block diagram of a video behavior recognition device in an embodiment
- Figure 10 is a diagram of the internal structure of a computer device in one embodiment.
- the video behavior recognition method provided in this application can be applied to the application environment shown in FIG. 1 .
- the terminal 102 communicates with the server 104 through the network.
- the terminal 102 can shoot the target object to obtain a video, and send the obtained video to the server 104
- the server 104 extracts at least two frames of target video images from the video, and extracts the features of the video images extracted from the at least two frames of target video images
- the spatial features are adjusted for their contribution, and the temporal features of the intermediate image features are fused with the cohesive features obtained by focusing on the temporal features through the prior information obtained by adjusting the change information of the intermediate image features in the time dimension according to the contribution, and then
- the temporal feature contribution adjustment is performed on the obtained fusion features, and video behavior recognition is performed based on the obtained behavior recognition features, and the server 104 may feed back the obtained video behavior recognition results to the terminal 102 .
- the video behavior recognition method can also be executed by the server 104 alone, for example, the server 104 can obtain at least two frames of target video images from the database, and perform video behavior recognition processing based on the obtained at least two frames of target video images.
- the video behavior recognition method can also be executed by the terminal 102. Specifically, after the terminal 102 captures the video, the terminal 102 continues to extract at least two frames of target video images from the captured video, and based on at least two frames of the target Video images are processed for video behavior recognition.
- the terminal 102 can be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, vehicle-mounted devices, and portable wearable devices
- the server 104 can be implemented by an independent server or a server cluster composed of multiple servers.
- a video behavior recognition method is provided, and the method is applied to the server 104 in FIG. 1 as an example for illustration, including the following steps:
- Step 202 extract video image features from at least two frames of target video images.
- the target video image is an image from a video that requires behavior recognition processing, and specifically may be an image extracted from a video that requires behavior recognition processing.
- the target video image may be an image extracted from the basketball video.
- the target video image has more than one frame, so that the video can be processed for behavior recognition based on the time information between frames.
- the target video image may be a multi-frame image continuously extracted from the video, for example, it may be 5 consecutive frames or 10 frames.
- the video image feature is obtained by feature extraction of the target video image, which is used to reflect the image characteristics of the target video image.
- the video image feature can be the image feature extracted by various image feature extraction methods, such as the artificial neural network.
- the frame target video image is subjected to feature extraction processing to extract image features.
- the server 104 acquires at least two frames of target video images, the target video images are extracted from the video captured by the terminal 102, and the target video images may be multiple frames of images continuously extracted from the video.
- the server 104 extracts video image features from at least two frames of target video images.
- the server 104 may perform image feature extraction processing on at least two frames of target video images, such as inputting them into an artificial neural network, respectively, to obtain video image features corresponding to each frame of target video images.
- Step 204 adjusting the contribution of the spatial features of the video image features to obtain intermediate image features.
- the spatial feature is used to reflect the spatial information of the target video image, and the spatial information may include pixel value distribution information of each pixel in the target video image, that is, the characteristics of the image itself in the target video image.
- the spatial feature can characterize the static feature of the object included in the target video image.
- the spatial feature can be further extracted from the video image feature, so as to obtain the feature reflecting the spatial information in the target video image from the video image feature.
- feature extraction may be performed on the video image features in the spatial dimension to obtain the spatial features of the video image features.
- the contribution adjustment is used to adjust the contribution degree of the spatial feature.
- the contribution degree of the spatial feature refers to the influence degree of the spatial feature on the behavior recognition result when the video behavior recognition is performed based on the characteristics of the target video image.
- the greater the contribution of spatial features the greater the impact of spatial features on video behavior recognition processing, that is, the closer the result of video behavior recognition is to the behavior reflected by spatial features.
- the contribution adjustment can be realized by adjusting the spatial features through preset weight parameters to obtain intermediate image features, which are image features obtained after adjusting the contribution degree of the spatial features of video image features in video behavior recognition.
- the server 104 adjusts the contribution of the spatial features of the video image features corresponding to each frame of the target video image. Specifically, the server 104 can perform spatial feature extraction on each video image feature to extract each video image. For the spatial features in the image features, the server 104 respectively adjusts the contributions of the spatial features of the video image features based on the spatial weight parameters to obtain intermediate image features.
- the spatial weight parameter can be set in advance, specifically, it can be obtained through pre-training with video image samples carrying behavior labels.
- Step 206 based on the prior information, the time feature of the intermediate image feature and the cohesive feature corresponding to the time feature are fused to obtain the fusion feature; the prior information is obtained according to the change information of the intermediate image feature in the time dimension; the cohesive feature is Obtained by focusing on temporal features.
- the prior information reflects the prior knowledge of the target video image in the time dimension, and the prior information is obtained according to the change information of the intermediate image features in the time dimension, specifically according to the similarity of the intermediate image features in the time dimension.
- the prior information can include weight parameters for each fusion feature when performing feature fusion, then the similarity in the time dimension can be calculated for the intermediate image features corresponding to each frame of the target video image, and the weight parameters including weight parameters can be obtained according to the obtained similarity.
- the time feature is used to reflect the time information of the target video images in the video, and the time information may include the correlation information between the target video images in the video, that is, the characteristics of the time sequence of the target video images in the video.
- the temporal feature can characterize the dynamic feature of the object included in the target video image, so as to realize the dynamic behavior recognition of the object.
- the temporal features can be further extracted from the intermediate image features to obtain features reflecting temporal information in the target video image from the intermediate image features.
- feature extraction may be performed on the features of the intermediate image in the time dimension to obtain the temporal features of the features of the intermediate image.
- the cohesive feature corresponding to the temporal feature is obtained by paying attention to the temporal feature.
- the attention processing refers to paying attention to the feature in the temporal feature that is conducive to video behavior recognition, so as to highlight the feature, so as to obtain low redundancy and strong cohesion.
- the algorithm based on the attention mechanism can pay attention to the time features of the intermediate image features, and obtain the cohesive features corresponding to the time features.
- the cohesive feature is obtained by focusing on the time feature, which has high cohesion, that is, the focal feature of the time information of the cohesive feature is prominent, the feature redundancy is low, and the feature validity is high, which can accurately express the target video image in the time dimension.
- the information is conducive to improving the accuracy of video behavior recognition.
- the time feature of the intermediate image feature and the cohesive feature corresponding to the time feature are fused through the prior information, so as to fuse the time feature and the cohesive feature according to the prior knowledge in the prior information to obtain the fusion feature.
- Fusion features are obtained by fusing time features and cohesive features through prior knowledge in prior information, which can ensure the cohesion of time information in fusion features and enhance the expression of important features in the time dimension, thereby improving the accuracy of video behavior recognition Rate.
- the prior information can include the weight parameters of each fusion feature when performing feature fusion, that is, the prior information includes the weight parameters of the time feature and the cohesive feature corresponding to the time feature, and the time feature and time feature are combined through the weight parameter.
- the cohesive features corresponding to the features are weighted and fused to obtain the fused features.
- the server 104 may acquire prior information, which is obtained according to the change information of the intermediate image features in the time dimension, specifically, according to the cosine similarity of the intermediate image features in the time dimension.
- the server 104 fuses the time features of the intermediate image features and the cohesive features corresponding to the time features based on the prior information.
- the server 104 may perform feature extraction of the time dimension on the intermediate image features to obtain the time features of the intermediate image features, and Further determine the cohesive features corresponding to the temporal features.
- the cohesive features corresponding to the temporal features are obtained by performing attention processing on the temporal features.
- the server 104 may also perform attention processing on the temporal features based on an attention mechanism algorithm, so as to obtain the cohesive features corresponding to the temporal features.
- the server 104 fuses the time features of the intermediate image features and the cohesion features corresponding to the time features according to the prior information. The features are weighted and fused to obtain fused features.
- Step 208 adjust the time feature contribution to the fusion feature to obtain the behavior recognition feature.
- the time feature contribution adjustment is used to adjust the contribution degree of the fusion feature in the time dimension
- the contribution degree of the time feature refers to the degree of influence of the characteristics of the fusion feature in the time dimension on the behavior recognition results when performing video behavior recognition based on the characteristics of the target video image .
- the greater the contribution of the fusion feature in the time dimension the greater the impact of the fusion feature in the time dimension on the video behavior recognition process, that is, the closer the result of the video behavior recognition is to the behavior reflected by the fusion feature in the time dimension .
- the temporal feature contribution adjustment can specifically be realized by adjusting the features of the fusion feature in the time dimension through preset weight parameters to obtain behavior recognition features, which can be used for video behavior recognition.
- the server 104 adjusts the time feature contribution of the fusion feature. Specifically, the server 104 can adjust the contribution of the fusion feature in the time dimension according to the time weight parameter, so as to adjust the contribution of the fusion feature in the time dimension, and obtain Behavioral identification features.
- the time weight parameter can be set in advance, specifically, it can be obtained through pre-training with video image samples carrying behavior labels.
- Step 210 perform video behavior recognition based on behavior recognition features.
- the behavior recognition feature is a feature used for video behavior recognition, specifically behavior classification can be carried out based on the behavior recognition feature, to determine the video behavior recognition result corresponding to the target video image.
- the server 104 can perform video behavior recognition based on the obtained behavior recognition features.
- the behavior recognition features can be input into a classifier for classification, and a video behavior recognition result can be obtained according to the classification result, so as to realize effective recognition of video behaviors.
- the spatial features of the video image features extracted from at least two frames of target video images are adjusted for contribution, and the prior information obtained by adjusting the change information of the intermediate image features in the time dimension according to the contribution adjustment is used for the intermediate
- the temporal features of the image features are fused with the cohesive features obtained by focusing on the temporal features, and then the temporal feature contribution adjustment is performed on the obtained fusion features, and video behavior recognition is performed based on the obtained behavior recognition features.
- the contribution adjustment of the spatial features of the video image features and the adjustment of the temporal feature contribution of the fusion features can adjust the contribution of time information and spatial information in the behavior recognition features to enhance the performance of the behavior recognition features.
- Behavioral information expressiveness by adjusting the prior information obtained by adjusting the change information of the intermediate image features in the time dimension according to the contribution, the temporal features of the intermediate image features and the cohesive features obtained by focusing on the temporal features can be fused, which can The time information in the behavior recognition features is effectively focused, so that the obtained behavior recognition features can effectively reflect the behavior information in the video, thereby improving the accuracy of video behavior recognition.
- adjusting the contribution of the spatial features of the video image features to obtain the intermediate image features includes: extracting the spatial features of the video image features to obtain the spatial features of the video image features; and passing the spatial structure parameters in the structural parameters Adjust the contribution of the spatial features to obtain the intermediate image features; the structural parameters are obtained through the training of video image samples carrying behavior labels.
- spatial feature extraction is used to extract spatial features from video image features to adjust the contribution of spatial features.
- Spatial feature extraction can be realized through a feature extraction module, for example, the convolution module in a convolutional neural network model can be used to perform convolution operations on video image features to achieve spatial feature extraction.
- the structural parameters may include weight parameters to adjust the weight of various operations on image features.
- the structural parameters can be the weight parameters of various operations defined in the operation space of the convolutional neural network model, such as weighted adjustments for operations such as convolution, sampling, and pooling. weight parameter.
- Structural parameters can include spatial structure parameters and temporal structure parameters, which are used to adjust the contribution of the spatial features of the spatial dimension and the temporal features of the temporal dimension, thereby adjusting the spatiotemporal information in the video image features to enhance the behavior information representation of the behavior recognition features It is beneficial to improve the accuracy of video behavior recognition.
- Structural parameters can be pre-trained through video image samples carrying behavior tags.
- Video image samples can be video images carrying behavior tags. Based on video image samples, structural parameters can be trained to effectively adjust the weight of various operations.
- the server 104 performs spatial feature extraction on the video image features corresponding to each frame of the target video image.
- the video image features can be extracted through the pre-trained video behavior recognition model.
- the spatial features of video image features are extracted through the convolutional layer structure in the video behavior recognition model, and the spatial features of video image features are obtained.
- the server 104 determines the structural parameters obtained through the training of video image samples carrying behavior tags, and adjusts the contribution to the spatial features through the spatial structure parameters in the structural parameters.
- the corresponding The weight parameter weights the spatial features to adjust the influence of the spatial features of the video image features on the recognition results during video behavior recognition through the spatial structure parameters, so as to realize the adjustment of the contribution of the spatial features and obtain the intermediate image features.
- the feature is the image feature obtained after adjusting the contribution degree of the spatial feature of the video image feature in the video behavior recognition.
- time feature contribution adjustment is performed on the fusion feature to obtain the behavior recognition feature, including: adjusting the contribution of the fusion feature through the time structure parameter in the structure parameter to obtain the behavior recognition feature.
- the structural parameters may be weight parameters of various operations defined in the operation space of the convolutional neural network model, and the structural parameters include time structure parameters for adjusting the contribution to the characteristics of the time dimension. Specifically, after obtaining the fusion feature, the server 104 adjusts the time feature contribution of the fusion feature through the time structure parameter in the structure parameter to obtain the behavior recognition feature used for video behavior processing.
- the time structure parameter can be a weight parameter
- the server 104 can carry out weighting processing on the fusion feature by the weight parameter corresponding to the time structure parameter, so that when the fusion feature is adjusted by the time structure parameter for video behavior recognition, the fusion feature is in The degree of influence of the characteristics of the time dimension on the recognition results, so as to realize the adjustment of the contribution of the time dimension features, to adjust the contribution of the fusion features in the time dimension, and obtain the behavior recognition features.
- the server 104 can perform video behavior recognition based on the obtained behavior recognition features Processing to obtain the video behavior recognition result.
- the spatial structure parameters and temporal structure parameters in the structural parameters obtained through the training of video image samples carrying behavior labels are used to adjust the contribution of the spatial features and fusion features of the video image features in the corresponding feature dimensions, so that according to Spatial structure parameters and temporal structure parameters adjust the contribution of time information and spatial information in behavior recognition features, and realize effective entanglement of spatiotemporal features, making the spatiotemporal features of behavior recognition features more expressive, that is, enhancing the behavior information of behavior recognition features Expressiveness, thereby improving the accuracy of video behavior recognition.
- the video behavior recognition method further includes: determining the structural parameters to be trained; through the spatial structure parameters in the structural parameters to be trained, adjusting the contribution of the spatial sample features of the video image sample features to obtain intermediate sample features;
- the sample feature is extracted from the video image sample; based on the prior sample information, the time sample feature of the intermediate sample feature and the cohesive sample feature corresponding to the time sample feature are fused to obtain the fusion sample feature; the cohesive sample feature is the time sample feature
- the prior sample information is obtained according to the change information of the intermediate sample features in the time dimension; the contribution of the fusion sample features is adjusted through the time structure parameters in the structure parameters to be trained to obtain the behavior recognition sample features; and Perform video behavior recognition based on behavior recognition sample features, and update structural parameters to be trained according to behavior recognition results and behavior labels corresponding to video image samples, and then continue training until the end of training to obtain structural parameters.
- training is carried out through video image samples carrying behavior labels, and structural parameters including temporal structural parameters and spatial structural parameters are obtained at the end of the training.
- the structure parameter to be trained may be an initial value during each iterative training, and the spatial structure parameter in the structure parameter to be trained is used to adjust the contribution of the spatial sample feature of the video image sample feature to obtain the intermediate sample feature.
- the intermediate sample feature is the result of adjusting the contribution of the spatial sample feature of the video image sample feature.
- the video image sample feature is extracted from the video image sample. Specifically, the feature extraction of the video image sample can be performed through the artificial neural network model to obtain the video image sample The video image sample features of .
- the prior sample information is obtained based on the change information of the intermediate sample features in the time dimension, which can be obtained according to the similarity of the intermediate sample features in the time dimension; the cohesive sample features are obtained by paying attention to the time sample features, specifically based on the attention The mechanism pays attention to the time sample features and obtains the cohesive sample features corresponding to the time sample features.
- the fusion sample features are obtained by fusing the time sample features of the intermediate sample features and the cohesive sample features corresponding to the time sample features according to the prior sample information.
- the cohesive sample features are weighted and fused to obtain the fused sample features.
- Behavior recognition sample features are used for video behavior recognition processing, which is obtained by adjusting the contribution of the time structure parameters in the structure parameters to be trained to the fusion sample features. The degree of contribution of the feature of dimension in the process of video action recognition. Behavior recognition results are obtained through video behavior recognition based on behavior recognition sample features.
- the structural parameters to be trained can be evaluated, and the structural parameters to be trained are updated according to the evaluation results, and then iterative training continues until At the end of the training, if the number of training times reaches the preset training times threshold, the behavior recognition results meet the recognition accuracy requirements, and the objective function meets the end conditions, etc., the structural parameters of the training completion can be obtained after the training is completed. Based on the structural parameters of the training completion, the video image features can be analyzed. Spatial features and fused features are respectively adjusted for contribution to realize video action recognition processing.
- the structural parameters can be trained by the server 104 , or can be transplanted to the server 104 after being trained by other training devices.
- the server 104 training structural parameters as an example.
- the server 104 determines the structural parameters to be trained.
- the structural parameters to be trained are the initial values during the current iterative training.
- the server 104 uses the spatial structural parameters in the structural parameters to be trained.
- the spatial sample features of the video image sample features are adjusted for contributions to obtain intermediate sample features.
- the server 104 fuses the time sample features of the intermediate sample features and the cohesive sample features corresponding to the time sample features based on the prior sample information to obtain the fused sample features.
- the server 104 After obtaining the fusion sample features, the server 104 adjusts the contribution of the fusion sample features through the time structure parameters in the structural parameters to be trained to obtain the behavior recognition sample features, and the server 104 performs video behavior recognition based on the behavior recognition sample features to obtain the behavior recognition results.
- the server 104 updates the structural parameters to be trained based on the behavior recognition results and the behavior labels corresponding to the video image samples, and returns to continue the iterative training through the updated structural parameters to be trained until the training end condition is met to obtain the structural parameters.
- Structural parameters can be used to weight and adjust various operations on the characteristics of the target video image in the spatio-temporal dimension during the video behavior recognition process, so as to realize the effective winding of the spatio-temporal features of the target video image and enhance the behavior recognition features.
- Behavior information is expressive, thereby improving the accuracy of video behavior recognition.
- the structural parameters completed through training can realize the effective entanglement of the spatio-temporal features of the target video image, and can enhance the behavior information expressiveness of behavior recognition features, thereby improving video Accuracy of behavior recognition.
- the video behavior recognition method is implemented by a video behavior recognition model
- the structural parameters to be trained are parameters of the video behavior recognition model during training.
- update the structural parameters to be trained and continue training until the end of the training to obtain the structural parameters including: obtaining the behavior recognition result output by the video behavior recognition model; determining the behavior recognition result and the video image The difference between the behavior labels corresponding to the samples; update the model parameters in the video behavior recognition model and the structural parameters to be trained according to the differences; and continue training based on the updated video behavior recognition model until the end of the training, and according to the completed video Behavior recognition models get structural parameters.
- the video behavior recognition method is implemented through a video behavior recognition model, that is, the steps of the video behavior recognition method are realized through a pre-trained video behavior recognition model.
- the video behavior recognition model can be an artificial neural network model based on various neural network algorithms, such as a convolutional neural network model, a deep learning network model, a recurrent neural network model, a perceptron network model, and a generative confrontation network model.
- the structural parameters to be trained are the parameters of the video behavior recognition model during training, that is, the structural parameters are parameters that contribute to the adjustment of the model operation in the video behavior recognition model.
- the behavior recognition result is the recognition result obtained by performing video behavior recognition based on the characteristics of behavior recognition samples.
- the model performs video behavior recognition based on the target video image, and outputs behavior recognition results.
- the difference between the behavior recognition result and the behavior label corresponding to the video image sample can be determined by comparing the behavior recognition result and the behavior label.
- Model parameters refer to the parameters corresponding to the network structure of each layer in the video behavior recognition model.
- model parameters may include, but are not limited to, various parameters such as convolution kernel parameters, pooling parameters, and up-down sampling parameters for each layer of convolution.
- model parameters and the structure parameters By updating the model parameters and the structure parameters to be trained in the video behavior recognition model according to the difference between the behavior recognition results and the behavior labels, the joint training of the model parameters and the structure parameters in the video behavior recognition model is realized.
- the trained video behavior recognition model is obtained at the end of the training, structural parameters can be determined according to the trained video behavior recognition model.
- the server 104 jointly trains model parameters and structural parameters through the video behavior recognition model, and the trained structural parameters can be determined from the trained video behavior recognition model. Specifically, after the server 104 inputs the video image sample into the video behavior recognition model, the video behavior recognition model performs video behavior recognition processing and outputs a behavior recognition result. The server 104 determines the difference between the behavior recognition result output by the video behavior recognition model and the behavior label corresponding to the video image sample, and updates the parameters of the video behavior recognition model according to the difference, specifically including the model parameters in the video behavior recognition model and The structural parameters to be trained are updated to obtain the updated video behavior recognition model.
- the server 104 continues to train through the video image samples based on the updated video behavior recognition model until the training ends, for example, the training ends when the training conditions are met, and a trained video behavior recognition model is obtained.
- the server 104 can determine the structural parameters that have been trained according to the video behavior recognition model that has been trained, and the structural parameters that have been trained can carry out weight adjustments to the operations of each layer of network structure in the video behavior recognition model, so as to adjust the performance of each layer of network structure on video behavior recognition.
- the degree of contribution of the processing is used to obtain expressive features for video behavior recognition, which improves the accuracy of video behavior recognition.
- the model parameters and structural parameters are jointly trained through the video behavior recognition model, the trained structural parameters can be determined from the trained video behavior recognition model, and the spatio-temporal parameters of the target video image can be realized through the trained structural parameters.
- the effective entanglement of features can enhance the behavior information expressiveness of behavior recognition features, thereby improving the accuracy of video behavior recognition.
- the structural parameters to be trained are updated and the training is continued until the training ends, and the structural parameters are obtained, including: determining the behavior recognition result and the behavior label corresponding to the video image sample Behavior recognition loss between; get the reward value according to the behavior recognition loss and the previous behavior recognition loss; and update the structural parameters to be trained according to the reward value, continue training through the updated structural parameters to be trained until the objective function meets the end condition,
- the structural parameters are obtained; the objective function is obtained based on each reward value in the training process.
- the behavior recognition loss is used to represent the degree of difference between the behavior recognition result and the behavior label corresponding to the video image sample, and the form of the behavior recognition loss can be set according to actual needs, such as cross-entropy loss.
- the previous behavior recognition loss is the behavior recognition loss correspondingly determined for the previous frame of video image samples.
- the reward value is used to update the structural parameters to be trained.
- the reward value is determined according to the behavior recognition loss and the previous behavior recognition loss.
- the reward value can guide the structural parameters to be trained to update in the direction that meets the training requirements. After the structural parameters to be trained are updated, the training is continued through the updated structural parameters to be trained until the objective function meets the end condition, and the training is ended, and the trained structural parameters are obtained.
- the objective function is obtained based on the reward values in the training process, that is, the objective function is obtained according to the reward values corresponding to each frame of video image samples.
- the objective function can be constructed according to the sum of the reward values corresponding to each frame of video image samples, so that according to the objective The function judges the end of the structural parameter training and obtains the structural parameters that meet the contribution adjustment requirements.
- the server 104 performs video behavior recognition based on the characteristics of behavior recognition samples. After obtaining the behavior recognition result, the server 104 determines the behavior recognition loss between the behavior recognition result and the behavior label corresponding to the video image sample. Cross-entropy loss between labels yields action recognition loss. The server 104 obtains a reward value based on the obtained behavior recognition loss and the previous behavior recognition loss corresponding to the previous frame video image sample. Specifically, the reward value can be determined according to the difference between the behavior recognition loss and the previous behavior recognition loss.
- the server 104 updates the structural parameters to be trained according to the reward value.
- the structural parameters to be trained can be updated according to the positive or negative value or the value of the reward value to obtain the updated structural parameters to be trained.
- the server 104 continues the training with the updated structural parameters to be trained until the objective function meets the end condition and ends the training to obtain the structural parameters. Among them, the objective function is obtained based on each reward value in the training process.
- the objective function can be constructed according to the sum of the reward values corresponding to each frame of video image samples, and the end of the structural parameter training is judged according to the objective function. For example, when the objective function reaches End the training at the extreme value, and obtain the structural parameters that meet the contribution adjustment requirements.
- the reward value is obtained according to the difference between the behavior recognition losses corresponding to the video image samples of each frame, the behavior recognition loss is determined according to the behavior recognition results and the behavior labels corresponding to the video image samples, and the training structure parameters are calculated by the reward value After the update, continue the training until the objective function obtained according to the reward value corresponding to each frame of video image samples meets the end condition, and the training ends, and the structural parameters of the training are obtained.
- the training efficiency of the structural parameters to be trained can be improved by updating the reward value obtained according to the difference between the behavior recognition losses corresponding to the video image samples of each frame.
- updating the structural parameters to be trained according to the reward value includes: updating the model parameters of the policy gradient network model according to the reward value; and updating the structural parameters to be trained by the updated policy gradient network model.
- the policy gradient (Policy Gradient) network model is a network model based on the policy gradient, its input is the state, and the output is the action.
- the gradient network model can make corresponding actions according to the current state and obtain higher reward values.
- the model parameters of the policy gradient network model can be used as a state, and in this state, the structural parameters output by the policy gradient network model according to the input structural parameters are actions, so that the policy gradient network model can be based on the input structural parameters and the current model parameters Predict the output of the next action, that is, the next structural parameter, so as to realize the update of the structural parameter during training.
- the server 104 updates the model parameters of the policy gradient network model according to the reward value, and specifically adjusts each model parameter in the policy gradient network model based on the reward value, so that The updated policy gradient network model performs the next structural parameter prediction.
- the server 104 uses the updated policy gradient network model to update the structural parameters to be trained.
- the updated policy gradient network model can perform structural parameters based on the updated network state and the structural parameters to be trained. Prediction, obtaining the predicted structural parameters, the structural parameters predicted by the strategy gradient network model are the structural parameters after the structural parameters to be trained are updated.
- the policy gradient network model is updated according to the reward value, and the structural parameters to be trained are updated through the updated policy gradient network model.
- the structural parameters can be optimized through the policy gradient method, which can ensure the training quality of the structural parameters. It is beneficial to improve the accuracy of video behavior recognition processing.
- updating the structural parameters to be trained by the updated policy gradient network model includes: predicting the structural parameters based on the updated model parameters and the structural parameters to be trained through the updated policy gradient network model, and obtaining the prediction The structural parameters; and according to the predicted structural parameters, obtain the structural parameters after the structural parameters to be trained are updated.
- the updated policy gradient network model is obtained by updating the model parameters of the policy gradient network model, that is, after adjusting and updating the model parameters of the policy gradient network model through reward values, an updated policy gradient network model is obtained.
- the policy gradient network model is updated, and after the updated policy gradient network model is obtained, the server takes the model parameters in the updated policy gradient network model as the state, and predicts the structural parameters in this state, specifically based on The updated model parameters and the structural parameters to be trained are used to predict the structural parameters, and the predicted structural parameters are obtained.
- the server uses the current network state of the updated policy gradient network model to predict the structural parameters by using the structural parameters to be trained, and obtains the predicted structural parameters.
- the server updates the structural parameters according to the predicted structural parameters, and obtains the structural parameters after the structural parameters to be trained are updated.
- the server may directly use the predicted structural parameters output by the updated policy gradient network model through structural parameter prediction as the structural parameters after the structural parameters to be trained are updated, so as to realize the updating of the structural parameters to be trained.
- the server predicts the structural parameters of the structural parameters to be trained through the updated policy gradient network model, and obtains the updated structural parameters of the structural parameters to be trained according to the predicted structural parameters, and optimizes the structural parameters by means of the policy gradient , which can ensure the training quality of structural parameters and is beneficial to improve the accuracy of video behavior recognition processing.
- the video behavior recognition method further includes: determining the similarity of intermediate image features in the time dimension; and correcting the initial prior information based on the similarity to obtain the prior information.
- the time dimension is the dimension of the sequence of each frame of the target video image in the video to which it belongs. According to the time feature of the time dimension, it can assist in the accurate identification of the video behavior.
- the similarity can represent the distance between each feature. The higher the similarity, the closer the distance.
- the similarity of the intermediate image features in the time dimension can reflect the change degree of the intermediate image features in the time dimension.
- the initial prior information may be preset prior information, specifically, prior information obtained from training based on sample data in advance. The initial prior information is corrected according to the similarity, so that according to the change degree of the target video image in each frame in the time dimension, the fusion of the temporal features and cohesive features of the intermediate image features can be weighted and adjusted to enhance the cohesion of the fusion features. feature, that is, to highlight the focal features of the fusion features and reduce the redundant information of the fusion features.
- the initial prior information can be corrected according to the change degree of each frame of the target video image in the time dimension, to obtain the corresponding prior information.
- the server 104 determines the similarity of the intermediate image features in the time dimension. Specifically, the cosine similarity can be calculated in the time dimension for the intermediate image features corresponding to each frame of the target video image, and the change in the time dimension of each frame of the target video image can be measured by the cosine similarity. degree.
- the server 104 corrects the initial prior information according to the similarity of the intermediate image features in the time dimension.
- the initial prior information can be divided into positive and negative parameters based on the similarity. After the initial prior information is corrected by the positive and negative parameters, the The corrected initial prior information is combined with the initial prior information in the form of residual connection to obtain prior information, so as to realize the determination of prior information.
- the initial prior information is corrected according to the similarity of the intermediate image features in the time dimension, and the initial prior information is corrected by reflecting the similarity of the change degree of each frame of the target video image in the time dimension, which can effectively
- the corresponding prior knowledge is obtained by using the change degree of the target video image in each frame in the time dimension, so that the temporal features and cohesive features can be fused based on the prior knowledge, and the temporal information in the behavior recognition features can be effectively focused, so that the obtained Behavior recognition features can effectively reflect the behavior information in the video, thus improving the accuracy of video behavior recognition.
- the initial prior information includes a first initial prior parameter and a second initial prior parameter; based on the similarity, the initial prior information is corrected to obtain the prior information, including: according to the first initial prior parameter , the second initial prior parameter and the preset threshold, and dynamically adjust the similarity; through the dynamically adjusted similarity, the first initial prior parameter and the second initial prior parameter are respectively corrected to obtain the first prior parameter and a second prior parameter; and obtaining prior information according to the first prior parameter and the second prior parameter.
- the initial prior information includes the first initial prior parameter and the second initial prior parameter
- the first initial prior parameter and the second initial prior parameter are respectively used as the fusion weight parameters of the time feature of the intermediate image feature and the cohesive feature .
- the preset threshold can be dynamically set according to actual needs, so as to dynamically correct prior information according to actual needs.
- the first priori parameter and the second priori parameter are respectively used as fusion weight parameters of the time feature of the intermediate image feature and the cohesive feature
- the prior information includes the first priori parameter and the second priori parameter.
- the server 104 determines a preset threshold, and dynamically adjusts the similarity according to the first initial prior parameter, the second initial prior parameter, and the preset threshold.
- the server 104 respectively corrects the first initial priori parameter and the second initial priori parameter in the initial priori information through the dynamically adjusted similarity to obtain the first priori parameter and the second priori parameter, and according to the first The prior parameter and the second prior parameter obtain prior information.
- the prior information can perform weighted fusion processing on temporal features and cohesive features, so as to fuse the temporal features and cohesive features according to the prior knowledge in the prior information to obtain fusion features. Fusion features are obtained by fusing time features and cohesive features through prior knowledge in prior information, which can ensure the cohesion of time information in fusion features and enhance the expression of important features in the time dimension, thereby improving the accuracy of video behavior recognition Rate.
- the first initial prior parameter and the second initial prior parameter are respectively corrected based on the dynamically adjusted similarity pair to obtain the first A priori parameter and a second priori parameter
- prior information is obtained according to the first priori parameter and the second priori parameter.
- the obtained prior information reflects the prior knowledge of the target video image in the time dimension. Based on the prior information, the fusion of temporal features and cohesive features can effectively focus on the temporal information in the behavior recognition features, so that the obtained behavior recognition Features can effectively reflect the behavior information in the video, thus improving the accuracy of video behavior recognition.
- the video behavior recognition method further includes performing cohesive processing on temporal features to obtain corresponding cohesive features, specifically including:
- Step 302 determine the current basis vector.
- the current basis vector is the basis vector currently performing cohesive processing on the time feature, and the cohesive processing on the time feature can be realized through the basis vector.
- the server 104 determines the current basis vector, such as B ⁇ C ⁇ K, where B is the data size of batch processing, C is the number of channels of intermediate image features, and K is the dimension of the base vector.
- Step 304 performing feature reconstruction on the time features of the intermediate image features through the current basis vector to obtain reconstructed features.
- the temporal feature is reconstructed by the current base vector
- the reconstructed feature can be obtained by fusing the current base vector with the temporal feature of the intermediate image feature.
- the server 104 may perform matrix multiplication by the current basis vector and the time feature of the intermediate image feature, and then perform normalized mapping to realize the reconstruction of the time feature to obtain the reconstructed feature.
- Step 306 generating a basis vector for the next attention process according to the reconstruction feature and the time feature.
- the basis vector for the next attention processing is the basis vector for the next attention processing, that is, the next cohesive processing of the temporal features.
- the server 104 generates a basis vector for the next attention process according to the reconstruction feature and the time feature, for example, the matrix multiplication of the reconstruction feature and the time feature may be performed to obtain the base vector for the next attention process.
- the basis vector of the next attention processing will be used as the basis vector for the next attention processing to perform feature reconstruction on the corresponding time features.
- Step 308 according to the basis vector, basis vector and time feature of the next attention process, the cohesion feature corresponding to the time feature is obtained.
- the server 104 After obtaining the basis vector of the next attention process, the server 104 obtains the cohesive feature corresponding to the time feature according to the basis vector, the basis vector and the time feature of the next attention process, so as to realize the cohesive processing of the time feature.
- the basis vector, basis vector, and time feature of the next attention process may be fused to generate a cohesive feature corresponding to the time feature.
- the time features of the intermediate image features are reconstructed through the basis vectors, a new basis vector is generated according to the reconstructed features and time features, and the time features are obtained according to the new basis vectors, old basis vectors and time features.
- the corresponding cohesive features so as to focus on the time features, to highlight the important focus features in the time dimension, and obtain cohesive features with high cohesion, which can accurately express the information of the target video image in the time dimension, which is conducive to improving video quality. Accuracy of behavior recognition.
- generating the basis vector for the next attention process according to the reconstruction feature and the time feature includes: fusing the reconstruction feature and the time feature to generate the attention feature; regularizing the attention feature to obtain the regularization feature ; and perform a sliding average update on the regularized features to generate the basis vector for the next attention process.
- the attention feature is obtained by fusing the reconstruction feature and the time feature, and by sequentially performing regularization processing and moving average update on the attention feature, it can ensure that the update of the base vector is more stable.
- the server 104 when generating the basis vector for the next attention process according to the reconstruction feature and the time feature, the server 104 fuses the reconstruction feature and the time feature to obtain the attention feature.
- the server 104 further performs regularization processing on the attention features, for example, L2 regularization processing may be performed on the attention features to obtain regularization features.
- the server 104 performs a sliding average update on the obtained regularized features to generate a basis vector for the next attention process.
- the moving average can be used to estimate the local mean of a variable, so that the update of the variable is related to the historical value over a period of time.
- the basis vector for the next attention processing is the basis vector for the next attention processing, that is, the next cohesive processing of the temporal features.
- the current base vector includes the data size of the batch processing, the number of channels of the intermediate image feature, and the dimension of the base vector; the time feature of the intermediate image feature is reconstructed through the current base vector to obtain the reconstructed feature, It includes: performing matrix multiplication and normalized mapping processing on the current base vector and the time feature of the intermediate image feature in sequence to obtain the reconstructed feature.
- the data size of batch processing is the size of data volume processed in each batch when batch processing is performed.
- the current base vector may be B ⁇ C ⁇ K, where B is the data size of batch processing, C is the number of channels of intermediate image features, and K is the dimension of the base vector.
- the server when the server performs feature reconstruction on the time features of the intermediate image features, it can perform matrix multiplication between the current base vector and the time features of the intermediate image features, and perform normalized mapping processing on the matrix multiplication results to realize the Reconstruction of temporal features to obtain reconstructed features.
- generating the basis vector for the next attention process according to the reconstruction feature and the time feature includes: performing matrix multiplication on the reconstruction feature and the time feature to obtain the basis vector for the next attention process.
- the server performs matrix multiplication processing on the reconstruction feature and the time feature to obtain the basis vector for the next attention process.
- the basis vector of the next attention processing will be used as the basis vector for the next attention processing to perform feature reconstruction on the corresponding time features.
- the cohesive feature corresponding to the time feature is obtained, including: fusing the basis vector, basis vector and time feature of the next attention process to obtain the time feature corresponding cohesive features.
- the server fuses the basis vectors, basis vectors, and time features of the next attention process, so as to fuse effective information of the base vectors, base vectors, and time features of the next attention process, and obtain the cohesive features corresponding to the time features.
- the time features of the intermediate image features are reconstructed by using the base vectors including the data size of the batch processing, the number of channels of the intermediate image features, and the dimension of the base vectors, and the matrix multiplication and normalization are performed in sequence.
- the reconstruction feature is obtained through the mapping process, and the new basis vector is generated by matrix multiplication according to the reconstruction feature and the time feature, and the cohesion feature corresponding to the time feature is obtained by fusing the new base vector, the old basis vector and the time feature, so that Focusing on the time features to highlight the important focus features in the time dimension and obtaining cohesive features with high cohesion can accurately express the information of the target video image in the time dimension, which is conducive to improving the accuracy of video behavior recognition.
- the time feature of the intermediate image feature and the cohesive feature corresponding to the time feature are fused to obtain the fusion feature, including: determining the prior information; performing temporal feature extraction on the intermediate image feature to obtain the intermediate The time feature of the image feature; and through the prior information, the time feature and the cohesive feature corresponding to the time feature are weighted and fused to obtain the fusion feature.
- the prior information reflects the prior knowledge of the target video image in the time dimension, and the prior information is obtained according to the change information of the intermediate image features in the time dimension, specifically according to the similarity of the intermediate image features in the time dimension.
- the time feature is used to reflect the time information of the target video image in the video.
- the time feature and the cohesive feature corresponding to the time feature are weighted and fused through the prior information. For example, when the prior information includes the first prior parameter and the second prior parameter, the first prior parameter and the second prior Parameter weighted fusion of time features and cohesive features corresponding to time features to obtain fusion features.
- the server 104 determines the prior information, which is obtained according to the change information of the intermediate image features in the time dimension, and specifically can be obtained according to the similarity of the intermediate image features in the time dimension.
- the server 104 performs temporal feature extraction on the intermediate image features, specifically, feature extraction may be performed on the temporal dimension of the intermediate image features to obtain the temporal features of the intermediate image features.
- the server 104 performs weighted fusion on the time feature and the cohesive feature corresponding to the time feature based on the prior information to obtain the fusion feature, thereby realizing the weighted fusion of the time feature and the cohesive feature corresponding to the time feature.
- the prior knowledge in the information is obtained by fusing temporal features and cohesive features, which can ensure the cohesion of temporal information in the fusion features and enhance the expression of important features in the temporal dimension, thereby improving the accuracy of video behavior recognition.
- the fusion feature is obtained by fusing time features and cohesive features based on prior knowledge in prior information, which can ensure the cohesion of time information in fusion features and enhance the expression of important features in the time dimension, thereby improving Accuracy of video behavior recognition.
- the prior information includes a first prior parameter and a second prior parameter; through the prior information, the time feature and the cohesive feature corresponding to the time feature are weighted and fused to obtain the fusion feature, including: through the first A priori parameter weights the time feature to obtain the weighted time feature; weights the cohesive feature corresponding to the time feature through the second prior parameter to obtain the weighted cohesive feature; and weights the cohesive feature
- the final temporal features are fused with the weighted cohesive features to obtain fusion features.
- the prior information includes a first prior parameter and a second prior parameter, respectively corresponding to the time feature and the weighted weight of the cohesive feature corresponding to the time feature.
- the server performs weighting processing on the time feature by using the first prior parameter in the prior information, and obtains the weighted time feature.
- the first prior parameter may be k1
- the time feature may be M
- the weighted time feature may be k1*M.
- the server performs weighting processing on the cohesion features corresponding to the time features as the second prior parameter in the prior information, and obtains the weighted cohesion features.
- the second prior parameter may be k2
- the cohesive feature corresponding to the time feature may be N
- the weighted cohesive feature may be k2*N.
- the server fuses the weighted time features and the weighted cohesion features to obtain fusion features.
- the fusion features obtained by server fusion may be k1*M+k2*N.
- the fusion feature is obtained by fusing the time feature and the cohesive feature based on the first prior parameter and the second prior parameter in the prior information, which can ensure the cohesion of the time information in the fusion feature and enhance the time dimension.
- the expression of important features can improve the accuracy of video behavior recognition.
- the fusing the temporal features of the intermediate image features and the cohesive features corresponding to the temporal features based on the prior information to obtain the fusion features it also includes: standardizing the intermediate image features to obtain standardized features; and Non-linear mapping is performed according to the standardized features to obtain the mapped intermediate image features.
- the normalization process can normalize the intermediate image features, which is beneficial to solve the problem of gradient disappearance and gradient explosion, and can ensure the network learning rate. Normalization can be achieved by batch normalization. Nonlinear mapping can introduce nonlinear factors to delinearize the intermediate image features, which is beneficial to enhance the flexible expression of intermediate image features.
- the server 104 performs normalization processing on the intermediate image features.
- the intermediate image features can be standardized through a BN (Batch Normalization, batch normalization) layer structure to obtain standardized features.
- the server 104 performs nonlinear mapping on the standardized features, for example, an activation function may be used to perform nonlinear mapping on the standardized features to obtain mapped intermediate image features.
- the time feature of the intermediate image feature and the cohesive feature corresponding to the time feature are fused to obtain the fusion feature, including: based on the prior information, the time feature of the mapped intermediate image feature and the time feature corresponding to the time feature
- the cohesive features are fused to obtain the fusion features; the prior information is obtained according to the change information of the mapped intermediate image features in the time dimension.
- the server 104 fuses the time feature of the mapped intermediate image feature and the cohesive feature corresponding to the time feature based on the prior information to obtain the fusion feature.
- the prior information is obtained according to the change information of the mapped intermediate image features in the time dimension
- the cohesive feature is obtained by focusing on the temporal features of the mapped intermediate image features.
- performing normalization processing on intermediate image features to obtain standardized features includes: performing normalization processing on intermediate image features through batch normalization layer structure to obtain standardized features.
- the batch normalization layer structure is a BN layer structure, which can standardize the intermediate image features in batches.
- the server can perform batch normalization processing on the intermediate image features through the batch normalization layer structure to obtain standardized features, thereby ensuring the processing efficiency of the standardization.
- performing nonlinear mapping according to the standardized features to obtain mapped intermediate image features includes: performing nonlinear mapping on the standardized features through an activation function to obtain mapped intermediate image features.
- the activation function is used to introduce nonlinear factors to achieve nonlinear mapping to standardized features.
- the specific form of the activation function can be set according to actual needs, for example, a ReLU function can be set, so that the server can perform non-linear mapping on the standardized features through the activation function to obtain the mapped intermediate image features.
- the intermediate image features are further subjected to normalization processing and nonlinear mapping through batch normalization layer structure and activation function, so as to enhance the feature expression of the intermediate image features and improve processing efficiency.
- the present application also provides an application scenario, where the above video behavior recognition method is applied.
- the application of the video behavior recognition method in this application scenario is as follows:
- spatio-temporal information modeling is one of the core issues of video action recognition.
- mainstream methods mainly include behavior recognition methods based on dual-stream networks and behavior recognition methods based on 3D (3-Dimensional, three-dimensional) convolutional networks.
- the former extracts RGB and optical flow features through two parallel networks, and the latter models temporal and spatial information simultaneously through 3D convolution.
- a large number of model parameters and computing power loss limit its efficiency.
- subsequent improved methods mainly decompose the three-dimensional convolution into two-dimensional spatial convolution and one-dimensional time convolution to respectively analyze time and space information. Modeling, thereby improving the efficiency of the model.
- the network structure search strategy can be used to adaptively adjust the weight of time and space information, and according to the different contributions in the behavior recognition process, the deep association between time and space information can be excavated , and jointly learn the interaction of time and space; at the same time, a rhythm regulator is designed to obtain a highly cohesive expression of time information according to the prior information of the action rhythm and the structural parameters of the time convolution, so as to adjust the actions of different rhythms. In this way, the problem of feature expression differences caused by the same action but with different rhythms is solved, and the accuracy of video action recognition is improved.
- the video behavior recognition method includes: extracting video image features from at least two frames of target video images, specifically at least two frames of target video images can be input into an artificial neural network to obtain video image features extracted by the artificial neural network; Adjust the contribution of the spatial features of the image features to obtain the intermediate image features. Specifically, adjust the contribution of the spatial features of the video image features through the pre-trained structural parameters; based on the prior information, the temporal features of the intermediate image features and the cohesion corresponding to the temporal features Features are fused, so that the rhythm of the behavior is adjusted using the rhythm regulator to obtain the fusion feature; then the time feature contribution adjustment is performed on the fusion feature to obtain the behavior recognition feature. Specifically, the time feature contribution adjustment can be made to the fusion feature through the structural parameters; finally, based on the behavior Recognize features for video behavior recognition, and get behavior recognition results.
- the video behavior recognition method in this embodiment is implemented based on a video behavior recognition model, as shown in FIG. 4 , which is a schematic diagram of the network structure of the video behavior recognition model in this embodiment.
- X is the video image feature extracted from at least two frames of the target video image
- the spatial feature is extracted by 1 ⁇ 3 ⁇ 3 2D convolution, and the spatial feature is obtained
- the spatial feature is processed by the spatial structure parameter ⁇ 1 in the structural parameter Contribute adjustments to get intermediate image features.
- the intermediate image features are sequentially processed through batch normalization and nonlinear mapping of activation functions. Specifically, batch normalization and nonlinear mapping of intermediate image features can be realized through the BN layer structure and the ReLU layer structure.
- the obtained mapped features A are subjected to temporal feature extraction through two 3 ⁇ 1 ⁇ 1 1D convolutions, one of which is processed by a highly cohesive 1D convolution, so that the temporal feature correspondence of the intermediate image features can be extracted cohesive features.
- the weighted adjustments are made respectively through the weight parameters ⁇ 1 and ⁇ 2 in the prior information, and the weighted adjustment results of the two branches are fused.
- the weight parameters ⁇ 1 and ⁇ 2 can be the structural parameters obtained based on the training of the policy gradient Agent network.
- the initial weight parameters ⁇ 1 and ⁇ 2 are corrected by residuals, and based on the residual corrected
- the weight parameters ⁇ 1 and ⁇ 2 weight the extraction results of 1D convolution. After the results of the two 1D convolution branches are fused, the temporal feature contribution of the fusion feature is adjusted through the time structure parameter ⁇ 2 in the structural parameter, and the behavior recognition feature is obtained after downsampling the fusion feature after the contribution adjustment.
- the behavior recognition feature is used Based on video behavior recognition, the result of behavior recognition is obtained.
- the structure parameter refers to the weight parameters of operations such as convolution defined in the operation space, which is a concept in the network structure search technology.
- the structural parameters corresponding to the temporal and spatial convolutions to be fused can be optimized and updated through two structural parameter update methods, the differential method and the strategy gradient method, including ⁇ 1 and ⁇ 2; while the high cohesion temporal convolution module and 1D time
- the pre-trained structural parameters ⁇ 1 and ⁇ 2 can also be used for weighted fusion processing.
- the structural parameters for fusing temporal and spatial convolutions include ⁇ 1 and ⁇ 2
- the structural parameters for weighted fusion of two temporal convolution branches include ⁇ 1 and ⁇ 2.
- the video image features extracted from the target video image are extracted through 1 ⁇ d ⁇ d 2D convolution for spatial feature extraction, and the extraction result is adjusted by the contribution of the spatial structure parameter ⁇ 1.
- the feature extraction result is multiplied by the structural parameter to obtain Fusion is performed to achieve contribution adjustment, and after contribution adjustment, batch normalization processing and nonlinear mapping of activation functions are performed sequentially.
- the mapped results are extracted through two t ⁇ 1 ⁇ 1 1D convolutions for temporal feature extraction, and the extracted results are weighted and fused through the structural parameters ⁇ 1 and ⁇ 2, respectively, and the weighted fusion results are adjusted for temporal feature contribution through the temporal structure parameter ⁇ 2 , to obtain the behavior recognition features for video behavior recognition processing.
- a multi-dimensional structural parameter is pre-defined, such as a multi-dimensional structural parameter vector, specifically a two-dimensional vector, which has a gradient in the differential mode update process.
- the dimensions of the structural parameters represent the structural parameters corresponding to the spatial convolution and temporal convolution, respectively.
- the structural parameters are applied to the spatial convolution and temporal convolution to fuse the features of the two. Specifically, ⁇ 1 is applied to the spatial convolution to adjust the contribution, and ⁇ 2 is applied to the temporal convolution to adjust the contribution. Calculate the error value according to the predicted results and real results of the video behavior recognition model, and use the gradient descent algorithm to update the structural parameters, and obtain the trained structural parameters at the end of the training.
- the operation space in the network structure search technology is denoted as O, and o is a specific operation.
- a node refers to a collection of basic operation units in the network structure search method.
- Set i and j as two sequentially adjacent nodes.
- the weights of a group of candidate operations between them are denoted as ⁇ ij
- P is the corresponding probability distribution.
- the candidate operation with the maximum probability between nodes i and j is obtained through the max function, and the final network structure is formed by stacking the operations obtained by searching between different nodes, as shown in the following formula (1):
- N is the number of nodes.
- L train (w, ⁇ ) is the objective function of the network structure
- w is the model parameter of the network structure
- the blocks of this embodiment are defined between two nodes.
- these nodes represent the output of the previous block and the input of the next block.
- Sequentially connected 1 ⁇ d ⁇ d convolutions and t ⁇ 1 ⁇ 1 convolutions are defined inside the block.
- Structure parameters are used on top of these two convolutions to tune their strength.
- ⁇ 2j ... ⁇ 2m ⁇ 2 , ⁇ 1n is determined as the structural parameter ⁇ 1 in Fig. 6
- ⁇ 21 is the structural parameter ⁇ 2 .
- o( ) as an operation defined in the search space O and acting on the input x
- the weight vector between node i and node j is ⁇ (i,j)
- the following formula (3) can be obtained
- F is the linear map of the weight vector
- y (i, j) is the sum of the linear maps of all weight vectors in the search space
- F can be set as a fully connected layer
- each cell unit is defined as a (2 +1)D convolutional blocks, so ⁇ o (i,j) is fixed. Therefore, the learning objective can be further simplified as the following formula (4),
- w ⁇ is the structural parameter of the network
- w n is the model parameter of the network
- y is the output of the (2+1)D convolutional block.
- synchronous training is carried out on the structural parameters w ⁇ and model parameters w n of the network, and gradient descent optimization is performed based on the objective function L val to obtain the structural parameters w ⁇ and model parameters w n that meet the needs and realize network training.
- a multi-dimensional structural parameter is pre-defined, such as a multi-dimensional structural parameter vector, specifically a two-dimensional vector, and the gradient information is truncated in the update process in the policy gradient mode.
- the dimensions of the structural parameters represent the structural parameters corresponding to the spatial convolution and temporal convolution, respectively.
- a policy gradient agent network is pre-defined to generate the next structural parameter according to the current structural parameters and the network state of the policy gradient agent network.
- the generated structural parameters are applied to spatial convolution and temporal convolution to fuse the features of the two.
- the network parameters of the Agent are updated, and then the new Agent predicts the next structural parameters, so as to realize the updating of the structural parameters.
- policy gradient descent is a reinforcement learning method, in which policy refers to the actions taken in different states (state), and the goal is to do gradient descent based on the policy, so as to train Out of the policy gradient network Agent can better make corresponding actions according to the current state, and can get a higher reward value (reward).
- policy gradient network agent the parameters of the current strategy gradient network agent as the state state, the structural parameters output by the network as the action, and use the current backbone network , that is, the loss and reward constant of the video behavior recognition module are used as components of the reward value reward function.
- the network In the forward processing flow, first input the initial structural parameters to the Agent network, and then the network will predict the next network parameter, namely action, according to the current Agent network parameters and the input structural parameters. In the process of backpropagation, it is to maximize the reward value that can be obtained currently, and update the parameters of the Agent network through the reward value. Assuming the current state is s, a represents the current action, and ⁇ represents the parameters of the network, then the cross-entropy loss CE is as follows (6),
- the reward function can be designed based on the smoothed CE value, so that the searched structural parameters and the learning of the backbone network of the video behavior recognition model are mutually assisted .
- the smoothed CE is as follows (7),
- i, j and N are the correct category, other categories and the total number of categories respectively, and ⁇ is a very small constant. Further, if the SCE n value obtained at the next time step n is greater than the SCE m obtained at the previous m, then a positive reward value ⁇ is given, otherwise the reward is - ⁇ .
- f is the reward value
- ⁇ is the set variable
- f(s, a) is the network prediction output.
- the multi-layer perceptron MLP corresponding to the structural parameters of the two parts of the prior excitation module for the importance of spatio-temporal information and narrowing intra-class differences are 3 layers with 6 hidden layer neurons and 4 hidden layer neurons respectively Neural network, while adding a ReLU activation function between each layer, and the last layer is a softplus activation function. Since the policy gradient mechanism requires a complete sequence of state behaviors, the lack of feedback in the intermediate state will lead to poor overall training effect.
- one method can be set to 1 epoch, that is, every 2 epochs
- the epoch calculates the reward of the most recent epoch; the other can be regarded as an optimization within an iteration, which is more conducive to optimization.
- the parameters of the network and the parameters of the Agent are separated and optimized separately. Different optimizers can be used for the two parameters, among which the Agent optimizer uses the Adam optimizer, and the network parameter optimization uses Stochastic Gradient Descent (SGD) for optimization processing, and the two are updated alternately during optimization.
- SGD Stochastic Gradient Descent
- Auto(2+1)D convolution structure that is, the structure of 2D convolution + 1D convolution to convert the video
- the spatiotemporal information in the image features is fused.
- Auto(2+1)D is composed of sequentially connected 2D convolution and 1D convolution, their corresponding structural parameters, and activation functions.
- 2D convolution and 1D convolution are used to decouple the time and space information in the feature, and independent modeling is performed, that is, spatial feature extraction is performed through 2D convolution, and temporal feature extraction is performed through 1D convolution.
- the decoupled information is adaptively fused through the structural parameters, and the nonlinear expression ability of the model is increased through the activation function.
- 2D convolution and 1D convolution form a basic convolution block, which can be used as the basic block structure in the network, such as the Block structure in ResNet (Residual Neural Network, residual network).
- the rhythm regulator contains a priori incentive module and high cohesion temporal expression modules.
- the priori incentive module can set the limit value Margin for the current structural parameters according to the similarity in the time dimension of the features, so as to promote the optimization of structural parameters.
- a highly cohesive temporal representation module can increase the cohesion of temporal dimension information through an efficient attention mechanism. Specifically, the feature map output by the previous layer is input into 2D convolution to extract spatial features.
- the features output by 2D convolution into the prior excitation module calculate its similarity in the time dimension, and set the appropriate Margin for the structural parameters according to the similarity value.
- the features output by the 2D convolution are input into the high cohesion temporal module and the 1D temporal convolution module and output feature maps, and the high cohesion temporal module and the 1D temporal convolution module are adaptively adjusted according to the prior information structure parameters
- the weights of the output feature maps are fused to obtain the fused features.
- the 3x1x1 time convolution branch is changed to 3x1x1 time convolution and 3x1x1 time convolution with expected maximum attention. branches.
- the priori excitation module mainly acts on the features through the excitation optimized for the priori parameters ⁇ 1 and ⁇ 2. As shown in Figure 7, the video image features extracted from the target video image are extracted through 1 ⁇ 3 ⁇ 3 2D convolution for spatial feature extraction, and the extracted results are adjusted for contribution by ⁇ 1. After the contribution adjustment, batch normalization processing and activation function non-linear mapping. The mapped results are processed through a priori excitation module.
- priori excitation module calculate the similarity of the mapped results in the time dimension, modify the initial priori parameters ⁇ 1 and ⁇ 2 based on the similarity, and pass the modified priori parameters ⁇ 1 and ⁇ 2 to pass the two
- the results of temporal feature extraction by t ⁇ 1 ⁇ 1 1D convolution are weighted and fused, and the result of weighted fusion is adjusted by the structural parameter ⁇ 2 for temporal feature contribution to obtain behavior recognition features for video behavior recognition processing.
- the arrows represent the flow direction of the feature map, they are connected by inputting the feature map output by the previous module into the next module, and then the feature map obtained after the prior similarity excitation module is input in parallel to the next volume
- the final output is to concatenate the feature maps of the two branches and reduce the dimensionality.
- the degree of change, and based on the threshold of the degree of change, the current prior parameters are divided into positive and negative parameters.
- the prior parameters after excitation correction and the original input prior parameters are merged in the way of residual connection as the final prior parameters.
- the element values of the tensor often do not have a large variance and are uniformly small.
- the current similarity prior information can be set by setting the margin and dynamically adjusting the threshold. , the following formula (10) can be obtained,
- Sim represents the similarity value
- Thres is the threshold
- ⁇ 1 and ⁇ 2 are prior parameters.
- the high cohesion time module obtains a high cohesion time expression based on the attention mechanism optimized by the EM (Expectation-Maximum, called expectation maximization) algorithm. For each sample, features are reconstructed through a fixed number of iterative optimizations. As shown in Figure 8, this process can be divided into E step and M step. After the feature map is down-sampled, it is processed by E step and M step respectively and then fused to obtain high cohesion features.
- B is the batch size, that is, the data size of batch processing
- C is the number of channels corresponding to the original input video image features
- K is the base vector dimension.
- step E matrix multiplication is performed by using the base vector and the spatial feature vector after spatial feature extraction of B ⁇ (H ⁇ W) ⁇ C, and then softmax is used to reconstruct the original feature, and the size is B ⁇ (H ⁇ The feature map of W) ⁇ K.
- the reconstructed feature map with a size of B ⁇ (H ⁇ W) ⁇ K is multiplied by the original feature map of B ⁇ (H ⁇ W) ⁇ C to obtain a new basis vector B ⁇ C ⁇ K .
- L2 regularization is performed on it, and the sliding average update of the base vector is added during training, as shown in the following formula (11),
- mu is the base vector
- mu_mean is its mean
- momentum is the momentum
- step E the base vector obtained in step E and the attention map obtained in step M are matrix multiplied to obtain the final reconstructed feature map with global information.
- the video behavior recognition method provided in this embodiment is applied in the field of video recognition, and in the field of video recognition, three-dimensional convolution is currently widely used, but it is difficult to expand due to the limitation of its high parameter amount.
- Some improved methods decompose 3D convolutions into 2D spatial convolutions and 1D temporal convolutions on the basis of low computational cost, small memory requirements, and high performance.
- the industry has not paid attention to the fact that the spatial and temporal cues in the video have different effects on different action categories.
- the adaptive spatiotemporal entanglement network involved automatically fuses the decomposed spatiotemporal information based on importance analysis to obtain a more powerful spatiotemporal representation.
- Auto(2+1)D convolution adaptively reorganizes and decouples spatio-temporal convolution filters through network structure search technology to model inconsistent contribution information of spatio-temporal information, and excavates deep layers between spatio-temporal information.
- Correlation, and learn spatiotemporal interaction information by integrating spatiotemporal information with different weights, the current model's ability to model temporal and spatial information is enhanced.
- the Rhythm Regulator uses the effective attention mechanism of the EM algorithm to extract high cohesion features in the time dimension, and can adjust the time information of actions with different rhythms according to the prior information of the action rhythm and the structural parameters of the temporal convolution. Obtaining highly cohesive expressions of temporal information to deal with the problem of different durations in different action classes can improve the accuracy of video action recognition.
- FIGS. 2-3 may include multiple steps or stages. These steps or stages are not necessarily executed at the same time, but may be executed at different times. The steps or stages The order of execution is not necessarily performed sequentially, but may be performed alternately or alternately with other steps or at least a part of steps or stages in other steps.
- a video behavior recognition device 900 is provided.
- the device may adopt a software module or a hardware module, or a combination of the two becomes a part of computer equipment.
- the device specifically includes: Image feature extraction module 902, spatial feature contribution adjustment module 904, feature fusion module 906, temporal feature contribution adjustment module 908 and video behavior recognition module 910, wherein:
- Video image feature extraction module 902 for extracting video image features from at least two frames of target video images
- the spatial feature contribution adjustment module 904 is used to adjust the contribution of the spatial feature of the video image feature to obtain the intermediate image feature
- the feature fusion module 906 is used to fuse the time features of the intermediate image features and the cohesive features corresponding to the time features based on the prior information to obtain the fusion features; the prior information is obtained according to the change information of the intermediate image features in the time dimension; The cohesive feature is obtained by focusing on the time feature;
- the temporal feature contribution adjustment module 908 is used to adjust the temporal feature contribution to the fusion feature to obtain behavior recognition features
- the video behavior recognition module 910 is configured to perform video behavior recognition based on behavior recognition features.
- the spatial feature contribution adjustment module 904 is also used to extract the spatial features of the video image features to obtain the spatial features of the video image features; adjust the contribution of the spatial features through the spatial structure parameters in the structural parameters to obtain the intermediate Image features; structural parameters are obtained through training of video image samples carrying behavior tags; the temporal feature contribution adjustment module 908 is also used to adjust the contribution of the fusion features through the temporal structure parameters in the structural parameters to obtain behavior recognition features.
- it also includes a parameter determination module to be trained, an intermediate sample feature acquisition module, a fusion sample feature acquisition module, a behavior recognition sample feature acquisition module, and an iterative module; wherein: the parameter determination module to be trained is used to determine the structure to be trained Parameters; the intermediate sample feature acquisition module is used to adjust the contribution of the spatial sample features of the video image sample features through the spatial structure parameters in the structural parameters to be trained to obtain intermediate sample features; the video image sample features are extracted from the video image samples The fusion sample feature acquisition module is used to fuse the time sample feature of the intermediate sample feature and the cohesive sample feature corresponding to the time sample feature based on the prior sample information to obtain the fusion sample feature; the cohesive sample feature is the time sample feature Obtained by attention processing; the prior sample information is obtained according to the change information of the intermediate sample features in the time dimension; the behavior recognition sample feature acquisition module is used to adjust the contribution of the fusion sample features through the time structure parameters in the structure parameters to be trained , to obtain the behavior recognition sample features; the iteration module is used to determine the
- the video behavior recognition device is realized by a video behavior recognition model, and the structural parameters to be trained are the parameters of the video behavior recognition model in training;
- the iteration module also includes a recognition result acquisition module, a difference determination module, a structural parameter update module and Structural parameter acquisition module; wherein: the identification result acquisition module is used to obtain the behavior recognition result output by the video behavior recognition model; the difference determination module is used to determine the difference between the behavior recognition result and the corresponding behavior label of the video image sample; the structural parameter The update module is used to update the model parameters in the video behavior recognition model and the structural parameters to be trained according to the differences; the structural parameter acquisition module is used to continue training based on the updated video behavior recognition model until the end of the training, and according to the completed training The video action recognition model gets the structural parameters.
- the iteration module further includes a recognition loss determination module, a reward value acquisition module, and a reward value processing module; wherein: the recognition loss determination module is used to determine the behavior between the behavior recognition result and the behavior label corresponding to the video image sample Recognition loss; Reward value acquisition module, used to obtain reward value according to behavior recognition loss and previous behavior recognition loss; Reward value processing module, Used to update the structural parameters to be trained according to the reward value, continue through the updated structural parameters to be trained Train until the objective function satisfies the end condition, and obtain the structural parameters; the objective function is obtained based on each reward value in the training process.
- the recognition loss determination module is used to determine the behavior between the behavior recognition result and the behavior label corresponding to the video image sample Recognition loss
- Reward value acquisition module used to obtain reward value according to behavior recognition loss and previous behavior recognition loss
- Reward value processing module Used to update the structural parameters to be trained according to the reward value, continue through the updated structural parameters to be trained Train until the objective function satisfies the end condition, and obtain the structural parameters;
- the reward value obtaining module is further configured to update the model parameters of the policy gradient network model according to the reward value; the updated policy gradient network model updates the structure parameters to be trained.
- the reward value obtaining module is also used to predict the structure parameters based on the updated model parameters and the structure parameters to be trained through the updated policy gradient network model, and obtain the predicted structure parameters; and according to the predicted structure Parameters to obtain the updated structural parameters of the structural parameters to be trained.
- it also includes a similarity determination module and a priori information correction module; wherein: a similarity determination module is used to determine the similarity of the intermediate image features in the time dimension; a priori information correction module is used for based on the similarity Correct the initial prior information to obtain the prior information.
- the initial prior information includes a first initial prior parameter and a second initial prior parameter
- the prior information modification module includes a similarity adjustment module, a priori parameter modification module and a priori information acquisition module; wherein: The similarity adjustment module is used to dynamically adjust the similarity according to the first initial prior parameter, the second initial prior parameter and the preset threshold; the prior parameter correction module is used to adjust the similarity through the dynamic adjustment
- the first initial a priori parameter and the second initial a priori parameter are corrected to obtain the first a priori parameter and the second a priori parameter;
- the prior information obtaining module is used to obtain the first a priori parameter and the second a priori parameter Prior Information.
- it also includes a base vector determination module, a feature reconstruction module, a base vector update module and a cohesive feature acquisition module; wherein: the base vector determination module is used to determine the current base vector; the feature reconstruction module is used for The time feature of the intermediate image feature is reconstructed through the current base vector to obtain the reconstructed feature; the base vector update module is used to generate the base vector for the next attention process according to the reconstructed feature and the time feature; the cohesive feature acquisition module, It is used to obtain the cohesive feature corresponding to the time feature according to the basis vector, basis vector and time feature of the next attention process.
- the base vector update module also includes an attention feature module, a regularization processing module, and a sliding average update module; wherein: the attention feature module is used to fuse reconstruction features and time features to generate attention features; regularization The processing module is used to perform regularization processing on attention features to obtain regularization features; the sliding average update module is used to perform sliding average update on regularization features to generate a basis vector for next attention processing.
- the current base vector includes the data size of batch processing, the number of channels of the intermediate image feature, and the dimension of the base vector;
- the feature reconstruction module is also used to combine the current base vector and the time feature of the intermediate image feature, in order Perform matrix multiplication and normalized mapping processing to obtain reconstructed features;
- the base vector update module is also used to perform matrix multiplication of reconstructed features and time features to obtain the base vector for the next attention process;
- cohesive feature acquisition module which is also used to fuse the basis vector, basis vector and time feature of the next attention process to obtain the cohesive feature corresponding to the time feature.
- the feature fusion module 906 is also used to determine prior information; perform temporal feature extraction on intermediate image features to obtain the temporal features of intermediate image features; The aggregated features are weighted and fused to obtain the fused features.
- the prior information includes the first prior parameter and the second prior parameter; the feature fusion module 906 is also used to weight the time feature through the first prior parameter, and obtain the weighted time feature ; weighting the cohesive features corresponding to the time features through the second prior parameter to obtain the weighted cohesive features; and fusing the weighted time features and the weighted cohesive features to obtain the fusion features .
- it also includes a normalization processing module and a nonlinear mapping module; wherein: the normalization processing module is used to standardize the intermediate image features to obtain standardized features; the nonlinear mapping module is used to perform nonlinear processing according to the standardized features. Mapping to obtain the mapped intermediate image features; the feature fusion module 906 is also used to fuse the time features of the mapped intermediate image features and the cohesive features corresponding to the time features based on prior information to obtain fusion features; prior information It is obtained according to the change information of the mapped intermediate image features in the time dimension.
- the normalization processing module is also used to perform normalization processing on the intermediate image features through the batch normalization layer structure to obtain standardized features;
- the nonlinear mapping module is also used to perform nonlinear mapping on the standardized features through the activation function, Get the mapped intermediate image features.
- Each module in the above-mentioned video behavior recognition device can be fully or partially realized by software, hardware and a combination thereof.
- the above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, and can also be stored in the memory of the computer device in the form of software, so that the processor can invoke and execute the corresponding operations of the above-mentioned modules.
- a computer device is provided.
- the computer device may be a server or a terminal, and its internal structure may be as shown in FIG. 10 .
- the computer device includes a processor, memory and a network interface connected by a system bus. Wherein, the processor of the computer device is used to provide calculation and control capabilities.
- the memory of the computer device includes a non-volatile storage medium and an internal memory.
- the non-volatile storage medium stores an operating system, computer readable instructions and a database.
- the internal memory provides an environment for the execution of the operating system and computer readable instructions in the non-volatile storage medium.
- the database of the computer device is used to store model data.
- the network interface of the computer device is used to communicate with an external terminal via a network connection. When the computer-readable instructions are executed by the processor, a video behavior recognition method is realized.
- FIG. 10 is only a block diagram of a part of the structure related to the solution of this application, and does not constitute a limitation to the computer equipment on which the solution of this application is applied.
- the specific computer equipment can be More or fewer components than shown in the figures may be included, or some components may be combined, or have a different arrangement of components.
- a computer device including a memory and a processor, where computer-readable instructions are stored in the memory, and the processor implements the steps in the foregoing method embodiments when executing the computer-readable instructions.
- a computer-readable storage medium which stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, the steps in the foregoing method embodiments are implemented.
- a computer program product or computer program comprising computer readable instructions stored in a computer readable storage medium.
- the processor of the computer device reads the computer-readable instructions from the computer-readable storage medium, and the processor executes the computer-readable instructions, so that the computer device executes the steps in the foregoing method embodiments.
- Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory or optical memory, etc.
- Volatile memory can include Random Access Memory (RAM) or external cache memory.
- RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM).
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims (20)
- 一种视频行为识别方法,由计算机设备执行,其特征在于,所述方法包括:从至少两帧目标视频图像提取视频图像特征;将所述视频图像特征的空间特征进行贡献调整,得到中间图像特征;基于先验信息对所述中间图像特征的时间特征和所述时间特征对应的内聚特征进行融合,得到融合特征;所述先验信息是根据所述中间图像特征在时间维度的变化信息得到的;所述内聚特征是对所述时间特征进行关注处理得到的;对所述融合特征进行时间特征贡献调整,得到行为识别特征;及基于所述行为识别特征进行视频行为识别。
- 根据权利要求1所述的方法,其特征在于,所述将所述视频图像特征的空间特征进行贡献调整,得到中间图像特征,包括:将所述视频图像特征进行空间特征提取,得到所述视频图像特征的空间特征;及通过结构参数中的空间结构参数对所述空间特征进行贡献调整,得到中间图像特征;所述结构参数是通过携带行为标签的视频图像样本训练得到的;所述对所述融合特征进行时间特征贡献调整,得到行为识别特征,包括:通过所述结构参数中的时间结构参数对所述融合特征进行贡献调整,得到行为识别特征。
- 根据权利要求2所述的方法,其特征在于,所述方法还包括:确定待训练结构参数;通过所述待训练结构参数中的空间结构参数,对视频图像样本特征的空间样本特征进行贡献调整,得到中间样本特征;所述视频图像样本特征是从所述视频图像样本提取得到的;基于先验样本信息对所述中间样本特征的时间样本特征和所述时间样本特征对应的内聚样本特征进行融合,得到融合样本特征;所述内聚样本特征是对所述时间样本特征进行关注处理得到的;所述先验样本信息是根据所述中间样本特征在时间维度的变化信息得到的;通过所述待训练结构参数中的时间结构参数对所述融合样本特征进行贡献调整,得到行为识别样本特征;及基于所述行为识别样本特征进行视频行为识别,并根据行为识别结果和所述视频图像样本对应的行为标签,对所述待训练结构参数进行更新后继续训练直至训练结束,获得所述结构参数。
- 根据权利要求3所述的方法,其特征在于,所述方法通过视频行为识别模型实现,所述待训练结构参数是所述视频行为识别模型在训练中的参数;所述根据行为识别结果和所述视频图像样本对应的行为标签,对所述待训练结构参数进行更新后继续训练直至训练结束,获得所述结构参数,包括:获得所述视频行为识别模型输出的行为识别结果;确定所述行为识别结果与所述视频图像样本对应的行为标签之间的差异;根据所述差异对所述视频行为识别模型中的模型参数和所述待训练结构参数进行更新;及基于更新后的视频行为识别模型继续训练直至训练结束,并根据训练完成的视频行为识别模型得到所述结构参数。
- 根据权利要求3所述的方法,其特征在于,所述根据行为识别结果和所述视频图像样本对应的行为标签,对所述待训练结构参数进行更新后继续训练直至训练结束,获得所述结构参数,包括:确定行为识别结果和所述视频图像样本对应的行为标签之间的行为识别损失;根据所述行为识别损失和前一行为识别损失得到奖励值;及根据所述奖励值对所述待训练结构参数进行更新,通过更新后的待训练结构参数继续训练直至目标函数满足结束条件时,获得所述结构参数;所述目标函数基于训练过程中的各奖励值得到。
- 根据权利要求5所述的方法,其特征在于,所述根据所述奖励值对所述待训练结构参数进行更新,包括:根据所述奖励值对策略梯度网络模型的模型参数进行更新;及由更新后的策略梯度网络模型对所述待训练结构参数进行更新。
- 根据权利要求6所述的方法,其特征在于,所述由更新后的策略梯度网络模型对所述待训练结构参数进行更新,包括:通过更新后的策略梯度网络模型,基于更新后的模型参数和待训练结构参数进行结构参数预测,获得预测的结构参数;及根据所述预测的结构参数,得到对所述待训练结构参数进行更新后的结构参数。
- 根据权利要求1所述的方法,其特征在于,所述方法还包括:确定所述中间图像特征在时间维度的相似度;及基于所述相似度对初始先验信息进行修正,得到先验信息。
- 根据权利要求8所述的方法,其特征在于,所述初始先验信息包括第一初始先验参数和第二初始先验参数;所述基于所述相似度对初始先验信息进行修正,得到先验信息,包括:根据所述第一初始先验参数、所述第二初始先验参数及预设阈值,对所述相似度进行动态调整;通过动态调整后的相似度分别对所述第一初始先验参数和所述第二初始先验参数进行修正,得到第一先验参数和第二先验参数;及根据所述第一先验参数和所述第二先验参数得到先验信息。
- 根据权利要求1所述的方法,其特征在于,所述方法还包括:确定当前基向量;通过所述当前基向量对所述中间图像特征的时间特征进行特征重构,得到重构特征;根据所述重构特征和所述时间特征生成下一关注处理的基向量;及根据所述下一关注处理的基向量、所述基向量和所述时间特征,得到所述时间特征对应的内聚特征。
- 根据权利要求10所述的方法,其特征在于,所述根据所述重构特征和所述时间特征生成下一关注处理的基向量,包括:融合所述重构特征和所述时间特征,生成注意力特征;对所述注意力特征进行正则化处理,得到正则化特征;及对所述正则化特征进行滑动平均更新,生成下一关注处理的基向量。
- 根据权利要求10所述的方法,其特征在于,所述当前基向量包括批次处理的数据大小、中间图像特征的通道数以及基向量的维度;所述通过所述当前基向量对所述中间图像特征的时间特征进行特征重构,得到重构特征,包括:将所述当前基向量与所述中间图像特征的时间特征,依次进行矩阵相乘及归一化映射处理,得到重构特征;所述根据所述重构特征和所述时间特征生成下一关注处理的基向量,包括:将所述重构特征和所述时间特征进行矩阵相乘,得到下一关注处理的基向量;所述根据所述下一关注处理的基向量、所述基向量和所述时间特征,得到所述时间特征对应的内聚特征,包括:将所述下一关注处理的基向量、所述基向量和所述时间特征进行融合,得到所述时间特征对应的内聚特征。
- 根据权利要求1至12任意一项所述的方法,其特征在于,所述基于先验信息对所述中间图像特征的时间特征和所述时间特征对应的内聚特征进行融合,得到融合特征,包括:确定先验信息;对所述中间图像特征进行时间特征提取,得到所述中间图像特征的时间特征;及通过所述先验信息,对所述时间特征和所述时间特征对应的内聚特征进行加权融合,得 到融合特征。
- 根据权利要求13所述的方法,其特征在于,所述先验信息包括第一先验参数和第二先验参数;所述通过所述先验信息,对所述时间特征和所述时间特征对应的内聚特征进行加权融合,得到融合特征,包括:通过所述第一先验参数对所述时间特征进行加权处理,获得加权处理后的时间特征;通过所述第二先验参数对所述时间特征对应的内聚特征进行加权处理,得到加权处理后的内聚特征;及将所述加权处理后的时间特征和所述加权处理后的内聚特征进行融合,得到融合特征。
- 根据权利要求1所述的方法,其特征在于,在所述基于先验信息对所述中间图像特征的时间特征和所述时间特征对应的内聚特征进行融合,得到融合特征之前,还包括:对所述中间图像特征进行标准化处理,得到标准化特征;及根据所述标准化特征进行非线性映射,获得映射后的中间图像特征;所述基于先验信息对所述中间图像特征的时间特征和所述时间特征对应的内聚特征进行融合,得到融合特征,包括:基于先验信息对所述映射后的中间图像特征的时间特征和所述时间特征对应的内聚特征进行融合,得到融合特征;所述先验信息是根据所述映射后的中间图像特征在时间维度的变化信息得到的。
- 根据权利要求15所述的方法,其特征在于,所述对所述中间图像特征进行标准化处理,得到标准化特征,包括:通过批量标准化层结构,对所述中间图像特征进行标准化处理,得到标准化特征;所述根据所述标准化特征进行非线性映射,获得映射后的中间图像特征,包括:通过激活函数对所述标准化特征进行非线性映射,获得映射后的中间图像特征。
- 一种视频行为识别装置,其特征在于,所述装置包括:视频图像特征提取模块,用于从至少两帧目标视频图像提取视频图像特征;空间特征贡献调整模块,用于将所述视频图像特征的空间特征进行贡献调整,得到中间图像特征;特征融合模块,用于基于先验信息对所述中间图像特征的时间特征和所述时间特征对应的内聚特征进行融合,得到融合特征;所述先验信息是根据所述中间图像特征在时间维度的变化信息得到的;所述内聚特征是对所述时间特征进行关注处理得到的;时间特征贡献调整模块,用于对所述融合特征进行时间特征贡献调整,得到行为识别特征;及视频行为识别模块,用于基于所述行为识别特征进行视频行为识别。
- 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现权利要求1至16中任一项所述的方法的步骤。
- 一种计算机可读存储介质,存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现权利要求1至16中任一项所述的方法的步骤。
- 一种计算机程序产品,包括计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现权利要求1至16任一项所述的方法的步骤。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP22880046.2A EP4287144A4 (en) | 2021-10-15 | 2022-09-05 | VIDEO BEHAVIOR RECOGNITION METHOD AND APPARATUS AND COMPUTER DEVICE AND STORAGE MEDIUM |
| US18/201,635 US20230316733A1 (en) | 2021-10-15 | 2023-05-24 | Video behavior recognition method and apparatus, and computer device and storage medium |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111202734.4A CN114332670A (zh) | 2021-10-15 | 2021-10-15 | 视频行为识别方法、装置、计算机设备和存储介质 |
| CN202111202734.4 | 2021-10-15 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US18/201,635 Continuation US20230316733A1 (en) | 2021-10-15 | 2023-05-24 | Video behavior recognition method and apparatus, and computer device and storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023061102A1 true WO2023061102A1 (zh) | 2023-04-20 |
Family
ID=81044868
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2022/116947 Ceased WO2023061102A1 (zh) | 2021-10-15 | 2022-09-05 | 视频行为识别方法、装置、计算机设备和存储介质 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20230316733A1 (zh) |
| EP (1) | EP4287144A4 (zh) |
| CN (1) | CN114332670A (zh) |
| WO (1) | WO2023061102A1 (zh) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116524419A (zh) * | 2023-07-03 | 2023-08-01 | 南京信息工程大学 | 基于时空解耦与自注意力差分lstm的视频预测方法、系统 |
| CN116524542A (zh) * | 2023-05-08 | 2023-08-01 | 杭州像素元科技有限公司 | 一种基于细粒度特征的跨模态行人重识别方法及装置 |
| CN118694899A (zh) * | 2024-07-18 | 2024-09-24 | 远洋亿家物业服务股份有限公司 | 基于人工智能的物业安全防控方法及系统 |
Families Citing this family (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114332670A (zh) * | 2021-10-15 | 2022-04-12 | 腾讯科技(深圳)有限公司 | 视频行为识别方法、装置、计算机设备和存储介质 |
| CN114882223B (zh) * | 2022-06-01 | 2024-08-16 | 安徽农业大学 | 轻量化叶菜苗语义分割模型及其测试和使用方法 |
| CN115240271B (zh) * | 2022-07-08 | 2025-08-05 | 北方工业大学 | 基于时空建模的视频行为识别方法与系统 |
| CN116189028B (zh) * | 2022-11-29 | 2024-06-21 | 北京百度网讯科技有限公司 | 图像识别方法、装置、电子设备以及存储介质 |
| CN116189281B (zh) * | 2022-12-13 | 2024-04-02 | 北京交通大学 | 基于时空自适应融合的端到端人体行为分类方法及系统 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110096950A (zh) * | 2019-03-20 | 2019-08-06 | 西北大学 | 一种基于关键帧的多特征融合行为识别方法 |
| KR20200036093A (ko) * | 2018-09-21 | 2020-04-07 | 네이버웹툰 주식회사 | 비디오 영상 내의 행동 인식 방법 및 장치 |
| CN111950444A (zh) * | 2020-08-10 | 2020-11-17 | 北京师范大学珠海分校 | 一种基于时空特征融合深度学习网络的视频行为识别方法 |
| CN113378600A (zh) * | 2020-03-09 | 2021-09-10 | 北京灵汐科技有限公司 | 一种行为识别方法及系统 |
| CN114332670A (zh) * | 2021-10-15 | 2022-04-12 | 腾讯科技(深圳)有限公司 | 视频行为识别方法、装置、计算机设备和存储介质 |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111178319A (zh) * | 2020-01-06 | 2020-05-19 | 山西大学 | 基于压缩奖惩机制的视频行为识别方法 |
| CN113435430B (zh) * | 2021-08-27 | 2021-11-09 | 中国科学院自动化研究所 | 基于自适应时空纠缠的视频行为识别方法、系统、设备 |
-
2021
- 2021-10-15 CN CN202111202734.4A patent/CN114332670A/zh active Pending
-
2022
- 2022-09-05 WO PCT/CN2022/116947 patent/WO2023061102A1/zh not_active Ceased
- 2022-09-05 EP EP22880046.2A patent/EP4287144A4/en active Pending
-
2023
- 2023-05-24 US US18/201,635 patent/US20230316733A1/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR20200036093A (ko) * | 2018-09-21 | 2020-04-07 | 네이버웹툰 주식회사 | 비디오 영상 내의 행동 인식 방법 및 장치 |
| CN110096950A (zh) * | 2019-03-20 | 2019-08-06 | 西北大学 | 一种基于关键帧的多特征融合行为识别方法 |
| CN113378600A (zh) * | 2020-03-09 | 2021-09-10 | 北京灵汐科技有限公司 | 一种行为识别方法及系统 |
| CN111950444A (zh) * | 2020-08-10 | 2020-11-17 | 北京师范大学珠海分校 | 一种基于时空特征融合深度学习网络的视频行为识别方法 |
| CN114332670A (zh) * | 2021-10-15 | 2022-04-12 | 腾讯科技(深圳)有限公司 | 视频行为识别方法、装置、计算机设备和存储介质 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4287144A4 |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116524542A (zh) * | 2023-05-08 | 2023-08-01 | 杭州像素元科技有限公司 | 一种基于细粒度特征的跨模态行人重识别方法及装置 |
| CN116524542B (zh) * | 2023-05-08 | 2023-10-31 | 杭州像素元科技有限公司 | 一种基于细粒度特征的跨模态行人重识别方法及装置 |
| CN116524419A (zh) * | 2023-07-03 | 2023-08-01 | 南京信息工程大学 | 基于时空解耦与自注意力差分lstm的视频预测方法、系统 |
| CN116524419B (zh) * | 2023-07-03 | 2023-11-07 | 南京信息工程大学 | 基于时空解耦与自注意力差分lstm的视频预测方法、系统 |
| CN118694899A (zh) * | 2024-07-18 | 2024-09-24 | 远洋亿家物业服务股份有限公司 | 基于人工智能的物业安全防控方法及系统 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4287144A1 (en) | 2023-12-06 |
| CN114332670A (zh) | 2022-04-12 |
| EP4287144A4 (en) | 2024-09-04 |
| US20230316733A1 (en) | 2023-10-05 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2023061102A1 (zh) | 视频行为识别方法、装置、计算机设备和存储介质 | |
| Paul et al. | Robust visual tracking by segmentation | |
| WO2020042895A1 (en) | Device and method of tracking poses of multiple objects based on single-object pose estimator | |
| CN113111814B (zh) | 基于正则化约束的半监督行人重识别方法及装置 | |
| WO2019228317A1 (zh) | 人脸识别方法、装置及计算机可读介质 | |
| CN113435430B (zh) | 基于自适应时空纠缠的视频行为识别方法、系统、设备 | |
| GB2584727A (en) | Optimised machine learning | |
| CN118644811B (zh) | 一种视频对象的检测方法、装置、电子设备和存储介质 | |
| CN116310318B (zh) | 交互式的图像分割方法、装置、计算机设备和存储介质 | |
| CN118379288B (zh) | 基于模糊剔除和多聚焦图像融合的胚胎原核目标计数方法 | |
| CN113822125A (zh) | 唇语识别模型的处理方法、装置、计算机设备和存储介质 | |
| CN115565051B (zh) | 轻量级人脸属性识别模型训练方法、识别方法及设备 | |
| Patel et al. | Learning surrogates via deep embedding | |
| CN110490304A (zh) | 一种数据处理方法及设备 | |
| CN119478529A (zh) | 基于改进yolov8网络与clip模型的地铁行人异常检测方法及系统 | |
| Li et al. | SCD-YOLO: a lightweight vehicle target detection method based on improved YOLOv5n | |
| Chen et al. | Frequency-space enhanced and temporal adaptative RGBT object tracking | |
| Negi et al. | End-to-end residual learning-based deep neural network model deployment for human activity recognition | |
| Pandeeswari et al. | Deep intelligent technique for person Re-identification system in surveillance images | |
| CN117095460A (zh) | 基于长短时关系预测编码的自监督群体行为识别方法及其识别系统 | |
| CN120257217B (zh) | 一种基于不确定性估计的多模态特征动态融合方法及系统 | |
| CN111126155A (zh) | 一种基于语义约束生成对抗网络的行人再识别方法 | |
| CN113658218B (zh) | 一种双模板密集孪生网络跟踪方法、装置及存储介质 | |
| CN113822291B (zh) | 一种图像处理方法、装置、设备及存储介质 | |
| Shu et al. | Foda-pg for enhanced medical imaging narrative generation: Adaptive differentiation of normal and abnormal attributes |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22880046 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2022880046 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2022880046 Country of ref document: EP Effective date: 20230831 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 11202306161T Country of ref document: SG |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |


