WO2023061102A1 - 视频行为识别方法、装置、计算机设备和存储介质 - Google Patents

视频行为识别方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2023061102A1
WO2023061102A1 PCT/CN2022/116947 CN2022116947W WO2023061102A1 WO 2023061102 A1 WO2023061102 A1 WO 2023061102A1 CN 2022116947 W CN2022116947 W CN 2022116947W WO 2023061102 A1 WO2023061102 A1 WO 2023061102A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
features
time
behavior recognition
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2022/116947
Other languages
English (en)
French (fr)
Inventor
胡益珲
杨伟东
陈宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to EP22880046.2A priority Critical patent/EP4287144A4/en
Publication of WO2023061102A1 publication Critical patent/WO2023061102A1/zh
Priority to US18/201,635 priority patent/US20230316733A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Definitions

  • the present application relates to the field of computer technology, in particular to a video behavior recognition method, device, computer equipment, storage medium and computer program product.
  • Video behavior recognition is one of the important topics in the field of computer vision. Based on video behavior recognition, it is possible to recognize the action behavior of the target object in a given video, such as eating, running, talking and other actions.
  • behavior recognition is mostly performed by extracting features from videos, but the features extracted in traditional video behavior recognition processing cannot effectively reflect the behavior information in the video, resulting in low accuracy of video behavior recognition. Low.
  • a video behavior recognition method performed by a computer device, said method comprising:
  • the time feature of the intermediate image feature and the cohesive feature corresponding to the time feature are fused to obtain the fusion feature; the prior information is obtained according to the change information of the intermediate image feature in the time dimension; the cohesive feature is the time feature. Obtained through attention processing;
  • a video behavior recognition device comprising:
  • Video image feature extraction module for extracting video image features from at least two frames of target video images
  • the spatial feature contribution adjustment module is used to adjust the contribution of the spatial feature of the video image feature to obtain the intermediate image feature
  • the feature fusion module is used to fuse the time features of the intermediate image features and the cohesive features corresponding to the time features based on the prior information to obtain the fusion features; the prior information is obtained according to the change information of the intermediate image features in the time dimension; the inner The poly feature is obtained by focusing on the time feature;
  • a temporal feature contribution adjustment module is used to adjust the temporal feature contribution to the fusion feature to obtain behavior recognition features
  • the video behavior recognition module is used for performing video behavior recognition based on behavior recognition features.
  • a computer device comprising a memory and a processor, the memory stores computer-readable instructions, and the processor implements the following steps when executing the computer-readable instructions:
  • the time feature of the intermediate image feature and the cohesive feature corresponding to the time feature are fused to obtain the fusion feature; the prior information is obtained according to the change information of the intermediate image feature in the time dimension; the cohesive feature is the time feature. Obtained through attention processing;
  • a computer-readable storage medium on which computer-readable instructions are stored, and when the computer-readable instructions are executed by a processor, the following steps are implemented:
  • the time feature of the intermediate image feature and the cohesive feature corresponding to the time feature are fused to obtain the fusion feature; the prior information is obtained according to the change information of the intermediate image feature in the time dimension; the cohesive feature is the time feature. Obtained through attention processing;
  • a computer program product comprising computer readable instructions which, when executed by a processor, implement the following steps:
  • the time feature of the intermediate image feature and the cohesive feature corresponding to the time feature are fused to obtain the fusion feature; the prior information is obtained according to the change information of the intermediate image feature in the time dimension; the cohesive feature is the time feature. Obtained through attention processing;
  • Fig. 1 is the application environment diagram of video behavior recognition method in an embodiment
  • Fig. 2 is a schematic flow chart of a video behavior recognition method in an embodiment
  • FIG. 3 is a schematic flow chart of cohesive processing of time features in one embodiment
  • Fig. 4 is a schematic structural diagram of a video behavior recognition model in an embodiment
  • FIG. 5 is a schematic flow chart of structural parameter weighted fusion in an embodiment
  • Fig. 6 is a schematic diagram of determining the processing of structural parameters in an embodiment
  • FIG. 7 is a schematic flow chart of feature fusion based on prior information in an embodiment
  • Fig. 8 is a schematic flow chart of high cohesion processing in an embodiment
  • Fig. 9 is a structural block diagram of a video behavior recognition device in an embodiment
  • Figure 10 is a diagram of the internal structure of a computer device in one embodiment.
  • the video behavior recognition method provided in this application can be applied to the application environment shown in FIG. 1 .
  • the terminal 102 communicates with the server 104 through the network.
  • the terminal 102 can shoot the target object to obtain a video, and send the obtained video to the server 104
  • the server 104 extracts at least two frames of target video images from the video, and extracts the features of the video images extracted from the at least two frames of target video images
  • the spatial features are adjusted for their contribution, and the temporal features of the intermediate image features are fused with the cohesive features obtained by focusing on the temporal features through the prior information obtained by adjusting the change information of the intermediate image features in the time dimension according to the contribution, and then
  • the temporal feature contribution adjustment is performed on the obtained fusion features, and video behavior recognition is performed based on the obtained behavior recognition features, and the server 104 may feed back the obtained video behavior recognition results to the terminal 102 .
  • the video behavior recognition method can also be executed by the server 104 alone, for example, the server 104 can obtain at least two frames of target video images from the database, and perform video behavior recognition processing based on the obtained at least two frames of target video images.
  • the video behavior recognition method can also be executed by the terminal 102. Specifically, after the terminal 102 captures the video, the terminal 102 continues to extract at least two frames of target video images from the captured video, and based on at least two frames of the target Video images are processed for video behavior recognition.
  • the terminal 102 can be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, vehicle-mounted devices, and portable wearable devices
  • the server 104 can be implemented by an independent server or a server cluster composed of multiple servers.
  • a video behavior recognition method is provided, and the method is applied to the server 104 in FIG. 1 as an example for illustration, including the following steps:
  • Step 202 extract video image features from at least two frames of target video images.
  • the target video image is an image from a video that requires behavior recognition processing, and specifically may be an image extracted from a video that requires behavior recognition processing.
  • the target video image may be an image extracted from the basketball video.
  • the target video image has more than one frame, so that the video can be processed for behavior recognition based on the time information between frames.
  • the target video image may be a multi-frame image continuously extracted from the video, for example, it may be 5 consecutive frames or 10 frames.
  • the video image feature is obtained by feature extraction of the target video image, which is used to reflect the image characteristics of the target video image.
  • the video image feature can be the image feature extracted by various image feature extraction methods, such as the artificial neural network.
  • the frame target video image is subjected to feature extraction processing to extract image features.
  • the server 104 acquires at least two frames of target video images, the target video images are extracted from the video captured by the terminal 102, and the target video images may be multiple frames of images continuously extracted from the video.
  • the server 104 extracts video image features from at least two frames of target video images.
  • the server 104 may perform image feature extraction processing on at least two frames of target video images, such as inputting them into an artificial neural network, respectively, to obtain video image features corresponding to each frame of target video images.
  • Step 204 adjusting the contribution of the spatial features of the video image features to obtain intermediate image features.
  • the spatial feature is used to reflect the spatial information of the target video image, and the spatial information may include pixel value distribution information of each pixel in the target video image, that is, the characteristics of the image itself in the target video image.
  • the spatial feature can characterize the static feature of the object included in the target video image.
  • the spatial feature can be further extracted from the video image feature, so as to obtain the feature reflecting the spatial information in the target video image from the video image feature.
  • feature extraction may be performed on the video image features in the spatial dimension to obtain the spatial features of the video image features.
  • the contribution adjustment is used to adjust the contribution degree of the spatial feature.
  • the contribution degree of the spatial feature refers to the influence degree of the spatial feature on the behavior recognition result when the video behavior recognition is performed based on the characteristics of the target video image.
  • the greater the contribution of spatial features the greater the impact of spatial features on video behavior recognition processing, that is, the closer the result of video behavior recognition is to the behavior reflected by spatial features.
  • the contribution adjustment can be realized by adjusting the spatial features through preset weight parameters to obtain intermediate image features, which are image features obtained after adjusting the contribution degree of the spatial features of video image features in video behavior recognition.
  • the server 104 adjusts the contribution of the spatial features of the video image features corresponding to each frame of the target video image. Specifically, the server 104 can perform spatial feature extraction on each video image feature to extract each video image. For the spatial features in the image features, the server 104 respectively adjusts the contributions of the spatial features of the video image features based on the spatial weight parameters to obtain intermediate image features.
  • the spatial weight parameter can be set in advance, specifically, it can be obtained through pre-training with video image samples carrying behavior labels.
  • Step 206 based on the prior information, the time feature of the intermediate image feature and the cohesive feature corresponding to the time feature are fused to obtain the fusion feature; the prior information is obtained according to the change information of the intermediate image feature in the time dimension; the cohesive feature is Obtained by focusing on temporal features.
  • the prior information reflects the prior knowledge of the target video image in the time dimension, and the prior information is obtained according to the change information of the intermediate image features in the time dimension, specifically according to the similarity of the intermediate image features in the time dimension.
  • the prior information can include weight parameters for each fusion feature when performing feature fusion, then the similarity in the time dimension can be calculated for the intermediate image features corresponding to each frame of the target video image, and the weight parameters including weight parameters can be obtained according to the obtained similarity.
  • the time feature is used to reflect the time information of the target video images in the video, and the time information may include the correlation information between the target video images in the video, that is, the characteristics of the time sequence of the target video images in the video.
  • the temporal feature can characterize the dynamic feature of the object included in the target video image, so as to realize the dynamic behavior recognition of the object.
  • the temporal features can be further extracted from the intermediate image features to obtain features reflecting temporal information in the target video image from the intermediate image features.
  • feature extraction may be performed on the features of the intermediate image in the time dimension to obtain the temporal features of the features of the intermediate image.
  • the cohesive feature corresponding to the temporal feature is obtained by paying attention to the temporal feature.
  • the attention processing refers to paying attention to the feature in the temporal feature that is conducive to video behavior recognition, so as to highlight the feature, so as to obtain low redundancy and strong cohesion.
  • the algorithm based on the attention mechanism can pay attention to the time features of the intermediate image features, and obtain the cohesive features corresponding to the time features.
  • the cohesive feature is obtained by focusing on the time feature, which has high cohesion, that is, the focal feature of the time information of the cohesive feature is prominent, the feature redundancy is low, and the feature validity is high, which can accurately express the target video image in the time dimension.
  • the information is conducive to improving the accuracy of video behavior recognition.
  • the time feature of the intermediate image feature and the cohesive feature corresponding to the time feature are fused through the prior information, so as to fuse the time feature and the cohesive feature according to the prior knowledge in the prior information to obtain the fusion feature.
  • Fusion features are obtained by fusing time features and cohesive features through prior knowledge in prior information, which can ensure the cohesion of time information in fusion features and enhance the expression of important features in the time dimension, thereby improving the accuracy of video behavior recognition Rate.
  • the prior information can include the weight parameters of each fusion feature when performing feature fusion, that is, the prior information includes the weight parameters of the time feature and the cohesive feature corresponding to the time feature, and the time feature and time feature are combined through the weight parameter.
  • the cohesive features corresponding to the features are weighted and fused to obtain the fused features.
  • the server 104 may acquire prior information, which is obtained according to the change information of the intermediate image features in the time dimension, specifically, according to the cosine similarity of the intermediate image features in the time dimension.
  • the server 104 fuses the time features of the intermediate image features and the cohesive features corresponding to the time features based on the prior information.
  • the server 104 may perform feature extraction of the time dimension on the intermediate image features to obtain the time features of the intermediate image features, and Further determine the cohesive features corresponding to the temporal features.
  • the cohesive features corresponding to the temporal features are obtained by performing attention processing on the temporal features.
  • the server 104 may also perform attention processing on the temporal features based on an attention mechanism algorithm, so as to obtain the cohesive features corresponding to the temporal features.
  • the server 104 fuses the time features of the intermediate image features and the cohesion features corresponding to the time features according to the prior information. The features are weighted and fused to obtain fused features.
  • Step 208 adjust the time feature contribution to the fusion feature to obtain the behavior recognition feature.
  • the time feature contribution adjustment is used to adjust the contribution degree of the fusion feature in the time dimension
  • the contribution degree of the time feature refers to the degree of influence of the characteristics of the fusion feature in the time dimension on the behavior recognition results when performing video behavior recognition based on the characteristics of the target video image .
  • the greater the contribution of the fusion feature in the time dimension the greater the impact of the fusion feature in the time dimension on the video behavior recognition process, that is, the closer the result of the video behavior recognition is to the behavior reflected by the fusion feature in the time dimension .
  • the temporal feature contribution adjustment can specifically be realized by adjusting the features of the fusion feature in the time dimension through preset weight parameters to obtain behavior recognition features, which can be used for video behavior recognition.
  • the server 104 adjusts the time feature contribution of the fusion feature. Specifically, the server 104 can adjust the contribution of the fusion feature in the time dimension according to the time weight parameter, so as to adjust the contribution of the fusion feature in the time dimension, and obtain Behavioral identification features.
  • the time weight parameter can be set in advance, specifically, it can be obtained through pre-training with video image samples carrying behavior labels.
  • Step 210 perform video behavior recognition based on behavior recognition features.
  • the behavior recognition feature is a feature used for video behavior recognition, specifically behavior classification can be carried out based on the behavior recognition feature, to determine the video behavior recognition result corresponding to the target video image.
  • the server 104 can perform video behavior recognition based on the obtained behavior recognition features.
  • the behavior recognition features can be input into a classifier for classification, and a video behavior recognition result can be obtained according to the classification result, so as to realize effective recognition of video behaviors.
  • the spatial features of the video image features extracted from at least two frames of target video images are adjusted for contribution, and the prior information obtained by adjusting the change information of the intermediate image features in the time dimension according to the contribution adjustment is used for the intermediate
  • the temporal features of the image features are fused with the cohesive features obtained by focusing on the temporal features, and then the temporal feature contribution adjustment is performed on the obtained fusion features, and video behavior recognition is performed based on the obtained behavior recognition features.
  • the contribution adjustment of the spatial features of the video image features and the adjustment of the temporal feature contribution of the fusion features can adjust the contribution of time information and spatial information in the behavior recognition features to enhance the performance of the behavior recognition features.
  • Behavioral information expressiveness by adjusting the prior information obtained by adjusting the change information of the intermediate image features in the time dimension according to the contribution, the temporal features of the intermediate image features and the cohesive features obtained by focusing on the temporal features can be fused, which can The time information in the behavior recognition features is effectively focused, so that the obtained behavior recognition features can effectively reflect the behavior information in the video, thereby improving the accuracy of video behavior recognition.
  • adjusting the contribution of the spatial features of the video image features to obtain the intermediate image features includes: extracting the spatial features of the video image features to obtain the spatial features of the video image features; and passing the spatial structure parameters in the structural parameters Adjust the contribution of the spatial features to obtain the intermediate image features; the structural parameters are obtained through the training of video image samples carrying behavior labels.
  • spatial feature extraction is used to extract spatial features from video image features to adjust the contribution of spatial features.
  • Spatial feature extraction can be realized through a feature extraction module, for example, the convolution module in a convolutional neural network model can be used to perform convolution operations on video image features to achieve spatial feature extraction.
  • the structural parameters may include weight parameters to adjust the weight of various operations on image features.
  • the structural parameters can be the weight parameters of various operations defined in the operation space of the convolutional neural network model, such as weighted adjustments for operations such as convolution, sampling, and pooling. weight parameter.
  • Structural parameters can include spatial structure parameters and temporal structure parameters, which are used to adjust the contribution of the spatial features of the spatial dimension and the temporal features of the temporal dimension, thereby adjusting the spatiotemporal information in the video image features to enhance the behavior information representation of the behavior recognition features It is beneficial to improve the accuracy of video behavior recognition.
  • Structural parameters can be pre-trained through video image samples carrying behavior tags.
  • Video image samples can be video images carrying behavior tags. Based on video image samples, structural parameters can be trained to effectively adjust the weight of various operations.
  • the server 104 performs spatial feature extraction on the video image features corresponding to each frame of the target video image.
  • the video image features can be extracted through the pre-trained video behavior recognition model.
  • the spatial features of video image features are extracted through the convolutional layer structure in the video behavior recognition model, and the spatial features of video image features are obtained.
  • the server 104 determines the structural parameters obtained through the training of video image samples carrying behavior tags, and adjusts the contribution to the spatial features through the spatial structure parameters in the structural parameters.
  • the corresponding The weight parameter weights the spatial features to adjust the influence of the spatial features of the video image features on the recognition results during video behavior recognition through the spatial structure parameters, so as to realize the adjustment of the contribution of the spatial features and obtain the intermediate image features.
  • the feature is the image feature obtained after adjusting the contribution degree of the spatial feature of the video image feature in the video behavior recognition.
  • time feature contribution adjustment is performed on the fusion feature to obtain the behavior recognition feature, including: adjusting the contribution of the fusion feature through the time structure parameter in the structure parameter to obtain the behavior recognition feature.
  • the structural parameters may be weight parameters of various operations defined in the operation space of the convolutional neural network model, and the structural parameters include time structure parameters for adjusting the contribution to the characteristics of the time dimension. Specifically, after obtaining the fusion feature, the server 104 adjusts the time feature contribution of the fusion feature through the time structure parameter in the structure parameter to obtain the behavior recognition feature used for video behavior processing.
  • the time structure parameter can be a weight parameter
  • the server 104 can carry out weighting processing on the fusion feature by the weight parameter corresponding to the time structure parameter, so that when the fusion feature is adjusted by the time structure parameter for video behavior recognition, the fusion feature is in The degree of influence of the characteristics of the time dimension on the recognition results, so as to realize the adjustment of the contribution of the time dimension features, to adjust the contribution of the fusion features in the time dimension, and obtain the behavior recognition features.
  • the server 104 can perform video behavior recognition based on the obtained behavior recognition features Processing to obtain the video behavior recognition result.
  • the spatial structure parameters and temporal structure parameters in the structural parameters obtained through the training of video image samples carrying behavior labels are used to adjust the contribution of the spatial features and fusion features of the video image features in the corresponding feature dimensions, so that according to Spatial structure parameters and temporal structure parameters adjust the contribution of time information and spatial information in behavior recognition features, and realize effective entanglement of spatiotemporal features, making the spatiotemporal features of behavior recognition features more expressive, that is, enhancing the behavior information of behavior recognition features Expressiveness, thereby improving the accuracy of video behavior recognition.
  • the video behavior recognition method further includes: determining the structural parameters to be trained; through the spatial structure parameters in the structural parameters to be trained, adjusting the contribution of the spatial sample features of the video image sample features to obtain intermediate sample features;
  • the sample feature is extracted from the video image sample; based on the prior sample information, the time sample feature of the intermediate sample feature and the cohesive sample feature corresponding to the time sample feature are fused to obtain the fusion sample feature; the cohesive sample feature is the time sample feature
  • the prior sample information is obtained according to the change information of the intermediate sample features in the time dimension; the contribution of the fusion sample features is adjusted through the time structure parameters in the structure parameters to be trained to obtain the behavior recognition sample features; and Perform video behavior recognition based on behavior recognition sample features, and update structural parameters to be trained according to behavior recognition results and behavior labels corresponding to video image samples, and then continue training until the end of training to obtain structural parameters.
  • training is carried out through video image samples carrying behavior labels, and structural parameters including temporal structural parameters and spatial structural parameters are obtained at the end of the training.
  • the structure parameter to be trained may be an initial value during each iterative training, and the spatial structure parameter in the structure parameter to be trained is used to adjust the contribution of the spatial sample feature of the video image sample feature to obtain the intermediate sample feature.
  • the intermediate sample feature is the result of adjusting the contribution of the spatial sample feature of the video image sample feature.
  • the video image sample feature is extracted from the video image sample. Specifically, the feature extraction of the video image sample can be performed through the artificial neural network model to obtain the video image sample The video image sample features of .
  • the prior sample information is obtained based on the change information of the intermediate sample features in the time dimension, which can be obtained according to the similarity of the intermediate sample features in the time dimension; the cohesive sample features are obtained by paying attention to the time sample features, specifically based on the attention The mechanism pays attention to the time sample features and obtains the cohesive sample features corresponding to the time sample features.
  • the fusion sample features are obtained by fusing the time sample features of the intermediate sample features and the cohesive sample features corresponding to the time sample features according to the prior sample information.
  • the cohesive sample features are weighted and fused to obtain the fused sample features.
  • Behavior recognition sample features are used for video behavior recognition processing, which is obtained by adjusting the contribution of the time structure parameters in the structure parameters to be trained to the fusion sample features. The degree of contribution of the feature of dimension in the process of video action recognition. Behavior recognition results are obtained through video behavior recognition based on behavior recognition sample features.
  • the structural parameters to be trained can be evaluated, and the structural parameters to be trained are updated according to the evaluation results, and then iterative training continues until At the end of the training, if the number of training times reaches the preset training times threshold, the behavior recognition results meet the recognition accuracy requirements, and the objective function meets the end conditions, etc., the structural parameters of the training completion can be obtained after the training is completed. Based on the structural parameters of the training completion, the video image features can be analyzed. Spatial features and fused features are respectively adjusted for contribution to realize video action recognition processing.
  • the structural parameters can be trained by the server 104 , or can be transplanted to the server 104 after being trained by other training devices.
  • the server 104 training structural parameters as an example.
  • the server 104 determines the structural parameters to be trained.
  • the structural parameters to be trained are the initial values during the current iterative training.
  • the server 104 uses the spatial structural parameters in the structural parameters to be trained.
  • the spatial sample features of the video image sample features are adjusted for contributions to obtain intermediate sample features.
  • the server 104 fuses the time sample features of the intermediate sample features and the cohesive sample features corresponding to the time sample features based on the prior sample information to obtain the fused sample features.
  • the server 104 After obtaining the fusion sample features, the server 104 adjusts the contribution of the fusion sample features through the time structure parameters in the structural parameters to be trained to obtain the behavior recognition sample features, and the server 104 performs video behavior recognition based on the behavior recognition sample features to obtain the behavior recognition results.
  • the server 104 updates the structural parameters to be trained based on the behavior recognition results and the behavior labels corresponding to the video image samples, and returns to continue the iterative training through the updated structural parameters to be trained until the training end condition is met to obtain the structural parameters.
  • Structural parameters can be used to weight and adjust various operations on the characteristics of the target video image in the spatio-temporal dimension during the video behavior recognition process, so as to realize the effective winding of the spatio-temporal features of the target video image and enhance the behavior recognition features.
  • Behavior information is expressive, thereby improving the accuracy of video behavior recognition.
  • the structural parameters completed through training can realize the effective entanglement of the spatio-temporal features of the target video image, and can enhance the behavior information expressiveness of behavior recognition features, thereby improving video Accuracy of behavior recognition.
  • the video behavior recognition method is implemented by a video behavior recognition model
  • the structural parameters to be trained are parameters of the video behavior recognition model during training.
  • update the structural parameters to be trained and continue training until the end of the training to obtain the structural parameters including: obtaining the behavior recognition result output by the video behavior recognition model; determining the behavior recognition result and the video image The difference between the behavior labels corresponding to the samples; update the model parameters in the video behavior recognition model and the structural parameters to be trained according to the differences; and continue training based on the updated video behavior recognition model until the end of the training, and according to the completed video Behavior recognition models get structural parameters.
  • the video behavior recognition method is implemented through a video behavior recognition model, that is, the steps of the video behavior recognition method are realized through a pre-trained video behavior recognition model.
  • the video behavior recognition model can be an artificial neural network model based on various neural network algorithms, such as a convolutional neural network model, a deep learning network model, a recurrent neural network model, a perceptron network model, and a generative confrontation network model.
  • the structural parameters to be trained are the parameters of the video behavior recognition model during training, that is, the structural parameters are parameters that contribute to the adjustment of the model operation in the video behavior recognition model.
  • the behavior recognition result is the recognition result obtained by performing video behavior recognition based on the characteristics of behavior recognition samples.
  • the model performs video behavior recognition based on the target video image, and outputs behavior recognition results.
  • the difference between the behavior recognition result and the behavior label corresponding to the video image sample can be determined by comparing the behavior recognition result and the behavior label.
  • Model parameters refer to the parameters corresponding to the network structure of each layer in the video behavior recognition model.
  • model parameters may include, but are not limited to, various parameters such as convolution kernel parameters, pooling parameters, and up-down sampling parameters for each layer of convolution.
  • model parameters and the structure parameters By updating the model parameters and the structure parameters to be trained in the video behavior recognition model according to the difference between the behavior recognition results and the behavior labels, the joint training of the model parameters and the structure parameters in the video behavior recognition model is realized.
  • the trained video behavior recognition model is obtained at the end of the training, structural parameters can be determined according to the trained video behavior recognition model.
  • the server 104 jointly trains model parameters and structural parameters through the video behavior recognition model, and the trained structural parameters can be determined from the trained video behavior recognition model. Specifically, after the server 104 inputs the video image sample into the video behavior recognition model, the video behavior recognition model performs video behavior recognition processing and outputs a behavior recognition result. The server 104 determines the difference between the behavior recognition result output by the video behavior recognition model and the behavior label corresponding to the video image sample, and updates the parameters of the video behavior recognition model according to the difference, specifically including the model parameters in the video behavior recognition model and The structural parameters to be trained are updated to obtain the updated video behavior recognition model.
  • the server 104 continues to train through the video image samples based on the updated video behavior recognition model until the training ends, for example, the training ends when the training conditions are met, and a trained video behavior recognition model is obtained.
  • the server 104 can determine the structural parameters that have been trained according to the video behavior recognition model that has been trained, and the structural parameters that have been trained can carry out weight adjustments to the operations of each layer of network structure in the video behavior recognition model, so as to adjust the performance of each layer of network structure on video behavior recognition.
  • the degree of contribution of the processing is used to obtain expressive features for video behavior recognition, which improves the accuracy of video behavior recognition.
  • the model parameters and structural parameters are jointly trained through the video behavior recognition model, the trained structural parameters can be determined from the trained video behavior recognition model, and the spatio-temporal parameters of the target video image can be realized through the trained structural parameters.
  • the effective entanglement of features can enhance the behavior information expressiveness of behavior recognition features, thereby improving the accuracy of video behavior recognition.
  • the structural parameters to be trained are updated and the training is continued until the training ends, and the structural parameters are obtained, including: determining the behavior recognition result and the behavior label corresponding to the video image sample Behavior recognition loss between; get the reward value according to the behavior recognition loss and the previous behavior recognition loss; and update the structural parameters to be trained according to the reward value, continue training through the updated structural parameters to be trained until the objective function meets the end condition,
  • the structural parameters are obtained; the objective function is obtained based on each reward value in the training process.
  • the behavior recognition loss is used to represent the degree of difference between the behavior recognition result and the behavior label corresponding to the video image sample, and the form of the behavior recognition loss can be set according to actual needs, such as cross-entropy loss.
  • the previous behavior recognition loss is the behavior recognition loss correspondingly determined for the previous frame of video image samples.
  • the reward value is used to update the structural parameters to be trained.
  • the reward value is determined according to the behavior recognition loss and the previous behavior recognition loss.
  • the reward value can guide the structural parameters to be trained to update in the direction that meets the training requirements. After the structural parameters to be trained are updated, the training is continued through the updated structural parameters to be trained until the objective function meets the end condition, and the training is ended, and the trained structural parameters are obtained.
  • the objective function is obtained based on the reward values in the training process, that is, the objective function is obtained according to the reward values corresponding to each frame of video image samples.
  • the objective function can be constructed according to the sum of the reward values corresponding to each frame of video image samples, so that according to the objective The function judges the end of the structural parameter training and obtains the structural parameters that meet the contribution adjustment requirements.
  • the server 104 performs video behavior recognition based on the characteristics of behavior recognition samples. After obtaining the behavior recognition result, the server 104 determines the behavior recognition loss between the behavior recognition result and the behavior label corresponding to the video image sample. Cross-entropy loss between labels yields action recognition loss. The server 104 obtains a reward value based on the obtained behavior recognition loss and the previous behavior recognition loss corresponding to the previous frame video image sample. Specifically, the reward value can be determined according to the difference between the behavior recognition loss and the previous behavior recognition loss.
  • the server 104 updates the structural parameters to be trained according to the reward value.
  • the structural parameters to be trained can be updated according to the positive or negative value or the value of the reward value to obtain the updated structural parameters to be trained.
  • the server 104 continues the training with the updated structural parameters to be trained until the objective function meets the end condition and ends the training to obtain the structural parameters. Among them, the objective function is obtained based on each reward value in the training process.
  • the objective function can be constructed according to the sum of the reward values corresponding to each frame of video image samples, and the end of the structural parameter training is judged according to the objective function. For example, when the objective function reaches End the training at the extreme value, and obtain the structural parameters that meet the contribution adjustment requirements.
  • the reward value is obtained according to the difference between the behavior recognition losses corresponding to the video image samples of each frame, the behavior recognition loss is determined according to the behavior recognition results and the behavior labels corresponding to the video image samples, and the training structure parameters are calculated by the reward value After the update, continue the training until the objective function obtained according to the reward value corresponding to each frame of video image samples meets the end condition, and the training ends, and the structural parameters of the training are obtained.
  • the training efficiency of the structural parameters to be trained can be improved by updating the reward value obtained according to the difference between the behavior recognition losses corresponding to the video image samples of each frame.
  • updating the structural parameters to be trained according to the reward value includes: updating the model parameters of the policy gradient network model according to the reward value; and updating the structural parameters to be trained by the updated policy gradient network model.
  • the policy gradient (Policy Gradient) network model is a network model based on the policy gradient, its input is the state, and the output is the action.
  • the gradient network model can make corresponding actions according to the current state and obtain higher reward values.
  • the model parameters of the policy gradient network model can be used as a state, and in this state, the structural parameters output by the policy gradient network model according to the input structural parameters are actions, so that the policy gradient network model can be based on the input structural parameters and the current model parameters Predict the output of the next action, that is, the next structural parameter, so as to realize the update of the structural parameter during training.
  • the server 104 updates the model parameters of the policy gradient network model according to the reward value, and specifically adjusts each model parameter in the policy gradient network model based on the reward value, so that The updated policy gradient network model performs the next structural parameter prediction.
  • the server 104 uses the updated policy gradient network model to update the structural parameters to be trained.
  • the updated policy gradient network model can perform structural parameters based on the updated network state and the structural parameters to be trained. Prediction, obtaining the predicted structural parameters, the structural parameters predicted by the strategy gradient network model are the structural parameters after the structural parameters to be trained are updated.
  • the policy gradient network model is updated according to the reward value, and the structural parameters to be trained are updated through the updated policy gradient network model.
  • the structural parameters can be optimized through the policy gradient method, which can ensure the training quality of the structural parameters. It is beneficial to improve the accuracy of video behavior recognition processing.
  • updating the structural parameters to be trained by the updated policy gradient network model includes: predicting the structural parameters based on the updated model parameters and the structural parameters to be trained through the updated policy gradient network model, and obtaining the prediction The structural parameters; and according to the predicted structural parameters, obtain the structural parameters after the structural parameters to be trained are updated.
  • the updated policy gradient network model is obtained by updating the model parameters of the policy gradient network model, that is, after adjusting and updating the model parameters of the policy gradient network model through reward values, an updated policy gradient network model is obtained.
  • the policy gradient network model is updated, and after the updated policy gradient network model is obtained, the server takes the model parameters in the updated policy gradient network model as the state, and predicts the structural parameters in this state, specifically based on The updated model parameters and the structural parameters to be trained are used to predict the structural parameters, and the predicted structural parameters are obtained.
  • the server uses the current network state of the updated policy gradient network model to predict the structural parameters by using the structural parameters to be trained, and obtains the predicted structural parameters.
  • the server updates the structural parameters according to the predicted structural parameters, and obtains the structural parameters after the structural parameters to be trained are updated.
  • the server may directly use the predicted structural parameters output by the updated policy gradient network model through structural parameter prediction as the structural parameters after the structural parameters to be trained are updated, so as to realize the updating of the structural parameters to be trained.
  • the server predicts the structural parameters of the structural parameters to be trained through the updated policy gradient network model, and obtains the updated structural parameters of the structural parameters to be trained according to the predicted structural parameters, and optimizes the structural parameters by means of the policy gradient , which can ensure the training quality of structural parameters and is beneficial to improve the accuracy of video behavior recognition processing.
  • the video behavior recognition method further includes: determining the similarity of intermediate image features in the time dimension; and correcting the initial prior information based on the similarity to obtain the prior information.
  • the time dimension is the dimension of the sequence of each frame of the target video image in the video to which it belongs. According to the time feature of the time dimension, it can assist in the accurate identification of the video behavior.
  • the similarity can represent the distance between each feature. The higher the similarity, the closer the distance.
  • the similarity of the intermediate image features in the time dimension can reflect the change degree of the intermediate image features in the time dimension.
  • the initial prior information may be preset prior information, specifically, prior information obtained from training based on sample data in advance. The initial prior information is corrected according to the similarity, so that according to the change degree of the target video image in each frame in the time dimension, the fusion of the temporal features and cohesive features of the intermediate image features can be weighted and adjusted to enhance the cohesion of the fusion features. feature, that is, to highlight the focal features of the fusion features and reduce the redundant information of the fusion features.
  • the initial prior information can be corrected according to the change degree of each frame of the target video image in the time dimension, to obtain the corresponding prior information.
  • the server 104 determines the similarity of the intermediate image features in the time dimension. Specifically, the cosine similarity can be calculated in the time dimension for the intermediate image features corresponding to each frame of the target video image, and the change in the time dimension of each frame of the target video image can be measured by the cosine similarity. degree.
  • the server 104 corrects the initial prior information according to the similarity of the intermediate image features in the time dimension.
  • the initial prior information can be divided into positive and negative parameters based on the similarity. After the initial prior information is corrected by the positive and negative parameters, the The corrected initial prior information is combined with the initial prior information in the form of residual connection to obtain prior information, so as to realize the determination of prior information.
  • the initial prior information is corrected according to the similarity of the intermediate image features in the time dimension, and the initial prior information is corrected by reflecting the similarity of the change degree of each frame of the target video image in the time dimension, which can effectively
  • the corresponding prior knowledge is obtained by using the change degree of the target video image in each frame in the time dimension, so that the temporal features and cohesive features can be fused based on the prior knowledge, and the temporal information in the behavior recognition features can be effectively focused, so that the obtained Behavior recognition features can effectively reflect the behavior information in the video, thus improving the accuracy of video behavior recognition.
  • the initial prior information includes a first initial prior parameter and a second initial prior parameter; based on the similarity, the initial prior information is corrected to obtain the prior information, including: according to the first initial prior parameter , the second initial prior parameter and the preset threshold, and dynamically adjust the similarity; through the dynamically adjusted similarity, the first initial prior parameter and the second initial prior parameter are respectively corrected to obtain the first prior parameter and a second prior parameter; and obtaining prior information according to the first prior parameter and the second prior parameter.
  • the initial prior information includes the first initial prior parameter and the second initial prior parameter
  • the first initial prior parameter and the second initial prior parameter are respectively used as the fusion weight parameters of the time feature of the intermediate image feature and the cohesive feature .
  • the preset threshold can be dynamically set according to actual needs, so as to dynamically correct prior information according to actual needs.
  • the first priori parameter and the second priori parameter are respectively used as fusion weight parameters of the time feature of the intermediate image feature and the cohesive feature
  • the prior information includes the first priori parameter and the second priori parameter.
  • the server 104 determines a preset threshold, and dynamically adjusts the similarity according to the first initial prior parameter, the second initial prior parameter, and the preset threshold.
  • the server 104 respectively corrects the first initial priori parameter and the second initial priori parameter in the initial priori information through the dynamically adjusted similarity to obtain the first priori parameter and the second priori parameter, and according to the first The prior parameter and the second prior parameter obtain prior information.
  • the prior information can perform weighted fusion processing on temporal features and cohesive features, so as to fuse the temporal features and cohesive features according to the prior knowledge in the prior information to obtain fusion features. Fusion features are obtained by fusing time features and cohesive features through prior knowledge in prior information, which can ensure the cohesion of time information in fusion features and enhance the expression of important features in the time dimension, thereby improving the accuracy of video behavior recognition Rate.
  • the first initial prior parameter and the second initial prior parameter are respectively corrected based on the dynamically adjusted similarity pair to obtain the first A priori parameter and a second priori parameter
  • prior information is obtained according to the first priori parameter and the second priori parameter.
  • the obtained prior information reflects the prior knowledge of the target video image in the time dimension. Based on the prior information, the fusion of temporal features and cohesive features can effectively focus on the temporal information in the behavior recognition features, so that the obtained behavior recognition Features can effectively reflect the behavior information in the video, thus improving the accuracy of video behavior recognition.
  • the video behavior recognition method further includes performing cohesive processing on temporal features to obtain corresponding cohesive features, specifically including:
  • Step 302 determine the current basis vector.
  • the current basis vector is the basis vector currently performing cohesive processing on the time feature, and the cohesive processing on the time feature can be realized through the basis vector.
  • the server 104 determines the current basis vector, such as B ⁇ C ⁇ K, where B is the data size of batch processing, C is the number of channels of intermediate image features, and K is the dimension of the base vector.
  • Step 304 performing feature reconstruction on the time features of the intermediate image features through the current basis vector to obtain reconstructed features.
  • the temporal feature is reconstructed by the current base vector
  • the reconstructed feature can be obtained by fusing the current base vector with the temporal feature of the intermediate image feature.
  • the server 104 may perform matrix multiplication by the current basis vector and the time feature of the intermediate image feature, and then perform normalized mapping to realize the reconstruction of the time feature to obtain the reconstructed feature.
  • Step 306 generating a basis vector for the next attention process according to the reconstruction feature and the time feature.
  • the basis vector for the next attention processing is the basis vector for the next attention processing, that is, the next cohesive processing of the temporal features.
  • the server 104 generates a basis vector for the next attention process according to the reconstruction feature and the time feature, for example, the matrix multiplication of the reconstruction feature and the time feature may be performed to obtain the base vector for the next attention process.
  • the basis vector of the next attention processing will be used as the basis vector for the next attention processing to perform feature reconstruction on the corresponding time features.
  • Step 308 according to the basis vector, basis vector and time feature of the next attention process, the cohesion feature corresponding to the time feature is obtained.
  • the server 104 After obtaining the basis vector of the next attention process, the server 104 obtains the cohesive feature corresponding to the time feature according to the basis vector, the basis vector and the time feature of the next attention process, so as to realize the cohesive processing of the time feature.
  • the basis vector, basis vector, and time feature of the next attention process may be fused to generate a cohesive feature corresponding to the time feature.
  • the time features of the intermediate image features are reconstructed through the basis vectors, a new basis vector is generated according to the reconstructed features and time features, and the time features are obtained according to the new basis vectors, old basis vectors and time features.
  • the corresponding cohesive features so as to focus on the time features, to highlight the important focus features in the time dimension, and obtain cohesive features with high cohesion, which can accurately express the information of the target video image in the time dimension, which is conducive to improving video quality. Accuracy of behavior recognition.
  • generating the basis vector for the next attention process according to the reconstruction feature and the time feature includes: fusing the reconstruction feature and the time feature to generate the attention feature; regularizing the attention feature to obtain the regularization feature ; and perform a sliding average update on the regularized features to generate the basis vector for the next attention process.
  • the attention feature is obtained by fusing the reconstruction feature and the time feature, and by sequentially performing regularization processing and moving average update on the attention feature, it can ensure that the update of the base vector is more stable.
  • the server 104 when generating the basis vector for the next attention process according to the reconstruction feature and the time feature, the server 104 fuses the reconstruction feature and the time feature to obtain the attention feature.
  • the server 104 further performs regularization processing on the attention features, for example, L2 regularization processing may be performed on the attention features to obtain regularization features.
  • the server 104 performs a sliding average update on the obtained regularized features to generate a basis vector for the next attention process.
  • the moving average can be used to estimate the local mean of a variable, so that the update of the variable is related to the historical value over a period of time.
  • the basis vector for the next attention processing is the basis vector for the next attention processing, that is, the next cohesive processing of the temporal features.
  • the current base vector includes the data size of the batch processing, the number of channels of the intermediate image feature, and the dimension of the base vector; the time feature of the intermediate image feature is reconstructed through the current base vector to obtain the reconstructed feature, It includes: performing matrix multiplication and normalized mapping processing on the current base vector and the time feature of the intermediate image feature in sequence to obtain the reconstructed feature.
  • the data size of batch processing is the size of data volume processed in each batch when batch processing is performed.
  • the current base vector may be B ⁇ C ⁇ K, where B is the data size of batch processing, C is the number of channels of intermediate image features, and K is the dimension of the base vector.
  • the server when the server performs feature reconstruction on the time features of the intermediate image features, it can perform matrix multiplication between the current base vector and the time features of the intermediate image features, and perform normalized mapping processing on the matrix multiplication results to realize the Reconstruction of temporal features to obtain reconstructed features.
  • generating the basis vector for the next attention process according to the reconstruction feature and the time feature includes: performing matrix multiplication on the reconstruction feature and the time feature to obtain the basis vector for the next attention process.
  • the server performs matrix multiplication processing on the reconstruction feature and the time feature to obtain the basis vector for the next attention process.
  • the basis vector of the next attention processing will be used as the basis vector for the next attention processing to perform feature reconstruction on the corresponding time features.
  • the cohesive feature corresponding to the time feature is obtained, including: fusing the basis vector, basis vector and time feature of the next attention process to obtain the time feature corresponding cohesive features.
  • the server fuses the basis vectors, basis vectors, and time features of the next attention process, so as to fuse effective information of the base vectors, base vectors, and time features of the next attention process, and obtain the cohesive features corresponding to the time features.
  • the time features of the intermediate image features are reconstructed by using the base vectors including the data size of the batch processing, the number of channels of the intermediate image features, and the dimension of the base vectors, and the matrix multiplication and normalization are performed in sequence.
  • the reconstruction feature is obtained through the mapping process, and the new basis vector is generated by matrix multiplication according to the reconstruction feature and the time feature, and the cohesion feature corresponding to the time feature is obtained by fusing the new base vector, the old basis vector and the time feature, so that Focusing on the time features to highlight the important focus features in the time dimension and obtaining cohesive features with high cohesion can accurately express the information of the target video image in the time dimension, which is conducive to improving the accuracy of video behavior recognition.
  • the time feature of the intermediate image feature and the cohesive feature corresponding to the time feature are fused to obtain the fusion feature, including: determining the prior information; performing temporal feature extraction on the intermediate image feature to obtain the intermediate The time feature of the image feature; and through the prior information, the time feature and the cohesive feature corresponding to the time feature are weighted and fused to obtain the fusion feature.
  • the prior information reflects the prior knowledge of the target video image in the time dimension, and the prior information is obtained according to the change information of the intermediate image features in the time dimension, specifically according to the similarity of the intermediate image features in the time dimension.
  • the time feature is used to reflect the time information of the target video image in the video.
  • the time feature and the cohesive feature corresponding to the time feature are weighted and fused through the prior information. For example, when the prior information includes the first prior parameter and the second prior parameter, the first prior parameter and the second prior Parameter weighted fusion of time features and cohesive features corresponding to time features to obtain fusion features.
  • the server 104 determines the prior information, which is obtained according to the change information of the intermediate image features in the time dimension, and specifically can be obtained according to the similarity of the intermediate image features in the time dimension.
  • the server 104 performs temporal feature extraction on the intermediate image features, specifically, feature extraction may be performed on the temporal dimension of the intermediate image features to obtain the temporal features of the intermediate image features.
  • the server 104 performs weighted fusion on the time feature and the cohesive feature corresponding to the time feature based on the prior information to obtain the fusion feature, thereby realizing the weighted fusion of the time feature and the cohesive feature corresponding to the time feature.
  • the prior knowledge in the information is obtained by fusing temporal features and cohesive features, which can ensure the cohesion of temporal information in the fusion features and enhance the expression of important features in the temporal dimension, thereby improving the accuracy of video behavior recognition.
  • the fusion feature is obtained by fusing time features and cohesive features based on prior knowledge in prior information, which can ensure the cohesion of time information in fusion features and enhance the expression of important features in the time dimension, thereby improving Accuracy of video behavior recognition.
  • the prior information includes a first prior parameter and a second prior parameter; through the prior information, the time feature and the cohesive feature corresponding to the time feature are weighted and fused to obtain the fusion feature, including: through the first A priori parameter weights the time feature to obtain the weighted time feature; weights the cohesive feature corresponding to the time feature through the second prior parameter to obtain the weighted cohesive feature; and weights the cohesive feature
  • the final temporal features are fused with the weighted cohesive features to obtain fusion features.
  • the prior information includes a first prior parameter and a second prior parameter, respectively corresponding to the time feature and the weighted weight of the cohesive feature corresponding to the time feature.
  • the server performs weighting processing on the time feature by using the first prior parameter in the prior information, and obtains the weighted time feature.
  • the first prior parameter may be k1
  • the time feature may be M
  • the weighted time feature may be k1*M.
  • the server performs weighting processing on the cohesion features corresponding to the time features as the second prior parameter in the prior information, and obtains the weighted cohesion features.
  • the second prior parameter may be k2
  • the cohesive feature corresponding to the time feature may be N
  • the weighted cohesive feature may be k2*N.
  • the server fuses the weighted time features and the weighted cohesion features to obtain fusion features.
  • the fusion features obtained by server fusion may be k1*M+k2*N.
  • the fusion feature is obtained by fusing the time feature and the cohesive feature based on the first prior parameter and the second prior parameter in the prior information, which can ensure the cohesion of the time information in the fusion feature and enhance the time dimension.
  • the expression of important features can improve the accuracy of video behavior recognition.
  • the fusing the temporal features of the intermediate image features and the cohesive features corresponding to the temporal features based on the prior information to obtain the fusion features it also includes: standardizing the intermediate image features to obtain standardized features; and Non-linear mapping is performed according to the standardized features to obtain the mapped intermediate image features.
  • the normalization process can normalize the intermediate image features, which is beneficial to solve the problem of gradient disappearance and gradient explosion, and can ensure the network learning rate. Normalization can be achieved by batch normalization. Nonlinear mapping can introduce nonlinear factors to delinearize the intermediate image features, which is beneficial to enhance the flexible expression of intermediate image features.
  • the server 104 performs normalization processing on the intermediate image features.
  • the intermediate image features can be standardized through a BN (Batch Normalization, batch normalization) layer structure to obtain standardized features.
  • the server 104 performs nonlinear mapping on the standardized features, for example, an activation function may be used to perform nonlinear mapping on the standardized features to obtain mapped intermediate image features.
  • the time feature of the intermediate image feature and the cohesive feature corresponding to the time feature are fused to obtain the fusion feature, including: based on the prior information, the time feature of the mapped intermediate image feature and the time feature corresponding to the time feature
  • the cohesive features are fused to obtain the fusion features; the prior information is obtained according to the change information of the mapped intermediate image features in the time dimension.
  • the server 104 fuses the time feature of the mapped intermediate image feature and the cohesive feature corresponding to the time feature based on the prior information to obtain the fusion feature.
  • the prior information is obtained according to the change information of the mapped intermediate image features in the time dimension
  • the cohesive feature is obtained by focusing on the temporal features of the mapped intermediate image features.
  • performing normalization processing on intermediate image features to obtain standardized features includes: performing normalization processing on intermediate image features through batch normalization layer structure to obtain standardized features.
  • the batch normalization layer structure is a BN layer structure, which can standardize the intermediate image features in batches.
  • the server can perform batch normalization processing on the intermediate image features through the batch normalization layer structure to obtain standardized features, thereby ensuring the processing efficiency of the standardization.
  • performing nonlinear mapping according to the standardized features to obtain mapped intermediate image features includes: performing nonlinear mapping on the standardized features through an activation function to obtain mapped intermediate image features.
  • the activation function is used to introduce nonlinear factors to achieve nonlinear mapping to standardized features.
  • the specific form of the activation function can be set according to actual needs, for example, a ReLU function can be set, so that the server can perform non-linear mapping on the standardized features through the activation function to obtain the mapped intermediate image features.
  • the intermediate image features are further subjected to normalization processing and nonlinear mapping through batch normalization layer structure and activation function, so as to enhance the feature expression of the intermediate image features and improve processing efficiency.
  • the present application also provides an application scenario, where the above video behavior recognition method is applied.
  • the application of the video behavior recognition method in this application scenario is as follows:
  • spatio-temporal information modeling is one of the core issues of video action recognition.
  • mainstream methods mainly include behavior recognition methods based on dual-stream networks and behavior recognition methods based on 3D (3-Dimensional, three-dimensional) convolutional networks.
  • the former extracts RGB and optical flow features through two parallel networks, and the latter models temporal and spatial information simultaneously through 3D convolution.
  • a large number of model parameters and computing power loss limit its efficiency.
  • subsequent improved methods mainly decompose the three-dimensional convolution into two-dimensional spatial convolution and one-dimensional time convolution to respectively analyze time and space information. Modeling, thereby improving the efficiency of the model.
  • the network structure search strategy can be used to adaptively adjust the weight of time and space information, and according to the different contributions in the behavior recognition process, the deep association between time and space information can be excavated , and jointly learn the interaction of time and space; at the same time, a rhythm regulator is designed to obtain a highly cohesive expression of time information according to the prior information of the action rhythm and the structural parameters of the time convolution, so as to adjust the actions of different rhythms. In this way, the problem of feature expression differences caused by the same action but with different rhythms is solved, and the accuracy of video action recognition is improved.
  • the video behavior recognition method includes: extracting video image features from at least two frames of target video images, specifically at least two frames of target video images can be input into an artificial neural network to obtain video image features extracted by the artificial neural network; Adjust the contribution of the spatial features of the image features to obtain the intermediate image features. Specifically, adjust the contribution of the spatial features of the video image features through the pre-trained structural parameters; based on the prior information, the temporal features of the intermediate image features and the cohesion corresponding to the temporal features Features are fused, so that the rhythm of the behavior is adjusted using the rhythm regulator to obtain the fusion feature; then the time feature contribution adjustment is performed on the fusion feature to obtain the behavior recognition feature. Specifically, the time feature contribution adjustment can be made to the fusion feature through the structural parameters; finally, based on the behavior Recognize features for video behavior recognition, and get behavior recognition results.
  • the video behavior recognition method in this embodiment is implemented based on a video behavior recognition model, as shown in FIG. 4 , which is a schematic diagram of the network structure of the video behavior recognition model in this embodiment.
  • X is the video image feature extracted from at least two frames of the target video image
  • the spatial feature is extracted by 1 ⁇ 3 ⁇ 3 2D convolution, and the spatial feature is obtained
  • the spatial feature is processed by the spatial structure parameter ⁇ 1 in the structural parameter Contribute adjustments to get intermediate image features.
  • the intermediate image features are sequentially processed through batch normalization and nonlinear mapping of activation functions. Specifically, batch normalization and nonlinear mapping of intermediate image features can be realized through the BN layer structure and the ReLU layer structure.
  • the obtained mapped features A are subjected to temporal feature extraction through two 3 ⁇ 1 ⁇ 1 1D convolutions, one of which is processed by a highly cohesive 1D convolution, so that the temporal feature correspondence of the intermediate image features can be extracted cohesive features.
  • the weighted adjustments are made respectively through the weight parameters ⁇ 1 and ⁇ 2 in the prior information, and the weighted adjustment results of the two branches are fused.
  • the weight parameters ⁇ 1 and ⁇ 2 can be the structural parameters obtained based on the training of the policy gradient Agent network.
  • the initial weight parameters ⁇ 1 and ⁇ 2 are corrected by residuals, and based on the residual corrected
  • the weight parameters ⁇ 1 and ⁇ 2 weight the extraction results of 1D convolution. After the results of the two 1D convolution branches are fused, the temporal feature contribution of the fusion feature is adjusted through the time structure parameter ⁇ 2 in the structural parameter, and the behavior recognition feature is obtained after downsampling the fusion feature after the contribution adjustment.
  • the behavior recognition feature is used Based on video behavior recognition, the result of behavior recognition is obtained.
  • the structure parameter refers to the weight parameters of operations such as convolution defined in the operation space, which is a concept in the network structure search technology.
  • the structural parameters corresponding to the temporal and spatial convolutions to be fused can be optimized and updated through two structural parameter update methods, the differential method and the strategy gradient method, including ⁇ 1 and ⁇ 2; while the high cohesion temporal convolution module and 1D time
  • the pre-trained structural parameters ⁇ 1 and ⁇ 2 can also be used for weighted fusion processing.
  • the structural parameters for fusing temporal and spatial convolutions include ⁇ 1 and ⁇ 2
  • the structural parameters for weighted fusion of two temporal convolution branches include ⁇ 1 and ⁇ 2.
  • the video image features extracted from the target video image are extracted through 1 ⁇ d ⁇ d 2D convolution for spatial feature extraction, and the extraction result is adjusted by the contribution of the spatial structure parameter ⁇ 1.
  • the feature extraction result is multiplied by the structural parameter to obtain Fusion is performed to achieve contribution adjustment, and after contribution adjustment, batch normalization processing and nonlinear mapping of activation functions are performed sequentially.
  • the mapped results are extracted through two t ⁇ 1 ⁇ 1 1D convolutions for temporal feature extraction, and the extracted results are weighted and fused through the structural parameters ⁇ 1 and ⁇ 2, respectively, and the weighted fusion results are adjusted for temporal feature contribution through the temporal structure parameter ⁇ 2 , to obtain the behavior recognition features for video behavior recognition processing.
  • a multi-dimensional structural parameter is pre-defined, such as a multi-dimensional structural parameter vector, specifically a two-dimensional vector, which has a gradient in the differential mode update process.
  • the dimensions of the structural parameters represent the structural parameters corresponding to the spatial convolution and temporal convolution, respectively.
  • the structural parameters are applied to the spatial convolution and temporal convolution to fuse the features of the two. Specifically, ⁇ 1 is applied to the spatial convolution to adjust the contribution, and ⁇ 2 is applied to the temporal convolution to adjust the contribution. Calculate the error value according to the predicted results and real results of the video behavior recognition model, and use the gradient descent algorithm to update the structural parameters, and obtain the trained structural parameters at the end of the training.
  • the operation space in the network structure search technology is denoted as O, and o is a specific operation.
  • a node refers to a collection of basic operation units in the network structure search method.
  • Set i and j as two sequentially adjacent nodes.
  • the weights of a group of candidate operations between them are denoted as ⁇ ij
  • P is the corresponding probability distribution.
  • the candidate operation with the maximum probability between nodes i and j is obtained through the max function, and the final network structure is formed by stacking the operations obtained by searching between different nodes, as shown in the following formula (1):
  • N is the number of nodes.
  • L train (w, ⁇ ) is the objective function of the network structure
  • w is the model parameter of the network structure
  • the blocks of this embodiment are defined between two nodes.
  • these nodes represent the output of the previous block and the input of the next block.
  • Sequentially connected 1 ⁇ d ⁇ d convolutions and t ⁇ 1 ⁇ 1 convolutions are defined inside the block.
  • Structure parameters are used on top of these two convolutions to tune their strength.
  • ⁇ 2j ... ⁇ 2m ⁇ 2 , ⁇ 1n is determined as the structural parameter ⁇ 1 in Fig. 6
  • ⁇ 21 is the structural parameter ⁇ 2 .
  • o( ) as an operation defined in the search space O and acting on the input x
  • the weight vector between node i and node j is ⁇ (i,j)
  • the following formula (3) can be obtained
  • F is the linear map of the weight vector
  • y (i, j) is the sum of the linear maps of all weight vectors in the search space
  • F can be set as a fully connected layer
  • each cell unit is defined as a (2 +1)D convolutional blocks, so ⁇ o (i,j) is fixed. Therefore, the learning objective can be further simplified as the following formula (4),
  • w ⁇ is the structural parameter of the network
  • w n is the model parameter of the network
  • y is the output of the (2+1)D convolutional block.
  • synchronous training is carried out on the structural parameters w ⁇ and model parameters w n of the network, and gradient descent optimization is performed based on the objective function L val to obtain the structural parameters w ⁇ and model parameters w n that meet the needs and realize network training.
  • a multi-dimensional structural parameter is pre-defined, such as a multi-dimensional structural parameter vector, specifically a two-dimensional vector, and the gradient information is truncated in the update process in the policy gradient mode.
  • the dimensions of the structural parameters represent the structural parameters corresponding to the spatial convolution and temporal convolution, respectively.
  • a policy gradient agent network is pre-defined to generate the next structural parameter according to the current structural parameters and the network state of the policy gradient agent network.
  • the generated structural parameters are applied to spatial convolution and temporal convolution to fuse the features of the two.
  • the network parameters of the Agent are updated, and then the new Agent predicts the next structural parameters, so as to realize the updating of the structural parameters.
  • policy gradient descent is a reinforcement learning method, in which policy refers to the actions taken in different states (state), and the goal is to do gradient descent based on the policy, so as to train Out of the policy gradient network Agent can better make corresponding actions according to the current state, and can get a higher reward value (reward).
  • policy gradient network agent the parameters of the current strategy gradient network agent as the state state, the structural parameters output by the network as the action, and use the current backbone network , that is, the loss and reward constant of the video behavior recognition module are used as components of the reward value reward function.
  • the network In the forward processing flow, first input the initial structural parameters to the Agent network, and then the network will predict the next network parameter, namely action, according to the current Agent network parameters and the input structural parameters. In the process of backpropagation, it is to maximize the reward value that can be obtained currently, and update the parameters of the Agent network through the reward value. Assuming the current state is s, a represents the current action, and ⁇ represents the parameters of the network, then the cross-entropy loss CE is as follows (6),
  • the reward function can be designed based on the smoothed CE value, so that the searched structural parameters and the learning of the backbone network of the video behavior recognition model are mutually assisted .
  • the smoothed CE is as follows (7),
  • i, j and N are the correct category, other categories and the total number of categories respectively, and ⁇ is a very small constant. Further, if the SCE n value obtained at the next time step n is greater than the SCE m obtained at the previous m, then a positive reward value ⁇ is given, otherwise the reward is - ⁇ .
  • f is the reward value
  • is the set variable
  • f(s, a) is the network prediction output.
  • the multi-layer perceptron MLP corresponding to the structural parameters of the two parts of the prior excitation module for the importance of spatio-temporal information and narrowing intra-class differences are 3 layers with 6 hidden layer neurons and 4 hidden layer neurons respectively Neural network, while adding a ReLU activation function between each layer, and the last layer is a softplus activation function. Since the policy gradient mechanism requires a complete sequence of state behaviors, the lack of feedback in the intermediate state will lead to poor overall training effect.
  • one method can be set to 1 epoch, that is, every 2 epochs
  • the epoch calculates the reward of the most recent epoch; the other can be regarded as an optimization within an iteration, which is more conducive to optimization.
  • the parameters of the network and the parameters of the Agent are separated and optimized separately. Different optimizers can be used for the two parameters, among which the Agent optimizer uses the Adam optimizer, and the network parameter optimization uses Stochastic Gradient Descent (SGD) for optimization processing, and the two are updated alternately during optimization.
  • SGD Stochastic Gradient Descent
  • Auto(2+1)D convolution structure that is, the structure of 2D convolution + 1D convolution to convert the video
  • the spatiotemporal information in the image features is fused.
  • Auto(2+1)D is composed of sequentially connected 2D convolution and 1D convolution, their corresponding structural parameters, and activation functions.
  • 2D convolution and 1D convolution are used to decouple the time and space information in the feature, and independent modeling is performed, that is, spatial feature extraction is performed through 2D convolution, and temporal feature extraction is performed through 1D convolution.
  • the decoupled information is adaptively fused through the structural parameters, and the nonlinear expression ability of the model is increased through the activation function.
  • 2D convolution and 1D convolution form a basic convolution block, which can be used as the basic block structure in the network, such as the Block structure in ResNet (Residual Neural Network, residual network).
  • the rhythm regulator contains a priori incentive module and high cohesion temporal expression modules.
  • the priori incentive module can set the limit value Margin for the current structural parameters according to the similarity in the time dimension of the features, so as to promote the optimization of structural parameters.
  • a highly cohesive temporal representation module can increase the cohesion of temporal dimension information through an efficient attention mechanism. Specifically, the feature map output by the previous layer is input into 2D convolution to extract spatial features.
  • the features output by 2D convolution into the prior excitation module calculate its similarity in the time dimension, and set the appropriate Margin for the structural parameters according to the similarity value.
  • the features output by the 2D convolution are input into the high cohesion temporal module and the 1D temporal convolution module and output feature maps, and the high cohesion temporal module and the 1D temporal convolution module are adaptively adjusted according to the prior information structure parameters
  • the weights of the output feature maps are fused to obtain the fused features.
  • the 3x1x1 time convolution branch is changed to 3x1x1 time convolution and 3x1x1 time convolution with expected maximum attention. branches.
  • the priori excitation module mainly acts on the features through the excitation optimized for the priori parameters ⁇ 1 and ⁇ 2. As shown in Figure 7, the video image features extracted from the target video image are extracted through 1 ⁇ 3 ⁇ 3 2D convolution for spatial feature extraction, and the extracted results are adjusted for contribution by ⁇ 1. After the contribution adjustment, batch normalization processing and activation function non-linear mapping. The mapped results are processed through a priori excitation module.
  • priori excitation module calculate the similarity of the mapped results in the time dimension, modify the initial priori parameters ⁇ 1 and ⁇ 2 based on the similarity, and pass the modified priori parameters ⁇ 1 and ⁇ 2 to pass the two
  • the results of temporal feature extraction by t ⁇ 1 ⁇ 1 1D convolution are weighted and fused, and the result of weighted fusion is adjusted by the structural parameter ⁇ 2 for temporal feature contribution to obtain behavior recognition features for video behavior recognition processing.
  • the arrows represent the flow direction of the feature map, they are connected by inputting the feature map output by the previous module into the next module, and then the feature map obtained after the prior similarity excitation module is input in parallel to the next volume
  • the final output is to concatenate the feature maps of the two branches and reduce the dimensionality.
  • the degree of change, and based on the threshold of the degree of change, the current prior parameters are divided into positive and negative parameters.
  • the prior parameters after excitation correction and the original input prior parameters are merged in the way of residual connection as the final prior parameters.
  • the element values of the tensor often do not have a large variance and are uniformly small.
  • the current similarity prior information can be set by setting the margin and dynamically adjusting the threshold. , the following formula (10) can be obtained,
  • Sim represents the similarity value
  • Thres is the threshold
  • ⁇ 1 and ⁇ 2 are prior parameters.
  • the high cohesion time module obtains a high cohesion time expression based on the attention mechanism optimized by the EM (Expectation-Maximum, called expectation maximization) algorithm. For each sample, features are reconstructed through a fixed number of iterative optimizations. As shown in Figure 8, this process can be divided into E step and M step. After the feature map is down-sampled, it is processed by E step and M step respectively and then fused to obtain high cohesion features.
  • B is the batch size, that is, the data size of batch processing
  • C is the number of channels corresponding to the original input video image features
  • K is the base vector dimension.
  • step E matrix multiplication is performed by using the base vector and the spatial feature vector after spatial feature extraction of B ⁇ (H ⁇ W) ⁇ C, and then softmax is used to reconstruct the original feature, and the size is B ⁇ (H ⁇ The feature map of W) ⁇ K.
  • the reconstructed feature map with a size of B ⁇ (H ⁇ W) ⁇ K is multiplied by the original feature map of B ⁇ (H ⁇ W) ⁇ C to obtain a new basis vector B ⁇ C ⁇ K .
  • L2 regularization is performed on it, and the sliding average update of the base vector is added during training, as shown in the following formula (11),
  • mu is the base vector
  • mu_mean is its mean
  • momentum is the momentum
  • step E the base vector obtained in step E and the attention map obtained in step M are matrix multiplied to obtain the final reconstructed feature map with global information.
  • the video behavior recognition method provided in this embodiment is applied in the field of video recognition, and in the field of video recognition, three-dimensional convolution is currently widely used, but it is difficult to expand due to the limitation of its high parameter amount.
  • Some improved methods decompose 3D convolutions into 2D spatial convolutions and 1D temporal convolutions on the basis of low computational cost, small memory requirements, and high performance.
  • the industry has not paid attention to the fact that the spatial and temporal cues in the video have different effects on different action categories.
  • the adaptive spatiotemporal entanglement network involved automatically fuses the decomposed spatiotemporal information based on importance analysis to obtain a more powerful spatiotemporal representation.
  • Auto(2+1)D convolution adaptively reorganizes and decouples spatio-temporal convolution filters through network structure search technology to model inconsistent contribution information of spatio-temporal information, and excavates deep layers between spatio-temporal information.
  • Correlation, and learn spatiotemporal interaction information by integrating spatiotemporal information with different weights, the current model's ability to model temporal and spatial information is enhanced.
  • the Rhythm Regulator uses the effective attention mechanism of the EM algorithm to extract high cohesion features in the time dimension, and can adjust the time information of actions with different rhythms according to the prior information of the action rhythm and the structural parameters of the temporal convolution. Obtaining highly cohesive expressions of temporal information to deal with the problem of different durations in different action classes can improve the accuracy of video action recognition.
  • FIGS. 2-3 may include multiple steps or stages. These steps or stages are not necessarily executed at the same time, but may be executed at different times. The steps or stages The order of execution is not necessarily performed sequentially, but may be performed alternately or alternately with other steps or at least a part of steps or stages in other steps.
  • a video behavior recognition device 900 is provided.
  • the device may adopt a software module or a hardware module, or a combination of the two becomes a part of computer equipment.
  • the device specifically includes: Image feature extraction module 902, spatial feature contribution adjustment module 904, feature fusion module 906, temporal feature contribution adjustment module 908 and video behavior recognition module 910, wherein:
  • Video image feature extraction module 902 for extracting video image features from at least two frames of target video images
  • the spatial feature contribution adjustment module 904 is used to adjust the contribution of the spatial feature of the video image feature to obtain the intermediate image feature
  • the feature fusion module 906 is used to fuse the time features of the intermediate image features and the cohesive features corresponding to the time features based on the prior information to obtain the fusion features; the prior information is obtained according to the change information of the intermediate image features in the time dimension; The cohesive feature is obtained by focusing on the time feature;
  • the temporal feature contribution adjustment module 908 is used to adjust the temporal feature contribution to the fusion feature to obtain behavior recognition features
  • the video behavior recognition module 910 is configured to perform video behavior recognition based on behavior recognition features.
  • the spatial feature contribution adjustment module 904 is also used to extract the spatial features of the video image features to obtain the spatial features of the video image features; adjust the contribution of the spatial features through the spatial structure parameters in the structural parameters to obtain the intermediate Image features; structural parameters are obtained through training of video image samples carrying behavior tags; the temporal feature contribution adjustment module 908 is also used to adjust the contribution of the fusion features through the temporal structure parameters in the structural parameters to obtain behavior recognition features.
  • it also includes a parameter determination module to be trained, an intermediate sample feature acquisition module, a fusion sample feature acquisition module, a behavior recognition sample feature acquisition module, and an iterative module; wherein: the parameter determination module to be trained is used to determine the structure to be trained Parameters; the intermediate sample feature acquisition module is used to adjust the contribution of the spatial sample features of the video image sample features through the spatial structure parameters in the structural parameters to be trained to obtain intermediate sample features; the video image sample features are extracted from the video image samples The fusion sample feature acquisition module is used to fuse the time sample feature of the intermediate sample feature and the cohesive sample feature corresponding to the time sample feature based on the prior sample information to obtain the fusion sample feature; the cohesive sample feature is the time sample feature Obtained by attention processing; the prior sample information is obtained according to the change information of the intermediate sample features in the time dimension; the behavior recognition sample feature acquisition module is used to adjust the contribution of the fusion sample features through the time structure parameters in the structure parameters to be trained , to obtain the behavior recognition sample features; the iteration module is used to determine the
  • the video behavior recognition device is realized by a video behavior recognition model, and the structural parameters to be trained are the parameters of the video behavior recognition model in training;
  • the iteration module also includes a recognition result acquisition module, a difference determination module, a structural parameter update module and Structural parameter acquisition module; wherein: the identification result acquisition module is used to obtain the behavior recognition result output by the video behavior recognition model; the difference determination module is used to determine the difference between the behavior recognition result and the corresponding behavior label of the video image sample; the structural parameter The update module is used to update the model parameters in the video behavior recognition model and the structural parameters to be trained according to the differences; the structural parameter acquisition module is used to continue training based on the updated video behavior recognition model until the end of the training, and according to the completed training The video action recognition model gets the structural parameters.
  • the iteration module further includes a recognition loss determination module, a reward value acquisition module, and a reward value processing module; wherein: the recognition loss determination module is used to determine the behavior between the behavior recognition result and the behavior label corresponding to the video image sample Recognition loss; Reward value acquisition module, used to obtain reward value according to behavior recognition loss and previous behavior recognition loss; Reward value processing module, Used to update the structural parameters to be trained according to the reward value, continue through the updated structural parameters to be trained Train until the objective function satisfies the end condition, and obtain the structural parameters; the objective function is obtained based on each reward value in the training process.
  • the recognition loss determination module is used to determine the behavior between the behavior recognition result and the behavior label corresponding to the video image sample Recognition loss
  • Reward value acquisition module used to obtain reward value according to behavior recognition loss and previous behavior recognition loss
  • Reward value processing module Used to update the structural parameters to be trained according to the reward value, continue through the updated structural parameters to be trained Train until the objective function satisfies the end condition, and obtain the structural parameters;
  • the reward value obtaining module is further configured to update the model parameters of the policy gradient network model according to the reward value; the updated policy gradient network model updates the structure parameters to be trained.
  • the reward value obtaining module is also used to predict the structure parameters based on the updated model parameters and the structure parameters to be trained through the updated policy gradient network model, and obtain the predicted structure parameters; and according to the predicted structure Parameters to obtain the updated structural parameters of the structural parameters to be trained.
  • it also includes a similarity determination module and a priori information correction module; wherein: a similarity determination module is used to determine the similarity of the intermediate image features in the time dimension; a priori information correction module is used for based on the similarity Correct the initial prior information to obtain the prior information.
  • the initial prior information includes a first initial prior parameter and a second initial prior parameter
  • the prior information modification module includes a similarity adjustment module, a priori parameter modification module and a priori information acquisition module; wherein: The similarity adjustment module is used to dynamically adjust the similarity according to the first initial prior parameter, the second initial prior parameter and the preset threshold; the prior parameter correction module is used to adjust the similarity through the dynamic adjustment
  • the first initial a priori parameter and the second initial a priori parameter are corrected to obtain the first a priori parameter and the second a priori parameter;
  • the prior information obtaining module is used to obtain the first a priori parameter and the second a priori parameter Prior Information.
  • it also includes a base vector determination module, a feature reconstruction module, a base vector update module and a cohesive feature acquisition module; wherein: the base vector determination module is used to determine the current base vector; the feature reconstruction module is used for The time feature of the intermediate image feature is reconstructed through the current base vector to obtain the reconstructed feature; the base vector update module is used to generate the base vector for the next attention process according to the reconstructed feature and the time feature; the cohesive feature acquisition module, It is used to obtain the cohesive feature corresponding to the time feature according to the basis vector, basis vector and time feature of the next attention process.
  • the base vector update module also includes an attention feature module, a regularization processing module, and a sliding average update module; wherein: the attention feature module is used to fuse reconstruction features and time features to generate attention features; regularization The processing module is used to perform regularization processing on attention features to obtain regularization features; the sliding average update module is used to perform sliding average update on regularization features to generate a basis vector for next attention processing.
  • the current base vector includes the data size of batch processing, the number of channels of the intermediate image feature, and the dimension of the base vector;
  • the feature reconstruction module is also used to combine the current base vector and the time feature of the intermediate image feature, in order Perform matrix multiplication and normalized mapping processing to obtain reconstructed features;
  • the base vector update module is also used to perform matrix multiplication of reconstructed features and time features to obtain the base vector for the next attention process;
  • cohesive feature acquisition module which is also used to fuse the basis vector, basis vector and time feature of the next attention process to obtain the cohesive feature corresponding to the time feature.
  • the feature fusion module 906 is also used to determine prior information; perform temporal feature extraction on intermediate image features to obtain the temporal features of intermediate image features; The aggregated features are weighted and fused to obtain the fused features.
  • the prior information includes the first prior parameter and the second prior parameter; the feature fusion module 906 is also used to weight the time feature through the first prior parameter, and obtain the weighted time feature ; weighting the cohesive features corresponding to the time features through the second prior parameter to obtain the weighted cohesive features; and fusing the weighted time features and the weighted cohesive features to obtain the fusion features .
  • it also includes a normalization processing module and a nonlinear mapping module; wherein: the normalization processing module is used to standardize the intermediate image features to obtain standardized features; the nonlinear mapping module is used to perform nonlinear processing according to the standardized features. Mapping to obtain the mapped intermediate image features; the feature fusion module 906 is also used to fuse the time features of the mapped intermediate image features and the cohesive features corresponding to the time features based on prior information to obtain fusion features; prior information It is obtained according to the change information of the mapped intermediate image features in the time dimension.
  • the normalization processing module is also used to perform normalization processing on the intermediate image features through the batch normalization layer structure to obtain standardized features;
  • the nonlinear mapping module is also used to perform nonlinear mapping on the standardized features through the activation function, Get the mapped intermediate image features.
  • Each module in the above-mentioned video behavior recognition device can be fully or partially realized by software, hardware and a combination thereof.
  • the above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, and can also be stored in the memory of the computer device in the form of software, so that the processor can invoke and execute the corresponding operations of the above-mentioned modules.
  • a computer device is provided.
  • the computer device may be a server or a terminal, and its internal structure may be as shown in FIG. 10 .
  • the computer device includes a processor, memory and a network interface connected by a system bus. Wherein, the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions and a database.
  • the internal memory provides an environment for the execution of the operating system and computer readable instructions in the non-volatile storage medium.
  • the database of the computer device is used to store model data.
  • the network interface of the computer device is used to communicate with an external terminal via a network connection. When the computer-readable instructions are executed by the processor, a video behavior recognition method is realized.
  • FIG. 10 is only a block diagram of a part of the structure related to the solution of this application, and does not constitute a limitation to the computer equipment on which the solution of this application is applied.
  • the specific computer equipment can be More or fewer components than shown in the figures may be included, or some components may be combined, or have a different arrangement of components.
  • a computer device including a memory and a processor, where computer-readable instructions are stored in the memory, and the processor implements the steps in the foregoing method embodiments when executing the computer-readable instructions.
  • a computer-readable storage medium which stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, the steps in the foregoing method embodiments are implemented.
  • a computer program product or computer program comprising computer readable instructions stored in a computer readable storage medium.
  • the processor of the computer device reads the computer-readable instructions from the computer-readable storage medium, and the processor executes the computer-readable instructions, so that the computer device executes the steps in the foregoing method embodiments.
  • Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory or optical memory, etc.
  • Volatile memory can include Random Access Memory (RAM) or external cache memory.
  • RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

一种视频行为识别方法,由计算机设备执行,包括:从至少两帧目标视频图像提取视频图像特征(202);将视频图像特征的空间特征进行贡献调整,得到中间图像特征(204);基于先验信息对中间图像特征的时间特征和时间特征对应的内聚特征进行融合,得到融合特征;先验信息是根据中间图像特征在时间维度的变化信息得到的;内聚特征是对时间特征进行关注处理得到的(206);对融合特征进行时间特征贡献调整,得到行为识别特征(208);及基于行为识别特征进行视频行为识别(210)。

Description

视频行为识别方法、装置、计算机设备和存储介质
本申请要求于2021年10月15日提交中国专利局、申请号为2021112027344、发明名称为“视频行为识别方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别是涉及一种视频行为识别方法、装置、计算机设备、存储介质和计算机程序产品。
背景技术
随着计算机技术的发展,计算机视觉技术在工业、医疗、社交和导航等各领域得到了广泛的应用,通过计算机视觉,可以用计算机代替人眼对目标进行识别和测量等视觉感知处理,实现对生物视觉的模拟。视频行为识别是计算机视觉领域的重要课题之一,基于视频行为识别可以识别出给定视频出目标对象的动作行为,如吃饭、跑步、说话等各种动作行为。
目前,视频行为识别处理中,多是通过从视频中提取特征以进行行为识别,但传统的视频行为识别处理中提取的特征无法有效反映出视频中的行为信息,导致视频行为识别的准确率较低。
发明内容
根据本申请提供的各种实施例,提供一种视频行为识别方法、装置、计算机设备、存储介质和计算机程序产品。
一种视频行为识别方法,由计算机设备执行,所述方法包括:
从至少两帧目标视频图像提取视频图像特征;
将视频图像特征的空间特征进行贡献调整,得到中间图像特征;
基于先验信息对中间图像特征的时间特征和时间特征对应的内聚特征进行融合,得到融合特征;先验信息是根据中间图像特征在时间维度的变化信息得到的;内聚特征是对时间特征进行关注处理得到的;
对融合特征进行时间特征贡献调整,得到行为识别特征;及
基于行为识别特征进行视频行为识别。
一种视频行为识别装置,所述装置包括:
视频图像特征提取模块,用于从至少两帧目标视频图像提取视频图像特征;
空间特征贡献调整模块,用于将视频图像特征的空间特征进行贡献调整,得到中间图像特征;
特征融合模块,用于基于先验信息对中间图像特征的时间特征和时间特征对应的内聚特征进行融合,得到融合特征;先验信息是根据中间图像特征在时间维度的变化信息得到的;内聚特征是对时间特征进行关注处理得到的;
时间特征贡献调整模块,用于对融合特征进行时间特征贡献调整,得到行为识别特征;及
视频行为识别模块,用于基于行为识别特征进行视频行为识别。
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现以下步骤:
从至少两帧目标视频图像提取视频图像特征;
将视频图像特征的空间特征进行贡献调整,得到中间图像特征;
基于先验信息对中间图像特征的时间特征和时间特征对应的内聚特征进行融合,得到融合特征;先验信息是根据中间图像特征在时间维度的变化信息得到的;内聚特征是对时间特征进行关注处理得到的;
对融合特征进行时间特征贡献调整,得到行为识别特征;及
基于行为识别特征进行视频行为识别。
一种计算机可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现以下步骤:
从至少两帧目标视频图像提取视频图像特征;
将视频图像特征的空间特征进行贡献调整,得到中间图像特征;
基于先验信息对中间图像特征的时间特征和时间特征对应的内聚特征进行融合,得到融合特征;先验信息是根据中间图像特征在时间维度的变化信息得到的;内聚特征是对时间特征进行关注处理得到的;
对融合特征进行时间特征贡献调整,得到行为识别特征;及
基于行为识别特征进行视频行为识别。
一种计算机程序产品,包括计算机可读指令,所述计算机可读指令被处理器执行时实现以下步骤:
从至少两帧目标视频图像提取视频图像特征;
将视频图像特征的空间特征进行贡献调整,得到中间图像特征;
基于先验信息对中间图像特征的时间特征和时间特征对应的内聚特征进行融合,得到融合特征;先验信息是根据中间图像特征在时间维度的变化信息得到的;内聚特征是对时间特征进行关注处理得到的;
对融合特征进行时间特征贡献调整,得到行为识别特征;及
基于行为识别特征进行视频行为识别。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为一个实施例中视频行为识别方法的应用环境图;
图2为一个实施例中视频行为识别方法的流程示意图;
图3为一个实施例中对时间特征进行内聚处理的流程示意图;
图4为一个实施例中视频行为识别模型的结构示意图;
图5为一个实施例中结构参数加权融合的流程示意图;
图6为一个实施例中确定结构参数处理的示意图;
图7为一个实施例中基于先验信息进行特征融合的流程示意图;
图8为一个实施例中高内聚处理的流程示意图;
图9为一个实施例中视频行为识别装置的结构框图;
图10为一个实施例中计算机设备的内部结构图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供的视频行为识别方法,可以应用于如图1所示的应用环境中。其中,终端102通过网络与服务器104进行通信。终端102可以对目标对象进行拍摄,得到视频,并将获得的视频发送至服务器104,服务器104从视频中提取至少两帧目标视频图像,并将从至少两帧目标视频图像提取的视频图像特征的空间特征进行贡献调整,通过根据贡献调整获得的中间图像特征在时间维度的变化信息得到的先验信息,对中间图像特征的时间特征和对时间特征进行关注处理得到的内聚特征进行融合,再对得到的融合特征进行时间特征贡献调整,基于获得的行为识别特征进行视频行为识别,服务器104可以将得到的视频行为识别结果反馈 至终端102。
在一些实施例中,视频行为识别方法也可以单独由服务器104执行,如可以由服务器104从数据库中获取至少两帧目标视频图像,并基于获得的至少两帧目标视频图像进行视频行为识别处理。在一些实施例中,视频行为识别方法也可以由终端102执行,具体可以由终端102拍摄到视频后,继续由终端102从拍摄的视频中提取至少两帧目标视频图像,并基于至少两帧目标视频图像进行视频行为识别处理。
其中,终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑、车载设备和便携式可穿戴设备,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在一个实施例中,如图2所示,提供了一种视频行为识别方法,以该方法应用于图1中的服务器104为例进行说明,包括以下步骤:
步骤202,从至少两帧目标视频图像提取视频图像特征。
其中,目标视频图像为从需要进行行为识别处理的视频中的图像,具体可为从需要进行行为识别处理的视频中抽取的图像。例如,对于需要进行行为识别处理的视频,如为终端102拍摄的篮球运动视频,则目标视频图像可以为从篮球运动视频中抽取的图像。目标视频图像多于一帧,以便可以根据各帧之间的时间信息对视频进行行为识别处理。一般的,视频行为识别中,有些动作可以不需要时间信息,即不需要多帧图像之间的关联关系,只根据空间信息便能够实现行为识别,如喝水、吃饭的动作行为识别;而对于一些更细致的行为识别,则需要多帧图像之间的关联关系,即利用多帧图像之间反映的时间信息实现视频的行为识别,如对打篮球中向下拍球和向上接球的行为,需要多帧视频图像综合进行识别。在具体应用中,目标视频图像可以为从视频中连续抽取的多帧图像,如可以为连续5帧或10帧等。
视频图像特征通过对目标视频图像进行特征提取得到,用于反映目标视频图像的图像特性,视频图像特征可以为通过各种图像特征提取方式提取到的图像特征,如可以为通过人工神经网络对各帧目标视频图像进行特征提取处理提取得到的图像特征。
具体地,服务器104获取至少两帧目标视频图像,目标视频图像从终端102拍摄到的视频中提取得到,目标视频图像可以为从视频中连续抽取的多帧图像。服务器104从至少两帧目标视频图像中提取得到视频图像特征。具体地,服务器104可以对至少两帧目标视频图像分别进行图像特征提取处理,如分别输入人工神经网络中,得到各帧目标视频图像分别对应的视频图像特征。
步骤204,将视频图像特征的空间特征进行贡献调整,得到中间图像特征。
其中,空间特征用于反映目标视频图像的空间信息,空间信息可以包括目标视频图像中各像素点的像素值分布信息,即目标视频图像中图像本身的特性。空间特征可以表征出目标视频图像包括的对象的静态特征。空间特征可以从视频图像特征中进一步提取得到,以从视频图像特征中获得反映目标视频图像中空间信息的特征。在具体实现时,可以对视频图像特征在空间维度进行特征提取,以得到视频图像特征的空间特征。贡献调整用于调整空间特征的贡献程度,空间特征的贡献程度指基于目标视频图像的特征进行视频行为识别时,空间特征对行为识别结果的影响程度。空间特征的贡献程度越大,则空间特征对视频行为识别处理的影响越大,即视频行为识别的结果越接近空间特征所反映的行为。贡献调整具体可以通过预先设定的权重参数对空间特征进行调整实现,以获得中间图像特征,中间图像特征为对视频图像特征的空间特征在视频行为识别中的贡献程度调整后得到的图像特征。
具体地,得到视频图像特征后,服务器104对各帧目标视频图像分别对应的视频图像特征的空间特征进行贡献调整,具体可以由服务器104对各视频图像特征进行空间特征提取,以提取得到各视频图像特征中的空间特征,服务器104基于空间权重参数对视频图像特征的空间特征分别进行贡献调整,得到中间图像特征。其中,空间权重参数可以预先设置,具体可以预先通过携带行为标签的视频图像样本训练得到。
步骤206,基于先验信息对中间图像特征的时间特征和时间特征对应的内聚特征进行融 合,得到融合特征;先验信息是根据中间图像特征在时间维度的变化信息得到的;内聚特征是对时间特征进行关注处理得到的。
其中,先验信息反映了目标视频图像在时间维度的先验知识,先验信息根据中间图像特征在时间维度的变化信息得到,具体可以根据中间图像特征在时间维度的相似度得到。例如,先验信息可以包括在进行特征融合时对各融合特征的权重参数,则可以将各帧目标视频图像对应的中间图像特征计算时间维度的相似度,并根据获得的相似度得到包括权重参数的先验信息。时间特征用于反映目标视频图像在视频中的时间信息,时间信息可以包括在视频中各目标视频图像之间的关联信息,即目标视频图像在视频中时间先后顺序的特性。时间特征可以表征出目标视频图像包括的对象的动态特征,从而实现对对象的动态行为识别。时间特征可以从中间图像特征中进一步提取得到,以从中间图像特征中获得反映目标视频图像中时间信息的特征。在具体实现时,可以对中间图像特征在时间维度进行特征提取,以得到中间图像特征的时间特征。时间特征对应的内聚特征是对时间特征进行关注处理得到的,关注处理指对时间特征中有利于视频行为识别的特征进行关注,以突出该特征,从而获得冗余度低、内聚性强的内聚特征,具体可以基于注意力机制的算法对中间图像特征的时间特征进行关注处理,得到时间特征对应的内聚特征。内聚特征通过对时间特征进行关注处理获得,具有高内聚性,即内聚特征的时间信息的焦点特征突出,特征冗余度低,特征有效性高,可以准确表达目标视频图像在时间维度的信息,有利于提高视频行为识别的准确率。
通过先验信息对中间图像特征的时间特征和时间特征对应的内聚特征进行融合,以按照先验信息中的先验知识将时间特征和内聚特征进行融合,得到融合特征。融合特征通过先验信息中的先验知识将时间特征和内聚特征融合得到,可以确保融合特征中时间信息的内聚性,增强时间维度中重要特征的表达,从而能够提高视频行为识别的准确率。具体实现时,先验信息可以包括在进行特征融合时对各融合特征的权重参数,即先验信息包括时间特征和时间特征对应的内聚特征分别的权重参数,通过权重参数将时间特征和时间特征对应的内聚特征进行加权融合,得到融合特征。
具体地,得到中间图像特征后,服务器104可以获取先验信息,先验信息根据中间图像特征在时间维度的变化信息得到,具体可以根据中间图像特征在时间维度的余弦相似度得到。服务器104基于先验信息对中间图像特征的时间特征和时间特征对应的内聚特征进行融合,具体可以由服务器104对中间图像特征进行时间维度的特征提取,以得到中间图像特征的时间特征,并进一步确定时间特征对应的内聚特征。时间特征对应的内聚特征是通过对时间特征进行关注处理得到的,具体也可以由服务器104基于注意力机制算法对时间特征进行关注处理,从而获得时间特征应的内聚特征。服务器104按照先验信息将中间图像特征的时间特征和时间特征对应的内聚特征进行融合,如服务器104可以按照先验信息中的权重参数,对图像特征的时间特征和时间特征对应的内聚特征进行加权融合处理,得到融合特征。
步骤208,对融合特征进行时间特征贡献调整,得到行为识别特征。
其中,时间特征贡献调整用于调整融合特征在时间维度的贡献程度,时间特征的贡献程度指基于目标视频图像的特征进行视频行为识别时,融合特征在时间维度的特征对行为识别结果的影响程度。融合特征在时间维度的特征的贡献程度越大,则融合特征在时间维度的特征对视频行为识别处理的影响越大,即视频行为识别的结果越接近融合特征在时间维度的特征所反映的行为。时间特征贡献调整具体可以通过预先设定的权重参数对融合特征在时间维度的特征进行调整实现,以获得行为识别特征,行为识别特征可以用于视频行为识别。
具体地,获得融合特征后,服务器104对融合特征进行时间特征贡献调整,具体可以由服务器104按照时间权重参数对融合特征进行时间维度的贡献调整,以调整融合特征在时间维度的贡献程度,得到行为识别特征。其中,时间权重参数可以预先设置,具体可以预先通过携带行为标签的视频图像样本训练得到。
步骤210,基于行为识别特征进行视频行为识别。
其中,行为识别特征为用于视频行为识别的特征,具体可以基于行为识别特征进行行为 分类,以确定目标视频图像所对应的视频行为识别结果。具体地,服务器104可以基于获得的行为识别特征进行视频行为识别,如可以将行为识别特征输入分类器中进行分类,根据分类结果得到视频行为识别结果,从而实现视频行为的有效识别。
上述视频行为识别方法中,将从至少两帧目标视频图像提取的视频图像特征的空间特征进行贡献调整,通过根据贡献调整获得的中间图像特征在时间维度的变化信息得到的先验信息,对中间图像特征的时间特征和对时间特征进行关注处理得到的内聚特征进行融合,再对得到的融合特征进行时间特征贡献调整,基于获得的行为识别特征进行视频行为识别。在视频行为识别处理过程中,对视频图像特征的空间特征进行贡献调整,并对融合特征进行时间特征贡献调整,可以调整行为识别特征中时间信息和空间信息的贡献程度,以增强行为识别特征的行为信息表现力;通过根据贡献调整获得的中间图像特征在时间维度的变化信息得到的先验信息,对中间图像特征的时间特征和对时间特征进行关注处理得到的内聚特征进行融合,可以对行为识别特征中时间信息进行有效聚焦,使获得的行为识别特征能够有效反映视频中的行为信息,从而提高了视频行为识别的准确率。
在一个实施例中,将视频图像特征的空间特征进行贡献调整,得到中间图像特征,包括:将视频图像特征进行空间特征提取,得到视频图像特征的空间特征;及通过结构参数中的空间结构参数对空间特征进行贡献调整,得到中间图像特征;结构参数是通过携带行为标签的视频图像样本训练得到的。
其中,空间特征提取用于从视频图像特征中提取空间特征,以对空间特征进行贡献调整。空间特征提取可以通过特征提取模块实现,如可以通过卷积神经网络模型中的卷积模块对视频图像特征进行卷积操作,以实现空间特征提取。结构参数可以包括权重参数,以对针对图像特征的各种操作处理进行加权调整。例如,对于卷积神经网络模型,结构参数可以为是卷积神经网络模型的操作空间中定义的各种操作的权重参数,具体如可以为对卷积、采样、池化等操作进行加权调整的权重参数。结构参数可以包括空间结构参数和时间结构参数,分别用于对空间维度的空间特征以及时间维度的时间特征进行贡献调整,从而调整视频图像特征中的时空信息,以增强行为识别特征的行为信息表现力,有利于提高视频行为识别的准确率。结构参数可以预先通过携带行为标签的视频图像样本训练得到,视频图像样本可以为携带行为标签的视频图像,基于视频图像样本可以训练得到结构参数,以便对各种操作处理进行有效的加权调整。
具体地,得到视频图像特征后,服务器104对各帧目标视频图像分别对应的视频图像特征进行空间特征提取,具体可以通过预先训练完成的视频行为识别模型对视频图像特征进行空间特征提取,如可以通过视频行为识别模型中的卷积层结构对视频图像特征进行空间特征提取,得到视频图像特征的空间特征。服务器104确定通过携带行为标签的视频图像样本训练得到的结构参数,并通过结构参数中的空间结构参数对空间特征进行贡献调整,如空间结构参数为权重参数时,可以通过空间结构参数所对应的权重参数对空间特征进行加权处理,以通过空间结构参数调整视频图像特征的空间特征在进行视频行为识别时对识别结果的影响程度,从而实现对空间特征的贡献调整,获得中间图像特征,中间图像特征为对视频图像特征的空间特征在视频行为识别中的贡献程度调整后得到的图像特征。
进一步地,对融合特征进行时间特征贡献调整,得到行为识别特征,包括:通过结构参数中的时间结构参数对融合特征进行贡献调整,得到行为识别特征。
其中,结构参数可以为卷积神经网络模型的操作空间中定义的各种操作的权重参数,结构参数包括对时间维度的特征进行贡献调整的时间结构参数。具体地,获得融合特征后,服务器104通过结构参数中的时间结构参数对融合特征进行时间特征贡献调整,得到用于视频行为处理的行为识别特征。在具体实现时,时间结构参数可以为权重参数,则服务器104可以通过时间结构参数所对应的权重参数对融合特征进行加权处理,以通过时间结构参数调整融合特征进行视频行为识别时,融合特征在时间维度的特征对识别结果的影响程度,从而实现对时间维度特征的贡献调整,以调整融合特征在时间维度的贡献程度,得到行为识别特征, 服务器104可以基于获得的行为识别特征进行视频行为识别处理,获得视频行为识别结果。
本实施例中,由通过携带行为标签的视频图像样本训练得到的结构参数中的空间结构参数和时间结构参数,分别对视频图像特征的空间特征及融合特征在相应特征维度进行贡献调整,从而根据空间结构参数和时间结构参数调整行为识别特征中时间信息和空间信息的贡献程度,实现了对时空特征的有效纠缠,使得行为识别特征的时空特征表现力强,即增强了行为识别特征的行为信息表现力,从而提高了视频行为识别的准确率。
在一个实施例中,视频行为识别方法还包括:确定待训练结构参数;通过待训练结构参数中的空间结构参数,对视频图像样本特征的空间样本特征进行贡献调整,得到中间样本特征;视频图像样本特征是从视频图像样本提取得到的;基于先验样本信息对中间样本特征的时间样本特征和时间样本特征对应的内聚样本特征进行融合,得到融合样本特征;内聚样本特征是对时间样本特征进行关注处理得到的;先验样本信息是根据中间样本特征在时间维度的变化信息得到的;通过待训练结构参数中的时间结构参数对融合样本特征进行贡献调整,得到行为识别样本特征;及基于行为识别样本特征进行视频行为识别,并根据行为识别结果和视频图像样本对应的行为标签,对待训练结构参数进行更新后继续训练直至训练结束,获得结构参数。
本实施例中,通过携带行为标签的视频图像样本进行训练,在训练结束时获得包括时间结构参数和空间结构参数的结构参数。其中,待训练结构参数可以为每次迭代训练时的初始值,通过待训练结构参数中的空间结构参数对视频图像样本特征的空间样本特征进行贡献调整,得到中间样本特征。中间样本特征为对视频图像样本特征的空间样本特征进行贡献调整后的结果,视频图像样本特征从视频图像样本提取得到,具体可以通过人工神经网络模型对视频图像样本进行特征提取,得到视频图像样本的视频图像样本特征。先验样本信息根据中间样本特征在时间维度的变化信息得到,具体可以为根据中间样本特征在时间维度的相似度得到;内聚样本特征通过对时间样本特征进行关注处理得到,具体可以基于注意力机制对时间样本特征进行关注处理,得到时间样本特征对应的内聚样本特征。
融合样本特征由中间样本特征的时间样本特征和时间样本特征对应的内聚样本特征按照先验样本信息进行融合得到,具体可以基于先验样本信息对中间样本特征的时间样本特征和时间样本特征对应的内聚样本特征进行加权融合,得到融合样本特征。行为识别样本特征用于视频行为识别处理,通过待训练结构参数中的时间结构参数对融合样本特征进行贡献调整得到,具体由时间结构参数对融合样本特征进行权重调整,以调整融合样本特征在时间维度的特征在视频行为识别过程中的贡献程度。行为识别结果通过基于行为识别样本特征进行视频行为识别得到,根据行为识别结果和视频图像样本对应携带的行为标签可以对待训练结构参数进行评价,根据评价结果对待训练结构参数进行更新后继续迭代训练直至训练结束,如训练次数达到预设训练次数阈值、行为识别结果满足识别精度要求、目标函数满足结束条件等,结束训练后获得训练完成的结构参数,基于训练完成的结构参数可以对视频图像特征的空间特征和融合特征分别进行贡献调整,以实现视频行为识别处理。
具体地,结构参数可以由服务器104训练得到,也可以由其他训练设备训练得到后移植至服务器104中。以服务器104训练结构参数为例,在训练结构参数时,服务器104确定待训练结构参数,待训练结构参数为当前迭代训练时的初始值,服务器104通过待训练结构参数中的空间结构参数,对视频图像样本特征的空间样本特征进行贡献调整,得到中间样本特征。进一步地,服务器104基于先验样本信息对中间样本特征的时间样本特征和时间样本特征对应的内聚样本特征进行融合,得到融合样本特征。得到融合样本特征后,服务器104通过待训练结构参数中的时间结构参数对融合样本特征进行贡献调整,得到行为识别样本特征,服务器104基于行为识别样本特征进行视频行为识别,得到行为识别结果。服务器104基于行为识别结果和视频图像样本对应的行为标签,对待训练结构参数进行更新,并通过更新后的待训练结构参数返回继续迭代训练直到满足训练结束条件时结束训练,获得结构参数。结构参数可以用于在进行视频行为识别处理时,对针对目标视频图像在时空维度的特征的各种 操作处理进行加权调整,从而实现目标视频图像的时空特征的有效缠绕,以增强行为识别特征的行为信息表现力,从而提高了视频行为识别的准确率。
本实施例中,通过携带行为标签的视频图像样本训练结构参数,通过训练完成的结构参数可以实现目标视频图像的时空特征的有效纠缠,能够增强行为识别特征的行为信息表现力,从而提高了视频行为识别的准确率。
在一个实施例中,视频行为识别方法通过视频行为识别模型实现,待训练结构参数是视频行为识别模型在训练中的参数。根据行为识别结果和视频图像样本对应的行为标签,对待训练结构参数进行更新后继续训练直至训练结束,获得结构参数,包括:获得视频行为识别模型输出的行为识别结果;确定行为识别结果与视频图像样本对应的行为标签之间的差异;根据差异对视频行为识别模型中的模型参数和待训练结构参数进行更新;及基于更新后的视频行为识别模型继续训练直至训练结束,并根据训练完成的视频行为识别模型得到结构参数。
本实施例中,视频行为识别方法通过视频行为识别模型实现,即通过预先训练完成的视频行为识别模型实现视频行为识别方法的步骤。视频行为识别模型可以为基于各种神经网络算法构建的人工神经网络模型,如卷积神经网络模型、深度学习网络模型、循环神经网络模型、感知机网络模型、生成对抗网络模型等。待训练结构参数是视频行为识别模型在训练中的参数,即结构参数为视频行为识别模型中对模型操作处理进行贡献调整的参数。
其中,行为识别结果为基于行为识别样本特征进行视频行为识别获得的识别结果,行为识别结果具体由视频行为识别模型输出,即将至少两帧目标视频图像输入视频行为识别模型中,以由视频行为识别模型基于目标视频图像进行视频行为识别,输出行为识别结果。行为识别结果与视频图像样本对应的行为标签之间的差异,可以通过对比行为识别结果与行为标签确定。模型参数指视频行为识别模型中各层网络结构所对应的参数。例如,对于卷积神经网络模型,模型参数可以包括但不限于包括各层卷积的卷积核参数、池化参数、上下采样参数等各种参数。通过根据行为识别结果与行为标签之间的差异对视频行为识别模型中的模型参数和待训练结构参数进行更新,以实现对视频行为识别模型中的模型参数和结构参数进行联合训练。在训练结束获得训练完成的视频行为识别模型时,根据训练完成的视频行为识别模型可以确定结构参数。
服务器104通过视频行为识别模型对模型参数和结构参数进行联合训练,训练完成的结构参数可以从训练完成的视频行为识别模型中确定。具体地,服务器104将视频图像样本输入视频行为识别模型后,由视频行为识别模型进行视频行为识别处理并输出行为识别结果。服务器104确定视频行为识别模型输出的行为识别结果与视频图像样本对应的行为标签之间的差异,并根据差异对视频行为识别模型的参数进行更新,具体包括对视频行为识别模型中的模型参数和待训练结构参数进行更新,得到更新后的视频行为识别模型。服务器104基于更新后的视频行为识别模型继续通过视频图像样本进行训练直至训练结束,如在满足训练条件时结束训练,得到训练完成的视频行为识别模型。服务器104可以根据训练完成的视频行为识别模型确定训练完成的结构参数,训练完成的结构参数可以对视频行为识别模型中各层网络结构的操作进行权重调整,以调整各层网络结构对视频行为识别处理的贡献程度,从而获得表现力强的特征进行视频行为识别,提高了视频行为识别的准确率。
本实施例中,通过视频行为识别模型对模型参数和结构参数进行联合训练,训练完成的结构参数可以从训练完成的视频行为识别模型中确定,通过训练完成的结构参数可以实现目标视频图像的时空特征的有效纠缠,能够增强行为识别特征的行为信息表现力,从而提高了视频行为识别的准确率。
在一个实施例中,根据行为识别结果和视频图像样本对应的行为标签,对待训练结构参数进行更新后继续训练直至训练结束,获得结构参数,包括:确定行为识别结果和视频图像样本对应的行为标签之间的行为识别损失;根据行为识别损失和前一行为识别损失得到奖励值;及根据奖励值对待训练结构参数进行更新,通过更新后的待训练结构参数继续训练直至目标函数满足结束条件时,获得结构参数;目标函数基于训练过程中的各奖励值得到。
其中,行为识别损失用于表征行为识别结果和视频图像样本对应的行为标签之间的差异程度,行为识别损失的形式可以根据实际需要进行设置,如可以设置为交叉熵损失。前一行为识别损失为针对前一帧视频图像样本对应确定的行为识别损失。奖励值用于对待训练结构参数进行更新,奖励值根据行为识别损失和前一行为识别损失确定,通过奖励值可以指导待训练结构参数向满足训练要求的方向进行更新。对待训练结构参数进行更新后,通过更新后的待训练结构参数继续训练直至目标函数满足结束条件时结束训练,得到训练完成的结构参数。其中,目标函数基于训练过程中的各奖励值得到,即目标函数根据各帧视频图像样本对应的奖励值得到,具体可以根据各帧视频图像样本对应的奖励值的和构建目标函数,以根据目标函数对结构参数训练的结束进行判定,获得满足贡献调整要求的结构参数。
具体地,服务器104基于行为识别样本特征进行视频行为识别,得到行为识别结果后,服务器104确定行为识别结果和视频图像样本对应的行为标签之间的行为识别损失,具体可以通过行为识别结果与行为标签之间的交叉熵损失得到行为识别损失。服务器104基于获得的行为识别损失与前一帧视频图像样本对应的前一行为识别损失得到奖励值,具体可以根据行为识别损失与前一行为识别损失之间的差异确定奖励值。例如,若行为识别损失大于前一行为识别损失,则可以获得数值为正值的奖励值,以提供正向反馈;若行为识别损失小于前一行为识别损失,则可以获得数值为负值的奖励值,以提供负向反馈,从而实现对待训练结构参数的更新指导。服务器104根据奖励值对待训练结构参数进行更新,如可以根据奖励值的正负或数值大小对待训练结构参数进行更新,得到更新后的待训练结构参数。服务器104通过更新后的待训练结构参数继续训练直至目标函数满足结束条件时结束训练,获得结构参数。其中,目标函数基于训练过程中的各奖励值得到,具体可以根据各帧视频图像样本对应的奖励值的和构建目标函数,通过根据目标函数对结构参数训练的结束进行判定,如在目标函数达到极值时结束训练,获得满足贡献调整要求的结构参数。
本实施例中,根据各帧视频图像样本对应的行为识别损失之间的差异得到奖励值,行为识别损失根据行为识别结果和视频图像样本对应的行为标签确定,并通过奖励值对待训练结构参数进行更新后继续进行训练,直至根据各帧视频图像样本对应的奖励值得到的目标函数满足结束条件时结束训练,得到训练完成的结构参数。通过根据各帧视频图像样本对应的行为识别损失之间的差异得到的奖励值对待训练结构参数进行更新,可以提高待训练结构参数的训练效率。
在一个实施例中,根据奖励值对待训练结构参数进行更新,包括:根据奖励值对策略梯度网络模型的模型参数进行更新;及由更新后的策略梯度网络模型对待训练结构参数进行更新。
其中,策略梯度(Policy Gradient)网络模型为基于策略梯度的网络模型,其输入为状态,输出为动作,策略即指在不同的状态下采取不同的动作,通过基于策略进行梯度下降,以训练策略梯度网络模型能够根据当前状态做出对应的动作,获得更高的奖励值。具体地,策略梯度网络模型的模型参数可以作为状态,而该状态下策略梯度网络模型根据输入的结构参数输出的结构参数为动作,从而策略梯度网络模型可以根据输入的结构参数和当前的模型参数预测输出下一个动作,即下一个结构参数,从而实现在训练中对结构参数的更新。
具体地,在根据奖励值对待训练结构参数进行更新时,服务器104根据奖励值对策略梯度网络模型的模型参数进行更新,具体基于奖励值对策略梯度网络模型中的各模型参数进行调整,以由更新后的策略梯度网络模型进行下一次的结构参数预测。对策略梯度网络模型进行更新后,服务器104通过更新后的策略梯度网络模型对待训练结构参数进行更新,具体可以由更新后的策略梯度网络模型基于更新后的网络状态和待训练结构参数进行结构参数预测,获得预测的结构参数,策略梯度网络模型预测的结构参数即为对待训练结构参数进行更新后的结构参数。
本实施例中,根据奖励值对策略梯度网络模型进行更新,并通过更新后的策略梯度网络模型对待训练结构参数进行更新,可以通过策略梯度方式来优化结构参数,能够确保结构参 数的训练质量,有利于提高视频行为识别处理的准确率。
在一个实施例中,由更新后的策略梯度网络模型对待训练结构参数进行更新,包括:通过更新后的策略梯度网络模型,基于更新后的模型参数和待训练结构参数进行结构参数预测,获得预测的结构参数;及根据预测的结构参数,得到对待训练结构参数进行更新后的结构参数。
其中,更新后的策略梯度网络模型,通过对策略梯度网络模型的模型参数进行更新后得到,即通过奖励值对策略梯度网络模型的模型参数进行调整更新后,得到更新后的策略梯度网络模型。
具体地,对策略梯度网络模型进行更新,得到更新后的策略梯度网络模型后,服务器以更新后的策略梯度网络模型中的模型参数作为状态,在该状态下对结构参数进行预测,具体可以基于更新后的模型参数和待训练结构参数进行结构参数预测,得到预测的结构参数。具体应用中,服务器以更新后的策略梯度网络模型的当前网络状态,利用待训练结构参数进行结构参数预测,得到预测的结构参数。服务器根据预测的结构参数进行结构参数更新,得到对待训练结构参数进行更新后的结构参数。例如,服务器可以直接将更新后的策略梯度网络模型通过结构参数预测输出的预测的结构参数,作为对待训练结构参数进行更新后的结构参数,从而实现对待训练结构参数的更新。
本实施例中,服务器通过更新后的策略梯度网络模型对待训练结构参数进行结构参数预测,并根据预测的结构参数得到对待训练结构参数进行更新后的结构参数,可以通过策略梯度方式来优化结构参数,能够确保结构参数的训练质量,有利于提高视频行为识别处理的准确率。
在一个实施例中,视频行为识别方法还包括:确定中间图像特征在时间维度的相似度;及基于相似度对初始先验信息进行修正,得到先验信息。
其中,时间维度即为各帧目标视频图像在所属视频中的先后顺序的维度,根据时间维度的时间特征,可以辅助对视频行为进行准确识别。相似度可以将表征各特征之间的距离,相似度越高,距离越近,通过中间图像特征在时间维度的相似度可以反映中间图像特征在时间维度的变化程度。初始先验信息可以为预先设定的先验信息,具体可以为预先基于样本数据训练得到的先验信息。根据相似度对初始先验信息进行修正,从而可以根据各帧目标视频图像在时间维度的变化程度,对中间图像特征的时间特征和内聚特征的融合进行加权调整,以增强融合特征的内聚性,即突出融合特征的焦点特征,减少融合特征的冗余信息。
具体地,服务器104在基于先验信息对中间图像特征的时间特征和时间特征对应的内聚特征进行融合前,可以根据各帧目标视频图像在时间维度的变化程度对初始先验信息进行修正,以获得对应的先验信息。服务器104确定中间图像特征在时间维度的相似度,具体可以对各帧目标视频图像分别对应的中间图像特征在时间维度计算余弦相似度,通过余弦相似度衡量各帧目标视频图像在时间维度的变化程度。服务器104根据中间图像特征在时间维度的相似度对初始先验信息进行修正,具体可以基于相似度将初始先验信息划分为正负参数,通过正负参数对初始先验信息进行修正后,将修正后的初始先验信息与初始先验信息以残差连接方式合并后得到先验信息,从而实现对先验信息的确定处理。
本实施例中,根据中间图像特征在时间维度的相似度对初始先验信息进行修正,通过反映了各帧目标视频图像在时间维度的变化程度的相似度对初始先验信息进行修正,可以有效利用各帧目标视频图像在时间维度的变化程度得到相应的先验知识,从而基于该先验知识对时间特征和内聚特征进行融合,可以对行为识别特征中时间信息进行有效聚焦,使获得的行为识别特征能够有效反映视频中的行为信息,从而提高了视频行为识别的准确率。
在一个实施例中,初始先验信息包括第一初始先验参数和第二初始先验参数;基于相似度对初始先验信息进行修正,得到先验信息,包括:根据第一初始先验参数、第二初始先验参数及预设阈值,对相似度进行动态调整;通过动态调整后的相似度分别对第一初始先验参数和第二初始先验参数进行修正,得到第一先验参数和第二先验参数;及根据第一先验参数 和第二先验参数得到先验信息。
其中,初始先验信息包括第一初始先验参数和第二初始先验参数,第一初始先验参数和第二初始先验参数分别作为中间图像特征的时间特征以及内聚特征的融合权重参数。预设阈值可以根据实际需要进行动态设置,以根据实际需要动态修正先验信息。第一先验参数和第二先验参数分别作为中间图像特征的时间特征以及内聚特征的融合权重参数,先验信息包括第一先验参数和第二先验参数。
具体地,在对初始先验信息进行修正时,服务器104确定预设阈值,并根据第一初始先验参数、第二初始先验参数及预设阈值,对相似度进行动态调整。服务器104通过动态调整后的相似度分别对初始先验信息中的第一初始先验参数和第二初始先验参数进行修正,得到第一先验参数和第二先验参数,并根据第一先验参数和第二先验参数得到先验信息。先验信息可以对时间特征和内聚特征进行加权融合处理,以按照先验信息中的先验知识将时间特征和内聚特征进行融合,得到融合特征。融合特征通过先验信息中的先验知识将时间特征和内聚特征融合得到,可以确保融合特征中时间信息的内聚性,增强时间维度中重要特征的表达,从而能够提高视频行为识别的准确率。
本实施例中,根据初始先验信息和预设阈值对相似度进行动态调整后,基于动态调整后的相似度对分别对第一初始先验参数和第二初始先验参数进行修正,得到第一先验参数和第二先验参数,根据第一先验参数和第二先验参数得到先验信息。获得的先验信息反映了目标视频图像在时间维度的先验知识,基于该先验信息对时间特征和内聚特征进行融合,可以对行为识别特征中时间信息进行有效聚焦,使获得的行为识别特征能够有效反映视频中的行为信息,从而提高了视频行为识别的准确率。
在一个实施例中,如图3所示,视频行为识别方法还包括对时间特征进行内聚处理,得到对应的内聚特征的处理,具体包括:
步骤302,确定当前基向量。
其中,当前基向量为当前对时间特征进行内聚处理的基向量,通过基向量可以实现对时间特征的内聚处理。具体地,在对时间特征进行内聚处理时,服务器104确定当前基向量,如可以为B×C×K,其中,B为批次处理的数据大小,C为中间图像特征的通道数,K为基向量的维度。
步骤304,通过当前基向量对中间图像特征的时间特征进行特征重构,得到重构特征。
其中,由当前基向量对时间特征进行特征重构,具体可以通过当前基向量与中间图像特征的时间特征进行融合,得到重构特征。具体实现时,服务器104可以通过当前基向量与中间图像特征的时间特征进行矩阵相乘后进行归一化映射后,实现对时间特征的重构,得到重构特征。
步骤306,根据重构特征和时间特征生成下一关注处理的基向量。
下一关注处理的基向量为下一次进行关注处理,即下一次对时间特征进行内聚处理时的基向量。具体地,服务器104根据重构特征和时间特征生成下一关注处理的基向量,如可以将重构特征和时间特征进行矩阵相乘后得到下一关注处理的基向量。下一关注处理的基向量将作为下一次进行关注处理时的基向量对相应的时间特征进行特征重构。
步骤308,根据下一关注处理的基向量、基向量和时间特征,得到时间特征对应的内聚特征。
得到下一关注处理的基向量后,服务器104根据下一关注处理的基向量、基向量和时间特征,获得时间特征对应的内聚特征,从而实现对时间特征的内聚处理。具体可以将下一关注处理的基向量、基向量和时间特征进行融合后,生成时间特征对应的内聚特征。
本实施例中,通过基向量对中间图像特征的时间特征进行特征重构,根据重构特征和时间特征生成新的基向量,并根据新的基向量、旧的基向量和时间特征得到时间特征对应的内聚特征,从而对时间特征进行聚焦,以突出在时间维度的重要焦点特征,获得具有高内聚性的内聚特征,可以准确表达目标视频图像在时间维度的信息,有利于提高视频行为识别的准 确率。
在一个实施例中,根据重构特征和时间特征生成下一关注处理的基向量,包括:融合重构特征和时间特征,生成注意力特征;对注意力特征进行正则化处理,得到正则化特征;及对正则化特征进行滑动平均更新,生成下一关注处理的基向量。
其中,注意力特征通过融合重构特征和时间特征得到,通过对注意力特征依次进行正则化处理和滑动平均更新,可以确保基向量的更新更加稳定。具体地,根据重构特征和时间特征生成下一关注处理的基向量时,服务器104融合重构特征和时间特征得到注意力特征。服务器104进一步对注意力特征进行正则化处理,如可以对注意力特征进行L2正则化处理,得到正则化特征。服务器104对获得的正则化特征进行滑动平均更新,生成下一关注处理的基向量。滑动平均,或者叫做指数加权平均,可以用来估计变量的局部均值,使得变量的更新与一段时间内的历史取值有关。下一关注处理的基向量为下一次进行关注处理,即下一次对时间特征进行内聚处理时的基向量。
本实施例中,通过对融合重构特征和时间特征得到的注意力特征依次进行正则化处理和滑动平均更新,可以确保基向量的更新更加稳定,以确保内聚特征的高内聚性,可以准确表达目标视频图像在时间维度的信息,有利于提高视频行为识别的准确率。
在一个实施例中,当前基向量包括批次处理的数据大小、中间图像特征的通道数以及基向量的维度;通过当前基向量对中间图像特征的时间特征进行特征重构,得到重构特征,包括:将当前基向量与中间图像特征的时间特征,依次进行矩阵相乘及归一化映射处理,得到重构特征。
其中,批次处理的数据大小为在进行批次处理时,每个批次处理的数据量大小。例如,当前基向量可以为B×C×K,其中,B为批次处理的数据大小,C为中间图像特征的通道数,K为基向量的维度。具体地,服务器对中间图像特征的时间特征进行特征重构时,可以将当前基向量与中间图像特征的时间特征进行矩阵相乘,并针对矩阵相乘的结果进行归一化映射处理,实现对时间特征的重构,得到重构特征。
进一步地,根据重构特征和时间特征生成下一关注处理的基向量,包括:将重构特征和时间特征进行矩阵相乘,得到下一关注处理的基向量。
具体地,服务器将重构特征和时间特征进行矩阵相乘处理,获得下一关注处理的基向量。下一关注处理的基向量将作为下一次进行关注处理时的基向量对相应的时间特征进行特征重构。
进一步地,根据下一关注处理的基向量、基向量和时间特征,得到时间特征对应的内聚特征,包括:将下一关注处理的基向量、基向量和时间特征进行融合,得到时间特征对应的内聚特征。
具体地,服务器将下一关注处理的基向量、基向量和时间特征进行融合,从而融合下一关注处理的基向量、基向量和时间特征的有效信息,得到时间特征对应的内聚特征。
本实施例中,通过包括批次处理的数据大小、中间图像特征的通道数以及基向量的维度的基向量,对中间图像特征的时间特征进行特征重构,具体依次进行矩阵相乘及归一化映射处理,得到重构特征,并根据重构特征和时间特征进行矩阵相乘生成新的基向量,融合新的基向量、旧的基向量和时间特征得到时间特征对应的内聚特征,从而对时间特征进行聚焦,以突出在时间维度的重要焦点特征,获得具有高内聚性的内聚特征,可以准确表达目标视频图像在时间维度的信息,有利于提高视频行为识别的准确率。
在一个实施例中,基于先验信息对中间图像特征的时间特征和时间特征对应的内聚特征进行融合,得到融合特征,包括:确定先验信息;对中间图像特征进行时间特征提取,得到中间图像特征的时间特征;及通过先验信息,对时间特征和时间特征对应的内聚特征进行加权融合,得到融合特征。
其中,先验信息反映了目标视频图像在时间维度的先验知识,先验信息根据中间图像特征在时间维度的变化信息得到,具体可以根据中间图像特征在时间维度的相似度得到。时间 特征用于反映目标视频图像在视频中的时间信息,通过对中间图像特征进行时间特征提取,可以提取得到中间图像特征的时间特征。通过先验信息对时间特征和时间特征对应的内聚特征进行加权融合,例如在先验信息包括第一先验参数和第二先验参数时,分别通过第一先验参数和第二先验参数对时间特征和时间特征对应的内聚特征进行加权融合,得到融合特征。
具体地,服务器104确定先验信息,先验信息根据中间图像特征在时间维度的变化信息得到,具体可以根据中间图像特征在时间维度的相似度得到。服务器104对中间图像特征进行时间特征提取,具体可以对中间图像特征中时间维度进行特征提取,以得到中间图像特征的时间特征。进一步地,服务器104基于先验信息对时间特征和时间特征对应的内聚特征进行加权融合,得到融合特征,从而实现对时间特征和时间特征对应的内聚特征的加权融合,融合特征通过先验信息中的先验知识将时间特征和内聚特征融合得到,可以确保融合特征中时间信息的内聚性,增强时间维度中重要特征的表达,从而能够提高视频行为识别的准确率。
本实施例中,融合特征基于先验信息中的先验知识将时间特征和内聚特征融合得到,可以确保融合特征中时间信息的内聚性,增强时间维度中重要特征的表达,从而能够提高视频行为识别的准确率。
在一个实施例中,先验信息包括第一先验参数和第二先验参数;通过先验信息,对时间特征和时间特征对应的内聚特征进行加权融合,得到融合特征,包括:通过第一先验参数对时间特征进行加权处理,获得加权处理后的时间特征;通过第二先验参数对时间特征对应的内聚特征进行加权处理,得到加权处理后的内聚特征;及将加权处理后的时间特征和加权处理后的内聚特征进行融合,得到融合特征。
其中,先验信息包括第一先验参数和第二先验参数,分别对应于时间特征和时间特征对应的内聚特征的加权权重。具体地,服务器通过先验信息中的第一先验参数,对时间特征进行加权处理,获得加权处理后的时间特征。例如,第一先验参数可以为k1,时间特征可以为M,则加权处理后的时间特征可以为k1*M。服务器通过先验信息中的第二先验参数,时间特征对应的内聚特征进行加权处理,获得加权处理后的内聚特征。例如,第二先验参数可以为k2,时间特征对应的内聚特征可以为N,则加权处理后的内聚特征可以为k2*N。服务器将加权处理后的时间特征和加权处理后的内聚特征进行融合,得到融合特征,如服务器融合得到的融合特征可以为k1*M+k2*N。
本实施例中,融合特征基于先验信息中的第一先验参数和第二先验参数将时间特征和内聚特征融合得到,可以确保融合特征中时间信息的内聚性,增强时间维度中重要特征的表达,从而能够提高视频行为识别的准确率。
在一个实施例中,在基于先验信息对中间图像特征的时间特征和时间特征对应的内聚特征进行融合,得到融合特征之前,还包括:对中间图像特征进行标准化处理,得到标准化特征;及根据标准化特征进行非线性映射,获得映射后的中间图像特征。
其中,标准化处理可以对中间图像特征进行规范化,有利于解决梯度消失和梯度爆炸问题,能够确保网络学习速率。标准化处理可以通过批量标准化处理实现。非线性映射可以引入非线性因素,从而对中间图像特征进行去线性,有利于增强中间图像特征的灵活表达。具体地,得到中间图像特征后,服务器104对中间图像特征进行标准化处理,如可以通过BN(Batch Normalization,批量标准化)层结构对中间图像特征进行标准化处理,得到标准化特征。进一步地,服务器104对标准化特征进行非线性映射,如可以通过激活函数对标准化特征进行非线性映射,得到映射后的中间图像特征。
进一步地,基于先验信息对中间图像特征的时间特征和时间特征对应的内聚特征进行融合,得到融合特征,包括:基于先验信息对映射后的中间图像特征的时间特征和时间特征对应的内聚特征进行融合,得到融合特征;先验信息是根据映射后的中间图像特征在时间维度的变化信息得到的。
具体地,得到映射后的中间图像特征后,服务器104基于先验信息对映射后的中间图像特征的时间特征和时间特征对应的内聚特征进行融合,得到融合特征。其中,先验信息是根 据映射后的中间图像特征在时间维度的变化信息得到的,内聚特征是对映射后的中间图像特征的时间特征进行关注处理得到的。
本实施例中,在得到中间图像特征后,进一步对中间图像特征进行标准化处理和非线性映射,以增强中间图像特征的特征表达,并基于映射后的中间图像特征进行视频行为识别处理,可以进一步提高行为识别特征的行为信息表现力,从而有利于提高视频行为识别的准确率。
在一个实施例中,对中间图像特征进行标准化处理,得到标准化特征,包括:通过批量标准化层结构,对中间图像特征进行标准化处理,得到标准化特征。
其中,批量标准化层结构为BN层结构,可以对中间图像特征批量进行标准化处理。具体地,服务器可以通过批量标准化层结构对中间图像特征批量进行标准化处理,得到标准化特征,从而能够确保标准化的处理效率。
进一步地,根据标准化特征进行非线性映射,获得映射后的中间图像特征,包括:通过激活函数对标准化特征进行非线性映射,获得映射后的中间图像特征。
其中,激活函数用于引入非线性因素,以实现对标准化特征的非线性映射。激活函数的具体形式可以根据实际需要进行设置,如可以设置ReLU函数,以由服务器通过激活函数对标准化特征进行非线性映射,得到映射后的中间图像特征。
本实施例中,在得到中间图像特征后,进一步通过批量标准化层结构、激活函数对中间图像特征,依次进行标准化处理和非线性映射,以增强中间图像特征的特征表达,并提高处理效率。
本申请还提供一种应用场景,该应用场景应用上述的视频行为识别方法。具体地,该视频行为识别方法在该应用场景的应用如下:
对于视频行为识别处理,时空信息建模是视频行为识别的核心问题之一。近年来主流方法主要有基于双流网络的行为识别方法和基于3D(3-Dimensional,三维)卷积网络的行为识别方法。前者通过平行的两个网络分别提取RGB和光流特征,后者通过3D卷积同时建模时间和空间信息。然而,大量的模型参数和算力损耗限制了其效率,基于此,后续的改进方法主要通过将三维卷积分解为二维空间卷积和一维时间卷积的方式来分别对时间和空间信息建模,进而提升模型的效率。
通过设计不同的网络结构来提取更好的时空特征,但忽略了时空线索对不同动作类的差异化影响。例如,有些动作即使没有时间信息的帮助,也很容易仅用一张图片来判别,这是因为它们在不同的场景中,具有显著的空间信息,此时可以作为具有高度可信度的动作类别进行预测。然而,时间信息对细粒度动作识别是必不可少的,例如,“拉小提琴”动作中的推弦弓和拉弦弓动作的判别,需要时间信息才可以针对推弦弓和拉弦弓动作进行准确识别。视频中通常包含丰富的时间相互关联的内容,在这样多维的信息中,仅仅对时空特征进行独立分解建模,而时空信息的相关性在不同的动作类别之间存在很大的差异,在识别过程中对时空信息的贡献不同,导致时空信息无法有效反映出视频中的行为信息。此外,视频中动作的时间边界不明确,即动作的开始时间和结束时间不明确、持续时间不确定,导致视频行为识别的准确率较低。
基于此,本实施例中通过上述的视频行为识别方法,可以采用网络结构搜索策略自适应地调整时间和空间信息的权重,根据行为识别过程中贡献的不同,挖掘时间空间信息之间的深层关联、共同学习时空的相互作用;同时设计了一个节奏调节器,根据动作节律的先验信息和时间卷积的结构参数,得到时间信息的高内聚性表达,以此来调整不同节奏的动作,从而解决相同动作却具有不同节奏造成的特征表达差异的问题,提高了视频行为识别的准确率。
具体地,视频行为识别方法包括:从至少两帧目标视频图像提取视频图像特征,具体可以将至少两帧目标视频图像输入到人工神经网络中,以由人工神经网络提取得到视频图像特征;将视频图像特征的空间特征进行贡献调整,得到中间图像特征,具体通过预先训练的结构参数对视频图像特征的空间特征进行贡献调整;基于先验信息对中间图像特征的时间特征 和时间特征对应的内聚特征进行融合,从而使用节奏调节器调整行为的节奏,得到融合特征;再对融合特征进行时间特征贡献调整,得到行为识别特征,具体可以通过结构参数对融合特征进行时间特征贡献调整;最后基于行为识别特征进行视频行为识别,得到行为识别结果。
本实施例的视频行为识别方法基于视频行为识别模型实现,如图4所示,为本实施例中视频行为识别模型的网络结构示意图。其中,X为至少两帧目标视频图像提取到的视频图像特征,通过1×3×3的2D卷积进行空间特征提取,得到空间特征,并通过结构参数中的空间结构参数α1对空间特征进行贡献调整,得到中间图像特征。中间图像特征依次通过批量标准化处理和激活函数的非线性映射处理,具体可以通过BN层结构和ReLU层结构实现对中间图像特征的批量标准化处理和非线性映射处理。获得的映射后的特征A分别通过两个3×1×1的1D卷积进行时间特征提取,其中一个分支为高内聚Cohesive的1D卷积处理,从而可以提取得到中间图像特征的时间特征对应的内聚特征。对于1D卷积进行时间特征提取的结果,通过先验信息中的权重参数β1和β2分别进行加权调整,并对两个分支的加权调整结果进行融合。权重参数β1和β2可以为基于策略梯度Agent网络训练得到的结构参数,通过确定特征A在时间维度的相似度,以对初始的权重参数β1和β2进行残差修正,并基于残差修正后的权重参数β1和β2对1D卷积的提取结果进行加权处理。两个1D卷积分支的结果进行融合后,通过结构参数中的时间结构参数α2对融合特征进行时间特征贡献调整,对贡献调整后的融合特征进行下采样后得到行为识别特征,行为识别特征用于视频行为识别,得到行为识别结果。
其中,结构参数指的是操作空间中定义的诸如卷积等操作的权重参数,是网络结构搜索技术中的概念。本实施例可以通过微分方式和策略梯度方式两种结构参数更新方式来优化更新待融合的时间和空间卷积对应的结构参数,包括α1和α2;而在高内聚时间卷积模块和1D时间卷积模块的融合中,也可以利用预先训练的结构参数β1和β2进行加权融合处理。如图5所示,融合时间和空间卷积的结构参数包括α1和α2,对两个时间卷积分支进行加权融合的结构参数包括β1和β2。具体地,目标视频图像提取到的视频图像特征通过1×d×d的2D卷积进行空间特征提取,提取结果通过空间结构参数α1进行贡献调整,具体通过特征提取结果 与结构参数进行相乘以进行融合,实现贡献调整,贡献调整后依次进行批量标准化处理和激活函数的非线性映射。映射后的结果分别通过两个t×1×1的1D卷积进行时间特征提取,提取的结果分别通过结构参数β1和β2进行加权融合,加权融合的结果通过时间结构参数α2进行时间特征贡献调整,得到进行视频行为识别处理的行为识别特征。
具体地,在训练结构参数时,对基于微分方式更新的处理,预先定义一个多维的结构参数,如可以为多维的结构参数向量,具体为二维向量,在微分方式更新处理中具有梯度。其中,结构参数的维度分别代表空间卷积和时间卷积对应的结构参数。将结构参数作用于空间卷积和时间卷积来融合两者的特征,具体通过α1作用于空间卷积进行贡献调整,通过α2作用于时间卷积进行贡献调整。根据视频行为识别模型的预测结果和真实结果计算误差值,利用梯度下降算法对结构参数进行更新,在训练结束时得到训练完成的结构参数。
进一步地,在根据视频行为识别模型的预测结果和真实结果计算误差值,利用梯度下降算法对结构参数进行更新时,采用微分的方式进行优化。将网络结构搜索技术中操作空间记作O,o则是具体的一个操作,节点指的是网络结构搜索方法中的基本操作单元的集合,设定i和j是两个顺序相邻的节点,它们之间的一组候选操作的权重记作α ij,P是对应的概率分布。节点i和j之间的具有最大概率的候选操作通过max函数得到,最终的网络结构通过不同节点间搜索得到的操作堆叠而成,如下式(1)所示,
Figure PCTCN2022116947-appb-000001
其中,N为节点数量。
横向来看相当于学习选定的具体操作,将操作空间限制在级联的2D卷积和1D卷积之上直接通过梯度进行优化,以搜索得到对应的网络结构,如下式(2),
Figure PCTCN2022116947-appb-000002
其中,
Figure PCTCN2022116947-appb-000003
为梯度优化处理,L train(w,α)为网络结构的目标函数,w为网络结构的模型参数。
纵向来看,相当于通过结构参数增强或者减弱2D空间卷积和1D时间卷积的特征在特征学习时的重要度。如图6所示,本实施例的块定义在两个节点之间。例如,对于ResNet结构这些节点代表前一个块的输出和后一个块的输入。顺序连接的1×d×d卷积和t×1×1卷积定义在块内部。结构参数用于这两个卷积之上来调整它们的强度。通过训练以从α 11…α 1i…α 1n中寻找满足2D卷积的贡献调整要求的结构参数α 1,从α 21…α 2j…α 2m中寻找满足1D卷积的贡献调整要求的结构参数α 2,图6中确定α 1n为结构参数α 1,α 21为结构参数α 2。记o(·)为定义在搜索空间O中,且作用于输入x之上的操作,则节点i和节点j之间的权重向量是α (i,j),可以得到下式(3),
y (i,j)=∑ o∈OF i,j(w Oo (i,j))o(x)      (3)
其中,F是权重向量的线性映射,y (i,j)是搜索空间中所有权重向量的线性映射之和,具体可以将F设置为一个全连接层,每一cell单元被定义为一个(2+1)D卷积块,因此α o (i,j)是固定的。因此学习目标可以进一步简化为下式(4),
y=g(w α,w n,x)       (4)
其中,w α是网络的结构参数,w n是网络的模型参数,y是(2+1)D卷积块的输出。得益于轻量的搜索空间,具体实现时可以将结构参数和模型参数同时进行端到端的训练,为每一个(2+1)D卷积块都学习一组结构参数,从而得到的优化方式如下式(5),
Figure PCTCN2022116947-appb-000004
即对网络的结构参数w α和模型参数w n进行同步训练,基于目标函数L val进行梯度下降优化,以得到满足需要的结构参数w α和模型参数w n,实现网络训练。
对于策略梯度方式的更新处理,预先定义一个多维的结构参数,如可以为多维的结构参数向量,具体为二维向量,在策略梯度方式的更新处理中截断梯度信息。其中,结构参数的维度分别代表空间卷积和时间卷积对应的结构参数。预先定义一策略梯度Agent网络来根据 当前的结构参数和策略梯度Agent网络的网络状态生成下一个结构参数。将生成的结构参数作用于空间卷积和时间卷积来融合两者的特征。根据策略梯度Agent网络的当前网络状态的奖励值更新Agent的网络参数,进而由新的Agent来预测下一个结构参数,从而实现结构参数的更新。
具体地,策略梯度下降是一种强化学习方法,其中策略(policy)指的是在不同状态(state)下,所采取的动作(action),目标是希望基于策略来做梯度下降,以此训练出策略梯度网络Agent能有较好的根据当前状态做出对应动作,能得到更高的奖励值(reward)。通过策略梯度方式来优化结构参数时,可以使用多层感知机(Multilayer Perceptron)作为策略梯度网络Agent,当前策略梯度网络Agent的参数作为状态state,网络输出的结构参数作为动作action,使用当前主干网络,即使用视频行为识别模块的损失loss和奖励常量作为奖励值reward函数的组成部分。在前向的处理流程中,先输入初始的结构参数给Agent网络,紧接着该网络会根据当前Agent网络参数和输入的结构参数预测出下一个网络参数即action。在反向传播过程中,则是最大化当前能够获得的奖励值reward,通过奖励值对Agent网络的参数进行更新。设当前的状态是s,a代表当前的action,θ代表网络的参数,则交叉熵损失CE如下式(6),
Figure PCTCN2022116947-appb-000005
其中,
Figure PCTCN2022116947-appb-000006
为模型预测输出,y为真实标签。为了确保结构参数搜索对网络整体学习的影响是正向的,可以基于平滑后的CE值进行reward函数的设计,可以使得搜索到的结构参数跟视频行为识别模型的主干网路的学习是互相辅助的。平滑后的CE如下式(7),
Figure PCTCN2022116947-appb-000007
其中,i,j和N分别为正确类别,其他类别和总类别数,ε是一个非常小的常数。进一步地,如果后一个时间步n得到的SCE n值大于前一个m得到的SCE m,则给予正向的reward 值γ,否则reward为-γ。如下式(8),
f=-γ*sgn(SCE m-SCE n)      (8)
其中,f为奖励值,γ为设定的变量。
整体的目标函数如下式(9),f(s,a)为网络预测输出。
L=∑log π(a|s,θ)f(s,a)       (9)
具体地,针对时空信息重要度和缩小类内差异性的先验激励模块两部分的结构参数对应的多层感知机MLP分别是具有6个隐层神经元和4个隐层神经元的3层神经网络,同时在各层之间添加了ReLU激活函数,且最后一层为softplus激活函数。由于policy gradient机制需要完整的状态行为序列,则会使得中间状态缺少反馈进而导致整体训练效果不佳,因此对于状态序列长度,一种方法可以将其设置为1个时期epoch,即每2个时期epoch计算一次最近时期epoch的reward;另一种则可以将其看为一个迭代(iteration)内的优化,这样会更有利于优化。在优化时将网络的参数和Agent的参数进行剥离,分开优化。针对两种参数可以采用不同的优化器,其中Agent优化器采用Adam优化器,网络参数优化采用随机梯度下降(Stochastic Gradient Descent,SGD)进行优化处理,在优化时两者交替更新。
在将结构参数作用于空间卷积和时间卷积来融合两者的特征时,具体根据结构参数,使用Auto(2+1)D卷积结构,即2D卷积+1D卷积的结构将视频图像特征中的时空信息进行融合。其中,Auto(2+1)D是由顺序连接的2D卷积和1D卷积、各自对应的结构参数,以及激活函数组成。通过2D卷积和1D卷积来分别解耦特征中的时间和空间信息,进行独立建模,即通过2D卷积进行空间特征提取,通过1D卷积进行时间特征提取。在训练结构参数时,通过结构参数来自适应地对解耦后的信息进行融合,并通过激活函数增加模型的非线性表达能力。2D卷积和1D卷积组成一个基本的卷积块,可以作为网络中的基础块结构,如可以作为ResNet(Residual Neural Network,残差网络)中的Block结构。
在根据所提特征在时间维度的相似度和先验信息对应的结构参数,具体包括先验参数β1和β2,使用节奏调节器调整行为的节奏的处理过程中,节奏调节器包含先验激励模块和高内聚时间表达模块。先验激励模块可以根据特征的时间维度上的相似度来为当前的结构参数设置界限值Margin,以此促进结构参数的优化。高内聚的时间表达模块可以通过高效的注意力机制来增加时间维度信息的内聚性。具体地,将上一层输出的特征图输入2D卷积,进行空间特征的提取。将2D卷积输出的特征输入先验激励模块,计算其在时间维度上的相似度,并根据相似度值为结构参数设置合适的Margin。另一方面,将2D卷积输出的特征输入高内聚时间模块和1D时间卷积模块并输出特征图,根据先验信息结构参数,自适应地调整高内聚时间模块和1D时间卷积模块输出的特征图的权重并进行融合,获得融合后的特征。
具体地,为了实现通过先验信息激励网络朝着高内聚时间特征的方向优化,将3x1x1这一条时间卷积分支改为3x1x1时间卷积和带有期望最大化注意力的3x1x1时间卷积两个分支。先验激励模块主要是通过对先验参数β1和β2优化的激励作用于特征。如图7所示,目标视频图像提取到的视频图像特征通过1×3×3的2D卷积进行空间特征提取,提取结果通过α1进行贡献调整,贡献调整后依次进行批量标准化处理和激活函数的非线性映射。映射后的结果通过先验激励模块进行处理。在先验激励模块中,计算映射后的结果在时间维度的相似度,基于相似度对初始的先验参数β1和β2进行修正,并通过修改后的先验参数β1和β2,对通过两个t×1×1的1D卷积进行时间特征提取得到的结果进行加权融合,加权融合的结果通过结构参数α2进行时间特征贡献调整,得到进行视频行为识别处理的行为识别特征。
图7中,箭头代表特征图的流向,它们通过把上一个模块输出的特征图输入下一个模块的方式连接,紧接着将先验相似度激励模块后得到的特征图,并行的输入下一个卷积块,最后的输出是将两个分支的特征图进行拼接并降低维度。为了通过先验信息激励网络朝着高内聚时间特征的或者高静态特性的方向优化,首先根据1x3x3卷积得到的特征图在时间维度上计算余弦相似度,以此来衡量该样本在时间维度的变化程度,并基于该变化程度阈值将当前 的先验参数划分为正负参数。在具体实现中,对于动作节奏较慢的视频,各帧目标视频图像之间的冗余信息多,则需要增强内聚特征,可以增大内聚特征的权重,以突出焦点特征进行行为识别,从而提高视频行为识别的准确度。具体来说,经过激励修正后的先验参数与原始输入的先验参数以残差连接的方式合并作为最终的先验参数。由于在网络达到一定优化的情况下,张量的元素值往往没有较大的方差,统一偏小,在具体实现时可以通过设置界限值margin,动态地调整阈值来设置当前的相似度先验信息,可以获得如下式(10),
Sim=max(0,Sim-(Thres+abs(β1-β2)))      (10)
其中,Sim代表相似度值,Thres是阈值,β1和β2为先验参数。
进一步地,高内聚时间模块基于EM(Expectation-Maximum,称期望最大化)算法优化的注意力机制获得高内聚的时间表达。对于每个样本,都经过固定次数的迭代优化来重构特征。如图8所示,这个过程可以分为E步和M步,特征图经过下采样处理后,通过E步和M步分别进行处理后融合得到高内聚特征。首先,假设有基向量B×C×K,其中B为batch大小,即为批次处理的数据大小,C为原始输入的视频图像特征对应的通道数,K为基向量维度。在E步中,通过使用基向量和B×(H×W)×C的进行空间特征提取后的空间特征向量做矩阵乘法,再接softmax来重构原始特征,得到尺寸为B×(H×W)×K的特征图。在M步中则是将尺寸为B×(H×W)×K的重构特征图和B×(H×W)×C的原始特征图做乘法来得到新的基向量B×C×K。进一步地,为了保证基向量更新的稳定,对其进行L2正则化,同时在训练时增加基向量的滑动平均更新,具体如下式(11),
mu=mu*momentum+mu_mean*(1-momentum)    (11)
其中,mu为基向量,mu_mean为其均值,momentum为动量。
最后将E步得到的基向量和M步得到的注意力图做矩阵乘法,得到最终重构的带有全局信息的特征图。
本实施例中提供的视频行为识别方法应用于视频识别领域,而在视频识别领域中三维卷 积目前被广泛的使用,但是由于其参数量高的限制难以拓展。一些改进方法在计算成本低、内存需求小、性能高的基础上,将三维卷积分解为二维空间卷积和一维时间卷积。随后很多工作着力于通过设计不同的网络结构来获得更具有表达性的特征。但业内并未关注到视频中的空间和时间线索在不同的动作类别上有着不同的影响。而本实施例中的视频行为识别方法,涉及的自适应时空纠缠网络基于重要性分析自动地融合分解后的时空信息,以获得更强大的时空表示。该视频行为识别方法中,Auto(2+1)D卷积通过网络结构搜索技术自适应重组解耦时空卷积滤波器,以建模时空的不一致贡献信息,挖掘出了时空信息之间的深层相关性,并学习时空交互信息,通过整合不同权重的时空信息,增强了当前模型对时间和空间信息的建模能力。而节律调节器利用EM算法的有效注意机制来提取时间维度的高内聚特征,可以根据动作节奏的先验信息和时间卷积的结构参数,来调整具有不同节奏的动作的时间信息,以此获得时间信息的高内聚性的表达式来处理不同动作类中的不同持续时间问题,可以提高视频行为识别的准确率。
应该理解的是,虽然图2-图3的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2-图3中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。
在一个实施例中,如图9所示,提供了一种视频行为识别装置900,该装置可以采用软件模块或硬件模块,或者是二者的结合成为计算机设备的一部分,该装置具体包括:视频图像特征提取模块902、空间特征贡献调整模块904、特征融合模块906、时间特征贡献调整模块908和视频行为识别模块910,其中:
视频图像特征提取模块902,用于从至少两帧目标视频图像提取视频图像特征;
空间特征贡献调整模块904,用于将视频图像特征的空间特征进行贡献调整,得到中间图像特征;
特征融合模块906,用于基于先验信息对中间图像特征的时间特征和时间特征对应的内聚特征进行融合,得到融合特征;先验信息是根据中间图像特征在时间维度的变化信息得到的;内聚特征是对时间特征进行关注处理得到的;
时间特征贡献调整模块908,用于对融合特征进行时间特征贡献调整,得到行为识别特征;
视频行为识别模块910,用于基于行为识别特征进行视频行为识别。
在一个实施例中,空间特征贡献调整模块904,还用于将视频图像特征进行空间特征提取,得到视频图像特征的空间特征;通过结构参数中的空间结构参数对空间特征进行贡献调整,得到中间图像特征;结构参数是通过携带行为标签的视频图像样本训练得到的;时间特征贡献调整模块908,还用于通过结构参数中的时间结构参数对融合特征进行贡献调整,得到行为识别特征。
在一个实施例中,还包括待训练参数确定模块、中间样本特征获得模块、融合样本特征获得模块、行为识别样本特征获得模块和迭代模块;其中:待训练参数确定模块,用于确定待训练结构参数;中间样本特征获得模块,用于通过待训练结构参数中的空间结构参数,对视频图像样本特征的空间样本特征进行贡献调整,得到中间样本特征;视频图像样本特征是从视频图像样本提取得到的;融合样本特征获得模块,用于基于先验样本信息对中间样本特征的时间样本特征和时间样本特征对应的内聚样本特征进行融合,得到融合样本特征;内聚样本特征是对时间样本特征进行关注处理得到的;先验样本信息是根据中间样本特征在时间维度的变化信息得到的;行为识别样本特征获得模块,用于通过待训练结构参数中的时间结构参数对融合样本特征进行贡献调整,得到行为识别样本特征;迭代模块,用于基于行为识别样本特征进行视频行为识别,并根据行为识别结果和视频图像样本对应的行为标签,对待 训练结构参数进行更新后继续训练直至训练结束,获得结构参数。
在一个实施例中,视频行为识别装置通过视频行为识别模型实现,待训练结构参数是视频行为识别模型在训练中的参数;迭代模块还包括识别结果获得模块、差异确定模块、结构参数更新模块和结构参数获得模块;其中:识别结果获得模块,用于获得视频行为识别模型输出的行为识别结果;差异确定模块,用于确定行为识别结果与视频图像样本对应的行为标签之间的差异;结构参数更新模块,用于根据差异对视频行为识别模型中的模型参数和待训练结构参数进行更新;结构参数获得模块,用于基于更新后的视频行为识别模型继续训练直至训练结束,并根据训练完成的视频行为识别模型得到结构参数。
在一个实施例中,迭代模块还包括识别损失确定模块、奖励值获得模块和奖励值处理模块;其中:识别损失确定模块,用于确定行为识别结果和视频图像样本对应的行为标签之间的行为识别损失;奖励值获得模块,用于根据行为识别损失和前一行为识别损失得到奖励值;奖励值处理模块,用于根据奖励值对待训练结构参数进行更新,通过更新后的待训练结构参数继续训练直至目标函数满足结束条件时,获得结构参数;目标函数基于训练过程中的各奖励值得到。
在一个实施例中,奖励值获得模块,还用于根据奖励值对策略梯度网络模型的模型参数进行更新;由更新后的策略梯度网络模型对待训练结构参数进行更新。
在一个实施例中,奖励值获得模块,还用于通过更新后的策略梯度网络模型,基于更新后的模型参数和待训练结构参数进行结构参数预测,获得预测的结构参数;及根据预测的结构参数,得到对待训练结构参数进行更新后的结构参数。
在一个实施例中,还包括相似度确定模块和先验信息修正模块;其中:相似度确定模块,用于确定中间图像特征在时间维度的相似度;先验信息修正模块,用于基于相似度对初始先验信息进行修正,得到先验信息。
在一个实施例中,初始先验信息包括第一初始先验参数和第二初始先验参数;先验信息修正模块包括相似度调整模块、先验参数修正模块和先验信息获得模块;其中:相似度调整模块,用于根据第一初始先验参数、第二初始先验参数及预设阈值,对相似度进行动态调整;先验参数修正模块,用于通过动态调整后的相似度分别对第一初始先验参数和第二初始先验参数进行修正,得到第一先验参数和第二先验参数;先验信息获得模块,用于根据第一先验参数和第二先验参数得到先验信息。
在一个实施例中,还包括基向量确定模块、特征重构模块、基向量更新模块和内聚特征获得模块;其中:基向量确定模块,用于确定当前基向量;特征重构模块,用于通过当前基向量对中间图像特征的时间特征进行特征重构,得到重构特征;基向量更新模块,用于根据重构特征和时间特征生成下一关注处理的基向量;内聚特征获得模块,用于根据下一关注处理的基向量、基向量和时间特征,得到时间特征对应的内聚特征。
在一个实施例中,基向量更新模块还包括注意力特征模块、正则化处理模块和滑动平均更新模块;其中:注意力特征模块,用于融合重构特征和时间特征,生成注意力特征;正则化处理模块,用于对注意力特征进行正则化处理,得到正则化特征;滑动平均更新模块,用于对正则化特征进行滑动平均更新,生成下一关注处理的基向量。
在一个实施例中,当前基向量包括批次处理的数据大小、中间图像特征的通道数以及基向量的维度;特征重构模块,还用于将当前基向量与中间图像特征的时间特征,依次进行矩阵相乘及归一化映射处理,得到重构特征;基向量更新模块,还用于将重构特征和时间特征进行矩阵相乘,得到下一关注处理的基向量;内聚特征获得模块,还用于将下一关注处理的基向量、基向量和时间特征进行融合,得到时间特征对应的内聚特征。
在一个实施例中,特征融合模块906,还用于确定先验信息;对中间图像特征进行时间特征提取,得到中间图像特征的时间特征;通过先验信息,对时间特征和时间特征对应的内聚特征进行加权融合,得到融合特征。
在一个实施例中,先验信息包括第一先验参数和第二先验参数;特征融合模块906,还 用于通过第一先验参数对时间特征进行加权处理,获得加权处理后的时间特征;通过第二先验参数对时间特征对应的内聚特征进行加权处理,得到加权处理后的内聚特征;及将加权处理后的时间特征和加权处理后的内聚特征进行融合,得到融合特征。
在一个实施例中,还包括标准化处理模块和非线性映射模块;其中:标准化处理模块,用于对中间图像特征进行标准化处理,得到标准化特征;非线性映射模块,用于根据标准化特征进行非线性映射,获得映射后的中间图像特征;特征融合模块906,还用于基于先验信息对映射后的中间图像特征的时间特征和时间特征对应的内聚特征进行融合,得到融合特征;先验信息是根据映射后的中间图像特征在时间维度的变化信息得到的。
在一个实施例中,标准化处理模块,还用于通过批量标准化层结构,对中间图像特征进行标准化处理,得到标准化特征;非线性映射模块,还用于通过激活函数对标准化特征进行非线性映射,获得映射后的中间图像特征。
关于视频行为识别装置的具体限定可以参见上文中对于视频行为识别方法的限定。上述视频行为识别装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器或终端,其内部结构图可以如图10所示。该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储模型数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种视频行为识别方法。
本领域技术人员可以理解,图10中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,还提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机可读指令,该处理器执行计算机可读指令时实现上述各方法实施例中的步骤。
在一个实施例中,提供了一种计算机可读存储介质,存储有计算机可读指令,该计算机可读指令被处理器执行时实现上述各方法实施例中的步骤。
在一个实施例中,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机可读指令,该计算机可读指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机可读指令,处理器执行该计算机可读指令,使得该计算机设备执行上述各方法实施例中的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存或光存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因 此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种视频行为识别方法,由计算机设备执行,其特征在于,所述方法包括:
    从至少两帧目标视频图像提取视频图像特征;
    将所述视频图像特征的空间特征进行贡献调整,得到中间图像特征;
    基于先验信息对所述中间图像特征的时间特征和所述时间特征对应的内聚特征进行融合,得到融合特征;所述先验信息是根据所述中间图像特征在时间维度的变化信息得到的;所述内聚特征是对所述时间特征进行关注处理得到的;
    对所述融合特征进行时间特征贡献调整,得到行为识别特征;及
    基于所述行为识别特征进行视频行为识别。
  2. 根据权利要求1所述的方法,其特征在于,所述将所述视频图像特征的空间特征进行贡献调整,得到中间图像特征,包括:
    将所述视频图像特征进行空间特征提取,得到所述视频图像特征的空间特征;及
    通过结构参数中的空间结构参数对所述空间特征进行贡献调整,得到中间图像特征;所述结构参数是通过携带行为标签的视频图像样本训练得到的;
    所述对所述融合特征进行时间特征贡献调整,得到行为识别特征,包括:
    通过所述结构参数中的时间结构参数对所述融合特征进行贡献调整,得到行为识别特征。
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括:
    确定待训练结构参数;
    通过所述待训练结构参数中的空间结构参数,对视频图像样本特征的空间样本特征进行贡献调整,得到中间样本特征;所述视频图像样本特征是从所述视频图像样本提取得到的;
    基于先验样本信息对所述中间样本特征的时间样本特征和所述时间样本特征对应的内聚样本特征进行融合,得到融合样本特征;所述内聚样本特征是对所述时间样本特征进行关注处理得到的;所述先验样本信息是根据所述中间样本特征在时间维度的变化信息得到的;
    通过所述待训练结构参数中的时间结构参数对所述融合样本特征进行贡献调整,得到行为识别样本特征;及
    基于所述行为识别样本特征进行视频行为识别,并根据行为识别结果和所述视频图像样本对应的行为标签,对所述待训练结构参数进行更新后继续训练直至训练结束,获得所述结构参数。
  4. 根据权利要求3所述的方法,其特征在于,所述方法通过视频行为识别模型实现,所述待训练结构参数是所述视频行为识别模型在训练中的参数;所述根据行为识别结果和所述视频图像样本对应的行为标签,对所述待训练结构参数进行更新后继续训练直至训练结束,获得所述结构参数,包括:
    获得所述视频行为识别模型输出的行为识别结果;
    确定所述行为识别结果与所述视频图像样本对应的行为标签之间的差异;
    根据所述差异对所述视频行为识别模型中的模型参数和所述待训练结构参数进行更新;及
    基于更新后的视频行为识别模型继续训练直至训练结束,并根据训练完成的视频行为识别模型得到所述结构参数。
  5. 根据权利要求3所述的方法,其特征在于,所述根据行为识别结果和所述视频图像样本对应的行为标签,对所述待训练结构参数进行更新后继续训练直至训练结束,获得所述结构参数,包括:
    确定行为识别结果和所述视频图像样本对应的行为标签之间的行为识别损失;
    根据所述行为识别损失和前一行为识别损失得到奖励值;及
    根据所述奖励值对所述待训练结构参数进行更新,通过更新后的待训练结构参数继续训练直至目标函数满足结束条件时,获得所述结构参数;所述目标函数基于训练过程中的各奖励值得到。
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述奖励值对所述待训练结构参数进行更新,包括:
    根据所述奖励值对策略梯度网络模型的模型参数进行更新;及
    由更新后的策略梯度网络模型对所述待训练结构参数进行更新。
  7. 根据权利要求6所述的方法,其特征在于,所述由更新后的策略梯度网络模型对所述待训练结构参数进行更新,包括:
    通过更新后的策略梯度网络模型,基于更新后的模型参数和待训练结构参数进行结构参数预测,获得预测的结构参数;及
    根据所述预测的结构参数,得到对所述待训练结构参数进行更新后的结构参数。
  8. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    确定所述中间图像特征在时间维度的相似度;及
    基于所述相似度对初始先验信息进行修正,得到先验信息。
  9. 根据权利要求8所述的方法,其特征在于,所述初始先验信息包括第一初始先验参数和第二初始先验参数;所述基于所述相似度对初始先验信息进行修正,得到先验信息,包括:
    根据所述第一初始先验参数、所述第二初始先验参数及预设阈值,对所述相似度进行动态调整;
    通过动态调整后的相似度分别对所述第一初始先验参数和所述第二初始先验参数进行修正,得到第一先验参数和第二先验参数;及
    根据所述第一先验参数和所述第二先验参数得到先验信息。
  10. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    确定当前基向量;
    通过所述当前基向量对所述中间图像特征的时间特征进行特征重构,得到重构特征;
    根据所述重构特征和所述时间特征生成下一关注处理的基向量;及
    根据所述下一关注处理的基向量、所述基向量和所述时间特征,得到所述时间特征对应的内聚特征。
  11. 根据权利要求10所述的方法,其特征在于,所述根据所述重构特征和所述时间特征生成下一关注处理的基向量,包括:
    融合所述重构特征和所述时间特征,生成注意力特征;
    对所述注意力特征进行正则化处理,得到正则化特征;及
    对所述正则化特征进行滑动平均更新,生成下一关注处理的基向量。
  12. 根据权利要求10所述的方法,其特征在于,所述当前基向量包括批次处理的数据大小、中间图像特征的通道数以及基向量的维度;所述通过所述当前基向量对所述中间图像特征的时间特征进行特征重构,得到重构特征,包括:
    将所述当前基向量与所述中间图像特征的时间特征,依次进行矩阵相乘及归一化映射处理,得到重构特征;
    所述根据所述重构特征和所述时间特征生成下一关注处理的基向量,包括:
    将所述重构特征和所述时间特征进行矩阵相乘,得到下一关注处理的基向量;
    所述根据所述下一关注处理的基向量、所述基向量和所述时间特征,得到所述时间特征对应的内聚特征,包括:
    将所述下一关注处理的基向量、所述基向量和所述时间特征进行融合,得到所述时间特征对应的内聚特征。
  13. 根据权利要求1至12任意一项所述的方法,其特征在于,所述基于先验信息对所述中间图像特征的时间特征和所述时间特征对应的内聚特征进行融合,得到融合特征,包括:
    确定先验信息;
    对所述中间图像特征进行时间特征提取,得到所述中间图像特征的时间特征;及
    通过所述先验信息,对所述时间特征和所述时间特征对应的内聚特征进行加权融合,得 到融合特征。
  14. 根据权利要求13所述的方法,其特征在于,所述先验信息包括第一先验参数和第二先验参数;所述通过所述先验信息,对所述时间特征和所述时间特征对应的内聚特征进行加权融合,得到融合特征,包括:
    通过所述第一先验参数对所述时间特征进行加权处理,获得加权处理后的时间特征;
    通过所述第二先验参数对所述时间特征对应的内聚特征进行加权处理,得到加权处理后的内聚特征;及
    将所述加权处理后的时间特征和所述加权处理后的内聚特征进行融合,得到融合特征。
  15. 根据权利要求1所述的方法,其特征在于,在所述基于先验信息对所述中间图像特征的时间特征和所述时间特征对应的内聚特征进行融合,得到融合特征之前,还包括:
    对所述中间图像特征进行标准化处理,得到标准化特征;及
    根据所述标准化特征进行非线性映射,获得映射后的中间图像特征;
    所述基于先验信息对所述中间图像特征的时间特征和所述时间特征对应的内聚特征进行融合,得到融合特征,包括:
    基于先验信息对所述映射后的中间图像特征的时间特征和所述时间特征对应的内聚特征进行融合,得到融合特征;所述先验信息是根据所述映射后的中间图像特征在时间维度的变化信息得到的。
  16. 根据权利要求15所述的方法,其特征在于,所述对所述中间图像特征进行标准化处理,得到标准化特征,包括:
    通过批量标准化层结构,对所述中间图像特征进行标准化处理,得到标准化特征;
    所述根据所述标准化特征进行非线性映射,获得映射后的中间图像特征,包括:
    通过激活函数对所述标准化特征进行非线性映射,获得映射后的中间图像特征。
  17. 一种视频行为识别装置,其特征在于,所述装置包括:
    视频图像特征提取模块,用于从至少两帧目标视频图像提取视频图像特征;
    空间特征贡献调整模块,用于将所述视频图像特征的空间特征进行贡献调整,得到中间图像特征;
    特征融合模块,用于基于先验信息对所述中间图像特征的时间特征和所述时间特征对应的内聚特征进行融合,得到融合特征;所述先验信息是根据所述中间图像特征在时间维度的变化信息得到的;所述内聚特征是对所述时间特征进行关注处理得到的;
    时间特征贡献调整模块,用于对所述融合特征进行时间特征贡献调整,得到行为识别特征;及
    视频行为识别模块,用于基于所述行为识别特征进行视频行为识别。
  18. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现权利要求1至16中任一项所述的方法的步骤。
  19. 一种计算机可读存储介质,存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现权利要求1至16中任一项所述的方法的步骤。
  20. 一种计算机程序产品,包括计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现权利要求1至16任一项所述的方法的步骤。
PCT/CN2022/116947 2021-10-15 2022-09-05 视频行为识别方法、装置、计算机设备和存储介质 Ceased WO2023061102A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22880046.2A EP4287144A4 (en) 2021-10-15 2022-09-05 VIDEO BEHAVIOR RECOGNITION METHOD AND APPARATUS AND COMPUTER DEVICE AND STORAGE MEDIUM
US18/201,635 US20230316733A1 (en) 2021-10-15 2023-05-24 Video behavior recognition method and apparatus, and computer device and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111202734.4A CN114332670A (zh) 2021-10-15 2021-10-15 视频行为识别方法、装置、计算机设备和存储介质
CN202111202734.4 2021-10-15

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/201,635 Continuation US20230316733A1 (en) 2021-10-15 2023-05-24 Video behavior recognition method and apparatus, and computer device and storage medium

Publications (1)

Publication Number Publication Date
WO2023061102A1 true WO2023061102A1 (zh) 2023-04-20

Family

ID=81044868

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/116947 Ceased WO2023061102A1 (zh) 2021-10-15 2022-09-05 视频行为识别方法、装置、计算机设备和存储介质

Country Status (4)

Country Link
US (1) US20230316733A1 (zh)
EP (1) EP4287144A4 (zh)
CN (1) CN114332670A (zh)
WO (1) WO2023061102A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116524419A (zh) * 2023-07-03 2023-08-01 南京信息工程大学 基于时空解耦与自注意力差分lstm的视频预测方法、系统
CN116524542A (zh) * 2023-05-08 2023-08-01 杭州像素元科技有限公司 一种基于细粒度特征的跨模态行人重识别方法及装置
CN118694899A (zh) * 2024-07-18 2024-09-24 远洋亿家物业服务股份有限公司 基于人工智能的物业安全防控方法及系统

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114332670A (zh) * 2021-10-15 2022-04-12 腾讯科技(深圳)有限公司 视频行为识别方法、装置、计算机设备和存储介质
CN114882223B (zh) * 2022-06-01 2024-08-16 安徽农业大学 轻量化叶菜苗语义分割模型及其测试和使用方法
CN115240271B (zh) * 2022-07-08 2025-08-05 北方工业大学 基于时空建模的视频行为识别方法与系统
CN116189028B (zh) * 2022-11-29 2024-06-21 北京百度网讯科技有限公司 图像识别方法、装置、电子设备以及存储介质
CN116189281B (zh) * 2022-12-13 2024-04-02 北京交通大学 基于时空自适应融合的端到端人体行为分类方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110096950A (zh) * 2019-03-20 2019-08-06 西北大学 一种基于关键帧的多特征融合行为识别方法
KR20200036093A (ko) * 2018-09-21 2020-04-07 네이버웹툰 주식회사 비디오 영상 내의 행동 인식 방법 및 장치
CN111950444A (zh) * 2020-08-10 2020-11-17 北京师范大学珠海分校 一种基于时空特征融合深度学习网络的视频行为识别方法
CN113378600A (zh) * 2020-03-09 2021-09-10 北京灵汐科技有限公司 一种行为识别方法及系统
CN114332670A (zh) * 2021-10-15 2022-04-12 腾讯科技(深圳)有限公司 视频行为识别方法、装置、计算机设备和存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111178319A (zh) * 2020-01-06 2020-05-19 山西大学 基于压缩奖惩机制的视频行为识别方法
CN113435430B (zh) * 2021-08-27 2021-11-09 中国科学院自动化研究所 基于自适应时空纠缠的视频行为识别方法、系统、设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200036093A (ko) * 2018-09-21 2020-04-07 네이버웹툰 주식회사 비디오 영상 내의 행동 인식 방법 및 장치
CN110096950A (zh) * 2019-03-20 2019-08-06 西北大学 一种基于关键帧的多特征融合行为识别方法
CN113378600A (zh) * 2020-03-09 2021-09-10 北京灵汐科技有限公司 一种行为识别方法及系统
CN111950444A (zh) * 2020-08-10 2020-11-17 北京师范大学珠海分校 一种基于时空特征融合深度学习网络的视频行为识别方法
CN114332670A (zh) * 2021-10-15 2022-04-12 腾讯科技(深圳)有限公司 视频行为识别方法、装置、计算机设备和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4287144A4

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116524542A (zh) * 2023-05-08 2023-08-01 杭州像素元科技有限公司 一种基于细粒度特征的跨模态行人重识别方法及装置
CN116524542B (zh) * 2023-05-08 2023-10-31 杭州像素元科技有限公司 一种基于细粒度特征的跨模态行人重识别方法及装置
CN116524419A (zh) * 2023-07-03 2023-08-01 南京信息工程大学 基于时空解耦与自注意力差分lstm的视频预测方法、系统
CN116524419B (zh) * 2023-07-03 2023-11-07 南京信息工程大学 基于时空解耦与自注意力差分lstm的视频预测方法、系统
CN118694899A (zh) * 2024-07-18 2024-09-24 远洋亿家物业服务股份有限公司 基于人工智能的物业安全防控方法及系统

Also Published As

Publication number Publication date
EP4287144A1 (en) 2023-12-06
CN114332670A (zh) 2022-04-12
EP4287144A4 (en) 2024-09-04
US20230316733A1 (en) 2023-10-05

Similar Documents

Publication Publication Date Title
WO2023061102A1 (zh) 视频行为识别方法、装置、计算机设备和存储介质
Paul et al. Robust visual tracking by segmentation
WO2020042895A1 (en) Device and method of tracking poses of multiple objects based on single-object pose estimator
CN113111814B (zh) 基于正则化约束的半监督行人重识别方法及装置
WO2019228317A1 (zh) 人脸识别方法、装置及计算机可读介质
CN113435430B (zh) 基于自适应时空纠缠的视频行为识别方法、系统、设备
GB2584727A (en) Optimised machine learning
CN118644811B (zh) 一种视频对象的检测方法、装置、电子设备和存储介质
CN116310318B (zh) 交互式的图像分割方法、装置、计算机设备和存储介质
CN118379288B (zh) 基于模糊剔除和多聚焦图像融合的胚胎原核目标计数方法
CN113822125A (zh) 唇语识别模型的处理方法、装置、计算机设备和存储介质
CN115565051B (zh) 轻量级人脸属性识别模型训练方法、识别方法及设备
Patel et al. Learning surrogates via deep embedding
CN110490304A (zh) 一种数据处理方法及设备
CN119478529A (zh) 基于改进yolov8网络与clip模型的地铁行人异常检测方法及系统
Li et al. SCD-YOLO: a lightweight vehicle target detection method based on improved YOLOv5n
Chen et al. Frequency-space enhanced and temporal adaptative RGBT object tracking
Negi et al. End-to-end residual learning-based deep neural network model deployment for human activity recognition
Pandeeswari et al. Deep intelligent technique for person Re-identification system in surveillance images
CN117095460A (zh) 基于长短时关系预测编码的自监督群体行为识别方法及其识别系统
CN120257217B (zh) 一种基于不确定性估计的多模态特征动态融合方法及系统
CN111126155A (zh) 一种基于语义约束生成对抗网络的行人再识别方法
CN113658218B (zh) 一种双模板密集孪生网络跟踪方法、装置及存储介质
CN113822291B (zh) 一种图像处理方法、装置、设备及存储介质
Shu et al. Foda-pg for enhanced medical imaging narrative generation: Adaptive differentiation of normal and abnormal attributes

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22880046

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022880046

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022880046

Country of ref document: EP

Effective date: 20230831

WWE Wipo information: entry into national phase

Ref document number: 11202306161T

Country of ref document: SG

NENP Non-entry into the national phase

Ref country code: DE