WO2024217164A1 - 视频去噪模型的处理方法、装置、计算机设备和存储介质 - Google Patents

视频去噪模型的处理方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2024217164A1
WO2024217164A1 PCT/CN2024/079883 CN2024079883W WO2024217164A1 WO 2024217164 A1 WO2024217164 A1 WO 2024217164A1 CN 2024079883 W CN2024079883 W CN 2024079883W WO 2024217164 A1 WO2024217164 A1 WO 2024217164A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
video frame
downsampled
features
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2024/079883
Other languages
English (en)
French (fr)
Inventor
陈艺云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to EP24791752.9A priority Critical patent/EP4632666A4/en
Publication of WO2024217164A1 publication Critical patent/WO2024217164A1/zh
Priority to US19/193,267 priority patent/US20250272803A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/60Image enhancement or restoration using machine learning, e.g. neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/30Noise filtering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • the present application relates to the field of computer technology, and in particular to a processing method, device, computer equipment and storage medium for a video denoising model.
  • video denoising technology has gradually become a research hotspot in the field of improving video quality.
  • the video denoising model based on deep learning has obvious advantages in denoising effect and speed, and has broad application prospects.
  • the existing single-frame based video denoising model cannot fully consider the correlation and continuity of the video in the temporal dimension, and cannot extract better features.
  • the multi-frame based video denoising model is also unable to extract better features when computing resources are limited, resulting in poor denoising effect of the existing video denoising model on the video.
  • a processing method, apparatus, computer device, computer-readable storage medium, and computer program product for a video denoising model that can improve the video denoising effect are provided.
  • the present application provides a method for processing a video denoising model, which is executed by a computer device, and the method includes:
  • the parameters in the video denoising model are adjusted to obtain a target video denoising model;
  • the reference video frame is a video frame in the reference video corresponding to the target video frame;
  • the target video denoising model is used to denoise the video to be denoised.
  • the present application also provides a processing device for a video denoising model.
  • the device comprises:
  • a video frame acquisition module used to acquire a target video frame in a video frame sequence of a sample video, and to acquire a reference video corresponding to the sample video;
  • a detail feature extraction module used to extract image detail features of the target video frame through a first branch of a video denoising model
  • a fusion feature extraction module used for downsampling the video frame sequence to obtain a downsampled video frame sequence, and extracting features from the downsampled video frame sequence through the second branch of the video denoising model to obtain an image fusion feature;
  • a prediction module used for generating a predicted video frame based on the image fusion feature and the image detail feature
  • the parameter adjustment module is used to adjust the parameters in the video denoising model according to the loss value between the predicted video frame and the reference video frame to obtain a target video denoising model;
  • the reference video frame is the reference video frame.
  • the target video denoising model is used to perform denoising on the video to be denoised.
  • the present application further provides a computer device, wherein the computer device comprises a memory and a processor, wherein the memory stores computer-readable instructions, and when the processor executes the computer-readable instructions, the steps of the processing method of the video denoising model are implemented.
  • the present application further provides a computer-readable storage medium having computer-readable instructions stored thereon, which implement the steps of the processing method of the video denoising model when executed by a processor.
  • the present application further provides a computer program product, which includes computer-readable instructions, and when the computer-readable instructions are executed by a processor, the steps of the processing method of the video denoising model are implemented.
  • FIG1 is a diagram of an application environment of a processing method of a video denoising model in one embodiment
  • FIG2a is a schematic flow chart of a method for processing a video denoising model in one embodiment
  • FIG2b is a schematic flow chart of a method for processing a video denoising model in another embodiment
  • FIG3 is a schematic diagram of denoising a noisy video frame in one embodiment
  • FIG4 is a schematic diagram of adding noise to a video frame in one embodiment
  • FIG5 is a schematic diagram of a real noise image in one embodiment
  • FIG6 is a schematic diagram of a flow chart of an image fusion feature extraction step in one embodiment
  • FIG7 is a schematic diagram of a process flow of a video denoising step in one embodiment
  • FIG8 is a flow chart of a method for processing a video denoising model in another embodiment
  • FIG9 is a schematic diagram of sample data processing in one embodiment
  • FIG10 is a schematic diagram of a video denoising model structure in one embodiment
  • FIG11 is a schematic diagram of a noisy video frame in another embodiment
  • FIG12 is a schematic diagram of a denoised video frame in one embodiment
  • FIG13 is a structural block diagram of a processing device for a video denoising model in one embodiment
  • FIG14 is a structural block diagram of a processing device for a video denoising model in another embodiment
  • FIG15 is a diagram showing the internal structure of a computer device in one embodiment
  • FIG. 16 is a diagram showing the internal structure of a computer device in another embodiment.
  • first, second, and third are only used to distinguish similar objects, and do not represent a specific order of the objects. It is understandable that the specific order or sequence of "first, second, and third” can be interchanged where permitted, so that the embodiments of the present application described herein can be used in addition to the embodiments in the figures. The invention may be performed in any order other than that shown or described.
  • the processing method of the video denoising model provided in the embodiment of the present application can be applied in the application environment shown in FIG1.
  • the terminal 102 communicates with the server 104 through a network.
  • the data storage system can store the data that the server 104 needs to process.
  • the data storage system can be integrated on the server 104, or it can be placed on the cloud or other servers.
  • the processing method of the video denoising model is executed by the terminal 102 or the server 104 alone, or by the terminal 102 and the server 104 in collaboration.
  • the processing method of the video denoising model is executed by terminal 102, which obtains a target video frame in a video frame sequence of a sample video, and obtains a reference video corresponding to the sample video; extracts image detail features of the target video frame through a first branch of the video denoising model; downsamples the video frame sequence to obtain a downsampled video frame sequence, and extracts features of the downsampled video frame sequence through a second branch of the video denoising model to obtain image fusion features; generates a predicted video frame based on the image fusion features and the image detail features; adjusts parameters in the video denoising model according to a loss value between the predicted video frame and the reference video frame to obtain a target video denoising model; wherein the reference video frame is a video frame in the reference video corresponding to the target video frame, and the target video denoising model is used to denoise the video to be denoised.
  • the terminal 102 can be, but is not limited to, various desktop computers, laptops, smart phones, tablet computers, Internet of Things devices and portable wearable devices.
  • the Internet of Things devices can be smart speakers, smart TVs, smart air conditioners, smart car-mounted devices, etc.
  • Portable wearable devices can be smart watches, smart bracelets, head-mounted devices, etc.
  • the server 104 can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms.
  • the terminal 102 and the server 104 can be directly or indirectly connected via wired or wireless communications, and this application does not limit this.
  • a method for processing a video denoising model is provided, which is described by taking the method applied to the computer device (terminal 102 or server 104) in FIG. 1 as an example, and includes the following steps:
  • the sample video is the video data used to train the machine learning model.
  • the sample video usually consists of multiple video frames, and each video frame contains information about the video content, such as color, shape, action, etc.
  • the sample video can come from various sources, such as real-life recordings, simulated videos, videos on the Internet, etc.
  • the sample video is a video with noise, and the reference video is a noise-free or extremely low-noise video corresponding to the sample video. It is usually used as the "real" or "ideal" state in the video denoising task.
  • the reference video provides a target standard for evaluating the denoising effect and model performance.
  • the sample videos in the embodiments of the present application include static videos with real noise and noisy dynamic videos.
  • Static video refers to the video data generated when the camera is fixed and the subject is not moving. Since the camera is not moving, the real noise in the static video is usually caused by factors such as the noise of the camera itself, uneven lighting, and sensor noise. Therefore, static videos with real noise can better reflect the video noise situation in actual applications; dynamic video refers to the video data generated when the camera or the subject is moving.
  • noisy dynamic video refers to the original video data. By adding noise to the video data to simulate the noise situation in the actual application scenario, the robustness and performance of the video denoising algorithm or model can be better tested and evaluated through the noisy dynamic video.
  • the reference video in the embodiments of the present application includes a clear static video obtained by smoothing the static video and a clear dynamic video without noise.
  • the terminal extracts a video frame sequence from the sample video at a certain time interval, and obtains the current target video frame to be processed from the extracted video frame sequence.
  • the video frame sequence extracted from the sample video by the terminal includes 10 video frames, and the current target video frame to be processed is the second frame, then the video frame of the second frame is obtained from the video frame sequence.
  • the video denoising model refers to a computer vision model or algorithm used to remove noise from the video.
  • Video noise is usually caused by factors such as imperfections in the acquisition equipment, interference in signal transmission, and compression algorithms. Therefore, in many video applications, such as video conferencing and video encoding, denoising is an important preprocessing step.
  • the task of the video denoising model is to restore the clearest and most noise-free video from the input noisy video, while retaining the details and quality of the input noisy video as much as possible.
  • the first branch of the video denoising model can specifically be a high-resolution branch, which is used to process the target video frame of the original resolution. It can be understood that the original resolution of the target video frame is high resolution. High resolution means that the resolution of the image reaches a specific resolution threshold. The resolution threshold can be set according to needs.
  • the target video frame with high resolution usually carries more noise and richer detail information. By performing feature processing on the target video frame through the first branch of the video denoising model, richer image detail features can be obtained.
  • Image detail features refer to the features of the detailed parts of an image, such as texture, edges, corners, etc. By extracting image detail features, noise and signals can be distinguished more accurately, and more detail information can be restored, thereby improving the quality and clarity of the image.
  • the terminal inputs the target video frame into the first branch of the video denoising model, processes the target video frame through each network layer of the first branch, and obtains image detail features of the target video frame.
  • the downsampled video frame sequence refers to the video frame sequence obtained by downsampling the video frame sequence of the sample video.
  • downsampling refers to reducing the resolution of the image, thereby reducing the size of the image and reducing the detail information in the image. It is usually used to reduce the amount of calculation and memory usage, while accelerating the training and reasoning process of the model.
  • the second branch of the video denoising model can specifically be a low-resolution branch, which is used to process the downsampled video frame sequence.
  • the resolution of each downsampled video frame in the downsampled video frame sequence is low resolution.
  • Low resolution means that the resolution of the image does not reach a specific resolution threshold.
  • the resolution threshold can be set according to demand.
  • the size of each downsampled video frame in the low-resolution downsampled video frame sequence is reduced or the detail information is reduced.
  • Processing the downsampled video frame sequence through the second branch of the video denoising model can effectively reduce the amount of calculation and improve the operating efficiency of the model. At the same time, it can also enhance the generalization ability of the model, making it more suitable for processing videos of different resolutions.
  • Image fusion features are feature representations obtained by fusing features of at least two downsampled video frames in a downsampled video frame sequence. It is understandable that for video data with noise, it is often difficult to obtain good denoising effects by using only one frame of image for denoising, because a single frame of image may have too much noise and distortion and cannot provide sufficient information. By fusing the features of multiple downsampled video frames, the expressiveness of the features can be improved, thereby improving the denoising effect of the model. In addition, the feature representations obtained after feature extraction of each downsampled video frame in the downsampled video frame sequence may have information loss. Fusion of the features of multiple downsampled video frames can improve the expressiveness of the features, thereby improving the denoising effect of the model.
  • the terminal after obtaining the video frame sequence, the terminal performs downsampling processing on each video frame in the video frame sequence to obtain a downsampled video frame sequence, and inputs the downsampled video frame sequence into the second branch of the video denoising model.
  • Each sub-branch of the branch processes each down-sampled video frame in the down-sampled video frame sequence respectively to obtain image fusion features.
  • the predicted video frame refers to a video frame generated by denoising the input video in the video denoising model.
  • the terminal fuses the image fusion feature and the image detail feature to obtain the global image feature, and generates a predicted video frame based on the global image feature.
  • the reference video frame is the video frame in the reference video corresponding to the target video frame
  • the loss value is used to evaluate the degree of difference between the predicted video frame obtained by the video denoising model after denoising the input video and the corresponding video frame in the reference video.
  • the smaller the loss value the smaller the difference between the result predicted by the model and the actual result, and the better the prediction accuracy and effect of the model.
  • the target video denoising model is a trained machine learning model used to denoise the target video.
  • the terminal after obtaining the predicted video frame, obtains a video frame corresponding to the target video frame from the reference video, which video frame can also be called a reference video frame, determines a loss value based on the predicted video frame and the corresponding reference video frame, and adjusts the parameters in the video denoising model based on the determined loss value until the training is stopped when the convergence condition is met, and the target video denoising model is obtained.
  • convergence means that the training process of the video denoising model has become stable, that is, the video denoising model has learned the characteristics of the data and no longer has significant improvement.
  • the convergence conditions include a fixed number of training rounds, a fixed threshold of the loss function, etc. When the model reaches this condition, the training is stopped to avoid overfitting.
  • the terminal adjusts the values of the weight parameters and bias parameters in the video denoising model based on the loss value to obtain the adjusted video denoising model, and re-executes step S202 until the training stops when the convergence condition is met to obtain the target video denoising model.
  • the terminal may determine based on the following formula:
  • L represents the loss value
  • I LQ represents the video frame sequence in the sample video
  • T represents the number of video frames in the video frame sequence
  • F(I LQ ) i represents the predicted video frame corresponding to the i-th video frame (target video frame) in the video frame sequence
  • the terminal extracts the image detail features of the target video frame through the first branch of the video denoising model.
  • the downsampled video frame sequence is extracted through the second branch of the video denoising model to obtain image fusion features, and a predicted video frame is generated based on the image fusion features and the image detail features.
  • the parameters in the video denoising model can be adjusted according to the loss value between the predicted video frame and the video frame corresponding to the target video frame in the reference video, so as to obtain a target video denoising model with better denoising effect.
  • the sample video includes a static video with real noise and a dynamic video with added noise;
  • the video includes a clear static video and a dynamic video without noise added obtained by smoothing a static video.
  • the static video also carries the added noise
  • the processing method of the above-mentioned video denoising model also includes the following steps: performing video capture on the static object to obtain the original static video carrying the real noise; performing noise addition processing on the original static video to obtain the static video; the static video carries the added noise and the real noise; and performing smoothing processing on the original static video to obtain a clear static video.
  • noise is the noise added to the video artificially
  • the types of noise include Gaussian noise, salt and pepper noise, pseudo-random noise, etc.
  • static objects refer to objects that remain motionless.
  • Smoothing is an image processing method, and its main purpose is to reduce the noise of the image. In video processing, smoothing can be applied to each frame of the video. By smoothing each frame, the video can be made smoother and more natural, and the noise can be reduced. Smoothing usually needs to be applied to each frame, so for video, smoothing can also be called time domain filtering.
  • the terminal keeps the video acquisition device still and shoots a static object to obtain a static video, which is the original static video carrying real noise.
  • a preset noise addition algorithm is used to perform noise addition processing on the original static video to obtain a static video, which carries the noise addition noise and real noise.
  • a preset smoothing algorithm is used to smooth the original static video to obtain a clear static video.
  • the smoothing algorithms include Gaussian blur, median filtering, mean filtering, etc.
  • Gaussian blur can reduce the noise of the image by taking the weighted average of the pixels around each pixel.
  • Median filtering and mean filtering can reduce the noise of the image by calculating the median or average of the pixels around each pixel.
  • the noise level of the clear static video is significantly lower than that of the original static video. Therefore, the clear static video can also be approximated as a video without noise, so that it can be used as a reference video without noise during model training.
  • the terminal uses a preset smoothing algorithm to smooth the original static video
  • the process of obtaining a clear static video specifically includes the following steps: determining the frame difference between adjacent original static video frames in the original static video, determining the area where the frame difference reaches a frame difference threshold as the noise area in the corresponding original static video frame, and smoothing the noise area in each original static video frame to obtain a clear static video.
  • the acquisition device may not be absolutely stable during video acquisition. There may be some very small jitters, and the flow of gas in the environment may cause slight movement of the static object, etc., resulting in the original static video being not absolutely static, but relatively static.
  • the frame difference between adjacent video frames should be 0.
  • FIG. 3 shows three adjacent noisy video frames
  • (a) in Figure 3 is a schematic diagram of the frame difference between two adjacent noisy video frames, and after smoothing the three noisy video frames, a clear video frame as shown in (c) in Figure 3 is obtained
  • (d) in Figure 3 is a schematic diagram of the frame difference between two adjacent clear video frames.
  • the terminal acquires the original static video with real noise by collecting the video of the static object.
  • the original static video is processed with noise to obtain a static video.
  • the static video carries the noise added and the real noise.
  • the original static video is smoothed to obtain a clear static video. Therefore, the static video containing the real noise can be used as the sample video.
  • the clear static video is used as the reference video to train the video denoising model, which can better simulate the noise situation in the real scene and improve the denoising effect of the target video denoising model.
  • the terminal performs noise processing on the original static video
  • the process of obtaining the static video specifically includes the following steps: obtaining partial pixels from each noisy video frame of the original static video; generating corresponding first pixel images according to the partial pixels of each noisy video frame; generating first initial noise images corresponding to each noisy video frame; fusing the first initial noise images with the first pixel images respectively to obtain first noise images corresponding to each noisy video frame; and fusing each first noise image into the corresponding noisy video frame to obtain a static video.
  • some pixels refer to some pixels in the noisy video frame, which can be randomly selected from the noisy video frame.
  • the first pixel image is used to describe the distribution of some pixels. Specifically, the grayscale value at the position corresponding to some pixels in the first pixel image is 1, which means that noise is added at the position corresponding to this pixel. The grayscale value at other positions other than some pixels is 0, which means that no noise is added at the position corresponding to this pixel.
  • the terminal after obtaining the original static video, the terminal obtains each noisy video frame from the original static video, and for any noisy video frame, randomly selects some pixels from the noisy video frame, and generates a first pixel image with the same size as the noisy video frame based on the selected some pixels, wherein the grayscale values at the positions corresponding to some pixels in the first pixel image can be 1, and the grayscale values at other positions outside the some pixels can be 0, and a preset noise generation algorithm is used to generate a first initial noise image, and the first pixel image is dot-multiplied by the first initial noise image to obtain a first noise image, and the first noise image is fused into the noisy video frame to obtain the corresponding noisy static video frame. It can be understood that the above noise addition processing is performed on each noisy video frame in the original static video to obtain a noisy static video.
  • the preset noise generation algorithm may be a random distribution algorithm, such as a Gaussian distribution algorithm.
  • the Gaussian distribution algorithm is used to process the corresponding noisy video frame to obtain a first initial noise image.
  • the terminal fuses the first noise image into the corresponding noisy video frame, and specifically can use pixel-by-pixel weighted averaging to achieve image fusion, which specifically includes the following steps: obtaining a first weight corresponding to the first noise image and a second weight corresponding to the noisy video frame, determining a weighted pixel value corresponding to each target pixel based on the first weight and the pixel value of each pixel in the first noise image, and the second weight and the pixel value of each pixel in the noisy video frame, and generating a noisy static video frame based on the weighted pixel value of each target pixel.
  • the target pixel refers to a pixel in the noisy static video frame.
  • the first row in FIG4 shows a traditional noise adding method, which specifically includes first randomly generating a noise image, and directly fusing the noise image onto the image to be noised (clean image) to obtain a corresponding noise image. From the noise image, it can be seen that the noise is evenly added to the clean image. However, as shown in FIG5 , in a real image, the noise (the dots in the figure represent the noise) is not evenly distributed at each pixel position.
  • the noise adding method used in the embodiment of the present application is shown in the second row or the third row in FIG4 , which first randomly selects some pixels from the image to be noised (clean image), and generates a noise image based on the selected some pixels.
  • a pixel image is formed, and the pixel image is fused with the corresponding noise image to obtain a noise image after adding noise, wherein the pixel image is a matrix composed of only 0 and 1 with the same length and width as the image to be added with noise, 0 indicates that no noise is added to this pixel position, and 1 indicates that noise is added to this pixel position.
  • the images to be added with noise (clean images) in the second and third rows of FIG4 are the same, and the randomly generated noise images are also the same, but the pixel images generated respectively are different, and the noise adding coefficients used in the noise adding are also different, so that the noise images obtained are also different, wherein the noise adding coefficient can be specifically based on the weight corresponding to the noise image and the weight corresponding to the clean image.
  • the terminal obtains partial pixels from each noisy video frame of the original static video; generates corresponding first pixel images according to the partial pixels of each noisy video frame; generates first initial noise images corresponding to each noisy video frame; fuses the first initial noise images with the first pixel images respectively to obtain first noise images corresponding to each noisy video frame; and fuses each first noise image into the corresponding noisy video frame to obtain a static video, so that the obtained static video can more accurately simulate the distribution of noise in the actual image, and also increase the diversity of noise.
  • the static video is used to train the video denoising model, which can further improve the denoising effect of the video denoising model.
  • the processing method of the video denoising model further includes the following steps: obtaining a non-noised dynamic video from a video database; and performing a noise processing on the non-noised dynamic video to obtain a noisy dynamic video.
  • dynamic videos contain moving and changing content, such as people walking, vehicles driving, etc. Such videos can show the movement and changes of dynamic objects from multiple angles.
  • the video database can be a public video data set, and the public video data set can specifically be a clear video data set REDS and DAVIS.
  • the video database can also be a clear video library obtained after denoising the video obtained by self-video acquisition. It should be noted that the clarity in the embodiment of the present application can be approximated as noise-free, that is, a clear video refers to a noise-free video.
  • the terminal can directly obtain a clear dynamic video from the video database, that is, a dynamic video without noise, and use a preset noise adding algorithm to perform noise adding processing on the obtained dynamic video to obtain a noisy dynamic video.
  • the terminal obtains unnoised dynamic video from the video database, performs noise processing on the unnoised dynamic video, and obtains noisy dynamic video, so that the noisy dynamic video can be used as a sample video, and the unnoised dynamic video can be used as a reference video to train the video denoising model, which can better simulate the noise situation in the real scene, thereby improving the denoising effect of the target video denoising model.
  • the video frames in the unnoised dynamic video are clear video frames
  • the terminal performs noise processing on the unnoised dynamic video to obtain a noisy dynamic video.
  • the process includes the following steps: selecting part of the pixels from each clear video frame; generating corresponding second pixel images according to the part of the pixels of each clear video frame; generating second initial noise images corresponding to each clear video frame; fusing each second initial noise image with the corresponding second pixel image to obtain a second noise image corresponding to each clear video frame; fusing each second noise image into the corresponding clear video frame to obtain a noisy dynamic video.
  • partial pixels refer to partial pixels in a clear video frame, which can be randomly selected from a clear video frame.
  • the second pixel image is used to describe the distribution of partial pixels.
  • the grayscale value at the position corresponding to the partial pixels in the second pixel image is 1, and 1 means that noise is added at the position corresponding to this pixel.
  • the grayscale value at other positions other than partial pixels is 0, and 0 means that no noise is added at the position corresponding to this pixel.
  • the terminal after obtaining the unnoised dynamic video, the terminal obtains each clear video frame from the unnoised dynamic video, and for any clear video frame, randomly selects some pixels from the clear video frame, and generates a second pixel image of the same size as the pre-clear video frame based on the selected some pixels, wherein the grayscale values at the corresponding positions of some pixels in the second pixel image can be 1, and the grayscale values at other positions outside the some pixels can be 0, and a preset noise generation algorithm is used to generate a second initial noise image, and the second pixel image is dot-multiplied by the second initial noise image to obtain a second noise image, and the second noise image is merged into the clear video frame to obtain a noisy dynamic video frame.
  • the above noise addition processing is performed on each clear video frame in the unnoised dynamic video to obtain a noisy dynamic video.
  • the preset noise generation algorithm may be a random distribution algorithm, such as a Gaussian distribution algorithm, etc.
  • a Gaussian distribution algorithm is used to process the corresponding clear video frame to obtain a second initial noise image.
  • the terminal fuses the second noise image into the corresponding clear video frame, and specifically can use pixel-by-pixel weighted averaging to achieve image fusion, which specifically includes the following steps: obtaining the third weight corresponding to the second noise image and the fourth weight corresponding to the clear video frame, determining the weighted pixel value corresponding to each target pixel based on the third weight and the pixel value of each pixel in the second noise image, and the fourth weight and the pixel value of each pixel in the clear video frame, and generating a noisy dynamic video frame based on the weighted pixel value of each target pixel.
  • the target pixel refers to the pixel in the noisy dynamic video frame.
  • the terminal selects part of pixels from each clear video frame; generates corresponding second pixel images according to part of pixels of each clear video frame; generates second initial noise images corresponding to each clear video frame; fuses each second initial noise image with the corresponding second pixel image to obtain a second noise image corresponding to each clear video frame; fuses each second noise image into the corresponding clear video frame to obtain a noisy dynamic video, so that the obtained noisy dynamic video can more accurately simulate the distribution of noise in the actual image, and also increase the diversity of noise.
  • the noisy dynamic video is used to train the video denoising model, which can further improve the denoising effect of the video denoising model.
  • the second branch includes an optical flow network, a target frame sub-branch and other frame sub-branches.
  • the terminal extracts features from the downsampled video frame sequence through the second branch of the video denoising model, and the process of obtaining the image fusion features specifically includes the following steps:
  • the optical flow network is a neural network model used to estimate optical flow information, which can be specifically the optical flow network SpyNet; optical flow information refers to the information about pixel position changes between adjacent video frames. It can be understood that in the video, there may be movement of objects or cameras between adjacent video frames, and these movements cause the pixel positions between adjacent frames to be different, and the optical flow information is the information used to describe the pixel position changes between adjacent frames.
  • the optical flow information in the embodiment of the present application may include the optical flow information between the downsampled target video frame and the corresponding adjacent downsampled video frame in the downsampled video frame sequence, and may also include the optical flow information between any two adjacent downsampled video frames in the downsampled video frame sequence.
  • the optical flow information may also be referred to as an optical flow vector, which may represent the pixel displacement between adjacent video frames and may be used for subsequent frame alignment and feature fusion.
  • a downsampled video frame sequence refers to a video frame sequence obtained after downsampling each video frame in a video frame sequence, and may specifically include a downsampled target video frame and a downsampled continuous video frame, wherein the downsampled continuous video frame includes at least one of a downsampled preceding video frame and a downsampled succeeding video frame.
  • the downsampled video frame sequence includes 5 downsampled video frames.
  • the downsampled target video frame is the 3rd frame in the downsampled video frame sequence
  • the other downsampled video frames except the 3rd frame in the downsampled video frame sequence are downsampled continuous video frames, wherein the 1st frame and the 2nd frame are the downsampled preceding video frames, and the 4th frame and the 5th frame are the downsampled succeeding video frames
  • the downsampled target video frame is the 1st frame in the downsampled video frame sequence
  • the 2nd frame to the 5th frame in the downsampled video frame sequence are the downsampled succeeding video frames of the downsampled target video frame
  • the downsampled target video frame is the 5th frame in the downsampled video frame sequence
  • the 1st frame to the 4th frame in the downsampled video frame sequence are the downsampled preceding video frames of the downsampled target video frame.
  • the terminal After obtaining the downsampled video frame sequence, the terminal inputs each downsampled video frame in the downsampled video frame sequence into the optical flow network, and determines the optical flow information between any two adjacent downsampled video frames in the downsampled video frame sequence through the optical flow network, thereby obtaining the optical flow information between the downsampled target video frame and the corresponding adjacent downsampled video frames.
  • the adjacent downsampled video frames include at least one of a downsampled preceding video frame and a downsampled succeeding video frame;
  • the optical flow information includes at least one of first optical flow information and second optical flow information, and when the downsampled continuous video frames include the downsampled preceding video frame, the terminal determines the first optical flow information between the adjacent first downsampled video frames through the optical flow network; when the downsampled continuous video frames include the downsampled succeeding video frame, the terminal determines the second optical flow information between the adjacent second downsampled video frames through the optical flow network;
  • the first optical flow information is information between adjacent first downsampled video frames
  • the second optical flow information is information between second downsampled video frames
  • the first downsampled video frame is a downsampled video frame between a downsampled target video frame and a downsampled preceding video frame
  • the second downsampled video frame is a downsampled video frame between a downsampled target video frame and a downsampled subsequent video frame.
  • the downsampled video frame sequence includes 5 downsampled video frames. If the downsampled target video frame is the third frame in the downsampled video frame sequence, then the second downsampled video frame sequence includes the third frame.
  • the 1st frame and the 2nd frame are downsampled pre-order video frames
  • the 4th frame and the 5th frame are downsampled post-order video frames
  • the first downsampled video frame is the downsampled video frame in the 1st frame, the 2nd frame and the 3rd frame in the sampling video frame sequence
  • the first optical flow information includes the optical flow information from the 1st frame to the 2nd frame
  • the second downsampled video frame is the downsampled video frame in the 3rd frame
  • the second optical flow information includes the optical flow information from the 5th frame to the 4th frame, and the optical flow information from the 4th frame to the 3rd frame.
  • the terminal when the downsampled continuous video frames include downsampled preceding video frames, the terminal inputs each downsampled preceding video frame and the downsampled target video frame in the downsampled video frame sequence into the optical flow network, and determines the optical flow information between any two adjacent downsampled video frames in the downsampled preceding video frame and the downsampled target video frame through the optical flow network, that is, determines the optical flow information between adjacent first downsampled video frames, and determines the optical flow information as the first optical flow information between the downsampled target video frame and the corresponding adjacent downsampled video frame; when the downsampled continuous video frames include downsampled subsequent video frames, the terminal inputs each downsampled subsequent video frame and the downsampled target video frame in the downsampled video frame sequence into the optical flow network, and determines the optical flow information between any two adjacent downsampled video frames in the downsampled subsequent video frame and the downsampled target video frame through the optical flow network, that is, determines the
  • other frame sub-branches are used to perform feature extraction on downsampled video frames other than the downsampled target video frames in the downsampled video frame sequence to obtain continuous video frame features corresponding to the downsampled target video frames.
  • the other frame sub-branches include at least one of the preceding frame sub-branch and the succeeding frame sub-branch.
  • the preceding frame sub-branch is used to perform feature extraction on the downsampled preceding video frame to obtain the preceding video frame features
  • the succeeding frame sub-branch is used to perform feature extraction on the downsampled succeeding video frame to obtain the succeeding video frame features.
  • the terminal inputs the downsampled continuous video frames in the downsampled video frame sequence into other frame sub-branches, performs feature extraction on the input sampled continuous video frames through other frame sub-branches, and obtains continuous video frame features corresponding to the downsampled target video frames.
  • adjacent downsampled video frames include at least one of a downsampled preceding video frame and a downsampled succeeding video frame; continuous video frame features include at least one of a preceding video frame feature and a succeeding video frame feature; when the downsampled continuous video frames include the downsampled preceding video frames, the terminal extracts features of the downsampled preceding video frames through a forward network layer of a preceding frame sub-branch to obtain the preceding video frame features; when the downsampled continuous video frames include the downsampled succeeding video frames, the terminal extracts features of the downsampled succeeding video frames through a backward network layer of a succeeding frame sub-branch to obtain the succeeding video frames.
  • Video frame features when the downsampled continuous video frames include the downsampled preceding video frames, the terminal extracts features of the downsampled preceding video frames through a forward network layer of a succeeding frame sub-branch to obtain the succeeding video frames.
  • the forward network layer refers to the forward U-type network
  • the backward network layer refers to the backward U-type network.
  • the forward U-type network is a U-type network used to extract features from the downsampled preceding video frames
  • the backward U-type network is a U-type network used to extract features from the downsampled subsequent video frames.
  • the U-type network is a convolutional neural network structure used for image processing tasks, which consists of a downsampling module and an upsampling module, and usually there are some convolutional layers and pooling layers in the middle.
  • the terminal when the downsampled continuous video frames include downsampled preceding video frames, the terminal inputs each downsampled preceding video frame in the downsampled video frame sequence into the preceding frame sub-branch, and extracts features of each downsampled preceding video frame through the forward network layer of the preceding frame sub-branch to obtain the features of the preceding video frame; when the downsampled continuous video frames include downsampled subsequent video frames, the terminal inputs each downsampled subsequent video frame in the downsampled video frame sequence into the subsequent frame sub-branch, and extracts features of each downsampled subsequent video frame through the backward network layer of the subsequent frame sub-branch to obtain the features of the subsequent video frame.
  • the video frame sequence there is usually a correlation between the preceding and succeeding frames.
  • the downsampled video frame sequence includes 5 downsampled video frames
  • the downsampled target video frame is the 3rd frame in the downsampled video frame sequence
  • the preceding frame sub-branch 1 is used to perform feature extraction on the 1st downsampled video frame in the downsampled video frame sequence
  • the preceding frame sub-branch 2 is used to perform feature extraction on the 2nd downsampled video frame in the downsampled video frame sequence
  • the subsequent frame sub-branch 3 is used to perform feature extraction on the 4th downsampled video frame in the downsampled video frame sequence
  • the subsequent frame sub-branch 4 is used to perform feature extraction on the 5th downsampled video frame in the downsampled video frame sequence.
  • S606 Align the continuous video frame features with the downsampled target video frame based on the optical flow information to obtain aligned video frame features.
  • alignment refers to matching the features of continuous video frames with the content of the downsampled target video frames. It can be understood that in a video frame sequence, there is a certain motion relationship between adjacent video frames. Through optical flow information, the downsampled target video frames can be aligned with the corresponding continuous video frame features. In this way, in subsequent processing, they can be regarded as video frames and video frame features at the same moment, thereby improving the accuracy of the model.
  • adjacent downsampled video frames include at least one of a downsampled preceding video frame and a downsampled succeeding video frame;
  • the optical flow information includes at least one of first optical flow information and second optical flow information;
  • the continuous video frame features include at least one of the preceding video frame features and the succeeding video frame features;
  • the aligned video frame features include at least one of the preceding aligned video frame features and the succeeding aligned video frame features;
  • the terminal extracts a feature vector of a preset position from the features of the preceding video frames, determines the target position corresponding to the preset position in the downsampled target video frame based on the first optical flow information and the extracted feature vector, and aligns the features of the preceding video frame with the features of the downsampled target video frame based on the feature vector of the preset position and the corresponding target position in the downsampled target video frame using an interpolation method to obtain the features of the pre-aligned video frame;
  • the terminal extracts a feature vector of a preset position from the features of the subsequent video frames, determines the target position corresponding to the preset position in the downsampled target video frame based on the second optical flow information and the extracted feature vector, and aligns the features of the subsequent video frame with the features of the downsampled target video frame based on the feature
  • the target sub-branch is used to perform feature processing on the downsampled target video frame in the downsampled video frame sequence to obtain image fusion features corresponding to the downsampled target video frame.
  • the terminal inputs the aligned video frame features into the target sub-branch, performs feature processing on the aligned video frame features through the target sub-branch, and obtains image fusion features.
  • the terminal when the downsampled continuous video frames include the downsampled previous video frames, the terminal processes the features of the previous aligned video frames through the forward network layer of the target sub-branch to obtain the previous image fusion features; when the downsampled continuous video frames include the downsampled subsequent video frames, the terminal processes the features of the subsequent aligned video frames through the backward network layer of the target sub-branch to obtain the subsequent image fusion features; and the image fusion features are determined based on at least one of the previous image fusion features and the subsequent image fusion features.
  • the forward network layer refers to the forward U-type network
  • the backward network layer refers to the backward U-type network
  • the forward U-type network of the target sub-branch is a U-type network for feature processing of the video frame features after the pre-order alignment
  • the backward U-type network of the target sub-branch is a U-type network for feature processing of the video frame after the post-order alignment.
  • the U-type network is a convolutional neural network structure for image processing tasks, which consists of a downsampling module and an upsampling module, and usually there are some convolutional layers and pooling layers in the middle.
  • the terminal when the downsampled continuous video frames include the downsampled preceding video frames, the terminal inputs the features of the preceding aligned video frames into the forward network layer of the target sub-branch, and performs feature processing on the features of the preceding aligned video frames through the forward network layer of the target sub-branch to obtain the preceding image fusion features; when the downsampled continuous video frames include the downsampled subsequent video frames, the terminal inputs the features of the subsequent aligned video frames into the forward network layer of the target sub-branch, and performs feature processing on the features of the subsequent aligned video frames through the forward network layer of the target sub-branch to obtain the subsequent image fusion features; when the downsampled continuous video frames only include the downsampled preceding video frames, the preceding image fusion features are directly determined as the image fusion features; when the downsampled continuous video frames only include the downsampled subsequent video frames, the subsequent image fusion features are directly determined as the image fusion features; when the downsampled continuous video
  • the terminal determines the optical flow information between the downsampled target video frame and the corresponding adjacent downsampled video frame in the downsampled video frame sequence through the optical flow network of the second branch, and performs feature processing on the downsampled video frame sequence through other frame sub-branches of the second branch to obtain continuous video frame features corresponding to the downsampled target video frame, so that the continuous frame information and optical flow information in the video sequence can be used to better understand the movement and changes in the video, thereby obtaining an accurate video feature representation.
  • the process of the terminal determining the image fusion feature based on the previous image fusion feature and the subsequent image fusion feature specifically includes the following steps: splicing the previous image fusion feature and the subsequent image fusion feature to obtain the spliced image feature, and performing convolution processing on the spliced image feature to obtain the image fusion feature.
  • the terminal splices the fusion features of the preceding image and the succeeding image to obtain the spliced image features, and inputs the spliced image features into the convolution layer of the target sub-branch.
  • the spliced image features are convoluted through the convolution layer to obtain more advanced feature information, which is the image fusion feature.
  • the terminal can effectively fuse the information of the previous and subsequent video frames by splicing the fusion features of the previous image and the fusion features of the subsequent image, and make full use of the correlation between the consecutive frames in the previous and subsequent video frames, so as to obtain accurate video feature representation.
  • convolution processing is performed on the spliced image features to further extract and enhance the features, so as to obtain more accurate image fusion features, and then based on the image fusion features, subsequent image reconstruction can be made more accurate, thereby improving the denoising effect of the target video denoising model.
  • the process of generating a predicted video frame based on the image fusion feature and the image detail feature by the terminal specifically includes the following steps: fusing the image fusion feature with the image detail feature to obtain a global image feature; reconstructing the image based on the global image feature to obtain a predicted video frame.
  • the terminal after obtaining the image fusion feature and the image detail feature, the terminal obtains a first fusion coefficient corresponding to the image fusion feature and a second fusion coefficient corresponding to the image detail feature, and fuses the image fusion feature and the image detail feature based on the first fusion coefficient and the second fusion coefficient to obtain a global image feature, and performs a deconvolution operation on the global image feature to obtain a predicted video frame of the same size as the target video frame.
  • the deconvolution operation is used to gradually enlarge the global image features to the original size to obtain a predicted video frame of the same size as the target video frame.
  • the terminal obtains the global image feature by fusing the image fusion feature with the image detail feature, and can comprehensively utilize the information of the image fusion feature and the image detail feature to more comprehensively describe the image content of the target video frame, thereby reconstructing the image based on the global image feature to obtain the predicted video frame, and can also have a better denoising effect, thereby improving the denoising effect of the target video denoising model.
  • the terminal fuses the image fusion feature with the image detail feature to obtain the global image feature, and the process specifically includes the following steps: upsampling the image fusion feature to obtain the upsampled image fusion feature; fusing the upsampled image fusion feature with the image detail feature to obtain the global image feature.
  • the terminal After obtaining the image fusion feature, the terminal specifically performs a deconvolution operation on the image fusion feature to obtain an upsampled image fusion feature, obtains a first fusion coefficient corresponding to the upsampled image fusion feature and a second fusion coefficient corresponding to the image detail feature, and fuses the upsampled image fusion feature with the image detail feature based on the first fusion coefficient and the second fusion coefficient to obtain a global image feature.
  • the upsampled image fusion feature and the image detail feature may be weightedly fused based on the first fusion coefficient and the second fusion coefficient.
  • the terminal upsamples the image fusion features to obtain upsampled image fusion features with the same resolution as the target video frame, and fuses the upsampled image fusion features with the image detail features to obtain global image features.
  • the respective advantages of the two features can be fully utilized to further improve the expression ability of the global image features, thereby improving the denoising effect of the target video denoising model.
  • the terminal may also use the target video denoising model to perform denoising processing on the denoised video.
  • the process specifically includes the following steps:
  • S702 Determine a current video frame to be denoised in a sequence of video frames to be denoised of the video to be denoised.
  • the terminal obtains the video to be denoised, extracts a sequence of video frames to be denoised from the video to be denoised, and determines the current video frame to be denoised from the sequence of video frames to be denoised.
  • the sequence of video frames to be denoised extracted by the terminal from the video to be denoised contains 10 video frames, and the current video frame to be denoised is the second frame. Get the second frame in the video frame sequence.
  • the target video denoising model refers to a trained video denoising model obtained by training the video denoising model, and the first branch of the target video denoising model may specifically be a high-resolution branch, which is used to process the current video frame to be denoised at the original resolution.
  • the terminal inputs the current video frame to be denoised into the first branch of the target video denoising model, processes the current video frame to be denoised through each network layer of the first branch, and obtains the image detail features to be denoised of the video frame to be denoised.
  • the downsampled video frame sequence to be denoised refers to the video frame sequence obtained by downsampling the video sequence to be denoised.
  • downsampling refers to reducing the resolution of the image, thereby reducing the size of the image and reducing the detail information in the image. It is usually used to reduce the amount of calculation and memory usage, while accelerating the prediction process of the model.
  • the second branch of the target video denoising model can specifically be a low-resolution branch, which is used to process the downsampled video frame sequence to be denoised. It can be understood that the resolution of each downsampled video frame to be denoised in the downsampled video frame sequence to be denoised is low resolution, and the size of each downsampled video frame to be denoised in the low-resolution downsampled video frame sequence to be denoised is reduced or the detail information is reduced.
  • the amount of calculation can be effectively reduced, the operating efficiency of the model can be improved, and the generalization ability of the model can be enhanced, making it more suitable for processing videos of different resolutions.
  • the fused features of the image to be denoised refer to the feature representation obtained by fusing the features of at least two downsampled video frames to be denoised in the downsampled video frame sequence to be denoised. It can be understood that for video data with noise, it is often difficult to obtain a good denoising effect by using only one frame of image for denoising, because a single frame image may have too much noise and distortion and cannot provide sufficient information. By fusing the features of multiple downsampled video frames to be denoised, the expressiveness of the features can be improved, thereby improving the denoising effect of the target video denoising model.
  • the feature representation obtained after feature extraction of each downsampled video frame to be denoised in the downsampled video frame sequence to be denoised may have information loss. Fusion of the features of multiple downsampled video frames to be denoised can improve the expressiveness of the features, thereby improving the denoising effect of the target video denoising model.
  • the terminal downsamples each of the video frames to be denoised in the video frame sequence to obtain a downsampled video frame sequence to be denoised, and inputs the downsampled video frame sequence to be denoised into the second branch of the target video denoising model, and processes each of the downsampled video frames to be denoised in the downsampled video frame sequence to be denoised through each sub-branch of the second branch to obtain the image fusion features to be denoised.
  • the second branch includes an optical flow network, a target frame sub-branch and other frame sub-branches
  • S706 specifically includes the following steps: determining the optical flow information between the current downsampled video frame to be denoised and the corresponding adjacent downsampled video frame to be denoised in the downsampled video frame sequence to be denoised through the optical flow network; performing feature extraction on the downsampled video frame sequence to be denoised through other frame sub-branches to obtain features of the continuous video frames to be denoised corresponding to the current downsampled video frame to be denoised; aligning the features of the continuous video frames to be denoised with the current downsampled video frame to be denoised based on the optical flow information to obtain features of the aligned video frames to be denoised; processing the features of the aligned video frames to be denoised through the target sub-branch to obtain features of the image fusion to be denoised.
  • the downsampled video frame sequence to be denoised includes the current downsampled video frame to be denoised and the downsampled continuous video frame to be denoised
  • the downsampled continuous video frame to be denoised includes at least one of the downsampled preceding video frame to be denoised and the downsampled subsequent video frame to be denoised
  • the other frame sub-branches include at least one of the preceding frame sub-branches and the subsequent frame sub-branches.
  • the features of the continuous video frames to be denoised include at least one of the features of the preceding video frames to be denoised and the features of the subsequent video frames to be denoised
  • the features of the aligned video frames to be denoised include at least one of the features of the preceding video frames to be denoised and the features of the subsequent aligned video frames to be denoised
  • the terminal determines, through the optical flow network, the optical flow information between the current downsampled video frame to be denoised and the corresponding adjacent downsampled video frames in the downsampled video frame sequence to be denoised, and the process specifically includes the following steps: determining, through the optical flow network, the third optical flow information between the current downsampled video frame to be denoised and the adjacent downsampled video frames in the downsampled preceding video frames to be denoised; determining, through the optical flow network, the fourth optical flow information between the current downsampled video frame to be denoised and the adjacent downsampled video frames in the downsamp
  • the terminal performs feature extraction on the downsampled video frame sequence to be denoised through other frame sub-branches to obtain the features of the continuous video frame to be denoised corresponding to the current downsampled video frame to be denoised
  • the process includes the following steps: performing feature extraction on the downsampled preceding video frame to be denoised through the forward network layer of the preceding frame sub-branch to obtain the features of the preceding video frame to be denoised; performing feature extraction on the downsampled subsequent video frame to be denoised through the backward network layer of the subsequent frame sub-branch to obtain the features of the subsequent video frame to be denoised.
  • the terminal aligns the features of the continuous video frames to be denoised with the current downsampled video frames to be denoised based on the optical flow information
  • the process of obtaining the features of the aligned video frames to be denoised includes the following steps: aligning the features of the preceding video frames to be denoised with the current downsampled video frames to be denoised based on the third optical flow information to obtain the features of the preceding aligned video frames to be denoised; aligning the features of the subsequent video frames to be denoised with the current downsampled video frames to be denoised based on the fourth optical flow information to obtain the features of the subsequent aligned video frames to be denoised;
  • the terminal processes the features of the video frame after denoising and alignment through the target sub-branch to obtain the process of image fusion features, which includes the following steps: processing the features of the pre-order aligned video frame to be denoised through the forward network layer of the target sub-branch to obtain the fusion features of the pre-order image to be denoised; processing the features of the post-order aligned video frame to be denoised through the backward network layer of the target sub-branch to obtain the fusion features of the post-order image to be denoised; and determining the fusion features of the image to be denoised based on at least one of the fusion features of the pre-order image to be denoised and the fusion features of the post-order image to be denoised.
  • the process of determining the fusion features of the image to be denoised based on at least one of the fusion features of the preceding image to be denoised and the fusion features of the subsequent image to be denoised by the terminal specifically includes the following steps: when the downsampled continuous video frames to be denoised only include the downsampled preceding video frames to be denoised, directly determining the fusion features of the preceding image to be denoised as the fusion features of the image to be denoised; when the downsampled continuous video frames to be denoised only include the downsampled subsequent video frames to be denoised, directly determining the fusion features of the subsequent image to be denoised as the fusion features of the image to be denoised; when the downsampled continuous video frames to be denoised include the downsampled preceding video frames to be denoised and the downsampled subsequent video frames to be denoised, splicing the fusion features of the preceding image to be denoised and the fusion features of
  • the terminal fuses the fused features of the image to be denoised and the detailed features of the image to be denoised to obtain the global image features to be denoised, and generates a predicted video frame based on the global image features to be denoised.
  • the terminal determines the current video frame to be denoised in the video frame sequence to be denoised of the video to be denoised; extracts the image detail features to be denoised of the video frame to be denoised through the first branch of the target video denoising model; after obtaining the downsampled video frame sequence to be denoised corresponding to the video frame sequence to be denoised, extracts the features of the downsampled video frame sequence to be denoised through the second branch of the target video denoising model to obtain the image fusion features to be denoised; generates the denoised video frame corresponding to the video frame to be denoised based on the image detail features to be denoised and the image fusion features to be denoised, which fully considers the video
  • the correlation and continuity in the time dimension can effectively reduce the amount of calculation and improve the operating efficiency of the model. Therefore, even when computing resources are limited, the features of the video frames to be denoised can be better extracted, thereby improving the denoising effect of the target video denoising
  • a method for processing a video denoising model is provided, which is described by taking the method applied to the computer device in FIG1 as an example, and includes the following steps:
  • S806 The static video with added noise and real noise and the dynamic video with added noise are determined as sample videos, and the clear static video and the dynamic video without added noise are determined as reference videos.
  • the reference video frame is a video frame in the reference video corresponding to the target video frame; the target video denoising model is used to denoise the video to be denoised, and the reference video includes a clear static video obtained by smoothing the static video and a dynamic video without noise.
  • the present application also provides an application scenario, which uses the processing method of the above-mentioned video denoising model, and the method includes the following steps:
  • the training data comes from two parts, one part is an artificially collected still video with real noise, and the other part is a public clear video set.
  • the video with real noise and the clear video are artificially denoised to obtain a low-quality noisy video (LQ).
  • the video with real noise is time-domain smoothed, and the clear video is copied to obtain a high-quality clear video (GT).
  • the low-quality noisy video (LQ) is used as a sample video
  • the corresponding high-quality clear video (GT) is used as a reference video to construct a paired data set, and the constructed paired data set is used to train the video denoising model.
  • the video denoising model includes a high-resolution branch and a low-resolution branch.
  • the low-resolution branch includes an optical flow network and multiple sub-branches.
  • Each sub-branch includes a forward U-type network and a backward U-type network.
  • the terminal obtains a target video frame in the video frame sequence of the sample video, and extracts the image detail features of the target video frame through the high-resolution branch of the video denoising model. After downsampling the video frame sequence to obtain a downsampled video frame sequence, the downsampled video frame sequence is input into the low-resolution branch.
  • the optical flow network of the second branch of the video denoising model is used to determine the optical flow information between adjacent downsampled video frames in the downsampled video frame sequence.
  • the downsampled video frames corresponding to the target sub-branch in the low-resolution branch and the optical flow information are processed respectively.
  • a video frame sequence includes 10 video frames, and the target video frame is the i-th frame.
  • the 10 downsampled video frames are input into the low-resolution branch of the video denoising model.
  • Each downsampled video frame corresponds to a sub-branch in the low-resolution branch.
  • the pre-trained optical flow network SpyNet is first used to determine the first optical flow information from the i+1th frame to the i-th frame, and the second optical flow information from the i-1th frame to the i-th frame, and the i+1th frame is extracted through the backward U-shaped network layer of the sub-branch corresponding to the i+1th frame.
  • the forward U-shaped network layer of the sub-branch corresponding to the i-1-th frame is used to extract features of the i-1-th frame to obtain the previous video frame features, and the previous video frame features and the subsequent video frame features are respectively aligned with the i-th frame based on the first optical flow information and the second optical flow information to obtain the previous aligned video frame features and the subsequent aligned video frame features, the forward U-shaped network layer of the sub-branch corresponding to the i-th frame is used to perform feature processing on the previous aligned video frame features to obtain the previous image fusion features, the backward U-shaped network layer of the sub-branch corresponding to the i-th frame is used to perform feature processing on the subsequent aligned video frame features to obtain the subsequent image fusion features, the previous image fusion features and the subsequent image fusion features are spliced to obtain the spliced image features, and the spliced image features are convoluted by the convolution layer of the sub-bra
  • the preceding video frame features corresponding to the i-1th frame can be specifically determined based on the image of the i-1th frame and the video frame features of the i-2th frame
  • the succeeding video frame features corresponding to the i+1th frame can be specifically determined based on the image of the i+1th frame and the video frame features of the i+2th frame.
  • Figure 11 is a video frame to be denoised of the video to be denoised, and the video frame to be denoised contains a lot of noise.
  • Figure 12 is a clear video frame obtained after denoising the video frame to be denoised using the target video denoising model trained by the solution of the present application.
  • the embodiment of the present application also provides a processing device for a video denoising model for implementing the processing method of the video denoising model involved above.
  • the implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the above method, so the specific limitations in the embodiments of the processing device for one or more video denoising models provided below can refer to the limitations of the processing method for the video denoising model above, and will not be repeated here.
  • a processing device for a video denoising model comprising: a video frame acquisition module 1302, a detail feature extraction module 1304, a fusion feature extraction module 1306, a prediction module 1308 and a parameter adjustment module 1310, wherein:
  • a video frame acquisition module 1302 is used to acquire a target video frame in a video frame sequence of a sample video
  • a detail feature extraction module 1304 is used to extract image detail features of a target video frame through a first branch of a video denoising model
  • a fusion feature extraction module 1306 is used to downsample the video frame sequence to obtain a downsampled video frame sequence, and extract features from the downsampled video frame sequence through the second branch of the video denoising model to obtain an image fusion feature;
  • a prediction module 1308, configured to generate a predicted video frame based on the image fusion feature and the image detail feature
  • the parameter adjustment module 1310 is used to adjust the parameters in the video denoising model according to the loss value between the predicted video frame and the reference video frame to obtain the target video denoising model;
  • the reference video frame is the video frame in the reference video corresponding to the target video frame;
  • the target video denoising model is used to denoise the video to be denoised.
  • the image detail features of the target video frame are extracted through the first branch of the video denoising model.
  • the downsampled video frame sequence is subjected to feature extraction through the second branch of the video denoising model to obtain the image fusion features.
  • the predicted video frame is generated based on the image fusion features and the image detail features. This not only fully considers the correlation and continuity of the video in the time dimension, but also can effectively reduce the amount of calculation and improve the operation efficiency of the model.
  • the parameters in the video denoising model can be adjusted according to the loss value between the predicted video frame and the video frame corresponding to the target video frame in the reference video to obtain a target video denoising model with better denoising effect.
  • the sample video includes a static video carrying real noise and a noisy dynamic video.
  • the sample video includes a static video with real noise and a dynamic video with added noise
  • the reference video includes a clear static video obtained by smoothing the static video and a dynamic video without added noise.
  • the apparatus further includes a sample video acquisition module 1312 and a reference video acquisition module 1314, wherein: the sample video acquisition module 1312 is used to perform video capture on a static object to obtain an original static video carrying real noise; the original static video is subjected to noise addition processing to obtain a static video; the static video carries added noise and real noise; the reference video acquisition module 1314 is used to perform smoothing processing on the original static video to obtain a clear static video.
  • the sample video acquisition module 1312 is further used to acquire partial pixels from each noisy video frame of the original static video; generate corresponding first pixel images according to the partial pixels of each noisy video frame; generate first initial noise images corresponding to each noisy video frame; fuse the first initial noise images with the first pixel images to obtain first noise images corresponding to each noisy video frame; fuse each first noise image to the corresponding noisy video frame
  • a static video is obtained.
  • the reference video acquisition module 1314 is further used to acquire a non-noised dynamic video from a video database; the sample video acquisition module 1312 is further used to perform noise processing on the non-noised dynamic video to obtain a noisy dynamic video.
  • the video frames in the unnoised dynamic video are clear video frames;
  • the sample video acquisition module 1312 is also used to select part of the pixels from each clear video frame; generate corresponding second pixel images according to the part of the pixels of each clear video frame; generate second initial noise images corresponding to each clear video frame; fuse each second initial noise image with the corresponding second pixel image to obtain a second noise image corresponding to each clear video frame; fuse each second noise image into the corresponding clear video frame to obtain a noisy dynamic video.
  • the second branch includes an optical flow network, a target frame sub-branch and other frame sub-branches; the fusion feature extraction module 1306 is also used to: determine the optical flow information between the downsampled target video frame in the downsampled video frame sequence and the corresponding adjacent downsampled video frame through the optical flow network; extract features of the downsampled video frame sequence through other frame sub-branches to obtain continuous video frame features corresponding to the downsampled target video frame; align the continuous video frame features with the downsampled target video frame based on the optical flow information to obtain aligned video frame features; process the aligned video frame features through the target sub-branch to obtain image fusion features.
  • adjacent downsampled video frames include downsampled preceding video frames and downsampled succeeding video frames
  • the fusion feature extraction module 1306 the optical flow information includes first optical flow information and second optical flow information
  • the continuous video frame features include preceding video frame features and succeeding video frame features
  • the aligned video frame features include preceding aligned video frame features and succeeding aligned video frame features; and are also used to: determine the first optical flow information between adjacent first downsampled video frames through an optical flow network; determine the second optical flow information between adjacent second downsampled video frames through an optical flow network
  • the first downsampled video frame is a downsampled video frame between the downsampled target video frame and the downsampled preceding video frame
  • the second downsampled video frame is a downsampled video frame between the downsampled target video frame and the downsampled succeeding video frame
  • the fusion feature extraction module 1306 is used to: splice the fusion features of the preceding image and the fusion features of the succeeding image to obtain spliced image features; and perform convolution processing on the spliced image features to obtain image fusion features.
  • the prediction module 1308 is further used to: fuse the image fusion feature with the image detail feature to obtain the global image feature; and reconstruct the image based on the global image feature to obtain a predicted video frame.
  • the prediction module is further used to: upsample the image fusion feature to obtain the upsampled image fusion feature; and fuse the upsampled image fusion feature with the image detail feature to obtain the global image feature.
  • the video frame acquisition module 1302 is further used to determine the current video frame to be denoised in the sequence of video frames to be denoised of the video to be denoised; the detail feature extraction module is further used to extract the detail features of the image to be denoised of the video frame to be denoised through the first branch of the target video denoising model; the fusion feature extraction module 1306 is further used to obtain the video frame to be denoised After the downsampled video frame sequence to be denoised corresponds to the video frame sequence, the feature extraction of the downsampled video frame sequence to be denoised is performed through the second branch of the target video denoising model to obtain the fusion feature of the image to be denoised; the prediction module is also used to generate a denoised video frame corresponding to the video frame to be denoised based on the detail feature of the image to be denoised and the fusion feature of the image to be denoised.
  • Each module in the processing device of the above video denoising model can be implemented in whole or in part by software, hardware and a combination thereof.
  • Each module can be embedded in or independent of a processor in a computer device in the form of hardware, or can be stored in a memory in a computer device in the form of software, so that the processor can call and execute operations corresponding to each module above.
  • a computer device which may be a server, and its internal structure diagram may be shown in FIG15.
  • the computer device includes a processor, a memory, an input/output interface (Input/Output, referred to as I/O) and a communication interface.
  • the processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer-readable instruction and a database.
  • the internal memory provides an environment for the operation of the operating system and the computer-readable instructions in the non-volatile storage medium.
  • the database of the computer device is used to store video data.
  • the input/output interface of the computer device is used to exchange information between the processor and an external device.
  • the communication interface of the computer device is used to communicate with an external terminal through a network connection.
  • a computer device which may be a terminal, and its internal structure diagram may be shown in FIG16.
  • the computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input device.
  • the processor, the memory, and the input/output interface are connected via a system bus, and the communication interface, the display unit, and the input device are connected to the system bus via the input/output interface.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer-readable instructions.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the input/output interface of the computer device is used to exchange information between the processor and an external device.
  • the communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be implemented through WIFI, a mobile cellular network, NFC (near field communication) or other technologies.
  • a processing method for a video denoising model is implemented.
  • the display unit of the computer device is used to form a visually visible image, and can be a display screen, a projection device or a virtual reality imaging device.
  • the display screen can be a liquid crystal display screen or an electronic ink display screen.
  • the input device of the computer device can be a touch layer covered on the display screen, or a button, trackball or touchpad set on the computer device casing, or an external keyboard, touchpad or mouse, etc.
  • FIG. 15 or FIG. 16 is merely a block diagram of a partial structure related to the scheme of the present application, and does not constitute a limitation on the computer device to which the scheme of the present application is applied.
  • the specific computer device may include more or fewer components than shown in the figure, or combine certain components, or have a different arrangement of components.
  • a computer device including a memory and a processor, wherein the memory stores computer-readable instructions, and the processor implements the steps in the above-mentioned method embodiments when executing the computer-readable instructions.
  • a computer-readable storage medium on which computer-readable instructions are stored.
  • the steps in the above-mentioned method embodiments are implemented.
  • a computer program product comprising computer-readable instructions, which implement the steps in the above-mentioned method embodiments when executed by a processor.
  • user information including but not limited to user device information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • any reference to the memory, database or other medium used in the embodiments provided in the present application can include at least one of non-volatile and volatile memory.
  • Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetoresistive random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc.
  • Volatile memory can include random access memory (RAM) or external cache memory, etc.
  • RAM can be in various forms, such as static random access memory (SRAM) or dynamic random access memory (DRAM).
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • the database involved in each embodiment provided in this application may include at least one of a relational database and a non-relational database.
  • Non-relational databases may include distributed databases based on blockchains, etc., but are not limited to this.
  • the processor involved in each embodiment provided in this application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, etc., but are not limited to this.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Image Processing (AREA)

Abstract

本申请涉及一种视频去噪模型的处理方法、装置、计算机设备、存储介质和计算机程序产品,该方法可应用于人工智能领域,所述方法包括:在样本视频的视频帧序列中获取目标视频帧,以及获取所述样本视频对应的参考视频(S202);通过视频去噪模型的第一分支提取目标视频帧的图像细节特征(S204);通过视频去噪模型的第二分支对下采样视频帧序列进行特征提取,得到图像融合特征(S206);基于图像融合特征和图像细节特征生成预测视频帧(S208);根据预测视频帧和参考视频中与目标视频帧对应的视频帧之间的损失值,对视频去噪模型中的参数进行调整,得到目标视频去噪模型(S210)。采用本方法能够提高目标视频去噪模型的去噪效果。

Description

视频去噪模型的处理方法、装置、计算机设备和存储介质
相关申请
本申请要求2023年04月18日申请的,申请号为2023104577981,名称为“视频去噪模型的处理方法、装置、计算机设备和存储介质”的中国专利申请的优先权,在此将其全文引入作为参考。
技术领域
本申请涉及计算机技术领域,特别是涉及一种视频去噪模型的处理方法、装置、计算机设备和存储介质。
背景技术
随着计算机视觉技术的发展,在提高视频质量领域中视频去噪技术逐渐成为了研究热点。其中,基于深度学习的视频去噪模型在去噪效果和速度上都具有明显的优势,并且具有广泛的应用前景。
然而,现有基于单帧的视频去噪模型因不能充分考虑视频在时间维度上的相关性和连续性,无法提取较好的特征,基于多帧的视频去噪模型在计算资源有限的情况下,也无法提取较好的特征,从而导致现有的视频去噪模型对视频的去噪效果较差。
发明内容
根据本申请提供的各种实施例,提供了一种能够提高视频去噪效果的视频去噪模型的处理方法、装置、计算机设备、计算机可读存储介质和计算机程序产品。
第一方面,本申请提供了一种视频去噪模型的处理方法,由计算机设备执行,所述方法包括:
在样本视频的视频帧序列中获取目标视频帧,以及获取所述样本视频对应的参考视频;
通过视频去噪模型的第一分支提取所述目标视频帧的图像细节特征;
对所述视频帧序列进行下采样得到下采样视频帧序列,通过所述视频去噪模型的第二分支对所述下采样视频帧序列进行特征提取,得到图像融合特征;
基于所述图像融合特征和所述图像细节特征生成预测视频帧;
根据所述预测视频帧和参考视频帧之间的损失值,对所述视频去噪模型中的参数进行调整,得到目标视频去噪模型;所述参考视频帧是所述参考视频中与所述目标视频帧对应的视频帧;所述目标视频去噪模型用于对待去噪视频进行去噪处理。
第二方面,本申请还提供了一种视频去噪模型的处理装置。所述装置包括:
视频帧获取模块,用于在样本视频的视频帧序列中获取目标视频帧,以及获取所述样本视频对应的参考视频;
细节特征提取模块,用于通过视频去噪模型的第一分支提取所述目标视频帧的图像细节特征;
融合特征提取模块,用于对所述视频帧序列进行下采样得到下采样视频帧序列,通过所述视频去噪模型的第二分支对所述下采样视频帧序列进行特征提取,得到图像融合特征;
预测模块,用于基于所述图像融合特征和所述图像细节特征生成预测视频帧;
参数调整模块,用于根据所述预测视频帧和和参考视频帧之间的损失值,对所述视频去噪模型中的参数进行调整,得到目标视频去噪模型;所述参考视频帧是所述参考视频中 与所述目标视频帧对应的视频帧;所述目标视频去噪模型用于对待去噪视频进行去噪处理。
第三方面,本申请还提供了一种计算机设备。所述计算机设备包括存储器和处理器,所述存储器存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现所述视频去噪模型的处理方法的步骤。
第四方面,本申请还提供了一种计算机可读存储介质。所述计算机可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现所述视频去噪模型的处理方法的步骤。
第五方面,本申请还提供了一种计算机程序产品。所述计算机程序产品,包括计算机可读指令,该计算机可读指令被处理器执行时实现所述视频去噪模型的处理方法的步骤。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例或传统技术中的技术方案,下面将对实施例或传统技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据公开的附图获得其他的附图。
图1为一个实施例中视频去噪模型的处理方法的应用环境图;
图2a为一个实施例中视频去噪模型的处理方法的流程示意图;
图2b为另一个实施例中视频去噪模型的处理方法的流程示意图;
图3为一个实施例中带噪视频帧去噪示意图;
图4为一个实施例中视频帧加噪示意图;
图5为一个实施例中真实噪声图像示意图;
图6为一个实施例中图像融合特征提取步骤的流程示意图;
图7为一个实施例中视频去噪步骤的流程示意图;
图8为另一个实施例中视频去噪模型的处理方法的流程示意图;
图9为一个实施例中样本数据处理示意图;
图10为一个实施例中视频去噪模型结构示意图;
图11为另一个实施例中带噪视频帧示意图;
图12为一个实施例中去噪后视频帧示意图;
图13为一个实施例中视频去噪模型的处理装置的结构框图;
图14为另一个实施例中视频去噪模型的处理装置的结构框图;
图15为一个实施例中计算机设备的内部结构图;
图16为另一个实施例中计算机设备的内部结构图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
需要说明的是,在以下的描述中,所涉及的术语“第一、第二和第三”仅仅是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一、第二和第三”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图 示或描述的以外的顺序实施。
本申请实施例提供的视频去噪模型的处理方法,可以应用于如图1所示的应用环境中。其中,终端102通过网络与服务器104进行通信。数据存储系统可以存储服务器104需要处理的数据。数据存储系统可以集成在服务器104上,也可以放在云上或其他服务器上。该视频去噪模型的处理方法由终端102或服务器104单独执行,或者由终端102和服务器104协同执行。在一些实施例中,该视频去噪模型的处理方法由终端102执行,终端102在样本视频的视频帧序列中获取目标视频帧,以及获取样本视频对应的参考视频;通过视频去噪模型的第一分支提取目标视频帧的图像细节特征;对视频帧序列进行下采样得到下采样视频帧序列,通过视频去噪模型的第二分支对下采样视频帧序列进行特征提取,得到图像融合特征;基于图像融合特征和图像细节特征生成预测视频帧;根据预测视频帧和和参考视频帧之间的损失值,对视频去噪模型中的参数进行调整,得到目标视频去噪模型;其中,参考视频帧是参考视频中与目标视频帧对应的视频帧,目标视频去噪模型用于对待去噪视频进行去噪处理。
其中,终端102可以但不限于是各种台式计算机、笔记本电脑、智能手机、平板电脑、物联网设备和便携式可穿戴设备,物联网设备可为智能音箱、智能电视、智能空调、智能车载设备等。便携式可穿戴设备可为智能手表、智能手环、头戴设备等。服务器104可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。终端102以及服务器104可以通过有线或无线通信方式进行直接或间接地连接,本申请在此不做限制。
在一个实施例中,如图2a和图2b所示,提供了一种视频去噪模型的处理方法,以该方法应用于图1中的计算机设备(终端102或服务器104)为例进行说明,包括以下步骤:
S202,在样本视频的视频帧序列中获取目标视频帧,以及获取样本视频对应的参考视频。
其中,样本视频是用于对机器学习模型进行训练的视频数据,样本视频通常由多个视频帧组成,并且每个视频帧都包含有关视频内容的信息,例如颜色、形状、动作等,样本视频可以来自于各种来源,例如现实生活中的录像、模拟生成的视频、互联网上的视频等。样本视频是携带有噪声的视频,参考视频是与样本视频相对应的无噪声或噪声水平极低的视频,通常被用作视频去噪任务中的“真实”或“理想”状态,在视频去噪模型的训练和评估过程中,参考视频提供了一个目标标准,用于评估去噪效果和模型性能。
需要说明的是,本申请实施例中样本视频包括携带真实噪声的静态视频和加噪的动态视频。静态视频是指相机固定不动,被拍摄对象不运动的情况下,产生的视频数据,由于相机不动,所以静态视频中真实噪声通常是由相机本身的噪声、光照不均匀、传感器噪声等因素引起的,因此,携带真实噪声的静态视频可以更好地反映实际应用中的视频噪声情况;动态视频是指相机或拍摄对象在运动的情况下,产生的视频数据,加噪的动态视频是指在原始视频数据的基础上,通过在视频数据中添加噪声来模拟实际应用场景中的噪声情况,通过加噪的动态视频,可以更好地测试和评估视频去噪算法或模型的鲁棒性和性能。本申请实施例中的参考视频包括对静态视频进行平滑处理所得的清晰静态视频和未加噪的清晰动态视频。
具体的,终端从样本视频中按照一定的时间间隔抽取出视频帧序列,并从所抽取出的视频帧序列中获取当前待处理的目标视频帧。例如,终端从样本视频中抽取出的视频帧序列包含10个视频帧,当前待处理的目标视频帧为第2帧,则从视频帧序列中获取第2帧的视频帧。
S204,通过视频去噪模型的第一分支提取目标视频帧的图像细节特征。
其中,视频去噪模型是指用于去除视频中的噪声的计算机视觉模型或算法。视频噪声通常由于采集设备的不完美、信号传输中的干扰、压缩算法等因素引起,因此在很多视频应用中,如视频会议、视频编码等,去噪处理是一个重要的预处理步骤,视频去噪模型的任务是从输入的噪声视频中恢复出尽可能清晰、无噪声的视频,且同时尽量保留输入的噪声视频中的细节和质量。
视频去噪模型的第一分支具体可以是高分辨率分支,用于对原始分辨率的目标视频帧进行处理,可以理解的是,目标视频帧的原始分辨率为高分辨率,高分辨率是指图像的分辨率达到了特定的分辨率阈值,分辨率阈值可以根据需求进行设定,高分辨率的目标视频帧通常会携带更多的噪声和更丰富的细节信息,通过视频去噪模型的第一分支对目标视频帧进行特征处理,可以得到更加丰富的图像细节特征。
图像细节特征是指图像中细节部分的特征,例如纹理、边缘、角点等,通过提取图像细节特征,可以更准确地区分噪声和信号,并且可以还原更多的细节信息,从而提高图像的质量和清晰度。
具体的,终端在得到目标视频帧之后,将目标视频帧输入视频去噪模型的第一分支,通过第一分支的各网络层对目标视频帧进行处理,得到目标视频帧的图像细节特征。
S206,对视频帧序列进行下采样得到下采样视频帧序列,通过视频去噪模型的第二分支对下采样视频帧序列进行特征提取,得到图像融合特征。
其中,下采样视频帧序列是指对样本视频的视频帧序列进行下采样所得到的视频帧序列,在图像处理中,下采样是指将图像的分辨率降低,从而使图像的尺寸减小,同时减少图像中的细节信息,通常用于降低计算量和内存占用,同时加速模型的训练和推理过程。
视频去噪模型的第二分支具体可以是低分辨率分支,用于对下采样视频帧序列进行处理,可以理解的是,下采样视频帧序列中的各个下采样视频帧的分辨率是低分辨率,低分辨率是指图像的分辨率未达到特定的分辨率阈值,分辨率阈值可以根据需求进行设定,低分辨率的下采样视频帧序列中各下采样视频帧的尺寸减小或者细节信息减少,通过视频去噪模型的第二分支对下采样视频帧序列进行处理,能够有效地降低计算量,提高模型的运行效率,同时还能够增强模型的泛化能力,使其更适合处理不同分辨率的视频。
图像融合特征是下采样视频帧序列中至少两个下采样视频帧的特征进行融合得到的特征表示,可以理解的是,对于存在噪声的视频数据,单独使用一帧图像进行去噪往往难以获得良好的去噪效果,因为单帧图像可能存在过多的噪声和失真,无法提供足够的信息,通过融合多个下采样视频帧的特征可以提高特征的表达能力,从而可以提高模型的去噪效果,此外下采样视频帧序列中的各个下采样视频帧经过特征提取后得到的特征表示可能存在信息损失,融合多个下采样视频帧的特征可以提高特征的表达能力,从而可以提高模型的去噪效果。
具体的,终端在得到视频帧序列之后,对视频帧序列中的各个视频帧进行下采样处理,得到下采样视频帧序列,并将下采样视频帧序列输入视频去噪模型的第二分支,通过第二 分支的各个子分支分别对下采样视频帧序列中的各个下采样视频帧进行处理,得到图像融合特征。
S208,基于图像融合特征和图像细节特征生成预测视频帧。
预测视频帧是指在视频去噪模型中通过对输入的视频进行去噪处理后所生成的视频帧。
具体的,终端在得到图像融合特征和图像细节特征之后,对图像融合特征和图像细节特征进行融合,得到全局图像特征,并基于全局图像特征生成预测视频帧。
S210,根据预测视频帧参考视频帧的损失值,对视频去噪模型中的参数进行调整,得到目标视频去噪模型。
其中,参考视频帧是参考视频中与目标视频帧对应的视频帧,损失值用于评估视频去噪模型在对输入视频进行去噪处理后所得到的预测视频帧和参考视频中对应的视频帧之间的差异程度,通常,损失值越小,代表模型预测的结果和真实结果之间的差异越小,模型的预测准确度和效果就越好。
目标视频去噪模型是训练好的用于对待去噪视频进行去噪处理的机器学习模型。
在一个实施例中,终端在得到预测视频帧之后,从参考视频中获取与目标视频帧对应的视频帧,该视频帧也可以称为参考视频帧,基于预测视频帧和对应的参考视频帧确定损失值,并基于所确定的损失值对视频去噪模型中的参数进行调整,直至满足收敛条件时停止训练,得到目标视频去噪模型。
其中,收敛是指视频去噪模型的训练过程已经趋于稳定,即视频去噪模型已经学习到了数据的特征,并且不再有显著的改善,收敛条件包括固定的训练轮数、固定损失函数的阈值等,当模型在达到该条件时停止训练,以避免过度拟合。
具体的,终端在得到损失值之后,基于损失值调整视频去噪模型中的权重参数和偏置参数的值,得到调整后视频去噪模型,并重新执行步骤S202直至训练满足收敛条件时停止训练,得到目标视频去噪模型。
在一个实施例中,终端可基于以下公式确定:
其中,L表示损失值,ILQ表示样本视频中的视频帧序列,T表示视频帧序列中视频帧的数量,F(ILQ)i表示视频帧序列中第i个视频帧(目标视频帧)对应的预测视频帧,表示参考视频中的第i个视频帧,即目标视频帧对应的参考视频帧。
上述实施例中,终端在样本视频的视频帧序列中获取目标视频帧之后,通过视频去噪模型的第一分支提取目标视频帧的图像细节特征,在获得视频帧序列对应的下采样视频帧序列后,通过视频去噪模型的第二分支对下采样视频帧序列进行特征提取,得到图像融合特征,基于图像融合特征和图像细节特征生成预测视频帧,既充分考虑了视频在时间维度上的相关性和连续性,又能够有效地降低计算量,提高模型的运行效率,从而在计算资源有限的情况下,也能够根据预测视频帧和参考视频中与目标视频帧对应的视频帧之间的损失值,对视频去噪模型中的参数进行调整,得到去噪效果较好的目标视频去噪模型。
在一个实施例中,样本视频包括携带真实噪声的静态视频和加噪的动态视频;参考视 频包括对静态视频进行平滑处理所得的清晰静态视频和未加噪的动态视频,通过使用包含真实噪声的静态视频和加噪的动态视频作为样本视频,以及使用对静态视频进行平滑处理所得的清晰静态视频和未加噪的动态视频作为参考,可以更好地模拟真实场景下的噪声情况,进一步提高了目标视频去噪模型的去噪效果。
在一个实施例中,静态视频还携带有加噪噪声,上述视频去噪模型的处理方法还包括以下步骤:对静态对象进行视频采集,得到携带真实噪声的原始静态视频;对原始静态视频进行加噪处理,得到静态视频;静态视频携带有加噪噪声和真实噪声;对原始静态视频进行平滑处理,得到清晰静态视频。
其中,加噪噪声是以人工方式添加到视频中的噪声,加噪噪声的种类包括高斯噪声、椒盐噪声、伪随机噪声等;静态对象是指保持不运动的对象。平滑处理是一种图像处理方法,其主要目的是降低图像的噪声,在视频处理中,平滑处理可以应用于视频的每一帧图像中,通过对每一帧图像进行平滑操作,可以使得视频更加平滑和自然,降低噪声,平滑处理通常需要应用到每一帧图像上,因此对于视频而言,平滑处理也可以称为时域滤波。
具体的,终端保持视频采集设备不运动,对静态对象进行拍摄,得到静态视频,该静态视频即为携带真实噪声的原始静态视频,一方面采用预设的加噪算法对原始静态视频进行加噪处理,得到静态视频,该静态视频即携带有加噪噪声和真实噪声,另一方面,采用预设的平滑处理算法对该原始静态视频进行平滑处理,得到清晰静态视频。
其中,平滑处理算法包括高斯模糊、中值滤波、均值滤波等,高斯模糊可以通过对每个像素点的周围像素点进行加权平均的方式,来降低图像的噪声,中值滤波和均值滤波通过对每个像素点周围的像素点进行中值或者平均值的计算来降低图像的噪声。
可以理解的是,清晰静态视频相比于原始静态视频的噪声水平明显降低,因此也可以将清晰静态视频近似为未携带噪声的视频,以便在模型训练时将其作为未携带噪声的参考视频。
在一个实施例中,终端采用预设的平滑处理算法对该原始静态视频进行平滑处理,得到清晰静态视频的过程具体包括以下步骤:确定原始静态视频中相邻原始静态视频帧之间的帧差,将帧差达到帧差阈值的区域确定为相应原始静态视频帧中的噪声区域,对各原始静态视频帧中的噪声区域进行平滑处理,得到清晰静态视频。
需要说明的是,虽然携带真实噪声的原始静态视频是对静态对象进行视频采集所得到的,但是在视频采集时,采集设备可能并非绝对的稳定,可能存在一些非常小的抖动,以及环境中气体流动导致静态对象轻微运动等,从而导致所得到原始静态视频并非绝对的静态,而是相对的静态,对于绝对静态的静态视频如果相邻视频帧之间不存在噪声,那么相邻视频帧之间的帧差应当为0。
参考图3,图3中的(b)示出了相邻的三个带噪视频帧,图3中的(a)为相邻的两个带噪视频帧之间的帧差示意图,在对该三个带噪视频帧进行平滑处理后,得到如图3中(c)所示清晰视频帧,图3中的(d)相邻的两个清晰视频帧之间的帧差示意图,从图3可以看出,原始静态视频的从单个带噪视频帧来看,没有明显的噪声,但是相邻的两个带噪视频帧之间的帧差比较大,相应的原始静态视频在播放时会有明显的闪烁噪声,该闪烁噪声即为帧间噪声,在对原始静态视频进行时域平滑后,相邻的两个清晰视频帧之间的帧差明显减小,说明帧间噪声被大大减弱。
上述实施例中,终端通过对静态对象进行视频采集,得到携带真实噪声的原始静态视 频,并对原始静态视频进行加噪处理,得到静态视频,静态视频携带有加噪噪声和真实噪声,对原始静态视频进行平滑处理,得到清晰静态视频,从而可以使用包含真实噪声的静态视频作为样本视频,使用清晰静态视频作为参考视频对视频去噪模型进行训练,可以更好地模拟真实场景下的噪声情况,提高了目标视频去噪模型的去噪效果。
在一个实施例中,终端对原始静态视频进行加噪处理,得到静态视频的过程具体包括以下步骤:从原始静态视频的各带噪视频帧中获取部分像素;根据各带噪视频帧的部分像素分别生成对应的第一像素图像;生成与各带噪视频帧对应的第一初始噪声图像;将第一初始噪声图像分别与第一像素图像进行融合,得到各带噪视频帧对应的第一噪声图像;将各第一噪声图像分别融合至对应的带噪视频帧中,得到静态视频。
其中,部分像素是指带噪视频帧中的部分像素点,具体可以是从带噪视频帧中随机选取出的,第一像素图像用于描述部分像素点的分布,具体的第一像素图像中部分像素点所对应的位置处的灰度值为1,1表示在此像素点所对应的位置处添加噪声,部分像素点之外的其他位置处的灰度值为0,0表示在此像素点所对应的位置处不添加噪声。
具体的,终端在得到原始静态视频后,从原始静态视频中获取各个带噪视频帧,针对任意一个带噪视频帧,从该带噪视频帧中随机选取部分像素,并基于所选取的部分像素生成与带噪视频帧大小相同的第一像素图像,其中该第一像素图像中部分像素对应位置处的灰度值可以为1,部分像素之外的其他位置处的灰度值可以为0,并采用预设的噪声生成算法生成第一初始噪声图像,将第一像素图像与第一初始噪声图像进行点乘,得到第一噪声图像,将该第一噪声图像融合至该带噪视频帧中,得到对应的加噪后的静态视频帧,可以理解的是对原始静态视频中的各个带噪视频帧均进行以上加噪处理,可以得到加噪后的静态视频。
其中,预设的噪声生成算法可以是随机分布算法,例如高斯分布算法,则采用高斯分布算法对相应的带噪视频帧进行处理,得到第一初始噪声图像。
在一个实施例中,终端将第一噪声图像融合至对应的带噪视频帧中,具体可以采用逐像素加权平均的方式实现图像融合,具体包括以下步骤:获取第一噪声图像对应的第一权重和带噪视频帧对应的第二权重,基于第一权重和第一噪声图像中各像素点的像素值、以及第二权重和带噪视频帧中各像素点的像素值,确定对应各目标像素点的加权像素值,基于各目标像素点的加权像素值生成加噪后的静态视频帧。其中目标像素点是指加噪后静态视频帧中的像素点。
参考图4,图4中的第一行,展示了传统的加噪方式,该加噪方式具体为,首先随机生成噪声图像,将该噪声图像直接融合到待加噪图像(干净图像)上,得到对应噪声图像,从该噪声图像中可以看出噪声被均匀添加到了干净图像上,然而如图5所示,真实的图像中,噪声(图中圆点表示噪声)并不是均匀分布在每个像素位置的;本申请实施例中所采用的加噪方式如图4中的第二行或第三行所示,首先从待加噪图像(干净图像)中随机选取出部分像素,并基于所选取出的部分像素生成像素图像,将像素图像与对应的噪声图像进行融合,得到加噪后的噪声图像,其中像素图像是仅由0和1组成的与待加噪图像长宽相同的矩阵,0表示此像素位置不加噪,1表示此像素位置加噪,图4中第二行和第三行中待加噪图像(干净图像)是相同的,随机生成的噪声图像也是相同的,但是分别所生成的像素图像是不同的,并且在加噪时所使用的加噪系数也是不同的,从而得到的噪声图像也是不同的,其中加噪系数具体可以是基于噪声图像对应的权重、干净图像对应的权重所 确定的。
上述实施例中,终端通过从原始静态视频的各带噪视频帧中获取部分像素;根据各带噪视频帧的部分像素分别生成对应的第一像素图像;生成与各带噪视频帧对应的第一初始噪声图像;将第一初始噪声图像分别与第一像素图像进行融合,得到各带噪视频帧对应的第一噪声图像;将各第一噪声图像分别融合至对应的带噪视频帧中,得到静态视频,从而可以使得到的静态视频能够更加准确地模拟实际图像中噪声的分布情况,同时也能够增加噪声的多样性,采用该静态视频训练视频去噪模型,可以进一步提高视频去噪模型的去噪效果。
在一个实施例中,上述视频去噪模型的处理方法还包括以下步骤:从视频数据库中获取未加噪的动态视频;对未加噪的动态视频进行加噪处理,得到加噪的动态视频。
其中,动态视频包含有运动、变化的内容,例如人的行走、车辆行驶等等,这样的视频可以从多个角度展示动态物体的运动和变化情况。视频数据库可以是公开视频数据集,公开视频数据集具体可以是清晰视频数据集REDS和DAVIS,视频数据库也可以是对自己进行视频采集所得到视频进行去噪处理后所得到的清晰视频库。需要说明的是,本申请实施例中的清晰可以近似为不含噪,即清晰视频是指不含噪视频。
具体的,终端可以直接从视频数据库中获取清晰的动态视频,该动态视频即为未加噪的动态视频,并采用预设的加噪算法对所获取的动态视频进行加噪处理,得到加噪的动态视频。
上述实施例中,终端通过从视频数据库中获取未加噪的动态视频,对未加噪的动态视频进行加噪处理,得到加噪的动态视频,从而可以使用加噪的动态视频作为样本视频,使用未加噪的动态视频作为参考视频对视频去噪模型进行训练,可以更好的模拟真实场景中的噪声情况,从而提高了目标视频去噪模型的去噪效果。
在一个实施例中,未加噪的动态视频中的视频帧为清晰视频帧,终端对未加噪的动态视频进行加噪处理,得到加噪的动态视频的过程包括以下步骤:从各清晰视频帧中选取部分像素;根据各清晰视频帧的部分像素分别生成对应的第二像素图像;生成各清晰视频帧对应的第二初始噪声图像;将各第二初始噪声图像分别与对应的第二像素图像进行融合,得到各清晰视频帧对应的第二噪声图像;将各第二噪声图像分别融合至对应的清晰视频帧中,得到加噪的动态视频。
其中,部分像素是指清晰视频帧中的部分像素点,具体可以是从清晰视频帧中随机选取出的,第二像素图像用于描述部分像素点的分布,具体的第二像素图像中部分像素点所对应的位置处的灰度值为1,1表示在此像素点所对应的位置处添加噪声,部分像素点之外的其他位置处的灰度值为0,0表示在此像素点所对应的位置处不添加噪声。
具体的,终端在得到未加噪的动态视频后,从未加噪的动态视频中获取各个清晰视频帧,针对任意一个清晰视频帧,从该清晰视频帧中随机选取部分像素,并基于所选取的部分像素生成预清晰视频帧大小相同的第二像素图像,其中该第二像素图像中部分像素对应位置处的灰度值可以为1,部分像素之外的其他位置处的灰度值可以为0,并采用预设的噪声生成算法生成第二初始噪声图像,将第二像素图像与第二初始噪声图像进行点乘,得到第二噪声图像,将该第二噪声图像融合至该清晰视频帧中,得到加噪后的动态视频帧,可以理解的是,对未加噪的动态视频中的各个清晰视频帧均进行以上加噪处理,可以到加噪后的动态视频。
其中,预设的噪声生成算法可以随机分布算法,例如高斯分布算法等,例如高斯分布算法,则采用高斯分布算法对相应的清晰视频帧进行处理,得到第二初始噪声图像。
在一个实施例中,终端将第二噪声图像融合至对应的清晰视频帧中,具体可以采用逐像素加权平均的方式实现图像融合,具体包括以下步骤:获取第二噪声图像对应的第三权重和清晰视频帧对应的第四权重,基于第三权重和第二噪声图像中各像素点的像素值、以及第四权重和清晰视频帧中各像素点的像素值,确定对应各目标像素点的加权像素值,基于各目标像素点的加权像素值生成加噪后的动态视频帧。其中目标像素点是指加噪后动态视频帧中的像素点。
上述实施例中,终端通过从各清晰视频帧中选取部分像素;根据各清晰视频帧的部分像素分别生成对应的第二像素图像;生成各清晰视频帧对应的第二初始噪声图像;将各第二初始噪声图像分别与对应的第二像素图像进行融合,得到各清晰视频帧对应的第二噪声图像;将各第二噪声图像分别融合至对应的清晰视频帧中,得到加噪的动态视频,从而可以使得到的加噪的动态视频能够更加准确地模拟实际图像中噪声的分布情况,同时也能够增加噪声的多样性,采用该加噪的动态视频训练视频去噪模型,可以进一步提高视频去噪模型的去噪效果。
在一个实施例中,第二分支包括光流网络、目标帧子分支和其它帧子分支,如图6所示,终端通过视频去噪模型的第二分支对下采样视频帧序列进行特征提取,得到图像融合特征的过程具体包括以下步骤:
S602,通过光流网络,确定下采样视频帧序列中的下采样目标视频帧与对应的相邻下采样视频帧之间的光流信息。
其中,光流网络是用于估计光流信息的神经网络模型,具体可以是光流网络SpyNet;光流信息是指相邻的视频帧之间像素位置变化的信息,可以理解的是,在视频中,相邻的视频帧之间可能存在着物体的运动或相机的运动,这些运动导致相邻帧之间的像素位置不同,而光流信息就是用于描述相邻帧之间像素位置变化的信息。
本申请实施例中的光流信息可以包括下采样视频帧序列中的下采样目标视频帧与对应的相邻下采样视频帧之间的光流信息,也可以包括下采样视频帧序列中任意两个相邻的下采样视频帧之间的光流信息。光流信息也可以称为光流向量,光流向量可以表示相邻的视频帧之间的像素位移,可以用于后续的帧对齐和特征融合。
下采样视频帧序列是指对视频帧序列中的各个视频帧进行下采样处理后所得到的视频帧序列,具体可以包括下采样目标视频帧和下采样连续视频帧,下采样连续视频帧包括下采样前序视频帧和下采样后序视频帧中的至少一种,例如,下采样视频帧序列中包含5个下采样视频帧,若下采样目标视频帧为下采样视频帧序列中的第3帧,则下采样视频帧序列中第3帧之外的其他下采样视频帧则为下采样连续视频帧,其中第1帧和第2帧为下采样前序视频帧,第4帧和第5帧为下采样后序视频帧;若下采样目标视频帧为下采样视频帧序列中的第1帧,则下采样视频帧序列中的第2帧至5帧则为下采样目标视频帧的下采样后序视频帧;若下采样目标视频帧为下采样视频帧序列中的第5帧,则下采样视频帧序列中的第1帧至4帧则为下采样目标视频帧的下采样前序视频帧。
具体的,终端在得到下采样视频帧序列之后,将下采样视频帧序列中的各个下采样视频帧输入光流网络,通过光流网络确定下采样视频帧序列中任意两个相邻的下采样视频帧之间的光流信息,从而得到下采样目标视频帧与对应的相邻下采样视频帧之间的光流信息。
在一个实施例中,相邻下采样视频帧包括下采样前序视频帧和下采样后序视频帧中的至少一种;光流信息包括第一光流信息和第二光流信息中的至少一种,当下采样连续视频帧包括下采样前序视频帧时,终端通过光流网络,确定相邻的第一下采样视频帧之间的第一光流信息;当下采样连续视频帧包括下采样后序视频帧时,终端通过光流网络,确定相邻的第二下采样视频帧之间的第二光流信息;
其中,第一光流信息是相邻的第一下采样视频帧之间的信息,第二光流信息是第二下采样视频帧之间的信息,第一下采样视频帧是下采样目标视频帧与下采样前序视频帧中的下采样视频帧,第二下采样视频帧是下采样目标视频帧与下采样后序视频帧中的下采样视频帧,例如,下采样视频帧序列中包含5个下采样视频帧,若下采样目标视频帧为下采样视频帧序列中的第3帧,则下采样视频帧序列中第1帧和第2帧为下采样前序视频帧,第4帧和第5帧为下采样后序视频帧,第一下采样视频帧即为采样视频帧序列中第1帧、第2帧和第3帧中的下采样视频帧,第一光流信息包括第1帧到第2帧的光流信息、第2帧到第3帧的光流信息,第二下采样视频帧即采样视频帧序列中为第3帧、第4帧和第5帧中的下采样视频帧,第二光流信息包括第5帧到第4帧的光流信息、第4帧到第3帧的光流信息。
具体的,当下采样连续视频帧包括下采样前序视频帧时,终端将下采样视频帧序列中的各个下采样前序视频帧和下采样目标视频帧输入光流网络,通过光流网络确定下采样前序视频帧和下采样目标视频帧中任意两个相邻的下采样视频帧之间的光流信息,即确定相邻的第一下采样视频帧之间的光流信息,并将该光流信息确定为下采样目标视频帧与对应的相邻下采样视频帧之间的第一光流信息;当下采样连续视频帧包括下采样后序视频帧时,终端将下采样视频帧序列中的各个下采样后序视频帧和下采样目标视频帧输入光流网络,通过光流网络确定下采样后序视频帧和下采样目标视频帧中任意两个相邻的下采样视频帧之间的光流信息,即确定相邻的第二下采样视频帧之间的光流信息,并将该光流信息确定为下采样目标视频帧与对应的相邻下采样视频帧之间的第二光流信息,从而可以更好地理解视频中的运动和变化,从而更精确地对齐视频帧并提取特征。
S604,通过其它帧子分支对下采样视频帧序列进行特征提取,得到下采样目标视频帧对应的连续视频帧特征。
其中,其它帧子分支用于对下采样视频帧序列中下采样目标视频帧之外的下采样视频帧进行特征提取,以得到下采样目标视频帧对应的连续视频帧特征,其它帧子分支包括前序帧子分支和后序帧子分支中的至少一种,前序帧子分支用于对下采样前序视频帧进行特征提取,得到前序视频帧特征,后序帧子分支用于对下采样后序视频帧进行特征提取,得到后序视频帧特征。
具体的,终端在得到下采样视频帧序列之后,将下采样视频帧序列中的下采样连续视频帧输入其它帧子分支,通过其它帧子分支对输入的采样连续视频帧进行特征提取,得到下采样目标视频帧对应的连续视频帧特征。
在一个实施例中,相邻下采样视频帧包括下采样前序视频帧和下采样后序视频帧中的至少一种;连续视频帧特征包括前序视频帧特征和后序视频帧特征中的至少一种;当下采样连续视频帧包括下采样前序视频帧时,终端通过前序帧子分支的前向网络层对下采样前序视频帧进行特征提取,得到前序视频帧特征;当下采样连续视频帧包括下采样后序视频帧时,终端通过后序帧子分支的后向网络层对下采样后序视频帧进行特征提取,得到后序 视频帧特征。
其中,前向网络层是指前向U型网络,后向网络层是指后向U型网络,前向U型网络是用于对下采样前序视频帧进行特征提取的U型网络,后向U型网络是用于对下采样后序视频帧进行特征提取的U型网络,U型网络是用于图像处理任务的卷积神经网络结构,它由下采样模块和上采样模块组成,通常在中间还会有一些卷积层和池化层。
具体的,当下采样连续视频帧包括下采样前序视频帧时,终端将下采样视频帧序列中的各个下采样前序视频帧输入前序帧子分支,通过前序帧子分支的前向网络层对各个下采样前序视频帧进行特征提取,得到前序视频帧特征;当下采样连续视频帧包括下采样后序视频帧时,终端将下采样视频帧序列中的各个下采样后序视频帧输入后序帧子分支,通过后序帧子分支的后向网络层对各个下采样后序视频帧进行特征提取,得到后序视频帧特征,在视频帧序列中,前后帧之间通常存在相关性,通过利用前序视频帧和后序视频帧的信息,可以更好地捕捉到视频序列中的时空特征,从而更精确地提取视频帧特征。
例如下采样视频帧序列中包含5个下采样视频帧,下采样目标视频帧为下采样视频帧序列中的第3帧,则前序帧子分支1用于对下采样视频帧序列中的第1帧下采样视频帧进行特征提取,前序帧子分支2用于对下采样视频帧序列中的第2帧下采样视频帧进行特征提取,后序帧子分支3用于对下采样视频帧序列中的第4帧下采样视频帧进行特征提取,后序帧子分支4用于对下采样视频帧序列中的第5帧下采样视频帧进行特征提取。
S606,基于光流信息将连续视频帧特征与下采样目标视频帧进行对齐,得到对齐后视频帧特征。
其中,对齐是指将连续视频帧特征与下采样目标视频帧的内容进行匹配,可以理解的是,在视频帧序列中,相邻的视频帧之间存在一定的运动关系,通过光流信息,可以将下采样目标视频帧与对应的连续视频帧特征进行对齐,这样在后续的处理中,就可以将它们看作是同一时刻的视频帧和视频帧特征,从而提高模型的准确度。
在一个实施例中,相邻下采样视频帧包括下采样前序视频帧和下采样后序视频帧中的至少一种;光流信息包括第一光流信息和第二光流信息中的至少一种;连续视频帧特征包括前序视频帧特征和后序视频帧特征中的至少一种;对齐后视频帧特征包括前序对齐后视频帧特征和后序对齐后视频帧特征中的至少一种;当下采样连续视频帧包括下采样前序视频帧时,终端基于第一光流信息将前序视频帧特征与下采样目标视频帧进行对齐,得到前序对齐后视频帧特征;当下采样连续视频帧包括下采样后序视频帧时,终端基于第二光流信息将后序视频帧特征与下采样目标视频帧进行对齐,得到后序对齐后视频帧特征。
具体的,当下采样连续视频帧包括下采样前序视频帧时,终端从前序视频帧特征中提取预设位置的特征向量,基于第一光流信息和所提取的特征向量确定该预设位置在下采样目标视频帧中对应的目标位置,基于预设位置的特征向量和下采样目标视频帧中对应的目标位置,采用插值法将前序视频帧特征与下采样目标视频帧的特征进行对齐,得到前序对齐后视频帧特征;当下采样连续视频帧包括下采样后序视频帧时,终端从后序视频帧特征中提取预设位置的特征向量,基于第二光流信息和所提取的特征向量确定该预设位置在下采样目标视频帧中对应的目标位置,基于预设位置的特征向量和下采样目标视频帧中对应的目标位置,采用插值法将后序视频帧特征与下采样目标视频帧的特征进行对齐,得到后序对齐后视频帧特征。其中预设位置可以是随机选取出的位置,也可以是预先指定的位置。
可以理解的是,通过对前序视频帧特征和后序视频帧特征进行对齐,可以在下采样目 标视频帧的特征提取中获得更多的信息,提高了对目标视频帧的特征提取效果,从而有助于更好地去噪,同时,通过前后两个方向的光流信息的利用,可以进一步提高视频帧的特征提取质量,从而视频去噪模型可以准确的估计出对齐后视频帧特征中的噪声,进而提高视频去噪模型的去噪效果。
S608,通过目标子分支对对齐后视频帧特征进行处理,得到图像融合特征。
其中,目标子分支用于对下采样视频帧序列中的下采样目标视频帧进行特征处理,以得到下采样目标视频帧对应的图像融合特征。
具体的,终端在得到下采样目标视频帧对应的对齐后视频帧特征之后,将对齐后视频帧特征输入目标子分支,通过目标子分支对对齐后视频帧特征进行特征处理,得到图像融合特征。
在一个实施例中,当下采样连续视频帧包括下采样前序视频帧时,终端通过目标子分支的前向网络层对前序对齐后视频帧特征进行处理,得到前序图像融合特征;当下采样连续视频帧包括下采样后序视频帧时,终端通过目标子分支的后向网络层对后序对齐后视频帧特征进行处理,得到后序图像融合特征;基于前序图像融合特征和后序图像融合特征中的至少一个,确定图像融合特征。
其中,前向网络层是指前向U型网络,后向网络层是指后向U型网络,目标子分支的前向U型网络是用于对前序对齐后视频帧特征进行特征处理的U型网络,目标子分支的后向U型网络是用于对后序对齐后视频帧进行特征处理的U型网络,U型网络是用于图像处理任务的卷积神经网络结构,它由下采样模块和上采样模块组成,通常在中间还会有一些卷积层和池化层。
具体的,当下采样连续视频帧包括下采样前序视频帧时,终端将前序对齐后视频帧特征输入目标子分支的前向网络层,通过目标子分支的前向网络层对前序对齐后视频帧特征进行特征处理,得到前序图像融合特征;当下采样连续视频帧包括下采样后序视频帧时,终端将后序对齐后视频帧特征输入目标子分支的前向网络层,通过目标子分支的前向网络层对后序对齐后视频帧特征进行特征处理,得到后序图像融合特征;在下采样连续视频帧仅包括下采样前序视频帧时,直接将前序图像融合特征确定为图像融合特征,在下采样连续视频帧仅包括下采样后序视频帧时,直接将后序图像融合特征确定为图像融合特征,在下采样连续视频帧包括下采样前序视频帧和下采样后序视频帧时,则基于前序图像融合特征和后序图像融合特征确定图像融合特征。
上述实施例中,终端通过第二分支的光流网络确定下采样视频帧序列中的下采样目标视频帧与对应的相邻下采样视频帧之间的光流信息,以及通过第二分支的其它帧子分支对下采样视频帧序列进行特征处理,得到下采样目标视频帧对应的连续视频帧特征,从而可以利用视频序列中的连续帧信息和光流信息,更好地理解视频中的运动和变化,从而可以得到准确的视频特征表示,同时,通过第二分支的目标子分支对对齐后的视频帧特征进行处理,可以得到更加准确的图像融合特征,进而基于图像融合特征可以使得后续的图像重建更加准确,提高了目标视频去噪模型的去噪效果。
在一个实施例中,下采样连续视频帧包括下采样前序视频帧和下采样后序视频帧时,终端基于前序图像融合特征和后序图像融合特征确定图像融合特征的过程具体包括以下步骤:将前序图像融合特征和后序图像融合特征进行拼接,得到拼接后图像特征,对拼接后图像特征进行卷积处理,得到图像融合特征。
具体的,终端在得到前序图像融合特征和后序图像融合特征之后,将前序图像融合特征和后序图像融合特征进行拼接,得到拼接后图像特征,并将拼接后图像特征输入目标子分支的卷积层,通过卷积层对拼接后图像特征进行卷积处理,得到更加高级的特征信息,该更加高级的特征信息即为图像融合特征。
上述实施例中,终端通过将前序图像融合特征和后序图像融合特征进行拼接可以有效地融合前后序视频帧的信息,充分利用前后序视频帧中连续帧之间的关联性,从而可以得到准确的视频特征表示,同时,对拼接后的图像特征进行卷积处理可以进一步提取和增强特征,从而可以得到更加准确的图像融合特征,进而基于图像融合特征可以使得后续的图像重建更加准确,提高了目标视频去噪模型的去噪效果。
在一个实施例中,终端基于图像融合特征和图像细节特征生成预测视频帧的过程具体包括以下步骤:将图像融合特征与图像细节特征进行融合,得到全局图像特征;基于全局图像特征进行图像重建,得到预测视频帧。
具体的,终端在得到图像融合特征与图像细节特征之后,获取图像融合特征对应的第一融合系数和图像细节特征对应的第二融合系数,并基于第一融合系数和第二融合系数对图像融合特征与图像细节特征进行融合,得到全局图像特征,对全局图像特征进行反卷积操作得到与目标视频帧相同大小的预测视频帧。
其中,反卷积操作用于将全局图像特征进行逐步放大到原始尺寸,以得到目标视频帧相同大小的预测视频帧。
上述实施例中,终端通过将图像融合特征与图像细节特征进行融合,得到全局图像特征,可以综合利用将图像融合特征与图像细节特征两者的信息,更全面的描述目标视频帧的图像内容,从而基于全局图像特征进行图像重建,得到预测视频帧,也能有较好的去噪效果,进而提高了目标视频去噪模型的去噪效果。
在一个实施例中,终端将图像融合特征与图像细节特征进行融合,得到全局图像特征的过程具体包括以下步骤:对图像融合特征进行上采样处理,得到上采样图像融合特征;将上采样图像融合特征与图像细节特征进行融合,得到全局图像特征。
具体的终端在得到图像融合特征之后,对图像融合特征进行反卷积操作,得到上采样图像融合特征,获取上采样图像融合特征对应的第一融合系数和图像细节特征对应的第二融合系数,并基于第一融合系数和第二融合系数对上采样图像融合特征与图像细节特征进行融合,得到全局图像特征。具体可以是基于第一融合系数和第二融合系数对上采样图像融合特征与图像细节特征进行加权融合。
上述实施例中,终端通过对图像融合特征进行上采样处理,从而可以得到与目标视频帧相同分辨率的上采样图像融合特征,将上采样图像融合特征与图像细节特征进行融合,得到全局图像特征,可以充分利用两种特征的各自的优势,进一步提高全局图像特征的表达能力,进而提高了目标视频去噪模型的去噪效果。
在一个实施例中,终端在得到目标视频去噪模型之后,还可以使用目标视频去噪模型对待去噪视频进行去噪处理,如图7所示,该过程具体包括以下步骤:
S702,在待去噪视频的待去噪视频帧序列中确定当前的待去噪视频帧。
具体的,终端获取待去噪视频,并从待去噪视频中抽取出待去噪视频帧序列,从待去噪视频帧序列中确定当前要进行去噪处理的待去噪视频帧。例如,终端从待去噪视频中抽取出的待去噪视频帧序列包含10个视频帧,当前的待去噪视频帧为第2帧,则从待去噪 视频帧序列中获取第2帧。
S704,通过目标视频去噪模型的第一分支提取待去噪视频帧的待去噪图像细节特征。
其中,目标视频去噪模型是指对视频去噪模型进行训练所得到的训练好的视频去噪模型,目标视频去噪模型的第一分支具体可以是高分辨率分支,用于对原始分辨率的当前的待去噪视频帧进行处理。
具体的,终端在得到待去噪视频的当前的待去噪视频帧后,将当前的待去噪视频帧输入目标视频去噪模型的第一分支,通过第一分支的各网络层对当前的待去噪视频帧进行处理,得到该待去噪视频帧的待去噪图像细节特征。
S706,对待去噪视频帧序列进行下采样得到下采样待去噪视频帧序列,通过目标视频去噪模型的第二分支对下采样待去噪视频帧序列进行特征提取,得到待去噪图像融合特征。
其中,下采样待去噪视频帧序列是指对待去噪视频序列进行下采样所得到的视频帧序列,在图像处理中,下采样指将图像的分辨率降低,从而使图像的尺寸减小,同时减少图像中的细节信息,通常用于降低计算量和内存占用,同时加速模型的预测过程。
目标视频去噪模型的第二分支具体可以是低分辨率分支,用于对下采样待去噪视频帧序列进行处理,可以理解的是,下采样待去噪视频帧序列中的各个下采样待去噪视频帧的分辨率是低分辨率,低分辨率的下采样待去噪视频帧序列中各下采样待去噪视频帧的尺寸减小或者细节信息减少,通过目标视频去噪模型的第二分支对下采样待去噪视频帧序列进行处理,能够有效地降低计算量,提高模型的运行效率,同时还能够增强模型的泛化能力,使其更适合处理不同分辨率的视频。
待去噪图像融合特征是指下采样待去噪视频帧序列中至少两个下采样待去噪视频帧的特征进行融合得到的特征表示,可以理解的是,对于存在噪声的视频数据,单独使用一帧图像进行去噪往往难以获得良好的去噪效果,因为单帧图像可能存在过多的噪声和失真,无法提供足够的信息,通过融合多个下采样待去噪视频帧的特征可以提高特征的表达能力,从而可以提高目标视频去噪模型的去噪效果,此外下采样待去噪视频帧序列中的各个下采样待去噪视频帧经过特征提取后得到的特征表示可能存在信息损失,融合多个下采样待去噪视频帧的特征可以提高特征的表达能力,从而可以提高目标视频去噪模型的去噪效果。
具体的,终端在得到待去噪视频帧序列之后,对待去噪视频帧序列中的各个待去噪视频帧进行下采样处理,得到下采样待去噪视频帧序列,并将下采样待去噪视频帧序列输入目标视频去噪模型的第二分支,通过第二分支的各个子分支分别对下采样待去噪视频帧序列中的各个下采样待去噪视频帧进行处理,得到待去噪图像融合特征。
在一个实施例中,第二分支包括光流网络、目标帧子分支和其它帧子分支,S706具体包括以下步骤:通过光流网络,确定下采样待去噪视频帧序列中的当前的下采样待去噪视频帧与对应的相邻下采样待去噪视频帧之间的光流信息;通过其它帧子分支对下采样待去噪视频帧序列进行特征提取,得到当前的下采样待去噪视频帧对应的待去噪连续视频帧特征;基于光流信息将待去噪连续视频帧特征与当前的下采样待去噪视频帧进行对齐,得到待去噪对齐后视频帧特征;通过目标子分支对待去噪对齐后视频帧特征进行处理,得到待去噪图像融合特征。
在一个实施例中,下采样待去噪视频帧序列包括当前的下采样待去噪视频帧和下采样待去噪连续视频帧,下采样待去噪连续视频帧包括下采样待去噪前序视频帧和下采样待去噪后序视频帧中的至少一种,其它帧子分支包括前序帧子分支和后序帧子分支中的至少一 种,待去噪连续视频帧特征包括待去噪前序视频帧特征和待去噪后序视频帧特征中的至少一种,待去噪对齐后视频帧特征包括待去噪前序对齐后视频帧特征和待去噪后序对齐后视频帧特征中的至少一种,终端通过光流网络,确定下采样待去噪视频帧序列中的当前的下采样待去噪视频帧与对应的相邻下采样视频帧之间的光流信息的过程具体包括以下步骤:通过光流网络,确定当前的下采样待去噪视频帧与下采样待去噪前序视频帧中的相邻下采样视频帧之间的第三光流信息;通过光流网络,确定当前的下采样待去噪视频帧与下采样待去噪后序视频帧中相邻下采样视频帧之间的第四光流信息。
在一个实施例中,终端通过其它帧子分支对下采样待去噪视频帧序列进行特征提取,得到当前的下采样待去噪视频帧对应的待去噪连续视频帧特征的过程包括以下步骤:通过前序帧子分支的前向网络层对下采样待去噪前序视频帧进行特征提取,得到待去噪前序视频帧特征;通过后序帧子分支的后向网络层对下采样待去噪后序视频帧进行特征提取,得到待去噪后序视频帧特征。
在一个实施例中,终端基于光流信息将待去噪连续视频帧特征与当前的下采样待去噪视频帧进行对齐,得到待去噪对齐后视频帧特征的过程包括以下步骤:基于第三光流信息将待去噪前序视频帧特征与当前的下采样待去噪视频帧进行对齐,得到待去噪前序对齐后视频帧特征;基于第四光流信息将待去噪后序视频帧特征与当前的下采样待去噪视频帧进行对齐,得到待去噪后序对齐后视频帧特征;
在一个实施例中,终端通过目标子分支对待去噪对齐后视频帧特征进行处理,得到图像融合特征的过程包括以下步骤:通过目标子分支的前向网络层对待去噪前序对齐后视频帧特征进行处理,得到待去噪前序图像融合特征;通过目标子分支的后向网络层对待去噪后序对齐后视频帧特征进行处理,得到待去噪后序图像融合特征;基于待去噪前序图像融合特征和待去噪后序图像融合特征中的至少一个,确定待去噪图像融合特征。
在一个实施例中,终端基于待去噪前序图像融合特征和待去噪后序图像融合特征中的至少一个,确定待去噪图像融合特征的过程具体包括以下步骤:在下采样待去噪连续视频帧仅包括下采样待去噪前序视频帧时,直接将待去噪前序图像融合特征确定为待去噪图像融合特征,在下采样待去噪连续视频帧仅包括下采样待去噪后序视频帧时,直接将待去噪后序图像融合特征确定为待去噪图像融合特征,在下采样待去噪连续视频帧包括下采样待去噪前序视频帧和下采样待去噪后序视频帧时,将待去噪前序图像融合特征和待去噪后序图像融合特征进行拼接,得到待去噪拼接后图像特征;对待去噪拼接后图像特征进行卷积处理,得到待去噪图像融合特征。
S708,基于待去噪图像细节特征和待去噪图像融合特征,生成待去噪视频帧对应的去噪视频帧。
具体的,终端在得到待去噪图像融合特征和待去噪图像细节特征之后,对待去噪图像融合特征和待去噪图像细节特征进行融合,得到待去噪全局图像特征,并基于待去噪全局图像特征生成预测视频帧。
上述实施例中,终端通过在待去噪视频的待去噪视频帧序列中确定当前的待去噪视频帧;通过目标视频去噪模型的第一分支提取待去噪视频帧的待去噪图像细节特征;在获得待去噪视频帧序列对应的下采样待去噪视频帧序列后,通过目标视频去噪模型的第二分支对下采样待去噪视频帧序列进行特征提取,得到待去噪图像融合特征;基于待去噪图像细节特征和待去噪图像融合特征,生成待去噪视频帧对应的去噪视频帧,既充分考虑了视频 在时间维度上的相关性和连续性,又能够有效地降低计算量,提高模型的运行效率,从而在计算资源有限的情况下,也能够较好的提取出待去噪视频帧的特征,进而提高了目标视频去噪模型的去噪效果。
在一个实施例中,如图8所示,提供了一种视频去噪模型的处理方法,以该方法应用于图1中的计算机设备为例进行说明,包括以下步骤:
S802,对静态对象进行视频采集,得到携带真实噪声的原始静态视频;对原始静态视频进行加噪处理,得到静态视频;静态视频携带有加噪噪声和真实噪声;对原始静态视频进行平滑处理,得到清晰静态视频。
S804,从视频数据库中获取未加噪的动态视频;对未加噪的动态视频进行加噪处理,得到加噪的动态视频。
S806,将携带有加噪噪声和真实噪声的静态视频、以及加噪的动态视频确定为样本视频,将清晰静态视频和未加噪的动态视频确定为参考视频。
S808,在样本视频的视频帧序列中获取目标视频帧。
S810,通过视频去噪模型的第一分支提取目标视频帧的图像细节特征。
S812,对视频帧序列进行下采样得到下采样视频帧序列,通过视频去噪模型的第二分支的光流网络,确定下采样视频帧序列中的下采样目标视频帧与对应的相邻下采样视频帧之间的光流信息。
S814,通过第二分支的其它帧子分支对下采样视频帧序列进行特征提取,得到下采样目标视频帧对应的连续视频帧特征。
S816,基于光流信息将连续视频帧特征与下采样目标视频帧进行对齐,得到对齐后视频帧特征。
S818,通过第二分支的目标子分支对对齐后视频帧特征进行处理,得到图像融合特征。
S820,对图像融合特征进行上采样处理,得到上采样图像融合特征。
S822,将上采样图像融合特征与图像细节特征进行融合,得到全局图像特征。
S824,基于全局图像特征进行图像重建,得到预测视频帧。
S826,根据预测视频帧和参考视频帧之间的损失值,对视频去噪模型中的参数进行调整,得到目标视频去噪模型。
其中,参考视频帧是参考视频中与目标视频帧对应的视频帧;目标视频去噪模型用于对待去噪视频进行去噪处理,参考视频包括对静态视频进行平滑处理所得的清晰静态视频和未加噪的动态视频。
本申请还提供一种应用场景,该应用场景应用上述视频去噪模型的处理方法,该方法包括以下步骤:
1、训练数据准备
参考图9所示的训练数据示意图,训练数据来源于两个部分,一部分是人工采集的画面静止的带有真实噪声的视频,另一部分是公开的清晰视频集,分别对带有真实噪声的视频和清晰视频进行人工加噪,得到低质量噪声视频(LQ),对带有真实噪声的视频进行时域平滑,以及对清晰视频进行复制,得到高质量清晰视频(GT),将低质量噪声视频(LQ)作为样本视频,将对应的高质量清晰视频(GT)作为参考视频,构建出成对的数据集,用所构建的成对的数据集来训练视频去噪模型。
2、模型训练
具体的,视频去噪模型的网络结构如图10所示,该视频去噪模型包括高分辨率分支和低分辨率分支,低分辨率分支包括光流网络以及多个子分支,每个子分支包括前向U型网络和后向U型网络,终端在样本视频的视频帧序列中获取目标视频帧,通过该视频去噪模型的高分辨率分支提取目标视频帧的图像细节特征,在对视频帧序列进行下采样得到下采样视频帧序列后,将下采样视频帧序列输入低分辨率分支,通过视频去噪模型的第二分支的光流网络,确定下采样视频帧序列中相邻下采样视频帧之间的光流信息,分别通过低分辨率分支中目标视频帧对应的目标子分支之外其他子分支和光流信息处理对应的下采样视频帧,从而得到下采样目标视频帧对应的连续视频帧特征,并基于下采样目标视频帧以及其与相邻下采样视频帧之间的光流信息,将连续视频帧特征与下采样目标视频帧进行对齐,得到对齐后视频帧特征,通过第分辨率分支的目标子分支对齐后视频帧特征进行处理,得到图像融合特征,对图像融合特征进行上采样处理,得到上采样图像融合特征,将上采样图像融合特征与图像细节特征进行融合,得到全局图像特征,基于全局图像特征进行图像重建,得到预测视频帧,根据预测视频帧和参考视频中与目标视频帧对应的视频帧确定损失值,基于损失值对视频去噪模型中的参数进行调整,得到目标视频去噪模型。
举例对通过视频去噪模型的低分辨率分支对下采样视频帧序列进行特征提取,得到图像融合特征的过程进行说明,以视频帧序列包含10个视频帧,目标视频帧为第i帧为例进行说明,在对视频帧序列的10个视频帧进行下采样得到10个下采样视频帧之后,将该10个下采样视频帧输入视频去噪模型的低分辨率分支,每个下采样视频帧分别对应于低分辨率分支中的一个子分支,以第i+1帧到第i帧、第i-1帧到第i帧为例,首先利用预训练好的光流网络SpyNet分别确定第i+1帧到第i帧的第一光流信息,以及第i-1帧到第i帧的第二光流信息,通过第i+1帧对应子分支的后向U型网络层对第i+1帧进行特征提取得到后序视频帧特征,通过第i-1帧对应子分支的前向U型网络层对第i-1帧进行特征提取得到前序视频帧特征,并基于第一光流信息和第二光流信息分别将前序视频帧特征和后序视频帧特征与第i帧对齐,得到前序对齐后视频帧特征和后序对齐后视频帧特征,通过第i帧对应的子分支的前向U型网络层对前序对齐后视频帧特征进行特征处理,得到前序图像融合特征,通过第i帧对应的子分支的后向U型网络层对后序对齐后视频帧特征进行特征处理,得到后序图像融合特征,将前序图像融合特征和后序图像融合特征进行拼接,得到拼接后图像特征,通过第i帧对应的子分支的卷积层对拼接后图像特征进行卷积处理,得到图像融合特征。其中,第i-1帧对应的前序视频帧特征具体可以是基于第i-1帧的图像和第i-2帧的视频帧特征而确定的,第i+1帧对应的后序视频帧特征具体可以是基于第i+1帧的图像和第i+2帧的视频帧特征而确定的。
参考图11和图12,图11为待去噪视频的某个待去噪视频帧,该待去噪视频帧中包含较多噪声,图12为采用本申请方案训练好的目标视频去噪模型去该待去噪视频帧进行去噪处理后所得到的清晰视频帧。
应该理解的是,虽然如上所述的各实施例所涉及的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,如上所述的各实施例所涉及的流程图中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的 步骤或者阶段的至少一部分轮流或者交替地执行。
基于同样的发明构思,本申请实施例还提供了一种用于实现上述所涉及的视频去噪模型的处理方法的视频去噪模型的处理装置。该装置所提供的解决问题的实现方案与上述方法中所记载的实现方案相似,故下面所提供的一个或多个视频去噪模型的处理装置实施例中的具体限定可以参见上文中对于视频去噪模型的处理方法的限定,在此不再赘述。
在一个实施例中,如图13所示,提供了一种视频去噪模型的处理装置,包括:视频帧获取模块1302、细节特征提取模块1304、融合特征提取模块1306、预测模块1308和参数调整模块1310,其中:
视频帧获取模块1302,用于在样本视频的视频帧序列中获取目标视频帧;
细节特征提取模块1304,用于通过视频去噪模型的第一分支提取目标视频帧的图像细节特征;
融合特征提取模块1306,用于对视频帧序列进行下采样得到下采样视频帧序列,通过视频去噪模型的第二分支对下采样视频帧序列进行特征提取,得到图像融合特征;
预测模块1308,用于基于图像融合特征和图像细节特征生成预测视频帧;
参数调整模块1310,用于根据预测视频帧和参考视频帧之间的损失值,对视频去噪模型中的参数进行调整,得到目标视频去噪模型;参考视频帧是参考视频中与目标视频帧对应的视频帧;目标视频去噪模型用于对待去噪视频进行去噪处理。
上述实施例中,在样本视频的视频帧序列中获取目标视频帧之后,通过视频去噪模型的第一分支提取目标视频帧的图像细节特征,在获得视频帧序列对应的下采样视频帧序列后,通过视频去噪模型的第二分支对下采样视频帧序列进行特征提取,得到图像融合特征,基于图像融合特征和图像细节特征生成预测视频帧,既充分考虑了视频在时间维度上的相关性和连续性,又能够有效地降低计算量,提高模型的运行效率,从而在计算资源有限的情况下,也能够根据预测视频帧和参考视频中与目标视频帧对应的视频帧之间的损失值,对视频去噪模型中的参数进行调整,得到去噪效果较好的目标视频去噪模型;另外通过样本视频包括携带真实噪声的静态视频和加噪的动态视频;通过使用包含真实噪声的静态视频和加噪的动态视频作为样本视频,以及使用对静态视频进行平滑处理所得的清晰静态视频和未加噪的动态视频作为参考,可以更好地模拟真实场景下的噪声情况,进一步提高了目标视频去噪模型的去噪效果。
在一个实施例中,样本视频包括携带真实噪声的静态视频和加噪的动态视频;参考视频包括对静态视频进行平滑处理所得的清晰静态视频和未加噪的动态视频。
在一个实施例中,如图14所示,装置还包括样本视频获取模块1312和参考视频获取模块1314,其中:样本视频获取模块1312,用于对静态对象进行视频采集,得到携带真实噪声的原始静态视频;对原始静态视频进行加噪处理,得到静态视频;静态视频携带有加噪噪声和真实噪声;参考视频获取模块1314,用于对原始静态视频进行平滑处理,得到清晰静态视频。
在一个实施例中,样本视频获取模块1312,还用于从原始静态视频的各带噪视频帧中获取部分像素;根据各带噪视频帧的部分像素分别生成对应的第一像素图像;生成与各带噪视频帧对应的第一初始噪声图像;将第一初始噪声图像分别与第一像素图像进行融合,得到各带噪视频帧对应的第一噪声图像;将各第一噪声图像分别融合至对应的带噪视频帧 中,得到静态视频。
在一个实施例中,参考视频获取模块1314,还用于从视频数据库中获取未加噪的动态视频;样本视频获取模块1312,还用于对未加噪的动态视频进行加噪处理,得到加噪的动态视频。
在一个实施例中,未加噪的动态视频中的视频帧为清晰视频帧;样本视频获取模块1312,还用于从各清晰视频帧中选取部分像素;根据各清晰视频帧的部分像素分别生成对应的第二像素图像;生成各清晰视频帧对应的第二初始噪声图像;将各第二初始噪声图像分别与对应的第二像素图像进行融合,得到各清晰视频帧对应的第二噪声图像;将各第二噪声图像分别融合至对应的清晰视频帧中,得到加噪的动态视频。
在一个实施例中,第二分支包括光流网络、目标帧子分支和其它帧子分支;融合特征提取模块1306,还用于:通过光流网络,确定下采样视频帧序列中的下采样目标视频帧与对应的相邻下采样视频帧之间的光流信息;通过其它帧子分支对下采样视频帧序列进行特征提取,得到下采样目标视频帧对应的连续视频帧特征;基于光流信息将连续视频帧特征与下采样目标视频帧进行对齐,得到对齐后视频帧特征;通过目标子分支对对齐后视频帧特征进行处理,得到图像融合特征。
在一个实施例中,相邻下采样视频帧包括下采样前序视频帧和下采样后序视频帧,融合特征提取模块1306,光流信息包括第一光流信息和第二光流信息;连续视频帧特征包括前序视频帧特征和后序视频帧特征;对齐后视频帧特征包括前序对齐后视频帧特征和后序对齐后视频帧特征;还用于:通过光流网络,确定相邻的第一下采样视频帧之间的第一光流信息;通过光流网络,确定相邻的第二下采样视频帧之间的第二光流信息;所述第一下采样视频帧是所述下采样目标视频帧与所述下采样前序视频帧中的下采样视频帧;所述第二下采样视频帧是所述下采样目标视频帧与所述下采样后序视频帧中的下采样视频帧;通过前序帧子分支的前向网络层对下采样前序视频帧进行特征提取,得到前序视频帧特征;通过后序帧子分支的后向网络层对下采样后序视频帧进行特征提取,得到后序视频帧特征;前序帧子分支和后序帧子分支属于其它帧子分支;基于光流信息中的第一光流信息将前序视频帧特征与下采样目标视频帧进行对齐,得到前序对齐后视频帧特征;基于光流信息中的第二光流信息将后序视频帧特征与下采样目标视频帧进行对齐,得到后序对齐后视频帧特征;通过目标子分支的前向网络层对前序对齐后视频帧特征进行处理,得到前序图像融合特征;通过目标子分支的后向网络层对后序对齐后视频帧特征进行处理,得到后序图像融合特征;基于前序图像融合特征和后序图像融合特征,确定图像融合特征。
在一个实施例中,融合特征提取模块1306,用于:将前序图像融合特征和后序图像融合特征进行拼接,得到拼接后图像特征;对拼接后图像特征进行卷积处理,得到图像融合特征。
在一个实施例中,预测模块1308,还用于:将图像融合特征与图像细节特征进行融合,得到全局图像特征;基于全局图像特征进行图像重建,得到预测视频帧。
在一个实施例中,预测模块,还用于:对图像融合特征进行上采样处理,得到上采样图像融合特征;将上采样图像融合特征与图像细节特征进行融合,得到全局图像特征。
在一个实施例中,视频帧获取模块1302,还用于在待去噪视频的待去噪视频帧序列中确定当前的待去噪视频帧;细节特征提取模块,还用于通过目标视频去噪模型的第一分支提取待去噪视频帧的待去噪图像细节特征;融合特征提取模块1306,还用于在获得待去噪 视频帧序列对应的下采样待去噪视频帧序列后,通过目标视频去噪模型的第二分支对下采样待去噪视频帧序列进行特征提取,得到待去噪图像融合特征;预测模块,还用于基于待去噪图像细节特征和待去噪图像融合特征,生成待去噪视频帧对应的去噪视频帧。
上述视频去噪模型的处理装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图15所示。该计算机设备包括处理器、存储器、输入/输出接口(Input/Output,简称I/O)和通信接口。其中,处理器、存储器和输入/输出接口通过系统总线连接,通信接口通过输入/输出接口连接到系统总线。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质和内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储视频数据。该计算机设备的输入/输出接口用于处理器与外部设备之间交换信息。该计算机设备的通信接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种视频去噪模型的处理方法。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图16所示。该计算机设备包括处理器、存储器、输入/输出接口、通信接口、显示单元和输入装置。其中,处理器、存储器和输入/输出接口通过系统总线连接,通信接口、显示单元和输入装置通过输入/输出接口连接到系统总线。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机可读指令。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的输入/输出接口用于处理器与外部设备之间交换信息。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信,无线方式可通过WIFI、移动蜂窝网络、NFC(近场通信)或其他技术实现。该计算机可读指令被处理器执行时以实现一种视频去噪模型的处理方法。该计算机设备的显示单元用于形成视觉可见的画面,可以是显示屏、投影装置或虚拟现实成像装置,显示屏可以是液晶显示屏或电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
本领域技术人员可以理解,图15或图16中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,还提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机可读指令,该处理器执行计算机可读指令时实现上述各方法实施例中的步骤。
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机可读指令,该计算机可读指令被处理器执行时实现上述各方法实施例中的步骤。
在一个实施例中,提供了一种计算机程序产品,包括计算机可读指令,该计算机可读指令被处理器执行时实现上述各方法实施例中的步骤。
需要说明的是,本申请所涉及的用户信息(包括但不限于用户设备信息、用户个人信息等)和数据(包括但不限于用于分析的数据、存储的数据、展示的数据等),均为经用户授权或者经过各方充分授权的信息和数据,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存、光存储器、高密度嵌入式非易失性存储器、阻变存储器(ReRAM)、磁变存储器(Magnetoresistive Random Access Memory,MRAM)、铁电存储器(Ferroelectric Random Access Memory,FRAM)、相变存储器(Phase Change Memory,PCM)、石墨烯存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器等。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。本申请所提供的各实施例中所涉及的数据库可包括关系型数据库和非关系型数据库中至少一种。非关系型数据库可包括基于区块链的分布式数据库等,不限于此。本申请所提供的各实施例中所涉及的处理器可为通用处理器、中央处理器、图形处理器、数字信号处理器、可编程逻辑器、基于量子计算的数据处理逻辑器等,不限于此。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请的保护范围应以所附权利要求为准。

Claims (16)

  1. 一种视频去噪模型的处理方法,由计算机设备执行,所述方法包括:
    在样本视频的视频帧序列中获取目标视频帧,以及获取所述样本视频对应的参考视频;
    通过视频去噪模型的第一分支提取所述目标视频帧的图像细节特征;
    对所述视频帧序列进行下采样得到下采样视频帧序列,通过所述视频去噪模型的第二分支对所述下采样视频帧序列进行特征提取,得到图像融合特征;
    基于所述图像融合特征和所述图像细节特征生成预测视频帧;
    根据所述预测视频帧和参考视频帧之间的损失值,对所述视频去噪模型中的参数进行调整,得到目标视频去噪模型;所述参考视频帧是所述参考视频中与所述目标视频帧对应的视频帧;所述目标视频去噪模型用于对待去噪视频进行去噪处理。
  2. 根据权利要求1所述的方法,所述样本视频包括携带真实噪声的静态视频和加噪的动态视频;所述参考视频包括对所述静态视频进行平滑处理所得的清晰静态视频和未加噪的所述动态视频。
  3. 根据权利要求2所述的方法,所述静态视频还携带有加噪噪声;所述方法还包括:
    对静态对象进行视频采集,得到携带真实噪声的原始静态视频;
    对所述原始静态视频进行加噪处理,得到所述静态视频;所述静态视频携带有所述加噪噪声和所述真实噪声;
    对所述原始静态视频进行平滑处理,得到所述清晰静态视频。
  4. 根据权利要求3所述的方法,所述对所述原始静态视频进行加噪处理,得到所述静态视频,包括:
    从所述原始静态视频的各带噪视频帧中获取部分像素;
    根据各所述带噪视频帧的部分像素分别生成对应的第一像素图像;
    生成与各所述带噪视频帧对应的第一初始噪声图像;
    将所述第一初始噪声图像分别与所述第一像素图像进行融合,得到各所述带噪视频帧对应的第一噪声图像;
    将各所述第一噪声图像分别融合至对应的所述带噪视频帧中,得到所述静态视频。
  5. 根据权利要求1所述的方法,所述方法还包括:
    从视频数据库中获取未加噪的动态视频;
    对所述未加噪的动态视频进行加噪处理,得到加噪的动态视频。
  6. 根据权利要求5所述的方法,所述未加噪的动态视频中的视频帧为清晰视频帧;
    所述对所述未加噪的动态视频进行加噪处理,得到加噪的动态视频,包括:
    从各所述清晰视频帧中选取部分像素;
    根据各所述清晰视频帧的部分像素分别生成对应的第二像素图像;
    生成各所述清晰视频帧对应的第二初始噪声图像;
    将各所述第二初始噪声图像分别与对应的所述第二像素图像进行融合,得到各所述清晰视频帧对应的第二噪声图像;
    将各所述第二噪声图像分别融合至对应的所述清晰视频帧中,得到加噪的动态视频。
  7. 根据权利要求1所述的方法,所述第二分支包括光流网络、目标帧子分支和其它帧子分支;所述通过所述视频去噪模型的第二分支对所述下采样视频帧序列进行特征提取, 得到图像融合特征,包括:
    通过所述光流网络,确定所述下采样视频帧序列中的下采样目标视频帧与对应的相邻下采样视频帧之间的光流信息;
    通过所述其它帧子分支对所述下采样视频帧序列进行特征提取,得到所述下采样目标视频帧对应的连续视频帧特征;
    基于所述光流信息将所述连续视频帧特征与所述下采样目标视频帧进行对齐,得到对齐后视频帧特征;
    通过所述目标子分支对所述对齐后视频帧特征进行处理,得到图像融合特征。
  8. 根据权利要求7所述的方法,所述相邻下采样视频帧包括下采样前序视频帧和下采样后序视频帧;所述光流信息包括第一光流信息和第二光流信息;所述连续视频帧特征包括前序视频帧特征和后序视频帧特征;对齐后视频帧特征包括前序对齐后视频帧特征和后序对齐后视频帧特征;
    所述通过所述光流网络,确定所述下采样视频帧序列中的下采样目标视频帧与对应的相邻下采样视频帧之间的光流信息,包括:
    通过所述光流网络,确定相邻的第一下采样视频帧之间的第一光流信息;通过所述光流网络,确定相邻的第二下采样视频帧之间的第二光流信息;所述第一下采样视频帧是所述下采样目标视频帧与所述下采样前序视频帧中的下采样视频帧;所述第二下采样视频帧是所述下采样目标视频帧与所述下采样后序视频帧中的下采样视频帧;
    所述通过所述其它帧子分支对所述下采样视频帧序列进行特征提取,得到所述下采样目标视频帧对应的连续视频帧特征,包括:
    通过前序帧子分支的前向网络层对所述下采样前序视频帧进行特征提取,得到前序视频帧特征;通过后序帧子分支的后向网络层对所述下采样后序视频帧进行特征提取,得到后序视频帧特征;所述前序帧子分支和所述后序帧子分支属于所述其它帧子分支;
    所述基于所述光流信息将所述连续视频帧特征与所述下采样目标视频帧进行对齐,得到对齐后视频帧特征,包括:
    基于所述第一光流信息将所述前序视频帧特征与所述下采样目标视频帧进行对齐,得到前序对齐后视频帧特征;基于所述第二光流信息将所述后序视频帧特征与所述下采样目标视频帧进行对齐,得到后序对齐后视频帧特征;
    所述通过所述目标子分支对所述对齐后视频帧特征进行处理,得到图像融合特征,包括:
    通过所述目标子分支的前向网络层对所述前序对齐后视频帧特征进行处理,得到前序图像融合特征;通过所述目标子分支的后向网络层对所述后序对齐后视频帧特征进行处理,得到后序图像融合特征;
    基于所述前序图像融合特征和所述后序图像融合特征,确定图像融合特征。
  9. 根据权利要求8所述的方法,所述基于所述前序图像融合特征和所述后序图像融合特征,确定图像融合特征,包括:
    将所述前序图像融合特征和所述后序图像融合特征进行拼接,得到拼接后图像特征;
    对所述拼接后图像特征进行卷积处理,得到图像融合特征。
  10. 根据权利要求1所述的方法,所述基于所述图像融合特征和所述图像细节特征生成预测视频帧,包括:
    将所述图像融合特征与所述图像细节特征进行融合,得到全局图像特征;
    基于所述全局图像特征进行图像重建,得到预测视频帧。
  11. 根据权利要求10所述的方法,所述将所述图像融合特征与所述图像细节特征进行融合,得到全局图像特征,包括:
    对所述图像融合特征进行上采样处理,得到上采样图像融合特征;
    将所述上采样图像融合特征与所述图像细节特征进行融合,得到全局图像特征。
  12. 根据权利要求1至11中任一项所述的方法,所述方法还包括:
    在待去噪视频的待去噪视频帧序列中确定当前的待去噪视频帧;
    通过所述目标视频去噪模型的第一分支提取所述待去噪视频帧的待去噪图像细节特征;
    对所述待去噪视频帧序列进行下采样得到下采样待去噪视频帧序列,通过所述目标视频去噪模型的第二分支对所述下采样待去噪视频帧序列进行特征提取,得到待去噪图像融合特征;
    基于所述待去噪图像细节特征和所述待去噪图像融合特征,生成所述待去噪视频帧对应的去噪视频帧。
  13. 一种视频去噪模型的处理装置,所述装置包括:
    视频帧获取模块,用于在样本视频的视频帧序列中获取目标视频帧,以及获取所述样本视频对应的参考视频;
    细节特征提取模块,用于通过视频去噪模型的第一分支提取所述目标视频帧的图像细节特征;
    融合特征提取模块,用于对所述视频帧序列进行下采样得到下采样视频帧序列,通过所述视频去噪模型的第二分支对所述下采样视频帧序列进行特征提取,得到图像融合特征;
    预测模块,用于基于所述图像融合特征和所述图像细节特征生成预测视频帧;
    参数调整模块,用于根据所述预测视频帧和和参考视频帧之间的损失值,对所述视频去噪模型中的参数进行调整,得到目标视频去噪模型;所述参考视频帧是所述参考视频中与所述目标视频帧对应的视频帧;所述目标视频去噪模型用于对待去噪视频进行去噪处理。
  14. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现权利要求1至12中任一项所述的方法的步骤。
  15. 一种计算机可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现权利要求1至12中任一项所述的方法的步骤。
  16. 一种计算机程序产品,包括计算机可读指令,其特征在于,该计算机可读指令被处理器执行时实现权利要求1至12中任一项所述的方法的步骤。
PCT/CN2024/079883 2023-04-18 2024-03-04 视频去噪模型的处理方法、装置、计算机设备和存储介质 Ceased WO2024217164A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP24791752.9A EP4632666A4 (en) 2023-04-18 2024-03-04 METHOD AND APPARATUS FOR PROCESSING VIDEO DENOISSING MODELS, COMPUTER DEVICE AND STORAGE MEDIA
US19/193,267 US20250272803A1 (en) 2023-04-18 2025-04-29 Method, computer device, and storage medium for processing video denoising model

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202310457798.1A CN116977200A (zh) 2023-04-18 2023-04-18 视频去噪模型的处理方法、装置、计算机设备和存储介质
CN202310457798.1 2023-04-18

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US19/193,267 Continuation US20250272803A1 (en) 2023-04-18 2025-04-29 Method, computer device, and storage medium for processing video denoising model

Publications (1)

Publication Number Publication Date
WO2024217164A1 true WO2024217164A1 (zh) 2024-10-24

Family

ID=88482158

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2024/079883 Ceased WO2024217164A1 (zh) 2023-04-18 2024-03-04 视频去噪模型的处理方法、装置、计算机设备和存储介质

Country Status (4)

Country Link
US (1) US20250272803A1 (zh)
EP (1) EP4632666A4 (zh)
CN (1) CN116977200A (zh)
WO (1) WO2024217164A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119991416A (zh) * 2025-04-17 2025-05-13 南京信息工程大学 一种基于raft光流的视频风格迁移方法
CN120075372A (zh) * 2025-04-27 2025-05-30 中国科学院沈阳自动化研究所 一种图像采集与融合的方法及装置
CN121147028A (zh) * 2025-09-11 2025-12-16 青岛大学 一种基于聚类和多尺度直方图匹配的图像序列闪烁消除方法及系统

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116977200A (zh) * 2023-04-18 2023-10-31 腾讯科技(深圳)有限公司 视频去噪模型的处理方法、装置、计算机设备和存储介质
CN117495853B (zh) * 2023-12-28 2024-05-03 淘宝(中国)软件有限公司 视频数据处理方法、设备及存储介质
CN118714417B (zh) * 2024-02-07 2026-01-27 浙江天猫技术有限公司 视频的生成方法、系统、电子设备和存储介质
CN118555461B (zh) * 2024-07-29 2024-10-15 浙江天猫技术有限公司 视频生成方法、装置、设备、系统及计算机程序产品
CN119991465A (zh) * 2025-01-23 2025-05-13 英特灵达信息技术(深圳)有限公司 光流信息预测网络训练方法、图像增强方法、装置及设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111738952A (zh) * 2020-06-22 2020-10-02 京东方科技集团股份有限公司 一种图像修复的方法、装置及电子设备
CN112686828A (zh) * 2021-03-16 2021-04-20 腾讯科技(深圳)有限公司 视频去噪方法、装置、设备及存储介质
CN113011562A (zh) * 2021-03-18 2021-06-22 华为技术有限公司 一种模型训练方法及装置
CN113034401A (zh) * 2021-04-08 2021-06-25 中国科学技术大学 视频去噪方法及装置、存储介质及电子设备
US11151695B1 (en) * 2019-08-16 2021-10-19 Perceive Corporation Video denoising using neural networks with spatial and temporal features
CN116977200A (zh) * 2023-04-18 2023-10-31 腾讯科技(深圳)有限公司 视频去噪模型的处理方法、装置、计算机设备和存储介质

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114494011B (zh) * 2021-12-31 2024-09-03 深圳市联影高端医疗装备创新研究院 图像插值方法及装置、处理设备、存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11151695B1 (en) * 2019-08-16 2021-10-19 Perceive Corporation Video denoising using neural networks with spatial and temporal features
CN111738952A (zh) * 2020-06-22 2020-10-02 京东方科技集团股份有限公司 一种图像修复的方法、装置及电子设备
CN112686828A (zh) * 2021-03-16 2021-04-20 腾讯科技(深圳)有限公司 视频去噪方法、装置、设备及存储介质
CN113011562A (zh) * 2021-03-18 2021-06-22 华为技术有限公司 一种模型训练方法及装置
CN113034401A (zh) * 2021-04-08 2021-06-25 中国科学技术大学 视频去噪方法及装置、存储介质及电子设备
CN116977200A (zh) * 2023-04-18 2023-10-31 腾讯科技(深圳)有限公司 视频去噪模型的处理方法、装置、计算机设备和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4632666A4 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119991416A (zh) * 2025-04-17 2025-05-13 南京信息工程大学 一种基于raft光流的视频风格迁移方法
CN120075372A (zh) * 2025-04-27 2025-05-30 中国科学院沈阳自动化研究所 一种图像采集与融合的方法及装置
CN121147028A (zh) * 2025-09-11 2025-12-16 青岛大学 一种基于聚类和多尺度直方图匹配的图像序列闪烁消除方法及系统

Also Published As

Publication number Publication date
EP4632666A4 (en) 2026-04-15
US20250272803A1 (en) 2025-08-28
CN116977200A (zh) 2023-10-31
EP4632666A1 (en) 2025-10-15

Similar Documents

Publication Publication Date Title
WO2024217164A1 (zh) 视频去噪模型的处理方法、装置、计算机设备和存储介质
TWI728465B (zh) 圖像處理方法和裝置、電子設備及儲存介質
CN111539879A (zh) 基于深度学习的视频盲去噪方法及装置
WO2022110638A1 (zh) 人像修复方法、装置、电子设备、存储介质和程序产品
CN111784578A (zh) 图像处理、模型训练方法及装置、设备、存储介质
JP2018527687A (ja) 知覚的な縮小方法を用いて画像を縮小するための画像処理システム
CN113628115B (zh) 图像重建的处理方法、装置、电子设备和存储介质
CN113902647B (zh) 一种基于双闭环网络的图像去模糊方法
CN106127689A (zh) 图像视频超分辨率方法和装置
CN116385283A (zh) 一种基于事件相机的图像去模糊方法及系统
Shrivastava et al. Video dynamics prior: An internal learning approach for robust video enhancements
CN115222606A (zh) 图像处理方法、装置、计算机可读介质及电子设备
Fang et al. Self-enhanced convolutional network for facial video hallucination
CN120912433A (zh) 基于事件流的模糊图像超分重建方法、装置、设备及存储介质
CN118608387A (zh) 用于对卫星视频帧进行超分辨率重建的方法、装置和设备
WO2024131707A1 (zh) 毛发增强方法、神经网络、电子装置和存储介质
Mahamud et al. Effective Super-Resolution Through Multi-Order Degradation Simulation and Efficient Training Strategies
CN106204445A (zh) 基于结构张量全变差的图像视频超分辨率方法
HK40097800A (zh) 视频去噪模型的处理方法、装置、计算机设备和存储介质
CN120912438B (zh) 基于双向聚焦增强的图像超分辨率方法和装置
CN121120896B (zh) 基于不确定性感知深度监督的稀疏视角室内重建方法
Xu et al. Image Restoration for Beautification
CN117557462A (zh) 图像重建模型的训练与视频播放方法、装置和计算机设备
CN118735821A (zh) 图像处理方法、装置、计算机设备和可读存储介质
CN114612293A (zh) 图像超分辨率处理方法、装置、设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24791752

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2024791752

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2024791752

Country of ref document: EP

Effective date: 20250710

WWP Wipo information: published in national office

Ref document number: 2024791752

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE