WO2019237850A1 - 一种视频处理方法、装置以及存储介质 - Google Patents

一种视频处理方法、装置以及存储介质 Download PDF

Info

Publication number
WO2019237850A1
WO2019237850A1 PCT/CN2019/085606 CN2019085606W WO2019237850A1 WO 2019237850 A1 WO2019237850 A1 WO 2019237850A1 CN 2019085606 W CN2019085606 W CN 2019085606W WO 2019237850 A1 WO2019237850 A1 WO 2019237850A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
video frame
region
data
barrage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2019/085606
Other languages
English (en)
French (fr)
Inventor
刘玉杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to EP19819568.7A priority Critical patent/EP3809710A4/en
Publication of WO2019237850A1 publication Critical patent/WO2019237850A1/zh
Priority to US16/937,360 priority patent/US11611809B2/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/84Generation or processing of descriptive data, e.g. content descriptors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • G06V10/235Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition based on user input or interaction
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/235Processing of additional data, e.g. scrambling of additional data or processing content descriptors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • H04N21/43072Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen of multiple content streams on the same device
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • H04N21/4312Generation of visual interfaces for content selection or interaction; Content or additional data rendering involving specific graphical features, e.g. screen layout, special fonts or colors, blinking icons, highlights or animations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • H04N21/4312Generation of visual interfaces for content selection or interaction; Content or additional data rendering involving specific graphical features, e.g. screen layout, special fonts or colors, blinking icons, highlights or animations
    • H04N21/4316Generation of visual interfaces for content selection or interaction; Content or additional data rendering involving specific graphical features, e.g. screen layout, special fonts or colors, blinking icons, highlights or animations for displaying supplemental content in a region of the screen, e.g. an advertisement in a separate window
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/435Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/462Content or additional data management e.g. creating a master electronic programme guide from data received from the Internet and a Head-end or controlling the complexity of a video stream by scaling the resolution or bit-rate based on the client capabilities
    • H04N21/4622Retrieving content or additional data from different sources, e.g. from a broadcast channel and the Internet
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/475End-user interface for inputting end-user data, e.g. personal identification number [PIN], preference data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/478Supplemental services, e.g. displaying phone caller identification, shopping application
    • H04N21/4788Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4886Data services, e.g. news ticker for displaying a ticker, e.g. scrolling banner for news, stock exchange, weather data
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/255Detecting or recognising potential candidate objects based on visual cues, e.g. shapes

Definitions

  • the present application relates to the field of Internet technologies, and in particular, to a video processing method, device, and storage medium.
  • the barrage on the video playback interface is independent of the video content being played, the barrage displayed on the video playback interface will not be able to feedback the currently playing video content in real time, that is, the customer is lacking.
  • the correlation between the barrage in the terminal and the video content further reduces the visual display effect of the currently displayed barrage data.
  • An embodiment of the present application provides a video processing method, which is executed by a client terminal, and includes:
  • Keyword information matching the barrage data is obtained in a key information database as target keyword information;
  • the key information database contains keyword information set by the user, and the target object corresponding to each keyword information Classification recognition model;
  • the target region in the target video frame is animated.
  • An embodiment of the present application provides a video processing apparatus, including:
  • a barrage data acquisition module configured to play video data and obtain barrage data corresponding to the video data
  • a keyword acquisition module is used to obtain keyword information that matches the barrage data in a key information database as target keyword information;
  • the key information database contains keyword information set by the user, and each key Classification and recognition model of target object corresponding to word information;
  • a target object recognition module is configured to obtain a target video frame from a plurality of video frames of the video data, and identify a target object corresponding to the target keyword information based on a classification recognition model corresponding to the target keyword information. Describing the image area in the target video frame, and using the identified image area as the target area;
  • a target area processing module is configured to perform animation processing on the target area in the target video frame when the target video frame in the video data is played.
  • An embodiment of the present application provides a video processing apparatus, including: a processor and a memory;
  • the processor is connected to a memory, where the memory is used to store program code, and the processor is used to call the program code to execute the video processing method provided in the embodiment of the present application.
  • An embodiment of the present application provides a computer storage medium.
  • the computer storage medium stores a computer program.
  • the computer program includes program instructions. When the program instructions are executed by a processor, the video processing provided by the embodiment of the application is performed. method.
  • FIG. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of a video processing method according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of acquiring barrage data according to an embodiment of the present application.
  • FIG. 4 is another schematic diagram of acquiring barrage data provided by an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of another video processing method according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of displaying barrage data on multiple video frames according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of feature extraction provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of selecting an optimal candidate region according to an embodiment of the present application.
  • FIG. 9 is a schematic diagram of an optimal video frame for modifying the target video frame according to an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of another video processing apparatus according to an embodiment of the present application.
  • the embodiment of the present application proposes a video processing method, which can enrich the visual display effect of the barrage data and avoid the waste of equipment resources and network resources caused by the purpose of identifying and capturing the barrage data.
  • FIG. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application.
  • the network architecture may include a server cluster and a client terminal cluster;
  • the client terminal cluster may include multiple client terminals, as shown in FIG. 1, specifically including a client terminal 3000 a, a client terminal 3000 b, ..., a client Terminal 3000n;
  • the server cluster may include a barrage server 2000a and a video source server 2000b.
  • the barrage server 2000a is configured to store barrage data in a preset time period as historical barrage data.
  • the video source server 2000b is used to store multiple video data sources.
  • the client terminal 3000a, the client terminal 3000b, ..., and the client terminal 3000n may be connected to the server cluster through a network, respectively.
  • a client terminal may be selected as a target client terminal in the client terminal cluster (the target client terminal is a client terminal 3000a as an example), so as to facilitate Describe the data interaction relationship between the client terminal 3000a and the barrage server 2000a and the video source server 2000b, that is, when the target client terminal (client terminal 3000a) is playing video data (the video data is the video source server 2000b Based on the data returned by the video download request sent by the client terminal 3000a), a barrage acquisition request may be sent to the barrage server 2000a based on the current playback progress of the video data, so that the barrage server 2000a is based on the barrage Get request to return historical barrage data,
  • the historical barrage data may be text input data input by other users on their corresponding client terminal (for example, client terminal 3000b) based on the current playback progress. That is, at this time, it can be understood that the user (for example, user A) corresponding to the client terminal 3000a and the user (for example, user B) corresponding to the client terminal 3000b are watching the video data synchronously. Therefore, the client terminal 3000a can simultaneously display the barrage data uploaded by the client terminal 3000b to the barrage server 2000a under the current playback progress.
  • the barrage server 2000a may store the barrage data in the barrage database as historical barrage data corresponding to the current playback progress, and may return the historical barrage data based on the received barrage acquisition request.
  • the historical barrage data may also include barrage data uploaded by other client terminals (for example, client terminal 3000c) received and stored by the barrage server over a period of time, that is, at this time, compared with For the client terminal 3000a and the client terminal 3000b that play the video data synchronously, the playback time stamp of the video data by the client terminal 3000c will be earlier than the playback time stamp of the video data by the client terminal 3000a and the client terminal 3000b.
  • client terminal 3000c the playback time stamp of the video data by the client terminal 3000c will be earlier than the playback time stamp of the video data by the client terminal 3000a and the client terminal 3000b.
  • the historical barrage data may be barrage data uploaded by the client terminal 3000c to the barrage server 2000a one hour ago (for example, the client terminal 3000c may obtain the playback progress of 10 hours before % Text input data as barrage data, and upload the barrage data to the barrage server 2000a). Therefore, for the client terminal 3000a and the client terminal 3000b that play the video data synchronously, when the playback progress of the video data reaches 10%, the historical barrage corresponding to the playback progress can be obtained from the barrage server 2000a synchronously. Data, and the obtained historical barrage data can be further used as the barrage data corresponding to the video data.
  • the period of time may be in units of time such as minutes, hours, days, months, and years, which will not be specifically limited here.
  • the barrage server 2000a receives the barrage acquisition request sent by the target client terminal, it will find the historical barrage data corresponding to the barrage acquisition request in the barrage database and send it to the barrage server.
  • Target customer terminal the barrage server 2000a
  • the user corresponding to the target client terminal can also see the text input data that he or she has input in real time on the video playback interface, that is, when the target client terminal receives the text input data input by the user, the text input data
  • the data is used as the barrage data corresponding to the currently played video data, and the barrage data is displayed on the playback interface corresponding to the video data.
  • the target client terminal may also upload the barrage data to a barrage server 2000a having a network connection relationship with the target client terminal, so that the barrage server 2000a stores the barrage data and / or
  • the barrage server 2000a can further store the barrage data as historical barrage data corresponding to the playback progress of the current video data, and can also send the barrage data to other people who watch the video data in synchronization. Customer terminal.
  • the target keyword information in the barrage data may be extracted in the background of the target client terminal, and based on the target keyword The information identifies the target object corresponding to the target keyword in the currently playing video data, and can further animate the target area where the target object is located in the target video frame to enrich the visual display of the barrage data. Effect to avoid the waste of equipment resources and network resources caused by the purpose of identifying and capturing barrage data.
  • the target client terminal extracts target keyword information from the barrage data, identifies the target object corresponding to the target keyword in the target video frame, and performs target area corresponding to the target object.
  • target keyword information from the barrage data
  • identifies the target object corresponding to the target keyword in the target video frame and performs target area corresponding to the target object.
  • FIG. 2 is a schematic flowchart of a video processing method according to an embodiment of the present application. As shown in FIG. 2, the method may be executed by a client terminal and includes steps S101 to S104:
  • the client terminal may obtain the barrage data corresponding to the video data during the playback of the video data; the barrage data may be historical barrage data returned by the barrage server, or may be the client terminal Text input data input by the corresponding user on the playback interface corresponding to the video data; then, the client terminal may display the barrage data on the playback interface corresponding to the video data.
  • the client terminal may be a target client terminal in the embodiment corresponding to FIG. 1.
  • the client terminal includes a personal computer, a tablet computer, a notebook computer, a smart TV, a smart phone, and other terminal devices that carry video data playback functions.
  • the barrage server may be the barrage server 2000a in the embodiment corresponding to FIG. 1 described above, and the barrage server may be used to store the text input by each user on the currently played video data on the corresponding client terminal.
  • Input data that is, the barrage server can be used to store the barrage data uploaded by each client terminal
  • each barrage data can be further stored according to the playback progress of the video data, so that users who watch the video data can
  • the barrage corresponding to the client terminal is opened, the corresponding barrage data is acquired based on the playback progress of the video data, and the barrage data corresponding to the video data is displayed.
  • the specific process in which the client terminal acquires and displays the barrage data may include: playing video data, and sending a barrage acquisition request to the barrage server, and receiving the barrage server returning based on the barrage acquisition request. And use the historical barrage data as barrage data corresponding to the video data, and display the barrage data on a playback interface of the video data.
  • the client terminal uses the client terminal 3000a in the embodiment corresponding to FIG. 1 as an example. Further, referring to FIG. 3, it is a schematic diagram of obtaining barrage data provided by an embodiment of the present application.
  • the client terminal 3000a can obtain the barrage data corresponding to the video data when the barrage is opened.
  • the specific process of acquiring the barrage data may be as follows: The client terminal 3000a may, based on the current playback progress of the video data (that is, the current playback progress as shown in FIG. 3 is 20%), and report to FIG. 3
  • the illustrated barrage server 2000a sends a barrage acquisition request, and receives the barrage server 2000a to return historical barrage data based on the barrage acquisition request (where the historical barrage data may be the data in the corresponding embodiment in FIG. 1 described above).
  • the barrage data uploaded by the client terminal 3000b when the playback progress of the video data reaches 20%.
  • the barrage server 2000a may use the barrage data as historical barrage data. (Stored in the barrage database shown in Figure 3). Subsequently, the client terminal 3000a may use the received historical barrage data as the barrage data corresponding to the video data, and may further display it on the playback interface of the video data (that is, the playback interface 100a shown in FIG. 3). The barrage data.
  • the client terminal 3000a and the client terminal 3000b in the embodiment of the present application do not play the video data synchronously. Therefore, the client terminal 3000a can obtain the client terminal 3000b from the barrage server 2000a based on the playback progress of the video data. Uploaded barrage data.
  • the specific process of uploading barrage data by the client terminal 3000b can be further referred to FIG. 4, which is another schematic diagram of obtaining barrage data provided by the embodiment of the present application.
  • the user B and the user C simultaneously watch the same video data on different client terminals, where the user B holds the client terminal 3000b and the user C holds the client terminal 3000c.
  • the client terminal (at this time, the The client terminal is the client terminal 3000b shown in FIG. 4)
  • the pop-up trigger operation corresponding to the text input data is detected, the text input data is used as the barrage data of the video data, and the video data is stored in the video data.
  • the playback interface ie, the playback interface 200a shown in FIG.
  • the barrage track is used to represent position information of the barrage data on the playback interface 200a (for example, the barrage data may be displayed on a first line of the barrage track).
  • the client terminal 3000b may further send the barrage data (ie, look at that child) to the barrage server 2000a as shown in FIG. 4.
  • the barrage server 2000a may send the barrage data (that is, look at the child) to the client terminal 3000c that watches the video data synchronously, so that the client terminal 3000c is on the video data playback interface 300a
  • the barrage data shown in Figure 4 is displayed.
  • the barrage data corresponding to the video data will be simultaneously displayed on the playback interfaces corresponding to the two client terminals (ie, the client terminal 3000b and the client terminal 3000c). .
  • the specific process in which the client terminal acquires and displays the barrage data may also include: playing video data and obtaining text input data, and using the text input data as barrage data corresponding to the video data And displaying the barrage data based on the barrage track on the playback interface of the video data, and sending the barrage data to the barrage server, so that the barrage server synchronizes the barrage data Sending to the client terminal watching the video data.
  • the barrage data obtained by the client terminal is historical barrage data returned by the barrage server based on the barrage acquisition request sent by the client terminal. That is, at this time, the client terminal may use the historical barrage data as barrage data; optionally, when the client terminal is the client terminal 3000b, the barrage data obtained by the client terminal is user B at The text input data input on the text input interface of the client terminal 3000b, that is, at this time, the client terminal can use the text input data as barrage data.
  • Step S102 obtaining keyword information matching the barrage data in a key information database as target keyword information
  • the client terminal may obtain a key information database, and based on the word segmentation technology, split the barrage data into a plurality of word segmentation data, and traverse in the key information database to find keyword information that matches each of the word segmentation data. If the client terminal finds keyword information that matches the segmentation data, the keyword information may be used as target keyword information corresponding to the barrage data, and may be further stored in the key information database. And obtaining a classification recognition model of a target object corresponding to the target keyword information.
  • the key information database includes keyword information set by a user, and a classification and recognition model of a target object corresponding to each keyword information.
  • the keyword information may be “flower”, “tree”, “
  • a classification and recognition model of the target object (tree) corresponding to the key information "tree” may be stored in the key information database, that is, it exists in the key information database. A large number of contour features of the tree.
  • the word segmentation technology means that the client terminal can perform word segmentation processing on the barrage data to split the barrage data into a plurality of word segmentation data.
  • the barrage data is “that flower”
  • the client terminal can obtain the following four segmented segmentation data through the segmentation technology: "that flower, flower, true, beautiful”.
  • the client terminal may further traverse the key information database to find keyword information that matches the four segmentation data respectively.
  • the key information set by the user in the key information database is "flower” as an example, so the client terminal can find the key that matches the word segmentation data ("flower") in the key information database.
  • the client terminal may further obtain a classification recognition model of a target object corresponding to the target keyword information (flower) in the key information database, that is, a large number of contour features of the target object (flower) may be obtained, So that the client terminal can further execute step S103.
  • the embodiment of the present application takes the target keyword information as an example, so as to further describe the following steps S103 and S104 in detail.
  • each target keyword information corresponds to a class recognition model of a target object, for example, if the target keyword information is a cat and a dog
  • the classification and recognition model of cats and the classification and recognition model of dogs can be obtained from the key information database. Therefore, the number of target keyword information corresponding to the video data is not limited here, and the identification process of the target objects corresponding to the multiple target keyword information in the target video frame may be referred to the present application.
  • the target video frame includes a process of identifying a target object corresponding to target keyword information.
  • Step S103 Obtain a target video frame from a plurality of video frames of the video data, and identify a target object corresponding to the target keyword information in the target video frame based on a classification recognition model corresponding to the target keyword information.
  • the client terminal may obtain a target video frame from a plurality of video frames of the video data, where the target video frame is a video frame within a preset time period before and after the barrage data appears, for example, the The target video frame is a video frame 3 seconds before and after the appearance of the barrage data, and the target video frame is further divided into a plurality of sub-regions, and a selective search is performed on each sub-region, and a sub-region after the selective search is performed.
  • the client terminal may further perform the processing on the regions to be processed based on a neural network model Feature extraction to obtain image features corresponding to the region to be processed, and then, the client terminal may further generate the to-be-processed based on the image features and the classification recognition model corresponding to the target keyword information.
  • Identification probability corresponding to the region, and selecting the corresponding to the target keyword information in the region to be processed according to the identification probability The candidate region of the target object.
  • the client terminal may optimally select the candidate region corresponding to the target video frame based on the regression model, and determine the optimal candidate region corresponding to the selected target video frame as target area.
  • the client terminal will display the barrage data on a playback interface corresponding to the video data, and the barrage data will be displayed on the playback interface corresponding to the video interface.
  • Dynamic display in the curtain track so that the barrage data corresponding to the video data can be displayed on different video frames of the video data, that is, the period of time during which the barrage data is dynamically displayed (this time can be changed in this application Defined as the barrage display time period), each video frame in the video stream corresponding to the barrage data (that is, the video data corresponding to the barrage data) will also be dynamically played in chronological order.
  • the client terminal will obtain the target video frame in each video frame corresponding to the barrage data.
  • the client terminal may split the video data corresponding to the barrage data into multiple video frames, and may further use the currently played video frame as the target video frame in the multiple video frames, so that the client terminal can further A target object corresponding to the target keyword information is identified in the target video frame.
  • the client terminal may identify a target object corresponding to the target keyword information in a plurality of video frames corresponding to the barrage data within the barrage display time period corresponding to the barrage data, and the For the specific process of the client terminal identifying the target object corresponding to the target keyword information in each video frame, see the client terminal for identifying the target object corresponding to the target keyword information in the target video frame. process.
  • the neural network model may be a Convolutional Neural Networks (CNN) model, or a combination model of a CNN model and a recurrent neural network model (RNN).
  • the neural network model may be used Feature extraction is performed on all regions to be processed input to the neural network model to obtain image features corresponding to the regions to be processed.
  • the client terminal needs to divide the target video frame first to obtain multiple sub-areas corresponding to the target video frame, and selectively search each sub-area. And merging the sub-areas after selective search to obtain multiple merged regions (the multiple merged regions include merged regions after multiple merges). Therefore, the client terminal may use all sub-regions and multiple merged regions as regions to be processed, to further perform feature extraction on the region to be processed through the neural network model to obtain image features corresponding to the region to be processed.
  • CNN Convolutional Neural Networks
  • RNN recurrent neural network model
  • the specific process for the client terminal to extract image features corresponding to the to-be-processed region is:
  • the client terminal performs convolution processing through a neural network model (for example, a convolutional neural network model, a CNN model), that is, the client terminal may randomly select a small part of the feature information in the region to be processed as a sample (i.e., convolution Kernel), and using this sample as a window to sequentially slide through the region to be processed, that is, to perform a convolution operation on the sample and the region to be processed to obtain the spatial feature information of the region to be processed.
  • a neural network model for example, a convolutional neural network model, a CNN model
  • the spatial feature information of the area to be processed is obtained, but the amount of the above-mentioned spatial feature information is huge.
  • pooling processing based on the convolutional neural network model can be used to analyze the space Aggregate statistics of feature information.
  • the amount of aggregated spatial feature information is much lower than the amount of spatial feature information extracted by the convolution operation, and it will also improve the subsequent classification effect (that is, the effect of identifying the target object).
  • Commonly used pooling methods mainly include the average pooling method and the maximum pooling method.
  • the average pooling method is to calculate an average feature information in a feature information set to represent the features of the feature information set; the maximum pooling operation is to extract the largest feature information in a feature information set to represent the features of the feature information set. It can be seen that by using the above method, the spatial feature information of all regions to be processed can be extracted, and the spatial feature information can be used as image features corresponding to the regions to be processed.
  • the client terminal may further perform time series processing through a recurrent neural network model (RNN model). That is, in the forgetting gate of the recurrent neural network model, the processor first Calculate the information that needs to be removed from the cell state; then in the input gate, the processor calculates the information that needs to be stored in the cell state; finally, in the output gate, the cell state is updated That is, the processor multiplies the old state of the unit by the information that needs to be removed, and then adds the information that needs to be stored to get the new state of the unit.
  • RNN model recurrent neural network model
  • the spatial feature information of the region to be processed can be extracted with the linear effect between the states of multiple units, so as to extract the spatiotemporal feature information hidden in the region to be processed. It can be seen that with the above method, the spatio-temporal feature information of all regions to be processed can be extracted, and the spatio-temporal feature information is referred to as the image feature corresponding to each region to be processed.
  • the recognition probability is a probability used to indicate that the target object is included in the region to be processed.
  • the client terminal only needs to perform a selective search according to a selective search algorithm among the divided multiple sub-regions to obtain multiple sub-regions after selective search. At this time, There is a certain correlation between these sub-regions (for example, the textures can be similar, or the colors can be similar). Therefore, the client terminal can greatly reduce the search area through the selective search algorithm, thereby improving the recognition of the target object. effectiveness.
  • the client terminal merging the sub-areas after the selective search means that the client terminal can merge two adjacent sub-areas based on a merge rule (for example, similar textures, similar colors, etc.), And in the process of merging each sub-region, multiple merging is performed according to the number of sub-regions after selective search until a merged region carrying a complete image is obtained.
  • a merge rule for example, similar textures, similar colors, etc.
  • the client terminal may divide the target video frame into multiple sub-areas in the background (for example, the target video frame may be divided into 1000 sub-areas, that is, the client terminal may divide the video corresponding to the target video frame The frame image is split into multiple graphic blocks. At this time, it should be understood that the division of these sub-regions is not visible to the user). Subsequently, the client terminal may further perform a selective search on each sub-region to obtain a plurality of selectively searched sub-regions (for example, the client terminal may select 500 carrying images among the 1000 sub-regions that have been divided.
  • Feature sub-areas as sub-areas after selective search), and the sub-areas after selective search may be further merged, that is, the client terminal may merge two adjacent sub-areas according to a merge rule such as color or texture Merge to get multiple merged regions.
  • a merge rule such as color or texture Merge
  • the client terminal may repeatedly merge the merged merged regions based on the merge rule to obtain a merged region containing a complete image.
  • the client terminal may merge the multiple sub-regions and the multiple merged regions.
  • the areas are all determined as areas to be processed. That is, the client terminal may take all the image regions corresponding to the target video frame that have appeared as regions to be processed, and input these regions to be processed into the neural network model.
  • this embodiment of the present application takes selective sub-regions of a target video frame to obtain eight sub-regions after selective search as an example.
  • the client terminal may split the target video frame into 100 sub-regions.
  • the client terminal may further perform a selective search on the 100 sub-regions by using a selective search algorithm to obtain the following 8 sub-regions after selective search: a-b-c-d-e-f-g-h.
  • the client terminal may merge the eight sub-areas based on the above-mentioned merge rule (that is, merging two adjacent sub-areas).
  • the merged area of the client terminal after the first merge may be ab-cd-ef-gh
  • the merged area of the client terminal after the second merge may be abcd-efgh
  • the client terminal The merged area after the third merge can be abcdefgh. At this time, a merged area containing the complete image has been obtained.
  • the region to be processed includes: 100 split sub-regions, 8 selective search sub-regions (abcdefgh), and four merge regions (ab-cd- ef-gh), two merged regions (abcd-efgh) after the second merge and one merged region (abcdefgh) after the third merge, that is, a total of 115 areas (i.e. 100) + 8 + 4 + 2 + 1).
  • the client terminal may further perform feature extraction on the region to be processed based on a neural network model (the neural network model may be a CNN model or a CNN + RNN model), that is, the client terminal may use the The neural network model is input to a region to be processed to output image features corresponding to the region to be processed (for example, for 115 regions to be processed, image features corresponding to the 115 regions to be processed may be obtained correspondingly). Then, the client terminal may further be based on the image feature and the classification recognition model corresponding to the target keyword information (for example, the target keyword information may be a child in the embodiment corresponding to FIG.
  • a neural network model may be a CNN model or a CNN + RNN model
  • the classification recognition model will contain a large number of contour features corresponding to the child), so that a recognition probability corresponding to the region to be processed can be generated.
  • each region to be processed will correspond to a recognition probability. Therefore, the client terminal may further select a candidate region containing the target object corresponding to the target keyword information in the region to be processed according to the recognition probability (that is, the client terminal may The region to be processed is a candidate region of a target object corresponding to the target keyword information). At this time, the candidate region will carry image features that can completely characterize the target object.
  • the to-be-processed areas determined by the client terminal are: 115 to-be-processed areas, and by entering the 115 to-be-processed areas into the
  • the neural network model can output image features corresponding to the region to be processed, that is, 115 image features can be obtained.
  • the client terminal matches the 115 image features with the classification recognition model corresponding to the child, that is, the 115 Image features are further input into the classifier corresponding to the neural network (in this case, the classifier can be a classifier built into the neural network model), that is, the recognition probability corresponding to the 115 image features can be output respectively.
  • the client terminal can obtain the recognition probabilities corresponding to the 115 regions to be processed, respectively. Subsequently, the client terminal may select, from among the 115 to-be-processed regions, a to-be-processed region with a recognition probability greater than a probability threshold as a candidate region of a target object corresponding to the target keyword information, that is, the client terminal may After the target object is identified in the target video frame, it can be determined that the target video frame includes the target object.
  • the client terminal may optimally select a candidate region corresponding to the target video frame based on a regression model, and determine the selected optimal candidate region corresponding to the target video frame as a target region.
  • the regression model can be used to locate the position of the target object in the target video frame, that is, the client terminal can select the target video from the candidate regions corresponding to the target object through the regression model.
  • the optimal candidate region corresponding to the frame, and the optimal candidate region may be further determined as a target region. It should be understood that the optimal candidate region is an image region of a target object corresponding to the target keyword information in the target video frame. Therefore, the client terminal may determine the optimal candidate region as the target region. .
  • the target video frame may be a currently played video frame, and may also be a video frame different from the currently played video frame in a plurality of video frames corresponding to the barrage data, for example, the video data has not yet been played.
  • Video frame Therefore, the identification of the target object in the multiple video frames corresponding to the barrage data by the client terminal may correspond to the target keyword information in each video frame in sequence according to the time sequence of each video frame.
  • the target object is identified, and the target area of the target object in each video frame is determined one by one; of course, the client terminal can also simultaneously identify the target object in multiple video frames corresponding to the barrage data, that is, all
  • the client terminal can use each video frame in the multiple video frames as the target video frame, that is, the client terminal can Pre-processing, so as to identify the target object in these unplayed video frames in advance to obtain the target area of the target object in each video frame.
  • the client terminal plays the target video frame in the target video data, that is, when each video frame is sequentially played in chronological order, the target area corresponding to the target object may be animatedly displayed in real time.
  • the embodiment of the present application only uses the currently played video frame as the target video frame as an example to identify the target object corresponding to the target keyword information in the target video frame.
  • the identification of the target object corresponding to the target keyword information in each video frame can still refer to the examples in this application.
  • the process of identifying the target object in the currently played video frame will not be repeated here.
  • Step S104 when the target video frame in the video data is played, perform animation processing on the target region in the target video frame.
  • the video data corresponding to the barrage data is dynamically played, that is, each video frame in the video data corresponding to the barrage data will be played one by one in chronological order. Therefore, the client terminal can When each video frame in the plurality of video frames is played (that is, when each video frame in the video data corresponding to the barrage data is sequentially played in chronological order), a target in the target video frame is played.
  • the area is animated (for example, the target area may be rendered and the target area after rendering processing may be enlarged and displayed).
  • the visual display effect of the barrage data can be enriched.
  • all key content (that is, all keyword information) in the barrage data can be further extracted based on the key information database.
  • the classification and recognition model of the target object corresponding to the keyword information A target object corresponding to a target video frame in a plurality of video frames of the video data may be identified, and a specific position of the target object in the target video frame (that is, the target object in the target video may be further determined). Target area in the frame), and animate the target area. Therefore, by correlating the keyword information in the barrage data with the target object located at the target area in the target video frame, the visual display effect of the barrage data can be enriched.
  • FIG. 5 is a schematic flowchart of another video processing method according to an embodiment of the present application. As shown in FIG. 5, the method may include steps S201 to 208:
  • Step S201 play video data, and obtain barrage data corresponding to the video data
  • Step S202 Obtain keyword information matching the barrage data in the key information database as target keyword information
  • steps S201 and S202 For specific implementation methods of the steps S201 and S202, reference may be made to the description of steps S101 and S102 in the embodiment corresponding to FIG. 2 above, and details will not be described herein again.
  • Step S203 Obtain a target video frame from a plurality of video frames of the video data
  • the client terminal may obtain a target video frame from a plurality of video frames of the video data corresponding to the barrage data, that is, the client terminal may display the barrage data at a position corresponding to the video data.
  • a video stream corresponding to the barrage data can be obtained (the video stream is composed of multiple video frames corresponding to the barrage data in the order of playback time), and the video stream can be further disassembled It is divided into a plurality of video frames corresponding to the barrage data, so the client terminal may select a video frame as a target video frame among the plurality of video frames.
  • FIG. 6 is a schematic diagram of displaying barrage data on multiple video frames according to an embodiment of the present application.
  • the barrage data corresponding to the video data can be displayed on different video frames of the video data, so as to appear as shown in FIG. 6.
  • the effect of this barrage data is shown dynamically.
  • the barrage data corresponding to the video data is displayed based on the barrage track shown in FIG. 6 (that is, the barrage data can be displayed from right to left on the barrage display area shown in FIG. 6. It is understood that the barrage display area shown in FIG. 6 is virtual to the user).
  • the barrage display area shown in FIG. 6 is virtual to the user.
  • time T is the currently played video frame in the video data
  • time T + 1, T + 2, and T + 3 are the video frames that will be played in the video data in sequence.
  • the video frame is used as a target video frame to identify a target object corresponding to the target keyword information (for example, a cat) in a target video frame corresponding to the barrage data.
  • each of the remaining video frames is sequentially used as a target video frame in order to identify a target object corresponding to the target keyword information (cat) in the remaining video frames. That is, for each video frame, steps S203 to S207 need to be executed cyclically based on the progress of the current playback time to identify the target object corresponding to the target keyword information in each video frame in the video data.
  • the target video frame may also be multiple video frames corresponding to the barrage data, that is, the client terminal may also convert multiple video frames in the video data corresponding to the barrage data. And used as a target video frame to further identify a target object corresponding to the target keyword information (cat) in the target video frame. That is, the client terminal can further perform the following steps S204 to S207 on each video frame synchronously after performing step S203, and then can perform a target object corresponding to the target keyword information in each video frame. Identify. Therefore, for the video frames in the video data that have not yet been played, the client terminal may pre-process the video frames that have not yet been played in accordance with the following steps S204 to S207.
  • this embodiment of the present application only uses the target video frame as the currently played video frame among the multiple video frames as an example, and uses the target keyword information in the target video frame as an example.
  • the corresponding target object is identified. Therefore, when the remaining video frame in the multiple video frames is determined as the target video frame, the process of identifying the target object corresponding to the target keyword information in the remaining video frames in the multiple video frames may be See the specific identification process of the target object in the currently playing video frame (that is, the target video frame) listed in the embodiment of the present application.
  • step S204 the target video frame is divided into a plurality of sub-regions, and a selective search is performed on each sub-region, and the sub-regions after the selective search are combined to obtain a plurality of merged regions, and the plurality of sub-regions are obtained. And the multiple merged areas are all determined as areas to be processed;
  • the client terminal merging the sub-areas after the selective search means that the client terminal can merge two adjacent sub-areas based on a merge rule (for example, similar textures, similar colors, etc.), And in the process of merging each sub-region, multiple merging is performed according to the number of sub-regions after selective search until a merged region carrying a complete image is obtained.
  • a merge rule for example, similar textures, similar colors, etc.
  • the determination of the to-be-processed area may refer to the description of the to-be-processed area in the embodiment corresponding to FIG. 2 above, that is, the client terminal may merge the multiple merged areas and split after multiple merges.
  • the obtained multiple sub-regions are all determined as regions to be processed.
  • the specific process of splitting the target video frame by the client terminal, and selectively searching the split sub-regions and merging the selectively searched sub-regions can refer to the corresponding implementation in FIG. 2 above.
  • the description of the multiple sub-regions and the multiple merged regions in the example will not be repeated here.
  • Step S205 Perform feature extraction on the region to be processed based on a neural network model to obtain image features corresponding to the region to be processed;
  • the client terminal may scale the image blocks in the area to be processed to the same size, and use the area to be processed having the same size as an input of a neural network model, and output the same as the neural network model through the neural network model.
  • FIG. 7 is a schematic diagram of feature extraction provided by an embodiment of the present application.
  • the client terminal may perform, in the image processing region C shown in FIG. 7, image blocks in the region A to be processed and image blocks in the region B to be processed corresponding to the target video frame.
  • Image processing that is, the image blocks in the region to be processed A and the image blocks in the region to be processed B can be scaled to the same size to ensure the accuracy of image feature extraction for the image blocks in each region to be processed .
  • the client terminal may further take an area to be processed having the same size as an input of a neural network model, and output an image feature corresponding to an image block in the area to be processed through the neural network model (that is, the customer The terminal can obtain the image features of the region A to be processed and the image features of the region B to be processed as shown in FIG. 7).
  • the to-be-processed area A and the to-be-processed area B listed in the embodiments of the present application are only a part of the to-be-processed area corresponding to the target video frame.
  • the client terminal The image blocks within the image are processed, and all to-be-processed regions with the same size after the image processing are used as inputs to the neural network model, so that the neural network model is output corresponding to the image blocks in each to-be-processed region.
  • Image features for example, if there are 1000 areas to be processed, image features corresponding to one-to-one image blocks in the 1000 areas to be processed are extracted).
  • Step S206 generating a recognition probability corresponding to the region to be processed based on the image feature and the classification recognition model corresponding to the target keyword information, and selecting an inclusion probability in the region to be processed according to the recognition probability.
  • the recognition probability is a probability used to indicate that the target object is included in the region to be processed.
  • the target keyword information of the barrage data is a cat
  • the classification recognition model corresponding to the target keyword information includes multiple types corresponding to the trained cats. Contour features. Therefore, the client terminal may further classify the image features corresponding to the region A to be processed shown in FIG. 7 with the image features corresponding to the region B to be processed, by using a classifier in the neural network model (that is, a classification recognition model, for example, a support vector machine). The image features are compared with the contour features in the classification recognition model to obtain the recognition probability (for example, 90%) corresponding to the region A to be processed and the recognition probability (for example, 40) corresponding to the region B to be processed. %).
  • a classifier in the neural network model that is, a classification recognition model, for example, a support vector machine
  • the client terminal may further determine the candidate region of the target object corresponding to the target keyword information as the region to be processed A according to the recognition probability corresponding to the region A to be processed and the recognition probability corresponding to the region B to be processed, that is, The client terminal may determine a region A to be processed whose recognition probability is greater than a recognition probability threshold (for example, 70%) as a candidate region of a target object corresponding to the target keyword information. Therefore, the client terminal may determine that the target object corresponding to the target keyword information is included in the region A to be processed, and consider that the target object corresponding to the target keyword information is not included in the region B to be processed. .
  • a recognition probability threshold for example, 70%
  • step S207 based on the regression model, the candidate region corresponding to the target video frame is optimally selected, and the selected optimal candidate region corresponding to the target video frame is determined as the target region.
  • the client terminal may optimally select a candidate region corresponding to the target video frame based on a regression model, and select an optimal candidate region corresponding to the target video frame from among them, and The previous video frame of is determined as the reference video frame, and multiple candidate regions corresponding to the reference video frame are obtained.
  • an estimated region is selected, and each pre- Estimate the overlap rate between the region and the optimal candidate region corresponding to the target video frame, obtain an estimated region with the highest overlap rate, and use the estimated region with the highest overlap rate to modify the corresponding area of the target video frame
  • the plurality of candidate regions corresponding to the reference video frame are selected from the to-be-processed regions corresponding to the reference video frame through the classification recognition model corresponding to the target keyword information; the reference video frame corresponds to The to-be-processed region is generated by performing a selective search on the reference video frame.
  • the client terminal may use a plurality of regions to be processed including the target object corresponding to the target keyword as candidate regions (at this time, it should be understood that the regions to be processed are the
  • the client terminal is obtained by performing scaling processing on the to-be-processed area X1, the to-be-processed area X2, and the to-be-processed area X3 corresponding to the target video frame shown in FIG. 8). Therefore, as shown in FIG.
  • the target video frame There is a mapping relationship between the corresponding to-be-processed areas (ie, to-be-processed area X1, to-be-processed area X2, and to-be-processed area X3) and the three candidate areas shown in FIG. 8 including the target object, that is, to-be-processed area X1 and There is a mapping relationship between the candidate area X1, a mapping relationship between the area to be processed X2 and the candidate area X2, and a mapping relationship between the area to be processed X3 and the candidate area X3. Therefore, the client terminal can perform optimal selection among the three candidate regions shown in FIG. 8. Specifically, please refer to FIG.
  • FIG. 8 which is a schematic diagram of selecting an optimal candidate region according to an embodiment of the present application.
  • a plurality of candidate regions that is, candidate regions X1, candidate regions X2, and candidate regions X3
  • An optimal candidate region corresponding to the target video frame.
  • the normalized distance from the target object in each candidate region to the candidate frame of the candidate region is calculated by a regression model, and the candidate region whose normalized distance satisfies a preset condition is used as the The optimal candidate region corresponding to the target video frame, and the selected optimal candidate region is determined as the target region, and the preset condition is that the normalized distance between the target object and the candidate frame of the candidate region is the smallest or the The normalized distance between the target object and the candidate frame of the candidate region is the largest.
  • the candidate frames of the three candidate regions are represented by line styles of different styles, and the candidate frames of each line type can be found correspondingly in the target video frame shown in FIG. 8. Corresponding pending area.
  • the client terminal can calculate the normalization of the target object in each candidate area to its candidate frame through the regression model.
  • the distance can be understood as that through the regression model, position information (for example, center point position information) of each to-be-processed region input to the neural network model in the target video frame can be obtained, and the length of each to-be-processed region Value and width value. Therefore, by calculating the normalized distance corresponding to each candidate region, it is possible to determine that the normalized distance of the target object from its candidate frame is farther in the candidate region X1 shown in FIG. In the candidate region X2 shown in FIG.
  • the candidate region X2 shown in FIG. 8 may be used as the optimal candidate region corresponding to the target video frame.
  • the client terminal may determine a previous video frame of the target video frame as a reference video frame, and obtain a plurality of candidate regions corresponding to the reference video frame.
  • the reference video frame may be a previous video frame of the target video frame, that is, the reference video frame and the target video frame are both video frames in currently played video data.
  • the multiple candidate regions corresponding to the reference video frame refer to the specific process of determining the multiple candidate regions corresponding to the target video frame in the foregoing embodiment corresponding to FIG. 8. That is, the multiple candidate regions corresponding to the reference video frame are selected from the to-be-processed regions corresponding to the reference video frame through the classification recognition model corresponding to the target keyword information; The region to be processed is generated by performing a selective search on the reference video frame.
  • the client terminal may select an estimated region among a plurality of candidate regions corresponding to the reference video frame.
  • the client terminal may first obtain a target area of the target object in the reference video frame, and obtain position information of the target area in the reference video frame as the first position information, and obtain The position information of the candidate region corresponding to the reference video frame is used as the second position information.
  • the client terminal may further calculate a distance between the first position information and the second position information corresponding to the reference video frame, and Use the candidate region whose distance is less than the distance threshold as the estimated region corresponding to the target video frame, and further determine the overlap ratio between each estimated region and the optimal candidate region corresponding to the target video frame, and obtain The estimated region having the highest overlap ratio is used to modify the optimal candidate region corresponding to the target video frame with the estimated region having the highest overlap ratio, and the modified optimal candidate region is determined as the target region.
  • the modified optimal candidate area is an image area of a target object corresponding to the target keyword information in the target video frame. Therefore, the client terminal may use the modified optimal candidate area The candidate area is determined as the target area.
  • the target region corresponding to the reference video frame is obtained by modifying the optimal candidate region corresponding to the reference video frame; the optimal candidate region corresponding to the reference video frame is based on the regression model in the Selected from candidate regions corresponding to the reference video frame.
  • the client terminal may further obtain position information of the target area in the reference video frame (that is, the target area). Center point position information (for example, the position information of the target area is Q1)), and the center point position information of the target area in the reference video frame is used as the first position information, and secondly, the client terminal may further obtain the position information
  • the position information of the candidate region corresponding to the reference video frame is used as the second position information. It should be understood that the target region corresponding to the reference video frame is obtained by modifying the optimal candidate region corresponding to the reference video frame; the optimal candidate region corresponding to the reference video frame is based on the regression model in The candidate region corresponding to the reference video frame is selected.
  • the target region in the reference video frame can also be understood as the candidate region corresponding to the reference video frame.
  • the target region of the target object in the reference video frame and the target region in the target video frame have the same center point position information. Therefore, in the process of estimating the area where the target object is located in the target video frame through the reference video frame, the target area in the reference video frame may be used as a special candidate area.
  • FIG. 9 is a schematic diagram of modifying an optimal video frame of the target video frame according to an embodiment of the present application.
  • the candidate region Y1 is a target region corresponding to the reference video frame. Since the target region corresponding to the reference video frame is obtained by performing the optimal candidate region corresponding to the reference video frame. Obtained by modification, and the optimal candidate region corresponding to the reference video frame is selected from the candidate region corresponding to the reference video frame based on the regression model; therefore, the client terminal may be in the reference video frame Determine the position information of the four candidate regions (ie, candidate region Y1, candidate region Y2, candidate region Y3, and candidate region Y4 as shown in FIG.
  • the position information of the candidate region Y1 is the position information of the target region (i.e. Q1)
  • the position information of the candidate region Y2 is Q2
  • the position information of the candidate region Y3 is Q3
  • the position information of the candidate region Y4 is Q4.
  • the region where the target object appears in the next video frame can be estimated in the reference video frame, that is, the reference video can be used in the reference video frame.
  • the client terminal may further calculate a distance between the first position information and the second position information corresponding to the reference video frame (the distance may be represented by a symbol D), that is, the client terminal may further calculate a distance between Q1 and Q1.
  • the client terminal may further use the candidate region whose distance is less than a distance threshold (for example, the distance threshold is 1) as an estimated region corresponding to the target video frame, that is, among the above four candidate regions, the candidate region
  • a distance threshold for example, the distance threshold is 1
  • the distance threshold is 1
  • the candidate region Y1, the candidate region Y2, and the candidate region Y3 is also smaller than the distance threshold.
  • the client terminal may further determine an overlap rate between each estimated region and an optimal candidate region corresponding to the target video frame; finally, the client terminal may obtain an estimated region with the highest overlap rate, The optimal candidate region corresponding to the target video frame is modified by using the estimated region with the highest overlap ratio, and the modified optimal candidate region is determined as the target region.
  • the calculation of the overlap ratio may be performed by the following steps: The client terminal may first obtain a length value and a width value of an optimal candidate region corresponding to the target video frame, and determine the length value and the width value according to the length value and the width value.
  • the area of the optimal candidate area corresponding to the target video frame is taken as the first area; secondly, the client terminal may further obtain the length value and width value of the estimated area, and according to the length value and width of the estimated area Value, the length value and the width value of the optimal candidate region corresponding to the target video frame, to determine the overlapping area between the estimated region and the optimal candidate region corresponding to the target video frame as the second area; then ,
  • the client terminal may determine an overlap rate between the estimated area and an optimal candidate area corresponding to the target video frame according to the first area and the second area.
  • the client terminal may compare the network based on the shape after obtaining the estimated area (that is, the estimated area Y1, the estimated area Y2, and the estimated area Y3) corresponding to the target video frame as shown in FIG. 9. Calculate the overlap area between the three predicted areas and the optimal candidate area corresponding to the target video frame as the second area. In addition, the area of the optimal candidate area corresponding to the target video frame is the first area. . Therefore, the client terminal may further obtain an overlap ratio between each estimated area and an optimal candidate area corresponding to the target video frame according to a ratio between the second area and the first area.
  • the overlap rate between the estimated region Y1 and the optimal candidate region corresponding to the target video frame is 50%, and the overlap rate between the estimated region Y2 and the optimal candidate region corresponding to the target video frame is 40. %,
  • the overlap rate between the estimated region Y3 and the optimal candidate region corresponding to the target video frame is 80%, so the client terminal can use the center position information and The average value between the center position information of the optimal candidate region is used as the center point position information of the target region of the target video frame.
  • the overlap between the estimated region and the optimal candidate region corresponding to the target video frame can be obtained through the shape comparison network (that is, the overlap rate between the two is calculated). Then, the client terminal may further correct the optimal candidate region corresponding to the target video frame by using the estimated region with the highest overlap ratio.
  • the overlap rate between the predicted region Y1 and the optimal candidate region corresponding to the target video frame is 100%, it indicates that the target object is in a stationary state among multiple video frames corresponding to the video data.
  • the target region of the target video frame and the target region of the reference video frame have the same center position information.
  • the target region that is, the target object corresponding to the target keyword information is at the target).
  • the image region in the video frame can be understood as the optimal candidate region corresponding to the target video frame; of course, the target region can also be understood as the client terminal after modifying the optimal candidate region corresponding to the target video frame.
  • the image region, that is, the image region is the optimal candidate region after correction.
  • Step S208 when the target video frame in the video data is played, perform animation processing on the target area in the target video frame.
  • step S208 For the specific implementation process of step S208, reference may be made to the description of step S104 in the embodiment corresponding to FIG. 2 described above, and details will not be repeated here.
  • the embodiment of the present application can split the video data corresponding to the barrage data into multiple video frames, and can target the target in the currently playing video data.
  • the target object corresponding to the keyword information is identified.
  • the client terminal may convert the detection of the target object in the video data into one-by-one detection of the target object in each video frame, and if each video frame in the video data exists with the target,
  • the target object corresponding to the keyword information can further determine the target area of the target object in each video frame, so that an animation effect can be set for each target area, so that each video frame can be corresponding when playing Animate the target area.
  • the visual display effect of the barrage data can be enriched by associating the target keyword information in the barrage data with the target object located at the target area in the target video frame. Therefore, it can be ensured that the user watching the video data can effectively capture the video content of interest through the set keyword information, and avoid the waste of equipment resources and network resources caused by the purpose of identifying and capturing the barrage data.
  • the user can see the movement track of the person on the playback interface corresponding to the video data through the client terminal, that is, since each video frame is played in chronological order Therefore, after the client terminal obtains the target area in each video frame, the target area in each video frame can be animated in order to ensure that the user holding the client terminal can view the video data in the video area.
  • the playback interface you can see the person's movement track on the playback interface when you see the animation display effect corresponding to each target area.
  • all key content (that is, all keyword information) in the barrage data can be further extracted based on the key information database.
  • the classification and recognition model of the target object corresponding to the keyword information A target object corresponding to a target video frame in a plurality of video frames of the video data may be identified, and a specific position of the target object in the target video frame (that is, the target object in the target video may be further determined). Target area in the frame), and animate the target area. Therefore, by correlating the keyword information in the barrage data with the target object located at the target area in the target video frame, the visual display effect of the barrage data can be enriched.
  • FIG. 10 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application.
  • the video processing device 1 may be a target client terminal in the embodiment corresponding to FIG. 1 described above.
  • the video processing device 1 may include a barrage data acquisition module 10, a keyword acquisition module 20, and a target.
  • the barrage data acquisition module 10 is configured to play video data and obtain barrage data corresponding to the video data;
  • the barrage data acquisition module 10 is specifically configured to play video data, send a barrage acquisition request to a barrage server, and receive historical barrage data returned by the barrage server based on the barrage acquisition request,
  • the historical barrage data is used as the barrage data corresponding to the video data, and the barrage data is displayed on a playback interface of the video data.
  • the barrage data acquisition module 10 is specifically configured to play video data and obtain text input data, and use the text input data as barrage data corresponding to the video data, and add the data to the video data.
  • the barrage data acquisition module 10 On the playback interface, displaying the barrage data based on the barrage track, and sending the barrage data to the barrage server, so that the barrage server sends the barrage data to watch the video data synchronously Customer terminal.
  • the keyword obtaining module 20 is configured to obtain keyword information matching the barrage data in a key information database as target keyword information;
  • the key information database includes keyword information set by a user, and Classification and recognition model of target object corresponding to each keyword information;
  • the keyword acquisition module 20 includes a barrage data splitting unit 201, a keyword search unit 202, a keyword determination unit 203, and a classification model acquisition unit 204.
  • the barrage data splitting unit 201 is configured to obtain a key information database and split the barrage data into a plurality of word segmentation data;
  • the keyword search unit 202 is configured to traverse in the key information database to find keyword information that matches each segmentation data;
  • the keyword determining unit 203 is configured to, if keyword information matching the segmentation data is found, use the keyword information as target keyword information corresponding to the barrage data;
  • the classification model acquiring unit 204 is configured to acquire a classification recognition model of a target object corresponding to the target keyword information in the key information database.
  • step S102 the classification model obtaining unit 204
  • the specific implementation processes of the barrage data splitting unit 201, the keyword searching unit 202, the keyword determining unit 203, and the classification model obtaining unit 204 can refer to the description of step S102 in the embodiment corresponding to FIG. 2 above. Here, I will not go into details.
  • the target object recognition module 30 is configured to obtain a target video frame from a plurality of video frames of the video data, and identify a target corresponding to the target keyword information based on a classification recognition model corresponding to the target keyword information. An image region of an object in the target video frame, and using the identified image region as a target region;
  • the target object recognition module 30 includes a target frame acquisition unit 301, a selective search unit 302, a feature extraction unit 303, a candidate region determination unit 304, and a target region determination unit 305.
  • the target frame obtaining unit 301 is configured to obtain a target video frame from a plurality of video frames of the video data
  • the selective search unit 302 is configured to divide the target video frame into multiple sub-regions, perform a selective search on each sub-region, and merge the sub-regions after the selective search to obtain multiple merged regions. And determining the multiple sub-regions and the multiple merged regions as regions to be processed;
  • the feature extraction unit 303 is configured to perform feature extraction on the region to be processed based on a neural network model to obtain image features corresponding to the region to be processed;
  • the feature extraction unit 303 is specifically configured to scale the image blocks in the area to be processed to the same size, and use the area to be processed having the same size as an input of a neural network model and pass the neural network
  • the model outputs image features corresponding to image blocks in the region to be processed.
  • the candidate region determining unit 304 is configured to generate a recognition probability corresponding to the region to be processed based on the image feature and the classification recognition model corresponding to the target keyword information, and determine The selection of a candidate region including a target object corresponding to the target keyword information in the region to be processed;
  • the recognition probability is a probability used to indicate that the target object is included in the area to be processed
  • the target region determining unit 305 is configured to optimally select a candidate region corresponding to the target video frame based on a regression model, and determine the selected optimal candidate region corresponding to the target video frame as a target region.
  • the target area determination unit 305 includes: an optimal area selection subunit 3051, a reference frame determination subunit 3052, an estimated area selection subunit 3053, an overlap ratio determination subunit 3054, and a target area determination subunit 3055.
  • the optimal region selection subunit 3051 is configured to optimally select a candidate region corresponding to the target video frame based on a regression model, and select an optimal candidate region corresponding to the target video frame from the regression model;
  • the reference frame determination subunit 3052 is configured to determine a previous video frame of the target video frame as a reference video frame, and obtain multiple candidate regions corresponding to the reference video frame; Candidate regions are selected from the to-be-processed regions corresponding to the reference video frame through the classification recognition model corresponding to the target keyword information; the to-be-processed regions corresponding to the reference video frame are processed by referring to the reference Generated by selective search of video frames;
  • the estimation region selection subunit 3053 is configured to select an estimation region among a plurality of candidate regions corresponding to the reference video frame;
  • the estimated region selection subunit 3053 is specifically configured to obtain a target region of the target object in the reference video frame, and obtain position information of the target region in the reference video frame as The first position information, and obtain the position information of the candidate region corresponding to the reference video frame as the second position information, and calculate the distance between the first position information and the second position information corresponding to the reference video frame, and Using the candidate region whose distance is less than the distance threshold as an estimated region corresponding to the target video frame;
  • the target region corresponding to the reference video frame is obtained by modifying the optimal candidate region corresponding to the reference video frame; the optimal candidate region corresponding to the reference video frame is based on the regression model in the Selected from candidate regions corresponding to the reference video frame.
  • the overlap rate determining subunit 3054 is configured to determine an overlap rate between each estimated region and an optimal candidate region corresponding to the target video frame;
  • the overlap rate determination subunit 3054 is specifically configured to obtain a length value and a width value of an optimal candidate region corresponding to the target video frame, and determine the target video frame correspondence according to the length value and the width value.
  • the area of the optimal candidate area is used as the first area, and the length value and width value of the estimated area are obtained, and the optimal value corresponding to the target video frame is obtained according to the length value and width value of the estimated area.
  • the length value and width value of the candidate area determine the overlapping area between the estimated area and the optimal candidate area corresponding to the target video frame, as the second area, and according to the first area and the second area, Area to determine the overlap ratio between the estimated area and the optimal candidate area corresponding to the target video frame.
  • the target region determination subunit 3055 is configured to obtain an estimated region with the highest overlap rate, and use the estimated region with the highest overlap rate to modify the optimal candidate region corresponding to the target video frame, and modify the corrected candidate region
  • the optimal candidate region is determined as the target region.
  • step S207 The specific implementation process of the target frame obtaining unit 301, the selective search unit 302, the feature extraction unit 303, the candidate region determination unit 304, and the target region determination unit 305 can refer to steps S203- The description of step S207 will not be repeated here.
  • the target region processing module 40 is configured to perform animation processing on the target region in the target video frame when the target video frame in the video data is played.
  • all key content (that is, all keyword information) in the barrage data can be further extracted based on the key information database.
  • the classification and recognition model of the target object corresponding to the keyword information A target object corresponding to a target video frame in a plurality of video frames of the video data may be identified, and a specific position of the target object in the target video frame (that is, the target object in the target video may be further determined). Target area in the frame), and animate the target area. Therefore, by correlating the keyword information in the barrage data with the target object located at the target area in the target video frame, the visual display effect of the barrage data can be enriched.
  • FIG. 11 is a schematic structural diagram of another video processing apparatus according to an embodiment of the present application.
  • the video processing device 1000 may be a target client terminal in the foregoing embodiment corresponding to FIG. 1.
  • the video processing device 1000 may include: at least one processor 1001, such as a CPU, and at least one network interface 1004.
  • the communication bus 1002 is used to implement connection and communication between these components.
  • the user interface 1003 may include a display and a keyboard, and the optional user interface 1003 may further include a standard wired interface and a wireless interface.
  • the network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface).
  • the memory 1005 may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), for example, at least one magnetic disk memory.
  • the memory 1005 may optionally be at least one storage device located far from the foregoing processor 1001.
  • the memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module, and a device control application program.
  • the network interface 1004 is mainly used to connect the barrage server and the video source server;
  • the user interface 1003 is mainly used to provide an interface for the user to input;
  • the processor 1001 can be used to call the memory Device control application stored in 1005 to achieve:
  • Keyword information matching the barrage data is obtained in a key information database as target keyword information;
  • the key information database contains keyword information set by the user, and the target object corresponding to each keyword information Classification recognition model;
  • the target region in the target video frame is animated.
  • the video processing apparatus 1000 described in the embodiment of the present application may perform the description of the video processing method in the foregoing embodiment corresponding to FIG. 2 or FIG.
  • the description of the video processing device 1 will not be repeated here.
  • the description of the beneficial effects of using the same method will not be repeated.
  • the embodiment of the present application further provides a computer storage medium, and the computer storage medium stores a computer program executed by the video processing device 1 mentioned above, and the computer program includes Program instructions.
  • the processor executes the program instructions, the description of the video processing method in the foregoing embodiments corresponding to FIG. 2 or FIG. 5 can be performed. Therefore, details are not described herein again. In addition, the description of the beneficial effects of using the same method will not be repeated.
  • the description of the beneficial effects of using the same method will not be repeated.
  • the program can be stored in a computer-readable storage medium.
  • the program When executed, the processes of the embodiments of the methods described above may be included.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random, Access Memory, RAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)
  • Processing Or Creating Images (AREA)

Abstract

本申请实施例公开了一种视频处理方法,由客户终端执行,所述方法包括:播放视频数据,并获取所述视频数据对应的弹幕数据;在关键信息库中获取与所述弹幕数据相匹配的关键字信息,作为目标关键字信息;所述关键信息库中包含用户设置的关键字信息,以及每个关键字信息对应的目标对象的分类识别模型;在所述视频数据的多个视频帧中获取目标视频帧,并基于所述目标关键字信息对应的分类识别模型,识别所述目标关键字信息对应的目标对象在所述目标视频帧中的图像区域,并将识别出的所述图像区域作为目标区域;当播放所述视频数据中的所述目标视频帧时,对所述目标视频帧中的所述目标区域进行动画处理。

Description

一种视频处理方法、装置以及存储介质
本申请要求于2018年6月15日提交中国专利局、申请号为201810618681.6、名称为“一种视频处理方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及互联网技术领域,尤其涉及一种视频处理方法、装置以及存储介质。
背景
用户在通过客户终端或者网页观看视频的过程中,常常开启弹幕,以查看其它用户发表的弹幕。由于部分视频的弹幕数量较大或者弹幕播放速度较快,用户在这些弹幕播放过程中,将无法在该视频播放界面上及时、清楚的辨识出弹幕关键内容(即难以及时捕捉到这些弹幕中的关键字信息),从而降低了弹幕数据的可识别度,以至于降低了弹幕数据的视觉展示效果。
此外,由于在所述视频播放界面上的弹幕,是独立于所播放的视频内容,因此,在该视频播放界面上所显示的弹幕将无法实时反馈当前播放的视频内容,即缺乏该客户终端中的弹幕与所述视频内容之间的相关性,进而降低当前显示的弹幕数据的视觉展示效果。
技术内容
本申请实施例提供了一种视频处理方法,由客户终端执行,包括:
播放视频数据,并获取所述视频数据对应的弹幕数据;
在关键信息库中获取与所述弹幕数据相匹配的关键字信息,作为目标关键字信息;所述关键信息库中包含用户设置的关键字信息,以及每个关键字信息对应的目标对象的分类识别模型;
在所述视频数据的多个视频帧中获取目标视频帧,并基于所述目标关键字信息对应的分类识别模型,识别所述目标关键字信息对应的目标对象在所述目标视频帧中的图像区域,并将识别出的所述图像区域作为目标区域;
当播放所述视频数据中的所述目标视频帧时,对所述目标视频帧中的所述目标区域进行动画处理。
本申请实施例提供了一种视频处理装置,包括:
弹幕数据获取模块,用于播放视频数据,并获取所述视频数据对应的弹幕数据;
关键字获取模块,用于在关键信息库中获取与所述弹幕数据相匹配的关键字 信息,作为目标关键字信息;所述关键信息库中包含用户设置的关键字信息,以及每个关键字信息对应的目标对象的分类识别模型;
目标对象识别模块,用于在所述视频数据的多个视频帧中获取目标视频帧,并基于所述目标关键字信息对应的分类识别模型,识别所述目标关键字信息对应的目标对象在所述目标视频帧中的图像区域,并将识别出的所述图像区域作为目标区域;
目标区域处理模块,用于当播放所述视频数据中的所述目标视频帧时,对所述目标视频帧中的所述目标区域进行动画处理。
本申请实施例提供了一种视频处理装置,包括:处理器和存储器;
所述处理器与存储器相连,其中,所述存储器用于存储程序代码,所述处理器用于调用所述程序代码,以执行本申请实施例提供的视频处理方法。
本申请实施例提供了一种计算机存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时,执行本申请实施例提供的视频处理方法。
附图说明
为了更清楚地说明本申请实施例或相关技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种网络架构的结构示意图;
图2是本申请实施例提供的一种视频处理方法的流程示意图;
图3是本申请实施例提供的一种获取弹幕数据的示意图;
图4是本申请实施例提供的另一种获取弹幕数据的示意图;
图5是本申请实施例提供的另一种视频处理方法的流程示意图;
图6是本申请实施例提供的一种弹幕数据显示在多个视频帧上的示意图;
图7是本申请实施例提供的一种特征提取的示意图;
图8是本申请实施例提供的一种选择最优候选区域的示意图;
图9是本申请实施例提供的一种修正所述目标视频帧的最优视频帧的示意图;
图10是本申请实施例提供的一种视频处理装置的结构示意图;
图11是本申请实施例提供的另一种视频处理装置的结构示意图。
实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的 实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提出一种视频处理方法,可以丰富该弹幕数据的视觉展示效果,并避免由于为了识别和捕捉弹幕数据而造成的设备资源和网络资源的浪费。
请参见图1,是本申请实施例提供的一种网络架构的结构示意图。如图1所示,所述网络架构可以包括服务器集群以及客户终端集群;所述客户终端集群可以包括多个客户终端,如图1所示,具体包括客户终端3000a、客户终端3000b、…、客户终端3000n;
如图1所示,所述服务器集群可以包括弹幕服务器2000a和视频源服务器2000b,所述弹幕服务器2000a用于将预设时间段内的弹幕数据作为历史弹幕数据进行存储,所述视频源服务器2000b用于存储多个视频数据源。
其中,客户终端3000a、客户终端3000b、…、客户终端3000n可以分别与所述服务器集群进行网络连接。
如图1所示,为更好的理解本方案,本申请实施例可在所述客户终端集群中选择一个客户终端作为目标客户终端(以目标客户终端为客户终端3000a为例),以便于分别描述该客户终端3000a与所述弹幕服务器2000a和视频源服务器2000b之间的数据交互关系,即当该目标客户终端(客户终端3000a)在播放视频数据(该视频数据是所述视频源服务器2000b基于该客户终端3000a发送的视频下载请求所返回的数据)过程中,可基于该视频数据的当前播放进度向弹幕服务器2000a发送弹幕获取请求,以使该弹幕服务器2000a基于所述弹幕获取请求返回历史弹幕数据,
其中,所述历史弹幕数据可以为其他用户基于当前播放进度在其对应的客户终端(例如,客户终端3000b)上所输入的文本输入数据。即此时,可以理解为:客户终端3000a对应的用户(例如,用户A)与客户终端3000b对应的用户(例如,用户B)正在同步观看该视频数据。因此,该客户终端3000a可在当前播放进度下,同步显示由该客户终端3000b上传给所述弹幕服务器2000a的弹幕数据。此时,该弹幕服务器2000a可在弹幕数据库中将该弹幕数据存储为与当前播放进度对应的历史弹幕数据,并可基于接收到的弹幕获取请求返回该历史弹幕数据。此外,所述历史弹幕数据还可以包括所述弹幕服务器在一段时间段内所接收并存储的其他客户终端(例如,客户终端3000c)所上传的弹幕数据,即此时,相比于同步播放该视频数据的客户终端3000a和客户终端3000b而言,客户终端3000c对该视频数据的播放时间戳,会较早于所述客户终端3000a和客户终端3000b对该视频数据的播放时间戳。例如,该历史弹幕数据可以为该客户终端3000c在1小时之前上传给所述弹幕服务器2000a的弹幕数据(比如,该客户终端3000c可以在1小时之前,将获取到的播放进度为10%时的文本输入数据,作为弹幕数据,并将该弹幕数据上传至所述弹幕服务器2000a)。因此,对于同步播放该视频数据的客户终端3000a和客户终端3000b而言,则可以在该视频数据的播放进度达到10%时,同步从该弹幕服务器2000a上获取该播放进度对应的历史弹幕数据,并 可进一步将获取到的历史弹幕数据作为该视频数据对应的弹幕数据。
其中,所述一段时间段可以以分钟、小时,天,月和年等为时间单位,这里将不对其进行具体限制。
应当理解,只要所述弹幕服务器2000a接收到目标客户终端发送的弹幕获取请求,则会在弹幕数据库中找到与该弹幕获取请求对应的历史弹幕数据,并将其下发至该目标客户终端。
此外,所述目标客户终端对应的用户还可以在该视频播放界面上看到自己实时输入的文本输入数据,即该目标客户终端在接收到该用户输入的文本输入数据时,可将该文本输入数据作为当前播放的视频数据对应的弹幕数据,并在该视频数据对应的播放界面上显示该弹幕数据。与此同时,所述目标客户终端还可以将该弹幕数据上传至与该目标客户终端具有网络连接关系的弹幕服务器2000a,以使该弹幕服务器2000a将该弹幕数据进行存储和/或下发,此时,该弹幕服务器2000a可进一步将该弹幕数据存储为与当前视频数据的播放进度对应的历史弹幕数据,还可以将该弹幕数据同步发送给观看该视频数据的其他客户终端。
随后,所述目标客户终端(即客户终端3000a)在获取到该弹幕数据后,可在该目标客户终端的后台对该弹幕数据中的目标关键字信息进行提取,并基于该目标关键字信息在当前播放的视频数据中识别该目标关键字对应的目标对象,并可以进一步在该目标视频帧中对所述目标对象所处的目标区域进行动画处理,以丰富该弹幕数据的视觉展示效果,避免由于为了识别和捕捉弹幕数据而造成的设备资源和网络资源的浪费。
其中,所述目标客户终端对该弹幕数据中目标关键字信息进行提取,并在该目标视频帧中对该目标关键字对应的目标对象进行识别,以及对该目标对象所对应的目标区域进行动画处理的具体过程,可以参见如下图2至图5对应的实施例。
进一步地,请参见图2,是本申请实施例提供的一种视频处理方法的流程示意图。如图2所示,所述方法可以由客户终端执行,包括步骤S101~S104:
S101,播放视频数据,并获取所述视频数据对应的弹幕数据;
具体地,客户终端在播放视频数据的过程中,可获取所述视频数据对应的弹幕数据;所述弹幕数据可以为弹幕服务器所返回的历史弹幕数据,也可以为所述客户终端所对应的用户在该视频数据对应的播放界面上所输入的文本输入数据;随后,所述客户终端可将所述弹幕数据显示在所述视频数据对应的播放界面上。
其中,所述客户终端可以为上述图1所对应实施例中的目标客户终端,所述客户终端包括个人电脑、平板电脑、笔记本电脑、智能电视、智能手机等携带视频数据播放功能的终端设备。
其中,所述弹幕服务器可以为上述图1所对应实施例中的弹幕服务器2000a,该弹幕服务器可以用于存储每个用户在其对应客户终端上对当前播放的视频数据所输入的文本输入数据(即该弹幕服务器可用于存储各客户终端分别上传的弹幕数据),并可进一步按照该视频数据的播放进度对每个弹幕数据进行存储,以便于观看该视频数据的用户可以在其对应客户终端的弹幕开启的情况下,基于该 视频数据的播放进度获取相应的弹幕数据,并显示所述视频数据对应的弹幕数据。
其中,所述客户终端获取并显示所述弹幕数据的具体过程可以包括:播放视频数据,并向弹幕服务器发送弹幕获取请求,并接收所述弹幕服务器基于所述弹幕获取请求返回的历史弹幕数据,并将所述历史弹幕数据作为所述视频数据对应的弹幕数据,并在所述视频数据的播放界面上显示所述弹幕数据。
为便于理解,所述客户终端以上述图1所对应实施例中的客户终端3000a为例,进一步地,请参见图3,是本申请实施例提供的一种获取弹幕数据的示意图。
如图3所示,所述客户终端3000a在播放视频数据的过程中,可以在弹幕开启时,获取到与所述视频数据对应的弹幕数据。其中,所述弹幕数据获取的具体过程可以为:所述客户终端3000a可以基于所述视频数据的当前播放进度(即如图3所示的当前播放进度为20%),向如图3所示的弹幕服务器2000a发送弹幕获取请求,并接收所述弹幕服务器2000a基于该弹幕获取请求返回历史弹幕数据(其中,该历史弹幕数据可以为上述图1所对应实施例中的客户终端3000b在该视频数据的播放进度达到20%时所上传的弹幕数据,因此,所述弹幕服务器2000a在接收到该弹幕数据后,可将该弹幕数据作为历史弹幕数据,存储于如图3所示的弹幕数据库中)。随后,所述客户终端3000a可以将接收到的历史弹幕数据作为该视频数据对应的弹幕数据,并可进一步在该视频数据的播放界面(即如图3所示的播放界面100a)上显示所述弹幕数据。
此时,应当理解,本申请实施例中的客户终端3000a和客户终端3000b并非同步播放该视频数据,因此,客户终端3000a可以基于该视频数据的播放进度从弹幕服务器2000a上获取到客户终端3000b所上传的弹幕数据。其中,所述客户终端3000b上传弹幕数据的具体过程可进一步参见图4,是本申请实施例提供的另一种获取弹幕数据的示意图。
此时,用户B和用户C在不同客户终端上同步观看同一视频数据,其中,用户B持有客户终端3000b,用户C持有客户终端3000c。如图4所示,当用户B在如图4所示的文本输入界面200b上输入文本输入数据(例如,该文本输入数据为:快看那个小孩)时,所述客户终端(此时,该客户终端为如图4所示的客户终端3000b)可以进一步在检测到与文本输入数据对应的弹幕触发操作时,将该文本输入数据作为该视频数据的弹幕数据,并在所述视频数据的播放界面(即如图4所示的播放界面200a)上,基于弹幕轨道显示所述弹幕数据。其中,所述弹幕轨道用于表征该弹幕数据在该播放界面200a上所处的位置信息(例如,可将该弹幕数据显示在在该弹幕轨道的第一行上)。与此同时,所述客户终端3000b可以进一步将该弹幕数据(即快看那个小孩)发送到如图4所示的弹幕服务器2000a。此时,所述弹幕服务器2000a可以将该弹幕数据(即快看那个小孩)发送至同步观看该视频数据的客户终端3000c,以使该客户终端3000c在所述视频数据的播放界面300a上显示如图4所示的弹幕数据。可见,对于同步播放该视频数据的两个客户终端而言,将会在这两个客户终端(即客户终端3000b和客户终端3000c)分别对应的播放界面上同步显示该视频数据对应的弹幕数据。
可选地,所述客户终端获取并显示所述弹幕数据的具体过程也可以包括:播放视频数据,并获取文本输入数据,并将所述文本输入数据作为所述视频数据对应的弹幕数据,并在所述视频数据的播放界面上,基于弹幕轨道显示所述弹幕数据,并将所述弹幕数据发送到弹幕服务器,以使所述弹幕服务器将所述弹幕数据同步发送至观看所述视频数据的客户终端。
由此可见,当所述客户终端为客户终端3000a时,所述客户终端所获取到的弹幕数据为所述弹幕服务器基于该客户终端所发送的弹幕获取请求所返回的历史弹幕数据,即此时,所述客户终端可以将该历史弹幕数据作为弹幕数据;可选地,当该客户终端为客户终端3000b时,所述客户终端所获取到的弹幕数据为用户B在该客户终端3000b的文本输入界面上所输入的文本输入数据,即此时,所述客户终端可以将该文本输入数据作为弹幕数据。
步骤S102,在关键信息库中获取与所述弹幕数据相匹配的关键字信息,作为目标关键字信息;
具体地,所述客户终端可以获取关键信息库,并基于分词技术将所述弹幕数据拆分为多个分词数据,并在所述关键信息库中遍历查找与各分词数据匹配的关键字信息,若所述客户终端查找到与所述各分词数据匹配的关键字信息,则可以将所述关键字信息作为所述弹幕数据对应的目标关键字信息,并可以进一步在所述关键信息库中,获取所述目标关键字信息对应的目标对象的分类识别模型。
其中,所述关键信息库中包含用户设置的关键字信息,以及每个关键字信息对应的目标对象的分类识别模型,例如,关键字信息可以为用户设置的“花”、“树”、“河流”等目标对象的分类信息,对于用户设置的每个关键字信息,可以对应的在该关键信息库中查找到与该目标关键字信息对应的目标对象的分类识别模型。比如,对于关键字信息为“树”而言,可以在关键信息库中存储有与该关键字信息“树”所对应的目标对象(树)的分类识别模型,即在该关键信息库中存在大量的该树的轮廓特征。
其中,所述分词技术是指所述客户终端可以对所述弹幕数据进行分词处理,以将该弹幕数据拆分为多个分词数据,例如,以所述弹幕数据为“那朵花真漂亮”为例,所述客户终端通过该分词技术,可以得到如下四个拆分后的分词数据:“那朵,花,真,漂亮”。随后,所述客户终端可以进一步在关键信息库中遍历查找与这4个分词数据分别匹配的关键字信息。为便于理解,以用户在关键信息库中所设置的关键字信息为“花”为例,于是,所述客户终端可以在关键信息库中找到与该分词数据(“花”)匹配的关键字信息(花),并将该关键字信息(花)作为所述弹幕数据(“那朵花真漂亮”)对应的目标关键字信息。随后,所述客户终端还可以进一步在该关键信息库中,获取与该目标关键字信息(花)对应的目标对象的分类识别模型,即可以获取到大量该目标对象(花)的轮廓特征,以便于所述客户终端可以进一步执行步骤S103。
为便于理解,本申请实施例以所述目标关键字信息为一个为例,以便于进一步对下述步骤S103和步骤S104进行详细描述。
应当理解,所述弹幕数据中的目标关键字信息可以为多个,且每个目标关键字信息分别对应一类目标对象的分类识别模型,例如,若所述目标关键字信息为猫和狗时,则可以在关键信息库中获取到猫的分类识别模型,以及狗的分类识别模型。因此,这里将不对所述视频数据对应的目标关键字信息的数量进行限制,且在所述目标视频帧中对所述多个目标关键字信息分别对应的目标对象的识别过程,可参见本申请实施例对所述目标视频帧中包含一个目标关键字信息对应的目标对象的识别过程。
步骤S103,在所述视频数据的多个视频帧中获取目标视频帧,并基于所述目标关键字信息对应的分类识别模型,识别所述目标关键字信息对应的目标对象在所述目标视频帧中的图像区域,并将识别出的所述图像区域作为目标区域;
具体地,所述客户终端可以在所述视频数据的多个视频帧中获取目标视频帧,所述目标视频帧为所述弹幕数据出现前后预设时间段内的视频帧,例如,所述目标视频帧为所述弹幕数据出现前后3秒的视频帧,并进一步将所述目标视频帧划分为多个子区域,并对各子区域进行选择性搜索,并对选择性搜索后的子区域进行合并,得到多个合并区域,并将所述多个子区域和所述多个合并区域均确定为待处理区域,其次,所述客户终端还可以进一步基于神经网络模型对所述待处理区域进行特征提取,得到与所述待处理区域对应的图像特征,然后,所述客户终端还可以进一步基于所述图像特征以及与所述目标关键字信息对应的所述分类识别模型,生成所述待处理区域对应的识别概率,并根据所述识别概率在所述待处理区域中选择包含所述目标关键字信息对应的目标对象的候选区域,最后,所述客户终端可以基于回归模型,对所述目标视频帧对应的候选区域进行最优选择,并将选择出的所述目标视频帧对应的最优候选区域确定为目标区域。
应当理解,所述客户终端在获取到所述弹幕数据后,会将所述弹幕数据显示在所述视频数据对应的播放界面上,且所述弹幕数据会在该播放界面对应的弹幕轨道中进行动态显示,以使该视频数据对应的弹幕数据可以在该视频数据的不同视频帧上进行显示,即在该弹幕数据动态显示的这段时间(本申请可将这段时间定义为弹幕显示时间段)内,该弹幕数据对应的视频流(即该弹幕数据对应的视频数据)中的每个视频帧也会同步按照时间顺序进行动态播放。因此,只要该视频数据对应的播放界面上,显示该弹幕数据中的目标关键字信息,所述客户终端则会在该弹幕数据对应的每个视频帧中获取目标视频帧(即所述客户终端可将该弹幕数据对应的视频数据拆分为多个视频帧,并可以在所述多个视频帧中进一步将当前播放的视频帧作为目标视频帧),以便于该客户终端进一步在该目标视频帧中识别所述目标关键字信息对应的目标对象。
鉴于此,所述客户终端可以在该弹幕数据对应的弹幕显示时间段内,对该弹幕数据对应的多个视频帧内所述目标关键字信息对应的目标对象进行识别,且所述客户终端对每个视频帧内所述目标关键字信息对应的目标对象进行识别的具体过程,可参见所述客户终端对所述目标视频帧内的所述目标关键字信息对应的目标对象的识别过程。
其中,所述神经网络模型可以为卷积神经网络(Convolutional Neural Networks,CNN)模型,还可以为CNN模型和循环神经网络模型(Recurrent Neural Network,RNN)的组合模型,所述神经网络模型可以用于对输入该神经网络模型的所有待处理区域进行特征提取,以得到与所述各待处理区域分别对应的图像特征。但在对各待处理区域进行特征提取之前,所述客户终端需要先对所述目标视频帧进行划分,以得到与该目标视频帧对应的多个子区域,并通过对各子区域进行选择性搜索,以及对选择性搜索后的子区域进行合并,可以得到多个合并区域(所述多个合并区域包含经过多次合并后的合并区域)。因此,该客户终端可以将所有子区域,以及多个合并区域作为待处理区域,以进一步通过该神经网络模型对该待处理区域进行特征提取,以得到与所述待处理区域对应的图像特征。
其中,为便于理解,本申请实施例以在多个待处理区域中选择一个待处理区域进行特征提取为例,则所述客户终端对该待处理区域对应的图像特征进行提取的具体过程为:所述客户终端通过神经网络模型(例如,卷积神经网络模型,CNN模型)进行卷积处理,即所述客户终端可以随机选取该待处理区域中的一小部分特征信息作为样本(即卷积核),并将这个样本作为一个窗口依次滑过所述待处理区域,也就是将上述样本和所述待处理区域做卷积运算,从而获得该待处理区域的空间特征信息。卷积运算后,获取到该待处理区域的空间特征信息,但上述空间特征信息的数量庞大,为了减少后续计算量,可以基于卷积神经网络模型的池化处理(Pooling),以对上述空间特征信息进行聚合统计,聚合统计后的空间特征信息的数量要远远低于卷积运算提取的空间特征信息的数量,同时还会提高后续分类效果(即对所述目标对象进行识别的效果)。常用的池化方法主要包括平均池化运算方法和最大池化运算方法。平均池化运算方法是在一个特征信息集合里计算出一个平均特征信息代表该特征信息集合的特征;最大池化运算是在一个特征信息集合里提取出最大特征信息代表该特征信息集合的特征。可见,采用上述方法,可以提取所有待处理区域的空间特征信息,并将所述空间特征信息作为各待处理区域分别对应的图像特征。
此外,所述客户终端在提取到所述待处理区域的空间特征信息之后,还可以进一步通过循环神经网络模型(RNN模型)进行时序处理,即在循环神经网络模型的遗忘门中,处理器首先计算需要从单元状态(cell state)中去除的信息;然后在输入门(input gate)中,处理器计算出在单元状态中需要存储的信息;最后在输出门(output gate)中,更新单元状态,也就是处理器将单元旧状态乘以需要去除的信息,然后再加上需要存储的信息,就得到单元新状态。所述待处理区域的空间特征信息通过与多个单元状态之间的线性作用,可以提取隐藏在所述待处理区域中的时空特征信息。可见,采用上述方法,可以提取所有待处理区域的时空特征信息,并将所述时空特征信息称之为各待处理区域分别对应的图像特征。
其中,所述识别概率是用于表示所述待处理区域中包含所述目标对象的概率。
应当理解,对于一个目标视频帧而言,所述客户终端只需在划分好的多个子 区域中按照选择性搜索算法执行一次选择性搜索,以得到选择性搜索后的多个子区域,此时,这些子区域之间存在一定的相关性(比如,可以纹理相近,也可以颜色相近),因此,所述客户终端通过该选择性搜索算法可以大大减少搜索区域,从而可以提高对目标对象进行识别的效率。
其中,所述客户终端对选择性搜索后的子区域进行合并,是指所述客户终端可以基于合并规则(比如,纹理相近,颜色相近等),将两两相邻的两个子区域进行合并,且在对各子区域进行合并的过程中,会根据选择性搜索后的子区域的数量进行多次合并,直到得到一张携带完整图像的合并区域。
比如,所述客户终端可以在后台将所述目标视频帧划分为多个子区域(例如,可以将该目标视频帧划分为1000个子区域,即所述客户终端可将所述目标视频帧对应的视频帧图像拆分成多个图形块。此时,应当理解,这些子区域的划分对用户而言是不可见的)。随后,所述客户终端可进一步对各子区域进行选择性搜索,以得到多个选择性搜索后的子区域(比如,所述客户终端可以在划分好的1000个子区域中,选择500个携带图像特征的子区域,作为选择性搜索后的子区域),并可进一步对选择性搜索后的子区域进行合并,即所述客户终端可以按照颜色或纹理等合并规则,将相邻的两个子区域进行合并,以得到多个合并区域。应当理解,所述客户终端可以重复基于该合并规则对合并后的合并区域进行合并,以得到包含完整图像的合并区域,随后,所述客户终端可以将所述多个子区域和所述多个合并区域均确定为待处理区域。即所述客户终端可以将所有出现过的与所述目标视频帧对应的图像区域作为待处理区域,并将这些待处理区域输入至所述神经网络模型。
又比如,为了更好的理解所述合并规则,本申请实施例以通过对目标视频帧的子区域进行选择性搜索,得到8个选择性搜索后的子区域为例。其中,在进行选择性搜索之前,所述客户终端可以将所述目标视频帧拆分为100个子区域。其次,所述客户终端可以进一步通过选择性搜索算法对这100个子区域进行选择性搜索,以得到以下这8个选择性搜索后的子区域为:a-b-c-d-e-f-g-h。然后,所述客户终端可以基于上述合并规则(即将相邻的两个子区域进行合并)对这8个子区域进行合并。于是,所述客户终端在进行第一次合并后的合并区域可以为ab-cd-ef-gh,所述客户终端在进行第二次合并后的合并区域可以为abcd-efgh,所述客户终端在进行第三次合并后的合并区域可以为abcdefgh,此时,已得到一个包含完整图像的合并区域。
其中,应当理解,所述待处理区域包括:100个拆分后的子区域,8个选择性搜索后的子区域(a-b-c-d-e-f-g-h),以及第一次合并后的四个合并区域(ab-cd-ef-gh),第二次合并后的两个合并区域(abcd-efgh)和第三次合并后的一个合并区域(abcdefgh),即包含图像特征的待处理区域总共有115个(即100个+8个+4个+2个+1个)。
随后,所述客户终端还可以进一步基于神经网络模型(该神经网络模型可以为CNN模型,也可以为CNN+RNN模型)对所述待处理区域进行特征提取,即 所述客户终端可以将所述待处理区域输入所述神经网络模型,以输出与所述待处理区域对应的图像特征(例如,对于115个待处理区域,则可对应的得到这115个待处理区域分别对应的图像特征)。然后,所述客户终端还可以进一步基于所述图像特征以及与所述目标关键字信息对应的所述分类识别模型(例如,该目标关键字信息可以为上述图4所对应实施例中的小孩,此时,该分类识别模型中将包含大量与该小孩对应的轮廓特征),从而可以生成与所述待处理区域对应的识别概率,此时,每个待处理区域均会对应一个识别概率。于是,所述客户终端可以进一步根据所述识别概率,在所述待处理区域中选择包含所述目标关键字信息对应的目标对象的候选区域(即所述客户终端可以将识别概率大于概率阈值的待处理区域作为所述目标关键字信息对应的目标对象的候选区域)。此时,所述候选区域中将携带能完整表征所述目标对象的图像特征。
比如,对于上述图4所对应实施例中的目标关键字(小孩)而言,所述客户终端所确定的待处理区域为:115个待处理区域,通过将这115个待处理区域输入所述神经网络模型,可以输出与所述待处理区域对应的图像特征,即可以得到115个图像特征,所述客户终端通过将这115个图像特征与该小孩对应的分类识别模型进行匹配,即将这115个图像特征进一步输入该神经网络对应的分类器(此时,该分类器可以为神经网络模型中自带的分类器)中,即可以输出这115个图像特征分别对应的识别概率,此时,所述客户终端可以得到这115个待处理区域分别对应的识别概率。随后,所述客户终端可以在这115个待处理区域中选择出识别概率大于概率阈值的待处理区域作为所述目标关键字信息对应的目标对象的候选区域,即此时所述客户终端可以在该目标视频帧中完成对所述目标对象的识别,即可以确定该目标视频帧中包含所述目标对象。
最后,所述客户终端可以基于回归模型,对所述目标视频帧对应的候选区域进行最优选择,并将选择出的所述目标视频帧对应的最优候选区域确定为目标区域。
其中,所述回归模型可用于定位出该目标对象在该目标视频帧中的位置,即所述客户终端通过该回归模型可以在该目标对象对应的多个候选区域中选择出与所述目标视频帧对应的最优候选区域,并可进一步将该最优候选区域确定为目标区域。应当理解,所述最优候选区域则为所述目标关键字信息对应的目标对象在所述目标视频帧中的图像区域,因此,所述客户终端可以将所述最优候选区域确定为目标区域。
应当理解,所述目标视频帧可以为当前播放的视频帧,还可以为与该弹幕数据对应的多个视频帧中与当前播放的视频帧不同的视频帧,比如,该视频数据中尚未播放的视频帧。因此,所述客户终端在对该弹幕数据对应的多个视频帧中目标对象的识别,可以是按照每个视频帧的时间顺序,依次对每个视频帧中所述目标关键字信息对应的目标对象进行识别,并逐一确定该目标对象在每一个视频帧中的目标区域;当然,所述客户终端还可以同步对该弹幕数据对应的多个视频帧中的目标对象进行识别,即所述客户终端在获取到该弹幕数据对应的多个视频帧 时,可以将所述多个视频帧中的每个视频帧均作为目标视频帧,即该客户终端可以对尚未播放的视频帧进行预处理,以便于提前对这些尚未播放的视频帧中的目标对象进行识别,以得到该目标对象在每个视频帧中的目标区域。此时,当所述客户终端播放所述目视频数据中的所述目标视频帧时,即按照时间顺序依次播放每个视频帧时,可以即时地对该目标对象对应的目标区域进行动画显示。
为便于理解,本申请实施例仅仅是以当前播放的视频帧作为目标视频帧为例,以对该目标视频帧中所述目标关键字信息对应的目标对象进行识别。当然,当该弹幕数据对应的多个视频帧中的其他视频帧被确定为目标视频帧时,对每个视频帧中目标关键字信息对应的目标对象的识别仍可参见本申请实施例对所述当前播放的视频帧中的目标对象的识别过程,这里将不再继续赘述。
步骤S104,当播放所述视频数据中的所述目标视频帧时,对所述目标视频帧中的所述目标区域进行动画处理。
应当理解,所述弹幕数据对应的视频数据是动态播放的,即对所述弹幕数据对应的视频数据中的每一个视频帧都将会按照时间顺序逐一播放,因此,所述客户终端可以在该多个视频帧中的每个视频帧被播放(即当所述弹幕数据对应的视频数据中的每个视频帧按照时间顺序依次被播放)时,对所述目标视频帧中的目标区域进行动画处理(比如,可以对该目标区域进行渲染处理,并将渲染处理后的目标区域进行放大显示)。
因此,通过将该弹幕数据中的关键字信息与所述目标视频帧中位于目标区域处的目标对象进行关联,可以丰富弹幕数据的视觉展示效果。
本申请实施例通过在获取到所述视频数据播放过程中的弹幕数据之后,可进一步基于关键信息库提取出该弹幕数据中的所有关键内容(即所有关键字信息),此外,通过每个关键字信息所对应的目标对象的分类识别模型。可对所述视频数据的多个视频帧中的目标视频帧对应的目标对象进行识别,并可进一步确定该目标对象在所述目标视频帧中的具体位置(即该目标对象在所述目标视频帧中的目标区域),并将该目标区域进行动画显示。因此,通过将该弹幕数据中的关键字信息与所述目标视频帧中位于目标区域处的目标对象进行关联,可以丰富弹幕数据的视觉展示效果。
进一步的,请参见图5,是本申请实施例提供的另一种视频处理方法的流程示意图。如图5所示,所述方法可以包括步骤S201-步骤208:
步骤S201,播放视频数据,并获取所述视频数据对应的弹幕数据;
步骤S202,在关键信息库中获取与所述弹幕数据相匹配的关键字信息,作为目标关键字信息;
其中,所述步骤S201和步骤S202的具体执行方法可以参见上述图2所对应实施例中对步骤S101和步骤S102的描述,这里将不再继续进行赘述。
步骤S203,在所述视频数据的多个视频帧中获取目标视频帧;
具体地,所述客户终端可以在所述弹幕数据对应的所述视频数据的多个视频帧中获取目标视频帧,即所述客户终端可以在所述弹幕数据显示在该视频数据对 应的播放界面上时,可获取到与该弹幕数据对应的视频流(该视频流是由与弹幕数据对应的多个视频帧按播放时间顺序所构成的),并可进一步将该视频流拆分为与所述弹幕数据对应的多个视频帧,于是,所述客户终端可以在所述多个视频帧中选择一个视频帧作为目标视频帧。
进一步地,请参见图6,是本申请实施例提供的一种弹幕数据显示在多个视频帧上的示意图。如图6所示,所述客户终端中的视频数据在动态播放的过程中,与该视频数据对应的弹幕数据可以在该视频数据的不同视频帧上进行显示,从而呈现出如图6所示的动态显示该弹幕数据的效果图。应当注意,该视频数据对应的弹幕数据是基于图6所示的弹幕轨道进行显示的(即该弹幕数据可以在图6所示的弹幕显示区域上由右向左进行显示,应当理解,图6所示的弹幕显示区域对于用户而言,是虚拟存在的)。在如图6所示的播放进度条上,若当前播放进度t=T,则所述弹幕数据可以位于该弹幕显示区域的最右侧,即刚显示在与该视频数据对应的播放界面上;若当前播放进度t=T+1,则所述弹幕数据可以位于该弹幕显示区域的较右侧;若当前播放进度t=T+2,则所述弹幕数据可以位于该弹幕显示区域的较左侧;若当前播放进度t=T+3,则所述弹幕数据可以位于该弹幕显示区域的最右侧,即该弹幕数据将离开与该视频数据对应的播放界面。
其中,如图6所示,T时刻为该视频数据中当前播放的视频帧,而T+1,T+2,T+3时刻为所述视频数据中即将顺序播放的视频帧。为便于理解,本申请实施例仅以所述目标视频帧为所述视频数据中当前播放的视频帧为例,即在图6所示的四个视频帧中,将播放进度t=T时刻的视频帧作为目标视频帧,以在该弹幕数据所对应的目标视频帧中对所述目标关键字信息(例如,猫)对应的目标对象进行识别。
应当理解,当该客户终端依次播放所述视频流中的余下视频帧(即T+1时刻对应的视频帧,T+2时刻对应的视频帧,T+3时刻对应的视频帧)时,可按照时间顺序依次将余下视频帧中的每一个视频帧作为目标视频帧,以对所述余下视频帧中所述目标关键字信息(猫)对应的目标对象进行识别。即对每个视频帧而言,需要基于当前播放时间进度循环执行步骤S203-步骤S207,以对所述视频数据中每个视频帧中所述目标关键字信息对应的目标对象进行识别。
当然,所述目标视频帧还可以为与所述弹幕数据对应的多个视频帧,即所述客户终端还可以将与所述弹幕数据对应的所述视频数据中的多个视频帧一并作为目标视频帧,以在所述目标视频帧中进一步对所述目标关键字信息(猫)对应的目标对象进行识别。即所述客户终端可以在执行完步骤S203之后,进一步对每个视频帧同步执行下述步骤S204-步骤S207,进而可以一并对每个视频帧中所述目标关键字信息对应的目标对象进行识别。因此,对于所述视频数据中尚未播放的视频帧而言,所述客户终端可以对这些尚未播放的视频帧一并按照如下步骤S204-步骤S207进行预处理。
当然,为了更好的理解本方案,本申请实施例仅以所述目标视频帧为所述多个视频帧中当前播放的视频帧为例,以对该目标视频帧中所述目标关键字信息对 应的目标对象进行识别。因此,当所述多个视频帧中余下的视频帧被确定为所述目标视频帧时,对多个视频帧中余下视频帧中所述目标关键字信息对应的目标对象进行识别的过程,可以参见本申请实施例所列举的在所述当前播放的视频帧(即目标视频帧)中对所述目标对象的具体识别过程。
步骤S204,将所述目标视频帧划分为多个子区域,并对各子区域进行选择性搜索,并对选择性搜索后的子区域进行合并,得到多个合并区域,并将所述多个子区域和所述多个合并区域均确定为待处理区域;
其中,所述客户终端对选择性搜索后的子区域进行合并,是指所述客户终端可以基于合并规则(比如,纹理相近,颜色相近等),将两两相邻的两个子区域进行合并,且在对各子区域进行合并的过程中,会根据选择性搜索后的子区域的数量进行多次合并,直到得到一张携带完整图像的合并区域。
其中,所述待处理区域的确定可参照上述图2所对应实施例中对所述待处理区域的描述,即所述客户终端可以将多次合并后的所述多个合并区域和拆分后得到的多个子区域均确定为待处理区域。其中,所述客户终端对所述目标视频帧的拆分,以及对拆分后的子区域进行选择性搜索以及对选择性搜索后的子区域进行合并的具体过程可以参见上述图2所对应实施例中对所述多个子区域和所述多个合并区域的描述,这里将不再继续进行赘述。
步骤S205,基于神经网络模型对所述待处理区域进行特征提取,得到与所述待处理区域对应的图像特征;
具体地,所述客户终端可以将所述待处理区域内的图像块缩放至相同尺寸,并将具有相同尺寸的待处理区域作为神经网络模型的输入,并通过所述神经网络模型输出与所述待处理区域内的图像块对应的图像特征。
进一步的,请参见图7,是本申请实施例提供的一种特征提取的示意图。如图7所示,所述客户终端可以在如图7所示的图像处理区域C中,将所述目标视频帧对应的待处理区域A内的图像块和待处理区域B内的图像块进行图像处理,即可以将所述待处理区域A内的图像块和所述待处理区域B内的图像块缩放至相同尺寸,以确保对各待处理区域内的图像块进行图像特征提取的准确性。随后,所述客户终端可以进一步将具有相同尺寸的待处理区域作为神经网络模型的输入,并通过所述神经网络模型输出与所述待处理区域内的图像块对应的图像特征(即所述客户终端可以得到如图7所示的待处理区域A的图像特征和待处理区域B的图像特征)。
应当理解,本申请实施例所列举的待处理区域A和待处理区域B仅仅是所述目标视频帧对应的部分待处理区域,在实际应用中,所述客户终端将会对所有的待处理区域内的图像块进行图像处理,并将图像处理后具有相同尺寸的所有待处理区域作为该神经网络模型的输入,以通过该所述神经网络模型输出与各待处理区域内的图像块分别对应的图像特征(比如,有1000个待处理区域,则会提取到与这1000个待处理区域内的图像块一一对应的图像特征)。
步骤S206,基于所述图像特征以及与所述目标关键字信息对应的所述分类 识别模型,生成所述待处理区域对应的识别概率,并根据所述识别概率在所述待处理区域中选择包含所述目标关键字信息对应的目标对象的候选区域;
其中,所述识别概率是用于表示所述待处理区域中包含所述目标对象的概率。
又比如,在如图7所示的目标视频帧中,所述弹幕数据的目标关键字信息为猫,且该目标关键字信息对应的分类识别模型中包含已训练好的猫对应的多种轮廓特征。于是,该客户终端可以进一步通过该神经网络模型中的分类器(即分类识别模型,例如,支持向量机),将图7所示的待处理区域A对应的图像特征和待处理区域B对应的图像特征,分别与所述分类识别模型中的各轮廓特征进行比对,以得到与待处理区域A对应的识别概率(例如,90%)和与待处理区域B对应的识别概率(例如,40%)。此时,所述客户终端可以进一步根据待处理区域A对应的识别概率和与待处理区域B对应的识别概率,确定所述目标关键字信息对应的目标对象的候选区域为待处理区域A,即该客户终端可以将识别概率大于识别概率阈值(例如,70%)的待处理区域A确定为所述目标关键字信息对应的目标对象的候选区域。因此,所述客户终端此时可以确定所述待处理区域A中包含所述目标关键字信息对应的目标对象,而认为所述待处理区域B中不包含所述目标关键字信息对应的目标对象。
步骤S207,基于回归模型,对所述目标视频帧对应的候选区域进行最优选择,并将选择出的所述目标视频帧对应的最优候选区域确定为目标区域。
具体地,所述客户终端可以基于回归模型,对所述目标视频帧对应的候选区域进行最优选择,并从中选择出所述目标视频帧对应的最优候选区域,并将所述目标视频帧的上一个视频帧确定为参考视频帧,并获取所述参考视频帧对应的多个候选区域,并在所述参考视频帧对应的多个候选区域中,选择预估区域,并确定每个预估区域与所述目标视频帧对应的最优候选区域之间的重叠率,并获取具有最高重叠率的预估区域,并用所述具有最高重叠率的预估区域修正所述目标视频帧对应的最优候选区域,并将修正后的最优候选区域确定为所述目标区域。
其中,所述参考视频帧对应的多个候选区域是通过所述目标关键字信息对应的所述分类识别模型在所述参考视频帧对应的待处理区域中选择出的;所述参考视频帧对应的待处理区域是通过对所述参考视频帧进行选择性搜索所生成的。
其中,所述客户终端在执行完上述步骤S206之后,可以将多个包含所述目标关键字对应的目标对象的待处理区域作为候选区域(此时,应当理解,所述待处理区域为所述客户终端通过对图8所示的目标视频帧对应的待处理区域X1,待处理区域X2和待处理区域X3进行缩放处理后所得到的),因此,如图8所示,所述目标视频帧对应的待处理区域(即待处理区域X1,待处理区域X2和待处理区域X3)与图8所示的包含所述目标对象的三个候选区域之间存在映射关系,即待处理区域X1与候选区域X1之间存在映射关系,待处理区域X2与候选区域X2之间存在映射关系,待处理区域X3与候选区域X3之间存在映射关系。于是,所述客户终端可以在如图8所示的三个候选区域中进行最优选择。具体地,请参见图8,是本申请实施例提供的一种选择最优候选区域的示意图。如图8所示, 在获取到所述目标视频帧对应的多个候选区域(即候选区域X1,候选区域X2和候选区域X3)后,可以基于回归模型,在这多个候选区域中选择出所述目标视频帧对应的最优候选区域。
在一些实施例中,通过回归模型分别计算出每个候选区域中的目标对象到所在候选区域的候选边框的归一化距离,将所述归一化距离满足预设条件的候选区域作为所述目标视频帧对应的最优候选区域,并将选择出的最优候选区域确定为目标区域,所述预设条件为所述目标对象到所在候选区域的候选边框的归一化距离最小或所述目标对象到所在候选区域的候选边框的归一化距离最大。
为便于理解,本申请将这三个候选区域的候选边框用不同样式的线型来表示,且每种线型的候选边框均可以在如图8所示的所述目标视频帧中对应的找到相应的待处理区域。
由于此时,每个待处理区域的候选框均被缩放为相同尺寸,因此,所述客户终端可以通过该回归模型可以分别计算出每个候选区域中的目标对象到其候选边框的归一化距离,即可以理解为,通过该回归模型,可以得到输入该神经网络模型的每个待处理区域在该目标视频帧中的位置信息(比如,中心点位置信息),以及各待处理区域的长度值和宽度值。因此,通过计算所述各候选区域对应的归一化距离,可以在如图8所示的候选区域X1中,确定所述目标对象距离其候选边框的归一化距离较远,并在如图8所示的候选区域X2中,确定所述目标对象距离其候选边框的归一化距离最近,并在如图8所示的候选区域X3中,确定所述目标对象距离其候选边框的归一化距离最远。因此,可以将如图8所示的候选区域X2作为所述目标视频帧对应的最优候选区域。
应当理解,对于所述弹幕数据对应的多个视频帧中的每个视频帧中的最优候选区域均可以参见图8所示的选择最优候选区域的选择过程,这里将不再继续赘述。
进一步地,所述客户终端可以将所述目标视频帧的上一个视频帧确定为参考视频帧,并获取所述参考视频帧对应的多个候选区域。
其中,所述参考视频帧可以为所述目标视频帧的上一个视频帧,即所述参考视频帧和所述目标视频帧均为当前播放的视频数据中的部分视频帧。其中,所述参考视频帧对应的多个候选区域的确定可以参见上述图8所对应实施例中确定所述目标视频帧对应的多个候选区域的具体过程。即所述参考视频帧对应的多个候选区域是通过所述目标关键字信息对应的所述分类识别模型在所述参考视频帧对应的待处理区域中选择出的;所述参考视频帧对应的待处理区域是通过对所述参考视频帧进行选择性搜索所生成的。
因此,所述客户终端可以在所述参考视频帧对应的多个候选区域中,选择预估区域。
具体地,所述客户终端可以首先获取所述目标对象在所述参考视频帧中的目标区域,并获取所述参考视频帧中的所述目标区域的位置信息,作为第一位置信息,并获取所述参考视频帧对应的候选区域的位置信息,作为第二位置信息,其 次,所述客户终端可以进一步计算所述参考视频帧对应的第一位置信息与第二位置信息之间的距离,并将所述距离小于距离阈值的候选区域作为所述目标视频帧对应的预估区域,并进一步确定每个预估区域与所述目标视频帧对应的最优候选区域之间的重叠率,并获取具有最高重叠率的预估区域,并用所述具有最高重叠率的预估区域修正所述目标视频帧对应的最优候选区域,并将修正后的最优候选区域确定为所述目标区域。应当理解,所述修正后的最优候选区域则为所述目标关键字信息对应的目标对象在所述目标视频帧中的图像区域,因此,所述客户终端可以将所述修正后的最优候选区域确定为目标区域。
其中,所述参考视频帧对应的目标区域是通过对所述参考视频帧对应的最优候选区域进行修正得到的;所述参考视频帧对应的最优候选区域是基于所述回归模型在所述参考视频帧对应的候选区域中选择出的。
所述客户终端在获取到所述参考视频帧中所述目标对象(例如,猫)对应的目标区域后,可进一步获取该参考视频帧中所述目标区域的位置信息(即所述目标区域的中心点位置信息(例如,目标区域的位置信息为Q1)),并将该参考视频帧中所述目标区域的中心点位置信息作为第一位置信息,其次,所述客户终端还可以进一步获取该参考视频帧对应的候选区域的位置信息,作为第二位置信息。应当理解,由于所述参考视频帧对应的目标区域是通过对所述参考视频帧对应的最优候选区域进行修正得到的;所述参考视频帧对应的最优候选区域是基于所述回归模型在所述参考视频帧对应的候选区域中选择出的,因此,所述参考视频帧中的目标区域也可以被理解为该参考视频帧对应的候选区域。比如,以处于静止状态的目标对象为例,该目标对象在所述参考视频帧中的目标区域和目标视频帧中的目标区域具有相同的中心点位置信息。因此,在通过所述参考视频帧对所述目标视频帧中所述目标对象所处的区域进行预估的过程中,可以将所述参考视频帧中的目标区域作为一种特殊的候选区域。
进一步地,请一并参见图9,是本申请实施例提供的一种修正所述目标视频帧的最优视频帧的示意图。在如图9所示的四个候选区域,候选区域Y1为该参考视频帧对应的目标区域,由于所述参考视频帧对应的目标区域是通过对所述参考视频帧对应的最优候选区域进行修正得到的,且所述参考视频帧对应的最优候选区域是基于所述回归模型在所述参考视频帧对应的候选区域中选择出的;因此,所述客户终端可以在该参考视频帧中确定这四个候选区域(即如图9所示的候选区域Y1,候选区域Y2,候选区域Y3和候选区域Y4)的位置信息,其中,候选区域Y1的位置信息为目标区域的位置信息(即Q1),候选区域Y2的位置信息为Q2,候选区域Y3的位置信息为Q3,候选区域Y4的位置信息为Q4。考虑到相邻视频帧里的目标对象的持续性(比如,运动轨迹的一致性),可以在该参考视频帧中预估该目标对象在下一个视频帧中出现的区域,即可以在该参考视频帧对应的多个候选区域中,通过计算该参考视频帧的各候选区域与目标区域之间距离,以将距离小于距离阈值的候选区域作为所述目标视频帧对应的预估区域。于是,所述客户终端可以进一步计算所述参考视频帧对应的第一位置信息与第二位 置信息之间的距离(该距离可用符号D表示),即所述客户终端可以进一步计算Q1与Q1之间的距离D1(此时,D1=0),Q1与Q2之间的距离D2(例如,D2=0.8),Q1与Q3之间的距离D3(例如,D3=0.5),Q1与Q4之间的距离D4(D4=3)。最后,所述客户终端可以进一步将所述距离小于距离阈值(例如,该距离阈值为1)的候选区域作为所述目标视频帧对应的预估区域,即在上述四个候选区域中,候选区域Y1的位置信息(Q1)与所述目标区域的位置信息(Q1)之间的距离小于所述距离阈值,候选区域Y2的位置信息(Q2)与所述目标区域的位置信息(Q1)之间的距离小于所述距离阈值,以及候选区域Y3的位置信息(Q3)与所述目标区域的位置信息(Q1)之间的距离也小于所述距离阈值,因此,如图9所示,所述客户终端可以将候选区域Y1、候选区域Y2和候选区域Y3作为所述目标视频帧对应的预估区域。
进一步地,所述客户终端可以进一步确定每个预估区域与所述目标视频帧对应的最优候选区域之间的重叠率;最后,所述客户终端可以获取具有最高重叠率的预估区域,并用所述具有最高重叠率的预估区域修正所述目标视频帧对应的最优候选区域,并将修正后的最优候选区域确定为所述目标区域。
其中,所述重叠率的计算可以通过以下步骤:所述客户终端可以首先获取所述目标视频帧对应的最优候选区域的长度值和宽度值,并根据所述长度值和宽度值确定所述目标视频帧对应的最优候选区域的面积,作为第一面积;其次,所述客户终端可以进一步获取所述预估区域的长度值和宽度值,并根据所述预估区域的长度值和宽度值、所述目标视频帧对应的最优候选区域的长度值和宽度值,确定所述预估区域与所述目标视频帧对应的最优候选区域之间的重叠面积,作为第二面积;然后,所述客户终端可以根据所述第一面积和所述第二面积,确定所述预估区域与所述目标视频帧对应的最优候选区域之间的重叠率。
进一步地,所述客户终端可以在获取到如图9所示的所述目标视频帧对应的预估区域(即预估区域Y1,预估区域Y2,预估区域Y3)后,基于外形比较网络分别计算这三个预估区域与所述目标视频帧对应的最优候选区域之间的重叠面积,作为第二面积,另外,所述目标视频帧对应的最优候选区域的面积为第一面积。因此,所述客户终端可以根据所述第二面积与所述第一面积之间的比值,进一步得到各预估区域与所述目标视频帧对应的最优候选区域之间的重叠率。其中,预估区域Y1与所述目标视频帧对应的最优候选区域之间的重叠率为50%,预估区域Y2与所述目标视频帧对应的最优候选区域之间的重叠率为40%,预估区域Y3与所述目标视频帧对应的最优候选区域之间的重叠率为80%,因此,所述客户终端可以用所述目标视频帧对应的预估区域的中心位置信息和最优候选区域的中心位置信息之间的均值,作为所述目标视频帧的目标区域的中心点位置信息。
可见,通过外形比较网络可以得到所述目标视频帧对应的预估区域和最优候选区域之间的重叠情况(即计算二者之间的重叠率)。然后,所述客户终端还可以进一步用具有最高重叠率的预估区域对该目标视频帧对应的最优候选区域进 行修正。当然,若所述预估区域Y1与所述目标视频帧对应的最优候选区域之间的重叠率为100%,则表明所述目标对象在该视频数据对应的多个视频帧中处于静止状态,此时,所述目标视频帧的目标区域与所述参考视频帧的目标区域具有相同中心位置信息,此时,所述目标区域(即所述目标关键字信息对应的目标对象在所述目标视频帧中的图像区域)可以理解为所述目标视频帧对应的最优候选区域;当然,该目标区域还可以理解为所述客户终端对该目标视频帧对应的最优候选区域进行修正后的图像区域,即该图像区域为修正后的最优候选区域。
步骤S208,当播放所述视频数据中的所述目标视频帧时,对所述目标视频帧中的所述目标区域进行动画处理。
其中,所述步骤S208的具体实现过程可参见上述图2所对应实施例中对步骤S104的描述,这里将不再继续进行赘述。
由此可见,本申请实施例在播放所述视频数据的过程中,可以将所述弹幕数据对应的视频数据拆分为多个视频帧,可以对所述当前播放的视频数据中所述目标关键字信息所对应的目标对象进行识别。此时,所述客户终端可以将对视频数据中所述目标对象的检测转化为对每个视频帧中的目标对象进行逐一检测,若该视频数据的每个视频帧中均存在与所述目标关键字信息对应的目标对象,则可以进一步确定该目标对象在每一个视频帧中的目标区域,从而可以进一步为每个目标区域设置动画效果,以使每个视频帧在播放的时候可以对应的显示该目标区域的动画效果。可见,通过将所述弹幕数据中的通过将该弹幕数据中的目标关键字信息与所述目标视频帧中位于目标区域处的目标对象进行关联,可以丰富该弹幕数据的视觉展示效果,从而可以确保观看该视频数据的用户可以通过设置的关键字信息有效地捕捉到自己感兴趣的视频内容,避免由于为了识别和捕捉弹幕数据而造成的设备资源和网络资源的浪费。
比如,以录制的视频数据为一个在奔跑的人为例,用户通过该客户终端可以在该视频数据对应的播放界面上看到这个人的运动轨迹,即由于每个视频帧均按照时间顺序进行播放,因此,对于该客户终端在获取到的每个视频帧中的目标区域之后,可以依次将每个视频帧中的目标区域进行动画显示,以确保持有该客户终端的用户可以在该视频数据的播放界面上,看到每个目标区域所对应的动画展示效果,即可在该播放界面上看到这个人的运动轨迹。
本申请实施例通过在获取到所述视频数据播放过程中的弹幕数据之后,可进一步基于关键信息库提取出该弹幕数据中的所有关键内容(即所有关键字信息),此外,通过每个关键字信息所对应的目标对象的分类识别模型。可对所述视频数据的多个视频帧中的目标视频帧对应的目标对象进行识别,并可进一步确定该目标对象在所述目标视频帧中的具体位置(即该目标对象在所述目标视频帧中的目标区域),并将该目标区域进行动画显示。因此,通过将该弹幕数据中的关键字信息与所述目标视频帧中位于目标区域处的目标对象进行关联,可以丰富弹幕数据的视觉展示效果。
进一步地,请参见图10,是本申请实施例提供的一种视频处理装置的结构 示意图。如图10所示,所述视频处理装置1可以为上述图1所对应实施例中的目标客户终端,所述视频处理装置1可以包括:弹幕数据获取模块10,关键字获取模块20,目标对象识别模块30和目标区域处理模块40;
所述弹幕数据获取模块10,用于播放视频数据,并获取所述视频数据对应的弹幕数据;
其中,所述弹幕数据获取模块10,具体用于播放视频数据,并向弹幕服务器发送弹幕获取请求,并接收所述弹幕服务器基于所述弹幕获取请求返回的历史弹幕数据,并将所述历史弹幕数据作为所述视频数据对应的弹幕数据,并在所述视频数据的播放界面上显示所述弹幕数据。
可选地,所述弹幕数据获取模块10,具体用于播放视频数据,并获取文本输入数据,并将所述文本输入数据作为所述视频数据对应的弹幕数据,并在所述视频数据的播放界面上,基于弹幕轨道显示所述弹幕数据,并将所述弹幕数据发送到弹幕服务器,以使所述弹幕服务器将所述弹幕数据同步发送至观看所述视频数据的客户终端。
所述关键字获取模块20,用于在关键信息库中获取与所述弹幕数据相匹配的关键字信息,作为目标关键字信息;所述关键信息库中包含用户设置的关键字信息,以及每个关键字信息对应的目标对象的分类识别模型;
其中,所述关键字获取模块20包括:弹幕数据拆分单元201,关键字查找单元202,关键字确定单元203和分类模型获取单元204;
所述弹幕数据拆分单元201,用于获取关键信息库,并将所述弹幕数据拆分为多个分词数据;
所述关键字查找单元202,用于在所述关键信息库中遍历查找与各分词数据匹配的关键字信息;
所述关键字确定单元203,用于若查找到与所述各分词数据匹配的关键字信息,则将所述关键字信息作为所述弹幕数据对应的目标关键字信息;
所述分类模型获取单元204,用于在所述关键信息库中,获取所述目标关键字信息对应的目标对象的分类识别模型。
其中,所述弹幕数据拆分单元201,关键字查找单元202,关键字确定单元203和分类模型获取单元204的具体实现过程可以参见上述图2所对应实施例中对步骤S102的描述,这里将不再继续进行赘述。
所述目标对象识别模块30,用于在所述视频数据的多个视频帧中获取目标视频帧,并基于所述目标关键字信息对应的分类识别模型,识别所述目标关键字信息对应的目标对象在所述目标视频帧中的图像区域,并将识别出的所述图像区域作为目标区域;
其中,所述目标对象识别模块30包括:目标帧获取单元301,选择性搜索单元302,特征提取单元303,候选区域确定单元304和目标区域确定单元305;
所述目标帧获取单元301,用于在所述视频数据的多个视频帧中获取目标视 频帧;
所述选择性搜索单元302,用于将所述目标视频帧划分为多个子区域,并对各子区域进行选择性搜索,并对选择性搜索后的子区域进行合并,得到多个合并区域,并将所述多个子区域和所述多个合并区域均确定为待处理区域;
所述特征提取单元303,用于基于神经网络模型对所述待处理区域进行特征提取,得到与所述待处理区域对应的图像特征;
具体地,所述特征提取单元303,具体用于将所述待处理区域内的图像块缩放至相同尺寸,并将具有相同尺寸的待处理区域作为神经网络模型的输入,并通过所述神经网络模型输出与所述待处理区域内的图像块对应的图像特征。
所述候选区域确定单元304,用于基于所述图像特征以及与所述目标关键字信息对应的所述分类识别模型,生成所述待处理区域对应的识别概率,并根据所述识别概率在所述待处理区域中选择包含所述目标关键字信息对应的目标对象的候选区域;
其中,所述识别概率是用于表示所述待处理区域中包含所述目标对象的概率;
所述目标区域确定单元305,用于基于回归模型,对所述目标视频帧对应的候选区域进行最优选择,并将选择出的所述目标视频帧对应的最优候选区域确定为目标区域。
其中,所述目标区域确定单元305包括:最优区域选择子单元3051,参考帧确定子单元3052,预估区域选择子单元3053,重叠率确定子单元3054和目标区域确定子单元3055;
所述最优区域选择子单元3051,用于基于回归模型,对所述目标视频帧对应的候选区域进行最优选择,并从中选择出所述目标视频帧对应的最优候选区域;
所述参考帧确定子单元3052,用于将所述目标视频帧的上一个视频帧确定为参考视频帧,并获取所述参考视频帧对应的多个候选区域;所述参考视频帧对应的多个候选区域是通过所述目标关键字信息对应的所述分类识别模型在所述参考视频帧对应的待处理区域中选择出的;所述参考视频帧对应的待处理区域是通过对所述参考视频帧进行选择性搜索所生成的;
所述预估区域选择子单元3053,用于在所述参考视频帧对应的多个候选区域中,选择预估区域;
具体地,所述预估区域选择子单元3053,具体用于获取所述目标对象在所述参考视频帧中的目标区域,并获取所述参考视频帧中的所述目标区域的位置信息,作为第一位置信息,并获取所述参考视频帧对应的候选区域的位置信息,作为第二位置信息,并计算所述参考视频帧对应的第一位置信息与第二位置信息之间的距离,并将所述距离小于距离阈值的候选区域作为所述目标视频帧对应的预估区域;
其中,所述参考视频帧对应的目标区域是通过对所述参考视频帧对应的最优候选区域进行修正得到的;所述参考视频帧对应的最优候选区域是基于所述回归模型在所述参考视频帧对应的候选区域中选择出的。
所述重叠率确定子单元3054,用于确定每个预估区域与所述目标视频帧对应的最优候选区域之间的重叠率;
具体地,所述重叠率确定子单元3054,具体用于获取所述目标视频帧对应的最优候选区域的长度值和宽度值,并根据所述长度值和宽度值确定所述目标视频帧对应的最优候选区域的面积,作为第一面积,并获取所述预估区域的长度值和宽度值,并根据所述预估区域的长度值和宽度值、所述目标视频帧对应的最优候选区域的长度值和宽度值,确定所述预估区域与所述目标视频帧对应的最优候选区域之间的重叠面积,作为第二面积,并根据所述第一面积和所述第二面积,确定所述预估区域与所述目标视频帧对应的最优候选区域之间的重叠率。
所述目标区域确定子单元3055,用于获取具有最高重叠率的预估区域,并用所述具有最高重叠率的预估区域修正所述目标视频帧对应的最优候选区域,并将修正后的最优候选区域确定为所述目标区域。
其中,所述最优区域选择子单元3051,参考帧确定子单元3052,预估区域选择子单元3053,重叠率确定子单元3054和目标区域确定子单元3055的具体实现过程可以参见上述图5所对应实施例中对步骤S207的描述,这里将不再继续进行赘述。
其中,所述目标帧获取单元301,选择性搜索单元302,特征提取单元303,候选区域确定单元304和目标区域确定单元305的具体实现过程可以参见上述图5所对应实施例中对步骤S203-步骤S207的描述,这里将不再继续进行赘述。
所述目标区域处理模块40,用于当播放所述视频数据中的所述目标视频帧时,对所述目标视频帧中的所述目标区域进行动画处理。
其中,所述弹幕数据获取模块10,关键字获取模块20,目标对象识别模块30和目标区域处理模块40的具体实现过程可以参见上述图2所对应实施例中对步骤S101-步骤S104的描述,这里将不再继续进行赘述。
本申请实施例通过在获取到所述视频数据播放过程中的弹幕数据之后,可进一步基于关键信息库提取出该弹幕数据中的所有关键内容(即所有关键字信息),此外,通过每个关键字信息所对应的目标对象的分类识别模型。可对所述视频数据的多个视频帧中的目标视频帧对应的目标对象进行识别,并可进一步确定该目标对象在所述目标视频帧中的具体位置(即该目标对象在所述目标视频帧中的目标区域),并将该目标区域进行动画显示。因此,通过将该弹幕数据中的关键字信息与所述目标视频帧中位于目标区域处的目标对象进行关联,可以丰富弹幕数据的视觉展示效果。
进一步地,请参见图11,是本申请实施例提供的另一种视频处理装置的结构示意图。如图11所示,所述视频处理装置1000可以是上述图1对应实施例中的目标客户终端,所述视频处理装置1000可以包括:至少一个处理器1001,例如CPU,至少一个网络接口1004,用户接口1003,存储器1005,至少一个通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。其中,用户接口1003可以包括显示屏(Display)、键盘(Keyboard),可选用户接口1003还 可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。存储器1005可选的还可以是至少一个位于远离前述处理器1001的存储装置。如图11所示,作为一种计算机存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及设备控制应用程序。
在图11所示的视频处理装置1000中,网络接口1004主要用于连接弹幕服务器和视频源服务器;而用户接口1003主要用于为用户提供输入的接口;而处理器1001可以用于调用存储器1005中存储的设备控制应用程序,以实现:
播放视频数据,并获取所述视频数据对应的弹幕数据;
在关键信息库中获取与所述弹幕数据相匹配的关键字信息,作为目标关键字信息;所述关键信息库中包含用户设置的关键字信息,以及每个关键字信息对应的目标对象的分类识别模型;
在所述视频数据的多个视频帧中获取目标视频帧,并基于所述目标关键字信息对应的分类识别模型,识别所述目标关键字信息对应的目标对象在所述目标视频帧中的图像区域,作为目标区域;
当播放所述视频数据中的所述目标视频帧时,对所述目标视频帧中的所述目标区域进行动画处理。
应当理解,本申请实施例中所描述的视频处理装置1000可执行前文图2或图5所对应实施例中对所述视频处理方法的描述,也可执行前文图10所对应实施例中对所述视频处理装置1的描述,在此不再赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。
此外,这里需要指出的是:本申请实施例还提供了一种计算机存储介质,且所述计算机存储介质中存储有前文提及的视频处理装置1所执行的计算机程序,且所述计算机程序包括程序指令,当所述处理器执行所述程序指令时,能够执行前文图2或图5所对应实施例中对所述视频处理方法的描述,因此,这里将不再进行赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。对于本申请所涉及的计算机存储介质实施例中未披露的技术细节,请参照本申请方法实施例的描述。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。
以上所揭露的仅为本申请较佳实施例而已,当然不能以此来限定本申请之权利范围,因此依本申请权利要求所作的等同变化,仍属本申请所涵盖的范围。

Claims (15)

  1. 一种视频处理方法,由客户终端执行,包括:
    播放视频数据,并获取所述视频数据对应的弹幕数据;
    在关键信息库中获取与所述弹幕数据相匹配的关键字信息,作为目标关键字信息;所述关键信息库中包含用户设置的关键字信息,以及每个关键字信息对应的目标对象的分类识别模型;
    在所述视频数据的多个视频帧中获取目标视频帧,并基于所述目标关键字信息对应的分类识别模型,识别所述目标关键字信息对应的目标对象在所述目标视频帧中的图像区域,并将识别出的所述图像区域作为目标区域;
    当播放所述视频数据中的所述目标视频帧时,对所述目标视频帧中的所述目标区域进行动画处理。
  2. 根据权利要求1所述的方法,其中,所述在关键信息库中获取与所述弹幕数据相匹配的关键字信息,作为目标关键字信息,包括:
    获取关键信息库,并将所述弹幕数据拆分为多个分词数据;
    在所述关键信息库中遍历查找与各分词数据匹配的关键字信息;
    若查找到与所述各分词数据匹配的关键字信息,则将所述关键字信息作为所述弹幕数据对应的目标关键字信息;
    在所述关键信息库中,获取所述目标关键字信息对应的目标对象的分类识别模型。
  3. 根据权利要求2所述的方法,其中,所述在所述视频数据的多个视频帧中获取目标视频帧,并基于所述目标关键字信息对应的分类识别模型,识别所述目标关键字信息对应的目标对象在所述目标视频帧中的图像区域,并将识别出的所述图像区域作为目标区域,包括:
    在所述视频数据的多个视频帧中获取目标视频帧;
    将所述目标视频帧划分为多个子区域,并对各子区域进行选择性搜索,并对选择性搜索后的子区域进行合并,得到多个合并区域,并将所述多个子区域和所述多个合并区域均确定为待处理区域;
    基于神经网络模型对所述待处理区域进行特征提取,得到与所述待处理区域对应的图像特征;
    基于所述图像特征以及与所述目标关键字信息对应的所述分类识别模型,生成所述待处理区域对应的识别概率,并根据所述识别概率在所述待处理区域中选择包含所述目标关键字信息对应的目标对象的候选区域;所述识别概率是用于表示所述待处理区域中包含所述目标对象的概率;
    通过回归模型分别计算出每个候选区域中的目标对象到所在候选区域的候选边框的归一化距离,将所述归一化距离满足预设条件的候选区域作为所述目标视频帧对应的最优候选区域,并将选择出的最优候选区域确定为目标区域,所述预设条件为所述目标对象到所在候选区域的候选边框的归一化距离最小或所述 目标对象到所在候选区域的候选边框的归一化距离最大。
  4. 根据权利要3所述的方法,其中,所述基于神经网络模型对所述待处理区域进行特征提取,得到与所述待处理区域对应的图像特征,包括:
    将所述待处理区域内的图像块缩放至相同尺寸,并将具有相同尺寸的待处理区域作为神经网络模型的输入,并通过所述神经网络模型输出与所述待处理区域内的图像块对应的图像特征。
  5. 根据权利要求3所述的方法,其中,将选择出的所述目标视频帧对应的最优候选区域确定为目标区域,包括:
    将所述目标视频帧的上一个视频帧确定为参考视频帧,并获取所述参考视频帧对应的多个候选区域;所述参考视频帧对应的多个候选区域是通过所述目标关键字信息对应的所述分类识别模型在所述参考视频帧对应的待处理区域中选择出的;所述参考视频帧对应的待处理区域是通过对所述参考视频帧进行选择性搜索所生成的;
    在所述参考视频帧对应的多个候选区域中,选择预估区域;
    确定每个预估区域与所述目标视频帧对应的最优候选区域之间的重叠率;
    获取具有最高重叠率的预估区域,并用所述具有最高重叠率的预估区域修正所述目标视频帧对应的最优候选区域,并将修正后的最优候选区域确定为所述目标区域。
  6. 根据权利要求5所述的方法,其中,所述在所述参考视频帧对应的多个候选区域中,选择预估区域,包括:
    获取所述目标对象在所述参考视频帧中的目标区域;所述参考视频帧对应的目标区域是通过对所述参考视频帧对应的最优候选区域进行修正得到的;所述参考视频帧对应的最优候选区域是基于所述回归模型在所述参考视频帧对应的候选区域中选择出的;
    获取所述参考视频帧中的所述目标区域的位置信息,作为第一位置信息,并获取所述参考视频帧对应的候选区域的位置信息,作为第二位置信息;
    计算所述参考视频帧对应的第一位置信息与第二位置信息之间的距离,并将所述距离小于距离阈值的候选区域作为所述目标视频帧对应的预估区域。
  7. 根据权利要求6所述的方法,其中,所述确定每个预估区域与所述目标视频帧对应的最优候选区域之间的重叠率,包括:
    获取所述目标视频帧对应的最优候选区域的长度值和宽度值,并根据所述长度值和宽度值确定所述目标视频帧对应的最优候选区域的面积,作为第一面积;
    获取所述预估区域的长度值和宽度值,并根据所述预估区域的长度值和宽度值、所述目标视频帧对应的最优候选区域的长度值和宽度值,确定所述预估区域与所述目标视频帧对应的最优候选区域之间的重叠面积,作为第二面积;
    根据所述第一面积和所述第二面积,确定所述预估区域与所述目标视频帧对应的最优候选区域之间的重叠率。
  8. 根据权利要求1所述的方法,其中,所述播放视频数据,并获取所述视 频数据对应的弹幕数据,包括:
    播放视频数据,并向弹幕服务器发送弹幕获取请求,并接收所述弹幕服务器基于所述弹幕获取请求返回的历史弹幕数据,并将所述历史弹幕数据作为所述视频数据对应的弹幕数据,并在所述视频数据的播放界面上显示所述弹幕数据。
  9. 根据权利要求1所述的方法,其中,所述播放视频数据,并获取所述视频数据对应的弹幕数据,包括:
    播放视频数据,并获取文本输入数据,并将所述文本输入数据作为所述视频数据对应的弹幕数据,并在所述视频数据的播放界面上,基于弹幕轨道显示所述弹幕数据,并将所述弹幕数据发送到弹幕服务器,以使所述弹幕服务器将所述弹幕数据同步发送至观看所述视频数据的客户终端。
  10. 一种视频处理装置,包括:
    弹幕数据获取模块,用于播放视频数据,并获取所述视频数据对应的弹幕数据;
    关键字获取模块,用于在关键信息库中获取与所述弹幕数据相匹配的关键字信息,作为目标关键字信息;所述关键信息库中包含用户设置的关键字信息,以及每个关键字信息对应的目标对象的分类识别模型;
    目标对象识别模块,用于在所述视频数据的多个视频帧中获取目标视频帧,并基于所述目标关键字信息对应的分类识别模型,识别所述目标关键字信息对应的目标对象在所述目标视频帧中的图像区域,并将识别出的所述图像区域作为目标区域;
    目标区域处理模块,用于当播放所述视频数据中的所述目标视频帧时,对所述目标视频帧中的所述目标区域进行动画处理。
  11. 根据权利要求10所述的装置,其中,所述关键字获取模块包括:
    弹幕数据拆分单元,用于获取关键信息库,并将所述弹幕数据拆分为多个分词数据;
    关键字查找单元,用于在所述关键信息库中遍历查找与各分词数据匹配的关键字信息;
    关键字确定单元,用于若查找到与所述各分词数据匹配的关键字信息,则将所述关键字信息作为所述弹幕数据对应的目标关键字信息;
    分类模型获取单元,用于在所述关键信息库中,获取所述目标关键字信息对应的目标对象的分类识别模型。
  12. 根据权利要求11所述的装置,其中,所述目标对象识别模块包括:
    目标帧获取单元,用于在所述视频数据的多个视频帧中获取目标视频帧;
    选择性搜索单元,用于将所述目标视频帧划分为多个子区域,并对各子区域进行选择性搜索,并对选择性搜索后的子区域进行合并,得到多个合并区域,并将所述多个子区域和所述多个合并区域均确定为待处理区域;
    特征提取单元,用于基于神经网络模型对所述待处理区域进行特征提取,得到与所述待处理区域对应的图像特征;
    候选区域确定单元,用于基于所述图像特征以及与所述目标关键字信息对应的所述分类识别模型,生成所述待处理区域对应的识别概率,并根据所述识别概率在所述待处理区域中选择包含所述目标关键字信息对应的目标对象的候选区域;所述识别概率是用于表示所述待处理区域中包含所述目标对象的概率;
    目标区域确定单元,用于基于回归模型,对所述目标视频帧对应的候选区域进行最优选择,并将选择出的所述目标视频帧对应的最优候选区域确定为目标区域。
  13. 根据权利要12所述的装置,其中,
    所述特征提取单元,具体用于将所述待处理区域内的图像块缩放至相同尺寸,并将具有相同尺寸的待处理区域作为神经网络模型的输入,并通过所述神经网络模型输出与所述待处理区域内的图像块对应的图像特征。
  14. 一种视频处理装置,包括:处理器和存储器;
    所述处理器与存储器相连,其中,所述存储器用于存储程序代码,所述处理器用于调用所述程序代码,以执行如权利要求1-9任一项所述的方法。
  15. 一种计算机存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时,执行如权利要求1-9任一项所述的方法。
PCT/CN2019/085606 2018-06-15 2019-05-06 一种视频处理方法、装置以及存储介质 Ceased WO2019237850A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19819568.7A EP3809710A4 (en) 2018-06-15 2019-05-06 VIDEO PROCESSING PROCESS AND DEVICE, AND INFORMATION MEDIA
US16/937,360 US11611809B2 (en) 2018-06-15 2020-07-23 Video processing method and apparatus, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810618681.6A CN110149530B (zh) 2018-06-15 2018-06-15 一种视频处理方法和装置
CN201810618681.6 2018-06-15

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/937,360 Continuation US11611809B2 (en) 2018-06-15 2020-07-23 Video processing method and apparatus, and storage medium

Publications (1)

Publication Number Publication Date
WO2019237850A1 true WO2019237850A1 (zh) 2019-12-19

Family

ID=67589230

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/085606 Ceased WO2019237850A1 (zh) 2018-06-15 2019-05-06 一种视频处理方法、装置以及存储介质

Country Status (4)

Country Link
US (1) US11611809B2 (zh)
EP (1) EP3809710A4 (zh)
CN (1) CN110149530B (zh)
WO (1) WO2019237850A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114332671A (zh) * 2021-11-08 2022-04-12 深圳追一科技有限公司 基于视频数据的处理方法、装置、设备及介质
CN114466204A (zh) * 2021-12-15 2022-05-10 北京快乐茄信息技术有限公司 一种视频弹幕的显示方法、装置、电子设备及存储介质
CN114697748A (zh) * 2020-12-25 2022-07-01 深圳Tcl新技术有限公司 一种基于语音识别的视频推荐方法和计算机设备
WO2023045867A1 (zh) * 2021-09-27 2023-03-30 北京有竹居网络技术有限公司 基于视频的信息展示方法、装置、电子设备及存储介质

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110826398B (zh) * 2019-09-23 2021-04-02 上海意略明数字科技股份有限公司 一种智能图像识别进行大数据采集分析系统及应用方法
CN110738640B (zh) * 2019-09-29 2022-11-18 万翼科技有限公司 空间数据比对方法及相关产品
CN111783512B (zh) * 2019-11-11 2024-05-14 西安宇视信息科技有限公司 图像处理方法、装置、设备及存储介质
KR20210067442A (ko) * 2019-11-29 2021-06-08 엘지전자 주식회사 객체 인식을 위한 자동 레이블링 장치 및 방법
CN110913266B (zh) * 2019-11-29 2020-12-29 北京达佳互联信息技术有限公司 评论信息显示方法、装置、客户端、服务器、系统和介质
CN113076942B (zh) * 2020-01-03 2024-12-06 上海依图网络科技有限公司 预设标志检测方法、装置、芯片及计算机可读存储介质
CN114327316B (zh) * 2020-09-30 2024-04-19 伊姆西Ip控股有限责任公司 获取视觉内容的方法、设备和计算机程序产品
WO2022099682A1 (en) * 2020-11-16 2022-05-19 Arris Enterprises Llc Object-based video commenting
CN112637670B (zh) * 2020-12-15 2022-07-29 上海哔哩哔哩科技有限公司 视频生成方法及装置
CN112784760B (zh) * 2021-01-25 2024-04-12 北京百度网讯科技有限公司 人体行为识别方法、装置、设备以及存储介质
CN113011919B (zh) * 2021-03-10 2024-02-02 腾讯科技(深圳)有限公司 识别兴趣对象的方法及装置、推荐方法、介质、电子设备
CN113129924A (zh) * 2021-03-30 2021-07-16 北京泽桥传媒科技股份有限公司 一种基于计算机视觉的音视频内容自动标签提取方法
CN113542854B (zh) * 2021-07-20 2023-01-10 北京字跳网络技术有限公司 视频处理方法、装置、电子设备及存储介质
CN113783997B (zh) * 2021-09-13 2022-08-23 北京字跳网络技术有限公司 一种视频发布方法、装置、电子设备及存储介质
CN116170607B (zh) * 2021-11-25 2025-02-07 上海哔哩哔哩科技有限公司 直播连麦中弹幕展示、发送方法及装置
CN116320577B (zh) * 2021-12-20 2025-11-11 上海哔哩哔哩科技有限公司 图片弹幕交互方法和系统
CN116560767A (zh) * 2022-01-30 2023-08-08 北京字跳网络技术有限公司 一种数据展示方法、装置、电子设备及存储介质
CN114693735B (zh) * 2022-03-23 2023-03-14 成都智元汇信息技术股份有限公司 一种基于目标识别的视频融合方法及装置
CN115278337A (zh) * 2022-07-25 2022-11-01 曙光信息产业(北京)有限公司 一种弹幕渲染方法、装置、设备及存储介质
CN115988231B (zh) * 2022-12-21 2025-06-24 上海哔哩哔哩科技有限公司 弹幕展示方法及装置
CN118200666B (zh) * 2024-04-15 2025-03-14 北京起创科技有限公司 一种基于ai大模型的媒体信息处理方法、装置、系统、设备和存储介质
US20250119621A1 (en) * 2024-09-13 2025-04-10 Beijing Baidu Netcom Science Technology Co., Ltd. Method and apparatus for generating comment information based on large model, electronic device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006129122A (ja) * 2004-10-29 2006-05-18 Sharp Corp 放送受信装置、放送受信方法、放送受信プログラム及びプログラム記録媒体
CN106101747A (zh) * 2016-06-03 2016-11-09 腾讯科技(深圳)有限公司 一种弹幕内容处理方法及应用服务器、用户终端
CN106909548A (zh) * 2015-12-22 2017-06-30 北京奇虎科技有限公司 基于服务器的图片加载方法及装置
CN106921891A (zh) * 2015-12-24 2017-07-04 北京奇虎科技有限公司 一种视频特征信息的展示方法和装置

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120166922A1 (en) * 2010-12-28 2012-06-28 Franklin Keith Rolles Content Management System for Resume and Portfolio Data for Producing Multiple Interactive Websites
CN106156237B (zh) * 2015-04-27 2019-06-28 北京智谷睿拓技术服务有限公司 信息处理方法、信息处理装置及用户设备
CN105095378A (zh) * 2015-06-30 2015-11-25 北京奇虎科技有限公司 一种网页弹幕的加载方法及装置
CN105578222B (zh) * 2016-02-01 2019-04-12 百度在线网络技术(北京)有限公司 一种信息推送方法和装置
CN106060672A (zh) * 2016-06-06 2016-10-26 乐视控股(北京)有限公司 一种视频信息处理方法及装置
CN107690078B (zh) * 2017-09-28 2020-04-21 腾讯科技(深圳)有限公司 弹幕信息显示方法、提供方法以及设备
CN107948708B (zh) * 2017-11-14 2020-09-11 阿里巴巴(中国)有限公司 弹幕展示方法及装置
CN108108353B (zh) * 2017-12-19 2020-11-10 北京邮电大学 一种基于弹幕的视频语义标注方法、装置及电子设备
CN110582025B (zh) * 2018-06-08 2022-04-01 北京百度网讯科技有限公司 用于处理视频的方法和装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006129122A (ja) * 2004-10-29 2006-05-18 Sharp Corp 放送受信装置、放送受信方法、放送受信プログラム及びプログラム記録媒体
CN106909548A (zh) * 2015-12-22 2017-06-30 北京奇虎科技有限公司 基于服务器的图片加载方法及装置
CN106921891A (zh) * 2015-12-24 2017-07-04 北京奇虎科技有限公司 一种视频特征信息的展示方法和装置
CN106101747A (zh) * 2016-06-03 2016-11-09 腾讯科技(深圳)有限公司 一种弹幕内容处理方法及应用服务器、用户终端

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3809710A4 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114697748A (zh) * 2020-12-25 2022-07-01 深圳Tcl新技术有限公司 一种基于语音识别的视频推荐方法和计算机设备
CN114697748B (zh) * 2020-12-25 2024-05-03 深圳Tcl新技术有限公司 一种基于语音识别的视频推荐方法和计算机设备
WO2023045867A1 (zh) * 2021-09-27 2023-03-30 北京有竹居网络技术有限公司 基于视频的信息展示方法、装置、电子设备及存储介质
CN114332671A (zh) * 2021-11-08 2022-04-12 深圳追一科技有限公司 基于视频数据的处理方法、装置、设备及介质
CN114332671B (zh) * 2021-11-08 2022-11-01 深圳追一科技有限公司 基于视频数据的处理方法、装置、设备及介质
CN114466204A (zh) * 2021-12-15 2022-05-10 北京快乐茄信息技术有限公司 一种视频弹幕的显示方法、装置、电子设备及存储介质
CN114466204B (zh) * 2021-12-15 2024-03-15 北京快乐茄信息技术有限公司 一种视频弹幕的显示方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN110149530B (zh) 2021-08-24
US11611809B2 (en) 2023-03-21
EP3809710A4 (en) 2021-05-05
EP3809710A1 (en) 2021-04-21
CN110149530A (zh) 2019-08-20
US20200356782A1 (en) 2020-11-12

Similar Documents

Publication Publication Date Title
WO2019237850A1 (zh) 一种视频处理方法、装置以及存储介质
CN110390704B (zh) 图像处理方法、装置、终端设备及存储介质
CN110868635B (zh) 视频处理方法、装置、电子设备及存储介质
CN113487709B (zh) 一种特效展示方法、装置、计算机设备以及存储介质
KR20210031756A (ko) 비디오 분석을 위한 게이팅 모델
WO2020253372A1 (zh) 基于大数据分析的信息推送方法、装置、设备及存储介质
CN110119700B (zh) 虚拟形象控制方法、虚拟形象控制装置和电子设备
CN111050023A (zh) 视频检测方法、装置、终端设备及存储介质
CN118644596B (zh) 一种人脸关键点运动图像生成方法以及相关设备
CN117336525A (zh) 视频处理方法、装置、计算机设备及存储介质
CN115376033A (zh) 信息生成方法及装置
CN115278326B (zh) 视频展示方法、装置、计算机可读介质及电子设备
CN114266621B (zh) 图像处理方法、图像处理系统及电子设备
CN109271929B (zh) 检测方法和装置
CN110692251A (zh) 修改数字视频内容
US20190096112A1 (en) Generating an interactive digital media item that follows a viewer
WO2023045635A1 (zh) 多媒体文件的字幕处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品
CN114302157A (zh) 属性标签识别、代播事件检测方法及其装置、设备、介质
US9286710B2 (en) Generating photo animations
CN114139491A (zh) 一种数据处理方法、装置及存储介质
WO2023030177A1 (zh) 视频处理方法、装置、计算机可读存储介质及计算机设备
CN113762056B (zh) 演唱视频识别方法、装置、设备及存储介质
CN114627556A (zh) 动作检测方法、动作检测装置、电子设备以及存储介质
CN114220175B (zh) 运动模式识别方法及其装置、设备、介质、产品
CN112989112B (zh) 在线课堂内容采集方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19819568

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2019819568

Country of ref document: EP