WO2016182665A1 - Entity based temporal segmentation of video streams - Google Patents
Entity based temporal segmentation of video streams Download PDFInfo
- Publication number
- WO2016182665A1 WO2016182665A1 PCT/US2016/027330 US2016027330W WO2016182665A1 WO 2016182665 A1 WO2016182665 A1 WO 2016182665A1 US 2016027330 W US2016027330 W US 2016027330W WO 2016182665 A1 WO2016182665 A1 WO 2016182665A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- entity
- video
- segment
- sample video
- time series
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/85—Assembly of content; Generation of multimedia applications
- H04N21/854—Content authoring
- H04N21/8547—Content authoring involving timestamps for synchronizing content
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/76—Television signal recording
- H04N5/91—Television signal processing therefor
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/49—Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/192—Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
- G06V30/194—References adjustable by an adaptive method, e.g. learning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
Definitions
- the described embodiments relate generally to video processing, and more particularly to entity based temporal segmentation of video streams.
- shot-based temporal segmentation is to link the raw low level video data with high level semantic fields of a video stream, e.g., finding appropriate representations for the visual content which reflects the semantics of the video.
- the contiguous shot of an aircraft flying towards a runway and landing as an example, on the semantic level, the contiguous shot includes two scenes: one describing the aircraft flying and the other about the aircraft landing.
- a shot-based segmentation may not differentiate between the two scenes if the transition between the two scenes is smooth.
- Described methods, systems and computer program products provide solutions for temporally segmenting a video based on analysis of entities identified in the video frames of the video.
- One embodiment includes a computer-implemented method for temporally segmenting a video.
- the method comprises the steps of decoding the video into multiple video frames. Multiple video frames are selected for annotation.
- the annotation process identifies entities present in a sample video frame and each identified entity has a timestamp and confidence score indicating the likelihood that the entity is accurately identified. For each identified entity, a time series comprising of timestamps and corresponding confidence scores is generated and smoothed to reduce annotation noise.
- an overall temporal segmentation for the video is generated, where the overall temporal segmentation reflects the semantics of the video.
- FIG. 1 is a block diagram illustrating a system view of a video hosting service having an entity based temporal segmentation module according to one embodiment.
- FIG. 2 is an example of a video frame having a dog wearing a hat and corresponding annotation for the dog and the hat.
- FIG. 3 is a block diagram illustrating a segmentation module according to one embodiment.
- FIG. 4 illustrates an example of time series of an identified entity in a video and corresponding confidence scores of the entity at various time instances in the video.
- FIG. 5 is an example of applying a smoothing function to a time series of an identified entity in a video.
- FIG. 6 is an example of detecting segment boundaries for an identified entity in a video.
- FIG. 7A is an example of generating an overall segmentation of a video based on individual segmentation for identified entities in the video according to one embodiment.
- FIG. 7B is an example corresponding to an overall segmentation of a video generation shown in FIG. 7A after sorting the individual segmentation for identified entities.
- FIG. 8 is a flow chart of entity based temporal segmentation according to one embodiment.
- FIG. 1 is a block diagram illustrating a system view of a video hosting service
- the video hosting service 100 having an entity based temporal segmentation module 102 according to one embodiment.
- Multiple users/viewers use client 110A-N to use services provided by the video hosting service 100, such as uploading and retrieving videos from a video hosting website, and receive the requested services from the video hosting service 100.
- the video hosting service 100 communicates with one or more clients 110A-N via a network 130.
- the video hosting service 100 receives the video hosting service requests for videos from clients 110A-N, segments and indexes the videos by the entity based temporal segmentation muddle 102 and returns the requested videos to the clients 110A-N.
- a client 110 is used by a user to request video hosting services.
- a user uses a client 110 to send a request for indexing or storing an uploaded video.
- the client 110 can be any type of computer device, such as a personal computer (e.g., desktop, notebook, laptop) computer, as well as devices such as a mobile telephone, personal digital assistant, IP enabled video player.
- the client 110 typically includes a processor, a display device (or output to a display device), a local storage, such as a hard drive or flash memory device, to which the client 110 stores data used by the user in performing tasks, and a network interface for coupling to the video hosting service 100 via the network 130.
- a client 110 also has a video player for playing a video stream.
- the network 130 enables communications between the clients 110 and the video hosting service 100.
- the network 130 is the Internet, and uses standardized internetworking communications technologies and protocols, known now or subsequently developed that enable the clients 110 to communicate with the video hosting service 100.
- the video hosting service 100 comprises an entity based temporal
- the video server 104 serves the videos from the video database 106 in response to user video hosting service requests.
- the video database 106 stores user uploaded videos, video collected from the Internet and videos segmented by the entity based temporal segmentation module 102.
- the video database 106 stores a large video corpus for the entity based temporal segmentation module 102 to train an annotation model.
- the entity based temporal segmentation module 102 segments an input video into multiple temporal semantic segments based on analysis of one or more entities that are present in the video frames of the input video.
- An entity in a video frame represents a semantically meaningful spatial-temporal region of the video frame.
- a frame of a video of a cat playing with a dog may contain a dog, or a cat or both dog and cat, where the dog and/or the cat are the entities of the video frame.
- Two temporally adjacent semantic segments of an input video contain different scenes in terms of semantics of the segments, e.g., a dog scene versus a cat scene.
- the entity based temporal segmentation 102 has a decoding module 140, an annotation module 150 and a segmentation module 300.
- the decoding module 140 decodes an input video, and the decoded video has multiple video frames. Any decoding schemes known to those of ordinary skills in the art can be used by the decoding module 140 at the discretion of the implementer.
- the decoding module 140 decodes the input video by performing an inversion of each stage of the corresponding encoding process that encodes the input video according to a video compression standard, including inverse transform (discrete cosine transform or wavelet transform), inverse quantization and inverse entropy encoding of the signals of the input video.
- the annotation module 150 selects multiple video frames from the decoded video and annotates each selected video frame. In one embodiment, the annotation module 150 selects the video frames based on timing information, e.g., selecting a video frame every 5 seconds of the input video, or location, e.g., selecting every tenth video frame according to a display order of the decoded video frames. To annotate a selected video frame, the annotation module 150 identifies the entities in the selected video frame and assigns a confidence score for each identified entity.
- the annotation module 150 applies a trained annotation model to each video frame of the input video and generates a set of annotation parameters describing each identified entity, e.g., a class label, a bounding box containing the identified entity and a confidence score.
- the class label of an identified entity describes the entity in a human readable manner, e.g., descriptive text of the entity.
- the bounding box containing the identified entity defines an area in a video frame that contains the identified entity.
- the bounding box is defined by its size and width and coordinates of one of its corner pixels.
- the confidence score associated with an entity indicates likelihood that the entity is accurately identified, e.g., the identified dog in the video frame has a 90% probability of being a dog.
- An entity having a higher confidence score in a video frame is more likely to be present in the video frame than in another video frame where the same entity has a lower confidence score.
- the annotation module 150 trains the annotation model using an annotation training framework, such as DisBelief framework, which trains deep neural network models in a distributed manner with rapid iterations using videos stored in the video database 106.
- an annotation training framework such as DisBelief framework
- the annotation module 150 trains the annotation model using an asynchronous stochastic gradient descent procedure and a variety of distributed batch optimization procedure on computing clusters with thousands of machines on a data set of 16 million images and 21 thousand categories.
- the annotation module 150 extracts visual features from the training images, learns the invariant features of the extracted visual features and builds the training model from the learning of the visual features.
- Other embodiments of the annotation module 150 may use other machine learning techniques to train the annotation model.
- FIG. 2 is an example of a video frame 810 having a dog 220 wearing a hat 230 and corresponding annotation for the dog and the hat.
- the annotation module 150 applies the trained annotation model to the video frame 210. Based on the application, the annotation module 150 identifies two entities in the video frame 210: a dog 220 and a hat 230 with a wide brim. For each identified entity, the annotation module 150 identifies the entity with a class label, e.g., a dog, a hat, and a bounding box containing the identified entity. The annotation module 150 also assigns a confidence score (not shown) for each identified entity based on the analysis of the visual features associated with the entity by the trained annotation model.
- a confidence score not shown
- the segmentation module 300 segments the input video into multiple temporal semantic segments based on analysis of one or more identified entities in the video frames of the input video.
- the segmentation module 300 generates an overall temporal segmentation of the input video based the temporal segmentation for each identified entity of the input video and combines the temporal segmentation of all the identified entities of the input video to generate the overall temporal segmentation for the entire input video.
- the segmentation module 300 is further described below with reference to FIGS. 3-8. II. Entity Based Temporal Semantic Segmentation
- FIG. 3 is a block diagram illustrating a segmentation module 300 according to one embodiment.
- the embodiment of the segmentation module 300 in FIG. 3 includes an entity module 310, a smoothing module 320, a segment detection module 330 and a scene segmentation module 340.
- entity module 310 entity module 310
- smoothing module 320 segment detection module 330
- scene segmentation module 340 scene segmentation module 340
- the entity module 310 interacts with the annotation module 150 of the segmentation module 150 to receive identified entities and their corresponding confidence scores and generates a time series for each identified entity with corresponding confidence scores over the entire length of the input video.
- the entity module 310 denotes the time series of an identified entity as S e , where parameter e represents the identified entity in a video frame.
- the time series S e includes a series of pairs [t s ., f ⁇ t Sl ) ⁇ , where parameter ⁇ refers to the frame number, parameter t Sl is the timestamp of the ith frame and (ts refers to the confidence score of the entity e at timestamp t Sv
- FIG. 4 illustrates an example of time series of an identified entity in an input video and corresponding confidence scores of the entity at various time instances of the input video.
- FIG. 4 shows a time series 430 of one identified entity, e.g., a dog in a video of a cat playing with the dog, over the entire length of the input video.
- the horizontal axis 410 represents the timing information of the time series 430, e.g., the length of the video and timestamps of the video frames of the video
- the vertical axis 420 represents the confidence scores (e.g., 430a-420h) associated with the entity at each time instance.
- the frame at time instance t x has a confidence score 430a, which represents the likelihood of the frame at time instance t x having the identified entity in the video frame.
- the smoothing module 320 removes potentially spurious segments by applying a smoothing function to the time series for each identified entity of the input video.
- An entity in a video frame of a video may be misidentified based on raw visual features of the video due to noise, e.g., motion blur caused by camera shake when capturing the input video.
- the confidence scores for an identified entity over the entire length of the input video may vary a lot due to small changes in temporally subsequence frames, which may lead to spurious segments of the input video.
- the smoothing module 320 uses a moving window to smooth the time series for each identified entity to generate smoothed time series for each identified entity.
- the moving window is defined by a size and a step.
- the moving window over a time series of an entity selects the confidences scores of the entity to be smoothed.
- the smoothing module 320 averages the confidences scores within the moving window to generate an averaged confidence score, which represents the smoothed confidence score of the entity within the moving window.
- the smoothing module 320 moves the window to next portion of the time series of the entity for smoothing the confidence scores within the next portion of the time series.
- FIG. 5 is an example of applying a smoothing function to a time series of an identified entity in a video.
- the raw time series for the identified entity is represented by the smooth and continuous curve 530.
- the smoothing function is an averaging function that averages the confidences scores within a moving window 540 defined by its size and step.
- the smoothed time series for the entity is represented by the curve 550, which removes the annotation noise in the video frames of the input video.
- the segment detection module 330 detects segments for each identified entity in the input video. In one embodiment, the segment detection module 330 detects edges in a video frame by detecting boundaries for segments containing an identified entity in the time series of the identified entity. The segment detection module 330 sorts the confidences scores associated with the smoothed time series of an identified entity in an ascending order of the timestamps of the time series, starting from the first timestamp selected by the segment detection module 330. The segment detection module 330 detects a pair of boundaries for a segment in the time series based on predefined onset and offset threshold values.
- An onset threshold value of a boundary of a segment indicates the start of the segment that contains the identified entity; an offset threshold value for the identified entity indicates the end of the segment that contains the identified entity.
- the video frames between the time instances associated with the start and end of the segment form a segment that contains the identified entity.
- the identified entity in the video frames captured between the corresponding time instances has a smoothed confidence score equal to or larger than the onset threshold value.
- the segment detection module 330 compares the calculated derivative with a first derivative threshold value (also referred to as "onset derivative threshold value"). Responsive to the calculated derivative exceeding the onset derivative threshold value, the segment detection module 330 starts a new segment for the identified entity.
- a first derivative threshold value also referred to as "onset derivative threshold value”
- the segment detection module 330 may compare the calculated derivative with a second derivative threshold value (also referred to as "offset derivative threshold value"). Responsive to the calculated derivative being smaller than the offset derivative threshold value, the segment detection module 330 concludes a current segment for the entity.
- a second derivative threshold value also referred to as "offset derivative threshold value”
- FIG. 6 shows an example of detecting segment boundaries for an identified entity, e.g., dog, in a video, based on configurable onset derivative threshold value and the offset derivative threshold value.
- the time series for the dog entity is represented by the curve 660.
- the entity at time instance t 1+At has a corresponding confidence score b, which is selected as the onset threshold value indicating the start 630 of a segment for the dog entity.
- the entity at time instance t ; - has a corresponding confidence score c, which is selected as the offset threshold value indicating the end 650 of the segment for the dog entity.
- the video frames between the time instances t 1+ t and t ; - form a segment that contains the dog entity.
- Each dog entity in the video frames captured between the time instances t 1+ t and t ; - has a confidence score equal to or larger than the onset threshold value, i.e., the confidence score b.
- the segment detection module 330 calculates the derivative of the confidence scores between t x and t 1+ t according to Equation (1) above.
- the segment detection module 330 compares the calculated derivative with a predetermined onset derivative threshold value. In the example in FIG. 6, the derivative of the confidence scores between t x and t 1+ t exceeds the predetermined onset derivative threshold value.
- the segment detection module 330 determines that a new segment for the dog entity starts at the time instance t 1+ t .
- the segment detection module 330 computes the derivative of the confidence scores between tj and tj + t according to Equation (1) above and compares the calculated derivative with a predetermined offset derivative threshold value.
- the derivative of the confidence scores between tj and tj + t is below the predetermined offset derivative threshold value.
- the segment detection module 330 determines that the segment for the dog entity ends at the time instance tj.
- the onset derivative threshold value and the offset derivative threshold value are configurable.
- the segment detection module 330 selects the onset derivative threshold value and the offset derivative threshold value based on video segmentation experiments with selected videos stored in the video database 106, where the selected videos have known segmentation information and represent ground truth to derive onset and offset derivative threshold values.
- the entity segment detection module 330 selects the onset derivative threshold value based on a selected percentile of ascending ordered positive derivatives of confidence scores; the segment detection module 330 selects the offset derivative threshold value based on a selected percentile of descending ordered negative derivatives of confidence scores.
- the segment detection module 330 selects a percentile of 0.3 of the ascending ordered positive derivatives as the onset threshold value and select a percentile of 0.3 of the descending ordered negative derivatives as the offset threshold value.
- the percentile of 0.3 of the ascending ordered positive derivatives sets the onset derivative threshold value to 0.2, while the percentile of 0.3 of the descending ordered negative derivatives sets the offset derivative threshold value to - 0.3.
- the onset derivative threshold value indicates the start of a segment for the entity and the offset derivative threshold value indicates the end of the segment for the entity.
- the segment detection module 330 calculates a percentage reduction in confidence scores between two consecutive timestamps as follows in Equation (2):
- the segment detection module 230 selects a threshold value for the percentage reduction and compares the calculated Percentage_R eduction with the selected threshold value.
- the segment detection module 230 concludes the segment at the timestamp
- the segment detection module 330 merges segments that are temporally close during a cool-off period.
- the cool-off period can last a period of time, e.g., five seconds, depending on a variety of factors, such as the characteristics of the content of the input video, available computing resources (e.g., number of computer processors).
- a segment for an entity is allowed to continue even if the condition indicating the end of the segment described above is met.
- An input video often has many video frames and lasts for some time. Each of the video frames may contain more than one entity in the video frame.
- the embodiments described above disclose generating the individual segmentation for each identified entity.
- the scene segmentation module 340 generates an overall segmentation of the entire input video based on the individual segmentation for each identified entity.
- the overall segmentation of the input video includes one or more temporal semantic segments, each of which has a set of entities; any two neighboring segments have sets of different entities.
- the segmentation module 300 has a scene segmentation module 340 for generating the overall segmentation of the input video.
- the scene segmentation module 340 obtains the individual segmentation for each identified entity of the input video from the segment detection module 330 and sorts the individual segmentation of the identified entities according to the timestamps associated with the individual
- the scene segmentation module 340 records the start and end associated with the individual segmentation and generates segments that contain different entities.
- FIG. 7 is an example of generating an overall segmentation of an input video based on individual segmentation for identified entities in the input video according to one embodiment.
- the example in FIG. 7 has four individual segments generated by the segmentation detection module 230: segment between time instance t x and time instance t 3 for the dog entity; segment between time instance t 5 and time instance t 7 for another dog entity; segment between time instance t 2 and time instance t 4 for the cat entity; segment between time instance t 6 and time instance t 8 for another cat entity.
- the scene segmentation module 340 orders the individual segments of the dog entity and the cat entity according to the start and end timestamps associated with the individual segments as shown in FIG. 7.
- the scene segmentation module 340 records the 4 start timestamps, i.e., timestamps at time instances t 1 ; t 2 , t 5 and t 6 , and 4 end timestamps, i.e., timestamps at time instances t 3 , t 4 , t 7 and t 8 .
- the scene segmentation module 340 combines the individual segments for the dog entity and the cat entity according to the ordered start and end timestamps to generate new segments for the input video. For example, ordered timestamps of the individual segments indicates the following six new segments:
- segment between timestamps t x and t 2 which is a dog-only segment
- segment between timestamps t 2 and t 3 which is a cat-and-dog segment
- segment between timestamps t 3 and t 4 which is a cat-only segment
- segment between timestamps t 5 and t 6 which is a dog-only segment
- segment between timestamps t 6 and t 7 which is a cat-and-dog segment
- segment between timestamps t 7 and t 8 which is a cat-only segment.
- the scene segmentation module 340 may further sort the new segments and delete a segment that contains a same set of entities as another one. For example, the segment between timestamps t x and t 2 and the segment between timestamps t 5 and t 6 are both dog-only segments. The scene segmentation module 340 may elect one of these two segments, e.g., the segment between timestamps t 5 and t 6 , to represent a dog-only segment of the input video. Similarly, the scene segmentation module 340 may select the segment timestamps t 7 and t 8 to represent a cat-only segment.
- the scene segmentation module 340 After the further sorting, the scene segmentation module 340 generates the overall segmentation of the input video, which includes three segments: dog-only segment, cat-only segment and cat-and-dog segment.
- FIG. 7B shows an example of the overall segmentation of the input video after sorting.
- the scene segmentation module 340 may further sort the new segments according to the confidence score associated with an entity. For example, the scene segmentation module 340 ranks the segments of an identified entity, e.g., a dog, based on the corresponding confidence scores of the segments. Responsive to a search query on an entity, the scene segmentation module 340 may return a subset of all segments of the queried entity, each of which has a confidence score exceeding a threshold, or return all segments of the queried entity.
- the scene segmentation module 340 may return a subset of all segments of the queried entity, each of which has a confidence score exceeding a threshold, or return all segments of the queried entity.
- FIG. 8 is a flow chart of entity based temporal segmentation according to one embodiment.
- the entity based temporal segmentation module 102 decodes 810 an input video.
- the decoded input video has multiple video frames, each of which has one or more entities.
- the entity based temporal segmentation module 102 selects 820 one or more sample video frames for segmentation. For example, the entity based temporal segmentation module 102 selects a video frame from every five video frames of the input video.
- the entity based temporal segmentation module 102 applies 830 a trained annotation model to the selected sample video frame.
- the entity based temporal segmentation module 102 identifies 840 each entity in each selected sample video frame based on the application of the trained annotation model. Each identified entity in a selected sample video frame has a timestamp, a label of the entity and a confidence score to indicate the likelihood that the entity is accurately identified.
- the entity based temporal segmentation module 102 generates 850 a time series for each identified entity, where the time series contains the identified entity at each time instance and its corresponding confidence score across the entire length of the input video.
- the entity based temporal segmentation module 102 applies 860 a smoothing function to the time series of each entity to eliminate noise generated during the annotation process.
- the entity based temporal segmentation module 102 For each identified entity, the entity based temporal segmentation module 102 generates individual segments that contain the identified entity across the entire length of the input video. An individual segment for an entity has a start point and end point, which define the length of the segment. In one embodiment, the entity based temporal segmentation module 102 detects 870 a pair of boundaries defining a segment based on predefined onset and offset threshold values. Based on the reordering and analysis of the individual segments for identified entities, the entity based temporal segmentation module 102 generates an overall segmentation for the entire input video.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Computer Security & Cryptography (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Television Signal Processing For Recording (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims
Priority Applications (6)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2017551249A JP6445716B2 (en) | 2015-05-14 | 2016-04-13 | Entity-based temporal segmentation of video streams |
| DE112016002175.5T DE112016002175T5 (en) | 2015-05-14 | 2016-04-13 | Entity-based temporal segmentation of video streams |
| GB1715780.1A GB2553446B8 (en) | 2015-05-14 | 2016-04-13 | Entity based temporal segmentation of video streams |
| CN201680019489.4A CN107430687B9 (en) | 2015-05-14 | 2016-04-13 | Entity-Based Time Segmentation of Video Streams |
| EP16793129.4A EP3295678A4 (en) | 2015-05-14 | 2016-04-13 | ENTITY-BASED TIME-BY-SEGMENTATION OF VIDEO STREAMS |
| KR1020177028040A KR101967086B1 (en) | 2015-05-14 | 2016-04-13 | Entity-based temporal segmentation of video streams |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US14/712,071 US9607224B2 (en) | 2015-05-14 | 2015-05-14 | Entity based temporal segmentation of video streams |
| US14/712,071 | 2015-05-14 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2016182665A1 true WO2016182665A1 (en) | 2016-11-17 |
Family
ID=57249260
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2016/027330 Ceased WO2016182665A1 (en) | 2015-05-14 | 2016-04-13 | Entity based temporal segmentation of video streams |
Country Status (8)
| Country | Link |
|---|---|
| US (1) | US9607224B2 (en) |
| EP (1) | EP3295678A4 (en) |
| JP (1) | JP6445716B2 (en) |
| KR (1) | KR101967086B1 (en) |
| CN (1) | CN107430687B9 (en) |
| DE (1) | DE112016002175T5 (en) |
| GB (1) | GB2553446B8 (en) |
| WO (1) | WO2016182665A1 (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111738173A (en) * | 2020-06-24 | 2020-10-02 | 北京奇艺世纪科技有限公司 | Video clip detection method and device, electronic equipment and storage medium |
| EP3792818A1 (en) * | 2019-09-12 | 2021-03-17 | Beijing Xiaomi Mobile Software Co., Ltd. | Video processing method and device, and storage medium |
Families Citing this family (20)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10051344B2 (en) * | 2016-09-27 | 2018-08-14 | Clarifai, Inc. | Prediction model training via live stream concept association |
| CN108510982B (en) * | 2017-09-06 | 2020-03-17 | 腾讯科技(深圳)有限公司 | Audio event detection method and device and computer readable storage medium |
| DE102017124600A1 (en) * | 2017-10-20 | 2019-04-25 | Connaught Electronics Ltd. | Semantic segmentation of an object in an image |
| US10417501B2 (en) | 2017-12-06 | 2019-09-17 | International Business Machines Corporation | Object recognition in video |
| CN108510493A (en) * | 2018-04-09 | 2018-09-07 | 深圳大学 | Boundary alignment method, storage medium and the terminal of target object in medical image |
| CN109145784B (en) * | 2018-08-03 | 2022-06-03 | 百度在线网络技术(北京)有限公司 | Method and apparatus for processing video |
| EP3621021A1 (en) | 2018-09-07 | 2020-03-11 | Delta Electronics, Inc. | Data search method and data search system thereof |
| CN109410145B (en) * | 2018-11-01 | 2020-12-18 | 北京达佳互联信息技术有限公司 | Time sequence smoothing method and device and electronic equipment |
| JP7126549B2 (en) * | 2018-12-05 | 2022-08-26 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | Method and Apparatus for Identifying Target Video Clips in Video |
| US10963702B1 (en) * | 2019-09-10 | 2021-03-30 | Huawei Technologies Co., Ltd. | Method and system for video segmentation |
| CN110704681B (en) * | 2019-09-26 | 2023-03-24 | 三星电子(中国)研发中心 | Method and system for generating video |
| CN110933462B (en) * | 2019-10-14 | 2022-03-25 | 咪咕文化科技有限公司 | Video processing method, system, electronic device and storage medium |
| CN110958489A (en) * | 2019-12-11 | 2020-04-03 | 腾讯科技(深圳)有限公司 | Video processing method, video processing device, electronic equipment and computer-readable storage medium |
| CN114025216B (en) * | 2020-04-30 | 2023-11-17 | 网易(杭州)网络有限公司 | Media material processing method, device, server and storage medium |
| CN111898461B (en) * | 2020-07-08 | 2022-08-30 | 贵州大学 | Time sequence behavior segment generation method |
| KR20220090158A (en) * | 2020-12-22 | 2022-06-29 | 삼성전자주식회사 | Electronic device for editing video using objects of interest and operating method thereof |
| US11935253B2 (en) | 2021-08-31 | 2024-03-19 | Dspace Gmbh | Method and system for splitting visual sensor data |
| CN114550300A (en) * | 2022-02-25 | 2022-05-27 | 北京百度网讯科技有限公司 | Video data analysis method and device, electronic equipment and computer storage medium |
| CN117095317B (en) * | 2023-10-19 | 2024-06-25 | 深圳市森歌数据技术有限公司 | Unmanned aerial vehicle three-dimensional image entity identification and time positioning method |
| CN117994875A (en) * | 2024-01-31 | 2024-05-07 | 长春众鼎科技有限公司 | Recording device and classified playback method of intelligent video automobile data recorder |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20010005430A1 (en) * | 1997-07-29 | 2001-06-28 | James Warnick | Uniform intensity temporal segments |
| US20010020981A1 (en) | 2000-03-08 | 2001-09-13 | Lg Electronics Inc. | Method of generating synthetic key frame and video browsing system using the same |
| US20070201558A1 (en) * | 2004-03-23 | 2007-08-30 | Li-Qun Xu | Method And System For Semantically Segmenting Scenes Of A Video Sequence |
| US20100039564A1 (en) * | 2007-02-13 | 2010-02-18 | Zhan Cui | Analysing video material |
Family Cites Families (27)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH07175816A (en) * | 1993-10-25 | 1995-07-14 | Hitachi Ltd | Video associative search apparatus and method |
| EP1125227A4 (en) * | 1998-11-06 | 2004-04-14 | Univ Columbia | SYSTEMS AND METHODS FOR INTEROPERABLE MULTIMEDIA CONTENTS |
| JP4404172B2 (en) * | 1999-09-02 | 2010-01-27 | 株式会社日立製作所 | Media scene information display editing apparatus, method, and storage medium storing program according to the method |
| US7042525B1 (en) * | 2000-07-06 | 2006-05-09 | Matsushita Electric Industrial Co., Ltd. | Video indexing and image retrieval system |
| JP4192703B2 (en) * | 2003-06-30 | 2008-12-10 | 日本電気株式会社 | Content processing apparatus, content processing method, and program |
| US7551234B2 (en) * | 2005-07-28 | 2009-06-23 | Seiko Epson Corporation | Method and apparatus for estimating shot boundaries in a digital video sequence |
| US7555149B2 (en) * | 2005-10-25 | 2009-06-30 | Mitsubishi Electric Research Laboratories, Inc. | Method and system for segmenting videos using face detection |
| CN1945628A (en) * | 2006-10-20 | 2007-04-11 | 北京交通大学 | Video frequency content expressing method based on space-time remarkable unit |
| US7559017B2 (en) * | 2006-12-22 | 2009-07-07 | Google Inc. | Annotation framework for video |
| DE102007028175A1 (en) * | 2007-06-20 | 2009-01-02 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Automated method for temporal segmentation of a video into scenes taking into account different types of transitions between image sequences |
| US8170342B2 (en) * | 2007-11-07 | 2012-05-01 | Microsoft Corporation | Image recognition of content |
| WO2009111699A2 (en) * | 2008-03-06 | 2009-09-11 | Armin Moehrle | Automated process for segmenting and classifying video objects and auctioning rights to interactive video objects |
| US20090278937A1 (en) * | 2008-04-22 | 2009-11-12 | Universitat Stuttgart | Video data processing |
| CN101527043B (en) * | 2009-03-16 | 2010-12-08 | 江苏银河电子股份有限公司 | Video picture segmentation method based on moving target outline information |
| CN101789124B (en) * | 2010-02-02 | 2011-12-07 | 浙江大学 | Segmentation method for space-time consistency of video sequence of parameter and depth information of known video camera |
| JP2012038239A (en) * | 2010-08-11 | 2012-02-23 | Sony Corp | Information processing equipment, information processing method and program |
| CN102402536A (en) * | 2010-09-13 | 2012-04-04 | 索尼公司 | Method and equipment for extracting key frame from video |
| CN102663015B (en) * | 2012-03-21 | 2015-05-06 | 上海大学 | Video semantic labeling method based on characteristics bag models and supervised learning |
| US9118886B2 (en) * | 2012-07-18 | 2015-08-25 | Hulu, LLC | Annotating general objects in video |
| US20140181668A1 (en) * | 2012-12-20 | 2014-06-26 | International Business Machines Corporation | Visual summarization of video for quick understanding |
| US10482777B2 (en) * | 2013-02-22 | 2019-11-19 | Fuji Xerox Co., Ltd. | Systems and methods for content analysis to support navigation and annotation in expository videos |
| US9154761B2 (en) * | 2013-08-19 | 2015-10-06 | Google Inc. | Content-based video segmentation |
| BR112016006860B8 (en) * | 2013-09-13 | 2023-01-10 | Arris Entpr Inc | APPARATUS AND METHOD FOR CREATING A SINGLE DATA STREAM OF COMBINED INFORMATION FOR RENDERING ON A CUSTOMER COMPUTING DEVICE |
| KR101507272B1 (en) * | 2014-02-12 | 2015-03-31 | 인하대학교 산학협력단 | Interface and method for semantic annotation system for moving objects in the interactive video |
| US10664687B2 (en) * | 2014-06-12 | 2020-05-26 | Microsoft Technology Licensing, Llc | Rule-based video importance analysis |
| US9805268B2 (en) * | 2014-07-14 | 2017-10-31 | Carnegie Mellon University | System and method for processing a video stream to extract highlights |
| JP2016103714A (en) * | 2014-11-27 | 2016-06-02 | 三星電子株式会社Samsung Electronics Co.,Ltd. | Video recording and reproducing device |
-
2015
- 2015-05-14 US US14/712,071 patent/US9607224B2/en active Active
-
2016
- 2016-04-13 CN CN201680019489.4A patent/CN107430687B9/en active Active
- 2016-04-13 EP EP16793129.4A patent/EP3295678A4/en not_active Ceased
- 2016-04-13 GB GB1715780.1A patent/GB2553446B8/en active Active
- 2016-04-13 WO PCT/US2016/027330 patent/WO2016182665A1/en not_active Ceased
- 2016-04-13 DE DE112016002175.5T patent/DE112016002175T5/en active Pending
- 2016-04-13 JP JP2017551249A patent/JP6445716B2/en active Active
- 2016-04-13 KR KR1020177028040A patent/KR101967086B1/en active Active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20010005430A1 (en) * | 1997-07-29 | 2001-06-28 | James Warnick | Uniform intensity temporal segments |
| US20010020981A1 (en) | 2000-03-08 | 2001-09-13 | Lg Electronics Inc. | Method of generating synthetic key frame and video browsing system using the same |
| US20070201558A1 (en) * | 2004-03-23 | 2007-08-30 | Li-Qun Xu | Method And System For Semantically Segmenting Scenes Of A Video Sequence |
| US20100039564A1 (en) * | 2007-02-13 | 2010-02-18 | Zhan Cui | Analysing video material |
Non-Patent Citations (4)
| Title |
|---|
| ALPESH DABHI ET AL.: "A Neural Network Model for Automatic Image Annotation Refinement", INTERNATIONAL JOURNAL OF EMERGING TECHNOLOGIES AND INNOVATIVE R ESEARCH, vol. 1, no. 6, November 2014 (2014-11-01), XP055329065, Retrieved from the Internet <URL:http://www.jetir.org/view?paper=JERIR1406036> * |
| C-Y LIN ET AL.: "MPEG-7 Video Automatic Labeling System", PROCEEDINGS OF THE 11TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA |
| S. WU ET AL.: "Study on a New Video Scene Segmentation Algorithm", APPL. MATH. INF. SCI., vol. 9, no. 1, 2015, pages 361 - 368, XP055537071, DOI: doi:10.12785/amis/090142 |
| See also references of EP3295678A4 |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP3792818A1 (en) * | 2019-09-12 | 2021-03-17 | Beijing Xiaomi Mobile Software Co., Ltd. | Video processing method and device, and storage medium |
| US11288514B2 (en) | 2019-09-12 | 2022-03-29 | Beijing Xiaomi Mobile Software Co., Ltd. | Video processing method and device, and storage medium |
| CN111738173A (en) * | 2020-06-24 | 2020-10-02 | 北京奇艺世纪科技有限公司 | Video clip detection method and device, electronic equipment and storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| GB2553446B8 (en) | 2021-12-08 |
| KR101967086B1 (en) | 2019-04-08 |
| GB2553446B (en) | 2021-08-04 |
| EP3295678A1 (en) | 2018-03-21 |
| JP2018515006A (en) | 2018-06-07 |
| GB2553446A (en) | 2018-03-07 |
| JP6445716B2 (en) | 2018-12-26 |
| EP3295678A4 (en) | 2019-01-30 |
| GB201715780D0 (en) | 2017-11-15 |
| CN107430687A (en) | 2017-12-01 |
| US9607224B2 (en) | 2017-03-28 |
| DE112016002175T5 (en) | 2018-01-25 |
| CN107430687B (en) | 2022-03-04 |
| US20160335499A1 (en) | 2016-11-17 |
| KR20170128771A (en) | 2017-11-23 |
| CN107430687B9 (en) | 2022-04-08 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US9607224B2 (en) | Entity based temporal segmentation of video streams | |
| US12014542B2 (en) | Selecting and presenting representative frames for video previews | |
| CN113613065B (en) | Video editing method, apparatus, electronic device, and storage medium | |
| CN110235138B (en) | System and method for appearance search | |
| US9176987B1 (en) | Automatic face annotation method and system | |
| EP3477506B1 (en) | Video detection method, server and storage medium | |
| US8804999B2 (en) | Video recommendation system and method thereof | |
| US8879788B2 (en) | Video processing apparatus, method and system | |
| US10303984B2 (en) | Visual search and retrieval using semantic information | |
| US10104345B2 (en) | Data-enhanced video viewing system and methods for computer vision processing | |
| JP2001155169A (en) | Method and system for dividing, classifying and summarizing video image | |
| JP2012523641A (en) | Keyframe extraction for video content analysis | |
| US9215479B2 (en) | System and method for real-time new event detection on video streams | |
| CN112686165A (en) | Method and device for identifying target object in video, electronic equipment and storage medium | |
| CN116645624A (en) | Video content understanding method and system, computer device, and storage medium | |
| CN107247919A (en) | The acquisition methods and system of a kind of video feeling content | |
| CN103984778B (en) | A kind of video retrieval method and system | |
| Zhao et al. | Key‐Frame Extraction Based on HSV Histogram and Adaptive Clustering | |
| CN101404030B (en) | Method and system for periodic structure fragment detection in video | |
| CN117061815A (en) | Video processing method, video processing device, computer readable medium and electronic equipment | |
| Bendraou | Video shot boundary detection and key-frame extraction using mathematical models | |
| KR101212845B1 (en) | Method And System For Sampling Moving Picture | |
| Yan et al. | A preliminary study of challenges in extracting purity videos from the AV Speech Benchmark | |
| US10489654B1 (en) | Video analysis method and system | |
| Masneri et al. | Towards semi-automatic annotations for video and audio corpora |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 16793129 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 20177028040 Country of ref document: KR Kind code of ref document: A Ref document number: 2017551249 Country of ref document: JP Kind code of ref document: A Ref document number: 201715780 Country of ref document: GB Kind code of ref document: A Free format text: PCT FILING DATE = 20160413 |
|
| REEP | Request for entry into the european phase |
Ref document number: 2016793129 Country of ref document: EP |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 112016002175 Country of ref document: DE |