WO2020053861A1 - A system and a computerized method for audio lip synchronization of video content - Google Patents
A system and a computerized method for audio lip synchronization of video content Download PDFInfo
- Publication number
- WO2020053861A1 WO2020053861A1 PCT/IL2019/051022 IL2019051022W WO2020053861A1 WO 2020053861 A1 WO2020053861 A1 WO 2020053861A1 IL 2019051022 W IL2019051022 W IL 2019051022W WO 2020053861 A1 WO2020053861 A1 WO 2020053861A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video
- audio
- lip sync
- scene
- content
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/4302—Content synchronisation processes, e.g. decoder synchronisation
- H04N21/4307—Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/242—Synchronisation processes, e.g. processing of PCR [Programme Clock References]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/4302—Content synchronisation processes, e.g. decoder synchronisation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/85—Assembly of content; Generation of multimedia applications
- H04N21/854—Content authoring
- H04N21/8547—Content authoring involving timestamps for synchronizing content
Definitions
- the disclosure relates to lip synchronization (lip sync) between a video signal and its respective audio signal, and in particular to the correction of lip sync errors between the video signal and the audio signal.
- Lip synchronization error also referred to as lip sync error
- lip sync error is defined as when the timing of a video portion deviates from the timing of its respective audio portion. Such a mismatch between the video signal and the audio signal, especially when the mismatch is above a certain threshold, is bothersome to the viewers and considered to be of poor quality. Unless care is taken to maintain the audio and video in sync this phenomena may continue and even become worse as transmission continues.
- the timing differential which may be static or dynamic, is typically referred to as the lip sync error. That is, the visual effect of the motion of a speaker’s lips is out of sync (i.e. , not synchronized) with the audio heard.
- This requirement for lip synchronization may occur in broadcast and live streaming as well as video clip transmission from files.
- the prior art teaches a variety of ways to reduce the lip sync error.
- One method calls for manual adjustment of the lip sync error based on an observation made by a user of a control system. Once the observer detects a lip sync error a manual adjustment, for example, delaying the video or delaying the audio, resolves the lip sync error.
- This method has many drawbacks including its subjectivity, i.e., it is dependent on a particular user’s experience rather than on an objective metric, it being error prone, and it being difficult to scale as the number of video channels exponentially increase over time.
- This method is deficient as this requires the use of typically an arbitrary delay factor that may or may not be suitable for a particular case. Moreover, it does not resolve any dynamic changes in the lip sync error that may occur during the delivery of a video clip to a client.
- Yet other prior art methods for detection of lip sync errors include the insertion of a video signal in sync with an audio synchronization signal, also referred to as a "pip". This allows for occasional synchronization between the video signal and the audio signal at rendezvous points.
- Yet another type of solution attempts to analyze the lip motion from its visual clues and correlate them to the audio provided by the audio track.
- these methods require specialized and mostly expensive equipment. The exponential growth of video delivery and the need to reduce costs significantly cannot be supported by such prior methods.
- Figure 1 is a schematic illustration of a system according to an embodiment.
- Figure 2A is a schematic illustration of a first unsynchronized audio and video stream by a time difference according to an embodiment.
- Figure 2B is a schematic illustration of a second unsynchronized audio and video stream by a time difference according to an embodiment.
- Figure 3 is a schematic block diagram of a system for lip sync error correction according to an embodiment.
- Figure 4 is a schematic illustration of a first flowchart for detection and correction of lip sync error according to an embodiment.
- Figure 5 is a schematic illustration of a second flowchart for detection and correction of lip sync error according to an embodiment.
- Figure 6 is a schematic illustration of a third flowchart providing details of the determination of mismatch cost for the second flowchart.
- Audiovisual content in the form of video clip files, streamed or broadcasted may present a problem known as a lip sync error, i.e., the motion of the lips of a speaker do not correspond to the sound at the same time. So as to overcome the problem the video content to the system the video content is segmented according to video scene cuts.
- the audio is segmented at audio scene cuts.
- Analyzer compares the timing of the various cuts and determines if a lip sync error has occurred and if so if the system can provide a correction to overcome the problem.
- a lip sync error is detected, based on a comparison between the video scene cuts and the audio scene cuts, a correction may be either suggested or automatically applied.
- FIG. 1 an exemplary and non-limiting schematic illustration 100 of a synchronized audio and video stream is provided. While a reference herein is made to an audiovisual content stream it should be understood that the application of the invention disclosed herein is broader and applies to such content that is streamed, provided from file or otherwise broadcasted.
- the video stream 1 10 has various video scenes Vsi through VS 7 . According to principles of the invention these video scenes are determined based on analysis of neighboring frames, searching, for example and without limitation, for a sudden spike in the difference between the neighboring frames, or according to any of a plurality of prior art methods including, without limitation, those specified herein. These tend to change from one video scene to another.
- a cut for example cut 1 1 1
- another scene begins. That is, in this particular example and without limitation, scene Vsi is in a home while scene VS 2 is in the street, the cut between the scenes being at 1 1 1 . Then the scene may move into a car, changing the video frames content abruptly and therefore suggesting a scene cut, indicated for example as cut 1 12. As a result the subsequent scene Vs 3 is a scene happening within a car.
- a similar process is performed in order to slice the audio track into segments, looking for abrupt changes in the ambient sound, or according to any of the listed prior art methods.
- the audio stream 120 is perfectly aligned with the video stream 1 10, that is As 3 and As 5 are in sync with Vs 3 and Vs 5 while As 4 is in sync with Vs 4 .
- a case like this would not require any lip sync correction as no lip sync error actually is shown.
- the division into the segments Vsi through Vs 7 and corresponding Asi through As 7 are integral to the principles of the inventions though ways of such segmentation of video and/or audio are found in the prior art and are outside the scope of the current invention.
- One of ordinary skill in the art would readily appreciate that even imperfect alignment between the audio and the video may be tolerable by a user if such is below a predetermined threshold.
- a threshold of a misalignment between audio and video that is up to 80 milliseconds is considered to be acceptable and therefore no lip sync error correction may be needed.
- the invention is concerned of novel and inventive use of such segmentation.
- Fig. 2A is an exemplary and non-limiting schematic illustration 200A of a first unsynchronized audio 220 and video stream 210 by a time difference T D .
- T D time difference between the video stream 210 and the audio stream
- the value of T D may also fluctuate to a certain degree around a threshold value D without departing from the scope of the disclosure herein. Therefore, a segmentation of the video stream 210 and the audio stream 220, performed according to the principles of the invention, shows a delta value between the audio and the video, then, if T D is above a predetermined threshold value A a correction may be either attempted automatically, or, a notification may be generated to alert an operator that an adjustment may be necessary.
- FIG. 2B is an exemplary and non-limiting schematic illustration 200B second unsynchronized audio and video stream by a time difference. This illustration however differs from that shown in Fig. 2A. While the same video sequence from Vsi through Vs 7 is shown, the audio stream is different. For Vs3 through Vs5 no audio cut, or segment is found, rather a continuous audio segment As 3 is detected. Thereafter the T D values for the lip sync error continue. As will be explained herein a decision may be taken as to the lip sync error correction that may be taken, for example, if this occurs at a low enough frequency throughout the received audiovisual content it may be assumed the a T D lip sync correction should take place.
- FIG. 3 is an exemplary and non-limiting schematic block diagram of a system 300 for lip sync error correction according to an embodiment.
- An audiovisual content 302 is provided and the video content 304 is directed to a video cut analyzer 310.
- the video cut analyzer 310 is enabled to segment the video content 304 to a plurality of video segments which are then provided by the video cut analyzer 310 to a video/audio scene delta analyzer 330 as well as to a lip sync error correction unit 340.
- the video cut analyzer 310 performs the segment cuts based on, for example but not by way of limitation, known in the art segmentation techniques.
- the audio content 306 of the audiovisual content 302 is provided to an audio cut analyzer.
- the audio cut analyzer 320 is enabled to segment the audio content 306 to a plurality of audio segments which are then provided by the audio cut analyzer 320 to the video/audio scene delta analyzer 330 as well as to a lip sync error correction unit 340.
- the audio cut analyzer 320 performs the segment cuts based on, for example but not by way of limitation, detection of changes in ambient noise (or sound) when changing from one scene to another.
- detection of changes in ambient noise (or sound) when changing from one scene to another.
- One out of many prior art solutions for such scene change detection is discussed in Lin et al. , “Acoustic Scene Change Detection by Spectro- Temporal Filtering on Spectogram Using Chirps”.
- the voice/audio scene delta analyzer (also referred to herein as the delta analyzer) 330 performs an analysis respective of the T D values between the video segments, as cut by the video cut analyzer 310, and the audio segments, as cut by the audio cut analyzer 320. Assuming there are a sufficient number of both audio and video segments, the analyzer may provide several types of different notification on notification signal 335.
- the first notification is that no lip sync errors were detected, which would mean that the T D values found are below a predetermined D threshold value, or, that the number of cases where the T D values exceed the minimum D threshold value is below another predetermined threshold value K.
- the value of D is 60 milliseconds and the value of K is 10%. In such cases no lip sync error correction may be necessary.
- Both D and K threshold values may be programmable so as to allow for tighter or looser threshold values depending on the desired quality of service with respect to lip sync errors. Another case is where it is impossible to make any kind of lip sync error correction and the system 300 provides a notification on signal 335 of this case.
- the inconsistency may be determined as an inconsistency between D value that is above a predetermined E threshold.
- a notification may be provided on the notification signal 335 to alert an operator of the system 300 that certain manual intervention may be required as automatic lip sync error correction cannot be performed by the system 300.
- the first case is when the T D is of a consistent value above D but below a predetermined E error value.
- the second case is when T D is of a consistently increasing or decreasing value above D but below a predetermined E error value.
- lip sync error correction takes place and is correctable. Such error correction is performed by the lip sync error correction unit 340 that receives the video segments from the video cut analyzer 310 and the audio segments from the audio cut analyzer 320 as well as any necessary information regarding the analysis performed by the video/audio scene delta analyzer 330.
- the lip sync error correction unit 340 is used by the lip sync error correction unit 340 to compensate for the T D value. If the distribution around the T D value is small, then correction can be made, however, if the distribution is large, i.e., it is inconsistent, then it is not possible to make a lip sync error correction using this particular solution. However if the T D value is constant, or has a tendency to either increase or decrease over time but within the boundaries of the maximum E threshold, and do that in a linear fashion over time, then the correction is possible using appropriate factor equations.
- Fig. 4 is an exemplary and non-limiting schematic illustration of a flowchart 400 for detection and correction of lip sync error according to an embodiment.
- S410 audiovisual content is received. It may be received from a file or as an audiovisual stream.
- video scene cuts in the video content (for example video content 304) of the received audiovisual content are determined, using, for example but not by way of limitation, techniques described herein.
- audio scene cuts in the audio content (for example audio content 306) of the received audiovisual content are determined, using, for example but not by way of limitation, techniques described herein.
- a comparison analysis is performed to check correlations between the video scene cuts and the audio scene cuts to determine matches as well as T D values between video segments and audio segments. It should be understood, as noted with respect of Fig. 2B, that there are cases where there is no one-to-one match between each video segment and each audio segment, and such mismatch, as long as it is infrequent, can be overcome by system 300 by skipping to the next possible match.
- S450 it is checked whether the lip sync error is within correctable parameters of the system 300, for example, but not by way of limitation, if T D is above E and is inconsistent, as described herein in more detail, and if so execution continues with S470; otherwise, execution continues with S460 where a notification is provided noting that the system, for example system 300, cannot perform lip sync to the received audiovisual content though a lip sync problem does exist, and thereafter execution terminates.
- S470 it is checked if the offset between the audio segments and the video segments is smaller than a predetermined threshold, i.e. T D is smaller than D, and if not execution continues with S490; otherwise, execution continues with S480 where a notification may be generated noting that no lip sync error correction is required.
- lip sync error correction is performed so as to compensate for the T D between the video segments and the audio segments, for example using techniques discussed herein.
- the compensation may involve any one of the two cases discussed herein in more detail, i.e., the first case where T D is constant, or thereabout, and the second case where T D continuously increases or decreases over time.
- Fig. 5 is a schematic illustration of a second flowchart 500 for detection and correction of lip sync error according to an embodiment
- Fig. 6 is a schematic illustration of a third flowchart 600 providing details of the determination of mismatch cost for the second flowchart.
- the method starts by obtaining a list of audio and video scene cuts (S505), which may be detected using prior-art solutions, or other solutions which are outside of the scope of the current invention. It then generates a collection start/end audio/video offsets (S510).
- Each such set points to a specific scene cut (up to a predetermined value X from the list’s start) as a possible start, for either list, and to another scene cut (up to X scene cuts from the end of the list) as its end, again from either list.
- These sets cover all the possibilities for start and end cuts, on either list, resulting in X 4 such sets.
- V s is the selected video start time of aspecific set; V e the selected video end time; A s the selected audio start time; and, A e the selected Audio end time.
- both P a and P v shall be incremented (S530-25) unless one has reached the end of its list, in which case it will not be incremented.
- the mismatch counter is incremented (S530-30), and then increment the pointer which is pointing to a scene change time that is“further behind” (S530-40 or S530-45 as the case may be), unless that pointer has reached the end of its list, in which case the other one will be incremented.
- the number of mismatches is evaluated (S530-55).
- the cost of this set is considered to be infinite (S530-60), and it will not be considered a good option. If the number of mismatches is below the predetermined threshold, or equal thereto, then the resulting accumulated cost is the accumulated cost (S530-65) and compared (S535) to the best accumulated cost thus far. If the cost is lower for this set, its cost are saved (S540) as the best cost, and its A,B factors are saved as the best factors thus far.
- S550, S560 the following options exist (S550, S560): a. The best cost is still infinity which means that no good match was found, and therefore a notification is provided that lipsync cannot be corrected (S555); b.
- the best cost is not infinity, the best A factor is 0, and the best B factor is 1 in which case a notification that the lipsync appears to be perfect as-is and no correction is necessary (S565); or, c.
- the various embodiments disclosed herein can be implemented as hardware, firmware, software, or any combination thereof.
- the software is preferably implemented as an application program tangibly embodied on a program storage unit or computer readable medium consisting of parts, or of certain devices and/or a combination of devices.
- the application program may be uploaded to, and executed by, a machine comprising any suitable architecture.
- the machine is implemented on a computer platform having hardware such as one or more central processing units (“CPUs”), a memory, and input/output interfaces.
- CPUs central processing units
- the computer platform may also include an operating system and microinstruction code.
- a non-transitory computer readable medium is any computer readable medium except for a transitory propagating signal.
- the various embodiments disclosed herein can be implemented as hardware, firmware, software, or any combination thereof.
- the software is preferably implemented as an application program tangibly embodied on a program storage unit or computer readable medium consisting of parts, or of certain devices and/or a combination of devices.
- the application program may be uploaded to, and executed by, a machine comprising any suitable architecture.
- the machine is implemented on a computer platform having hardware such as one or more central processing units (“CPUs”), a memory, and input/output interfaces.
- CPUs central processing units
- the computer platform may also include an operating system and microinstruction code.
- a non-transitory computer readable medium is any computer readable medium except for a transitory propagating signal.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computer Security & Cryptography (AREA)
- Television Signal Processing For Recording (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
Audiovisual content in the form of video clip files, streamed or broadcasted may present a problem known as a lip sync error, i.e., the motion of the lips of a speaker do not correspond to the sound at the same time. So as to overcome the problem the video content to the system the video content is segmented according to video scene cuts. Similarly, the audio is segmented at audio scene cuts. Analyzer compares the timing of the various cuts and determines if a lip sync error has occurred and if so if the system can provide a correction to overcome the problem. When a lip sync error is detected, based on a comparison between the video scene cuts and the audio scene cuts, a correction may be either suggested or automatically applied.
Description
A System and a Computerized Method for Audio Lip
Synchronization of Video Content
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001 ]This application claims the benefit of U.S. Provisional Application No. 62/730,555 filed on September 13, 2019, the contents of which are hereby incorporated by reference.
TECHNICAL FIELD
[0002]The disclosure relates to lip synchronization (lip sync) between a video signal and its respective audio signal, and in particular to the correction of lip sync errors between the video signal and the audio signal.
BACKGROUND
[0003] The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, issues identified with respect to one or more approaches should not assume to have been recognized in any prior art on the basis of this section, unless otherwise indicated.
[0004] Lip synchronization error, also referred to as lip sync error, is defined as when the timing of a video portion deviates from the timing of its respective audio portion. Such a mismatch between the video signal and the audio signal, especially when the mismatch is above a certain threshold, is bothersome to the viewers and considered to be of poor quality. Unless care is taken to maintain the audio and video in sync this phenomena may continue and even become worse as transmission continues. The timing differential, which may be static or dynamic, is typically referred to as the lip sync error. That is, the visual effect of the motion of a speaker’s lips is out of sync (i.e. , not synchronized) with the audio heard. This requirement for lip synchronization may occur in broadcast and live streaming as well as video clip transmission from files.
[0005] The prior art teaches a variety of ways to reduce the lip sync error. One method calls for manual adjustment of the lip sync error based on an observation made by a user of a control system. Once the observer detects a lip sync error a manual adjustment, for example, delaying the video or delaying the audio, resolves the lip sync error. This method has many drawbacks including its subjectivity, i.e., it is dependent on a particular user’s experience rather than on an objective metric, it being error prone, and it being difficult to scale as the number of video channels exponentially increase over time. This may also be achieved automatically if a previously detected delay is known and a delay factor is automatically used. This method is deficient as this requires the use of typically an arbitrary delay factor that may or may not be suitable for a particular case. Moreover, it does not resolve any dynamic changes in the lip sync error that may occur during the delivery of a video clip to a client. Yet other prior art methods for detection of lip sync errors include the insertion of a video signal in sync with an audio synchronization signal, also referred to as a "pip". This allows for occasional synchronization between the video signal and the audio signal at rendezvous points. Yet another type of solution attempts to analyze the lip motion from its visual clues and correlate them to the audio provided by the audio track. One of ordinary skill in the art would readily appreciate that these methods require specialized and mostly expensive equipment. The exponential growth of video delivery and the need to reduce costs significantly cannot be supported by such prior methods.
[0006] It is therefore desirable to provide a solution that will allow for affordable, simple and real-time lip sync to support the ever increasing demand to resolve the lip sync error problem.
SUMMARY
[0007] A summary of several example embodiments of the disclosure follows. This summary is provided for the convenience of the reader to provide a basic understanding of such embodiments and does not wholly define the breadth of the disclosure. This summary is not an extensive overview of all contemplated embodiments, and is intended to neither identify key or critical elements of all embodiments nor to delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or
more embodiments in a simplified form as a prelude to the more detailed description that is presented later. For convenience, the term“certain embodiments” may be used herein to refer to a single embodiment or multiple embodiments of the disclosure.
[0008] Certain embodiments disclosed herein include a system for lip synchronization of audiovisual content comprises: a video cut analyzer adapted to receive a video portion of the audiovisual content and output video segments at video scene cuts; an audio cut analyzer adapted to receive audio portion of the audiovisual content and output audio segments at audio scene cuts; a video-audio scene delta analyzer adapted to receive the video segments and the audio segments and determine therefrom at least a time delta value between the video segments and the audio segments and determine at least a correction factor; and, a lip sync error correction unit adapted to receive the video segments, the audio segments and the correction factor and output a lip sync corrected audiovisual content, wherein the correction factor is used to reduce the time delta value of the lip sync corrected audiovisual content to below a predetermined threshold value.
[0009] Certain embodiments disclosed herein include method for lip synchronization of audiovisual content comprises: receive audiovisual content that require lip sync; detect all video scene cuts in the received video content of the audiovisual content; detect all audio scene cuts in the received audio content of the audiovisual content; perform a comparison analysis between video cuts and audio cuts to determine a sync error; generate a notification that a lip sync is required for the audiovisual content but cannot be performed upon determination that the sync error is not within correctable parameters; generate a notification that no lip sync is required for the audiovisual content upon determination that the lip sync error is within correctable parameters and that an offset between the video content and the audio content is below a predetermined threshold value; and, perform lip sync error correction to reduce the lip sync error between the video content and the audio content upon determination that the lip sync error is within correctable parameters and that the offset between the video content and the audio content exceeds the predetermined threshold value.
BRIEF DESCRIPTION OF THE DRAWINGS
[0010]The foregoing and other objects, features and advantages will become apparent and more readily appreciated from the following detailed description taken in conjunction with the accompanying drawings, in which:
[0011 ] Figure 1 is a schematic illustration of a system according to an embodiment.
[0012] Figure 2A is a schematic illustration of a first unsynchronized audio and video stream by a time difference according to an embodiment.
[0013] Figure 2B is a schematic illustration of a second unsynchronized audio and video stream by a time difference according to an embodiment.
[0014] Figure 3 is a schematic block diagram of a system for lip sync error correction according to an embodiment.
[0015] Figure 4 is a schematic illustration of a first flowchart for detection and correction of lip sync error according to an embodiment.
[0016] Figure 5 is a schematic illustration of a second flowchart for detection and correction of lip sync error according to an embodiment.
[0017] Figure 6 is a schematic illustration of a third flowchart providing details of the determination of mismatch cost for the second flowchart.
DETAILED DESCRIPTION
[0018] Below, exemplary embodiments will be described in detail with reference to accompanying drawings so as to be easily realized by a person having ordinary knowledge in the art. The exemplary embodiments may be embodied in various forms without being limited to the exemplary embodiments set forth herein. Descriptions of well-known parts are omitted for clarity, and like reference numerals refer to like elements throughout.
[0019] It is important to note that the embodiments disclosed herein are only examples of the many advantageous uses of the innovative teachings herein. In general, statements made in the specification of the present application do not necessarily limit any of the various claims. Moreover, some statements may apply to some inventive features but not to others. In general, unless otherwise indicated, singular elements may be in plural and vice versa with no loss of generality.
[0020] Audiovisual content in the form of video clip files, streamed or broadcasted may present a problem known as a lip sync error, i.e., the motion of the lips of a speaker do not correspond to the sound at the same time. So as to overcome the problem the video content to the system the video content is segmented according to video scene cuts. Similarly, the audio is segmented at audio scene cuts. Analyzer compares the timing of the various cuts and determines if a lip sync error has occurred and if so if the system can provide a correction to overcome the problem. When a lip sync error is detected, based on a comparison between the video scene cuts and the audio scene cuts, a correction may be either suggested or automatically applied.
[0021 ] Reference is now made to Fig. 1 where an exemplary and non-limiting schematic illustration 100 of a synchronized audio and video stream is provided. While a reference herein is made to an audiovisual content stream it should be understood that the application of the invention disclosed herein is broader and applies to such content that is streamed, provided from file or otherwise broadcasted. The video stream 1 10 has various video scenes Vsi through VS7. According to principles of the invention these video scenes are determined based on analysis of neighboring frames, searching, for example and without limitation, for a sudden spike in the difference between the neighboring frames, or according to any of a plurality of prior art methods including, without limitation, those specified herein. These tend to change from one video scene to another. For example, as a video clip moves from a scene inside a home to a scene on the street a cut, for example cut 1 1 1 , is determined and another scene begins. That is, in this particular example and without limitation, scene Vsi is in a home while scene VS2 is in the street, the cut between the scenes being at 1 1 1 . Then the scene may move into a car, changing the video frames content abruptly and therefore suggesting a scene cut, indicated for example as cut 1 12. As a result the subsequent scene Vs3 is a scene happening within a car. A similar process, with obvious adaptations for the different type of media, is performed in order to slice the audio track into segments, looking for abrupt changes in the ambient sound, or according to any of the listed prior art methods. In this exemplary and non-limiting example, the audio stream 120 is perfectly aligned with the video stream 1 10, that is As3 and As5 are in sync with Vs3 and Vs5 while As4 is in sync with Vs4. A case like this would not require any lip sync correction as no lip sync error
actually is shown. The division into the segments Vsi through Vs7 and corresponding Asi through As7 are integral to the principles of the inventions though ways of such segmentation of video and/or audio are found in the prior art and are outside the scope of the current invention. One of ordinary skill in the art would readily appreciate that even imperfect alignment between the audio and the video may be tolerable by a user if such is below a predetermined threshold. Typically for the industry a threshold of a misalignment between audio and video that is up to 80 milliseconds is considered to be acceptable and therefore no lip sync error correction may be needed. The invention is concerned of novel and inventive use of such segmentation.
[0022] Fig. 2A is an exemplary and non-limiting schematic illustration 200A of a first unsynchronized audio 220 and video stream 210 by a time difference TD. As can be seen, the time difference between the video stream 210 and the audio stream is constant for the purpose of this illustration. The value of TD may also fluctuate to a certain degree around a threshold value D without departing from the scope of the disclosure herein. Therefore, a segmentation of the video stream 210 and the audio stream 220, performed according to the principles of the invention, shows a delta value between the audio and the video, then, if TD is above a predetermined threshold value A a correction may be either attempted automatically, or, a notification may be generated to alert an operator that an adjustment may be necessary. Fig. 2B is an exemplary and non-limiting schematic illustration 200B second unsynchronized audio and video stream by a time difference. This illustration however differs from that shown in Fig. 2A. While the same video sequence from Vsi through Vs7 is shown, the audio stream is different. For Vs3 through Vs5 no audio cut, or segment is found, rather a continuous audio segment As3 is detected. Thereafter the TD values for the lip sync error continue. As will be explained herein a decision may be taken as to the lip sync error correction that may be taken, for example, if this occurs at a low enough frequency throughout the received audiovisual content it may be assumed the a TD lip sync correction should take place.
[0023] Reference is now made to Fig. 3 which is an exemplary and non-limiting schematic block diagram of a system 300 for lip sync error correction according to an embodiment. An audiovisual content 302 is provided and the video content 304 is directed to a video
cut analyzer 310. The video cut analyzer 310 is enabled to segment the video content 304 to a plurality of video segments which are then provided by the video cut analyzer 310 to a video/audio scene delta analyzer 330 as well as to a lip sync error correction unit 340. The video cut analyzer 310 performs the segment cuts based on, for example but not by way of limitation, known in the art segmentation techniques. The audio content 306 of the audiovisual content 302 is provided to an audio cut analyzer. The audio cut analyzer 320 is enabled to segment the audio content 306 to a plurality of audio segments which are then provided by the audio cut analyzer 320 to the video/audio scene delta analyzer 330 as well as to a lip sync error correction unit 340. The audio cut analyzer 320 performs the segment cuts based on, for example but not by way of limitation, detection of changes in ambient noise (or sound) when changing from one scene to another. One out of many prior art solutions for such scene change detection is discussed in Lin et al. , “Acoustic Scene Change Detection by Spectro- Temporal Filtering on Spectogram Using Chirps”. Another scene change detection method is provided by Kyperountas et al., in “Enhanced Eigen-Audioframes for Audiovisual Scene Change Detection”. The voice/audio scene delta analyzer (also referred to herein as the delta analyzer) 330 performs an analysis respective of the TD values between the video segments, as cut by the video cut analyzer 310, and the audio segments, as cut by the audio cut analyzer 320. Assuming there are a sufficient number of both audio and video segments, the analyzer may provide several types of different notification on notification signal 335. The first notification is that no lip sync errors were detected, which would mean that the TD values found are below a predetermined D threshold value, or, that the number of cases where the TD values exceed the minimum D threshold value is below another predetermined threshold value K. In one example, but not by way of limitation, the value of D is 60 milliseconds and the value of K is 10%. In such cases no lip sync error correction may be necessary. Both D and K threshold values may be programmable so as to allow for tighter or looser threshold values depending on the desired quality of service with respect to lip sync errors. Another case is where it is impossible to make any kind of lip sync error correction and the system 300 provides a notification on signal 335 of this case. Such a case may happen when the lip sync error is above the D threshold and has an
inconsistent value. The inconsistency may be determined as an inconsistency between D value that is above a predetermined E threshold. In this case a notification may be provided on the notification signal 335 to alert an operator of the system 300 that certain manual intervention may be required as automatic lip sync error correction cannot be performed by the system 300.
[0024] In between these two cases there are two other cases that may be handled according to the principles of the invention. The first case is when the TD is of a consistent value above D but below a predetermined E error value. The second case is when TD is of a consistently increasing or decreasing value above D but below a predetermined E error value. In both cases lip sync error correction takes place and is correctable. Such error correction is performed by the lip sync error correction unit 340 that receives the video segments from the video cut analyzer 310 and the audio segments from the audio cut analyzer 320 as well as any necessary information regarding the analysis performed by the video/audio scene delta analyzer 330. Hence if the video/audio scene delta analyzer 330 has concluded that the TD value is below the predetermined E threshold value then the correction is possible. A correction factor is used by the lip sync error correction unit 340 to compensate for the TD value. If the distribution around the TD value is small, then correction can be made, however, if the distribution is large, i.e., it is inconsistent, then it is not possible to make a lip sync error correction using this particular solution. However if the TD value is constant, or has a tendency to either increase or decrease over time but within the boundaries of the maximum E threshold, and do that in a linear fashion over time, then the correction is possible using appropriate factor equations. According to one embodiment the factor may change over time if changes in the TD value are relatively infrequent, or, in other words, distribution is not too wide around the TD value. The lip sync error correction unit 340 provides lip sync corrected audiovisual content 345 thereby overcoming deficiencies that may have occurred in the audiovisual input content 302. It should therefore be understood that the error correction may include, but is not limited to, linear drift correction and non-linear drift correction.
[0025] Fig. 4 is an exemplary and non-limiting schematic illustration of a flowchart 400 for detection and correction of lip sync error according to an embodiment. In S410 audiovisual content is received. It may be received from a file or as an audiovisual stream. In the latter case it is necessary to collect or otherwise analyze a sufficient number of video segments and audio segments before an analysis according to the invention can take place. Thereafter corrections and updates can take place as new audiovisual content (for example audiovisual content 302) is provided and an updated analysis takes place that takes into account the newly received content. In S420 video scene cuts in the video content (for example video content 304) of the received audiovisual content are determined, using, for example but not by way of limitation, techniques described herein. In S430 audio scene cuts in the audio content (for example audio content 306) of the received audiovisual content are determined, using, for example but not by way of limitation, techniques described herein. In S440 a comparison analysis is performed to check correlations between the video scene cuts and the audio scene cuts to determine matches as well as TD values between video segments and audio segments. It should be understood, as noted with respect of Fig. 2B, that there are cases where there is no one-to-one match between each video segment and each audio segment, and such mismatch, as long as it is infrequent, can be overcome by system 300 by skipping to the next possible match. In S450 it is checked whether the lip sync error is within correctable parameters of the system 300, for example, but not by way of limitation, if TD is above E and is inconsistent, as described herein in more detail, and if so execution continues with S470; otherwise, execution continues with S460 where a notification is provided noting that the system, for example system 300, cannot perform lip sync to the received audiovisual content though a lip sync problem does exist, and thereafter execution terminates. In S470 it is checked if the offset between the audio segments and the video segments is smaller than a predetermined threshold, i.e. TD is smaller than D, and if not execution continues with S490; otherwise, execution continues with S480 where a notification may be generated noting that no lip sync error correction is required. In S490 lip sync error correction is performed so as to compensate for the TD between the video segments and the audio segments, for example using techniques discussed herein. The
compensation may involve any one of the two cases discussed herein in more detail, i.e., the first case where TD is constant, or thereabout, and the second case where TD continuously increases or decreases over time. Once correction has completed, execution terminates.
[0026] Fig. 5 is a schematic illustration of a second flowchart 500 for detection and correction of lip sync error according to an embodiment and Fig. 6 is a schematic illustration of a third flowchart 600 providing details of the determination of mismatch cost for the second flowchart. Essentially the method starts by obtaining a list of audio and video scene cuts (S505), which may be detected using prior-art solutions, or other solutions which are outside of the scope of the current invention. It then generates a collection start/end audio/video offsets (S510). Each such set points to a specific scene cut (up to a predetermined value X from the list’s start) as a possible start, for either list, and to another scene cut (up to X scene cuts from the end of the list) as its end, again from either list. These sets cover all the possibilities for start and end cuts, on either list, resulting in X4 such sets. According to the method it will initiate the best found cost to infinity. It thereafter iterates (S520) for each of these possible sets, to determine the A and B factors (S525) for this set, as follows: Af=Vs-As and Bf=(Ve-Vs)/Ae-As). Where Vs is the selected video start time of aspecific set; Ve the selected video end time; As the selected audio start time; and, Ae the selected Audio end time. Thereafter, a new list of corrected audio scene change times is determined as follows: A[i]=(A[i]-As)*Bf+Af+As. The method then determines the cost (S530) for this set of A,B factors. The determination is performed (S530) as follows: setting (S530-10) the cost accumulator to 0, the number of detected mismatches to 0, and pointers inside the list for both audio and video, to 0 (Pa=Pv=0). Thereafter looping over until both pointers reach the end of their lists, based on the following logic: determining the distance between the pointed-to scene cuts as follows: D=|A[Pa]-V[Pv]|. If the pointed to scene cuts are close enough to count as a match (D<=Dm), but not a perfect match (D>DP), the distance between them is added to the accumulated cost (S530-20) after which both Pa and Pv are increased (S530-25) unless one reached the end of its list, in which case it will not be incremented. In the case where the pointed to scene cuts are close enough to count as a perfect match (D<=DP), both Pa and Pv shall be incremented (S530-25) unless one
has reached the end of its list, in which case it will not be incremented. In case where the delta is too big (D>Dm), the mismatch counter is incremented (S530-30), and then increment the pointer which is pointing to a scene change time that is“further behind” (S530-40 or S530-45 as the case may be), unless that pointer has reached the end of its list, in which case the other one will be incremented. Once both pointers reach the end of their respective lists, the number of mismatches is evaluated (S530-55). If that value is above a predetermined value then the cost of this set is considered to be infinite (S530-60), and it will not be considered a good option. If the number of mismatches is below the predetermined threshold, or equal thereto, then the resulting accumulated cost is the accumulated cost (S530-65) and compared (S535) to the best accumulated cost thus far. If the cost is lower for this set, its cost are saved (S540) as the best cost, and its A,B factors are saved as the best factors thus far. Once all the sets have been evaluated, the following options exist (S550, S560): a. The best cost is still infinity which means that no good match was found, and therefore a notification is provided that lipsync cannot be corrected (S555); b. The best cost is not infinity, the best A factor is 0, and the best B factor is 1 in which case a notification that the lipsync appears to be perfect as-is and no correction is necessary (S565); or, c. The best cost is not infinity, but the best factors differ from Af=0, Bf=1 resulting in a notification that the lipsync is not good, but can be corrected by applying these factors to the audio (S570).
[0027] The various embodiments disclosed herein can be implemented as hardware, firmware, software, or any combination thereof. Moreover, the software is preferably implemented as an application program tangibly embodied on a program storage unit or computer readable medium consisting of parts, or of certain devices and/or a combination of devices. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture. Preferably, the machine is implemented on a computer platform having hardware such as one or more central processing units (“CPUs”), a memory, and input/output interfaces. The computer platform may also include an operating system and microinstruction code. The various processes and functions described herein may be either part of the microinstruction code or part of the application program, or any combination thereof, which may be executed by a CPU, whether or not such a computer or processor is explicitly shown. In addition, various
other peripheral units may be connected to the computer platform such as an additional data storage unit and a printing unit. Furthermore, a non-transitory computer readable medium is any computer readable medium except for a transitory propagating signal.
[0028] The various embodiments disclosed herein can be implemented as hardware, firmware, software, or any combination thereof. Moreover, the software is preferably implemented as an application program tangibly embodied on a program storage unit or computer readable medium consisting of parts, or of certain devices and/or a combination of devices. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture. Preferably, the machine is implemented on a computer platform having hardware such as one or more central processing units (“CPUs”), a memory, and input/output interfaces. The computer platform may also include an operating system and microinstruction code. The various processes and functions described herein may be either part of the microinstruction code or part of the application program, or any combination thereof, which may be executed by a CPU, whether or not such a computer or processor is explicitly shown. In addition, various other peripheral units may be connected to the computer platform such as an additional data storage unit and a printing unit. Furthermore, a non-transitory computer readable medium is any computer readable medium except for a transitory propagating signal.
[0029] All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the principles of the disclosed embodiment and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the disclosed embodiments, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.
Claims
1 . A system for lip synchronization of audiovisual content comprises:
a video cut analyzer adapted to receive a video portion of the audiovisual content and output video segments at video scene cuts;
an audio cut analyzer adapted to receive audio portion of the audiovisual content and output audio segments at audio scene cuts;
a video-audio scene delta analyzer adapted to receive the video segments and the audio segments and determine therefrom at least a time delta value between the video segments and the audio segments and determine at least a correction factor; and a lip sync error correction unit adapted to receive the video segments, the audio segments and the correction factor and output a lip sync corrected audiovisual content, wherein the correction factor is used to reduce the time delta value of the lip sync corrected audiovisual content to below a predetermined threshold value.
2. The system of claim 1 , wherein the video cut analyzer determines a video scene change for the video scene cut based on an abrupt difference between neighboring frames of the video portion.
3. The system of claim 1 , wherein the video cut analyzer determines a video scene change for the video scene cut based on a change from a frame in a video scene having a first background to a video scene in a second background.
4. The system of claim 1 , wherein the audio cut analyzer determines an audio scene change for the audio scene cut based on a change in an ambient sound.
5. The system of claim 1 , wherein the audio cut analyzer determines an audio scene change for the audio scene cut based on a change in an ambient noise.
6. The system of claim 1 , wherein the audio cut analyzer determines an audio scene change for the audio scene cut by performing a spectro-temporal filtering.
7. The system of claim 1 , wherein the lip sync error correction unit provides a notification that lip sync correction cannot be performed upon determination that the lip sync error is not within correctable parameters.
8. The system of claim 1 , wherein the lip sync error correction unit provides a notification that lip sync correction is unnecessary as the lip sync error is smaller than a predetermined threshold value between audio and video.
9. The system of claim 1 , wherein the lip sync error correction unit performs the lip sync error correction upon determination that the lip sync error is within correctable parameters but above a predetermined threshold value for the offset between audio and video.
10. The system of claim 1 , wherein the audiovisual content is at least one of: video clip file, streamed video content, and broadcast video content.
1 1 . The system of claim 1 , wherein the error correction unit is further adapted to perform at least one of: a linear drift correction and a non-liner drift correction.
12. A method for lip synchronization of audiovisual content comprises:
receive audiovisual content that require lip sync;
detecting all video scene cuts in the received video content of the audiovisual content;
detecting all audio scene cuts in the received audio content of the audiovisual content;
performing a comparison analysis between video cuts and audio cuts to determine a sync error;
generating a notification that a lip sync is required for the audiovisual content but
cannot be performed upon determination that the sync error is not within correctable parameters;
generating a notification that no lip sync is required for the audiovisual content upon determination that the lip sync error is within correctable parameters and that an offset between the video content and the audio content is below a predetermined threshold value; and
performing a lip sync error correction to reduce the lip sync error between the video content and the audio content upon determination that the lip sync error is within correctable parameters and that the offset between the video content and the audio content exceeds the predetermined threshold value.
13. The method of claim 12, wherein a detection of a video scene cut comprises: determining an abrupt difference between neighboring frames of the video content.
14. The method of claim 12, wherein a detection of a video scene cut comprises: determining a change from a frame in a video scene having a first background to a video scene in a second background.
15. The method of claim 12, wherein a detection of an audio scene cut comprises: determining a change for the audio scene cut based on a change in an ambient sound.
16. The method of claim 12, wherein a detection of an audio scene cut comprises: determining a change for the audio scene cut by performing a spectro-temporal filtering.
17. The method of claim 12, wherein performing a lip sync error correction comprises perfoming at least one of: a linear drift correction and a non-liner drift correction.
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP19861070.1A EP3841758A4 (en) | 2018-09-13 | 2019-09-12 | A system and a computerized method for audio lip synchronization of video content |
| US17/200,450 US20210219012A1 (en) | 2018-09-13 | 2021-03-12 | System and a computerized method for audio lip synchronization of video content |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201862730555P | 2018-09-13 | 2018-09-13 | |
| US62/730,555 | 2018-09-13 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/200,450 Continuation US20210219012A1 (en) | 2018-09-13 | 2021-03-12 | System and a computerized method for audio lip synchronization of video content |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2020053861A1 true WO2020053861A1 (en) | 2020-03-19 |
Family
ID=69778425
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/IL2019/051022 Ceased WO2020053861A1 (en) | 2018-09-13 | 2019-09-12 | A system and a computerized method for audio lip synchronization of video content |
Country Status (3)
| Country | Link |
|---|---|
| US (1) | US20210219012A1 (en) |
| EP (1) | EP3841758A4 (en) |
| WO (1) | WO2020053861A1 (en) |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111354235A (en) * | 2020-04-24 | 2020-06-30 | 刘纯 | Piano remote teaching system |
| CN111510758A (en) * | 2020-04-24 | 2020-08-07 | 怀化学院 | Synchronization method and system in piano video teaching |
| CN113516985A (en) * | 2021-09-13 | 2021-10-19 | 北京易真学思教育科技有限公司 | Speech recognition method, apparatus and non-volatile computer-readable storage medium |
| EP4024878A1 (en) * | 2020-12-30 | 2022-07-06 | Advanced Digital Broadcast S.A. | A method and a system for testing audio-video synchronization of an audio-video player |
Families Citing this family (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11228799B2 (en) * | 2019-04-17 | 2022-01-18 | Comcast Cable Communications, Llc | Methods and systems for content synchronization |
| US11871068B1 (en) * | 2019-12-12 | 2024-01-09 | Amazon Technologies, Inc. | Techniques for detecting non-synchronization between audio and video |
| SE545595C2 (en) | 2021-10-15 | 2023-11-07 | Livearena Tech Ab | System and method for producing a shared video stream |
| US12452477B2 (en) * | 2023-06-16 | 2025-10-21 | Disney Enterprises, Inc. | Video and audio synchronization with dynamic frame and sample rates |
| US12500997B2 (en) | 2024-01-31 | 2025-12-16 | Livearena Technologies Ab | Method and device for producing a video stream |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20020150126A1 (en) * | 2001-04-11 | 2002-10-17 | Kovacevic Branko D. | System for frame based audio synchronization and method thereof |
| US20100303158A1 (en) * | 2006-06-08 | 2010-12-02 | Thomson Licensing | Method and apparatus for scene change detection |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7149686B1 (en) * | 2000-06-23 | 2006-12-12 | International Business Machines Corporation | System and method for eliminating synchronization errors in electronic audiovisual transmissions and presentations |
| KR100694060B1 (en) * | 2004-10-12 | 2007-03-12 | 삼성전자주식회사 | Audio video synchronization device and method |
| US20130141643A1 (en) * | 2011-12-06 | 2013-06-06 | Doug Carson & Associates, Inc. | Audio-Video Frame Synchronization in a Multimedia Stream |
-
2019
- 2019-09-12 WO PCT/IL2019/051022 patent/WO2020053861A1/en not_active Ceased
- 2019-09-12 EP EP19861070.1A patent/EP3841758A4/en not_active Withdrawn
-
2021
- 2021-03-12 US US17/200,450 patent/US20210219012A1/en not_active Abandoned
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20020150126A1 (en) * | 2001-04-11 | 2002-10-17 | Kovacevic Branko D. | System for frame based audio synchronization and method thereof |
| US20100303158A1 (en) * | 2006-06-08 | 2010-12-02 | Thomson Licensing | Method and apparatus for scene change detection |
Non-Patent Citations (2)
| Title |
|---|
| See also references of EP3841758A4 * |
| SUNDARAM, H. ET AL.: "Determining Computable Scenes in Films and their Structures using Audio-Visual Memory Models", 2000). MULTIMEDIA '00: PROCEEDINGS OF THE EIGHTH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 30 October 2000 (2000-10-30), pages 95 - 104, XP058235864, Retrieved from the Internet <URL:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.514&rep=repl&type=pdf> DOI: 10.1145/354384.354440 * |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111354235A (en) * | 2020-04-24 | 2020-06-30 | 刘纯 | Piano remote teaching system |
| CN111510758A (en) * | 2020-04-24 | 2020-08-07 | 怀化学院 | Synchronization method and system in piano video teaching |
| EP4024878A1 (en) * | 2020-12-30 | 2022-07-06 | Advanced Digital Broadcast S.A. | A method and a system for testing audio-video synchronization of an audio-video player |
| CN113516985A (en) * | 2021-09-13 | 2021-10-19 | 北京易真学思教育科技有限公司 | Speech recognition method, apparatus and non-volatile computer-readable storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| EP3841758A1 (en) | 2021-06-30 |
| EP3841758A4 (en) | 2022-06-22 |
| US20210219012A1 (en) | 2021-07-15 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20210219012A1 (en) | System and a computerized method for audio lip synchronization of video content | |
| US11432037B2 (en) | Method and system for detecting and responding to changing of media channel | |
| US11863821B2 (en) | Media monitoring using multiple types of signatures | |
| US10390109B2 (en) | System and method for synchronizing metadata with audiovisual content | |
| JP5602138B2 (en) | Feature optimization and reliability prediction for audio and video signature generation and detection | |
| US20210204033A1 (en) | System and computerized method for subtitles synchronization of audiovisual content using the human voice detection for synchronization | |
| US20240251124A1 (en) | Audio Video Synchronization | |
| EP2667601B1 (en) | Method and device for implementing fast channel change | |
| EP1706974B1 (en) | Method, system and receiver for receiving a multi-carrier transmission | |
| CN111510758A (en) | Synchronization method and system in piano video teaching | |
| US11722729B2 (en) | Method and system for use of earlier and/or later single-match as basis to disambiguate channel multi-match with non-matching programs | |
| US11695989B2 (en) | Content-modification system with user experience analysis feature | |
| Peng et al. | Multi-object multimedia presentation synchronization strategy | |
| HK1257061B (en) | Method and system for detecting and responding to changing of media channel |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19861070 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2019861070 Country of ref document: EP Effective date: 20210326 |