WO2022068570A1 - 一种音频水印添加、解析方法、设备及介质 - Google Patents
一种音频水印添加、解析方法、设备及介质 Download PDFInfo
- Publication number
- WO2022068570A1 WO2022068570A1 PCT/CN2021/118202 CN2021118202W WO2022068570A1 WO 2022068570 A1 WO2022068570 A1 WO 2022068570A1 CN 2021118202 W CN2021118202 W CN 2021118202W WO 2022068570 A1 WO2022068570 A1 WO 2022068570A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio
- target frame
- watermark
- terminal
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/018—Audio watermarking, i.e. embedding inaudible data in the audio signal
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/10—Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
- G06F21/16—Program or content traceability, e.g. by watermarking
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/42—Systems providing special services or facilities to subscribers
- H04M3/56—Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
- H04M3/568—Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/83—Generation or processing of protective or descriptive data associated with content; Content structuring
- H04N21/835—Generation of protective data, e.g. certificates
- H04N21/8358—Generation of protective data, e.g. certificates involving watermark
Definitions
- the present application relates to the field of electronics, and in particular, to a method, device and medium for adding and parsing audio watermarks.
- Watermarking technology is a commonly used image processing method. By marking watermarks on important images, the source of the images can be queried according to the watermark during the process of image dissemination, so that the disseminators dare not take pictures and disseminate the images easily, which has a deterrent effect. and retrospective accountability.
- the main method is to secretly photograph or secretly record on a remote terminal side through a mobile phone or a voice recorder and other devices, and transmit the secretly recorded audio and video files to other personnel, and eventually spread on the Internet, causing adverse effects.
- the audio watermark adding method in the prior art mainly obtains the audio watermark by performing offline processing on the audio in the later stage.
- the online real-time audio stream cannot be effectively processed.
- Embodiments of the present application provide an audio watermark adding and parsing method, device and medium, which are used to solve the problem of real-time addition of audio watermarks.
- a first aspect of the embodiments of the present application provides an audio watermark adding method, including: a playback terminal acquires first audio in real time; the playback terminal embeds an audio watermark in the first audio, and the audio watermark is associated with the playback terminal. associated; the playing terminal plays the first audio embedded with the audio watermark.
- the playback terminal acquires the first audio in real time; the playback terminal embeds an audio watermark in the first audio, and the audio watermark is associated with the playback terminal; the playback terminal plays the first audio embedded with the audio watermark. Therefore, under the scenario of playing audio in real time, the audio watermark is added to the audio stream in real time through the playback terminal, so that the later equipment can determine the playback terminal according to the audio watermark when parsing the watermark, which is convenient for tracing the source after the first audio is transcribed. .
- embedding an audio watermark in the first audio by the playback terminal includes: the playback terminal determines, in the first audio, a first target frame that satisfies a first preset condition; the playback terminal is after the first target frame A second target frame that satisfies the second preset condition is determined, and the first target frame is used to mark the second target frame; the playback terminal embeds the audio watermark in the second target frame.
- the playback terminal determines the first target frame according to the first preset condition, then determines the second target frame according to the second preset condition, and finally determines the second target frame according to the second preset condition.
- Embedded audio watermark In this way, the playback terminal can accurately find the appropriate position where the audio watermark is embedded in the first audio during real-time processing.
- the playback terminal determines, in the first audio, a first target frame that satisfies the first preset condition, including: when the sampling rate of the first audio is greater than or equal to the first threshold, the playback terminal plays the low-frequency part.
- the audio frame whose maximum value is within the first interval is determined as the first target frame; or, when the sampling rate of the first audio is less than the first threshold, the playback terminal determines that the audio frame containing the first characteristic sound is the first target frame. a target frame.
- the playback terminal when the sampling rate of the first audio is greater than or equal to the first threshold, determines the audio frame whose maximum value of the low-frequency part is within the first interval as the first target frame; When the sampling rate is less than the first threshold, the playback terminal determines the first target frame by determining the first characteristic sound, and does not need to embed a synchronization frame mark. Thus, it is ensured that in the first audio, no matter whether the sampling rate of the first audio is greater than or less than the first threshold, the first target frame marking the watermark embedding position can be found.
- the playback terminal takes the audio frame with the maximum value of the low-frequency part within the first interval as the first target frame, and further includes: the playback terminal A sync frame marker is added to the first target frame.
- a synchronization frame marker is added to the first target frame, so that the subsequent parsing terminal can quickly locate the location based on the synchronization frame marker during parsing.
- the first target frame when the sampling rate of the first audio is greater than or equal to the first threshold, a synchronization frame marker is added to the first target frame, so that the subsequent parsing terminal can quickly locate the location based on the synchronization frame marker during parsing.
- the playback terminal adds a synchronization frame mark to the first target frame, including: the playback terminal obtains a first sampling point, and the first sampling point is a sampling point of an intermediate frequency part; the playback terminal increases the first sampling point The energy value of the point, so that the ratio of the energy value of the first sampling point to the energy value of the low-frequency part is greater than or equal to the second threshold.
- the playback terminal determines in real time the target frame that meets the first preset condition in the first audio as the first target frame or the second target frame, and then increases the intermediate frequency part of the first target frame or the second target frame to the first target frame.
- the energy value of the sampling point is such that the ratio of the energy value of the first sampling point to the energy value of the low frequency part is greater than or equal to the preset ratio, thereby realizing the addition of the synchronization frame mark.
- the playback terminal determines that the audio frame containing the first characteristic sound is the first target frame, including: when the first characteristic sound is detected, and When the duration of the first characteristic sound is greater than or equal to a preset time, the playback terminal determines the audio frame containing the first characteristic sound as the first target frame.
- the first characteristic sound may be a human voice, or in a human voice state, when a specific sentence is detected, the target frame where the specific sentence is located is determined as the first target frame, so as to ensure that the subsequent watermark embedding can be embedded to the target frame where the voice information is recorded.
- the playback terminal determining a second target frame that satisfies the second preset condition after the first target frame includes: the playback terminal determining a target whose intermediate frequency energy value is greater than or equal to a third threshold and less than a fourth threshold.
- the frame is the second target frame.
- the second target frame is a target frame located after the first target frame. Since the second target frame is added to the first audio in real time, it cannot be guaranteed that every target frame after the first target frame in the first audio All frames are suitable for use as the second target frame, so the condition of the second target frame needs to be judged, and the target frame is used as the second target frame only when the target frame satisfies the second preset condition.
- embedding the audio watermark in the third target frame by the playback terminal includes: acquiring, by the playback terminal, a first sequence of numbers corresponding to the audio watermark, and the first sequence of numbers includes at least one element; Obtain at least one second sampling point in the third target frame; the playback terminal embeds at least one element in the first sequence into the at least one second sampling point, wherein one element in the first sequence corresponds to a first sequence Two sampling points.
- the playback terminal searches for the second target frame that meets the preset conditions after the first target frame according to the time sequence to embed the audio watermark.
- the process of embedding the audio watermark by changing the energy ratio of the energy values of the sampling points in the second target frame in different time domains and/or different frequency domains, so as to realize the real-time embedding of the audio watermark, and the embedded watermark has strong anti-interference in the process of audio transcription ability. Can propagate through digital channels or air channels.
- the playing terminal adds at least one element in the first sequence to the at least one second sampling point respectively, including: the playing terminal adjusting the second sampling point in different time domains and/or different frequency domain parts.
- An energy ratio of energy values wherein the energy ratio of a second sampling point is associated with an element in the first sequence.
- the specific manner in which the playback terminal adjusts the energy ratio of the energy values of the second sampling point in different time domains and/or different frequency domains may be: increasing the energy value of the first half of the first sub-sampling point , so that the ratio of the energy value of the first half of the first sub-sampling point to the second half of the energy value is greater than or equal to the fifth threshold, and the first sub-sampling point is recorded as 1, wherein the first sub-sampling point is the at least one first sub-sampling point.
- One of the two sampling points increase the energy value of the second half of the second sub-sampling point, so that the ratio of the energy value of the second half of the second sampling point to the first half of the energy value is greater than or equal to the fifth threshold, the second sub-sampling point.
- the sampling point is recorded as 0.
- the playback terminal further includes:
- the energy value of the high-energy part in the second sampling point is increased.
- the playback terminal after acquiring the first sequence corresponding to the audio watermark, the playback terminal further includes:
- the playback terminal adds a check digit to the first sequence, and the check digit is used to verify the transmission integrity of the first sequence.
- a second aspect of the embodiments of the present application provides an audio watermark parsing method, including: a parsing terminal obtains a first audio, where an audio watermark is embedded in the first audio, the audio watermark is associated with a playback terminal, and the playback terminal is used to The audio watermark is embedded in the first audio in real time; the parsing terminal parses the audio watermark from the first audio; the parsing terminal determines the playing terminal according to the audio watermark.
- An embodiment of the present application provides an audio watermark parsing method, including: a parsing terminal obtains first audio, an audio watermark is embedded in the first audio, the audio watermark is associated with a playback terminal, and the playback terminal is configured to embed the audio watermark into the first audio in real time
- the parsing terminal parses the audio watermark from the first audio; the parsing terminal determines the playing terminal according to the audio watermark. Therefore, the analysis terminal can determine the playback terminal that adds the audio watermark to the first audio according to the audio watermark.
- the method further includes: determining, by the parsing terminal, a first target frame that satisfies a first preset condition in the first audio; After the target frame, determine a second target frame that satisfies the second preset condition; the parsing terminal parses the audio watermark from the first audio, including: the parsing terminal parses the audio watermark from the second target frame.
- the playback terminal adds the audio watermark to the first audio in real time, it cannot be guaranteed that each frame in the first audio conforms to the watermark embedding conditions. Therefore, in the solution of real-time audio watermark embedding, it cannot be Offline watermark embedding is the same as embedding according to preset rules. Instead, the first target frame and the second target frame need to be determined according to the first preset condition and the second preset condition, respectively. Therefore, the parsing terminal also needs to parse the first target frame and the second target frame according to the same conditions when parsing. .
- the parsing terminal determines the first target frame in the first audio that satisfies the first preset condition, including: the parsing terminal extracts data from the first audio. It is determined that the first characteristic sound is contained,
- the target frame whose duration of the first characteristic sound is greater than or equal to the preset time is used as the first target frame.
- the analysis terminal determines the audio frame whose maximum value of the low-frequency part is within the first interval as the first target frame;
- the parsing terminal determines the first target frame by determining the first characteristic sound, and does not need to parse the synchronization frame mark.
- the parsing terminal determines the first target frame in the first audio that satisfies the first preset condition; including: the parsing terminal acquires the frame by frame.
- the first ratio of the energy value of the intermediate frequency part and the low frequency part of the first audio when the analysis terminal obtains the initial target frame whose first ratio is greater than or equal to the second threshold, it starts from the initial target frame and slides backward through a sliding window.
- the first audio frequency is detected to obtain a second ratio of the energy values of the intermediate frequency part and the low frequency part in each sliding window; the analysis terminal obtains the frame of the sliding window with the largest second ratio as the first target frame.
- the analysis terminal obtains the first ratio of the energy values of the sampling points of the intermediate frequency part and the sampling points of the low frequency part of each frame frame by frame, and when the first ratio is found to be greater than or equal to the first sampling point of the second threshold, the first ratio is determined.
- the target frame where the sampling point is located is the initial target frame. However, since there are 2048 sampling points in a frame, the first sampling point is only a part of the sampling points. Therefore, after finding the first sampling point that conforms to the first ratio, the first target frame where the first sampling point is actually located There may be an offset relative to the initial target frame where the current first sample point is located.
- the frame where the sliding window with the largest second ratio is obtained is the first target frame, so as to prevent analytical deviation.
- the first target frame includes a synchronization frame mark
- the analysis terminal obtains the frame where the sliding window with the largest second ratio is located as the first target frame, including: the analysis terminal from the sliding window with the largest second ratio. obtains the first sampling point with the highest energy value of the intermediate frequency part; the analysis terminal obtains the third sampling point with a preset length before the first sampling point; the analysis terminal determines the energy value of the first sampling point and the third sampling point
- the portion where the ratio of the energy values is greater than the seventh threshold is the synchronization frame marker; the analysis terminal determines that the frame where the sliding window with the largest second ratio is located is the first target frame according to the synchronization frame marker.
- the above-mentioned method for detecting the synchronization frame mark is determined by detecting the first ratio of the energy values of the intermediate frequency part and the low frequency part of the first audio frequency.
- the original content of the first audio frequency that is, the There may also be a situation in which the ratio of the energy value of the intermediate frequency part to the low frequency part is greater than the first ratio.
- the analysis terminal determines the synchronization frame mark according to the ratio between the first sampling point and the third sampling point with a preset length before the first sampling point, thereby preventing the occurrence of the above-mentioned false detection.
- the analyzing terminal determines, after the first target frame, a second target frame that satisfies the second preset condition, including: the analyzing terminal starts from the first target frame and moves backward in units of frames, and obtains each frame respectively.
- the parsing terminal after the parsing terminal detects the first target frame, the first target frame is used as the positioning frame, and can continue to search for the second target frame located after the first target frame, and the second target frame is embedded with an audio watermark, And the second target frame satisfies the second preset condition.
- the parsing terminal can quickly find the second target frame after the first target frame according to the second preset condition.
- the parsing terminal parses the audio watermark from the second target frame, including: the parsing terminal obtains a second sampling point from the second target frame, and the second sampling point is the second sampling point in the second target frame.
- the sampling point whose energy ratio is greater than or equal to the fifth threshold;
- the analysis terminal obtains the energy ratio of the energy values of different time domains and/or different frequency domains in the second sampling points respectively;
- the analysis terminal obtains the energy ratio related to the energy ratio
- the first element of the link the first element is an element in the first sequence recorded by the audio watermark.
- the first sequence contains a check digit
- the method further includes: determining the check digit according to the check digit. Whether the first sequence is complete; if so, convert the first sequence to a decimal sequence; if not, ignore the first sequence.
- the method further includes: the parsing terminal adjusts the length value of the first length according to the duration of the second target frame, wherein the length of the second target frame is The longer the duration, the greater the length of the first length.
- the parsing terminal removes the energy value of the first length at the head and tail of the second target frame, respectively.
- the method includes a plurality of watermark detection periods, wherein each of the watermark detection periods parses out one of the audio watermarks, the method further includes: determining a repetition rate from the audio watermarks parsed by the plurality of watermark detection periods The highest one is used as the watermark of the first audio.
- the first audio includes a plurality of watermark analysis cycles, each cycle includes a first target frame and a second target frame, and the same audio watermark is embedded in each watermark detection cycle.
- some parsing errors may occur in the parsing terminal, resulting in that not all audio watermarks parsed by the watermark detection period are the same sequence.
- the wrong audio watermark obtained by wrong parsing is always random and non-repetitive, so the one with the highest repetition rate among the audio watermarks parsed by multiple watermark detection cycles can be determined as the correct audio watermark.
- the correct watermark embedded in the first audio frequency can be accurately parsed, thereby further preventing the erroneous parsing of the parsing terminal.
- a third aspect of the embodiments of the present application provides a playback terminal, including:
- an execution unit configured to embed an audio watermark in the first audio acquired by the acquiring unit, where the audio watermark is associated with the playback terminal
- a playing unit configured to play the first audio embedded with the audio watermark by the execution unit.
- the execution unit is also used for:
- the first target frame After the first target frame, determine a second target frame that satisfies the second preset condition, and the first target frame is used to mark the second target frame;
- the audio watermark is embedded in the second target frame.
- the execution unit is also used for:
- the audio frame whose maximum value of the low-frequency part is in the first interval is determined as the first target frame; or,
- the sampling rate of the first audio is smaller than the first threshold, it is determined that the audio frame containing the first characteristic sound is the first target frame.
- the execution unit is further configured to:
- a sync frame marker is added to the first target frame.
- the execution unit is also used for:
- the first sampling point is a sampling point of the intermediate frequency part
- the energy value of the first sampling point is increased, so that the ratio of the energy value of the first sampling point to the energy value of the low frequency part is greater than or equal to the second threshold.
- the execution unit is further configured to:
- an audio frame containing the first characteristic sound is determined as the first target frame.
- the execution unit is also used for:
- the target frame whose energy value of the intermediate frequency part is greater than or equal to the third threshold and less than the fourth threshold is the second target frame.
- the execution unit is also used for:
- the first sequence corresponding to the audio watermark includes at least one element
- the execution unit is also used for:
- a fourth aspect of the embodiments of the present application provides a parsing terminal, including:
- an acquisition unit configured to acquire a first audio, an audio watermark is embedded in the first audio, the audio watermark is associated with a playback terminal, and the playback terminal is used to embed the audio watermark in the first audio in real time;
- a parsing unit for parsing the audio watermark from the first audio acquired by the acquiring unit
- An execution unit configured to determine the playback terminal according to the audio watermark parsed by the parsing unit.
- the parsing unit is also used to:
- the parsing unit is further configured to:
- a target frame containing a first characteristic sound and the duration of the first characteristic sound is greater than or equal to a preset time is used as the first target frame.
- the parsing unit is further configured to:
- the first audio is detected by sliding backwards from the initial target frame, so as to obtain the intermediate frequency part and the low frequency part in each sliding window.
- the frame where the sliding window with the largest second ratio is obtained is the first target frame.
- the first target frame includes a synchronization frame mark
- the parsing unit is also used for:
- the frame where the sliding window with the largest second ratio is located is the first target frame.
- the parsing unit is also used to:
- the target frame obtained from the candidate target frame with the energy ratio of different time domain and/or different frequency domain partial energy values is greater than or equal to the fifth threshold is the second target frame.
- the parsing unit is also used to:
- the first audio includes multiple watermark detection periods, wherein each of the watermark detection periods parses out one audio watermark, and the parsing unit is also used for:
- the one with the highest repetition rate is determined as the watermark of the first audio.
- a fifth aspect of the embodiments of the present application provides an electronic device, the electronic device includes: an interaction device, an input/output (I/O) interface, a processor, and a memory, where program instructions are stored in the memory; the interaction device is used to obtain Operation instructions input by the user; the processor is configured to execute program instructions stored in the memory, and execute the method described in any optional implementation manner of the first aspect or the second aspect.
- I/O input/output
- processor is configured to execute program instructions stored in the memory, and execute the method described in any optional implementation manner of the first aspect or the second aspect.
- a sixth aspect of the embodiments of the present application provides a computer-readable storage medium, including instructions, when the instructions are executed on a computer device, the computer device is made to perform any one of the optional implementations of the first aspect or the second aspect above method described.
- FIG. 1 is a schematic diagram of a usage scenario of an audio watermarking method provided by an embodiment of the present application
- FIG. 2 is a schematic diagram of an embodiment of an audio watermarking method provided by an embodiment of the present application.
- FIG. 3 is a schematic diagram of another embodiment of an audio watermarking method provided by an embodiment of the present application.
- FIG. 4 is a schematic diagram of another embodiment of an audio watermarking method provided by an embodiment of the present application.
- FIG. 5a is a schematic diagram of another embodiment of the method for adding audio watermarks provided by an embodiment of the present application.
- FIG. 5b is a schematic diagram of another implementation manner of the audio watermarking method provided by the embodiment of the application.
- 5c is a schematic diagram of another implementation manner of the audio watermarking method provided by the embodiment of the application.
- 5d is a schematic diagram of another implementation manner of the audio watermarking method provided by the embodiment of the application.
- FIG. 5e is a schematic diagram of another implementation manner of the audio watermarking method provided by the embodiment of the application.
- 5f is a schematic diagram of another implementation manner of the audio watermarking method provided by the embodiment of the application.
- 5g is a schematic diagram of another implementation manner of the audio watermarking method provided by the embodiment of the application.
- 5h is a schematic diagram of another implementation manner of the audio watermarking method provided by the embodiment of the application.
- 5i is a schematic diagram of another implementation manner of the audio watermarking method provided by the embodiment of the application.
- FIG. 6 is a schematic diagram of an embodiment of an audio watermark parsing method provided by an embodiment of the present application.
- FIG. 7 is a schematic diagram of another embodiment of an audio watermark parsing method provided by an embodiment of the present application.
- FIG. 8 is a schematic diagram of another embodiment of an audio watermark parsing method provided by an embodiment of the present application.
- 9a is a schematic diagram of another implementation manner of an audio watermark parsing method provided by an embodiment of the present application.
- 9b is a schematic diagram of another implementation manner of the audio watermark parsing method provided by the embodiment of the application.
- FIG. 10 is a schematic diagram of another embodiment of an audio watermark parsing method provided by an embodiment of the present application.
- FIG. 11 is a schematic diagram of another embodiment of an audio watermark parsing method provided by an embodiment of the present application.
- FIG. 12a is a schematic diagram of another implementation manner of the audio watermark parsing method provided by an embodiment of the application.
- 12b is a schematic diagram of another implementation manner of the audio watermark parsing method provided by the embodiment of the application.
- FIG. 13 is a schematic diagram of a usage scenario of an embodiment of the present application.
- FIG. 14 is a schematic diagram of a usage scenario of an embodiment of the present application.
- FIG. 15 is a schematic diagram of an electronic device provided by an embodiment of the application.
- 16 is a schematic diagram of a playback terminal provided by an embodiment of the application.
- FIG. 17 is a schematic diagram of an analysis terminal provided by an embodiment of the present application.
- Embodiments of the present invention provide an audio watermark adding and parsing method, device and medium, which can solve the problem of real-time addition of audio watermarks.
- Watermarking technology is a commonly used image processing method. By marking watermarks on important images, the source of the images can be queried according to the watermarks during the process of image dissemination, so that the disseminators dare not take pictures and disseminate the images easily, which has a deterrent effect. and retrospective accountability.
- the main method is to secretly photograph or secretly record on a remote terminal side through a mobile phone or a voice recorder and other devices, and transmit the secretly recorded audio and video files to other personnel, and eventually spread on the Internet, causing adverse effects.
- the audio watermarking method mainly obtains the audio watermark by offline processing of the audio in the later stage.
- the online real-time audio stream cannot be effectively processed.
- the usage scenario of the embodiment of the present application can be applied to a conference scenario, and the scenario includes a conference venue A101, a venue B102, a venue C103, and a media center 104 that schedules audio among the above three venues .
- the conference site A101, the conference site B102, and the conference site C103 may be remote conference sites respectively located at different locations.
- the representative of the venue A101 speaks, the recording device of the venue A101 obtains the speech of the representative, and then the communication device of the venue A101 sends the real-time audio stream to the media center 104, and the media center 104 separates the audio streams.
- conference site B102 and conference site C103 It is sent to conference site B102 and conference site C103, and the communication devices of conference site B102 and conference site C103 acquire the audio stream and play the real-time audio from conference site A101 in real time through the external playback device.
- a remote audio conference between the conference site A101, the conference site B102, and the conference site C103 is realized.
- audiences at venue A, venue B, and venue C may secretly record the played audio and leak it. Therefore, when tracing the source for accountability, it is necessary to know which venue the audience recorded the audio secretly.
- an embodiment of the present application provides an audio watermark adding method, which can solve the problem of audio source traceability by adding a watermark to the audio played in real time.
- an embodiment of the audio watermarking method provided by the present application includes the following steps.
- the playback terminal acquires an audio watermark.
- the playback terminal acquires the audio watermark associated with the playback terminal, and in the subsequent watermark parsing process, the parsing terminal can know through the audio watermark that the audio watermark is embedded in the audio by the playback terminal.
- this audio watermark can be pre-existed locally in the playback terminal, or it can be sent to the playback terminal by other equipment, for example, the service management center (service management center, SMC) sends respective audio to multiple different playback terminals.
- the audio watermark is taken as the number 14 for illustration. After acquiring the audio watermark "14" sent by the SMC, the current playback terminal stores the audio watermark locally, and the number 14 is the audio watermark associated with the current playback terminal.
- the playback terminal acquires the first audio in real time.
- the first audio may be acquired from the outside by the playback terminal.
- the playback terminal may be the playback terminal of venue B in FIG. 1 or the playback terminal of venue C in FIG. 1 .
- the media center obtains the first audio in real time.
- the first audio may also be acquired in real time by the playback terminal from its own memory, which is not limited in this embodiment of the present application.
- the playback terminal embeds an audio watermark in the first audio.
- the audio watermark "14" of the playback terminal is associated with the playback terminal, so that the playback terminal can be uniquely determined through the audio watermark.
- the playing terminal embeds the audio watermark into the first audio in real time during the process of playing the first audio.
- the embodiments of the present application further provide a specific working manner of embedding an audio watermark in the first audio.
- the following detailed description is given with reference to the accompanying drawings.
- the method for embedding an audio watermark in the first audio in the audio watermarking method provided by the present application includes the following steps.
- the playback terminal determines, in the first audio, a first target frame that satisfies a first preset condition.
- the first target frame plays a role of marking.
- the parsing terminal can know that there is an audio watermark after the first target frame according to the first target frame, so that the audio watermark can be realized. Fast location of watermarks.
- the first target frame is added to the first audio in real time, it cannot be guaranteed that every frame in the first audio is suitable as the first target frame, so it is necessary to judge the conditions of the first target frame. , and only when the target frame satisfies the first preset condition, the target frame is regarded as the first target frame.
- a synchronous frame condition detector can be set in the playback terminal, and the synchronous frame condition detector can be an entity device arranged in the playback terminal, or it can be an operation logic stored in the playback terminal.
- the detector executes the method described in step 301 above, thereby determining the first target frame that satisfies the first preset condition.
- the first target frame satisfies the first preset condition
- the first preset condition is different according to the different sampling rates.
- the playback terminal determines the audio frame whose maximum value of the low-frequency part is within the first interval as the first target frame.
- the sampling rate of the first audio is smaller than the first threshold, the playback terminal determines that the audio frame containing the first characteristic sound is the first target frame.
- the playback terminal determines the audio frame whose maximum value of the low-frequency part is within the first interval as the first target frame.
- the playback terminal identifies the first target frame by adding a synchronization frame mark to the first target frame. Since the way of adding the synchronization frame mark is to change the energy value of the intermediate frequency part of the first target frame, this requires: for the original audio of the first target frame, its low frequency energy cannot be too high or too low. In practical applications, often There will be a problem that the low frequency energy of the original audio is too low, and the relative capability of the intermediate frequency is also too low, resulting in the problem that the parsing terminal cannot resolve the synchronization frame mark after recording.
- the audio frame whose maximum value of the low frequency part is within the first interval is determined as the first target frame.
- the specific method of the first interval can be (Tlow, Thigh).
- the playback terminal takes the maximum value Value of the ability of the low-frequency part of the first audio, and judges whether the Value conforms to the formula: Tlow ⁇ Value ⁇ Thigh, if it satisfies condition, the current target frame is determined as the first target frame.
- Tlow is the lower limit of the energy value in the first interval
- Thigh is the upper limit of the energy value in the first interval.
- the specific values of Tlow and Thigh can be formulated by those skilled in the art according to actual needs. For example, different threshold ranges are set for different audio recording scenarios such as conference room scenes, auditorium scenes, and open office areas, especially the thresholds for which the energy is too low. In order to ensure that in different scenarios, the synchronization frame marker can obtain good embedding strength in the first target frame.
- a synchronization frame marker needs to be added to the target frame, so that the subsequent parsing terminal can determine the first target frame according to the synchronization frame marker.
- the adding of the synchronization frame marker specifically includes the following steps.
- the playback terminal obtains the first sampling point from the first target frame.
- the first sampling point is the sampling point of the intermediate frequency part of the first target frame.
- the playback terminal increases the energy value of the first sampling point, so that the ratio of the energy value of the first sampling point to the energy value of the low frequency part is greater than or equal to the second threshold.
- the first sampling point is the sampling point of the intermediate frequency part.
- the ratio of the energy value between the first sampling point and other sampling points in the low frequency part appears.
- the energy value of the second sampling point 401 of the intermediate frequency part is significantly improved, because the low frequency part is added to the modified formula of the energy value E(i)' of the intermediate frequency part
- the maximum energy value max_E 1 so that the energy value of the first sampling point of the intermediate frequency part is significantly improved compared to the energy of the low frequency part.
- IFFT inverse fast Fourier transform
- the energy value of the intermediate frequency part of the first target frame is modified by the above method, and then the fast inverse Fourier transform is performed on the first target frame, thereby obtaining the embedded synchronization frame.
- the marked time domain signal, the time domain signal is the first target frame in which the synchronization frame mark has been embedded.
- the energy value of the first sampling point of the intermediate frequency part of the first target frame is increased by the methods described in the above steps 1)-4), so that the energy value of the first sampling point and the energy value of the low frequency part are increased. ratio exceeds the preset range. So that in the subsequent analysis process, when the analysis terminal obtains that the ratio of the energy value of a sampling point in the intermediate frequency part of a target frame to the energy value of the low frequency part is greater than the preset value, it can be judged that the target frame is marked with a synchronization frame. the first target frame. Thus, the addition of the synchronization frame mark in the first target frame is realized.
- the playback terminal determines in real time the target frame that meets the first preset condition in the first audio as the first target frame, and then increases the energy value of the first sampling point of the intermediate frequency part of the first target frame, so that the first The ratio of the energy value of the sampling point to the energy value of the low frequency part is greater than or equal to the preset ratio, thereby realizing the addition of the synchronization frame mark. Therefore, when the sampling rate of the first audio is greater than or equal to the first threshold, the playback terminal determines the audio frame whose maximum value of the low-frequency part is within the first interval as the first target frame.
- the embodiment of the present application provides a second solution.
- the playback terminal determines that the audio frame containing the first characteristic sound is the first target frame.
- the playback terminal determines that the audio frame containing the first characteristic sound is the first target frame.
- the specific method is as follows.
- the playback terminal determines the audio frame containing the first characteristic sound as the first target frame.
- the first characteristic sound may be detected by a sound detection method, wherein the method for detecting the characteristic sound may be any method in the prior art, which is not limited in this embodiment of the present application.
- the first characteristic sound may be a human voice.
- the moment when the human voice is detected again is used as the first target frame.
- the advantage of this is that it can be applied to conference scenarios. In conference scenarios, people communicate through voice. In order to prevent the voice from being transcribed, it is necessary to add an audio watermark to the audio recorded with the voice. After detecting that no human voice lasts longer than a preset time (for example, 1.5s), the moment when human voice is detected again is taken as the first target frame, so as to ensure that the subsequent watermark embedding can be embedded in the audio recorded with voice information.
- a preset time for example, 1.5s
- the first characteristic sound can also be judged in more detail.
- the target frame where the specific sentence is located is determined as the first target frame.
- the playback terminal determines the first target frame by determining the first characteristic sound, thereby realizing the watermark embedding start position (that is, the characteristic sound)
- the confirmation of the first target frame does not need to embed the synchronization frame marker.
- case 1 and case 2 can be implemented by a synchronous frame embedder in the playback terminal, and the synchronous frame embedder can be an entity device set in the playback terminal, or it can be stored in the playback terminal.
- the operation logic of this application is not limited to this embodiment.
- the playback terminal determines the first target frame in the first audio, and at this time, the playback terminal needs to perform subsequent steps to embed an audio watermark after the first target frame.
- the playback terminal determines, after the first target frame, a second target frame that satisfies a second preset condition.
- the second target frame is a target frame located after the first target frame. Since the second target frame is determined in real time, it cannot be guaranteed that every frame after the first target frame in the first audio is suitable as the first target frame. There are two target frames, so the condition of the second target frame needs to be judged, and only when the target frame meets the second preset condition, the target frame is regarded as the second target frame.
- the determination of the first target frame and the second target frame by the playback terminal in the first audio is periodic.
- the playback terminal can only process the first audio in the order of playback time.
- the playback terminal first determines the first target frame according to the first preset condition; and then determines the second target frame that satisfies the second preset condition after the first target frame.
- the first target frame and the second target frame constitute an audio watermark cycle.
- the playback terminal still first determines the first target frame according to the first preset condition, and then determines the second target frame that satisfies the second preset condition after the first target frame.
- the first audio includes a plurality of first target frames, wherein every two adjacent first target frames in the first audio include a second target frame.
- the playback terminal in the process of adding a watermark to the first audio, can also determine the first target frame first, and then determine the second target frame, and the result can still achieve every time in the first audio.
- a second target frame is included between two adjacent first target frames, so those skilled in the art can determine the order of determining the first target frame and the second target frame according to actual needs, which is not carried out in this embodiment of the present application. limited.
- the embodiments of the present application only describe that the second target frame is located after the first target frame.
- the above-mentioned specific implementation manner of determining the second preset condition of the second target frame may be:
- the playback terminal determines that the target frame whose energy value of the intermediate frequency part is greater than or equal to the third threshold and less than the fourth threshold is the second target frame.
- the position where the watermark is embedded in the second target frame is the intermediate frequency region, if the energy in the intermediate frequency region is too low, it is easy to cause misinterpretation or failure to detect the watermark after embedding the watermark; if the energy in the intermediate frequency region is too high, It will result in popping sound after embedding the watermark.
- the continuous watermark embedding may cause mutual interference, when the second target frame is determined, it can be further ensured that sufficient intervals are reserved between the second target frames between multiple cycles, and further, the third The interval between target frames needs to be greater than or equal to the sixth threshold, and the specific size of the sixth threshold can be set by those skilled in the art according to the actual situation, which is not limited in this embodiment of the present application.
- the energy value of the intermediate frequency part of the second target frame is greater than or equal to the third threshold and less than the fourth threshold, the third threshold is less than the fourth threshold, and the specific values of the third threshold and the fourth threshold can be determined by those skilled in the art. It is determined according to actual needs, which is not limited in this embodiment of the present application.
- the playback terminal embeds an audio watermark in the second target frame.
- the second target frame is a target frame located after the first target frame.
- the second target frame is located between the two first target frames, and the second target frame is located between the two first target frames.
- the audio watermark is embedded in the audio watermark, so that in the subsequent watermark analysis process, the analysis terminal can determine the position of the second target frame according to the first target frame, so as to find the audio watermark.
- an embodiment of the present application further provides a specific implementation manner of embedding an audio watermark in a second target frame by a playback terminal. For ease of understanding, this situation is described in detail below with reference to Fig. 5a, as shown in Fig. 5a , the step of embedding audio watermark includes.
- the playback terminal acquires the first sequence corresponding to the audio watermark.
- the audio watermark can be presented in the form of a sequence.
- the playback terminal is the playback terminal where the venue B is located.
- the identifier of the playback terminal is "14”
- the playback terminal is The first number column corresponding to the audio watermark is "14”
- subsequent steps need to embed the number column 14 as an audio watermark in the first audio, so as to mark the playback terminal associated with the first audio.
- the playback terminal acquires at least one second sampling point from the second target frame.
- one target frame includes 2048 sampling points
- the playback terminal obtains at least one second sampling point from the second target frame, and the second sampling point is used to embed the above-mentioned first sequence in the subsequent work process. element.
- the playback terminal embeds at least one element in the first sequence into at least one second sampling point respectively.
- the playback terminal embeds an element in the first sequence into a second sampling point, where an element in the first sequence is embedded in a second sampling point. So that the second sampling point records the content of the first sequence, the subsequent parsing terminal can parse the content recorded in the audio watermark by reading the first sequence recorded in the second sampling point.
- step of embedding at least one element in the first sequence of numbers into at least one second sampling point by the playback terminal can be implemented in the following manner.
- the playback terminal adjusts the energy ratio of the energy values of the second sampling points in different time domains and/or different frequency domains.
- the magnitude of the above energy ratio is associated with the numbers in the first sequence, and different energy ratios may correspond to different numbers, so that different numbers in the first sequence are recorded by different energy ratios. Therefore, the content of the first sequence is recorded in the second sampling point by changing the energy ratio, thereby realizing the embedding of the audio watermark.
- the audio watermark that the playback terminal needs to embed is the number "14".
- the audio watermark needs to be converted into binary, so that the number "14" is converted into binary to obtain the first sequence: "1110"
- the first sequence is the content that needs to be embedded in the first audio as an audio watermark.
- the first sequence contains 1110 four elements, and these four elements are respectively embedded in the four second sampling points in the second target frame, so as to realize the embedding of the audio watermark.
- the method for implementing time-domain embedding by changing the ratio of energy values of different time-domain parts includes the following steps.
- the energy value of the first half part 5061 of the second sampling point is reduced in the order of the time domain, so that the energy value of the second half part 5062 of the second sampling point is significantly higher
- the energy value of 5061 in the first half is significantly higher
- the waveform of this energy distribution can be preset to digital 0.
- Figure 5c is the waveform diagram of the original frame of the second sampling point.
- the waveform diagram shown in Figure 5d is obtained, and then the waveform diagram shown in Figure 5d is obtained.
- the intermediate frequency part is selected to perform the second DCT transformation to obtain the waveform diagram as shown in Fig. 5e.
- the energy of the waveform shown in Fig. 5e is processed according to the following formula 1.
- j is the time period in the time domain
- P(j) is the total energy of the waveform in the period j as shown in Figure 5e
- ⁇ is a preset coefficient, and the specific value of ⁇ can be based on actual needs. Adjusted, mid represents the midpoint in the time domain of the waveform in Figure 5E. It can be seen from the above formula that the value of j ranges from 1 to the midpoint, that is, the first half of Fig. 5e. The total energy value P(j) of the first half is divided by the coefficient ⁇ , thereby reducing the first half of the second sampling point. energy value, the waveform shown in Figure 5b is obtained.
- the energy value of the second half part 5063 of the second sampling point is reduced in the order of the time domain, so that the energy value of the first half part 5064 of the second sampling point is significantly higher
- the energy value of 5063 in the second half can be preset as number 1.
- the energy of the waveform shown in FIG. 5e is processed according to the following formula 2 at this time.
- j is the time period in the time domain
- P(j) is the total energy of the waveform in the period j as shown in Figure 5e
- ⁇ is a preset coefficient, and the specific value of ⁇ can be based on actual needs. Adjustments are made, mid represents the midpoint of the waveform in Figure 5e in the time domain, and S'-T' represents the upper limit of the time domain in Figure 5e.
- the value of j ranges from mid+1 to S'-T', that is, from the midpoint of the time domain to the end of the time domain, that is, the second half of Figure 5e, the total energy of the second half
- the value P(j) is divided by the coefficient ⁇ , thereby reducing the energy value in the second half of the second sampling point, resulting in the waveform shown in Figure 5f.
- the embedding of the watermark is realized by changing the ratio of the energy values of different time-domain portions of the second sampling point. Specifically, the energy ratio of the front and rear parts of the second sampling point in the time domain is adjusted, and then the different energy ratios of the front and rear parts are preset to 0 and 1 respectively, thereby realizing binary digital embedding on the second sampling point.
- the binary number sequence is converted into a decimal number sequence as required, thereby realizing the working process of adding the first number sequence to the second sampling point.
- the parsing of the audio watermark can be realized by directly obtaining the ratio of the energy values of the front and rear parts of the target frame.
- the determination of the energy ratio is based on the midpoint of the target frame in the time domain as a limit, and the ratio of the front and rear parts is obtained and analyzed according to the above preset rules.
- the high-energy part will increase in the time domain.
- the audio watermark is 1
- the time domain is limited by the midpoint, and the ratio of the energy value of the first half to the energy value of the second half is greater than or equal to the fifth preset value.
- the embodiment of the present application further provides a method for changing the energy ratio according to the frequency domain, so as to overcome the influence of the echo reverberation generated during the transcription process on the energy division in the time domain.
- the energy ratio of the second sampling point in the frequency domain is changed in a similar manner to the above, so as to realize the embedding of the watermark. That is, this embodiment does not use time as a boundary in the time domain, but uses frequency as a boundary in the frequency domain.
- the energy distribution diagram obtained by changing the energy ratios of different frequency domain parts is shown in Fig. 5g.
- the pictures shown in the aforementioned Fig. 5b to Fig. 5f are the line graphs of the corresponding relationship between the energy value and the time domain.
- the energy distribution diagrams displayed by 5g show the distribution relationship of energy in time domain and frequency domain.
- the shaded parts represent high energy parts and the white parts represent low energy parts.
- the second sampling point is delimited by the midpoint of the frequency domain, the energy segment 5065 whose ratio of the upper half to the lower half is greater than the fifth threshold represents the number 1, and the ratio of the lower half to the upper half is greater than the fifth threshold
- the energy part of 5066 represents the number 0, thereby realizing the embedding of audio watermark by changing the energy ratio in the frequency domain, overcoming the influence of the echo reverberation generated in the transcription process on the energy part in the time domain.
- the schemes of the above-mentioned scheme 1 and scheme 2 are combined.
- the method of scheme 1 is used to adjust the energy ratio of the second sampling point before and after the time domain, so as to realize the watermark embedding of the first step, for example, as shown in Figure 5h
- the ratio of the energy value of the first half part 5067 of the second sampling point to the energy value of the second half part 5068 is greater than the fifth threshold, and the audio watermark is embedded as a number 1.
- Figure 5h in order to prevent the echo reverberation generated during the transcription process in the The impact on the audio watermark energy in the time domain is shown in Figure 5h.
- the method of the above scheme 2 is adopted to change the high frequency part 50681 and the low frequency part.
- the energy ratio of 50682, wherein the energy distribution whose ratio of the upper half to the lower half of the second sampling point is greater than the fifth threshold value also represents the number 1.
- two watermarks, one large and one small are added to the second sampling point at the same time.
- the two watermarks, one large and one small respectively record the same value, wherein the large watermark is the watermark obtained by changing the energy ratio of the two parts before and after the second sampling point in the time domain, that is, the first half of the second sampling point in Figure 5h
- the watermark formed by the energy value of 5067 and the energy ratio of the second half of 5068, the small watermark is the watermark obtained by changing the energy ratio of the high and low frequency parts of the low energy part of the second sampling point, that is, the high frequency part of the second half part 5068 in Figure 5h
- the low-energy part in the audio watermark can be further segmented,
- the energy distribution diagram shown in Figure 5i is used to represent the number 1. Among them, the shaded part represents the area of high energy value, and the white part represents the area of low energy value. As shown in Fig. 5i, in the time domain, the first half part 51 is a high energy part, and the second half part 52 is a low energy part.
- the second half part 52 is further divided into two parts in the time domain, which are respectively recorded as the first part 521 and the second part 522, wherein the first part 521 is close to the first half part 51 ( high-energy portion), and the second portion 522 is an area away from the first half portion 51 (high-energy portion).
- the energy ratio of the high-frequency part to the low-frequency part of the second part 522 is changed.
- the energy of the first part 521 may be corresponding to the reverberation generated during the transcription process.
- 522 embeds a small watermark by changing the ratio of high and low frequency energy, wherein the energy value of the high frequency part 5221 of the second part 522 is higher than the energy value of the low frequency part 5222, which also represents the number 1.
- the two watermarks of size represent the same audio watermark respectively.
- the audio watermark shown in Figure 5i above is divided into two parts (the first part 521 and the second part 522) in the time domain for the low-energy part (ie, the second half part 52), and in practice During the working process, those skilled in the art can divide the low-energy part into more parts in the time domain according to actual needs, which is not limited in the embodiment of the present application.
- the playback terminal searches for the second target frame that meets the preset condition after the first target frame according to the time sequence to embed the audio watermark.
- the process of embedding the audio watermark by changing the energy ratio of the energy values of the sampling points in the third target frame in different time domains and/or different frequency domains, to realize the embedding of the audio watermark, thereby realizing the real-time embedding of the audio watermark, and the embedded watermark in the audio
- the process of transcription has strong anti-interference ability.
- the playback terminal implements binary watermark embedding by changing the energy ratio.
- those skilled in the art can change the energy ratio to realize other binary systems according to actual needs. , such as decimal or hexadecimal, which is not limited in this embodiment of the present application.
- the audio watermark acquired by the playback terminal is the number "14".
- the audio watermark needs to be converted into binary, so that the number "14" is converted into Binary, get the first sequence: "1110”
- the first sequence is the content that needs to be embedded in the first audio as an audio watermark.
- a check digit may be added to the first sequence, for example, a parity check method may be used, for example, the first sequence "1110” includes three numbers 1, that is, an odd number The number is "1". At this time, a number 1 is added to the last digit of the first number sequence to obtain a new first number sequence "11101".
- the 1 of the last digit in the first number sequence is the check digit.
- the "1" of the parity digit is used to indicate that there are odd numbers of "1"s in addition to the parity digit in the current sequence.
- the check digit is "0"
- the number of digits in the first sequence is determined in this way, and the subsequent parsing terminal can be implemented according to the check digit.
- the verification of the first sequence ensures that the transmission of the first sequence is accurate.
- the audio watermark will occupy too many frames. Due to the loss of sampling points, the start bit of each frame is deviated from the expected one, which makes the audio watermark detection inaccurate.
- the first sequence is divided into multiple sub-sequences, and each sub-sequence is added to the check digit according to the above method, and then reassembled into a large first sequence. Therefore, when the subsequent parsing terminal parses the audio watermark, the first sequence can be checked according to the preset sub-period, thereby ensuring the transmission integrity of the first sequence.
- the first number column corresponding to the audio watermark is: "11101".
- the playback terminal obtains the first second target frame, and embeds the first element in the first sequence: "1" into the first second sampling point in the second target frame according to the method described in the aforementioned step 506, Implements the embedding of the first digit in the first sequence.
- the first sub-cycle in the watermark embedding cycle is completed.
- the above step 506 is executed cyclically, the second second sampling point is obtained from the second target frame, and the second element in the first sequence: "1" is embedded in the second second sample point of the second target frame in the same way.
- the playback terminal embeds the five-digit numbers in the first sequence into the five second sampling points of the second target frame through a complete watermark embedding period composed of five watermark embedding sub-periods.
- the playback terminal in order to ensure that the intensity of watermark embedding in the current sub-cycle is sufficient, the playback terminal needs to perform an embedded intensity detection after each sub-cycle is completed.
- the specific detection method of the watermark embedding strength is to detect whether the energy ratio of the energy values of the second sampling points in different time domains and/or different frequency domains in the current sub-cycle is greater than the fifth threshold.
- the playback terminal needs to determine the energy ratio between the first half and the second half of the current second sampling point in the time domain.
- the playback terminal embeds an audio watermark in the second target frame in the manner described in the above steps 501 to 503 . So far, step 203 is completed.
- the playback terminal plays the first audio embedded with the audio watermark.
- the playback terminal plays the first audio with the audio watermark embedded in it. Since the audio watermark is embedded in the first audio in real time, in the real-time playback scenario, the audio played by the playback terminal can still contain the audio watermark, so that in the subsequent process of being transcribed, the source of the first audio can be traced according to the audio watermark.
- the playback terminal acquires the first audio in real time; the playback terminal embeds an audio watermark in the first audio, and the audio watermark is associated with the playback terminal; the playback terminal plays the embedded audio The first audio of the audio watermark. Therefore, under the scenario of playing audio in real time, the audio watermark is added to the audio stream in real time through the playback terminal, so that the later equipment can determine the playback terminal according to the audio watermark when parsing the watermark, which is convenient for tracing the source after the first audio is transcribed. .
- the first audio of the audio watermark added by the above-mentioned audio watermarking method whether it is dubbed through a digital channel or an air channel, the first audio obtained by dubbing can be parsed by the parsing terminal to obtain the audio watermark, thereby realizing For the traceability of the first audio, the parsing terminal can determine the playback terminal that added the audio watermark to the first audio through the audio watermark.
- the audio watermark parsing method provided by the embodiments of the present application will be described in detail below with reference to the accompanying drawings.
- an embodiment of the audio watermark parsing method provided by the present application includes the following steps.
- the parsing terminal obtains the first audio.
- the first audio is the audio in which the playback terminal has embedded the audio watermark through the above method.
- the initial playback source of the first audio is the playback terminal, that is, the playback terminal embeds the audio watermark. in the first audio.
- the first audio is directly acquired by the parsing terminal, or can be transcribed.
- the transcribing can be either a digital channel or an air channel.
- the parsing terminal after acquiring the first audio, the parsing terminal also needs to convert the format and sampling rate of the first audio.
- the recording device after the playback device plays the first audio, there are many possibilities for the recording device to rip the first audio, especially the recording audio files of different brands of recording devices have different formats, and the sampling rate is generally 44.1K, so it is necessary to convert the audio file format and sample rate first. to get the format and sample rate that the parsing terminal can handle.
- the parsing terminal can convert the sampling rate of the first audio into 48k.
- the parsing terminal parses the audio watermark from the first audio.
- the parsing terminal may perform real-time parsing on the first audio, and may also perform offline parsing.
- This embodiment of the present application is not limited.
- the embodiments of the present application mainly describe the offline analysis method, but do not constitute a limitation to this solution.
- the audio watermark parsing method provided by the present application includes the following steps for parsing the audio watermark from the first audio.
- the parsing terminal determines a first target frame in the first audio that satisfies a first preset condition.
- the first target frame satisfies the first preset condition, so the parsing terminal can acquire the first target frame in the first audio according to the first preset condition.
- the first audio includes multiple first target frames, and each first target frame corresponds to a period of watermark parsing, so whenever the parsing terminal determines in the first audio a first target frame that satisfies the first preset condition a target frame, then perform a subsequent parsing step.
- the first preset condition may be: taking the target frame with the maximum value of the low frequency part in the first interval as the first target frame, for the specific implementation of the first interval, please refer to the description of the above step 301, which will not be repeated here.
- the first target frame includes marker information
- the parsing terminal needs to further determine the first target frame according to the marker information
- the implementation of the marker information includes two technical solutions according to the actual situation of the first audio: 1.
- the sampling rate of an audio is less than the first threshold
- the marking information is a characteristic sound.
- the marker information is a synchronization frame marker.
- the marking information is the characteristic sound.
- the analysis terminal when the analysis terminal detects that the sampling rate of the first audio is less than the first threshold, it can determine that in the first audio, the marker information in the first target frame and the second target frame is the characteristic sound.
- the specific method for parsing the terminal detection mark information is:
- the parsing terminal respectively determines from the first audio a target frame that contains the first characteristic sound and the duration of the first characteristic sound is greater than or equal to the preset time as the first target frame.
- the first characteristic sound may be detected by a sound detection method, wherein the method for detecting the characteristic sound may be any method in the prior art, which is not limited in this embodiment of the present application.
- the first characteristic sound can be a human voice. For example, after detecting that there is no human voice for a duration greater than a preset time, the target frame at the moment when the human voice is detected again is determined as the first target frame.
- a more detailed implementation can be agreed between the playback terminal and the analysis terminal for the characteristic sound. For example, in the state of human voice detection, when the analysis terminal detects a specific sentence, the target of the specific sentence is located. The frame is determined to be the first target frame.
- the parsing terminal determines the first target frame by determining the first characteristic sound, thereby realizing the watermark embedding start position (that is, the characteristic sound)
- the synchronization frame marker does not need to be embedded in the first audio.
- the marker information is a synchronization frame marker.
- the analysis terminal determines that the sampling rate of the first audio is greater than or equal to the first threshold, it may be determined that, in the first audio, the marker information in the first target frame is a synchronization frame marker.
- the method for parsing the terminal parsing synchronization frame marker specifically includes the following steps.
- the parsing terminal acquires the first ratio of the energy values of the intermediate frequency part and the low frequency part of the first audio frequency frame by frame.
- the analysis terminal can pass the first ratio value to determine the sync frame marker.
- the analysis terminal obtains the initial target frame whose first ratio is greater than or equal to the second threshold, it starts from the initial target frame and detects the first audio frame by frame by sliding window method to obtain the intermediate frequency part in each sliding window.
- the second ratio to the energy value of the low frequency part.
- the analysis terminal obtains the first ratio of the energy values of the sampling points of the intermediate frequency part and the sampling points of the low frequency part of each frame frame by frame, and when the first ratio is found to be greater than or equal to the first sampling point of the second threshold, the The target frame where the first sampling point is located is the initial target frame. However, since there are 2048 sampling points in a frame, the first sampling point is only a part of the sampling points. Therefore, after finding the first sampling point that conforms to the first ratio, the first target frame where the first sampling point is actually located There may be an offset relative to the initial target frame where the current first sample point is located.
- the analysis terminal obtains the frame of the sliding window with the largest second ratio as the first target frame.
- the initial target frame 801 is the initial sliding window 801 generated by the parsing terminal, there is an intersection between the initial target frame 801 and the first target frame 802 , and the first sampling point 803 is located in the intersection , the parsing terminal needs to completely overlap the sliding window 801 with the first target frame 802 to determine the position of the first target frame.
- the specific working mode of the parsing terminal is as follows: through the sliding window method, the second ratio of the energy value of the intermediate frequency part and the low frequency part in each sliding window 801 is detected, wherein, because the playback terminal actively improves the intermediate frequency part of the first target frame 802 Therefore, the window where the second ratio reaches the maximum value is the window in which the first target frame is located, and in this way, the overlapping of the sliding window 801 and the first target frame 802 is achieved. Therefore, the search for the synchronization frame mark is realized by means of sliding window detection, and the offset problem generated in the search process is effectively prevented, and the accuracy of subsequent watermark detection is improved.
- the above-mentioned detection method of the synchronous frame mark is determined by detecting the first ratio of the energy value of the intermediate frequency part and the low frequency part of the first audio frequency.
- the original content of the first audio frequency that is, not
- the ratio of the energy value of the intermediate frequency part to the low frequency part is greater than the first ratio.
- false detection of sync frame markers may result.
- the synchronization frame mark compared with the original content in the first audio, the synchronization frame mark has an essential difference in that, in the synchronization frame mark, the energy value of the first sampling point in the intermediate frequency part has a sudden increase point relative to the low frequency part.
- the parsing terminal determines that the target frame where the sliding window is located is the first target frame through the above step 3, it can further pass The following steps determine whether the synchronization frame marker in the current target frame is the real target frame marker, thereby preventing false detections from occurring.
- the analysis terminal obtains the first sampling point with the highest energy value of the intermediate frequency part from the sliding window with the largest second ratio.
- the first sampling point is the point with the highest energy value in the current window.
- the analysis terminal acquires a third sampling point with a preset length before the first sampling point.
- the third sampling point is located before the first sampling point in the time domain, and the preset length of the third sampling point from the first sampling point can be set by those skilled in the art according to actual needs, or can be set by the analysis terminal It is determined by itself according to parameters such as the sampling rate, which is not limited in this embodiment of the present application.
- the analysis terminal determines that the part where the ratio of the energy value of the first sampling point to the energy value of the third sampling point is greater than the seventh threshold is a synchronization frame mark.
- FIG. 9a and FIG. 9b wherein, FIG. 9a is a first target frame with a synchronization frame mark added, and FIG. 9b is a common target frame without a synchronization frame mark added.
- the energy ratio of the intermediate frequency part and the low frequency part both satisfies the preset conditions. Therefore, in this case, it is impossible to judge which one is added with the synchronization frame mark only by the energy ratio between the intermediate frequency and the low frequency part. target frame, resulting in false detections.
- FIG. 9a is a first target frame with a synchronization frame mark added
- FIG. 9b is a common target frame without a synchronization frame mark added.
- the first sampling point 901 and the third sampling point 902 are separated by 3 sampling points, it can be seen that the first sampling point is relative to the third sampling point. There is a sudden increase in the energy of the In this case, the energy value of the first sampling point 903 will not change significantly relative to the third sampling point 904 . Therefore, by this method, the synchronization frame mark can be accurately identified, and the occurrence of false detection can be prevented.
- the parsing terminal can detect the first target frame determined by the playing terminal in the first audio.
- the parsing terminal determines, after the first target frame, a second target frame that satisfies a second preset condition.
- the parsing terminal after the parsing terminal detects the first target frame, the first target frame is used as the positioning frame, and can continue to search for the second target frame located after the first target frame, and the second target frame is embedded with an audio watermark, And the second target frame satisfies the second preset condition.
- the parsing terminal can quickly find the second target frame after the first target frame according to the second preset condition.
- the parsing terminal can also start from the first target frame and move forward during offline parsing.
- the embodiment of the present application is only described by moving backward from the first target frame to find the second target frame, but does not constitute a limitation to the solution of the embodiment of the present application.
- the parsing terminal may specifically determine the second target frame through the following steps.
- the analysis terminal starts from the first target frame and moves backward in units of frames, and obtains candidate target frames whose energy of the intermediate frequency part of each frame is greater than or equal to a third threshold and less than a fourth threshold, respectively.
- the parsing terminal first parses the target frame whose energy of the intermediate frequency part is greater than or equal to the third threshold and less than the fourth threshold as the candidate target frame that may have the second target frame.
- the target frame obtained by the terminal from the candidate target frame with the energy ratio of the partial energy values in different time domains and/or different frequency domains is greater than or equal to the fifth threshold as the second target frame.
- the analysis terminal obtains different time domains from the candidate target frame according to the specific method of embedding the audio watermark in the playback terminal. And/or the target frame whose energy ratio of different frequency domain partial energy values is greater than or equal to the fifth threshold is the second target frame.
- the parsing terminal determines the second target frame, and then starts parsing the watermark from the second target frame.
- the parsing terminal parses the audio watermark from the second target frame.
- the playback terminal embeds the audio watermark in the second target frame. Therefore, when the parsing terminal obtains the second target frame, the embedded audio watermark can be parsed from the second target frame.
- the parsing terminal parses the audio watermark from the second target frame through the following steps.
- the parsing terminal acquires the second sampling point from the second target frame.
- the second sampling point is a sampling point whose energy ratio is greater than or equal to the fifth threshold in the second target frame
- the analysis terminal obtains the second sampling point from the second target frame, and the second sampling point is recorded with playback The watermark information embedded in the terminal.
- the analysis terminal obtains energy ratios of partial energy values in different time domains and/or different frequency domains in the second sampling point, respectively.
- the playback terminal embeds the audio watermark
- the following three different schemes can be adopted for the change of the energy ratio in the second sampling point: 1. Change the ratio of the energy values in different time domains; 2. Change the energy ratios in different frequency domains The ratio of partial energy values; 3. Change the ratio of partial energy values in different time domains and different frequency domains at the same time. Therefore, according to different watermark embedding methods, the parsing terminal needs to use corresponding means for parsing.
- the three modes are described in detail below with reference to the accompanying drawings.
- the playback terminal when embedding the audio watermark, changes the ratio of the energy values of different time domain parts. Specifically, the ratio of the energy value of the second half of the second sampling point to the energy value of the first half is greater than When the fifth threshold is used, the number corresponding to the ratio is 0. When the ratio of the energy value of the first half of the second sampling point to the energy value of the second half is greater than the fifth threshold, the number corresponding to the ratio is 1.
- the energy subgraphs of the second target frame obtained by the parsing terminal are respectively as shown in Fig. 12a and Fig. 12b.
- the dark part represents the high-energy area
- the white part represents the low-energy area
- the energy value of the second half part 1201 in the second target frame is significantly larger than the energy value of the first half part 1202, and after the calculation of the analytical terminal, the second half of the second sampling point 1201
- the analysis terminal determines that the watermark embedded in the second target frame shown in FIG. 12a is 0 at this time.
- the energy value of the first half part 1203 of the second target frame is significantly larger than the energy value of the second half part 1204, and after the calculation of the analytical terminal, the energy value of the first half part 1203 of the second sampling point
- the parsing terminal determines that the watermark embedded in the second target frame shown in FIG. 12b is 1 at this time.
- the determination of the energy ratio is based on the midpoint of the target frame in the time domain as a limit, and the ratio of the front and rear parts is obtained and analyzed according to the above preset rules.
- the high-energy part will increase in the time domain due to the echo or reverberation of the venue.
- the audio watermark is 1
- the time domain is limited by the midpoint, and the ratio of the energy value of the first half to the energy value of the second half is greater than or equal to the fifth preset value.
- the playback terminal further provides a method of changing the energy ratio according to the frequency domain when embedding the watermark, so as to overcome the influence of the echo reverberation generated during the transcription process on the energy division in the time domain.
- the playback terminal adopts a similar method to the above when embedding the audio watermark, and changes the energy ratio of the second sampling point in the frequency domain, so as to realize the embedding of the watermark.
- the difference between the partial energy ratio method is that in this embodiment, time is not used as the boundary in the time domain, but frequency is used as the boundary in the frequency domain.
- the energy distribution diagram of the second target frame obtained is shown in Figure 5g, and the second sampling point is demarcated by the midpoint of the frequency domain.
- the dark Parts represent high-energy areas
- white parts represent low-energy areas, so that the analytical terminal can intuitively obtain the energy segment of the second target frame through the energy segment map.
- the ratio of the upper half to the lower half is greater than the fifth threshold representing the number 1.
- the ratio of the lower half to the upper half is greater than the fifth threshold, which represents the number 0, so that the embedding of the audio watermark is realized by changing the energy ratio in the frequency domain, which overcomes the echo reverberation generated during the transcription process. The effect of the energy division.
- the playback terminal integrates the solutions of the first solution and the second solution when embedding the audio watermark.
- the energy ratio of the front and rear parts of the second sampling point in the time domain is adjusted to realize the first step.
- Watermark embedding for example, as shown in Figure 5h, the ratio of the energy value of the first half of the second sampling point to the energy value of the second half is greater than the fifth threshold, and the embedded audio watermark is digital 1.
- Figure 5h in order to prevent the The effect of echo reverberation on the audio watermark energy in the time domain is shown in Figure 5h.
- the method of the above scheme 2 is adopted to change the high frequency part.
- the ratio of energy to the low frequency part is shown in Figure 5h.
- two watermarks, one large and one small are added to the second sampling point at the same time.
- the two watermarks, one large and one small respectively record the same value, wherein the large watermark is the watermark obtained by changing the energy ratio of the two parts before and after the second sampling point in the time domain, and the small watermark is the low energy change of the second sampling point.
- the low-energy part in the audio watermark can be further segmented, As shown in Figure 5i, for the small watermark embedded in the low-energy part in the second sampling point, the low-energy part is divided into two parts in the time domain, which are respectively recorded as the first part and the second part, where the first part is The region close to the high-energy part, and the second part is the region far from the high-energy part.
- the first part is The region close to the high-energy part
- the second part is the region far from the high-energy part.
- the energy of the first part may increase accordingly due to the reverberation generated during the transcription process. Therefore, the energy ratio of high and low frequencies in the first part is no longer changed, thereby eliminating the influence of echo reverberation generated in the transcription process on the energy division in the time domain.
- a small watermark is embedded in the way of energy ratio.
- the audio watermark shown in Figure 5i above is divided into two parts for processing in the time domain for the low-energy part.
- a part is divided into more parts in the time domain, which is not limited in this embodiment of the present application.
- the parsing terminal first parses the large watermark during parsing.
- the parsing method of the large watermark please refer to the above scheme 1, and then parse the small watermark in the large watermark.
- the specific parsing method of the small watermark please refer to In the above solution 2, when the numbers parsed in the large watermark and the small watermark are consistent, it is determined that the watermark is parsed correctly, and the parsing terminal obtains the watermark embedded in the current second target frame.
- the playback terminal implements binary watermark embedding by changing the energy ratio.
- the energy ratio In the actual working process, those skilled in the art can change the energy ratio to achieve other
- the watermark is embedded in a hexadecimal system, such as a decimal system or a hexadecimal system, which is not limited in this embodiment of the present application.
- the parsing terminal parses the watermark in a corresponding way.
- the audio watermark has better anti-interference ability, and the audio In the process of transcription, the watermark is not easily lost due to air recording, and the parsing terminal can accurately parse the audio watermark embedded in the playback terminal, so that the audio watermark embedding scheme has better stability and accuracy.
- the parsing terminal When parsing the audio watermark from the second target frame, the parsing terminal first parses the first second sampling point: the first element in the first sequence: "1" in the manner described in the above step 703 to realize the first sequence Parsing of the first digit in .
- the first sub-cycle in the watermark parsing cycle is completed.
- the above step 703 is executed in a loop, the second second sampling point is obtained from the second target frame, and the second element in the first sequence is parsed in the same way: "1", which realizes the second digit in the first sequence. analysis. ...and so on, the playback terminal parses the complete watermark parsing cycle composed of four watermark embedding sub-cycles, and realizes parsing of the audio watermark.
- the analysis terminal needs to preprocess the second target frame to prevent the influence caused by the energy change of the next frame. Therefore, the energy on both sides of the second target frame is removed.
- the length of removal can be determined by those skilled in the art according to actual needs. For example, 8 sampling points are cut before and after each, which can also be determined by the analysis terminal according to preset logic, for example, the analysis terminal adjusts the length of the cut according to the length of the detection period.
- This embodiment of the present application is not limited. The front and back parts of the finally obtained second target frame are subtracted by the preset length, and only the middle part is retained, thereby eliminating the interference caused by the deviation.
- the parsing terminal parses the audio watermark from the first audio.
- the parsing terminal verifies the audio watermark.
- the first sequence embedded in the audio watermark contains a check digit
- the parsing terminal verifies the integrity of the first sequence according to the check digit to ensure watermark parsing
- the first number column is "11101”
- the parity check method is agreed between the playback terminal and the parsing terminal
- the last digit in the first number column is the check digit
- the last digit When the number is 1, it means that the first sequence includes an odd number of 1s except for the check digit.
- the parsing terminal determines whether the number of 1s in the parsed first sequence is an odd number according to the record of the last digit of the first sequence, so as to determine whether the parsing of the current first sequence is complete. .
- the parsing terminal determines, from the audio watermarks parsed by multiple watermark detection cycles, the one with the highest repetition rate as the watermark of the first audio.
- the first audio includes multiple groups of the first target frame + the second target frame, wherein each first target frame and the second target frame form a watermark detection period, and each watermark detection period is embedded Has the same audio watermark.
- each watermark detection period is embedded Has the same audio watermark.
- some parsing errors may occur in the parsing terminal, resulting in that not all audio watermarks obtained by the watermark detection cycle parsing are the same sequence.
- the wrong audio watermark is always random and non-repetitive, so the one with the highest repetition rate among the audio watermarks parsed by multiple watermark detection cycles can be determined as the correct audio watermark. Therefore, through this multi-cycle decision-making method, combined with the aforementioned parity check method, the correct watermark embedded in the first audio frequency can be accurately parsed, thereby further preventing the erroneous analysis of the parsing terminal.
- the first sequence may be converted into decimal as required, for example, the first sequence is converted from binary to decimal, and finally the number 14 is obtained.
- the verification of the audio watermark in the first audio is finally completed.
- the parsing terminal determines the playing terminal according to the audio watermark.
- the audio watermark is associated with the playback terminal, according to the audio watermark parsed from the first audio, it can be known which playback terminal added the audio watermark of the current first audio.
- the audio content played by the playback terminals of site A, site B, and site C is exactly the same, but the audio watermarks embedded in each playback terminal are different when playing the same audio, and the parsing terminal can know by parsing the audio watermark. Which site's playback terminal is currently playing the first audio, so that the source of the first audio can be traced.
- the audio watermark adding method and the audio watermark parsing method provided by the embodiments of the present application can be used in various usage scenarios.
- the following describes the usage scenarios of the methods provided by the embodiments of the present application with reference to the accompanying drawings. Be explained.
- FIG. 13 The architecture in this embodiment is shown in FIG. 13 .
- an additional service management center (service management center, SMC) is added, and the following steps are performed.
- SMC service management center
- the SMC sends an audio watermark to the playback terminal.
- the number of playback terminals may be multiple, and the audio watermark sent by the SMC is in one-to-one correspondence with each playback terminal, and is used to uniquely mark each playback terminal.
- the playback terminal stores the audio watermark locally.
- each playback terminal after acquiring the audio watermark, stores the audio watermark locally, so that the audio watermark is embedded in the audio stream when the real-time audio stream is acquired later.
- the media center sends the first audio to the playback terminal.
- the audio stream of the media center may be generated by a playback terminal of a certain venue and sent to the media center MCU, and then the media center sends the audio stream to the playback terminals of other venues in real time.
- the playback terminal embeds an audio watermark in the first audio in real time.
- the playback terminal embeds the locally stored audio watermark into the first audio in real time by using the audio watermark adding method provided in any one of the foregoing embodiments.
- the audio watermark adding method provided in any one of the foregoing embodiments.
- the playback terminal plays the first audio.
- the first audio played by the playback terminal is embedded with an audio watermark, and the entire watermark embedding process is performed in real time without affecting the live broadcast effect of the first audio.
- the played first audio can be traced to the source of the first audio according to the audio watermark. Play the terminal.
- the parsing terminal obtains the first audio.
- the parsing terminal may acquire the first audio through a digital channel, or may acquire the first audio through an air channel.
- the parsing terminal can perform parsing. .
- the parsing terminal parses the audio watermark from the first audio.
- the parsing terminal parses the audio watermark from the first audio by using the audio parsing and adding method provided by any one of the above embodiments. For details, refer to the above-mentioned records, which will not be repeated here.
- the analysis terminal determines the playback terminal according to the audio watermark.
- the audio watermark is associated with the playback terminal, according to the audio watermark parsed from the first audio, it can be known which playback terminal added the audio watermark of the current first audio. Thus, the traceability of the first audio frequency is realized.
- the SMC allocates different site identifiers (that is, audio watermarks) to the playback terminals where site A, site B, and site C are located, respectively.
- site identifiers are associated with the playback terminals of the site, and are used to uniquely identify each The playback terminal of the venue.
- the playback terminals of venue A, venue B, and venue C obtain the live audio stream from the media center MCU and execute the audio watermark adding method provided by the embodiment of the present application, and add the respectively obtained venue identifiers as watermarks to the playback audio stream in real time , so that an audio watermark is embedded in the audio played by the playback terminals of each conference site.
- the parsing terminal can determine the audience in which venue the audio is specifically transcribed by the venue identifier recorded in the watermark information in the audio stream. Thus, the traceability of the audio watermark is realized.
- the user sends on-demand information to the cloud server through the terminal, and orders the audio or video content to be watched. Send on-demand content. Specific steps are as follows.
- the user terminal generates on-demand information according to the on-demand content selected by the user.
- the user selects audio or video content to be on-demand through the interactive interface of the user terminal, and generates on-demand information, where the on-demand information is used to record the audio or video content on demand by the user.
- the user terminal sends the on-demand information to the cloud server.
- the user terminal sends the on-demand information to the cloud server, so that the cloud server knows the content that the user needs to on-demand.
- the cloud server acquires the target content on demand by the user according to the on demand information.
- the cloud server acquires the target content requested by the user from the database according to the requested information of the user.
- the cloud server generates an audio watermark according to the terminal identifier of the user terminal.
- the user terminal is an audio or video playback terminal
- the audio watermark is associated with the user terminal and is used to uniquely identify the user terminal. So as to realize the acquisition of audio watermark.
- the cloud server embeds the audio watermark into the target content.
- the target content may be audio or video; if it is video, the audio watermark is embedded in the audio content of the video, and the specific manner in which the cloud server embeds the audio watermark in the target content can be provided in the embodiments of the present application.
- the step of embedding the watermark can be performed by the cloud server, or the cloud server can send the audio watermark to the user terminal, There is a user terminal to execute.
- the cloud server can embed an audio watermark in the audio of the target content in real time while transmitting the target content to the user terminal, thereby improving work efficiency.
- the cloud server sends the watermark content to the user terminal.
- the audio or video content in the watermark content is the content ordered by the user of the user terminal, and an audio watermark has been added to the watermark content.
- the user terminal plays the watermark content.
- the parsing terminal obtains the watermark content.
- the watermark content may be watermark-embedded audio, or a video with audio, and the audio in the video is embedded with a watermark, and the parsing terminal may obtain the watermark content through a digital channel, or may obtain it through an air channel
- the parsing terminal can parse the first audio in the watermark content transcribed by the two methods.
- the parsing terminal parses the audio watermark from the first audio.
- the parsing terminal parses the audio watermark from the first audio by using the audio parsing and adding method provided by any one of the above embodiments. For details, refer to the above-mentioned records, which will not be repeated here.
- the parsing terminal determines the user terminal according to the audio watermark.
- the audio watermark is associated with the user terminal, according to the audio watermark parsed from the first audio, it can be known which user terminal added the audio watermark of the current first audio. Thus, the traceability of the first audio frequency is realized.
- the watermark adding method and the watermark parsing method provided in the embodiments of the present application can be applied to various scenarios that require audio watermarking and parsing.
- the above two methods are only examples, and do not constitute a Limitation of usage scenarios of the application examples.
- the above method may be implemented by one entity device, or jointly implemented by multiple entity devices, or may be a logic function module in one entity device, which is not specifically limited in this embodiment of the present application.
- FIG. 15 is a schematic diagram of the hardware structure of an electronic device provided by an embodiment of the present application; the electronic device may be a playback terminal or a parsing terminal in the embodiment of the present invention, and the electronic device includes at least one processor 1501, a communication line 1502, and a memory 1503 and at least one communication interface 1504.
- the processor 1501 may be a general-purpose central processing unit (CPU), microprocessor, application-specific integrated circuit (application-specific integrated circuit, server IC), or one or more programs used to control the program execution of the present application of integrated circuits.
- CPU central processing unit
- microprocessor application-specific integrated circuit
- server IC application-specific integrated circuit
- programs used to control the program execution of the present application of integrated circuits.
- Communication line 1502 may include a path to communicate information between the components described above.
- Communication interface 1504 using any transceiver-like device, for communicating with other devices or communication networks, such as Ethernet, radio access network (RAN), wireless local area networks (WLAN), etc. .
- RAN radio access network
- WLAN wireless local area networks
- Memory 1503 may be read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), or other types of information and instructions It can also be an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, CD-ROM storage (including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or capable of carrying or storing desired program code in the form of instructions or data structures and capable of being executed by a computer Access any other medium without limitation.
- the memory may exist independently and be connected to the processor through communication line 1502. The memory can also be integrated with the processor.
- the memory 1503 is used for storing computer-executed instructions for executing the solution of the present application, and the execution is controlled by the processor 1501 .
- the processor 1501 is configured to execute the computer-executed instructions stored in the memory 1503, thereby implementing the charging management method provided by the following embodiments of the present application.
- the computer-executed instructions in the embodiment of the present application may also be referred to as application code, which is not specifically limited in the embodiment of the present application.
- the processor 1501 may include one or more CPUs, such as CPU0 and CPU1 in FIG. 15 .
- the electronic device may include multiple processors, such as the processor 1501 and the processor 1505 in FIG. 15 .
- processors can be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor.
- a processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (eg, computer program instructions).
- the electronic device may further include an output device 1505 and an input device 1506 .
- the output device 1505 is in communication with the processor 1501 and can display information in a variety of ways.
- the output device 1505 may be a liquid crystal display (LCD), a light emitting diode (LED) display device, a cathode ray tube (CRT) display device, or a projector (projector) Wait.
- Input device 1506 is in communication with processor 1501 and can receive user input in a variety of ways.
- the input device 1506 may be a mouse, a keyboard, a touch screen device, a sensor device, or the like.
- the above-mentioned electronic device may be a general-purpose device or a special-purpose device.
- the electronic device may be a server, a wireless terminal device, an embedded device, or a device with a similar structure in FIG. 15 .
- the embodiments of the present application do not limit the type of the electronic device.
- the electronic device may be divided into functional units according to the foregoing method examples.
- each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit.
- the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units. It should be noted that the division of units in the embodiments of the present application is illustrative, and is only a logical function division, and other division methods may be used in actual implementation.
- FIG. 16 shows a schematic structural diagram of a playback terminal provided by an embodiment of the present application.
- the playback terminal provided by the embodiment of the present application includes:
- an acquisition unit 1601, configured to acquire the first audio in real time
- An execution unit 1602 configured to embed an audio watermark in the first audio acquired by the acquisition unit 1601, where the audio watermark is associated with the playback terminal;
- the playing unit 1603 is configured to play the first audio embedded with the audio watermark by the executing unit 1602.
- execution unit 1602 is further configured to:
- the first target frame After the first target frame, determine a second target frame that satisfies the second preset condition, and the first target frame is used to mark the second target frame;
- the audio watermark is embedded in the second target frame.
- execution unit 1602 is further configured to:
- the audio frame whose maximum value of the low-frequency part is in the first interval is determined as the first target frame; or,
- the sampling rate of the first audio is smaller than the first threshold, it is determined that the audio frame containing the first characteristic sound is the first target frame.
- the execution unit 1602 is further configured to:
- a sync frame marker is added to the first target frame.
- execution unit 1602 is further configured to:
- the first sampling point is a sampling point of the intermediate frequency part
- the energy value of the first sampling point is increased, so that the ratio of the energy value of the first sampling point to the energy value of the low frequency part is greater than or equal to the second threshold.
- the execution unit 1602 is further configured to:
- an audio frame containing the first characteristic sound is determined as the first target frame.
- execution unit 1602 is further configured to:
- the target frame whose energy value of the intermediate frequency part is greater than or equal to the third threshold and less than the fourth threshold is the second target frame.
- execution unit 1602 is further configured to:
- the first sequence corresponding to the audio watermark includes at least one element
- execution unit 1602 is further configured to:
- the analysis terminal provided by the embodiment of the present application includes:
- the obtaining unit 1701 is used to obtain the first audio, the audio watermark is embedded in the first audio, and the audio watermark is associated with the playback terminal, and the playback terminal is used to embed the audio watermark in the first audio in real time;
- parsing unit 1702 for parsing the audio watermark from the first audio acquired by the acquiring unit 1701;
- the execution unit 1703 is configured to determine the playback terminal according to the audio watermark parsed by the parsing unit 1702 .
- the parsing unit 1702 is further configured to:
- the parsing unit 1702 is further configured to:
- a target frame containing a first characteristic sound and the duration of the first characteristic sound is greater than or equal to a preset time is used as the first target frame.
- the parsing unit 1702 is further configured to:
- the first audio is detected by sliding backwards from the initial target frame, so as to obtain the intermediate frequency part and the low frequency part in each sliding window.
- the frame where the sliding window with the largest second ratio is obtained is the first target frame.
- the first target frame includes a synchronization frame mark
- the parsing unit 1702 is further configured to:
- the frame where the sliding window with the largest second ratio is located is the first target frame.
- the parsing unit 1702 is further configured to:
- the target frame obtained from the candidate target frame where the energy ratio of the partial energy values in different time domains and/or different frequency domains is greater than or equal to the fifth threshold is the second target frame.
- the parsing unit 1702 is further configured to:
- the first audio includes a plurality of watermark detection periods, wherein each of the watermark detection periods parses out one audio watermark respectively, and the parsing unit 1702 is also used for:
- the one with the highest repetition rate is determined as the watermark of the first audio.
- the functions described in the present invention may be implemented in hardware, software, firmware, or any combination thereof.
- the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
- Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another.
- a storage medium can be any available medium that can be accessed by a general purpose or special purpose computer.
- the disclosed communication method, relay device, host base station, and computer storage medium may be implemented in other ways.
- the apparatus embodiments described above are only illustrative.
- the division of the units is only a logical function division. In actual implementation, there may be other division methods.
- multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
- the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
- the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
- the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
- the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
- a computer device which may be a personal computer, a server, or a network device, etc.
- the aforementioned storage media include: U disk, mobile hard disk, read-only memory (full English name: Read-Only Memory, English abbreviation: ROM), random access memory (English full name: Random Access Memory, English abbreviation: RAM), magnetic Various media that can store program codes, such as discs or optical discs.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Human Computer Interaction (AREA)
- Theoretical Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Technology Law (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- Editing Of Facsimile Originals (AREA)
Abstract
一种音频水印添加方法,包括:播放终端实时获取第一音频(202);播放终端在第一音频中嵌入音频水印(203),音频水印与播放终端相关联;播放终端播放嵌有音频水印的第一音频(204)。还提供一种音频水印解析方法、设备及介质,在实时播放音频的场景下,播放终端通过在音频流中实时地加入音频水印,以使得后期设备在解析水印时能够根据该音频水印确定该播放终端,便于在第一音频被转录后进行溯源。
Description
本申请要求于2020年9月30日提交中国专利局、申请号为“202011066454.0”、申请名称为“一种音频水印添加、解析方法、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及电子领域,尤其涉及一种音频水印添加、解析方法、设备及介质。
水印技术是常用的图片处理方法,通过在重要的图片上标记水印,从而在图片传播的过程中能够根据该水印查询到图片的源头,使得传播者不敢轻易偷拍和传播图片,起到了震慑作用和追溯问责的能力。
当前,在音频领域也有类似的需求,例如在远程音视讯会议中,会出现会议内容外流的情况,导致商业或者个人机密信息泄露,并造成不良影响。其主要手段是通过手机或者录音笔等设备,在某一远程终端侧,进行偷拍或者偷录,并将偷录的音视频文件,外传给其他人员,最终在互联网上传播,造成不良影响。
现有技术中的音频水印添加方法,主要是通过后期对音频进行离线处理以得到音频水印,在直播会议的场景下,对于在线的实时音频流,无法进行有效处理。
因此,现有技术中存在的上述问题还有待于改进。
发明内容
本申请实施例提供了一种音频水印添加、解析方法、设备及介质,用于解决音频水印的实时添加的问题。
有鉴于此,本申请实施例第一方面提供了一种音频水印添加方法,包括:播放终端实时获取第一音频;该播放终端在该第一音频中嵌入音频水印,该音频水印与该播放终端相关联;该播放终端播放嵌有该音频水印的该第一音频。
本实施例中,播放终端实时获取第一音频;播放终端在第一音频中嵌入音频水印,音频水印与播放终端相关联;播放终端播放嵌有音频水印的第一音频。从而在实时播放音频的场景下,通过播放终端在音频流中实时地加入音频水印,以使得后期设备在解析水印时能够根据该音频水印确定该播放终端,便于在第一音频被转录后进行溯源。
可选地,该播放终端在该第一音频中嵌入音频水印包括:该播放终端在该第一音频中确定满足第一预设条件的第一目标帧;该播放终端在该第一目标帧之后确定满足第二预设条件的第二目标帧,该第一目标帧用于标记该第二目标帧;该播放终端在该第二目标帧中嵌入该音频水印。
本实施例中,播放终端在对第一音频进行实时处理的过程中,根据第一预设条件确定第一目标帧,之后根据第二预设条件确定第二目标帧,最终在第二目标帧中嵌入音频水印。通过此种方式,使得播放终端能够在实时处理的过程中准确地找到第一音频中嵌入音频水印的合适位置。
可选地,该播放终端在该第一音频中确定满足第一预设条件的第一目标帧,包括:当 该第一音频的采样率大于或等于第一阈值时,该播放终端将低频部分最大值处于第一区间内的音频帧确定为该第一目标帧;或者,当该第一音频的采样率小于该第一阈值时,该播放终端确定包含第一特征声音的音频帧为该第一目标帧。
本实施例中,当该第一音频的采样率大于或等于第一阈值时,该播放终端将低频部分最大值处于第一区间内的音频帧确定为该第一目标帧;当第一音频的采样率小于第一阈值时,播放终端通过确定第一特征声音的方式来确定第一目标帧,不需要再嵌入同步帧标记。从而确保在第一音频中,无论第一音频的采样率大于还是小于第一阈值,均可以找到标记水印嵌入位置的第一目标帧。
可选地,当该第一音频的采样率大于或等于第一阈值时,该播放终端将低频部分最大值处于第一区间内的音频帧作为该第一目标帧之后,还包括:该播放终端在该第一目标帧中添加同步帧标记。
本实施例中,当该第一音频的采样率大于或等于第一阈值时,在第一目标帧中加入同步帧标记,以使得后续解析终端在解析时,能够根据该同步帧标记快速定位到第一目标帧。
可选地,该播放终端在该第一目标帧中添加同步帧标记,包括:该播放终端获取第一采样点,该第一采样点为中频部分的采样点;该播放终端提升该第一采样点的能量值,以使得该第一采样点的能量值与低频部分能量值的比值大于或等于第二阈值。
本实施例中,播放终端实时地判断第一音频中符合第一预设条件的目标帧作为第一目标帧或第二目标帧,之后提升第一目标帧或第二目标帧的中频部分第一采样点的能量值,以使得第一采样点的能量值与低频部分能量值的比值大于或等于预设比值,从而实现了同步帧标记的添加。
可选地,该当该第一音频的采样率小于该第一阈值时,该播放终端确定包含第一特征声音的音频帧为该第一目标帧,包括:当检测到该第一特征声音,且该第一特征声音的持续时间大于或等于预设时间时,该播放终端将包含该第一特征声音的音频帧确定为该第一目标帧。
本实施例中,第一特征声音可以为人声,或者在人声状态下,当检测到特定语句时,将该特定语句所在的目标帧确定为第一目标帧,从而确保后续的水印嵌入能够嵌入到记录有语音信息的目标帧中。
可选地,该播放终端在该第一目标帧之后确定满足第二预设条件的第二目标帧,包括:该播放终端确定中频部分能量值大于或等于第三阈值且小于第四阈值的目标帧为该第二目标帧。
本实施例中,第二目标帧是位于第一目标帧之后的目标帧,由于第二目标帧是实时加入到第一音频中的,因此无法保证第一音频中第一目标帧之后的每一帧都适合作为第二目标帧,从而需要对第二目标帧的条件进行判断,只有当目标帧满足第二预设条件时,才将该目标帧作为第二目标帧。
可选地,该播放终端在该第三目标帧中嵌入该音频水印,包括:该播放终端获取该音频水印所对应的第一数列,该第一数列中包括至少一个元素;该播放终端从该第三目标帧中获取至少一个第二采样点;该播放终端将该第一数列中的至少一个元素分别嵌入该至少 一个第二采样点中,其中,该第一数列中的一个元素对应一个第二采样点。
本实施例中,播放终端为了在第一音频中实时地嵌入音频水印,按照时序在第一目标帧之后寻找符合预设条件的第二目标帧进行音频水印的嵌入,在嵌入音频水印的过程中,通过改变第二目标帧中采样点在不同时域和/或不同频域部分能量值的能量比值,从而实现音频水印的实时嵌入,所嵌入的水印在音频转录的过程具有较强的抗干扰能力。能够通过数字信道或空气信道传播。
可选地,该播放终端将该第一数列中的至少一个元素分别加入该至少一个第二采样点中,包括:该播放终端调节该第二采样点在不同时域和/或不同频域部分能量值的能量比值,其中,一个该第二采样点的该能量比值与该第一数列中的一个元素相关联。
本实施例中,可选地,播放终端调节该第二采样点在不同时域和/或不同频域部分能量值的能量比值的具体方式可以为:提高第一子采样点前半部分的能量值,以使得该第一子采样点前半部分与后半部分能量值的比值大于或等于第五阈值,将该第一子采样点记录为1,其中,该第一子采样点为该至少一个第二采样点中的一个;提高第二子采样点后半部分的能量值,以使得该第二采样点后半部分与前半部分能量值的比值大于或等于该第五阈值,将该第二子采样点记录为0。
可选地,该播放终端将该第一数列中的至少一个元素分别加入该至少一个第二采样点之后,还包括:
当该第二采样点中高能量部分与低能量部分的比值小于该第五阈值时,提高该第二采样点中高能量部分的能量值。
可选地,该播放终端获取该音频水印所对应的第一数列之后,还包括:
播放终端在该第一数列中加入校验位,该校验位用于校验该第一数列的传输完整性。
本申请实施例第二方面提供了一种音频水印解析方法,包括:解析终端获取第一音频,该第一音频中嵌有音频水印,该音频水印与播放终端相关联,该播放终端用于将该音频水印实时嵌入该第一音频;该解析终端从该第一音频中解析该音频水印;该解析终端根据该音频水印确定该播放终端。
本申请实施例提供一种音频水印解析方法,包括:解析终端获取第一音频,第一音频中嵌有音频水印,音频水印与播放终端相关联,播放终端用于将音频水印实时嵌入第一音频;解析终端从第一音频中解析音频水印;解析终端根据音频水印确定播放终端。从而解析终端能够根据音频水印,确定将音频水印加入第一音频中的播放终端。
可选地,该解析终端从该第一音频中解析该音频水印之前,还包括:该解析终端确定该第一音频中满足第一预设条件的第一目标帧;该解析终端在该第一目标帧之后确定满足第二预设条件的第二目标帧;该解析终端从该第一音频中解析该音频水印,包括:该解析终端从该第二目标帧中解析该音频水印。
本实施例中,由于播放终端是实时将音频水印添加到第一音频中的,不能保证第一音频中的每一帧都符合水印嵌入的条件,因此在实时嵌入音频水印的方案中,不能像离线水印嵌入一样按照预设的规则进行嵌入。而是要根据第一预设条件和第二预设条件分别确定第一目标帧和第二目标帧,因此解析终端在解析时,也需要按照相同的条件解析第一目标 帧和第二目标帧。
可选地,当该第一音频的采样率小于第一阈值时,该解析终端确定该第一音频中满足第一预设条件的第一目标帧,包括:该解析终端从该第一音频中确定包含有第一特征声音,
且该第一特征声音的持续时间大于或等于预设时间的目标帧作为该第一目标帧。
本实施例中,当该第一音频的采样率大于或等于第一阈值时,该解析终端将低频部分最大值处于第一区间内的音频帧确定为该第一目标帧;当第一音频的采样率小于第一阈值时,解析终端通过确定第一特征声音的方式来确定第一目标帧,不需要再解析同步帧标记。从而确保在第一音频中,无论第一音频的采样率大于还是小于第一阈值,解析终端均可以找到标记水印嵌入位置的第一目标帧。
可选地,当该第一音频的采样率大于或等于第一阈值时,该解析终端确定该第一音频中满足第一预设条件的第一目标帧;包括:该解析终端逐帧获取该第一音频中频部分与低频部分能量值的第一比值;当该解析终端获取到该第一比值大于或等于第二阈值的初始目标帧时,从该初始目标帧开始通过滑窗方式向后滑动检测该第一音频,以获取每个滑动窗口内中频部分与低频部分能量值的第二比值;该解析终端获取该第二比值最大的滑动窗口所在帧为该第一目标帧。
本实施例中,解析终端逐帧获取每一帧中频部分采样点与低频部分采样点的能量值的第一比值,当找到第一比值大于或等于第二阈值第一采样点时,确定第一采样点所在的目标帧为初始目标帧。然而,由于一帧有2048个采样点,第一采样点仅为其中的部分采样点,因此,当找到了符合第一比值的第一采样点后,第一采样点实际所在的第一目标帧相对于当前第一采样点所在的初始目标帧之间,可能会存在偏移。为解决此问题,需要从初始目标帧开始,通过滑窗方式向后移动,逐帧检测初始目标帧之后的第一音频,以获取每个滑动窗口内中频部分与低频部分能量值的第二比值,获取该第二比值最大的滑动窗口所在帧为该第一目标帧,从而防止解析偏差。
可选地,该第一目标帧中包括同步帧标记,该解析终端获取该第二比值最大的滑动窗口所在帧为该第一目标帧,包括:该解析终端从该第二比值最大的滑动窗口中获取中频部分能量值最高的第一采样点;该解析终端获取距离该第一采样点之前预设长度的第三采样点;该解析终端确定该第一采样点能量值与该第三采样点能量值的比值大于第七阈值的部分为该同步帧标记;该解析终端根据该同步帧标记确定该第二比值最大的滑动窗口所在帧为该第一目标帧。
本实施例中,上述同步帧标记的检测方法,是检测第一音频的中频部分与低频部分能量值的第一比值来确定的,然而在实际工作过程中,第一音频的原始内容(即非水印内容)中也可能存在中频部分与低频部分能量值比值大于第一比值的情况。从而会造成同步帧标记的误检测。对此,解析终端通过第一采样点与第一采样点之前预设长度的第三采样点的比值来确定同步帧标记,从而杜绝了上述误检测的情况发生。
可选地,该解析终端在该第一目标帧之后确定满足第二预设条件的第二目标帧,包括:该解析终端从该第一目标帧开始以帧为单位向后移动,分别获取每帧中频部分的能量大于或等于第三阈值且小于第四阈值的备选目标帧;该解析终端从该备选目标帧中获取不同时 域和/或不同频域部分能量值的能量比值大于或等于第五阈值的目标帧为该第二目标帧。
本实施例中,当解析终端检测到第一目标帧后,以第一目标帧为定位帧,即可继续寻找位于第一目标帧之后的第二目标帧,第二目标帧嵌有音频水印,且第二目标帧满足第二预设条件。解析终端可以根据该第二预设条件,在第一目标帧之后快速地找到第二目标帧。
可选地,该解析终端从该第二目标帧中解析该音频水印,包括:该解析终端从该第二目标帧中获取第二采样点,该第二采样点为该第二目标帧中该能量比值大于或等于该第五阈值的采样点;该解析终端分别获取所第二采样点中取不同时域和/或不同频域部分能量值的能量比值;该解析终端获取与该能量比值相关联的第一元素,该第一元素为该音频水印所记录的第一数列中的一个元素。
本实施例中,可选地,该第一数列中包含校验位,则该将该至少一个第二目标帧的数字按解析顺序生成第一数列之后,还包括:根据该校验位确定该第一数列是否完整;若是,则将该第一数列转化为十进制数列;若否,则忽略该第一数列。
可选地,该解析终端从该第二目标帧中解析该音频水印之前,还包括:解析终端根据该第二目标帧的时长调节该第一长度的长度值,其中,该第二目标帧的时长越长,该第一长度的长度越大。解析终端分别去除该第二目标帧头部和尾部第一长度的能量值。
可选地,该方法包括多个水印检测周期,其中,每个该水印检测周期分别解析出一个该音频水印,该方法还包括:从该多个水印检测周期所解析的音频水印中确定重复率最高的一个作为该第一音频的水印。
本实施例中,第一音频中包括多个水印解析周期,每个周期内包括一个第一目标帧和一个第二目标帧,每个水印检测周期中均嵌入有相同的音频水印。在实际解析的过程中,解析终端可能会出现一些解析错误的情况,导致不是所有水印检测周期解析得到的音频水印都是相同的数列。当解析出错时,误解析所得到的错误音频水印总是随机且不重复的,因此多个水印检测周期所解析的音频水印中重复率最高的一个,可以确定为正确的音频水印。通过这种多周期决策的方式,准确地解析出第一音频中所嵌入的正确水印,进一步防止解析终端的误解析。
本申请实施例第三方面提供一种播放终端,包括:
获取单元,用于实时获取第一音频;
执行单元,用于在所述获取单元获取的所述第一音频中嵌入音频水印,所述音频水印与所述播放终端相关联;
播放单元,用于播放由所述执行单元嵌有所述音频水印的所述第一音频。
可选地,该执行单元,还用于:
在该第一音频中确定满足第一预设条件的第一目标帧;
在该第一目标帧之后确定满足第二预设条件的第二目标帧,该第一目标帧用于标记该第二目标帧;
在该第二目标帧中嵌入该音频水印。
可选地,该执行单元,还用于:
当该第一音频的采样率大于或等于第一阈值时,将低频部分最大值处于第一区间内的 音频帧确定为该第一目标帧;或者,
当该第一音频的采样率小于该第一阈值时,确定包含第一特征声音的音频帧为该第一目标帧。
可选地,当该第一音频的采样率大于或等于第一阈值时,该执行单元,还用于:
在该第一目标帧中添加同步帧标记。
可选地,该执行单元,还用于:
获取第一采样点,该第一采样点为中频部分的采样点;
提升该第一采样点的能量值,以使得该第一采样点的能量值与低频部分能量值的比值大于或等于第二阈值。
可选地,该当该第一音频的采样率小于该第一阈值时,该执行单元,还用于:
当检测到该第一特征声音,且该第一特征声音的持续时间大于或等于预设时间时,将包含该第一特征声音的音频帧确定为该第一目标帧。
可选地,该执行单元,还用于:
确定中频部分能量值大于或等于第三阈值且小于第四阈值的目标帧为该第二目标帧。
可选地,该执行单元,还用于:
获取该音频水印所对应的第一数列,该第一数列中包括至少一个元素;
从该第三目标帧中获取至少一个第二采样点;
将该第一数列中的至少一个元素分别嵌入该至少一个第二采样点中,其中,该第一数列中的一个元素对应一个第二采样点。
可选地,该执行单元,还用于:
调节该第二采样点在不同时域和/或不同频域部分能量值的能量比值,其中,一个该第二采样点的该能量比值与该第一数列中的一个元素相关联。
本申请实施例第三方面的有益效果可参阅上述第一方面,此处不再赘述。
本申请实施例第四方面提供一种解析终端,包括:
获取单元,用于获取第一音频,该第一音频中嵌有音频水印,该音频水印与播放终端相关联,该播放终端用于将该音频水印实时嵌入该第一音频;
解析单元,用于从该获取单元获取的该第一音频中解析该音频水印;
执行单元,用于根据该解析单元解析的该音频水印确定该播放终端。
可选地,该解析单元,还用于:
确定该第一音频中满足第一预设条件的第一目标帧;
在该第一目标帧之后确定满足第二预设条件的第二目标帧;
从该第二目标帧中解析该音频水印。
可选地,当该第一音频的采样率小于第一阈值时,该解析单元,还用于:
从该第一音频中确定包含有第一特征声音,且该第一特征声音的持续时间大于或等于预设时间的目标帧作为该第一目标帧。
可选地,当该第一音频的采样率大于或等于第一阈值时,该解析单元,还用于:
逐帧获取该第一音频中频部分与低频部分能量值的第一比值;
当获取到该第一比值大于或等于第二阈值的初始目标帧时,从该初始目标帧开始通过滑窗方式向后滑动检测该第一音频,以获取每个滑动窗口内中频部分与低频部分能量值的第二比值;
获取该第二比值最大的滑动窗口所在帧为该第一目标帧。
可选地,该第一目标帧中包括同步帧标记,该解析单元,还用于:
从该第二比值最大的滑动窗口中获取中频部分能量值最高的第一采样点;
获取距离该第一采样点之前预设长度的第三采样点;
确定该第一采样点能量值与该第三采样点能量值的比值大于第七阈值的部分为该同步帧标记;
根据该同步帧标记确定该第二比值最大的滑动窗口所在帧为该第一目标帧。
可选地,该解析单元,还用于:
从该第一目标帧开始以帧为单位向后移动,分别获取每帧中频部分的能量大于或等于第三阈值且小于第四阈值的备选目标帧;
从该备选目标帧中获取不同时域和/或不同频域部分能量值的能量比值大于或等于第五阈值的目标帧为该第二目标帧。
可选地,该解析单元,还用于:
从该第二目标帧中获取第二采样点,该第二采样点为该第二目标帧中该能量比值大于或等于该第五阈值的采样点;
分别获取所第二采样点中取不同时域和/或不同频域部分能量值的能量比值;
获取与该能量比值相关联的第一元素,该第一元素为该音频水印所记录的第一数列中的一个元素。
可选地,该第一音频中包括多个水印检测周期,其中,每个该水印检测周期分别解析出一个该音频水印,该解析单元,还用于:
从该多个水印检测周期所解析的音频水印中确定重复率最高的一个作为该第一音频的水印。
本申请实施例第四方面的有益效果可参阅上述第二方面,此处不再赘述。
本申请实施例第五方面提供一种电子设备,该电子设备包括:交互装置、输入/输出(I/O)接口、处理器和存储器,该存储器中存储有程序指令;该交互装置用于获取用户输入的操作指令;该处理器用于执行存储器中存储的程序指令,执行如上述第一方面或第二方面任意一种可选的实现方式所述的方法。
本申请实施例第六方面提供一种计算机可读存储介质,包括指令,当该指令在计算机设备上运行时,使得该计算机设备执行如上述第一方面或第二方面任意一种可选的实现方式所述的方法。
图1为本申请实施例所提供的音频水印添加方法的一个使用场景示意图;
图2为本申请实施例所提供的音频水印添加方法的一个实施例的示意图;
图3为本申请实施例所提供的音频水印添加方法的另一个实施例的示意图;
图4为本申请实施例所提供的音频水印添加方法的另一个实施例的示意图;
图5a为本申请实施例所提供的音频水印添加方法的另一个实施例的示意图;
图5b为本申请实施例所提供的音频水印添加方法的另一种实现方式的示意图;
图5c为本申请实施例所提供的音频水印添加方法的另一种实现方式的示意图;
图5d为本申请实施例所提供的音频水印添加方法的另一种实现方式的示意图;
图5e为本申请实施例所提供的音频水印添加方法的另一种实现方式的示意图;
图5f为本申请实施例所提供的音频水印添加方法的另一种实现方式的示意图;
图5g为本申请实施例所提供的音频水印添加方法的另一种实现方式的示意图;
图5h为本申请实施例所提供的音频水印添加方法的另一种实现方式的示意图;
图5i为本申请实施例所提供的音频水印添加方法的另一种实现方式的示意图;
图6为本申请实施例所提供的音频水印解析方法的一个实施例的示意图;
图7为本申请实施例所提供的音频水印解析方法的另一个实施例的示意图;
图8为本申请实施例所提供的音频水印解析方法的另一个实施例的示意图;
图9a为本申请实施例所提供的音频水印解析方法的另一种实现方式的示意图;
图9b为本申请实施例所提供的音频水印解析方法的另一种实现方式的示意图;
图10为本申请实施例所提供的音频水印解析方法的另一个实施例的示意图;
图11为本申请实施例所提供的音频水印解析方法的另一个实施例的示意图;
图12a为本申请实施例所提供的音频水印解析方法的另一种实现方式的示意图;
图12b为本申请实施例所提供的音频水印解析方法的另一种实现方式的示意图;
图13为本申请实施例一种使用场景的示意图;
图14为本申请实施例一种使用场景的示意图;
图15为本申请实施例所提供的一种电子设备的示意图;
图16为本申请实施例所提供的播放终端的示意图;
图17为本申请实施例所提供的解析终端的示意图。
本发明实施例提供一种音频水印添加、解析方法、设备及介质,能够解决音频水印的实时添加问题。
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、 产品或设备固有的其它步骤或单元。
水印技术是常用的图片处理方法,通过在重要的图片上标记水印,从而在图片传播的过程中能够根据该水印查询到图片的源头,使得传播者不敢轻易偷拍和传播图片,起到了震慑作用和追溯问责的能力。
当前,在音频领域也有类似的需求,例如在远程音视讯会议中,会出现会议内容外流的情况,导致商业或者个人机密信息泄露,并造成不良影响。其主要手段是通过手机或者录音笔等设备,在某一远程终端侧,进行偷拍或者偷录,并将偷录的音视频文件,外传给其他人员,最终在互联网上传播,造成不良影响。
当前,音频水印添加方法主要是通过后期对音频进行离线处理以得到音频水印,在直播会议的场景下,对于在线的实时音频流,无法进行有效处理。
例如,对本申请实施例所提供方法的应用场景进行说明。
请参阅图1,如图1所示,本申请实施例的使用场景可以应用与会议的场景中,该场景包括会场A101、会场B102、会场C103以及在上述三个会场间调度音频的媒体中心104。其中,会场A101、会场B102和会场C103可以是分别位于不同地点的远程会场。在具体工作过程中,例如,会场A101的代表发言,会场A101的录音设备获取代表的发言,之后会场A101的通信设备将实时的音频流发送给媒体中心104,该媒体中心104将该音频流分别发送给会场B102和会场C103,会场B102和会场C103的通信设备获取到该音频流后通过外放设备实时播放来自会场A101的实时音频。从而实现了会场A101、会场B102和会场C103之间的远程音频会议。
在上述工作过程中,会场A、会场B和会场C的听众均有可能偷偷录制所播放的音频并传播泄露,因此在溯源问责时,需要知晓该音频具体为哪一会场的观众偷录。
为了解决上述问题,本申请实施例提供一种音频水印添加方法,能够通过对实时播放的音频添加水印,解决音频的溯源问题。
以下结合附图对本申请实施例所提供的音频水印添加方法进行详细的说明。需要说明的是,上述应用场景只是一种举例,并不构成对本申请使用场景的限定,本申请实施例所提供的方法还可以应用于其他场景中,对此本申请实施例并不进行限定。
请参阅图2,如图2所示,本申请所提供音频水印添加方法的实施例包括以下步骤。
201、播放终端获取音频水印。
本实施例中,播放终端获取与该播放终端相关联的音频水印,在后续水印解析的过程中,解析终端能够通过该音频水印知晓,该音频水印是由该播放终端嵌入到音频中的。
可选地,该音频水印可以是预存在播放终端本地的,也可以是其他设备发送给播放终端的,例如服务管理中心(service management center,SMC)向多个不同的播放终端分别发送各自的音频水印。为便于说明,本实施例中以音频水印为数字14进行举例说明。当前播放终端获取SMC发送的音频水印“14”后,将该音频水印存储在本地,该数字14即为与当前播放终端相关联的音频水印。
202、播放终端实时获取第一音频。
本实施例中,该第一音频可以是播放终端从外部获取的,例如该播放终端可以是图1 中的会场B的播放终端,也可以是图1中会场C的播放终端,该播放终端从媒体中心实时获取第一音频。可选地,在其他的应用场景中,该第一音频也可以是播放终端从自身的内存中实时获取的,对此本申请实施例并不进行限定。
203、播放终端在第一音频中嵌入音频水印。
本实施例中,播放终端的音频水印“14”与播放终端相关联,从而通过该音频水印即可唯一地确定该播放终端。在具体工作过程中,播放终端在播放第一音频的过程中实时地将该音频水印嵌入第一音频中。
可选地,本申请实施例进一步提供一种在第一音频中嵌入音频水印的具体工作方式,为便于理解,以下结合附图进行详细说明。
请参阅图3,如图3所示,本申请所提供音频水印添加方法中在第一音频中嵌入音频水印的方法包括以下步骤。
301、播放终端在第一音频中确定满足第一预设条件的第一目标帧。
本实施例中,第一目标帧起到了标记的作用,在后续对音频水印进行解析的过程中,解析终端根据第一目标帧即可知晓第一目标帧之后存在音频水印,从而可以实现对音频水印的快速定位。
需要说明的是,由于第一目标帧是实时加入到第一音频中的,因此无法保证第一音频中的每一帧都适合作为第一目标帧,从而需要对第一目标帧的条件进行判断,只有当目标帧满足第一预设条件时,才将该目标帧作为第一目标帧。
可选地,放终端中可以设置一个同步帧条件检测器,该同步帧条件检测器可以为在设置在播放终端中的实体装置,也可以是存储在播放终端中的运行逻辑,该同步帧条件检测器运行上述步骤301所述的方法,从而确定了满足第一预设条件的第一目标帧。
需要进一步说明的是,如上所述,第一目标帧满足第一预设条件,在具体工作过程中,按照采样率大小的不同,第一预设条件并不相同。其中,当第一音频的采样率大于或等于第一阈值时,播放终端将低频部分最大值处于第一区间内的音频帧确定为第一目标帧。当第一音频的采样率小于第一阈值时,播放终端确定包含第一特征声音的音频帧为第一目标帧。为便于理解,以下分别对第一预设条件的两种情况进行详细说明。
一、当第一音频的采样率大于或等于第一阈值时,播放终端将低频部分最大值处于第一区间内的音频帧确定为第一目标帧。
本实施例中,当第一音频的采样率大于或等于第一阈值时,播放终端采用在第一目标帧中加入同步帧标记的方式来标识该第一目标帧。由于添加同步帧标记的方式是改变第一目标帧中频部分的能量值,这就要求:对于第一目标帧的原始音频而言,其低频能量不能过高或过低,在实际应用中,常常会出现原始音频低频能量过低,导致中频相对能力也过低,从而导致在录制后解析终端解析不到同步帧标记的问题。另外当原始音频低频能力过高,导致中频能量过高,会出现爆音的问题。因此要确定低频部分最大值处于第一区间内的音频帧确定为第一目标帧。
第一区间具体方式可以为(Tlow,Thigh),播放终端在确定第一目标帧时,取第一音频低频部分的能力最大值Value,判断该Value是否符合公式:Tlow<Value<Thigh,若满 足条件,则将当前目标帧确定为第一目标帧。其中,Tlow为第一区间能量值的下限,Thigh为第一区间能量值的上限。Tlow和Thigh的具体数值可以由本领域技术人员根据实际需要制定,例如,针对会议室场景、大礼堂场景及开放办公区等不同的音频录制场景设置不同的阈值范围,特别是能量过低的阈值,以保证其在不同场景下,同步帧标记在第一目标帧中可以获得良好的嵌入强度。
进一步地,当播放终端按照上述方式确定第一目标帧后,需要在目标帧中添加同步帧标记,以使得后续的解析终端能够按照该同步帧标记确定第一目标帧。同步帧标记的添加具体包括以下步骤。
1、播放终端从第一目标帧中获取第一采样点。
本实施例中,该第一采样点为第一目标帧的中频部分的采样点。
2、播放终端提升第一采样点的能量值,以使得第一采样点的能量值与低频部分能量值的比值大于或等于第二阈值。
本实施例中,第一采样点为中频部分的采样点,在第一采样点的能能量值被播放终端提高之后,第一采样点与低频部分其他采样点之间的能量值的比值出现了显著的区别,其中,第一采样点的能量值与低频部分能量值的比值大于或等于第二阈值。从而后续解析终端在获取第一目标帧时,根据中频部分采样点与低频部分采样点能量值的比值即可解析到该同步帧标记。
为便于理解,基于上述步骤1-2所述的思路,以下提供一种同步帧标记加入的更具体的实现方式。
本实施例中,如图4所示,对对第一目标帧的音频信号做快速傅里叶变换后,得到了第一目标帧的音频信号在频域和时域上的折线图。通过图4,即可知道低频部分能量的最大值。
本实施例中,如图4所示,经过修改后,中频部分第二采样点401的能量值有了显著的提升,由于对中频部分能量值E(i)'的修改公式中添加了低频部分能量最大值max_E
1,从而使得中频部分第一采样点的能量值相对低频部分的能量有了显著的提升。需要说明的是,上述取8个能量值只是一种优选的举例,本领域技术人员可以根据实际需要选择不同数量的能量值,对此本申请实施例并不进行限定。
3)、在中频部分取与上述8个能量点对称的点,进行相同的能量值提升操作。
本实施例中,由于傅里叶变换后的能量分布具有对称性,所以在与这8个点对称的部分取8个点,作相同的嵌入操作。
4)、频域信号做快速逆傅里叶变换(IFFT)。
本实施例中,在经过快速傅里叶变换之后,通过上述方式修改了第一目标帧中频部分的能量值,之后再对第一目标帧做快速逆傅里叶变换,从而得到了嵌入同步帧标记的时域信号,该时域信号即为已经嵌入同步帧标记的第一目标帧。
本实施例中,通过上述步骤1)-4)所述的方式,将第一目标帧中频部分第一采样点的能量值进行提升,以使得第一采样点的能量值与低频部分的能量值的比值超过预设的范围。以使得后续解析过程中,当解析终端获取到某一目标帧中频部分有一个采样点的能量值与低频部分能量值的比值大于预设值时,即可判断该目标帧为添加了同步帧标记的第一目标帧。从而实现了第一目标帧中同步帧标记的添加。
本实施例中,播放终端实时地判断第一音频中符合第一预设条件的目标帧作为第一目标帧,之后提升第一目标帧的中频部分第一采样点的能量值,以使得第一采样点的能量值与低频部分能量值的比值大于或等于预设比值,从而实现了同步帧标记的添加。从而在第一音频的采样率大于或等于第一阈值时,播放终端将低频部分最大值处于第一区间内的音频帧确定为第一目标帧。
需要说明的是,当第一音频的采样率小于第一阈值时,由于同步帧标记的方式提升了中频部分的能量值,对于采样率小于预设值的音频而言,同步帧标记的嵌入会影响到人耳收听第一音频的体验,因此不能再用同步帧的方式来定位第一目标帧,为了解决此问题,本申请实施例提供了第二种方案。
二、当第一音频的采样率小于第一阈值时,播放终端确定包含第一特征声音的音频帧为第一目标帧。
播放终端确定包含第一特征声音的音频帧为第一目标帧。具体方法如下。
当检测到第一特征声音,且第一特征声音的持续时间大于或等于预设时间时,播放终端将包含第一特征声音的音频帧确定为第一目标帧。
本实施例中,第一特征声音可以通过声音检测方法进行检测,其中,检测特征声音的方式可以为现有技术中的任意一种方式,本申请实施例对此并不进行限定。具体工作过程中,第一特征声音可以为人声,例如,当检测到没有人声的时间持续大于预设时间后,再次检测到人声的时刻,作为第一目标帧。这样做的好处在于能够适用于会议场景,在会议场景下,人们通过语音进行交流,为了防止语音被转录,因此需要在记录有语音的音频中加入音频水印,通过上述人声检测的方式,当检测到没有人声的时间持续大于预设时间(例如1.5s)后,将再次检测到人声的时刻作为第一目标帧,从而确保后续的水印嵌入能够嵌入到记录有语音信息的音频中。
可选地,第一特征声音还可以做更细化的判断,例如,在人声状态下,当检测到特定语句时,将该特定语句所在的目标帧确定为第一目标帧。
本实施例中,当第一音频的采样率小于第一阈值时,播放终端通过确定第一特征声音的方式来确定第一目标帧,从而通过特征声音的方式实现了水印嵌入起始位置(即第一目标帧)的确认,不需要再嵌入同步帧标记。从而确保在第一音频中,无论第一音频的采样率大于还是小于第一阈值,均可以找到标记水印嵌入位置的第一目标帧。
可选地,情况一和情况二的两种方式可由播放终端中的同步帧嵌入器来实现,该同步 帧嵌入器可以为在设置在播放终端中的实体装置,也可以是存储在播放终端中的运行逻辑,对此本申请实施例并不进行限定。
通过上述方式,播放终端在第一音频中确定了第一目标帧,此时,播放终端需要执行后续步骤,在第一目标帧之后嵌入音频水印。
302、播放终端在第一目标帧之后确定满足第二预设条件的第二目标帧。
本实施例中,第二目标帧是位于第一目标帧之后的目标帧,由于第二目标帧是实时确定的,因此无法保证第一音频中第一目标帧之后的每一帧都适合作为第二目标帧,从而需要对第二目标帧的条件进行判断,只有当目标帧满足第二预设条件时,才将该目标帧作为第二目标帧。
需要说明的是,在实际工作过程中,播放终端在第一音频中对于第一目标帧和第二目标帧的确定是周期性的。首先,由于播放终端需要在第一音频中实时添加音频水印,因此播放终端只能够按照播放时间的先后顺序对第一音频进行处理。在处理过程中,播放终端先按照第一预设条件确定第一目标帧;接着在第一目标帧之后确定满足第二预设条件的第二目标帧。此时,第一目标帧与第二目标帧构成了一个音频水印周期。在下一个音频水印周期中,播放终端依然先按照第一预设条件确定第一目标帧,接着在第一目标帧之后确定满足第二预设条件的第二目标帧。这样处理的结果是,第一音频中包括多个第一目标帧,其中,第一音频中每两个相邻的第一目标帧之间包含一个第二目标帧。
因此,作为一种可选的技术方案,播放终端在向第一音频添加水印的过程中,也可以先确定第一目标帧,再确定第二目标帧,其结果依然可以实现第一音频中每两个相邻的第一目标帧之间包含一个第二目标帧,因此本领域技术人员可以根据实际需要确定第一目标帧和第二目标帧的确定顺序,对此本申请实施例并不进行限定。为便于理解,本申请实施例仅以第二目标帧位于第一目标帧之后进行说明。
进一步地,上述确定第二目标帧的第二预设条件的具体实现方式可以为:
播放终端确定中频部分能量值大于或等于第三阈值且小于第四阈值的目标帧为第二目标帧。
本实施例中,由于在第二目标帧中嵌入水印的位置为中频区域,如果中频区域能量过低,则导致嵌入水印后,容易导致误解析或无法检测到水印;如果中频区域能量过高,则导致嵌入水印后,产生爆音。进一步地,由于连续的水印嵌入可能会导致相互干扰,因此在确定第二目标帧时,还可以进一步确保多个周期之间的第二目标帧之间保留有足够的间隔,进一步地,第三目标帧之间的间隔需要大于或等于第六阈值,该第六阈值的具体大小可由本领域技术人员根据实际情况设定,对此本申请实施例并不进行限定。
需要说明的是,第二目标帧中频部分能量值大于或等于第三阈值且小于第四阈值,该第三阈值小于该第四阈值,第三阈值与第四阈值的具体数值可以由本领域技术人员根据实际需求来确定,对此本申请实施例并不进行限定。
303、播放终端在第二目标帧中嵌入音频水印。
本实施例中,第二目标帧是位于第一目标帧之后的目标帧,在第一音频的多个水印周期中,第二目标帧位于两个第一目标帧之间,在第二目标帧中嵌入音频水印,以使得后续 水印解析的过程中,解析终端能够根据第一目标帧确定第二目标帧所在的位置,从而找到音频水印。
进一步地,本申请实施例进一步提供一种播放终端在第二目标帧中嵌入音频水印的具体实现方式,为便于理解,以下结合附图5a,对此种情况进行详细说明,如图5a所示,嵌入音频水印的步骤包括。
501、播放终端获取音频水印所对应的第一数列。
本实施例中,可选地,音频水印可以通过数列的形式呈现,例如,播放终端为会场B所在的播放终端,如上述步骤201中的举例,该播放终端的标识为“14”,则该音频水印所对应的第一数列为“14”,后续步骤需要将该数列14作为音频水印嵌入第一音频中,从而在该第一音频中标记与之关联的播放终端。
502、播放终端从第二目标帧中获取至少一个第二采样点。
本实施例中,一个目标帧中包括2048个采样点,播放终端从第二目标帧中获取至少一个第二采样点,该第二采样点用于在后续工作过程中嵌入上述第一数列中的元素。
503、播放终端将第一数列中的至少一个元素分别嵌入至少一个第二采样点中。
本实施例中,播放终端将第一数列中的元素嵌入第二采样点,其中,第一数列中的一个元素嵌入到一个第二采样点中。以使得第二采样点记录了第一数列的内容,后续解析终端通过读取第二采样点中所记录的第一数列,即可解析出音频水印所记载的内容。
需要说明的是,播放终端将第一数列中的至少一个元素分别嵌入至少一个第二采样点的步骤可以通过以下方式来实现。
播放终端调节第二采样点在不同时域和/或不同频域部分能量值的能量比值。
本实施例中,上述能量比值的大小与第一数列中的数字相关联,不同的能量比值可以对应不同的数字,从而通过不同的能量比值记录了第一数列中的不同数字。由此通过改变能量比值的方式在第二采样点中记录了第一数列的内容,实现了音频水印的嵌入。
例如,播放终端需要嵌入的的音频水印为数字“14”,在将音频水印嵌入第一音频时,需要将该音频水印转化为二进制,从而将该数字“14”转化为二进制得到第一数列:“1110”,该第一数列即为需要作为音频水印嵌入第一音频中的内容。第一数列中包含1110四个元素,这四个元素分别嵌入第二目标帧中的四个第二采样点,从而实现音频水印的嵌入。
在具体工作过程中,通过改变第二采样点能量比值的方式,实现第一数列中每个元素在第二采样点中的嵌入。对于第二采样点中能量比值的改变可以采用以下三种不同的方案:一、改变不同时域部分能量值的比值;二、改变不同频域部分能量值的比值;三、同时改变不同时域和不同频域部分能量值的比值。为便于理解,以下结合附图对此三种方式做详细的说明。
一、改变不同时域部分能量值的比值。
本实施例中,通过改变不同时域部分能量值的比值实现时域嵌入的方式包括以下步骤。
1、降低第二子采样点前半部分的能量值,以使得第二采样点后半部分与前半部分能量值的比值大于或等于第五阈值。
本实施例中,如图5b所示,在第二采样点中按照时域的顺序,降低第二采样点前半部 分5061的能量值,以使得第二采样点后半部分5062的能量值显著高于前半部分5061的能量值。此时,可以将这种能量分布的波形预设为数字0。
具体地,可以通过以下方式来实现。
如图5c所示,图5c为第二采样点原始帧的波形图,对图5c所示的原始帧进行第一次离散余弦变换DCT后得到如图5d所示的波形图,之后从图5d所示的波形图中选出中频部分进行第二次DCT变换得到如图5e所示的波形图,此时按照如下公式1对图5e所示波形的能量进行处理。
在上述公式1中,j为时域上的时间段,P(j)为如图5e中波形在时段j内的总能量,λ为预设的系数,其中,λ的具体数值可以根据实际需要进行调整,mid代表图5e中的波形在时域上的中点。从上述公式可知,j的取值范围为1到中点,即图5e中的前半部分,将前半部分的总能量值P(j)除以系数λ,从而降低了第二采样点前半部分的能量值,得到如图5b所示的波形图。
2、降低第二子采样点后半部分的能量值,以使得第二采样点前半部分与后半部分能量值的比值大于或等于第五阈值。
本实施例中,如图5f所示,在第二采样点中按照时域的顺序,降低第二采样点后半部分5063的能量值,以使得第二采样点前半部分5064的能量值显著高于后半部分5063的能量值。此时,可以将这种能量分布的波形预设为数字1。
具体地,可以通过以下方式来实现。
基于上述图5e所示的波形图,此时按照如下公式2对图5e所示波形的能量进行处理。
在上述公式1中,j为时域上的时间段,P(j)为如图5e中波形在时段j内的总能量,λ为预设的系数,其中,λ的具体数值可以根据实际需要进行调整,mid代表图5e中的波形在时域上的中点,S'-T'表示图5e中时域的上限。从上述公式可知,j的取值范围为mid+1到S'-T',即从时域中点开始到时域的终点,即图5e中的后半部分,将后半部分的总能量值P(j)除以系数λ,从而降低了第二采样点后半部分的能量值,得到如图5f所示的波形图。
本实施例中,通过改变第二采样点不同时域部分能量值的比值实现水印的嵌入。具体地,调整第二采样点在时域上前后部分的能量比值,之后将前后部分不同的能量比值分别预设为0和1,从而在第二采样点上实现了二进制的数字嵌入,后续可以根据需要将该二进制数列转化为十进制数列,从而实现了将第一数列加入第二采样点的工作过程。在后续解析终端对音频水印进行解析时,直接获取目标帧中前后部分能量值的比值即可实现对音频水印的解析。
需要说明的是,在上述解析的过程中,该能量比值的确定,是以目标帧在时域上的中点为界限,获取前后部分的比值后按照上述预设规则解析的。然而,在实际工作过程中, 在对第一音频进行翻录时,由于场地的回声或混响等原因,会造成高能量部分在时域上增加。例如,音频水印为1时,时域上以中点为界限,前半部分的能量值与后半部分能量值的比值大于或等于第五预设值。然而,由于转录过程中回声或混响的存在,导致高能量区域在时域上越过了中点的界限,从而影响了能量比值的大小,造成解析终端无法根据能量比值获取到水印。为解决上述问题,本申请实施例进一步提供一种依据频域来改变能量比值的方法,以克服转录过程中所产生的回声混响对时域上的能量分部造成的影响。
二、改变不同频域部分能量值的比值。
本实施例中,采用与上述类似的方式,改变第二采样点在频域上的的能量比值,从而实现水印的嵌入,具体的实现方式可参阅上述改变不同时域部分能量比值的方法,区别在于,本种实施方式并不是在时域上以时间作为划界,而是在频域上以频率作为划界。
通过改变不同频域部分的能量比值所得到的能量分布图如图5g所示,需要说明的是,前述图5b至图5f所示的图片为能量值与时域对应关系的折线图,而图5g所显示的能量分布图之间显示能量在时域和频域上的分布关系。在图5g的能量分部框中,阴影部分表示高能量部分,白色部分表示低能量部分。在第二采样点中以频域的中点划界,上半部分与下半部分的比值大于第五阈值的能量分部5065表示数字1,下半部分与上半部分的比值大于第五阈值的能量分部5066表示数字0,从而通过改变频域上的能量比值实现了音频水印的嵌入,克服了转录过程中所产生的回声混响对时域上的能量分部造成的影响。
三、同时改变不同时域和不同频域部分能量值的比值。
本实施例中,综合了上述方案一和方案二的方案,首先通过方案一的方法,调整第二采样点在时域上前后部分的能量比值,实现第一步的水印嵌入,例如图5h所示,第二采样点前半部分5067的能量值与后半部分5068的能量值的比值大于第五阈值,嵌入了音频水印为数字1,进一步地,为了防止转录过程中所产生的回声混响在时域上对音频水印能量分部造成的影响,如图5h所示,对于图5h的后半部分(即能量较低的部分),采取上述方案二的方法,改变高频部分50681与低频部分50682的能量比值,其中,第二采样点上半部分与下半部分的比值大于第五阈值的能量分布同样表示数字1。
同理,将上述能量分部的比值对调,即可表示数字0。
方案三所提供的方式,在第二采样点中同时加入了一大一小两个水印。该一大一小两个水印分别记录了相同的数值,其中,大水印为在时域上改变第二采样点前后两部分的能量比值所得到的水印,即图5h中第二采样点前半部分5067的能量值与后半部分5068的能量比值所形成的水印,小水印为改变第二采样点低能量部分高低频的能量比值所得到的水印,即图5h中后半部分5068的高频部分50681与低频部分50682的能量比值所形成的水印。
进一步地,对于如图5h所示的水印,为了防止转录过程中所产生的回声混响对时域上的能量分部造成的影响,还可以对音频水印中的低能量部分进行进一步的分割,如图5i所示的能量分布图用于表示数字1。其中,阴影部分表示高能量值区域,白色部分表示低能量值的区域。如图5i所示,在时域上前半部分51为高能量部分,后半部分52为低能量部分。对于后半部分52嵌入的小水印时,将该后半部分52在时域上进一步分为两部分,分 别记为第一部分521和第二部分522,其中,第一部分521为接近前半部分51(高能量部分)的区域,第二部分522为远离前半部分51(高能量部分)的区域。在嵌入小水印时,仅仅改变第二部分522高频部分与低频部分的能量比值,对于第一部分521,由于靠近高能量部分,可能会由于转录过程中产生的混响导致第一部分521的能量相应升高,因此不再改变第一部分521的高低频能量比值,从而排除了转录过程中所产生的回声混响对时域上的能量分部造成的影响,仅仅在远离高能量部分的第二部分522通过改变高低频能量比值的方式嵌入小水印,其中,第二部分522的高频部分5221的能量值高于低频部分5222的能量值,同样表示数字1。大小两个水印分别表示同样的音频水印。
同理,将图5i中的高能量和低能量的部分对调,即可得到表示数字0的音频水印。
需要说明的是,上述图5i所示的音频水印,对于低能量部分(即后半部分52),在时域上分为两个部分进行处理(第一部分521和第二部分522),在实际工作过程中,本领域技术人员根据实际需要,可以将该低能量部分在时域上分割为更多的部分,对此本申请实施例并不进行限定。
本实施例中,播放终端为了在第一音频中实时地嵌入音频水印,按照时序在第一目标帧之后寻找符合预设条件的第二目标帧进行音频水印的嵌入,在嵌入音频水印的过程中,通过改变第三目标帧中采样点在不同时域和/或不同频域部分能量值的能量比值,来实现音频水印的嵌入,从而实现了音频水印的实时嵌入,同时所嵌入的水印在音频转录的过程具有较强的抗干扰能力。
需要说明的是,上述方案一至方案三的方案中,播放终端通过改变能量比值实现了二进制的水印嵌入,在实际工作过程中,本领域技术人员可根据实际需要,通过改变能量比值实现其他进制的水印嵌入,例如十进制或12进制,对此本申请实施例并不进行限定。
需要说明的是,在上述第一数列中,为了在解析的过程中确定第一音频中所嵌入水印的完整性,在播放终端第一音频中嵌入音频水印时,需要在第一数列中加入校验位,以使得后续解析终端能够根据该校验位确定音频水印传输的完整性,防止在第一音频转录过程中,由于转录的信号丢失导致音频水印传输不完整,产生解析失败的情况。为便于理解,以下对校验位的具体添加方式进行详细说明。
本实施例中,如上所述,播放终端所获取的音频水印为数字“14”,在将音频水印嵌入第一音频时,需要将该音频水印转化为二进制,从而将该数字“14”转化为二进制,得到第一数列:“1110”,该第一数列即为需要作为音频水印嵌入第一音频中的内容。为了确保该第一数列传输的完整性,可以在第一数列中加入校验位,例如,可以采用奇偶校验的方式,例如,第一数列“1110”中包括三个数字1,即奇数个数字“1”,此时,在第一数列的最后一位加入一个数字1,得到新的第一数列“11101”,该第一数列中最后一位数的1即为校验位,该校验位的“1”用于表示当前数列中除校验位外还有奇数个数字“1”。同理,若第一数列中数字“1”的数量为偶数个,则校验位为“0”,从而通过此种方式确定了第一数列的位数,后续解析终端可以根据校验位实现对第一数列的校验,确保第一数列的传输准确。
进一步地,由于第一音频在空气录制的过程中会有部分采样点丢失的问题,当音频水印中的第一数列过长,会导致音频水印所占用的帧数过多,在检测时,用于采样点的丢失, 导致每帧的起始位和预期产生偏差,使得音频水印检测不准确。为了解决此问题,当第一数列的长度超过阈值后,进行将第一数列分割成多个子数列,将每个子数列分别按照上述方式加入校验位后再重新组合成一个大的第一数列。从而在后续解析终端对音频水印进行解析时,能够按照预设的子周期分别对第一数列进行校验,保证了第一数列的传输完整性。
本实施例中,以音频水印为数字“14”为例,在加入校验位之后,音频水印所对应的第一数列为:“11101”。播放终端获取第一个第二目标帧,按照前述步骤506所记载的方式,将第一数列中的第一个元素:“1”嵌入第二目标帧中的第一个第二采样点中,实现了第一数列中第一位数的嵌入。完成了水印嵌入周期中的第一子周期。之后循环执行上述步骤506,从第二目标帧中获取第二个第二采样点,按照同样方式将第一数列中的第二个元素:“1”嵌入第二目标帧的第二个第二采样点中,实现了第一数列中第二位数的嵌入。……以此类推,播放终端通过五个水印嵌入子周期所组成的完整水印嵌入周期,将第一数列中的五位数字嵌入第二目标帧的五个第二采样点中。
可选地,在上述每个水印嵌入的子周期中,为了保证当前子周期中水印嵌入的强度足够,播放终端在每个子周期完成之后,就需要执行一次嵌入强度的检测。
需要说明的是,水印嵌入强度的具体检测方式为:检测当前子周期中第二采样点在不同时域和/或不同频域部分能量值的能量比值是否大于第五阈值。以上述步骤506中第一种水印嵌入方案为例,假设当前第二采样点嵌入的数字为1,则播放终端需要判断,当前第二采样点在时域上前半部分与后半部分的能量比值是否大于第五阈值,若是,执行后续步骤,若否,重新执行水印嵌入,此时将公式1中的λ的值提升,以进一步降低第二采样点后半部分的能量值,从而提升第二采样点在时域上前半部分与后半部分的能量比值,以使得水印嵌入强度达到要求。
本实施例中,通过上述步骤501至503所述的方式,播放终端在第二目标帧中嵌入了音频水印。至此步骤203完成。
204、播放终端播放嵌有音频水印的第一音频。
本实施例中,播放终端将嵌有音频水印的第一音频播放出来,由于音频水印是实时嵌入第一音频中的,因此在实时播放的场景下,播放终端所播放的音频依然能够带有音频水印,从而在后续被转录的过程中,可以根据该音频水印对该第一音频实现溯源。
综上所述,本申请实施例所提供的音频水印添加方法中,播放终端实时获取第一音频;播放终端在第一音频中嵌入音频水印,音频水印与播放终端相关联;播放终端播放嵌有音频水印的第一音频。从而在实时播放音频的场景下,通过播放终端在音频流中实时地加入音频水印,以使得后期设备在解析水印时能够根据该音频水印确定该播放终端,便于在第一音频被转录后进行溯源。
通过上述音频水印添加方法加入音频水印的第一音频,无论是通过数字信道进行翻录,还是通过空气信道进行翻录,翻录所得到的第一音频中,均可以被解析终端解析出音频水印,从而实现对第一音频的溯源,解析终端可以通过音频水印确定将该音频水印加入到第一音频中的播放终端。为便于理解,以下结合附图,对本申请实施例所提供的音频水印解析方法进行详细说明。
请参阅图6,如图6所示,本申请所提供的音频水印解析方法的实施例包括以下步骤。
601、解析终端获取第一音频。
本实施例中,第一音频为播放终端通过上述方法嵌入了音频水印的音频,需要说明的是,该第一音频的初始播放源为该播放终端,即:是该播放终端将该音频水印嵌入该第一音频中。当播放终端播放了该第一音频后,第一音频直接被解析终端获取,也可以经过转录,该转录可以是通过数字信道进行的传播转录,也可以是通过空气信道进行的转录,对于这两种方式的转录本申请所提供的方法均能够进行解析。
可选地,在获取到第一音频后,解析终端还需要对第一音频进行格式及采样率的转化。
本实施例中,由于在播放设备播放该第一音频后,翻录第一音频的录制设备的可能性较多,特别是不同品牌的录音设备的录制音频文件格式不尽相同,其采样率一般为44.1K,因此需要首先进行音频文件格式的转换和采样率的变换。以得到解析终端能够处理的格式和采样率。优选地,解析终端可以将第一音频的采样率转化为48k。
602、解析终端从第一音频中解析音频水印。
本实施例中,解析终端可以对第一音频进行实时解析,也可以进行离线的解析。对此本申请实施例并不进行限定。为便于理解,本申请实施例主要对离线解析的方法进行说明,但并不构成对本方案的限定。
请参阅图7,如图7所示,本申请所提供的音频水印解析方法从第一音频中解析音频水印包括以下步骤。
701、解析终端确定第一音频中满足第一预设条件的第一目标帧。
本实施例中,第一目标帧满足第一预设条件,因此解析终端可以按照第一预设条件来获取第一音频中的第一目标帧。需要说明的是,第一音频中包括多个第一目标帧,每个第一目标帧对应一个水印解析的周期,因此每当解析终端在第一音频中确定一个满足第一预设条件的第一目标帧,则执行一次后续解析步骤。该第一预设条件可以为:将低频部分最大值处于第一区间的目标帧作为第一目标帧,关于第一区间的具体实施方式可参阅上述步骤301的记载,此处不再赘述。
需要说明的是,第一目标帧中包括标记信息,解析终端需要进一步根据标记信息确定第一目标帧,其中,标记信息的实现方式根据第一音频的实际情况包含两种技术方案:一、第一音频的采样率小于第一阈值时,标记信息为特征声音。二、第一音频的采样率大于或等于第一阈值时,标记信息为同步帧标记。以下分别对此两种情况进行详细说明。
一、第一音频的采样率小于第一阈值时,标记信息为特征声音。
本实施例中,当解析终端检测到第一音频的采样率小于第一阈值时,即可判定在第一音频中,第一目标帧与第二目标帧中的标记信息为特征声音,此时,解析终端检测标记信息的具体方法为:
解析终端从第一音频中分别确定包含有第一特征声音,且第一特征声音的持续时间大于或等于预设时间的目标帧作为第一目标帧。
本实施例中,第一特征声音可以通过声音检测方法进行检测,其中,检测特征声音的方式可以为现有技术中的任意一种方式,本申请实施例对此并不进行限定。具体工作过程 中,第一特征声音可以为人声,例如,当检测到没有人声的时间持续大于预设时间后,再次检测到人声的时刻所在的目标帧,确定为第一目标帧。
进一步地,播放终端与解析终端之间针对特征声音还可以约定更加细化的实施方式,例如,在人声检测的状态下,当解析终端检测到特定语句时,才将该特定语句所在的目标帧确定为第一目标帧。
本实施例中,当第一音频的采样率小于第一阈值时,解析终端通过确定第一特征声音的方式来确定第一目标帧,从而通过特征声音的方式实现了水印嵌入起始位置(即第一目标帧)的确认,第一音频中不需要再嵌入同步帧标记。从而确保在第一音频中,无论第一音频的采样率大于还是小于第一阈值,均可以找到标记水印嵌入位置的第一目标帧。
二、第一音频的采样率大于第一阈值时,标记信息为同步帧标记。
本实施例中,当解析终端确定第一音频的采样率大于或等于第一阈值时,可判定,在第一音频中,第一目标帧中的标记信息为同步帧标记。解析终端解析同步帧标记的方法具体包括以下步骤。
1、解析终端逐帧获取第一音频中频部分与低频部分能量值的第一比值。
本实施例中,由于播放终端添加同步帧的方法为:提升中频部分第一采样点的能量值,从而提高中频部分与低频部分能量值的第一比值,因此,解析终端可以通过该第一比值来确定同步帧标记。
2、当解析终端获取到第一比值大于或等于第二阈值的初始目标帧时,从初始目标帧开始通过滑窗方式向后逐帧滑动检测第一音频,以获取每个滑动窗口内中频部分与低频部分能量值的第二比值。
本实施例中,解析终端逐帧获取每一帧中频部分采样点与低频部分采样点的能量值的第一比值,当找到第一比值大于或等于第二阈值第一采样点时,确定所述第一采样点所在的目标帧为初始目标帧。然而,由于一帧有2048个采样点,第一采样点仅为其中的部分采样点,因此,当找到了符合第一比值的第一采样点后,第一采样点实际所在的第一目标帧相对于当前第一采样点所在的初始目标帧之间,可能会存在偏移。为解决此问题,需要从初始目标帧开始,通过滑窗方式向后移动,逐帧检测初始目标帧之后的第一音频,以获取每个滑动窗口内中频部分与低频部分能量值的第二比值。
3、解析终端获取第二比值最大的滑动窗口所在帧为第一目标帧。
本实施例中,请参阅图8所示,初始目标帧801即解析终端生成的初始的滑动窗口801,初始目标帧801与第一目标帧802之间存在交集,第一采样点803位于该交集的部分,解析终端需要将滑动窗口801与第一目标帧802之间完全重合,才能确定第一目标帧所在的位置。对此,解析终端的具体工作方式为:通过滑窗方式,检测每个滑动窗口801内中频部分与低频部分能量值的第二比值,其中,由于播放终端主动提升了第一目标帧802中频部分的能量,因此,第二比值达到最大值的窗口,即为第一目标帧所在的窗口,从而通过此种方式,实现了滑动窗口801与第一目标帧802的重合。从而通过滑窗检测的方式实现了对同步帧标记的查找,且有效的防止了查找过程中产生的偏移问题,提升了后续水印检测的精度。
需要说明的是,上述同步帧标记的检测方法,是检测第一音频的中频部分与低频部分能量值的第一比值来确定的,然而在实际工作过程中,第一音频的原始内容(即非水印内容)中也可能存在中频部分与低频部分能量值比值大于第一比值的情况。从而会造成同步帧标记的误检测。事实上,同步帧标记与第一音频中的原始内容相比,由于个本质的区别在于,同步帧标记中,中频部分的第一采样点能量值相对于低频部分有一个突然的增长点,在自然录音的状态下音频是不会有这样的突增的,因此,利用这种特性,当解析终端通过上述步骤3的方式确定了滑动窗口所在的目标帧为第一目标帧时,可以进一步通过以下步骤确定当前目标帧中的同步帧标记是否为真正的目标帧标记,从而防止误检测的情况发生。
4、解析终端从第二比值最大的滑动窗口中获取中频部分能量值最高的第一采样点。
本实施例中,第一采样点为当前窗口中能量值最高的点。
5、解析终端获取距离第一采样点之前预设长度的第三采样点。
本实施例中,第三采样点在时域上位于第一采样点之前,第三采样点距离第一采样点的预设长度可以由本领域技术人员根据实际需要来设定,也可以由解析终端根据采样率等参数自行确定,对此本申请实施例并不进行限定。
6、解析终端确定第一采样点能量值与第三采样点能量值的比值大于第七阈值的部分为同步帧标记。
本实施例中,请参阅图9a和图9b,其中,图9a为加入了同步帧标记的第一目标帧,图9b为没有加入同步帧标记的普通目标帧。图9a和图9b两幅图中,中频部分与低频部分的能量比值均满足预设条件,因此该种情况下,仅通过中频与低频部分的能量比值,无法判断哪一个是加入了同步帧标记的目标帧,从而导致误检测的发生。对此通过上述步骤4至6所示的方法,假设在图9a中,第一采样点901与第三采样点902之间相隔3个采样点,可以看到第一采样点相对第三采样点的能量发生了突增,而在图9b中,未加入同步帧标记的波形图中,能量值的变化是平滑的,第一采样点903与第三采样点904之间相隔3个采样点的情况下,第一采样点903的能量值相对第三采样点904不会有明显的变化。因此,通过此种方法,能够准确的识别同步帧标记,防止误检测的发生。
本实施例中,通过上述特征声音的方式和同步帧标记的方式,无论第一音频的采样率大小如何,解析终端均能够在第一音频中检测到播放终端确定的第一目标帧。
702、解析终端在第一目标帧之后确定满足第二预设条件的第二目标帧。
本实施例中,当解析终端检测到第一目标帧后,以第一目标帧为定位帧,即可继续寻找位于第一目标帧之后的第二目标帧,第二目标帧嵌有音频水印,且第二目标帧满足第二预设条件。解析终端可以根据该第二预设条件,在第一目标帧之后快速地找到第二目标帧。
可选地,由于第一音频是多周期水印嵌入的,每个周期中包括一个第一目标帧和一个第二目标帧,因此解析终端在离线解析时,也可以从第一目标帧开始向前获取第二目标帧,为便于理解,本申请实施例仅以从第一目标帧开始向后移动寻找第二目标帧进行说明,但并不构成对本申请实施例方案的限定。
可选地,如图10所示,解析终端具体可以通过以下步骤确定第二目标帧。
1001、解析终端从第一目标帧开始以帧为单位向后移动,分别获取每帧中频部分的能 量大于或等于第三阈值且小于第四阈值的备选目标帧。
本实施例中,由于播放终端在第二目标帧中嵌入水印的位置为中频区域,如果中频区域能量过低,则导致嵌入水印后,容易导致误解析和无法检测到水印;如果中频区域能量过高,则导致嵌入水印后,产生爆音。因此根据这一特性,解析终端首先解析中频部分的能量大于或等于第三阈值且小于第四阈值的目标帧作为可能存在第二目标帧的备选目标帧。
1002、解析终端从备选目标帧中获取不同时域和/或不同频域部分能量值的能量比值大于或等于第五阈值的目标帧为第二目标帧。
本实施例中,由于播放终端通过改变第二目标帧中能量比值的方式来实现音频水印的嵌入,因此,根据播放终端嵌入音频水印的具体方式,解析终端从备选目标帧中获取不同时域和/或不同频域部分能量值的能量比值大于或等于第五阈值的目标帧即为第二目标帧。
本实施例中,通过上述步骤,解析终端确定了第二目标帧,接下来可以开始从第二目标帧中解析水印。
703、解析终端从第二目标帧中解析音频水印。
本实施例中,播放终端将音频水印嵌入在第二目标帧,因此,当解析终端获取到第二目标帧时,即可从该第二目标帧中解析所嵌入的音频水印。
请参阅图11,如图11所示,可选地,解析终端通过以下步骤从第二目标帧中解析音频水印。
1101、解析终端从第二目标帧中获取第二采样点。
本实施例中,第二采样点为第二目标帧中能量比值大于或等于第五阈值的采样点,解析终端从第二目标帧中获取第二采样点,该第二采样点中记录有播放终端嵌入的水印信息。
1102、解析终端分别获取第二采样点中取不同时域和/或不同频域部分能量值的能量比值。
本实施例中,播放终端在嵌入音频水印时,对于第二采样点中能量比值的改变可以采用以下三种不同的方案:一、改变不同时域部分能量值的比值;二、改变不同频域部分能量值的比值;三、同时改变不同时域和不同频域部分能量值的比值。因此,根据不同的水印嵌入方式,解析终端需要采用相应的手段进行解析。为便于理解,以下结合附图对此三种方式做详细的说明。
一、改变不同时域部分能量值的比值。
本实施例中,播放终端在嵌入音频水印时,通过改变不同时域部分能量值的比值的方式来进行,具体地,第二采样点后半部分的能量值与前半部分的能量值的比值大于第五阈值时,约定该比值所对应的数字为0,第二采样点前半部分的能量值与后半部分的能量值的比值大于第五阈值时,约定该比值所对应的数字为1。
基于上述水印嵌入规则,请参阅图12a和图12b,解析终端所获取的第二目标帧的能量分部图分别为如图12a和图12b所示的两种情况。在能量分布图中,深色部分表示高能量区域,白色部分表示低能量区域,从而通过能量分部图,解析终端可以直观地获取到第二目标帧的能量分部。其中,对于图12a所示的能量分部图,第二目标帧中后半部分1201的能量值显著大于前半部分1202的能量值,且经过解析终端的计算,第二采样点后半部分 1201的能量值与前半部分1202的能量值的比值大于第五阈值时,则此时解析终端将图12a所示的第二目标帧所嵌入的水印判定为0。对于图12b所示的能量分部图,第二目标帧前半部分1203的能量值显著大于后半部分1204的能量值,且经过解析终端的计算,第二采样点前半部分1203的能量值与后半部分1204的能量值的比值大于第五阈值时,则此时解析终端将图12b所示的第二目标帧所嵌入的水印判定为1。
需要说明的是,在上述解析的过程中,该能量比值的确定,是以目标帧在时域上的中点为界限,获取前后部分的比值后按照上述预设规则解析的。然而,在实际工作过程中,在对第一音频进行翻录时,由于场地的回声或混响等原因,会造成高能量部分在时域上增加。例如,音频水印为1时,时域上以中点为界限,前半部分的能量值与后半部分能量值的比值大于或等于第五预设值。然而,由于转录过程中回声或混响的存在,导致高能量区域在时域上越过了中点的界限,从而影响了能量比值的大小,造成解析终端无法根据能量比值获取到水印。为解决上述问题,播放终端在嵌入水印时进一步提供一种依据频域来改变能量比值的方法,以克服转录过程中所产生的回声混响对时域上的能量分部造成的影响。
二、改变不同频域部分能量值的比值。
本实施例中,播放终端在嵌入音频水印时采用与上述类似的方式,改变第二采样点在频域上的的能量比值,从而实现水印的嵌入,具体的实现方式可参阅上述改变不同时域部分能量比值的方法,区别在于,本种实施方式并不是在时域上以时间作为划界,而是在频域上以频率作为划界。
因此,解析终端在解析音频水印时,所得到第二目标帧的能量分部图如图5g所示,在第二采样点中以频域的中点划界,在能量分布图中,深色部分表示高能量区域,白色部分表示低能量区域,从而通过能量分部图,解析终端可以直观地获取到第二目标帧的能量分部。如图5g所示,上半部分与下半部分的比值大于第五阈值的表示数字1。下半部分与上半部分的比值大于第五阈值的表示数字0,从而通过改变频域上的能量比值实现了音频水印的嵌入,克服了转录过程中所产生的回声混响对时域上的能量分部造成的影响。
三、同时改变不同时域和不同频域部分能量值的比值。
本实施例中,播放终端在嵌入音频水印时综合了上述方案一和方案二的方案,首先通过方案一的方法,调整第二采样点在时域上前后部分的能量比值,实现第一步的水印嵌入,例如图5h所示,第二采样点前半部分的能量值与后半部分能量值的比值大于第五阈值,嵌入了音频水印为数字1,进一步地,为了防止转录过程中所产生的回声混响在时域上对音频水印能量分部造成的影响,如图5h所示,对于图5h的后半部分(即能量较低的部分),采取上述方案二的方法,改变高频部分与低频部分的能量比值。
方案三所提供的方式,在第二采样点中同时加入了一大一小两个水印。该一大一小两个水印分别记录了相同的数值,其中,大水印为在时域上改变第二采样点前后两部分的能量比值所得到的水印,小水印为改变第二采样点低能量部分高低频的能量比值所得到的水印。
进一步地,对于如图5h所示的水印,为了防止转录过程中所产生的回声混响对时域上的能量分部造成的影响,还可以对音频水印中的低能量部分进行进一步的分割,如图5i所 示,对于第二采样点中低能量部分嵌入的小水印时,将该低能量部分在时域上分为两部分,分别记为第一部分和第二部分,其中,第一部分为接近高能量部分的区域,第二部分为远离高能量部分的区域。在嵌入小水印时,仅仅改变第二部分高频部分与低频部分的能量比值,对于第一部分,由于靠近高能量部分,可能会由于转录过程中产生的混响导致第一部分的能量相应升高,因此不再改变第一部分的高低频能量比值,从而排除了转录过程中所产生的回声混响对时域上的能量分部造成的影响,仅仅在远离高能量部分的第二部分通过改变高低频能量比值的方式嵌入小水印。
需要说明的是,上述图5i所示的音频水印,对于低能量部分,在时域上分为两个部分进行处理,在实际工作过程中,本领域技术人员根据实际需要,可以将该低能量部分在时域上分割为更多的部分,对此本申请实施例并不进行限定。
对于播放终端按照方案三所嵌入的音频水印,解析终端在解析时首先解析大水印,大水印的解析方式可参阅上述方案一,之后解析大水印中的小水印,小水印的具体解析方法可参阅上述方案二,大水印与小水印中所解析出的数字一致时,判定水印解析正确,解析终端获取到当前第二目标帧中所嵌入的水印。
需要说明的是,上述方案一至方案三的方案中,播放终端通过改变能量比值的方式实现了二进制的水印嵌入,在实际工作过程中,本领域技术人员可根据实际需要,通过改变能量比值实现其他进制的水印嵌入,例如十进制或12进制,对此本申请实施例并不进行限定。
本实施例中,对于播放终端通过不同方式在第一音频中嵌入的水印,解析终端通过相应的方式进行解析,通过本申请实施例所提供的方式,音频水印具有较好的抗干扰能力,音频水印在转录的过程中不易因为空气录制产生损失,同时解析终端能够准确地解析出播放终端所嵌入的音频水印,使得音频水印的嵌入方案具有较好的稳定性和准确性。
需要说明的是,以音频水印为数字“14”为例,音频水印所对应的第一数列为:“1110”。解析终端从第二目标帧解析音频水印时,按照上述步骤703所记载的方式,先解析第一个第二采样点中:第一数列中的第一个元素:“1”实现了第一数列中第一位数的解析。完成了水印解析周期中的第一子周期。之后循环执行上述步骤703,从第二目标帧中获取第二个第二采样点,按照同样方式解析第一数列中的第二个元素:“1”,实现了第一数列中第二位数的解析。……以此类推,播放终端解析四个水印嵌入子周期所组成的完整水印解析周期,实现了对音频水印的解析。
可选地,解析终端在执确定第一目标帧时,由于起始位的偏差和录制采样点丢失会带来的偏差。这些偏差使得后续的第二目标帧也会发生相应的偏差,导致第二目标帧中嵌入的音频水印受到下一帧能量的影响。因此解析终端需要对第二目标帧进行预处理,防止由于下一帧能量变化带来的影响,因此将第二目标帧两侧的能量去除,去除的长度可以由本领域技术人员根据实际需要决定,例如前后各剪裁8个采样点,也可以由解析终端按照预设逻辑来决定,例如由解析终端根据检测周期的长度来调节剪裁的长度。对此本申请实施例并不进行限定。最终得到的第二目标帧前后部分被减去了预设长度,仅保留中间的部分,从而排除了由于偏差带来的干扰。
本实施例中,通过上述方式,解析终端从第一音频中解析出了音频水印。
1103、解析终端对音频水印进行校验。
本实施例中,作为一种优选的实施方式,音频水印所嵌入的第一数列中包含一个校验位,解析终端根据该校验位对第一数列的完整性进行校验,以确保水印解析的准确性,例如,如上述举例,第一数列为“11101”,播放终端与解析终端之间约定采用奇偶校验的方式,第一数列中的最后一位为校验位,最后一位的数字为1时,代表第一数列中除校验位外包括奇数个数字1。从而解析终端在解析出第一数列后,根据第一数列最后一位的记载,确定所解析出的第一数列中数字1的数量是否为奇数个,即可确定当前第一数列的解析是否完整。
1104、当解析终端完成了对整个第一音频的解析时,解析终端从多个水印检测周期所解析的音频水印中确定重复率最高的一个作为第一音频的水印。
本实施例中,第一音频中包括多组第一目标帧+第二目标帧,其中,每个第一目标帧与第二目标帧组成了一个水印检测周期,每个水印检测周期中均嵌入有相同的音频水印。然而在实际解析的过程中,解析终端可能会出现一些解析错误的情况,导致不是所有水印检测周期解析得到的音频水印都是相同的数列,通过实验观察,当解析出错时,误解析所得到的错误音频水印总是随机且不重复的,因此多个水印检测周期所解析的音频水印中重复率最高的一个,可以确定为正确的音频水印。因此通过这种多周期决策的方式,结合前述奇偶校验的校验方式,准确地解析出第一音频中所嵌入的正确水印,进一步防止解析终端的误解析。
可选地,当解析终端确定第一数列传输完整后,可以根据需要,对第一数列进行进制的转换,例如将第一数列由二进制转化为十进制,最后得到数字14。从而最终完成了对于第一音频中音频水印的校验。
603、解析终端根据音频水印确定播放终端。
本实施例中,由于音频水印与播放终端相关联,因此根据从第一音频中解析的音频水印,即可知晓当前这段第一音频的音频水印是由哪一个播放终端添加的。例如实时会议场景中,会场A、会场B和会场C的播放终端播放的音频内容完全相同,但是各个播放终端在播放相同的音频时各自嵌入的音频水印不同,解析终端通过解析音频水印即可知晓当前第一音频是由哪一个会场的播放终端播放的,从而实现了第一音频的溯源。
需要说明的是,本申请实施例所提供的音频水印添加方法和音频水印解析方法可用于各种不同的使用场景中,为便于理解,以下结合附图,对本申请实施例所提供方法的使用场景进行说明。
一、远程会议场景。
本实施例中的架构如图13所示,在如图1所示的架构基础上,多出了一个服务管理中心(service management center,SMC),执行以下步骤。
1301、SMC向播放终端发送音频水印。
本实施例中,播放终端的数量可以为多个,SMC发送的音频水印与各个播放终端一一对应,用于唯一地标记各个播放终端。
1302、播放终端将音频水印存储在本地。
本实施例中,各个播放终端在获取到音频水印后,将音频水印存储在本地,以便后续在获取到实时音频流时将音频水印嵌入音频流。
1303、媒体中心向播放终端发送第一音频。
本实施例中,媒体中心的音频流可以是由某个会场的播放终端生成后发送给媒体中心MCU的,之后媒体中心将该音频流实时发送给各个其他会场的播放终端。
1304、播放终端在第一音频中实时嵌入音频水印。
本实施例中,播放终端通过上述任意一个实施例所提供的音频水印添加方法将存储在本地的音频水印实时嵌入第一音频中。具体可参阅上述记载,此处不再赘述。
1305、播放终端播放第一音频。
本实施例中,播放终端播放的第一音频嵌入有音频水印,整个水印嵌入过程实时进行,不会影响第一音频的直播效果,同时所播放的第一音频可以根据音频水印溯源第一音频的播放终端。
1306、解析终端获取第一音频。
本实施例中,解析终端可以是通过数字信道获取到第一音频,也可以是通过空气信道获取到该第一音频,对于此两种方式所转录得到的第一音频,解析终端均能够进行解析。
1307、解析终端从第一音频中解析音频水印。
本实施例中,解析终端通过上述任意一个实施例所提供的音频解析添加方法从第一音频中解析音频水印。具体可参阅上述记载,此处不再赘述。
1308、解析终端根据音频水印确定播放终端。
本实施例中,由于音频水印与播放终端相关联,因此根据从第一音频中解析的音频水印,即可知晓当前这段第一音频的音频水印是由哪一个播放终端添加的。从而实现了第一音频的溯源。
本实施例中,该SMC分别向会场A、会场B和会场C所在的播放终端分配不同的会场标识(即音频水印),这些会场标识与会场的播放终端向关联,用于唯一地标识每个会场的播放终端。会场A、会场B和会场C的播放终端从媒体中心MCU获取到直播的音频流后执行本申请实施例所提供的音频水印添加方法,将各自获取的会场标识作为水印实时添加到播放音频流中,以使得各个会场的播放终端所播放的音频中嵌入有音频水印。解析终端根据该被转录视频中的音频水印,能够通过音频流中水印信息所记录的会场标识,确定音频具体为哪一会场的观众转录。从而实现了对音频水印的溯源。
二、云点播场景。
本实施例中,请参阅图14,如图14所示,用户通过终端向云服务器发送点播信息,点播所需要观看的音频或视频内容,云服务根据用户的点播信息向用户所在的终端实时地发送点播内容。具体步骤如下。
1401、用户终端根据用户选择点播的内容生成点播信息。
本实施例中,用户通过用户终端的交互界面选择需要点播的音频或视频内容,并生成点播信息,该点播信息用于记录用户点播的音频或视频内容。
1402、用户终端将点播信息发送给云服务器。
本实施例中,用户终端将点播信息发送给云服务器以使得云服务器知晓用户所需点播的内容。
1403、云服务器根据点播信息获取用户点播的目标内容。
本实施例中,云服务器根据用户的点播信息从数据库中获取用户所点播的目标内容。
1404、云服务器根据用户终端的终端标识生成音频水印。
本实施例中,用户终端为音频或视频的播放终端,该音频水印与用户终端相关联,用于唯一地该用户终端。从而实现对音频水印的获取。
1405、云服务器将音频水印嵌入目标内容中。
本实施例中,目标内容可以为音频或视频;若为视频,则将音频水印嵌入该视频的音频内容中,云服务器将音频水印嵌入目标内容中的具体方式可以为本申请实施例所提供的任意一种音频水印添加方法,具体可参阅前述记载,需要说明的是,在本种情况中,该水印嵌入的步骤可以由云服务器来执行,也可以由云服务器将音频水印发给用户终端,有用户终端来执行。对此本申请实施例并不进行限定。通过本申请实施例所提供的方法,云服务器能够一边向用户终端传输目标内容,一边实时地在目标内容的音频中嵌入音频水印,从而提升了工作效率。
1406、云服务器将水印内容发送给用户终端。
本实施例中,水印内容中的音频或视频内容为用户终端的用户所点播的内容,同时该水印内容中已经添加了音频水印。
1407、用户终端播放水印内容。
本实施例中是,由于用户终端播放的内容中已经添加了音频水印,因此若用户终端的用户再翻录该终端播放的内容,翻录的内容中会保留有该音频水印,从而可以溯源到翻录该内容的用户终端。
1408、解析终端获取水印内容。
本实施例中,水印内容可以是嵌入有水印的音频,或者带有音频的视频,该视频中的音频嵌入有水印,解析终端可以是通过数字信道获取到水印内容,也可以是通过空气信道获取到该水印内容,对于此两种方式所转录得到水印内容中的第一音频,解析终端均能够进行解析。
1409、解析终端从第一音频中解析音频水印。
本实施例中,解析终端通过上述任意一个实施例所提供的音频解析添加方法从第一音频中解析音频水印。具体可参阅上述记载,此处不再赘述。
1410、解析终端根据音频水印确定用户终端。
本实施例中,由于音频水印与用户终端相关联,因此根据从第一音频中解析的音频水印,即可知晓当前这段第一音频的音频水印是由哪一个用户终端添加的。从而实现了第一音频的溯源。
综上所述,本申请实施例所提供的水印添加方法和水印解析方法可以应用于各种不同的有音频水印添加和解析需求的场景中,上述两种方式只是一种举例,并不构成对本申请 实施例使用场景的限定。
从硬件结构上来描述,上述方法可以由一个实体设备实现,也可以由多个实体设备共同实现,还可以是一个实体设备内的一个逻辑功能模块,本申请实施例对此不作具体限定。
例如,上述方法可以通过图15中的电子设备来实现。图15为本申请实施例提供的一种电子设备的硬件结构示意图;该电子设备可以是本发明实施例中的播放终端或解析终端,该电子设备包括至少一个处理器1501,通信线路1502,存储器1503以及至少一个通信接口1504。
处理器1501可以是一个通用中央处理器(central processing unit,CPU),微处理器,特定应用集成电路(application-specific integrated circuit,服务器IC),或一个或多个用于控制本申请方案程序执行的集成电路。
通信线路1502可包括一通路,在上述组件之间传送信息。
通信接口1504,使用任何收发器一类的装置,用于与其他设备或通信网络通信,如以太网,无线接入网(radio access network,RAN),无线局域网(wireless local area networks,WLAN)等。
存储器1503可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器可以是独立存在,通过通信线路1502与处理器相连接。存储器也可以和处理器集成在一起。
其中,存储器1503用于存储执行本申请方案的计算机执行指令,并由处理器1501来控制执行。处理器1501用于执行存储器1503中存储的计算机执行指令,从而实现本申请下述实施例提供的计费管理的方法。
可选的,本申请实施例中的计算机执行指令也可以称之为应用程序代码,本申请实施例对此不作具体限定。
在具体实现中,作为一种实施例,处理器1501可以包括一个或多个CPU,例如图15中的CPU0和CPU1。
在具体实现中,作为一种实施例,电子设备可以包括多个处理器,例如图15中的处理器1501和处理器1505。这些处理器中的每一个可以是一个单核(single-CPU)处理器,也可以是一个多核(multi-CPU)处理器。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(例如计算机程序指令)的处理核。
在具体实现中,作为一种实施例,电子设备还可以包括输出设备1505和输入设备1506。输出设备1505和处理器1501通信,可以以多种方式来显示信息。例如,输出设备1505可以是液晶显示器(liquid crystal display,LCD),发光二级管(light emitting diode,LED)显示设备,阴极射线管(cathode ray tube,CRT)显示设备,或投影仪(projector)等。输入 设备1506和处理器1501通信,可以以多种方式接收用户的输入。例如,输入设备1506可以是鼠标、键盘、触摸屏设备或传感设备等。
上述的电子设备可以是一个通用设备或者是一个专用设备。在具体实现中,电子设备可以服务器、无线终端设备、嵌入式设备或有图15中类似结构的设备。本申请实施例不限定电子设备的类型。
本申请实施例可以根据上述方法示例对电子设备进行功能单元的划分,例如,可以对应各个功能划分各个功能单元,也可以将两个或两个以上的功能集成在一个处理单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。需要说明的是,本申请实施例中对单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
比如,以采用集成的方式划分各个功能单元的情况下,图16示出了本申请实施例所提供的一种播放终端的结构示意图。
如图16所示,本申请实施例所提供的播放终端包括。
获取单元1601,用于实时获取第一音频;
执行单元1602,用于在所述获取单元1601获取的所述第一音频中嵌入音频水印,所述音频水印与所述播放终端相关联;
播放单元1603,用于播放由所述执行单元1602嵌有所述音频水印的所述第一音频。
可选地,该执行单元1602,还用于:
在该第一音频中确定满足第一预设条件的第一目标帧;
在该第一目标帧之后确定满足第二预设条件的第二目标帧,该第一目标帧用于标记该第二目标帧;
在该第二目标帧中嵌入该音频水印。
可选地,该执行单元1602,还用于:
当该第一音频的采样率大于或等于第一阈值时,将低频部分最大值处于第一区间内的音频帧确定为该第一目标帧;或者,
当该第一音频的采样率小于该第一阈值时,确定包含第一特征声音的音频帧为该第一目标帧。
可选地,当该第一音频的采样率大于或等于第一阈值时,该执行单元1602,还用于:
在该第一目标帧中添加同步帧标记。
可选地,该执行单元1602,还用于:
获取第一采样点,该第一采样点为中频部分的采样点;
提升该第一采样点的能量值,以使得该第一采样点的能量值与低频部分能量值的比值大于或等于第二阈值。
可选地,该当该第一音频的采样率小于该第一阈值时,该执行单元1602,还用于:
当检测到该第一特征声音,且该第一特征声音的持续时间大于或等于预设时间时,将包含该第一特征声音的音频帧确定为该第一目标帧。
可选地,该执行单元1602,还用于:
确定中频部分能量值大于或等于第三阈值且小于第四阈值的目标帧为该第二目标帧。
可选地,该执行单元1602,还用于:
获取该音频水印所对应的第一数列,该第一数列中包括至少一个元素;
从该第三目标帧中获取至少一个第二采样点;
将该第一数列中的至少一个元素分别嵌入该至少一个第二采样点中,其中,该第一数列中的一个元素对应一个第二采样点。
可选地,该执行单元1602,还用于:
调节该第二采样点在不同时域和/或不同频域部分能量值的能量比值,其中,一个该第二采样点的该能量比值与该第一数列中的一个元素相关联。
如图17所示,本申请实施例所提供的解析终端包括。
获取单元1701,用于获取第一音频,该第一音频中嵌有音频水印,该音频水印与播放终端相关联,该播放终端用于将该音频水印实时嵌入该第一音频;
解析单元1702,用于从该获取单元1701获取的该第一音频中解析该音频水印;
执行单元1703,用于根据该解析单元1702解析的该音频水印确定该播放终端。
可选地,该解析单元1702,还用于:
确定该第一音频中满足第一预设条件的第一目标帧;
在该第一目标帧之后确定满足第二预设条件的第二目标帧;
从该第二目标帧中解析该音频水印。
可选地,当该第一音频的采样率小于第一阈值时,该解析单元1702,还用于:
从该第一音频中确定包含有第一特征声音,且该第一特征声音的持续时间大于或等于预设时间的目标帧作为该第一目标帧。
可选地,当该第一音频的采样率大于或等于第一阈值时,该解析单元1702,还用于:
逐帧获取该第一音频中频部分与低频部分能量值的第一比值;
当获取到该第一比值大于或等于第二阈值的初始目标帧时,从该初始目标帧开始通过滑窗方式向后滑动检测该第一音频,以获取每个滑动窗口内中频部分与低频部分能量值的第二比值;
获取该第二比值最大的滑动窗口所在帧为该第一目标帧。
可选地,该第一目标帧中包括同步帧标记,该解析单元1702,还用于:
从该第二比值最大的滑动窗口中获取中频部分能量值最高的第一采样点;
获取距离该第一采样点之前预设长度的第三采样点;
确定该第一采样点能量值与该第三采样点能量值的比值大于第七阈值的部分为该同步帧标记;
根据该同步帧标记确定该第二比值最大的滑动窗口所在帧为该第一目标帧。
可选地,该解析单元1702,还用于:
从该第一目标帧开始以帧为单位向后移动,分别获取每帧中频部分的能量大于或等于第三阈值且小于第四阈值的备选目标帧;
从该备选目标帧中获取不同时域和/或不同频域部分能量值的能量比值大于或等于第 五阈值的目标帧为该第二目标帧。
可选地,该解析单元1702,还用于:
从该第二目标帧中获取第二采样点,该第二采样点为该第二目标帧中该能量比值大于或等于该第五阈值的采样点;
分别获取所第二采样点中取不同时域和/或不同频域部分能量值的能量比值;
获取与该能量比值相关联的第一元素,该第一元素为该音频水印所记录的第一数列中的一个元素。
可选地,该第一音频中包括多个水印检测周期,其中,每个该水印检测周期分别解析出一个该音频水印,该解析单元1702,还用于:
从该多个水印检测周期所解析的音频水印中确定重复率最高的一个作为该第一音频的水印。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本发明所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。计算机可读介质包括计算机存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是通用或专用计算机能够存取的任何可用介质。
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的通信方法、中继设备、宿主基站及计算机存储介质,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(英文全称:Read-Only Memory,英文缩写:ROM)、随机存取存储器(英文全称:Random Access Memory,英文缩写:RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。
Claims (21)
- 一种音频水印添加方法,其特征在于,包括:播放终端实时获取第一音频;所述播放终端在所述第一音频中嵌入音频水印,所述音频水印与所述播放终端相关联;所述播放终端播放嵌有所述音频水印的所述第一音频。
- 根据权利要求1所述的方法,其特征在于,所述播放终端在所述第一音频中嵌入音频水印包括:所述播放终端在所述第一音频中确定满足第一预设条件的第一目标帧;所述播放终端在所述第一目标帧之后确定满足第二预设条件的第二目标帧,所述第一目标帧用于标记所述第二目标帧;所述播放终端在所述第二目标帧中嵌入所述音频水印。
- 根据权利要求2所述的方法,其特征在于,所述播放终端在所述第一音频中确定满足第一预设条件的第一目标帧,包括:当所述第一音频的采样率大于或等于第一阈值时,所述播放终端将低频部分最大值处于第一区间内的音频帧确定为所述第一目标帧;或者,当所述第一音频的采样率小于所述第一阈值时,所述播放终端确定包含第一特征声音的音频帧为所述第一目标帧。
- 根据权利要求3所述的方法,其特征在于,当所述第一音频的采样率大于或等于第一阈值时,所述播放终端将低频部分最大值处于第一区间内的音频帧作为所述第一目标帧之后,还包括:所述播放终端在所述第一目标帧中添加同步帧标记。
- 根据权利要求3所述的方法,其特征在于,所述播放终端在所述第一目标帧中添加同步帧标记,包括:所述播放终端获取第一采样点,所述第一采样点为中频部分的采样点;所述播放终端提升所述第一采样点的能量值,以使得所述第一采样点的能量值与低频部分能量值的比值大于或等于第二阈值。
- 根据权利要求3所述的方法,其特征在于,所述当所述第一音频的采样率小于所述第一阈值时,所述播放终端确定包含第一特征声音的音频帧为所述第一目标帧,包括:当检测到所述第一特征声音,且所述第一特征声音的持续时间大于或等于预设时间时,所述播放终端将包含所述第一特征声音的音频帧确定为所述第一目标帧。
- 根据权利要求2至6任一所述的方法,其特征在于,所述播放终端在所述第一目标帧之后确定满足第二预设条件的第二目标帧,包括:所述播放终端确定中频部分能量值大于或等于第三阈值且小于第四阈值的目标帧为所述第二目标帧。
- 根据权利要求2至7任一所述的方法,其特征在于,所述播放终端在所述第三目标帧中嵌入所述音频水印,包括:所述播放终端获取所述音频水印所对应的第一数列,所述第一数列中包括至少一个元 素;所述播放终端从所述第三目标帧中获取至少一个第二采样点;所述播放终端将所述第一数列中的至少一个元素分别嵌入所述至少一个第二采样点中,其中,所述第一数列中的一个元素对应一个第二采样点。
- 根据权利要求8所述的方法,其特征在于,所述播放终端将所述第一数列中的至少一个元素分别加入所述至少一个第二采样点中,包括:所述播放终端调节所述第二采样点在不同时域和/或不同频域部分能量值的能量比值,其中,一个所述第二采样点的所述能量比值与所述第一数列中的一个元素相关联。
- 一种音频水印解析方法,其特征在于,包括:解析终端获取第一音频,所述第一音频中嵌有音频水印,所述音频水印与播放终端相关联,所述播放终端用于将所述音频水印实时嵌入所述第一音频;所述解析终端从所述第一音频中解析所述音频水印;所述解析终端根据所述音频水印确定所述播放终端。
- 根据权利要求10所述的方法,其特征在于,所述解析终端从所述第一音频中解析所述音频水印之前,还包括:所述解析终端确定所述第一音频中满足第一预设条件的第一目标帧;所述解析终端在所述第一目标帧之后确定满足第二预设条件的第二目标帧;所述解析终端从所述第一音频中解析所述音频水印,包括:所述解析终端从所述第二目标帧中解析所述音频水印。
- 根据权利要求11所述的方法,其特征在于,当所述第一音频的采样率小于第一阈值时,所述解析终端确定所述第一音频中满足第一预设条件的第一目标帧,包括:所述解析终端从所述第一音频中确定包含有第一特征声音,且所述第一特征声音的持续时间大于或等于预设时间的目标帧作为所述第一目标帧。
- 根据权利要求11所述的方法,其特征在于,当所述第一音频的采样率大于或等于第一阈值时,所述解析终端确定所述第一音频中满足第一预设条件的第一目标帧;,包括:所述解析终端逐帧获取所述第一音频中频部分与低频部分能量值的第一比值;当所述解析终端获取到所述第一比值大于或等于第二阈值的初始目标帧时,从所述初始目标帧开始通过滑窗方式向后滑动检测所述第一音频,以获取每个滑动窗口内中频部分与低频部分能量值的第二比值;所述解析终端获取所述第二比值最大的滑动窗口所在帧为所述第一目标帧。
- 根据权利要求13所述的方法,其特征在于,所述第一目标帧中包括同步帧标记,所述解析终端获取所述第二比值最大的滑动窗口所在帧为所述第一目标帧,包括:所述解析终端从所述第二比值最大的滑动窗口中获取中频部分能量值最高的第一采样点;所述解析终端获取距离所述第一采样点之前预设长度的第三采样点;所述解析终端确定所述第一采样点能量值与所述第三采样点能量值的比值大于第七阈值的部分为所述同步帧标记;所述解析终端根据所述同步帧标记确定所述第二比值最大的滑动窗口所在帧为所述第一目标帧。
- 根据权利要求11至14任一所述的方法,其特征在于,所述解析终端在所述第一目标帧之后确定满足第二预设条件的第二目标帧,包括:所述解析终端从所述第一目标帧开始以帧为单位向后移动,分别获取每帧中频部分的能量大于或等于第三阈值且小于第四阈值的备选目标帧;所述解析终端从所述备选目标帧中获取不同时域和/或不同频域部分能量值的能量比值大于或等于第五阈值的目标帧为所述第二目标帧。
- 根据权利要求15所述的方法,其特征在于,所述解析终端从所述第二目标帧中解析所述音频水印,包括:所述解析终端从所述第二目标帧中获取第二采样点,所述第二采样点为所述第二目标帧中所述能量比值大于或等于所述第五阈值的采样点;所述解析终端分别获取所第二采样点中取不同时域和/或不同频域部分能量值的能量比值;所述解析终端获取与所述能量比值相关联的第一元素,所述第一元素为所述音频水印所记录的第一数列中的一个元素。
- 根据权利要求10至16任一所述的方法,其特征在于,所述方法包括多个水印检测周期,其中,每个所述水印检测周期分别解析出一个所述音频水印,所述方法还包括:所述解析终端从所述多个水印检测周期所解析的音频水印中确定重复率最高的一个作为所述第一音频的水印。
- 一种播放终端,其特征在于,包括:获取单元,用于实时获取第一音频;执行单元,用于在所述获取单元获取的所述第一音频中嵌入音频水印,所述音频水印与所述播放终端相关联;播放终端,用于播放由所述执行单元嵌有所述音频水印的所述第一音频。
- 一种解析终端,其特征在于,包括:获取单元,用于获取第一音频,所述第一音频中嵌有音频水印,所述音频水印与播放终端相关联,所述播放终端用于将所述音频水印实时嵌入所述第一音频;解析单元,用于从所述获取单元获取的所述第一音频中解析所述音频水印;执行单元,用于根据所述解析单元解析的所述音频水印确定所述播放终端。
- 一种电子设备,其特征在于,所述电子设备包括:交互装置、输入/输出(I/O)接口、处理器和存储器,所述存储器中存储有程序指令;所述交互装置用于获取用户输入的操作指令;所述处理器用于执行存储器中存储的程序指令,执行如权利要求1至9或10至17任一所述的方法。
- 一种计算机可读存储介质,包括指令,其特征在于,当所述指令在计算机设备上运行时,使得所述计算机设备执行如权利要求1至9或10至17任一所述的方法。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP21874231.0A EP4210049B1 (en) | 2020-09-30 | 2021-09-14 | Audio watermark adding method and device, audio watermark analyzing method and device, and medium |
| US18/192,571 US12518769B2 (en) | 2020-09-30 | 2023-03-29 | Audio watermark addition method, audio watermark parsing method, device, and medium |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202011066454.0A CN114333859A (zh) | 2020-09-30 | 2020-09-30 | 一种音频水印添加、解析方法、设备及介质 |
| CN202011066454.0 | 2020-09-30 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US18/192,571 Continuation US12518769B2 (en) | 2020-09-30 | 2023-03-29 | Audio watermark addition method, audio watermark parsing method, device, and medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022068570A1 true WO2022068570A1 (zh) | 2022-04-07 |
Family
ID=80949653
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/118202 Ceased WO2022068570A1 (zh) | 2020-09-30 | 2021-09-14 | 一种音频水印添加、解析方法、设备及介质 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US12518769B2 (zh) |
| EP (1) | EP4210049B1 (zh) |
| CN (1) | CN114333859A (zh) |
| WO (1) | WO2022068570A1 (zh) |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115035903B (zh) * | 2022-08-10 | 2022-12-06 | 杭州海康威视数字技术股份有限公司 | 一种物理语音水印的注入方法、语音溯源方法及装置 |
| CN117037817B (zh) * | 2023-07-11 | 2025-10-21 | 深圳大学 | 基于深度学习的抗去同步攻击语音鲁棒水印方法及终端 |
| US12537803B2 (en) | 2023-09-29 | 2026-01-27 | Bank Of America Corporation | Using tonal bits for secure messaging |
| TWI857825B (zh) * | 2023-10-26 | 2024-10-01 | 台灣大哥大股份有限公司 | 串流音訊浮水印系統及串流音訊浮水印方法 |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20030103645A1 (en) * | 1995-05-08 | 2003-06-05 | Levy Kenneth L. | Integrating digital watermarks in multimedia content |
| CN1462439A (zh) * | 2001-05-08 | 2003-12-17 | 皇家菲利浦电子有限公司 | 对于音频信号再抽样坚固的水印产生和检测 |
| CN101266794A (zh) * | 2008-03-27 | 2008-09-17 | 上海交通大学 | 基于回声隐藏的多重水印嵌入和提取方法 |
| CN101350198A (zh) * | 2008-08-29 | 2009-01-21 | 西安电子科技大学 | 基于骨导的语音压缩水印方法 |
| CN101442672A (zh) * | 2007-11-23 | 2009-05-27 | 华为技术有限公司 | 数字水印处理系统、数字水印嵌入和检测方法及装置 |
| CN106921728A (zh) * | 2016-08-31 | 2017-07-04 | 阿里巴巴集团控股有限公司 | 一种定位用户的方法、信息推送方法及相关设备 |
Family Cites Families (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR100611412B1 (ko) * | 2002-10-18 | 2006-08-09 | 명지대학교 산학협력단 | 마스킹 효과를 이용한 오디오 워터마크 삽입 및 검출방법 |
| US8099285B2 (en) * | 2007-12-13 | 2012-01-17 | Dts, Inc. | Temporally accurate watermarking system and method of operation |
| EP2083418A1 (en) * | 2008-01-24 | 2009-07-29 | Deutsche Thomson OHG | Method and Apparatus for determining and using the sampling frequency for decoding watermark information embedded in a received signal sampled with an original sampling frequency at encoder side |
| CN102385862A (zh) * | 2011-09-07 | 2012-03-21 | 武汉大学 | 一种面向空气信道传播的音频数字水印方法 |
| CN105227311B (zh) * | 2014-07-01 | 2020-06-12 | 腾讯科技(深圳)有限公司 | 验证方法和系统、音频检测方法和处理方法 |
| US10043527B1 (en) * | 2015-07-17 | 2018-08-07 | Digimarc Corporation | Human auditory system modeling with masking energy adaptation |
| CN105392022B (zh) * | 2015-11-04 | 2019-01-18 | 北京符景数据服务有限公司 | 基于音频水印的信息交互方法与装置 |
| JP7013093B2 (ja) * | 2018-05-01 | 2022-01-31 | アルパイン株式会社 | 故障検出装置、移動体搭載装置、故障検出方法 |
| CN111199745A (zh) * | 2018-11-20 | 2020-05-26 | 尼尔森网联媒介数据服务有限公司 | 广告识别方法、设备、媒体平台、终端、服务器、介质 |
| CN111292756B (zh) * | 2020-01-19 | 2023-05-26 | 成都潜在人工智能科技有限公司 | 一种抗压缩音频无声水印嵌入和提取方法及系统 |
-
2020
- 2020-09-30 CN CN202011066454.0A patent/CN114333859A/zh active Pending
-
2021
- 2021-09-14 EP EP21874231.0A patent/EP4210049B1/en active Active
- 2021-09-14 WO PCT/CN2021/118202 patent/WO2022068570A1/zh not_active Ceased
-
2023
- 2023-03-29 US US18/192,571 patent/US12518769B2/en active Active
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20030103645A1 (en) * | 1995-05-08 | 2003-06-05 | Levy Kenneth L. | Integrating digital watermarks in multimedia content |
| CN1462439A (zh) * | 2001-05-08 | 2003-12-17 | 皇家菲利浦电子有限公司 | 对于音频信号再抽样坚固的水印产生和检测 |
| CN101442672A (zh) * | 2007-11-23 | 2009-05-27 | 华为技术有限公司 | 数字水印处理系统、数字水印嵌入和检测方法及装置 |
| CN101266794A (zh) * | 2008-03-27 | 2008-09-17 | 上海交通大学 | 基于回声隐藏的多重水印嵌入和提取方法 |
| CN101350198A (zh) * | 2008-08-29 | 2009-01-21 | 西安电子科技大学 | 基于骨导的语音压缩水印方法 |
| CN106921728A (zh) * | 2016-08-31 | 2017-07-04 | 阿里巴巴集团控股有限公司 | 一种定位用户的方法、信息推送方法及相关设备 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4210049A4 |
Also Published As
| Publication number | Publication date |
|---|---|
| US20230238008A1 (en) | 2023-07-27 |
| CN114333859A (zh) | 2022-04-12 |
| EP4210049A4 (en) | 2024-02-21 |
| EP4210049B1 (en) | 2025-11-05 |
| US12518769B2 (en) | 2026-01-06 |
| EP4210049A1 (en) | 2023-07-12 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2022068570A1 (zh) | 一种音频水印添加、解析方法、设备及介质 | |
| US11477156B2 (en) | Watermarking and signal recognition for managing and sharing captured content, metadata discovery and related arrangements | |
| CN109819338B (zh) | 一种视频自动剪辑方法、装置及便携式终端 | |
| CN101212648B (zh) | 用于同步内容的数据流与元数据的方法和设备 | |
| CN103561229B (zh) | 会议标签生成及应用方法、装置、系统 | |
| JP2014516189A (ja) | 受信データの比較を実行しその比較に基づいて後続サービスを提供する方法及びシステム | |
| CN107705785A (zh) | 智能音箱的声源定位方法、智能音箱及计算机可读介质 | |
| US20160065791A1 (en) | Sound image play method and apparatus | |
| CN114071184B (zh) | 一种字幕定位方法、电子设备及介质 | |
| WO2015070682A1 (zh) | 一种音频文件的播控处理方法、装置及存储介质 | |
| US12495269B2 (en) | Method and apparatus for low complexity low bitrate 6DoF HOA rendering | |
| US20160196631A1 (en) | Hybrid Automatic Content Recognition and Watermarking | |
| CN115022710B (zh) | 一种视频处理方法、设备及可读存储介质 | |
| CN107040728A (zh) | 一种视频时间轴生成方法及装置、用户设备 | |
| CN116233411A (zh) | 音视频同步测试的方法、装置、设备及计算机存储介质 | |
| CN116955693A (zh) | 基于音视一致性感知的音视显著性检测方法 | |
| Suzuki et al. | AnnoTone: Record-time audio watermarking for context-aware video editing | |
| CN116193160A (zh) | 一种数字水印嵌入方法、装置、设备和介质 | |
| CN116980716A (zh) | 视频处理方法、装置、设备和存储介质 | |
| CN104079948B (zh) | 生成铃声文件的方法及装置 | |
| Pan et al. | Audio-Video Dual-Modality Robust Watermarking | |
| JP2026502251A (ja) | 仮想スピーカ決定方法および関連装置 | |
| CN120378682A (zh) | 媒体文件录制方法、媒体文件分析方法及相关装置 | |
| CN116170720A (zh) | 数据传输方法、装置、电子设备及存储介质 | |
| CN121056700A (zh) | 一种时延确定方法、装置、设备、存储介质和计算机程序产品 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21874231 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2021874231 Country of ref document: EP Effective date: 20230406 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWG | Wipo information: grant in national office |
Ref document number: 2021874231 Country of ref document: EP |