WO2016206273A1 - 一种激活音修正帧数的获取方法、激活音检测方法和装置 - Google Patents

一种激活音修正帧数的获取方法、激活音检测方法和装置 Download PDF

Info

Publication number
WO2016206273A1
WO2016206273A1 PCT/CN2015/093889 CN2015093889W WO2016206273A1 WO 2016206273 A1 WO2016206273 A1 WO 2016206273A1 CN 2015093889 W CN2015093889 W CN 2015093889W WO 2016206273 A1 WO2016206273 A1 WO 2016206273A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
parameter
background noise
signal
activation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2015/093889
Other languages
English (en)
French (fr)
Inventor
朱长宝
袁浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to RU2017145122A priority Critical patent/RU2684194C1/ru
Priority to EP25180489.4A priority patent/EP4641568A3/en
Priority to JP2017566850A priority patent/JP6635440B2/ja
Priority to CA2990328A priority patent/CA2990328C/en
Priority to US15/577,343 priority patent/US10522170B2/en
Priority to EP15896160.7A priority patent/EP3316256A4/en
Priority to KR1020177036055A priority patent/KR102042117B1/ko
Publication of WO2016206273A1 publication Critical patent/WO2016206273A1/zh
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/81Detection of presence or absence of voice signals for discriminating voice from music
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/012Comfort noise or silence coding

Definitions

  • This application relates to, but is not limited to, the field of communications.
  • VAD Voice Activity Detection
  • AMR Adaptive Multi-Rate
  • AMR-WB Adaptive Multi-Rate Wideband
  • VAD of these encoders does not achieve good performance under all typical background noise. Especially for unsteady noise, these encoders have low VAD efficiency. For music signals, these VADs sometimes have error detection, resulting in a significant quality degradation of the corresponding processing algorithms.
  • the embodiment of the invention provides a method for acquiring an activation sound correction frame number, an activation sound detection method and a device, so as to solve the problem that the accuracy of the activation sound detection (VAD) is low.
  • VAD activation sound detection
  • An embodiment of the present invention provides a method for acquiring an activation tone correction frame number, where the method includes:
  • the obtaining an activation tone detection decision result of the current frame includes:
  • the frame energy parameter is a weighted superposition value or a direct superposition value of each sub-band signal energy
  • the spectral center of gravity feature parameter is a ratio of a weighted accumulated value of all or part of the subband signal energy to an unweighted accumulated value, or a value obtained by smoothing the ratio;
  • the time domain stability characteristic parameter is a desired ratio of the variance of the amplitude superposition value and the square of the amplitude superposition value, or the ratio is multiplied by a coefficient;
  • the spectral flatness characteristic parameter is a ratio of a geometric mean of the predetermined plurality of spectral magnitudes to an arithmetic mean, or the ratio is multiplied by a coefficient;
  • the tonal feature parameter is obtained by calculating the correlation value of the intra-frame spectral difference coefficients of the two frames before and after, or continuing to smooth-filter the correlation value.
  • the calculating, according to the tonality flag, the signal to noise ratio parameter, the spectral center of gravity feature parameter, and the frame energy parameter, the activation sound detection decision result includes:
  • the obtaining, according to the activation tone detection decision result of the current frame, the number of background noise update times, and the number of the activation tone holding frames, obtaining the number of activated sound correction frames includes:
  • the activation sound correction frame number is a constant and the activation sound retention frame number The maximum value in .
  • the obtaining the activation tone holding frame number includes:
  • the obtaining the activation tone holding frame number includes:
  • the calculating the long-term signal-to-noise ratio and the average full-band signal-to-noise ratio according to the sub-band signal includes:
  • Calculating the long-term signal to noise ratio by using a ratio of the average long-term active tone signal energy calculated by the previous frame of the current frame and the average long-term background noise energy; calculating the distance from the current
  • the average of the full-band signal-to-noise ratio of the plurality of frames closest to the frame results in the average full-band signal-to-noise ratio.
  • the precondition for correcting the current active tone holding frame number is that the activation tone flag indicates that the current frame is an active tone frame.
  • the correcting the number of currently activated tone keeping frames to obtain the number of the activated tone keeping frames includes:
  • the activation tone Maintaining the number of frames equal to the minimum number of consecutive active tone frames minus the number of consecutive speech frames; if the average full band signal to noise ratio is greater than a set threshold value, and the number of consecutive speech frames is greater than a set number
  • the second threshold value is set according to the size of the long-term signal to noise ratio.
  • the obtaining the number of background noise updates includes:
  • the calculating the number of background noise update times according to the background noise update identifier includes:
  • the calculating the number of background noise update times according to the background noise update identifier includes:
  • the background noise update identifier indicates that the current frame is background noise, and the number of background noise update times is less than a set threshold, the background noise update number is incremented by one.
  • the obtaining the background noise update identifier includes:
  • the frame energy parameter is a weighted superposition value or a direct superposition value of each sub-band signal energy
  • the spectral center of gravity feature parameter is a ratio of a weighted accumulated value of all or part of the subband signal energy to an unweighted accumulated value, or a value obtained by smoothing the ratio;
  • the time domain stability characteristic parameter is a desired ratio of a variance of a frame energy amplitude and a square of the amplitude superposition value, or the ratio is multiplied by a coefficient;
  • the spectral flatness parameter is a ratio of a geometric mean of the predetermined plurality of spectral magnitudes to an arithmetic mean, or the ratio is multiplied by a coefficient.
  • the background noise update identifier includes:
  • the time domain stability characteristic parameter is greater than a set threshold value
  • the smoothed filter value of the spectral gravity center feature parameter value is greater than a set threshold value, and the time domain stability feature parameter value is also greater than a set threshold value;
  • the smoothed filtered value of the tonal characteristic parameter or the tonal characteristic parameter is greater than a set threshold value, and the time domain stability characteristic parameter value is greater than a set threshold value;
  • the spectrally flattened characteristic parameters of each sub-band or the spectrally flattened characteristic parameters of each of the sub-bands are each a smoothed filtered value less than a respective corresponding set threshold value;
  • the value of the frame energy parameter is greater than a set threshold.
  • the embodiment of the invention provides an activation sound detection method, and the method includes:
  • the number of sound-holding frames is calculated by the number of activated sound correction frames
  • the activation sound detection decision result is calculated according to the activation sound correction frame number and the second activation sound detection determination result.
  • the calculating the activation sound detection decision result according to the activation sound correction frame number and the second activation sound detection determination result includes:
  • the activation sound detection determination result is set as an active sound frame, and the The number of active tone correction frames is reduced by 1.
  • the obtaining the first activation tone detection decision result includes:
  • the frame energy parameter is a weighted superposition value or a direct superimposed value of each sub-band signal energy
  • the spectral center of gravity feature parameter is a ratio of a weighted accumulated value of all or part of the subband signal energy to an unweighted accumulated value, or a value obtained by smoothing the ratio;
  • the time domain stability characteristic parameter is a desired ratio of the variance of the amplitude superposition value and the square of the amplitude superposition value, or the ratio is multiplied by a coefficient;
  • the spectral flatness characteristic parameter is a ratio of a geometric mean of the predetermined plurality of spectral magnitudes to an arithmetic mean, or the ratio is multiplied by a coefficient;
  • the tonal feature parameter is obtained by calculating the correlation value of the intra-frame spectral difference coefficients of the two frames before and after, or continuing to smooth-filter the correlation value.
  • the calculating, according to the tonality flag, the signal to noise ratio parameter, the spectral center of gravity feature parameter, and the frame energy parameter, the first activated sound detection decision result includes:
  • Calculating a long-term signal-to-noise ratio by calculating a ratio of an average long-term active tone signal energy calculated by a previous frame of the current frame to an average long-term background noise energy;
  • the obtaining the activation tone holding frame number includes:
  • the obtaining the activation tone holding frame number includes:
  • the calculating the long-term signal-to-noise ratio and the average full-band signal-to-noise ratio according to the sub-band signal includes:
  • Calculating the long-term signal to noise ratio by using a ratio of the average long-term activated sound signal energy calculated by the previous frame of the current frame to the average long-term background noise energy; calculating a plurality of the closest to the current frame The average of the full band signal to noise ratio of the frame results in the average full band signal to noise ratio.
  • the precondition for correcting the current active tone holding frame number is that the activation tone flag indicates that the current frame is an active tone frame.
  • the correcting the current active tone keeping frame number includes: if the continuous voice frame number is less than a set first threshold value, and the long time signal to noise ratio is less than a set threshold value,
  • the activation tone keeps the number of frames equal to the minimum number of consecutive active tone frames minus the number of consecutive speech frames; if the average full band signal to noise ratio is greater than a set second threshold value, and the continuous speech frame If the number is greater than a set threshold, the value of the number of active tone hold frames is set according to the size of the long-term signal to noise ratio.
  • the obtaining the number of background noise updates includes:
  • the calculating the number of background noise update times according to the background noise update identifier includes:
  • the calculating the number of background noise update times according to the background noise update identifier includes:
  • the background noise update identifier indicates that the current frame is background noise, and the number of background noise update times is less than a set threshold, the background noise update number is incremented by one.
  • the obtaining the background noise update identifier includes:
  • the sign noise, the tonal feature parameter, and the frame energy parameter perform background noise detection to obtain the background noise update identifier.
  • the frame energy parameter is a weighted superposition value or a direct superimposed value of each sub-band signal energy
  • the spectral center of gravity feature parameter is a ratio of a weighted accumulated value of all or part of the subband signal energy to an unweighted accumulated value, or a value obtained by smoothing the ratio;
  • the time domain stability characteristic parameter is a desired ratio of a variance of a frame energy amplitude and a square of the amplitude superposition value, or the ratio is multiplied by a coefficient;
  • the spectral flatness parameter is a ratio of a geometric mean of the predetermined plurality of spectral magnitudes to an arithmetic mean, or the ratio is multiplied by a coefficient.
  • the background noise update identifier includes:
  • the time domain stability characteristic parameter is greater than a set threshold value
  • the smoothed filter value of the spectral gravity center feature parameter value is greater than a set threshold value, and the time domain stability feature parameter value is also greater than a set threshold value;
  • the smoothed filtered value of the tonal characteristic parameter or the tonal characteristic parameter is greater than a set threshold, and the time domain stability characteristic parameter value is greater than a set threshold;
  • the spectrally flattened characteristic parameters of each sub-band or the spectrally flattened characteristic parameters of each of the sub-bands are each a smoothed filtered value less than a respective corresponding set threshold value;
  • the value of the frame energy parameter is greater than a set threshold.
  • the calculating the number of the activated sound correction frames according to the first activation sound detection determination result, the background noise update times, and the activation sound retention frame number includes:
  • the number of the activated sound correction frames is a constant and the maximum value of the number of the active sound holding frames.
  • An embodiment of the present invention provides an apparatus for acquiring an activation tone correction frame number, where the apparatus includes:
  • a first acquiring unit configured to: obtain an activation sound detection decision result of the current frame
  • a second obtaining unit configured to: obtain an activation tone holding frame number
  • a third obtaining unit configured to: obtain a background noise update number
  • a fourth acquiring unit configured to: obtain an activation sound correction frame number according to the activation sound detection determination result of the current frame, the background noise update number, and the activation sound retention frame number.
  • An embodiment of the present invention provides an activation tone detecting apparatus, where the apparatus includes:
  • a fifth obtaining unit configured to: obtain a first activated sound detection decision result
  • a sixth obtaining unit configured to: obtain an activation tone holding frame number
  • the seventh obtaining unit is configured to: obtain the number of background noise updates
  • a first calculating unit configured to: calculate an activation sound correction frame number according to the first activation sound detection determination result, the background noise update number, and the activation sound retention frame number;
  • An eighth obtaining unit configured to: obtain a second activated sound detection decision result
  • the second calculating unit is configured to: calculate the activation sound detection decision result according to the activation sound correction frame number and the second activation sound detection determination result.
  • a computer readable storage medium storing computer executable instructions for performing the method of any of the above.
  • An embodiment of the present invention provides a method for acquiring an activation tone correction frame number, an activation tone detection method, and a device, which first obtain a first activation tone detection decision result, obtain an activation tone hold frame number, obtain a background noise update number, and then obtain the background noise update number. First activation sound detection determination result, the number of background noise update times, and the activation sound retention frame number calculation activation sound correction frame number, and obtaining a second activation sound detection determination result, and finally correcting the number of frames and the sound according to the activation sound.
  • the second activation sound detection decision result calculates the activation sound detection decision result, and the accuracy of the VAD detection can be improved.
  • FIG. 1 is a schematic flowchart of a method for detecting an activated tone according to Embodiment 1 of the present invention
  • FIG. 2 is a schematic diagram of a process of obtaining a VAD decision result according to Embodiment 1 of the present invention
  • FIG. 3 is a schematic flowchart of a background noise detecting method according to Embodiment 2 of the present invention.
  • FIG. 4 is a schematic flowchart of a method for correcting a current active tone holding frame number in a VAD decision according to Embodiment 3 of the present invention
  • FIG. 5 is a schematic flowchart of a method for acquiring an activation tone correction frame number according to Embodiment 4 of the present invention.
  • FIG. 6 is a schematic structural diagram of an apparatus for acquiring an activated sound correction frame number according to Embodiment 4 of the present invention.
  • FIG. 7 is a schematic flowchart of a method for detecting an activated tone according to Embodiment 5 of the present invention.
  • FIG. 8 is a schematic structural diagram of an activated sound detecting apparatus according to Embodiment 5 of the present invention.
  • the embodiment of the invention provides an activation sound detection method, as shown in FIG. 1 , the method includes:
  • Step 101 Obtain a subband signal and a spectrum amplitude of a current frame.
  • an audio stream with a frame length of 20 ms and a sampling rate of 32 kHz is taken as an example.
  • the method of this paper is equally applicable under other frame lengths and sample rate conditions.
  • a 40-channel filter bank is used.
  • the input audio signal is s HP (n)
  • L C is 40
  • w c is a window function
  • the window length is 10L C
  • the sub-band signal X(k,l) X CR ( l,k)+i ⁇ X CI (l,k)
  • X CR and X CI are the real and imaginary parts of the subband signal.
  • the subband signals are calculated as follows:
  • Time-frequency transform is performed on the filter group sub-band signal, and the spectrum amplitude is calculated.
  • Embodiments of the present invention can be implemented by performing time-frequency transform on all filter bank sub-bands or partial filter bank sub-bands and calculating spectrum amplitudes.
  • the time-frequency transform method in the embodiment of the present invention may be a Discrete Fourier Transform (DFT), a Fast Fourier Transformation (FFT), or a Discrete Cosine Transform (Discrete Cosine Transform). DCT) or Discrete Sine Transform (DST).
  • DFT Discrete Fourier Transform
  • FFT Fast Fourier Transformation
  • DCT Discrete Cosine Transform
  • DST Discrete Sine Transform
  • the time-frequency conversion equation is as follows:
  • X DFT_POW [k,j] ((Re(X DFT [k,j])) 2 +(Im(X DFT [k,j])) 2 ); 0 ⁇ k ⁇ 10,0 ⁇ j ⁇ 16
  • Re and Im respectively represent the real part and the imaginary part of the spectral coefficient X DFT_POW [k, j].
  • a sp is the spectrum amplitude after time-frequency transform.
  • Step 102 Calculate a frame energy parameter, a spectral center of gravity characteristic parameter, and a time domain stability characteristic parameter value of the current frame according to the subband signal, and calculate a value of the spectral flatness characteristic parameter and the tonal characteristic parameter according to the spectrum amplitude.
  • the frame energy parameter is a weighted superposition value or a direct superposition value of each sub-band signal energy, wherein:
  • E C (t, k) (X CR (t, k)) 2 + (X CI (t, k)) 2 0 ⁇ t ⁇ 15, 0 ⁇ k ⁇ L C .
  • the human ear is relatively insensitive to very low frequency (such as below 100 Hz) and high frequency (such as above 20 kHz) sound.
  • the filter is arranged according to the frequency from low to high.
  • the sub-bands, from the second sub-band to the penultimate sub-band, are the main filter group sub-bands that are sensitive to hearing, and accumulate some or all of the auditory-sensitive filter sub-band energy to obtain the frame energy parameter 1, and calculate the equation. as follows:
  • E_sb_start is the starting subband index, and its value ranges from [0, 6].
  • E_sb_end is the end subband index, which takes a value greater than 6, less than the total number of subbands.
  • a frame energy parameter 2 The value of the frame energy parameter 1 plus some or all of the weighted values of the energy of the filter bank subbands that are not used in calculating the frame energy parameter 1 yields a frame energy parameter 2, which is calculated as follows:
  • Num_band is the total number of subbands.
  • the spectral center of gravity feature parameter is a ratio of a weighted accumulated value of all or a portion of the subband signal energy to an unweighted accumulated value, wherein:
  • the spectral center-of-gravity characteristic parameter is calculated according to the energy of each filter bank sub-band.
  • the spectral center-of-gravity characteristic parameter is a ratio of the sum of the energy of the filter group sub-band energy sum and the direct addition of the sub-band energy or the Other spectral center of gravity feature parameter values are smoothed and filtered.
  • the spectral center of gravity feature parameters can be implemented using the following substeps:
  • the two spectral center-of-gravity characteristic parameter values are calculated, which are the first interval spectral center-of-gravity characteristic parameter and the second interval spectral gravity center characteristic parameter.
  • Delta1 and Delta2 are each a small offset value ranging from (0,1). Where k is the spectral center of gravity numbered index.
  • Sp_center[2] sp_center -1 [2] ⁇ spc_sm_scale+sp_center[0] ⁇ (1-spc_sm_scale)
  • spc_sm_scale is the spectral center-of-gravity parameter smoothing filter scale factor
  • sp_center -1 [2] represents the smooth spectral center of gravity feature parameter value of the previous frame, and its initial value is 1.6.
  • the time domain stability characteristic parameter is a desired ratio of the variance of the amplitude superposition value and the square of the amplitude superposition value, or the ratio is multiplied by a coefficient, wherein:
  • the time domain stability characteristic parameter is calculated from the latest frame energy parameters of the plurality of frame signals.
  • the time domain stability characteristic parameter is calculated by using the frame energy parameter of the latest 40 frame signal. The calculation steps are:
  • e_offset is an offset value, which ranges from [0, 0.1].
  • Amp t2 (n) Amp t1 (-2n)+Amp t1 (-2n-1); 0 ⁇ n ⁇ 20;
  • Amp t1 represents the energy amplitude of the current frame
  • Amp t1 represents the energy amplitude of the n frames of the current frame
  • time domain stability feature parameter ltd_stable_rate0 is obtained by calculating the ratio of the variance of the 20 amplitude superposition values closest to the current frame to the average energy. The equation is calculated as follows:
  • the spectral flatness characteristic parameter is a ratio of a geometric mean of the predetermined plurality of spectral magnitudes to an arithmetic mean, or the ratio is multiplied by a coefficient.
  • N A is the number of spectral amplitudes.
  • the predetermined plurality of spectrums in the embodiment of the present invention may be a part of the spectrum selected according to the experience of the technician, or may be a part of the spectrum selected according to the actual situation.
  • the spectrum amplitude is divided into three frequency bands, and the spectral flatness characteristics of the three frequency bands are calculated.
  • the division manner is as follows:
  • the tonal feature parameter is obtained by calculating the correlation value of the intra-frame spectral difference coefficients of the two frames before and after, or continuing to smooth-filter the correlation value.
  • the tonal characteristic parameters are calculated according to the spectral amplitude, wherein the tonal characteristic parameters can be calculated according to all spectral amplitudes or partial spectral amplitudes.
  • a part (not less than 8 spectral coefficients) or all spectral amplitudes are compared with adjacent spectral amplitudes, and the value of the differential result less than 0 is set to 0, resulting in a set of non-negative spectral differential coefficients:
  • the angle 0 is represented as the current frame, and the equation is calculated as follows:
  • Step 103 Calculate a signal to noise ratio parameter of the current frame according to the background noise energy obtained in the previous frame of the current frame, the frame energy parameter of the current frame, and the signal to noise ratio subband energy.
  • the background noise energy of the previous frame of the current frame can be obtained by an existing method.
  • the value of the signal to noise ratio subband background noise energy uses the default initial value. SNR of the previous frame of the current frame, background noise energy estimation of the sub-band and the signal-to-noise ratio of the current frame The principle of the quantity estimation is the same, and the signal-to-noise ratio sub-band background energy estimation of the current frame is referred to step 107 of this embodiment.
  • the signal to noise ratio parameter of the current frame can be implemented by using an existing signal to noise ratio calculation method. Optionally, the following method is used:
  • the filter bank subband is re-divided into multiple SNR subbands, and the index is as follows.
  • the energy of each SNR sub-band of the current frame is calculated.
  • the calculation equation is as follows:
  • the sub-band average signal-to-noise ratio SNR1 is calculated from the energy of each SNR subband of the current frame and the background noise energy of each SNR subband of the previous frame.
  • the calculation equation is as follows:
  • E sb2_bg is the estimated background noise energy of each SNR subband of the previous frame of the current frame, and the number of num_band SNR subbands.
  • the principle of obtaining the background noise energy of the signal-to-noise ratio sub-band of the previous frame is the same as the principle of obtaining the background energy of the signal-to-noise ratio sub-band of the current frame, and the process of obtaining the background energy of the signal-to-noise ratio sub-band of the current frame is as follows. Step 107.
  • the full-band signal-to-noise ratio SNR2 is calculated according to the estimated full frame background noise energy of the previous frame and the frame energy parameter of the current frame:
  • E t_bg is the estimated total background noise energy of the previous frame
  • the principle of obtaining the full background noise energy of the previous frame is the same as the principle of obtaining the full background noise energy of the current frame, and the full background noise energy of the current frame is obtained.
  • the signal to noise ratio parameters in this embodiment include a subband average signal to noise ratio SNR1 and a full band signal to noise ratio SNR2.
  • the full background noise energy and the background noise energy of each subband are collectively referred to as background noise energy.
  • Step 104 Calculate a tonality flag of the current frame according to a frame energy parameter of the current frame, a spectral center-of-gravity characteristic parameter, a time domain stability characteristic parameter, a spectral flatness characteristic parameter, and a tonal characteristic parameter, where:
  • a value of 1 for the tonality_frame indicates that the current frame is a tonal frame, and 0 indicates that the current frame is a non-tonal frame;
  • step 104b determining whether the tonal characteristic parameter or its smoothed filtered value is greater than the corresponding set threshold value tonality_decision_thr1 or tonality_decision_thr2, if one of the above conditions is true, then step 104c is performed, otherwise step 104d is performed;
  • the value range of tonality_decision_thr1 is [0.5, 0.7]
  • the range of tonality_rate1 is [0.7, 0.99].
  • step 104c If the time domain stability characteristic parameter value lt_stable_rate0 is smaller than a set threshold value lt_stable_decision_thr1; the spectral center of gravity characteristic parameter value sp_center[1] is greater than a set threshold value spc_decision_thr1, and the spectral flatness characteristic parameter of each subband If the threshold is smaller than the corresponding preset threshold, the current frame is determined to be a tonal frame, and the value of the tonality frame flag tonality_frame is set to 1, otherwise it is determined to be a non-tonal frame, and the value of the tonal frame flag tonality_frame is set to 0. . And proceed to step 104d.
  • the value range of the threshold lt_stable_decision_thr1 is [0.01, 0.25], and the spc_decision_thr1 is [1.0, 1.8].
  • the tonality feature parameter tonality_degree is updated using the following equation:
  • Tonality_degree tonality_degree -1 ⁇ td_scale_A+td_scale_B;
  • tonality_degree -1 is the characteristic parameter of the degree of tonality of the previous frame. Its initial value ranges from [0,1].
  • td_scale_A is the attenuation coefficient, and its value range is [0, 1];
  • td_scale_B is the accumulation coefficient, and its value range is [0, 1].
  • the current frame is a tonal signal, otherwise, the current frame is determined to be a non-tonal signal.
  • Step 105 Calculate the VAD decision result according to the tonality mark, the signal to noise ratio parameter, the spectral center of gravity feature parameter, and the frame energy parameter, as shown in FIG. 2, and the steps are as follows:
  • Step 105a Calculating the average long-term active tone signal energy calculated by the previous frame of the current frame And the ratio of the average long-term background noise energy, the long-term signal to noise ratio lt_snr is calculated;
  • the calculation and definition of the average long-term activated sound signal energy E fg and the average long-term background noise energy E bg are shown in step 105g.
  • the long-term signal-to-noise ratio lt_snr is calculated as follows:
  • Step 105b calculating an average value of the full-band signal-to-noise ratio SNR2 of the plurality of frames closest to the current frame, to obtain an average full-band signal-to-noise ratio SNR2_lt_ave;
  • SNR2(n) represents the value of the full-band signal-to-noise ratio SNR2 of the nth frame of the current frame
  • F_num is the total number of frames for which the average value is calculated, which is in the range of [8, 64].
  • Step 105c Obtain a decision signal-to-noise ratio threshold snr_thr of the VAD decision according to the spectral center-of-gravity characteristic parameter, the long-term signal-to-noise ratio lt_snr, the number of consecutive active sound frames continuous_speech_num, and the number of consecutive noise frames continuous_noise_num.
  • the initial value of the decision signal to noise ratio threshold snr_thr is set, and the range is [0.1, 2], for example, 1.06.
  • the value of the decision signal-to-noise ratio threshold snr_thr is first adjusted according to the spectral center-of-gravity characteristic parameter. The steps are as follows: if the value of the spectral center-of-gravity characteristic parameter sp_center[2] is greater than a set threshold value spc_vad_dec_thr1, then snr_thr is added with an offset value, in this example, the offset value is taken as 0.05; otherwise, if sp_center[1 ] is larger than spc_vad_dec_thr2, then snr_thr is added with an offset value.
  • the offset value is taken to be 0.10; otherwise, snr_thr is added with an offset value. In this example, the offset value is taken to be 0.40; wherein the threshold value spc_vad_dec_thr1 and The range of spc_vad_dec_thr2 is [1.2, 2.5].
  • snr_thr is secondarily adjusted according to the number of consecutively activated audio frames continuous_speech_num, the number of consecutive noise frames continuous_noise_num, the average full-band signal-to-noise ratio SNR2_lt_ave, and the long-term signal-to-noise ratio lt_snr.
  • the offset value is changed to 0.1; otherwise, if continuous_noise_num is greater than a set threshold value cpn_vad_dec_thr3, then snr_thr is added with an offset value. In this example, the offset value is changed. Take 0.2; otherwise, if continuous_noise_num is greater than a set threshold cpn_vad_dec_thr4, then snr_thr is added with an offset value, in this case the offset value is taken as 0.1.
  • the thresholds cpn_vad_dec_thr1, cpn_vad_dec_thr2, cpn_vad_dec_thr3, cpn_vad_dec_thr4 have a value range of [2,500], and the coefficient lt_tsnr_scale has a value range of [0, 2].
  • the decision signal-to-noise ratio threshold snr_thr is finally adjusted to obtain the decision signal-to-noise ratio threshold snr_thr of the current frame.
  • Snr_thr snr_thr+(lt_tsnr-thr_offset) ⁇ thr_scale;
  • thr_offset is an offset value
  • the value range is [0.5, 3]
  • thr_scale is a gain coefficient
  • its value range is [0.1, 1].
  • Step 105d Calculate an initial VAD decision result according to the decision threshold snr_thr of the activated sound detection and the signal to noise ratio parameters SNR1 and SNR2 calculated by the current frame.
  • the value of the VAD flag vad_flag is used to indicate whether the current frame is an active tone frame.
  • a value of 1 indicates that the current frame is an active tone frame
  • 0 indicates that the current frame is an active tone frame.
  • the frame is an inactive tone frame. Otherwise, it is judged that the current frame is an inactive sound frame, and the value of the VAD flag vad_flag is set to zero.
  • SNR2 is greater than a set threshold value snr2_thr, it is determined that the current frame is an active tone frame, and the value of the VAD flag vad_flag is set to 1.
  • the range of snr2_thr is [1.2, 5.0].
  • Step 105e Correct the initial VAD decision result according to the tonality flag, the average full-band signal-to-noise ratio SNR2_lt_ave, the spectral center of gravity, and the long-term signal-to-noise ratio lt_snr.
  • the tonality flag indicates that the current frame is a tonal signal, that is, the tonality_flag is 1, it is determined that the current frame is an active tone signal, and the vad_flag flag is set to 1.
  • SNR2_lt_ave_thr1 is [1, 4]
  • range of lt_tsnr_tscale is [0.1, 0.6].
  • the current The frame is the active tone frame and the vad_flag flag is set.
  • SNR2_lt_ave_t_thr2 is [1.0, 2.5]
  • the range of sp_center_t_thr1 is [2.0, 4.0]
  • the range of lt_tsnr_t_thr1 is [2.5, 5.0].
  • SNR2_lt_ave is greater than a set threshold SNR2_lt_ave_t_thr3
  • the spectral center-of-gravity characteristic parameter sp_center[2] is greater than a set threshold sp_center_t_thr2 and the long-term signal-to-noise ratio lt_snr is less than a set threshold lt_tsnr_t_thr2
  • SNR2_lt_ave_t_thr3 is [0.8, 2.0]
  • the range of sp_center_t_thr2 is [2.0, 4.0]
  • the range of lt_tsnr_t_thr2 is [2.5, 5.0].
  • SNR2_lt_ave is greater than a set threshold SNR2_lt_ave_t_thr4
  • the spectral center-of-gravity characteristic parameter sp_center[2] is greater than a set threshold sp_center_t_thr3 and the long-term signal-to-noise ratio lt_snr is less than a set threshold lt_tsnr_t_thr3, it is determined that the current frame is an active sound frame,
  • the vad_flag flag is set.
  • SNR2_lt_ave_t_thr4 is [0.6, 2.0]
  • the range of sp_center_t_thr3 is [3.0, 6.0]
  • the range of lt_tsnr_t_thr3 is [2.5, 5.0].
  • Step 105f According to the determination result of the multiple frames before the current frame, the long-term signal to noise ratio lt_snr, the average full-band signal-to-noise ratio SNR2_lt_ave, the signal-to-noise ratio parameter of the current frame, and the active tone detection of the current frame. As a result of the decision, the number of frames to be activated is corrected.
  • the precondition for the current activation tone to maintain the frame number correction is that the activation tone flag indicates that the current frame is the active tone frame. If the condition is not met, the value of the current activation tone retention frame number num_speech_hangover is not corrected, and the process proceeds directly to step 105g.
  • the current active tone hold frame number num_speech_hangover is equal to the minimum continuous active tone frame number minus the continuous speech frame. The number continues_speech_num.
  • the number of active tone hold frames num_speech_hangover is set according to the size of the long-term signal to noise ratio lt_tsnr value. Otherwise, the value of the current active tone hold frame number num_speech_hangover is not corrected. In this embodiment, the minimum number of consecutive active tone frames is 8, which can take a value between [6, 20].
  • the first threshold value continuous_speech_num_thr1 and the second threshold value continuous_speech_num_thr2 may be the same or different.
  • the value of num_speech_hangover is 3; otherwise, if the long-term signal-to-noise ratio lt_snr is greater than 1.6, the value of num_speech_hangover is 4; otherwise, the value of num_speech_hangover is 5.
  • Step 105g Add an activation tone hold according to the decision result of the current frame and the activation tone holding frame number num_speech_hangover, and obtain the VAD decision result of the current frame.
  • the method is:
  • the activation tone flag is 0, and the activation tone holding frame number num_speech_hangover is greater than 0, the activation tone is added, that is, the activation tone flag is set to 1, and the value of num_speech_hangover is decremented by 1.
  • the final VAD decision result of the current frame is obtained.
  • the method further includes: calculating, according to the initial VAD decision result, the average long-term activated sound signal energy E fg , the calculated value is used for the next frame VAD decision; after the step 105 g, the method may further include: according to the current frame The VAD decision results calculate the average long-term background noise energy E bg , and the calculated value is used for the next frame VAD decision.
  • the average long-term activation tone signal energy E fg is calculated as follows:
  • the initial VAD decision result indicates that the current frame is an active tone frame, that is, the value of the VAD flag is 1, and E t1 is greater than a multiple of E bg , and the embodiment takes 6 times, then the average long-term active sound energy is accumulated.
  • the update method is fg_energy plus E t1 to get the new fg_energy. Add 1 to fg_energy_count to get the new fg_energy_count.
  • fg_max_frame_num a set value of 512
  • attenu_coef1 takes a value of 0.75.
  • Bg_energy_count is the background noise energy accumulation frame number, which is used to record how many frames of energy the accumulated value of the latest background noise energy contains.
  • Bg_energy is the accumulated value of the most recent background noise energy.
  • the background noise energy accumulated value bg_energy and the background noise energy accumulated frame number bg_energy_count are updated.
  • the update method is the background noise energy accumulated value bg_energy plus E t1 to obtain a new background noise energy accumulated value bg_energy.
  • the background noise energy accumulation frame number bg_energy_count is incremented by one to obtain a new background noise energy accumulation frame number bg_energy_count.
  • the background noise energy accumulation frame number bg_energy_count is equal to the average long time background noise
  • the maximum number of count frames for acoustic energy calculation then the accumulated frame number and the accumulated value are multiplied by the attenuation coefficient attenu_coef2.
  • the maximum number of count frames calculated by the average long-term background noise energy in this embodiment is 512, and the attenuation coefficient attenu_coef2 is equal to 0.75.
  • the background noise energy calculation equation is obtained by dividing the background noise energy accumulated value bg_energy by the background noise energy accumulation frame number to obtain an average long time background noise energy as follows:
  • first embodiment may further include the following steps:
  • Step 106 Calculate the background noise update identifier according to the VAD decision result, the tonal feature parameter, the SNR parameter, the tonality flag, and the time domain stability feature parameter of the current frame. For the calculation method, refer to the second embodiment described later.
  • Step 107 Obtain the background noise energy of the current frame according to the background noise update identifier and the frame energy parameter of the current frame, and the full-band background noise energy of the previous frame of the current frame; the background noise energy of the current frame is used for the next frame. Signal to noise ratio parameter calculation.
  • the background noise update identifier is used to determine whether to perform background noise update. If the background noise update identifier is 1, the background noise update is performed according to the ratio of the energy of the full-band background noise energy to the current frame signal.
  • the background noise energy estimate includes a subband background noise energy estimate and a full band background noise energy estimate.
  • E sb2_bg (k) E sb2_bg_pre (k) ⁇ bg_e +E sb2_bg (k) ⁇ (1- ⁇ bg_e ); 0 ⁇ k ⁇ num_sb
  • E sb2_bg_pre (k) represents the subband background noise energy of the kth SNR subband of the previous frame.
  • ⁇ bg_e is the background noise update factor whose value is determined by the full-band background noise energy of the previous frame and the current frame energy parameter. The calculation process is as follows:
  • the value is 0.96, otherwise the value is 0.95.
  • the background noise update identifier of the current frame is 1
  • the background noise energy accumulated value E t_sum and the background noise energy accumulated frame number N Et_counter are updated , and the calculation equation is as follows:
  • E t_sum E t_sum_-1 +E t1 ;
  • N Et_counter N Et_counter_-1 +1;
  • E t_sum_-1 is the accumulated background noise energy of the previous frame
  • N Et_counter_-1 is the accumulated number of background noise energy frames calculated in the previous frame.
  • the total band background noise energy is obtained by the ratio of the background noise energy accumulated value E t_sum to the cumulative number of frames N Et_counter :
  • N Et_counter is equal to 64, and if N Et_counter is equal to 64, the background noise energy accumulated value E t_sum and the accumulated frame number N Et_counter are respectively multiplied by 0.75.
  • tonality flag tonality_flag is equal to 1 and the value of the frame energy parameter E t1 is less than the value of the background noise energy characteristic parameter E t — bg multiplied by a gain coefficient gain
  • E t_sum E t_sum ⁇ gain+delta
  • E sb2_bg (k) E sb2_bg (k) ⁇ gain+delta;
  • the value of gain is [0.3, 1].
  • An embodiment of the present invention further provides an embodiment of a background noise detecting method. As shown in FIG. 3, the method includes:
  • Step 201 Obtain a subband signal and a spectrum amplitude of a current frame.
  • Step 202 Calculate a value of a spectral flatness characteristic parameter and a tonal characteristic parameter according to the spectral amplitude value according to the frame energy parameter, the spectral gravity center characteristic parameter, and the time domain stability characteristic parameter calculated by the subband signal;
  • the frame energy parameter is a weighted superposition value or a direct superimposed value for each sub-band signal energy.
  • the spectral center of gravity feature parameter is a ratio of a weighted accumulated value of all or part of the subband signal energy to an unweighted accumulated value, or a value obtained by smoothing the ratio.
  • the time domain stability characteristic parameter is a desired ratio of the variance of the frame energy amplitude to the square of the amplitude superposition value, or the ratio is multiplied by a coefficient.
  • the spectral flatness parameter is a ratio of a geometric mean of the predetermined plurality of spectral magnitudes to an arithmetic mean, or the ratio is multiplied by a coefficient.
  • Step 201 and step 202 can adopt the same method as above, and details are not described herein again.
  • Step 203 Perform background noise detection according to the spectral center-of-gravity characteristic parameter, the time domain stability characteristic parameter, the spectral flatness characteristic parameter, the tonal characteristic parameter, and the current frame energy parameter, and determine whether the current frame is background noise.
  • the background noise update identifier is set to a first preset value; then, if any of the following conditions is true, it is determined that the current frame is not a noise signal, and the background noise update flag is set to the first Two preset values:
  • the time domain stability characteristic parameter lt_stable_rate0 is greater than a set threshold
  • the smoothed filter value of the spectral center of gravity characteristic parameter value is greater than a set threshold value, and the time domain stability characteristic parameter value is also greater than a set threshold value;
  • the smoothed filtered value of the tonal characteristic parameter or the tonal characteristic parameter is greater than a set threshold value, and the time domain stability characteristic parameter lt_stable_rate0 value is greater than the threshold value set therein;
  • the smoothed filtered values of the spectral flatness characteristic parameters of each sub-band or the spectral flatness characteristic parameters of each sub-band are smaller than respective corresponding set threshold values;
  • the value of the frame energy parameter E t1 is greater than the set threshold E_thr1.
  • a background noise update identifier background_flag is used to indicate whether the current frame is background noise, and it is agreed that if the current frame is determined to be background noise, the background noise update identifier background_flag is set to 1 (the first preset value), otherwise the background noise is set.
  • the update flag background_flag is 0 (second preset value).
  • the feature parameter and the current frame energy parameter detect whether the current frame is a noise signal. If it is not a noise signal, the background noise update flag background_flag is set to zero.
  • the threshold value lt_stable_rate_thr1 ranges from [0.8, 1.6];
  • Judging tonal characteristic parameters Whether the value is greater than a set threshold value tonality_rate_thr1, whether the time domain stability characteristic parameter lt_stable_rate0 value is greater than the set threshold value lt_stable_rate_thr3, if the above conditions are satisfied, it is determined that the current frame is not background noise, and background_flag is assigned a value of 0.
  • the threshold value of tonality_rate_thr1 ranges from [0.4, 0.66].
  • the threshold value lt_stable_rate_thr3 ranges from [0.06, 0.3].
  • the background_flag is assigned a value of 0.
  • the value range of sSMR_thr4, sSMR_thr5, and sSMR_thr6 is [0.80, 0.92].
  • E_thr1 takes the value according to the dynamic range of the frame energy parameter.
  • the embodiment of the invention further provides a method for correcting the number of activated sound holding frames in the VAD decision. As shown in FIG. 4, the method includes:
  • Step 301 Calculating a long-term signal to noise ratio lt_snr according to the subband signal
  • the long-term signal-to-noise ratio lt_snr is calculated by calculating the ratio of the average long-time activated sound signal energy and the average long-term background noise energy calculated from the previous frame of the current frame; the long-term signal-to-noise ratio lt_snr can be represented by a logarithm.
  • Step 302 Calculate an average full-band signal-to-noise ratio SNR2_lt_ave
  • Step 303 Maintain the current active tone according to the decision result of the multiple frames before the current frame, the long-term signal to noise ratio lt_snr, the average full-band signal-to-noise ratio SNR2_lt_ave, the signal-to-noise ratio parameter of the current frame, and the VAD decision result of the current frame. The number of frames is corrected.
  • the precondition for the current activation tone to maintain the frame number correction is that the activation tone flag indicates that the current frame is an active tone frame.
  • the current active tone keeping frame number is corrected, if the continuous voice frame number is less than a set first threshold value 1 and the long time signal to noise ratio lt_snr is less than a set threshold value 2, the current active tone remains The number of frames is equal to the minimum number of consecutive active tone frames minus the number of consecutive speech frames; otherwise, if the average full band signal to noise ratio SNR2_lt_ave is greater than a set threshold value of 3, and the number of consecutive speech frames is greater than a set second threshold A value of 4 sets the value of the number of active tone hold frames according to the size of the long-term signal to noise ratio. Otherwise, the value of the current active tone hold frame number num_speech_hangover is not corrected.
  • An embodiment of the present invention provides a method for acquiring an activation tone correction frame number, as shown in FIG. as follows:
  • the number of activated tone correction frames is selected to be a constant, for example, 20 and the number of active tone holding frames. Maximum value.
  • the method may further include: 405: correcting the VAD decision result according to the VAD decision result and the activation sound correction frame number, wherein:
  • the current frame is set to be an active sound frame, and the number of activated sound correction frames is decreased by 1.
  • the embodiment of the present invention further provides an apparatus 60 for acquiring the number of activated sound correction frames.
  • the obtaining apparatus 60 includes:
  • the first obtaining unit 61 is configured to: obtain an activation sound detection decision result of the current frame;
  • the second obtaining unit 62 is configured to: obtain an activation tone holding frame number
  • the third obtaining unit 63 is configured to: obtain the number of background noise updates
  • the fourth obtaining unit 64 is configured to: obtain an activated sound correction frame number according to the activation sound detection determination result of the current frame, the background noise update number, and the activation sound retention frame number.
  • the embodiment of the invention provides an activation sound detection method, as shown in FIG. 7, the steps are as follows:
  • the second activation tone detection decision result vadb_flag is obtained by any existing activation tone detection decision scheme, and the existing activation tone detection decision scheme is not elaborated herein.
  • the number of activated sound correction frames is selected to be 20 and the maximum value of the number of active sound holding frames.
  • the current frame is set to be the active sound frame, and the number of the activated sound correction frames is decreased by 1.
  • the embodiment of the present invention further provides an active sound detecting device.
  • the detecting device 80 includes:
  • the fifth obtaining unit 81 is configured to: obtain a first activated sound detection decision result
  • the sixth obtaining unit 82 is configured to: obtain an activation tone holding frame number
  • the seventh obtaining unit 83 is configured to: obtain the number of background noise updates
  • the first calculating unit 84 is configured to: calculate the number of the activated sound correction frames according to the first activation sound detection determination result, the background noise update number, and the activation sound retention frame number;
  • the eighth obtaining unit 85 is configured to: obtain a second activated sound detection decision result
  • the second calculating unit 86 is configured to calculate the activation sound detection determination result according to the activation sound correction frame number and the second activation sound detection determination result.
  • VAD Voice over IP
  • the technical solution provided by the embodiment of the present invention overcomes the shortcomings of the existing VAD algorithm, and improves the detection efficiency of the unstable noise by the VAD, and also improves the accuracy of the music detection.
  • the speech and audio signal processing algorithm using the technical solution provided by the embodiment of the present invention can achieve better performance.
  • the background noise detecting method provided by the embodiment of the invention can make the estimation of the background noise more accurate and stable, and is beneficial to improving the accuracy of the VAD detection.
  • the method for detecting a tonality signal provided by the embodiment of the invention improves the accuracy of the tonal music detection.
  • the method for correcting the number of active tone keeping frames provided by the embodiment of the present invention can make the VAD algorithm have a better balance between performance and efficiency under different noise and signal to noise ratios.
  • the method for adjusting the decision signal to noise ratio threshold in the VAD decision provided by the embodiment of the present invention can make the VAD decision algorithm achieve better accuracy under different signal to noise ratios, and further improve in the case of ensuring quality. effectiveness.
  • all or part of the steps of the above embodiments may also be implemented by using an integrated circuit. These steps may be separately fabricated into individual integrated circuit modules, or multiple modules or steps may be fabricated into a single integrated circuit module. achieve.
  • the device/function module/functional unit in the above embodiment can be implemented by using a general-purpose computing device. Now, they can be concentrated on a single computing device or distributed over a network of multiple computing devices.
  • the device/function module/functional unit in the above embodiment When the device/function module/functional unit in the above embodiment is implemented in the form of a software function module and sold or used as a stand-alone product, it can be stored in a computer readable storage medium.
  • the above mentioned computer readable storage medium may be a read only memory, a magnetic disk or an optical disk or the like.
  • the technical solution provided by the embodiment of the present invention overcomes the shortcomings of the existing VAD algorithm, and improves the detection efficiency of the unstable noise by the VAD, and also improves the accuracy of the music detection.
  • the speech and audio signal processing algorithm using the technical solution provided by the embodiment of the present invention can achieve better performance.
  • the background noise detecting method provided by the embodiment of the invention can make the estimation of the background noise more accurate and stable, and is beneficial to improving the accuracy of the VAD detection.
  • the method for detecting a tonality signal provided by the embodiment of the invention improves the accuracy of the tonal music detection.
  • the method for correcting the number of active tone keeping frames provided by the embodiment of the present invention can make the VAD algorithm have a better balance between performance and efficiency under different noise and signal to noise ratios.
  • the method for adjusting the decision signal to noise ratio threshold in the VAD decision provided by the embodiment of the present invention can make the VAD decision algorithm achieve better accuracy under different signal to noise ratios, and further improve in the case of ensuring quality. effectiveness.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Telephone Function (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

一种激活音修正帧数的获取方法、激活音检测方法和装置,首先获得第一激活音检测判决结果和第二激活音检测判决结果(501),获得激活音保持帧数(502),获得背景噪声更新次数(503),再根据第一激活音检测判决结果、所述背景噪声更新次数和所述激活音保持帧数计算激活音修正帧数(504),最后根据激活音修正帧数和第二激活音检测判决结果计算当前帧的激活音检测判决结果(505)。

Description

一种激活音修正帧数的获取方法、激活音检测方法和装置 技术领域
本申请涉及但不限于通信领域。
背景技术
正常的语音通话中,用户有时在说话,有时在听,这个时候就会在通话过程出现非激活音阶段,正常情况下通话双方总的非语音激活阶段要超过通话双方总的语音编码时长的50%。在非激活音阶段,只有背景噪声,背景噪声通常没有任何有用信息。利用这一事实,在语音频信号处理过程中,通过激活音检测(Voice Activity Detection,简称VAD)算法检测出激活音和非激活音,并采用不同的方法分别进行处理。很多语音编码标准,如自适应多速率(Adaptive Multi-Rate,AMR),自适应多速率宽带(Adaptive Multi-Rate Wideband,简称AMR-WB),都支持VAD功能。在效率方面,这些编码器的VAD并不能在所有的典型背景噪声下都达到很好的性能。特别是在非稳定噪声下,这些编码器的VAD效率都较低。而对于音乐信号,这些VAD有时候会出现错误检测,导致相应的处理算法出现明显的质量下降。
发明内容
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。
本发明实施例提供了一种激活音修正帧数的获取方法、激活音检测方法和装置,以解决激活音检测(VAD)的准确率低的问题。
本发明实施例提供了一种激活音修正帧数的获取方法,所述方法包括:
获得当前帧的激活音检测判决结果;
获得激活音保持帧数;
获得背景噪声更新次数;
根据所述当前帧的激活音检测判决结果、所述背景噪声更新次数和所述激活音保持帧数获取激活音修正帧数。
可选地,所述获得当前帧的激活音检测判决结果包括:
获得所述当前帧的子带信号及频谱幅值;
根据所述子带信号计算得到所述当前帧的帧能量参数、谱重心特征参数和时域稳定度特征参数;根据所述频谱幅值计算得到谱平坦度特征参数和调性特征参数;
根据利用所述当前帧的前一帧得到的背景噪声能量、所述帧能量参数及信噪比子带能量计算得到所述当前帧的信噪比参数;
根据所述帧能量参数、所述谱重心特征参数、所述时域稳定度特征参数、所述谱平坦度特征参数、所述调性特征参数计算得到所述当前帧的调性标志;
根据所述调性标志、所述信噪比参数、所述谱重心特征参数、所述帧能量参数计算得到所述激活音检测判决结果。
可选地,
所述帧能量参数是每个子带信号能量的加权叠加值或直接叠加值;
所述谱重心特征参数是所有或部分子带信号能量的加权累加值和未加权累加值的比值,或是将所述比值进行平滑滤波得到的值;
所述时域稳定度特征参数是幅值叠加值的方差和幅值叠加值平方的期望的比值,或该比值乘上一个系数;
所述谱平坦度特征参数是预定的多个频谱幅值的几何平均数和算术平均数的比值,或该比值乘上一个系数;
调性特征参数是通过计算前后两帧信号的帧内频谱差分系数的相关值得到,或继续对该相关值进行平滑滤波得到。
可选地,所述根据所述调性标志、所述信噪比参数、所述谱重心特征参数、所述帧能量参数计算得到所述激活音检测判决结果包括:
通过所述当前帧的前一帧计算得到的平均长时激活音信号能量和平均长 时背景噪声能量的比值,计算得到长时信噪比;
计算距离所述当前帧最近的多个帧的全带信噪比的平均值,得到平均全带信噪比;
根据所述谱重心特征参数、所述长时信噪比、连续激活音帧个数和连续噪声帧个数得到激活音检测判决的判决信噪比门限;
根据所述激活音检测的判决门限和所述信噪比参数计算得到初始的激活音检测判决结果;
根据所述调性标志、所述平均全带信噪比、所述谱重心特征参数和所述长时信噪比对所述初始的激活音检测判决结果进行修正,得到所述激活音检测判决结果。
可选地,所述根据所述当前帧的激活音检测判决结果、所述背景噪声更新次数和所述激活音保持帧数获取激活音修正帧数包括:
当所述当前帧的激活音检测判决结果为激活音帧,且所述背景噪声更新次数小于预设门限值时,则所述激活音修正帧数为一个常数和所述激活音保持帧数中的最大值。
可选地,所述获得激活音保持帧数包括:
设置所述激活音保持帧数的初始值。
可选地,所述获得激活音保持帧数包括:
获得所述当前帧的子带信号及频谱幅值;
根据所述子带信号计算得到长时信噪比和平均全带信噪比,根据所述当前帧之前的多个帧的激活音检测的判决结果、长时信噪比、平均全带信噪比、所述当前帧的激活音检测判决结果,对当前激活音保持帧数进行修正获得所述激活音保持帧数。
可选地,所述根据所述子带信号计算得到长时信噪比和平均全带信噪比包括:
通过利用所述当前帧的前一帧计算得到的平均长时激活音信号能量和平均长时背景噪声能量的比值,计算得到所述长时信噪比;计算距离所述当前 帧最近的多个帧的全带信噪比的平均值,得到所述平均全带信噪比。
可选地,对所述当前激活音保持帧数进行修正的前提条件是激活音标志指示所述当前帧为激活音帧。
可选地,所述对当前激活音保持帧数进行修正获得所述激活音保持帧数包括:
获得所述激活音保持帧数时,如果所述连续语音帧数小于一个设定的第一门限值,并且所述长时信噪比小于一个设定的门限值,则所述激活音保持帧数等于最小连续激活音帧数减去所述连续语音帧数;如果所述平均全带信噪比大于一个设定的门限值,并且所述连续语音帧数大于一个设定的第二门限值,则根据所述长时信噪比的大小设置所述激活音保持帧数的值。
可选地,所述获得背景噪声更新次数包括:
获得背景噪声更新标识;
根据所述背景噪声更新标识计算所述背景噪声更新次数。
可选地,所述根据所述背景噪声更新标识计算所述背景噪声更新次数包括:
设置所述背景噪声更新次数初始值。
可选地,所述根据所述背景噪声更新标识计算所述背景噪声更新次数包括:
当所述背景噪声更新标识指示所述当前帧为背景噪声,且所述背景噪声更新次数小于设定的门限值时,将所述背景噪声更新次数加1。
可选地,所述获得背景噪声更新标识包括:
获得所述当前帧的子带信号及频谱幅值;
根据所述子带信号计算得到帧能量参数、谱重心特征参数、时域稳定度特征参数;根据所述频谱幅值计算得到谱平坦度特征参数和调性特征参数;
根据所述谱重心特征参数、所述时域稳定度特征参数、所述谱平坦度特征参数、所述调性特征参数、所述帧能量参数进行背景噪声检测,获得所述背景噪声更新标识。
可选地,
所述帧能量参数是每个子带信号能量的加权叠加值或直接叠加值;
所述谱重心特征参数是所有或部分子带信号能量的加权累加值和未加权累加值的比值,或是将所述比值进行平滑滤波得到的值;
所述时域稳定度特征参数是帧能量幅值的方差和幅值叠加值平方的期望的比值,或该比值乘上一个系数;
所述谱平坦度参数是预定的多个频谱幅值的几何平均数和算术平均数的比值,或该比值乘上一个系数。
可选地,所述根据所述谱重心特征参数、所述时域稳定度特征参数、所述谱平坦度特征参数、所述调性特征参数、所述帧能量参数进行背景噪声检测,获得所述背景噪声更新标识,包括:
设置所述背景噪声更新标识为第一预设值;
如果以下任一条件成立,则判断所述当前帧不是噪声信号,并将所述背景噪声更新标识设置为第二预设值:
所述时域稳定度特征参数大于一个设定的门限值;
所述谱重心特征参数值的平滑滤波值大于一个设定的门限值,且所述时域稳定度特征参数值也大于一个设定的门限值;
所述调性特征参数或所述调性特征参数平滑滤波后的值大于一个设定的门限值,且时域稳定度特征参数值大于设定的门限值;
每个子带的谱平坦度特征参数或所述每个子带的谱平坦度特征参数各自平滑滤波后的值均小于各自对应的设定的门限值;
或,所述帧能量参数的值大于设定的门限值。
本发明实施例提供了一种激活音检测方法,所述方法包括:
获得第一激活音检测判决结果;
获得激活音保持帧数;
获得背景噪声更新次数;
根据所述第一激活音检测判决结果、所述背景噪声更新次数和所述激活 音保持帧数计算激活音修正帧数;
获得第二激活音检测判决结果;
根据所述激活音修正帧数和所述第二激活音检测判决结果计算所述的激活音检测判决结果。
可选地,所述根据所述激活音修正帧数和所述第二激活音检测判决结果计算所述激活音检测判决结果包括:
当所述第二激活音检测判决结果指示所述当前帧为非激活音帧,且所述激活音修正帧数大于0时,将所述激活音检测判决结果设置为激活音帧,且所述激活音修正帧数减1。
可选地,所述获得第一激活音检测判决结果包括:
获得当前帧的子带信号及频谱幅值;
根据所述子带信号计算得到所述当前帧的帧能量参数、谱重心特征参数和时域稳定度特征参数;根据所述频谱幅值计算得到谱平坦度特征参数和调性特征参数;
根据利用所述当前帧的前一帧得到的背景噪声能量、所述帧能量参数及信噪比子带能量计算得到所述当前帧的信噪比参数;
根据所述帧能量参数、所述谱重心特征参数、所述时域稳定度特征参数、所述谱平坦度特征参数、所述调性特征参数计算得到所述当前帧的调性标志;
根据所述调性标志、所述信噪比参数、所述谱重心特征参数、所述帧能量参数计算得到所述第一激活音检测判决结果。
可选地,所述帧能量参数是每个子带信号能量的加权叠加值或直接叠加值;
所述谱重心特征参数是所有或部分子带信号能量的加权累加值和未加权累加值的比值,或是将所述比值进行平滑滤波得到的值;
所述时域稳定度特征参数是幅值叠加值的方差和幅值叠加值平方的期望的比值,或该比值乘上一个系数;
所述谱平坦度特征参数是预定的多个频谱幅值的几何平均数和算术平均数的比值,或该比值乘上一个系数;
调性特征参数是通过计算前后两帧信号的帧内频谱差分系数的相关值得到,或继续对该相关值进行平滑滤波得到。
可选地,所述根据所述调性标志、所述信噪比参数、所述谱重心特征参数、所述帧能量参数计算得到所述第一激活音检测判决结果包括:
通过所述当前帧的前一帧计算得到的平均长时激活音信号能量和平均长时背景噪声能量的比值,计算得到长时信噪比;
计算距离所述当前帧最近的多个帧的全带信噪比的平均值,得到平均全带信噪比;
根据所述谱重心特征参数、所述长时信噪比、连续激活音帧个数和连续噪声帧个数得到激活音检测的判决门限;
根据所述激活音检测的判决门限和所述信噪比参数计算得到初始的激活音检测判决结果;
根据所述调性标志、所述平均全带信噪比、所述谱重心特征参数和所述长时信噪比对所述初始的激活音检测判决结果进行修正,得到所述第一激活音检测判决结果。
可选地,所述获得激活音保持帧数包括:
设置所述激活音保持帧数的初始值。
可选地,所述获得激活音保持帧数包括:
获得当前帧的子带信号及频谱幅值;
根据所述子带信号计算得到长时信噪比和平均全带信噪比,根据所述当前帧之前的多个帧的激活音检测的判决结果、所述长时信噪比、所述平均全带信噪比、所述第一激活音检测判决结果,对当前激活音保持帧数进行修正。
可选地,所述根据所述子带信号计算得到长时信噪比和平均全带信噪比包括:
通过利用所述当前帧的前一帧计算得到的平均长时激活音信号能量和平均长时背景噪声能量的比值,计算得到所述长时信噪比;计算距离所述当前帧最近的多个帧的全带信噪比的平均值,得到所述平均全带信噪比。
可选地,对所述当前激活音保持帧数进行修正的前提条件是激活音标志指示所述当前帧为激活音帧。
可选地,所述对当前激活音保持帧数进行修正包括:如果连续语音帧数小于一个设定的第一门限值,并且所述长时信噪比小于一个设定的门限值,则所述激活音保持帧数等于最小连续激活音帧数减去所述连续语音帧数;如果所述平均全带信噪比大于一个设定的第二门限值,并且所述连续语音帧数大于一个设定的门限值,则根据所述长时信噪比的大小设置所述激活音保持帧数的值。
可选地,所述获得背景噪声更新次数包括:
获得背景噪声更新标识;
根据所述背景噪声更新标识计算所述背景噪声更新次数。
可选地,所述根据所述背景噪声更新标识计算所述背景噪声更新次数包括:
设置所述背景噪声更新次数初始值。
可选地,所述根据所述背景噪声更新标识计算所述背景噪声更新次数包括:
当所述背景噪声更新标识指示所述当前帧为背景噪声时,且所述背景噪声更新次数小于设定的门限值时,将所述背景噪声更新次数加1。
可选地,所述获得背景噪声更新标识包括:
获得当前帧的子带信号及频谱幅值;
根据所述子带信号计算得到的帧能量参数、谱重心特征参数、时域稳定度特征参数的值,根据所述频谱幅值计算得到谱平坦度特征参数和调性特征参数的值;
根据所述谱重心特征参数、所述时域稳定度特征参数、所述谱平坦度特 征参数、所述调性特征参数、所述帧能量参数进行背景噪声检测,获得所述背景噪声更新标识。
可选地,所述帧能量参数是每个子带信号能量的加权叠加值或直接叠加值;
所述谱重心特征参数是所有或部分子带信号能量的加权累加值和未加权累加值的比值,或将是所述比值进行平滑滤波得到的值;
所述时域稳定度特征参数是帧能量幅值的方差和幅值叠加值平方的期望的比值,或该比值乘上一个系数;
所述谱平坦度参数是预定的多个频谱幅值的几何平均数和算术平均数的比值,或该比值乘上一个系数。
可选地,所述根据所述谱重心特征参数、所述时域稳定度特征参数、所述谱平坦度特征参数、所述调性特征参数、所述帧能量参数进行背景噪声检测,获得所述背景噪声更新标识,包括:
设置所述背景噪声更新标识为第一预设值;
如果以下任一条件成立,则判断所述当前帧不是噪声信号,并将所述背景噪声更新标识设置为第二预设值:
所述时域稳定度特征参数大于一个设定的门限值;
所述谱重心特征参数值的平滑滤波值大于一个设定的门限值,且所述时域稳定度特征参数值也大于一个设定的门限值;
所述调性特征参数或所述调性特征参数平滑滤波后的值大于一个设定的门限值,且所述时域稳定度特征参数值大于设定的门限值;
每个子带的谱平坦度特征参数或所述每个子带的谱平坦度特征参数各自平滑滤波后的值均小于各自对应的设定的门限值;
或,所述帧能量参数的值大于设定的门限值。
可选地,所述根据所述第一激活音检测判决结果、所述背景噪声更新次数和所述激活音保持帧数计算激活音修正帧数包括:
当所述第一激活音检测判决结果为激活音帧,且所述背景噪声更新次数 小于预设门限值时,则所述激活音修正帧数为一个常数和所述激活音保持帧数中的最大值。
本发明实施例提供了一种激活音修正帧数的获取装置,所述装置包括:
第一获取单元,设置为:获得当前帧的激活音检测判决结果;
第二获取单元,设置为:获得激活音保持帧数;
第三获取单元,设置为:获得背景噪声更新次数;
第四获取单元,设置为:根据所述当前帧的激活音检测判决结果、所述背景噪声更新次数和所述激活音保持帧数获取激活音修正帧数。
本发明实施例提供了一种激活音检测装置,所述装置包括:
第五获取单元,设置为:获得第一激活音检测判决结果;
第六获取单元,设置为:获得激活音保持帧数;
第七获取单元,设置为:获得背景噪声更新次数;
第一计算单元,设置为:根据所述第一激活音检测判决结果、所述背景噪声更新次数和所述激活音保持帧数计算激活音修正帧数;
第八获取单元,设置为:获得第二激活音检测判决结果;
第二计算单元,设置为:根据所述激活音修正帧数和所述第二激活音检测判决结果计算所述激活音检测判决结果。
一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行上述任一项的方法。
本发明实施例提供一种激活音修正帧数的获取方法、激活音检测方法和装置,首先获得第一激活音检测判决结果,获得激活音保持帧数,获得背景噪声更新次数,再根据所述第一激活音检测判决结果、所述背景噪声更新次数和所述激活音保持帧数计算激活音修正帧数,并且获得第二激活音检测判决结果,最后根据所述激活音修正帧数和所述第二激活音检测判决结果计算所述激活音检测判决结果,能够提高VAD检测的准确率。
在阅读并理解了附图和详细描述后,可以明白其他方面。
附图概述
图1为本发明实施例一提供的激活音检测方法的流程示意图;
图2为本发明实施例一中得到VAD判决结果的过程示意图;
图3为本发明实施例二提供的背景噪声检测方法的流程示意图;
图4为本发明实施例三提供的VAD判决中当前激活音保持帧数的修正方法的流程示意图;
图5为本发明实施例四提供的激活音修正帧数的获取方法的流程示意图;
图6为本发明实施例四提供的激活音修正帧数的获取装置的结构示意图;
图7为本发明实施例五提供的激活音检测方法的流程示意图;
图8为本发明实施例五提供的激活音检测装置的结构示意图。
本发明的实施方式
下文中将结合附图对本发明的实施方式进行详细说明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互任意组合。
在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行。并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。
符号说明:不做特殊说明时,以下实施例中右上角标[i]表示帧序号,[0]表示当前帧,[-1]表示上一帧。如
Figure PCTCN2015093889-appb-000001
Figure PCTCN2015093889-appb-000002
表示当前帧和前一帧的平滑频谱。
实施例一
本发明实施例提供了一种激活音检测方法,如图1所示,该方法包括:
步骤101、获得当前帧的子带信号及频谱幅值。
本实施例中以帧长为20ms,采样率为32kHz的音频流为例说明。在其它帧长和采样率条件下,本文的方法同样适用。
将当前帧时域信号输入滤波器组,进行子带滤波计算,得到滤波器组子带信号;
本实施例中采用一个40通道的滤波器组,对于采用其他通道数的滤波器组本文的方法同样适用。假设输入的音频信号为sHP(n),LC为40,是滤波器组通道数,wc是一个窗函数,窗长为10LC,子带信号X(k,l)=XCR(l,k)+i·XCI(l,k),XCR和XCI是子带信号的实部和虚部,子带信号计算方法如下:
Figure PCTCN2015093889-appb-000003
Figure PCTCN2015093889-appb-000004
l子带时间索引0≤l≤15,k子带索引0≤k≤LC
对滤波器组子带信号进行时频变换,并计算得到频谱幅值。
其中对全部滤波器组子带或部分滤波器组子带进行时频变换,计算频谱幅值,都可以实现本发明实施例。本发明实施例的所述时频变换方法可以是离散傅里叶变换(Discrete Fourier Transform,简称DFT)、快速傅里叶变换(Fast Fourier Transformation,简称FFT)、离散余弦变换(Discrete Cosine Transform,简称DCT)或离散正弦变换(Discrete Sine Transform,简称DST)。本实施例采用DFT为例,说明其实现方法。计算过程如下:
对索引为0到9的每个滤波器组子带上的16个时间样点数据进行16点的DFT变换,进一步提高频谱分辨率,并计算每个频点的幅值,得到频谱幅值Asp
时频变换计算方程式如下:
Figure PCTCN2015093889-appb-000005
计算每个频点的幅值过程如下:
首先,计算数组XDFT[k,j]在每个点上的能量,计算方程式如下:
XDFT_POW[k,j]=((Re(XDFT[k,j]))2+(Im(XDFT[k,j]))2);0≤k<10,0≤j<16
其中Re,Im分别表示取频谱系数XDFT_POW[k,j]的实部和虚部。
如果k为偶数,则采用以下方程计算每个频点上的频谱幅值:
Figure PCTCN2015093889-appb-000006
如果k为奇数,则采用以下方程计算每个频点上的频谱幅值:
Figure PCTCN2015093889-appb-000007
Asp即为时频变换后的频谱幅值。
步骤102、根据子带信号计算得到当前帧的帧能量参数、谱重心特征参数、时域稳定度特征参数的值,根据频谱幅值计算得到谱平坦度特征参数和调性特征参数的值。
其中,所述帧能量参数是每个子带信号能量的加权叠加值或直接叠加值,其中:
a)根据滤波器组子带信号X[k,l]计算每个滤波器组子带的能量
Figure PCTCN2015093889-appb-000008
Figure PCTCN2015093889-appb-000009
其中,EC(t,k)=(XCR(t,k))2+(XCI(t,k))20≤t≤15,0≤k≤LC
b)将部分听觉比较敏感的滤波器组子带或所有的滤波器组子带的能量累加,得到帧能量参数。
其中根据心理听觉模型,人耳对极低频(如100Hz以下)和高频(如20kHz以上)声音会比较不敏感,示例性的,本发明实施例中认为按照频率从低到高排列的滤波器组子带,从第二个子带到倒数第二个子带为听觉比较敏感的主要滤波器组子带,将部分或全部听觉比较敏感的滤波器组子带能量累加得到帧能量参数1,计算方程式如下:
Figure PCTCN2015093889-appb-000010
其中,e_sb_start为起始子带索引,其取值范围为[0,6]。e_sb_end为结束子带索引,其取值大于6,小于子带总数。
帧能量参数1的值加上部分或全部在计算帧能量参数1时未使用的滤波器组子带的能量的加权值,得到帧能量参数2,其计算方程式如下:
Figure PCTCN2015093889-appb-000011
其中e_scale1,e_scale2为加权比例因子,其取值范围分别为[0,1]。num_band为子带总个数。
所述谱重心特征参数是所有或部分子带信号能量的加权累加值和未加权累加值的比值,其中:
根据每个滤波器组子带的能量计算得到谱重心特征参数,谱重心特征参数是通过求滤波器组子带能量加权相加的和与子带能量的直接相加的和的比值或通过对其他谱重心特征参数值进行平滑滤波得到的。
谱重心特征参数可以采用如下子步骤实现:
a:将用于谱重心特征参数计算的子带区间划分如下:
Figure PCTCN2015093889-appb-000012
b:采用a的谱重心特征参数计算区间划分方式和以下公式,计算得到两个谱重心特征参数值,分别为第一区间谱重心特征参数和第二区间谱重心特征参数。
Figure PCTCN2015093889-appb-000013
Delta1,Delta2分别为一个小的偏置值,取值范围为(0,1)。其中k为谱重心编号索引。
c:对第一区间谱重心特征参数sp_center[0]进行平滑滤波运算,得到平滑谱重心特征参数值,即第一区间谱重心特征参数值的平滑滤波值,计算过程如下:
sp_center[2]=sp_center-1[2]·spc_sm_scale+sp_center[0]·(1-spc_sm_scale)
其中,spc_sm_scale为谱重心参数平滑滤波比例因子,sp_center-1[2]表示上一帧的平滑谱重心特征参数值,其初始值为1.6。
所述时域稳定度特征参数是幅值叠加值的方差和幅值叠加值平方的期望的比值,或该比值乘上一个系数,其中:
由最新的多个帧信号的帧能量参数计算得到时域稳定度特征参数。在本实施例中采用最新的40帧信号的帧能量参数计算得到时域稳定度特征参数。计算步骤为:
首先,计算得到最近40帧信号的能量幅值,计算方程如下:
Figure PCTCN2015093889-appb-000014
其中,e_offset为一个偏置值,其取值范围为[0,0.1]。
其次,依次将当前帧到前面第40帧的相邻两帧的能量幅值相加,得到20个幅值叠加值。计算方程如下:
Ampt2(n)=Ampt1(-2n)+Ampt1(-2n-1);0≤n<20;
其中,n=0时,Ampt1表示当前帧的能量幅值,n<0时,Ampt1表示当前帧往前的n帧的能量幅值。
最后,通过计算距离当前帧最近的20个幅值叠加值的方差和平均能量的比值,得到时域稳定度特征参数ltd_stable_rate0。计算方程式如下:
Figure PCTCN2015093889-appb-000015
所述谱平坦度特征参数是预定的多个频谱幅值的几何平均数和算术平均数的比值,或该比值乘上一个系数。
对频谱幅值进行平滑得到:
Figure PCTCN2015093889-appb-000016
其中
Figure PCTCN2015093889-appb-000017
Figure PCTCN2015093889-appb-000018
表示当前帧和当前帧的前一帧的平滑频谱,NA是频谱幅值的数目。
需要说明的是,本发明实施例中所述预定的多个频谱可以是根据技术人员的经验选定的一部分频谱,也可以是按照实际情况选取的一部分频谱。
本实施例将频谱幅值划分成3个频带,并计算这3个频带的谱平坦度特征其划分方式如下表:
谱平坦度子带划分
Figure PCTCN2015093889-appb-000019
让N(k)=NA_end(k)-NA_start(k)表示用于计算谱平坦度频谱个数,则FSF(k):
Figure PCTCN2015093889-appb-000020
最后,对当前帧的谱平坦度特征参数进行平滑滤波,得到当前帧最终的谱平坦度特征参数:
Figure PCTCN2015093889-appb-000021
其中
Figure PCTCN2015093889-appb-000022
Figure PCTCN2015093889-appb-000023
分别表示当前帧和当前帧的前一帧的谱平坦度。
调性特征参数是通过计算前后两帧信号的帧内频谱差分系数的相关值得到的,或继续对该相关值进行平滑滤波得到的。
前后两帧信号的帧内频谱差分系数的相关值的计算方法如下:
根据频谱幅值计算得到调性特征参数,其中调性特征参数可以根据所有频谱幅值或部分频谱幅值计算得到。
其计算步骤如下:
a,将部分(不小于8个频谱系数)或全部频谱幅值跟相邻的频谱幅值做差分运算,并将差分结果小于0的值置0,得到一组非负的频谱差分系数:
Figure PCTCN2015093889-appb-000024
b,求取步骤a计算得到的当前帧非负的频谱差分系数和当前帧的前一帧非负的频谱差分系数的相关系数,得到第一调性特征参数值,计算方程式如下:
Figure PCTCN2015093889-appb-000025
其中,
Figure PCTCN2015093889-appb-000026
为前一帧的非负的频谱差分系数。
c,对第一调性特征参数值进行平滑运算,得到第二调性特征参数值
Figure PCTCN2015093889-appb-000027
和第三调性特征
Figure PCTCN2015093889-appb-000028
角标0表示为当前帧,计算方程如下:
FT(0)=FTR
Figure PCTCN2015093889-appb-000029
Figure PCTCN2015093889-appb-000030
步骤103、根据当前帧的前一帧得到的背景噪声能量、当前帧的帧能量参数及信噪比子带能量计算得到当前帧的信噪比参数。
当前帧的前一帧的背景噪声能量可通过已有方法获得。
如果当前帧是起始帧,信噪比子带背景噪声能量的值采用默认的初值。当前帧的前一帧信噪比子带背景噪声能量估计与当前帧的信噪比子带背景能 量估计的原理相同,当前帧的信噪比子带背景能量估计参见本实施例的步骤107。其中,当前帧的信噪比参数可采用已有信噪比计算方法实现。可选地,采用以下方法:
首先,将滤波器组子带重新划分为多个信噪比子带,划分索引如下表,
Figure PCTCN2015093889-appb-000031
其次,根据信噪比子带的划分方式,计算当前帧每个信噪比子带能量。计算方程如下:
Figure PCTCN2015093889-appb-000032
再次,根据当前帧每个信噪比子带的能量和上一帧每个信噪比子带的背景噪声能量计算子带平均信噪比SNR1。计算方程如下:
Figure PCTCN2015093889-appb-000033
其中Esb2_bg为估计得到的当前帧的前一帧每个信噪比子带的背景噪声能量,num_band信噪比子带个数。得到该前一帧信噪比子带的背景噪声能量的原理与得到当前帧的信噪比子带背景能量的原理相同,得到当前帧的信噪比子带背景能量的过程参见下文实施例一的步骤107。
最后,根据估计得到的所述前一帧全带背景噪声能量和当前帧的帧能量参数,计算全带信噪比SNR2:
Figure PCTCN2015093889-appb-000034
其中Et_bg为估计得到的前一帧全带背景噪声能量,得到前一帧全带背景噪声能量原理与得到当前帧的全带背景噪声能量的原理相同,得到当前帧的全带背景噪声能量的过程参见下文实施例一的步骤107;
本实施例中信噪比参数包括子带平均信噪比SNR1和全带信噪比SNR2。全带背景噪声能量和每个子带的背景噪声能量统称为背景噪声能量。
步骤104、根据当前帧的帧能量参数、谱重心特征参数、时域稳定度特征参数、谱平坦度特征参数、调性特征参数计算得到当前帧的调性标志,其中:
104a:假设当前帧信号为非调性信号,并用一个调性帧标志tonality_frame来指示当前帧是否为调性帧。
本实施例中tonality_frame的值为1表示当前帧为调性帧,0表示当前帧为非调性帧;
104b:判断调性特征参数或其平滑滤波后的值是否大于对应的设定的门限值tonality_decision_thr1或tonality_decision_thr2,如果上述条件有一个成立则执行步骤104c,否则执行步骤104d;
其中,tonality_decision_thr1的取值范围为[0.5,0.7],tonality_rate1的取值范围为[0.7,0.99]。
104c:如果时域稳定度特征参数值lt_stable_rate0小于一个设定的门限值lt_stable_decision_thr1;谱重心特征参数值sp_center[1]大于一个设定的门限值spc_decision_thr1,且每子带的谱平坦度特征参数均小于各自对应的预设的门限值,则判断当前帧为调性帧,设置调性帧标志tonality_frame的值为1,否则判断为非调性帧,设置调性帧标志tonality_frame的值为0。并继续执行步骤104d。
其中,门限值lt_stable_decision_thr1的取值范围为[0.01,0.25],spc_decision_thr1为[1.0,1.8]。
104d:根据调性帧标志tonality_frame对调性程度特征参数tonality_degree进行更新,其中调性程度参数tonality_degree初始值在初始开始工作时进行设置,取值范围为[0,1]。不同的情况下,调性程度特征参数tonality_degree计算方法不同:
如果当前的调性帧标志指示当前帧为调性帧,则采用以下方程式对调性程度特征参数tonality_degree进行更新:
tonality_degree=tonality_degree-1·td_scale_A+td_scale_B;
其中,tonality_degree-1为前一帧的调性程度特征参数。其初始值取值范围为[0,1]。td_scale_A为衰减系数,其取值范围为[0,1];td_scale_B为累加系数,其取值范围为[0,1]。
104e:根据更新后的调性程度特征参数tonality_degree判断当前帧是否为调性信号,并设置调性标志tonality_flag的值。
若调性程度特征参数tonality_degree大于一个设定的门限值,则判断当前帧为调性信号,否则,判断当前帧为非调性信号。
步骤105:根据调性标志、信噪比参数、谱重心特征参数、帧能量参数计算得到VAD判决结果,如图2所示,步骤如下:
步骤105a:通过当前帧的前一帧计算得到的平均长时激活音信号能量 和平均长时背景噪声能量的比值,计算得到长时信噪比lt_snr;
平均长时激活音信号能量Efg和平均长时背景噪声能量Ebg的计算和定义见步骤105g。长时信噪比lt_snr计算方程如下:
Figure PCTCN2015093889-appb-000035
该式中,长时信噪比lt_snr是采用对数表示的。
步骤105b:计算距离当前帧最近的多个帧的全带信噪比SNR2的平均值,得到平均全带信噪比SNR2_lt_ave;
计算方程如下:
Figure PCTCN2015093889-appb-000036
SNR2(n)表示当前帧往前第n帧的全带信噪比SNR2的值,F_num为计算平均值的总帧数,其取值范围为[8,64]。
步骤105c:根据谱重心特征参数、长时信噪比lt_snr、连续激活音帧个数continuous_speech_num和连续噪声帧个数continuous_noise_num得到VAD判决的判决信噪比门限snr_thr。
实现步骤如下:
首先,设置判决信噪比门限snr_thr的初始值,范围为[0.1,2],比如,为1.06。
其次,根据谱重心特征参数首次调整判决信噪比门限snr_thr的值。其步骤如下:如果谱重心特征参数sp_center[2]的值大于一个设定的门限值spc_vad_dec_thr1,则snr_thr加上一个偏置值,本例中改偏置值取0.05;否则,如果sp_center[1]大于spc_vad_dec_thr2,则snr_thr加上一个偏置值,本例中改偏置值取0.10;否则,snr_thr加上一个偏置值,本例中改偏置值取0.40;其中,门限值spc_vad_dec_thr1和spc_vad_dec_thr2取值范围为[1.2,2.5]。
再次,根据连续激活音帧个数continuous_speech_num、连续噪声帧个数continuous_noise_num、平均全带信噪比SNR2_lt_ave和长时信噪比lt_snr二次调整snr_thr的值。如果连续激活音帧个数continuous_speech_num大于一个设 定的门限值cpn_vad_dec_thr1,则snr_thr减去0.2;否则,如果连续噪声个数continuous_noise_num大于一个设定的门限值cpn_vad_dec_thr2,并且SNR2_lt_ave大于一个偏置值加上长时信噪比lt_snr乘以系数lt_tsnr_scale,则snr_thr加上一个偏置值,本例中改偏置值取0.1;否则,如果continuous_noise_num大于一个设定的门限值cpn_vad_dec_thr3,则snr_thr加上一个偏置值,本例中改偏置值取0.2;否则,如果continuous_noise_num大于一个设定的门限值cpn_vad_dec_thr4,则snr_thr加上一个偏置值,本例中改偏置值取0.1。其中,门限值cpn_vad_dec_thr1,cpn_vad_dec_thr2,cpn_vad_dec_thr3,cpn_vad_dec_thr4取值范围为[2,500],系数lt_tsnr_scale取值范围为[0,2]。跳过本步骤,直接进入最后一步,也可实现本发明实施例。
最后,根据长时信噪比lt_snr的值再对判决信噪比门限snr_thr进行最终调整,得到当前帧的判决信噪比门限snr_thr。
修正方程如下:
snr_thr=snr_thr+(lt_tsnr-thr_offset)·thr_scale;
其中,thr_offset为一个偏置值,其取值范围为[0.5,3];thr_scale为一个增益系数,其取值范围为[0.1,1]。
步骤105d:根据激活音检测的判决门限snr_thr和当前帧计算得到的信噪比参数SNR1、SNR2计算得到初始的VAD判决结果。
计算过程如下:
如果SNR1大于判决门限snr_thr,则判断当前帧为激活音帧,用VAD标志vad_flag的值来指示当前帧是否为激活音帧,本实施例中用值1表示当前帧为激活音帧,0表示当前帧为非激活音帧。否则,判断当前帧为非激活音帧,VAD标志vad_flag的值置0。
如果SNR2大于一个设定的门限值snr2_thr,则判断当前帧为激活音帧,VAD标志vad_flag的值置1。其中,snr2_thr的取值范围为[1.2,5.0]。
步骤105e:根据调性标志、平均全带信噪比SNR2_lt_ave、谱重心和长时信噪比lt_snr对所述初始VAD判决结果进行修正。
步骤如下:
如果调性标志指示当前帧为调性信号,即tonality_flag为1,则判断当前帧是激活音信号,vad_flag标志置1。
如果平均全带信噪比SNR2_lt_ave大于一个设定的门限SNR2_lt_ave_t_thr1加上长时信噪比lt_snr乘于系数lt_tsnr_tscale,则判断当前帧为激活音帧,vad_flag标志置1。
其中,本实施例SNR2_lt_ave_thr1的取值范围为[1,4],lt_tsnr_tscale的取值范围为[0.1,0.6]。
如果平均全带信噪比SNR2_lt_ave大于一个设定的门限SNR2_lt_ave_t_thr2,并且谱重心特征参数sp_center[2]大于一个设定的门限sp_center_t_thr1和长时信噪比lt_snr小于一个设定的门限lt_tsnr_t_thr1,则判断当前帧为激活音帧,vad_flag标志置1。其中,SNR2_lt_ave_t_thr2的取值范围为[1.0,2.5],sp_center_t_thr1的取值范围为[2.0,4.0],lt_tsnr_t_thr1的取值范围为[2.5,5.0]。
如果SNR2_lt_ave大于一个设定的门限SNR2_lt_ave_t_thr3,并且谱重心特征参数sp_center[2]大于一个设定的门限sp_center_t_thr2和长时信噪比lt_snr小于一个设定的门限lt_tsnr_t_thr2,则判断当前帧为激活音帧,vad_flag标志置1。其中,SNR2_lt_ave_t_thr3的取值范围为[0.8,2.0],sp_center_t_thr2的取值范围为[2.0,4.0],lt_tsnr_t_thr2的取值范围为[2.5,5.0]。
如果SNR2_lt_ave大于一个设定的门限SNR2_lt_ave_t_thr4,并且谱重心特征参数sp_center[2]大于一个设定的门限sp_center_t_thr3和长时信噪比lt_snr小于一个设定的门限lt_tsnr_t_thr3,则判断当前帧为激活音帧,vad_flag标志置1。其中,SNR2_lt_ave_t_thr4的取值范围为[0.6,2.0],sp_center_t_thr3的取值范围为[3.0,6.0],lt_tsnr_t_thr3的取值范围为[2.5,5.0]。
步骤105f:根据当前帧之前的多个帧的判决结果、长时信噪比lt_snr、平均全带信噪比SNR2_lt_ave、当前帧的信噪比参数和当前帧的激活音检测 判决结果,修正激活音保持帧数。
计算步骤如下:
当前激活音保持帧数修正的前提条件是激活音标志指示当前帧为激活音帧,若不符合该条件,不修正当前激活音保持帧数num_speech_hangover的值,直接进入步骤105g。
激活音保持帧数修正步骤如下:
如果连续语音帧数continuous_speech_num小于一个设定的第一门限值continuous_speech_num_thr1,并且lt_tsnr小于一个设定的门限值lt_tsnr_h_thr1,则当前激活音保持帧数num_speech_hangover等于最小连续激活音帧数减去连续语音帧数continuous_speech_num。否则,如果SNR2_lt_ave大于一个设定的门限值SNR2_lt_ave_thr1,并且连续语音帧数continuous_speech_num大于一个设定的第二门限值continuous_speech_num_thr2,则根据长时信噪比lt_tsnr的大小设置激活音保持帧数num_speech_hangover的值。否则,不修正当前激活音保持帧数num_speech_hangover的值。其中本实施例中最小连续激活音帧数取值为8,其可以在[6,20]之间取值。第一门限值continuous_speech_num_thr1和第二门限值continuous_speech_num_thr2可以相同也可以不同。
步骤如下:
如果长时信噪比lt_snr大于2.6,则num_speech_hangover的值为3;否则,如果长时信噪比lt_snr大于1.6,则num_speech_hangover的值为4;否则,num_speech_hangover的值为5。
步骤105g:根据当前帧的判决结果和激活音保持帧数num_speech_hangover添加激活音保持,得到当前帧的VAD判决结果。
其方法为:
如果当前帧被判断为非激活音,即激活音标志为0,并且激活音保持帧数num_speech_hangover大于0,添加激活音保持,即设置激活音标志为1,并且将num_speech_hangover的值减1。
得到当前帧的最终的VAD判决结果。
可选地,步骤105d之后,还可以包括:根据初始VAD判决结果,计算平均长时激活音信号能量Efg,计算值用于下一帧VAD判决;步骤105g之后,还可以包括:根据当前帧的VAD判决结果计算平均长时背景噪声能量Ebg,计算值用于下一帧VAD判决。
平均长时激活音信号能量Efg计算过程如下:
a),如果初始VAD判决结果指示当前帧为激活音帧,即VAD标志的值为1,并且Et1大于Ebg的多倍,本实施例取6倍,则更新平均长时激活音能量累加值fg_energy和平均长时激活音能量累加帧数fg_energy_count。更新方法为fg_energy加上Et1得到新的fg_energy。fg_energy_count加1得到新的fg_energy_count。
b),为了保证平均长时激活音信号能量能反映最新的激活音信号能量,如果平均长时激活音能量累加帧数值等于一个设定值fg_max_frame_num,则累加帧数和累加值同时乘上一个衰减系数attenu_coef1。本实施例中fg_max_frame_num取值512,attenu_coef1取值为0.75。
c),由平均长时激活音能量累加值fg_energy除以平均长时激活音能量累加帧数得到平均长时激活音信号能量,计算方程式如下:
Figure PCTCN2015093889-appb-000037
平均长时背景噪声能量Ebg的计算方法为:
假设bg_energy_count为背景噪声能量累加帧数,用于记录最近背景噪声能量的累加值包含了多少帧的能量。bg_energy为最近背景噪声能量的累加值。
a),如果当前帧判断为非激活音帧,则VAD标志的值为0,并且SNR2小于1.0,则更新背景噪声能量累加值bg_energy和背景噪声能量累加帧数bg_energy_count。更新方法为背景噪声能量累加值bg_energy加上Et1得到新的背景噪声能量累加值bg_energy。背景噪声能量累加帧数bg_energy_count加1得到新的背景噪声能量累加帧数bg_energy_count。
b),如果背景噪声能量累加帧数bg_energy_count为等于平均长时背景噪 声能量计算的最大计数帧数,则累加帧数和累加值同时乘上衰减系数attenu_coef2。其中,本实施例平均长时背景噪声能量计算的最大计数帧数为512,衰减系数attenu_coef2等于0.75。
c),由背景噪声能量累加值bg_energy除于背景噪声能量累加帧数得到平均长时背景噪声能量计算方程式如下:
Figure PCTCN2015093889-appb-000038
另外,还需说明的是,实施例一还可以包括以下的步骤:
步骤106:根据当前帧的VAD判决结果、调性特征参数、信噪比参数、调性标志、时域稳定度特征参数计算背景噪声更新标识,计算方法可以参考后述的实施例二。
步骤107:根据背景噪声更新标识和当前帧的帧能量参数、当前帧的前一帧的全带背景噪声能量,得到当前帧的背景噪声能量;所述当前帧的背景噪声能量用于下一帧信噪比参数计算。
其中,根据背景噪声更新标识判断是否进行背景噪声更新,如果背景噪声更新标识为1,则根据估计得到全带背景噪声能量与当前帧信号的能量的比值进行背景噪声更新。背景噪声能量估计包括子带背景噪声能量估计和全带背景噪声能量估计。
a,子带背景噪声能量估计方程式如下:
Esb2_bg(k)=Esb2_bg_pre(k)·αbg_e+Esb2_bg(k)·(1-αbg_e);0≤k<num_sb
其中,num_sb是频域子带的个数,Esb2_bg_pre(k)表示前一帧第k个信噪比子带的子带背景噪声能量。
αbg_e是背景噪声更新因子,其值由前一帧的全带背景噪声能量和当前帧能量参数决定。计算过程如下:
如果上一帧全带背景背景噪声能量Et_bg小于当前帧的帧能量参数Et1,则取值0.96,否则取值0.95。
b,全带背景噪声能量估计:
如果当前帧的背景噪声更新标识为1,则更新背景噪声能量累加值Et_sum 和背景噪声能量累计帧数NEt_counter,计算方程如下:
Et_sum=Et_sum_-1+Et1
NEt_counter=NEt_counter_-1+1;
其中Et_sum_-1为前一帧的背景噪声能量累加值,NEt_counter_-1为前一帧计算得到的背景噪声能量累计帧数。
c,全带背景噪声能量由背景噪声能量累加值Et_sum和累计帧数NEt_counter的比值得到:
Figure PCTCN2015093889-appb-000039
判断NEt_counter是否等于64,如果NEt_counter等于64则分别将背景噪声能量累加值Et_sum和累计帧数NEt_counter乘0.75。
d,根据调性标志、帧能量参数、全带背景噪声能量的值对子带背景噪声能量和背景噪声能量累加值进行调整。计算过程如下:
如果调性标志tonality_flag等于1并且帧能量参数Et1的值小于背景噪声能量特征参数Et_bg的值乘以一个增益系数gain,
则,Et_sum=Et_sum·gain+delta;Esb2_bg(k)=Esb2_bg(k)·gain+delta;
其中,gain的取值范围为[0.3,1]。
实施例二
本发明实施例还提供了一种背景噪声检测方法实施例,如图3所示,该方法包括:
步骤201:获得当前帧的子带信号及频谱幅值;
步骤202:根据子带信号计算得到的帧能量参数、谱重心特征参数、时域稳定度特征参数的值,根据频谱幅值计算得到谱平坦度特征参数和调性特征参数的值;
所述帧能量参数是每个子带信号能量的加权叠加值或直接叠加值。
所述谱重心特征参数是所有或部分子带信号能量的加权累加值和未加权累加值的比值,或该比值进行平滑滤波得到的值。
所述时域稳定度特征参数是帧能量幅值的方差和幅值叠加值平方的期望的比值,或该比值乘上一个系数。
所述谱平坦度参数是预定的多个频谱幅值的几何平均数和算术平均数的比值,或该比值乘上一个系数。
步骤201和步骤202可采用与上文相同的方法,在此不再赘述。
步骤203:根据谱重心特征参数、时域稳定度特征参数、谱平坦度特征参数、调性特征参数、当前帧能量参数进行背景噪声检测,判断当前帧是否为背景噪声。
首先,假定当前帧是背景噪声,并设置所述背景噪声更新标识为第一预设值;然后,如果以下任一条件成立,则判断当前帧不是噪声信号,并将背景噪声更新标识设置为第二预设值:
时域稳定度特征参数lt_stable_rate0大于一个设定的门限值;
谱重心特征参数值的平滑滤波值大于一个设定的门限值,且时域稳定度特征参数值也大于一个设定的门限值;
调性特征参数或调性特征参数平滑滤波后的值大于一个设定的门限值,且时域稳定度特征参数lt_stable_rate0值大于其设定的门限值;
每个子带的谱平坦度特征参数或每个子带的谱平坦度特征参数各自平滑滤波后的值均小于各自对应的设定的门限值;
或,帧能量参数Et1的值大于设定的门限值E_thr1。
假设当前帧为背景噪声。
本实施例通过一个背景噪声更新标识background_flag来指示当前帧是否是背景噪声,并约定如果判断当前帧为背景噪声,则设置背景噪声更新标识background_flag为1(第一预设值),否则设置背景噪声更新标识background_flag为0(第二预设值)。
根据时域稳定度特征参数、谱重心特征参数、谱平坦度特征参数、调性 特征参数、当前帧能量参数检测当前帧是否为噪声信号。如果不是噪声信号,则将背景噪声更新标识background_flag置0。
过程如下:
判断时域稳定度特征参数lt_stable_rate0是否大于一个设定的门限值lt_stable_rate_thr1。如果是,则判断当前帧不是噪声信号,并将background_flag置0。本实施例门限值lt_stable_rate_thr1取值范围为[0.8,1.6];
判断谱重心特征参数值的平滑滤波值是否大于一个设定的门限值sp_center_thr1,并且时域稳定度特征参数值也大于一个设定的门限值lt_stable_rate_thr2。如果是,则判断当前帧不是噪声信号,并将background_flag置0。sp_center_thr1的取值范围为[1.6,4];lt_stable_rate_thr2的取值范围为(0,0.1]。
判断调性特征参数
Figure PCTCN2015093889-appb-000040
的值是否大于一个设定的门限值tonality_rate_thr1,时域稳定度特征参数lt_stable_rate0值是否大于设定的门限值lt_stable_rate_thr3,如果上述条件同时成立,则判断当前帧不是背景噪声,background_flag赋值为0。门限值tonality_rate_thr1取值范围在[0.4,0.66]。门限值lt_stable_rate_thr3的取值范围为[0.06,0.3]。
判断谱平坦度特征参数FSSF(0)的值是否小于设定的门限值sSMR_thr1,判断谱平坦度特征参数FSSF(1)的值是否小于设定的门限值sSMR_thr2,判断谱平坦度特征参数FSSF(2)的值是否小于设定的sSMR_thr3,如果上述条件同时成立,则判断当前帧不是背景噪声,background_flag赋值为0,门限值sSMR_thr1、sSMR_thr2、sSMR_thr3的取值范围为[0.88,0.98]。判断平坦度特征参数FSSF(0)的值是否小于设定的门限值sSMR_thr4,判断谱平坦度特征参数FSSF(1)的值是否小于设定的门限值sSMR_thr5,判断谱平坦度特征参数FSSF(2)的值是否小于设定的门限值sSMR_thr6。如果上述任一条件成立,则判断当前帧不是背景噪声。background_flag赋值为0。sSMR_thr4、sSMR_thr5、sSMR_thr6的取值范围为[0.80,0.92]。
判断帧能量参数Et1的值是否大于设定的门限值E_thr1,如果上述条件成立,则判断当前帧不是背景噪声。background_flag赋值为0。E_thr1根据帧 能量参数的动态范围进行取值。
如果当前帧未被检测成不是背景噪声,则表示当前帧为背景噪声。
实施例三
本发明实施例还提供了一种VAD判决中激活音保持帧数的修正方法,如图4所示,该方法包括:
步骤301:根据子带信号计算得到长时信噪比lt_snr;
通过当前帧的前一帧计算得到的平均长时激活音信号能量和平均长时背景噪声能量的比值,计算得到长时信噪比lt_snr;长时信噪比lt_snr可采用对数表示。
步骤302:计算平均全带信噪比SNR2_lt_ave;
计算距离当前帧最近的多个个帧的全带信噪比SNR2的平均值,得到平均全带信噪比SNR2_lt_ave;
步骤303:根据当前帧之前的多个帧的判决结果、长时信噪比lt_snr、平均全带信噪比SNR2_lt_ave、当前帧的信噪比参数和当前帧的VAD判决结果,对当前激活音保持帧数进行修正。
可理解地,当前激活音保持帧数修正的前提条件是激活音标志指示当前帧为激活音帧。
对当前激活音保持帧数进行修正时,如果连续语音帧数小于一个设定的第一门限值1,并且长时信噪比lt_snr小于一个设定的门限值2,则当前激活音保持帧数等于最小连续激活音帧数减去连续语音帧数;否则,如果平均全带信噪比SNR2_lt_ave大于一个设定的门限值3,并且连续语音帧数大于一个设定的第二门限值4,则根据长时信噪比的大小设置激活音保持帧数的值,否则不修正当前激活音保持帧数num_speech_hangover的值。
实施例四
本发明实施例提供一种激活音修正帧数的获取方法,如图5所示,步骤 如下:
401:通过本发明实施例一的方法获得当前帧激活音检测判决结果。
402:通过本发明提供的实施例三获得激活音保持帧数。
403:获得背景噪声更新次数update_count。步骤如下:
403a:通过本发明提供的实施例二计算背景噪声更新标识background_flag;
403b:当背景噪声更新标识指示是背景噪声,且背景噪声更新次数小于1000,则背景噪声更新次数加1。其中背景噪声更新次数初始值设置为0。
404:根据当前帧的激活音检测判决结果、背景噪声更新次数和活音保持帧数获取激活音修正帧数warm_hang_num。
其中,当当前帧激活音检测判决结果为激活音帧,且背景噪声更新次数小于预设门限值例如为12,则激活音修正帧数选择为常数例如为20和激活音保持帧数中的最大值。
另外,还可以包括405:根据VAD判决结果、激活音修正帧数修正VAD判决结果,其中:
当VAD判决结果指示当前帧为非激活音帧,且激活音修正帧数大于0,则设置当前帧为激活音帧,同时激活音修正帧数减1。
对应于前述激活音修正帧数的获取方法,本发明实施例还提供了一种激活音修正帧数的获取装置60,如图6所示,该获取装置60包括:
第一获取单元61,设置为:获得当前帧的激活音检测判决结果;
第二获取单元62,设置为:获得激活音保持帧数;
第三获取单元63,设置为:获得背景噪声更新次数;
第四获取单元64,设置为:根据所述当前帧的激活音检测判决结果、所述背景噪声更新次数和所述激活音保持帧数获取激活音修正帧数。
本实施例中激活音修正帧数的获取装置的每个单元的工作流程和工作原理参见上述方法实施例中的描述,在此不再赘述。
实施例五
本发明实施例提供一种激活音检测方法,如图7所示,步骤如下:
501:通过本发明实施例一的方法获得第一激活音检测判决结果vada_flag;获得第二激活音检测判决结果vadb_flag。
需要说明的是,第二激活音检测判决结果vadb_flag是通过已有的任意一种激活音检测判决方案获得的,对于已有的激活音检测判决方案,本文在此不做详细阐述。
502:通过本发明提供的实施例三获得激活音保持帧数。
503:获得背景噪声更新次数update_count。步骤如下:
503a:通过本发明提供的实施例二计算背景噪声更新标识background_flag;
503b:当背景噪声更新标识指示是背景噪声,且背景噪声更新次数小于1000,则背景噪声更新次数加1。背景噪声更新次数初始值设置为0。
504:根据vada_flag、背景噪声更新次数和激活音保持帧数计算激活音修正帧数warm_hang_num,其中:
当vada_flag指示为激活音帧,且背景噪声更新次数小于12,则激活音修正帧数选择为20和激活音保持帧数中的最大值。
505:根据vadb_flag、激活音修正帧数计算VAD判决结果,其中:
当vadb_flag指示当前帧为非激活音帧,且激活音修正帧数大于0,则设置当前帧为激活音帧,同时激活音修正帧数减1。
对应于前述激活音检测方法,本发明实施例还提供了一种激活音检测装置,如图8所示,该检测装置80包括:
第五获取单元81,设置为:获得第一激活音检测判决结果;
第六获取单元82,设置为:获得激活音保持帧数;
第七获取单元83,设置为:获得背景噪声更新次数;
第一计算单元84,设置为:根据所述第一激活音检测判决结果、所述背景噪声更新次数和所述激活音保持帧数计算激活音修正帧数;
第八获取单元85,设置为:获得第二激活音检测判决结果;
第二计算单元86,设置为:根据所述激活音修正帧数和所述第二激活音检测判决结果计算所述激活音检测判决结果。
本实施例中激活音检测装置的每个单元的工作流程和工作原理参见上述方法实施例中的描述,在此不再赘述。
现代的很多语音编码标准,如AMR,AMR-WB,都支持VAD功能。在效率方面,这些编码器的VAD并不能在所有的典型背景噪声下都达到很好的性能。特别是在非稳定噪声下,如办公室噪声,这些编码器的VAD效率都较低。而对于音乐信号,这些VAD有时候会出现错误检测,导致相应的处理算法出现明显的质量下降。
本发明实施例提供的技术方案克服了既有VAD算法的缺点,在提高VAD对不稳定噪声检测效率的同时也提高音乐检测的准确率。使得采用本发明实施例提供的技术方案的语音频信号处理算法可以得到更好的性能。
另外,本发明实施例提供的背景噪声检测方法,可使得背景噪声的估计更加准确和稳定,有利于提高VAD检测的准确率。本发明实施例同时提供的调性信号检测方法,提高了调性音乐检测的准确率。本发明实施例同时提供的激活音保持帧数的修正方法,可使得在不同的噪声和信噪比下,VAD算法可以在性能和效率得到更好的平衡。本发明实施例同时提供的VAD判决中判决信噪比门限的调整方法,可使得VAD判决算法在不同的信噪比下都可以达到较好的准确率,在保证质量的情况下,进一步的提升效率。
本领域普通技术人员可以理解上述实施例的全部或部分步骤可以使用计算机程序流程来实现,所述计算机程序可以存储于一计算机可读存储介质中,所述计算机程序在相应的硬件平台上(如系统、设备、装置、器件等)执行,在执行时,包括方法实施例的步骤之一或其组合。
可选地,上述实施例的全部或部分步骤也可以使用集成电路来实现,这些步骤可以被分别制作成一个个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。
上述实施例中的装置/功能模块/功能单元可以采用通用的计算装置来实 现,它们可以集中在单个的计算装置上,也可以分布在多个计算装置所组成的网络上。
上述实施例中的装置/功能模块/功能单元以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。上述提到的计算机可读取存储介质可以是只读存储器,磁盘或光盘等。
工业实用性
本发明实施例提供的技术方案克服了既有VAD算法的缺点,在提高VAD对不稳定噪声检测效率的同时也提高音乐检测的准确率。使得采用本发明实施例提供的技术方案的语音频信号处理算法可以得到更好的性能。另外,本发明实施例提供的背景噪声检测方法,可使得背景噪声的估计更加准确和稳定,有利于提高VAD检测的准确率。本发明实施例同时提供的调性信号检测方法,提高了调性音乐检测的准确率。本发明实施例同时提供的激活音保持帧数的修正方法,可使得在不同的噪声和信噪比下,VAD算法可以在性能和效率得到更好的平衡。本发明实施例同时提供的VAD判决中判决信噪比门限的调整方法,可使得VAD判决算法在不同的信噪比下都可以达到较好的准确率,在保证质量的情况下,进一步的提升效率。

Claims (27)

  1. 一种激活音修正帧数的获取方法,所述方法包括:
    获得当前帧的激活音检测判决结果;
    获得激活音保持帧数;
    获得背景噪声更新次数;
    根据所述当前帧的激活音检测判决结果、所述背景噪声更新次数和所述激活音保持帧数获取激活音修正帧数。
  2. 根据权利要求1所述的方法,其中,所述获得当前帧的激活音检测判决结果包括:
    获得所述当前帧的子带信号及频谱幅值;
    根据所述子带信号计算得到所述当前帧的帧能量参数、谱重心特征参数和时域稳定度特征参数;根据所述频谱幅值计算得到谱平坦度特征参数和调性特征参数;
    根据利用所述当前帧的前一帧得到的背景噪声能量、所述帧能量参数及信噪比子带能量计算得到所述当前帧的信噪比参数;
    根据所述帧能量参数、所述谱重心特征参数、所述时域稳定度特征参数、所述谱平坦度特征参数、所述调性特征参数计算得到所述当前帧的调性标志;
    根据所述调性标志、所述信噪比参数、所述谱重心特征参数、所述帧能量参数计算得到所述激活音检测判决结果。
  3. 根据权利要求2所述的方法,其中,所述根据所述调性标志、所述信噪比参数、所述谱重心特征参数、所述帧能量参数计算得到所述激活音检测判决结果包括:
    通过所述当前帧的前一帧计算得到的平均长时激活音信号能量和平均长时背景噪声能量的比值,计算得到长时信噪比;
    计算距离所述当前帧最近的多个帧的全带信噪比的平均值,得到平均全带信噪比;
    根据所述谱重心特征参数、所述长时信噪比、连续激活音帧个数和连续噪声帧个数得到激活音检测判决的判决信噪比门限;
    根据所述激活音检测的判决门限和所述信噪比参数计算得到初始的激活音检测判决结果;
    根据所述调性标志、所述平均全带信噪比、所述谱重心特征参数和所述长时信噪比对所述初始的激活音检测判决结果进行修正,得到所述激活音检测判决结果。
  4. 根据权利要求1所述的方法,其中,所述根据所述当前帧的激活音检测判决结果、所述背景噪声更新次数和所述激活音保持帧数获取激活音修正帧数包括:
    当所述当前帧的激活音检测判决结果为激活音帧,且所述背景噪声更新次数小于预设门限值时,则所述激活音修正帧数为一个常数和所述激活音保持帧数中的最大值。
  5. 根据权利要求1所述的方法,其中,所述获得激活音保持帧数包括:
    获得所述当前帧的子带信号及频谱幅值;
    根据所述子带信号计算得到长时信噪比和平均全带信噪比,根据所述当前帧之前的多个帧的激活音检测的判决结果、长时信噪比、平均全带信噪比、所述当前帧的激活音检测判决结果,对当前激活音保持帧数进行修正获得所述激活音保持帧数。
  6. 根据权利要求5所述的方法,其中,所述根据所述子带信号计算得到长时信噪比和平均全带信噪比包括:
    通过利用所述当前帧的前一帧计算得到的平均长时激活音信号能量和平均长时背景噪声能量的比值,计算得到所述长时信噪比;计算距离所述当前帧最近的多个帧的全带信噪比的平均值,得到所述平均全带信噪比。
  7. 根据权利要求5所述的方法,其中,对所述当前激活音保持帧数进行修正的前提条件是激活音标志指示所述当前帧为激活音帧。
  8. 根据权利要求5所述的方法,其中,所述对当前激活音保持帧数进 行修正获得所述激活音保持帧数包括:
    如果所述连续语音帧数小于一个设定的第一门限值,并且所述长时信噪比小于一个设定的门限值,则所述激活音保持帧数等于最小连续激活音帧数减去所述连续语音帧数;如果所述平均全带信噪比大于一个设定的门限值,并且所述连续语音帧数大于一个设定的第二门限值,则根据所述长时信噪比的大小设置所述激活音保持帧数的值。
  9. 根据权利要求1所述的方法,其中,所述获得背景噪声更新次数包括:
    获得背景噪声更新标识;
    根据所述背景噪声更新标识计算所述背景噪声更新次数。
  10. 根据权利要求9所述的方法,其中,所述根据所述背景噪声更新标识计算所述背景噪声更新次数包括:
    当所述背景噪声更新标识指示所述当前帧为背景噪声,且所述背景噪声更新次数小于设定的门限值时,将所述背景噪声更新次数加1。
  11. 根据权利要求9所述的方法,其中,所述获得背景噪声更新标识包括:
    获得所述当前帧的子带信号及频谱幅值;
    根据所述子带信号计算得到帧能量参数、谱重心特征参数、时域稳定度特征参数;根据所述频谱幅值计算得到谱平坦度特征参数和调性特征参数;
    根据所述谱重心特征参数、所述时域稳定度特征参数、所述谱平坦度特征参数、所述调性特征参数、所述帧能量参数进行背景噪声检测,获得所述背景噪声更新标识。
  12. 根据权利要求11所述方法,其中,所述根据所述谱重心特征参数、所述时域稳定度特征参数、所述谱平坦度特征参数、所述调性特征参数、所述帧能量参数进行背景噪声检测,获得所述背景噪声更新标识,包括:
    设置所述背景噪声更新标识为第一预设值;
    如果以下任一条件成立,则判断所述当前帧不是噪声信号,并将所述背 景噪声更新标识设置为第二预设值:
    所述时域稳定度特征参数大于一个设定的门限值;
    所述谱重心特征参数值的平滑滤波值大于一个设定的门限值,且所述时域稳定度特征参数值也大于一个设定的门限值;
    所述调性特征参数或所述调性特征参数平滑滤波后的值大于一个设定的门限值,且时域稳定度特征参数值大于设定的门限值;
    每个子带的谱平坦度特征参数或所述每个子带的谱平坦度特征参数各自平滑滤波后的值均小于各自对应的设定的门限值;
    或,所述帧能量参数的值大于设定的门限值。
  13. 一种激活音检测方法,所述方法包括:
    获得第一激活音检测判决结果;
    获得激活音保持帧数;
    获得背景噪声更新次数;
    根据所述第一激活音检测判决结果、所述背景噪声更新次数和所述激活音保持帧数计算激活音修正帧数;
    获得第二激活音检测判决结果;
    根据所述激活音修正帧数和所述第二激活音检测判决结果计算所述激活音检测判决结果。
  14. 根据权利要求13所述的方法,其中,所述根据所述激活音修正帧数和所述第二激活音检测判决结果计算所述激活音检测判决结果包括:
    当所述第二激活音检测判决结果指示所述当前帧为非激活音帧,且所述激活音修正帧数大于0时,将所述激活音检测判决结果设置为激活音帧,且所述激活音修正帧数减1。
  15. 根据权利要求13所述的方法,其中,所述获得第一激活音检测判决结果包括:
    获得当前帧的子带信号及频谱幅值;
    根据所述子带信号计算得到所述当前帧的帧能量参数、谱重心特征参数 和时域稳定度特征参数;根据所述频谱幅值计算得到谱平坦度特征参数和调性特征参数;
    根据利用所述当前帧的前一帧得到的背景噪声能量、所述帧能量参数及信噪比子带能量计算得到所述当前帧的信噪比参数;
    根据所述帧能量参数、所述谱重心特征参数、所述时域稳定度特征参数、所述谱平坦度特征参数、所述调性特征参数计算得到所述当前帧的调性标志;
    根据所述调性标志、所述信噪比参数、所述谱重心特征参数、所述帧能量参数计算得到所述第一激活音检测判决结果。
  16. 根据权利要求15所述的方法,其中,所述根据所述调性标志、所述信噪比参数、所述谱重心特征参数、所述帧能量参数计算得到所述第一激活音检测判决结果包括:
    通过所述当前帧的前一帧计算得到的平均长时激活音信号能量和平均长时背景噪声能量的比值,计算得到长时信噪比;
    计算距离所述当前帧最近的多个帧的全带信噪比的平均值,得到平均全带信噪比;
    根据所述谱重心特征参数、所述长时信噪比、连续激活音帧个数和连续噪声帧个数得到激活音检测的判决门限;
    根据所述激活音检测的判决门限和所述信噪比参数计算得到初始的激活音检测判决结果;
    根据所述调性标志、所述平均全带信噪比、所述谱重心特征参数和所述长时信噪比对所述初始的激活音检测判决结果进行修正,得到所述第一激活音检测判决结果。
  17. 根据权利要求13所述的方法,其中,所述获得激活音保持帧数包括:
    获得当前帧的子带信号及频谱幅值;
    根据所述子带信号计算得到长时信噪比和平均全带信噪比,根据所述当前帧之前的多个帧的激活音检测的判决结果、长时信噪比、平均全带信噪 比、所述第一激活音检测判决结果,对当前激活音保持帧数进行修正。
  18. 根据权利要求17所述的方法,其中,所述根据所述子带信号计算得到长时信噪比和平均全带信噪比包括:
    通过利用所述当前帧的前一帧计算得到的平均长时激活音信号能量和平均长时背景噪声能量的比值,计算得到所述长时信噪比;计算距离所述当前帧最近的多个帧的全带信噪比的平均值,得到所述平均全带信噪比。
  19. 根据权利要求17所述的方法,其中,所述对当前激活音保持帧数进行修正包括:
    如果连续语音帧数小于一个设定的第一门限值,并且所述长时信噪比小于一个设定的门限值,则所述激活音保持帧数等于最小连续激活音帧数减去所述连续语音帧数;如果所述平均全带信噪比大于一个设定的门限值,并且所述连续语音帧数大于一个设定的第二门限值,则根据所述长时信噪比的大小设置所述激活音保持帧数的值。
  20. 根据权利要求13所述的方法,其中,所述获得背景噪声更新次数包括:
    获得背景噪声更新标识;
    根据所述背景噪声更新标识计算所述背景噪声更新次数。
  21. 根据权利要求20所述的方法,其中,所述根据所述背景噪声更新标识计算所述背景噪声更新次数包括:
    当所述背景噪声更新标识指示所述当前帧为背景噪声时,且所述背景噪声更新次数小于设定的门限值时,将所述背景噪声更新次数加1。
  22. 根据权利要求20所述的方法,其中,所述获得背景噪声更新标识包括:
    获得当前帧的子带信号及频谱幅值;
    根据所述子带信号计算得到的帧能量参数、谱重心特征参数、时域稳定度特征参数的值,根据所述频谱幅值计算得到谱平坦度特征参数和调性特征参数的值;
    根据所述谱重心特征参数、所述时域稳定度特征参数、所述谱平坦度特征参数、所述调性特征参数、所述帧能量参数进行背景噪声检测,获得所述背景噪声更新标识。
  23. 根据权利要求22所述的方法,其中,所述根据所述谱重心特征参数、所述时域稳定度特征参数、所述谱平坦度特征参数、所述调性特征参数、所述帧能量参数进行背景噪声检测,获得所述背景噪声更新标识,具体包括:
    设置所述背景噪声更新标识为第一预设值;
    如果以下任一条件成立,则判断所述当前帧不是噪声信号,并将所述背景噪声更新标识设置为第二预设值:
    所述时域稳定度特征参数大于一个设定的门限值;
    所述谱重心特征参数值的平滑滤波值大于一个设定的门限值,且所述时域稳定度特征参数值也大于一个设定的门限值;
    所述调性特征参数或所述调性特征参数平滑滤波后的值大于一个设定的门限值,且所述时域稳定度特征参数值大于设定的门限值;
    每个子带的谱平坦度特征参数或所述每个子带的谱平坦度特征参数各自平滑滤波后的值均小于各自对应的设定的门限值;
    或,所述帧能量参数的值大于设定的门限值。
  24. 根据权利要求13所述的方法,其中,所述根据所述第一激活音检测判决结果、所述背景噪声更新次数和所述激活音保持帧数计算激活音修正帧数包括:
    当所述第一激活音检测判决结果为激活音帧,且所述背景噪声更新次数小于预设门限值时,则所述激活音修正帧数为一个常数和所述激活音保持帧数中的最大值。
  25. 一种激活音修正帧数的获取装置,所述装置包括:
    第一获取单元,设置为:获得当前帧的激活音检测判决结果;
    第二获取单元,设置为:获得激活音保持帧数;
    第三获取单元,设置为:获得背景噪声更新次数;
    第四获取单元,设置为:根据所述当前帧的激活音检测判决结果、所述背景噪声更新次数和所述激活音保持帧数获取激活音修正帧数。
  26. 一种激活音检测装置,所述装置包括:
    第五获取单元,设置为:获得第一激活音检测判决结果;
    第六获取单元,设置为:获得激活音保持帧数;
    第七获取单元,设置为:获得背景噪声更新次数;
    第一计算单元,设置为:根据所述第一激活音检测判决结果、所述背景噪声更新次数和所述激活音保持帧数计算激活音修正帧数;
    第八获取单元,设置为:获得第二激活音检测判决结果;
    第二计算单元,设置为:根据所述激活音修正帧数和所述第二激活音检测判决结果计算所述激活音检测判决结果。
  27. 一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行权利要求1-24任一项的方法。
PCT/CN2015/093889 2015-06-26 2015-11-05 一种激活音修正帧数的获取方法、激活音检测方法和装置 Ceased WO2016206273A1 (zh)

Priority Applications (7)

Application Number Priority Date Filing Date Title
RU2017145122A RU2684194C1 (ru) 2015-06-26 2015-11-05 Способ получения кадра модификации речевой активности, устройство и способ обнаружения речевой активности
EP25180489.4A EP4641568A3 (en) 2015-06-26 2015-11-05 Voice activity modification frame acquiring method, and voice activity detection method and apparatus
JP2017566850A JP6635440B2 (ja) 2015-06-26 2015-11-05 音声区間補正フレーム数の取得方法、音声区間検出方法及び装置
CA2990328A CA2990328C (en) 2015-06-26 2015-11-05 Voice activity modification frame acquiring method, and voice activity detection method and apparatus
US15/577,343 US10522170B2 (en) 2015-06-26 2015-11-05 Voice activity modification frame acquiring method, and voice activity detection method and apparatus
EP15896160.7A EP3316256A4 (en) 2015-06-26 2015-11-05 Voice activity modification frame acquiring method, and voice activity detection method and apparatus
KR1020177036055A KR102042117B1 (ko) 2015-06-26 2015-11-05 보이스 활성화 수정 프레임 수량의 취득 방법, 보이스 활성화 탐지 방법 및 장치

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510364255.0A CN106328169B (zh) 2015-06-26 2015-06-26 一种激活音修正帧数的获取方法、激活音检测方法和装置
CN201510364255.0 2015-06-26

Publications (1)

Publication Number Publication Date
WO2016206273A1 true WO2016206273A1 (zh) 2016-12-29

Family

ID=57584376

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/093889 Ceased WO2016206273A1 (zh) 2015-06-26 2015-11-05 一种激活音修正帧数的获取方法、激活音检测方法和装置

Country Status (8)

Country Link
US (1) US10522170B2 (zh)
EP (2) EP4641568A3 (zh)
JP (1) JP6635440B2 (zh)
KR (1) KR102042117B1 (zh)
CN (1) CN106328169B (zh)
CA (1) CA2990328C (zh)
RU (1) RU2684194C1 (zh)
WO (1) WO2016206273A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10522170B2 (en) 2015-06-26 2019-12-31 Zte Corporation Voice activity modification frame acquiring method, and voice activity detection method and apparatus
CN112420079A (zh) * 2020-11-18 2021-02-26 青岛海尔科技有限公司 语音端点检测方法和装置、存储介质及电子设备

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105261375B (zh) * 2014-07-18 2018-08-31 中兴通讯股份有限公司 激活音检测的方法及装置
JP6759898B2 (ja) * 2016-09-08 2020-09-23 富士通株式会社 発話区間検出装置、発話区間検出方法及び発話区間検出用コンピュータプログラム
CN107123419A (zh) * 2017-05-18 2017-09-01 北京大生在线科技有限公司 Sphinx语速识别中背景降噪的优化方法
CN108962284B (zh) * 2018-07-04 2021-06-08 科大讯飞股份有限公司 一种语音录制方法及装置
CN111599345B (zh) * 2020-04-03 2023-02-10 厦门快商通科技股份有限公司 语音识别算法评估方法、系统、移动终端及存储介质
US11636872B2 (en) * 2020-05-07 2023-04-25 Netflix, Inc. Techniques for computing perceived audio quality based on a trained multitask learning model
CN112908352B (zh) * 2021-03-01 2024-04-16 百果园技术(新加坡)有限公司 一种音频去噪方法、装置、电子设备及存储介质
EP4354898A4 (en) * 2021-06-08 2024-10-16 Panasonic Intellectual Property Management Co., Ltd. Ear-mounted device and reproduction method
US20230046530A1 (en) * 2021-08-03 2023-02-16 Bard College Enhanced bird feeders and baths
CN114220446B (zh) * 2021-12-08 2025-07-18 漳州立达信光电子科技有限公司 一种适应性背景噪声检测方法、系统及介质
US12525226B2 (en) * 2023-02-10 2026-01-13 Qualcomm Incorporated Latency reduction for multi-stage speech recognition

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1473321A (zh) * 2000-09-09 2004-02-04 英特尔公司 用于综合电信处理的话音激活检测器
CN101197135A (zh) * 2006-12-05 2008-06-11 华为技术有限公司 声音信号分类方法和装置
CN101399039A (zh) * 2007-09-30 2009-04-01 华为技术有限公司 一种确定非噪声音频信号类别的方法及装置
CN101841587A (zh) * 2009-03-20 2010-09-22 联芯科技有限公司 信号音检测方法和装置以及移动终端噪声抑制方法
CN102687196A (zh) * 2009-10-08 2012-09-19 西班牙电信公司 用于检测语音段的方法
WO2012146290A1 (en) * 2011-04-28 2012-11-01 Telefonaktiebolaget L M Ericsson (Publ) Frame based audio signal classification
CN103903634A (zh) * 2012-12-25 2014-07-02 中兴通讯股份有限公司 激活音检测及用于激活音检测的方法和装置
CN104424956A (zh) * 2013-08-30 2015-03-18 中兴通讯股份有限公司 激活音检测方法和装置

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05130067A (ja) 1991-10-31 1993-05-25 Nec Corp 可変閾値型音声検出器
US6269331B1 (en) * 1996-11-14 2001-07-31 Nokia Mobile Phones Limited Transmission of comfort noise parameters during discontinuous transmission
CN1703736A (zh) * 2002-10-11 2005-11-30 诺基亚有限公司 用于源控制可变比特率宽带语音编码的方法和装置
US7567900B2 (en) 2003-06-11 2009-07-28 Panasonic Corporation Harmonic structure based acoustic speech interval detection method and device
JP4729927B2 (ja) * 2005-01-11 2011-07-20 ソニー株式会社 音声検出装置、自動撮像装置、および音声検出方法
EP2276023A3 (en) * 2005-11-30 2011-10-05 Telefonaktiebolaget LM Ericsson (publ) Efficient speech stream conversion
EP1982324B1 (en) 2006-02-10 2014-09-24 Telefonaktiebolaget LM Ericsson (publ) A voice detector and a method for suppressing sub-bands in a voice detector
CN101320559B (zh) * 2007-06-07 2011-05-18 华为技术有限公司 一种声音激活检测装置及方法
GB2450886B (en) * 2007-07-10 2009-12-16 Motorola Inc Voice activity detector and a method of operation
US20120095760A1 (en) * 2008-12-19 2012-04-19 Ojala Pasi S Apparatus, a method and a computer program for coding
CN102044244B (zh) * 2009-10-15 2011-11-16 华为技术有限公司 信号分类方法和装置
CN102693720A (zh) * 2009-10-15 2012-09-26 华为技术有限公司 一种音频信号检测方法和装置
EP2816560A1 (en) * 2009-10-19 2014-12-24 Telefonaktiebolaget L M Ericsson (PUBL) Method and background estimator for voice activity detection
CN102741918B (zh) * 2010-12-24 2014-11-19 华为技术有限公司 用于话音活动检测的方法和设备
JP5936377B2 (ja) 2012-02-06 2016-06-22 三菱電機株式会社 音声区間検出装置
RU2536343C2 (ru) * 2013-04-15 2014-12-20 Открытое акционерное общество "Концерн "Созвездие" Способ выделения речевого сигнала в условиях наличия помех и устройство для его осуществления
JP6406257B2 (ja) * 2013-08-30 2018-10-17 日本電気株式会社 信号処理装置、信号処理方法および信号処理プログラム
FI125723B (en) * 2014-07-11 2016-01-29 Suunto Oy Portable activity tracking device and associated method
CN106328169B (zh) 2015-06-26 2018-12-11 中兴通讯股份有限公司 一种激活音修正帧数的获取方法、激活音检测方法和装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1473321A (zh) * 2000-09-09 2004-02-04 英特尔公司 用于综合电信处理的话音激活检测器
CN101197135A (zh) * 2006-12-05 2008-06-11 华为技术有限公司 声音信号分类方法和装置
CN101399039A (zh) * 2007-09-30 2009-04-01 华为技术有限公司 一种确定非噪声音频信号类别的方法及装置
CN101841587A (zh) * 2009-03-20 2010-09-22 联芯科技有限公司 信号音检测方法和装置以及移动终端噪声抑制方法
CN102687196A (zh) * 2009-10-08 2012-09-19 西班牙电信公司 用于检测语音段的方法
WO2012146290A1 (en) * 2011-04-28 2012-11-01 Telefonaktiebolaget L M Ericsson (Publ) Frame based audio signal classification
CN103903634A (zh) * 2012-12-25 2014-07-02 中兴通讯股份有限公司 激活音检测及用于激活音检测的方法和装置
CN104424956A (zh) * 2013-08-30 2015-03-18 中兴通讯股份有限公司 激活音检测方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3316256A4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10522170B2 (en) 2015-06-26 2019-12-31 Zte Corporation Voice activity modification frame acquiring method, and voice activity detection method and apparatus
CN112420079A (zh) * 2020-11-18 2021-02-26 青岛海尔科技有限公司 语音端点检测方法和装置、存储介质及电子设备

Also Published As

Publication number Publication date
KR20180008647A (ko) 2018-01-24
EP4641568A2 (en) 2025-10-29
CN106328169B (zh) 2018-12-11
CA2990328C (en) 2021-09-21
CN106328169A (zh) 2017-01-11
JP6635440B2 (ja) 2020-01-22
EP3316256A4 (en) 2018-08-22
KR102042117B1 (ko) 2019-11-08
RU2684194C1 (ru) 2019-04-04
JP2018523155A (ja) 2018-08-16
EP4641568A3 (en) 2025-11-12
US10522170B2 (en) 2019-12-31
EP3316256A1 (en) 2018-05-02
CA2990328A1 (en) 2016-12-29
US20180158470A1 (en) 2018-06-07

Similar Documents

Publication Publication Date Title
WO2016206273A1 (zh) 一种激活音修正帧数的获取方法、激活音检测方法和装置
CN112992188B (zh) 一种激活音检测vad判决中信噪比门限的调整方法及装置
CN104424956B9 (zh) 激活音检测方法和装置
US9672841B2 (en) Voice activity detection method and method used for voice activity detection and apparatus thereof
US10339961B2 (en) Voice activity detection method and apparatus
WO2012158157A1 (en) Method for super-wideband noise supression
CN110390947B (zh) 声源位置的确定方法、系统、设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15896160

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 15577343

Country of ref document: US

ENP Entry into the national phase

Ref document number: 20177036055

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2990328

Country of ref document: CA

ENP Entry into the national phase

Ref document number: 2017566850

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2017145122

Country of ref document: RU

Ref document number: 2015896160

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 2015896160

Country of ref document: EP