CN109727605A

CN109727605A - Method and system for processing sound signals

Info

Publication number: CN109727605A
Application number: CN201811645765.5A
Authority: CN
Inventors: 袁斌
Original assignee: AI Speech Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2019-05-07
Anticipated expiration: 2038-12-29
Also published as: CN109727605B

Abstract

The present invention discloses a kind of method and system for handling voice signal.One specific embodiment of this method includes: to obtain voice signal to be processed, and the voice signal to be processed includes target sound signal and interference sound signal；It determines the power spectral density of the interference sound signal, and processing is weighted to the voice signal to be processed according to the power spectral density, to obtain the spectrum estimation of target sound signal；Masking threshold is determined according to the spectrum estimation；In the case where determining that the spectrum component of interference sound signal in the voice signal to be processed is greater than the masking threshold, the voice signal to be processed is filtered.This method can make voice signal distortion reduction, sound more natural, and reduce the complexity of algorithm calculating, and accelerate the convergence rate of pre-echo arrester.And it can be improved its robustness under strong background noise and near-end speech environment.

Description

Handle the method and system of voice signal

Technical field

The present invention relates to signal processing technology field more particularly to a kind of method and system for handling voice signal.

Background technique

In the prior art, it for the filtering processing of voice signal, can be reduced " music noise ", but there are filter drops Voice signal of making an uproar that treated less natural problem to a certain extent.Because human ear receive a sound when be likely to by To the interference and compacting of another sound, this phenomenon is known as masking effect.The tone of two sound or time are upper closer, Masking effect is more serious, so the residual noise generally after postfilter noise reduction process is lost primary characteristic, certain Make hearing test unnatural in degree.

Summary of the invention

The embodiment of the present invention provides a kind of method and system for handling voice signal, asks at least solving above-mentioned technology One of topic.

In a first aspect, the embodiment of the present invention provides a kind of method for handling voice signal, comprising: obtain sound to be processed Signal, the voice signal to be processed include target sound signal and interference sound signal；Determine the interference sound signal Power spectral density, and processing is weighted to the voice signal to be processed according to the power spectral density, to obtain target The spectrum estimation of voice signal；Masking threshold is determined according to the spectrum estimation；It determines and is interfered in the voice signal to be processed In the case that the spectrum component of voice signal is greater than the masking threshold, place is filtered to the voice signal to be processed Reason.

Optionally, the interference sound signal includes noise signal and echo signal.

Optionally, processing is weighted to the voice signal to be processed according to the power spectral density, to obtain target The step of spectrum estimation of voice signal includes:

The voice signal to be processed is converted into frequency-region signal E (Ω)；

Determine posteriori SNR PostSNR (Ω) according to the following formula:

PostSNR (Ω)=| E (Ω) |²/(R_bb(Ω)+R_nn(Ω)),

Wherein, R_bb(Ω) is the power spectral density of the echo signal, R_nn(Ω) is the power spectrum of the noise signal Density；

Prior weight PrioriSNR (Ω) is derived according to the following formula:

PrioriSNR(Ω_i)=(1-alpha) * P (PostSNR (Ω_i)-1)+alpha*|S’(Ω_i-1)|²/R_bb(Ω)；

Wherein, alpha is smoothing factor, P (x)=(| x |+x)/2, S ' (Ω_i-1) be previous frame voice signal frequency spectrum Estimation；

Further calculate weighting coefficient H_LSA(Ω), and obtain the spectrum estimation S ' (Ω) of the target sound signal:

S ' (Ω)=E (Ω) * H_LSA(Ω),

Wherein, theta=PostSNR (Ω) * PrioriSNR (Ω)/(PrioriSNR (Ω)+1).

Optionally it is determined that the spectrum component of interference sound signal is greater than the masking threshold in the voice signal to be processed In the case where value, the step of being filtered to the voice signal to be processed, includes:

The weighting system of filtering processing is determined according to the power spectral density of the power spectral density of echo signal and noise signal Number H (Ω):

H (Ω)=min (1, sqrt (R_TT(Ω)/(R_bb(Ω)+R_nn(Ω))) +(zeta_b*R_bb(Ω)+zeta_n*R_nn (Ω))/(R_bb(Ω)+R_nn(Ω))),

Wherein, R_bb(Ω) is the power spectral density of the echo signal, R_nn(Ω) is the power spectrum of the noise signal Density, zeta_b are echo attenuation coefficient, and zeta_n is noise reduction coefficient.

Optionally, the step of determining masking threshold according to the spectrum estimation include:

According to spectrum estimation, power spectral density B (k) and the extension of the critical band of the voice signal to be processed are determined Critical band frequency spectrum C (k):

C (k)=B (k) * SF (k),

Wherein, SF (k)=15.81+7.5*k+0.474-17.5*sqrt (1+ (k+0.474) 2), bh, bl are respectively each The bound frequency of critical band；

According to extension critical band frequency spectrum C (k) and offset function O (k), preliminary masking threshold T (k) is determined:

T (k)=10^{lg(C(k))-(O(k)/10)},

Wherein, offset function O (k)=belta* (14.5+k)+(1-belta) * 5.5；Belta is tone coefficient；

According to preliminary masking threshold T (k) and absolute threshold of audibility T_abs(k), masking threshold R is determined_TT(Ω):

R_TT(Ω)=min (T (k), T_abs(k)),

Wherein, T_abs(k)=3.64f^-0.8-6.5exp(f-3.3)²+10^-3f⁴。

Optionally, the step of obtaining voice signal to be processed include:

Receive initial voice signal；

Echo cancellor is carried out to the initial voice signal, to obtain the voice signal to be processed.

Optionally, the voice signal to be processed is voice signal.

Second aspect, the embodiment of the present invention provide a kind of system for handling voice signal, comprising: signal acquisition module is used In obtaining voice signal to be processed, the voice signal to be processed includes target sound signal and interference sound signal；Frequency spectrum is estimated Determining module is counted, for determining the power spectral density of the interference sound signal, and according to the power spectral density to described Voice signal to be processed is weighted processing, to obtain the spectrum estimation of target sound signal；Masking threshold determining module is used In determining masking threshold according to the spectrum estimation；Module is filtered, is done for determining in the voice signal to be processed The spectrum component of voice signal is disturbed greater than in the case where the masking threshold, place is filtered to the voice signal to be processed Reason.

Optionally, the spectrum estimation determining module is also used to, and the voice signal to be processed is converted to frequency-region signal E(Ω)；And determine posteriori SNR PostSNR (Ω) according to the following formula:

PostSNR (Ω)=| E (Ω) |²/(R_bb(Ω)+R_nn(Ω)),

Prior weight PrioriSNR (Ω) is derived according to the following formula:

S ' (Ω)=E (Ω) * H_LSA(Ω),

Wherein, theta=PostSNR (Ω) * PrioriSNR (Ω)/(PrioriSNR (Ω)+1).

Optionally, masking threshold determining module is also used to, and according to spectrum estimation, determines the voice signal to be processed The power spectral density B (k) of critical band and extension critical band frequency spectrum C (k):

C (k)=B (k) * SF (k),

T (k)=10^{lg(C(k))-(O(k)/10)},

R_TT(Ω)=min (T (k), T_abs(k)),

Wherein, T_abs(k)=3.64f^-0.8-6.5exp(f-3.3)²+10^-3f⁴。

Optionally, the filtering processing module is also used to, according to the function of the power spectral density of echo signal and noise signal Rate spectrum density determines the weighting coefficient H (Ω) of filtering processing:

Optionally, the signal acquisition module is also used to, and receives initial voice signal；To the initial voice signal into Row echo cancellor, to obtain the voice signal to be processed.

The third aspect, the embodiment of the present invention provide a kind of storage medium, are stored with one or more in the storage medium Including the program executed instruction, it is described execute instruction can by electronic equipment (including but not limited to computer, server, or Network equipment etc.) it reads and executes, in the method for executing any of the above-described processing voice signal of the present invention.

Fourth aspect provides a kind of electronic equipment comprising: at least one processor, and with described at least one Manage the memory of device communication connection, wherein the memory is stored with the instruction that can be executed by least one described processor, Described instruction is executed by least one described processor, so that at least one described processor is able to carry out above-mentioned of the present invention The method and system of one processing voice signal.

5th aspect, the embodiment of the present invention also provide a kind of computer program product, and the computer program product includes The computer program of storage on a storage medium, the computer program includes program instruction, when described program instruction is calculated When machine executes, the computer is made to execute the method and system of any of the above-described processing voice signal.

The beneficial effect of the embodiment of the present invention is: can make voice signal distortion reduction, sound more natural, pass through meter The power spectral density PSD of the interference sound signal of calculation, further determines that out masking threshold, calculates this process reduces algorithm Complexity.And the order requirement for eliminating filter to pre-echo is reduced, and then accelerates the receipts of pre-echo arrester Hold back speed.And it can be improved its robustness under strong background noise and near-end speech environment.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, making required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are some embodiments of the invention, for this For the those of ordinary skill of field, without creative efforts, it can also be obtained according to these attached drawings others Attached drawing.

Fig. 1 is the flow chart of an embodiment of the method for processing voice signal of the invention；

Fig. 2 is the flow chart of another embodiment of the method for processing voice signal of the invention；

Fig. 3 is the schematic diagram that the method for processing voice signal of the invention realizes an embodiment of system；

Fig. 4 is the schematic diagram of an embodiment of the method for processing voice signal of the invention；

Fig. 5 is the schematic diagram of an embodiment of the system of processing voice signal of the invention；

Fig. 6 is the structural schematic diagram of an embodiment of electronic equipment of the invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people Member's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.

The present invention can describe in the general context of computer-executable instructions executed by a computer, such as journey Sequence module.Generally, program module include routines performing specific tasks or implementing specific abstract data types, it is program, right As, element, data structure etc..The present invention can also be practiced in a distributed computing environment, in these distributed computing environment In, by executing task by the connected remote processing devices of communication network.In a distributed computing environment, program mould Block can be located in the local and remote computer storage media including storage equipment.

In the present invention, the fingers such as " module ", " device ", " system " are applied to the related entities of computer, such as hardware, firmly Combination, software or software in execution of part and software etc..In detail, for example, element can with but be not limited to run on Process, processor, object, executable element, execution thread, program and/or the computer of processor.In addition, running on service Application program or shell script, server on device can be elements.One or more elements can execution process and/ Or in thread, and element can be localized and/or be distributed between two or multiple stage computers on one computer, and It can be run by various computer-readable mediums.Element can also according to the signal with one or more data packets, for example, Interacted from one with another element in local system, distributed system, and/or internet network by signal with The signal of the data of other system interactions is communicated by locally and/or remotely process.

Finally, it is to be noted that, herein, relational terms such as first and second and the like are used merely to Distinguish one entity or operation from another entity or operation, without necessarily requiring or implying these entities or There are any actual relationship or orders between operation.Moreover, the terms "include", "comprise", are not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence " including ... ", it is not excluded that wrapping Include in the process, method, article or equipment of the element that there is also other identical elements.

As shown in Figure 1, the embodiment of the present invention provides a kind of method for handling voice signal, comprising:

Step S11: obtaining voice signal to be processed, and voice signal to be processed includes target sound signal and interference sound Signal.

Step S12: determining the power spectral density of interference sound signal, according to power spectral density to voice signal to be processed It is weighted processing, to obtain the spectrum estimation of target sound signal.Specifically, determining the power spectrum of interference sound signal After degree, determine posteriority and prior weight, and according to the signal-to-noise ratio computation weighting coefficient and to voice signal to be processed into Row weighting processing, obtains the spectrum estimation of target sound information.

Step S13: masking threshold is determined according to spectrum estimation.

Step S14: the case where spectrum component of interference sound signal in voice signal to be processed is greater than masking threshold is determined Under, voice signal to be processed is filtered.

And in embodiments of the present invention, specific for the calculating of masking threshold:

According to spectrum estimation, determines the power spectral density B (k) of the critical band of voice signal to be processed and extend critical Band spectrum C (k):

C (k)=B (k) * SF (k),

T (k)=10^{lg(C(k))-(O(k)/10)},

R_TT(Ω)=min (T (k), T_abs(k)),

Wherein, T_abs(k)=3.64f^-0.8-6.5exp(f-3.3)²+10^-3f⁴。

The embodiment of the present invention is further determined that out and is sheltered by the power spectral density PSD of the interference sound signal of calculating Threshold value, this process reduces the complexities that algorithm calculates.And the order requirement that filter is eliminated to pre-echo is reduced, And then accelerate the convergence rate of pre-echo arrester.And it can be improved it in strong background noise and near-end speech environment Under robustness.

As shown in Fig. 2, the embodiment of the present invention provides a kind of method for handling voice signal, comprising:

Step S21: initial voice signal is received.The initial voice signal can be picked up by radio equipments such as microphones.

Step S22: carrying out echo cancellor to initial voice signal by Echo Canceller, to obtain sound letter to be processed Number.

Step S23: determining the power spectral density of interference sound signal, according to power spectral density to voice signal to be processed It is weighted processing, to obtain the spectrum estimation of target sound signal.

Step S24: masking threshold is determined according to spectrum estimation.

Step S25: the case where spectrum component of interference sound signal in voice signal to be processed is greater than masking threshold is determined Under, voice signal to be processed is filtered.

After receiving initial signal, first to its preliminary progress echo cancellor, sound letter is can be improved in the embodiment of the present invention Number processing accuracy.

If including noise signal and echo signal in voice signal to be processed, according to power spectral density to be processed Voice signal is weighted processing, during obtaining the spectrum estimation of target sound signal:

Voice signal to be processed is converted into frequency-region signal E (Ω)；

Determine posteriori SNR PostSNR (Ω) according to the following formula:

PostSNR (Ω)=| E (Ω) |²/(R_bb(Ω)+R_nn(Ω)),

Wherein, R_bb(Ω) is the power spectral density of echo signal, R_nn(Ω) is the power spectral density of noise signal；

Prior weight PrioriSNR (Ω) is derived according to the following formula:

Further calculate weighting coefficient H_LSA(Ω), and obtain the spectrum estimation S ' (Ω) of target sound signal:

S ' (Ω)=E (Ω) * H_LSA(Ω),

Wherein, theta=PostSNR (Ω) * PrioriSNR (Ω)/(PrioriSNR (Ω)+1).

In the case where determining that the spectrum component of interference sound signal in voice signal to be processed is greater than masking threshold, treat Handling the step of voice signal is filtered includes:

Wherein, R_bb(Ω) is the power spectral density of echo signal, R_nn(Ω) is the power spectral density of noise signal, Zeta_b is echo attenuation coefficient, and zeta_n is noise reduction coefficient.

The embodiment of the present invention remains original background noise characteristic, and residual echo hearing test is more noise like, voice Distortion reduction, so that sound sounds more natural.And the order requirement that filter is eliminated to pre-echo is reduced, in turn It accelerates the convergence rate of pre-echo arrester while reducing the algorithm computation complexity of Echo Canceller.And it can Improve its robustness under strong background noise and near-end speech environment.

As shown in figure 3, in embodiments of the present invention, distal end in the method realization system of processing voice signal of the invention Microphone transmits voice signal, is shown by loudspeaker, and constitutes initial echo signal d (k).Proximal end microphones pick up speech Signal y (k), including pure voice signal s (k) i.e. target sound signal, noise signal n (k) and loudspeaker are anti-through LRM The initial echo signal d (k) of feedback.Firstly, Echo Canceller C carries out echo to the voice signal y (k) that proximal end microphone picks up It eliminates, the filtering processing of filter H further progress.

As shown in figure 4, the embodiment of the present invention provides a kind of method for handling voice signal, comprising:

Proximal end microphones pick up speech signal y (k), including pure voice signal s (k), noise signal n (k), and The initial echo signal d (k) that loudspeaker is fed back through LRM.In embodiments of the present invention, which is target information.

Echo Canceller carries out echo cancellor to the voice signal y (k) that proximal end microphone picks up, after obtaining echo cancellor Voice signal e (k).The interference sound signal that voice signal e (k) after the echo cancellor includes is noise signal and residual Echo signal.

Noise PSD Rnn (Ω) and residual echo PSD R are estimated by statistics or autocorrelation method_bb(Ω)。

Postfilter is weighted processing to the proximal end microphone signal after echo cancellor, obtains pure voice signal Frequency spectrum S ' (Ω) according to a preliminary estimate.Detailed process includes:

A) posteriori SNR is calculated:

PostSNR (Ω)=| E (Ω) |²/(R_bb(Ω)+R_nn(Ω))

B) prior weight is derived according to decision-directed method:

PrioriSNR(Ω_i)=(1-alpha) * P (PostSNR (Ω_i)-1)+alpha*|S’(Ω_i-1)|²/R_bb(Ω)

Wherein alpha is smoothing factor, P (x)=(| x |+x)/2, S ' (Ω_i-1) it is tentatively estimating for previous frame voice signal Meter.

C) theta=PostSNR (Ω) * PrioriSNR (Ω)/(PrioriSNR (Ω)+1) is defined, then calculates and adds Weight coefficient:

D) weighting obtains S ' according to a preliminary estimate (Ω)=E (Ω) * H of voice signal_LSA(Ω)

Then, according to speech signal spec-trum, S ' (Ω) estimates masking threshold R according to a preliminary estimate_TT(Ω).Detailed process packet It includes:

A) critical band analysis is carried out to signal and human ear is regarded as discrete bandpass filter group according to situation theory, One critical band is referred to as a Bark, then

The power spectral density of each critical band

Wherein, bh, bl are respectively the bound frequency of each critical band, and k is related with sample rate.

B) spread function SF (k) is calculated:

SF (k)=15.81+7.5*k+0.474-17.5*sqrt (1+ (k+0.474) 2)

Due to influencing each other between critical band, extension extension critical band frequency spectrum is represented by C (k)=B (k) * SF (k)。

C) the masking threshold R of masking noise and residual echo is calculated_TT(Ω)。

Because being respectively there are two kinds of masking thresholds: the threshold value of masking by pure tone noise and residual echo is C (k)-(14.5+ K) threshold value of db and noise and residual echo masking pure tone, is C (k) -5.5db.

Accordingly, it is determined that signal is similar to pure tone or noise and residual echo, and then needs to define spectrum flatness and estimates SFM:

SFM=10*lg (G/A)

Wherein, G, A are respectively the geometrical mean and arithmetic mean of instantaneous value of power spectrum density.

And define tone coefficient belta=min (SFM/SFM_max,1)

The offset function O (k) that each frequency band shelters energy is calculated by belta:

O (k)=belta* (14.5+k)+(1-belta) * 5.5

Then masking threshold size are as follows: T (k)=10^{lg(C(k))-(O(k)/10)}

The spread function threshold value being calculated is returned in the domain Bark

Compared with human ear hearing absolute threshold, if the absolute threshold of audibility of the masking threshold lower than human ear calculated, Just take the value of the absolute threshold of audibility, wherein absolute threshold of audibility Tabs (k) is defined as:

Tabs (k)=3.64f^-0.8-6.5exp(f-3.3)²+10^-3f⁴

So final masking threshold is R_TT(Ω)=min (T (k), T_abs(k))。

Further, psychologic acoustics weighted filtering is carried out to frequency domain microphone signal E (Ω) after echo cancellor.It is (fast with FFT Fast Fourier transform) digital signal of time domain can be converted to frequency-region signal, and judge frequency domain Mike's wind after echo cancellor Whether the noise spectrum ingredient in number E (Ω) is less than masking threshold, does not handle if then retaining；If otherwise to corresponding noise frequency Spectrum ingredient is decayed according to traditional MMSE-LSA.

Wherein, psychologic acoustics weighting filter coefficients specific derivation process is as follows:

The design object of psychologic acoustics adaptive weighted filter is to be equal to cover in the sum of residual echo distortion and noise distortion Near-end voice signals distortion is minimum when covering threshold value, so optimal psychologic acoustics weighting filter coefficients H (Ω) meets:

[zeta_b–H(Ω)]2R_bb(Ω)+[zeta_n–H(Ω)]2R_nn(Ω)=R_TT(Ω)

Wherein, zeta_b is residual echo attenuation coefficient, usually takes 20lg (zeta_b)=- 35；

Zeta_n is noise reduction coefficient, usually takes 20lg (zeta_n)=- 15.

Due to 0≤H (Ω)≤1, solves above-mentioned secondary equation H (Ω) and positive value is taken to obtain:

H (Ω)=min (1, [zeta_b*R_bb(Ω)+zeta_n*R_nn(Ω)+

sqrt([R_bb(Ω)+R_nn(Ω)]*R_TT(Ω)-[zeta_b-zeta_n]²*R_bb(Ω)*R_bb(Ω))]/(R_bb (Ω)+ R_nn(Ω)))

Since zeta_b, zeta_n are much smaller than 1 and usually relative to R_bb(Ω) and R_bbR for (Ω)_TT(Ω) is not Too small, institute's above formula can abbreviation are as follows:

H (Ω)=min (1, sqrt (R_TT(Ω)/(R_bb(Ω)+R_nn(Ω))) +(zeta_b*R_bb(Ω)+zeta_n*R_nn (Ω))/(R_bb(Ω)+R_nn(Ω)))

The embodiment of the present invention eliminates adaptive filter to pre-echo since psychologic acoustics postfilter can also be reduced The order requirement of wave device, it is possible to accelerate the convergence rate of Echo Canceller, reduce algorithm computation complexity, and can mention Its high robustness under strong background noise and near-end speech environment.

And merge residual echo in postposition psychologic acoustics weighting filter and eliminate, it is gone adaptively using residual echo Filter weighting coefficients are updated, acoustic echo is further eliminated.In addition, in masking threshold noise spectrum below and remaining back Sound ingredient is not since human ear masking effect is heard, so this partial noise frequency spectrum and residual echo ingredient do not need to decay, Only need using traditional adaptive post-filtering method to not by voice signal shelter noise spectrum and residual echo at Divide and decay, to remain original background noise characteristic well, residual echo hearing test is more noise like, voice Distortion reduction sounds more natural.

It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a systems The movement of column merges, but those skilled in the art should understand that, the present invention is not limited by the sequence of acts described, Because according to the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art also answer This knows that the embodiments described in the specification are all preferred embodiments, and related actions and modules is not necessarily originally Necessary to invention.In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, without detailed in some embodiment The part stated, reference can be made to the related descriptions of other embodiments.

As shown in figure 5, the embodiment of the present invention also provides a kind of system 500 for handling voice signal, comprising:

Signal acquisition module 510, for obtaining voice signal to be processed, voice signal to be processed includes target sound letter Number and interference sound signal.

Spectrum estimation determining module 520, for determining the power spectral density of interference sound signal, and according to power spectrum Density is weighted processing to voice signal to be processed, to obtain the spectrum estimation of target sound signal.

Masking threshold determining module 530, for determining masking threshold according to spectrum estimation.

Module 540 is filtered, is covered for determining that the spectrum component of interference sound signal in voice signal to be processed is greater than In the case where covering threshold value, voice signal to be processed is filtered.

Further, interference sound signal includes noise signal and echo signal.

Spectrum estimation determining module is also used to, and voice signal to be processed is converted to frequency-region signal E (Ω)；And according to Following formula determines posteriori SNR PostSNR (Ω):

PostSNR (Ω)=| E (Ω) |²/(R_bb(Ω)+R_nn(Ω)),

Prior weight PrioriSNR (Ω) is derived according to the following formula:

S ' (Ω)=E (Ω) * H_LSA(Ω),

Wherein, theta=PostSNR (Ω) * PrioriSNR (Ω)/(PrioriSNR (Ω)+1).

Masking threshold determining module is also used to, and according to spectrum estimation, determines the function of the critical band of voice signal to be processed Rate spectrum density B (k) and extension critical band frequency spectrum C (k):

C (k)=B (k) * SF (k),

T (k)=10^{lg(C(k))-(O(k)/10)},

R_TT(Ω)=min (T (k), T_abs(k)),

Wherein, T_abs(k)=3.64f^-0.8-6.5exp(f-3.3)²+10^-3f⁴。

Filtering processing module is also used to, and the power spectral density according to the power spectral density of echo signal and noise signal is true Make the weighting coefficient H (Ω) of filtering processing:

Signal acquisition module is also used to, and receives initial voice signal；Echo cancellor is carried out to initial voice signal, with To voice signal to be processed.

In some embodiments, the embodiment of the present invention provides a kind of non-volatile computer readable storage medium storing program for executing, described to deposit Being stored in storage media one or more includes the programs executed instruction, it is described execute instruction can by electronic equipment (including but It is not limited to computer, server or the network equipment etc.) it reads and executes, for executing any of the above-described processing of the present invention The method of voice signal.

In some embodiments, the embodiment of the present invention also provides a kind of computer program product, and the computer program produces Product include the computer program being stored on non-volatile computer readable storage medium storing program for executing, and the computer program includes program Instruction makes the computer execute the side of any of the above-described processing voice signal when described program instruction is computer-executed Method.

In some embodiments, the embodiment of the present invention also provides a kind of electronic equipment comprising: at least one processor, And the memory being connect at least one described processor communication, wherein the memory is stored with can be by described at least one The instruction that a processor executes, described instruction is executed by least one described processor, so that at least one described processor energy Enough methods for executing processing voice signal.

In some embodiments, the embodiment of the present invention also provides a kind of storage medium, is stored thereon with computer program, It is characterized in that, the method for handling voice signal when the program is executed by processor.

The system of the processing voice signal of the embodiments of the present invention can be used for executing the processing sound of the embodiment of the present invention The method of signal, and reach the method technology achieved of the realization processing voice signal of the embodiments of the present invention accordingly Effect, which is not described herein again.Hardware processor (hardware processor) Lai Shixian can be passed through in the embodiment of the present invention Related function module.

Fig. 6 is the hardware knot of the electronic equipment of the method for the execution processing voice signal that another embodiment of the application provides Structure schematic diagram, as shown in fig. 6, the equipment includes:

One or more processors 610 and memory 620, in Fig. 6 by taking a processor 610 as an example.

The equipment for executing the method for processing voice signal can also include: input unit 630 and output device 640.

Processor 610, memory 620, input unit 630 and output device 640 can pass through bus or other modes It connects, in Fig. 6 for being connected by bus.

Memory 620 is used as a kind of non-volatile computer readable storage medium storing program for executing, can be used for storing non-volatile software journey Sequence, non-volatile computer executable program and module, the method such as the processing voice signal in the embodiment of the present application are corresponding Program instruction/module.Processor 610 by operation be stored in memory 620 non-volatile software program, instruction with And module, thereby executing the various function application and data processing of server, i.e. realization above method embodiment handles sound The method of signal.

Memory 620 may include storing program area and storage data area, wherein storing program area can store operation system Application program required for system, at least one function；Storage data area can store the use of the device according to processing voice signal The data etc. created.In addition, memory 620 may include high-speed random access memory, it can also include non-volatile deposit Reservoir, for example, at least a disk memory, flush memory device or other non-volatile solid state memory parts.In some implementations In example, optional memory 620 includes the memory remotely located relative to processor 610, these remote memories can lead to Network connection is crossed to the device for handling voice signal.The example of above-mentioned network include but is not limited to internet, intranet, Local area network, mobile radio communication and combinations thereof.

Input unit 630 can receive the number or character information of input, and generate and the device of processing voice signal User setting and the related signal of function control.Output device 640 may include that display screen etc. shows equipment.

One or more of modules are stored in the memory 620, when by one or more of processors When 610 execution, the method for handling voice signal in above-mentioned any means embodiment is executed.

The said goods can be performed the embodiment of the present application provided by method, have the corresponding functional module of execution method and Beneficial effect.The not technical detail of detailed description in the present embodiment, reference can be made to method provided by the embodiment of the present application.

The electronic equipment of the embodiment of the present application exists in a variety of forms, including but not limited to:

(1) mobile communication equipment: the characteristics of this kind of equipment is that have mobile communication function, and to provide speech, data Communication is main target.This Terminal Type includes: smart phone (such as iPhone), multimedia handset, functional mobile phone, and Low-end mobile phone etc..

(2) super mobile personal computer equipment: this kind of equipment belongs to the scope of personal computer, there is calculating and processing function Can, generally also have mobile Internet access characteristic.This Terminal Type includes: PDA, MID and UMPC equipment etc., such as iPad.

(3) portable entertainment device: this kind of equipment can show and play multimedia content.Such equipment includes: sound Frequently, video player (such as iPod), handheld device, e-book and intelligent toy and portable car-mounted navigation equipment.

(4) server: providing the equipment of the service of calculating, and the composition of server includes that processor, hard disk, memory, system are total Line etc., server is similar with general computer architecture, but due to needing to provide highly reliable service, in processing energy Power, stability, reliability, safety, scalability, manageability etc. are more demanding.

(5) other electronic devices with data interaction function.

The apparatus embodiments described above are merely exemplary, wherein the unit as illustrated by the separation member It may or may not be physically separated, component shown as a unit may or may not be physics Unit, it can it is in one place, or may be distributed over multiple network units.It can select according to the actual needs Some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.

Through the above description of the embodiments, those skilled in the art can be understood that each embodiment The mode of general hardware platform can be added to realize by software, naturally it is also possible to pass through hardware.Based on this understanding, above-mentioned Technical solution substantially in other words can be embodied in the form of software products the part that the relevant technologies contribute, should Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several Instruction is used so that computer equipment (can be personal computer, server or the network equipment an etc.) execution is each Method described in certain parts of embodiment or embodiment.

Finally, it should be noted that above embodiments are only to illustrate the technical solution of the application, rather than its limitations；To the greatest extent Pipe is with reference to the foregoing embodiments described in detail the application, those skilled in the art should understand that: it is still It is possible to modify the technical solutions described in the foregoing embodiments, or part of technical characteristic is equally replaced It changes；And these are modified or replaceed, the essence of each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution Mind and range.

Claims

1. a kind of method for handling voice signal characterized by comprising

Voice signal to be processed is obtained, the voice signal to be processed includes target sound signal and interference sound signal；

It determines the power spectral density of the interference sound signal, and the sound to be processed is believed according to the power spectral density Number it is weighted processing, to obtain the spectrum estimation of target sound signal；

Masking threshold is determined according to the spectrum estimation；

It is right in the case where determining that the spectrum component of interference sound signal in the voice signal to be processed is greater than the masking threshold The voice signal to be processed is filtered.

2. the method according to claim 1, wherein the interference sound signal includes noise signal and echo letter Number.

3. according to the method described in claim 2, it is characterized in that, being believed according to the power spectral density the sound to be processed Number it is weighted processing, to include: the step of obtaining the spectrum estimation of target sound signal

Determine posteriori SNR PostSNR (Ω) according to the following formula:

PostSNR (Ω)=| E (Ω) |²/(R_bb(Ω)+R_nn(Ω)),

Wherein, R_bb(Ω) is the power spectral density of the echo signal, R_nn(Ω) is the power spectral density of the noise signal；

Prior weight PrioriSNR (Ω) is derived according to the following formula:

PrioriSNR(Ω_i)=(1-alpha) * P (PostSNR (Ω_i)-1)+alpha*|S’(Ω_i-1)|2/R_bb(Ω)；

Wherein, alpha is smoothing factor, P (x)=(| x |+x)/2, S ' (Ω_i-1) be previous frame voice signal spectrum estimation；

S ' (Ω)=E (Ω) * H_LSA(Ω),

Wherein, theta=PostSNR (Ω) * PrioriSNR (Ω)/(PrioriSNR (Ω)+1).

4. according to the method described in claim 2, it is characterized in that, determining interference sound signal in the voice signal to be processed Spectrum component be greater than the masking threshold in the case where, the step of voice signal to be processed is filtered packet It includes:

The weighting coefficient H of filtering processing is determined according to the power spectral density of the power spectral density of echo signal and noise signal (Ω):

H (Ω)=min (1, sqrt (R_TT(Ω)/(R_bb(Ω)+R_nn(Ω)))

+(zeta_b*R_bb(Ω)+zeta_n*R_nn(Ω))/(R_bb(Ω)+R_nn(Ω))),

Wherein, R_bb(Ω) is the power spectral density of the echo signal, R_nn(Ω) is the power spectral density of the noise signal, Zeta_b is echo attenuation coefficient, and zeta_n is noise reduction coefficient.

5. the method according to claim 1, wherein the step of determining masking threshold according to spectrum estimation packet It includes:

According to spectrum estimation, determines the power spectral density B (k) of the critical band of the voice signal to be processed and extend critical frequency Band frequency spectrum C (k):

C (k)=B (k) * SF (k),

Wherein, SF (k)=15.81+7.5*k+0.474-17.5*sqrt (1+ (k+0.474) 2), bh, bl are respectively each critical frequency The bound frequency of band；

T (k)=10^{lg(C(k))-(O(k)/10)},

R_TT(Ω)=min (T (k), T_abs(k)),

Wherein, T_abs(k)=3.64f^-0.8-6.5exp(f-3.3)²+10^-3f⁴。

6. the method according to claim 1, wherein the step of obtaining voice signal to be processed includes:

Receive initial voice signal；

7. the method according to claim 1, wherein the voice signal to be processed is voice signal.

8. a kind of system for handling voice signal characterized by comprising

Signal acquisition module, for obtaining voice signal to be processed, the voice signal to be processed include target sound signal and Interference sound signal；

Spectrum estimation determining module, for determining the power spectral density of the interference sound signal, and according to the power spectrum Density is weighted processing to the voice signal to be processed, to obtain the spectrum estimation of target sound signal；

Masking threshold determining module, for determining masking threshold according to the spectrum estimation；

Module is filtered, for determining that the spectrum component of interference sound signal in the voice signal to be processed is greater than described cover In the case where covering threshold value, the voice signal to be processed is filtered.

9. system according to claim 8, which is characterized in that the interference sound signal includes noise signal and echo letter Number.

10. system according to claim 8, which is characterized in that the spectrum estimation determining module is also used to, will it is described to Processing voice signal is converted to frequency-region signal E (Ω)；And determine posteriori SNR PostSNR (Ω) according to the following formula:

PostSNR (Ω)=| E (Ω) |²/(R_bb(Ω)+R_nn(Ω)),

Prior weight PrioriSNR (Ω) is derived according to the following formula:

S ' (Ω)=E (Ω) * H_LSA(Ω),

Wherein, theta=PostSNR (Ω) * PrioriSNR (Ω)/(PrioriSNR (Ω)+1).

11. system according to claim 8, which is characterized in that masking threshold determining module is also used to, and is estimated according to frequency spectrum Meter determines power spectral density B (k) and extension critical band frequency spectrum C (k) of the critical band of the voice signal to be processed:

C (k)=B (k) * SF (k),

T (k)=10^{lg(C(k))-(O(k)/10)},

R_TT(Ω)=min (T (k), T_abs(k)),

Wherein, T_abs(k)=3.64f^-0.8-6.5exp(f-3.3)²+10^-3f⁴。

12. system according to claim 8, which is characterized in that the filtering processing module is also used to, according to echo signal Power spectral density and noise signal power spectral density determine filtering processing weighting coefficient H (Ω):

H (Ω)=min (1, sqrt (R_TT(Ω)/(R_bb(Ω)+R_nn(Ω)))

+(zeta_b*R_bb(Ω)+zeta_n*R_nn(Ω))/(R_bb(Ω)+R_nn(Ω))),

13. system according to claim 8, which is characterized in that the signal acquisition module is also used to, and receives initial voice Signal；Echo cancellor is carried out to the initial voice signal, to obtain the voice signal to be processed.

14. a kind of electronic equipment comprising: at least one processor, and connect at least one described processor communication Memory, wherein the memory be stored with can by least one described processor execute instruction, described instruction by it is described extremely A few processor executes, so that at least one described processor is able to carry out any one of claim 1-7 the method The step of.

15. a kind of storage medium, is stored thereon with computer program, which is characterized in that the realization when program is executed by processor The step of any one of claim 1-7 the method.