CN109448751B - A Binaural Speech Enhancement Method Based on Deep Learning - Google Patents

A Binaural Speech Enhancement Method Based on Deep Learning Download PDF

Info

Publication number
CN109448751B
CN109448751B CN201811646317.7A CN201811646317A CN109448751B CN 109448751 B CN109448751 B CN 109448751B CN 201811646317 A CN201811646317 A CN 201811646317A CN 109448751 B CN109448751 B CN 109448751B
Authority
CN
China
Prior art keywords
complex
channel
speech
domain signal
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811646317.7A
Other languages
Chinese (zh)
Other versions
CN109448751A (en
Inventor
李军锋
孙兴伟
夏日升
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Original Assignee
Institute of Acoustics CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS filed Critical Institute of Acoustics CAS
Priority to CN201811646317.7A priority Critical patent/CN109448751B/en
Publication of CN109448751A publication Critical patent/CN109448751A/en
Application granted granted Critical
Publication of CN109448751B publication Critical patent/CN109448751B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/01Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Stereophonic System (AREA)

Abstract

本发明公开了一种基于深度学习的双耳语音增强方法,包括:对包含待增强目标语音信号的左/右通道带噪语音信号分别进行处理得到左/右频域信号,并对其幅值进行组合得到单通道复数特征,利用左/右通道的频域信号和对应的目标频域信号理论值分别计算出对应的目标语音理想复数掩蔽,将其组合构成目标语音单通道复数掩蔽理论值,并结合单通道复数特征对复数前馈神经网络进行训练得到双耳语音增强模型,利用模型输出的目标语音单通道复数掩蔽估计值分别处理左/右通道带噪语音信号得到左/右通道频域信号,最后得到对应的目标语音时域信号。本方法可以抑制噪声干扰并保持目标声源的空间信息。充分利用深度神经网络的泛化能力,达到双耳语音的增强。

Figure 201811646317

The invention discloses a binaural speech enhancement method based on deep learning. Combining to obtain single-channel complex features, using the left/right channel frequency domain signal and the corresponding theoretical value of the target frequency domain signal to calculate the corresponding ideal complex mask of the target speech, and combining them to form the theoretical value of the single-channel complex mask of the target speech, Combined with single-channel complex features, the complex feedforward neural network is trained to obtain a binaural speech enhancement model, and the single-channel complex masking estimates of the target speech output by the model are used to process the left/right channel noisy speech signals respectively to obtain the left/right channel frequency domain. signal, and finally obtain the corresponding target speech time domain signal. The method can suppress noise interference and maintain the spatial information of the target sound source. Make full use of the generalization ability of deep neural network to achieve the enhancement of binaural speech.

Figure 201811646317

Description

Binaural speech enhancement method based on deep learning
Technical Field
The invention relates to the technical field of speech enhancement, in particular to a binaural speech enhancement method based on deep learning.
Background
At present, the speech enhancement technology mainly removes background noise and directional noise interference in speech signals, improves speech quality and intelligibility, and thus obtains better performance in speech recognition and human ear understanding. In the enhancement technology taking single-channel voice as output, background noise can be suppressed by utilizing different characteristics of voice and noise in a time-frequency domain of single-channel input, and directional noise can be better removed by utilizing spatial information of target voice and interference signals in multi-channel input. In binaural hearing, human ears can improve the comprehension of voice by using the spatial information difference between a target and an interference signal in dual-channel voice, and can perform positioning by using the spatial information of a target sound source. In most traditional speech enhancement with dual channels as output, only interference removal is considered, no special processing is performed on the spatial information of target speech, and the suppression effect on non-stationary noise is poor.
Disclosure of Invention
The invention aims to solve the defects in the prior art.
In order to achieve the aim, the invention discloses a binaural speech enhancement method based on deep learning, which comprises the following steps:
respectively performing framing, windowing and Fourier transformation on the noisy speech signal of the left channel and the noisy speech signal of the right channel to obtain a noisy speech frequency domain signal of the left channel and a noisy speech frequency domain signal of the right channel; the left channel noisy speech signal comprises a left channel target speech signal to be enhanced, and the right channel noisy speech signal comprises a right channel target speech signal to be enhanced;
combining the amplitudes of the left channel voice frequency domain signal with noise and the right channel voice frequency domain signal with noise to obtain single-channel complex characteristics;
calculating by using the left channel noisy speech frequency domain signal and the left channel target speech frequency domain signal theoretical value to obtain a left channel target speech ideal complex mask; calculating by using the theoretical values of the right channel noisy speech frequency domain signal and the right channel target speech frequency domain signal to obtain an ideal complex masking of the right channel target speech;
combining the left channel target voice ideal complex masking and the right channel target voice ideal complex masking to form a target voice single-channel complex masking theoretical value;
training a complex feedforward neural network by using a single-channel complex feature and a target voice single-channel complex masking theoretical value to obtain a binaural voice enhancement model;
the single-channel complex feature is used as the input of a binaural voice enhancement model, a target voice single-channel complex masking estimated value is output, and a left-channel noisy voice frequency domain signal and a right-channel noisy voice frequency domain signal are respectively enhanced by the target voice single-channel complex masking estimated value to obtain a left-channel target voice frequency domain signal estimated value and a right-channel target voice frequency domain signal estimated value;
and respectively carrying out inverse Fourier transform on the estimated value of the left channel target voice frequency domain signal and the estimated value of the right channel target voice frequency domain signal to obtain a left channel target voice time domain signal and a right channel target voice time domain signal.
Preferably, the steps of framing, windowing and fourier transforming the left and right channel noisy speech signals are performed, in particular,
respectively carrying out framing and windowing on the noisy speech signal of the left channel and the noisy speech signal of the right channel, taking 1024 sampling points as a frame signal, and if the length is insufficient, firstly filling zero to 1024 points; then windowing each frame of signal, wherein a Hamming window is adopted as a windowing function; and finally, carrying out Fourier transform on each frame of signal.
Preferably, the single-channel complex feature XC=|XL|+j|XRWhere j is the complex imaginary unit, | XLI is the amplitude of the left channel voice frequency domain signal with noise, | XRAnd | is the amplitude of the noise-containing speech frequency domain signal of the right channel.
Preferably, the left channel target speech ideal complex masking is:
Figure BDA0001932133400000031
wherein j is complex imaginary unit, XLIs a complex number, is a left channel noisy speech frequency domain signal, SLThe signal is a complex number, the theoretical value of the left channel target voice frequency domain signal is shown, and r and i represent the real part and the imaginary part of the complex number;
preferably, the ideal complex masking of the right channel target speech is:
Figure BDA0001932133400000032
wherein j is complex imaginary unit, XRIs a complex number, is a right channel noisy speech frequency domain signal, SRThe expression r and i is the complex number, the theoretical value of the target speech frequency domain signal of the right channel, and the real part and the imaginary part of the complex number are taken.
Preferably, the target voice single-channel complex masking theoretical value MC=ML+jMRWhere j is the complex imaginary unit, MLFor ideal complex masking of the left channel target speech, MRIdeal complex masking for the right channel target speech.
Preferably, the step of training the complex feedforward neural network to obtain the binaural speech enhancement model by using the single-channel complex feature and the target speech single-channel complex masking theoretical value, specifically,
the complex feedforward neural network is a fully-connected neural network with 4 layers, and each layer in the network has 1024 hidden-layer complex nodes. The activation function of each neuron uses a linear modification unit and acts on the real part and imaginary part of the complex number node, respectively, with the expression f (x) max (0, x).
And performing front-and-back frame expansion on the single-channel complex feature to obtain a single-channel complex expansion feature, outputting a target voice single-channel complex masking estimation value as the input of a complex feedforward neural network, taking a target voice single-channel complex masking theoretical value as a training target of the complex feedforward neural network, and continuously reducing the mean square error between the target voice single-channel complex masking estimation value and the target voice single-channel complex masking theoretical value through iteration.
Preferably, the single-channel complex masking estimate MC′=ML′+jMR', where j is the complex imaginary unit, ML' is an estimate of the ideal complex masking of the left channel target speech, MR' is an estimate of the ideal complex masking of the right channel target speech.
Preferably, the left channel target voice frequency domain signal estimated value X'L=M′L*XLWherein M isL' is an estimate of the ideal complex masking of the left channel target speech, XLA voice frequency domain signal with noise of a left channel is obtained;
preferably, the right channel target speech frequency domain signal estimated value X'R=M′R*XRWherein M isR' estimation value, X, of ideal complex masking of target speech of right channelRAnd the right channel is a voice frequency domain signal with noise.
The invention has the advantages that: the method has the advantages that the ideal complex masking of the left channel and the right channel is utilized to construct single-channel complex masking, and the single-channel complex masking is estimated through a complex feedforward neural network, so that the purpose of jointly processing the left channel and the right channel is achieved, and the spatial information of a target sound source is kept while noise interference is suppressed. By containing enough noise types and orientations in the training data, the generalization capability of the deep neural network can be fully utilized, the robustness of the model is improved, and the purpose of binaural speech enhancement is achieved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a binaural speech enhancement method based on deep learning.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of a binaural speech enhancement method based on deep learning. As shown in fig. 1, includes:
step S101: and respectively performing framing, windowing and Fourier transformation on the voice signal with noise of the left channel and the voice signal with noise of the right channel to obtain a voice frequency domain signal with noise of the left channel and a voice frequency domain signal with noise of the right channel.
The left-channel noisy speech signal comprises a left-channel target speech signal to be enhanced, and the right-channel noisy speech signal comprises a right-channel target speech signal to be enhanced.
In a specific embodiment, framing and windowing are respectively carried out on a noisy speech signal of a left channel and a noisy speech signal of a right channel, 1024 sampling points are taken as a frame signal, and if the length is insufficient, zero padding is carried out to 1024 points; then windowing each frame of signal, wherein a Hamming window is adopted as a windowing function; and finally, carrying out Fourier transform on each frame of signal to obtain a left channel voice frequency domain signal with noise and a right channel voice frequency domain signal with noise.
Step S102: and combining the amplitudes of the left channel voice frequency domain signal with noise and the right channel voice frequency domain signal with noise to obtain single-channel complex characteristics.
In particular, a single-channel complex feature XC=|XL|+j|XRWhere j is the complex imaginary unit, | XLI is the amplitude of the left channel voice frequency domain signal with noise, | XRAnd | is the amplitude of the noise-containing speech frequency domain signal of the right channel.
Step S103: calculating by using the left channel noisy speech frequency domain signal and the left channel target speech frequency domain signal theoretical value to obtain a left channel target speech ideal complex mask; and calculating to obtain the ideal complex masking of the right channel target voice by using the theoretical value of the right channel noisy voice frequency domain signal and the right channel target voice frequency domain signal.
Specifically, the ideal complex masking of the left channel target speech is:
Figure BDA0001932133400000061
wherein j is complex imaginary unit, XLIs a complex number, is a left channel noisy speech frequency domain signal, SLThe expression r and i is the real part and the imaginary part of the complex number, and is the theoretical value of the target speech frequency domain signal of the left channel.
The ideal complex masking of the right channel target speech is:
Figure BDA0001932133400000062
wherein j is complex imaginary unit, XRIs a complex number, is a right channel noisy speech frequency domain signal, SRIs complex, is the frequency domain signal of the right channel target voiceIn terms of values, r and i denote the real and imaginary parts of the complex number.
Step S104: and combining the left channel target voice ideal complex masking and the right channel target voice ideal complex masking to form a target voice single-channel complex masking theoretical value.
Specifically, the target voice single-channel complex masking theoretical value MC=ML+jMRWhere j is the complex imaginary unit, MLFor ideal complex masking of the left channel target speech, MRIdeal complex masking for the right channel target speech.
Step S105: and training the complex feedforward neural network by using the single-channel complex feature and the target voice single-channel complex masking theoretical value to obtain a binaural voice enhancement model.
In one embodiment, the complex feedforward neural network is a 4-layer fully-connected neural network, and each layer in the network has 1024 hidden-layer complex nodes. The activation function of each neuron uses a linear modification unit and acts on the real part and imaginary part of the complex number node, respectively, with the expression f (x) max (0, x).
And performing front-and-back frame expansion on the single-channel complex feature to obtain a single-channel complex expansion feature, outputting a target voice single-channel complex masking estimation value as the input of a complex feedforward neural network, taking a target voice single-channel complex masking theoretical value as a training target of the complex feedforward neural network, and continuously reducing the mean square error between the target voice single-channel complex masking estimation value and the target voice single-channel complex masking theoretical value through iteration.
Step S106: and taking the single-channel complex feature as the input of a binaural voice enhancement model, outputting a target voice single-channel complex masking estimated value, and enhancing the left-channel noisy voice frequency domain signal and the right-channel noisy voice frequency domain signal respectively by using the target voice single-channel complex masking estimated value to obtain a left-channel target voice frequency domain signal estimated value and a right-channel target voice frequency domain signal estimated value.
Specifically, a single-channel complex masking estimate MC′=ML′+jMR', wherein j isComplex imaginary unit, ML' is an estimate of the ideal complex masking of the left channel target speech, MR' is an estimate of the ideal complex masking of the right channel target speech.
Left channel target voice frequency domain signal estimated value X'L=M′L*XLWherein M isL' is an estimate of the ideal complex masking of the left channel target speech, XLAnd the left channel is a voice frequency domain signal with noise.
Right channel target voice frequency domain signal estimated value X'R=M′R*XRWherein M isR' estimation value, X, of ideal complex masking of target speech of right channelRAnd the right channel is a voice frequency domain signal with noise.
Step S107: and respectively carrying out inverse Fourier transform on the estimated value of the left channel target voice frequency domain signal and the estimated value of the right channel target voice frequency domain signal to obtain a left channel target voice time domain signal and a right channel target voice time domain signal.
The invention provides a binaural speech enhancement method based on deep learning, which constructs single-channel complex masking by utilizing ideal complex masking of a left channel and a right channel, estimates the single-channel complex masking through a complex feedforward neural network, achieves the purpose of jointly processing the left channel and the right channel, and further keeps the spatial information of a target sound source while inhibiting noise interference. By containing enough noise types and orientations in the training data, the generalization capability of the deep neural network can be fully utilized, the robustness of the model is improved, and the purpose of binaural speech enhancement is achieved.
The above embodiments are provided to further explain the objects, technical solutions and advantages of the present invention in detail, it should be understood that the above embodiments are merely exemplary embodiments of the present invention and are not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (8)

1.一种基于深度学习的双耳语音增强方法,其特征在于,包括步骤:1. a binaural speech enhancement method based on deep learning, is characterized in that, comprises the steps: 对左通道带噪语音信号和右通道带噪语音信号分别进行分帧、加窗和傅里叶变换,得到左通道带噪语音频域信号和右通道带噪语音频域信号;所述左通道带噪语音信号中包含待增强的左通道目标语音信号,所述右通道带噪语音信号中包含待增强的右通道目标语音信号;Framing, windowing and Fourier transform are respectively performed on the left channel noisy speech signal and the right channel noisy speech signal to obtain the left channel noisy speech and audio domain signal and the right channel noisy speech and audio domain signal; the left channel The noisy speech signal includes the left channel target speech signal to be enhanced, and the right channel noisy speech signal includes the right channel target speech signal to be enhanced; 将所述左通道带噪语音频域信号和右通道带噪语音频域信号的幅值进行组合得到单通道复数特征;Combining the amplitudes of the left channel noisy speech and audio domain signal and the right channel noisy speech and audio domain signal to obtain a single-channel complex feature; 利用所述左通道带噪语音频域信号和左通道目标语音频域信号理论值计算得到左通道目标语音理想复数掩蔽;利用所述右通道带噪语音频域信号和右通道目标语音频域信号理论值计算得到右通道目标语音理想复数掩蔽;The ideal complex mask of the left channel target speech is obtained by calculating the theoretical value of the left channel noisy speech domain signal and the left channel target speech domain signal; using the right channel noisy speech domain signal and the right channel target speech domain signal The theoretical value is calculated to obtain the ideal complex mask of the right channel target speech; 将左通道目标语音理想复数掩蔽和右通道目标语音理想复数掩蔽进行组合构成目标语音单通道复数掩蔽理论值;Combining the ideal complex masking of the left channel target speech and the ideal complex masking of the right channel target speech to form the theoretical value of the single-channel complex masking of the target speech; 使用所述单通道复数特征和目标语音单通道复数掩蔽理论值对复数前馈神经网络进行训练得到双耳语音增强模型;Using the single-channel complex feature and the target speech single-channel complex masking theoretical value to train a complex feedforward neural network to obtain a binaural speech enhancement model; 将所述单通道复数特征作为所述双耳语音增强模型的输入,输出目标语音单通道复数掩蔽估计值,利用所述目标语音单通道复数掩蔽估计值分别对左通道带噪语音频域信号和右通道带噪语音频域信号进行增强,得到左通道目标语音频域信号估计值和右通道目标语音频域信号估计值;The single-channel complex feature is used as the input of the binaural speech enhancement model, and the target speech single-channel complex masking estimate is output. The right channel noisy speech domain signal is enhanced to obtain the estimated value of the left channel target speech domain signal and the estimated value of the right channel target speech domain signal; 对所述左通道目标语音频域信号估计值和右通道目标语音频域信号估计值分别进行逆傅里叶变换,得到左通道目标语音时域信号和右通道目标语音时域信号。Perform inverse Fourier transform on the estimated value of the left channel target speech and audio domain signal and the estimated value of the right channel target speech and audio domain signal respectively, to obtain the left channel target speech time domain signal and the right channel target speech time domain signal. 2.根据权利要求1所述的方法,其特征在于,所述对左通道带噪语音信号和右通道带噪语音信号分别进行分帧、加窗和傅里叶变换的步骤,具体为,2. method according to claim 1, is characterized in that, the described step of framing, windowing and Fourier transform is carried out respectively to left channel noisy speech signal and right channel noisy speech signal, is specially, 对所述左通道带噪语音信号和右通道带噪语音信号分别进行分帧和加窗处理,取1024个采样点作为一帧信号,若长度不足则先补零到1024点;然后对每一帧信号进行加窗,加窗函数采用汉明窗;最后对每一帧信号进行傅里叶变换。Framing and windowing are respectively performed on the left channel noisy speech signal and the right channel noisy speech signal, and 1024 sampling points are taken as a frame signal, if the length is insufficient, then zero-fill to 1024 points; The frame signal is windowed, and the windowing function adopts the Hamming window; finally, the Fourier transform is performed on each frame signal. 3.根据权利要求1所述的方法,其特征在于,所述单通道复数特征为:3. method according to claim 1, is characterized in that, described single-channel complex number characteristic is: XC=|XL|+j|XR|X C =| XL |+j|X R | 其中,j为复数虚部单位,|XL|为左通道带噪语音频域信号的幅值,|XR|为右通道带噪语音频域信号的幅值。Among them, j is the complex imaginary part unit, |X L | is the amplitude of the left channel noisy speech and audio domain signal, |X R | is the amplitude of the right channel noisy speech and audio domain signal. 4.根据权利要求1所述的方法,其特征在于,4. The method of claim 1, wherein 所述左通道目标语音理想复数掩蔽为:The ideal complex masking of the left channel target speech is:
Figure FDA0002659484200000021
Figure FDA0002659484200000021
其中,j为复数虚部单位,XL为复数,为左通道带噪语音频域信号,SL为复数,为左通道目标语音频域信号理论值,r和i表示取复数的实部和虚部;Among them, j is the complex imaginary part unit, XL is a complex number, is the left channel noisy speech domain signal, SL is a complex number, is the theoretical value of the left channel target speech domain signal, r and i represent the sum of the real part of the complex number imaginary part; 所述右通道目标语音理想复数掩蔽为:The ideal complex masking of the right channel target speech is:
Figure FDA0002659484200000022
Figure FDA0002659484200000022
其中,j为复数虚部单位,XR为复数,为右通道带噪语音频域信号,SR为复数,为右通道目标语音频域信号理论值,r和i表示取复数的实部和虚部。Among them, j is the complex imaginary part unit, X R is a complex number, it is the right channel noisy voice domain signal, S R is a complex number, it is the theoretical value of the right channel target voice domain signal, r and i represent the sum of the real part of the complex number imaginary part.
5.根据权利要求1或4任一权利要求所述的方法,其特征在于,所述目标语音单通道复数掩蔽理论值为:5. The method according to any one of claim 1 or 4, wherein the target speech single-channel complex masking theoretical value is: MC=ML+jMR M C =M L +jM R 其中,j为复数虚部单位,ML为左通道目标语音理想复数掩蔽,MR为右通道目标语音理想复数掩蔽。Among them, j is the complex imaginary unit, ML is the ideal complex mask of the left channel target speech, and MR is the ideal complex mask of the right channel target speech. 6.根据权利要求1所述的方法,其特征在于,所述使用所述单通道复数特征和目标语音单通道复数掩蔽理论值对复数前馈神经网络进行训练得到双耳语音增强模型的步骤,具体为,6. method according to claim 1, is characterized in that, described using described single-channel complex number feature and target speech single-channel complex number masking theoretical value to carry out the training of complex number feedforward neural network to obtain the step of binaural speech enhancement model, Specifically, 所述复数前馈神经网络为一个4层的全连接神经网络,网络中每层有1024个隐层复数节点;每个神经元的激活函数使用的是线性修正单元,并且分别作用在复数节点的实部和虚部上,其表达式为f(x)=max(0,x);The complex feedforward neural network is a 4-layer fully connected neural network, and each layer in the network has 1024 hidden layer complex nodes; the activation function of each neuron uses a linear correction unit, and acts on the complex nodes respectively. On the real and imaginary parts, the expression is f(x)=max(0,x); 将所述单通道复数特征进行前后帧扩展,得到单通道复数扩展特征,并作为所述复数前馈神经网络的输入,输出目标语音单通道复数掩蔽估计值,将目标语音单通道复数掩蔽理论值作为所述复数前馈神经网络的训练目标,通过迭代不断使目标语音单通道复数掩蔽估计值与目标语音单通道复数掩蔽理论值的均方误差减小。The single-channel complex number feature is extended before and after the frame to obtain the single-channel complex number expansion feature, and as the input of the complex feedforward neural network, the output target voice single-channel complex number masking estimated value, the target voice single-channel complex number masking theoretical value As the training target of the complex feedforward neural network, the mean square error between the estimated value of the single-channel complex masking of the target speech and the theoretical value of the single-channel complex masking of the target speech is reduced continuously through iteration. 7.根据权利要求1所述的方法,其特征在于,所述单通道复数掩蔽估计值为:7. The method according to claim 1, wherein the estimated value of the single-channel complex masking is: MC′=ML′+jMRM C ′= ML ′+jM R 其中,j为复数虚部单位,ML′为左通道目标语音理想复数掩蔽的估计值,MR′为右通道目标语音理想复数掩蔽的估计值。Among them, j is the complex imaginary unit, ML ′ is the estimated value of the ideal complex mask of the target speech in the left channel, and MR ′ is the estimated value of the ideal complex mask of the target speech in the right channel. 8.根据权利要求1或7所述的方法,其特征在于,8. The method according to claim 1 or 7, characterized in that, 所述左通道目标语音频域信号估计值:The estimated value of the left channel target voice domain signal: X′L=M′L*XL X′ L =M′ L * XL 其中,ML′为左通道目标语音理想复数掩蔽的估计值,XL为左通道带噪语音频域信号;Among them, ML ′ is the estimated value of the ideal complex mask of the left channel target speech, and XL is the left channel noisy speech and audio domain signal; 所述右通道目标语音频域信号估计值:The estimated value of the right channel target voice domain signal: X′R=M′R*XR X' R =M' R *X R 其中,MR′为右通道目标语音理想复数掩蔽的估计值,XR为右通道带噪语音频域信号。Among them, MR ′ is the estimated value of the ideal complex mask of the right channel target speech, and X R is the right channel noisy speech and audio domain signal.
CN201811646317.7A 2018-12-29 2018-12-29 A Binaural Speech Enhancement Method Based on Deep Learning Active CN109448751B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811646317.7A CN109448751B (en) 2018-12-29 2018-12-29 A Binaural Speech Enhancement Method Based on Deep Learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811646317.7A CN109448751B (en) 2018-12-29 2018-12-29 A Binaural Speech Enhancement Method Based on Deep Learning

Publications (2)

Publication Number Publication Date
CN109448751A CN109448751A (en) 2019-03-08
CN109448751B true CN109448751B (en) 2021-03-23

Family

ID=65540255

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811646317.7A Active CN109448751B (en) 2018-12-29 2018-12-29 A Binaural Speech Enhancement Method Based on Deep Learning

Country Status (1)

Country Link
CN (1) CN109448751B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110739002B (en) * 2019-10-16 2022-02-22 中山大学 Complex domain speech enhancement method, system and medium based on generation countermeasure network
CN111239686B (en) * 2020-02-18 2021-12-21 中国科学院声学研究所 Dual-channel sound source positioning method based on deep learning
CN111681646A (en) * 2020-07-17 2020-09-18 成都三零凯天通信实业有限公司 Universal scene Chinese Putonghua speech recognition method of end-to-end architecture
CN114333811A (en) * 2020-09-30 2022-04-12 中国移动通信有限公司研究院 Voice recognition method, system and equipment
CN114694672B (en) * 2020-12-30 2025-10-21 阿里巴巴集团控股有限公司 Speech enhancement method, device and equipment
CN113129918B (en) * 2021-04-15 2022-05-03 浙江大学 Voice dereverberation method combining beam forming and deep complex U-Net network
CN115862649B (en) * 2021-09-24 2025-07-22 北京字跳网络技术有限公司 Audio noise reduction method, device, equipment and storage medium
CN113921027B (en) * 2021-12-14 2022-04-29 北京清微智能信息技术有限公司 Speech enhancement method and device based on spatial features and electronic equipment
CN115512714B (en) * 2022-03-22 2025-09-12 钉钉(中国)信息技术有限公司 Speech enhancement method, device and equipment
CN114566189B (en) * 2022-04-28 2022-10-04 之江实验室 Speech emotion recognition method and system based on three-dimensional depth feature fusion
CN114999510B (en) * 2022-04-29 2025-07-11 中国科学技术大学 Single-channel speech enhancement method based on complex convolutional recurrent neural network based on masking effect

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102157156B (en) * 2011-03-21 2012-10-10 清华大学 Single-channel voice enhancement method and system
US9881631B2 (en) * 2014-10-21 2018-01-30 Mitsubishi Electric Research Laboratories, Inc. Method for enhancing audio signal using phase information
CN107845389B (en) * 2017-12-21 2020-07-17 北京工业大学 A speech enhancement method based on multi-resolution auditory cepstral coefficients and deep convolutional neural networks
CN108564963B (en) * 2018-04-23 2019-10-18 百度在线网络技术(北京)有限公司 Method and apparatus for enhancing voice

Also Published As

Publication number Publication date
CN109448751A (en) 2019-03-08

Similar Documents

Publication Publication Date Title
CN109448751B (en) A Binaural Speech Enhancement Method Based on Deep Learning
CN110970053B (en) A multi-channel and speaker-independent speech separation method based on deep clustering
CN109584903B (en) Multi-user voice separation method based on deep learning
CN105869651B (en) Binary channels Wave beam forming sound enhancement method based on noise mixing coherence
US9681246B2 (en) Bionic hearing headset
CN102157156B (en) Single-channel voice enhancement method and system
CN111081267B (en) Multi-channel far-field speech enhancement method
CN100524465C (en) A method and device for noise elimination
CN110728989B (en) A Binaural Speech Separation Method Based on Long Short-Term Memory Network LSTM
CN111292759A (en) A method and system for stereo echo cancellation based on neural network
CN106373589B (en) An iterative structure-based binaural hybrid speech separation method
KR20180069879A (en) Globally Optimized Least Squares Post Filtering for Voice Enhancement
CN113362846B (en) A Speech Enhancement Method Based on Generalized Sidelobe Cancellation Structure
CN108986832B (en) Method and device for binaural speech de-reverberation based on speech occurrence probability and consistency
CN101964934A (en) Binary microphone microarray voice beam forming method
WO2022105690A1 (en) Earphone and noise reduction method
CN106297817A (en) A kind of sound enhancement method based on binaural information
WO2022032608A1 (en) Audio noise reduction method and device
Li et al. Multichannel online dereverberation based on spectral magnitude inverse filtering
Wolff et al. A generalized view on microphone array postfilters
CN110858485B (en) Voice enhancement method, device, equipment and storage medium
CN114038475A (en) Single-channel speech enhancement system based on speech spectrum compensation
Aroudi et al. Cognitive-driven convolutional beamforming using EEG-based auditory attention decoding
CN115188391A (en) A method and device for voice enhancement with far-field dual microphones
CN104394498B (en) A three-channel holographic sound field playback method and sound field collecting device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant