WO2024017110A1 - 语音降噪方法、模型训练方法、装置、设备、介质及产品 - Google Patents
语音降噪方法、模型训练方法、装置、设备、介质及产品 Download PDFInfo
- Publication number
- WO2024017110A1 WO2024017110A1 PCT/CN2023/106951 CN2023106951W WO2024017110A1 WO 2024017110 A1 WO2024017110 A1 WO 2024017110A1 CN 2023106951 W CN2023106951 W CN 2023106951W WO 2024017110 A1 WO2024017110 A1 WO 2024017110A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio frame
- activity detection
- detection result
- noise reduction
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0224—Processing in the time domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
Definitions
- This application relates to the field of audio processing technology, such as speech noise reduction methods, model training methods, devices, equipment, media and products.
- the speech collected by the microphone of the terminal device usually contains a certain degree of noise.
- the speech noise reduction algorithm can suppress the noise carried in the speech, thereby improving the intelligibility and voice quality of the speech.
- speech noise reduction solutions can be roughly divided into two categories: traditional noise reduction solutions and artificial intelligence (Artificial Intelligence, AI) noise reduction solutions.
- Traditional noise reduction solutions use signal processing to achieve speech noise reduction, which cannot eliminate unsteady noise, that is, the ability to reduce sudden noise is weak; AI noise reduction solutions can reduce both steady-state noise and unsteady-state noise. It has good noise reduction capabilities, but this solution is a data-driven solution and is very dependent on training samples. If there are scenarios that are not considered during the model training process (such as a situation where the signal-to-noise ratio is very low), then in actual applications Encountering this scenario may result in unpredictable signal output or even system crash.
- the embodiments of this application provide speech noise reduction methods, model training methods, devices, equipment, media and products, which can effectively combine traditional noise reduction solutions and AI noise reduction solutions to improve the speech noise reduction effect.
- a speech noise reduction method which method includes:
- the model activity detection result corresponding to the previous audio frame and the algorithm activity detection result corresponding to the current audio frame are fused to obtain the target activity detection result corresponding to the current audio frame, where,
- the model activity detection result is output by a preset speech noise reduction network model;
- the initial noise reduction audio frame is input to the preset speech noise reduction network model to output the target noise reduction audio frame and the model activity detection result corresponding to the current audio frame.
- a model training method including:
- the sample model activity detection result corresponding to the previous sample audio frame and the sample algorithm activity detection result corresponding to the current sample audio frame are fused to obtain the target sample activity detection result corresponding to the current sample audio frame, wherein,
- the sample model activity detection results are output by the speech noise reduction network model;
- a first loss relationship is determined based on the target sample noise-reduced audio frame and the pure audio frame
- a second loss relationship is determined based on the sample model activity detection result and the activity detection label, and based on the first loss relationship and The second loss relationship trains the speech noise reduction network model.
- a voice noise reduction device which device includes:
- the voice activity detection module is configured to use a preset voice activity detection algorithm to detect the current audio frame to be processed, and obtain the corresponding algorithm activity detection results;
- the detection result fusion module is configured to fuse the model activity detection result corresponding to the previous audio frame and the algorithm activity detection result corresponding to the current audio frame to obtain the target activity detection result corresponding to the current audio frame, wherein,
- the above model activity detection results are output by the preset speech noise reduction network model;
- a noise reduction processing module configured to perform noise estimation and noise elimination on the current audio frame based on the target activity detection result to obtain an initial noise reduction audio frame
- the model input module is configured to input the initial noise reduction audio frame to the preset speech noise reduction network model to output the target noise reduction audio frame and the model activity detection result corresponding to the current audio frame.
- a model training device including:
- the voice detection module is configured to use a preset voice activity detection algorithm to detect the current sample audio frame to be processed, and obtain the corresponding sample algorithm activity detection result, wherein the current sample audio frame is associated with an activity detection tag and a clean audio frame;
- a fusion module configured to perform fusion processing on the sample model activity detection result corresponding to the previous sample audio frame and the sample algorithm activity detection result corresponding to the current sample audio frame, to obtain the target sample activity detection result corresponding to the current sample audio frame.
- the sample model activity detection result is output by the speech noise reduction network model
- a noise elimination module configured to perform noise estimation and noise elimination on the current sample audio frame based on the target active sample detection result to obtain an initial denoised sample audio frame
- a network model input module configured to input the initial noise reduction sample audio frame to the speech noise reduction network model to output the target sample noise reduction audio frame and the sample model activity detection result corresponding to the current sample audio frame;
- a network model training module configured to determine a first loss relationship based on the target sample noise-reduced audio frame and the clean audio frame, determine a second loss relationship based on the sample model activity detection result and the activity detection label, and based on The first loss relationship and the second loss relationship train the speech noise reduction network model.
- an electronic device including:
- the memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor, so that the at least one processor can execute the method described in any embodiment of the present application. Speech noise reduction methods and/or model training methods.
- a computer-readable storage medium stores a computer program, and the computer program is used to implement any of the embodiments of the present application when executed by a processor. Speech noise reduction methods and/or model training methods.
- a computer program product includes a computer program that, when executed by a processor, implements the speech noise reduction method and method described in any embodiment of the present application. /or model training method.
- the speech noise reduction solution provided in the embodiment of this application uses a preset speech activity detection algorithm to detect the current audio frame to be processed, and obtains the corresponding algorithm activity detection result.
- the model activity detection result corresponding to the previous audio frame and the current audio The algorithm activity detection results corresponding to the frame are fused to obtain the target activity detection results corresponding to the current audio frame.
- the model activity detection results are determined by the preset speech noise reduction network. Network model output, perform noise estimation and noise elimination on the current audio frame based on the target activity detection results, and obtain the initial noise-reduced audio frame.
- the initial noise-reduced audio frame is input to the preset speech noise reduction network model to output the target noise-reduced audio frame.
- the model activity detection result corresponding to the current audio frame is input to the preset speech noise reduction network model to output the target noise-reduced audio frame.
- the preset speech noise reduction network model can output the model activity detection results.
- the model activity detection results of the previous audio frame and the traditional speech noise reduction can be compared.
- the algorithm activity detection results obtained by the algorithm are combined, so that the traditional noise reduction algorithm can obtain more activity detection information and determine the voice activity detection results more reasonably and accurately.
- noise estimation and noise elimination can better protect the voice.
- noise elimination to obtain traditional noise reduction results with higher signal-to-noise ratio, and then use the traditional noise reduction results as the input of the preset speech noise reduction network model to obtain better noise reduction frequency frames, reducing the preset The possibility of the speech noise reduction network model to process harsh data.
- Traditional noise reduction algorithms and AI noise reduction methods promote each other and have good noise reduction capabilities for various noises, which can improve the speech noise reduction effect and improve the overall speech noise reduction. The stability and robustness of the solution.
- Figure 1 is a schematic flow chart of a speech noise reduction method provided by an embodiment of the present application.
- Figure 2 is a schematic flow chart of yet another speech noise reduction method provided by an embodiment of the present application.
- Figure 3 is a schematic diagram of the reasoning flow of a speech noise reduction method provided by an embodiment of the present application.
- Figure 4 is a schematic flow chart of a model training method provided by an embodiment of the present application.
- Figure 5 is a schematic diagram of the training process of a model training method provided by an embodiment of the present application.
- Figure 6 is a structural block diagram of a speech noise reduction device provided by an embodiment of the present application.
- Figure 7 is a structural block diagram of a model training device provided by an embodiment of the present application.
- FIG. 8 is a structural block diagram of an electronic device provided by an embodiment of the present application.
- Figure 1 is a schematic flowchart of a speech noise reduction method provided by an embodiment of the present application.
- This embodiment can be applied to the situation of speech noise reduction, for example, it can be applied to various situations such as voice calls, audio and video live broadcasts, and multi-person conferences. kind of scene.
- the method can be executed by a voice noise reduction device, which can be implemented in the form of hardware and/or software.
- the voice noise reduction device can be configured in electronic equipment such as voice noise reduction equipment.
- the electronic device may be a mobile device such as a mobile phone, a smart watch, a tablet computer, or a personal digital assistant; it may also be a desktop computer or other other device.
- the method includes:
- Step 101 Use the preset speech activity detection algorithm to detect the current audio frame to be processed, and obtain the corresponding algorithm activity detection result.
- the current audio frame to be processed can be understood as the audio frame that currently needs to be processed for voice noise reduction, and the current audio frame can be included in an audio file or audio stream.
- the current audio frame may be an original audio frame in an audio file or audio stream, or an audio frame obtained by preprocessing the original audio frame.
- the entire speech noise reduction solution can be understood as a speech noise reduction system, and the current audio frame can be understood as an input signal of the speech noise reduction system.
- the speech noise reduction solution can include traditional speech noise reduction algorithms and AI speech noise reduction models.
- the type of traditional speech noise reduction algorithm can be, for example, the Adaptive Noise Suppression (ANS) algorithm in Web Real-Time Communication (webRTC), linear filtering method, spectral subtraction method, statistical model algorithm or Subspace algorithm, etc.
- Traditional speech noise reduction algorithms mainly include three parts: Voice Activity Detection (VAD) estimation, noise estimation and noise elimination.
- VAD Voice Activity Detection
- Voice activity detection also known as voice endpoint detection or voice boundary detection, can identify long periods of silence from the sound signal stream.
- the preset voice activity detection algorithm in the embodiment of the present application can be a voice activity detection algorithm in any traditional voice noise reduction algorithm.
- the preset speech noise reduction network model in this application can be an AI speech noise reduction model, which can include real-time noise suppression (Dual-Signal Transformation LSTM Network) such as RNNoise model or dual-channel signal transformation long short-term memory artificial neural network. for Real-Time Noise Suppression, DTLN) noise reduction model, etc.
- the default speech noise reduction network model includes two branches, one branch is used for input The denoised speech is output (can be referred to as the noise reduction branch), and the other branch is used to output the speech activity detection result (can be referred to as the detection branch).
- the original model structure can be maintained; for AI speech denoising models that do not include detection branches, detection branches can be added based on the backbone network, and the network of the detection branches
- the structure may include, for example, convolutional layers and/or fully connected layers.
- RNNoise is a noise reduction solution that combines audio feature extraction + deep neural network.
- the obtained detection results can be recorded as algorithm activity detection results, and the preset voice activity can be reduced to
- the activity detection results output by the noise network model are recorded as model activity detection results.
- Step 102 Fusion process the model activity detection result corresponding to the previous audio frame and the algorithm activity detection result corresponding to the current audio frame to obtain the target activity detection result corresponding to the current audio frame, wherein the model activity detection result
- the results are output by the preset speech noise reduction network model.
- the previous audio frame can be understood as the latest audio frame before the current audio frame, that is, the previous audio frame is located before the current audio frame and the two frame numbers are adjacent.
- the preset speech noise reduction network model can output the noise reduction audio frame and model activity detection results corresponding to the previous audio frame, and the model activity detection results can be cached for use. For noise reduction processing of the current audio frame.
- the model activity detection results corresponding to the previous audio frame and the algorithm activity detection results corresponding to the current audio frame can be combined to determine the parameters used in the traditional speech noise reduction algorithm.
- Activity detection results target activity detection results
- the traditional noise reduction algorithms can obtain more VAD information, thereby obtaining more accurate noise estimates, which can better protect speech and eliminate it more accurately.
- Noise can improve the output signal-to-noise ratio (Signal to Noise Ratio, SNR) of traditional noise reduction algorithms.
- Step 103 Perform noise estimation and noise elimination on the current audio frame based on the target activity detection result to obtain an initial noise-reduced audio frame.
- the noise estimation algorithm and noise elimination algorithm in the traditional speech noise reduction algorithm can be used to process the current audio frame accordingly, and the processed audio frame is recorded as the initial noise reduction audio frame.
- Step 104 Input the initial noise reduction audio frame to the preset speech noise reduction network model to output the target noise reduction audio frame and the model activity detection result corresponding to the current audio frame.
- the initial noise-reduced audio frame can be directly used as the preset Assuming the input of the speech noise reduction network model, the initial noise reduction audio frame can also be converted according to the characteristics of the preset speech noise reduction network model, for example, into a signal with a preset dimension.
- the preset dimension can be, for example, the frequency domain or the time domain. or other dimension fields.
- the speech noise reduction method uses a preset speech activity detection algorithm to detect the current audio frame to be processed, and obtains the corresponding algorithm activity detection result.
- the model activity detection result corresponding to the previous audio frame and the current audio The algorithm activity detection results corresponding to the frame are fused to obtain the target activity detection results corresponding to the current audio frame.
- the model activity detection results are output by the preset speech noise reduction network model. Based on the target activity detection results, the noise estimation and noise are performed on the current audio frame. Eliminate to obtain an initial noise-reduction audio frame, and input the initial noise-reduction audio frame to the preset speech noise reduction network model to output the target noise-reduction audio frame and the model activity detection result corresponding to the current audio frame.
- the preset speech noise reduction network model can output the model activity detection results.
- the model activity detection results of the previous audio frame and the traditional speech noise reduction can be compared.
- the algorithm activity detection results obtained by the algorithm are combined, so that the traditional noise reduction algorithm can obtain more activity detection information and determine the voice activity detection results more reasonably and accurately.
- noise estimation and noise elimination can better protect the voice. , eliminate more noise, and obtain traditional noise reduction results with a higher signal-to-noise ratio, and then use the traditional noise reduction results as the input of the preset speech noise reduction network model to obtain better noise reduction frequency frames, reducing the preset
- the speech noise reduction network model has the possibility to process harsh data.
- Traditional noise reduction algorithms and AI noise reduction methods promote each other, have better noise reduction capabilities for various noises, and improve the overall stability and robustness of the solution.
- voice activity detection can be at the frame level or at the frequency point level, and the detection results can be represented by one or more probability values.
- the algorithm activity detection result includes a first probability value corresponding to the presence of speech in the audio frame
- the model activity detection result includes a second probability value corresponding to the existence of speech in the audio frame.
- the fusion processing of the model activity detection result corresponding to the previous audio frame and the algorithm activity detection result corresponding to the current audio frame to obtain the target activity detection result corresponding to the current audio frame includes: using a preset calculation In this way, the first probability value in the model activity detection result corresponding to the previous audio frame and the second probability value in the algorithm activity detection result corresponding to the current audio frame are calculated to obtain the third probability value. According to the The third probability value determines the target activity detection result corresponding to the current audio frame. With this setting, for frame-level speech activity detection, the target activity detection results can be accurately determined.
- the first probability value is used to represent the probability that the corresponding audio frame contains speech after detecting the corresponding audio frame using the preset voice activity detection algorithm.
- the corresponding audio frame here can be any audio frame, and can be the current audio frame. , or it can be the previous audio frame.
- the first probability value corresponding to different audio frames can be different;
- the second probability value is used to represent the corresponding audio output by the preset speech noise reduction network model.
- the probability that the frame contains speech, the corresponding audio frame here can also be any audio frame, and the second probability values corresponding to different audio frames can be different.
- the first probability value in the algorithm activity detection result corresponding to the current audio frame can be used to represent the current audio frame obtained after using the preset voice activity detection algorithm to detect the current audio frame (assumed to be marked as A).
- the probability that contains speech can be recorded as Pa.
- the second probability value in the model activity detection result corresponding to the previous audio frame can be used to represent the upper value predicted by the preset speech noise reduction network model when performing speech noise reduction processing on the previous audio frame (assumed to be B).
- the probability that an audio frame contains speech can be recorded as Pb.
- the third probability value can be used as the target activity detection result corresponding to the current audio frame.
- the preset calculation method is one of taking the maximum value, taking the minimum value, calculating the average, calculating the sum, calculating the weighted sum, and calculating the weighted average.
- Pc max(Pa, Pb).
- the algorithm activity detection result includes a fourth probability value for the presence of speech in each of the preset number of frequency points in the corresponding audio frame; and the model activity detection result includes the corresponding audio frame.
- Each frequency point in the preset number of frequency points has a fifth probability value of speech; wherein, the model activity detection result corresponding to the previous audio frame and the algorithm activity detection result corresponding to the current audio frame are fused.
- obtaining the target activity detection result corresponding to the current audio frame including: for each frequency point in the preset number of frequency points, using a preset calculation method to calculate the model activity detection result corresponding to the previous audio frame.
- the fifth probability value of a single frequency point is calculated with the corresponding fourth probability value of the single frequency point in the algorithm activity detection result corresponding to the current audio frame to obtain a sixth probability value; according to the preset The sixth probability value of the number determines the target activity detection result corresponding to the current audio frame.
- the preset number (denoted as n) can be set according to actual needs, for example, it can be determined according to the number of points used in the fast Fourier transform in the preprocessing stage, for example, n is 256.
- the fourth probability value corresponding to the current audio frame can be used to represent each of the preset number of frequency points in the current audio frame obtained after using the preset voice activity detection algorithm to detect the current audio frame (assumed to be marked as A).
- the probability that a frequency point contains speech can be recorded as PA[n].
- PA[n] can be understood as a vector containing n elements (n bits). The value of each element is between 0 and 1. The value of an element is The value is used to represent the probability that the corresponding frequency point contains speech.
- the fifth probability value corresponding to the previous audio frame can be used to indicate that when performing speech noise reduction processing on the previous audio frame (assumed to be marked as B), the preset speech noise reduction network model predicts the predetermined value in the previous audio frame. Assume the probability that each frequency point contains speech among a number of frequency points, which can be recorded as PB[n]. Calculate PA[n] and PB[n] using a preset calculation method to obtain a preset number of sixth probability values, which can be recorded as PC[n], for example. For example, a vector containing the sixth probability value may be used as the target activity detection result corresponding to the current audio frame.
- the preset calculation method is one of taking the maximum value, taking the minimum value, calculating the average, calculating the sum, calculating the weighted sum, and calculating the weighted average.
- PC[n] max(PA[n], PB[n]).
- the maximum value of the corresponding fourth probability value and fifth probability value becomes the sixth probability value corresponding to the first frequency point in the current audio frame, and subsequent frequencies Click and so on.
- inputting the initial noise-reduction audio frame to the preset speech noise reduction network model includes: performing feature extraction of a preset feature dimension on the initial noise-reduction audio frame to obtain the target input signal; input the target input signal to the preset speech noise reduction network model, or input the target input signal and the initial noise reduction audio frame to the preset speech noise reduction network model.
- feature extraction can be carried out in a targeted manner and the prediction accuracy and precision of the preset speech noise reduction network model can be improved.
- the preset feature dimensions include explicit feature dimensions, which can be fundamental frequency features, such as pitch frequency (Pitch), per-channel energy normalization (PCEN) features, or Mel Frequency Cepstrum Coefficient (Mel Frequency Cepstrum Coefficient, MFCC) characteristics and so on.
- the preset feature dimensions can be determined based on the network structure or characteristics of the preset speech noise reduction network model.
- Figure 2 is a schematic flow chart of another voice noise reduction method provided by an embodiment of the present application. This method is optimized based on the above optional embodiments.
- Figure 3 is a schematic diagram of a voice noise reduction method provided by an embodiment of the present application. The schematic diagram of the reasoning flow can be understood by combining Figure 2 and Figure 3 to understand the solution of the embodiment of the present application. Among them, as shown in Figure 2, the method may include:
- Step 201 Obtain the original audio frame, preprocess the original audio frame, and obtain the current audio frame to be processed.
- the original audio frame is included in an audio file or audio stream, for example, it may be an audio stream in a voice call scenario.
- the call audio needs to be noise reduced.
- Preprocessing can include processing such as framing, windowing, and Fourier transform.
- the preprocessed noisy speech frame is the current audio frame to be processed, which is used as the input signal of the preset traditional noise reduction algorithm (recorded as S0).
- Step 202 Use the preset speech activity detection algorithm in the preset traditional noise reduction algorithm to detect the current audio frame to be processed, and obtain the corresponding algorithm activity detection result.
- the preset traditional noise reduction algorithm may be the ANS algorithm.
- S0 is detected. Assuming that it is a frequency-level detection, the voice presence probability Pf of 256 frequency points can be obtained [256], that is, the corresponding S0 The algorithm activity detection results.
- Step 203 Determine whether the current audio frame has a previous audio frame. If so, perform step 204; otherwise, perform step 206.
- Step 206 is executed based on the algorithm activity detection result corresponding to the current audio frame. Perform noise estimation and noise removal.
- Step 204 Obtain the model activity detection result corresponding to the previous audio frame, and fuse the obtained model activity detection result and the algorithm activity detection result corresponding to the current audio frame to obtain the target activity detection result corresponding to the current audio frame.
- the model activity detection result corresponding to the previous audio frame is output by a preset speech noise reduction network model based on artificial intelligence, which can be the speech presence probability PF [256] of 256 frequency points in the previous audio frame, which can be used
- Step 205 Based on the target activity detection result, use the preset traditional noise reduction algorithm to perform noise estimation and noise elimination on the current audio frame to obtain an initial noise reduction audio frame, and execute step 207.
- the preset traditional noise reduction algorithm implements noise estimation and noise elimination according to P [256], and obtains the speech signal S1 that has undergone traditional noise reduction processing, that is, the initial noise reduction audio frame.
- Step 206 Based on the algorithm activity detection result corresponding to the current audio frame, use the preset traditional noise reduction algorithm to perform noise estimation and noise elimination on the current audio frame to obtain an initial noise reduction audio frame.
- the preset traditional noise reduction algorithm implements noise estimation and noise elimination according to Pf [256], and obtains the speech signal S1 that has undergone traditional noise reduction processing, that is, the initial noise reduction audio frame.
- Step 207 Extract features of preset feature dimensions on the initial noise-reduced speech to obtain the target input signal.
- S1 serves as the input signal of the preset speech noise reduction network model, which can be a signal in the frequency domain, time domain or other dimensional domain.
- the preset speech noise reduction network model there may be an explicit one-step Feature extraction calculation, such as pitch frequency features, records the extracted feature information as the target input signal S2.
- Step 208 Input the target input signal and/or the initial noise reduction audio frame to the preset speech noise reduction network model to output the target noise reduction audio frame and the model activity detection result corresponding to the current audio frame.
- S1 or S2 can be used as the model input, or both S1 and S2 can be used as the model input, and input into the preset speech noise reduction network model for inference calculation to obtain the output signal.
- the output signal contains two parts. The first part is the final denoised speech output S3 of the speech denoising method, and the second part is the VAD output PF [256] of the model, which is used by the traditional speech denoising algorithm when processing the next audio frame.
- Step 209 Determine whether there is an original audio frame to be processed. If so, return to step 201; otherwise, end the process.
- step 201 can be returned to continue the denoising process.
- the speech noise reduction method provided by the embodiments of this application uses a preset speech noise reduction network model based on artificial intelligence to provide information feedback to the traditional noise reduction algorithm, so that the traditional noise reduction algorithm can obtain more VAD information.
- Traditional noise reduction Both VAD estimation and AI noise reduction use frequency point level, which can obtain more accurate noise estimation, so that traditional noise reduction algorithms can better protect speech, eliminate more noise, improve the output signal-to-noise ratio of traditional noise reduction, and achieve high
- the input of the preset speech denoising network model can be enriched, which reduces the possibility of the preset speech denoising network model processing bad data and at the same time improves the speech denoising performance of the model. effect, improving voice noise reduction performance.
- Figure 4 is a schematic flowchart of a model training method provided by an embodiment of the present application.
- Figure 5 is a schematic diagram of the training process of a model training method provided by an embodiment of the present application.
- the embodiment of the present application can be understood in conjunction with Figures 4 and 4 .
- This embodiment can be applied to training a speech noise reduction network model based on artificial intelligence.
- the model can be applied to various scenarios such as voice calls, audio and video live broadcasts, and multi-person conferences.
- the method can be executed by a model training device, which can be implemented in the form of hardware and/or software, and which can be configured in electronic equipment such as model training equipment.
- the electronic device may be a mobile device such as a mobile phone, a smart watch, a tablet computer, or a personal digital assistant; it may also be a desktop computer or other other device.
- the speech noise reduction network model trained using the embodiments of this application can be applied to the speech noise reduction method provided by any embodiment of this application.
- the method includes:
- Step 401 Use the preset voice activity detection algorithm to detect the current sample audio frame to obtain the corresponding sample algorithm activity detection result, wherein the current sample audio frame is associated with an activity detection tag and a pure audio frame.
- a pure (clean) speech data set and a noise data set can be mixed into noisy speech data according to a preset mixing rule.
- the preset mixing rule can be based on, for example, signal-to-noise ratio or room acoustic impulse response (Room Impulse Response). RIR) to set.
- RIR Room Impulse Response
- the mixed noisy speech data set and the pure speech data set are used as a training set for the model.
- the current sample audio frame can be an audio frame in the training set.
- the current sample audio frame can carry an activity detection label, which can be added through manual annotation.
- the label can be 1, if it does not contain speech, the label can be 0; taking the frequency level as an example, the label can be a vector containing a preset number of elements, and the value of each element It is 1 or 0. If the corresponding frequency point contains speech, the value is 1. If the corresponding frequency point does not contain speech, the value is 0.
- Step 402 Fusion process the sample model activity detection result corresponding to the previous sample audio frame and the sample algorithm activity detection result corresponding to the current sample audio frame to obtain the current sample audio frame The corresponding target sample activity detection result, wherein the sample model activity detection result is output by the speech noise reduction network model.
- the activity detection result fusion process in this step can be similar to the fusion process in the speech noise reduction method provided by the embodiment of the present application.
- it can be frequency point level fusion or frame level fusion, etc., and similar pre-processing can also be used.
- the calculation method is designed to fuse the corresponding frequency values. For specific details, please refer to the relevant content of this article and will not be repeated here.
- Step 403 Perform noise estimation and noise elimination on the current sample audio frame based on the target active sample detection result to obtain an initial denoised sample audio frame.
- Step 404 Input the initial noise reduction sample audio frame to the speech noise reduction network model to output the target sample noise reduction audio frame and the sample model activity detection result corresponding to the current sample audio frame.
- Step 405 Determine a first loss relationship based on the target sample noise-reduced audio frame and the pure audio frame, determine a second loss relationship based on the sample model activity detection result and the activity detection label, and determine the second loss relationship based on the first The loss relationship and the second loss relationship train the speech noise reduction network model.
- the loss relationship can be used to characterize the difference between two types of data, which can be represented by a loss value. For example, it can be calculated using a loss function.
- the first loss relationship is used to characterize the difference between the target sample noise-reduced audio frame and the pure audio frame
- the second loss relationship is used to characterize the difference between the sample model activity detection result and the activity detection label, where, is used to calculate the first
- the first loss function of the loss relationship and the function type of the second loss function used to calculate the second loss relationship can be set according to actual needs.
- the target loss relationship may be calculated based on the first loss relationship and the second loss relationship, and the calculation method may be, for example, weighted summation.
- the speech noise reduction network model is trained according to the target loss relationship.
- the weight parameters in the speech noise reduction network model can be continuously optimized using training methods such as backpropagation with the goal of minimizing the target loss relationship. value until the preset training cutoff condition is met.
- the training cutoff condition can be set according to actual needs, for example, it can be set based on the number of iterations, the degree of convergence of the loss value, or the accuracy of the model.
- the model training method provided by the embodiment of the present application uses the traditional noise reduction algorithm and the speech noise reduction network model as a whole during the training process, which can avoid the data generated by the traditional noise reduction algorithm concatenating the separately trained speech noise reduction network model. Mismatch risk, the model obtained after training can be used for speech noise reduction, and has better noise reduction capabilities for various noises, improving the noise reduction effect.
- the sample algorithm activity detection result includes a first sample probability value corresponding to the presence of speech in the sample audio frame
- the sample model activity detection result includes a second sample probability value corresponding to the presence of speech in the sample audio frame
- the sample model activity detection result corresponding to the previous sample audio frame and the sample algorithm activity detection result corresponding to the current sample audio frame are fused to obtain the target sample activity detection result corresponding to the current sample audio frame, It includes: using a preset calculation method to calculate the second sample probability value in the sample model activity detection result corresponding to the previous sample audio frame, and the first sample probability value in the sample algorithm activity detection result corresponding to the current sample audio frame. The value is calculated to obtain a third sample probability value, and the target sample activity detection result corresponding to the current sample audio frame is determined according to the third sample probability value.
- the sample algorithm activity detection result includes the fourth sample probability value of the existence of speech at each frequency point in the preset number of frequency points in the corresponding audio frame;
- the model activity detection result includes the corresponding audio frame.
- the sample model activity detection result corresponding to the previous sample audio frame and the sample algorithm activity detection result corresponding to the current sample audio frame are fused to obtain the target sample activity detection result corresponding to the current sample audio frame, It includes: for each frequency point in the preset number of frequency points, using a preset calculation method to calculate the fifth sample probability value of a single frequency point in the sample model activity detection result corresponding to the previous sample audio frame, and The fourth sample probability value corresponding to the single frequency point in the sample algorithm activity detection result corresponding to the current sample audio frame is calculated to obtain a sixth sample probability value; according to the preset number of sixth sample probability values , determine the target sample activity detection result corresponding to the current sample audio frame.
- inputting the initial noise reduction sample audio frame to the speech noise reduction network model includes: performing feature extraction of preset feature dimensions on the initial noise reduction sample audio frame to obtain a target input signal;
- the target input signal is input to the speech noise reduction network model, or the target input signal and the initial noise reduction sample audio frame are input to the speech noise reduction network model.
- Figure 6 is a structural block diagram of a voice noise reduction device provided by an embodiment of the present application.
- the device can be implemented by software and/or hardware, and can generally be integrated in electronic equipment such as voice noise reduction equipment. It can be performed by executing a voice noise reduction method. Perform voice noise reduction.
- the device includes:
- the voice activity detection module 601 is configured to use a preset voice activity detection algorithm to detect the current audio frame to be processed, and obtain the corresponding algorithm activity detection result;
- the detection result fusion module 602 is configured to fuse the model activity detection result corresponding to the previous audio frame and the algorithm activity detection result corresponding to the current audio frame to obtain the target activity detection result corresponding to the current audio frame, wherein,
- the model activity detection result is output by a preset speech noise reduction network model;
- the noise reduction processing module 603 is configured to perform noise estimation and noise elimination on the current audio frame based on the target activity detection result to obtain an initial noise reduction audio frame;
- the model input module 604 is configured to input the initial noise reduction audio frame to the preset speech noise reduction network model to output the target noise reduction audio frame and the model activity detection result corresponding to the current audio frame.
- the voice noise reduction device uses a preset voice activity detection algorithm to detect the current audio frame to be processed, and obtains the corresponding algorithm activity detection result, and compares the model activity detection result corresponding to the previous audio frame and the current audio frame.
- the corresponding algorithm activity detection results are fused to obtain the target activity detection results corresponding to the current audio frame.
- the model activity detection results are output by the preset speech noise reduction network model. Based on the target activity detection results, noise estimation and noise elimination are performed on the current audio frame. , obtain the initial noise-reduction audio frame, and input the initial noise-reduction audio frame to the preset speech noise reduction network model to output the target noise-reduction audio frame and the model activity detection result corresponding to the current audio frame.
- the preset speech noise reduction network model can output the model activity detection results.
- the model activity detection results of the previous audio frame and the traditional speech noise reduction can be compared.
- the algorithm activity detection results obtained by the algorithm are combined, so that the traditional noise reduction algorithm can obtain more activity detection information and determine the voice activity detection results more reasonably and accurately.
- noise estimation and noise elimination can better protect the voice. , eliminate more noise, and obtain traditional noise reduction results with higher signal-to-noise ratio, and then use the traditional noise reduction results as the input of the preset speech noise reduction network model to obtain better noise reduction frequency frames, reducing the preset
- the speech noise reduction network model has the possibility to process harsh data.
- Traditional noise reduction algorithms and AI noise reduction methods promote each other, have better noise reduction capabilities for various noises, and improve the overall stability and robustness of the solution.
- the algorithm activity detection result includes a first probability value corresponding to the presence of speech in the audio frame
- the model activity detection result includes a second probability value corresponding to the existence of speech in the audio frame
- the detection result fusion module 602 is configured to fuse the model activity detection result and the algorithm activity detection result in the following manner to obtain the target activity detection result corresponding to the current audio frame:
- the algorithm activity detection result includes the fourth probability value of the existence of speech in each of the preset number of frequency points in the corresponding audio frame;
- the model activity detection result includes the preset number of frequency points in the corresponding audio frame. Let the fifth probability value of speech exist for each frequency point among the number of frequency points;
- the detection result fusion module 602 is also configured to fuse the model activity detection result and the algorithm activity detection result in the following manner to obtain the target activity detection result corresponding to the current audio frame:
- a preset calculation method is used to calculate the fifth probability value of a single frequency point in the model activity detection result corresponding to the previous audio frame, and the current audio frame Calculate the corresponding fourth probability value of the single frequency point in the corresponding algorithm activity detection result to obtain a sixth probability value; determine the target corresponding to the current audio frame based on the preset number of sixth probability values Activity test results.
- the preset calculation method is one of taking the maximum value, taking the minimum value, calculating the average, calculating the sum, calculating the weighted sum, and calculating the weighted average.
- model input module includes:
- a feature extraction unit configured to extract features of a preset feature dimension from the initial noise-reduced speech to obtain a target input signal
- a signal input unit configured to input the target input signal to the preset speech noise reduction network model, or to input the target input signal and the initial noise reduction audio frame to the preset speech noise reduction network. model to output the target noise reduction audio frame and the model activity detection result corresponding to the current audio frame.
- Figure 7 is a structural block diagram of a model training device provided by an embodiment of the present application.
- the device can be implemented by software and/or hardware, and can generally be integrated in electronic equipment such as model training equipment. Model training can be performed by executing a model training method. .
- the device includes:
- the voice detection module 701 is configured to use a preset voice activity detection algorithm to detect the current sample audio frame to be processed, and obtain the corresponding sample algorithm activity detection result, wherein the current sample audio frame is associated with an activity detection tag and a clean audio frame. ;
- the fusion module 702 is configured to perform fusion processing on the sample model activity detection result corresponding to the previous sample audio frame and the sample algorithm activity detection result corresponding to the current sample audio frame, to obtain the target sample activity detection corresponding to the current sample audio frame.
- the result, wherein the sample model activity detection result is output by the speech noise reduction network model;
- the noise elimination module 703 is configured to perform noise estimation and noise elimination on the current sample audio frame based on the target active sample detection result to obtain an initial denoised sample audio frame;
- the network model input module 704 is configured to input the initial noise reduction sample audio frame to the speech noise reduction network model to output the target sample noise reduction audio frame and the sample model activity detection result corresponding to the current sample audio frame;
- the network model training module 705 is configured to determine a first loss relationship based on the target sample noise-reduced audio frame and the clean audio frame, and determine a second loss relationship based on the sample model activity detection result and the activity detection label, and The speech noise reduction network model is trained based on the first loss relationship and the second loss relationship.
- the model training device uses the traditional noise reduction algorithm and the speech noise reduction network model as a whole during the training process, which can avoid the data generated by the traditional noise reduction algorithm concatenating the separately trained speech noise reduction network model. Mismatch risk, the model obtained after training can be used for speech noise reduction, and has better noise reduction capabilities for various noises, improving the noise reduction effect.
- FIG. 8 is a structural block diagram of an electronic device provided by an embodiment of the present application.
- the electronic device 800 includes a processor 801, and a memory 802 communicatively connected to the processor 801.
- the memory 802 stores a computer program that can be executed by the processor 801, and the computer program is executed by the processor 801, so that the processor 801
- the speech noise reduction method and/or model training method described in any embodiment of the present application can be executed.
- the number of processors may be one or more. In FIG. 8 , one processor is taken as an example.
- Embodiments of the present application also provide a computer-readable storage medium.
- the computer-readable storage medium stores a computer program.
- the computer program is used to enable the processor to implement the speech reduction described in any embodiment of the present application when executed. noise methods and/or model training methods.
- Embodiments of the present application also provide a computer program product.
- the computer program product includes a computer program. When executed by a processor, the computer program implements the speech noise reduction method and/or model training method as provided in the embodiments of the present application.
- the speech noise reduction device, model training device, electronic equipment, storage media and products provided in the above embodiments can execute the speech noise reduction method or model training method provided by the corresponding embodiments of the present application, and have corresponding functional modules and functions to execute the method. beneficial effects.
- the speech noise reduction method or model training method provided by any embodiment of this application.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Telephone Function (AREA)
Abstract
Description
Claims (11)
- 一种语音降噪方法,包括:采用预设语音活性检测算法对待处理的当前音频帧进行检测,得到对应的算法活性检测结果;对上一音频帧对应的模型活性检测结果和所述当前音频帧对应的算法活性检测结果进行融合处理,得到所述当前音频帧对应的目标活性检测结果,其中,所述模型活性检测结果由预设语音降噪网络模型输出;基于所述目标活性检测结果对所述当前音频帧进行噪声估计和噪声消除,得到初始降噪音频帧;将所述初始降噪音频帧输入至所述预设语音降噪网络模型,以输出目标降噪音频帧以及所述当前音频帧对应的模型活性检测结果。
- 根据权利要求1所述的方法,其中,所述算法活性检测结果包括对应音频帧中存在语音的第一概率值,所述模型活性检测结果包括对应音频帧中存在语音的第二概率值;所述对上一音频帧对应的模型活性检测结果和所述当前音频帧对应的算法活性检测结果进行融合处理,得到所述当前音频帧对应的目标活性检测结果,包括:采用预设计算方式,对上一音频帧对应的模型活性检测结果中的第二概率值,和所述当前音频帧对应的算法活性检测结果中的第一概率值进行计算,得到第三概率值,根据所述第三概率值确定所述当前音频帧对应的目标活性检测结果。
- 根据权利要求1所述的方法,其中,所述算法活性检测结果包括对应音频帧中,预设数量的频点中每个频点存在语音的第四概率值;所述模型活性检测结果包括对应音频帧中,所述预设数量的频点中每个频点存在语音的第五概率值;所述对上一音频帧对应的模型活性检测结果和所述当前音频帧对应的算法活性检测结果进行融合处理,得到所述当前音频帧对应的目标活性检测结果,包括:针对所述预设数量的频点中的每个频点,采用预设计算方式,对上一音频帧对应的模型活性检测结果中的单个频点的第五概率值,和所述当前音频帧对应的算法活性检测结果中的对应的所述单个频点的第四概率值进行计算,得到第六概率值;根据所述预设数量的第六概率值,确定所述当前音频帧对应的目标活性检 测结果。
- 根据权利要求2或3所述的方法,其中,所述预设计算方式为取最大值、取最小值、计算平均值、求和、计算加权和、以及计算加权平均值中的一种。
- 根据权利要求1所述的方法,其中,所述将所述初始降噪音频帧输入至所述预设语音降噪网络模型,包括:对所述初始降噪音频帧进行预设特征维度的特征提取,得到目标输入信号;将所述目标输入信号输入至所述预设语音降噪网络模型,或者,将所述目标输入信号和所述初始降噪音频帧输入至所述预设语音降噪网络模型。
- 一种模型训练方法,包括:采用预设语音活性检测算法对当前样本音频帧进行检测,得到对应的样本算法活性检测结果,其中,所述当前样本音频帧关联有活性检测标签和纯净音频帧;对上一样本音频帧对应的样本模型活性检测结果和所述当前样本音频帧对应的样本算法活性检测结果进行融合处理,得到所述当前样本音频帧对应的目标样本活性检测结果,其中,所述样本模型活性检测结果由语音降噪网络模型输出;基于所述目标活性样本检测结果对所述当前样本音频帧进行噪声估计和噪声消除,得到初始降噪样本音频帧;将所述初始降噪样本音频帧输入至所述语音降噪网络模型,以输出目标样本降噪音频帧以及所述当前样本音频帧对应的样本模型活性检测结果;根据所述目标样本降噪音频帧和所述纯净音频帧确定第一损失关系,根据所述样本模型活性检测结果和所述活性检测标签确定第二损失关系,并基于所述第一损失关系和所述第二损失关系对所述语音降噪网络模型进行训练。
- 一种语音降噪装置,包括:语音活性检测模块,设置为采用预设语音活性检测算法对待处理的当前音频帧进行检测,得到对应的算法活性检测结果;检测结果融合模块,设置为对上一音频帧对应的模型活性检测结果和所述当前音频帧对应的算法活性检测结果进行融合处理,得到所述当前音频帧对应的目标活性检测结果,其中,所述模型活性检测结果由预设语音降噪网络模型输出;降噪处理模块,设置为基于所述目标活性检测结果对所述当前音频帧进行噪声估计和噪声消除,得到初始降噪音频帧;模型输入模块,设置为将所述初始降噪音频帧输入至所述预设语音降噪网络模型,以输出目标降噪音频帧以及所述当前音频帧对应的模型活性检测结果。
- 一种模型训练装置,包括:语音检测模块,设置为采用预设语音活性检测算法对待处理的当前样本音频帧进行检测,得到对应的样本算法活性检测结果,其中,所述当前样本音频帧关联有活性检测标签和干净音频帧;融合模块,设置为对上一样本音频帧对应的样本模型活性检测结果和所述当前样本音频帧对应的样本算法活性检测结果进行融合处理,得到所述当前样本音频帧对应的目标样本活性检测结果,其中,所述样本模型活性检测结果由语音降噪网络模型输出;噪声消除模块,设置为基于所述目标活性样本检测结果对所述当前样本音频帧进行噪声估计和噪声消除,得到初始降噪样本音频帧;网络模型输入模块,设置为将所述初始降噪样本音频帧输入至所述语音降噪网络模型,以输出目标样本降噪音频帧以及所述当前样本音频帧对应的样本模型活性检测结果;网络模型训练模块,设置为根据所述目标样本降噪音频帧和所述干净音频帧确定第一损失关系,根据所述样本模型活性检测结果和所述活性检测标签确定第二损失关系,并基于所述第一损失关系和所述第二损失关系对所述语音降噪网络模型进行训练。
- 一种电子设备,所述电子设备包括:至少一个处理器;以及与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-5任一项所述的语音降噪方法和/或权利要求6所述的模型训练方法。
- 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序用于使处理器执行时实现权利要求1-5任一项所述的语音降噪方法和/或权利要求6所述的模型训练方法。
- 一种计算机程序产品,所述计算机程序产品包括计算机程序,所述计算机程序在被处理器执行时实现权利要求1-5任一项所述的语音降噪方法和/或权利要求6所述的模型训练方法。
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP23842175.4A EP4535352A4 (en) | 2022-07-21 | 2023-07-12 | VOICE NOISE REDUCTION METHOD, MODEL TRAINING METHOD, APPARATUS, DEVICE, SUPPORT AND PRODUCT |
| US18/880,052 US20250166650A1 (en) | 2022-07-21 | 2023-07-12 | Method for reducing voice noise, method for training model, and device |
| JP2025503141A JP2025523704A (ja) | 2022-07-21 | 2023-07-12 | 音声ノイズ低減方法、モデル訓練方法、装置、デバイス、及び媒体 |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210864010.4A CN115273880B (zh) | 2022-07-21 | 2022-07-21 | 语音降噪方法、模型训练方法、装置、设备、介质及产品 |
| CN202210864010.4 | 2022-07-21 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024017110A1 true WO2024017110A1 (zh) | 2024-01-25 |
Family
ID=83767239
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2023/106951 Ceased WO2024017110A1 (zh) | 2022-07-21 | 2023-07-12 | 语音降噪方法、模型训练方法、装置、设备、介质及产品 |
Country Status (5)
| Country | Link |
|---|---|
| US (1) | US20250166650A1 (zh) |
| EP (1) | EP4535352A4 (zh) |
| JP (1) | JP2025523704A (zh) |
| CN (1) | CN115273880B (zh) |
| WO (1) | WO2024017110A1 (zh) |
Families Citing this family (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115273880B (zh) * | 2022-07-21 | 2025-10-03 | 百果园技术(新加坡)有限公司 | 语音降噪方法、模型训练方法、装置、设备、介质及产品 |
| CN116469402B (zh) * | 2023-04-23 | 2026-04-24 | 百果园技术(新加坡)有限公司 | 一种音频降噪方法、装置、设备、存储介质及产品 |
| CN120089160B (zh) * | 2025-04-27 | 2025-08-01 | 苏州大学 | 一种基于音频处理的无损管道风险等级检测方法 |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2017218386A1 (en) * | 2016-06-13 | 2017-12-21 | Med-El Elektromedizinische Geraete Gmbh | Recursive noise power estimation with noise model adaptation |
| CN108428456A (zh) * | 2018-03-29 | 2018-08-21 | 浙江凯池电子科技有限公司 | 语音降噪算法 |
| US20200286501A1 (en) * | 2017-10-12 | 2020-09-10 | Huawei Technologies Co., Ltd. | Apparatus and a method for signal enhancement |
| CN114255778A (zh) * | 2021-12-21 | 2022-03-29 | 广州欢城文化传媒有限公司 | 一种音频流降噪方法、装置、设备及存储介质 |
| CN114495969A (zh) * | 2022-01-20 | 2022-05-13 | 南京烽火天地通信科技有限公司 | 一种融合语音增强的语音识别方法 |
| CN114596870A (zh) * | 2022-03-07 | 2022-06-07 | 广州博冠信息科技有限公司 | 实时音频处理方法和装置、计算机存储介质、电子设备 |
| CN115273880A (zh) * | 2022-07-21 | 2022-11-01 | 百果园技术(新加坡)有限公司 | 语音降噪方法、模型训练方法、装置、设备、介质及产品 |
Family Cites Families (17)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6415253B1 (en) * | 1998-02-20 | 2002-07-02 | Meta-C Corporation | Method and apparatus for enhancing noise-corrupted speech |
| JP2003316380A (ja) * | 2002-04-19 | 2003-11-07 | Sony Corp | 会話を含む音の信号処理を行う前の段階の処理におけるノイズリダクションシステム |
| KR20140031790A (ko) * | 2012-09-05 | 2014-03-13 | 삼성전자주식회사 | 잡음 환경에서 강인한 음성 구간 검출 방법 및 장치 |
| CN103700375B (zh) * | 2013-12-28 | 2016-06-15 | 珠海全志科技股份有限公司 | 语音降噪方法及其装置 |
| CN104810024A (zh) * | 2014-01-28 | 2015-07-29 | 上海力声特医学科技有限公司 | 一种双路麦克风语音降噪处理方法及系统 |
| CN104064196B (zh) * | 2014-06-20 | 2017-08-01 | 哈尔滨工业大学深圳研究生院 | 一种基于语音前端噪声消除的提高语音识别准确率的方法 |
| TWI759591B (zh) * | 2019-04-01 | 2022-04-01 | 威聯通科技股份有限公司 | 語音增強方法及系統 |
| CN111508513B (zh) * | 2020-03-30 | 2024-04-09 | 广州酷狗计算机科技有限公司 | 音频处理方法及装置、计算机存储介质 |
| CN111554314B (zh) * | 2020-05-15 | 2024-08-16 | 腾讯科技(深圳)有限公司 | 噪声检测方法、装置、终端及存储介质 |
| CN113744732B (zh) * | 2020-05-28 | 2024-11-05 | 阿里巴巴集团控股有限公司 | 设备唤醒相关方法、装置及故事机 |
| CN112435683B (zh) * | 2020-07-30 | 2023-12-01 | 珠海市杰理科技股份有限公司 | 基于t-s模糊神经网络的自适应噪声估计及语音降噪方法 |
| CN114333884B (zh) * | 2020-09-30 | 2024-05-03 | 北京君正集成电路股份有限公司 | 一种基于麦克风阵列结合唤醒词进行的语音降噪方法 |
| CN113284517B (zh) * | 2021-02-03 | 2022-04-01 | 珠海市杰理科技股份有限公司 | 语音端点检测方法、电路、音频处理芯片和音频设备 |
| CN112908352B (zh) * | 2021-03-01 | 2024-04-16 | 百果园技术(新加坡)有限公司 | 一种音频去噪方法、装置、电子设备及存储介质 |
| CN113744725B (zh) * | 2021-08-19 | 2024-07-05 | 清华大学苏州汽车研究院(相城) | 一种语音端点检测模型的训练方法及语音降噪方法 |
| CN113744752A (zh) * | 2021-08-30 | 2021-12-03 | 西安声必捷信息科技有限公司 | 语音处理方法及装置 |
| CN114464168B (zh) * | 2022-03-07 | 2025-01-28 | 云知声智能科技股份有限公司 | 语音处理模型的训练方法、语音数据的降噪方法及装置 |
-
2022
- 2022-07-21 CN CN202210864010.4A patent/CN115273880B/zh active Active
-
2023
- 2023-07-12 WO PCT/CN2023/106951 patent/WO2024017110A1/zh not_active Ceased
- 2023-07-12 JP JP2025503141A patent/JP2025523704A/ja active Pending
- 2023-07-12 EP EP23842175.4A patent/EP4535352A4/en active Pending
- 2023-07-12 US US18/880,052 patent/US20250166650A1/en active Pending
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2017218386A1 (en) * | 2016-06-13 | 2017-12-21 | Med-El Elektromedizinische Geraete Gmbh | Recursive noise power estimation with noise model adaptation |
| US20200286501A1 (en) * | 2017-10-12 | 2020-09-10 | Huawei Technologies Co., Ltd. | Apparatus and a method for signal enhancement |
| CN108428456A (zh) * | 2018-03-29 | 2018-08-21 | 浙江凯池电子科技有限公司 | 语音降噪算法 |
| CN114255778A (zh) * | 2021-12-21 | 2022-03-29 | 广州欢城文化传媒有限公司 | 一种音频流降噪方法、装置、设备及存储介质 |
| CN114495969A (zh) * | 2022-01-20 | 2022-05-13 | 南京烽火天地通信科技有限公司 | 一种融合语音增强的语音识别方法 |
| CN114596870A (zh) * | 2022-03-07 | 2022-06-07 | 广州博冠信息科技有限公司 | 实时音频处理方法和装置、计算机存储介质、电子设备 |
| CN115273880A (zh) * | 2022-07-21 | 2022-11-01 | 百果园技术(新加坡)有限公司 | 语音降噪方法、模型训练方法、装置、设备、介质及产品 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4535352A4 * |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4535352A4 (en) | 2026-03-25 |
| CN115273880A (zh) | 2022-11-01 |
| JP2025523704A (ja) | 2025-07-23 |
| CN115273880B (zh) | 2025-10-03 |
| US20250166650A1 (en) | 2025-05-22 |
| EP4535352A1 (en) | 2025-04-09 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Li et al. | Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement | |
| Lin et al. | Speech enhancement using multi-stage self-attentive temporal convolutional networks | |
| CN106486131B (zh) | 一种语音去噪的方法及装置 | |
| WO2024017110A1 (zh) | 语音降噪方法、模型训练方法、装置、设备、介质及产品 | |
| CN112004177B (zh) | 一种啸叫检测方法、麦克风音量调节方法及存储介质 | |
| CN111785288B (zh) | 语音增强方法、装置、设备及存储介质 | |
| US20200396329A1 (en) | Acoustic echo cancellation based sub band domain active speaker detection for audio and video conferencing applications | |
| WO2022141868A1 (zh) | 一种提取语音特征的方法、装置、终端及存储介质 | |
| CN112949708A (zh) | 情绪识别方法、装置、计算机设备和存储介质 | |
| CN118899005B (zh) | 一种音频信号处理方法、装置、计算机设备及存储介质 | |
| CN112602150A (zh) | 噪声估计方法、噪声估计装置、语音处理芯片以及电子设备 | |
| EP4456064A1 (en) | Audio data processing method and apparatus, device, storage medium, and program product | |
| EP4553827B1 (en) | Intent recognition method, apparatus, storage medium and computer device | |
| CN112309417A (zh) | 风噪抑制的音频信号处理方法、装置、系统和可读介质 | |
| CN115083440B (zh) | 音频信号降噪方法、电子设备和存储介质 | |
| Hidayat et al. | A modified MFCC for improved wavelet-based denoising on robust speech recognition | |
| CN111800552A (zh) | 音频输出处理方法、装置、系统及电子设备 | |
| CN112331187B (zh) | 多任务语音识别模型训练方法、多任务语音识别方法 | |
| CN114827833B (zh) | 啸叫抑制方法、装置、芯片及电子设备 | |
| CN111276132A (zh) | 一种语音处理方法、电子设备及计算机可读存储介质 | |
| CN115440240A (zh) | 语音降噪的训练方法、语音降噪系统及语音降噪方法 | |
| CN115910095B (zh) | 一种语音增强方法、装置、计算机设备以及存储介质 | |
| CN110838307A (zh) | 语音消息处理方法及装置 | |
| CN114049887B (zh) | 用于音视频会议的实时语音活动检测方法及系统 | |
| Agrawal et al. | Performance analysis of speech enhancement using spectral gating with U-Net |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23842175 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 18880052 Country of ref document: US |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023842175 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2023842175 Country of ref document: EP Effective date: 20250102 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2025503141 Country of ref document: JP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 11202500054U Country of ref document: SG |
|
| WWP | Wipo information: published in national office |
Ref document number: 11202500054U Country of ref document: SG |
|
| WWP | Wipo information: published in national office |
Ref document number: 2023842175 Country of ref document: EP |
|
| WWP | Wipo information: published in national office |
Ref document number: 18880052 Country of ref document: US |