CN111192599A - Method and device for noise reduction - Google Patents

Method and device for noise reduction Download PDF

Info

Publication number
CN111192599A
CN111192599A CN201811352262.9A CN201811352262A CN111192599A CN 111192599 A CN111192599 A CN 111192599A CN 201811352262 A CN201811352262 A CN 201811352262A CN 111192599 A CN111192599 A CN 111192599A
Authority
CN
China
Prior art keywords
noise reduction
model
voice
terminal device
voice information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811352262.9A
Other languages
Chinese (zh)
Other versions
CN111192599B (en
Inventor
宋钦梅
方华
袁其政
屈跃强
程宝平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Hangzhou Information Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Hangzhou Information Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201811352262.9A priority Critical patent/CN111192599B/en
Publication of CN111192599A publication Critical patent/CN111192599A/en
Application granted granted Critical
Publication of CN111192599B publication Critical patent/CN111192599B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/02Constructional features of telephone sets
    • H04M1/19Arrangements of transmitters, receivers, or complete sets to prevent eavesdropping, to attenuate local noise or to prevent undesired transmission; Mouthpieces or receivers specially adapted therefor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/18Automatic or semi-automatic exchanges with means for reducing interference or noise; with means for reducing effects due to line faults with means for protecting lines

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the application discloses a noise reduction method and a noise reduction device, wherein the method comprises the following steps: the time domain characteristic value and the target value of the training data are trained through the deep learning training model constructed based on the frequency domain, and the server can send the obtained model parameters to the terminal equipment, so that the terminal equipment updates the parameters of the first voice noise reduction model after receiving the model parameters, and performs noise reduction processing on voice information input by a user by using the updated first voice noise reduction model. In the embodiment of the application, the parameters of the first voice noise reduction model in the terminal device are updated by using the model parameters obtained by deep learning training model training, so that the noise reduction processing can be performed on the voice information input by the user by using the model parameters obtained by deep learning training model training, the voice information obtained by the noise reduction processing can be more accurate, and the experience of the user is improved.

Description

Noise reduction method and device
Technical Field
The present application relates to the field of communications technologies, and in particular, to a noise reduction method and apparatus.
Background
In real life, voice information sent by a user usually contains noise, such as wind sound, car sound, and machine operation sound in the environment. During the conversation process of the user using the voice device, the noises may affect the conversation quality of the user, so that the user experience is not good. For example, a user a and a user B communicate with each other through a terminal device (e.g., a mobile phone), and if noise included in voice information sent by the user a through the mobile phone a is large, the user B may not normally obtain the voice information of the user a through the mobile phone B, for example, the obtained voice information is not clear enough, or the voice information sent by the user a is not obtained.
Therefore, a noise reduction method is needed to solve the technical problem of low communication quality between users due to the existence of noise.
Disclosure of Invention
The embodiment of the application provides a noise reduction method and a noise reduction device, which are used for solving the technical problem of low call quality among users due to the existence of noise.
The noise reduction method provided by the embodiment of the application comprises the following steps:
the method comprises the steps that a server obtains training data, wherein the training data comprise first voice information collected in a first environment and second voice information collected in a second environment, noise in the first environment is smaller than or equal to a preset threshold, and noise in the second environment is larger than the preset threshold;
the server determines a time domain characteristic value and a target value of the training data according to the training data; the time domain characteristic value of the training data comprises one or more of a noise threshold value, a long-time energy value, a short-time energy value and a noise envelope tracking value; the target value of the training data comprises a voice activity detection value of the first voice information and/or a full-band signal-to-noise ratio of the second voice information;
the server trains the time domain characteristic value and the target value based on a deep learning training model constructed by a frequency domain to obtain a model parameter, and sends the model parameter to the terminal equipment; and the model parameters are used for the terminal equipment to perform noise reduction processing on the voice information input by the user.
The noise reduction method provided by the embodiment of the application comprises the following steps:
the terminal equipment receives the model parameters sent by the server;
the terminal equipment updates the parameters of the first voice noise reduction model in the terminal equipment according to the model parameters to obtain an updated first voice noise reduction model;
and after receiving the voice information input by the user, the terminal equipment uses the updated first voice noise reduction model to perform noise reduction processing on the voice information.
Optionally, after the terminal device receives the model parameter sent by the server, the method further includes:
the terminal equipment updates a preset mark into a first indicated value;
after the terminal device obtains the updated first speech noise reduction model, the method further includes:
and the terminal equipment updates the preset mark into a second indication value.
Optionally, after receiving the voice information input by the user, the terminal device further includes, before updating the voice information by using the updated first voice noise reduction model:
and the terminal equipment determines that the preset mark is the second indication value.
Optionally, the method further comprises:
after the terminal equipment receives voice information input by a user, if the preset mark is determined to be the first indicated value, a second voice noise reduction model in the terminal equipment is used for carrying out noise reduction processing on the voice information; the second voice noise reduction model is a standby model of the first voice noise reduction model.
An embodiment of the present application provides a server, including:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring training data, the training data comprises first voice information collected in a first environment and second voice information collected in a second environment, the noise in the first environment is less than or equal to a preset threshold, and the noise in the second environment is greater than the preset threshold;
the determining module is used for determining a time domain characteristic value and a target value of the training data according to the training data, wherein the time domain characteristic value of the training data comprises one or more of a noise threshold value, a long-term energy value, a short-term energy value and a noise envelope tracking value; the target value of the training data comprises a voice activity detection value of the first voice information and/or a full-band signal-to-noise ratio of the second voice information;
and the processing module is used for training the time domain characteristic value and the target value based on a deep learning training model constructed in a frequency domain to obtain a model parameter, and sending the model parameter to the terminal equipment so that the terminal equipment can use the model parameter to perform noise reduction processing on the voice information input by the user.
The embodiment of the present application provides a terminal device, which includes:
the receiving and sending module is used for receiving the model parameters sent by the server;
the updating module is used for updating the parameters of the first voice noise reduction model in the terminal equipment according to the model parameters to obtain an updated first voice noise reduction model;
and the noise reduction module is used for performing noise reduction processing on the voice information by using the updated first voice noise reduction model after receiving the voice information input by the user.
Optionally, after the transceiver module receives the model parameters sent by the server, the update module is further configured to: and after the updated first voice noise reduction model is obtained, updating the preset mark into a second indication value.
Optionally, the noise reduction module is further configured to:
and determining the preset mark as the second indication value.
Optionally, the noise reduction module is further configured to:
after receiving voice information input by a user, if the preset mark is determined to be the first indicated value, performing noise reduction processing on the voice information by using a second voice noise reduction model in the terminal equipment; the second voice noise reduction model is a standby model of the first voice noise reduction model.
In the above embodiment of the application, the server takes the collected first speech information in the first environment (noise is less than or equal to the preset threshold) and the collected second speech information in the second environment (noise is greater than the preset threshold) as training data, and after determining the time domain characteristic value and the target value of the training data, trains the time domain characteristic value and the target value based on the deep learning training model constructed in the frequency domain to obtain the model parameters, and sends the model parameters to the terminal device, so that the terminal device can update the parameters of the first speech noise reduction model in the terminal device after receiving the model parameters sent by the server, and can perform noise reduction processing on the speech information input by the user by using the updated first speech noise reduction model. In the embodiment of the application, the time domain characteristic value (which can embody the time domain characteristic) and the target value of the training data are obtained, and the time domain characteristic value and the target value of the training data are trained by adopting the deep learning training model (which can embody the frequency domain characteristic) constructed based on the frequency domain, so that the time domain characteristic and the frequency domain characteristic of the voice information can be combined in the process of training the model, the training performance of the deep learning training model is improved, and the training speed of the deep learning training model is accelerated; moreover, the training data can comprise voice information under various environments, and different training data can be adopted for training the deep learning training model for multiple times, so that more accurate model parameters can be obtained; furthermore, parameters of the first voice noise reduction model in the terminal equipment are updated through model parameters obtained through deep learning training model training, so that the noise reduction processing can be performed on voice information input by a user through the model parameters obtained through deep learning training model training by the terminal equipment, the voice information obtained through the noise reduction processing can be more accurate, and the experience of the user is improved. In addition, the process of training data by adopting the deep learning training model to obtain model parameters and the noise reduction process of the terminal equipment on the voice information of the user can be processed in parallel, so that the processing speed and the processing efficiency of voice noise reduction can be improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a schematic diagram of a possible system architecture provided by an embodiment of the present application;
fig. 2 is a schematic flowchart corresponding to a noise reduction method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a server provided in an embodiment of the present application;
fig. 4 is a schematic structural diagram of a terminal device provided in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a possible system architecture provided in this embodiment, where the system architecture may include a network device 100, and one or more terminal devices (e.g., a terminal device 101 and a terminal device 102 illustrated in fig. 1), and the terminal device 101 and the terminal device 102 may communicate through the network device 100. The network device 100 may be a base station, such as an evolved base station (e.g., evolved NodeB) in a Long Term Evolution (LTE) communication system. The terminal device may be a mobile phone (mobile phone), a tablet computer (Pad), and the like, and is not limited specifically.
In this embodiment of the application, a user a and a user B may communicate through a terminal device, where the user a uses the terminal device 101, and the user B uses the terminal device 102, and one possible scenario is that the user a inputs voice information to the terminal device 101, the terminal device 101 transmits the voice information input by the user a to the terminal device 102 through the network device 100, and the terminal device 102 provides the received voice information to the user B after receiving the voice information.
In order to improve the call quality, one possible implementation manner (for convenience of description, implementation manner a) is to perform noise reduction processing on the voice information sent by the user by using a preset noise signal, that is, to filter the preset noise signal in the voice information sent by the user, so as to obtain the filtered voice information. By adopting the voice noise reduction method, the preset noise signals in the voice information sent by the user can be filtered. However, the preset noise signal used in this method is fixed, that is, the terminal device may use the preset noise signal to perform noise reduction processing on the voice information sent by all subsequent users without updating the model parameters in the terminal device. In some cases, the speech information obtained by performing noise reduction processing on the speech information of the user in various different environments by using the preset noise signal may be inaccurate.
Based on this, the embodiment of the present application provides a voice noise reduction method, so as to solve the technical problem that the quality of a call between users is low due to the existence of noise.
Fig. 2 is a schematic flowchart corresponding to a noise reduction method provided in an embodiment of the present application, where the method includes:
step 201, the server acquires training data and determines a time domain characteristic value and a target value of the training data.
Here, the training data may include first voice information collected in a first environment and second voice information collected in a second environment, and noise in the first environment may be equal to or less than a preset threshold, that is, noise in the first environment is small, and noise in the second environment may be greater than the preset threshold, that is, noise in the first environment is large. The preset threshold may be set by a person skilled in the art according to experience, or may be determined through one or more experiments, which is not limited in the embodiment of the present application.
In the embodiment of the application, the training data can be acquired in various manners, and in a possible implementation manner, a tester can carry the voice acquisition device to collect the first voice information and the second voice information in the first environment and the second environment respectively. For example, taking a preset threshold value of-30 db as an example, if the application scenario is a certain room on a preset floor, the voice information, which is acquired by the voice acquisition device and is sent by the tester at a first position where the noise in the room is less than or equal to-30 db (or no noise), may be used as the first voice information, and the same voice information, which is acquired by the voice acquisition device and is sent by the tester at a second position where the noise in the room is greater than-30 db, may be used as the second voice information. For example, if the user says "test the noise of the indoor environment" (for convenience of description, abbreviated as voice 1) to the voice collection device at a position a away from the window in the room, and says "test the noise of the indoor environment" (for convenience of description, abbreviated as voice 2) to the voice collection device at a position B near the window, the first voice message may be voice 1, and the second voice message may be voice 2.
In the embodiment of the present application, the voice information may generally have time domain characteristics and frequency domain characteristics as a signal. Specifically, the time domain features of the speech information may be described by a magnitude dimension of the speech information and a time dimension of the speech information, and may be expressed, for example, as a function of a change in magnitude over time; through the time domain characteristics of the voice information, the information such as the energy value of the voice information in a certain time period and the amplitude value of the voice information at a certain time point can be obtained. The frequency domain features of the speech information may be described by an amplitude dimension of the speech information and a frequency dimension of the speech information, and may be represented, for example, as a function of amplitude versus frequency. The time domain features and the frequency domain features of the speech information may be transformed, and in one example, the time-varying speech information may be decomposed into a plurality of speech information having different frequencies by fourier transform.
In this embodiment of the application, the server may obtain training data acquired by the voice acquisition device, and determine a time domain feature value of the training data, where the time domain feature value of the training data may be used to indicate a feature of the training data in a time domain, and in a possible implementation, the time domain feature value of the training data may include one or more of a noise threshold, a long-term energy value, a short-term energy value, and a noise envelope tracking value. It is understood that the time domain feature value of the training data may further include other information, which is not limited in this embodiment of the present application.
In one example, an initial noise threshold (e.g., -40 db) may be preset, and voice messages with amplitudes less than-40 db in the training data may be identified as noise, and voice messages with amplitudes greater than-40 db may be identified as voice of the user. In a specific implementation, different environments may have different initial noise thresholds, for example, an initial noise threshold corresponding to an office environment may be-40 db, and an initial noise threshold corresponding to a mall may be-30 db. In the embodiment of the present application, the initial noise threshold corresponding to each environment may be set empirically by a person skilled in the art, or may be determined experimentally by a person skilled in the art, which is not limited to this.
The long-time energy value and the short-time energy value may be used to indicate energy information of the voice information within a preset time period. The following describes an implementation process of obtaining the long-term energy value and the short-term energy value of the first voice message by taking the first voice message as an example, and the process of obtaining the long-term energy value and the short-term energy value of the second voice message may be processed by referring to the first voice message.
In the embodiment of the application, the framing operation can be performed on the first voice information in advance according to the total duration of the first voice information, so that multi-frame voice information is obtained. The time lengths of any two frames of voice information in the multi-frame voice information may be the same or may be different. In a specific implementation, there may be multiple ways of dividing the first voice information into multiple frames of voice information, and in a possible implementation, a standard frame duration (e.g., 10ms) may be preset, so that the first voice information is divided into multiple frames of voice information according to the standard frame duration. In an example (for convenience of description, referred to as example one), the total duration of the first speech information is 205ms, and then 21 frames of speech information can be obtained after division according to the duration of the standard frame, where the durations of the 1 st frame speech information to the 20 th frame speech information are the same as the duration of the standard frame, and are both 10ms, and the duration of the 21 st frame speech information is 5 ms.
Further, the long-term energy value may be set as an average value of the total energy of the speech information of the a frame (e.g., a is 3), and the short-term energy value may be set as an energy value of the speech information of the latest B frame (e.g., B is 1). Based on the multi-frame speech information and the duration of each frame of speech information in example one, the long-term energy value may be an average of the total energy of the speech information of 30ms duration (if the 21 st frame of speech information is included, an average of the total energy of the speech information of 25ms duration); accordingly, the short-time energy value may be an energy value of the 21 st frame speech information. In the embodiments of the present application, the values of a and B may be adjusted by those skilled in the art according to actual situations, and are not limited thereto.
The noise envelope tracking value may be used to estimate the amplitude of the noise. Generally, noise may have a wider temporal characteristic than speech, and therefore, an estimate of the noise amplitude may be obtained by tracking the minimum amplitude corresponding to each frame of speech information in the second speech information. For example, the minimum value corresponding to each frame of speech information in the second speech information may be extracted in advance, a noise envelope tracking map may be drawn accordingly, and the estimated noise information (e.g., a noise envelope value corresponding to the long-term energy value, a noise envelope value corresponding to the short-term energy value, etc.) may be obtained through calculation according to the noise envelope tracking map and a preset index (e.g., a fast-decreasing and slow-increasing principle).
It should be noted that, in the embodiment of the present application, the size of the noise threshold may also be adjusted according to the noise envelope tracking value. In one example, an increment (e.g., 2 db) may be preset, and if the calculated noise envelope tracking value is greater than the initial noise threshold (e.g., -40 db), the noise threshold may be sequentially increased by the preset increment. For example, if the first calculated noise envelope tracking value is-35 db, the noise threshold may be adjusted to-33 db, and the noise envelope tracking value may be calculated based on the-33 db noise threshold.
In this embodiment, the server may further determine a target value of the training data, where the target value of the training data may include a voice activity detection value of the first voice information and/or a full-band signal-to-noise ratio of the second voice information. In this case, the voice activity detection value of the first voice information may be used to indicate whether voice or noise is detected, and in one example, a first indication value (1) and a second indication value (0) may be preset, for example, "1" may be used to indicate that voice is currently detected, "0" may be used to indicate that noise is currently detected, or "0" may be used to indicate that voice is currently detected, "1" may be used to indicate that noise is currently detected, which is not limited in this embodiment of the application. The full-band snr of the second speech information can be used to indicate a correspondence between speech and noise, and in one example, the full-band snr of the second speech information can be a ratio of an average power of the second speech information to an average power of noise corresponding to the second speech information.
Step 202, the server trains the time domain characteristic value and the target value based on the deep learning training model constructed by the frequency domain to obtain model parameters.
Here, the deep learning training model may be constructed in various ways, and in one possible implementation, the deep learning training model may be constructed based on keras. Specifically, Keras is a highly modular neural network library based on thano, for example, Keras may be based on Torch and written in Python language, and Keras may support Graphics Processing Unit (GPU) and Central Processing Unit (CPU).
In an embodiment of the application, the deep learning training model may include a voice activity detection module, a noise spectrum estimation module, and a spectral subtraction module. The voice activity detection module may detect the first voice information and the second voice information, and distinguish between voice and amplitude according to activity indicators (e.g., amplitude range, etc.) of the detected first voice information and the detected second voice information. The noise spectrum estimation module may be configured to calculate the first speech information and the second speech information, and may estimate a spectral characteristic of the noise according to a result of the calculation. The spectrum subtraction module may be configured to determine a gain value according to the calculation results obtained by the voice activity detection module and the noise spectrum estimation module, where the gain value may be used to suppress noise in the voice information.
In specific implementation, the server may use the time domain feature value of the training data as input information of the deep learning training model, and may use the target value of the training data as output information of the deep learning training model, so as to control the deep learning training model to perform model training according to the input information and the output information, thereby obtaining model parameters. In one example, the server may obtain the first model parameter by inputting the time domain feature value into the voice activity detection module; further, the server can obtain a second model parameter by inputting the first model parameter into the noise spectrum estimation module, and meanwhile, the server can obtain a third model parameter by inputting the first model parameter into the spectrum subtraction module; and finally, the server inputs the first model parameter, the second model parameter and the third model parameter into the spectrum subtraction module together to obtain the model parameters of the training data after deep learning training model training. It should be noted that, in the embodiment of the present application, each module in the deep learning training model may be a functional module constructed by Keras, that is, the voice activity detection module, the noise spectrum estimation module, and the spectrum subtraction module are introduced only for describing a process of determining the model parameters, and in a specific implementation, other modules may be further included, which is not limited in particular. Moreover, the names of the functional modules may also be other possible names, and are not limited specifically.
Step 203, sending the model parameters to the terminal device.
Here, the server may send the model parameters to the terminal device by communicating with the terminal device, where the server may communicate with the terminal device in various ways, and in one example, the server may communicate with the terminal device in a wireless way; in another example, the server may also communicate with the terminal device through a wired line (e.g., an optical fiber, a network cable, etc.), which is not limited in this embodiment.
And 204, the terminal equipment receives the model parameters sent by the server, and updates the parameters of the first voice noise reduction model in the terminal equipment by using the model parameters.
In this embodiment of the application, a first voice noise reduction model and a second voice noise reduction model may be preset in the terminal device, and the second voice noise reduction model may be a standby model of the first voice noise reduction model. The parameters of the first speech noise reduction module and the second speech noise reduction model in the initial state (for example, when the terminal device leaves a factory, or after the terminal device is initialized) may be parameters before model training is performed on the deep speech learning training model built based on Kears.
In a specific implementation, a preset flag may be set in the terminal device, and the preset flag may be used to indicate whether the first speech noise reduction model in the terminal device is in an updated state. In one example, before the terminal device receives the model parameters sent by the server and updates the parameters of the first voice noise reduction model in the terminal device by using the model parameters, the terminal device may update the preset flag to a first indication value, and the first indication value may be used to indicate that the first voice noise reduction model in the terminal device is in an updated state. Further, the terminal device may update the preset flag to a second indication value after obtaining the updated first voice noise reduction model, where the second indication value may be used to indicate that the first voice noise reduction model in the terminal device is in an un-updated state, or may be used to indicate that the first voice noise reduction model in the terminal device is updated completely. In the embodiment of the present application, the preset flag may be represented by one bit, for example, the first indication value may be "0", and the second indication value may be "1"; alternatively, the first indication value may be "1", and the second indication value may be "0", which is not particularly limited.
In this embodiment of the application, if the terminal device detects that the first speech noise reduction model cannot perform noise reduction processing due to some reasons (for example, some hardware of the terminal device is damaged or an update algorithm of the first speech noise reduction model is wrong), the terminal device may also update the preset flag to the first indication value.
It should be noted that, in this embodiment of the application, the terminal device may update the second speech noise reduction module to the first speech noise reduction module after obtaining the updated first speech noise reduction model, for example, may update a parameter of the second speech noise reduction module to a parameter of the first speech noise reduction module.
Step 205, after receiving the voice message input by the user, if the terminal device determines that the preset flag is the first indication value, step 206a may be executed; if the preset flag is determined to be the second indication value, step 206b may be executed; if the predetermined flag is determined to be the second indication value, step 206b may be performed.
In step 206a, the terminal device performs noise reduction processing on the voice information by using the second voice noise reduction model.
Here, if the terminal device determines that the preset flag is the first indication value, it indicates that the first speech noise reduction model in the terminal device is in an updated state, or the first speech noise reduction model cannot perform noise reduction processing, and at this time, the terminal device may control the second speech noise reduction model in the terminal device to perform noise reduction processing on speech information input by a user.
Specifically, the parameters of the second speech noise reduction model may be model parameters received by the terminal device and obtained by performing model training last time by the server (or may be model parameters of a conventional speech noise reduction model in an initial state), and therefore, the terminal device may perform noise reduction processing on speech information input by the user by using the model parameters obtained last time (or the model parameters of the conventional speech noise reduction model).
And step 206a, the terminal device performs noise reduction processing on the voice information by using the updated first voice noise reduction model.
Here, the terminal device determines that the preset flag is the second indication value, which indicates that the first speech noise reduction model in the terminal device is updated, and at this time, the terminal device may control the updated first speech noise reduction model to perform noise reduction processing on speech information input by a user. That is, the terminal device may perform noise reduction processing on the speech information input by the user by using the latest model parameters obtained by model training performed by the server.
In the embodiment of the application, the time domain characteristic value (which can embody the time domain characteristic) and the target value of the training data are obtained, and the time domain characteristic value and the target value of the training data are trained by adopting the deep learning training model (which can embody the frequency domain characteristic) constructed based on the frequency domain, so that the time domain characteristic and the frequency domain characteristic of the voice information can be combined in the process of training the model, the training performance of the deep learning training model is improved, and the training speed of the deep learning training model is accelerated; moreover, the training data can comprise voice information under various environments, and different training data can be adopted for training the deep learning training model for multiple times, so that more accurate model parameters can be obtained; furthermore, parameters of the first voice noise reduction model in the terminal equipment are updated through model parameters obtained through deep learning training model training, so that the noise reduction processing can be performed on voice information input by a user through the model parameters obtained through deep learning training model training by the terminal equipment, the voice information obtained through the noise reduction processing can be more accurate, and the experience of the user is improved. In addition, the process of training data by adopting the deep learning training model to obtain model parameters and the noise reduction process of the terminal equipment on the voice information of the user can be processed in parallel, so that the processing speed and the processing efficiency of voice noise reduction can be improved.
For the above method flow, an embodiment of the present application further provides a noise reduction apparatus, and specific contents of the apparatus may be implemented with reference to the above method.
Fig. 3 is a schematic structural diagram of a server according to an embodiment of the present application, including:
an obtaining module 301, configured to obtain training data, where the training data includes first voice information collected in a first environment and second voice information collected in a second environment, noise in the first environment is less than or equal to a preset threshold, and noise in the second environment is greater than the preset threshold;
a determining module 302, configured to determine, according to the training data, a time-domain feature value and a target value of the training data, where the time-domain feature value of the training data includes one or more of a noise threshold, a long-time energy value, a short-time energy value, and a noise envelope tracking value; the target value of the training data comprises a voice activity detection value of the first voice information and/or a full-band signal-to-noise ratio of the second voice information;
the processing module 303 is configured to train the time domain feature value and the target value based on a deep learning training model constructed in a frequency domain to obtain a model parameter, and send the model parameter to a terminal device, so that the terminal device performs noise reduction processing on voice information input by a user by using the model parameter.
Fig. 4 is a schematic structural diagram of a terminal device provided in an embodiment of the present application, where the terminal device includes:
a transceiver module 401, configured to receive a model parameter sent by a server;
an updating module 402, configured to update parameters of a first speech noise reduction model in the terminal device according to the model parameters, to obtain an updated first speech noise reduction model;
and a noise reduction module 403, configured to, after receiving the voice information input by the user, perform noise reduction processing on the voice information by using the updated first voice noise reduction model.
Optionally, after the transceiver module 401 receives the model parameters sent by the server, the updating module 402 is further configured to: and after the updated first voice noise reduction model is obtained, updating the preset mark into a second indication value.
Optionally, the noise reduction module 403 is further configured to:
and determining the preset mark as the second indication value.
Optionally, the noise reduction module 403 is further configured to:
after receiving voice information input by a user, if the preset mark is determined to be the first indicated value, performing noise reduction processing on the voice information by using a second voice noise reduction model in the terminal equipment; the second voice noise reduction model is a standby model of the first voice noise reduction model.
From the above, it can be seen that: in the above embodiment of the application, the server takes the collected first speech information in the first environment (noise is less than or equal to the preset threshold) and the collected second speech information in the second environment (noise is greater than the preset threshold) as training data, and after determining the time domain characteristic value and the target value of the training data, trains the time domain characteristic value and the target value based on the deep learning training model constructed in the frequency domain to obtain the model parameters, and sends the model parameters to the terminal device, so that the terminal device can update the parameters of the first speech noise reduction model in the terminal device after receiving the model parameters sent by the server, and can perform noise reduction processing on the speech information input by the user by using the updated first speech noise reduction model. In the embodiment of the application, the time domain characteristic value (which can embody the time domain characteristic) and the target value of the training data are obtained, and the time domain characteristic value and the target value of the training data are trained by adopting the deep learning training model (which can embody the frequency domain characteristic) constructed based on the frequency domain, so that the time domain characteristic and the frequency domain characteristic of the voice information can be combined in the process of training the model, the training performance of the deep learning training model is improved, and the training speed of the deep learning training model is accelerated; moreover, the training data can comprise voice information under various environments, and different training data can be adopted for training the deep learning training model for multiple times, so that more accurate model parameters can be obtained; furthermore, parameters of the first voice noise reduction model in the terminal equipment are updated through model parameters obtained through deep learning training model training, so that the noise reduction processing can be performed on voice information input by a user through the model parameters obtained through deep learning training model training by the terminal equipment, the voice information obtained through the noise reduction processing can be more accurate, and the experience of the user is improved. In addition, the process of training data by adopting the deep learning training model to obtain model parameters and the noise reduction process of the terminal equipment on the voice information of the user can be processed in parallel, so that the processing speed and the processing efficiency of voice noise reduction can be improved.
It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (10)

1.一种降噪方法,其特征在于,所述方法包括:1. A noise reduction method, characterized in that the method comprises: 服务器获取训练数据,所述训练数据包括在第一环境中收集的第一语音信息和在第二环境中收集的第二语音信息,所述第一环境中的噪声小于等于预设阈值,所述第二环境中的噪声大于所述预设阈值;The server acquires training data, the training data includes first voice information collected in a first environment and second voice information collected in a second environment, where the noise in the first environment is less than or equal to a preset threshold, the The noise in the second environment is greater than the preset threshold; 所述服务器根据所述训练数据,确定所述训练数据的时域特征值和目标值;所述训练数据的时域特征值包括噪声阈值、长时能量值、短时能量值和噪声包络跟踪值中的一项或多项;所述训练数据的目标值包括所述第一语音信息的语音活动检测值和/或所述第二语音信息的全带信噪比;The server determines, according to the training data, a time-domain characteristic value and a target value of the training data; the time-domain characteristic value of the training data includes a noise threshold, a long-term energy value, a short-term energy value, and noise envelope tracking One or more of the values; the target value of the training data includes the voice activity detection value of the first voice information and/or the full-band signal-to-noise ratio of the second voice information; 所述服务器基于频域构建的深度学习训练模型,对所述时域特征值和所述目标值进行训练,得到模型参数,并将所述模型参数发送给终端设备,所述模型参数用于所述终端设备对用户输入的语音信息进行降噪处理。The server trains the time domain feature value and the target value based on the deep learning training model constructed in the frequency domain, obtains model parameters, and sends the model parameters to the terminal device, and the model parameters are used for all The terminal device performs noise reduction processing on the voice information input by the user. 2.一种降噪方法,其特征在于,所述方法包括:2. A noise reduction method, characterized in that the method comprises: 终端设备接收服务器发送的模型参数;The terminal device receives the model parameters sent by the server; 所述终端设备根据所述模型参数,对所述终端设备中的第一语音降噪模型的参数进行更新,得到更新后的第一语音降噪模型;The terminal device updates the parameters of the first voice noise reduction model in the terminal device according to the model parameters, to obtain an updated first voice noise reduction model; 所述终端设备在接收到用户输入的语音信息后,使用所述更新后的第一语音降噪模型对所述语音信息进行降噪处理。After receiving the voice information input by the user, the terminal device uses the updated first voice noise reduction model to perform noise reduction processing on the voice information. 3.根据权利要求2所述的方法,其特征在于,所述终端设备接收服务器发送的模型参数之后,还包括:3. The method according to claim 2, wherein after the terminal device receives the model parameters sent by the server, the method further comprises: 所述终端设备将预设标志更新为第一指示值;The terminal device updates the preset flag to the first indication value; 所述终端设备得到更新后的第一语音降噪模型之后,还包括:After the terminal device obtains the updated first voice noise reduction model, the method further includes: 所述终端设备将所述预设标志更新为第二指示值。The terminal device updates the preset flag to a second indication value. 4.根据权利要求3所述的方法,其特征在于,所述终端设备在接收到用户输入的语音信息后,使用所述更新后的第一语音降噪模型对所述语音信息进行更新之前,还包括:4. The method according to claim 3, wherein, after the terminal device receives the voice information input by the user, before using the updated first voice noise reduction model to update the voice information, Also includes: 所述终端设备确定所述预设标志为所述第二指示值。The terminal device determines that the preset flag is the second indication value. 5.根据权利要求3所述的方法,其特征在于,所述方法还包括:5. The method according to claim 3, wherein the method further comprises: 所述终端设备在接收到用户输入的语音信息后,若确定所述预设标志为所述第一指示值,则使用所述终端设备中的第二语音降噪模型对所述语音信息进行降噪处理;所述第二语音降噪模型为所述第一语音降噪模型的备用模型。After the terminal device receives the voice information input by the user, if it is determined that the preset flag is the first indication value, the second voice noise reduction model in the terminal device is used to reduce the voice information. Noise processing; the second speech noise reduction model is a backup model of the first speech noise reduction model. 6.一种服务器,其特征在于,所述服务器包括:6. A server, characterized in that the server comprises: 获取模块,用于获取训练数据,所述训练数据包括在第一环境中收集的第一语音信息和在第二环境中收集的第二语音信息,所述第一环境中的噪声小于等于预设阈值,所述第二环境中的噪声大于所述预设阈值;an acquisition module, configured to acquire training data, where the training data includes first voice information collected in a first environment and second voice information collected in a second environment, where the noise in the first environment is less than or equal to a preset value a threshold, where the noise in the second environment is greater than the preset threshold; 确定模块,用于根据所述训练数据,确定所述训练数据的时域特征值和目标值,所述训练数据的时域特征值包括噪声阈值、长时能量值、短时能量值和噪声包络跟踪值中的一项或多项;所述训练数据的目标值包括所述第一语音信息的语音活动检测值和/或所述第二语音信息的全带信噪比;A determination module, configured to determine, according to the training data, a time-domain characteristic value and a target value of the training data, where the time-domain characteristic value of the training data includes a noise threshold, a long-term energy value, a short-term energy value, and a noise packet one or more of the network tracking values; the target value of the training data includes the voice activity detection value of the first voice information and/or the full-band signal-to-noise ratio of the second voice information; 处理模块,用于基于频域构建的深度学习训练模型,对所述时域特征值和所述目标值进行训练,得到模型参数,并将所述模型参数发送给终端设备,以使所述终端设备使用所述模型参数对用户输入的语音信息进行降噪处理。The processing module is used to train the deep learning training model constructed in the frequency domain, train the time domain feature value and the target value, obtain model parameters, and send the model parameters to the terminal device, so that the terminal The device uses the model parameters to perform noise reduction processing on the speech information input by the user. 7.一种终端设备,其特征在于,所述终端设备包括:7. A terminal device, wherein the terminal device comprises: 收发模块,用于接收服务器发送的模型参数;The transceiver module is used to receive the model parameters sent by the server; 更新模块,用于根据所述模型参数,对所述终端设备中的第一语音降噪模型的参数进行更新,得到更新后的第一语音降噪模型;an update module, configured to update the parameters of the first voice noise reduction model in the terminal device according to the model parameters to obtain an updated first voice noise reduction model; 降噪模块,用于在接收到用户输入的语音信息后,使用所述更新后的第一语音降噪模型对所述语音信息进行降噪处理。A noise reduction module, configured to perform noise reduction processing on the voice information using the updated first voice noise reduction model after receiving the voice information input by the user. 8.根据权利要求7所述的终端设备,其特征在于,在所述收发模块接收到服务器发送的模型参数之后,所述更新模块还用于:将预设标志更新为第一指示值,以及在得到更新后的第一语音降噪模型之后,将所述预设标志更新为第二指示值。8. The terminal device according to claim 7, wherein after the transceiver module receives the model parameters sent by the server, the update module is further configured to: update the preset flag to the first indication value, and After the updated first speech noise reduction model is obtained, the preset flag is updated to a second indicated value. 9.根据权利要求8所述的终端设备,其特征在于,所述降噪模块还用于:9. The terminal device according to claim 8, wherein the noise reduction module is further configured to: 确定所述预设标志为所述第二指示值。It is determined that the preset flag is the second indication value. 10.根据权利要求9所述的终端设备,其特征在于,所述降噪模块还用于:10. The terminal device according to claim 9, wherein the noise reduction module is further configured to: 在接收到用户输入的语音信息后,若确定所述预设标志为所述第一指示值,则使用所述终端设备中的第二语音降噪模型对所述语音信息进行降噪处理;所述第二语音降噪模型为所述第一语音降噪模型的备用模型。After receiving the voice information input by the user, if it is determined that the preset flag is the first indication value, use the second voice noise reduction model in the terminal device to perform noise reduction processing on the voice information; The second speech noise reduction model is a backup model of the first speech noise reduction model.
CN201811352262.9A 2018-11-14 2018-11-14 Noise reduction method and device Active CN111192599B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811352262.9A CN111192599B (en) 2018-11-14 2018-11-14 Noise reduction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811352262.9A CN111192599B (en) 2018-11-14 2018-11-14 Noise reduction method and device

Publications (2)

Publication Number Publication Date
CN111192599A true CN111192599A (en) 2020-05-22
CN111192599B CN111192599B (en) 2022-11-22

Family

ID=70708941

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811352262.9A Active CN111192599B (en) 2018-11-14 2018-11-14 Noise reduction method and device

Country Status (1)

Country Link
CN (1) CN111192599B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111933105A (en) * 2020-09-17 2020-11-13 南昌工程学院 Automobile noise control device and control method thereof
CN112565997A (en) * 2020-12-04 2021-03-26 可孚医疗科技股份有限公司 Adaptive noise reduction method and device for hearing aid, hearing aid and storage medium
CN112580823A (en) * 2020-12-17 2021-03-30 北京嘀嘀无限科技发展有限公司 Data processing method and device, readable storage medium and electronic equipment
CN112634932A (en) * 2021-03-09 2021-04-09 南京涵书韵信息科技有限公司 Audio signal processing method and device, server and related equipment
CN113421577A (en) * 2021-05-10 2021-09-21 北京达佳互联信息技术有限公司 Video dubbing method and device, electronic equipment and storage medium
CN113840034A (en) * 2021-11-29 2021-12-24 荣耀终端有限公司 Sound signal processing method and terminal device
WO2022026948A1 (en) 2020-07-31 2022-02-03 Dolby Laboratories Licensing Corporation Noise reduction using machine learning
CN116941185A (en) * 2021-07-09 2023-10-24 Oppo广东移动通信有限公司 Noise reduction method based on transfer learning, terminal equipment, network equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105023580A (en) * 2015-06-25 2015-11-04 中国人民解放军理工大学 Unsupervised noise estimation and speech enhancement method based on separable deep automatic encoding technology
CN106024002A (en) * 2015-02-11 2016-10-12 恩智浦有限公司 Time zero convergence single microphone noise reduction
US20180033449A1 (en) * 2016-08-01 2018-02-01 Apple Inc. System and method for performing speech enhancement using a neural network-based combined symbol
CN107845389A (en) * 2017-12-21 2018-03-27 北京工业大学 A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks
CN108346433A (en) * 2017-12-28 2018-07-31 北京搜狗科技发展有限公司 A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106024002A (en) * 2015-02-11 2016-10-12 恩智浦有限公司 Time zero convergence single microphone noise reduction
CN105023580A (en) * 2015-06-25 2015-11-04 中国人民解放军理工大学 Unsupervised noise estimation and speech enhancement method based on separable deep automatic encoding technology
US20180033449A1 (en) * 2016-08-01 2018-02-01 Apple Inc. System and method for performing speech enhancement using a neural network-based combined symbol
CN107845389A (en) * 2017-12-21 2018-03-27 北京工业大学 A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks
CN108346433A (en) * 2017-12-28 2018-07-31 北京搜狗科技发展有限公司 A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4383256A2 (en) 2020-07-31 2024-06-12 Dolby Laboratories Licensing Corporation Noise reduction using machine learning
CN116057626A (en) * 2020-07-31 2023-05-02 杜比实验室特许公司 Noise Reduction Using Machine Learning
WO2022026948A1 (en) 2020-07-31 2022-02-03 Dolby Laboratories Licensing Corporation Noise reduction using machine learning
CN111933105A (en) * 2020-09-17 2020-11-13 南昌工程学院 Automobile noise control device and control method thereof
CN111933105B (en) * 2020-09-17 2024-03-29 南昌工程学院 Automobile noise control device and control method thereof
CN112565997B (en) * 2020-12-04 2022-03-22 可孚医疗科技股份有限公司 Adaptive noise reduction method and device for hearing aid, hearing aid and storage medium
CN112565997A (en) * 2020-12-04 2021-03-26 可孚医疗科技股份有限公司 Adaptive noise reduction method and device for hearing aid, hearing aid and storage medium
CN112580823A (en) * 2020-12-17 2021-03-30 北京嘀嘀无限科技发展有限公司 Data processing method and device, readable storage medium and electronic equipment
CN112634932A (en) * 2021-03-09 2021-04-09 南京涵书韵信息科技有限公司 Audio signal processing method and device, server and related equipment
CN112634932B (en) * 2021-03-09 2021-06-22 赣州柏朗科技有限公司 Audio signal processing method and device, server and related equipment
CN113421577A (en) * 2021-05-10 2021-09-21 北京达佳互联信息技术有限公司 Video dubbing method and device, electronic equipment and storage medium
CN116941185A (en) * 2021-07-09 2023-10-24 Oppo广东移动通信有限公司 Noise reduction method based on transfer learning, terminal equipment, network equipment and storage medium
CN113840034A (en) * 2021-11-29 2021-12-24 荣耀终端有限公司 Sound signal processing method and terminal device

Also Published As

Publication number Publication date
CN111192599B (en) 2022-11-22

Similar Documents

Publication Publication Date Title
CN111192599B (en) Noise reduction method and device
CN111149370B (en) Howling Detection in Conference System
EP3998557A1 (en) Audio signal processing method, model training method, and related apparatus
CN105979197B (en) Teleconference control method and device based on sound automatic identification of uttering long and high-pitched sounds
CN103065631B (en) A kind of method of speech recognition, device
CN103971680B (en) A kind of method, apparatus of speech recognition
US20200265857A1 (en) Speech enhancement method and apparatus, device and storage mediem
CN110335593B (en) Voice endpoint detection method, device, equipment and storage medium
RU2588596C2 (en) Determination of distance and/or quality of acoustics between mobile device and base unit
WO2019112468A1 (en) Multi-microphone noise reduction method, apparatus and terminal device
CN104103278A (en) Real time voice denoising method and device
US12119015B2 (en) Systems, methods, apparatus, and storage medium for processing a signal
US20150032445A1 (en) Noise estimation apparatus, noise estimation method, noise estimation program, and recording medium
CN111883173A (en) Audio packet loss repairing method, device and system based on neural network
CN103617801A (en) Voice detection method and device and electronic equipment
CN111223492A (en) A kind of echo path delay estimation method and device
US20150325252A1 (en) Method and device for eliminating noise, and mobile terminal
US20150098587A1 (en) Processing apparatus, processing method, program, computer readable information recording medium and processing system
JP2005516247A (en) Voice activity detector and enabler for noisy environments
EP2993666B1 (en) Voice switching device, voice switching method, and computer program for switching between voices
CN108039182B (en) Voice activation detection method
CN106340310B (en) Speech detection method and device
CN103337245B (en) Based on the noise suppressing method of signal to noise ratio curve and the device of subband signal
GB2580821A (en) Analysing speech signals
CN116193321A (en) Sound signal processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant