CN111192599A

CN111192599A - Method and device for noise reduction

Info

Publication number: CN111192599A
Application number: CN201811352262.9A
Authority: CN
Inventors: 宋钦梅; 方华; 袁其政; 屈跃强; 程宝平
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Priority date: 2018-11-14
Filing date: 2018-11-14
Publication date: 2020-05-22
Anticipated expiration: 2038-11-14
Also published as: CN111192599B

Abstract

The embodiment of the application discloses a noise reduction method and a noise reduction device, wherein the method comprises the following steps: the time domain characteristic value and the target value of the training data are trained through the deep learning training model constructed based on the frequency domain, and the server can send the obtained model parameters to the terminal equipment, so that the terminal equipment updates the parameters of the first voice noise reduction model after receiving the model parameters, and performs noise reduction processing on voice information input by a user by using the updated first voice noise reduction model. In the embodiment of the application, the parameters of the first voice noise reduction model in the terminal device are updated by using the model parameters obtained by deep learning training model training, so that the noise reduction processing can be performed on the voice information input by the user by using the model parameters obtained by deep learning training model training, the voice information obtained by the noise reduction processing can be more accurate, and the experience of the user is improved.

Description

Noise reduction method and device

Technical Field

The present application relates to the field of communications technologies, and in particular, to a noise reduction method and apparatus.

Background

In real life, voice information sent by a user usually contains noise, such as wind sound, car sound, and machine operation sound in the environment. During the conversation process of the user using the voice device, the noises may affect the conversation quality of the user, so that the user experience is not good. For example, a user a and a user B communicate with each other through a terminal device (e.g., a mobile phone), and if noise included in voice information sent by the user a through the mobile phone a is large, the user B may not normally obtain the voice information of the user a through the mobile phone B, for example, the obtained voice information is not clear enough, or the voice information sent by the user a is not obtained.

Therefore, a noise reduction method is needed to solve the technical problem of low communication quality between users due to the existence of noise.

Disclosure of Invention

The embodiment of the application provides a noise reduction method and a noise reduction device, which are used for solving the technical problem of low call quality among users due to the existence of noise.

The noise reduction method provided by the embodiment of the application comprises the following steps:

the method comprises the steps that a server obtains training data, wherein the training data comprise first voice information collected in a first environment and second voice information collected in a second environment, noise in the first environment is smaller than or equal to a preset threshold, and noise in the second environment is larger than the preset threshold;

the server determines a time domain characteristic value and a target value of the training data according to the training data; the time domain characteristic value of the training data comprises one or more of a noise threshold value, a long-time energy value, a short-time energy value and a noise envelope tracking value; the target value of the training data comprises a voice activity detection value of the first voice information and/or a full-band signal-to-noise ratio of the second voice information;

the server trains the time domain characteristic value and the target value based on a deep learning training model constructed by a frequency domain to obtain a model parameter, and sends the model parameter to the terminal equipment; and the model parameters are used for the terminal equipment to perform noise reduction processing on the voice information input by the user.

the terminal equipment receives the model parameters sent by the server;

the terminal equipment updates the parameters of the first voice noise reduction model in the terminal equipment according to the model parameters to obtain an updated first voice noise reduction model;

and after receiving the voice information input by the user, the terminal equipment uses the updated first voice noise reduction model to perform noise reduction processing on the voice information.

Optionally, after the terminal device receives the model parameter sent by the server, the method further includes:

the terminal equipment updates a preset mark into a first indicated value;

after the terminal device obtains the updated first speech noise reduction model, the method further includes:

and the terminal equipment updates the preset mark into a second indication value.

Optionally, after receiving the voice information input by the user, the terminal device further includes, before updating the voice information by using the updated first voice noise reduction model:

and the terminal equipment determines that the preset mark is the second indication value.

Optionally, the method further comprises:

after the terminal equipment receives voice information input by a user, if the preset mark is determined to be the first indicated value, a second voice noise reduction model in the terminal equipment is used for carrying out noise reduction processing on the voice information; the second voice noise reduction model is a standby model of the first voice noise reduction model.

An embodiment of the present application provides a server, including:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring training data, the training data comprises first voice information collected in a first environment and second voice information collected in a second environment, the noise in the first environment is less than or equal to a preset threshold, and the noise in the second environment is greater than the preset threshold;

the determining module is used for determining a time domain characteristic value and a target value of the training data according to the training data, wherein the time domain characteristic value of the training data comprises one or more of a noise threshold value, a long-term energy value, a short-term energy value and a noise envelope tracking value; the target value of the training data comprises a voice activity detection value of the first voice information and/or a full-band signal-to-noise ratio of the second voice information;

and the processing module is used for training the time domain characteristic value and the target value based on a deep learning training model constructed in a frequency domain to obtain a model parameter, and sending the model parameter to the terminal equipment so that the terminal equipment can use the model parameter to perform noise reduction processing on the voice information input by the user.

The embodiment of the present application provides a terminal device, which includes:

the receiving and sending module is used for receiving the model parameters sent by the server;

the updating module is used for updating the parameters of the first voice noise reduction model in the terminal equipment according to the model parameters to obtain an updated first voice noise reduction model;

and the noise reduction module is used for performing noise reduction processing on the voice information by using the updated first voice noise reduction model after receiving the voice information input by the user.

Optionally, after the transceiver module receives the model parameters sent by the server, the update module is further configured to: and after the updated first voice noise reduction model is obtained, updating the preset mark into a second indication value.

Optionally, the noise reduction module is further configured to:

and determining the preset mark as the second indication value.

Optionally, the noise reduction module is further configured to:

after receiving voice information input by a user, if the preset mark is determined to be the first indicated value, performing noise reduction processing on the voice information by using a second voice noise reduction model in the terminal equipment; the second voice noise reduction model is a standby model of the first voice noise reduction model.

In the above embodiment of the application, the server takes the collected first speech information in the first environment (noise is less than or equal to the preset threshold) and the collected second speech information in the second environment (noise is greater than the preset threshold) as training data, and after determining the time domain characteristic value and the target value of the training data, trains the time domain characteristic value and the target value based on the deep learning training model constructed in the frequency domain to obtain the model parameters, and sends the model parameters to the terminal device, so that the terminal device can update the parameters of the first speech noise reduction model in the terminal device after receiving the model parameters sent by the server, and can perform noise reduction processing on the speech information input by the user by using the updated first speech noise reduction model. In the embodiment of the application, the time domain characteristic value (which can embody the time domain characteristic) and the target value of the training data are obtained, and the time domain characteristic value and the target value of the training data are trained by adopting the deep learning training model (which can embody the frequency domain characteristic) constructed based on the frequency domain, so that the time domain characteristic and the frequency domain characteristic of the voice information can be combined in the process of training the model, the training performance of the deep learning training model is improved, and the training speed of the deep learning training model is accelerated; moreover, the training data can comprise voice information under various environments, and different training data can be adopted for training the deep learning training model for multiple times, so that more accurate model parameters can be obtained; furthermore, parameters of the first voice noise reduction model in the terminal equipment are updated through model parameters obtained through deep learning training model training, so that the noise reduction processing can be performed on voice information input by a user through the model parameters obtained through deep learning training model training by the terminal equipment, the voice information obtained through the noise reduction processing can be more accurate, and the experience of the user is improved. In addition, the process of training data by adopting the deep learning training model to obtain model parameters and the noise reduction process of the terminal equipment on the voice information of the user can be processed in parallel, so that the processing speed and the processing efficiency of voice noise reduction can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic diagram of a possible system architecture provided by an embodiment of the present application;

fig. 2 is a schematic flowchart corresponding to a noise reduction method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a server provided in an embodiment of the present application;

fig. 4 is a schematic structural diagram of a terminal device provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a possible system architecture provided in this embodiment, where the system architecture may include a network device 100, and one or more terminal devices (e.g., a terminal device 101 and a terminal device 102 illustrated in fig. 1), and the terminal device 101 and the terminal device 102 may communicate through the network device 100. The network device 100 may be a base station, such as an evolved base station (e.g., evolved NodeB) in a Long Term Evolution (LTE) communication system. The terminal device may be a mobile phone (mobile phone), a tablet computer (Pad), and the like, and is not limited specifically.

In this embodiment of the application, a user a and a user B may communicate through a terminal device, where the user a uses the terminal device 101, and the user B uses the terminal device 102, and one possible scenario is that the user a inputs voice information to the terminal device 101, the terminal device 101 transmits the voice information input by the user a to the terminal device 102 through the network device 100, and the terminal device 102 provides the received voice information to the user B after receiving the voice information.

In order to improve the call quality, one possible implementation manner (for convenience of description, implementation manner a) is to perform noise reduction processing on the voice information sent by the user by using a preset noise signal, that is, to filter the preset noise signal in the voice information sent by the user, so as to obtain the filtered voice information. By adopting the voice noise reduction method, the preset noise signals in the voice information sent by the user can be filtered. However, the preset noise signal used in this method is fixed, that is, the terminal device may use the preset noise signal to perform noise reduction processing on the voice information sent by all subsequent users without updating the model parameters in the terminal device. In some cases, the speech information obtained by performing noise reduction processing on the speech information of the user in various different environments by using the preset noise signal may be inaccurate.

Based on this, the embodiment of the present application provides a voice noise reduction method, so as to solve the technical problem that the quality of a call between users is low due to the existence of noise.

Fig. 2 is a schematic flowchart corresponding to a noise reduction method provided in an embodiment of the present application, where the method includes:

step 201, the server acquires training data and determines a time domain characteristic value and a target value of the training data.

Here, the training data may include first voice information collected in a first environment and second voice information collected in a second environment, and noise in the first environment may be equal to or less than a preset threshold, that is, noise in the first environment is small, and noise in the second environment may be greater than the preset threshold, that is, noise in the first environment is large. The preset threshold may be set by a person skilled in the art according to experience, or may be determined through one or more experiments, which is not limited in the embodiment of the present application.

In the embodiment of the application, the training data can be acquired in various manners, and in a possible implementation manner, a tester can carry the voice acquisition device to collect the first voice information and the second voice information in the first environment and the second environment respectively. For example, taking a preset threshold value of-30 db as an example, if the application scenario is a certain room on a preset floor, the voice information, which is acquired by the voice acquisition device and is sent by the tester at a first position where the noise in the room is less than or equal to-30 db (or no noise), may be used as the first voice information, and the same voice information, which is acquired by the voice acquisition device and is sent by the tester at a second position where the noise in the room is greater than-30 db, may be used as the second voice information. For example, if the user says "test the noise of the indoor environment" (for convenience of description, abbreviated as voice 1) to the voice collection device at a position a away from the window in the room, and says "test the noise of the indoor environment" (for convenience of description, abbreviated as voice 2) to the voice collection device at a position B near the window, the first voice message may be voice 1, and the second voice message may be voice 2.

In the embodiment of the present application, the voice information may generally have time domain characteristics and frequency domain characteristics as a signal. Specifically, the time domain features of the speech information may be described by a magnitude dimension of the speech information and a time dimension of the speech information, and may be expressed, for example, as a function of a change in magnitude over time; through the time domain characteristics of the voice information, the information such as the energy value of the voice information in a certain time period and the amplitude value of the voice information at a certain time point can be obtained. The frequency domain features of the speech information may be described by an amplitude dimension of the speech information and a frequency dimension of the speech information, and may be represented, for example, as a function of amplitude versus frequency. The time domain features and the frequency domain features of the speech information may be transformed, and in one example, the time-varying speech information may be decomposed into a plurality of speech information having different frequencies by fourier transform.

In this embodiment of the application, the server may obtain training data acquired by the voice acquisition device, and determine a time domain feature value of the training data, where the time domain feature value of the training data may be used to indicate a feature of the training data in a time domain, and in a possible implementation, the time domain feature value of the training data may include one or more of a noise threshold, a long-term energy value, a short-term energy value, and a noise envelope tracking value. It is understood that the time domain feature value of the training data may further include other information, which is not limited in this embodiment of the present application.

In one example, an initial noise threshold (e.g., -40 db) may be preset, and voice messages with amplitudes less than-40 db in the training data may be identified as noise, and voice messages with amplitudes greater than-40 db may be identified as voice of the user. In a specific implementation, different environments may have different initial noise thresholds, for example, an initial noise threshold corresponding to an office environment may be-40 db, and an initial noise threshold corresponding to a mall may be-30 db. In the embodiment of the present application, the initial noise threshold corresponding to each environment may be set empirically by a person skilled in the art, or may be determined experimentally by a person skilled in the art, which is not limited to this.

The long-time energy value and the short-time energy value may be used to indicate energy information of the voice information within a preset time period. The following describes an implementation process of obtaining the long-term energy value and the short-term energy value of the first voice message by taking the first voice message as an example, and the process of obtaining the long-term energy value and the short-term energy value of the second voice message may be processed by referring to the first voice message.

In the embodiment of the application, the framing operation can be performed on the first voice information in advance according to the total duration of the first voice information, so that multi-frame voice information is obtained. The time lengths of any two frames of voice information in the multi-frame voice information may be the same or may be different. In a specific implementation, there may be multiple ways of dividing the first voice information into multiple frames of voice information, and in a possible implementation, a standard frame duration (e.g., 10ms) may be preset, so that the first voice information is divided into multiple frames of voice information according to the standard frame duration. In an example (for convenience of description, referred to as example one), the total duration of the first speech information is 205ms, and then 21 frames of speech information can be obtained after division according to the duration of the standard frame, where the durations of the 1 st frame speech information to the 20 th frame speech information are the same as the duration of the standard frame, and are both 10ms, and the duration of the 21 st frame speech information is 5 ms.

Further, the long-term energy value may be set as an average value of the total energy of the speech information of the a frame (e.g., a is 3), and the short-term energy value may be set as an energy value of the speech information of the latest B frame (e.g., B is 1). Based on the multi-frame speech information and the duration of each frame of speech information in example one, the long-term energy value may be an average of the total energy of the speech information of 30ms duration (if the 21 st frame of speech information is included, an average of the total energy of the speech information of 25ms duration); accordingly, the short-time energy value may be an energy value of the 21 st frame speech information. In the embodiments of the present application, the values of a and B may be adjusted by those skilled in the art according to actual situations, and are not limited thereto.

The noise envelope tracking value may be used to estimate the amplitude of the noise. Generally, noise may have a wider temporal characteristic than speech, and therefore, an estimate of the noise amplitude may be obtained by tracking the minimum amplitude corresponding to each frame of speech information in the second speech information. For example, the minimum value corresponding to each frame of speech information in the second speech information may be extracted in advance, a noise envelope tracking map may be drawn accordingly, and the estimated noise information (e.g., a noise envelope value corresponding to the long-term energy value, a noise envelope value corresponding to the short-term energy value, etc.) may be obtained through calculation according to the noise envelope tracking map and a preset index (e.g., a fast-decreasing and slow-increasing principle).

It should be noted that, in the embodiment of the present application, the size of the noise threshold may also be adjusted according to the noise envelope tracking value. In one example, an increment (e.g., 2 db) may be preset, and if the calculated noise envelope tracking value is greater than the initial noise threshold (e.g., -40 db), the noise threshold may be sequentially increased by the preset increment. For example, if the first calculated noise envelope tracking value is-35 db, the noise threshold may be adjusted to-33 db, and the noise envelope tracking value may be calculated based on the-33 db noise threshold.

In this embodiment, the server may further determine a target value of the training data, where the target value of the training data may include a voice activity detection value of the first voice information and/or a full-band signal-to-noise ratio of the second voice information. In this case, the voice activity detection value of the first voice information may be used to indicate whether voice or noise is detected, and in one example, a first indication value (1) and a second indication value (0) may be preset, for example, "1" may be used to indicate that voice is currently detected, "0" may be used to indicate that noise is currently detected, or "0" may be used to indicate that voice is currently detected, "1" may be used to indicate that noise is currently detected, which is not limited in this embodiment of the application. The full-band snr of the second speech information can be used to indicate a correspondence between speech and noise, and in one example, the full-band snr of the second speech information can be a ratio of an average power of the second speech information to an average power of noise corresponding to the second speech information.

Step 202, the server trains the time domain characteristic value and the target value based on the deep learning training model constructed by the frequency domain to obtain model parameters.

Here, the deep learning training model may be constructed in various ways, and in one possible implementation, the deep learning training model may be constructed based on keras. Specifically, Keras is a highly modular neural network library based on thano, for example, Keras may be based on Torch and written in Python language, and Keras may support Graphics Processing Unit (GPU) and Central Processing Unit (CPU).

In an embodiment of the application, the deep learning training model may include a voice activity detection module, a noise spectrum estimation module, and a spectral subtraction module. The voice activity detection module may detect the first voice information and the second voice information, and distinguish between voice and amplitude according to activity indicators (e.g., amplitude range, etc.) of the detected first voice information and the detected second voice information. The noise spectrum estimation module may be configured to calculate the first speech information and the second speech information, and may estimate a spectral characteristic of the noise according to a result of the calculation. The spectrum subtraction module may be configured to determine a gain value according to the calculation results obtained by the voice activity detection module and the noise spectrum estimation module, where the gain value may be used to suppress noise in the voice information.

In specific implementation, the server may use the time domain feature value of the training data as input information of the deep learning training model, and may use the target value of the training data as output information of the deep learning training model, so as to control the deep learning training model to perform model training according to the input information and the output information, thereby obtaining model parameters. In one example, the server may obtain the first model parameter by inputting the time domain feature value into the voice activity detection module; further, the server can obtain a second model parameter by inputting the first model parameter into the noise spectrum estimation module, and meanwhile, the server can obtain a third model parameter by inputting the first model parameter into the spectrum subtraction module; and finally, the server inputs the first model parameter, the second model parameter and the third model parameter into the spectrum subtraction module together to obtain the model parameters of the training data after deep learning training model training. It should be noted that, in the embodiment of the present application, each module in the deep learning training model may be a functional module constructed by Keras, that is, the voice activity detection module, the noise spectrum estimation module, and the spectrum subtraction module are introduced only for describing a process of determining the model parameters, and in a specific implementation, other modules may be further included, which is not limited in particular. Moreover, the names of the functional modules may also be other possible names, and are not limited specifically.

Step 203, sending the model parameters to the terminal device.

Here, the server may send the model parameters to the terminal device by communicating with the terminal device, where the server may communicate with the terminal device in various ways, and in one example, the server may communicate with the terminal device in a wireless way; in another example, the server may also communicate with the terminal device through a wired line (e.g., an optical fiber, a network cable, etc.), which is not limited in this embodiment.

And 204, the terminal equipment receives the model parameters sent by the server, and updates the parameters of the first voice noise reduction model in the terminal equipment by using the model parameters.

In this embodiment of the application, a first voice noise reduction model and a second voice noise reduction model may be preset in the terminal device, and the second voice noise reduction model may be a standby model of the first voice noise reduction model. The parameters of the first speech noise reduction module and the second speech noise reduction model in the initial state (for example, when the terminal device leaves a factory, or after the terminal device is initialized) may be parameters before model training is performed on the deep speech learning training model built based on Kears.

In a specific implementation, a preset flag may be set in the terminal device, and the preset flag may be used to indicate whether the first speech noise reduction model in the terminal device is in an updated state. In one example, before the terminal device receives the model parameters sent by the server and updates the parameters of the first voice noise reduction model in the terminal device by using the model parameters, the terminal device may update the preset flag to a first indication value, and the first indication value may be used to indicate that the first voice noise reduction model in the terminal device is in an updated state. Further, the terminal device may update the preset flag to a second indication value after obtaining the updated first voice noise reduction model, where the second indication value may be used to indicate that the first voice noise reduction model in the terminal device is in an un-updated state, or may be used to indicate that the first voice noise reduction model in the terminal device is updated completely. In the embodiment of the present application, the preset flag may be represented by one bit, for example, the first indication value may be "0", and the second indication value may be "1"; alternatively, the first indication value may be "1", and the second indication value may be "0", which is not particularly limited.

In this embodiment of the application, if the terminal device detects that the first speech noise reduction model cannot perform noise reduction processing due to some reasons (for example, some hardware of the terminal device is damaged or an update algorithm of the first speech noise reduction model is wrong), the terminal device may also update the preset flag to the first indication value.

It should be noted that, in this embodiment of the application, the terminal device may update the second speech noise reduction module to the first speech noise reduction module after obtaining the updated first speech noise reduction model, for example, may update a parameter of the second speech noise reduction module to a parameter of the first speech noise reduction module.

Step 205, after receiving the voice message input by the user, if the terminal device determines that the preset flag is the first indication value, step 206a may be executed; if the preset flag is determined to be the second indication value, step 206b may be executed; if the predetermined flag is determined to be the second indication value, step 206b may be performed.

In step 206a, the terminal device performs noise reduction processing on the voice information by using the second voice noise reduction model.

Here, if the terminal device determines that the preset flag is the first indication value, it indicates that the first speech noise reduction model in the terminal device is in an updated state, or the first speech noise reduction model cannot perform noise reduction processing, and at this time, the terminal device may control the second speech noise reduction model in the terminal device to perform noise reduction processing on speech information input by a user.

Specifically, the parameters of the second speech noise reduction model may be model parameters received by the terminal device and obtained by performing model training last time by the server (or may be model parameters of a conventional speech noise reduction model in an initial state), and therefore, the terminal device may perform noise reduction processing on speech information input by the user by using the model parameters obtained last time (or the model parameters of the conventional speech noise reduction model).

And step 206a, the terminal device performs noise reduction processing on the voice information by using the updated first voice noise reduction model.

Here, the terminal device determines that the preset flag is the second indication value, which indicates that the first speech noise reduction model in the terminal device is updated, and at this time, the terminal device may control the updated first speech noise reduction model to perform noise reduction processing on speech information input by a user. That is, the terminal device may perform noise reduction processing on the speech information input by the user by using the latest model parameters obtained by model training performed by the server.

In the embodiment of the application, the time domain characteristic value (which can embody the time domain characteristic) and the target value of the training data are obtained, and the time domain characteristic value and the target value of the training data are trained by adopting the deep learning training model (which can embody the frequency domain characteristic) constructed based on the frequency domain, so that the time domain characteristic and the frequency domain characteristic of the voice information can be combined in the process of training the model, the training performance of the deep learning training model is improved, and the training speed of the deep learning training model is accelerated; moreover, the training data can comprise voice information under various environments, and different training data can be adopted for training the deep learning training model for multiple times, so that more accurate model parameters can be obtained; furthermore, parameters of the first voice noise reduction model in the terminal equipment are updated through model parameters obtained through deep learning training model training, so that the noise reduction processing can be performed on voice information input by a user through the model parameters obtained through deep learning training model training by the terminal equipment, the voice information obtained through the noise reduction processing can be more accurate, and the experience of the user is improved. In addition, the process of training data by adopting the deep learning training model to obtain model parameters and the noise reduction process of the terminal equipment on the voice information of the user can be processed in parallel, so that the processing speed and the processing efficiency of voice noise reduction can be improved.

For the above method flow, an embodiment of the present application further provides a noise reduction apparatus, and specific contents of the apparatus may be implemented with reference to the above method.

Fig. 3 is a schematic structural diagram of a server according to an embodiment of the present application, including:

an obtaining module 301, configured to obtain training data, where the training data includes first voice information collected in a first environment and second voice information collected in a second environment, noise in the first environment is less than or equal to a preset threshold, and noise in the second environment is greater than the preset threshold;

a determining module 302, configured to determine, according to the training data, a time-domain feature value and a target value of the training data, where the time-domain feature value of the training data includes one or more of a noise threshold, a long-time energy value, a short-time energy value, and a noise envelope tracking value; the target value of the training data comprises a voice activity detection value of the first voice information and/or a full-band signal-to-noise ratio of the second voice information;

the processing module 303 is configured to train the time domain feature value and the target value based on a deep learning training model constructed in a frequency domain to obtain a model parameter, and send the model parameter to a terminal device, so that the terminal device performs noise reduction processing on voice information input by a user by using the model parameter.

Fig. 4 is a schematic structural diagram of a terminal device provided in an embodiment of the present application, where the terminal device includes:

a transceiver module 401, configured to receive a model parameter sent by a server;

an updating module 402, configured to update parameters of a first speech noise reduction model in the terminal device according to the model parameters, to obtain an updated first speech noise reduction model;

and a noise reduction module 403, configured to, after receiving the voice information input by the user, perform noise reduction processing on the voice information by using the updated first voice noise reduction model.

Optionally, after the transceiver module 401 receives the model parameters sent by the server, the updating module 402 is further configured to: and after the updated first voice noise reduction model is obtained, updating the preset mark into a second indication value.

Optionally, the noise reduction module 403 is further configured to:

and determining the preset mark as the second indication value.

Optionally, the noise reduction module 403 is further configured to:

From the above, it can be seen that: in the above embodiment of the application, the server takes the collected first speech information in the first environment (noise is less than or equal to the preset threshold) and the collected second speech information in the second environment (noise is greater than the preset threshold) as training data, and after determining the time domain characteristic value and the target value of the training data, trains the time domain characteristic value and the target value based on the deep learning training model constructed in the frequency domain to obtain the model parameters, and sends the model parameters to the terminal device, so that the terminal device can update the parameters of the first speech noise reduction model in the terminal device after receiving the model parameters sent by the server, and can perform noise reduction processing on the speech information input by the user by using the updated first speech noise reduction model. In the embodiment of the application, the time domain characteristic value (which can embody the time domain characteristic) and the target value of the training data are obtained, and the time domain characteristic value and the target value of the training data are trained by adopting the deep learning training model (which can embody the frequency domain characteristic) constructed based on the frequency domain, so that the time domain characteristic and the frequency domain characteristic of the voice information can be combined in the process of training the model, the training performance of the deep learning training model is improved, and the training speed of the deep learning training model is accelerated; moreover, the training data can comprise voice information under various environments, and different training data can be adopted for training the deep learning training model for multiple times, so that more accurate model parameters can be obtained; furthermore, parameters of the first voice noise reduction model in the terminal equipment are updated through model parameters obtained through deep learning training model training, so that the noise reduction processing can be performed on voice information input by a user through the model parameters obtained through deep learning training model training by the terminal equipment, the voice information obtained through the noise reduction processing can be more accurate, and the experience of the user is improved. In addition, the process of training data by adopting the deep learning training model to obtain model parameters and the noise reduction process of the terminal equipment on the voice information of the user can be processed in parallel, so that the processing speed and the processing efficiency of voice noise reduction can be improved.

It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A noise reduction method, characterized in that the method comprises:

The server acquires training data, the training data includes first voice information collected in a first environment and second voice information collected in a second environment, where the noise in the first environment is less than or equal to a preset threshold, the The noise in the second environment is greater than the preset threshold;

The server determines, according to the training data, a time-domain characteristic value and a target value of the training data; the time-domain characteristic value of the training data includes a noise threshold, a long-term energy value, a short-term energy value, and noise envelope tracking One or more of the values; the target value of the training data includes the voice activity detection value of the first voice information and/or the full-band signal-to-noise ratio of the second voice information;

The server trains the time domain feature value and the target value based on the deep learning training model constructed in the frequency domain, obtains model parameters, and sends the model parameters to the terminal device, and the model parameters are used for all The terminal device performs noise reduction processing on the voice information input by the user.

2. A noise reduction method, characterized in that the method comprises:

The terminal device receives the model parameters sent by the server;

The terminal device updates the parameters of the first voice noise reduction model in the terminal device according to the model parameters, to obtain an updated first voice noise reduction model;

After receiving the voice information input by the user, the terminal device uses the updated first voice noise reduction model to perform noise reduction processing on the voice information.

3. The method according to claim 2, wherein after the terminal device receives the model parameters sent by the server, the method further comprises:

The terminal device updates the preset flag to the first indication value;

After the terminal device obtains the updated first voice noise reduction model, the method further includes:

The terminal device updates the preset flag to a second indication value.

4. The method according to claim 3, wherein, after the terminal device receives the voice information input by the user, before using the updated first voice noise reduction model to update the voice information, Also includes:

The terminal device determines that the preset flag is the second indication value.

5. The method according to claim 3, wherein the method further comprises:

After the terminal device receives the voice information input by the user, if it is determined that the preset flag is the first indication value, the second voice noise reduction model in the terminal device is used to reduce the voice information. Noise processing; the second speech noise reduction model is a backup model of the first speech noise reduction model.

6. A server, characterized in that the server comprises:

an acquisition module, configured to acquire training data, where the training data includes first voice information collected in a first environment and second voice information collected in a second environment, where the noise in the first environment is less than or equal to a preset value a threshold, where the noise in the second environment is greater than the preset threshold;

A determination module, configured to determine, according to the training data, a time-domain characteristic value and a target value of the training data, where the time-domain characteristic value of the training data includes a noise threshold, a long-term energy value, a short-term energy value, and a noise packet one or more of the network tracking values; the target value of the training data includes the voice activity detection value of the first voice information and/or the full-band signal-to-noise ratio of the second voice information;

The processing module is used to train the deep learning training model constructed in the frequency domain, train the time domain feature value and the target value, obtain model parameters, and send the model parameters to the terminal device, so that the terminal The device uses the model parameters to perform noise reduction processing on the speech information input by the user.

7. A terminal device, wherein the terminal device comprises:

The transceiver module is used to receive the model parameters sent by the server;

an update module, configured to update the parameters of the first voice noise reduction model in the terminal device according to the model parameters to obtain an updated first voice noise reduction model;

A noise reduction module, configured to perform noise reduction processing on the voice information using the updated first voice noise reduction model after receiving the voice information input by the user.

8. The terminal device according to claim 7, wherein after the transceiver module receives the model parameters sent by the server, the update module is further configured to: update the preset flag to the first indication value, and After the updated first speech noise reduction model is obtained, the preset flag is updated to a second indicated value.

9. The terminal device according to claim 8, wherein the noise reduction module is further configured to:

It is determined that the preset flag is the second indication value.

10. The terminal device according to claim 9, wherein the noise reduction module is further configured to:

After receiving the voice information input by the user, if it is determined that the preset flag is the first indication value, use the second voice noise reduction model in the terminal device to perform noise reduction processing on the voice information; The second speech noise reduction model is a backup model of the first speech noise reduction model.