WO2024139730A1 - 音频数据处理方法、装置、设备、计算机可读存储介质及计算机程序产品 - Google Patents
音频数据处理方法、装置、设备、计算机可读存储介质及计算机程序产品 Download PDFInfo
- Publication number
- WO2024139730A1 WO2024139730A1 PCT/CN2023/129766 CN2023129766W WO2024139730A1 WO 2024139730 A1 WO2024139730 A1 WO 2024139730A1 CN 2023129766 W CN2023129766 W CN 2023129766W WO 2024139730 A1 WO2024139730 A1 WO 2024139730A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio data
- noise reduction
- noise
- data
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/028—Voice signal separating using properties of sound source
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0264—Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
Definitions
- the embodiments of the present application provide an audio data processing method, apparatus, device, computer-readable storage medium, and computer program product, which can avoid loss of valid audio data during noise reduction, thereby improving the quality of audio data.
- the present application provides an audio data processing device, including:
- An acquisition module used for acquiring original noisy video data to be processed and target scene parameters associated with the original noisy video data
- a target noise reduction intensity parameter for denoising the original noisy audio data is adaptively determined through a target scene parameter associated with the original noisy audio data, and based on the target noise reduction intensity parameter, the noise content in the original noisy audio data is quantitatively reduced. That is, the target scene parameter here reflects at least one of the application scene and the acquisition scene of the original noisy audio data, and the target noise reduction intensity parameter reflects the strength of noise suppression in the original noisy audio data.
- the noise content in the original noisy audio data is quantitatively reduced, and a certain degree of noise residue is accepted, without completely separating the noise data and audio data in the original noisy audio data, to completely suppress the noise, avoid loss of effective audio data during noise reduction, improve the quality of audio data, and improve the flexibility of noise processing.
- FIG1 is a schematic diagram of an audio data processing system provided by the present application.
- FIG2 is a schematic diagram of an interactive scenario of an audio data processing method provided by the present application.
- FIG3 is a flow chart of an audio data processing method provided by the present application.
- FIG5 is a schematic diagram of the structure of a target noise reduction processing model provided by the present application.
- FIG6 is a schematic diagram of PESQ scores of noisy audio data under different noise reduction intensity parameters provided by the present application.
- FIG7 is a schematic diagram of SI-SNR scores of noisy audio data under different noise reduction intensity parameters provided by the present application.
- FIG8 is a schematic diagram of the structure of an audio data processing device provided in an embodiment of the present application.
- FIG. 9 is a schematic diagram of the structure of a computer device provided in an embodiment of the present application.
- AIaaS artificial intelligence cloud services
- AIaaS AI as a Service
- the AIaaS platform will split several common AI services and provide independent or packaged services in the cloud.
- This service model is similar to opening an AI theme mall: all developers can access one or more artificial intelligence services provided by the platform through the API interface, and some senior developers can also Use the AI framework and AI infrastructure provided by the platform to deploy and operate your own cloud artificial intelligence services.
- the artificial intelligence cloud service includes a target noise reduction processing model for performing noise reduction processing on noisy audio data.
- the computer device can call the target noise reduction processing model in the artificial intelligence cloud service through the API interface, and input the original noisy audio data and the target noise reduction intensity parameter into the target noise reduction processing model.
- the target noise reduction processing model is used to perform noise reduction processing on the original noisy audio data based on the target noise reduction intensity parameter, so as to quantitatively reduce the noise content in the original noisy audio data, avoid the loss of effective audio data during noise reduction, improve the quality of audio data, and make the noise reduction processing of audio data more intelligent.
- different computer devices can call the target noise reduction processing model, so that multiple computer devices can share the target noise reduction processing model, improve the utilization rate of the target noise reduction processing model, and do not need to train the computer device separately to obtain the target noise reduction processing model, thereby reducing the computing resource overhead of the computer device.
- the audio data processing system for implementing the present application is first introduced.
- the audio data processing system includes a server 10 and a terminal cluster.
- the terminal cluster may include one or more terminals, and the number of terminals is not limited here.
- the terminal cluster may include terminal 1, terminal 2, ..., terminal n; it can be understood that terminal 1, terminal 2, terminal 3, ..., terminal n can all be connected to the server 10 through a network connection, so that each terminal can exchange data with the server 10 through a network connection.
- the target application here can refer to an application with voice communication function, such as a target application including an independent application, a web application, a small program in a host application, etc.
- Any terminal in the terminal cluster can be used as a sending terminal or a receiving terminal.
- the sending terminal can refer to a terminal that generates original noisy audio data and sends the original noisy audio data
- the receiving terminal can refer to a terminal that receives the original noisy audio data.
- terminal 1 when user 1 corresponding to terminal 1 is communicating with user 2 corresponding to terminal 2 by voice, when user 1 needs to send audio data to user 2, terminal 1 can be called a sending terminal and terminal 2 can be called a receiving terminal; similarly, when user 2 needs to send audio data to user 1, terminal 2 can be called a sending terminal and terminal 1 can be called a receiving terminal.
- the server 10 refers to a device that provides backend services for a target application in a terminal.
- the server can be used to perform noise reduction processing on the original noisy video data sent by the sending terminal, and forward the original noisy video data after the noise reduction processing to the receiving terminal.
- the server 10 can be used to forward the original noisy video data sent by the sending terminal to the receiving terminal, and the receiving terminal performs noise reduction processing on the original noisy video data to obtain the processed original noisy video data.
- the server can be used to receive the original noisy video data after the noise reduction processing sent by the sending terminal, and forward the original noisy video data after the noise reduction processing to the receiving terminal, that is, the original noisy video data after the noise reduction processing is obtained by the sending terminal performing noise reduction processing on the original noisy video data.
- the original noisy audio data in the embodiments of the present application may refer to audio data collected by the microphone of the sending terminal, that is, the original noisy audio data refers to audio data that has not been subjected to noise reduction processing, and usually the original noisy audio data includes audio data and noise data.
- Audio data may refer to data useful to the user, such as the audio data may refer to voice data in the process of user voice communication, or the audio data may refer to music works recorded by the user, etc.; the audio data may be collected from the sounds emitted by people, animals, robots, etc.
- the noise data here may refer to data that is meaningless to the user, such as the noise data may refer to environmental noise. For example, in the process of user voice communication, all audio data except the voice data of the two parties to the call are noise data.
- the server can be an independent physical server, or a server cluster or distributed system composed of at least two physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDNs), and big data and artificial intelligence platforms.
- the terminal can refer to a vehicle-mounted terminal, a smart phone, a tablet computer, a laptop computer, a desktop computer, Smart speakers, speakers with screens, smart watches, etc., but not limited thereto.
- Each terminal and server can be directly or indirectly connected via wired or wireless communication.
- the number of terminals and servers can be one or at least two, and this application does not limit this.
- the audio data processing system of Figure 1 can be applied to voice communication scenarios, live broadcast scenarios, audio and video recording scenarios, etc.
- the audio data system of Figure 1 is applied to the voice communication scenario shown in Figure 2 as an example for explanation.
- the terminal 20a in Figure 2 can be any terminal in the terminal cluster in Figure 1
- the terminal 21a in Figure 2 can be any terminal in the terminal cluster in Figure 1 except the terminal 20a
- the server 22a in Figure 2 can be the server 10 in Figure 1.
- the terminal 20a can collect the speech process of the user 1 to obtain the original noisy audio data 1; the original noisy audio data 1 includes the speech content of the user 1 (i.e., the voice data 1) and the noise data 1, and the noise data 1 reflects the environmental noise during the speech process of the user 1, such as the howling emitted by the terminal 20a, or the speech content of other people, etc. After the terminal 20a collects the original noisy audio data 1, the original noisy audio data 1 can be sent to the server 22a.
- the server 22a After receiving the original noisy audio data 1, the server 22a can obtain the target scene parameter 1 of the original noisy audio data 1; the target scene parameter 1 can be used to reflect at least one of the collection scene and application scene of the original noisy audio data 1, and the application scene of the original noisy audio data 1 is taken as an example for explanation.
- the target scenario parameter 1 reflects that the application scenario of the original noisy audio data 1 is a voice communication scenario.
- the server 22a can query the noise reduction intensity parameter corresponding to the application scenario of the original noisy audio data 1 according to the correspondence between the application scenario and the noise reduction intensity parameter, and determine the queried noise reduction intensity parameter as the target noise reduction intensity parameter 1 corresponding to the original noisy audio data 1.
- terminal 21a can collect the speech process of user 2 to obtain original noisy audio data 2; the original noisy audio data 2 includes the speech content of user 2 (i.e., voice data 2) and noise data 2, and the noise data 2 reflects the environmental noise during the speech process of user 2, such as the howling emitted by terminal 21a, or the speech content of other people, etc.
- terminal 21a collects the original noisy audio data 2, it can send the original noisy audio data 2 to server 22a.
- server 22a After server 22a receives the original noisy audio data 2, it can obtain the target scene parameter 2 of the original noisy audio data 2; the target scene parameter 2 can be used to reflect at least one of the collection scene and application scene of the original noisy audio data 2, and the application scene of the original noisy audio data 2 is used as an example for explanation.
- the target scenario parameter 2 reflects that the application scenario of the original noisy audio data 2 is a voice communication scenario.
- the server 22a can query the noise reduction intensity parameter corresponding to the application scenario of the original noisy audio data 2 according to the correspondence between the application scenario and the noise reduction intensity parameter, and determine the queried noise reduction intensity parameter as the target noise reduction intensity parameter 2 corresponding to the original noisy audio data 2.
- the target noise reduction intensity parameter 2 reflects the intensity of the noise reduction processing required for the noise data in the original noisy audio data 2. Therefore, the server 22a can perform noise reduction processing on the original noisy audio data 1 according to the target noise reduction intensity parameter 1 to obtain the target enhanced audio data 2, and send the target enhanced audio data 2 to the terminal 20a. Some noise data remains in the target enhanced audio data 2 to avoid completely separating the audio data and noise data of the original noisy audio data 2, causing the audio data in the original noisy audio data 2 to be damaged.
- the terminal 20a receives the target enhanced audio data 2
- the user 1 can perceive the environment in which the user 2 is located based on the target enhanced audio data 2, so that the voice communication
- the letter process is more real and full.
- FIG3 is a flowchart of an audio data processing method provided in an embodiment of the present application.
- the method can be executed by any terminal in the terminal cluster in FIG1, or by the server in FIG1.
- the devices used to execute the audio data processing method in the embodiment of the present application can be collectively referred to as computer devices.
- the method may include the following steps:
- the computer device may obtain the location information of the acquisition device of the original noisy video data, determine the location information of the acquisition device as the location information of the acquisition environment of the original noisy video data, and determine the location information of the acquisition environment as the target scene parameter of the original noisy video data.
- the target scene parameter can be used to determine the acquisition scene of the original noisy video data. For example, if the recording environment is determined to be a park based on the location information of the recording environment, it indicates that the acquisition scene of the original noisy video data is outdoors or in an open place; if the recording environment is determined to be an office building based on the location information of the recording environment, it indicates that the acquisition scene of the original noisy video data is indoors, in a private place, etc.
- the computer device may obtain a program identifier corresponding to a recording application of the original noisy audio data, and determine the program identifier of the recording application as a target scene parameter of the original noisy audio data;
- the recording application may include but is not limited to: a voice call application, a conference application, a music playback application, etc.
- the program identifier may be a program name, number, etc.
- the target scene parameter may be used to determine the application scene of the original noisy audio data, for example, if the program identifier of the recording application indicates that the recording application is a voice call application, it indicates that the application scene of the original noisy audio data is a voice call scene; if the program identifier for the recording application indicates that the recording application is a conference application, it indicates that the application scene of the original noisy audio data is a conference application scene.
- the target scene parameters may include at least one or more of the environmental parameters of the recording environment of the original noisy audio data, the location information of the recording environment, and the program identifier corresponding to the recording application.
- the computer device may determine the collection scene associated with the original noisy video data based on the location information of the device that collected the original noisy video data, and the collection scene includes indoors, outdoors, private places or open places, etc.
- the target scene parameter is used to determine the application scene of the original noisy video data
- the computer device may determine the application scene of the original noisy video data based on the usage indication information of the owner of the original noisy video data, and the usage indication information is used to indicate the application scene of the original noisy video data, and the application scene may include voice communication, live broadcast, and music playback scenes, etc.
- the computer device may determine the collection scene associated with the original noisy video data based on the location information of the device that collected the original noisy video data, and the collection scene may include indoors, outdoors, private places or open places, etc.
- the usage instruction information of the owner of the noisy video data determines the application scenario of the original noisy video data.
- S102 Determine a target noise reduction intensity parameter for performing noise reduction processing on the original noisy audio data according to the target scene parameter.
- the computer device can determine the target noise reduction intensity parameter for denoising the original noisy audio data according to the target scene parameter; the target noise reduction intensity parameter is used to indicate the amount of data (i.e., content) corresponding to the noise data to be removed in the original noisy audio data, that is, the target noise reduction intensity parameter is used to indicate the strength of noise reduction for the noise data in the original noisy audio data.
- the target noise reduction intensity parameter is used to indicate the amount of data (i.e., content) corresponding to the noise data to be removed in the original noisy audio data, that is, the target noise reduction intensity parameter is used to indicate the strength of noise reduction for the noise data in the original noisy audio data.
- the target noise reduction intensity parameter indicates that the intensity (i.e., power) of the noise data in the original noisy audio data is reduced by 5dB, and the intensity of the noise data in the original noisy audio data (i.e., target enhanced audio data) after noise reduction is 1dB.
- the target noise reduction intensity parameter is 5dB, because the original signal-to-noise ratio of the original noisy audio data is: the ratio between the power of the audio data in the original noisy audio data and the power of the noise data in the original noisy audio data.
- the larger the target noise reduction intensity parameter the greater the noise reduction intensity for the original noisy audio data, and the more data corresponding to the noise data to be removed from the original noisy audio data; the smaller the target noise reduction intensity parameter, the smaller the noise reduction intensity for the original noisy audio data, and the less data corresponding to the noise data to be removed from the original noisy audio data.
- Method three If the target scene parameter includes a program identifier corresponding to the recording application level, and environmental parameters of the recording environment of the original noisy audio data and/or location information of the recording environment, the computer device can determine the application scenario of the original noisy audio data according to the program identifier corresponding to the recording application level, and determine the acquisition scenario of the original noisy audio data according to at least one of the environmental parameters of the recording environment of the original noisy audio data and the location information of the recording environment, that is, determine that the target scene parameter reflects the acquisition scenario and application scenario of the original noisy audio data.
- the computing device can obtain the quality requirement level of the audio data in the application scenario, and determine the first noise reduction intensity parameter for performing noise reduction processing on the original noisy audio data according to the quality requirement level.
- the noise type and noise of the sample noise data in the sample noise-containing frequency data are The method comprises the following steps: a noise change characteristic of the noise data of the first noise type in the sample noisy audio data indicates that the intensity of the noise data of the first noise type varies within the range of [5, 10] dB, and the computer device can determine the noise reduction intensity parameter corresponding to the noise data of the first noise type according to [5, 10] dB, such as determining 7.5 dB as the noise reduction intensity parameter corresponding to the noise data of the first noise type.
- the above step S205 includes: the computing device can obtain the error function of the initial noise reduction processing model, substitute the predicted speech enhancement data and the annotated speech enhancement data into the error function, and obtain the noise reduction processing error of the initial noise reduction processing model.
- the error function of the initial noise reduction processing model can be a mean square error function or a cross entropy function, etc.
- the noise reduction processing error is used to measure the noise reduction processing accuracy of the initial noise reduction processing model, that is, the larger the noise reduction processing error, the lower the noise reduction processing accuracy of the initial noise reduction processing model; the smaller the noise reduction processing error, the higher the noise reduction processing accuracy of the initial noise reduction processing model.
- step S206 in the embodiment of the present application can refer to the above explanation of step S101
- step S207 in the embodiment of the present application can refer to the above explanation of step S102.
- the computer device may parse the frequency domain signal of the original noisy audio data through the speech analysis network of the target noise reduction processing model to obtain the cosine transform mask of the original noisy audio data; the cosine transform mask is used to reflect the proportion of audio data in the original noisy audio data, that is, the cosine transform mask is used to reflect the proportion of audio data in the original noisy audio data in the original noisy audio data.
- the speech generation network of the target noise reduction processing model may be used to generate the target enhanced audio data according to the cosine transform mask of the original noisy audio data, the frequency domain signal of the original noisy audio data, and the target noise reduction intensity parameter.
- the noise data and the speech data can be distinguished by the difference in the spectrum template of the original noisy audio data, that is, the computer device can extract key speech features based on the difference in the spectrum template of the original noisy audio data.
- the first speech feature extraction mode, the second speech feature extraction mode and the third speech extraction mode in the embodiment of the present application are respectively methods of extracting key speech features from different angles.
- the first speech feature extraction mode, the second speech feature extraction mode and the third speech extraction mode are respectively one of the above-mentioned extraction modes based on frequency distribution features, the extraction modes based on spectrum flatness and the extraction modes based on spectrum template differences.
- the first speech feature extraction mode, the second speech feature extraction mode and the third speech extraction mode may be different, or at least two of the extraction modes may be the same.
- the target noise reduction processing model includes a feature extraction network 501, a speech analysis network 502, and a speech generation network 503.
- the feature extraction network is used to convert the original noisy frequency data in the time domain into the frequency domain to obtain the frequency domain signal of the original noisy frequency data.
- the feature extraction network first resamples the original noisy frequency data x n , and resamples the original noisy frequency data of various sampling rate types to 48kHz. After the resampling is completed, the resampled original noisy frequency data is subjected to frame windowing processing.
- the resampled original noisy frequency data can be divided into multiple noisy frequency data segments according to a frame length of 1024 and a frame shift of 512, and the multiple noisy frequency data segments are modulated respectively using a Hamming window.
- a discrete cosine transform (DCT) operation is performed on the modulated multiple noisy frequency data segments to obtain the original The frequency domain signal X k of the noisy frequency data.
- DCT discrete cosine transform
- SDCT short-time discrete cosine transform
- the speech analysis network 502 is used to advance the cosine transform mask of the original noisy frequency data.
- the speech analysis network can be a deep learning network module.
- the stride of the two-dimensional convolution is (2,1), which can reduce the number of frequency domain signals by half layer by layer, and the number of time domain frames remains unchanged, which plays a role in reducing the dimension and reducing the amount of calculation.
- the coding layer 5021 includes three two-dimensional convolutions as an example, namely, two-dimensional convolution 1, two-dimensional convolution 2, and two-dimensional convolution 3.
- Two-dimensional convolution 1, two-dimensional convolution 2, and two-dimensional convolution 3 extract the first key speech feature, the second key speech feature, and the third key speech feature of the original noisy audio data, respectively.
- the decoding layer 5023 is mainly composed of DecTConv2d with transposed two-dimensional convolution (ConvTranspose2d) as the kernel.
- the workflow of the deep learning network module is that the encoding layer receives the frequency domain signal of the original noisy frequency data from the feature extraction network, and then extracts high-dimensional features (i.e., the first key speech feature, the second key speech feature, and the third key speech feature) layer by layer through two-dimensional convolution. The corresponding output is given to the transposed two-dimensional convolution through a jump connection.
- RNNs receive the third key speech feature from the output of the last layer of two-dimensional convolution 3, perform temporal information extraction and analysis, and give the input to the decoding layer.
- the decoding layer receives the output from RNNs and the encoding layer, and after layer-by-layer dimensionality increase processing, finally obtains the cosine transform mask
- the above-mentioned generating target enhanced audio data based on the enhanced signal-to-noise ratio and the frequency domain signal includes: a computer device can perform noise reduction processing on the frequency domain signal of the original noisy frequency data according to the enhanced signal-to-noise ratio and the cosine transform mask of the original noisy frequency data to obtain frequency domain enhanced audio data, transform the frequency domain enhanced audio data to obtain time domain enhanced audio data, and determine the time domain enhanced audio data as the target enhanced audio data.
- k in formula (3) is the kth sampling point of the original noisy frequency data, and k is a positive integer greater than 1.
- the original signal-to-noise ratio of the original noisy frequency data can be expressed by the following formula (4):
- the frequency domain signal of the audio data in the frequency domain enhanced audio data is:
- the frequency domain signal of the noise data in the frequency domain enhanced audio data is:
- the frequency domain signal of the frequency domain enhanced audio data can be expressed by the following formula (5):
- the enhanced signal-to-noise ratio of the original noisy audio data after denoising can be expressed by the following formula (6):
- the frequency domain signal Xk of the original noisy audio data is equal to the frequency domain signal of the audio data in the frequency domain enhanced audio data.
- the relationship between can be expressed by the following formula (7):
- the computer device performs time domain transformation on the above formula (9) to obtain the target enhanced audio data.
- the noise processing strength of the algorithm on the original noisy audio data is quantitatively controlled.
- the target noise reduction intensity parameter can be flexibly configured, which improves the adaptability of the present application to different scenarios and improves the generalization of the present application.
- the present application can cover most voice data application scenarios and actual needs, reducing the difficulty of algorithm development and system complexity.
- pure speech is not used as the target enhanced speech, but the speech signal (i.e., sample audio data) and the noise signal (i.e., sample noise data) are mixed according to a certain signal-to-noise ratio (sample noise reduction intensity parameter) to obtain the target enhanced speech (i.e., labeled speech enhancement data), which to a certain extent avoids the speech loss problem and noise residual discontinuity problem that are prone to occur in conventional speech enhancement and noise reduction algorithms.
- sample audio data i.e., sample audio data
- noise signal i.e., sample noise data
- a batch of test data i.e., noisy frequency data
- the noise reduction strength parameter ⁇ snr is set to 5 dB, 10 dB, 20 dB, and 40 dB respectively.
- the perceptual evaluation of speech quality (PESQ) and scale-invariant signal-to-noise ratio (SI-SNR) are selected as reference indicators for noise reduction effect.
- FIG6 shows the PESQ scores of noisy video data under different noise reduction intensity parameters.
- the horizontal axis of FIG6 represents the original signal-to-noise ratio of the noisy video data
- the vertical axis represents the PESQ score of the noisy video data after noise reduction according to the noise reduction intensity parameter.
- Each original signal-to-noise ratio corresponds to 5 rectangles.
- the length of the first rectangle from left to right represents the PESQ score of the noisy video data without noise reduction
- the lengths of the second to fifth rectangles represent the PESQ scores of the noisy video data after noise reduction according to the noise reduction intensity parameters of 5dB, 10dB, 20dB, and 40dB, respectively.
- the PESQ score of the noisy video data after noise reduction according to the noise reduction intensity parameter is higher than the PESQ score of the noisy video data without noise reduction, which is particularly obvious when the original signal-to-noise ratio of the noisy video data is greater than 4dB.
- the larger the noise reduction intensity parameter the higher the PESQ score of the noisy frequency data after being processed according to the noise reduction intensity parameter; the smaller the noise reduction intensity parameter, the lower the PESQ score of the noisy frequency data after being processed according to the noise reduction intensity parameter.
- FIG7 shows the SI-SNR scores of noisy audio data under different noise reduction intensity parameters.
- the horizontal axis of FIG7 represents the original signal-to-noise ratio of the noisy audio data
- the vertical axis represents the SI-SNR score of the noisy audio data after noise reduction processing according to the noise reduction intensity parameter.
- Each original signal-to-noise ratio corresponds to 5 rectangles.
- the length of the first rectangle from left to right represents the SI-SNR score of the noisy audio data without noise reduction processing
- the lengths of the second to fifth rectangles represent the SI-SNR scores of the noisy audio data after noise reduction processing according to the noise reduction intensity parameters of 5dB, 10dB, 20dB, and 40dB, respectively.
- the SI-SNR score of the noisy audio data after processing according to the noise reduction intensity parameter is higher than the SI-SNR score of the noisy audio data without noise reduction processing.
- This situation is particularly obvious when the original signal-to-noise ratio of the noisy audio data is greater than 4dB.
- the noise reduction intensity parameter the higher the SI-SNR score of the noisy audio data after processing according to the noise reduction intensity parameter; the noise reduction intensity parameter The smaller it is, the lower the SI-SNR score of the noisy audio data after being processed according to the noise reduction intensity parameter.
- a target noise reduction intensity parameter for denoising the original noisy audio data is adaptively determined through a target scene parameter associated with the original noisy audio data, and based on the target noise reduction intensity parameter, the noise content in the original noisy audio data is quantitatively reduced. That is, the target scene parameter here reflects at least one of the application scene and the acquisition scene of the original noisy audio data, and the target noise reduction intensity parameter reflects the strength of noise suppression in the original noisy audio data.
- the noise content in the original noisy audio data is quantitatively reduced, and a certain degree of noise residue is accepted, without completely separating the noise data and audio data in the original noisy audio data, to completely suppress the noise, avoid loss of effective audio data during noise reduction, improve the quality of audio data, and improve the flexibility of noise processing.
- FIG 8 is a schematic diagram of the structure of an audio data processing device provided in an embodiment of the present application.
- the above-mentioned audio data processing device can be a computer program (including program code) running in a network device, for example, the audio data processing device is an application software; the device can be used to execute the corresponding steps in the method provided in an embodiment of the present application.
- the audio data processing device may include:
- the determination module 802 includes an acquisition unit 81a and a determination unit 82a; the acquisition unit 81a is configured to acquire the quality requirement level of the audio data in the application scenario if the target scenario parameter is used to determine the application scenario of the original noisy audio data; the determination unit 82a is configured to determine the target noise reduction intensity parameter for denoising the original noisy audio data based on the quality requirement level.
- the acquisition unit 81a is configured to acquire the historical noise data within a historical time period in the acquisition scene if the target scene parameter is used to determine the acquisition scene of the original noisy audio data; the determination unit 82a is configured to determine the target noise reduction intensity parameter for denoising the original noisy audio data based on the historical noise data.
- the determination unit 82a determines the target noise reduction intensity parameters for denoising the original noisy audio data based on the historical noise data, including: determining from the historical noise data, the noise type and noise change characteristics corresponding to the noise data of the acquisition scene within the historical time period; and determining the target noise reduction intensity parameters for denoising the original noisy audio data based on the noise type and the noise change characteristics.
- the noise data of the acquisition scene within the historical time period corresponds to M noise types
- the determination unit 82a determines the target noise reduction intensity parameters for denoising the original noisy audio data according to the noise type and the noise change characteristics, including: determining M candidate noise reduction intensity parameters for denoising the original noisy audio data based on the noise change characteristics corresponding to the M noise types respectively; determining the M candidate noise reduction intensity parameters as the target noise reduction intensity parameters; or performing mean calculation on the M candidate noise intensity parameters to obtain the target noise reduction intensity parameters.
- the processing module 803 includes an extraction unit 83a, a parsing unit 84a and a generation unit 85a;
- the extraction unit 83a is configured to extract the frequency domain signal of the original noisy audio data through the feature extraction network of the target noise reduction processing model;
- the parsing unit 84a is configured to parse the frequency domain signal of the original noisy audio data through the speech parsing network of the target noise reduction processing model to obtain the cosine transform mask of the original noisy audio data;
- the cosine transform mask is used to reflect the proportion of audio data in the original noisy audio data;
- the generation unit 85a is configured to generate the cosine transform mask of the original noisy audio data according to the cosine transform of the original noisy audio data through the speech generation network of the target noise reduction processing model.
- the target enhanced audio data is generated by combining the mask, the frequency domain signal of the original noisy audio data and the target noise reduction strength parameter.
- the parsing unit 84a parses the first key speech feature, the second key speech feature and the third key speech feature to obtain a cosine transform mask of the original noisy audio data, including: parsing the third key speech feature through the timing parsing layer in the speech parsing network to obtain the timing information of the original noisy audio data; parsing through the decoding layer in the speech parsing network according to the timing information, the first key speech feature, the second key speech feature and the third key speech feature to obtain a cosine transform mask of the original noisy audio data.
- the training module 805 adjusts the model parameters of the initial denoising processing model according to the denoising processing error and the stability to obtain the target denoising processing model, including: The processing error determines the convergence state of the initial denoising processing model; if the convergence state of the initial denoising processing model is a non-converged state, or the stability is less than a stability threshold, the model parameters of the initial denoising processing model are adjusted according to the denoising processing error; until the convergence state of the adjusted initial denoising processing model is a converged state, and the corresponding stability is greater than or equal to the stability threshold, the adjusted initial denoising processing model is determined as the target denoising processing model.
- the generation module 804 generates annotated speech enhancement data according to the sample noise reduction intensity parameter, the sample audio data and the sample noise data, including: performing noise reduction processing on the sample noise data according to the sample noise reduction intensity parameter to obtain processed sample noise data; combining the processed sample noise data with the sample audio data to obtain annotated speech enhancement data.
- the various modules in the audio data processing device shown in Figure 8 can be separately or all combined into one or several units to constitute, or one (some) of the units can be further divided into at least two functionally smaller sub-units, which can achieve the same operation without affecting the realization of the technical effects of the embodiments of the present application.
- the above modules are divided based on logical functions.
- the functions of one module can also be implemented by at least two units, or the functions of at least two modules can be implemented by one unit.
- the audio data processing device may also include other units.
- these functions can also be implemented with the assistance of other units, and can be implemented by the collaboration of at least two units.
- a target noise reduction intensity parameter for noise reduction processing of the original noisy audio data is adaptively determined through a target scene parameter associated with the original noisy audio data, and based on the target noise reduction intensity parameter, the noise content in the original noisy audio data is quantitatively reduced. That is, the target scene parameter here reflects at least one of the application scene and the acquisition scene of the original noisy audio data, and the target noise reduction intensity parameter reflects the strength of noise suppression in the original noisy audio data.
- the processor 1001 may be configured to call a computer application stored in the memory 1005.
- the method includes: parsing the third key speech feature through the timing parsing layer in the speech parsing network to obtain the timing information of the original noisy audio data; parsing according to the timing information, the first key speech feature, the second key speech feature and the third key speech feature through the decoding layer in the speech parsing network to obtain the cosine transform mask of the original noisy audio data.
- the processor 1001 can be used to call a computer application stored in the memory 1005 to achieve: obtaining sample audio data and sample noise data, and generating sample noisy audio data based on the sample audio data and the sample noise data; obtaining sample noise reduction intensity parameters for performing noise reduction processing on the sample noisy audio data; generating labeled speech enhancement data based on the sample noise reduction intensity parameters, the sample audio data and the sample noise data; performing noise reduction processing on the sample noisy audio data based on the sample noise reduction intensity parameters through an initial noise reduction processing model to obtain predicted speech enhancement data; optimizing and training the initial noise reduction processing model based on the predicted speech enhancement data and the labeled speech enhancement data to obtain the target noise reduction processing model.
- the processor 1001 can be used to call a computer application stored in the memory 1005 to optimize the training of the initial noise reduction processing model according to the predicted speech enhancement data and the annotated speech enhancement data to obtain the target noise reduction processing model, including: determining the noise reduction processing error of the initial noise reduction processing model according to the predicted speech enhancement data and the annotated speech enhancement data; determining the stability of the noise data contained in the predicted speech enhancement data according to the predicted speech enhancement data; adjusting the model parameters of the initial noise reduction processing model according to the noise reduction processing error and the stability to obtain the target noise reduction processing model.
- the processor 1001 can be used to call a computer application stored in the memory 1005 to adjust the model parameters of the initial denoising processing model according to the denoising processing error and the stability to obtain the target denoising processing model, including: determining the convergence state of the initial denoising processing model according to the denoising processing error; if the convergence state of the initial denoising processing model is an unconverged state, or the stability is less than a stability threshold, adjusting the model parameters of the initial denoising processing model according to the denoising processing error; until the convergence state of the adjusted initial denoising processing model is a converged state, and the corresponding stability is greater than or equal to the stability threshold, the adjusted initial denoising processing model is determined as the target denoising processing model.
- the noise content in the original noisy audio data is quantitatively reduced, and a certain degree of noise residue is accepted, without completely separating the noise data and audio data in the original noisy audio data, to completely suppress the noise, avoid the loss of effective audio data during noise reduction, improve the quality of audio data, and improve the flexibility of noise processing.
- the computer device described in the embodiments of the present application can execute the description of the above-mentioned audio data processing method in the corresponding embodiments above, and can also execute the description of the above-mentioned audio data processing device in the corresponding embodiments above.
- the embodiment of the present application also provides a computer-readable storage medium
- the above-mentioned computer-readable storage medium stores a computer program executed by the audio data processing device mentioned above, and the above-mentioned computer program includes program instructions.
- the above-mentioned processor executes the above-mentioned program instructions, it can execute the description of the above-mentioned audio data processing method in the corresponding embodiment of the above text.
- the description of the beneficial effects of using the same method will not be repeated.
- the description of the method embodiment of this application please refer to the description of the method embodiment of this application.
- the above program instructions may be deployed on a computer device for execution, or deployed on at least two computer devices at one location for execution, or executed on at least two computer devices distributed at at least two locations and interconnected through a communication network.
- At least two computer devices distributed at at least two locations and interconnected through a communication network may constitute a blockchain network.
- the above-mentioned computer-readable storage medium can be the audio data processing device provided in any of the above-mentioned embodiments or the central storage unit of the above-mentioned computer device, such as the hard disk or memory of the computer device.
- the computer-readable storage medium can also be an external storage device of the computer device, such as a plug-in hard disk equipped on the computer device, a smart memory card (SMART Media card, SMC), a secure digital (Secure digital, SD) card, a flash memory card (flaSh card), etc.
- the computer-readable storage medium can also include both the central storage unit of the computer device and an external storage device.
- the computer-readable storage medium is used to store the computer program and other programs and data required by the computer device.
- the computer-readable storage medium can also be used to temporarily store data that has been output or is to be output.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Circuit For Audible Band Transducer (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Abstract
Description
xn=sn+dn(1);
Xk=Yk+Dk(3);
Claims (18)
- 一种音频数据处理方法,应用于计算机设备,包括:获取待处理的原始带噪音频数据,以及与所述原始带噪音频数据关联的目标场景参数;根据所述目标场景参数,确定用于对所述原始带噪音频数据进行降噪处理的目标降噪强度参数;根据所述目标降噪强度参数,对所述原始带噪音频数据进行降噪处理,得到目标增强音频数据。
- 如权利要求1所述的方法,其中,所述根据所述目标场景参数,确定用于对所述原始带噪音频数据进行降噪处理的目标降噪强度参数,包括:若所述目标场景参数用于确定所述原始带噪音频数据的应用场景,获取在所述应用场景下关于音频数据的质量需求等级;根据所述质量需求等级,确定用于对所述原始带噪音频数据进行降噪处理的目标降噪强度参数。
- 如权利要求1所述的方法,其中,所述根据所述目标场景参数,确定用于对所述原始带噪音频数据进行降噪处理的目标降噪强度参数,包括:若所述目标场景参数用于确定所述原始带噪音频数据的采集场景,获取在所述采集场景中在历史时间段内的历史噪声数据;根据所述历史噪声数据,确定用于对所述原始带噪音频数据进行降噪处理的目标降噪强度参数。
- 如权利要求3所述的方法,其中,所述根据所述历史噪声数据,确定用于对所述原始带噪音频数据进行降噪处理的目标降噪强度参数,包括:从所述历史噪声数据中,确定所述采集场景在所述历史时间段内的噪声数据对应噪声类型和噪声变化特征;根据所述噪声类型和所述噪声变化特征,确定用于对所述原始带噪音频数据进行降噪处理的目标降噪强度参数。
- 根据权利要求4中所述的方法,其中,所述采集场景在所述历史时间段内的噪声数据对应M个噪声类型,所述根据所述噪声类型和所述噪声变化特征,确定用于对所述原始带噪音频数据进行降噪处理的目标降噪强度参数,包括:基于所述M个噪声类型分别对应的噪声变化特征,确定用于对原始带噪音频数据进行降噪处理的M个候选降噪强度参数;将所述M个候选降噪强度参数确定为目标降噪强度参数;或者,对所述M个候选噪声强度参数进行均值计算,得到目标降噪强度参数。
- 如权利要求1所述的方法,其中,所述根据所述目标降噪强度参数,对所述原始带噪音频数据进行降噪处理,得到目标增强音频数据,包括:获取目标降噪处理模型,所述目标降噪处理模型包括特征提取网络、语音解析网络和语音生成网络;通过所述特征提取网络,提取所述原始带噪音频数据的频域信号;通过所述语音解析网络,对所述原始带噪音频数据的频域信号进行解析,得到所述原始带噪音频数据的余弦变换掩码;所述余弦变换掩码用于反映所述原始带噪音频数据中的音频数据的占比;通过所述语音生成网络,根据所述原始带噪音频数据的余弦变换掩码、所述原始带噪音频数据的频域信号以及所述目标降噪强度参数,生成目标增强音频数据。
- 如权利要求6所述的方法,其中,所述通过所述语音解析网络,对所述原始带噪音频数据的频域信号进行解析,得到所述原始带噪音频数据的余弦变换掩码,包括:通过所述语音解析网络中的编码层,按照第一语音特征提取模式对所述原始带噪音频数据的频域信号进行语音特征提取,得到第一关键语音特征;按照第二语音特征提取模式对所述第一关键语音特征进行语音特征提取,得到第二关键语音特征;按照第三语音特征提取模式对所述第一关键语音特征和所述第二关键语音特征进行语音特征提取,得到第三关键语音特征;对所述第一关键语音特征、所述第二关键语音特征以及所述第三关键语音特征进行解析,得到所述原始带噪音频数据的余弦变换掩码。
- 如权利要求7所述的方法,其中,所述对所述第一关键语音特征、所述第二关键语音特征以及所述第三关键语音特征进行解析,得到所述原始带噪音频数据的余弦变换掩码,包括:通过所述语音解析网络中的时序解析层,对所述第三关键语音特征进行解析,得到所述原始带噪音频数据的时序信息;通过所述语音解析网络中的解码层,根据所述时序信息、所述第一关键语音特征、所述第二关键语音特征以及所述第三关键语音特征进行解析,得到所述原始带噪音频数据的余弦变换掩码。
- 如权利要求6所述的方法,其中,所述通过所述语音生成网络,根据所述原始带噪音频数据的余弦变换掩码、所述原始带噪音频数据的频域信号以及所述目标降噪强度参数,生成目标增强音频数据,包括:通过所述语音生成网络,根据所述原始带噪音频数据的频域信号,确定所述原始带噪音频数据的原始信噪比;根据所述原始信噪比以及所述目标降噪强度参数,生成所述原始带噪音频数据降噪后的增强信噪比;根据所述增强信噪比、所述原始带噪音频数据的余弦变换掩码以及所述原始带噪音频数据的频域信号,生成目标增强音频数据。
- 如权利要求7所述的方法,其中,所述根据所述增强信噪比、所述原始带噪音频数据的余弦变换掩码以及所述原始带噪音频数据的频域信号,生成目标增强音频数据,包括:根据所述增强信噪比、所述原始带噪音频数据的余弦变换掩码对所述原始带噪音频数据的频域信号进行降噪处理,得到频域增强音频数据;对所述频域增强音频数据进行变换,得到时域增强音频数据,将所述时域增强音频数据,确定为目标增强音频数据。
- 如权利要求6所述的方法,其中,所述方法还包括:获取样本音频数据以及样本噪声数据,根据所述样本音频数据和所述样本噪声数据生成样本带噪音频数据;获取用于对所述样本带噪音频数据进行降噪处理的样本降噪强度参数;根据所述样本降噪强度参数、所述样本音频数据以及样本噪声数据生成标注语音增强数据;通过初始降噪处理模型基于所述样本降噪强度参数,对所述样本带噪音频数据进行降噪处理,得到预测语音增强数据;根据所述预测语音增强数据和所述标注语音增强数据,对所述初始降噪处理模型进行优化训练,得到所述目标降噪处理模型。
- 如权利要求11所述的方法,其中,所述根据所述预测语音增强数据和所述标注语音增强数据,对所述初始降噪处理模型进行优化训练,得到所述目标降噪处理模型,包括:根据所述预测语音增强数据和所述标注语音增强数据,确定所述初始降噪处理模型的降噪处理误差;根据所述预测语音增强数据确定所述预测语音增强数据中所包含的噪声数据的稳定度;根据所述降噪处理误差和所述稳定度,对所述初始降噪处理模型的模型参数进行调整,得到所述目标降噪处理模型。
- 如权利要求12所述的方法,其中,所述根据所述降噪处理误差和所述稳定度,对所述初始降噪处理模型的模型参数进行调整,得到所述目标降噪处理模型,包括:根据所述降噪处理误差确定所述初始降噪处理模型的收敛状态;若所述初始降噪处理模型的收敛状态为未收敛状态,或所述稳定度小于稳定度阈值,则根据所述降噪处理误差,对所述初始降噪处理模型的模型参数进行调整;直到调整后的初始降噪处理模型的收敛状态为已收敛状态,且对应的稳定度大于或等于所述稳定度阈值,将所述调整后的初始降噪处理模型,确定为所述目标降噪处理模型。
- 如权利要求11所述的方法,其中,所述根据所述样本降噪强度参数、所述样本音频数据以及样本噪声数据生成标注语音增强数据,包括:根据所述样本降噪强度参数对所述样本噪声数据进行降噪处理,得到处理后的样本噪声数据;将所述处理后的样本噪声数据与所述样本音频数据进行组合,得到标注语音增强数据。
- 一种音频数据处理装置,包括:获取模块,配置为获取待处理的原始带噪音频数据,以及与所述原始带噪音频数据关联的目标场景参数;确定模块,配置为根据所述目标场景参数,确定用于对所述原始带噪音频数据进行降噪处理的目标降噪强度参数;处理模块,配置为根据所述目标降噪强度参数,对所述原始带噪音频数据进行降噪处理,得到目标增强音频数据。
- 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现权利要求1至14中任一项所述的方法的步骤。
- 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1至14中任一项所述的音频数据处理方法的步骤。
- 一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现权利要求1至14中任一项所述的音频数据处理方法的步骤。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP23909663.9A EP4560627A4 (en) | 2022-12-30 | 2023-11-03 | METHOD AND APPARATUS FOR PROCESSING AUDIO DATA, AND DEVICE, COMPUTER-READABLE STORAGE MEDIA AND COMPUTER PROGRAM PRODUCT |
| US18/908,353 US20250029627A1 (en) | 2022-12-30 | 2024-10-07 | Method and apparatus for processing audio data, device, and computer-readable storage medium |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211725937.6 | 2022-12-30 | ||
| CN202211725937.6A CN118280377A (zh) | 2022-12-30 | 2022-12-30 | 音频数据处理方法、装置、设备及存储介质 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US18/908,353 Continuation US20250029627A1 (en) | 2022-12-30 | 2024-10-07 | Method and apparatus for processing audio data, device, and computer-readable storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024139730A1 true WO2024139730A1 (zh) | 2024-07-04 |
Family
ID=91643243
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2023/129766 Ceased WO2024139730A1 (zh) | 2022-12-30 | 2023-11-03 | 音频数据处理方法、装置、设备、计算机可读存储介质及计算机程序产品 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20250029627A1 (zh) |
| EP (1) | EP4560627A4 (zh) |
| CN (1) | CN118280377A (zh) |
| WO (1) | WO2024139730A1 (zh) |
Families Citing this family (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119155583A (zh) * | 2024-08-13 | 2024-12-17 | 江西瑞声电子有限公司 | 耳机自适应降噪的方法、耳机与存储介质 |
| CN119559940A (zh) * | 2024-11-26 | 2025-03-04 | 北京航空航天大学 | 一种高噪声条件下的空管指令端到端语音识别方法 |
| CN119479670A (zh) * | 2024-12-04 | 2025-02-18 | 歌尔股份有限公司 | 语音增强模型训练方法、语音增强方法、设备、介质及产品 |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110197670A (zh) * | 2019-06-04 | 2019-09-03 | 大众问问(北京)信息科技有限公司 | 音频降噪方法、装置及电子设备 |
| CN111785288A (zh) * | 2020-06-30 | 2020-10-16 | 北京嘀嘀无限科技发展有限公司 | 语音增强方法、装置、设备及存储介质 |
| US20210074282A1 (en) * | 2019-09-11 | 2021-03-11 | Massachusetts Institute Of Technology | Systems and methods for improving model-based speech enhancement with neural networks |
| CN113362845A (zh) * | 2021-05-28 | 2021-09-07 | 阿波罗智联(北京)科技有限公司 | 声音数据降噪方法、装置、设备、存储介质及程序产品 |
| CN113395539A (zh) * | 2020-03-13 | 2021-09-14 | 北京字节跳动网络技术有限公司 | 音频降噪方法、装置、计算机可读介质和电子设备 |
| CN113539283A (zh) * | 2020-12-03 | 2021-10-22 | 腾讯科技(深圳)有限公司 | 基于人工智能的音频处理方法、装置、电子设备及存储介质 |
| DE102021203815A1 (de) * | 2021-04-16 | 2022-10-20 | Robert Bosch Gesellschaft mit beschränkter Haftung | Tonverarbeitungsvorrichtung, System und Verfahren |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20220092389A1 (en) * | 2020-09-21 | 2022-03-24 | Aondevices, Inc. | Low power multi-stage selectable neural network suppression |
| WO2022182356A1 (en) * | 2021-02-26 | 2022-09-01 | Hewlett-Packard Development Company, L.P. | Noise suppression controls |
-
2022
- 2022-12-30 CN CN202211725937.6A patent/CN118280377A/zh active Pending
-
2023
- 2023-11-03 EP EP23909663.9A patent/EP4560627A4/en active Pending
- 2023-11-03 WO PCT/CN2023/129766 patent/WO2024139730A1/zh not_active Ceased
-
2024
- 2024-10-07 US US18/908,353 patent/US20250029627A1/en active Pending
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110197670A (zh) * | 2019-06-04 | 2019-09-03 | 大众问问(北京)信息科技有限公司 | 音频降噪方法、装置及电子设备 |
| US20210074282A1 (en) * | 2019-09-11 | 2021-03-11 | Massachusetts Institute Of Technology | Systems and methods for improving model-based speech enhancement with neural networks |
| CN113395539A (zh) * | 2020-03-13 | 2021-09-14 | 北京字节跳动网络技术有限公司 | 音频降噪方法、装置、计算机可读介质和电子设备 |
| CN111785288A (zh) * | 2020-06-30 | 2020-10-16 | 北京嘀嘀无限科技发展有限公司 | 语音增强方法、装置、设备及存储介质 |
| CN113539283A (zh) * | 2020-12-03 | 2021-10-22 | 腾讯科技(深圳)有限公司 | 基于人工智能的音频处理方法、装置、电子设备及存储介质 |
| DE102021203815A1 (de) * | 2021-04-16 | 2022-10-20 | Robert Bosch Gesellschaft mit beschränkter Haftung | Tonverarbeitungsvorrichtung, System und Verfahren |
| CN113362845A (zh) * | 2021-05-28 | 2021-09-07 | 阿波罗智联(北京)科技有限公司 | 声音数据降噪方法、装置、设备、存储介质及程序产品 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4560627A4 * |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4560627A4 (en) | 2025-11-19 |
| CN118280377A (zh) | 2024-07-02 |
| US20250029627A1 (en) | 2025-01-23 |
| EP4560627A1 (en) | 2025-05-28 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2024139730A1 (zh) | 音频数据处理方法、装置、设备、计算机可读存储介质及计算机程序产品 | |
| CN113539283B (zh) | 基于人工智能的音频处理方法、装置、电子设备及存储介质 | |
| US12586599B2 (en) | Audio signal processing method and apparatus, electronic device, and storage medium with machine learning and for microphone mute state features in a multi person voice call | |
| US10218856B2 (en) | Voice signal processing method, related apparatus, and system | |
| CN115101082B (zh) | 语音增强方法、装置、设备、存储介质及程序产品 | |
| JP2016502139A (ja) | 圧縮されたオーディオ信号を修復するシステム、コンピュータ可読記憶媒体、および方法 | |
| EP4243019B1 (en) | Voice processing method, smart terminal and storage medium | |
| WO2024027295A1 (zh) | 语音增强模型的训练、增强方法、装置、电子设备、存储介质及程序产品 | |
| US11924367B1 (en) | Joint noise and echo suppression for two-way audio communication enhancement | |
| CN112151055B (zh) | 音频处理方法及装置 | |
| CN110956976A (zh) | 一种回声消除方法、装置、设备及可读存储介质 | |
| CN110782907B (zh) | 语音信号的发送方法、装置、设备及可读存储介质 | |
| CN107578783A (zh) | 音视频直播中的音频降噪方法及系统、存储器及电子设备 | |
| WO2022156336A1 (zh) | 音频数据处理方法、装置、设备、存储介质及程序产品 | |
| CN114333892A (zh) | 一种语音处理方法、装置、电子设备和可读介质 | |
| CN115083440B (zh) | 音频信号降噪方法、电子设备和存储介质 | |
| US11521637B1 (en) | Ratio mask post-filtering for audio enhancement | |
| CN110364188A (zh) | 音频播放方法、装置及计算机可读存储介质 | |
| CN114760389B (zh) | 语音通话方法、装置、计算机存储介质及电子设备 | |
| CN113113046B (zh) | 音频处理的性能检测方法、装置、存储介质及电子设备 | |
| CN118250486A (zh) | 视频卡顿的检测方法、装置、终端及存储介质 | |
| CN114093373A (zh) | 音频数据传输方法、装置、电子设备及存储介质 | |
| CN114724572B (zh) | 确定回声延时的方法和装置 | |
| CN117153178B (zh) | 音频信号处理方法、装置、电子设备和存储介质 | |
| CN117727334B (zh) | 一种音频处理方法、装置、设备及可读存储介质 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23909663 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023909663 Country of ref document: EP Ref document number: 23909663 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2023909663 Country of ref document: EP Effective date: 20250219 |
|
| WWP | Wipo information: published in national office |
Ref document number: 2023909663 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |