WO2021057239A1 - 语音数据的处理方法、装置、电子设备及可读存储介质 - Google Patents
语音数据的处理方法、装置、电子设备及可读存储介质 Download PDFInfo
- Publication number
- WO2021057239A1 WO2021057239A1 PCT/CN2020/105034 CN2020105034W WO2021057239A1 WO 2021057239 A1 WO2021057239 A1 WO 2021057239A1 CN 2020105034 W CN2020105034 W CN 2020105034W WO 2021057239 A1 WO2021057239 A1 WO 2021057239A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voice
- enhancement
- data
- processing
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Definitions
- This application relates to the field of Internet technology. Specifically, this application relates to a voice data processing method, device, electronic equipment, and computer-readable storage medium.
- speech Enhancement speech noise reduction.
- the speech collected by a microphone is usually speech with different noises.
- the main purpose of speech enhancement is to recover speech without noise from noisy speech.
- speech enhancement various interference signals can be effectively suppressed, and the target speech signal can be enhanced, which not only improves speech intelligibility and voice quality, but also helps to improve speech recognition.
- the embodiment of the present application provides a method for processing voice data.
- the method is executed by a server and includes:
- the second voice data performs voice enhancement processing on the second voice data based on the updated voice enhancement parameter.
- An embodiment of the present application provides a device for processing voice data, which includes:
- the receiving module is used to receive the first voice data sent by the sender
- the acquisition module is used to acquire corresponding speech enhancement parameters
- a processing module configured to perform voice enhancement processing on the first voice data based on the acquired voice enhancement parameters to obtain first voice enhancement data, and determine the first voice enhancement parameters based on the first voice data;
- the update module is configured to update the acquired voice enhancement parameters by using the first voice enhancement parameters to obtain the updated voice enhancement parameters, which are used when the second voice data sent by the sender is received, based on the update
- the subsequent speech enhancement parameters perform speech enhancement processing on the second speech data
- the sending module is used to send the first voice enhanced data to the receiver.
- An embodiment of the present application also provides an electronic device, which includes:
- the bus is used to connect the processor and the memory
- the memory is used to store operation instructions
- the processor is configured to call the operation instruction and execute the executable instruction to cause the processor to perform the operation corresponding to the voice data processing method shown in the above application.
- the embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for processing voice data shown in the above-mentioned application is realized.
- FIG. 1A is a system architecture diagram to which a voice data processing method provided by an embodiment of the application is applicable;
- FIG. 1B is a schematic flowchart of a method for processing voice data according to an embodiment of this application
- Figure 2 is a schematic diagram of the structure of the LSTM model in this application.
- Figure 3 is a schematic diagram of the logical steps of speech feature extraction in this application.
- FIG. 4 is a schematic structural diagram of a voice data processing device provided by another embodiment of this application.
- FIG. 5 is a schematic structural diagram of an electronic device for processing voice data according to another embodiment of this application.
- AI Artificial Intelligence
- digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
- artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
- Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
- Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology.
- Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics.
- Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning or deep learning.
- ASR automatic speech recognition technology
- TTS speech synthesis technology
- voiceprint recognition technology Enabling computers to be able to listen, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
- the noise reduction model corresponding to the changed speaker it is necessary to obtain the noise reduction model corresponding to the changed speaker, and use the noise reduction model to perform noise reduction processing on the speech data of the speaker. In this way, the noise reduction model corresponding to each speaker needs to be stored, and the storage demand is high.
- the embodiments of the present application provide a voice data processing method, device, electronic device, and computer-readable storage medium, aiming to solve the above technical problems in related technologies.
- FIG. 1A is a system architecture diagram to which the voice processing method provided in an embodiment of the present application is applicable.
- the system architecture diagram includes: a server 11, a network 12, and terminal devices 13 and 14, wherein the server 11 establishes a connection with the terminal device 13 and the terminal device 14 through the network 12.
- the server 11 is a background server that processes the received voice data after receiving the voice data sent by the sender.
- the server 11, together with the terminal device 13, and the terminal device 14 provide services for users.
- the server 11 processes the voice data sent by the terminal device 13 (or the terminal device 14) corresponding to the sender, and then converts the obtained voice enhancement data It is sent to the terminal device 14 (or terminal device 13) corresponding to the recipient to provide it to the user, where the server 11 may be a single server or a cluster server composed of multiple servers.
- the network 12 may include a wired network and a wireless network. As shown in Figure 1A, on the access network side, the terminal device 13 and the terminal device 14 can be connected to the network 12 in a wireless or wired manner; and on the core network side, the server 11 is generally connected in a wired manner. To network 12. Of course, the aforementioned server 11 may also be connected to the network 12 in a wireless manner.
- the above-mentioned terminal device 13 and terminal device 14 may refer to smart devices with data calculation and processing functions, for example, they can play processed voice enhancement data provided by a server.
- the terminal device 13 and the terminal device 14 include, but are not limited to, a smart phone (installed with a communication module), a handheld computer, a tablet computer, and the like.
- the terminal device 13 and the terminal device 14 are respectively installed with operating systems, including but not limited to: Android operating system, Symbian operating system, Windows mobile operating system, Apple iPhone OS operating system, and so on.
- an embodiment of the present application provides a method for processing voice data, and the processing method is executed by the server 11 in FIG. 1A. As shown in Figure 1B, the method includes:
- Step S101 When the first voice data sent by the sender is received, corresponding voice enhancement parameters are obtained.
- the pre-stored voice enhancement parameters corresponding to the sender are obtained; if the voice enhancement parameters corresponding to the sender are not obtained, then the pre-stored voice enhancement parameters corresponding to the sender are obtained. Set the voice enhancement parameters.
- the sender can be the party that sends the voice data.
- the terminal device 13 can be the sender, and the content of user A's speech can be the first voice data, the first voice data It is transmitted to the server through the network.
- the server After the server receives the first voice data, it can obtain the corresponding voice enhancement parameters, and then perform voice enhancement processing on the first voice data.
- the server can run an LSTM (Long-Short Term Memory) model, which can be used to perform voice enhancement processing on voice data.
- LSTM Long-Short Term Memory
- Step S102 Perform voice enhancement processing on the voice data based on the acquired voice enhancement parameters to obtain first voice enhancement data, and determine the first voice enhancement parameters based on the first voice data;
- the voice enhancement process is performed on the first voice data based on the preset voice enhancement parameter to obtain the first voice enhancement data.
- the first voice data is subjected to voice enhancement processing based on the voice enhancement parameter corresponding to the sender to obtain the first voice enhancement data.
- the voice enhancement parameter corresponding to the sender is not obtained, then the first voice data is voice enhanced based on the preset voice enhancement parameter; if the voice enhancement parameter corresponding to the sender is obtained, Then, perform voice enhancement processing on the first voice data based on the voice enhancement parameter corresponding to the sender.
- the voice enhancement processing is performed on the first voice data based on the preset voice enhancement parameter to obtain the first voice enhancement
- the data and the determining the first voice enhancement parameter based on the first voice data includes: performing feature sequence processing on the first voice data through the trained voice enhancement model to obtain the first voice feature sequence, and the voice
- the enhancement model is set with the preset speech enhancement parameters; the preset speech enhancement parameters are used to perform batch calculation on the first speech feature sequence to obtain the processed first speech feature sequence and the first speech Enhancement parameters; performing feature inverse transformation processing on the processed first voice feature sequence to obtain the first voice enhancement data.
- the first voice data is subjected to voice enhancement processing based on the voice enhancement parameter corresponding to the sender to obtain the first voice enhancement Data
- determining the first voice enhancement parameter based on the first voice data includes: performing feature sequence processing on the first voice data through the trained voice enhancement model to obtain a second voice feature sequence;
- the voice enhancement parameter corresponding to the sender performs batch calculation on the second voice feature sequence to obtain the processed second voice feature sequence and the second voice enhancement parameter; perform the processed second voice feature sequence
- the inverse feature transformation process obtains the processed second speech enhancement data, and uses the processed second speech enhancement data as the first speech enhancement data.
- Step S103 Send the first voice enhancement data to the receiver, and use the first voice enhancement parameter to update the acquired voice enhancement parameter to obtain the updated voice enhancement parameter for use when the transmission is received.
- the second voice data sent by the party performs voice enhancement processing on the second voice data based on the updated voice enhancement parameter.
- the obtained pre-set voice enhancement parameter is updated based on the first voice enhancement parameter to obtain the updated voice enhancement parameter, and Use the first speech enhancement parameter as the speech enhancement parameter corresponding to the sender.
- the first voice enhancement parameter is used to update the voice enhancement parameter corresponding to the sender to obtain the updated voice enhancement parameter .
- the first voice enhancement parameter can be used as the voice enhancement parameter corresponding to the sender. It is stored in the storage container; if the voice enhancement parameter corresponding to the sender has been saved in the storage container, the first voice enhancement parameter can be replaced with the saved voice enhancement parameter.
- the server sends the first voice enhancement data obtained through the voice enhancement processing to the receiver, and the receiver only needs to play the first voice enhancement data after receiving the first voice enhancement data.
- the trained speech enhancement model is generated in the following manner: acquiring first speech sample data containing noise, and performing speech feature extraction on the first speech sample data to obtain a first speech feature sequence; Acquire second voice sample data that does not contain noise, and perform voice feature extraction on the second voice sample data to obtain a second voice feature sequence; use the first voice feature sequence to train a preset voice enhancement model, Obtain the first voice feature sequence output by the trained voice enhancement model, and calculate the similarity between the first voice feature sequence obtained by training the voice enhancement model and the second voice feature sequence until the training institute The similarity between the first voice feature sequence obtained by the voice enhancement model and the second voice feature sequence exceeds a preset similarity threshold, and a trained voice enhancement model is obtained.
- the method for extracting the voice feature sequence includes: performing voice framing and windowing processing on the voice sample data to obtain at least two voice frames of the voice sample data; performing fast Fourier analysis on each voice frame.
- Leaf transform to obtain each discrete power spectrum corresponding to each voice frame; perform logarithmic calculations on each discrete power spectrum to obtain each logarithmic power spectrum corresponding to each voice frame, and use each logarithmic power spectrum as the voice The voice feature sequence of the sample data.
- the server when the first voice data sent by the sender is received, the corresponding voice enhancement parameters are acquired, and then the voice enhancement processing is performed on the first voice data based on the acquired voice enhancement parameters to obtain the first voice Enhance data, and determine a first voice enhancement parameter based on the first voice data, and then use the first voice enhancement parameter to update the acquired voice enhancement parameter to obtain the updated voice enhancement parameter for use when receiving To the second voice data sent by the sender, perform voice enhancement processing on the second voice data based on the updated voice enhancement parameters, and send the first voice enhancement data to the receiver.
- the server can perform voice enhancement processing on the voice data of the sender based on the voice enhancement parameters corresponding to the sender.
- the voice enhancement parameters corresponding to different senders are different, the voices obtained by performing voice enhancement processing for different senders are different.
- the enhancement effect is also different. It is realized that the speech enhancement is not only pertinent when multiple models are not required, but also the speech enhancement parameters can be stored. There is no need to store multiple models, and the storage requirement is low.
- the embodiment of the present application describes in detail a method for processing voice data as shown in FIG. 1B.
- Step S101 when the first voice data sent by the sender is received, corresponding voice enhancement parameters are obtained;
- the sender can be the party that sends the voice data.
- the terminal device 13 can be the sender, and the content of user A's speech can be the first voice data, the first voice data It is transmitted to the server through the network. After the server receives the first voice data, it can obtain the corresponding voice enhancement parameters, and then perform voice enhancement processing on the first voice data.
- the server can run an LSTM (Long-Short Term Memory) model, which can be used to perform voice enhancement processing on voice data.
- LSTM Long-Short Term Memory
- speech Enhancement speech noise reduction.
- the speech collected by a microphone is usually speech with different noises.
- the main purpose of speech enhancement is to recover speech without noise from noisy speech.
- speech enhancement various interference signals can be effectively suppressed and the target speech signal can be enhanced, which can not only improve speech intelligibility and speech quality, but also help improve speech recognition.
- the basic structure of the LSTM model can be shown in Figure 2, including a front-end LSTM layer, a batch processing layer, and a back-end LSTM layer; where X is each frame of voice in the voice data, and t is a time window.
- the so-called one frame of speech refers to a short segment in the speech signal.
- the voice signal is not stable on the macro level, and stable on the micro level, and has short-term stability (the voice signal can be considered to be approximately unchanged within 10 to 30 ms).
- This can be used to divide the voice signal into short segments
- each short segment is called a frame.
- the length of a frame of speech is 10ms, then the segment of speech includes 100 frames.
- the front-end LSTM layer, batch processing layer, and back-end LSTM layer will simultaneously calculate voice frames in different time windows.
- the batch processing layer is used to calculate the voice enhancement parameters corresponding to the voice data, such as the mean value. And variance.
- the terminal device 13 and the terminal device 14 may also have the following characteristics:
- the device has a central processing unit, a memory, an input component and an output component, that is to say, the device is often a microcomputer device with communication functions.
- the device can also have a variety of input methods, such as keyboard, mouse, touch screen, microphone and camera, etc., and can adjust the input as needed.
- the equipment often has a variety of output methods, such as receivers, display screens, etc., which can also be adjusted as needed;
- the device In the software system, the device must have an operating system, such as Windows Mobile, Symbian, Palm, Android, iOS, etc. At the same time, these operating systems are becoming more and more open, and personalized applications based on these open operating system platforms are emerging one after another, such as communication books, calendars, notepads, calculators, and various games, which greatly satisfy individuality. User’s needs;
- an operating system such as Windows Mobile, Symbian, Palm, Android, iOS, etc.
- these operating systems are becoming more and more open, and personalized applications based on these open operating system platforms are emerging one after another, such as communication books, calendars, notepads, calculators, and various games, which greatly satisfy individuality. User’s needs;
- the device has flexible access methods and high-bandwidth communication performance, and can automatically adjust the selected communication method according to the selected service and the environment in which it is located, so that it is convenient for users to use.
- the equipment can support GSM (Global System for Mobile Communication), WCDMA (Wideband Code Division Multiple Access), CDMA2000 (Code Division Multiple Access), TDSCDMA (Time Division- Synchronous Code Division Multiple Access, Time Division Synchronous Code Division Multiple Access), Wi-Fi (Wireless-Fidelity, Wireless Fidelity), and WiMAX (Worldwide Interoperability for Microwave Access), etc., so as to adapt to multiple standard networks, Not only supports voice services, but also supports multiple wireless data services;
- equipment pays more attention to humanization, individualization and multi-function.
- equipment has moved from a "equipment-centric" model to a "human-centric” model, integrating embedded computing, control technology, artificial intelligence technology, and biometric authentication technology, which fully embodies the people-oriented tenet .
- the equipment can be adjusted according to individual needs and become more personalized.
- the device itself integrates many software and hardware, and its functions are becoming more and more powerful.
- the acquiring corresponding speech enhancement parameters includes:
- the server may use the trained LSTM model to perform voice enhancement processing on the first voice data.
- the trained LSTM model is a general model with preset speech enhancement parameters, that is, the speech enhancement parameters in the trained LSTM model.
- the trained LSTM model can perform speech enhancement processing on any user's speech data.
- the trained LSTM model in order to provide targeted speech enhancement for different users, can be trained using the user’s speech data to obtain the user’s speech enhancement parameters. In this way, the user’s speech When the data is subjected to voice enhancement processing, the user's voice enhancement parameters can be used to perform voice enhancement processing on the user's voice data.
- the voice data of user A is used to train the trained LSTM model to obtain the voice enhancement parameters of user A.
- the trained LSTM model can use user A's voice enhancement parameters for voice enhancement processing.
- the server when the server receives the user's first voice data, it may first obtain the user's voice enhancement parameters.
- the voice enhancement parameters corresponding to each user may be stored in the storage container of the server, or may be stored in the storage container of other devices, which is not limited in the embodiment of the present application.
- the server does not obtain the user's voice enhancement parameters, it means that the server has received the user's voice data for the first time, and it is sufficient to obtain the preset voice enhancement parameters at this time.
- Step S102 Perform voice enhancement processing on the first voice data based on the acquired voice enhancement parameters to obtain first voice enhancement data, and determine the first voice enhancement parameters based on the first voice data;
- the voice enhancement parameter corresponding to the sender is not obtained, then the first voice data is voice enhanced based on the preset voice enhancement parameter; if the voice enhancement parameter corresponding to the sender is obtained, Then, perform voice enhancement processing on the first voice data based on the voice enhancement parameter corresponding to the sender.
- the voice enhancement process is performed on the first voice data based on the obtained voice enhancement parameter to obtain the first voice
- the step of enhancing data and determining a first voice enhancement parameter based on the first voice data includes:
- the first speech data can be input into the trained LSTM model, and the trained LSTM model performs feature sequence processing on the first speech data to obtain the first speech
- the first voice feature sequence corresponding to the data where the first voice feature sequence includes at least two voice features
- the first voice feature sequence is batch-processed using preset voice enhancement parameters to obtain the processed first voice feature Sequence, and then perform feature inverse transformation processing on the processed first voice feature sequence to obtain the first voice enhancement data, that is, use the trained LSTM model (general model) to perform voice enhancement processing on the first voice data.
- the batch calculation can adopt the following formula (1) and formula (2):
- ⁇ B is the mean value in the speech enhancement parameters
- x i is the input speech feature
- y i is the output speech feature after speech enhancement
- ⁇ , ⁇ , and ⁇ are variable parameters respectively.
- the first voice data to train the trained LSTM model to obtain the first voice enhancement parameter, that is, the voice enhancement parameter corresponding to the sender, and then store it.
- the following formula (3) and formula (4) can be used to train the trained LSTM model:
- ⁇ B is the mean value in the speech enhancement parameters
- Is the variance in the speech enhancement parameters
- x i is the input speech feature
- m is the number of speech features.
- performing voice enhancement processing on the first voice data based on the acquired voice enhancement parameters and determining the first voice enhancement parameters based on the first voice data may be executed sequentially or in parallel. Execution, etc., can be adjusted according to actual needs in actual applications, and the embodiment of the present application does not limit the execution order.
- the voice enhancement parameter corresponding to the sender is obtained, the first voice data is subjected to voice enhancement processing based on the obtained voice enhancement parameter to obtain the first voice enhancement Data, and the step of determining a first voice enhancement parameter based on the first voice data includes:
- the first voice data can be input into the trained LSTM model, and the trained LSTM model performs feature sequence processing on the first voice data to obtain the first voice data
- the corresponding second voice feature sequence where the second voice feature sequence includes at least two voice features
- the second voice feature sequence is batch-processed and calculated using the voice enhancement parameters corresponding to the sender to obtain the processed second voice Feature sequence, and then perform feature inverse transformation on the processed second voice feature sequence to obtain the second voice enhancement data, that is, replace the voice enhancement parameters corresponding to the sender with the voice enhancement parameters in the trained LSTM model , And then use the updated LSTM model to perform voice enhancement processing on the second voice data.
- the batch calculation can also adopt formula (1) and formula (2), which will not be repeated here.
- formula (3) and formula (4) can also be used for training the updated LSTM model, which will not be repeated here.
- the trained speech enhancement model is generated in the following manner:
- the first voice sample data containing noise is obtained, and voice feature extraction is performed on the first voice sample data to obtain the first voice feature a, and the second voice sample data that does not contain noise is obtained, and the second voice sample data is obtained.
- Perform voice feature extraction on the voice sample data to obtain the second voice feature b and then input the voice feature a into the original LSTM model, and use the voice feature b as the training target to perform one-way training on the original LSTM model, that is, one-way adjustment in the LSTM model
- the similarity calculation can use the angle cosine, Pearson correlation coefficient and other similarity measurement methods, also can use the Euclidean distance, Manhattan distance and other distance measurement methods, of course, can also use other calculation methods, specific calculations
- the mode can be set according to actual needs, which is not limited in the embodiment of the present application.
- the manner of speech feature extraction includes:
- the voice sample data is the voice signal.
- the voice signal is a time-domain signal.
- the processor cannot directly process the time-domain signal. Therefore, it is necessary to perform voice framing and windowing processing on the voice sample data to obtain the voice sample data.
- At least two speech frames so as to convert the time domain signal into a frequency domain signal that can be processed by the processor, as shown in Figure 3, and then perform FFT (Fast Fourier Transformation, Fast Fourier Transformation) on each speech frame separately to obtain Discrete power spectrum corresponding to each voice frame, and then perform logarithmic calculation on each discrete power spectrum to obtain each logarithmic power spectrum corresponding to each voice frame, thereby obtaining the voice feature corresponding to each voice frame, and the collection of all voice features It is the voice feature sequence corresponding to the voice sample data. Perform feature inverse transformation processing on the voice feature sequence to convert the voice feature sequence in the frequency domain into a voice signal in the time domain.
- the feature extraction method for the first voice sample data is the same as the feature extraction method for the second voice sample data. Therefore, for the convenience of description, the embodiment of the present application combines the first voice sample data and the second voice sample data.
- the data are collectively referred to as voice sample data.
- Step S103 Send the first voice enhancement data to the receiver, and use the first voice enhancement parameter to update the acquired voice enhancement parameter to obtain the updated voice enhancement parameter for use when the transmission is received.
- the second voice data sent by the party performs voice enhancement processing on the second voice data based on the updated voice enhancement parameter.
- the first speech enhancement parameter is used to update the acquired speech enhancement parameter to obtain the updated speech enhancement parameter. In this way, no adaptive training is required.
- the first voice enhancement parameter can be used as the voice enhancement parameter corresponding to the sender. It is stored in the storage container; if the voice enhancement parameter corresponding to the sender has been saved in the storage container, the first voice enhancement parameter can be replaced with the saved voice enhancement parameter.
- the second voice data sent by the sender When the second voice data sent by the sender is received, the second voice data can be processed for voice enhancement based on the first voice enhancement parameter, that is, the updated voice enhancement parameter.
- the server can continuously train the trained LSTM model in one direction based on the latest voice data sent by the sender, thereby continuously updating the voice enhancement parameters corresponding to the sender, so that the matching degree between the voice enhancement parameters and the sender becomes more and more. High, the voice enhancement effect for the sender is getting better and better.
- the server sends the first voice enhancement data obtained through the voice enhancement processing to the receiver, and the receiver only needs to play the first voice enhancement data after receiving the first voice enhancement data.
- the execution order of the server for updating the voice enhancement parameters and sending the voice enhancement data can be sequential or parallel. In actual applications, it can be set according to actual needs, which is not limited in the embodiments of this application. .
- the trained LSTM model is running in the server, the trained LSTM model has general speech enhancement parameters, and there is no user A in the storage container or other storage containers in the server The speech enhancement parameters.
- the terminal device corresponding to user A sends the first sentence to the server.
- the server After receiving the first sentence of user A, the server searches for the voice corresponding to user A Enhancement parameters. Because there are no voice enhancement parameters of user A in the storage container or other storage containers in the server, the voice enhancement parameters of user A cannot be obtained.
- the general voice enhancement parameters of the trained LSTM model are obtained, and the general The speech enhancement parameter performs speech enhancement processing on the first sentence of speech, and obtains the enhanced first sentence of speech, and sends the enhanced first sentence of speech to the corresponding terminal equipment of user B and user C, and at the same time, adopts the first sentence of speech One-way training is performed on the trained LSTM model, and the first speech enhancement parameter of user A is obtained and stored.
- the terminal device After user A completes the second sentence, the terminal device sends the second sentence to the server.
- the server After receiving the second sentence of user A, the server searches for the voice enhancement parameters corresponding to user A. The search is successful this time, and the user is retrieved.
- the enhanced second sentence is obtained, and the enhanced second sentence is sent to the terminal devices corresponding to user B and user C.
- the second sentence is used to perform one-way training on the updated LSTM model, Obtain the second speech enhancement parameter of user A, and replace the first speech enhancement parameter with the second speech enhancement parameter.
- the speech enhancement process for subsequent speeches can be deduced by analogy, so I won't repeat them here.
- the server when the first voice data sent by the sender is received, the corresponding voice enhancement parameters are acquired, and then the voice enhancement processing is performed on the first voice data based on the acquired voice enhancement parameters to obtain the first voice Enhance data, and determine a first voice enhancement parameter based on the first voice data, and then use the first voice enhancement parameter to update the acquired voice enhancement parameter to obtain the updated voice enhancement parameter for use when receiving To the second voice data sent by the sender, perform voice enhancement processing on the second voice data based on the updated voice enhancement parameters, and send the first voice enhancement data to the receiver.
- the server can perform voice enhancement processing on the voice data of the sender based on the voice enhancement parameters corresponding to the sender.
- the voice enhancement parameters corresponding to different senders are different, the voices obtained by performing voice enhancement processing for different senders are different.
- the enhancement effect is also different. It is realized that the speech enhancement is not only pertinent when multiple models are not required, but also the speech enhancement parameters can be stored. There is no need to store multiple models, and the storage requirement is low.
- the server can continue to train the trained LSTM model in one direction based on the latest voice data sent by the sender, so as to continuously update the voice enhancement parameters corresponding to the sender, so that the matching degree between the voice enhancement parameters and the sender becomes more and more. High, the voice enhancement effect for the sender is getting better and better.
- it is enough to train the speech enhancement parameters, and it is not necessary to train the entire trained LSTM model or a whole layer in the model, which improves the cost and speed of training.
- FIG. 4 is a schematic structural diagram of a voice data processing apparatus provided by another embodiment of this application. As shown in FIG. 4, the apparatus in this embodiment may include:
- the receiving module 401 is configured to receive the first voice data sent by the sender
- the obtaining module 402 is used to obtain corresponding speech enhancement parameters
- the processing module 403 is configured to perform voice enhancement processing on the first voice data based on the acquired voice enhancement parameters to obtain first voice enhancement data, and determine the first voice enhancement parameters based on the first voice data;
- the update module 404 is configured to use the first voice enhancement parameter to update the acquired voice enhancement parameter to obtain the updated voice enhancement parameter, which is used when the second voice data sent by the sender is received, based on the Performing voice enhancement processing on the second voice data with the updated voice enhancement parameters;
- the sending module 405 is configured to send the first voice enhanced data to the receiver.
- the acquisition module is specifically configured to:
- the update module is further configured to update the obtained pre-set voice enhancement parameter based on the first voice enhancement parameter to obtain The updated speech enhancement parameter, and the first speech enhancement parameter is used as the speech enhancement parameter corresponding to the sender.
- the update module is further configured to use the first voice enhancement parameter to update the voice enhancement parameter corresponding to the sender, Get the updated speech enhancement parameters.
- the processing module is further configured to perform voice enhancement processing on the first voice data based on the preset voice enhancement parameter, To obtain the first speech enhancement data.
- the processing module includes: a feature sequence processing sub-module, a batch processing calculation sub-module, and a feature inverse transformation processing sub-module;
- the feature sequence processing sub-module is used to perform feature sequence processing on the first voice data through the trained voice enhancement model to obtain the first voice feature sequence,
- the speech enhancement model is set with the preset speech enhancement parameters;
- a batch calculation sub-module configured to perform batch calculation on the first voice feature sequence by using the preset voice enhancement parameters to obtain the processed first voice feature sequence and the first voice enhancement parameters;
- the feature inverse transformation processing sub-module is configured to perform feature inverse transformation processing on the processed first speech feature sequence to obtain the first speech enhancement data.
- the processing module is further configured to perform voice enhancement processing on the first voice data based on the voice enhancement parameter corresponding to the sender To obtain the first speech enhancement data.
- the processing module includes: a feature sequence processing sub-module, a batch processing calculation sub-module, and a feature inverse transformation processing sub-module;
- the feature sequence processing submodule is also used to perform feature sequence processing on the first voice data through the trained voice enhancement model to obtain a second voice feature sequence;
- the batch calculation sub-module is further configured to perform batch calculation on the second voice feature sequence by using the voice enhancement parameter to obtain the processed second voice feature sequence and the second voice enhancement parameter;
- the feature inverse transformation processing sub-module is further configured to perform feature inverse transformation processing on the processed second speech feature sequence to obtain processed second speech enhancement data, and to combine the processed second speech enhancement data As the first speech enhancement data.
- the trained speech enhancement model is generated in the following manner:
- the manner of extracting the speech feature sequence includes:
- the voice data processing device of this embodiment can execute the voice data processing method shown in the first embodiment of the present application, and the implementation principles are similar, and will not be repeated here.
- the server when the first voice data sent by the sender is received, the corresponding voice enhancement parameters are acquired, and then the voice enhancement processing is performed on the first voice data based on the acquired voice enhancement parameters to obtain the first voice Enhance data, and determine a first voice enhancement parameter based on the first voice data, and then use the first voice enhancement parameter to update the acquired voice enhancement parameter to obtain the updated voice enhancement parameter for use when receiving To the second voice data sent by the sender, perform voice enhancement processing on the second voice data based on the updated voice enhancement parameters, and send the first voice enhancement data to the receiver.
- the server can perform voice enhancement processing on the voice data of the sender based on the voice enhancement parameters corresponding to the sender.
- the voice enhancement parameters corresponding to different senders are different, the voices obtained by performing voice enhancement processing for different senders are different.
- the enhancement effect is also different. It is realized that the speech enhancement is not only pertinent when multiple models are not required, but also the speech enhancement parameters can be stored. There is no need to store multiple models, and the storage requirement is low.
- the server can continue to train the trained LSTM model in one direction based on the latest voice data sent by the sender, so as to continuously update the voice enhancement parameters corresponding to the sender, so that the matching degree between the voice enhancement parameters and the sender becomes more and more. High, the voice enhancement effect for the sender is getting better and better.
- it is enough to train the speech enhancement parameters, and it is not necessary to train the entire trained LSTM model or a whole layer in the model, which improves the cost and speed of training.
- an electronic device includes: a memory and a processor; at least one program, stored in the memory, for being executed by the processor, can realize: In, when the first voice data sent by the sender is received, the corresponding voice enhancement parameters are acquired, and then the voice enhancement processing is performed on the first voice data based on the acquired voice enhancement parameters to obtain the first voice enhancement data, and Determine a first voice enhancement parameter based on the first voice data, and then use the first voice enhancement parameter to update the acquired voice enhancement parameter to obtain the updated voice enhancement parameter for use when the sender sends Performing voice enhancement processing on the second voice data based on the updated voice enhancement parameters, and sending the first voice enhancement data to the receiver.
- the server can perform voice enhancement processing on the voice data of the sender based on the voice enhancement parameters corresponding to the sender. Because the voice enhancement parameters corresponding to different senders are different, the voices obtained by performing voice enhancement processing for different senders are different. The enhancement effect is also different. It is realized that the speech enhancement is not only pertinent when multiple models are not required, but also the speech enhancement parameters can be stored. There is no need to store multiple models, and the storage requirement is low.
- the server can continue to train the trained LSTM model in one direction based on the latest voice data sent by the sender, so as to continuously update the voice enhancement parameters corresponding to the sender, so that the matching degree between the voice enhancement parameters and the sender becomes more and more. High, the voice enhancement effect for the sender is getting better and better.
- it is enough to train the speech enhancement parameters, and it is not necessary to train the entire trained LSTM model or a whole layer in the model, which improves the cost and speed of training.
- an electronic device is provided.
- the electronic device 5000 shown in FIG. 5 includes a processor 5001 and a memory 5003. Among them, the processor 5001 and the memory 5003 are connected, for example, through a bus 5002.
- the electronic device 5000 may further include a transceiver 5004. It should be noted that in actual applications, the transceiver 5004 is not limited to one, and the structure of the electronic device 5000 does not constitute a limitation to the embodiment of the present application.
- the processor 5001 may be a CPU, a general-purpose processor, DSP, ASIC, FPGA, or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. It can implement or execute various exemplary logical blocks, modules, and circuits described in conjunction with the disclosure of this application.
- the processor 5001 may also be a combination that implements computing functions, for example, including one or more microprocessor combinations, DSP and microprocessor combinations, and so on.
- the bus 5002 may include a path for transferring information between the above-mentioned components.
- the bus 5002 may be a PCI bus or an EISA bus.
- the bus 5002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used to represent in FIG. 5, but it does not mean that there is only one bus or one type of bus.
- the memory 5003 can be ROM or other types of static storage devices that can store static information and instructions, RAM or other types of dynamic storage devices that can store information and instructions, or it can be EEPROM, CD-ROM or other optical disk storage, or optical disk storage. (Including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures and can be used by a computer Any other media accessed, but not limited to this.
- the memory 5003 is used to store application program codes for executing the solutions of the present application, and the processor 5001 controls the execution.
- the processor 5001 is configured to execute the application program code stored in the memory 5003 to implement the content shown in any of the foregoing method embodiments.
- electronic equipment includes but is not limited to: mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PAD (tablet computers), PMP (portable multimedia players), vehicle terminals (such as vehicle navigation terminals), etc.
- Mobile terminals such as digital TVs, desktop computers, etc.
- Another embodiment of the present application provides a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and when it runs on a computer, the computer can execute the corresponding content in the foregoing method embodiment.
- the voice enhancement processing is performed on the first voice data based on the acquired voice enhancement parameters to obtain the first voice Enhance data, and determine a first voice enhancement parameter based on the first voice data, and then use the first voice enhancement parameter to update the acquired voice enhancement parameter to obtain the updated voice enhancement parameter for use when receiving
- To the second voice data sent by the sender perform voice enhancement processing on the second voice data based on the updated voice enhancement parameters, and send the first voice enhancement data to the receiver.
- the server can perform voice enhancement processing on the voice data of the sender based on the voice enhancement parameters corresponding to the sender. Because the voice enhancement parameters corresponding to different senders are different, the voices obtained by performing voice enhancement processing for different senders are different. The enhancement effect is also different. It is realized that the speech enhancement is not only pertinent when multiple models are not required, but also the speech enhancement parameters can be stored. There is no need to store multiple models, and the storage requirement is low.
- the server can continue to train the trained LSTM model in one direction based on the latest voice data sent by the sender, so as to continuously update the voice enhancement parameters corresponding to the sender, so that the matching degree between the voice enhancement parameters and the sender becomes more and more. High, the voice enhancement effect for the sender is getting better and better.
- it is enough to train the speech enhancement parameters, and it is not necessary to train the entire trained LSTM model or a whole layer in the model, which improves the cost and speed of training.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Telephonic Communication Services (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims (20)
- 一种语音数据的处理方法,由服务器执行,包括:接收发送方发送的第一语音数据,并获取相应的语音增强参数;基于获取到的语音增强参数对所述第一语音数据进行语音增强处理,以得到第一语音增强数据,并基于所述第一语音数据确定第一语音增强参数;将所述第一语音增强数据发送至接收方,并采用所述第一语音增强参数对获取到的语音增强参数进行更新,得到更新后的语音增强参数,以用于当接收到发送方发送的第二语音数据,基于所述更新后的语音增强参数对所述第二语音数据进行语音增强处理。
- 根据权利要求1所述的语音数据的处理方法,其中,所述获取相应的语音增强参数,包括:获取预先存储的与所述发送方对应的语音增强参数;若未获取到与所述发送方对应的语音增强参数,则获取预先设置的语音增强参数。
- 根据权利要求2所述的语音数据的处理方法,其中,若未获取到与所述发送方对应的语音增强参数,所述采用所述第一语音增强参数对获取的语音增强参数进行更新,得到更新后的语音增强参数,包括:基于所述第一语音增强参数对获取的预先设置的语音增强参数进行更新,得到更新后的语音增强参数,并将所述第一语音增强参数作为与所述发送方对应的语音增强参数。
- 根据权利要求2所述的语音数据的处理方法,其中,若获取到与所述发送方对应的语音增强参数,所述采用所述第一语音增强参数对获取的语音增强参数进行更新,得到更新后的语音增强参数,包括:采用所述第一语音增强参数对与所述发送方对应的语音增强参数进行更新,得到更新后的语音增强参数。
- 根据权利要求2所述的语音数据的处理方法,其中,若未获取到与所述发送方对应的语音增强参数,所述基于获取到的语音增强参数对所述第一语音数据进行语音增强处理,以得到第一语音增强数据,包括:基于所述预先设置的语音增强参数对所述第一语音数据进行语音增强处理,以得到第一语音增强数据。
- 根据权利要求5所述的语音数据的处理方法,其中,若未获取到与所述发送方对应的语音增强参数,所述基于所述预先设置的语音增强参数对所述第一语音数据进行语音增强处理,以得到第一语音增强数据以及所述基于所述第一语音数据确定第一语音增强参数,包括:通过训练后的语音增强模型,对所述第一语音数据进行特征序列处理,得到第一语音特征序列,所述语音增强模型设置有所述预设置的语音增强参数;采用所述预设置的语音增强参数对所述第一语音特征序列进行批处理计算,得到处理后的第一语音特征序列和所述第一语音增强参数;对所述处理后的第一语音特征序列进行特征逆变换处理,得到所述第一语音增强数据。
- 根据权利要求2所述的语音数据的处理方法,其中,若获取到与所述发送方对应的语音增强参数,所述基于获取到的语音增强参数对所述第一语音数据进行语音增强处理,以得到第一语音增强数据,包括:基于与所述发送方对应的语音增强参数对所述第一语音数据进行语音增强处理以得到第一语音增强数据。
- 根据权利要求7所述的语音数据的处理方法,其中,若获取到与所述发送方对应的语音增强参数,所述基于与所述发送方对应的语音增强参数对所述第一语音数据进行语音增强处理以得到第一语音增强数据,以及所述基于所述第一语音数据确定第一语音增强参数,包括:通过训练后的语音增强模型,对所述第一语音数据进行特征序列处理,得到第二语音特征序列;采用与所述发送方对应的语音增强参数对所述第二语音特征序列进行批处理计算,得到处理后的第二语音特征序列和第二语音增强参数;对所述处理后的第二语音特征序列进行特征逆变换处理,得到处理后第二语音增强数据,并将所述处理后的第二语音增强数据作为所述第一语音增强数据。
- 根据权利要求6或8所述的语音数据的处理方法,其中,所述训练后的 语音增强模型通过如下方式生成:获取包含噪声的第一语音样本数据,并对所述第一语音样本数据进行语音特征提取,得到第一语音特征序列;获取不包含噪声的第二语音样本数据,并对所述第二语音样本数据进行语音特征提取,得到第二语音特征序列;采用所述第一语音特征序列对预设的语音增强模型进行训练,得到训练后的语音增强模型所输出的第一语音特征序列,并计算所述训练所述语音增强模型得到的第一语音特征序列与所述第二语音特征序列的相似度,直至所述训练所述语音增强模型得到的第一语音特征序列与所述第二语音特征序列的相似度超过预设相似度阈值,得到训练后的语音增强模型。
- 根据权利要求9所述的语音数据的处理方法,其中,语音特征序列提取的方式,包括:对语音样本数据进行语音分帧和加窗处理,得到所述语音样本数据的至少两个语音帧;对各个语音帧分别进行快速傅里叶变换,得到各个语音帧分别对应的各个离散功率谱;对各个离散功率谱分别进行对数计算,得到各个语音帧分别对应的各个对数功率谱,并将各个对数功率谱作为所述语音样本数据的语音特征序列。
- 一种语音数据的处理装置,包括:接收模块,用于接收发送方发送的第一语音数据;获取模块,用于获取相应的语音增强参数;处理模块,用于基于获取到的语音增强参数对所述第一语音数据进行语音增强处理,以得到第一语音增强数据,并基于所述第一语音数据确定第一语音增强参数;更新模块,用于采用所述第一语音增强参数对获取到的语音增强参数进行更新,得到更新后的语音增强参数,以用于当接收到发送方发送的第二语音数据,基于所述更新后的语音增强参数对所述第二语音数据进行语音增强处理;发送模块,用于将所述第一语音增强数据发送至接收方。
- 根据权利要求11所述的装置,其中,所述获取模块,还用于获取预先存储的与所述发送方对应的语音增强参数;若未获取到与所述发送方对应的语音增强参数,则获取预先设置的语音增强参数。
- 根据权利要求12所述的装置,其中,若未获取到与所述发送方对应的语音增强参数,所述更新模块,还用于基于所述第一语音增强参数对获取的预先设置的语音增强参数进行更新,得到更新后的语音增强参数,并将所述第一语音增强参数作为与所述发送方对应的语音增强参数。
- 根据权利要求12所述的装置,其中,若获取到与所述发送方对应的语音增强参数,所述更新模块,还用于采用所述第一语音增强参数对与所述发送方对应的语音增强参数进行更新,得到更新后的语音增强参数。
- 根据权利要求12所述的装置,其中,若未获取到与所述发送方对应的语音增强参数,所述处理模块,还用于基于所述预先设置的语音增强参数对所述第一语音数据进行语音增强处理,以得到第一语音增强数据。
- 根据权利要求15所述的装置,其中,所述处理模块包括特征序列处理子模块、批处理计算子模块和特征逆变换处理子模块;若未获取到与所述发送方对应的语音增强参数,所述特征序列处理子模块,用于通过训练后的语音增强模型,对所述第一语音数据进行特征序列处理,得到第一语音特征序列,所述语音增强模型设置有所述预设置的语音增强参数;所述批处理计算子模块,用于采用所述预设置的语音增强参数对所述第一语音特征序列进行批处理计算,得到处理后的第一语音特征序列和所述第一语音增强参数;所述特征逆变换处理子模块,用于对所述处理后的第一语音特征序列进行特征逆变换处理,得到所述第一语音增强数据。
- 根据权利要求12所述的装置,其中,若获取到与所述发送方对应的语音增强参数,所述处理模块,还用于基于与所述发送方对应的语音增强参数对所述第一语音数据进行语音增强处理以得到第一语音增强数据。
- 根据权利要求17所述的装置,其中,所述处理模块包括特征序列处理 子模块、批处理计算子模块和特征逆变换处理子模块;若获取到与所述发送方对应的语音增强参数,所述特征序列处理子模块,用于通过训练后的语音增强模型,对所述第一语音数据进行特征序列处理,得到第二语音特征序列;所述批处理计算子模块,用于采用所述发送方对应的语音增强参数对所述第二语音特征序列进行批处理计算,得到处理后的第二语音特征序列和第二语音增强参数;所述特征逆变换处理子模块,用于对所述处理后的第二语音特征序列进行特征逆变换处理,得到处理后第二语音增强数据,并将所述处理后的第二语音增强数据作为所述第一语音增强数据。
- 一种电子设备,其包括:处理器、存储器和总线;所述总线,用于连接所述处理器和所述存储器;所述存储器,用于存储操作指令;所述处理器,用于通过调用所述操作指令,执行上述权利要求1-10中任一项所述的语音数据的处理方法。
- 一种计算机可读存储介质,所述计算机存储介质用于存储计算机指令,当其在计算机上运行时,使得计算机可以执行上述权利要求1-10中任一项所述的语音数据的处理方法。
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2021558880A JP7301154B2 (ja) | 2019-09-23 | 2020-07-28 | 音声データの処理方法並びにその、装置、電子機器及びコンピュータプログラム |
| EP20868291.4A EP3920183B1 (en) | 2019-09-23 | 2020-07-28 | Speech data processing method and apparatus, electronic device and readable storage medium |
| US17/447,536 US12039987B2 (en) | 2019-09-23 | 2021-09-13 | Speech data processing method and apparatus, electronic device, and readable storage medium |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910900060.1 | 2019-09-23 | ||
| CN201910900060.1A CN110648680B (zh) | 2019-09-23 | 2019-09-23 | 语音数据的处理方法、装置、电子设备及可读存储介质 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/447,536 Continuation US12039987B2 (en) | 2019-09-23 | 2021-09-13 | Speech data processing method and apparatus, electronic device, and readable storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021057239A1 true WO2021057239A1 (zh) | 2021-04-01 |
Family
ID=69011077
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2020/105034 Ceased WO2021057239A1 (zh) | 2019-09-23 | 2020-07-28 | 语音数据的处理方法、装置、电子设备及可读存储介质 |
Country Status (5)
| Country | Link |
|---|---|
| US (1) | US12039987B2 (zh) |
| EP (1) | EP3920183B1 (zh) |
| JP (1) | JP7301154B2 (zh) |
| CN (1) | CN110648680B (zh) |
| WO (1) | WO2021057239A1 (zh) |
Families Citing this family (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110648680B (zh) * | 2019-09-23 | 2024-05-14 | 腾讯科技(深圳)有限公司 | 语音数据的处理方法、装置、电子设备及可读存储介质 |
| CN112820307B (zh) * | 2020-02-19 | 2023-12-15 | 腾讯科技(深圳)有限公司 | 语音消息处理方法、装置、设备及介质 |
| CN112151052B (zh) * | 2020-10-26 | 2024-06-25 | 平安科技(深圳)有限公司 | 语音增强方法、装置、计算机设备及存储介质 |
| CN112562704B (zh) * | 2020-11-17 | 2023-08-18 | 中国人民解放军陆军工程大学 | 基于blstm的分频拓谱抗噪语音转换方法 |
| CN114495904B (zh) * | 2022-04-13 | 2022-09-23 | 阿里巴巴(中国)有限公司 | 语音识别方法以及装置 |
| CN114999508B (zh) * | 2022-07-29 | 2022-11-08 | 之江实验室 | 一种利用多源辅助信息的通用语音增强方法和装置 |
| CN116434764A (zh) * | 2023-02-01 | 2023-07-14 | 深圳大学 | 一种基于神经网络的语音增强方法、装置、设备及介质 |
| JP7621577B1 (ja) * | 2023-05-25 | 2025-01-24 | 三菱電機株式会社 | 学習装置、学習方法、及び学習プログラム |
Citations (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102800322A (zh) * | 2011-05-27 | 2012-11-28 | 中国科学院声学研究所 | 一种噪声功率谱估计与语音活动性检测方法 |
| US9058820B1 (en) * | 2013-05-21 | 2015-06-16 | The Intellisis Corporation | Identifying speech portions of a sound model using various statistics thereof |
| CN104952448A (zh) * | 2015-05-04 | 2015-09-30 | 张爱英 | 一种双向长短时记忆递归神经网络的特征增强方法及系统 |
| US9208794B1 (en) * | 2013-08-07 | 2015-12-08 | The Intellisis Corporation | Providing sound models of an input signal using continuous and/or linear fitting |
| CN108615533A (zh) * | 2018-03-28 | 2018-10-02 | 天津大学 | 一种基于深度学习的高性能语音增强方法 |
| CN108877823A (zh) * | 2018-07-27 | 2018-11-23 | 三星电子(中国)研发中心 | 语音增强方法和装置 |
| CN109102823A (zh) * | 2018-09-05 | 2018-12-28 | 河海大学 | 一种基于子带谱熵的语音增强方法 |
| CN109273021A (zh) * | 2018-08-09 | 2019-01-25 | 厦门亿联网络技术股份有限公司 | 一种基于rnn的实时会议降噪方法及装置 |
| CN109427340A (zh) * | 2017-08-22 | 2019-03-05 | 杭州海康威视数字技术股份有限公司 | 一种语音增强方法、装置及电子设备 |
| CN109979478A (zh) * | 2019-04-08 | 2019-07-05 | 网易(杭州)网络有限公司 | 语音降噪方法及装置、存储介质及电子设备 |
| CN110648680A (zh) * | 2019-09-23 | 2020-01-03 | 腾讯科技(深圳)有限公司 | 语音数据的处理方法、装置、电子设备及可读存储介质 |
Family Cites Families (22)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2007116585A (ja) | 2005-10-24 | 2007-05-10 | Matsushita Electric Ind Co Ltd | ノイズキャンセル装置およびノイズキャンセル方法 |
| JP5188300B2 (ja) * | 2008-07-14 | 2013-04-24 | 日本電信電話株式会社 | 基本周波数軌跡モデルパラメータ抽出装置、基本周波数軌跡モデルパラメータ抽出方法、プログラム及び記録媒体 |
| US8234111B2 (en) * | 2010-06-14 | 2012-07-31 | Google Inc. | Speech and noise models for speech recognition |
| JP5870476B2 (ja) * | 2010-08-04 | 2016-03-01 | 富士通株式会社 | 雑音推定装置、雑音推定方法および雑音推定プログラム |
| PT2866228T (pt) * | 2011-02-14 | 2016-08-31 | Fraunhofer Ges Forschung | Descodificador de áudio que compreende um estimador de ruído de fundo |
| CN103650040B (zh) * | 2011-05-16 | 2017-08-25 | 谷歌公司 | 使用多特征建模分析语音/噪声可能性的噪声抑制方法和装置 |
| JP5916054B2 (ja) * | 2011-06-22 | 2016-05-11 | クラリオン株式会社 | 音声データ中継装置、端末装置、音声データ中継方法、および音声認識システム |
| JP2015004959A (ja) * | 2013-05-22 | 2015-01-08 | ヤマハ株式会社 | 音響処理装置 |
| GB2519117A (en) * | 2013-10-10 | 2015-04-15 | Nokia Corp | Speech processing |
| GB2520048B (en) * | 2013-11-07 | 2018-07-11 | Toshiba Res Europe Limited | Speech processing system |
| CN104318927A (zh) * | 2014-11-04 | 2015-01-28 | 东莞市北斗时空通信科技有限公司 | 一种抗噪声的低速率语音编码方法及解码方法 |
| JP2016109933A (ja) | 2014-12-08 | 2016-06-20 | パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America | 音声認識方法ならびに音声認識システムおよびそれに含まれる音声入力装置 |
| CN105355199B (zh) * | 2015-10-20 | 2019-03-12 | 河海大学 | 一种基于gmm噪声估计的模型组合语音识别方法 |
| CN106971741B (zh) * | 2016-01-14 | 2020-12-01 | 芋头科技(杭州)有限公司 | 实时将语音进行分离的语音降噪的方法及系统 |
| CN106340304B (zh) * | 2016-09-23 | 2019-09-06 | 桂林航天工业学院 | 一种适用于非平稳噪声环境下的在线语音增强方法 |
| CN106615533A (zh) | 2016-09-29 | 2017-05-10 | 芜湖市三山区绿色食品产业协会 | 一种诺尼果香饯的制作方法 |
| CN106898348B (zh) * | 2016-12-29 | 2020-02-07 | 北京小鸟听听科技有限公司 | 一种出声设备的去混响控制方法和装置 |
| US10978091B2 (en) * | 2018-03-19 | 2021-04-13 | Academia Sinica | System and methods for suppression by selecting wavelets for feature compression in distributed speech recognition |
| US10811000B2 (en) * | 2018-04-13 | 2020-10-20 | Mitsubishi Electric Research Laboratories, Inc. | Methods and systems for recognizing simultaneous speech by multiple speakers |
| CN110176245A (zh) * | 2019-05-29 | 2019-08-27 | 贾一焜 | 一种语音降噪系统 |
| KR102260216B1 (ko) * | 2019-07-29 | 2021-06-03 | 엘지전자 주식회사 | 지능적 음성 인식 방법, 음성 인식 장치, 지능형 컴퓨팅 디바이스 및 서버 |
| CN110648681B (zh) * | 2019-09-26 | 2024-02-09 | 腾讯科技(深圳)有限公司 | 语音增强的方法、装置、电子设备及计算机可读存储介质 |
-
2019
- 2019-09-23 CN CN201910900060.1A patent/CN110648680B/zh active Active
-
2020
- 2020-07-28 EP EP20868291.4A patent/EP3920183B1/en active Active
- 2020-07-28 JP JP2021558880A patent/JP7301154B2/ja active Active
- 2020-07-28 WO PCT/CN2020/105034 patent/WO2021057239A1/zh not_active Ceased
-
2021
- 2021-09-13 US US17/447,536 patent/US12039987B2/en active Active
Patent Citations (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102800322A (zh) * | 2011-05-27 | 2012-11-28 | 中国科学院声学研究所 | 一种噪声功率谱估计与语音活动性检测方法 |
| US9058820B1 (en) * | 2013-05-21 | 2015-06-16 | The Intellisis Corporation | Identifying speech portions of a sound model using various statistics thereof |
| US9208794B1 (en) * | 2013-08-07 | 2015-12-08 | The Intellisis Corporation | Providing sound models of an input signal using continuous and/or linear fitting |
| CN104952448A (zh) * | 2015-05-04 | 2015-09-30 | 张爱英 | 一种双向长短时记忆递归神经网络的特征增强方法及系统 |
| CN109427340A (zh) * | 2017-08-22 | 2019-03-05 | 杭州海康威视数字技术股份有限公司 | 一种语音增强方法、装置及电子设备 |
| CN108615533A (zh) * | 2018-03-28 | 2018-10-02 | 天津大学 | 一种基于深度学习的高性能语音增强方法 |
| CN108877823A (zh) * | 2018-07-27 | 2018-11-23 | 三星电子(中国)研发中心 | 语音增强方法和装置 |
| CN109273021A (zh) * | 2018-08-09 | 2019-01-25 | 厦门亿联网络技术股份有限公司 | 一种基于rnn的实时会议降噪方法及装置 |
| CN109102823A (zh) * | 2018-09-05 | 2018-12-28 | 河海大学 | 一种基于子带谱熵的语音增强方法 |
| CN109979478A (zh) * | 2019-04-08 | 2019-07-05 | 网易(杭州)网络有限公司 | 语音降噪方法及装置、存储介质及电子设备 |
| CN110648680A (zh) * | 2019-09-23 | 2020-01-03 | 腾讯科技(深圳)有限公司 | 语音数据的处理方法、装置、电子设备及可读存储介质 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP3920183A4 |
Also Published As
| Publication number | Publication date |
|---|---|
| US20220013133A1 (en) | 2022-01-13 |
| EP3920183A4 (en) | 2022-06-08 |
| CN110648680B (zh) | 2024-05-14 |
| CN110648680A (zh) | 2020-01-03 |
| JP7301154B2 (ja) | 2023-06-30 |
| EP3920183B1 (en) | 2025-06-25 |
| JP2022527527A (ja) | 2022-06-02 |
| US12039987B2 (en) | 2024-07-16 |
| EP3920183A1 (en) | 2021-12-08 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2021057239A1 (zh) | 语音数据的处理方法、装置、电子设备及可读存储介质 | |
| US11355097B2 (en) | Sample-efficient adaptive text-to-speech | |
| US8996372B1 (en) | Using adaptation data with cloud-based speech recognition | |
| US20230335148A1 (en) | Speech Separation Method, Electronic Device, Chip, and Computer-Readable Storage Medium | |
| WO2021128880A1 (zh) | 一种语音识别方法、装置和用于语音识别的装置 | |
| US12230275B2 (en) | Speech instruction recognition method, electronic device, and non-transient computer readable storage medium | |
| CN109410973B (zh) | 变声处理方法、装置和计算机可读存储介质 | |
| CN106165015B (zh) | 用于促进基于加水印的回声管理的装置和方法 | |
| EP4254408A1 (en) | Speech processing method and apparatus, and apparatus for processing speech | |
| US11776563B2 (en) | Textual echo cancellation | |
| CN114898762A (zh) | 基于目标人的实时语音降噪方法、装置和电子设备 | |
| CN110827808A (zh) | 语音识别方法、装置、电子设备和计算机可读存储介质 | |
| CN113223553B (zh) | 分离语音信号的方法、装置及介质 | |
| CN110246502A (zh) | 语音降噪方法、装置及终端设备 | |
| CN112750469B (zh) | 语音中检测音乐的方法、语音通信优化方法及对应的装置 | |
| CN107437412B (zh) | 一种声学模型处理方法、语音合成方法、装置及相关设备 | |
| CN114783455B (zh) | 用于语音降噪的方法、装置、电子设备和计算机可读介质 | |
| CN110580910B (zh) | 一种音频处理方法、装置、设备及可读存储介质 | |
| CN119864010A (zh) | 非自回归语音合成方法、装置、计算机设备及存储介质 | |
| CN111667842A (zh) | 音频信号处理方法及装置 | |
| CN111968630B (zh) | 信息处理方法、装置和电子设备 | |
| HK40013080B (zh) | 语音数据的处理方法、装置、电子设备及可读存储介质 | |
| HK40013080A (zh) | 语音数据的处理方法、装置、电子设备及可读存储介质 | |
| CN114242053A (zh) | 一种语音控制方法及装置、存储介质 | |
| CN116153291A (zh) | 一种语音识别方法及设备 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20868291 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2020868291 Country of ref document: EP Effective date: 20210901 |
|
| ENP | Entry into the national phase |
Ref document number: 2021558880 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWG | Wipo information: grant in national office |
Ref document number: 2020868291 Country of ref document: EP |