WO2022198820A1 - 一种语音处理方法、装置和用于语音处理的装置 - Google Patents
一种语音处理方法、装置和用于语音处理的装置 Download PDFInfo
- Publication number
- WO2022198820A1 WO2022198820A1 PCT/CN2021/102566 CN2021102566W WO2022198820A1 WO 2022198820 A1 WO2022198820 A1 WO 2022198820A1 CN 2021102566 W CN2021102566 W CN 2021102566W WO 2022198820 A1 WO2022198820 A1 WO 2022198820A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- signal
- processing
- microphones
- frame
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02165—Two microphones, one receiving mainly the noise signal and the other one mainly the speech signal
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
Definitions
- the present application relates to the technical field of intelligent control, and in particular, to a voice processing method, device and device for voice processing.
- Smart devices can convert the voice of the user into text through speech recognition technology, and then understand the instructions issued by the user by analyzing the text.
- the smart device can accurately recognize the user's voice.
- the environment in which the user is located is complex and changeable, and noise or interference often affects the accuracy of the voice recognition performed by the smart device, resulting in excessively loud voice recognition noise on the smart device and affecting the recognition effect.
- Embodiments of the present application provide a voice processing method, device, and device for voice processing, which can improve the accuracy of intelligently performing device voice recognition.
- an embodiment of the present application discloses a voice processing method, which is applied to a terminal device, where the terminal device is provided with at least two microphones, and the method includes:
- Blind separation processing is performed on the first signal and the second signal to obtain a speech signal and a noise signal;
- the voice signal is subjected to adaptive noise cancellation processing to obtain a target voice signal.
- an embodiment of the present application discloses a voice processing apparatus, which is applied to a terminal device.
- the terminal device is provided with at least two microphones, and the apparatus includes:
- a coarse separation module used for summing the signals received by the at least two microphones to obtain a first signal, and performing difference processing on the signals received by the at least two microphones to obtain a second signal ;
- a blind separation processing module configured to perform blind separation processing on the first signal and the second signal to obtain a speech signal and a noise signal
- An adaptive noise cancellation processing module configured to perform adaptive noise cancellation processing on the voice signal based on the noise signal to obtain a target voice signal.
- an embodiment of the present application discloses an apparatus for speech processing, which is applied to a terminal device, the terminal device is provided with at least two microphones, the apparatus includes a memory, and one or more programs, One or more programs are stored in memory and configured to be executed by one or more processors including instructions for:
- Blind separation processing is performed on the first signal and the second signal to obtain a speech signal and a noise signal;
- the voice signal is subjected to adaptive noise cancellation processing to obtain a target voice signal.
- the embodiments of the present application disclose a machine-readable medium with instructions stored thereon that, when executed by one or more processors, cause an apparatus to execute one or more of the aforementioned speech processing methods.
- the voice processing method of the embodiment of the present application can be applied to a terminal device provided with at least two microphones.
- a differential array can be formed by using two or more microphones of the terminal device to achieve rough separation of speech signals and noise signals. Specifically, by summing the signals received by at least two microphones, a beam can be formed in front of the speaker, mainly receiving the speaker's voice, and suppressing the noise behind the speaker's side to a certain extent.
- the main signal (the first signal).
- the difference processing of the signals received by at least two microphones can form a beam behind the speaker side, mainly to receive noise or interference behind the speaker side, and a signal dominated by noise (the second signal) can be obtained.
- the first channel signal and the second channel signal obtained by the rough separation are further separated based on the blind separation technology, so that more accurate speech signals and noise signals can be obtained.
- adaptive noise removal processing is performed, and the target speech signal for noise removal can be obtained.
- the embodiments of the present application use the differential microphone array technology, combined with the blind separation technology and the adaptive noise cancellation technology, to perform coarse separation, further separation, and adaptive noise cancellation on the signals received by at least two microphones.
- the speech signal and the noise signal are more precise, which in turn can improve the efficiency and accuracy of removing noise or interference in the speech signal.
- the embodiment of the present application uses the differential microphone array technology to roughly separate the signals received by at least two microphones, so that the rough separation process is not sensitive to the direction of noise or interference, which can improve the The robustness of the denoising performance optimizes the speech denoising effect, thereby improving the speech recognition accuracy of the terminal device in the case of complex and changeable environments, noise or interference.
- Fig. 1 is a flow chart of steps of a speech processing method embodiment of the present application
- FIG. 2 is a schematic flow chart of the present application for performing difference processing on the signals of three microphones
- FIG. 3 is a schematic diagram of a signal inflow of an adaptive noise cancellation processing module of the present application.
- FIG. 4 is a structural block diagram of an embodiment of a speech processing apparatus of the present application.
- FIG. 5 is a block diagram of a device 800 for speech processing according to the present application.
- FIG. 6 is a schematic structural diagram of a server in some embodiments of the present application.
- FIG. 1 there is shown a flowchart of steps of an embodiment of a voice processing method of the present application, which is applied to a terminal device, where the terminal device is provided with at least two microphones, and the method may specifically include the following steps:
- Step 101 Perform summation processing on the signals received by the at least two microphones to obtain a first signal, and perform a difference processing on the signals received by the at least two microphones to obtain a second signal;
- Step 102 performing blind separation processing on the first signal and the second signal to obtain a speech signal and a noise signal
- Step 103 Based on the noise signal, perform adaptive noise elimination processing on the voice signal to obtain a target voice signal.
- the voice processing method provided in the embodiment of the present application can be applied to a terminal device.
- the terminal device has at least two microphones and can be used to collect sound signals. : air conditioners, refrigerators, rice cookers, water heaters, etc.), business smart terminals (including video phones, conference desktop smart terminals, etc.), wearable devices (including smart watches, smart glasses, etc.), financial smart terminals, and smart phones, Tablet PCs, personal digital assistants (PDAs), in-vehicle devices, computers, etc.
- PDAs personal digital assistants
- the embodiments of the present application take the terminal device as an earphone as an example for description, and the earphone has at least two microphones.
- the speech processing method of the embodiment of the present application includes a three-stage separation module, and the first-stage separation module obtains a first channel of signals and a second channel of signals by performing summation processing and difference processing on signals received by at least two microphones. Coarse separation of speech and noise signals.
- the second-stage separation module further separates and processes the first-path signal and the second-path signal extracted by the first-stage separation module through blind separation technology to obtain a speech signal and a noise signal.
- the third-stage separation module performs adaptive noise elimination processing on the separated speech signal based on the noise signal separated by the second-stage separation module to obtain the final target speech signal.
- This embodiment of the present application first performs preliminary extraction on signals received by at least two microphones of the terminal device. Specifically, the signals received by the at least two microphones are summed to obtain a first channel of signals, and the signals received by the at least two microphones are subjected to difference processing to obtain a second channel of signals. Among them, the first signal is a signal dominated by the speaker's voice, and the second signal is a noise dominated signal.
- the second microphone usually one of the two microphones of the headset is close to the speaker's mouth. called the second microphone.
- the signal received by the first microphone is added to the signal received by the second microphone.
- a beam is formed in front of the speaker (end-fire direction), and the speaker's voice is mainly received. Noise formation is suppressed to a certain extent. Therefore, summing the signals received by the two microphones can obtain one signal (the first signal) dominated by speech.
- the signal received by the second microphone is subtracted from the signal received by the first microphone. At this time, a beam is formed behind the speaker, and the noise or interference behind the speaker is mainly received. Therefore, the signals received by the two microphones are calculated. After processing, a signal (the second signal) dominated by noise can be obtained.
- a signal mainly composed of the speaker's voice (the first signal) and a signal mainly composed of noise (the second signal) can be obtained.
- the first signal is a signal dominated by the speaker's voice, which still contains part of the noise signal; the second signal is a noise dominated signal, which still contains part of the voice signal.
- the method of extracting the first channel signal is the same as that of the two microphones, and the method of extracting the second channel signal is slightly different from that of the two microphones.
- the method may further include: aligning the phases of the signals received by the at least two microphones;
- Step 101 is to perform summation processing on the signals received by the at least two microphones to obtain a first channel signal, and perform difference processing on the signals received by the at least two microphones to obtain a second channel signal.
- phase alignment is performed on the signals received by the at least two microphones to obtain the at least two phase-aligned signals.
- the signals received by the two microphones, and then the phase-aligned signals received by the at least two microphones are summed to obtain the first signal and the phase-aligned signals received by the at least two microphones.
- the difference processing is performed to obtain the second channel signal, so as to improve the accuracy of extracting the first channel signal and the second channel signal, thereby improving the effect of noise reduction on the speech signal.
- the channel signal can be used as an auxiliary estimation for subsequent further denoising to improve the final denoising effect.
- the process of summing the signals received by the two microphones is the same as the process of summing the signals received by more than two microphones.
- the following describes the signals received by the two microphones respectively.
- the terminal device is provided with two microphones, and in step 101, the signals received by the at least two microphones are subjected to difference processing to obtain a second signal, including :
- Step S11 determining a first microphone and a second microphone among the two microphones
- Step S12 subtracting each frame of signal received by the first microphone from the signal of each frame received by the second microphone to obtain a second signal.
- the first microphone and the second microphone are determined among the two microphones.
- the first microphone is a microphone of the two microphones that is close to the speaker's mouth
- the second microphone is a microphone of the two microphones that is far away from the speaker's mouth.
- the two microphones are in a straight line.
- a phase alignment operation is first performed on the signal received by the first microphone and the signal received by the second microphone to obtain the phase-aligned signals of the two microphones. Then, the signals of the two microphones after phase alignment are summed to obtain the first signal, which forms the suppression of white noise.
- the difference processing is performed on the signals of the two microphones after phase alignment. Specifically, the signal received by the first microphone is subtracted from the signal received by the second microphone to obtain the second signal.
- the signal received by each microphone of the terminal device is processed in units of frames, so as to perform real-time processing on the signal received by each microphone, thereby improving the real-time performance and accuracy of processing. Specifically, by subtracting each frame of signal received by the second microphone from each frame of signal received by the first microphone, the second channel of signal can be obtained.
- the terminal device is provided with n microphones, and n is greater than 2.
- the signals received by the at least two microphones are subjected to difference processing to obtain the second road signals, including:
- Step S21 the current frame signal received by the ith microphone is subtracted from the current frame signal received by the i-1 th microphone to obtain n-1 frame signals, and the value of i is 1 to n;
- Step S23 summing the processed n-1 frame signals to obtain the second frame signal output by the current frame
- Step S24 After all frame signals received by the n microphones are processed, a second channel of signals is obtained.
- the terminal device has more than two microphones
- n-1 channels of frame signals are obtained, and the value of i ranges from 1 to n; the n-1 channels of frame signals are respectively subjected to adaptive filtering processing with the reference signal y(n) to obtain Processed n-1 channel frame signals; summing the processed n-1 channel frame signals to obtain the second channel frame signal output by the current frame.
- the reference signal y(n) yc(n)-N(n)
- yc(n) is the sum of the signals of the previous frame received by the n microphones
- N(n) is the first frame output of the previous frame.
- the embodiment of the present application uses the processing result of the previous frame to calculate the reference signal y(n) of the current frame, and uses y(n) to update the adaptive filter.
- an initial reference signal y(n) can be set.
- the adaptive filter can be updated with y(n) calculated from the first frame.
- the adaptive filter can be updated with y(n) calculated from the second frame, and so on, until the signal processing of the last frame is completed, and the complete second channel signal can be obtained.
- the embodiment of the present application does not limit the type of the adaptive filter used in the difference processing process, for example, it may be NLMS (Normalized Least Mean Square, normalized least mean square adaptive filter).
- NLMS Normalized Least Mean Square, normalized least mean square adaptive filter
- FIG. 2 a schematic flowchart of performing difference processing on signals of three microphones according to an embodiment of the present application is shown. As shown in FIG. 2 , the three microphones are respectively a microphone 1 , a microphone 2 , and a microphone 3 .
- the signals of the three microphones are differentiated.
- the difference-finding process specifically includes: subtracting the signal of microphone 1 from the signal of microphone 2 to obtain signal a; subtracting the signal of microphone 2 from the signal of microphone 3 to obtain signal b.
- Adaptive filtering is performed on the signal a and the signal b and the reference signal y(n) to obtain the signal a' and the signal b'; the second signal is obtained by adding the signal a' and the signal b'.
- the following operations can be performed for the first frame signal: first perform adaptive filtering processing (the adaptive filter has an initial state during the first calculation), and calculate N(n); Then perform a summation calculation to obtain yc(n), yc(n) and N(n) are subtracted to obtain y(n); then the adaptive filter is updated using the calculated y(n). At this time, the difference processing of the first frame signal is completed, and the second frame signal output by the first frame is obtained. Then process the following frame signals in turn according to the above steps.
- processing the second frame signal you can refer to the processing result of the first frame signal.
- processing the third frame signal you can refer to the processing result of the second frame signal.
- the complete second channel signal can be obtained.
- the embodiment of the present application achieves rough separation of the speech signal and the noise signal.
- the first channel signal and the second channel signal can be subjected to blind separation processing to further separate the speech signal and the noise signal to obtain a more accurate signal. speech signal and noise signal.
- the blind separation processing refers to the technology of separating each source signal (such as the speaker's speech signal and the noise signal) from the collected mixed signal when the source signal cannot be accurately known. Since the microphones in the earphones usually have a small aperture and the number of microphones is usually small, in a relatively noisy environment, the collected sound signal contains a large amount of noise signals, resulting in poor quality of the speech signal.
- the embodiment of the present application performs blind separation processing on the extracted first channel signal and the second channel signal respectively, and performing blind separation processing on the first channel signal can further reduce the noise signal in the first channel signal, A voice signal is obtained, and the voice signal contains less noise; the second-path signal is subjected to blind separation processing, which can further reduce the voice signal in the second-path signal, and a noise signal is obtained, and the noise signal contains less voice, It provides a basis for subsequent further noise reduction processing.
- performing blind separation processing on the first signal and the second signal in step 102 to obtain a speech signal and a noise signal including:
- ICA Independent Component Correlation Algorithm, Independent Component Analysis
- the second-level separation module in the embodiment of the present application uses IVA (Independent Vector Analysis) to blindly separate the first signal to obtain a speech signal, and blindly separate the second signal to obtain a noise signal.
- IVA Independent Vector Analysis
- the embodiment of the present application adopts the IVA blind separation technology which is not sensitive to the direction of the noise, can still achieve a robust separation effect for the noise in front of the speaker, and can further improve the effect of speech noise reduction.
- blind separation algorithms such as PCA (Principal Component Analysis, principal component analysis) may also be used.
- step 103 performs adaptive noise cancellation processing on the speech signal based on the noise signal to obtain a target speech signal, including:
- an adaptive filtering algorithm based on recursive least squares RLS is used to perform adaptive noise elimination processing on the speech signal to obtain a target speech signal.
- the adaptive noise cancellation processing in the embodiment of the present application adopts the RLS (Recursive Least Squares, recursive least squares) technology, and the RLS algorithm itself has the characteristics of rapid convergence.
- RLS Recursive Least Squares, recursive least squares
- the RLS adaptive filtering algorithm is as follows:
- n the frame number
- W the adaptive filter coefficient vector
- G the gain vector
- X the noise signal output by the blind separation.
- d the voice signal output by blind separation.
- s(n) is the final output target speech signal.
- the forgetting factor ⁇ can be chosen as a constant such as 0.99.
- the RLS adaptive filtering algorithm requires a large amount of computation, and for terminal devices with limited computing capabilities such as earphones, the computational pressure is relatively large.
- the method can be applied to terminal devices with different computing capabilities, and the embodiment of the present application introduces a voice activation detection module to reduce the calculation amount of the RLS adaptive filtering algorithm.
- the method may further include:
- Step S31 performing voice activation detection on each frame of the voice signal
- Step S32 setting the voice signal flag bit for the frame signal of the voice signal as the voice activation detection result
- Performing adaptive noise removal processing on the speech signal in step 103 includes: performing adaptive noise removal processing on the frame signal having the voice signal flag bit in the speech signal.
- Voice Activity Detection the purpose is to detect whether the current voice signal contains a voice signal, that is, to judge the input signal, and distinguish the voice signal from various background noise signals.
- not every frame of the sound signal received by the microphone includes the speaker's voice signal. If adaptive noise cancellation is performed on each frame of the signal, it will not only increase the additional computational cost, but also affect the Efficiency of speech processing. Therefore, the embodiment of the present application performs voice activation detection on the voice signal obtained by the blind separation processing to detect whether the current frame signal contains a voice signal, and only performs adaptive noise removal processing on the frame signal containing the voice signal, so as to reduce the computational cost and improve the Efficiency of speech processing.
- the speech signal separated by the second-stage separation module before inputting the speech signal and noise signal separated by the second-stage separation module into the adaptive noise cancellation module (third-stage separation module), input the speech signal separated by the second-stage separation module first.
- the voice activation detection module is responsible for detecting whether each frame of the input voice signal contains a voice signal in units of frames, and sets the voice signal flag bit for the frame signal of the voice signal according to the voice activation detection result, and then sets each frame signal.
- the voice activation detection result of the frame signal is transmitted to the adaptive noise cancellation module, and the adaptive noise cancellation module decides whether to perform adaptive noise cancellation processing according to whether the voice activation detection result contains the voice signal flag bit.
- the voice activation detection module may not work for a preset period of time (such as the first 20s) before the adaptive noise cancellation process starts.
- the adaptive filter will be updated all the time. After a preset time period (20s), whether the adaptive noise cancellation process starts or not depends on the voice signal flag bit. In this way, processing time and power consumption can be saved, and adaptive filter coefficients can be updated more accurately, thereby improving the robustness of the algorithm.
- an active window strategy can be adopted, and the sliding window can store the voice signal flag bit of the past preset frame number (such as 5 frames to 10 frames) and the voice signal flag bit of the current frame, and only all frame signals in the sliding window can be stored. All of them have a voice signal flag bit, that is, only when all frame signals in the sliding window contain voice signals, adaptive noise removal processing is performed, and the adaptive filter coefficients are updated.
- the embodiments of the present application can quickly eliminate noise or interference in speech. Compared with existing algorithms, the embodiments of the present application are not sensitive to the direction of noise or interference. The denoising performance is more robust.
- taking two microphones first perform a phase alignment operation on the signal received by the first microphone and the signal received by the second microphone to obtain the phase-aligned signals of the two microphones.
- the phase-aligned signals of the two microphones are summed to obtain the first signal, and the signal received by the second microphone is subtracted from the signal received by the first microphone to obtain the second signal.
- the signals received by the two microphones include the voice signal of girl A and the voice signal of boy B, wherein the voice signal of boy B is the voice signal of the target speaker to be extracted.
- the first-level separation module Through the processing of the first-level separation module, a first-channel signal dominated by boys' voice signals and a second-channel signal dominated by girls' voice signals are obtained.
- the girl's speech signal may be processed as a noise signal relative to the boy's speech signal.
- the first channel signal dominated by the boy's voice signal and the second channel signal dominated by the girl's voice signal are input into the blind separation processing module for blind separation processing.
- the girl's voice signal in the first signal is further reduced to obtain a voice signal;
- the boy's voice signal in the second signal is further reduced to obtain a noise signal.
- the speech signal and the noise signal output by the blind separation processing module are input into the adaptive noise cancellation processing module, and the speech signal output by the blind separation processing module is input into the voice activation detection module for voice activation detection.
- the speech activation detection result of the frame signal is input to the adaptive noise cancellation processing module.
- the adaptive noise cancellation processing module decides whether to perform adaptive noise cancellation processing on the current frame according to whether the voice activation detection result output by the voice activation detection module includes the voice signal flag bit.
- FIG. 3 it is a schematic diagram of signal inflow of the adaptive noise cancellation processing module.
- the speech signal and noise signal output by the blind separation processing module and the speech activation detection result of each frame signal output by the speech activation detection module are used as the input of FIG. 4, and the target speech signal is finally output.
- a differential array can be formed by using two or more microphones of the terminal device to achieve rough separation of speech signals and noise signals. Specifically, by summing the signals received by at least two microphones, a beam can be formed in front of the speaker, mainly receiving the speaker's voice, and suppressing the noise behind the speaker's side to a certain extent.
- the main signal (the first signal).
- the difference processing of the signals received by at least two microphones can form a beam behind the speaker side, mainly to receive noise or interference behind the speaker side, and a signal dominated by noise (the second signal) can be obtained.
- the first channel signal and the second channel signal obtained by the rough separation are further separated based on the blind separation technology, so that more accurate speech signals and noise signals can be obtained.
- adaptive noise removal processing is performed, and the target speech signal for noise removal can be obtained.
- the embodiments of the present application use the differential microphone array technology, combined with the blind separation technology and the adaptive noise cancellation technology, to perform coarse separation, further separation, and adaptive noise cancellation on the signals received by at least two microphones.
- the speech signal and the noise signal are more precise, which in turn can improve the efficiency and accuracy of removing noise or interference in the speech signal.
- the embodiment of the present application uses the differential microphone array technology to roughly separate the signals received by at least two microphones, so that the rough separation process is insensitive to the direction of noise or interference, which can improve the The robustness of the denoising performance, optimizes the effect of speech denoising, and further improves the speech recognition accuracy of terminal equipment in the case of complex and changeable environments, noise or interference.
- FIG. 4 a structural block diagram of an embodiment of a voice processing apparatus of the present application is shown.
- the apparatus can be applied to terminal equipment, and the terminal equipment is provided with at least two microphones.
- the apparatus may include:
- the coarse separation module 401 is used for summing the signals received by the at least two microphones to obtain a first channel signal, and performing difference processing on the signals received by the at least two microphones to obtain a second channel Signal;
- a blind separation processing module 402 configured to perform blind separation processing on the first signal and the second signal to obtain a speech signal and a noise signal;
- the adaptive noise cancellation processing module 403 is configured to perform adaptive noise cancellation processing on the voice signal based on the noise signal to obtain a target voice signal.
- the device further includes:
- phase alignment module configured to phase align the signals received by the at least two microphones
- the coarse separation module is specifically configured to perform summation processing on the signals received by the at least two microphones after phase alignment, to obtain the first signal, and the phase-aligned signals received by the at least two microphones.
- the signal is subjected to difference processing to obtain the second signal.
- the terminal device is provided with two microphones
- the blind separation processing module includes:
- a determination submodule for determining a first microphone and a second microphone among the two microphones
- the first subtraction sub-module is used for subtracting the signal of each frame received by the second microphone from the signal of each frame received by the first microphone to obtain the second signal.
- the terminal device is provided with n microphones, where n is greater than 2, and the blind separation processing module includes:
- the second subtraction sub-module is used to subtract the current frame signal received by the i-1 microphone from the current frame signal received by the i microphone to obtain n-1 frame signals, where i is 1 to 1 n;
- a summation submodule for summing the processed n-1 road frame signals to obtain the second road frame signal output by the current frame
- the iterative completion sub-module is used to obtain the second signal after all frame signals received by the n microphones are processed.
- the blind separation processing module is specifically configured to perform blind separation processing on each frame of signals in the first channel of signals using an independent vector analysis blind separation algorithm to obtain a speech signal, and to separate the second channel of signals.
- Each frame of the signal is blindly separated by an independent vector analysis blind separation algorithm to obtain a noise signal.
- the device further includes:
- a voice activation detection module for carrying out voice activation detection by every frame signal in the voice signal, and setting the voice signal flag bit for the frame signal of the voice activation detection result
- the adaptive noise cancellation processing module is specifically configured to perform adaptive noise cancellation processing on the frame signal with the voice signal flag bit in the speech signal.
- the adaptive noise cancellation processing module is specifically configured to use the noise signal as a reference signal, and use the voice signal as a target signal, and adaptively adapt the voice signal based on an RLS-based adaptive filtering algorithm. Noise removal processing to obtain the target speech signal.
- the embodiments of the present application use two or more microphones of a terminal device to form a differential array. Based on the blind separation technology, combined with the differential microphone array technology and adaptive filtering technology, noise or interference in speech can be quickly eliminated. Compared with existing algorithms , the embodiment of the present application is not sensitive to the direction of noise or interference, the denoising performance is more robust, the speech denoising effect is optimized, and the speech recognition accuracy of the terminal device is improved in the case of complex and changeable environment, noise or interference.
- An embodiment of the present application provides an apparatus for speech processing, which is applied to a terminal device, the terminal device is provided with at least two microphones, the apparatus includes a memory, and one or more programs, one or one of which is The above programs are stored in memory and configured to be executed by one or more processors including instructions for summing signals received by the at least two microphones , obtain the first channel signal, and perform difference processing on the signals received by the at least two microphones to obtain the second channel signal; perform blind separation processing on the first channel signal and the second channel signal to obtain A voice signal and a noise signal; based on the noise signal, the voice signal is subjected to adaptive noise elimination processing to obtain a target voice signal.
- FIG. 5 is a block diagram of an apparatus 800 for speech processing according to an exemplary embodiment.
- apparatus 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, fitness device, personal digital assistant, and the like.
- the apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power supply component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and communication component 816.
- the processing component 802 generally controls the overall operation of the device 800, such as operations associated with display, phone calls, data communications, camera operations, and recording operations.
- the processing element 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Additionally, processing component 802 may include one or more modules that facilitate interaction between processing component 802 and other components. For example, processing component 802 may include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802.
- Memory 804 is configured to store various types of data to support operation at device 800 . Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and the like. Memory 804 may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
- SRAM static random access memory
- EEPROM electrically erasable programmable read only memory
- EPROM erasable Programmable Read Only Memory
- PROM Programmable Read Only Memory
- ROM Read Only Memory
- Magnetic Memory Flash Memory
- Magnetic or Optical Disk Magnetic Disk
- Power supply assembly 806 provides power to the various components of device 800 .
- Power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to device 800 .
- Multimedia component 808 includes a screen that provides an output interface between the device 800 and the user.
- the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.
- the touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundaries of a touch or swipe action, but also detect the duration and pressure associated with the touch or swipe action.
- multimedia component 808 includes a front-facing camera and/or a rear-facing camera. When the device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data.
- Each of the front and rear cameras can be a fixed optical lens system or have focal length and optical zoom capability.
- Audio component 810 is configured to output and/or input audio signals.
- audio component 810 includes a microphone (MIC) that is configured to receive external audio signals when device 800 is in operating modes, such as call mode, recording mode, and voice information processing mode. The received audio signal may be further stored in memory 804 or transmitted via communication component 816 .
- audio component 810 also includes a speaker for outputting audio signals.
- the I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to: home button, volume buttons, start button, and lock button.
- Sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of device 800 .
- the sensor assembly 814 can detect the on/off state of the device 800, the relative positioning of the components, such as the display and keypad of the device 800, the sensor assembly 814 can also speech processing the position of the device 800 or a component of the device 800 Changes, the presence or absence of user contact with the device 800, the orientation or acceleration/deceleration of the device 800 and the temperature change of the device 800.
- Sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact.
- Sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
- the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
- Communication component 816 is configured to facilitate wired or wireless communication between apparatus 800 and other devices.
- Device 800 may access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof.
- the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel.
- the communication component 816 also includes a near field communication (NFC) module to facilitate short-range communication.
- NFC near field communication
- the NFC module may be implemented based on radio frequency information processing (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
- RFID radio frequency information processing
- IrDA infrared data association
- UWB ultra-wideband
- Bluetooth Bluetooth
- apparatus 800 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation is used to perform the above method.
- ASICs application specific integrated circuits
- DSPs digital signal processors
- DSPDs digital signal processing devices
- PLDs programmable logic devices
- FPGA field programmable A gate array
- controller microcontroller, microprocessor or other electronic component implementation is used to perform the above method.
- non-transitory computer-readable storage medium including instructions, such as a memory 804 including instructions, executable by the processor 820 of the apparatus 800 to perform the method described above.
- the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.
- FIG. 6 is a schematic structural diagram of a server in some embodiments of the present application.
- the server 1900 may vary widely depending on configuration or performance, and may include one or more central processing units (CPUs) 1922 (eg, one or more processors) and memory 1932, one or more A storage medium 1930 (eg, one or more mass storage devices) that stores applications 1942 or data 1944 above.
- the memory 1932 and the storage medium 1930 may be short-term storage or persistent storage.
- the program stored in the storage medium 1930 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server.
- the central processing unit 1922 may be configured to communicate with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900 .
- Server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or, one or more operating systems 1941 , such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and so on.
- a non-transitory computer-readable storage medium when the instructions in the storage medium are executed by a processor of an apparatus (server or terminal), the apparatus enables the apparatus to execute the voice processing method shown in FIG. 1 .
- a non-transitory computer-readable storage medium when an instruction in the storage medium is executed by a processor of an apparatus (server or terminal), the apparatus enables the apparatus to execute a voice processing method, the method comprising: converting the The signals received by the at least two microphones are summed to obtain a first signal, and the signals received by the at least two microphones are subjected to difference processing to obtain a second signal; the first signal and The second channel signal is subjected to blind separation processing to obtain a voice signal and a noise signal; based on the noise signal, the voice signal is subjected to adaptive noise elimination processing to obtain a target voice signal.
- These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing terminal equipment to produce a machine that causes the instructions to be executed by the processor of the computer or other programmable data processing terminal equipment Means are created for implementing the functions specified in the flow or flows of the flowcharts and/or the blocks or blocks of the block diagrams.
- These computer program instructions may also be stored in a computer readable memory capable of directing a computer or other programmable data processing terminal equipment to operate in a particular manner, such that the instructions stored in the computer readable memory result in an article of manufacture comprising instruction means, the The instruction means implement the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Soundproofing, Sound Blocking, And Sound Damping (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
本申请实施例提供了一种语音处理方法、装置和用于语音处理的装置,应用于终端设备,所述终端设备设置有至少两个麦克风。其中的方法包括:将所述至少两个麦克风接收到的信号进行求和处理,得到第一路信号,以及将所述至少两个麦克风接收到的信号进行求差处理,得到第二路信号;将所述第一路信号和所述第二路信号进行盲分离处理,得到语音信号和噪音信号;基于所述噪音信号,将所述语音信号进行自适应噪音消除处理,得到目标语音信号。本申请实施例可以优化语音去噪效果,进而在环境复杂多变、噪音或者干扰较大的情况下,提高终端设备的语音识别准确率。
Description
本申请要求在2021年03月22日提交中国专利局、申请号为202110303349.2、发明名称为“一种语音处理方法、装置和用于语音处理的装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及智能控制技术领域,尤其涉及一种语音处理方法、装置和用于语音处理的装置。
随着语音识别技术的日益成熟,市场中出现越来越多的智能设备,例如智能音箱、智能电视等,这些智能设备基于语音识别技术,为用户提供更便捷的交互方式。
智能设备通过语音识别技术可以把用户说话的声音转换成文字,进而通过分析文字理解用户发出的指令。通常,在比较安静或者高信噪比的环境下,智能设备可以准确识别用户的语音。但是实际应用中,用户所处的环境复杂多变,噪音或者干扰往往会影响智能设备进行语音识别的准确率,导致智能设备语音识别噪音过大,识别效果受到影响。
发明内容
本申请实施例提供一种语音处理方法、装置和用于语音处理的装置,可以提高智能进行设备语音识别的准确率。
为了解决上述问题,本申请实施例公开了一种语音处理方法,应用于终端设备,所述终端设备设置有至少两个麦克风,所述方法包括:
将所述至少两个麦克风接收到的信号进行求和处理,得到第一路信号,以及将所述至少两个麦克风接收到的信号进行求差处理,得到第二路信号;
将所述第一路信号和所述第二路信号进行盲分离处理,得到语音信号和噪音信号;
基于所述噪音信号,将所述语音信号进行自适应噪音消除处理,得到目标语音信号。
另一方面,本申请实施例公开了一种语音处理装置,应用于终端设备, 所述终端设备设置有至少两个麦克风,所述装置包括:
粗分离模块,用于将所述至少两个麦克风接收到的信号进行求和处理,得到第一路信号,以及将所述至少两个麦克风接收到的信号进行求差处理,得到第二路信号;
盲分离处理模块,用于将所述第一路信号和所述第二路信号进行盲分离处理,得到语音信号和噪音信号;
自适应噪音消除处理模块,用于基于所述噪音信号,将所述语音信号进行自适应噪音消除处理,得到目标语音信号。
再一方面,本申请实施例公开了一种用于语音处理的装置,应用于终端设备,所述终端设备设置有至少两个麦克风,所述装置包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行以下操作的指令:
将所述至少两个麦克风接收到的信号进行求和处理,得到第一路信号,以及将所述至少两个麦克风接收到的信号进行求差处理,得到第二路信号;
将所述第一路信号和所述第二路信号进行盲分离处理,得到语音信号和噪音信号;
基于所述噪音信号,将所述语音信号进行自适应噪音消除处理,得到目标语音信号。
又一方面,本申请实施例公开了一种机器可读介质,其上存储有指令,当由一个或多个处理器执行时,使得装置执行如前述一个或多个所述的语音处理方法。
本申请实施例包括以下优点:
本申请实施例的语音处理方法可应用于设置有至少两个麦克风的终端设备。首先,利用终端设备的两个或者以上的麦克风可以形成差分阵列,实现对语音信号和噪音信号的粗分离。具体地,将至少两个麦克风接收到的信号进行求和处理,可以在说话人前方形成波束,主要接收到说话人的语音,对说话人侧后方的噪音形成一定抑制,可以得到一路以语音为主的信号(第一路信号)。将至少两个麦克风接收到的信号进行求差处理,可以在说话人侧后方形成波束,主要接收说话人侧后方的噪音或者干扰,可以得到一路以噪声为主的信号(第二路信号)。接下来,基于盲分离技术对粗分离得到的第一路信号和第二路信号进行进一步分离,可以得到更加精准的语音信号和 噪音信号。最后,基于盲分离得到的语音信号和噪音信号,进行自适应噪音消除处理,可以得到消除噪音的目标语音信号。本申请实施例利用差分麦克风阵列技术,结合盲分离技术和自适应噪音消除技术,对至少两个麦克风接收到的信号进行粗分离、进一步分离、以及自适应噪音消除三级处理,使得分离得到的语音信号和噪音信号更加精准,进而可以提高消除语音信号中噪音或者干扰的效率和精准度。此外,相较于已有的降噪算法,本申请实施例利用差分麦克风阵列技术,对至少两个麦克风接收到的信号进行粗分离,使得粗分离过程对噪音或者干扰的方向不敏感,可以提高去噪性能的鲁棒性,优化语音去噪效果,进而在环境复杂多变、噪音或者干扰较大的情况下,可以提高终端设备的语音识别准确率。
上述说明仅是本申请技术方案的概述,为了能够更清楚了解本申请的技术手段,而可依照说明书的内容予以实施,并且为了让本申请的上述和其它目的、特征和优点能够更明显易懂,以下特举本申请的具体实施方式。
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请的一种语音处理方法实施例的步骤流程图;
图2是本申请的一种对三个麦克风的信号进行求差处理的流程示意图;
图3是本申请的一种自适应噪音消除处理模块的信号流入示意图;
图4是本申请的一种语音处理装置实施例的结构框图;
图5是本申请的一种用于语音处理的装置800的框图;
图6是本申请的一些实施例中服务器的结构示意图。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是 全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
参照图1,示出了本申请的一种语音处理方法实施例的步骤流程图,应用于终端设备,所述终端设备设置有至少两个麦克风,所述方法具体可以包括如下步骤:
步骤101、将所述至少两个麦克风接收到的信号进行求和处理,得到第一路信号,以及将所述至少两个麦克风接收到的信号进行求差处理,得到第二路信号;
步骤102、将所述第一路信号和所述第二路信号进行盲分离处理,得到语音信号和噪音信号;
步骤103、基于所述噪音信号,将所述语音信号进行自适应噪音消除处理,得到目标语音信号。
本申请实施例提供的语音处理方法可应用于终端设备,所述终端设备具有至少两个麦克风,可用于采集声音信号,所述终端设备包括但不限于:耳机、录音笔、家居智能终端(包括:空调、冰箱、电饭煲、热水器等),商务智能终端(包括:可视电话、会议桌面智能终端等),可穿戴设备(包括智能手表、智能眼镜等),金融智能终端机,以及智能手机、平板电脑、个人数字助理(personal digital assistant,PDA)、车载设备、计算机等。
为便于描述,本申请实施例以所述终端设备为耳机为例进行说明,所述耳机具有至少两个麦克风。
本申请实施例的语音处理方法包括三级分离模块,第一级分离模块通过对至少两个麦克风接收到的信号进行求和处理以及求差处理,得到第一路信号和第二路信号,实现对语音信号和噪音信号的粗分离。第二级分离模块通过盲分离技术,对第一级分离模块提取的第一路信号和第二路信号进行进一步分离处理,得到语音信号和噪音信号。第三级分离模块基于第二级分离模块分离得到的噪音信号,对分离得到的语音信号进行自适应噪音消除处理,得到最终的目标语音信号。
本申请实施例首先对终端设备的至少两个麦克风接收到的信号进行初步提取。具体地,将所述至少两个麦克风接收到的信号进行求和处理,得到第一路信号,以及将所述至少两个麦克风接收到的信号进行求差处理,得到第二路信号。其中,第一路信号为说话人语音为主的信号,第二路信号为噪音为主的信号。
以两个麦克风为例,通常耳机的两个麦克风中有一个麦克风靠近说话人的嘴部,本申请实施例将两个麦克风中靠近说话人嘴部的麦克风称为第一麦克风,将另一个麦克风称为第二麦克风。本申请实施例用第一麦克风接收到的信号加上第二麦克风接收到的信号,此时在说话人前方(端射方向)形成波束,主要接收到说话人的语音,对说话人侧后方的噪音形成一定抑制,因此,将两个麦克风接收到的信号进行求和处理可以得到一路以语音为主的信号(第一路信号)。
用第二麦克风接收到的信号减去第一麦克风接收到的信号,此时在说话人后方形成波束,主要接收说话人后方的噪音或者干扰,因此,将两个麦克风接收到的信号进行求差处理可以得到一路以噪声为主的信号(第二路信号)。
通过步骤101的初步提取,可以得到一路以说话人语音为主的信号(第一路信号)以及一路以噪音为主的信号(第二路信号)。可以理解的是,通过提取第一路信号和第二路信号,本申请实施例实现了对语音信号和噪音信号的粗分离。第一路信号是以说话人语音为主的信号,其中仍然包含部分噪音信号;第二路信号是以噪音为主的信号,其中仍然包含部分语音信号。
需要说明的是,对于两个以上麦克风的情况,提取第一路信号的方式和两个麦克风相同,提取第二路信号的方式和两个麦克风略有不同。
在本申请的一种可选实施例中,所述方法还可以包括:将所述至少两个麦克风接收到的信号进行相位对齐;
步骤101所述将所述至少两个麦克风接收到的信号进行求和处理,得到第一路信号,以及将所述至少两个麦克风接收到的信号进行求差处理,得到第二路信号,具体可以包括:
将相位对齐后的所述至少两个麦克风接收到的信号进行求和处理,得到第一路信号,以及将相位对齐后的所述至少两个麦克风接收到的信号进行求差处理,得到第二路信号。
在实际应用中,由于终端设备的每个麦克风的位置不同,因此每个麦克风接收到的信号存在时间差,也即,每个麦克风接收到的信号相位是不对齐的。直接对多个麦克风接收到的信号进行求和处理或者求差处理,虽然可以在一定程度上降低白噪声,但是,如果在相位不对齐的情况下直接进行求差处理,可能会造成目标语音信号泄露到噪音为主的那路信号,将影响第二路信号的准确性,进而影响最终的降噪效果。因此,本申请实施例在将至少两 个麦克风接收到的信号进行求和处理以及求差处理之前,将所述至少两个麦克风接收到的信号进行相位对齐,得到相位对齐后的所述至少两个麦克风接收到的信号,进而将相位对齐后的所述至少两个麦克风接收到的信号进行求和处理,得到第一路信号,以及将相位对齐后的所述至少两个麦克风接收到的信号进行求差处理,得到第二路信号,以提高提取第一路信号和第二路信号的精准度,进而提高对语音信号降噪的效果。
本申请实施例通过将至少两个麦克风接收到的信号进行求和处理以及求差处理,无需估计不同麦克风接收信号的时间差,可以简化降噪处理的操作步骤,得到的第一路信号和第二路信号可以作为后续进一步去噪的辅助估计,以提高最终的降噪效果。
在步骤101所述的粗分离阶段,将两个麦克风接收到的信号进行求和处理与将两个以上麦克风接收到的信号进行求和处理的过程相同,下面分别说明将两个麦克风接收到的信号进行求差处理与对两个以上麦克风接收到的信号进行求差处理的具体过程。
在本申请的一种可选实施例中,所述终端设备设置有两个麦克风,步骤101中所述将所述至少两个麦克风接收到的信号进行求差处理,得到第二路信号,包括:
步骤S11、在所述两个麦克风中确定第一麦克风和第二麦克风;
步骤S12、将所述第二麦克风接收到的每帧信号减去所述第一麦克风接收到的每帧信号,得到第二路信号。
在终端设备设置有两个麦克风的情况下,在所述两个麦克风中确定第一麦克风和第二麦克风。其中,第一麦克风为两个麦克风中靠近说话人嘴部的麦克风,第二麦克风为两个麦克风中远离说话人嘴部的麦克风。两个麦克风位于一条直线上。
在具体实施中,可选地,首先对第一麦克风接收到的信号和第二麦克风接收到的信号进行相位对齐操作,得到相位对齐后的两个麦克风的信号。然后对相位对齐后的两个麦克风的信号进行求和处理,得到第一路信号,形成对白噪声的抑制。对相位对齐后的两个麦克风的信号进行求差处理,具体地,用第二麦克风接收到的信号减去第一麦克风接收到的信号,可以得到第二路信号。
进一步地,本申请实施例对终端设备的每个麦克风接收到的信号以帧为单位进行处理,以对每个麦克风接收到的信号进行实时处理,提高处理的实 时性和精准度。具体地,将所述第二麦克风接收到的每帧信号减去所述第一麦克风接收到的每帧信号,可以得到第二路信号。
在本申请的一种可选实施例中,所述终端设备设置有n个麦克风,n大于2,步骤101中所述将所述至少两个麦克风接收到的信号进行求差处理,得到第二路信号,包括:
步骤S21、将第i个麦克风接收到的当前帧信号减去第i-1个麦克风接收到的当前帧信号,得到n-1路帧信号,i的取值为1至n;
步骤S22、将所述n-1路帧信号分别与参考信号y(n)进行自适应滤波处理,得到处理后的n-1路帧信号,其中,y(n)=yc(n)-N(n),yc(n)为所述n个麦克风接收到的上一帧信号的和,N(n)为上一帧输出的第二路帧信号;
步骤S23、将所述处理后的n-1路帧信号求和,得到当前帧输出的第二路帧信号;
步骤S24、在所述n个麦克风接收到的所有帧信号处理完成之后,得到第二路信号。
在终端设备具有两个以上麦克风的情况下,可选地,首先将所有麦克风接收到的信号进行相位对齐操作,得到相位对齐后的所有麦克风的信号。然后将相位对齐后的所有麦克风的信号进行求和处理,得到第一路信号,形成对白噪声的抑制。将相位对齐后的所有麦克风的信号进行求差处理,具体地,将相位对齐后的所有麦克风信号的每一帧执行如下操作:将第i个麦克风接收到的当前帧信号减去第i-1个麦克风接收到的当前帧信号,得到n-1路帧信号,i的取值为1至n;将所述n-1路帧信号分别与参考信号y(n)进行自适应滤波处理,得到处理后的n-1路帧信号;将所述处理后的n-1路帧信号求和,得到当前帧输出的第二路帧信号。其中,参考信号y(n)=yc(n)-N(n),yc(n)为所述n个麦克风接收到的上一帧信号的和,N(n)为上一帧输出的第二路帧信号,本申请实施例利用上一帧的处理结果计算当前帧的参考信号y(n),用y(n)更新自适应滤波器。
需要说明的是,在处理第一帧信号时,由于还未产生上一帧的处理结果,因此,可以设置一个初始参考信号y(n),在第一帧信号处理完成之后,在处理第二帧信号时,即可用第一帧计算得到的y(n)更新自适应滤波器。同样地,在处理第三帧信号时,即可用第二帧计算得到的y(n)更新自适应滤波器,以此类推,直到最后一帧信号处理完成,可以得到完整的第二路信号。
本申请实施例对求差处理过程中采用的自适应滤波器的种类不做限制, 例如可以为NLMS(Normalized Least Mean Square,归一化最小均方自适应滤波器)。
下面以三个麦克风为例,说明本申请实施例对终端设备的三个麦克风接收到的信号进行求差处理的具体过程。参照图2,示出了本申请实施例的一种对三个麦克风的信号进行求差处理的流程示意图。如图2所示,三个麦克风分别为麦克风1、麦克风2、麦克风3。
首先对麦克风1、麦克风2、麦克风3接收到的信号进行相位对齐,然后对相位对齐后的三个麦克风的信号相加可以得到第一路信号,形成对白噪声的抑制,以及对相位对齐后的三个麦克风的信号进行求差。求差过程具体包括:用麦克风2的信号减去麦克风1的信号,得到信号a;用麦克风3的信号减去麦克风2的信号,得到信号b。将信号a和信号b与参考信号y(n)进行自适应滤波处理,得到信号a’和信号b’;将信号a’和信号b’相加得到第二路信号。
如图2所示,在具体实施例中,对于第一帧信号可以执行如下操作:先进行自适应滤波处理(第一次计算时自适应滤波器有初始状态),计算出N(n);然后进行求和计算,得到yc(n),yc(n)和N(n)相减后得到y(n);接下来利用计算得到的y(n)更新自适应滤波器。此时完成第一帧信号的求差处理,得到第一帧输出的第二路帧信号。然后对后面的帧信号依次按照上面的步骤进行处理,在处理第二帧信号时,可以参考第一帧信号的处理结果,在处理第三帧信号时,可以参考第二帧信号的处理结果,以此类推,直到最后一帧信号处理完成之后,可以得到完整的第二路信号。
通过提取第一路信号和第二路信号,本申请实施例实现了对语音信号和噪音信号的粗分离。在提取得到第一路信号和第二路信号之后,可以将所述第一路信号和所述第二路信号进行盲分离处理,以将语音信号和噪音信号进行进一步的分离,得到更精准的语音信号和噪音信号。
其中,盲分离处理是指在源信号无法准确获知的情况下,从采集的混合信号中分离出各个源信号(如说话人的语音信号以及噪音信号)的技术。由于耳机中的麦克风通常具有孔径较小的特点,且麦克风的数量通常较少,因此,在比较嘈杂的环境下,采集的声音信号中包含大量的噪音信号,导致语音信号的质量较差。为了提高语音信号的质量,本申请实施例将提取的第一路信号和第二路信号分别进行盲分离处理,将第一路信号进行盲分离处理可以进一步降低第一路信号中的噪音信号,得到语音信号,该语音信号中包含 更少的噪音;将第二路信号进行盲分离处理,可以进一步降低第二路信号中的语音信号,得到噪音信号,该噪音信号中包含更少的语音,为后续进一步降噪处理提供基础。
在本申请的一种可选实施例中,步骤102所述将所述第一路信号和所述第二路信号进行盲分离处理,得到语音信号和噪音信号,包括:
将所述第一路信号中的每帧信号采用独立向量分析盲分离算法进行盲分离处理,得到语音信号,以及将所述第二路信号中的每帧信号采用独立向量分析盲分离算法进行盲分离处理,得到噪音信号。
ICA(Independent Component Correlation Algorithm,独立成分分析)是指当假设源信号各分量间彼此统计独立,且没有时间结构时,在某一分离准则下通过对神经网络权值的反馈调整,使得变换后信号的不同分量之间的相依性最小,也即输出达到尽可能的独立。ICA的目的是通过线性变换使得观测信号的各个分量的统计独立性最大化。如果源信号之间具有统计独立性,那么可以通过ICA实现信号的分离。但是ICA不可避免的问题就是分离出来的信号由于其排序不一致导致的信号混杂。因此,本申请实施例采用IVA(Independent Vector Analysis,独立向量分析)盲分离算法,IVA是一种扩展的ICA算法,IVA考虑到了属于相同源的频率分量之间的相关性,将每一帧的所有频点统一进行分离计算,有效避免了排序模糊性问题。
本申请实施例的第二级分离模块采用IVA(独立向量分析技术)将第一路信号进行盲分离处理,得到语音信号,以及将第二路信号进行盲分离处理,得到噪音信号。本申请实施例采用IVA盲分离技术对噪音的方向并不敏感,对处于说话人前方的噪音仍然能达到鲁棒的分离效果,可以进一步提高语音降噪的效果。
需要说明的是,本申请实施例对采用的盲分离算法的种类不做限制,例如还可以采用PCA(Principal Component Analysis,主成分分析)的盲分离算法等。
在本申请的一种可选实施例中,步骤103所述基于所述噪音信号,将所述语音信号进行自适应噪音消除处理,得到目标语音信号,包括:
将所述噪音信号作为参考信号,以及将所述语音信号作为目标信号,基于递归最小二乘法RLS的自适应滤波算法对所述语音信号进行自适应噪音消除处理,得到目标语音信号。
本申请实施例的自适应噪音消除处理采用RLS(Recursive Least Squares, 递归最小二乘法)技术,RLS算法本身具有快速收敛的特点。
具体地,RLS自适应滤波算法如下:
1、初始化:
P(0)=δ
-1I,δ是很小的正常数,I是单位阵
W(0)=0
2、对于n=1,2,…,N,进行如下计算:
e(n)=d(n)-W
T(n-1)X(n) (2)
W(n)=W(n-1)+G(n)e(n) (3)
P(n)=λ
-1P(n-1)-λ
-1G(n)X
T(n)P(n-1) (4)
s(n)=d(n)-W
T(n)X(n) (5)
其中,n代表帧号,W代表自适应滤波器系数向量,G代表增益向量。X代表盲分离输出的那路噪音信号。(3)式中d表示盲分离输出的那路语音信号。s(n)是最终输出的目标语音信号。遗忘因子λ可以选择一个常数如0.99。
然而,RLS自适应滤波算法的计算量较大,对于耳机等计算能力受限的终端设备,计算压力较大,因此,为了减小自适应滤波处理的计算量,使得本申请实施例的语音处理方法可以适用于不同计算能力的终端设备,本申请实施例引入话音激活检测模块来降低RLS自适应滤波算法的计算量。
在本申请的一种可选实施例中,步骤102所述将所述第一信号和所述第二信号进行盲分离处理,得到语音信号和噪音信号之后,所述方法还可以包括:
步骤S31、将所述语音信号中的每帧信号进行话音激活检测;
步骤S32、将话音激活检测结果为话音信号的帧信号设置话音信号标志位;
步骤103中所述将所述语音信号进行自适应噪音消除处理,包括:将所述语音信号中具有话音信号标志位的帧信号进行自适应噪音消除处理。
话音激活检测(VAD,Voice Activity Detection),目的是检测当前语音信号中是否包含话音信号存在,即对输入信号进行判断,将话音信号与各种背景噪声信号区分出来。
在具体实施中,麦克风接收到的声音信号中并不是每一帧信号都包含说 话人的话音信号,如果对每一帧信号都进行自适应噪音消除处理,不仅导致增加额外的计算成本,而且影响语音处理的效率。因此,本申请实施例对盲分离处理得到的语音信号进行话音激活检测,以检测当前帧信号是否包含话音信号,仅对包含话音信号的帧信号进行自适应噪音消除处理,以减少计算成本,提高语音处理的效率。
在本申请实施例中,在将第二级分离模块分离得到的语音信号和噪音信号输入自适应噪音消除模块(第三级分离模块)之前,先将第二级分离模块分离得到的语音信号输入话音激活检测模块,话音激活检测模块负责对输入的语音信号以帧为单位检测每一帧信号是否包含话音信号,对话音激活检测结果为话音信号的帧信号设置话音信号标志位,然后将每一帧信号的话音激活检测结果传给自适应噪音消除模块,自适应噪音消除模块根据话音激活检测结果是否包含话音信号标志位决定是否进行自适应噪音消除处理。
话音激活检测模块可以使用基于语音时域能量算法,设置阈值theshold,计算当前帧信号的能量,比如当前帧信号x有N个点,n=1,2,3,...,N,则当前帧信号的能量enery=sum(x[n]*x[n]),也即,把一帧所有点的能量相加求和。如果enery>theshold,则确定当前帧信号包含话音信号,可以对当前帧信号设置话音信号标志位,否则确定当前帧信号不包含话音信号,不对当前帧信号设置话音信号标志位。
在实际应用中,由于自适应滤波系数的更新需要一段的收敛时间,因此在自适应噪音消除处理开始之前的预设时间段(如前20s)内,话音激活检测模块可以先不工作,这段时间内自适应滤波器会一直更新。从预设时间段(20s)之后,自适应噪音消除处理是否进行开始依赖于话音信号标志位。由此既可以节约处理时间、降低功耗,又可以对自适应滤波系数进行更精准的更新,提高算法的鲁棒性。
可选地,可以采用活动窗策略,滑动窗可以存储过去预设帧数(如5帧到10帧)的话音信号标志位以及当前帧的话音信号标志位,只有在滑动窗内的所有帧信号全都具有话音信号标志位,也即在滑动窗内的所有帧信号均包含话音信号的情况下,才进行自适应噪音消除处理,更新自适应滤波系数。
本申请实施例基于盲分离技术,结合差分麦克风阵列技术和自适应滤波技术,可以快速消除语音中的噪音或者干扰,相较于已有算法,本申请实施例对噪音或者干扰的方向不敏感,去噪性能更加鲁棒。
一个示例中,以两个麦克风为例,首先将第一麦克风接收到的信号和第 二麦克风接收到的信号进行相位对齐操作,得到相位对齐后的两个麦克风的信号。将相位对齐后的两个麦克风的信号进行求和处理,得到第一路信号,以及用第二麦克风接收到的信号减去第一麦克风接收到的信号,得到第二路信号。假设两个麦克风接收到的信号包含女孩A的语音信号和男孩B的语音信号,其中,男孩B的语音信号为需要提取的目标说话人的语音信号。通过第一级分离模块的处理,得到以男孩语音信号为主的第一路信号,以及以女孩语音信号为主的第二路信号。在该示例中,女孩语音信号相对于男孩语音信号可以作为噪音信号来处理。
然后,将男孩语音信号为主的第一路信号和女孩语音信号为主的第二路信号输入盲分离处理模块进行盲分离处理。经过盲分离处理模块的处理,第一路信号中的女孩语音信号进一步降低,得到语音信号;经过盲分离处理模块的处理,第二路信号中的男孩语音信号进一步降低,得到噪音信号。
接下来,将盲分离处理模块输出的语音信号和噪音信号输入自适应噪音消除处理模块,并且将盲分离处理模块输出的语音信号输入话音激活检测模块进行话音激活检测,话音激活检测模块将每一帧信号的语音激活检测结果输入自适应噪音消除处理模块。自适应噪音消除处理模块根据话音激活检测模块输出的语音激活检测结果是否包括话音信号标志位决定是否对当前帧进行自适应噪音消除处理。
参照图3,为自适应噪音消除处理模块的信号流入示意图。盲分离处理模块输出的语音信号和噪音信号以及话音激活检测模块输出的每一帧信号的语音激活检测结果作为图4的输入,最终输出目标语音信号。
综上,本申请实施例的语音处理方法可应用于设置有至少两个麦克风的终端设备。首先,利用终端设备的两个或者以上的麦克风可以形成差分阵列,实现对语音信号和噪音信号的粗分离。具体地,将至少两个麦克风接收到的信号进行求和处理,可以在说话人前方形成波束,主要接收到说话人的语音,对说话人侧后方的噪音形成一定抑制,可以得到一路以语音为主的信号(第一路信号)。将至少两个麦克风接收到的信号进行求差处理,可以在说话人侧后方形成波束,主要接收说话人侧后方的噪音或者干扰,可以得到一路以噪声为主的信号(第二路信号)。接下来,基于盲分离技术对粗分离得到的第一路信号和第二路信号进行进一步分离,可以得到更加精准的语音信号和噪音信号。最后,基于盲分离得到的语音信号和噪音信号,进行自适应噪音消除处理,可以得到消除噪音的目标语音信号。本申请实施例利用差分麦克 风阵列技术,结合盲分离技术和自适应噪音消除技术,对至少两个麦克风接收到的信号进行粗分离、进一步分离、以及自适应噪音消除三级处理,使得分离得到的语音信号和噪音信号更加精准,进而可以提高消除语音信号中噪音或者干扰的效率和精准度。此外,相较于已有的降噪算法,本申请实施例利用差分麦克风阵列技术,对至少两个麦克风接收到的信号进行粗分离,使得粗分离过程对噪音或者干扰的方向不敏感,可以提高去噪性能的鲁棒性,优化语音去噪效果,进而在环境复杂多变、噪音或者干扰较大的情况下,可以提高终端设备的语音识别准确率。
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本申请实施例所必须的。
参照图4,示出了本申请的一种语音处理装置实施例的结构框图,所述装置可应用于终端设备,所述终端设备设置有至少两个麦克风,所述装置可以包括:
粗分离模块401,用于将所述至少两个麦克风接收到的信号进行求和处理,得到第一路信号,以及将所述至少两个麦克风接收到的信号进行求差处理,得到第二路信号;
盲分离处理模块402,用于将所述第一路信号和所述第二路信号进行盲分离处理,得到语音信号和噪音信号;
自适应噪音消除处理模块403,用于基于所述噪音信号,将所述语音信号进行自适应噪音消除处理,得到目标语音信号。
可选地,所述装置还包括:
相位对齐模块,用于将所述至少两个麦克风接收到的信号进行相位对齐;
所述粗分离模块,具体用于将相位对齐后的所述至少两个麦克风接收到的信号进行求和处理,得到第一路信号,以及将相位对齐后的所述至少两个麦克风接收到的信号进行求差处理,得到第二路信号。
可选地,所述终端设备设置有两个麦克风,所述盲分离处理模块,包括:
确定子模块,用于在所述两个麦克风中确定第一麦克风和第二麦克风;
第一相减子模块,用于将所述第二麦克风接收到的每帧信号减去所述第一麦克风接收到的每帧信号,得到第二路信号。
可选地,所述终端设备设置有n个麦克风,n大于2,所述盲分离处理模块,包括:
第二相减子模块,用于将第i个麦克风接收到的当前帧信号减去第i-1个麦克风接收到的当前帧信号,得到n-1路帧信号,i的取值为1至n;
自适应滤波子模块,用于将所述n-1路帧信号分别与参考信号y(n)进行自适应滤波处理,得到处理后的n-1路帧信号,其中,y(n)=yc(n)-N(n),yc(n)为所述n个麦克风接收到的上一帧信号的和,N(n)为上一帧输出的第二路帧信号;
求和子模块,用于将所述处理后的n-1路帧信号求和,得到当前帧输出的第二路帧信号;
迭代完成子模块,用于在所述n个麦克风接收到的所有帧信号处理完成之后,得到第二路信号。
可选地,所述盲分离处理模块,具体用于将所述第一路信号中的每帧信号采用独立向量分析盲分离算法进行盲分离处理,得到语音信号,以及将所述第二路信号中的每帧信号采用独立向量分析盲分离算法进行盲分离处理,得到噪音信号。
可选地,所述装置还包括:
话音激活检测模块,用于将所述语音信号中的每帧信号进行话音激活检测,并且将话音激活检测结果为话音信号的帧信号设置话音信号标志位;
所述自适应噪音消除处理模块,具体用于将所述语音信号中具有话音信号标志位的帧信号进行自适应噪音消除处理。
可选地,所述自适应噪音消除处理模块,具体用于将所述噪音信号作为参考信号,以及将所述语音信号作为目标信号,基于RLS的自适应滤波算法对所述语音信号进行自适应噪音消除处理,得到目标语音信号。
本申请实施例利用终端设备的两个或者以上的麦克风形成差分阵列,基于盲分离技术,结合差分麦克风阵列技术和自适应滤波技术,可以快速消除语音中的噪音或者干扰,相较于已有算法,本申请实施例对噪音或者干扰的方向不敏感,去噪性能更加鲁棒,优化语音去噪效果,进而在环境复杂多变、噪音或者干扰的情况下,提高终端设备的语音识别准确率。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明 的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
本申请实施例提供了一种用于语音处理的装置,应用于终端设备,所述终端设备设置有至少两个麦克风,所述装置包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行以下操作的指令:将所述至少两个麦克风接收到的信号进行求和处理,得到第一路信号,以及将所述至少两个麦克风接收到的信号进行求差处理,得到第二路信号;将所述第一路信号和所述第二路信号进行盲分离处理,得到语音信号和噪音信号;基于所述噪音信号,将所述语音信号进行自适应噪音消除处理,得到目标语音信号。
图5是根据一示例性实施例示出的一种用于语音处理的装置800的框图。例如,装置800可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等。
参照图5,装置800可以包括以下一个或多个组件:处理组件802,存储器804,电源组件806,多媒体组件808,音频组件810,输入/输出(I/O)的接口812,传感器组件814,以及通信组件816。
处理组件802通常控制装置800的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理元件802可以包括一个或多个处理器820来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件802可以包括一个或多个模块,便于处理组件802和其他组件之间的交互。例如,处理组件802可以包括多媒体模块,以方便多媒体组件808和处理组件802之间的交互。
存储器804被配置为存储各种类型的数据以支持在设备800的操作。这些数据的示例包括用于在装置800上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器804可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
电源组件806为装置800的各种组件提供电力。电源组件806可以包括电源管理系统,一个或多个电源,及其他与为装置800生成、管理和分配电力相关联的组件。
多媒体组件808包括在所述装置800和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件808包括一个前置摄像头和/或后置摄像头。当设备800处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。
音频组件810被配置为输出和/或输入音频信号。例如,音频组件810包括一个麦克风(MIC),当装置800处于操作模式,如呼叫模式、记录模式和语音信息处理模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器804或经由通信组件816发送。在一些实施例中,音频组件810还包括一个扬声器,用于输出音频信号。
I/O接口812为处理组件802和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。
传感器组件814包括一个或多个传感器,用于为装置800提供各个方面的状态评估。例如,传感器组件814可以检测到设备800的打开/关闭状态,组件的相对定位,例如所述组件为装置800的显示器和小键盘,传感器组件814还可以语音处理装置800或装置800一个组件的位置改变,用户与装置800接触的存在或不存在,装置800方位或加速/减速和装置800的温度变化。传感器组件814可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件814还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件814还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。
通信组件816被配置为便于装置800和其他设备之间有线或无线方式的 通信。装置800可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信组件816经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件816还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频信息处理(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
在示例性实施例中,装置800可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。
在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器804,上述指令可由装置800的处理器820执行以完成上述方法。例如,所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
图6是本申请的一些实施例中服务器的结构示意图。该服务器1900可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)1922(例如,一个或一个以上处理器)和存储器1932,一个或一个以上存储应用程序1942或数据1944的存储介质1930(例如一个或一个以上海量存储设备)。其中,存储器1932和存储介质1930可以是短暂存储或持久存储。存储在存储介质1930的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器1922可以设置为与存储介质1930通信,在服务器1900上执行存储介质1930中的一系列指令操作。
服务器1900还可以包括一个或一个以上电源1926,一个或一个以上有线或无线网络接口1950,一个或一个以上输入输出接口1958,一个或一个以上键盘1956,和/或,一个或一个以上操作系统1941,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
一种非临时性计算机可读存储介质,当所述存储介质中的指令由装置(服务器或者终端)的处理器执行时,使得装置能够执行图1所示的语音处理方法。
一种非临时性计算机可读存储介质,当所述存储介质中的指令由装置(服务器或者终端)的处理器执行时,使得装置能够执行一种语音处理方法, 所述方法包括:将所述至少两个麦克风接收到的信号进行求和处理,得到第一路信号,以及将所述至少两个麦克风接收到的信号进行求差处理,得到第二路信号;将所述第一路信号和所述第二路信号进行盲分离处理,得到语音信号和噪音信号;基于所述噪音信号,将所述语音信号进行自适应噪音消除处理,得到目标语音信号。
本申请实施例是参照根据本申请实施例的方法、终端设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理终端设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理终端设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理终端设备上,使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求指出。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制。
以上所述仅为本申请的较佳实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申 请的保护范围之内。
以上对本申请所提供的一种语音处理方法、一种语音处理装置和一种用于语音处理的装置,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。
Claims (15)
- 一种语音处理方法,其特征在于,应用于终端设备,所述终端设备设置有至少两个麦克风,所述方法包括:将所述至少两个麦克风接收到的信号进行求和处理,得到第一路信号,以及将所述至少两个麦克风接收到的信号进行求差处理,得到第二路信号;将所述第一路信号和所述第二路信号进行盲分离处理,得到语音信号和噪音信号;基于所述噪音信号,将所述语音信号进行自适应噪音消除处理,得到目标语音信号。
- 根据权利要求1所述的方法,其特征在于,所述方法还包括:将所述至少两个麦克风接收到的信号进行相位对齐;所述将所述至少两个麦克风接收到的信号进行求和处理,得到第一路信号,以及将所述至少两个麦克风接收到的信号进行求差处理,得到第二路信号,包括:将相位对齐后的所述至少两个麦克风接收到的信号进行求和处理,得到第一路信号,以及将相位对齐后的所述至少两个麦克风接收到的信号进行求差处理,得到第二路信号。
- 根据权利要求1所述的方法,其特征在于,所述终端设备设置有两个麦克风,所述将所述至少两个麦克风接收到的信号进行求差处理,得到第二路信号,包括:在所述两个麦克风中确定第一麦克风和第二麦克风;将所述第二麦克风接收到的每帧信号减去所述第一麦克风接收到的每帧信号,得到第二路信号。
- 根据权利要求1所述的方法,其特征在于,所述终端设备设置有n个麦克风,n大于2,所述将所述至少两个麦克风接收到的信号进行求差处理,得到第二路信号,包括:将第i个麦克风接收到的当前帧信号减去第i-1个麦克风接收到的当前帧信号,得到n-1路帧信号,i的取值为1至n;将所述n-1路帧信号分别与参考信号y(n)进行自适应滤波处理,得到处理后的n-1路帧信号,其中,y(n)=yc(n)-N(n),yc(n)为所述n个麦克风接收到的上一帧信号的和,N(n)为上一帧输出的第二路帧信号;将所述处理后的n-1路帧信号求和,得到当前帧输出的第二路帧信号;在所述n个麦克风接收到的所有帧信号处理完成之后,得到第二路信号。
- 根据权利要求1所述的方法,其特征在于,所述将所述第一路信号和所述第二路信号进行盲分离处理,得到语音信号和噪音信号,包括:将所述第一路信号中的每帧信号采用独立向量分析盲分离算法进行盲分离处理,得到语音信号,以及将所述第二路信号中的每帧信号采用独立向量分析盲分离算法进行盲分离处理,得到噪音信号。
- 根据权利要求1所述的方法,其特征在于,所述将所述第一信号和所述第二信号进行盲分离处理,得到语音信号和噪音信号之后,所述方法还包括:将所述语音信号中的每帧信号进行话音激活检测;将话音激活检测结果为话音信号的帧信号设置话音信号标志位;所述将所述语音信号进行自适应噪音消除处理,包括:将所述语音信号中具有话音信号标志位的帧信号进行自适应噪音消除处理。
- 根据权利要求1所述的方法,其特征在于,所述基于所述噪音信号,将所述语音信号进行自适应噪音消除处理,得到目标语音信号,包括:将所述噪音信号作为参考信号,以及将所述语音信号作为目标信号,基于递归最小二乘法RLS的自适应滤波算法对所述语音信号进行自适应噪音消除处理,得到目标语音信号。
- 一种语音处理装置,其特征在于,应用于终端设备,所述终端设备设置有至少两个麦克风,所述装置包括:粗分离模块,用于将所述至少两个麦克风接收到的信号进行求和处理,得到第一路信号,以及将所述至少两个麦克风接收到的信号进行求差处理,得到第二路信号;盲分离处理模块,用于将所述第一路信号和所述第二路信号进行盲分离处理,得到语音信号和噪音信号;自适应噪音消除处理模块,用于基于所述噪音信号,将所述语音信号进行自适应噪音消除处理,得到目标语音信号。
- 根据权利要求8所述的装置,其特征在于,所述装置还包括:相位对齐模块,用于将所述至少两个麦克风接收到的信号进行相位对齐;所述粗分离模块,具体用于将相位对齐后的所述至少两个麦克风接收到的信号进行求和处理,得到第一路信号,以及将相位对齐后的所述至少两个麦克风接收到的信号进行求差处理,得到第二路信号。
- 根据权利要求8所述的装置,其特征在于,所述终端设备设置有两个麦克风,所述盲分离处理模块,包括:确定子模块,用于在所述两个麦克风中确定第一麦克风和第二麦克风;第一相减子模块,用于将所述第二麦克风接收到的每帧信号减去所述第一麦克风接收到的每帧信号,得到第二路信号。
- 根据权利要求8所述的装置,其特征在于,所述终端设备设置有n个麦克风,n大于2,所述盲分离处理模块,包括:第二相减子模块,用于将第i个麦克风接收到的当前帧信号减去第i-1个麦克风接收到的当前帧信号,得到n-1路帧信号,i的取值为1至n;自适应滤波子模块,用于将所述n-1路帧信号分别与参考信号y(n)进行自适应滤波处理,得到处理后的n-1路帧信号,其中,y(n)=yc(n)-N(n),yc(n)为所述n个麦克风接收到的上一帧信号的和,N(n)为上一帧输出的第二路帧信号;求和子模块,用于将所述处理后的n-1路帧信号求和,得到当前帧输出的第二路帧信号;迭代完成子模块,用于在所述n个麦克风接收到的所有帧信号处理完成之后,得到第二路信号。
- 根据权利要求8所述的装置,其特征在于,所述盲分离处理模块,具体用于将所述第一路信号中的每帧信号采用独立向量分析盲分离算法进行盲分离处理,得到语音信号,以及将所述第二路信号中的每帧信号采用独立向量分析盲分离算法进行盲分离处理,得到噪音信号。
- 根据权利要求8所述的装置,其特征在于,所述装置还包括:话音激活检测模块,用于将所述语音信号中的每帧信号进行话音激活检测,并且将话音激活检测结果为话音信号的帧信号设置话音信号标志位;所述自适应噪音消除处理模块,具体用于将所述语音信号中具有话音信号标志位的帧信号进行自适应噪音消除处理。
- 一种用于语音处理的装置,其特征在于,应用于终端设备,所述终端设备设置有至少两个麦克风,所述装置包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序,所述一个或者一个以上程序包含用于进行如权利要求1至7中任一所述的语音处理方法的指令。
- 一种机器可读介质,其上存储有指令,当由一个或多个处理器执行时,使得装置执行如权利要求1至7中任一所述的语音处理方法。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP21932436.5A EP4310841A4 (en) | 2021-03-22 | 2021-06-25 | SPEECH PROCESSING METHOD AND APPARATUS, AND SPEECH PROCESSING APPARATUS |
| US18/116,768 US12431153B2 (en) | 2021-03-22 | 2023-03-02 | Speech processing method and apparatus and apparatus for speech processing |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110303349.2A CN113077808B (zh) | 2021-03-22 | 2021-03-22 | 一种语音处理方法、装置和用于语音处理的装置 |
| CN202110303349.2 | 2021-03-22 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US18/116,768 Continuation US12431153B2 (en) | 2021-03-22 | 2023-03-02 | Speech processing method and apparatus and apparatus for speech processing |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022198820A1 true WO2022198820A1 (zh) | 2022-09-29 |
Family
ID=76613940
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/102566 Ceased WO2022198820A1 (zh) | 2021-03-22 | 2021-06-25 | 一种语音处理方法、装置和用于语音处理的装置 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US12431153B2 (zh) |
| EP (1) | EP4310841A4 (zh) |
| CN (1) | CN113077808B (zh) |
| WO (1) | WO2022198820A1 (zh) |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115831155B (zh) * | 2021-09-16 | 2026-01-30 | 腾讯科技(深圳)有限公司 | 音频信号的处理方法、装置、电子设备及存储介质 |
| CN117174078A (zh) * | 2022-07-20 | 2023-12-05 | 深圳Tcl新技术有限公司 | 语音信号的处理方法、装置、设备及计算机可读存储介质 |
| CN115798503B (zh) * | 2022-09-27 | 2026-02-03 | 上海富瀚微电子股份有限公司 | 一种定向拾音方法、装置及电子设备 |
| CN119785817A (zh) * | 2025-01-02 | 2025-04-08 | 科大讯飞股份有限公司 | 一种语音分离方法、装置、存储介质及设备 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20050080616A1 (en) * | 2001-07-19 | 2005-04-14 | Johahn Leung | Recording a three dimensional auditory scene and reproducing it for the individual listener |
| CN102592607A (zh) * | 2012-03-30 | 2012-07-18 | 北京交通大学 | 一种使用盲语音分离的语音转换系统和方法 |
| CN104810024A (zh) * | 2014-01-28 | 2015-07-29 | 上海力声特医学科技有限公司 | 一种双路麦克风语音降噪处理方法及系统 |
| CN106504763A (zh) * | 2015-12-22 | 2017-03-15 | 电子科技大学 | 基于盲源分离与谱减法的麦克风阵列多目标语音增强方法 |
| CN110085247A (zh) * | 2019-05-06 | 2019-08-02 | 上海互问信息科技有限公司 | 一种针对复杂噪声环境的双麦克风降噪方法 |
Family Cites Families (19)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6917688B2 (en) * | 2002-09-11 | 2005-07-12 | Nanyang Technological University | Adaptive noise cancelling microphone system |
| JP4496379B2 (ja) * | 2003-09-17 | 2010-07-07 | 財団法人北九州産業学術推進機構 | 分割スペクトル系列の振幅頻度分布の形状に基づく目的音声の復元方法 |
| US7464029B2 (en) * | 2005-07-22 | 2008-12-09 | Qualcomm Incorporated | Robust separation of speech signals in a noisy environment |
| CN1809105B (zh) * | 2006-01-13 | 2010-05-12 | 北京中星微电子有限公司 | 适用于小型移动通信设备的双麦克语音增强方法及系统 |
| KR101184394B1 (ko) * | 2006-05-10 | 2012-09-20 | 에이펫(주) | 윈도우 분리 직교 모델을 이용한 잡음신호 분리방법 |
| CN100524465C (zh) * | 2006-11-24 | 2009-08-05 | 北京中星微电子有限公司 | 一种噪声消除装置和方法 |
| US8175871B2 (en) * | 2007-09-28 | 2012-05-08 | Qualcomm Incorporated | Apparatus and method of noise and echo reduction in multiple microphone audio systems |
| US8175291B2 (en) * | 2007-12-19 | 2012-05-08 | Qualcomm Incorporated | Systems, methods, and apparatus for multi-microphone based speech enhancement |
| JP2009153053A (ja) * | 2007-12-21 | 2009-07-09 | Nec Corp | 音声推定方法及びそれを用いた携帯端末 |
| CN101192411B (zh) * | 2007-12-27 | 2010-06-02 | 北京中星微电子有限公司 | 大距离麦克风阵列噪声消除的方法和噪声消除系统 |
| US8223988B2 (en) * | 2008-01-29 | 2012-07-17 | Qualcomm Incorporated | Enhanced blind source separation algorithm for highly correlated mixtures |
| US8391507B2 (en) * | 2008-08-22 | 2013-03-05 | Qualcomm Incorporated | Systems, methods, and apparatus for detection of uncorrelated component |
| CN102074246B (zh) * | 2011-01-05 | 2012-12-19 | 瑞声声学科技(深圳)有限公司 | 基于双麦克风语音增强装置及方法 |
| DE102012217522A1 (de) * | 2012-09-27 | 2014-03-27 | Rheinmetall Defence Electronics Gmbh | Verfahren zur Unterdrückung von periodischen Anteilen in Empfangssignalen, die transiente Signale enthalten können, insbesondere zur Ortung von Geschoss- und Mündungsknallen. |
| US10609475B2 (en) * | 2014-12-05 | 2020-03-31 | Stages Llc | Active noise control and customized audio system |
| CN106157960A (zh) * | 2015-04-14 | 2016-11-23 | 杜比实验室特许公司 | 音频内容的自适应算术编解码 |
| US11348595B2 (en) * | 2017-01-04 | 2022-05-31 | Blackberry Limited | Voice interface and vocal entertainment system |
| JP7498560B2 (ja) * | 2019-01-07 | 2024-06-12 | シナプティクス インコーポレイテッド | システム及び方法 |
| CN110164468B (zh) * | 2019-04-25 | 2022-01-28 | 上海大学 | 一种基于双麦克风的语音增强方法及装置 |
-
2021
- 2021-03-22 CN CN202110303349.2A patent/CN113077808B/zh active Active
- 2021-06-25 WO PCT/CN2021/102566 patent/WO2022198820A1/zh not_active Ceased
- 2021-06-25 EP EP21932436.5A patent/EP4310841A4/en active Pending
-
2023
- 2023-03-02 US US18/116,768 patent/US12431153B2/en active Active
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20050080616A1 (en) * | 2001-07-19 | 2005-04-14 | Johahn Leung | Recording a three dimensional auditory scene and reproducing it for the individual listener |
| CN102592607A (zh) * | 2012-03-30 | 2012-07-18 | 北京交通大学 | 一种使用盲语音分离的语音转换系统和方法 |
| CN104810024A (zh) * | 2014-01-28 | 2015-07-29 | 上海力声特医学科技有限公司 | 一种双路麦克风语音降噪处理方法及系统 |
| CN106504763A (zh) * | 2015-12-22 | 2017-03-15 | 电子科技大学 | 基于盲源分离与谱减法的麦克风阵列多目标语音增强方法 |
| CN110085247A (zh) * | 2019-05-06 | 2019-08-02 | 上海互问信息科技有限公司 | 一种针对复杂噪声环境的双麦克风降噪方法 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4310841A4 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN113077808B (zh) | 2024-04-26 |
| CN113077808A (zh) | 2021-07-06 |
| EP4310841A4 (en) | 2024-07-24 |
| EP4310841A1 (en) | 2024-01-24 |
| US20230206937A1 (en) | 2023-06-29 |
| US12431153B2 (en) | 2025-09-30 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2022198820A1 (zh) | 一种语音处理方法、装置和用于语音处理的装置 | |
| CN110970046B (zh) | 一种音频数据处理的方法及装置、电子设备、存储介质 | |
| CN110970057B (zh) | 一种声音处理方法、装置与设备 | |
| CN111986693B (zh) | 音频信号的处理方法及装置、终端设备和存储介质 | |
| CN109977860B (zh) | 图像处理方法及装置、电子设备和存储介质 | |
| CN113506582B (zh) | 声音信号识别方法、装置及系统 | |
| CN106205628A (zh) | 声音信号优化方法及装置 | |
| CN113345461B (zh) | 一种语音处理方法、装置和用于语音处理的装置 | |
| CN108922553A (zh) | 用于音箱设备的波达方向估计方法及系统 | |
| CN113707134A (zh) | 一种模型训练方法、装置和用于模型训练的装置 | |
| CN112447184B (zh) | 语音信号处理方法及装置、电子设备、存储介质 | |
| WO2020103353A1 (zh) | 多波束选取方法及装置 | |
| CN114363770A (zh) | 通透模式下的滤波方法、装置、耳机以及可读存储介质 | |
| CN107992813A (zh) | 一种唇部状态检测方法及装置 | |
| CN113223553B (zh) | 分离语音信号的方法、装置及介质 | |
| CN110459236A (zh) | 音频信号的噪声估计方法、装置及存储介质 | |
| CN113470675B (zh) | 音频信号处理方法及装置 | |
| CN109256145B (zh) | 基于终端的音频处理方法、装置、终端和可读存储介质 | |
| CN113314135A (zh) | 声音信号识别方法及装置 | |
| CN115529537B (zh) | 一种差分波束形成方法、装置及存储介质 | |
| CN113362842B (zh) | 音频信号处理方法及装置 | |
| CN113488066B (zh) | 音频信号处理方法、音频信号处理装置及存储介质 | |
| CN113470676B (zh) | 声音处理方法、装置、电子设备和存储介质 | |
| CN113489855A (zh) | 声音处理方法、装置、电子设备和存储介质 | |
| CN114724578B (zh) | 一种音频信号处理方法、装置及存储介质 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21932436 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2021932436 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2021932436 Country of ref document: EP Effective date: 20231020 |