WO2021103710A1 - 直播音频处理方法、装置、电子设备和存储介质 - Google Patents

直播音频处理方法、装置、电子设备和存储介质 Download PDF

Info

Publication number
WO2021103710A1
WO2021103710A1 PCT/CN2020/111873 CN2020111873W WO2021103710A1 WO 2021103710 A1 WO2021103710 A1 WO 2021103710A1 CN 2020111873 W CN2020111873 W CN 2020111873W WO 2021103710 A1 WO2021103710 A1 WO 2021103710A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
guest
processed
audio
mixed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2020/111873
Other languages
English (en)
French (fr)
Inventor
张晨
邢文浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to EP20891582.7A priority Critical patent/EP4068284A4/en
Publication of WO2021103710A1 publication Critical patent/WO2021103710A1/zh
Priority to US17/743,879 priority patent/US20220270638A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications
    • H04L65/4061Push-to services, e.g. push-to-talk or push-to-video
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/61Network streaming of media packets for supporting one-way streaming services, e.g. Internet radio
    • H04L65/612Network streaming of media packets for supporting one-way streaming services, e.g. Internet radio for unicast
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/75Media network packet handling
    • H04L65/762Media network packet handling at the source 
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/002Applications of echo suppressors or cancellers in telephonic connections
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M9/00Arrangements for interconnection not involving centralised switching
    • H04M9/08Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
    • H04M9/082Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic using echo cancellers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B3/00Line transmission systems
    • H04B3/02Details
    • H04B3/20Reducing echo effects or singing; Opening or closing transmitting path; Conditioning for transmission in one direction or the other
    • H04B3/23Reducing echo effects or singing; Opening or closing transmitting path; Conditioning for transmission in one direction or the other using a replica of transmitted signal in the time domain, e.g. echo cancellers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/568Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants

Definitions

  • This application relates to the field of audio processing technology, and in particular to a live audio processing method, device, electronic equipment, and storage medium.
  • the live broadcast companion refers to the auxiliary live broadcast tool of the live broadcast platform and the live broadcast software. With more and more types of live broadcast platforms and live broadcast software, various live broadcast partners have also appeared.
  • the live broadcast companion can be a good auxiliary live broadcast, which can provide desktop sound effects, screen capture, picture quality adjustment, picture-in-picture, high-definition large screen, massive song library, intelligent special effects, audio and video recording and other functions, making the live broadcast easy and smooth.
  • Adding the mic-link function in the live broadcast companion can realize the mic-link between the host and other guests, so as to push the voice signal of the host to the mic-link guest.
  • the background music also needs to be pushed to the Lianmai guest side together.
  • the host uses the microphone to collect the host's voice signal and background music, it will also collect the voice signal of the microphone speaker outside the speaker, so that the microphone speaker can hear their own voice. Therefore, it is necessary to perform echo cancellation on the voice signal of the microphone-linked guest acquired by the microphone at the host side during the push process.
  • the inventor realizes that the traditional echo cancellation method tends to over-cancel the host's voice signal, so that the host's voice quality heard by the mic guests is poor.
  • the present application provides a live audio processing method, device, electronic equipment, and storage medium, so as to at least solve the problem of poor sound quality of the host's voice heard by the guest side of the microphone in the related technology.
  • the technical solutions of the embodiments of this application are as follows:
  • a live audio processing method which is applied to an anchor end, and includes:
  • the guest audio signal the first audio signal and the processed first audio signal, detect the voice activity state of the guest terminal;
  • the mixed audio signal is the first audio signal collected by the host microphone and A signal composed of the host audio signal;
  • the processed first audio signal and the processed mixed audio signal are synthesized and pushed to the guest side.
  • a live audio processing device including:
  • the first audio signal acquisition module is configured to acquire the first audio signal that is mixed according to the guest audio signal and the background audio signal of the host;
  • the first echo cancellation module is configured to perform echo cancellation on the guest audio signal in the first audio signal to obtain the processed first audio signal
  • the voice activity state detection module is configured to perform detection of the voice activity state of the guest terminal based on the guest audio signal, the first audio signal, and the processed first audio signal;
  • the second echo cancellation module is configured to perform echo cancellation on the first audio signal in the mixed audio signal according to different voice activity states and the first audio signal to obtain a processed mixed audio signal, where the mixed audio signal is A signal composed of the first audio signal collected by the host microphone and the host audio signal;
  • the second audio signal synthesis module is configured to perform synthesis of the processed first audio signal and the processed mixed audio signal and push them to the guest end.
  • an electronic device including:
  • Processor a memory for storing processor-executable instructions; wherein the processor is configured to execute instructions to implement the steps of the above method.
  • a storage medium is provided, when the instructions in the storage medium are executed by the processor of the electronic device, the electronic device can execute the steps of the above method.
  • a computer program product which when executed on a data processing device, is suitable for executing the program of the steps of the above initialization method.
  • This application uses two channels of echo cancellation to work together to adjust the echo processing method of the sound signal collected by the host microphone according to the different voice activity status of the guest end, so that the host audio signal at the host end can not be over-processed, thereby protecting The host audio signal improves the sound quality of the host's voice heard by the guests.
  • Fig. 1 is an application environment diagram of a live audio processing method in an embodiment
  • Figure 2 is a schematic flow chart of a live audio processing method in an embodiment
  • FIG. 3 is a schematic diagram of the process of judging the voice activity state of the guest terminal in an embodiment
  • FIG. 4 is a schematic flowchart of the echo cancellation method of the sound signal at the host end when the guest end is in a voice state in an embodiment
  • Figure 5 is a schematic flow chart of a live audio processing method in an embodiment
  • Figure 6 is a structural block diagram of a live audio processing device in an embodiment
  • Fig. 7 is an internal structure diagram of an electronic device in an embodiment.
  • the live audio processing method provided in the embodiments of the present application can be applied to the application environment as shown in FIG. 1.
  • the application environment includes the host 110, the server 120, and the guest 130.
  • the host terminal 110 communicates with the server 120 through the network
  • the guest terminal 130 communicates with the server 120 through the network.
  • the host 110 may install applications or plug-ins such as the live broadcast companion in advance, so that the host 110 can perform live entertainment or live games through these applications or plug-ins.
  • the application or plug-in installed on the live broadcast terminal 110 can adjust the echo cancellation method of the sound signal collected by the microphone at the host terminal 110 according to the real-time voice activity status of the guest terminal 130, so that the audio signal of the host terminal 110 can be changed.
  • the host 110 mixes the acquired guest audio signal and the background audio signal of the host to form the first audio signal.
  • the host 110 performs echo cancellation on the guest audio signal in the first audio signal to obtain the processed first audio signal.
  • the voice activity state of the guest terminal is detected.
  • a processed mixed audio signal is obtained.
  • the host terminal 110 synthesizes the processed first audio signal and the processed mixed audio signal and pushes it to the guest terminal 130.
  • the anchor terminal 110 and the guest terminal 130 can be, but are not limited to, various personal computers, laptops, smart phones, tablets, and portable wearable devices.
  • the server 120 can be an independent server or a server cluster composed of multiple servers. achieve.
  • a live audio processing method is provided. Taking the method applied to the host 110 in FIG. 1 as an example for description, the method includes the following steps:
  • Step 202 Obtain a first audio signal formed by mixing the guest audio signal and the background audio signal of the host.
  • the guest audio signal may be a guest voice signal.
  • the background audio signal of the host may be background music played locally through the host, such as game music or karaoke music. Specifically, after receiving the guest audio signal and the locally played background audio signal, the host terminal may mix the guest audio signal and the background audio signal to form the first audio signal.
  • Step 204 Perform echo cancellation on the guest audio signal in the first audio signal to obtain a processed first audio signal.
  • the first audio signal can be echo canceled to eliminate the guest in the first audio signal Audio signal to get the background audio signal.
  • the first audio signal can be echo canceled by means of acoustic echo cancellation.
  • Step 206 Detect the voice activity state of the guest terminal based on the guest audio signal, the first audio signal, and the processed first audio signal.
  • the voice activity detection (Voice Activity Detection, VAD) of the voice activity state of the guest terminal may refer to detecting whether the current guest terminal has a voice, for example, whether the guest with microphone is speaking. If you are currently speaking, you can consider the voice activity status as a voice state; if you are not currently speaking, you can consider the voice activity status as a mute state.
  • the voice activity state can be detected through threshold discrimination algorithms, model matching algorithms, and the like. Taking the threshold discrimination algorithm as an example, the voice activity state of the guest terminal can be judged by detecting the audio energy in the received guest audio frame of a certain length of time.
  • the device can further detect the energy of the first audio frame before echo cancellation for a certain period of time (that is, the audio synthesized by the guest audio signal and the background audio signal of the host) and the energy of the first audio frame after echo cancellation (that is, the background obtained after echo cancellation) Audio signal) to determine the voice activity state of the guest end, which can improve the accuracy of judging the voice activity state.
  • the energy of the first audio frame before echo cancellation for a certain period of time that is, the audio synthesized by the guest audio signal and the background audio signal of the host
  • the energy of the first audio frame after echo cancellation that is, the background obtained after echo cancellation
  • Step 208 Perform echo cancellation on the first audio signal in the mixed audio signal according to different voice activity states and the first audio signal to obtain a processed mixed audio signal.
  • the mixed audio signal is a signal composed of the first audio signal collected by the host microphone and the host audio signal.
  • the echo in the sound signal collected by the microphone at the host is mainly generated by the first audio signal. If the background audio signal echo in the first audio signal is not completely eliminated, it can be masked by the internally mixed background audio signal. Therefore, the guest audio signal echo in the first audio signal is the echo that needs to be completely eliminated. Therefore, the mixed audio signal collected by the microphone can be subjected to different degrees of echo cancellation according to the voice activity state of the guest terminal.
  • a milder echo cancellation method can be used for the mixed audio signal to eliminate the first audio signal in the mixed audio signal to obtain the host audio signal; when detecting the voice of the guest When the active state is speaking or voice, in order to completely eliminate the echo of the guest audio signal, a stronger echo cancellation method can be used for the mixed audio signal.
  • step 210 the processed first audio signal and the processed mixed audio signal are synthesized and pushed to the guest terminal.
  • the obtained background audio signal and host audio signal can be processed The mix is pushed to the guest side.
  • the above live audio processing method adjusts the echo cancellation method of the mixed audio signal composed of the first audio signal collected by the host microphone and the host audio signal according to the different voice activity states of the guest end, and uses this method to perform the echo cancellation on the mixed audio signal Echo cancellation is performed on the first audio signal in the host, so that the host signal at the host end can not be excessively processed, thereby protecting the host audio signal and improving the audio quality of the host sound heard by the guest end.
  • detecting the voice activity state of the guest terminal according to the guest audio signal, the first audio signal, and the processed first audio signal includes the following steps:
  • Step 302 According to the guest audio signal, the first audio signal, and the processed first audio signal, respectively calculate the guest audio energy, the first audio energy, and the processed first audio energy.
  • a threshold discrimination algorithm can be used to detect the voice activity state of the guest terminal. Specifically, the following formula can be used to measure the guest audio energy, the first audio energy, and the processed first audio energy (ie, the background audio energy obtained after echo cancellation) of an audio frame:
  • L represents the length of the audio frame, which can be but not limited to 20ms
  • S represents the audio signal.
  • Step 304 When it is determined that the guest audio energy is less than the first threshold, and the ratio of the processed first audio energy to the first audio energy is greater than the second threshold, detect that the voice activity state is a mute state.
  • the guest audio energy of the measured nth audio frame is E1
  • the first audio energy is Ein
  • the processed first audio energy is Eout
  • the first threshold is Th1
  • the second threshold is Th2. If it is judged that E1 ⁇ Th1, it can be considered that the guest end is in a mute state at this time. Further, continue to determine the ratio of the processed first audio energy Eout to the first audio energy Ein Eout/Ein>Th2, it can be considered that the proportion of the guest audio signal in the first audio signal is very small, that is, the guest received by the host There is very little audio signal. Therefore, it can be judged that the guest terminal is in a silent state at this time.
  • Step 306 When it is determined that the energy of the guest audio signal is greater than the first threshold, or the ratio of the processed first audio energy to the first audio energy is less than the second threshold, detect that the voice activity state is a voice state.
  • the guest terminal is in a voice state at this time. Further, continue to determine the ratio of the processed first audio energy Eout and the first audio energy Ein Eout/Ein ⁇ Th2, it can be considered that the guest audio signal in the first audio signal accounts for a relatively large amount, that is, the guest audio received by the host More signals. Therefore, it can be determined that the guest terminal is in a voice state at this time.
  • the first threshold value Th1 may be but not limited to 0.001
  • Th2 may be but not limited to 0.9.
  • the accuracy of the detection of the voice activity state can be improved.
  • performing echo cancellation on the first audio signal in the mixed audio signal according to different voice activity states and the first audio signal to obtain the processed mixed audio signal includes: when the voice activity state is detected as a mute state When the first audio signal is used as a reference signal, adaptive filtering is performed on the mixed audio signal to filter the first audio signal in the mixed audio signal.
  • an adaptive filter can be used to compare the mixed audio signal to a certain degree. Light echo cancellation.
  • the first audio signal is used as the reference signal, and the estimated value of the echo signal collected by the microphone is obtained through linear superposition. By subtracting the estimated value of the echo signal from the mixed audio signal collected by the microphone, the echo cancellation of the mixed audio signal can be implemented to obtain the host audio signal.
  • the adaptive filtering is adopted.
  • the method cannot completely eliminate the echo of the guest audio signal.
  • a mild non-linear process NLP
  • NLP non-linear process
  • the audio signal of the host terminal can be protected, thereby improving the audio quality of the host sound heard by the guest terminal.
  • performing echo cancellation on the first audio signal in the mixed audio signal to obtain the processed mixed audio signal includes:
  • Step 402 When it is detected that the voice activity state is the voice state, use the first audio signal as a reference signal to perform adaptive filtering processing on the mixed audio signal to obtain a filtered mixed audio signal.
  • the first audio signal can be used as a reference signal, and the estimated value of the echo signal collected by the microphone can be obtained by means of adaptive filtering and linear superposition. The estimated value of the echo signal is subtracted from the mixed audio signal collected by the microphone, and the mixed audio signal is filtered.
  • Step 404 Perform nonlinear processing on the filtered mixed audio signal to eliminate residual echo signals in the filtered mixed audio signal.
  • the residual echo signal can be eliminated by performing nonlinear processing on the filtered mixed audio signal.
  • the input of non-linear processing includes two signals, one is the residual echo signal after linear processing by adaptive filtering, which can be recorded as err, and the other is the echo signal estimated by adaptive filtering, which can be recorded as echo.
  • the err and echo are transformed into frequency domain signals by Fourier FFT, that is, then, the signal-to-noise ratio Snr(k) of the amplitude spectrum of Err and Echo can be calculated.
  • the input is mainly a residual echo signal, and Err(k) is weighted with a low gain; if a certain frequency point k has a higher Snr(k) , It can be considered that the input is mainly the audio signal of the host, and Err(k) is weighted to a high gain.
  • the weighted Err' is transformed into the time domain through inverse Fourier transform, that is, the residual echo is further removed from the output err' signal.
  • the interference of the echo of the guest audio signal can be completely eliminated.
  • performing echo cancellation on the guest audio signal in the first audio signal to obtain the processed first audio signal includes: taking the guest audio signal as a reference signal and performing adaptive filtering processing on the first audio signal, Obtain the processed first audio signal.
  • an adaptive filter may be used to perform echo cancellation on the first audio signal received by the host player.
  • the estimated value of the acquired echo signal can be obtained by linear superposition. By subtracting the estimated value of the echo signal from the acquired first audio signal, it is possible to implement echo cancellation on the first audio signal, thereby separating the background audio signal.
  • performing echo cancellation on the first audio signal in the mixed audio signal according to different voice activity states and the first audio signal, and after obtaining the processed mixed audio signal further includes: combining the first audio signal with The processed mixed audio signal is synthesized and pushed to the audience.
  • the live broadcast scene also includes the audience side.
  • the processed mixed audio signal ie, the host audio signal obtained by echo cancellation
  • the first audio signal ie, the guest audio signal and the background audio signal of the host
  • This not only enables the audience to hear the host audio signal, the guest audio signal and the background audio signal at the host side at the same time, but also improves the sound quality of the audio heard by the audience.
  • FIG. 5 a specific embodiment is used to illustrate the live audio processing method, which includes the following steps:
  • Step 501 Acquire the guest audio signal.
  • Step 502 Obtain the background audio signal played by the host player.
  • Step 503 Mix the obtained guest audio signal and background audio signal to form a first audio signal.
  • Step 504 Use an external speaker to play the first audio signal.
  • Step 505 Use a microphone to collect the first audio signal and the host audio signal to obtain a mixed audio signal.
  • Step 506 Perform echo cancellation on the guest audio signal in the first audio signal to obtain the processed first audio signal, that is, the background audio signal.
  • adaptive filtering is performed on the first audio signal to obtain the processed first audio signal.
  • Step 507 Detect the voice activity state of the guest terminal. According to different voice activity states, adjust the echo cancellation method for the mixed audio signal composed of the first audio signal collected by the microphone and the host audio signal.
  • the voice activity state of the guest terminal can be detected based on the guest audio energy, the first audio energy, and the processed first audio energy.
  • the voice activity state is detected as a mute state; when it is judged that the guest audio signal energy is greater than the first threshold , Or when the ratio of the processed first audio energy to the first audio energy is less than the second threshold, the voice activity state is detected as the voice state.
  • Step 508 Perform echo cancellation on the first audio signal in the mixed audio signal to obtain a processed mixed audio signal.
  • the first audio signal is used as a reference signal to perform adaptive filtering processing on the mixed audio signal to filter the first audio signal in the mixed audio signal.
  • the detected voice activity state is the voice state
  • Step 509 Synthesize the processed first audio signal and the processed mixed audio signal and push them to the guest terminal.
  • Step 510 The first audio signal and the processed mixed audio signal are synthesized and pushed to the audience.
  • a live audio processing device 600 including: a first audio signal acquisition module 601, a first echo cancellation module 602, a voice activity state detection module 603, and a second echo cancellation module
  • the first audio signal acquisition module 601 is configured to acquire the first audio signal formed by mixing the guest audio signal and the background audio signal of the host end;
  • the first echo cancellation module 602 is configured to perform echo cancellation on the guest audio signal in the first audio signal to obtain a processed first audio signal;
  • the voice activity state detection module 603 is configured to perform detection of the voice activity state of the guest terminal based on the guest audio signal, the first audio signal, and the processed first audio signal;
  • the second echo cancellation module 604 is configured to perform echo cancellation on the first audio signal in the mixed audio signal according to different voice activity states and the first audio signal to obtain a processed mixed audio signal;
  • the second audio signal synthesis module 605 is configured to perform synthesis of the processed first audio signal and the processed mixed audio signal and push them to the guest terminal.
  • the voice activity detection module 603 is also configured to perform calculations based on the guest audio signal, the first audio signal, and the processed first audio signal to obtain the guest audio energy, the first audio energy, and the processed first audio signal, respectively.
  • the first audio energy when it is determined that the guest audio energy is less than the first threshold, and the ratio of the processed first audio energy to the first audio energy is greater than the second threshold, the voice activity state is detected as the mute state; when the guest audio is judged When the signal energy is greater than the first threshold, or the ratio of the processed first audio energy to the first audio energy is less than the second threshold, it is detected that the voice activity state is a voice state.
  • the second echo cancellation module 604 is configured to perform adaptive filtering processing on the mixed audio signal by using the first audio signal as a reference signal when detecting that the voice activity state is a mute state , Filtering the first audio signal in the mixed audio signal.
  • the second echo cancellation module 604 is configured to perform adaptive filtering processing on the mixed audio signal by using the first audio signal as a reference signal when the voice activity state is the voice state to obtain the filtered mixture.
  • Audio signal Non-linear processing is performed on the filtered mixed audio signal to eliminate the residual echo signal in the filtered mixed audio signal.
  • the first echo cancellation module 602 is configured to perform adaptive filtering processing on the first audio signal using the guest audio signal as a reference signal to obtain a processed first audio signal.
  • the live audio processing device 600 further includes a third audio signal synthesis module configured to perform synthesis of the first audio signal and the processed mixed audio signal and push it to the audience.
  • an electronic device is provided.
  • the electronic device may be a terminal, and its internal structure diagram may be as shown in FIG. 7.
  • the electronic equipment includes a processor, a memory, a network interface, a display screen and an input device connected through a system bus.
  • the processor of the electronic device is used to provide calculation and control capabilities.
  • the memory of the electronic device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and instructions.
  • the internal memory provides an environment for the operation of the operating system and instructions in the non-volatile storage medium.
  • the network interface of the electronic device is used to communicate with an external terminal through a network connection.
  • the instruction is executed by the processor to realize a live audio processing method.
  • the display screen of the electronic device can be a liquid crystal display screen or an electronic ink display screen
  • the input device of the electronic device can be a touch layer covered on the display screen, or it can be a button, trackball or touch pad set on the housing of the electronic device , It can also be an external keyboard, touchpad, or mouse.
  • an electronic device including a memory and a processor, the memory stores executable instructions of the processor, and the processor implements the following steps when executing the instructions:
  • the first audio signal formed by mixing the guest audio signal and the background audio signal of the host; perform echo cancellation on the guest audio signal in the first audio signal to obtain the processed first audio signal; according to the guest audio signal, the first audio signal The audio signal and the processed first audio signal are used to detect the voice activity state of the guest end; according to different voice activity states and the first audio signal, the first audio signal in the mixed audio signal is echo canceled to obtain the processed mixed audio Signal: The processed first audio signal and the processed mixed audio signal are synthesized and pushed to the guest side.
  • the processor further implements the following steps when executing instructions:
  • the guest audio energy, the first audio energy, and the processed first audio energy are respectively calculated; when it is determined that the guest audio energy is less than the first threshold, and When the ratio of the processed first audio energy to the first audio energy is greater than the second threshold, the voice activity state is detected as the mute state; when it is determined that the guest audio signal energy is greater than the first threshold, or the processed first audio energy and the first audio energy When the ratio of audio energy is less than the second threshold, it is detected that the voice activity state is the voice state.
  • the processor further implements the following steps when executing instructions:
  • the first audio signal is used as a reference signal to perform adaptive filtering processing on the mixed audio signal to filter the first audio signal in the mixed audio signal.
  • the processor further implements the following steps when executing instructions:
  • the detected voice activity state is the voice state
  • use the first audio signal as the reference signal to perform adaptive filtering on the mixed audio signal to obtain the filtered mixed audio signal; perform nonlinear processing on the filtered mixed audio signal to eliminate The residual echo signal in the filtered mixed audio signal.
  • the processor further implements the following steps when executing instructions:
  • adaptive filtering is performed on the first audio signal to obtain the processed first audio signal.
  • the processor further implements the following steps when executing instructions:
  • the first audio signal and the processed mixed audio signal are synthesized and pushed to the audience.
  • a storage medium on which processor-executable instructions are stored, and when the instructions are executed by the processor, the following steps are implemented:
  • the first audio signal formed by mixing the guest audio signal and the background audio signal of the host; perform echo cancellation on the guest audio signal in the first audio signal to obtain the processed first audio signal; according to the guest audio signal, the first audio signal The audio signal and the processed first audio signal are used to detect the voice activity state of the guest end; according to different voice activity states and the first audio signal, the first audio signal in the mixed audio signal is echo canceled to obtain the processed mixed audio Signal: The processed first audio signal and the processed mixed audio signal are synthesized and pushed to the guest side.
  • the guest audio energy, the first audio energy, and the processed first audio energy are respectively calculated; when it is determined that the guest audio energy is less than the first threshold, and When the ratio of the processed first audio energy to the first audio energy is greater than the second threshold, the voice activity state is detected as the mute state; when it is determined that the guest audio signal energy is greater than the first threshold, or the processed first audio energy and the first audio energy When the ratio of audio energy is less than the second threshold, it is detected that the voice activity state is the voice state.
  • the first audio signal is used as a reference signal to perform adaptive filtering processing on the mixed audio signal to filter the first audio signal in the mixed audio signal.
  • the detected voice activity state is the voice state
  • use the first audio signal as the reference signal to perform adaptive filtering on the mixed audio signal to obtain the filtered mixed audio signal; perform nonlinear processing on the filtered mixed audio signal to eliminate The residual echo signal in the filtered mixed audio signal.
  • adaptive filtering is performed on the first audio signal to obtain the processed first audio signal.
  • the first audio signal and the processed mixed audio signal are synthesized and pushed to the audience.
  • a computer program product which when executed on a data processing device, is adapted to execute a program that initializes the following method steps:
  • the first audio signal formed by mixing the guest audio signal and the background audio signal of the host; perform echo cancellation on the guest audio signal in the first audio signal to obtain the processed first audio signal; according to the guest audio signal, the first audio signal The audio signal and the processed first audio signal are used to detect the voice activity state of the guest end; according to different voice activity states and the first audio signal, the first audio signal in the mixed audio signal is echo canceled to obtain the processed mixed audio Signal: The processed first audio signal and the processed mixed audio signal are synthesized and pushed to the guest side.
  • any reference to memory, storage, database or other media used in the embodiments provided in the embodiments of the present application may include non-volatile and/or volatile memory.
  • Non-volatile memory can include read-only memory (ROM, Read-Only Memory), programmable ROM (PROM, Programmable Read-Only Memory), electrically programmable ROM (EPROM, Electrically Programmable Read-Only Memory), and electrically erasable Except for programmable ROM (EEPROM, Electrically Erasable Programmable read only memory) or flash memory. Volatile memory may include random access memory (RAM, Random Access Memory) or external cache memory.
  • RAM Random Access Memory
  • RAM is available in many forms, such as static RAM (SRAM, Static Random Access Memory), dynamic RAM (DRAM, Dynamic Random Access Memory), synchronous DRAM (SDRAM, Synchronous Dynamic Random Access Memory), dual Data rate SDRAM (DDRSDRAM, Double Data Rate Synchronous Dynamic Access Memory), Enhanced SDRAM (ESDRAM, Enhanced Synchronous Access Memory), Synchronous Link DRAM (SLDRAM, Synchronous Link Dynamic Random Access Memory), memory bus (Rambus) Direct RAM (DRDRAM, Direct Rambus Dynamic Random Access Memory), direct memory bus dynamic RAM (DRDRAM, Direct Rambus Dynamic Random Access Memory), and memory bus dynamic RAM (RDRAM, Rambus Dynamic Random Access Memory), etc.
  • SRAM Static Random Access Memory
  • DRAM dynamic RAM
  • SDRAM Synchronous Dynamic Random Access Memory
  • DDRSDRAM dual Data rate SDRAM
  • ESDRAM Double Data Rate Synchronous Dynamic Access Memory
  • ESDRAM Enhanced Synchronous Access Memory
  • SLDRAM Synchronous Link Dynamic Random Access Memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Telephone Function (AREA)

Abstract

一种直播音频处理方法、装置、电子设备和存储介质。该方法应用于主播端,包括:获取根据嘉宾音频信号和主播端的背景音频信号进行混合形成的第一音频信号进行回声消除(S204);根据嘉宾音频信号、第一音频信号和处理后的第一音频信号,检测嘉宾端的语音活动状态(S206);根据不同的语音活动状态和第一音频信号,对混合音频信号进行回声消除(S208);将回声消除后的第一音频信号和混合音频信号进行合成并推送至嘉宾端(S210)。采用本方法能够通过使用两路回声消除协同工作,根据嘉宾端的语音活动状态调整对主播端麦克风采集到的声音信号的回声处理方式,使得主播音频信号不被过度的处理,提高了嘉宾端听到的主播声音音质。

Description

直播音频处理方法、装置、电子设备和存储介质
相关申请的交叉引用
本申请要求在2019年11月28日提交中国专利局、申请号为201911191671.X、申请名称为“直播音频处理方法、装置、电子设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及音频处理技术领域,尤其涉及一种直播音频处理方法、装置、电子设备和存储介质。
背景技术
直播伴侣指的是直播平台、直播软件的辅助直播工具。随着直播平台、直播软件的类型越来越多,各种直播伴侣也伴随着出现。直播伴侣可以很好的辅助直播,可提供桌面音效、屏幕捕捉、画质调整、画中画、高清大屏、海量歌库、智能特效、音视频录制等功能,让直播轻松顺畅。
直播伴侣中加入连麦功能可以实现主播与其他嘉宾的连麦,从而将主播端声音信号推送给连麦嘉宾端。有些场景下,若主播端播放背景音乐,还需要将背景音乐一起推送给连麦嘉宾端。当主播端使用麦克风采集主播声音信号和背景音乐时,同时会采集到扬声器外放的连麦嘉宾声音信号,使得连麦嘉宾能够听到自己的声音。因此,在推送过程中需要将主播端麦克风获取的连麦嘉宾声音信号进行回声消除。发明人意识到传统的回声消除方式往往会将主播声音信号过度消除,从而使得连麦嘉宾端听到的主播声音音质较差。
发明内容
本申请提供一种直播音频处理方法、装置、电子设备和存储介质,以至少解决相关技术中连麦嘉宾端听到的主播声音音质较差的问题。本申请实施 例的技术方案如下:
根据本申请实施例的第一方面,提供一种直播音频处理方法,应用于主播端,包括:
获取根据嘉宾音频信号和主播端的背景音频信号进行混合形成的第一音频信号;
对第一音频信号中的嘉宾音频信号进行回声消除,得到处理后的第一音频信号;
根据嘉宾音频信号、第一音频信号和处理后的第一音频信号,检测嘉宾端的语音活动状态;
根据不同的语音活动状态和第一音频信号,对混合音频信号中的第一音频信号进行回声消除,得到处理后的混合音频信号,所述混合音频信号为主播端麦克风采集的第一音频信号和主播音频信号组成的信号;
将处理后的第一音频信号,和处理后的混合音频信号进行合成并推送至嘉宾端。
根据本申请实施例的第二方面,提供一种直播音频处理装置,包括:
第一音频信号获取模块,被配置为执行获取根据嘉宾音频信号和主播端的背景音频信号进行混合后的第一音频信号;
第一回声消除模块,被配置为执行对第一音频信号中的嘉宾音频信号进行回声消除,得到处理后的第一音频信号;
语音活动状态检测模块,被配置为执行根据嘉宾音频信号、第一音频信号和处理后的第一音频信号,检测嘉宾端的语音活动状态;
第二回声消除模块,被配置为执行根据不同的语音活动状态和第一音频信号,对混合音频信号中的第一音频信号进行回声消除,得到处理后的混合音频信号,所述混合音频信号为主播端麦克风采集的第一音频信号和主播音频信号组成的信号;
第二音频信号合成模块,被配置为执行将处理后的第一音频信号,和处理后的混合音频信号进行合成并推送至嘉宾端。
根据本申请实施例的第三方面,提供一种电子设备,包括:
处理器;用于存储处理器可执行指令的存储器;其中,处理器被配置为执行指令,以实现如上方法的步骤。
根据本申请实施例的第四方面,提供一种存储介质,当存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行如上方法的步骤。
根据本申请实施例的第五方面,提供一种计算机程序产品,当在数据处理设备上执行时,适于执行如上初始化方法步骤的程序。
本申请通过使用两路回声消除协同工作,根据嘉宾端不同的语音活动状态调整对主播端麦克风采集到的声音信号的回声处理方式,使得主播端的主播音频信号能够不被过度的处理,从而保护了主播音频信号,提高了嘉宾端听到的主播声音音质。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并与说明书一起用于解释本申请实施例的原理,并不构成对本申请实施例的不当限定。
图1为一个实施例中直播音频处理方法的应用环境图;
图2为一个实施例中直播音频处理方法的流程示意图;
图3为一个实施例中判断嘉宾端语音活动状态的流程示意图;
图4为一个实施例中当嘉宾端为语音状态时,主播端声音信号的回声消除方式的流程示意图;
图5为一个实施例中直播音频处理方法的流程示意图;
图6为一个实施例中直播音频处理装置的结构框图;
图7为一个实施例中电子设备的内部结构图。
具体实施方式
为了使本领域普通人员更好地理解本申请实施例的技术方案,下面将结 合附图,对本申请实施例中的技术方案进行清楚、完整地描述。
本申请实施例提供的直播音频处理方法,可以应用于如图1所示的应用环境中。该应用环境包括主播端110、服务器120和嘉宾端130。其中,主播端110通过网络与服务器120进行通信,嘉宾端130通过网络与服务器120进行通信。主播端110可以事先安装直播伴侣等应用或者插件,使得主播端110可以通过这些应用或者插件进行娱乐直播或者游戏直播。在直播过程中,直播端110安装的应用或者插件可以根据嘉宾端130的实时语音活动状态,调整对主播端110麦克风采集到的声音信号进行回声消除的方式,使主播端110的音频信号能够不被过度消除,从而保护了主播端110的声音音质。具体地,主播端110将获取的嘉宾音频信号和主播端的背景音频信号进行混合形成第一音频信号。主播端110对第一音频信号中的嘉宾音频信号进行回声消除,得到处理后的第一音频信号。然后,根据嘉宾音频信号、第一音频信号和处理后的第一音频信号,检测嘉宾端的语音活动状态。通过根据不同的语音活动状态和第一音频信号,对混合音频信号中的第一音频信号进行回声消除,得到处理后的混合音频信号。主播端110将处理后的第一音频信号,和处理后的混合音频信号进行合成并推送至嘉宾端130。其中,主播端110和嘉宾端130可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器120可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在一个实施例中,如图2所示,提供了一种直播音频处理方法,以该方法应用于图1中的主播端110为例进行说明,包括以下步骤:
步骤202,获取根据嘉宾音频信号和主播端的背景音频信号进行混合形成的第一音频信号。
其中,嘉宾音频信号可以为嘉宾人声信号。主播端的背景音频信号可以是通过主播端本地播放的背景音乐,例如游戏音乐或者连麦K歌音乐等。具体地,主播端接收到嘉宾音频信号和本地播放的背景音频信号后,可以将嘉宾音频信号和背景音频信号进行混合,形成第一音频信号。
步骤204,对第一音频信号中的嘉宾音频信号进行回声消除,得到处理后的第一音频信号。
具体地,由于通过播放器获取的背景音频信号不能被直接用于推送给嘉宾端,因此,在得到第一音频信号后,可以对第一音频信号进行回声消除,消除第一音频信号中的嘉宾音频信号,得到背景音频信号。在本申请实施例中,可以通过声学回声消除的方式对第一音频信号进行回声消除。
步骤206,根据嘉宾音频信号、第一音频信号和处理后的第一音频信号,检测嘉宾端的语音活动状态。
其中,嘉宾端的语音活动状态的声音状态检测(Voice Activity Detection,VAD)可以是指检测当前嘉宾端是否有语音,例如连麦嘉宾是否在说话。若当前在说话状态,可以认为语音活动状态为语音状态;若当前不在说话状态,可以认为语音活动状态为静音状态。具体地,语音活动状态可以通过门限判别类算法、模型匹配类算法等方式检测得到。以门限判别类算法为例,可以通过检测接收到的一定时长的嘉宾音频帧中的音频能量,判断嘉宾端的语音活动状态。同时,还可以进一步检测一定时长回声消除前的第一音频帧能量(即嘉宾音频信号和主播端的背景音频信号合成的音频)和回声消除后的第一音频帧能量(即回声消除后得到的背景音频信号),来判断嘉宾端的语音活动状态,从而可以提高判断语音活动状态的准确性。
步骤208,根据不同的语音活动状态和第一音频信号,对混合音频信号中的第一音频信号进行回声消除,得到处理后的混合音频信号。
所述混合音频信号为主播端麦克风采集的第一音频信号和主播音频信号组成的信号。
具体地,由于主播端麦克风采集声音信号中的回声主要是第一音频信号产生的。第一音频信号中的背景音频信号回声如果消除不彻底可以用内混的背景音频信号掩蔽,因此第一音频信号中的嘉宾音频信号回声是主要需要彻底消除的回声。因此,可以根据嘉宾端的语音活动状态,对麦克风采集的混合音频信号进行不同程度的回声消除。当检测嘉宾端的语音活动状态为不说 话或者静音状态时,可以对混合音频信号采用较轻程度的回声消除方式,消除混合音频信号中的第一音频信号,得到主播音频信号;当检测嘉宾端的语音活动状态为说话或者语音状态时,为了完全消除嘉宾音频信号回声,可以对混合音频信号采用强度较强的回声消除方式。
步骤210,将处理后的第一音频信号,和处理后的混合音频信号进行合成并推送至嘉宾端。
具体地,在通过对第一音频信号进行回声消除得到背景音频信号,并通过对主播端麦克风采集的混合音频信号进行回声消除得到主播音频信号后,可以将得到的背景音频信号和主播音频信号进行混音推送至嘉宾端。
上述直播音频处理方法,通过根据嘉宾端不同的语音活动状态,调整对主播端麦克风采集的第一音频信号和主播音频信号组成的混合音频信号进行回声消除的方式,并使用该方式对混合音频信号中的第一音频信号进行回声消除,使得主播端的主播信号能够不被过度的处理,从而保护了主播音频信号,提高了嘉宾端听到的主播声音音质。
在一个实施例中,如图3所示,根据所述嘉宾音频信号、所述第一音频信号和所述处理后的第一音频信号,检测嘉宾端的语音活动状态,包括以下步骤:
步骤302,根据嘉宾音频信号、第一音频信号和处理后的第一音频信号,分别计算得到嘉宾音频能量、第一音频能量和处理后的第一音频能量。
在本申请实施例中,可以使用门限判别类算法检测嘉宾端的语音活动状态。具体地,可以采用以下公式测量一个音频帧的嘉宾音频能量、第一音频能量和处理后的第一音频能量(即回声消除后得到的背景音频能量):
其中,代表第n个音频帧的能量;L代表音频帧的长度,可以但不限于设定L为20ms;S代表音频信号。
步骤304,当判断嘉宾音频能量小于第一阈值,且处理后的第一音频能量和第一音频能量的比值大于第二阈值时,则检测语音活动状态为静音状态。
具体地,假设测量第n个音频帧的嘉宾音频能量为E1,第一音频能量为 Ein,处理后的第一音频能量为Eout,第一阈值为Th1,第二阈值为Th2。若判断E1<Th1,可以认为此时嘉宾端处于静音状态。进一步地,继续判断处理后的第一音频能量Eout和第一音频能量Ein的比值Eout/Ein>Th2,可以认为第一音频信号中的嘉宾音频信号占比很少,即主播端接收到的嘉宾音频信号很少。因此,可以判断此时嘉宾端处于静音状态。
步骤306,当判断嘉宾音频信号能量大于第一阈值,或者处理后的第一音频能量和第一音频能量的比值小于第二阈值时,则检测语音活动状态为语音状态。
具体地,若判断E1>Th1,可以认为此时嘉宾端处于语音状态。进一步地,继续判断处理后的第一音频能量Eout和第一音频能量Ein的比值Eout/Ein<Th2,可以认为第一音频信号中的嘉宾音频信号占比较大,即主播端接收到的嘉宾音频信号较多。因此,可以判断此时嘉宾端处于语音状态。在本申请实施例中,第一阈值为Th1可以但不限于是0.001,Th2可以但不限于是0.9。
本申请实施例中,通过根据嘉宾音频能量、主播播放端接收到的消除前和回声消除后的音频能量判断嘉宾端的语音活动状态,可以提高语音活动状态检测的准确性。
在一个实施例中,根据不同的语音活动状态和第一音频信号,对混合音频信号中的第一音频信号进行回声消除,得到处理后的混合音频信号,包括:当检测语音活动状态为静音状态时,以第一音频信号为参考信号,对混合音频信号进行自适应滤波处理,过滤混合音频信号中的第一音频信号。
具体地,若检测嘉宾端为静音状态,可以认为此时主播端麦克风采集的混合音频信号中没有或者有很少的嘉宾音频信号回声,那么可以使用自适应滤波器对混合音频信号进行一个程度较轻的回声消除。将第一音频信号作为参考信号,通过线性叠加的方式获得麦克风采集到的回声信号的估计值。从麦克风采集到的混合音频信号中减去该回声信号的估计值,就可以实现对混合音频信号进行回声消除,得到主播音频信号。进一步地,若主播端采集的混合音频信号中有很少的嘉宾音频信号回声,由于通过线性叠加的方式获得 的回声信号估计值与麦克风采集到的嘉宾音频信号存在偏差,所以通过自适应滤波的方式无法将嘉宾音频信号回声完全消除掉。这种情况下,可以对过滤后的混合音频信号采用一个轻度的非线性处理(Non-linear Process,NLP),既能彻底消除嘉宾音频信号回声,同时又能保护主播端声音音质。本申请实施例中,当嘉宾端为静音状态时,通过对麦克风采集到的声音信号进行轻量级的回声消除,可以保护主播端音频信号,从而改善嘉宾端听到的主播声音音质。
在一个实施例中,如图4所示,根据不同的语音活动状态和第一音频信号,对混合音频信号中的第一音频信号进行回声消除,得到处理后的混合音频信号,包括:
步骤402,当检测语音活动状态为语音状态时,以第一音频信号为参考信号,对混合音频信号进行自适应滤波处理,得到过滤后的混合音频信号。
具体地,若检测嘉宾端为语音状态,可以认为此时主播端麦克风采集的混合音频信号中有程度较强的嘉宾音频信号回声,那么可以使用对混合音频信号进行一个程度较强的回声消除。首先,可以将第一音频信号作为参考信号,通过自适应滤波线性叠加的方式获得麦克风采集到的回声信号的估计值。从麦克风采集到的混合音频信号中减去该回声信号的估计值,对混合音频信号进行过滤。
步骤404,对过滤后的混合音频信号进行非线性处理,消除过滤后的混合音频信号中的残余回声信号。
具体地,由于通过线性叠加的方式获得的回声信号估计值与麦克风采集到的嘉宾音频信号存在偏差,所以通过自适应滤波的方式无法将嘉宾音频信号回声完全消除掉,会有残余回声。可以进一步通过对过滤后的混合音频信号进行非线性处理,消除残余回声信号。非线性处理的输入包含两路信号,一路是经过自适应滤波线性处理后的残余回声信号,可以记为err,另一路是自适应滤波估计的回声信号,可以记为echo。将err和echo通过傅里叶FFT变换到频域信号,即,接着,可以计算Err和Echo幅度谱的信噪比Snr(k)。 若某个频点k的信噪比Snr(k)较低,可以认为输入的主要是残余回声信号,则Err(k)加权一个低增益;若某个频点k的Snr(k)较高,可以认为输入的主要是主播端音频信号,则Err(k)加权一个高增益。最后,将加权后得到的Err’通过傅里叶反变换到时域,即,输出的err’信号里进一步去除了残余回声。
本申请实施例中,当嘉宾端为语音状态时,通过对麦克风采集到的声音信号进行程度较强的回声消除,可以彻底消除嘉宾音频信号回声的干扰。
在一个实施例中,对第一音频信号中的嘉宾音频信号进行回声消除,得到处理后的第一音频信号,包括:以嘉宾音频信号为参考信号,对第一音频信号进行自适应滤波处理,得到处理后的第一音频信号。
具体地,可以使用自适应滤波器对主播端播放器接收到的第一音频信号进行回声消除。将嘉宾音频信号作为参考信号,可以通过线性叠加的方式获得获取到的回声信号的估计值。从获取到的第一音频信号中减去该回声信号的估计值,就可以实现对第一音频信号进行回声消除,从而分离得到背景音频信号。
在一个实施例中,根据不同的语音活动状态和第一音频信号,对混合音频信号中的第一音频信号进行回声消除,得到处理后的混合音频信号之后,还包括:将第一音频信号和处理后的混合音频信号进行合成并推送至观众端。
具体地,在直播场景中还包括观众端。可以将处理后的混合音频信号(即回声消除得到的主播音频信号)和第一音频信号(即嘉宾音频信号和主播端的背景音频信号)进行混音,得到推送给观众端的音频信号。这样不仅可以使观众能够同时听到主播音频信号,嘉宾音频信号和主播端的背景音频信号,还可以改善观众听到的声音音质。
在一个实施例中,如图5所示,通过一个具体实施例说明直播音频处理方法,包括以下步骤:
步骤501,获取嘉宾音频信号。
步骤502,获取主播端播放器播放的背景音频信号。
步骤503,将获取的嘉宾音频信号和背景音频信号进行混合形成第一音频 信号。
步骤504,使用外放扬声器播放第一音频信号。
步骤505,使用麦克风采集第一音频信号和主播音频信号,得到混合音频信号。
步骤506,对第一音频信号中的嘉宾音频信号进行回声消除,得到处理后的第一音频信号,即背景音频信号。
具体地,以嘉宾音频信号为参考信号,对第一音频信号进行自适应滤波处理,得到处理后的第一音频信号。
步骤507,检测嘉宾端的语音活动状态。根据不同的语音活动状态,调整对麦克风采集的第一音频信号和主播音频信号组成的混合音频信号进行回声消除的方式。
具体地,可以根据嘉宾音频能量、第一音频能量和处理后的第一音频能量,检测嘉宾端的语音活动状态。当判断嘉宾音频能量小于第一阈值,且处理后的第一音频能量和第一音频能量的比值大于第二阈值时,则检测语音活动状态为静音状态;当判断嘉宾音频信号能量大于第一阈值,或者处理后的第一音频能量和第一音频能量的比值小于第二阈值时,则检测语音活动状态为语音状态。
步骤508,对混合音频信号中的第一音频信号进行回声消除,得到处理后的混合音频信号。
具体地,当检测语音活动状态为静音状态时,以第一音频信号为参考信号,对混合音频信号进行自适应滤波处理,过滤混合音频信号中的第一音频信号。当检测语音活动状态为语音状态时,以第一音频信号为参考信号,对混合音频信号进行自适应滤波处理,得到过滤后的混合音频信号;对过滤后的混合音频信号进行非线性处理,消除过滤后的混合音频信号中的残余回声信号。
步骤509,将处理后的第一音频信号,和处理后的混合音频信号进行合成并推送至所述嘉宾端。
步骤510,将第一音频信号和处理后的混合音频信号进行合成并推送至观众端。
在一个实施例中,如图6所示,提供了一种直播音频处理装置600,包括:第一音频信号获取模块601、第一回声消除模块602、语音活动状态检测模块603、第二回声消除模块604和第二音频信号合成模块605,其中:
第一音频信号获取模块601,被配置为执行获取根据嘉宾音频信号和主播端的背景音频信号进行混合形成的第一音频信号;
第一回声消除模块602,被配置为执行对第一音频信号中的嘉宾音频信号进行回声消除,得到处理后的第一音频信号;
语音活动状态检测模块603,被配置为执行根据嘉宾音频信号、第一音频信号和处理后的第一音频信号,检测嘉宾端的语音活动状态;
第二回声消除模块604,被配置为执行根据不同的语音活动状态和第一音频信号,对混合音频信号中的第一音频信号进行回声消除,得到处理后的混合音频信号;
第二音频信号合成模块605,被配置为执行将处理后的第一音频信号,和处理后的混合音频信号进行合成并推送至嘉宾端。
在一个实施例中,语音活动状态检测模块603还被配置为执行根据嘉宾音频信号、述第一音频信号和处理后的第一音频信号,分别计算得到嘉宾音频能量,第一音频能量和处理后的第一音频能量;当判断嘉宾音频能量小于第一阈值,且处理后的第一音频能量和第一音频能量的比值大于第二阈值时,则检测语音活动状态为静音状态;当判断嘉宾音频信号能量大于第一阈值,或者处理后的第一音频能量和第一音频能量的比值小于第二阈值时,则检测语音活动状态为语音状态。
在一个实施例中,第二回声消除模块604被配置为执行当检测所述语音活动状态为静音状态时,以所述第一音频信号为参考信号,对所述混合音频信号进行自适应滤波处理,过滤所述混合音频信号中的第一音频信号。
在一个实施例中,第二回声消除模块604被配置为执行当检测语音活动 状态为语音状态时,以第一音频信号为参考信号,对混合音频信号进行自适应滤波处理,得到过滤后的混合音频信号;对过滤后的混合音频信号进行非线性处理,消除过滤后的混合音频信号中的残余回声信号。
在一个实施例中,第一回声消除模块602被配置为执行以嘉宾音频信号为参考信号,对第一音频信号进行自适应滤波处理,得到处理后的第一音频信号。
在一个实施例中,直播音频处理装置600还包括第三音频信号合成模块,被配置为执行将第一音频信号和处理后的混合音频信号进行合成并推送至观众端。
在一个实施例中,提供了一种电子设备,该电子设备可以是终端,其内部结构图可以如图7所示。该电子设备包括通过系统总线连接的处理器、存储器、网络接口、显示屏和输入装置。其中,该电子设备的处理器用于提供计算和控制能力。该电子设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和指令。该内存储器为非易失性存储介质中的操作系统和指令的运行提供环境。该电子设备的网络接口用于与外部的终端通过网络连接通信。该指令被处理器执行时以实现一种直播音频处理方法。该电子设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该电子设备的输入装置可以是显示屏上覆盖的触摸层,也可以是电子设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
在一个实施例中,提供了一种电子设备,包括存储器和处理器,存储器中存储有处理器可执行指令,该处理器执行指令时实现以下步骤:
获取根据嘉宾音频信号和主播端的背景音频信号进行混合形成的第一音频信号;对第一音频信号中的嘉宾音频信号进行回声消除,得到处理后的第一音频信号;根据嘉宾音频信号、第一音频信号和处理后的第一音频信号,检测嘉宾端的语音活动状态;根据不同的语音活动状态和第一音频信号,对混合音频信号中的第一音频信号进行回声消除,得到处理后的混合音频信号;将处理后的第一音频信号,和处理后的混合音频信号进行合成并推送至嘉宾 端。
在一个实施例中,处理器执行指令时还实现以下步骤:
根据嘉宾音频信号、述第一音频信号和处理后的第一音频信号,分别计算得到嘉宾音频能量、第一音频能量和处理后的第一音频能量;当判断嘉宾音频能量小于第一阈值,且处理后的第一音频能量和第一音频能量的比值大于第二阈值时,则检测语音活动状态为静音状态;当判断嘉宾音频信号能量大于第一阈值,或者处理后的第一音频能量和第一音频能量的比值小于第二阈值时,则检测语音活动状态为语音状态。
在一个实施例中,处理器执行指令时还实现以下步骤:
当检测语音活动状态为静音状态时,以第一音频信号为参考信号,对混合音频信号进行自适应滤波处理,过滤混合音频信号中的第一音频信号。
在一个实施例中,处理器执行指令时还实现以下步骤:
当检测语音活动状态为语音状态时,以第一音频信号为参考信号,对混合音频信号进行自适应滤波处理,得到过滤后的混合音频信号;对过滤后的混合音频信号进行非线性处理,消除过滤后的混合音频信号中的残余回声信号。
在一个实施例中,处理器执行指令时还实现以下步骤:
以嘉宾音频信号为参考信号,对第一音频信号进行自适应滤波处理,得到处理后的第一音频信号。
在一个实施例中,处理器执行指令时还实现以下步骤:
将第一音频信号和处理后的混合音频信号进行合成并推送至观众端。
在一个实施例中,提供了一种存储介质,其上存储有处理器可执行指令,指令被处理器执行时实现以下步骤:
获取根据嘉宾音频信号和主播端的背景音频信号进行混合形成的第一音频信号;对第一音频信号中的嘉宾音频信号进行回声消除,得到处理后的第一音频信号;根据嘉宾音频信号、第一音频信号和处理后的第一音频信号,检测嘉宾端的语音活动状态;根据不同的语音活动状态和第一音频信号,对 混合音频信号中的第一音频信号进行回声消除,得到处理后的混合音频信号;将处理后的第一音频信号,和处理后的混合音频信号进行合成并推送至嘉宾端。
在一个实施例中,指令被处理器执行时还实现以下步骤:
根据嘉宾音频信号、述第一音频信号和处理后的第一音频信号,分别计算得到嘉宾音频能量、第一音频能量和处理后的第一音频能量;当判断嘉宾音频能量小于第一阈值,且处理后的第一音频能量和第一音频能量的比值大于第二阈值时,则检测语音活动状态为静音状态;当判断嘉宾音频信号能量大于第一阈值,或者处理后的第一音频能量和第一音频能量的比值小于第二阈值时,则检测语音活动状态为语音状态。
在一个实施例中,指令被处理器执行时还实现以下步骤:
当检测语音活动状态为静音状态时,以第一音频信号为参考信号,对混合音频信号进行自适应滤波处理,过滤混合音频信号中的第一音频信号。
在一个实施例中,指令被处理器执行时还实现以下步骤:
当检测语音活动状态为语音状态时,以第一音频信号为参考信号,对混合音频信号进行自适应滤波处理,得到过滤后的混合音频信号;对过滤后的混合音频信号进行非线性处理,消除过滤后的混合音频信号中的残余回声信号。
在一个实施例中,指令被处理器执行时还实现以下步骤:
以嘉宾音频信号为参考信号,对第一音频信号进行自适应滤波处理,得到处理后的第一音频信号。
在一个实施例中,指令被处理器执行时还实现以下步骤:
将第一音频信号和处理后的混合音频信号进行合成并推送至观众端。
在一个实施例中,还提供了一种计算机程序产品,当在数据处理设备上执行时,适于执行初始化有如下方法步骤的程序:
获取根据嘉宾音频信号和主播端的背景音频信号进行混合形成的第一音频信号;对第一音频信号中的嘉宾音频信号进行回声消除,得到处理后的第 一音频信号;根据嘉宾音频信号、第一音频信号和处理后的第一音频信号,检测嘉宾端的语音活动状态;根据不同的语音活动状态和第一音频信号,对混合音频信号中的第一音频信号进行回声消除,得到处理后的混合音频信号;将处理后的第一音频信号,和处理后的混合音频信号进行合成并推送至嘉宾端。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过指令来完成,所述的指令可存储于一非易失性计算机可读取存储介质中,该指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请实施例所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM,Read-Only Memory)、可编程ROM(PROM,Programmable Read-Only Memory)、电可编程ROM(EPROM,Electrically Programmable Read-Only Memory)、电可擦除可编程ROM(EEPROM,Electrically Erasable Programmable read only memory)或闪存。易失性存储器可包括随机存取存储器(RAM,Random Access Memory)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM,Static Random Access Memory)、动态RAM(DRAM,Dynamic Random Access Memory)、同步DRAM(SDRAM,Synchronous Dynamic Random Access Memory)、双数据率SDRAM(DDRSDRAM,Double Data Rate Synchronous Dynamic Random Access Memory)、增强型SDRAM(ESDRAM,Enhanced Synchronous Dynamic Random Access Memory)、同步链路(Synchlink)DRAM(SLDRAM,Sync Link Dynamic Random Access Memory)、存储器总线(Rambus)直接RAM(DRDRAM,Direct Rambus Dynamic Random Access Memory)、直接存储器总线动态RAM(DRDRAM,Direct Rambus Dynamic Random Access Memory)、以及存储器总线动态RAM(RDRAM,Rambus Dynamic Random Access Memory)等。

Claims (19)

  1. 一种直播音频处理方法,应用于主播端,所述方法包括:
    获取根据嘉宾音频信号和所述主播端的背景音频信号进行混合形成的第一音频信号;
    根据所述嘉宾音频信号对所述第一音频信号中的嘉宾音频信号进行回声消除,得到处理后的所述第一音频信号;
    根据所述嘉宾音频信号、所述第一音频信号和所述处理后的第一音频信号,检测嘉宾端的语音活动状态;
    根据不同的所述语音活动状态和所述第一音频信号,对混合音频信号中的第一音频信号进行回声消除,得到处理后的所述混合音频信号,所述混合音频信号为主播端麦克风采集的第一音频信号和主播音频信号组成的信号;
    将处理后的所述第一音频信号和处理后的所述混合音频信号进行合成并推送至所述嘉宾端。
  2. 根据权利要求1所述的方法,所述根据所述嘉宾音频信号、所述第一音频信号和处理后的所述第一音频信号,检测嘉宾端的语音活动状态,包括:
    根据所述嘉宾音频信号、所述第一音频信号和处理后的所述第一音频信号,分别计算得到嘉宾音频能量、第一音频能量和处理后的第一音频能量;
    当判断所述嘉宾音频能量小于第一阈值,且处理后的所述第一音频能量和所述第一音频能量的比值大于第二阈值时,则检测所述语音活动状态为静音状态;
    当判断所述嘉宾音频信号能量大于所述第一阈值,或者处理后的所述第一音频能量和所述第一音频能量的比值小于所述第二阈值时,则检测所述语音活动状态为语音状态。
  3. 根据权利要求2所述的方法,所述根据不同的所述语音活动状态和所述第一音频信号,对所述混合音频信号中的第一音频信号进行回声消除,得到处理后的所述混合音频信号,包括:
    当检测所述语音活动状态为静音状态时,以所述第一音频信号为参考信号,对所述混合音频信号进行自适应滤波处理,过滤所述混合音频信号中的第一音频信号。
  4. 根据权利要求2所述的方法,所述根据不同的所述语音活动状态和所述第一音频信号,对所述混合音频信号中的第一音频信号进行回声消除,得到处理后的所述混合音频信号,包括:
    当检测所述语音活动状态为语音状态时,以所述第一音频信号为参考信号,对所述混合音频信号进行自适应滤波处理,得到过滤后的混合音频信号;
    对过滤后的所述混合音频信号进行非线性处理,消除所述过滤后的混合音频信号中的残余回声信号。
  5. 根据权利要求1所述的方法,所述对所述第一音频信号中的嘉宾音频信号进行回声消除,得到处理后的第一音频信号,包括:
    以所述嘉宾音频信号为参考信号,对所述第一音频信号进行自适应滤波处理,得到所述处理后的第一音频信号。
  6. 根据权利要求1所述的方法,所述根据不同的所述语音活动状态和所述第一音频信号,对所述混合音频信号中的第一音频信号进行回声消除,得到处理后的所述混合音频信号之后,还包括:
    将所述第一音频信号和处理后的所述混合音频信号进行合成并推送至观众端。
  7. 一种直播音频处理装置,应用于主播端,所述装置包括:
    第一音频信号获取模块,被配置为执行获取根据嘉宾音频信号和所述主播端的背景音频信号进行混合形成的第一音频信号;
    第一回声消除模块,被配置为执行根据所述嘉宾音频信号对所述第一音频信号中的嘉宾音频信号进行回声消除,得到处理后的第一音频信号;
    语音活动状态检测模块,被配置为执行根据所述嘉宾音频信号、所述第一音频信号和所述处理后的第一音频信号,检测嘉宾端的语音活动状态;
    第二回声消除模块,被配置为执行根据不同的所述语音活动状态和所述 第一音频信号,对混合音频信号中的第一音频信号进行回声消除,得到处理后的混合音频信号,所述混合音频信号为主播端麦克风采集的第一音频信号和主播音频信号组成的信号;
    第二音频信号合成模块,被配置为执行将处理后的所述第一音频信号和处理后的所述混合音频信号进行合成并推送至所述嘉宾端。
  8. 根据权利要求7所述的装置,所述语音活动状态检测模块还被配置为执行:
    根据所述嘉宾音频信号、所述第一音频信号和处理后的所述第一音频信号,分别计算得到嘉宾音频能量,第一音频能量和处理后的所述第一音频能量;
    当判断所述嘉宾音频能量小于第一阈值,且处理后的所述第一音频能量和所述第一音频能量的比值大于第二阈值时,则检测所述语音活动状态为静音状态;
    当判断所述嘉宾音频信号能量大于所述第一阈值,或者处理后的所述第一音频能量和所述第一音频能量的比值小于所述第二阈值时,则检测所述语音活动状态为语音状态。
  9. 根据权利要求8所述的装置,所述语音活动状态检测模块还被配置为执行:
    当检测所述语音活动状态为静音状态时,以所述第一音频信号为参考信号,对所述混合音频信号进行自适应滤波处理,过滤所述混合音频信号中的第一音频信号。
  10. 根据权利要求8所述的装置,所述语音活动状态检测模块还被配置为执行:
    当检测所述语音活动状态为语音状态时,以所述第一音频信号为参考信号,对所述混合音频信号进行自适应滤波处理,得到过滤后的混合音频信号;
    对过滤后的所述混合音频信号进行非线性处理,消除所述过滤后的混合音频信号中的残余回声信号。
  11. 根据权利要求7所述的装置,所述第一回声消除模块还被配置为执行:
    以所述嘉宾音频信号为参考信号,对所述第一音频信号进行自适应滤波处理,得到所述处理后的第一音频信号。
  12. 根据权利要求7所述的装置,所述装置还包括第三音频信号合成模块,被配置为执行:
    将所述第一音频信号和所述处理后的混合音频信号进行合成并推送至观众端。
  13. 一种电子设备,包括存储器和处理器:
    所述存储器用于存储所述处理器可执行指令;
    所述处理器被配置为执行所述指令,以实现如下步骤:
    获取根据嘉宾音频信号和所述主播端的背景音频信号进行混合形成的第一音频信号;
    根据所述嘉宾音频信号对所述第一音频信号中的嘉宾音频信号进行回声消除,得到处理后的所述第一音频信号;
    根据所述嘉宾音频信号、所述第一音频信号和所述处理后的第一音频信号,检测嘉宾端的语音活动状态;
    根据不同的所述语音活动状态和所述第一音频信号,对混合音频信号中的第一音频信号进行回声消除,得到处理后的所述混合音频信号,所述混合音频信号为主播端麦克风采集的第一音频信号和主播音频信号组成的信号;
    将处理后的所述第一音频信号和处理后的所述混合音频信号进行合成并推送至所述嘉宾端。
  14. 根据权利要求13所述的设备,所述根据所述嘉宾音频信号、所述第一音频信号和处理后的所述第一音频信号,检测嘉宾端的语音活动状态,包括:
    根据所述嘉宾音频信号、所述第一音频信号和处理后的所述第一音频信号,分别计算得到嘉宾音频能量、第一音频能量和处理后的第一音频能量;
    当判断所述嘉宾音频能量小于第一阈值,且处理后的所述第一音频能量和所述第一音频能量的比值大于第二阈值时,则检测所述语音活动状态为静音状态;
    当判断所述嘉宾音频信号能量大于所述第一阈值,或者处理后的所述第一音频能量和所述第一音频能量的比值小于所述第二阈值时,则检测所述语音活动状态为语音状态。
  15. 根据权利要求14所述的设备,所述根据不同的所述语音活动状态和所述第一音频信号,对所述混合音频信号中的第一音频信号进行回声消除,得到处理后的所述混合音频信号,包括:
    当检测所述语音活动状态为静音状态时,以所述第一音频信号为参考信号,对所述混合音频信号进行自适应滤波处理,过滤所述混合音频信号中的第一音频信号。
  16. 根据权利要求14所述的设备,所述根据不同的所述语音活动状态和所述第一音频信号,对所述混合音频信号中的第一音频信号进行回声消除,得到处理后的所述混合音频信号,包括:
    当检测所述语音活动状态为语音状态时,以所述第一音频信号为参考信号,对所述混合音频信号进行自适应滤波处理,得到过滤后的混合音频信号;
    对过滤后的所述混合音频信号进行非线性处理,消除所述过滤后的混合音频信号中的残余回声信号。
  17. 根据权利要求13所述的设备,所述对所述第一音频信号中的嘉宾音频信号进行回声消除,得到处理后的第一音频信号,包括:
    以所述嘉宾音频信号为参考信号,对所述第一音频信号进行自适应滤波处理,得到所述处理后的第一音频信号。
  18. 根据权利要求13所述的设备,所述根据不同的所述语音活动状态和所述第一音频信号,对所述混合音频信号中的第一音频信号进行回声消除,得到处理后的所述混合音频信号之后,还包括:
    将所述第一音频信号和处理后的所述混合音频信号进行合成并推送至观 众端。
  19. 一种计算机可读存储介质,其上承载计算机指令程序,所述计算机指令程序被处理器执行时实现权利要求1~6任一项所述方法的步骤。
PCT/CN2020/111873 2019-11-28 2020-08-27 直播音频处理方法、装置、电子设备和存储介质 Ceased WO2021103710A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20891582.7A EP4068284A4 (en) 2019-11-28 2020-08-27 METHOD AND DEVICE FOR PROCESSING LIVE BROADCASTING AUDIO, AND ELECTRONIC DEVICE AND STORAGE MEDIUM
US17/743,879 US20220270638A1 (en) 2019-11-28 2022-05-13 Method and apparatus for processing live stream audio, and electronic device and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911191671.X 2019-11-28
CN201911191671.XA CN110956969B (zh) 2019-11-28 2019-11-28 直播音频处理方法、装置、电子设备和存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/743,879 Continuation US20220270638A1 (en) 2019-11-28 2022-05-13 Method and apparatus for processing live stream audio, and electronic device and storage medium

Publications (1)

Publication Number Publication Date
WO2021103710A1 true WO2021103710A1 (zh) 2021-06-03

Family

ID=69978826

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/111873 Ceased WO2021103710A1 (zh) 2019-11-28 2020-08-27 直播音频处理方法、装置、电子设备和存储介质

Country Status (4)

Country Link
US (1) US20220270638A1 (zh)
EP (1) EP4068284A4 (zh)
CN (1) CN110956969B (zh)
WO (1) WO2021103710A1 (zh)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110956969B (zh) * 2019-11-28 2022-06-10 北京达佳互联信息技术有限公司 直播音频处理方法、装置、电子设备和存储介质
CN111510738B (zh) * 2020-04-26 2023-08-11 北京字节跳动网络技术有限公司 一种直播中音频的传输方法及装置
CN111583952B (zh) * 2020-05-19 2024-05-07 北京达佳互联信息技术有限公司 音频处理方法、装置、电子设备及存储介质
CN114697742A (zh) * 2020-12-25 2022-07-01 华为技术有限公司 一种视频录制方法及电子设备
CN113225574B (zh) * 2021-04-28 2023-01-20 北京达佳互联信息技术有限公司 信号处理方法及装置
US11621016B2 (en) * 2021-07-31 2023-04-04 Zoom Video Communications, Inc. Intelligent noise suppression for audio signals within a communication platform
US20240054044A1 (en) * 2022-08-15 2024-02-15 Smule, Inc. Remediating characteristics of content captured by a recording application on a user device
KR102516391B1 (ko) * 2022-09-02 2023-04-03 주식회사 액션파워 음성 구간 길이를 고려하여 오디오에서 음성 구간을 검출하는 방법

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000047697A (ja) * 1998-07-30 2000-02-18 Nec Eng Ltd ノイズキャンセラ
CN101609667A (zh) * 2009-07-22 2009-12-23 福州瑞芯微电子有限公司 Pmp播放器中实现卡拉ok功能的方法
US20110144984A1 (en) * 2006-05-11 2011-06-16 Alon Konchitsky Voice coder with two microphone system and strategic microphone placement to deter obstruction for a digital communication device
CN106531177A (zh) * 2016-12-07 2017-03-22 腾讯科技(深圳)有限公司 一种音频处理的方法、移动终端以及系统
CN109005419A (zh) * 2018-09-05 2018-12-14 北京优酷科技有限公司 一种语音信息的处理方法及客户端
CN109767777A (zh) * 2019-01-31 2019-05-17 迅雷计算机(深圳)有限公司 一种直播软件的混音方法
CN110138650A (zh) * 2019-05-14 2019-08-16 北京达佳互联信息技术有限公司 即时通讯的音质优化方法、装置及设备
CN110956969A (zh) * 2019-11-28 2020-04-03 北京达佳互联信息技术有限公司 直播音频处理方法、装置、电子设备和存储介质

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6148078A (en) * 1998-01-09 2000-11-14 Ericsson Inc. Methods and apparatus for controlling echo suppression in communications systems
US7319748B2 (en) * 2003-01-08 2008-01-15 Nxp B.V. Device and method for suppressing echo in telephones
NZ553385A (en) * 2004-08-09 2010-06-25 Nielsen Co Us Llc Methods and apparatus to monitor audio/visual content from various sources
RS49875B (sr) * 2006-10-04 2008-08-07 Micronasnit, Sistem i postupak za slobodnu govornu komunikaciju pomoću mikrofonskog niza
CN101562669B (zh) * 2009-03-11 2012-10-03 上海朗谷电子科技有限公司 自适应全双工全频段回声消除的方法
US8582754B2 (en) * 2011-03-21 2013-11-12 Broadcom Corporation Method and system for echo cancellation in presence of streamed audio
US9924252B2 (en) * 2013-03-13 2018-03-20 Polycom, Inc. Loudspeaker arrangement with on-screen voice positioning for telepresence system
US9083782B2 (en) * 2013-05-08 2015-07-14 Blackberry Limited Dual beamform audio echo reduction
CN106297816B (zh) * 2015-05-20 2019-12-13 广州质音通讯技术有限公司 一种回声消除的非线性处理方法和装置及电子设备
CN107124661B (zh) * 2017-04-07 2020-05-19 广州市百果园网络科技有限公司 直播频道中的通信方法、装置及系统
CN107886965B (zh) * 2017-11-28 2021-04-20 游密科技(深圳)有限公司 游戏背景音的回声消除方法
CN107799123B (zh) * 2017-12-14 2021-07-23 南京地平线机器人技术有限公司 控制回声消除器的方法和具有回声消除功能的装置
US10986437B1 (en) * 2018-06-21 2021-04-20 Amazon Technologies, Inc. Multi-plane microphone array
CN111415653B (zh) * 2018-12-18 2023-08-01 百度在线网络技术(北京)有限公司 用于识别语音的方法和装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000047697A (ja) * 1998-07-30 2000-02-18 Nec Eng Ltd ノイズキャンセラ
US20110144984A1 (en) * 2006-05-11 2011-06-16 Alon Konchitsky Voice coder with two microphone system and strategic microphone placement to deter obstruction for a digital communication device
CN101609667A (zh) * 2009-07-22 2009-12-23 福州瑞芯微电子有限公司 Pmp播放器中实现卡拉ok功能的方法
CN106531177A (zh) * 2016-12-07 2017-03-22 腾讯科技(深圳)有限公司 一种音频处理的方法、移动终端以及系统
CN109005419A (zh) * 2018-09-05 2018-12-14 北京优酷科技有限公司 一种语音信息的处理方法及客户端
CN109767777A (zh) * 2019-01-31 2019-05-17 迅雷计算机(深圳)有限公司 一种直播软件的混音方法
CN110138650A (zh) * 2019-05-14 2019-08-16 北京达佳互联信息技术有限公司 即时通讯的音质优化方法、装置及设备
CN110956969A (zh) * 2019-11-28 2020-04-03 北京达佳互联信息技术有限公司 直播音频处理方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
EP4068284A4 (en) 2022-12-28
EP4068284A1 (en) 2022-10-05
CN110956969B (zh) 2022-06-10
CN110956969A (zh) 2020-04-03
US20220270638A1 (en) 2022-08-25

Similar Documents

Publication Publication Date Title
WO2021103710A1 (zh) 直播音频处理方法、装置、电子设备和存储介质
JP6703525B2 (ja) 音源を強調するための方法及び機器
US8032364B1 (en) Distortion measurement for noise suppression system
CN101313483B (zh) 回音消除的配置
CN109845288B (zh) 用于麦克风之间的输出信号均衡的方法和装置
JP4964943B2 (ja) オーディオ入力信号の反響コンテンツを抽出および変更するためのシステム
US8724798B2 (en) System and method for acoustic echo cancellation using spectral decomposition
WO2018188282A1 (zh) 回声消除方法、装置、会议平板及计算机存储介质
US11380312B1 (en) Residual echo suppression for keyword detection
JP2023133472A (ja) ギャップ信頼度を用いた背景雑音推定
CN111883153B (zh) 一种基于麦克风阵列的双端讲话状态检测方法及装置
US10529331B2 (en) Suppressing key phrase detection in generated audio using self-trigger detector
CN110956976A (zh) 一种回声消除方法、装置、设备及可读存储介质
CN113160846A (zh) 噪声抑制方法和电子设备
CN111477238A (zh) 一种回声消除方法、装置及电子设备
Shankar et al. Efficient two-microphone speech enhancement using basic recurrent neural network cell for hearing and hearing aids
CN116349252A (zh) 用于处理双耳录音的方法和设备
CN114678038A (zh) 音频噪声检测方法、计算机设备和计算机程序产品
CN111800552A (zh) 音频输出处理方法、装置、系统及电子设备
CN114627889A (zh) 多声源声音信号处理方法及装置、存储介质和电子设备
CN113096694B (zh) 一种电子终端的播放质量检测方法及电子终端
Shankar et al. Influence of mvdr beamformer on a speech enhancement based smartphone application for hearing aids
CN107452398A (zh) 回声获取方法、电子设备及计算机可读存储介质
CN105869656A (zh) 一种语音信号清晰度的确定方法及装置
WO2020107455A1 (zh) 语音处理方法、装置、存储介质及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20891582

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020891582

Country of ref document: EP

Effective date: 20220628