WO2022022139A1 - 一种基于多音区的语音检测方法、相关装置及存储介质 - Google Patents

一种基于多音区的语音检测方法、相关装置及存储介质 Download PDF

Info

Publication number
WO2022022139A1
WO2022022139A1 PCT/CN2021/100472 CN2021100472W WO2022022139A1 WO 2022022139 A1 WO2022022139 A1 WO 2022022139A1 CN 2021100472 W CN2021100472 W CN 2021100472W WO 2022022139 A1 WO2022022139 A1 WO 2022022139A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
target detection
voice
output signal
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2021/100472
Other languages
English (en)
French (fr)
Inventor
郑脊萌
陈联武
黎韦伟
段志毅
于蒙
苏丹
姜开宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to EP21850172.4A priority Critical patent/EP4123646B1/en
Publication of WO2022022139A1 publication Critical patent/WO2022022139A1/zh
Priority to US17/944,067 priority patent/US12051441B2/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Definitions

  • the present application relates to the field of artificial intelligence, in particular to speech detection technology.
  • VAD voice activity detection
  • the pre-processing system generally adopts the method of azimuth angle estimation combined with signal strength estimation, or the method of azimuth angle estimation combined with spatial spectrum estimation.
  • the speaker with the strongest signal energy ie the signal energy arriving at the microphone array
  • its azimuth angle are estimated and used as the main speaker and its azimuth angle.
  • the main speaker may be farther from the microphone array than the interfering speaker.
  • the volume of the main speaker may be greater than that of the interfering speaker, the propagation loss of its speech signal in space is greater, so the signal strength reaching the microphone array may be smaller, resulting in poorer effect in subsequent speech processing.
  • Embodiments of the present application provide a multi-tone zone-based voice detection method, a related device, and a storage medium.
  • the present application provides a multi-tone zone-based voice detection method.
  • the voice detection method is executed by computer equipment, including: :
  • the sound area information includes sound area identification, pointing angle and user information
  • the sound area identification is used to identify the sound area
  • the pointing angle is used to indicate the sound area.
  • the central angle the user information is used to indicate the user retention in the sound area
  • N is an integer greater than 1;
  • a control signal corresponding to the target detection sound area is generated, wherein the control signal is used to suppress or retain the voice input signal processing, the control signal has a one-to-one correspondence with the sound area;
  • the voice input signal corresponding to the target detection sound region is processed to obtain the voice output signal corresponding to the target detection sound region, wherein the control signal, the voice input signal and the voice output signal have one-to-one correspondence;
  • the speech detection result of the target detection sound area is generated.
  • a voice detection device the voice detection device is deployed on computer equipment, including:
  • the acquisition module is used to obtain the sound region information corresponding to each sound region in the N sound regions, wherein the sound region information includes sound region identification, pointing angle and user information, and the sound region identification is used to identify the sound region, and the pointing angle is used to identify the sound region.
  • the user information is used to indicate the user retention situation in the sound area, and N is an integer greater than 1;
  • the generation module is used to use each sound zone as a target detection sound zone, and generate a control signal corresponding to the target detection sound zone according to the sound zone information corresponding to the target detection sound zone, wherein the control signal is used for the voice input signal. Suppression processing or reservation processing is performed, and the control signal has a one-to-one correspondence with the sound area;
  • the processing module is used to process the voice input signal corresponding to the target detection tone area by using the control signal corresponding to the target detection tone area, and obtain the voice output signal corresponding to the target detection tone area, wherein the control signal and the voice input signal And the voice output signal has a one-to-one correspondence;
  • the generating module is further configured to generate the speech detection result of the target detection sound region according to the speech output signal corresponding to the target detection sound region.
  • a computer device including: a memory, a transceiver, a processor, and a bus system;
  • the memory is used to store the program
  • the processor is configured to execute the program in the memory, and the processor is configured to execute the methods described in the above aspects according to the instructions in the program code;
  • the bus system is used to connect the memory and the processor so that the memory and the processor can communicate.
  • Another aspect of the present application provides a computer-readable storage medium having instructions stored in the computer-readable storage medium, which, when executed on a computer, cause the computer to perform the methods described in the above aspects.
  • Another aspect of the present application provides a computer program product or computer program, the computer program product or computer program comprising computer instructions stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the methods provided by various optional implementations in the above aspects.
  • FIG. 1 is a schematic diagram of an environment based on a multi-user conference scenario in an embodiment of the present application
  • FIG. 2 is a schematic diagram of an embodiment of a speech detection system in an embodiment of the present application.
  • FIG. 3 is a schematic diagram of an embodiment of a method for detecting speech based on multi-tone regions in an embodiment of the present application
  • FIG. 4 is a schematic diagram of a multi-tone area division manner in an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a multi-channel sound pickup system in an embodiment of the application.
  • FIG. 6 is another schematic structural diagram of a multi-channel sound pickup system in an embodiment of the present application.
  • FIG. 7 is a schematic diagram of an interface for implementing a call based on a multi-tone zone voice detection method in an embodiment of the application;
  • FIG. 8 is another schematic structural diagram of a multi-channel sound pickup system in an embodiment of the present application.
  • FIG. 9 is a schematic diagram of an interface for realizing a dialogue response based on a multi-tone area voice detection method in an embodiment of the application.
  • FIG. 10 is another schematic structural diagram of the multi-channel sound pickup system in the embodiment of the application.
  • FIG. 11 is a schematic diagram of an interface for implementing text recording based on a multi-sound area speech detection method in an embodiment of the application;
  • FIG. 12 is a schematic diagram of an embodiment of a speech detection apparatus in an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a computer device in an embodiment of the present application.
  • the embodiments of the present application provide a voice detection method, a related device, and a storage medium based on a multi-sound zone, which can retain or suppress voice signals in different directions through a control signal in a multi-sound source scenario, so that real-time voice signals can be retained or suppressed.
  • the voice of each user is separated and enhanced, thereby improving the accuracy of voice detection and improving the effect of voice processing.
  • the speech detection method based on the multi-sound zone can perform speech recognition and semantic recognition in the case of multiple users speaking at the same time, and then decide which user to respond to.
  • the far-field recognition scenario it is easy for multiple people to speak.
  • the multi-tone zone-based speech detection method provided by the present application can solve the problem of signal interference in the above scenarios. For example, in the wake-up-free scenario of smart speaker products, multiple users in the surrounding environment often speak at the same time. Based on this, the method provided in this application is used to first determine which user should respond, and then the user's voice content Recognition of content and intent is carried out, and the smart speaker product determines whether to respond to the user's voice command according to the recognition result.
  • FIG. 1 is a schematic diagram of an environment based on a multi-user conference scenario in an embodiment of the present application.
  • the conference system can include a screen, a camera, and a microphone array.
  • the microphone array is used to collect the voices of 6 users, and the camera is used to capture the real-time images of the 6 users. related information, etc.
  • FIG. 2 is a schematic diagram of an embodiment of the voice detection system in the embodiment of the present application, as shown in the figure.
  • the voice detection system includes a server and a terminal device.
  • the server involved in this application may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, Cloud servers for basic cloud computing services such as cloud communications, middleware services, domain name services, security services, Content Delivery Network (CDN), and big data and artificial intelligence platforms.
  • the terminal device may be a smart TV, a smart speaker, a smart phone, a tablet computer, a notebook computer, a palmtop computer, a personal computer, etc., but is not limited thereto.
  • the terminal device can be performed through a wireless network, a wired network or a removable storage medium.
  • the wireless network uses standard communication technologies and/or protocols.
  • the wireless network is usually the Internet (Internet), but can also be any network, including but not limited to Bluetooth, Local Area Network (LAN), Metropolitan Area Network (MAN), Wide Area Network (WAN) , mobile, private network, or any combination of virtual private networks).
  • custom or dedicated data communication techniques may be used in place of or in addition to the data communication techniques described above.
  • the removable storage medium may be a Universal Serial Bus (Universal Serial Bus, USB) flash disk, a mobile hard disk, or other removable storage medium, etc., which is not limited in this application.
  • the voice signal and other sounds in the environment can be picked up by the microphone array equipped on the terminal device, and then the microphone array transmits the collected digital signals to the voice signal pre-processing module, which is processed by the pre-processing module.
  • the module performs target speech extraction, enhancement, VAD detection, speaker detection, and main speaker detection, etc.
  • the specific processing content is flexibly determined according to the scene and functional requirements.
  • the voice signal enhanced by the preprocessing module can be sent to the server, and the enhanced voice signal can be processed through the voice recognition module or voice call module deployed in the server.
  • An embodiment of the multi-tone zone-based voice detection method in the embodiment of the present application includes:
  • the sound area information includes sound area identification, pointing angle and user information
  • the sound area identification is used to identify the sound area
  • the pointing angle is used to indicate the sound area.
  • the user information is used to indicate the user's remaining status in the voice zone
  • N is an integer greater than 1.
  • the space within the visible range can be divided into N sound regions.
  • the space within the viewing range may also be non-uniformly divided, which is not limited here.
  • Each sound area corresponds to a sound source. Assuming that there are two or more users in a certain sound area, these users can also be considered to belong to the same person. Therefore, in the actual partition, each sound area can be divided into is sufficiently detailed.
  • i represents the ith sound area
  • ⁇ i represents the pointing angle corresponding to the ith sound area
  • ⁇ i represents the user information corresponding to the ith sound area
  • the user information is used to indicate the user retention in the sound area, for example , if it is detected that there is no user in the i-th sound region, then ⁇ i can be set to -1; if it is detected that there is a user in the i-th sound region, then ⁇ i can be set to 1.
  • the methods provided in the embodiments of the present application may be executed by computer equipment, and specifically may be executed by a voice detection apparatus deployed on the computer equipment.
  • the computer equipment may be a terminal device or a server, that is, the voice detection apparatus may be deployed in Terminal devices can also be deployed on servers.
  • the voice detection device can also be deployed in a voice detection system, that is, the voice detection device can implement the method provided in this application based on a multi-channel sound pickup system.
  • the control signal 102 takes each sound zone as the target detection sound zone respectively, according to the sound zone information corresponding to the target detection sound zone, generate the control signal corresponding to the target detection sound zone, wherein, the control signal is used to suppress the voice input signal. Or reserved processing, the control signal has a one-to-one correspondence with the sound zone.
  • each sound area can be used as a target detection sound area, and the sound area corresponding to the target detection sound area can be detected according to the sound area.
  • the control signal corresponding to the target detection sound region is generated, so as to generate a control signal corresponding to each sound region.
  • the control signal can suppress or retain the voice input signal obtained through the microphone array. Assuming that there is no user in the i-th sound area, it means that the voice input signal in this sound area belongs to noise (abnormal human voice), therefore, the control signal generated for this sound area can suppress the voice input signal deal with. Assuming that it is detected that a user exists in the i-th sound region and the voice input signal in this sound region belongs to a normal human voice, the control signal generated for this sound region can retain the voice input signal.
  • the way to detect whether there is a user in the sound area can be detected by using a computer vision (Computer Vision, CV) technology, or it can be estimated whether there is a user in the current sound area by using a spatial spectrum estimation method.
  • a computer vision Computer Vision, CV
  • control signal corresponding to the target detection sound area uses the control signal corresponding to the target detection sound area to process the speech input signal corresponding to the target detection sound area to obtain the speech output signal corresponding to the target detection sound area, wherein the control signal, the speech input signal and the speech output signal Signals have a one-to-one correspondence.
  • each sound region is still used as the target detection sound region, and the control signal corresponding to the target detection sound region is used.
  • process the speech input signal corresponding to the target detection sound area and obtain the speech output signal corresponding to the target detection sound area, that is, use the control signal corresponding to each sound area to suppress the speech input signal in the corresponding sound area.
  • Processing or reservation processing thereby outputting the speech output signal corresponding to each sound zone. For example, if there is no user in the i-th sound region, the control signal of the i-th sound region may be "0", that is, the speech input signal corresponding to the sound region is suppressed.
  • the control signal of the i-th sound zone can be "1", that is, the voice input signal corresponding to the sound zone is reserved.
  • the speech input signal corresponding to the sound area is extracted, separated and enhanced.
  • the voice detection device can also perform post-processing on the voice output signal corresponding to each sound zone, that is, each voice zone can be used as a target detection
  • the speech output signal corresponding to the sound area generates the speech detection result of the target detection sound area.
  • cross-channel post-processing and noise reduction post-processing are performed on the speech output signal corresponding to the target detection sound area, and finally the post-processed speech output signal is detected, and finally the speech detection result of each sound area is generated, and then Determines whether to respond to speech originating in this zone.
  • the voice detection device can detect whether each sound zone meets the human voice matching condition.
  • the voice detection result of the ith voice zone can be "the ith voice zone" User exists in the zone”. It is also assumed that the ith sound area does not meet the human voice matching condition, then the speech detection result of the ith sound area is "no user in the ith sound area”.
  • FIG. 5 is a schematic structural diagram of the multi-channel sound pickup system in the embodiment of the present application.
  • a microphone equipped on a terminal device The array can pick up the audio signal corresponding to each sound zone, and the audio signal includes the voice input signal and the noise signal.
  • the control signal corresponding to each sound zone is generated by the signal separator, and the corresponding control signal of each sound zone is used to suppress or retain the voice input signal of the corresponding pointing angle, and then each voice output signal is cross-sound respectively.
  • the target speech output signal corresponding to each sound area is obtained.
  • the speech detection result is determined, that is, the speech detection result of each sound region in the N sound regions is obtained.
  • a method for detecting speech based on multi-sound zones is provided.
  • the tone zone information corresponding to each tone zone in the N tone zones is obtained, and the tone zone information includes a tone zone identifier, a pointing angle, and a user Therefore, each sound area can be used as the target detection sound area respectively, and the control signal corresponding to the target detection sound area is generated according to the sound area information corresponding to the target detection sound area, and then the control signal corresponding to the target detection sound area is used.
  • the speech signals from different directions are processed in parallel based on multiple sound regions, and in the scenario of multiple sound sources, the control signals can be used to retain or suppress the speech signals in different directions, so that each sound can be separated and enhanced in real time.
  • the voice of each user thereby improving the accuracy of voice detection, which is beneficial to improve the effect of subsequent voice processing.
  • acquiring the sound area information corresponding to each sound area in the N sound areas may include the following steps:
  • Each sound zone is used as the target detection sound zone respectively, and the user information corresponding to the target detection sound zone is determined according to the user detection result corresponding to the target detection sound zone;
  • the target detection sound region corresponding to the target detection sound region is generated.
  • Corresponding sound zone information the lip motion information corresponding to the target detection sound region, the sound region identification corresponding to the target detection sound region, and the pointing angle corresponding to the target detection sound region.
  • a method for acquiring sound area information based on CV technology is introduced.
  • the CV technology can be implemented by using a neural network.
  • a corresponding camera needs to be configured to capture user images.
  • Wide-angle camera coverage for a 360-degree space, 2 to 3 wide-angle cameras can be used for comprehensive coverage in a spliced manner.
  • each person in the space can be detected and numbered, and related information can also be provided, such as the user's identity information, face azimuth, lip motion information, face orientation, and face distance, etc.
  • Each sound region in the N sound regions is detected, and a user detection result corresponding to each sound region is obtained respectively.
  • the user detection result includes the user's identity information and lip motion information as an example for description, but this should not be construed as a limitation of this application.
  • the user detection result includes user information and lip motion information, wherein the user information includes whether there is a user, and if there is a user, whether the user's identity information can be extracted, for example, there is a user in the second sound area, and After identification, it is determined that the user is "Xiao Li", and the corresponding ID is "01011". For another example, there is no user in the fifth sound area, that is, no identification is required.
  • the lip motion information indicates whether the user's lips are moving. Generally, when a person speaks, the lips move. Therefore, whether the user is speaking can be further determined based on the lip motion information.
  • ⁇ i represents the pointing angle of the i th sound area
  • ⁇ i represents the user information of the ith sound area
  • Li represents the lip motion information of the ith sound area.
  • a method for acquiring sound area information based on CV technology is provided.
  • more sound area information can be detected by using CV technology, which is equivalent to "seeing" each sound area.
  • the relevant information of internal users such as whether there is a user, user information of the user, and whether the user has lip movement, etc., realizes the integration and utilization of multi-modal information, and can further improve the accuracy of speech detection through the information in the visual dimension. And it can also pass a feasible way for the subsequent processing of the video-related solution.
  • the user corresponding to the target detection sound region is determined according to the user detection result corresponding to the target detection sound region information, including the following steps:
  • the first identity identifier is determined as user information
  • the second identity identifier is determined as user information
  • the third identity identifier is determined as user information
  • the lip motion information corresponding to the target detection sound region is determined according to the user detection result corresponding to the target detection sound region, which specifically includes the following steps:
  • the first movement identifier is determined as lip movement information
  • the second motion identifier is determined as lip motion information
  • the third motion identifier is determined as lip motion information.
  • any sound region can be used as a target detection sound region , assuming that the tone zone is the i-th tone zone, based on the user detection result of the i-th tone zone, it can be determined whether there is a user in the i-th tone zone, and if there is a user, whether the user's identity information can be obtained.
  • the user information corresponding to the i-th sound region is represented as ⁇ i , that is, the user information in the direction with the pointing angle ⁇ i .
  • ⁇ i is the first identity of the user, such as "5". If there is no user in the direction of the pointing angle ⁇ i , ⁇ i can be set to a special value, that is, set to a second identity identifier, such as "-1".
  • ⁇ i can be set to another special value, that is, the third identity identifier, such as "0", thereby informing the subsequent processing module of the direction
  • the third identity identifier such as "0"
  • any sound region can be used as a target detection Sound area, assuming that the sound area is the ith sound area, based on the user detection result of the ith sound area, it can be determined whether there is a user in the ith sound area, and if there is a user, whether the user's lips move .
  • the camera generally adopts an inactive wide-angle camera.
  • the CV algorithm is used to detect all the people and faces within the viewing angle. At the same time, the partial image of the face can be extracted, and the CV algorithm can be used to detect whether the lips are moving on the face.
  • the lip motion information corresponding to the i-th sound region is represented as L i , that is, the lip motion information in the direction of the pointing angle ⁇ i . If there is a user in the direction with the pointing angle ⁇ i , and it can be determined that the user's lip movement occurs, Li can be set as the first movement flag, such as "0". If there is a user in the direction with the pointing angle ⁇ i , but the user does not have any lip movement, Li can be set as the second movement flag, such as "1". If there is no user in the direction of the pointing angle ⁇ i , Li can be set as a special value, that is, set as the third motion identification. Such as "-1".
  • a specific method for extracting lip motion information and user information based on CV technology is provided.
  • the user information and lip motion information of the user can be analyzed from multiple aspects, and the number of user information and lip motion information can be increased as much as possible.
  • the feasibility of identification, the information contained in each sound area is analyzed in multiple dimensions, so as to improve the operability of the scheme.
  • control signal corresponding to the target detection sound region is generated according to the sound region information corresponding to the target detection sound region. It includes the following steps:
  • a first control signal is generated, wherein the first control signal belongs to a control signal, and the first control signal is used to perform a voice input signal processing. suppress processing;
  • a second control signal is generated, wherein the second control signal belongs to the control signal, and the second control signal is used to reserve the voice input signal deal with.
  • a method for generating a control signal without adopting the CV technology is introduced.
  • the CV technology is not adopted, the user's identity cannot be identified, and the user's lip motion information cannot be obtained.
  • any sound region can be used as a target detection sound region , assuming that the sound area is the i-th sound area, that is, the sound area information of the i-th sound area is ⁇ (i, ⁇ i , ⁇ i ) ⁇ , where the user information ⁇ i can indicate the direction of the pointing angle ⁇ i If there is no user, or there is a user in the direction of the pointing angle ⁇ i , if necessary, the identity information of the user can be further identified by means of voiceprint recognition, which will not be described in detail here.
  • the process of generating the control signal if it is detected that there is no user in the i-th sound area, all signals at the pointing angle ⁇ i can be learned and suppressed by the signal separator, that is, the first control signal is generated by the signal separator, All signals at the pointing angle ⁇ i are suppressed using the first control signal. If it is detected that there is a user in the i-th sound area, the signal at the pointing angle ⁇ i can be learned by the signal separator and retained, that is, a second control signal is generated by the signal separator, and the pointing angle ⁇ is determined by the second control signal. The signal on i is reserved.
  • a method for generating a control signal without using the CV technology is provided.
  • the control signal can be generated only by using audio data.
  • the flexibility of the solution is increased, and on the other hand, based on the A control signal can also be generated with a smaller amount of information, thereby saving computing resources, improving the generation efficiency of the control signal, and saving power for the device.
  • control signal corresponding to the target detection sound region is generated according to the sound region information corresponding to the target detection sound region. It includes the following steps:
  • a first control signal is generated, wherein the first control signal belongs to a control signal, and the first control signal is used to perform a voice input signal processing. suppress processing;
  • the first control signal is generated
  • a second control signal is generated, wherein the second control signal belongs to the control signal, and the second control signal uses For the reservation of the voice input signal;
  • the first control signal or the second control signal is generated according to the original audio signal.
  • a method for generating a control signal when the CV technology is adopted is introduced.
  • the CV technology When the CV technology is adopted, the user's identity can be identified, and the user's lip motion information can be obtained.
  • only the CV technology can be used to estimate whether there are users in the current sound region, or the CV technology and the spatial spectrum estimation method can be used to jointly determine whether there are users in the current sound region, that is, the sound region information of N sound regions can be obtained.
  • any sound region can be used as a target detection sound region , assuming that the sound area is the i-th sound area, that is, the sound area information of the i-th sound area is ⁇ (i, ⁇ i , ⁇ i ,L i ) ⁇ , where the user information ⁇ i can be the first identity identifier , the second identity identifier or the third identity identifier, and the lip motion information may be the first motion identifier, the second motion identifier or the third motion identifier.
  • the control signal uses the first control signal to suppress all signals on the pointing angle ⁇ i . If it is detected that there is a user in the i-th sound region, it is necessary to further determine whether the user has lip movement.
  • all signals at the pointing angle ⁇ i can be learned and suppressed by the signal separator, that is, the first control signal is generated by the signal separator , using the first control signal to suppress all signals on the pointing angle ⁇ i .
  • the signal can be learned by the signal separator and retained at the pointing angle ⁇ i , that is, the second control signal is generated by the signal separator, using The second control signal performs reservation processing on the signal at the pointing angle ⁇ i .
  • the signal can be learned by the signal separator and retained at the pointing angle ⁇ i , that is, the second control signal is generated by the signal separator, using The second control signal performs reservation processing on the signal at the pointing angle ⁇ i .
  • the signal on ⁇ i that is, the second control signal is generated by the signal separator, and the signal on the pointing angle ⁇ i is reserved by the second control signal.
  • the signal separator can be used to learn and suppress all signals at the pointing angle ⁇ i , that is, the signal separator is used to generate a first control signal, and the first control signal is used for all signals at the pointing angle ⁇ i . signal is suppressed.
  • a method for generating a control signal under the condition of adopting CV technology is provided.
  • audio data and image data are used as the basis for generating the control signal, on the one hand, the flexibility of the scheme is increased.
  • the control signal generated based on more information will be more accurate, thereby improving the accuracy of speech detection.
  • control signal corresponding to the target detection sound region is generated according to the sound region information corresponding to the target detection sound region. It includes the following steps:
  • a preset algorithm is used to generate the control signal corresponding to the target detection sound region, wherein the preset algorithm is an adaptive beamforming algorithm, a blind source separation algorithm or a speech separation based on deep learning. algorithm;
  • the voice input signal corresponding to the target detection sound region is processed to obtain the voice output signal corresponding to the target detection sound region, which specifically includes the following steps:
  • the preset algorithm is an adaptive beamforming algorithm
  • the adaptive beamforming algorithm is used to process the voice input signal corresponding to the target detection sound region, and the corresponding target detection sound region is obtained. the voice output signal;
  • the preset algorithm is the blind source separation algorithm
  • the blind source separation algorithm is used to process the speech input signal corresponding to the target detection sound region, and the speech corresponding to the target detection sound region is obtained. output signal;
  • the preset algorithm is a speech separation algorithm based on deep learning
  • the speech input signal corresponding to the target detection sound area is processed by the speech separation algorithm based on deep learning, and the target detection sound area is obtained.
  • the voice output signal corresponding to the sound zone is obtained.
  • a method for implementing signal separation based on a control signal is introduced.
  • the preset algorithm used when generating the control signal is consistent with the algorithm used for signal separation in practical applications.
  • the present application provides Three preset algorithms are adaptive beamforming algorithm, blind source separation algorithm or speech separation algorithm based on deep learning. Signal separation will be described below in combination with these three preset algorithms.
  • Adaptive beamforming also known as adaptive spatial filtering, can perform spatial filtering by weighting each array element to achieve the purpose of enhancing useful signals and suppressing interference. Weighting factor. Under ideal conditions, adaptive beamforming technology can effectively suppress interference while retaining the desired signal, thereby maximizing the output signal-to-interference-to-noise ratio of the array.
  • Blind Source Separation is to estimate the source signal only based on the observed mixed signal without knowing the source signal and signal mixing parameters.
  • Independent Component Analysis ICA is a new technology gradually developed to solve the problem of blind signal separation. Most of the blind signal separation adopts the method of independent component analysis, that is, the received mixed signal is decomposed into several independent components by an optimization algorithm according to the principle of statistical independence, and these independent components are used as an approximate estimation of the source signal.
  • the speech separation based on deep learning mainly uses the method based on deep learning to learn the characteristics of speech, speaker and noise from the training data, so as to achieve the goal of speech separation.
  • multilayer perceptrons Deep Neural Networks (DNN), Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM) networks and generative adversarial networks ( Generative Adversarial Networks, GAN), etc., are not limited here.
  • the generator when using GAN for speech enhancement, the generator is usually set to be all convolutional layers in the model, in order to reduce the training parameters and thus shorten the training time; the discriminator is responsible for providing the generator with the authenticity of the generated data and helping the generator. Fine-tune in the direction of "Generate Clean Sound".
  • a method for realizing signal separation based on a control signal is provided.
  • the adaptive beamforming algorithm is also used when the signal is separated.
  • the blind source separation algorithm is used to generate the control signal, then the blind source separation algorithm is also used when the signal is separated.
  • the control signal is generated using the deep learning-based speech separation algorithm, then the deep learning-based speech separation is also used when the signal is separated. algorithm. Therefore, the control signal can better cooperate with the separation of the coordination signal, so as to achieve a better signal separation effect, thereby improving the accuracy of speech detection.
  • the voice detection result of the target detection sound region is generated according to the voice output signal corresponding to the target detection sound region, which specifically includes follow the steps below:
  • the voice output signal corresponding to the target detection tone zone determine the signal power corresponding to the target detection tone zone, wherein the signal power is the signal power of the voice output signal at the time-frequency point;
  • the estimated signal-to-noise ratio corresponding to the target detection sound region determine the output signal weighted value corresponding to the target detection sound region, wherein the output signal weighted value is the weighted result of the voice output signal on the time-frequency point;
  • the speech detection result corresponding to the target detection sound area is determined.
  • a method for post-processing the voice output signal across the channels is introduced. Since the voice output signal after signal separation is not always clean, if the voice output signal corresponding to each pointing angle has a higher signal-to-noise ratio, then cross-channel post-processing can be performed. It should be noted that when the signal-to-noise ratio of the speech output signal reaches more than -5dB, it can be considered that the signal-to-noise ratio is high. However, the critical value of the signal-to-noise ratio can also be adjusted according to the actual situation. "-5dB" only For an illustration, it should not be construed as a limitation of the present application.
  • an implementation method of cross-channel post-processing is to first determine the signal power corresponding to the target detection sound area according to the voice output signal corresponding to the target detection sound area, and then calculate The estimated signal-to-noise ratio corresponding to the target detection sound area, then determine the output signal weighted value corresponding to the target detection sound area, and finally determine the target detection sound area corresponding to the output signal weighted value and the voice output signal according to the target detection sound area.
  • the corresponding target speech output signal based on the target speech output signal, determines the speech detection result corresponding to the target detection sound area.
  • the following will take any one of the N sound regions as an example for introduction, and other sound regions also use a similar method to determine the target speech output signal, which will not be repeated here, and any sound region can be used as
  • the target detection sound area assuming that the sound area is the ith sound area, and the corresponding pointing angle is ⁇ i , for each time-frequency point (t, f) of the pointing angle ⁇ i , the following method is used to calculate the ith sound area Estimated signal-to-noise ratio for the tones:
  • ⁇ i (t, f) represents the estimated signal-to-noise ratio of the i-th sound region
  • P i (t, f) represents the speech output signal with the pointing angle of ⁇ i at the time-frequency point (t, f)
  • N represents the N sound area (can also be used as the N pointing angle)
  • j represents the j-th sound area (also can be used as the j-th pointing angle)
  • i represents the ith sound area (also can be used as the i-th sound area) a pointing angle)
  • t represents time
  • f represents frequency.
  • Wiener filtering formula can be used to calculate the weighted value of the output signal of the ith sound zone:
  • g i (t, f) represents the weighted value of the output signal of the i-th sound region, that is, the weighting of the speech output signal with the direction of ⁇ i at the time-frequency point (t, f) is generated.
  • the target speech output signal of the ith sound region is calculated:
  • y i (t, f) represents the target speech output signal of the i-th sound area, that is, the target speech output signal of the cross-channel post-processing algorithm in the direction of the pointing angle ⁇ i .
  • xi (t, f) represents the speech output signal of the i-th sound zone, that is, the speech output signal with the direction of the pointing angle ⁇ i .
  • the target speech output signal y i (t, f) in this embodiment is a speech output signal that has not undergone noise reduction processing.
  • a method for performing cross-channel post-processing on a speech output signal is provided.
  • the cross-channel post-processing method can be used to improve the Good separation of the speech signal, especially when the signal-to-noise ratio is high enough, can improve the purity of the speech signal, thereby further improving the quality of the output signal.
  • Determining the target speech output signal corresponding to the target detection sound area specifically includes the following steps:
  • Noise reduction processing is performed on the to-be-processed speech output signal corresponding to the target detection sound region to obtain a target speech output signal corresponding to the target detection sound region.
  • a method for performing noise reduction processing on the speech output signal to be processed is introduced.
  • the following will take any one of the N sound regions as an example for introduction, and other sound regions also adopt a similar method. Determining the target speech output signal, which is not repeated here, any sound region can be used as the target detection sound region, assuming that the sound region is the ith sound region, and the corresponding pointing angle is ⁇ i . Based on the above embodiments, it can be known that the target speech output signal of the ith sound zone is calculated according to the weighted value of the output signal of the ith sound zone and the speech output signal of the ith sound zone.
  • y′ i (t, f) represents the to-be-processed speech output signal of the i-th sound zone, that is, the to-be-processed speech output signal of the cross-channel post-processing algorithm in the direction of the pointing angle ⁇ i .
  • xi (t, f) represents the speech output signal of the i-th sound zone, that is, the speech output signal with the direction of the pointing angle ⁇ i .
  • the to-be-processed voice output signal y′ i (t, f) in this embodiment is a voice output signal that has not undergone noise reduction processing, while the target voice output signal in this embodiment is y i (t,f) is the noise-reduced speech output signal.
  • noise reduction processing is performed on the to-be-processed speech output signal y' i (t, f) to obtain the target speech output signal y i (t, f) corresponding to each sound region.
  • a feasible filtering method is to use a least mean square (Least Mean Square, LMS) adaptive filter for noise reduction processing.
  • the LMS adaptive filter uses the filter parameters obtained at the moment to automatically adjust The current filter parameters are adapted to the unknown or randomly varying statistical properties of the signal and noise to achieve optimal filtering.
  • Another feasible filtering method is to use LMS adaptive notch filter for noise reduction processing.
  • the adaptive notch filter method is suitable for monochromatic interference noise, such as single-frequency sine wave noise. It is hoped that the characteristics of the notch filter are ideal and the notch The shoulders are arbitrarily narrow for immediate access to flat areas.
  • Another feasible filtering method is to use the basic spectral subtraction method for noise reduction processing.
  • the speech output signal to be processed is insensitive to phase, and the phase information before spectral subtraction is used in the signal after spectral subtraction.
  • the target speech output signal after spectral subtraction can be obtained by using the inverse fast Fourier transform (Inverse Fast Fourier Transform, IFFT).
  • IFFT inverse Fast Fourier Transform
  • Another feasible filtering method is to use Wiener filter noise reduction for noise reduction processing. The above examples are only feasible solutions. In practical applications, other noise reduction methods may also be used, which are not limited here.
  • a method for noise reduction processing of the speech output signal to be processed is provided.
  • noise, interfering human voice and residual echo can be further suppressed, so that the target speech can be better improved
  • the quality of the output signal is beneficial to increase the accuracy of speech detection.
  • the voice detection result of the target detection sound region is generated according to the voice output signal corresponding to the target detection sound region, which specifically includes follow the steps below:
  • a first voice detection result is generated, wherein the first voice detection result belongs to the voice detection result, and the first voice detection result indicates that the target voice output signal is a human voice Signal;
  • a second voice detection result is generated, wherein the second voice detection result belongs to the voice detection result, and the second voice detection result indicates that the target voice output signal is noise signal.
  • a method of performing voice detection on each sound zone is introduced.
  • voice detection process it is necessary to judge whether the voice output signal of each voice zone meets the human voice matching condition.
  • the "target voice output signal” in the example is obtained by the voice output signal after cross-channel post-processing and noise reduction post-processing. If the speech output signal has not undergone cross-channel postprocessing and noise reduction postprocessing, speech detection is performed on the "speech output signal”. If the speech output signal is only subjected to cross-channel post-processing without noise reduction post-processing, speech detection can be performed on the "to-be-processed speech output signal".
  • This application takes "target voice output signal” as an example for description, but this should not be construed as a limitation on this application.
  • the target voice output signal For the convenience of explanation, the following will take any one of the N sound regions as an example to introduce, and other sound regions also use a similar method to determine the voice detection result. , which will not be repeated here, any sound area can be used as the target detection sound area. In the detection process, at least one of the target speech output signal, lip motion information, user information, and voiceprint can be used to determine whether a certain sound region satisfies the human voice matching condition, which will be described with reference to several examples below.
  • Scenario 2 If the received target voice output signal is very weak or does not resemble a human voice, it can be determined that the user is not speaking in the direction of the pointing angle corresponding to the sound area, so it is determined that the human voice matching condition is not met.
  • Scenario 3 If the received target voice output signal is a human voice, but it does not match the voiceprint of the given user information extremely (for example, the matching score is less than 0.5), it can be determined that the pointing angle corresponding to the voice area at this time is In the direction, the user does not speak, and the target voice output signal is the noise signal leaking from the human voice from other directions to the current channel, so it is determined that the matching condition of the human voice is not satisfied.
  • the received target voice output signal is human voice, but the lip motion information indicates that the user's lips have not moved, and the voiceprint matching degree is not high, it can also determine the pointing angle corresponding to the sound area at this time. In the direction, the user does not speak, and the marked voice output signal is the noise signal leaking from the human voice in other directions to the current channel, so it is determined that the matching condition of the human voice is not satisfied.
  • the corresponding voiceprint can be obtained from the database (assuming that the user has used the user information for registration), and according to the voiceprint, it can be determined whether the current target voice output signal in the channel matches the user's voiceprint, If the match is successful, it is determined that the human voice matching condition is satisfied; if not, it is determined that the target voice output signal is a noise signal leaking from the human voice in other directions to the current channel, that is, the human voice matching condition is not satisfied.
  • a first voice detection result is generated, which means that the target voice output signal is a normal human voice signal.
  • a second voice detection result is generated, which means that the target voice output signal is a noise signal.
  • a method for performing voice detection on each sound zone is provided.
  • the Can after generating the voice detection result of the target detection sound region according to the voice output signal corresponding to the target detection sound region, the Can include the following steps:
  • the target sound region is determined from the M sound regions, wherein , the first voice detection result indicates that the voice output signal is a human voice signal, M sound regions belong to N sound regions, and M is an integer greater than or equal to 1 and less than or equal to N;
  • a method of making a call based on the voice detection result is introduced. Based on the above embodiment, it can be known that, after obtaining the voice detection result corresponding to each of the N voice zones, the first voice detection result is selected.
  • the corresponding sound area is because in the call scene, in order to improve the call quality, it is necessary to transmit human voice and suppress noise, and the first voice detection result indicates that the voice output signal of this sound area is a human voice signal.
  • the "voice output signal” in this embodiment can also be a "to-be-processed voice output signal” or a "target voice output signal", which can be flexibly selected in the specific processing process, and this is only an illustration. It should not be construed as a limitation of this application.
  • the voice detection result of M sound regions in the N sound regions is the first voice detection result, that is, according to the voice output signal (or, the target voice output signal, or the pending voice output signal) corresponding to each sound region in the M sound regions processing the voice output signal), based on this, the main speaker can be further determined based on the voice output signals of the M voice zones, wherein each voice zone in the M voice zones is called a "target voice zone".
  • FIG. 6 is another schematic diagram of the structure of the multi-channel sound pickup system in the embodiment of the application. As shown in the figure, the microphone array equipped on the terminal device can pick up the corresponding sound area of each sound area.
  • the audio signal includes the voice input signal and the noise signal.
  • the control signal corresponding to each sound zone is generated by the signal separator, and the corresponding control signal of each sound zone is used to suppress or retain the speech input signal of each pointing angle, so as to obtain the corresponding voice output signal of each sound zone. Based on the voice zone information and the voice output signal for each voice zone, a voice detection result for each voice zone is determined.
  • the main speaker determination module determines the main speaker in real time according to the voice output signals of the M sound regions and the sound region information.
  • the signal strength of each speaker and the distance from the speaker to the microphone array (which can be provided by a wide-angle camera or a multi-camera array) are used to measure the speaker's original volume (that is, the volume at the mouth), and then according to the original volume.
  • the volume determines the main speaker.
  • the main speaker can be judged according to the face orientation of each speaker (for example, in a video conference scenario, a user whose face is facing the camera is more likely to be the main speaker). speaker.
  • the primary speaker's decision including its location and identity, is output to the mixer for call needs.
  • the mixer merges N continuous audio streams into one or more output audios to meet the call requirements.
  • One implementation method is that if the main speaker is determined to be in the direction of ⁇ i , the output single-channel audio is equal to the voice output signal of the first input, and the input data of other channels are directly discarded.
  • An implementation method is, if the main speaker is determined to be in the direction of ⁇ i and the direction of the pointing angle as ⁇ 4 , the output audio is equal to the voice output signal of the first input and the voice output signal of the fourth input, The input data of other channels are discarded directly.
  • the voice output signal can also be subjected to cross-channel post-processing and noise reduction post-processing, and then the target target voice output signal corresponding to each sound zone is obtained.
  • Information and target speech output signal determine the speech detection result for each sound zone.
  • FIG. 7 is a schematic diagram of an interface for implementing a call based on the multi-tone area voice detection method in the embodiment of the application.
  • the main speaker can be determined using the technical solution provided in this application, and the main speaker's voice can be transmitted to user A, while other speakers or noise can be suppressed, so that user A can hear clearer speech.
  • a method for making a call based on a voice detection result is provided.
  • the voice of each user can be separated and enhanced in real time in a multi-user scenario, so that the voice detection can be performed in a call scenario.
  • high-quality calls are realized based on the flow of multi-user parallel separation enhancement processing and post-mixing processing.
  • the Can after generating the voice detection result of the target detection sound region according to the voice output signal corresponding to the target detection sound region, the Can include the following steps:
  • the target sound region is determined from the M sound regions, wherein , the first voice detection result indicates that the voice output signal is a human voice signal, M sound regions belong to N sound regions, and M is an integer greater than or equal to 1 and less than or equal to N;
  • Semantic recognition is performed on the speech output signal corresponding to the target sound area, and the semantic recognition result is obtained;
  • dialogue response information is generated.
  • a method of feeding back dialogue response information based on the voice detection result is provided.
  • the first voice is selected from the The sound area corresponding to the detection result is because in the intelligent dialogue scene, in order to improve the accuracy of the intelligent dialogue, it is necessary to transmit human voice and suppress noise.
  • the first voice detection result indicates that the voice output signal of this tone area is a human voice signal.
  • the "voice output signal” in this embodiment can also be a "to-be-processed voice output signal” or a "target voice output signal”, which can be flexibly selected in the specific processing process, and this is only an illustration. It should not be construed as a limitation of this application.
  • the voice detection result of M sound regions in the N sound regions is the first voice detection result, that is, according to the voice output signal (or, the target voice output signal, or the pending voice output signal) corresponding to each sound region in the M sound regions processing the voice output signal), based on this, the main speaker may be further determined based on the voice output signals of the M sound regions.
  • each of the M sound regions is called a "target sound region”.
  • FIG. 8 is another schematic diagram of the structure of the multi-channel sound pickup system in the embodiment of the application. As shown in the figure, the microphone array equipped on the terminal device can pick up the corresponding sound area of each sound area.
  • the audio signal includes the voice input signal and the noise signal.
  • the control signal corresponding to each sound zone is generated by the signal separator, and the corresponding control signal of each sound zone is used to suppress or retain the speech input signal of each pointing angle, so as to obtain the corresponding voice output signal of each sound zone. Based on the voice zone information and the voice output signal for each voice zone, a voice detection result for each voice zone is determined.
  • NLP processing is performed on the speech output signal corresponding to each target sound region in the M sound regions, that is, the intention of the speaker in each target sound region is obtained, that is, the semantic recognition result is obtained.
  • the main speaker determination module determines the main speaker in real time according to the voice output signals of the M sound regions and the sound region information.
  • the signal strength of each speaker and the distance from the speaker to the microphone array (which can be provided by a wide-angle camera or a multi-camera array) are used to measure the speaker's original volume (that is, the volume at the mouth), and then according to the original volume.
  • the volume determines the main speaker.
  • the semantic recognition result of each speaker and the face orientation for example, in a video conference scenario, the user whose face is facing the camera is more likely to be the main speaker.
  • the judgment result of the main speaker includes its position and identity, which is used as the basis for generating the dialogue response information, so as to reply the dialogue response information corresponding to the main speaker's intention.
  • cross-channel post-processing and noise reduction post-processing can also be performed on the speech output signal, and then the target target speech output signal corresponding to each sound zone is obtained.
  • Information and target speech output signal determine the speech detection result for each sound zone.
  • FIG. 9 is a schematic diagram of an interface for realizing a dialogue response based on a multi-tone area speech detection method in an embodiment of the application.
  • the main speaker can be determined by using the technical solution provided in this application, and according to the judgment result of the main speaker and the semantic recognition result, the main speaker can reply "Xiao Teng, what day of the week is today?", that is, generate Conversation response information, such as "Hi, it's Friday".
  • a method for feeding back dialogue response information based on a voice detection result is provided.
  • the voice of each user can be separated and enhanced in real time in a multi-user scenario, so that in intelligent dialogue
  • the voice detection results and semantic recognition results determine the main speaker, and based on the multi-user parallel separation enhancement processing and post-mixing processing processes, the voice quality is improved, so that the dialogue response information can be fed back according to the semantic recognition results.
  • Voice is filtered.
  • the Can after generating the voice detection result of the target detection sound region according to the voice output signal corresponding to the target detection sound region, the Can include the following steps:
  • the target sound region is determined from the M sound regions, wherein , the first voice detection result indicates that the voice output signal is a human voice signal, M sound regions belong to N sound regions, and M is an integer greater than or equal to 1 and less than or equal to N;
  • text record information is generated, wherein the text record information includes at least one of translation text and conference record text.
  • a method for generating text record information based on the voice detection result is introduced.
  • the first voice is selected from the The sound area corresponding to the detection result. This is because in the translation or recording scenario, in order to improve the accuracy of translation or recording, it is necessary to transmit human voice and suppress noise.
  • the first speech detection result indicates that the voice output signal of this sound area is human. sound signal.
  • the "voice output signal” in this embodiment can also be a "to-be-processed voice output signal” or a "target voice output signal”, which can be flexibly selected in the specific processing process, and this is only an illustration. It should not be construed as a limitation of this application.
  • the voice detection result of M sound regions in the N sound regions is the first voice detection result, that is, according to the voice output signal (or, the target voice output signal, or the pending voice output signal) corresponding to each sound region in the M sound regions processing the voice output signal), based on this, the main speaker may be further determined based on the voice output signals of the M sound regions.
  • each of the M sound regions is called a "target sound region”.
  • FIG. 10 is another schematic diagram of the structure of the multi-channel sound pickup system in the embodiment of the application. As shown in the figure, the microphone array equipped on the terminal device can pick up the corresponding sound area of each sound area.
  • the audio signal includes the voice input signal and the noise signal.
  • the control signal corresponding to each sound zone is generated by the signal separator, and the corresponding control signal of each sound zone is used to suppress or retain the speech input signal of each pointing angle, so as to obtain the corresponding voice output signal of each sound zone. Based on the voice zone information and the voice output signal for each voice zone, a voice detection result for each voice zone is determined.
  • the voice output signal corresponding to each target voice zone in the M voice zones is segmented, that is, the dead point position of each voice output signal is determined, thereby obtaining the audio data to be recognized.
  • the audio data carries user information, and the user information may specifically be a user ID. Both the to-be-recognized audio data and user information can be used for subsequent speech recognition tasks. Therefore, the ASR technology is used to process the audio data to be recognized corresponding to each target sound region in the M sound regions, that is, the speech content of the speaker in each target sound region is obtained, that is, the speech recognition result is obtained.
  • the main speaker determination module determines the main speaker in real time according to the voice output signals of the M sound regions and the sound region information.
  • the signal strength of each speaker and the distance from the speaker to the microphone array (which can be provided by a wide-angle camera or a multi-camera array) are used to measure the speaker's original volume (that is, the volume at the mouth), and then according to the original volume.
  • the volume determines the main speaker.
  • the voice recognition result of each speaker and the face orientation for example, in a video conference scenario, the user whose face is facing the camera is more likely to be the main speaker.
  • the judgment result of the main speaker includes its position and identity, which is used as the generation basis of the text record information to display the dialogue response information, and the text record information includes at least one of translation text and conference record text.
  • the ASR technology can use the rule method or the machine learning model method to send the segmented audio data to be recognized together with the voiceprint to the ASR module in the cloud, usually the voiceprint identification or voiceprint model parameters are sent to the ASR module.
  • the ASR module can use the voiceprint information to further improve its recognition rate.
  • FIG. 11 is a schematic diagram of an interface for implementing text recording based on the multi-voice detection method in the embodiment of the present application.
  • the main speaker can be determined by using the technical solution provided in this application, and according to the judgment result of the main speaker and the speech recognition result, a paragraph of words spoken by the main speaker can be translated in real time.
  • the main speaker is user A.
  • a method for generating text record information based on a voice detection result is provided.
  • the voice of each user can be separated and enhanced in real time, so that the voice of each user can be separated and enhanced in real time under intelligent dialogue.
  • the speech detection results and speech recognition results accurately distinguish the respective start and end time points of each speaker, and recognize the speech of each speaker individually to obtain more accurate speech recognition performance, which can be used for subsequent semantic understanding performance and translation. performance, etc.
  • the voice quality is improved, thereby helping to increase the accuracy of the text record information.
  • FIG. 12 is a schematic diagram of an embodiment of the voice detection device in the embodiment of the present application.
  • the voice detection device 20 includes:
  • the acquisition module 201 is used for acquiring the sound area information corresponding to each sound area in the N sound areas, wherein the sound area information includes a sound area identification, a pointing angle and user information, and the sound area identification is used to identify the sound area, The pointing angle is used to indicate the central angle of the sound area, the user information is used to indicate the user retention in the sound area, and N is an integer greater than 1;
  • the generation module 202 is used for taking each sound zone as the target detection sound zone respectively, and according to the sound zone information corresponding to the target detection sound zone, the control signal corresponding to the target detection sound zone is generated, wherein the control signal is used for the voice input
  • the signal is suppressed or reserved, and the control signal has a one-to-one correspondence with the sound area;
  • the processing module 203 is used for using the control signal corresponding to the target detection sound area to process the voice input signal corresponding to the target detection sound area to obtain the corresponding voice output signal of the target detection sound area, wherein the control signal, the voice input signal The signal and the voice output signal have a one-to-one correspondence;
  • the generating module 202 is further configured to generate the speech detection result of the target detection sound region according to the speech output signal corresponding to the target detection sound region.
  • the acquisition module 201 is specifically configured to detect each sound region in the N sound regions, and obtain a user detection result corresponding to each sound region respectively;
  • Each sound zone is used as the target detection sound zone respectively, and the user information corresponding to the target detection sound zone is determined according to the user detection result corresponding to the target detection sound zone;
  • the target detection sound region corresponding to the target detection sound region is generated.
  • Corresponding sound zone information the lip motion information corresponding to the target detection sound region, the sound region identification corresponding to the target detection sound region, and the pointing angle corresponding to the target detection sound region.
  • the acquiring module 201 is specifically configured to determine the first identity identifier as user information if the user detection result corresponding to the target detection tone zone is that there is an identifiable user in the target detection tone zone;
  • the second identity identifier is determined as user information
  • the third identity identifier is determined as user information
  • the acquisition module 201 is specifically configured to determine the first motion identification as lip motion information if the user detection result corresponding to the target detection sound area is that there is a user with lip motion in the target detection sound area;
  • the second motion identifier is determined as lip motion information
  • the third motion identifier is determined as lip motion information.
  • the generating module 202 is specifically configured to generate a first control signal if the user information corresponding to the target detection sound area is used to indicate that there is no user in the target detection sound area, wherein the first control signal belongs to the control signal, and the first control signal It is used to suppress the voice input signal;
  • a second control signal is generated, wherein the second control signal belongs to the control signal, and the second control signal is used to reserve the voice input signal deal with.
  • the generating module 202 is specifically configured to generate a first control signal if the user information corresponding to the target detection sound area is used to indicate that there is no user in the target detection sound area, wherein the first control signal belongs to the control signal, and the first control signal It is used to suppress the voice input signal;
  • the first control signal is generated
  • a second control signal is generated, wherein the second control signal belongs to the control signal, and the second control signal uses For the reservation of the voice input signal;
  • the first control signal or the second control signal is generated according to the original audio signal.
  • the generating module 202 is specifically configured to use a preset algorithm to generate a control signal corresponding to the target detection tone area according to the tone area information corresponding to the target detection tone area, wherein the preset algorithm is an adaptive beamforming algorithm and a blind source separation algorithm Or a speech separation algorithm based on deep learning;
  • the processing module 203 is specifically configured to, if the preset algorithm is an adaptive beamforming algorithm, use the adaptive beamforming algorithm to process the voice input signal corresponding to the target detection sound region according to the control signal corresponding to the target detection sound region, Obtain the voice output signal corresponding to the target detection sound area;
  • the preset algorithm is the blind source separation algorithm
  • the blind source separation algorithm is used to process the speech input signal corresponding to the target detection sound region, and the speech corresponding to the target detection sound region is obtained. output signal;
  • the preset algorithm is a speech separation algorithm based on deep learning
  • the speech input signal corresponding to the target detection sound area is processed by the speech separation algorithm based on deep learning, and the target detection sound area is obtained.
  • the voice output signal corresponding to the sound zone is obtained.
  • the generation module 202 is specifically used to determine the signal power corresponding to the target detection sound region according to the corresponding voice output signal of the target detection sound region, wherein the signal power is the signal power of the voice output signal at the time-frequency point;
  • the estimated signal-to-noise ratio corresponding to the target detection sound region determine the output signal weighted value corresponding to the target detection sound region, wherein the output signal weighted value is the weighted result of the voice output signal on the time-frequency point;
  • the speech detection result corresponding to the target detection sound area is determined.
  • the generating module 202 is specifically used to determine the to-be-processed voice output signal corresponding to the target detection sound region according to the output signal weighting value corresponding to the target detection sound region and the corresponding voice output signal of the target detection sound region;
  • Noise reduction processing is performed on the to-be-processed speech output signal corresponding to the target detection sound region to obtain a target speech output signal corresponding to the target detection sound region.
  • the generating module 202 is specifically configured to generate a first voice detection result if the target voice output signal corresponding to the target detection sound area satisfies the human voice matching condition, wherein the first voice detection result belongs to the voice detection result, and the first voice detection result Indicates that the target voice output signal is a human voice signal;
  • a second voice detection result is generated, wherein the second voice detection result belongs to the voice detection result, and the second voice detection result indicates that the target voice output signal is noise signal.
  • the speech detection apparatus 20 further includes a determination module 204 and a transmission module 205;
  • the determination module 204 is used for generating the voice detection result of the target detection sound region according to the corresponding voice output signal of the target detection sound region in the generation module 202, if the voice detection results corresponding to the M sound regions are the first voice The detection result, then according to the corresponding voice output signal of each voice zone in the M voice zones, the target voice zone is determined from the M voice zones, wherein the first voice detection result indicates that the voice output signal is a human voice signal, and the M voice zones
  • the region belongs to N sound regions, and M is an integer greater than or equal to 1 and less than or equal to N;
  • the transmission module 205 is used for transmitting the voice output signal corresponding to the target sound area to the calling party.
  • the speech detection apparatus 20 further includes a determination module 204 and an identification module 206;
  • the determination module 204 is used for generating the voice detection result of the target detection sound region according to the corresponding voice output signal of the target detection sound region in the generation module 202, if the voice detection results corresponding to the M sound regions are the first voice The detection result, then according to the corresponding voice output signal of each voice zone in the M voice zones, the target voice zone is determined from the M voice zones, wherein the first voice detection result indicates that the voice output signal is a human voice signal, and the M voice zones
  • the region belongs to N sound regions, and M is an integer greater than or equal to 1 and less than or equal to N;
  • the recognition module 206 is used to perform semantic recognition on the speech output signal corresponding to the target sound region to obtain a semantic recognition result
  • the generating module 202 is further configured to generate dialogue response information according to the semantic recognition result.
  • the speech detection apparatus 20 further includes a determination module 204 and an identification module 206;
  • the determination module 204 is used for generating the voice detection result of the target detection sound region according to the corresponding voice output signal of the target detection sound region in the generation module 202, if the voice detection results corresponding to the M sound regions are the first voice The detection result, then according to the corresponding voice output signal of each voice zone in the M voice zones, the target voice zone is determined from the M voice zones, wherein the first voice detection result indicates that the voice output signal is a human voice signal, and the M voice zones
  • the region belongs to N sound regions, and M is an integer greater than or equal to 1 and less than or equal to N;
  • the processing module 203 is used to perform segmentation processing on the voice output signal corresponding to the target sound area to obtain audio data to be recognized;
  • the recognition module 206 is used to perform speech recognition on the to-be-recognized audio data corresponding to the target sound area, and obtain a speech recognition result;
  • the generating module 202 is further configured to generate text record information according to the speech recognition result corresponding to the target sound area, wherein the text record information includes at least one of translation text and conference record text.
  • FIG. 13 is a schematic structural diagram of a computer device 30 according to an embodiment of the present application.
  • Computer device 30 may include input device 310 , output device 320 , processor 330 , and memory 340 .
  • the output device in this embodiment of the present application may be a display device.
  • Memory 340 may include read-only memory and random access memory, and provides instructions and data to processor 330 .
  • a portion of the memory 340 may also include non-volatile random access memory (Non-Volatile Random Access Memory, NVRAM).
  • NVRAM non-Volatile Random Access Memory
  • the memory 340 stores the following elements, executable modules or data structures, or a subset thereof, or an extended set thereof:
  • Operation instructions including various operation instructions, which are used to realize various operations.
  • Operating System Includes various system programs for implementing various basic services and handling hardware-based tasks.
  • the processor 330 is configured to:
  • the sound area information includes sound area identification, pointing angle and user information
  • the sound area identification is used to identify the sound area
  • the pointing angle is used to indicate the sound area.
  • the central angle the user information is used to indicate the user retention in the sound area
  • N is an integer greater than 1;
  • a control signal corresponding to the target detection sound area is generated, wherein the control signal is used to suppress or retain the voice input signal processing, the control signal has a one-to-one correspondence with the sound area;
  • the voice input signal corresponding to the target detection sound region is processed to obtain the voice output signal corresponding to the target detection sound region, wherein the control signal, the voice input signal and the voice output signal have one-to-one correspondence;
  • the speech detection result of the target detection sound area is generated.
  • the processor 330 controls the operation of the computer device 30, and the processor 330 may also be referred to as a central processing unit (Central Processing Unit, CPU).
  • Memory 340 may include read-only memory and random access memory, and provides instructions and data to processor 330 .
  • a portion of memory 340 may also include NVRAM.
  • various components of the computer device 30 are coupled together through a bus system 350, where the bus system 350 may include a power bus, a control bus, a status signal bus, and the like in addition to a data bus.
  • the various buses are labeled as bus system 350 in the figure.
  • the methods disclosed in the above embodiments of the present application may be applied to the processor 330 or implemented by the processor 330 .
  • the processor 330 may be an integrated circuit chip with signal processing capability. In the implementation process, each step of the above-mentioned method may be completed by an integrated logic circuit of hardware in the processor 330 or an instruction in the form of software.
  • the above-mentioned processor 330 can be a general-purpose processor, a digital signal processor (Digital Signal Processing, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or Other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 340, and the processor 330 reads the information in the memory 340, and completes the steps of the above method in combination with its hardware.
  • FIG. 13 can be understood by referring to the related description and effect of the method part of FIG. 3 , and details are not repeated here.
  • Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when it runs on a computer, causes the computer to execute the method described in the foregoing embodiments.
  • Embodiments of the present application also provide a computer program product including a program, which, when run on a computer, causes the computer to execute the method described in the foregoing embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种基于多音区的语音检测方法、相关装置及存储介质,应用于人工智能领域,获取N个音区内每个音区所对应的音区信息;将每个音区分别作为目标检测音区,根据目标检测音区所对应的音区信息,生成目标检测音区所对应的控制信号;采用目标检测音区所对应的控制信号,对目标检测音区所对应的语音输入信号进行处理,得到目标检测音区所对应的语音输出信号;根据目标检测音区所对应的语音输出信号,生成目标检测音区的语音检测结果(104)。由于基于多个音区并行处理来自不同方向的语音信号,在多声源的场景下,通过控制信号对不同方向上的语音信号进行保留或者抑制,从而能够实时分离和增强目标检测用户的语音,由此提升语音检测的准确度。

Description

一种基于多音区的语音检测方法、相关装置及存储介质
本申请要求于2020年7月27日提交中国专利局、申请号202010732649.8、申请名称为“一种基于多音区的语音检测方法、相关装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及语音检测技术。
背景技术
随着远场语音在人们日常生活中的广泛应用,在多声源(或多用户)场景下,对每一个可能的声源进行语音激活检测(voice activity detection,VAD)、分离、增强、识别和通话等处理,已成为多种智能语音产品提升其语音交互性能的瓶颈。
在传统方案中,设计一种基于主说话人检测算法的单声道前处理系统,该前处理系统一般采用方位角估计结合信号强度估计的方法,或者采用方位角估计结合空间谱估计的方法,估计出信号能量(即到达麦克风阵列处的信号能量)最强的说话人和其方位角度,并将其作为主说话人及其方位角度。
然而,当环境中存在多个说话人时,仅凭到达信号强度来判断主说话人是存在漏洞的,这是因为主说话人可能相对干扰说话人而言,与距离麦克风阵列之间的距离更远,虽然主说话人的音量可能大于干扰说话人,但其语音信号在空间中的传播损耗更大,因此到达麦克风阵列的信号强度可能反而更小,导致在后续的语音处理上效果较差。
发明内容
本申请实施例提供了一种基于多音区的语音检测方法、相关装置及存储介质,本申请一方面提供一种基于多音区的语音检测方法,所述语音检测方法由计算机设备执行,包括:
获取N个音区内每个音区所对应的音区信息,其中,音区信息包括音区标识、指向角度以及用户信息,音区标识用于标识音区,指向角度用于指示音区的中心角度,用户信息用于指示音区内的用户存留情况,N为大于1的整数;
将每个音区分别作为目标检测音区,根据目标检测音区所对应的音区信息,生成目标检测音区所对应的控制信号,其中,控制信号用于对语音输入信号进行抑制处理或保留处理,控制信号与音区具有一一对应的关系;
采用目标检测音区所对应的控制信号,对目标检测音区所对应的语音输入信号进行处理,得到目标检测音区所对应的语音输出信号,其中,控制信号、语音输入信号以及语音输出信号具有一一对应的关系;
根据目标检测音区所对应的语音输出信号,生成目标检测音区的语音检测结果。
本申请另一方面提供一种语音检测装置,所述语音检测装置部署在计算机设备上,包括:
获取模块,用于获取N个音区内每个音区所对应的音区信息,其中,音区信息包括音区标识、指向角度以及用户信息,音区标识用于标识音区,指向角度用于指示音区的中心角度,用户信息用于指示音区内的用户存留情况,N为大于1的整数;
生成模块,用于将每个音区分别作为目标检测音区,根据目标检测音区所对应的音区信息,生成目标检测音区所对应的控制信号,其中,控制信号用于对语音输入信号进行抑制处理或保留处理,控制信号与音区具有一一对应的关系;
处理模块,用于采用目标检测音区所对应的控制信号,对目标检测音区所对应的语音输入信号进行处理,得到目标检测音区所对应的语音输出信号,其中,控制信号、语音输入信号以及语音输出信号具有一一对应的关系;
生成模块,还用于根据目标检测音区所对应的语音输出信号,生成目标检测音区的语音检测结果。
本申请另一方面提供一种计算机设备,包括:存储器、收发器、处理器以及总线系统;
其中,存储器用于存储程序;
处理器用于执行存储器中的程序,处理器用于根据程序代码中的指令执行上述各方面所述的方法;
总线系统用于连接存储器以及处理器,以使存储器以及处理器进行通信。
本申请的另一方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述各方面所述的方法。
本申请的另一个方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述各方面中各种可选实现方式所提供的方法。
附图说明
图1为本申请实施例中基于多用户会议场景的一个环境示意图;
图2为本申请实施例中语音检测系统的一个实施例示意图;
图3为本申请实施例中基于多音区的语音检测方法的一个实施例示意图;
图4为本申请实施例中多音区划分方式的一个示意图;
图5为本申请实施例中多声道拾音系统的一个架构示意图;
图6为本申请实施例中多声道拾音系统的另一个架构示意图;
图7为本申请实施例中基于多音区语音检测方法实现通话的一个界面示意图;
图8为本申请实施例中多声道拾音系统的另一个架构示意图;
图9为本申请实施例中基于多音区语音检测方法实现对话响应的一个界面示意图;
图10为本申请实施例中多声道拾音系统的另一个架构示意图;
图11为本申请实施例中基于多音区语音检测方法实现文本记录的一个界面示意图;
图12为本申请实施例中语音检测装置的一个实施例示意图;
图13为本申请实施例中计算机设备的一个结构示意图。
具体实施方式
本申请实施例提供了一种基于多音区的语音检测方法、相关装置及存储介质,可以在多声源的场景下,通过控制信号对不同方向上的语音信号进行保留或者抑制,从而能够实时分离和增强每个用户的语音,由此提升语音检测的准确度,有利于提升语音处理效果。
应理解,本申请提供的基于多音区的语音检测方法,能够针对多用户同时说话的情况,进行语音识别和语义识别,再决定响应哪一个用户。在远场识别场景下,容易存在多人说话的情况,比如,在会议室内,在车里,或在放置有智能家居的屋子里可能存在多个用户同时说话的情况,这便会造成多源信号干扰检测的问题。本申请提供的基于多音区的语音检测方法可解决上述场景中存在的信号干扰问题。例如,在智能音箱产品的免唤醒场景下,经常出现周围环境中的多个用户同时发声的情形,基于此,采用本申请提供的方法先确定应该响应哪一个用户,然后对该用户的语音内容进行内容方面和意图方面的识别,智能音箱产品根据识别结果确定是否响应用户的语音指令。
为了便于理解,下面将结合一个具体的场景对本申请提供的语音检测方法进行介绍。请参阅图1,图1为本申请实施例中基于多用户会议场景的一个环境示意图,如图所示,在远场语音的会议场景下,会议室中可能同时有多名参会人员,如用户1、用户2、用户3、用户4、用户5和用户6。会议系统可以包括屏幕、摄像头以及麦克风阵列,其中,麦克风阵列用于采集6位用户的语音,摄像头用于拍摄6位用户的实时画面,屏幕既可以展示6位用户的画像,还可以展示与会议相关的信息等。对于通话应用而言,需要实时确定在会议场景中的主讲人(通常为1个或2个主讲人),然后增强主讲人的语音并传输到通话连接的远端,同时在会议场景下,对于会议转录功能而言,还需要实时确定每一个用户是否在说话,以此分离并增强说话人的语音,再传输到云端的自动语音识别技术(Automatic Speech Recognition,ASR)服务模块,通过ASR服务模块对语音内容进行识别。
本申请提供的基于多音区的语音检测方法应用于如图2所示的语音检测系统,请参阅图2,图2为本申请实施例中语音检测系统的一个实施例示意图,如图所示,语音检测系统中包括服务器以及终端设备。本申请涉及的服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。终端设备可以是智能电视、智能音箱、智能手机、平板电脑、笔记本电脑、掌上电脑、个人电脑等,但不局限于此。
在语音检测系统中,终端设备和服务器之间可以通过无线网络、有线网络或可移动存储介质进行通信。其中,上述的无线网络使用标准通信技术和/或协议。无线网络通常为互联网(Internet)、但也可以是任何网络,包括但不限于蓝牙、局域网(Local Area Network,LAN)、城域网(Metropolitan Area Network,MAN)、广域网(Wide Area Network,WAN)、移动、专用网络或者虚拟专用网络的任何组合)。在一些实施例中,可以使用定制或专用数据通信技术取代或者补充上述数据通信技术。可移动存储介质可以为通用串行总线(Universal Serial Bus,USB)闪存盘、移动硬盘或其他可移动存储介质等,本申请在此不做限制。虽然图2中仅示出了五种类型的终端设备,但应当理解,图2中的示例仅用于理解本方案,不应理解为对本申请的限定。
基于图2所示的语音检测系统,可通过终端设备上装备的麦克风阵列拾取语音信号以及环境中的其他声音,然后麦克风阵列将收集到的数字信号传输到语音信号的前处理模块,由前处理模块进行目标语音的提取、增强、VAD检测、说话人检测以及主说话人检测等处理,具体处理内容视场景和功能需求灵活确定。经过前处理模块增强后的语音信号可被送入至服务器,通过服务器中部署的语音识别模块或者语音通话模块等,对增强后的语音信号进行相关处理。
结合上述介绍,下面将对本申请中基于多音区的语音检测方法进行介绍,请参阅图3,本申请实施例中基于多音区的语音检测方法的一个实施例包括:
101、获取N个音区内每个音区所对应的音区信息,其中,音区信息包括音区标识、指向角度以及用户信息,音区标识用于标识音区,指向角度用于指示音区的中心角度,用户信息用于指示音区内的用户存留情况,N为大于1的整数。
本实施例中,首先可以将可视范围内的空间划分为N个音区,为了便于说明,请参阅图4,图4为本申请实施例中多音区划分方式的一个示意图,如图所示,假设一个360度的空间可以平均分为12个音区,每个音区为30度,每个音区的中心角度为θ i=1,...,N,例如,θ 1=15度,θ 2=45度,θ 3=75度,以此类推。需要说明的是,图4仅为一个示例,在实际应 用中,N的数量为大于或等于2的整数,例如可以取值12、24或36等,分区数量取决于运算量,此外,对可视范围内的空间还可以进行非均匀的划分,此处不做限定。每个音区对应于一个声源,假设某个音区内存在2个或2个以上的用户,则也可以认为这些用户属于同一个人,因此,在实际分区中,可对每个音区划分的足够细密。
在音区划分完成之后,语音检测装置可以获取每个音区所对应的音区信息,其中,音区信息包括音区标识、指向角度以及用户信息,例如,第1个音区的音区信息可以表示为{(i,θ ii)} i=1,第2个音区的音区信息可以表示为{(i,θ ii)} i=2,以此类推。其中,i表示第i个音区,θ i表示第i个音区对应的指向角度,λ i表示第i个音区对应的用户信息,用户信息用于指示音区内的用户存留情况,例如,假设检测到第i个音区内不存在用户,则λ i可设置为-1,假设检测到第i个音区内存在用户,则λ i可设置为1。
需要说明的是,本申请实施例提供的方法可以由计算机设备执行,具体可以由计算机设备上部署的语音检测装置执行,计算机设备可以是终端设备,也可以是服务器,即语音检测装置可部署于终端设备,也可以部署于服务器。当然,语音检测装置还可以部署于语音检测系统,即语音检测装置可基于多声道拾音系统实现本申请提供的方法。
102、将每个音区分别作为目标检测音区,根据目标检测音区所对应的音区信息,生成目标检测音区所对应的控制信号,其中,控制信号用于对语音输入信号进行抑制处理或保留处理,控制信号与音区具有一一对应的关系。
本实施例中,在语音检测装置获取到N个音区中每个音区所对应的音区信息之后,可以将每个音区分别作为目标检测音区,根据目标检测音区所对应的音区信息,生成目标检测音区所对应的控制信号,从而分别生成每个音区所对应的控制信号,控制信号能够对通过麦克风阵列获取到的语音输入信号进行抑制处理或保留处理。假设检测到第i个音区内不存在用户,即表示该音区上的语音输入信号属于噪声(非正常人声),因此,针对于该音区生成的控制信号可以对语音输入信号进行抑制处理。假设检测到第i个音区内存在用户且该音区上的语音输入信号属于正常人声,那么针对于该音区生成的控制信号可以对语音输入信号进行保留处理。
需要说明的是,检测音区内是否存用户的方式可以是采用计算机视觉(Computer Vision,CV)技术进行检测,也可以利用空间谱估计的方式估计当前音区内是否存在用户。
103、采用目标检测音区所对应的控制信号,对目标检测音区所对应的语音输入信号进行处理,得到目标检测音区所对应的语音输出信号,其中,控制信号、语音输入信号以及语音输出信号具有一一对应的关系。
本实施例中,在语音检测装置获取到N个音区中每个音区所对应的控制信号之后,仍然将每个音区分别作为目标检测音区,采用目标检测音区所对应的控制信号,对目标检测音区所对应的语音输入信号进行处理,得到目标检测音区所对应的语音输出信号,即分别采用每个音区所对应的控制信号对相应音区内的语音输入信号进行抑制处理或保留处理,由此输出每个音区所对应的语音输出信号。例如,第i个音区内不存在用户,则第i个音区的控制信号可以为“0”,即对该音区对应的语音输入信号进行抑制处理。又例如,第i个音 区内存在正常发声的用户,则第i个音区的控制信号可以为“1”,即对该音区对应的语音输入信号进行保留处理,进一步地,还可以对该音区对应的语音输入信号进行提取、分离以及增强等处理。
104、根据目标检测音区所对应的语音输出信号,生成目标检测音区的语音检测结果。
本实施例中,为了提升语音输出信号的质量,语音检测装置还可以对每个音区所对应的语音输出信号进行后处理,即可以将每个音区分别作为目标检测音区,根据目标检测音区所对应的语音输出信号,生成目标检测音区的语音检测结果。例如,对目标检测音区所对应的语音输出信号进行跨声道后处理以及降噪后处理等,最后对经过后处理的语音输出信号进行检测,最终生成每个音区的语音检测结果,进而确定是否响应来源于该音区的语音。在一些情况下,语音检测装置可检测每个音区是否符合人声匹配条件,假设第i个音区符合人声匹配条件,则第i个音区的语音检测结果可以为“第i个音区存在用户”。又假设第i个音区不符合人声匹配条件,则第i个音区的语音检测结果为“第i个音区无用户”。
本申请可基于多声道拾音系统实现语音检测,请参阅图5,图5为本申请实施例中多声道拾音系统的一个架构示意图,如图所示,在终端设备上装备的麦克风阵列可拾取每个音区所对应的音频信号,音频信号包括语音输入信号以及噪声信号。由信号分离器生成每个音区所对应的控制信号,采用各个音区对应的控制信号分别对对应的指向角度的语音输入信号进行抑制或保留处理,再分别对每个语音输出信号进行跨声道后处理和降噪后处理,由此得到每个音区对应的目标语音输出信号。最后,基于每个音区的音区信息和目标语音输出信号,确定语音检测结果,即得到N个音区中每个音区的语音检测结果。
本申请实施例中,提供了一种基于多音区的语音检测方法,首先获取N个音区内每个音区所对应的音区信息,该音区信息包括音区标识、指向角度以及用户信息,于是可以将每个音区分别作为目标检测音区,根据目标检测音区所对应的音区信息,生成目标检测音区所对应的控制信号,然后采用目标检测音区所对应的控制信号,对目标检测音区所对应的语音输入信号进行处理,得到目标检测音区所对应的语音输出信号,最后根据目标检测音区所对应的语音输出信号,生成目标检测音区的语音检测结果,即得到每个音区所对应的语音检测结果,便于根据语音检测结果确定是否响应于该音区对应的用户。采用上述方式,基于多个音区并行处理来自不同方向的语音信号,进而在多声源的场景下,可通过控制信号对不同方向上的语音信号进行保留或者抑制,从而能够实时分离和增强每个用户的语音,由此提升语音检测的准确度,有利于提升后续的语音处理效果。
在上述图3对应的实施例的基础上,本申请实施例提供的一个可选实施例中,获取N个音区内每个音区所对应的音区信息,可以包括如下步骤:
对N个音区内每个音区进行检测,得到每个音区分别对应的用户检测结果;
将每个音区分别作为目标检测音区,根据目标检测音区所对应的用户检测结果确定目标检测音区所对应的用户信息;
根据目标检测音区所对应的用户检测结果确定目标检测音区所对应的唇部运动信息;
获取目标检测音区所对应的音区标识以及目标检测音区所对应的指向角度;
根据目标检测音区所对应的用户信息、目标检测音区所对应的唇部运动信息、目标检测音区所对应的音区标识以及目标检测音区所对应的指向角度,生成目标检测音区所对应的音区信息。
本实施例中,介绍了一种基于CV技术获取音区信息的方式,CV技术可以采用神经网络来实现,通常情况下,还需要配置对应的摄像头来捕捉用户画面,该摄像头可以是采用1个广角摄像头覆盖,对于360度的空间而言,可以采用2到3个广角摄像头以拼接的方式进行全面覆盖。利用CV技术可以检测空间内的每个人,并对其进行编号,还可以提供相关信息,例如,用户的身份信息、人脸方位角、唇部运动信息、脸部朝向以及人脸距离等,针对N个音区中的每个音区进行检测,分别得到每个音区所对应的用户检测结果。本申请以用户检测结果包括用户的身份信息以及唇部运动信息为例进行说明,然而这不应理解为对本申请的限定。
用户检测结果包括用户信息以及唇部运动信息,其中,用户信息包括是否存在用户,以及若存在用户的情况下,是否能够提取到该用户的身份信息,例如,第2个音区存在用户,且经过识别后确定该用户为“小李”,对应的身份标识为“01011”。又例如,第5个音区不存在用户,即无需进行识别。唇部运动信息表示用户的嘴唇是否在动,通常情况下,人在说话时嘴唇会运动,因此,基于唇部运动信息能够进一步确定该用户是否在说话。结合预先划分的音区,即可确定每个音区所对应的音区标识以及每个音区所对应的指向角度,由此,生成每个音区所对应的音区信息{(i,θ ii,L i)} i=1,...,N。音区信息{(i,θ ii,L i)} i=1,...,N中的i表示第i个音区,θ i表示第i个音区的指向角度,λ i表示第i个音区的用户信息,L i表示第i个音区的唇部运动信息。
其次,本申请实施例中,提供了一种基于CV技术获取音区信息的方式,采用上述方式,利用CV技术能够探测到更多的音区信息,相当于可以“看”到每个音区内用户的相关情况,比如,是否存在用户,用户的用户信息,以及用户是否存在唇部运动等,实现多模态信息的整合利用,通过视觉维度上的信息能够进一步提升语音检测的准确度,并且还可以为后续处理有关视频的方案通过可行的方式。
在一些情况下在上述图3对应的实施例的基础上,本申请实施例提供的另一个可选实施例中,根据目标检测音区所对应的用户检测结果确定目标检测音区所对应的用户信息,具体包括如下步骤:
若目标检测音区所对应的用户检测结果为目标检测音区内存在可识别用户,则将第一身份标识确定为用户信息;
若目标检测音区所对应的用户检测结果为目标检测音区内不存在用户,则将第二身份标识确定为用户信息;
若目标检测音区所对应的用户检测结果为目标检测音区内存在未知用户,则将第三身份标识确定为用户信息;
根据目标检测音区所对应的用户检测结果确定目标检测音区所对应的唇部运动信息,具体包括如下步骤:
若目标检测音区所对应的用户检测结果为目标检测音区内存在具有唇部运动的用户,则将第一运动标识确定为唇部运动信息;
若目标检测音区所对应的用户检测结果为目标检测音区内存在用户,且用户不具有唇部运动,则将第二运动标识确定为唇部运动信息;
若目标检测音区所对应的用户检测结果为目标检测音区内不存在用户,则将第三运动标识确定为唇部运动信息。
本实施例中,介绍了一种基于CV技术提取唇部运动信息和用户信息的具体方式,由于用户信息和唇部运动信息需要结合实际情况进行确定,因此,需要对每个音区内的用户信息和唇部运动信息进行检测,下面将进行详细介绍。
一、针对用户信息的识别方式;
为了便于说明,本申请将以N个音区中的任意一个音区为例进行介绍,其他音区采用类似地方式确定用户信息,故此处不做赘述,任意一个音区可以作为目标检测音区,假设该音区为第i个音区,基于第i个音区的用户检测结果可确定第i个音区内是否存在用户,以及存在用户的情况下,是否能够得到该用户的身份信息。第i个音区所对应的用户信息表示为λ i,即表示为在指向角度为θ i的方向上的用户信息。如果指向角度为θ i的方向上存在用户,且能够确定该用户的身份信息,即表示能够确定该用户的姓名和身份标识,则λ i为该用户的第一身份标识,如“5”。如果指向角度为θ i的方向上没有用户,则λ i可设置为特殊值,即设置为第二身份标识,如“-1”。如果不具有人脸识别的功能,即无法确定该用户的身份信息,则可以将λ i设置为另一个特殊值,即第三身份标识,如“0”,由此告知后续的处理模块该方向虽然有用户,但是身份未知,如有必要,可以通过声纹识别的方式进一步识别该用户的身份信息。
二、针对唇部运动信息的识别方式;
为了便于说明,本申请将以N个音区中的任意一个音区为例进行介绍,其他音区采用类似地方式确定唇部运动信息,故此处不做赘述,任意一个音区可以作为目标检测音区,假设该音区为第i个音区,基于第i个音区的用户检测结果可确定第i个音区内是否存在用户,以及存在用户的情况下,该用户的嘴唇是否发生运动。摄像头一般采用不可活动的广角摄像头,通过CV算法检测视角范围内所有的人和人脸,同时可将人脸局部图像抠出,通过CV算法检测人脸上嘴唇是否在运动。第i个音区所对应的唇部运动信息表示为L i,即表示为在指向角度为θ i的方向上的唇部运动信息。如果指向角度为θ i的方向上存在用户,且能够确定该用户的发生唇部运动,则L i可设置为第一运动标识,如“0”。如果指向角度为θ i的方向上存在用户,但是该用户并未发生唇部运动,则L i可设置为第二运动标识,如“1”。如果指向角度为θ i的方向上没有用户,则L i可设置为特殊值,即设置为第三运动标识。如“-1”。
再次,本申请实施例中,提供了一种基于CV技术提取唇部运动信息和用户信息的具体方式,采用上述方式,能够从多个方面分析用户的用户信息以及唇部运动信息,尽可能增加识别的可行性,在多个维度上对每个音区所包括的信息进行分析,从而提升方案的可操作性。
在上述图3对应的实施例的基础上,本申请实施例提供的另一个可选实施例中,根据目标检测音区所对应的音区信息,生成目标检测音区所对应的控制信号,具体包括如下步骤:
若目标检测音区所对应的用户信息用于指示目标检测音区内不存在用户,则生成第一控制信号,其中,第一控制信号属于控制信号,第一控制信号用于对语音输入信号进行抑制处理;
若目标检测音区所对应的用户信息用于指示目标检测音区内存在用户,则生成第二控制信号,其中,第二控制信号属于控制信号,第二控制信号用于对语音输入信号进行保留处理。
本实施例中,介绍了一种不采用CV技术的情况下生成控制信号的方式,不采用CV技术时,即无法识别用户的身份,并且无法获取用户的唇部运动信息。该情况下,可利用空间谱估计的方式估计当前音区内是否存在用户,由此得到N个音区的音区信息,N个音区的音区信息可以表示为{(i,θ ii)} i=1,...,N
为了便于说明,本申请将以N个音区中的任意一个音区为例进行介绍,其他音区采用类似地方式生成控制信号,故此处不做赘述,任意一个音区可以作为目标检测音区,假设该音区为第i个音区,即第i个音区的音区信息为{(i,θ ii)},其中,用户信息λ i可指示指向角度为θ i的方向上没有用户,或者,指向角度为θ i的方向存在用户,如有必要,可以通过声纹识别的方式进一步识别该用户的身份信息,此处不做详述。在生成控制信号的过程中,如果检测到第i个音区内没有用户,则可以通过信号分离器学习并抑制在指向角度θ i上的所有信号,即通过信号分离器生成第一控制信号,利用第一控制信号对指向角度θ i上的所有信号进行抑制处理。如果检测到第i个音区内存在用户,则可以通过信号分离器学习并保留在指向角度θ i上的信号,即通过信号分离器生成第二控制信号,利用第二控制信号对指向角度θ i上的信号进行保留处理。
其次,本申请实施例中,提供了一种不采用CV技术的情况下生成控制信号的方式,采用上述方式,能够仅利用音频数据生成控制信号,一方面增加方案的灵活性,另一方面基于较少的信息量也可以生成控制信号,从而节省了运算资源,有利于提升控制信号的生成效率,对于设备而言还可以节省电量。
在上述图3对应的实施例的基础上,本申请实施例提供的另一个可选实施例中,根据目标检测音区所对应的音区信息,生成目标检测音区所对应的控制信号,具体包括如下步骤:
若目标检测音区所对应的用户信息用于指示目标检测音区内不存在用户,则生成第一控制信号,其中,第一控制信号属于控制信号,第一控制信号用于对语音输入信号进行抑制处理;
若目标检测音区所对应的用户信息用于指示目标检测音区内存在用户,且用户不具有唇部运动,则生成第一控制信号;
若目标检测音区所对应的用户信息用于指示目标检测音区内存在用户,且用户具有唇部运动,则生成第二控制信号,其中,第二控制信号属于控制信号,第二控制信号用于对语音输入信号进行保留处理;
若目标检测音区所对应的用户信息用于指示目标检测音区内存在用户,且未知用户的唇部运动情况,则根据原始音频信号生成第一控制信号或第二控制信号。
本实施例中,介绍了一种采用CV技术的情况下生成控制信号的方式,采用CV技术时,可以识别用户的身份,并且获取用户的唇部运动信息。该情况下,可仅采用CV技术估计当前音区内是否存在用户,或者采用CV技术与空间谱估计方式联合判定当前音区内是否存在用户,即得到N个音区的音区信息,N个音区的音区信息可以表示为{(i,θ ii,L i)} i=1,...,N
为了便于说明,本申请将以N个音区中的任意一个音区为例进行介绍,其他音区采用类似地方式生成控制信号,故此处不做赘述,任意一个音区可以作为目标检测音区,假设该音区为第i个音区,即第i个音区的音区信息为{(i,θ ii,L i)},其中,用户信息λ i可以为第一身份标识、第二身份标识或者第三身份标识,唇部运动信息可以为第一运动标识、第二运动标识或者第三运动标识。具体地,在生成控制信号的过程中,如果检测到第i个音区内没有用户,则可以通过信号分离器学习并抑制在指向角度θ i上的所有信号,即通过信号分离器生成第一控制信号,利用第一控制信号对指向角度θ i上的所有信号进行抑制处理。如果检测到第i个音区内存在用户,则需要进一步判断该用户是否具有唇部运动。
如果检测到第i个音区内存在用户,但该用户不具有唇部运动,则可以通过信号分离器学习并抑制在指向角度θ i上的所有信号,即通过信号分离器生成第一控制信号,利用第一控制信号对指向角度θ i上的所有信号进行抑制处理。
如果检测到第i个音区内存在用户,且该用户具有唇部运动,则可以通过信号分离器学习并保留在指向角度θ i上的信号,即通过信号分离器生成第二控制信号,利用第二控制信号对指向角度θ i上的信号进行保留处理。
如果检测到第i个音区内存在用户,且该用户具有唇部运动,则可以通过信号分离器学习并保留在指向角度θ i上的信号,即通过信号分离器生成第二控制信号,利用第二控制信号对指向角度θ i上的信号进行保留处理。
如果检测到第i个音区内存在用户,但是可能由于人脸不清晰,或者人头偏转角度较大,导致摄像头无法清楚地拍摄到嘴唇部分等原因,导致无法确定该用户的唇部运动情况,于是需要对指向角度θ i上输入的原始音频信号进行空间谱估计或方位估计等,粗略地判断该用户是否在发声,如果确定该用户正在发声,则可以通过信号分离器学习并保留在指向角度θ i上的信号,即通过信号分离器生成第二控制信号,利用第二控制信号对指向角度θ i上的信号进行保留处理。如果确定该用户未发声,则可以通过信号分离器学习并抑制在指向角度θ i上的所有信号,即通过信号分离器生成第一控制信号,利用第一控制信号对指向角度θ i上的所有信号进行抑制处理。
再次,本申请实施例中,提供了一种采用CV技术的情况下生成控制信号的方式,采用上述方式,同时利用了音频数据和图像数据作为生成控制信号的依据,一方面增加方案的灵活性,另一方面基于更多的信息量生成的控制信号会更加准确,从而提升语音检测的准确度。
在上述图3对应的实施例的基础上,本申请实施例提供的另一个可选实施例中,根据目标检测音区所对应的音区信息,生成目标检测音区所对应的控制信号,具体包括如下步骤:
根据目标检测音区所对应的音区信息,采用预设算法生成目标检测音区所对应的控制信号,其中,预设算法为自适应波束形成算法、盲源分离算法或基于深度学习的语音分离算法;
采用目标检测音区所对应的控制信号,对目标检测音区所对应的语音输入信号进行处理,得到目标检测音区所对应的语音输出信号,具体包括如下步骤:
若预设算法为自适应波束形成算法,则根据目标检测音区所对应的控制信号,采用自适应波束形成算法对目标检测音区所对应的语音输入信号进行处理,得到目标检测音区所对应的语音输出信号;
若预设算法为盲源分离算法,则根据目标检测音区所对应的控制信号,采用盲源分离算法对目标检测音区所对应的语音输入信号进行处理,得到目标检测音区所对应的语音输出信号;
若预设算法为基于深度学习的语音分离算法,则根据目标检测音区所对应的控制信号,采用基于深度学习的语音分离算法对目标检测音区所对应的语音输入信号进行处理,得到目标检测音区所对应的语音输出信号。
本实施例中,介绍了一种基于控制信号实现信号分离的方式,在生成控制信号时所采用的预设算法,与在实际应用中,对信号进行分离所使用的算法一致,本申请提供了三种预设算法,分别为自适应波束形成算法、盲源分离算法或基于深度学习的语音分离算法。下面将结合这三种预设算法对信号分离进行介绍。
一、自适应波束形成算法;
自适应波束形成又称自适应空域滤波,可通过对各阵元加权进行空域滤波,以达到增强有用信号,并抑制干扰的目的,此外,还可以根据信号环境的变化,来改变各阵元的加权因子。在理想的条件下,自适应波束形成技术可以有效地抑制干扰而保留期望的信号,从而使阵列的输出信号干扰噪声比达到最大。
二、盲源分离算法;
盲源分离(Blind Source Separation,BSS)的含义是在不知道源信号及信号混合参数的情况下,仅根据观测到的混合信号估计源信号。独立分量分析(Independent Component Analysis,ICA)是为了解决盲信号分离问题而逐渐发展起来的一种新技术。盲信号分离大部分都采用独立分量分析的方法,即将接收到的混合信号按照统计独立的原则通过优化算法分解为若干独立分量,这些独立分量作为源信号的一种近似估计。
三、基于深度学习的语音分离算法;
基于深度学习的语音分离,主要是用基于深度学习的方法,从训练数据中学习语音、说话人和噪音的特征,从而实现语音分离的目标。具体可以使用多层感知机、深度神经网络(Deep Neural Network,DNN)、卷积神经网络(Convolutional Neural Networks,CNN)、长短时记忆(Long Short-Term Memory,LSTM)网络以及生成式对抗网络(Generative Adversarial Networks,GAN)等,此处不做限定。
其中,采用GAN进行语音增强时,模型中通常会把生成器设置为全部是卷积层,为了减少训练参数从而缩短训练时间;判别器负责向生成器提供生成数据的真伪信息,帮助生成器向着“生成干净声音”的方向微调。
其次,本申请实施例中,提供了一种基于控制信号实现信号分离的方式,采用上述方式,若使用自适应波束形成算法生成控制信号,那么在信号分离的时候也使用自适应波束形成算法,若使用盲源分离算法生成控制信号,那么在信号分离的时候也使用盲源分离算法,若使用基于深度学习的语音分离算法生成控制信号,那么在信号分离的时候也使用基于深度学习的语音分离算法。从而使得控制信号能够更好地配合协调信号的分离,达到更好的信号分离效果,进而提升语音检测的准确度。
在上述图3对应的实施例的基础上,本申请实施例提供的另一个可选实施例中,根据目标检测音区所对应的语音输出信号,生成目标检测音区的语音检测结果,具体包括如下步骤:
根据目标检测音区所对应的语音输出信号,确定目标检测音区所对应的信号功率,其中,信号功率为语音输出信号在时频点上的信号功率;
根据目标检测音区所对应的信号功率,确定目标检测音区所对应的估计信噪比;
根据目标检测音区所对应的估计信噪比,确定目标检测音区所对应的输出信号加权值,其中,输出信号加权值为语音输出信号在时频点上的加权结果;
根据目标检测音区所对应的输出信号加权值以及目标检测音区所对应的语音输出信号,确定目标检测音区所对应的目标语音输出信号;
根据目标检测音区所对应的目标语音输出信号,确定目标检测音区所对应的语音检测结果。
本实施例中,介绍了一种对语音输出信号进行跨声道后处理的方式,由于经过信号分离的语音输出信号并非总是洁净的,如果每个指向角度对应的语音输出信号都具有较高的信噪比,那么可以进行跨声道的后处理。需要说明的是,在语音输出信号的信噪比达到-5分贝以上的情况下,可认为信噪比较高,然而,还可以根据实际情况调整信噪比临界值,“-5分贝”仅为一个示意,不应理解为对本申请的限定。
以每个音区分别作为目标检测音区,跨声道后处理的一种实现方式为,先根据目标检测音区所对应的语音输出信号,确定目标检测音区所对应的信号功率,然后计算目标检测音区所对应的估计信噪比,再确定目标检测音区所对应的输出信号加权值,最后根据目标检测音区所对应的输出信号加权值和语音输出信号,确定目标检测音区所对应的目标语音输出信号,基于该目标语音输出信号,确定目标检测音区所对应的语音检测结果。基于此, 为了便于说明,下面将以N个音区中的任意一个音区为例进行介绍,其他音区也采用类似方式确定目标语音输出信号,此处不做赘述,任意一个音区可以作为目标检测音区,假设该音区为第i个音区,对应的指向角度为θ i,对于指向角度θ i的每个时频点(t,f)而言,采用如下方式计算第i个音区的估计信噪比:
Figure PCTCN2021100472-appb-000001
其中,μ i(t,f)表示第i个音区的估计信噪比,P i(t,f)表示指向角度为θ i方向的语音输出信号在时频点(t,f)上的信号功率,N表示N个音区(也可作为N个指向角度),j表示第j个音区(也可作为第j个指向角度),i表示第i个音区(也可作为第i个指向角度),t表示时间,f表示频率。
接下来,可采用维纳滤波的公式计算第i个音区的输出信号加权值:
Figure PCTCN2021100472-appb-000002
其中,g i(t,f)表示第i个音区的输出信号加权值,即产生对指向角度为θ i方向的语音输出信号在时频点(t,f)的加权。
最后,基于第i个音区的输出信号加权值以及第i个音区的语音输出信号,计算第i个音区的目标语音输出信号:
y i(t,f)=x i(t,f)*g i(t,f);
其中,y i(t,f)表示第i个音区的目标语音输出信号,即跨声道后处理算法在指向角度为θ i方向的目标语音输出信号。x i(t,f)表示第i个音区的语音输出信号,即指向角度为θ i方向的语音输出信号。可以理解的是,本实施例中的目标语音输出信号y i(t,f)是未经过降噪处理的语音输出信号。
其次,本申请实施例中,提供了一种对语音输出信号进行跨声道后处理的方式,采用上述方式,考虑到不同音区之间的关联关系,可通过跨声道后处理的方式更好地分离出语音信号,尤其在信噪比足够高的情况下,能够提升语音信号的纯净度,从而进一步提高输出信号质量。
在上述图3对应的实施例的基础上,本申请实施例提供的另一个可选实施例中,根据目标检测音区所对应的输出信号加权值以及目标检测音区所对应的语音输出信号,确定目标检测音区所对应的目标语音输出信号,具体包括如下步骤:
根据目标检测音区所对应的输出信号加权值以及目标检测音区所对应的语音输出信号,确定目标检测音区所对应的待处理语音输出信号;
对目标检测音区所对应的待处理语音输出信号进行降噪处理,得到目标检测音区所对应的目标语音输出信号。
本实施例中,介绍了一种对待处理语音输出信号进行降噪处理的方式,为了便于说明,下面将以N个音区中的任意一个音区为例进行介绍,其他音区也采用类似方式确定目标语音输出信号,此处不做赘述,任意一个音区可以作为目标检测音区,假设该音区为第i个音区,对应的指向角度为θ i。基于上述实施例可知,根据第i个音区的输出信号加权值以及第i个音区的语音输出信号,计算第i个音区的目标语音输出信号。然而,如果需要进行降噪处理,则基于第i个音区的输出信号加权值以及第i个音区的语音输出信号,计算第i个音区的是待处理语音输出信号,具体的计算方式为;
y′ i(t,f)=x i(t,f)*g i(t,f);
其中,y′ i(t,f)表示第i个音区的待处理语音输出信号,即跨声道后处理算法在指向角度为θ i方向的待处理语音输出信号。x i(t,f)表示第i个音区的语音输出信号,即指向角度为θ i方向的语音输出信号。可以理解的是,与前述实施例不同,本实施例中的待处理语音输出信号y′ i(t,f)是未经过降噪处理的语音输出信号,而本实施例中的目标语音输出信号y i(t,f)是经过降噪处理的语音输出信号。
基于此,再对待处理语音输出信号y′ i(t,f)进行降噪处理,得到每个音区所对应的目标语音输出信号y i(t,f)。
需要说明的是,一种可行的滤波方式为,采用最小均方(Least Mean Square,LMS)自适应滤波器进行降噪处理,LMS自适应滤波器利用前一刻已获得的滤波器参数,自动调节当前滤波器参数,以适应信号和噪声未知的或随机变化的统计特性,从而实现最优滤波。另一种可行的滤波方式为,采用LMS自适应陷波器进行降噪处理,自适应陷波器法适用于单色干扰噪声,如单频正弦波噪声,希望陷波器的特性理想,缺口的肩部任意窄,可马上进入平坦区域。另一种可行的滤波方式为,采用基本谱减法进行降噪处理,待处理语音输出信号对相位不灵敏,将谱减前的相位信息用到谱减后的信号中,在求出谱减后的幅值之后,结合相角,就能用快速傅里叶逆变换(Inverse Fast Fourier Transform,IFFT)求出谱减 后的目标语音输出信号。另一种可行的滤波方式为,采用维纳滤波降噪进行降噪处理。上述示例仅为可行方案,在实际应用中,还可以采用其他降噪方式,此处不做限定。
再次,本申请实施例中,提供了一种对待处理语音输出信号进行降噪处理的方式,采用上述方式,能够进一步抑制噪声、干扰人声以及残留回声等,由此能够更好地提升目标语音输出信号的质量,有利于增加语音检测的准确度。
在上述图3对应的实施例的基础上,本申请实施例提供的另一个可选实施例中,根据目标检测音区所对应的语音输出信号,生成目标检测音区的语音检测结果,具体包括如下步骤:
若目标检测音区所对应的目标语音输出信号满足人声匹配条件,则生成第一语音检测结果,其中,第一语音检测结果属于语音检测结果,第一语音检测结果表示目标语音输出信号为人声信号;
若目标检测音区所对应的目标语音输出信号不满足人声匹配条件,则生成第二语音检测结果,其中,第二语音检测结果属于语音检测结果,第二语音检测结果表示目标语音输出信号为噪声信号。
本实施例中,介绍了一种对每个音区进行语音检测的方式,在语音检测过程中,需要判断每个音区的语音输出信号是否满足人声匹配条件,需要说明的是,本实施例中的“目标语音输出信号”为语音输出信号经过跨声道后处理以及降噪后处理得到的。如果语音输出信号未经过跨声道后处理和降噪后处理,则对“语音输出信号”进行语音检测。如果语音输出信号仅经过跨声道后处理,而未经过降噪后处理,则可以对“待处理语音输出信号”进行语音检测。本申请以“目标语音输出信号”为例进行说明,但这不应理解为对本申请的限定。
下面将介绍如何基于目标语音输出信号判断是否满足人声匹配条件,为了便于说明,下面将以N个音区中的任意一个音区为例进行介绍,其他音区也采用类似方式确定语音检测结果,此处不做赘述,任意一个音区可以作为目标检测音区。在检测过程中,可利用目标语音输出信号、唇部运动信息、用户信息以及声纹中的至少一种来判定某个音区是否满足人声匹配条件,下面将结合几个示例进行说明。
情形一、如果并未收到目标语音输出信号,即表示用户并未说话,则确定不满足人声匹配条件。
情形二、如果收到的目标语音输出信号非常微弱或不像人声,则可以判定此时在该音区对应的指向角度方向上,用户并没有说话,因此确定不满足人声匹配条件。
情形三、如果收到的目标语音输出信号是人声,但是与给定用户信息的声纹极端不匹配(例如,匹配分值小于0.5),则可以判定此时在该音区对应的指向角度方向上,用户并没有说话,该目标语音输出信号为其它方向的人声泄漏至在本声道中的噪声信号,因此确定不满足人声匹配条件。
情形四、如果收到的目标语音输出信号是人声,但唇部运动信息表示用户的嘴唇没有发生运动,且声纹匹配度不高,则也可以判定此时在该音区对应的指向角度方向上,用户并没有说话,该标语音输出信号为其它方向的人声泄漏至在本声道中的噪声信号,因此确定不满足人声匹配条件。
其中,基于用户信息可以从数据库中得到其对应的声纹(假设该用户已使用用户信息进行注册),根据声纹可以判断当前该通道中的目标语音输出信号是否与该用户的声纹匹配,如果匹配成功,则确定满足人声匹配条件,如果不匹配,则判断该目标语音输出信号为其它方向的人声泄漏至在本声道中的噪声信号,即不满足人声匹配条件。
需要说明的是,上述四种情形仅为一个示意,在实际应用中,还可以根据情况灵活地设定其他的判定方式,此处不做限定。如果确定目标语音输出信号满足人声匹配条件,则 生成第一语音检测结果,即表示该目标语音输出信号为正常的人声信号。反之,如果确定目标语音输出信号不满足人声匹配条件,则生成第二语音检测结果,即表示该目标语音输出信号为噪声信号。
其次,本申请实施例中,提供了一种对每个音区进行语音检测的方式,采用上述方式,针对每个音区需要分别判定是否满足人声匹配条件,即使有些音区存在用户,但是可能该用户的并未开口说话,或者说话的声音很小,又或者是用户的身份信息与预设身份信息不匹配等情况,均认为不满足人声匹配条件,因此,为了能够提高语音检测的准确度,可以从多个维度上判断该音区对应的语音输出信号是否符合人声匹配条件,由此增加方案的可行性和可操作性。
在上述图3对应的实施例的基础上,本申请实施例提供的另一个可选实施例中,根据目标检测音区所对应的语音输出信号,生成目标检测音区的语音检测结果之后,还可以包括如下步骤:
若存在M个音区所对应的语音检测结果均为第一语音检测结果,则根据M个音区中每个音区所对应的语音输出信号,从M个音区中确定目标音区,其中,第一语音检测结果表示语音输出信号为人声信号,M个音区属于N个音区,M为大于或等于1,且小于或等于N的整数;
将目标音区所对应的语音输出信号传输至通话方。
本实施例中,介绍了一种基于语音检测结果进行通话的方式,基于上述实施例可知,在得到N个音区中每个音区所对应的语音检测结果之后,从中选择第一语音检测结果所对应的音区,这是因为在通话场景中,为了提升通话质量,需要传递人声并且抑制噪声,第一语音检测结果表示该音区的语音输出信号为人声信号。可以理解的是,本实施例中的“语音输出信号”也可以是“待处理语音输出信号”或者“目标语音输出信号”,在具体的处理过程中可灵活选择,此处仅为一个示意,不应理解为对本申请的限定。
假设N个音区中存在M个音区的语音检测结果为第一语音检测结果,即根据M个音区中每个音区所对应的语音输出信号(或,目标语音输出信号,或,待处理语音输出信号),基于此,还可以进一步基于M个音区的语音输出信号判定主说话人,其中,M个音区中的每个音区称为“目标音区”。为了便于介绍,请参阅图6,图6为本申请实施例中多声道拾音系统的另一个架构示意图,如图所示,在终端设备上装备的麦克风阵列可拾取每个音区所对应的音频信号,音频信号包括语音输入信号以及噪声信号。由信号分离器生成每个音区所对应的控制信号,采用各个音区对应的控制信号分别对每个指向角度的语音输入信号进行抑制或保留处理,得到每个音区对应的语音输出信号。基于每个音区的音区信息和语音输出信号,确定每个音区的语音检测结果。
主说话人判定模块根据M个音区的语音输出信号以及音区信息实时判定主说话人,例如,当对判决结果的时延要求较高时,主说话人判定模块可以直接根据短时间内接收到的每一个说话人的信号强度以及该说话人离麦克风阵列的距离(可以通过广角摄像头或多摄像头阵列提供)来测算说话人的原始音量(即嘴部出声处的音量),进而根据原始音量判决主说话人。又例如,当对判决结果的时延要求较低时,可以根据每一个说话人的人脸朝向(例如,视频会议场景下,正脸朝向摄像头的用户更有可能是主说话人)来判断主说话人。主说话人的判决结果包括其方位和身份,将其输出给混合器用于通话需求。混合器根据主说话人判决结果,将N路连续音频流汇成一路或多路的输出音频,用于满足通话需求。一种实现方式为,如果主说话人被判定在指向角度为θ i方向,则输出的单路音频等于第1路输入的语音输出信号,其它声道的输入数据直接舍弃。一种实现方式为,如果主说话人 被判定在指向角度为θ i方向和指向角度为θ 4方向,则输出的音频等于第1路输入的语音输出信号和第4路输入的语音输出信号,其它声道的输入数据直接舍弃。
需要说明的是,基于图6可知,还可以对语音输出信号进行跨声道后处理和降噪后处理,则得到每个音区对应的目标目标语音输出信号,基于每个音区的音区信息和目标语音输出信号,确定每个音区的语音检测结果。
请参阅图7,图7为本申请实施例中基于多音区语音检测方法实现通话的一个界面示意图,如图所示,以应用于通话场景为例,我方参与人有多位用户,因此,可采用本申请提供的技术方案确定主说话人,并将该主说话人的语音传递给用户甲,而其他说话人或者噪声可被抑制,从而使得用户甲能够听到更清晰的语音。
本申请实施例中,提供了一种基于语音检测结果进行通话的方式,采用上述方式,能够在多用户的场景下,实时分离以及增强每一个用户的语音,使得在通话场景下能够根据语音检测结果,并基于多用户并行分离增强处理以及后期混合的处理的流程,实现高质量的通话。
在上述图3对应的实施例的基础上,本申请实施例提供的另一个可选实施例中,根据目标检测音区所对应的语音输出信号,生成目标检测音区的语音检测结果之后,还可以包括如下步骤:
若存在M个音区所对应的语音检测结果均为第一语音检测结果,则根据M个音区中每个音区所对应的语音输出信号,从M个音区中确定目标音区,其中,第一语音检测结果表示语音输出信号为人声信号,M个音区属于N个音区,M为大于或等于1,且小于或等于N的整数;
对目标音区所对应的语音输出信号进行语义识别,得到语义识别结果;
根据语义识别结果,生成对话响应信息。
本实施例中,提供了一种基于语音检测结果反馈对话响应信息的方式,基于上述实施例可知,在得到N个音区中每个音区所对应的语音检测结果之后,从中选择第一语音检测结果所对应的音区,这是因为在智能对话场景中,为了提升智能对话的准确度,需要传递人声并且抑制噪声,第一语音检测结果表示该音区的语音输出信号为人声信号。可以理解的是,本实施例中的“语音输出信号”也可以是“待处理语音输出信号”或者“目标语音输出信号”,在具体的处理过程中可灵活选择,此处仅为一个示意,不应理解为对本申请的限定。
假设N个音区中存在M个音区的语音检测结果为第一语音检测结果,即根据M个音区中每个音区所对应的语音输出信号(或,目标语音输出信号,或,待处理语音输出信号),基于此,还可以进一步基于M个音区的语音输出信号判定主说话人。其中,M个音区中的每个音区称为“目标音区”。为了便于介绍,请参阅图8,图8为本申请实施例中多声道拾音系统的另一个架构示意图,如图所示,在终端设备上装备的麦克风阵列可拾取每个音区所对应的音频信号,音频信号包括语音输入信号以及噪声信号。由信号分离器生成每个音区所对应的控制信号,采用各个音区对应的控制信号分别对每个指向角度的语音输入信号进行抑制或保留处理,得到每个音区对应的语音输出信号。基于每个音区的音区信息和语音输出信号,确定每个音区的语音检测结果。
接下来,对M个音区中每个目标音区所对应的语音输出信号进行NLP处理,即获取每个目标音区内说话人的意图,即得到语义识别结果。
主说话人判定模块根据M个音区的语音输出信号以及音区信息实时判定主说话人,例如,当对判决结果的时延要求较高时,主说话人判定模块可以直接根据短时间内接收到的每一个说话人的信号强度以及该说话人离麦克风阵列的距离(可以通过广角摄像头或多摄 像头阵列提供)来测算说话人的原始音量(即嘴部出声处的音量),进而根据原始音量判决主说话人。又例如,当对判决结果的时延要求较低时,可以根据每一个说话人的语义识别结果以及人脸朝向(例如,视频会议场景下,正脸朝向摄像头的用户更有可能是主说话人)来判断主说话人。主说话人的判决结果包括其方位和身份,将其作为对话响应信息的生成依据,以此回复主说话人意图所对应的对话响应信息。
需要说明的是,基于图8可知,还可以对语音输出信号进行跨声道后处理和降噪后处理,则得到每个音区对应的目标目标语音输出信号,基于每个音区的音区信息和目标语音输出信号,确定每个音区的语音检测结果。
请参阅图9,图9为本申请实施例中基于多音区语音检测方法实现对话响应的一个界面示意图,如图所示,以应用于智能对话为例,假设我方有多位说话人,可采用本申请提供的技术方案确定主说话人,并根据主说话人的判定结果以及语义识别结果,可对主说话人说出的“小腾,今天是星期几呢?”进行回复,即生成对话响应信息,例如“Hi,今天是星期五”。
在实际应用中,还可以应用于智能客服和人机对话等场景,可实现对场景中每一个说话人的同步、实时以及独立的语义解析,还可以实现对每一个说话人进行手动屏蔽或开启等功能,还可以对每一个说话人进行自动屏蔽或开启等功能,此处不再详述。
本申请实施例中,提供了一种基于语音检测结果反馈对话响应信息的方式,采用上述方式,能够在多用户的场景下,实时分离以及增强每一个用户的语音,使得在智能对话下能够根据语音检测结果以及语义识别结果确定主说话人,并基于多用户并行分离增强处理以及后期混合的处理的流程,提升语音质量,从而实现能够根据语义识别结果单独反馈对话响应信息,对非交互目的的语音进行过滤。
在上述图3对应的实施例的基础上,本申请实施例提供的另一个可选实施例中,根据目标检测音区所对应的语音输出信号,生成目标检测音区的语音检测结果之后,还可以包括如下步骤:
若存在M个音区所对应的语音检测结果均为第一语音检测结果,则根据M个音区中每个音区所对应的语音输出信号,从M个音区中确定目标音区,其中,第一语音检测结果表示语音输出信号为人声信号,M个音区属于N个音区,M为大于或等于1,且小于或等于N的整数;
对目标音区所对应的语音输出信号进行切分处理,得到待识别音频数据;
对目标音区所对应的待识别音频数据进行语音识别,得到语音识别结果;
根据目标音区所对应的语音识别结果,生成文本记录信息,其中,文本记录信息包括翻译文本以及会议记录文本中的至少一种。
本实施例中,介绍了一种基于语音检测结果生成文本记录信息的方式,基于上述实施例可知,在得到N个音区中每个音区所对应的语音检测结果之后,从中选择第一语音检测结果所对应的音区,这是因为在翻译或记录的场景中,为了提升翻译或记录的准确度,需要传递人声并且抑制噪声,第一语音检测结果表示该音区的语音输出信号为人声信号。可以理解的是,本实施例中的“语音输出信号”也可以是“待处理语音输出信号”或者“目标语音输出信号”,在具体的处理过程中可灵活选择,此处仅为一个示意,不应理解为对本申请的限定。
假设N个音区中存在M个音区的语音检测结果为第一语音检测结果,即根据M个音区中每个音区所对应的语音输出信号(或,目标语音输出信号,或,待处理语音输出信号),基于此,还可以进一步基于M个音区的语音输出信号判定主说话人。其中,M个音区中的每个音区称为“目标音区”。为了便于介绍,请参阅图10,图10为本申请实施例中多声道拾 音系统的另一个架构示意图,如图所示,在终端设备上装备的麦克风阵列可拾取每个音区所对应的音频信号,音频信号包括语音输入信号以及噪声信号。由信号分离器生成每个音区所对应的控制信号,采用各个音区对应的控制信号分别对每个指向角度的语音输入信号进行抑制或保留处理,得到每个音区对应的语音输出信号。基于每个音区的音区信息和语音输出信号,确定每个音区的语音检测结果。
接下来,对M个音区中每个目标音区所对应的语音输出信号进行切分处理,即判定每个语音输出信号止点位置,由此得到待识别音频数据,此外,每个待识别音频数据携带有用户信息,该用户信息具体可以是用户标识。待识别音频数据和用户信息均可用于后续的语音识别任务。于是对M个音区中每个目标音区所对应的待识别音频数据采用ASR技术进行处理,即获取每个目标音区内说话人的说话内容,即得到语音识别结果。
主说话人判定模块根据M个音区的语音输出信号以及音区信息实时判定主说话人,例如,当对判决结果的时延要求较高时,主说话人判定模块可以直接根据短时间内接收到的每一个说话人的信号强度以及该说话人离麦克风阵列的距离(可以通过广角摄像头或多摄像头阵列提供)来测算说话人的原始音量(即嘴部出声处的音量),进而根据原始音量判决主说话人。又例如,当对判决结果的时延要求较低时,可以根据每一个说话人的语音识别结果以及人脸朝向(例如,视频会议场景下,正脸朝向摄像头的用户更有可能是主说话人)来判断主说话人。主说话人的判决结果包括其方位和身份,将其作为文本记录信息的生成依据,以此展示对话响应信息,该文本记录信息包括翻译文本以及会议记录文本中的至少一种。
可以理解的是,ASR技术可采用规则方式或机器学习的模型方式,将切分后的待识别音频数据和声纹一起送给云端的ASR模块,通常是将声纹标识或者声纹模型参数送给云端的ASR模块,ASR模块可以利用该声纹信息进一步提升其识别率。
需要说明的是,基于图10可知,还可以对语音输出信号进行跨声道后处理和降噪后处理,则得到每个音区对应的目标目标语音输出信号,基于每个音区的音区信息和目标语音输出信号,确定每个音区的语音检测结果。此外,语音信号切分的对象为每个目标音区所对应的目标语音输出信号。
请参阅图11,图11为本申请实施例中基于多音区语音检测方法实现文本记录的一个界面示意图,如图所示,以应用于同声翻译场景为例,假设我方有多位说话人,可采用本申请提供的技术方案确定主说话人,并根据主说话人的判定结果以及语音识别结果,可对主说话人说出的一段话进行实时翻译,例如,主说话人是用户A,用户A说出“本次会议的主要内容是将让大家能够更好的了解今年的工作目标,提升工作效率”,此时,可实时展示文本记录信息,如“The main content of this meeting is to let everyone have a better understanding of this year's work objectives and improve work efficiency”。
在实际应用中,还可以应用于翻译、会议记录以及会议助手等场景,可实现对场景中每一个说话人的同步、实时以及独立的语音识别(例如,进行完整的会议转录),还可以实现对每一个说话人进行手动屏蔽或开启等功能,还可以对每一个说话人进行自动屏蔽或开启等功能,此处不再详述。
本申请实施例中,提供了一种基于语音检测结果生成文本记录信息的方式,采用上述方式,能够在多用户的场景下,实时分离以及增强每一个用户的语音,使得在智能对话下能够根据语音检测结果以及语音识别结果准确地分辨出每一个说话人各自的起止时间点,并单独对每个说话人的语音进行识别,得到更准确的语音识别性能,可用于后续的语义理解性能以及翻译性能等。并且基于多用户并行分离增强处理以及后期混合的处理的流程,提升语音质量,从而有利于增加文本记录信息的准确度。
下面对本申请中的语音检测装置进行详细描述,请参阅图12,图12为本申请实施例中语音检测装置的一个实施例示意图,语音检测装置20包括:
获取模块201,用于用于获取N个音区内每个音区所对应的音区信息,其中,音区信息包括音区标识、指向角度以及用户信息,音区标识用于标识音区,指向角度用于指示音区的中心角度,用户信息用于指示音区内的用户存留情况,N为大于1的整数;
生成模块202,用于将每个音区分别作为目标检测音区,根据目标检测音区所对应的音区信息,生成目标检测音区所对应的控制信号,其中,控制信号用于对语音输入信号进行抑制处理或保留处理,控制信号与音区具有一一对应的关系;
处理模块203,用于采用目标检测音区所对应的控制信号,对目标检测音区所对应的语音输入信号进行处理,得到目标检测音区所对应的语音输出信号,其中,控制信号、语音输入信号以及语音输出信号具有一一对应的关系;
生成模块202,还用于根据目标检测音区所对应的语音输出信号,生成目标检测音区的语音检测结果。
在上述图12所对应的实施例的基础上,本申请实施例提供的语音检测装置20的另一实施例中,
获取模块201,具体用于对N个音区内每个音区进行检测,得到每个音区分别对应的用户检测结果;
将每个音区分别作为目标检测音区,根据目标检测音区所对应的用户检测结果确定目标检测音区所对应的用户信息;
根据目标检测音区所对应的用户检测结果确定目标检测音区所对应的唇部运动信息;
获取目标检测音区所对应的音区标识以及目标检测音区所对应的指向角度;
根据目标检测音区所对应的用户信息、目标检测音区所对应的唇部运动信息、目标检测音区所对应的音区标识以及目标检测音区所对应的指向角度,生成目标检测音区所对应的音区信息。
在上述图12所对应的实施例的基础上,本申请实施例提供的语音检测装置20的另一实施例中,
获取模块201,具体用于若目标检测音区所对应的用户检测结果为目标检测音区内存在可识别用户,则将第一身份标识确定为用户信息;
若目标检测音区所对应的用户检测结果为目标检测音区内不存在用户,则将第二身份标识确定为用户信息;
若目标检测音区所对应的用户检测结果为目标检测音区内存在未知用户,则将第三身份标识确定为用户信息;
获取模块201,具体用于若目标检测音区所对应的用户检测结果为目标检测音区内存在具有唇部运动的用户,则将第一运动标识确定为唇部运动信息;
若目标检测音区所对应的用户检测结果为目标检测音区内存在用户,且用户不具有唇部运动,则将第二运动标识确定为唇部运动信息;
若目标检测音区所对应的用户检测结果为目标检测音区内不存在用户,则将第三运动标识确定为唇部运动信息。
在上述图12所对应的实施例的基础上,本申请实施例提供的语音检测装置20的另一实施例中,
生成模块202,具体用于若目标检测音区所对应的用户信息用于指示目标检测音区内不存在用户,则生成第一控制信号,其中,第一控制信号属于控制信号,第一控制信号用于对语音输入信号进行抑制处理;
若目标检测音区所对应的用户信息用于指示目标检测音区内存在用户,则生成第二控制信号,其中,第二控制信号属于控制信号,第二控制信号用于对语音输入信号进行保留处理。
在上述图12所对应的实施例的基础上,本申请实施例提供的语音检测装置20的另一实施例中,
生成模块202,具体用于若目标检测音区所对应的用户信息用于指示目标检测音区内不存在用户,则生成第一控制信号,其中,第一控制信号属于控制信号,第一控制信号用于对语音输入信号进行抑制处理;
若目标检测音区所对应的用户信息用于指示目标检测音区内存在用户,且用户不具有唇部运动,则生成第一控制信号;
若目标检测音区所对应的用户信息用于指示目标检测音区内存在用户,且用户具有唇部运动,则生成第二控制信号,其中,第二控制信号属于控制信号,第二控制信号用于对语音输入信号进行保留处理;
若目标检测音区所对应的用户信息用于指示目标检测音区内存在用户,且未知用户的唇部运动情况,则根据原始音频信号生成第一控制信号或第二控制信号。
在上述图12所对应的实施例的基础上,本申请实施例提供的语音检测装置20的另一实施例中,
生成模块202,具体用于根据目标检测音区所对应的音区信息,采用预设算法生成目标检测音区所对应的控制信号,其中,预设算法为自适应波束形成算法、盲源分离算法或基于深度学习的语音分离算法;
处理模块203,具体用于若预设算法为自适应波束形成算法,则根据目标检测音区所对应的控制信号,采用自适应波束形成算法对目标检测音区所对应的语音输入信号进行处理,得到目标检测音区所对应的语音输出信号;
若预设算法为盲源分离算法,则根据目标检测音区所对应的控制信号,采用盲源分离算法对目标检测音区所对应的语音输入信号进行处理,得到目标检测音区所对应的语音输出信号;
若预设算法为基于深度学习的语音分离算法,则根据目标检测音区所对应的控制信号,采用基于深度学习的语音分离算法对目标检测音区所对应的语音输入信号进行处理,得到目标检测音区所对应的语音输出信号。
在上述图12所对应的实施例的基础上,本申请实施例提供的语音检测装置20的另一实施例中,
生成模块202,具体用于根据目标检测音区所对应的语音输出信号,确定目标检测音区所对应的信号功率,其中,信号功率为语音输出信号在时频点上的信号功率;
根据目标检测音区所对应的信号功率,确定目标检测音区所对应的估计信噪比;
根据目标检测音区所对应的估计信噪比,确定目标检测音区所对应的输出信号加权值,其中,输出信号加权值为语音输出信号在时频点上的加权结果;
根据目标检测音区所对应的输出信号加权值以及目标检测音区所对应的语音输出信号,确定目标检测音区所对应的目标语音输出信号;
根据目标检测音区所对应的目标语音输出信号,确定目标检测音区所对应的语音检测结果。
在上述图12所对应的实施例的基础上,本申请实施例提供的语音检测装置20的另一实施例中,
生成模块202,具体用于根据目标检测音区所对应的输出信号加权值以及目标检测音区所对应的语音输出信号,确定目标检测音区所对应的待处理语音输出信号;
对目标检测音区所对应的待处理语音输出信号进行降噪处理,得到目标检测音区所对应的目标语音输出信号。
在上述图12所对应的实施例的基础上,本申请实施例提供的语音检测装置20的另一实施例中,
生成模块202,具体用于若目标检测音区所对应的目标语音输出信号满足人声匹配条件,则生成第一语音检测结果,其中,第一语音检测结果属于语音检测结果,第一语音检测结果表示目标语音输出信号为人声信号;
若目标检测音区所对应的目标语音输出信号不满足人声匹配条件,则生成第二语音检测结果,其中,第二语音检测结果属于语音检测结果,第二语音检测结果表示目标语音输出信号为噪声信号。
在上述图12所对应的实施例的基础上,本申请实施例提供的语音检测装置20的另一实施例中,语音检测装置20还包括确定模块204以及传输模块205;
确定模块204,用于在生成模块202根据目标检测音区所对应的语音输出信号,生成目标检测音区的语音检测结果之后,若存在M个音区所对应的语音检测结果均为第一语音检测结果,则根据M个音区中每个音区所对应的语音输出信号,从M个音区中确定目标音区,其中,第一语音检测结果表示语音输出信号为人声信号,M个音区属于N个音区,M为大于或等于1,且小于或等于N的整数;
传输模块205,用于将目标音区所对应的语音输出信号传输至通话方。
在上述图12所对应的实施例的基础上,本申请实施例提供的语音检测装置20的另一实施例中,语音检测装置20还包括确定模块204以及识别模块206;
确定模块204,用于在生成模块202根据目标检测音区所对应的语音输出信号,生成目标检测音区的语音检测结果之后,若存在M个音区所对应的语音检测结果均为第一语音检测结果,则根据M个音区中每个音区所对应的语音输出信号,从M个音区中确定目标音区,其中,第一语音检测结果表示语音输出信号为人声信号,M个音区属于N个音区,M为大于或等于1,且小于或等于N的整数;
识别模块206,用于对目标音区所对应的语音输出信号进行语义识别,得到语义识别结果;
生成模块202,还用于根据语义识别结果,生成对话响应信息。
在上述图12所对应的实施例的基础上,本申请实施例提供的语音检测装置20的另一实施例中,语音检测装置20还包括确定模块204以及识别模块206;
确定模块204,用于在生成模块202根据目标检测音区所对应的语音输出信号,生成目标检测音区的语音检测结果之后,若存在M个音区所对应的语音检测结果均为第一语音检测结果,则根据M个音区中每个音区所对应的语音输出信号,从M个音区中确定目标音区,其中,第一语音检测结果表示语音输出信号为人声信号,M个音区属于N个音区,M为大于或等于1,且小于或等于N的整数;
处理模块203,用于对目标音区所对应的语音输出信号进行切分处理,得到待识别音频数据;
识别模块206,用于对目标音区所对应的待识别音频数据进行语音识别,得到语音识别结果;
生成模块202,还用于根据目标音区所对应的语音识别结果,生成文本记录信息,其中,文本记录信息包括翻译文本以及会议记录文本中的至少一种。
图13是本申请实施例计算机设备30的结构示意图。计算机设备30可包括输入设备310、输出设备320、处理器330和存储器340。本申请实施例中的输出设备可以是显示设备。存储器340可以包括只读存储器和随机存取存储器,并向处理器330提供指令和数据。存储器340的一部分还可以包括非易失性随机存取存储器(Non-Volatile Random Access Memory,NVRAM)。
存储器340存储了如下的元素,可执行模块或者数据结构,或者它们的子集,或者它们的扩展集:
操作指令:包括各种操作指令,用于实现各种操作。
操作系统:包括各种系统程序,用于实现各种基础业务以及处理基于硬件的任务。
本申请实施例中处理器330用于:
获取N个音区内每个音区所对应的音区信息,其中,音区信息包括音区标识、指向角度以及用户信息,音区标识用于标识音区,指向角度用于指示音区的中心角度,用户信息用于指示音区内的用户存留情况,N为大于1的整数;
将每个音区分别作为目标检测音区,根据目标检测音区所对应的音区信息,生成目标检测音区所对应的控制信号,其中,控制信号用于对语音输入信号进行抑制处理或保留处理,控制信号与音区具有一一对应的关系;
采用目标检测音区所对应的控制信号,对目标检测音区所对应的语音输入信号进行处理,得到目标检测音区所对应的语音输出信号,其中,控制信号、语音输入信号以及语音输出信号具有一一对应的关系;
根据目标检测音区所对应的语音输出信号,生成目标检测音区的语音检测结果。
处理器330控制计算机设备30的操作,处理器330还可以称为中央处理单元(Central Processing Unit,CPU)。存储器340可以包括只读存储器和随机存取存储器,并向处理器330提供指令和数据。存储器340的一部分还可以包括NVRAM。具体的应用中,计算机设备30的各个组件通过总线系统350耦合在一起,其中总线系统350除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都标为总线系统350。
上述本申请实施例揭示的方法可以应用于处理器330中,或者由处理器330实现。处理器330可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器330中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器330可以是通用处理器、数字信号处理器(Digital Signal Processing,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器340,处理器330读取存储器340中的信息,结合其硬件完成上述方法的步骤。图13的相关描述可以参阅图3方法部分的相关描述和效果进行理解,本处不做过多赘述。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行如前述实施例所描述的方法。
本申请实施例中还提供一种包括程序的计算机程序产品,当其在计算机上运行时,使得计算机执行如前述实施例所描述的方法。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (16)

  1. 一种基于多音区的语音检测方法,所述语音检测方法由计算机设备执行,包括:
    获取N个音区内每个音区所对应的音区信息,其中,所述音区信息包括音区标识、指向角度以及用户信息,所述音区标识用于标识音区,所述指向角度用于指示音区的中心角度,所述用户信息用于指示音区内的用户存留情况,所述N为大于1的整数;
    将每个音区分别作为目标检测音区,根据所述目标检测音区所对应的音区信息,生成所述目标检测音区所对应的控制信号,其中,所述控制信号用于对语音输入信号进行抑制处理或保留处理,所述控制信号与所述音区具有一一对应的关系;
    采用所述目标检测音区所对应的控制信号,对所述目标检测音区所对应的语音输入信号进行处理,得到所述目标检测音区所对应的语音输出信号,其中,所述控制信号、所述语音输入信号以及所述语音输出信号具有一一对应的关系;
    根据所述目标检测音区所对应的语音输出信号,生成所述目标检测音区的语音检测结果。
  2. 根据权利要求1所述的语音检测方法,所述获取N个音区内每个音区所对应的音区信息,包括:
    对所述N个音区内所述每个音区进行检测,得到所述每个音区分别对应的用户检测结果;
    将每个音区分别作为目标检测音区,根据所述目标检测音区所对应的用户检测结果确定所述目标检测音区所对应的用户信息;
    根据所述目标检测音区所对应的用户检测结果确定所述目标检测音区所对应的唇部运动信息;
    获取所述目标检测音区所对应的音区标识以及所述目标检测音区所对应的指向角度;
    根据所述目标检测音区所对应的用户信息、所述目标检测音区所对应的唇部运动信息、所述目标检测音区所对应的音区标识以及所述目标检测音区所对应的指向角度,生成所述目标检测音区所对应的音区信息。
  3. 根据权利要求2所述的语音检测方法,所述根据所述目标检测音区所对应的用户检测结果确定所述目标检测音区所对应的用户信息,包括:
    若所述目标检测音区所对应的用户检测结果为所述目标检测音区内存在可识别用户,则将第一身份标识确定为用户信息;
    若所述目标检测音区所对应的用户检测结果为所述目标检测音区内不存在用户,则将第二身份标识确定为用户信息;
    若所述目标检测音区所对应的用户检测结果为所述目标检测音区内存在未知用户,则将第三身份标识确定为用户信息;
    所述根据所述目标检测音区所对应的用户检测结果确定所述目标检测音区所对应的唇部运动信息,包括:
    若所述目标检测音区所对应的用户检测结果为所述目标检测音区内存在具有唇部运动的用户,则将所述第一运动标识确定为唇部运动信息;
    若所述目标检测音区所对应的用户检测结果为所述目标检测音区内存在用户,且所述用户不具有唇部运动,则将所述第二运动标识确定为唇部运动信息;
    若所述目标检测音区所对应的用户检测结果为所述目标检测音区内不存在用户,则将所述第三运动标识确定为唇部运动信息。
  4. 根据权利要求1所述的语音检测方法,所述根据所述目标检测音区所对应的音区信息,生成所述目标检测音区所对应的控制信号,包括:
    若所述目标检测音区所对应的用户信息用于指示所述目标检测音区内不存在用户,则生成第一控制信号,其中,所述第一控制信号属于所述控制信号,所述第一控制信号用于对语音输入信号进行抑制处理;
    若所述目标检测音区所对应的用户信息用于指示所述目标检测音区内存在用户,则生成第二控制信号,其中,所述第二控制信号属于所述控制信号,所述第二控制信号用于对语音输入信号进行保留处理。
  5. 根据权利要求2所述的语音检测方法,所述根据所述目标检测音区所对应的音区信息,生成所述目标检测音区所对应的控制信号,包括:
    若所述目标检测音区所对应的用户信息用于指示所述目标检测音区内不存在用户,则生成第一控制信号,其中,所述第一控制信号属于所述控制信号,所述第一控制信号用于对语音输入信号进行抑制处理;
    若所述目标检测音区所对应的用户信息用于指示所述目标检测音区内存在用户,且所述用户不具有唇部运动,则生成所述第一控制信号;
    若所述目标检测音区所对应的用户信息用于指示所述目标检测音区内存在用户,且所述用户具有唇部运动,则生成第二控制信号,其中,所述第二控制信号属于所述控制信号,所述第二控制信号用于对语音输入信号进行保留处理;
    若所述目标检测音区所对应的用户信息用于指示所述目标检测音区内存在用户,且未知所述用户的唇部运动情况,则根据原始音频信号生成所述第一控制信号或所述第二控制信号。
  6. 根据权利要求1所述的语音检测方法,所述根据所述目标检测音区所对应的音区信息,生成所述目标检测音区所对应的控制信号,包括:
    根据所述目标检测音区所对应的音区信息,采用预设算法生成所述目标检测音区所对应的控制信号,其中,所述预设算法为自适应波束形成算法、盲源分离算法或基于深度学习的语音分离算法;
    所述采用所述目标检测音区所对应的控制信号,对所述目标检测音区所对应的语音输入信号进行处理,得到所述目标检测音区所对应的语音输出信号,包括:
    若所述预设算法为所述自适应波束形成算法,则根据所述目标检测音区所对应的控制信号,采用所述自适应波束形成算法对所述目标检测音区所对应的语音输入信号进行处理,得到所述目标检测音区所对应的语音输出信号;
    若所述预设算法为所述盲源分离算法,则根据所述目标检测音区所对应的控制信号,采用所述盲源分离算法对所述目标检测音区所对应的语音输入信号进行处理,得到所述目标检测音区所对应的语音输出信号;
    若所述预设算法为所述基于深度学习的语音分离算法,则根据所述目标检测音区所对应的控制信号,采用所述基于深度学习的语音分离算法对所述目标检测音区所对应的语音输入信号进行处理,得到所述目标检测音区所对应的语音输出信号。
  7. 根据权利要求1所述的语音检测方法,所述根据所述目标检测音区所对应的语音输出信号,生成所述目标检测音区的语音检测结果,包括:
    根据所述目标检测音区所对应的语音输出信号,确定所述目标检测音区所对应的信号功率,其中,所述信号功率为所述语音输出信号在时频点上的信号功率;
    根据所述目标检测音区所对应的信号功率,确定所述目标检测音区所对应的估计信噪比;
    根据所述目标检测音区所对应的估计信噪比,确定所述目标检测音区所对应的输出信号加权值,其中,所述输出信号加权值为所述语音输出信号在时频点上的加权结果;
    根据所述目标检测音区所对应的输出信号加权值以及所述目标检测音区所对应的语音输出信号,确定所述目标检测音区所对应的目标语音输出信号;
    根据所述目标检测音区所对应的目标语音输出信号,确定所述目标检测音区所对应的语音检测结果。
  8. 根据权利要求7所述的语音检测方法,所述根据所述目标检测音区所对应的输出信号加权值以及所述目标检测音区所对应的语音输出信号,确定所述目标检测音区所对应的目标语音输出信号,包括:
    所述根据所述目标检测音区所对应的输出信号加权值以及所述目标检测音区所对应的语音输出信号,确定所述目标检测音区所对应的待处理语音输出信号;
    对所述目标检测音区所对应的待处理语音输出信号进行降噪处理,得到所述目标检测音区所对应的目标语音输出信号。
  9. 根据权利要求8所述的语音检测方法,所述根据所述目标检测音区所对应的语音输出信号,生成所述目标检测音区的语音检测结果,包括:
    若所述目标检测音区所对应的目标语音输出信号满足人声匹配条件,则生成第一语音检测结果,其中,所述第一语音检测结果属于所述语音检测结果,所述第一语音检测结果表示所述目标语音输出信号为人声信号;
    若所述目标检测音区所对应的目标语音输出信号不满足人声匹配条件,则生成第二语音检测结果,其中,所述第二语音检测结果属于所述语音检测结果,所述第二语音检测结果表示所述目标语音输出信号为噪声信号。
  10. 根据权利要求1至9中任一项所述的语音检测方法,所述根据所述目标检测音区所对应的语音输出信号,生成所述目标检测音区的语音检测结果之后,所述方法还包括:
    若存在M个音区所对应的语音检测结果均为第一语音检测结果,则根据所述M个音区中每个音区所对应的语音输出信号,从所述M个音区中确定目标音区,其中,所述第一语音检测结果表示所述语音输出信号为人声信号,所述M个音区属于所述N个音区,所述M为大于或等于1,且小于或等于所述N的整数;
    将所述目标音区所对应的语音输出信号传输至通话方。
  11. 根据权利要求1至9中任一项所述的语音检测方法,所述根据所述目标检测音区所对应的语音输出信号,生成所述目标检测音区的语音检测结果之后,所述方法还包括:
    若存在M个音区所对应的语音检测结果均为第一语音检测结果,则根据所述M个音区中每个音区所对应的语音输出信号,从所述M个音区中确定目标音区,其中,所述第一语音检测结果表示所述语音输出信号为人声信号,所述M个音区属于所述N个音区,所述M为大于或等于1,且小于或等于所述N的整数;
    对所述目标音区所对应的语音输出信号进行语义识别,得到语义识别结果;
    根据所述语义识别结果,生成对话响应信息。
  12. 根据权利要求1至9中任一项所述的语音检测方法,所述根据所述目标检测音区所对应的语音输出信号,生成所述目标检测音区的语音检测结果之后之后,所述方法还包括:
    若存在M个音区所对应的语音检测结果均为第一语音检测结果,则根据所述M个音区中每个音区所对应的语音输出信号,从所述M个音区中确定目标音区,其中,所述第一语音检测结果表示所述语音输出信号为人声信号,所述M个音区属于所述N个音区,所述M为大于或等于1,且小于或等于所述N的整数;
    对所述目标音区所对应的语音输出信号进行切分处理,得到待识别音频数据;
    对所述目标音区所对应的待识别音频数据进行语音识别,得到语音识别结果;
    根据所述目标音区所对应的语音识别结果,生成文本记录信息,其中,所述文本记录信息包括翻译文本以及会议记录文本中的至少一种。
  13. 一种语音检测装置,所述语音检测装置部署在计算机设备上,包括:
    获取模块,用于获取N个音区内每个音区所对应的音区信息,其中,所述音区信息包括音区标识、指向角度以及用户信息,所述音区标识用于标识音区,所述指向角度用于指示音区的中心角度,所述用户信息用于指示音区内的用户存留情况,所述N为大于1的整数;
    生成模块,用于将每个音区分别作为目标检测音区,根据所述目标检测音区所对应的音区信息,生成所述目标检测音区所对应的控制信号,其中,所述控制信号用于对语音输入信号进行抑制处理或保留处理,所述控制信号与所述音区具有一一对应的关系;
    处理模块,用于采用所述目标检测音区所对应的控制信号,对所述目标检测音区所对应的语音输入信号进行处理,得到所述目标检测音区所对应的语音输出信号,其中,所述控制信号、所述语音输入信号以及所述语音输出信号具有一一对应的关系;
    所述生成模块,还用于根据所述目标检测音区所对应的语音输出信号,生成所述目标检测音区的语音检测结果。
  14. 一种计算机设备,包括:存储器、收发器、处理器以及总线系统;
    其中,所述存储器用于存储程序;
    所述处理器用于执行所述存储器中的程序,所述处理器用于根据所述程序代码中的指令执行权利要求1至12中任一项所述的方法;
    所述总线系统用于连接所述存储器以及所述处理器,以使所述存储器以及所述处理器进行通信。
  15. 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1至12中任一项所述的语音检测方法。
  16. 一种计算机程序产品,当所述计算机程序产品被执行时,用于执行权利要求1至12中任一所述的方法。
PCT/CN2021/100472 2020-07-27 2021-06-17 一种基于多音区的语音检测方法、相关装置及存储介质 Ceased WO2022022139A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21850172.4A EP4123646B1 (en) 2020-07-27 2021-06-17 Voice detection method based on multiple sound regions, related device, and storage medium
US17/944,067 US12051441B2 (en) 2020-07-27 2022-09-13 Multi-register-based speech detection method and related apparatus, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010732649.8A CN111833899B (zh) 2020-07-27 2020-07-27 一种基于多音区的语音检测方法、相关装置及存储介质
CN202010732649.8 2020-07-27

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/944,067 Continuation US12051441B2 (en) 2020-07-27 2022-09-13 Multi-register-based speech detection method and related apparatus, and storage medium

Publications (1)

Publication Number Publication Date
WO2022022139A1 true WO2022022139A1 (zh) 2022-02-03

Family

ID=72926418

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/100472 Ceased WO2022022139A1 (zh) 2020-07-27 2021-06-17 一种基于多音区的语音检测方法、相关装置及存储介质

Country Status (4)

Country Link
US (1) US12051441B2 (zh)
EP (1) EP4123646B1 (zh)
CN (1) CN111833899B (zh)
WO (1) WO2022022139A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE202022101069U1 (de) 2022-02-24 2022-03-23 Pankaj Agarwal Intelligentes Geräuscherkennungssystem auf der Grundlage der Verarbeitung mehrerer Geräusche durch künstliche Intelligenz
EP4478351A4 (en) * 2022-04-22 2025-03-12 Huawei Technologies Co., Ltd. VOICE INTERACTION METHOD AND ELECTRONIC DEVICE AND STORAGE MEDIUM

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111432160A (zh) * 2020-04-01 2020-07-17 广州市百果园信息技术有限公司 一种实现多人通话的方法、装置、服务器和存储介质
CN111833899B (zh) 2020-07-27 2022-07-26 腾讯科技(深圳)有限公司 一种基于多音区的语音检测方法、相关装置及存储介质
CN112397065A (zh) 2020-11-04 2021-02-23 深圳地平线机器人科技有限公司 语音交互方法、装置、计算机可读存储介质及电子设备
CN113012700B (zh) * 2021-01-29 2023-12-26 深圳壹秘科技有限公司 语音信号处理方法、装置、系统及计算机可读存储介质
DE102021103310B4 (de) * 2021-02-12 2024-01-04 Dr. Ing. H.C. F. Porsche Aktiengesellschaft Verfahren und vorrichtung zur verbesserung der sprachverständlichkeit in einem raum
CN113241068A (zh) * 2021-03-26 2021-08-10 青岛海尔科技有限公司 语音信号的响应方法和装置、存储介质及电子装置
CN113270095B (zh) * 2021-04-26 2022-04-08 镁佳(北京)科技有限公司 语音处理方法、装置、存储介质及电子设备
CN115641869A (zh) * 2021-07-19 2023-01-24 上海擎感智能科技有限公司 一种实现车内话者分离的方法及系统
CN113823273B (zh) * 2021-07-23 2024-02-13 腾讯科技(深圳)有限公司 音频信号处理方法、装置、电子设备及存储介质
CN113611308B (zh) * 2021-09-08 2024-05-07 杭州海康威视数字技术股份有限公司 一种语音识别方法、装置、系统、服务器及存储介质
TWI902903B (zh) * 2021-09-17 2025-11-01 信驊科技股份有限公司 視訊內容提供方法及視訊內容提供裝置
CN114116771A (zh) * 2021-11-29 2022-03-01 如果科技有限公司 声控数据分析方法、装置、终端设备及存储介质
CN114333863B (zh) * 2021-12-16 2025-07-18 中国科学技术大学 语音增强方法、装置、电子设备及计算机可读存储介质
CN114495923A (zh) * 2021-12-28 2022-05-13 北京百度网讯科技有限公司 智能控制系统实现方法、装置、电子设备及存储介质
CN118800228A (zh) * 2023-04-13 2024-10-18 华为技术有限公司 数据处理方法及相关设备
CN116741161A (zh) * 2023-06-09 2023-09-12 广州小鹏汽车科技有限公司 语音处理方法、装置、终端设备及存储介质
CN120148513B (zh) * 2025-03-25 2026-04-10 平安科技(深圳)有限公司 代答行为检测方法、装置、设备及介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1116961A2 (en) * 2000-01-13 2001-07-18 Nokia Mobile Phones Ltd. Method and system for tracking human speakers
US20050111674A1 (en) * 2003-11-20 2005-05-26 Acer Inc. Sound pickup method and system with sound source tracking
CN1813284A (zh) * 2003-06-17 2006-08-02 索尼爱立信移动通讯股份有限公司 语音活动检测的装置和方法
CN101297587A (zh) * 2006-04-21 2008-10-29 雅马哈株式会社 拾音装置和语音会议设备
CN108370470A (zh) * 2015-12-04 2018-08-03 森海塞尔电子股份有限及两合公司 具有麦克风阵列系统的会议系统以及会议系统中的语音获取方法
CN110797051A (zh) * 2019-10-28 2020-02-14 星络智能科技有限公司 一种唤醒门限设置方法、装置、智能音箱及存储介质
CN111833899A (zh) * 2020-07-27 2020-10-27 腾讯科技(深圳)有限公司 一种基于多音区的语音检测方法、相关装置及存储介质

Family Cites Families (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100194953B1 (ko) * 1996-11-21 1999-06-15 정선종 유성음 구간에서 프레임별 피치 검출 방법
US8073148B2 (en) * 2005-07-11 2011-12-06 Samsung Electronics Co., Ltd. Sound processing apparatus and method
TWI543635B (zh) * 2013-12-18 2016-07-21 jing-feng Liu Speech Acquisition Method of Hearing Aid System and Hearing Aid System
EP3639263B1 (en) * 2017-06-13 2025-04-30 Sandeep Kumar Chintala NOISE SUPPRESSION IN VOICE COMMUNICATION SYSTEMS
CN108305615B (zh) * 2017-10-23 2020-06-16 腾讯科技(深圳)有限公司 一种对象识别方法及其设备、存储介质、终端
CN207718803U (zh) * 2017-12-06 2018-08-10 广州宝镜智能科技有限公司 多信源语音区分识别系统
CN107910006A (zh) * 2017-12-06 2018-04-13 广州宝镜智能科技有限公司 语音识别方法、装置和多信源语音区分识别系统
WO2019195799A1 (en) * 2018-04-05 2019-10-10 Synaptics Incorporated Context-aware control for smart devices
US20190341053A1 (en) * 2018-05-06 2019-11-07 Microsoft Technology Licensing, Llc Multi-modal speech attribution among n speakers
US11152006B2 (en) * 2018-05-07 2021-10-19 Microsoft Technology Licensing, Llc Voice identification enrollment
US10694285B2 (en) * 2018-06-25 2020-06-23 Biamp Systems, LLC Microphone array with automated adaptive beam tracking
US10540960B1 (en) * 2018-09-05 2020-01-21 International Business Machines Corporation Intelligent command filtering using cones of authentication in an internet of things (IoT) computing environment
JP7158976B2 (ja) * 2018-09-28 2022-10-24 沖電気工業株式会社 収音装置、収音プログラム及び収音方法
CN113658588A (zh) * 2018-09-29 2021-11-16 百度在线网络技术(北京)有限公司 多音区语音识别方法、装置及存储介质
CN110491403B (zh) * 2018-11-30 2022-03-04 腾讯科技(深圳)有限公司 音频信号的处理方法、装置、介质和音频交互设备
US11404073B1 (en) * 2018-12-13 2022-08-02 Amazon Technologies, Inc. Methods for detecting double-talk
US10937441B1 (en) * 2019-01-04 2021-03-02 Amazon Technologies, Inc. Beam level based adaptive target selection
CN110160633B (zh) * 2019-04-30 2021-10-08 百度在线网络技术(北京)有限公司 一种多声音区域的音频隔离度检测方法及装置
CN110310633B (zh) * 2019-05-23 2022-05-20 阿波罗智联(北京)科技有限公司 多音区语音识别方法、终端设备和存储介质
CN110211585A (zh) * 2019-06-05 2019-09-06 广州小鹏汽车科技有限公司 车内娱乐互动方法、装置、车辆和机器可读介质
CN110459234B (zh) * 2019-08-15 2022-03-22 思必驰科技股份有限公司 用于车载的语音识别方法及系统
CN110475180A (zh) 2019-08-23 2019-11-19 科大讯飞(苏州)科技有限公司 车载多音区音频处理系统及方法
US11514928B2 (en) * 2019-09-09 2022-11-29 Apple Inc. Spatially informed audio signal processing for user speech
US11107492B1 (en) * 2019-09-18 2021-08-31 Amazon Technologies, Inc. Omni-directional speech separation
US11341988B1 (en) * 2019-09-23 2022-05-24 Apple Inc. Hybrid learning-based and statistical processing techniques for voice activity detection
CN110648663A (zh) * 2019-09-26 2020-01-03 科大讯飞(苏州)科技有限公司 车载音频管理方法、装置、设备、汽车及可读存储介质
CN111223497B (zh) * 2020-01-06 2022-04-19 思必驰科技股份有限公司 一种终端的就近唤醒方法、装置、计算设备及存储介质
US11348253B2 (en) * 2020-01-09 2022-05-31 Alibaba Group Holding Limited Single-channel and multi-channel source separation enhanced by lip motion
CN111341313A (zh) * 2020-03-04 2020-06-26 北京声智科技有限公司 一种车载多音区声源检测方法、装置及系统
US11250869B2 (en) * 2020-04-16 2022-02-15 Lg Electronics Inc. Audio zoom based on speaker detection using lip reading
US11114108B1 (en) * 2020-05-11 2021-09-07 Cirrus Logic, Inc. Acoustic source classification using hyperset of fused voice biometric and spatial features
US11264017B2 (en) * 2020-06-12 2022-03-01 Synaptics Incorporated Robust speaker localization in presence of strong noise interference systems and methods
US11423906B2 (en) * 2020-07-10 2022-08-23 Tencent America LLC Multi-tap minimum variance distortionless response beamformer with neural networks for target speech separation
CN114360527B (zh) * 2021-12-30 2023-09-26 亿咖通(湖北)技术有限公司 车载语音交互方法、装置、设备及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1116961A2 (en) * 2000-01-13 2001-07-18 Nokia Mobile Phones Ltd. Method and system for tracking human speakers
CN1813284A (zh) * 2003-06-17 2006-08-02 索尼爱立信移动通讯股份有限公司 语音活动检测的装置和方法
US20050111674A1 (en) * 2003-11-20 2005-05-26 Acer Inc. Sound pickup method and system with sound source tracking
CN101297587A (zh) * 2006-04-21 2008-10-29 雅马哈株式会社 拾音装置和语音会议设备
CN108370470A (zh) * 2015-12-04 2018-08-03 森海塞尔电子股份有限及两合公司 具有麦克风阵列系统的会议系统以及会议系统中的语音获取方法
CN110797051A (zh) * 2019-10-28 2020-02-14 星络智能科技有限公司 一种唤醒门限设置方法、装置、智能音箱及存储介质
CN111833899A (zh) * 2020-07-27 2020-10-27 腾讯科技(深圳)有限公司 一种基于多音区的语音检测方法、相关装置及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4123646A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE202022101069U1 (de) 2022-02-24 2022-03-23 Pankaj Agarwal Intelligentes Geräuscherkennungssystem auf der Grundlage der Verarbeitung mehrerer Geräusche durch künstliche Intelligenz
EP4478351A4 (en) * 2022-04-22 2025-03-12 Huawei Technologies Co., Ltd. VOICE INTERACTION METHOD AND ELECTRONIC DEVICE AND STORAGE MEDIUM

Also Published As

Publication number Publication date
EP4123646A4 (en) 2023-04-05
US12051441B2 (en) 2024-07-30
US20230013740A1 (en) 2023-01-19
EP4123646B1 (en) 2025-02-19
CN111833899A (zh) 2020-10-27
CN111833899B (zh) 2022-07-26
EP4123646A1 (en) 2023-01-25

Similar Documents

Publication Publication Date Title
WO2022022139A1 (zh) 一种基于多音区的语音检测方法、相关装置及存储介质
JP7536789B2 (ja) 分散システムにおいてユーザの好みに最適化するためのカスタマイズされた出力
US11996114B2 (en) End-to-end time-domain multitask learning for ML-based speech enhancement
US8606249B1 (en) Methods and systems for enhancing audio quality during teleconferencing
CN115482830B (zh) 语音增强方法及相关设备
CN111489760A (zh) 语音信号去混响处理方法、装置、计算机设备和存储介质
US12387738B2 (en) Distributed teleconferencing using personalized enhancement models
US20240096343A1 (en) Voice quality enhancement method and related device
US11875800B2 (en) Talker prediction method, talker prediction device, and communication system
CN102160335B (zh) 周围电话系统中的对话检测
US20230352040A1 (en) Audio source feature separation and target audio source generation
US20140278399A1 (en) Speech fragment detection for management of interaction in a remote conference
CN114461842B (zh) 生成劝阻话术的方法、装置、设备及存储介质
HK40029471A (zh) 一种基於多音区的语音检测方法、相关装置及存储介质
HK40029471B (zh) 一种基於多音区的语音检测方法、相关装置及存储介质
NO347277B1 (en) Method and system for speech detection and speech enhancement
US12581038B2 (en) Audio processing in video conferencing system using multimodal features
TWI801085B (zh) 智能網路通訊之雜訊消減方法
CN112788278B (zh) 视频流的生成方法、装置、设备及存储介质
CN121393429A (zh) 多模态语音交互方法、装置、智能设备及可读存储介质
CN121393460A (zh) 多人场景的语音降噪方法、装置、电子设备及存储介质
HK40027472A (zh) 语音信号去混响处理方法、装置、计算机设备和存储介质
HK40069959B (zh) 音频的处理方法、装置、设备及介质
Nakatani et al. AI Hears Your Voice as if It Were Right Next to You-Audio Processing Framework for Separating Distant Sounds with Close-microphone Quality NTT Communication Science Laboratories

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21850172

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021850172

Country of ref document: EP

Effective date: 20221017

NENP Non-entry into the national phase

Ref country code: DE