WO2022068608A1 - 信号处理的方法和电子设备 - Google Patents

信号处理的方法和电子设备 Download PDF

Info

Publication number
WO2022068608A1
WO2022068608A1 PCT/CN2021/118948 CN2021118948W WO2022068608A1 WO 2022068608 A1 WO2022068608 A1 WO 2022068608A1 CN 2021118948 W CN2021118948 W CN 2021118948W WO 2022068608 A1 WO2022068608 A1 WO 2022068608A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound source
electronic device
user
audio signal
target sound
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2021/118948
Other languages
English (en)
French (fr)
Inventor
鲍光照
陈礼文
黄磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to US18/247,212 priority Critical patent/US20230386494A1/en
Priority to EP21874269.0A priority patent/EP4207186B1/en
Publication of WO2022068608A1 publication Critical patent/WO2022068608A1/zh
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Definitions

  • the embodiments of the present application relate to the field of acoustics, and more particularly, to a signal processing method and electronic device.
  • smart devices such as smart TVs, smart speakers, and smart lights can be used for remote pickup. For example, when a user says “turn off the lights” at a distance of 5 meters, the smart device picks up the voice and recognizes the voice, and controls the lights to perform corresponding light-off action.
  • the commonly used far-field sound pickup technology uses a microphone array to pick up audio signals, and uses beamforming technology and echo cancellation algorithms to suppress environmental noise and echoes to obtain clearer audio signals.
  • various noises and interferences in the actual environment such as the noise of cooking and washing dishes in the kitchen, the noise of TV programs, the interference noise of family chats, etc., and some families have empty rooms or decorated walls with large acoustic reflection coefficients. material, resulting in a large reverberation, and the sound is easily mushy, all these unfavorable factors will greatly reduce the intelligibility of the sound picked up by the microphone array, resulting in a significant drop in the speech recognition rate.
  • Embodiments of the present application provide a signal processing method and electronic device, which determine the target sound source direction of a user who is interacting with the electronic device through an audio signal and a video obtained based on a camera, and further, based on the video obtained by the camera
  • the video of the user's lips and the preset voice enhancement model in the direction of the target sound source the voice enhancement process is performed on the picked-up audio signal to obtain or restore a relatively clear audio signal, which can greatly improve the efficiency of voice recognition.
  • a signal processing method which is characterized in that it is applied to an electronic device, the electronic device includes a microphone array and a camera, and the method includes:
  • a third audio signal is obtained through a speech enhancement model, where the speech enhancement model includes a correspondence between pronunciation and lip shape.
  • the sound source direction information includes at least one sound source direction, and the at least one sound source direction includes a target sound source direction.
  • the user direction information includes some directions related to the user, illustratively including at least one type of direction related to the user.
  • the target sound source direction is the direction in which the target user who is performing voice interaction with the electronic device is located, that is, the source direction of the sound emitted by the target user.
  • lip shapes during the user's speech are recorded in the user's lip video, and the lip shapes and pronunciations have a corresponding relationship, that is, one lip shape can correspond to one or more pronunciations.
  • the lips are in a static state.
  • the lip video of the user in the direction of the target sound source can actually be understood as the lip video of the target user.
  • the purpose of the speech enhancement model is to perform pickup enhancement processing on the audio signal, enhance the audio signal in the direction of the target sound source, and suppress or eliminate the audio signal generated in other directions, including the speaker or background noise, so as to obtain or restore a clearer audio signal.
  • audio signal The speech enhancement model of the embodiment of the present application integrates the information of audio and video, and integrates the correspondence between pronunciation and lip shape, that is, one or more pronunciations may correspond to one lip shape.
  • the camera is a rotatable camera. After determining the direction of the target sound source, the camera can be rotated to the direction of the target sound source to capture a video of the user's lips in the direction of the target sound source.
  • the first video is obtained by the camera, and the direction of the target sound source is determined in combination with the first audio signal obtained by the microphone array, which can greatly improve the estimation accuracy of the direction of the target sound source, and avoids the need to use only the audio signal.
  • the false sound source generated by the strong reflected sound interferes with the determination of the direction of the target sound source, and the video of the user's lips in the direction of the target sound source obtained by the camera and the preset speech enhancement model.
  • the second audio signal obtained by the microphone array is processed for voice enhancement. Since the corresponding relationship between pronunciation and lip shape is integrated in the voice enhancement model, combined with the user's lip video and the voice enhancement model, a relatively clean third audio signal can be recovered. Finally, which can effectively improve the efficiency of speech recognition.
  • the electronic device further includes a directional microphone
  • the method further includes:
  • a third audio signal is obtained through a speech enhancement model, including:
  • the third audio signal is obtained through the speech enhancement module according to the second audio signal, the fourth audio signal and the video of the user's lips.
  • the directional microphone can be fixed on the camera. In this way, after the direction of the target sound source is determined, the directional microphone is driven to rotate during the process of rotating the camera, and finally rotates to the direction of the target sound source. The fourth audio signal in the source direction.
  • a fourth audio signal in the direction of the target sound source is obtained through the directional microphone.
  • the echo of the display screen itself has a certain inhibitory effect, and has a further inhibitory effect on the echo residue after echo cancellation. Therefore, the embodiment of the present application uses the fourth audio signal obtained by the directional microphone in the direction of the target sound source, combined with the microphone array. For the obtained second audio signal, the two audio signals are used as audio input, which can greatly improve the effect of sound pickup enhancement, so as to improve the speech recognition efficiency.
  • the user direction information includes at least one of the following types of directions:
  • the first class of directions including the direction in which the at least one active lip is located;
  • the second type of direction includes the direction in which at least one user is located;
  • a third type of direction includes a direction in which at least one user who is looking at the electronic device is located.
  • the signal processing method of the embodiment of the present application for the method of determining the direction of the target sound source using the first type of direction, can effectively detect whether the lips of a person are moving in the picture through the first video, that is, whether there is a person talking. For example, the scene of people talking in the video can be excluded to a certain extent.
  • the scene that interferes with the user's speech can also be excluded to a certain extent; for the method of using the second type of direction to determine the target sound source direction, the first Users appearing in the video detection screen can effectively eliminate interference signals from other non-users, for example, interference signals from speakers can be excluded; for the method of determining the direction of the target sound source using the third type of direction, the first video detection screen Whether there is a user looking at the electronic device, in general, especially for the electronic device with a display screen, if the user has the intention of interacting with the electronic device, in most cases, a voice command will be issued to the electronic device, so that the electronic device can The voice commands are well received, and the user can quickly know whether the electronic device is executing according to the command or get some feedback from the electronic device. For example, if the user sends a voice command to ask about the weather status, the user needs to look at the Displayed weather conditions.
  • the sound source direction information includes at least one sound source direction
  • the determining the target sound source direction according to the sound source direction information and the user direction information includes:
  • the target sound source direction is determined from the at least one direction.
  • the signal processing method of the embodiment of the present application can simplify the calculation by combining with at least one sound source direction and at least one type of direction to determine the target sound source direction.
  • the determining the target sound source direction from the at least one direction includes:
  • the target sound source direction is determined from the at least one direction according to at least one of the following parameters;
  • the at least one parameter includes:
  • the preset time period is the time period between the current time and the historical time
  • the included angle between each of the directions and a direction perpendicular to the display screen of the electronic device is a direction perpendicular to the display screen of the electronic device.
  • the direction is The direction of the target sound source is most likely. Ideally, this direction is basically the direction of the target sound source.
  • the angle of the preset angle range corresponding to each direction can not only include The angle corresponding to this direction also includes the angles near this angle.
  • This parameter can be understood as whether the electronic device has successfully performed voice interaction with the user in the vicinity of an angle corresponding to a certain direction within a preset period of time.
  • the included angle between each direction and the direction perpendicular to the display screen of the electronic device it is more suitable for electronic devices with a display screen. This parameter can be understood as whether the user is targeting the electronic device. Near a specific direction defined when using a preset scene.
  • the target sound source direction is determined from at least one direction by using the above at least one parameter.
  • the target sound source can be further effectively improved. Estimation accuracy of sound source direction to improve speech recognition efficiency.
  • determining the target sound source direction from the at least one direction according to at least one of the following parameters includes:
  • the direction corresponding to the confidence level with the largest numerical value in the at least one direction is determined as the target sound source direction.
  • the obtaining the second audio signal through the microphone array includes:
  • the second audio signal is obtained in the direction of the target sound source based on a beamforming technique.
  • the signal processing method of the embodiment of the present application obtains the second audio signal in the direction of the target sound source through the beamforming technology, which enhances the sound pickup effect and effectively reduces the influence of interference signals in other directions on the efficiency of speech recognition.
  • the first audio signal is a wake-up signal.
  • an electronic device including a microphone array, a camera and a processor, the processor being used for:
  • a third audio signal is obtained through a speech enhancement model, where the speech enhancement model includes a correspondence between pronunciation and lip shape.
  • the electronic device further includes a directional microphone
  • the processor is further configured to:
  • the processor is specifically used for:
  • the third audio signal is obtained through the speech enhancement module according to the second audio signal, the fourth audio signal and the video of the user's lips.
  • the directional microphone is fixedly connected to the camera.
  • the user direction information includes at least one of the following types of directions:
  • the first class of directions including the direction in which the at least one active lip is located;
  • the second type of direction includes the direction in which at least one user is located;
  • a third type of direction includes a direction in which at least one user who is looking at the electronic device is located.
  • the sound source direction information includes at least one sound source direction
  • the processor is specifically used for:
  • the target sound source direction is determined from the at least one direction.
  • the processor is specifically configured to:
  • the target sound source direction is determined from the at least one direction according to at least one of the following parameters;
  • the at least one parameter includes:
  • the preset time period is the time period between the current time and the historical time
  • the included angle between each of the directions and a direction perpendicular to the display screen of the electronic device is a direction perpendicular to the display screen of the electronic device.
  • the processor is specifically configured to:
  • the direction corresponding to the confidence level with the largest numerical value in the at least one direction is determined as the target sound source direction.
  • the processor is specifically configured to:
  • the second audio signal is obtained in the direction of the target sound source based on a beamforming technique.
  • the first audio signal is a wake-up signal.
  • the electronic device is a smart TV.
  • a chip including a processor for calling and executing instructions stored in the memory from a memory, so that an electronic device on which the chip is installed executes the method described in the first aspect.
  • a computer storage medium comprising: a processor coupled to a memory for storing a program or an instruction that, when executed by the processor, causes the processor to The apparatus performs the method described in the first aspect above.
  • the present application provides a computer program product that, when the computer program product is run on an electronic device, causes the electronic device to perform the method according to any one of the first aspects.
  • FIG. 1 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of an electronic device provided by another embodiment of the present application.
  • FIG. 3 is a schematic scene diagram of a video captured by a camera according to an embodiment of the present application.
  • FIG. 4 is an exemplary block diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 5 is a schematic scene diagram provided by an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of a signal processing method provided by an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a signal processing method provided by another embodiment of the present application.
  • FIG. 8 is a schematic flowchart of a method for an electronic device to determine the direction of a target sound source provided by another embodiment of the present application.
  • FIG. 9 is a schematic flowchart of a signal processing method provided by another embodiment of the present application.
  • the signal processing method provided by the embodiment of the present application determines the direction (referred to as the direction of the target sound source) of the user who is performing voice interaction with the electronic device (referred to as the target user) through an audio signal and a video obtained based on a camera, Furthermore, based on the video of the user's lips in the direction obtained by the camera and the preset voice enhancement model, voice enhancement processing is performed on the picked-up audio signal to obtain or restore a relatively clear audio signal, which can greatly improve the efficiency of voice recognition.
  • the target user is a person who is interacting with the electronic device by voice, and the target user is sending a voice command to the electronic device to perform a certain action.
  • the target user can also be understood as the actual speaker.
  • the target sound source direction is the direction where the target user is located, that is, the source direction of the sound emitted by the target user. Due to the influence of various interference signals in the environment, the electronic device may pick up audio signals in multiple sound source directions. Therefore, the direction where the target user is located is defined as the target sound source direction.
  • the user's lip video records the lip shape (referred to as lip shape) during the user's speech.
  • lip shape which has a corresponding relationship with pronunciation, that is, a lip shape can correspond to one or more pronunciations, for example, "wo", "me” and “grip” represent three different pronunciations, but one lip corresponds to one type.
  • the lip video of the user in the direction of the target sound source may actually be understood as the lip video of the target user.
  • the purpose of the speech enhancement model is to perform pickup enhancement processing on the audio signal, enhance the audio signal in the direction of the target sound source, and suppress or eliminate the audio signal generated in other directions, including the speaker or background noise, so as to obtain or restore a clearer audio signal.
  • audio signal The speech enhancement model of the embodiment of the present application integrates audio and video information, and integrates the correspondence between pronunciation and lip shape, and one or more pronunciations may correspond to one lip shape.
  • the audio signal and the user's lip video are used as the input of the speech enhancement model, and the speech enhancement model can perform speech enhancement processing on the audio signal based on the correspondence between pronunciation and lip shape and the input user's lip video, A clearer audio signal is obtained for speech recognition.
  • the speech enhancement module may perform noise reduction processing, echo cancellation residual processing, de-reverberation processing, etc. on the audio signal.
  • the signal processing method in the embodiment of the present application can be applied to any electronic device capable of recognizing speech.
  • the electronic device may be a voice-controlled device such as a smart TV (also referred to as a smart screen).
  • the electronic device may be a voice communication device such as a mobile phone or a computer.
  • a smart TV is taken as an example to describe the electronic device according to the embodiment of the present application.
  • the electronic device 10 includes a housing 110 , a display screen 120 , a microphone array 130 , and a camera 140 , and the display screen 120 , the microphone array 130 and the camera 140 are installed in the housing 110 .
  • the display screen 120 is used to display images, videos, and the like.
  • the display screen 120 includes a display panel.
  • the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (active-matrix organic light).
  • LED liquid crystal display
  • OLED organic light-emitting diode
  • AMOLED organic light-emitting diode
  • FLED flexible light-emitting diode
  • Miniled MicroLed, Micro-oLed, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED) and so on.
  • the microphone array 130 is used to pick up audio signals, and includes a plurality of microphones, which can pick up audio signals in multiple directions.
  • the microphones in the microphone array 130 may be omnidirectional microphones, directional microphones, or a combination of omnidirectional microphones and directional microphones, which is not limited in the implementation of this application.
  • the omnidirectional microphone can pick up the audio signal in all directions, no matter where the speaker is, the sound in all directions will be picked up with the same sensitivity.
  • a directional microphone can only pick up audio signals in a specific direction.
  • the microphone array 130 may be disposed at any position of the housing 110, which is not limited in the embodiment of the present application.
  • the microphone array 130 is arranged in the housing 110 and is located in an area on one side of the display screen 120 , the sound outlet of the microphone array 130 is arranged on the front of the housing 110 , and the sound outlet
  • the orientation of the housing 110 is the same as the orientation of the display screen 120, the front of the housing 110 can be understood as the same side as the display screen 120, or the front of the housing 110 can be understood as the housing 120 facing the user under normal use conditions. one side.
  • the microphone array 130 can be arranged in the area of the casing 110 on either side of the display screen 120 . Assuming that the microphone array 130 shown in FIG. 1 is arranged in the area of the casing 110 at the top side of the display screen 120 , then the microphone Arrays 130 may also be provided in areas of housing 110 on other sides of display screen 120 (eg, left, right, or bottom side).
  • the microphone array 130 may be disposed in the casing 110 and located in the area on the top side of the display screen 120, and the sound outlet of the microphone array 130 may be disposed on the top surface of the casing 110 (not shown in the figure),
  • the top surface of the housing 110 is connected to the front surface of the housing 110 , and the orientation of the sound outlet is perpendicular to the orientation of the display screen 120 .
  • the microphone array 130 may also be arranged on the rear side of the display screen 120, and the sound outlet of the microphone array 130 is arranged on the display screen 120 (not shown in the figure).
  • the microphone array 130 may also be disposed on the rear side of the display screen 120 , and the sound outlet holes of the microphone array 130 may be disposed on the front side of the housing 110 .
  • the microphone array 130 may be arranged in a linear structure as shown in FIG. 1 , or may be arranged in other structures, which are not limited in any embodiment of the present application.
  • the microphone array 130 may be arranged in a circular configuration, a rectangular configuration, or the like.
  • Camera 140 is used to capture still images or video.
  • the object is projected through the lens to generate an optical image onto the photosensitive element.
  • the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal.
  • the ISP outputs the digital image signal to the DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
  • the electronic device 10 may include 1 or N cameras 130 , where N is a positive integer greater than 1.
  • the camera 140 can be rotated within a preset angle range to shoot a video within a certain angle range, and the video can be used to determine the direction of the target sound source; and, after the electronic device 10 determines the direction of the target sound source , the camera 140 can be rotated to the direction of the target sound source, so that the camera 140 is facing the direction of the target sound source, so that the target user is displayed in the middle of the screen as much as possible, so as to better shoot the video in the direction of the target sound source, and obtain
  • the user's lip video can be used to process the audio signal input to the speech enhancement model to output a clearer audio signal for speech recognition.
  • the camera 140 is disposed on the top surface of the housing 110 and protrudes from the top surface, so as to better realize the rotation of the camera 140 .
  • the camera 140 may be located above the microphone array 130 .
  • the camera 140 may be disposed on the front of the housing 110 and in an area on the top side of the display screen 120 .
  • the camera 140 may rotate within a preset angle range, and the preset angle range may be any range of angles.
  • the rotatable angle range of the camera 140 is less than or equal to 180°, for example, the angle range may be 120°, and the camera 140 may be located on the display screen 120 It can be rotated within an angle range of 120° in front of the smart TV.
  • the shooting field of view of the camera 140 basically all the pictures located in the front of the smart TV within a range of 180° can be captured.
  • the electronic device 10 further includes a directional microphone 150 , which can be rotated to pick up audio signals in a specific direction. After the electronic device 10 determines the target sound source direction, the directional microphone 150 can be rotated to the target sound source direction to perform directional pickup in the target sound source direction.
  • the directional microphone 150 can pick up the sound in the direction of the target sound source without distortion, it can suppress interference and reverberation to a certain extent, and the directional microphone 150 can pick up the sound forward, which can also play a good role in the echo. inhibition. Therefore, in the embodiment of the present application, the audio signal obtained by the directional microphone 150 and the audio signal obtained by the microphone array 130 can be used as the audio input of the speech enhancement model, and a clearer audio signal can be obtained or restored.
  • the directional microphone 150 can be provided on the camera 140 , for example
  • the directional microphone 150 is fixedly connected to the camera 140.
  • the directional microphone 150 also rotates to the target sound source direction, which is simple and convenient to implement.
  • the electronic device 10 further includes a processor (not shown in the figure), and the display screen 120, the microphone array 130, the camera 140 and the directional microphone 150 are all connected to the processor for inputting the signals collected by the various components to the processor, for further processing.
  • the processor runs the instruction to implement the signal processing method of the embodiment of the present application, so as to obtain a relatively clear audio signal sent by the user, and after performing speech recognition on the audio signal, it can control the corresponding component to execute the instruction corresponding to the audio signal.
  • the structure of the electronic device 10 described above by taking the smart TV as an example is only a schematic illustration, and the electronic device 10 may have more or less components.
  • the electronic device 10 may include a microphone array 130 , a camera 140 , and optionally, the electronic device 10 may further include a directional microphone 150 , but the electronic device 10 may not include the display screen 120 .
  • the electronic device 10 may include the directional microphone 150 and the camera 140, but the electronic device 10 does not include the microphone array 130.
  • the audio signal picked up by the directional microphone 150 and the camera 140 may be used for shooting
  • the video determines the direction of the target sound source, and the video in the direction of the target sound source is captured by the camera 140 and the audio signal in the direction of the target sound source is picked up by the directional microphone 150, so as to enhance the model through the video and the voice in the direction of the target sound source A clearer audio signal is recovered.
  • the directional microphone 150 may rotate all the time to collect the audio signal.
  • the electronic device 10 may include other components besides the microphone array 130, the camera 140 and the directional microphone 150.
  • the electronic device 10 may be a mobile phone or a computer.
  • FIG. 4 is an exemplary block diagram of an electronic device 10 provided by an embodiment of the present application.
  • the electronic device 10 may include the display screen 120, the microphone array 130, the directional microphone 150, and the camera 140 shown in FIG. 3.
  • the electronic device 10 may further include one or more of the following components: a processor 160, a wireless communication A module 171 , an audio module 172 , a speaker 173 , a touch sensor 174 , a key 175 and an internal memory 176 .
  • the wireless communication module 171 can provide applications on the electronic device 10 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), global navigation satellites Wireless communication solutions such as global navigation satellite system (GNSS), frequency modulation (FM), near field communication (NFC), and infrared technology (IR).
  • WLAN wireless local area networks
  • BT Bluetooth
  • GNSS global navigation satellite system
  • FM frequency modulation
  • NFC near field communication
  • IR infrared technology
  • the wireless communication module 171 may be one or more devices integrating at least one communication processing module.
  • the wireless communication module 171 receives electromagnetic waves via the antenna, frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor.
  • the wireless communication module 171 can also receive the signal to be sent from the processor, perform frequency modulation on it, amplify it, and convert it into electromagnetic waves for radiation through the antenna.
  • the audio module 172 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 172 may also be used to encode and decode audio signals. In some embodiments, the audio module 172 may be provided in the processor, or some functional modules of the audio module 172 may be provided in the processor 160 .
  • the speaker 173 is also called “speaker", and is used to convert audio electrical signals into sound signals.
  • the electronic device 10 can listen to the sound in music or video through the speaker 173.
  • the speaker 173 can also be used to listen to hands-free calls.
  • the touch sensor 174 is also referred to as a "touch panel”.
  • the touch sensor 174 may be disposed on the display screen 120, and the touch sensor 174 and the display screen 120 form a touch screen, which is also referred to as a "touch screen”.
  • the touch sensor 174 is used to detect a touch operation on or near it.
  • the touch sensor 174 may communicate the detected touch operation to the processor 160 to determine the type of touch event.
  • Visual output related to touch operations may be provided through display screen 120 .
  • the touch sensor 174 may also be disposed on the surface of the electronic device 10 , which is different from the location where the display screen 120 is located.
  • the keys 175 include a power-on key, a volume key, and the like. Keys 175 may be mechanical keys. It can also be a touch key.
  • the electronic device 10 may receive key 175 inputs and generate key signal inputs related to user settings and function control of the electronic device 10 .
  • Internal memory 176 is used to store computer executable program code, which includes instructions.
  • the processor 160 executes various functional applications and data processing of the electronic device 10 by executing the instructions stored in the internal memory.
  • the internal memory 176 may include a stored program area and a stored data area.
  • the storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like.
  • the storage data area can store data (such as audio data, phone book, etc.) created during the use of the electronic device 10 and the like.
  • the internal memory 176 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like.
  • FIG. 5 is a schematic scene diagram provided by an embodiment of the present application.
  • the target user is watching TV and says "Xiaoyi Xiaoyi, I want to watch a variety show" to the smart TV, the smart TV receives and recognizes this instruction to tune the smart TV to the variety show programme.
  • an angle may be used to represent a certain direction
  • we may define a reference direction and the angle between a certain direction and the reference direction may be used to represent the certain direction.
  • the reference direction may be arbitrary, and the embodiment of the present application does not make any limitation.
  • the direction extending along the left side (the direction indicated by the arrow in the direction corresponding to 0° in the figure) in the length direction (for example, the x direction) of the electronic device 10 can be recorded as the reference direction,
  • the angle corresponding to the reference direction is 0°
  • the target user is facing the electronic device 10
  • the included angle between the target sound source direction where the target user is located and the reference direction is 90°.
  • the electronic device 10 includes a microphone array 130, a camera 140 and a processing unit 160.
  • the processing unit 160 may include a target sound source direction determination module 161 and a speech enhancement module 162.
  • the electronic device 10 further includes a directional microphone 150 .
  • FIG. 6 is a schematic flowchart of a signal processing method provided by an embodiment of the present application. Referring to FIG. 6 , the general process of the embodiment of the present application is as follows:
  • the target user starts to send a voice command to the electronic device, and the microphone array 130 picks up the first audio signal.
  • the camera 140 shoots a video to obtain a first video.
  • the processing unit 160 performs sound source localization on the first audio signal to obtain sound source direction information including at least one sound source direction, and the processing unit 160 processes the first video to obtain user direction information. This step may be performed by the processing unit
  • the target sound source direction determination module 161 in 160 is executed.
  • the processing unit 160 determines the target sound source direction where the target user is located according to the sound source direction information and the user direction information, and this step can be performed by the target sound source direction determination module 161 in the processing unit 160 .
  • the processing unit 160 controls the camera 140 to rotate to the direction of the target sound source, and the camera 140 shoots a video in the direction of the target sound source to obtain a video of the user's lips in the direction of the target sound source.
  • the microphone array 130 continues to pick up the second audio signal, where the second audio signal is a signal that actually needs speech recognition.
  • the processing unit 160 may also control the directional microphone 150 to rotate to the target sound source direction, and the directional microphone 150 picks up the fourth audio signal in the target sound source direction.
  • the processing unit 160 controls the camera 140 and the directional microphone 150 to rotate together to the direction of the target sound source.
  • the processing unit 160 uses the second audio signal and the video of the user's lip in the direction of the target sound source as input, the processing unit 160 performs speech enhancement processing on the second audio signal through the speech enhancement model, and obtains an enhanced relatively clear third audio signal, This step may be performed by the speech enhancement module 162 in the processing unit 160 .
  • the processing unit 160 uses the speech enhancement model to perform a The second audio signal and the fourth audio signal are subjected to speech enhancement processing to obtain a third audio signal.
  • step S210 and S220 may be performed simultaneously
  • steps S250 and S260 may be performed simultaneously
  • steps S250, S260 and S270 may be performed simultaneously
  • step S250 may be performed before step S260, or may be performed after step S260.
  • the first video is obtained by the camera, and the first audio signal obtained by the microphone array is combined to determine the direction of the target sound source where the target user who is performing voice interaction with the electronic device is located, which can greatly improve the target sound.
  • the estimation accuracy of the source direction avoids the false sound source generated by the strong reflected sound from interfering with the determination of the target sound source direction when the target sound source direction is determined only by the audio signal, and the user's lips in the target sound source direction obtained by the camera.
  • the second audio signal obtained through the microphone array is processed for voice enhancement. Since the corresponding relationship between pronunciation and lip shape is integrated in the voice enhancement model, combined with the user's lip video and voice enhancement model, A relatively clean third audio signal can be recovered, and finally, the speech recognition efficiency can be effectively improved.
  • the directional microphone has a certain inhibitory effect on reverberation, interference other than the direction of the target sound source, and the echo of the display screen itself, and has a further inhibitory effect on the residual echo after echo cancellation.
  • the embodiment of this application uses a directional microphone.
  • the fourth audio signal picked up in the direction of the target sound source is combined with the second audio signal obtained by the microphone array, and these two audio signals are used as audio input, which can greatly improve the effect of sound pickup enhancement and improve the efficiency of speech recognition.
  • FIG. 7 is a schematic flowchart of a signal processing method 300 provided by another embodiment of the present application, and the method may be executed by the processing unit 160 of the electronic device 10 .
  • step S310 sound source localization is performed on the first audio signal obtained through the microphone array 130 to obtain sound source direction information, where the sound source direction information includes at least one sound source direction.
  • the at least one sound source direction includes a target sound source direction.
  • the user issues a voice command to the electronic device, and the microphone array 130 picks up the audio signal.
  • This step can be used for sound source localization, and the first audio signal can be a small part of the content of the voice command issued by the user, and the small part of the content basically does not affect the subsequent content for speech recognition.
  • the first audio signal may be a wake-up signal.
  • the voice command issued by the user is "Xiaoyi Xiaoyi, I want to watch a variety show”
  • the first audio signal can be one word or multiple words or multiple "Xiaoyi” in “Xiaoyi”
  • " One or more words or multiple "Xiaoyi” in “Xiaoyi” can be understood as a wake-up signal.
  • the electronic device detects "Xiaoyi” it can be determined that the user may need the electronic device to execute voice commands, and the microphone array 130 will Sound source localization is performed, and subsequent audio signals are continuously picked up.
  • the first audio signal may be the first few words in the voice command.
  • the microphone array 130 will detect the audio signal.
  • the voice command issued by the user is "I want to watch a variety show", and the first audio signal may be "I”.
  • the microphone array 130 performs sound source localization on the first audio signal, in order to determine the direction of the target sound source where the target user is located, that is, the direction of the sound source where the voice command is actually issued. However, the microphone array 130 will pick up audio signals in all directions. Due to the influence of various interfering sounds in the environment, the final sound source direction may not be accurate, and at least one sound source direction will be obtained. The at least one sound source direction Including the direction of the target sound source, and may also include the direction of the interfering signal. For example, the target user sends a voice command to the electronic device, the speaker is playing music, and other users (referred to as interfering users) are speaking. Assuming that the above three sounds can be picked up by the microphone array 130, then the microphone array 130 3 or 2 or 1 sound source directions may be determined, but the results are not accurate. Therefore, it is necessary to further determine the target sound source direction in combination with the video.
  • the technology for sound source localization in this embodiment of the present application may be a maximum output power-based controllable beamforming technology, a high-resolution spectrogram estimation technology, or a sound time-delay estimation (TDE)-based sound.
  • the source location technology is not limited in any embodiment of the present application.
  • a certain direction in this embodiment of the present application can be represented by an angle.
  • step S320 the first video obtained by the camera 140 is processed to obtain user direction information.
  • the user direction information includes some directions related to the user, for example, the user direction information includes at least one type of direction related to the user.
  • the user sends a voice command to the electronic device
  • the camera 140 can shoot a video
  • the electronic device can be used to determine the direction of the target sound source based on the obtained first video.
  • a voice command issued by the user may be used as a trigger condition for the camera 140 to shoot a video, and the electronic device detects the voice command issued by the user and controls the camera 140 to start shooting a video. Since the camera 140 in the embodiments of the present application can be rotated, in some examples, the camera 140 can be rotated while shooting videos, so as to obtain pictures with more angular ranges.
  • the camera 140 can shoot video all the time, and use the video for a period of time after the electronic device receives the voice command from the user as the first video to determine the direction of the target sound source.
  • the electronic device processes the first video captured by the camera 140, and detects user-related content in the first video,
  • User direction information is obtained, the user direction information including at least one type of direction related to the user.
  • the user direction information including at least one type of direction related to the user.
  • the users involved in the user direction information not only include the target users who are interacting with the electronic device, but also other users, as long as they are the users detected in the first video.
  • Target users other users can be understood as interfering users.
  • the user direction information includes at least one type of direction related to the user, each type of direction including at least one direction.
  • the at least one type of orientation includes at least one of the following:
  • the first class of directions including the direction in which the at least one active lip is located;
  • a second type of direction where the second type of direction includes the direction in which at least one user is located;
  • a third type of direction includes a direction in which at least one user who is looking at the electronic device is located.
  • the first video is used to detect whether someone's lips are moving, that is, detecting whether someone is speaking, which can effectively exclude the scene where a person is speaking in the video.
  • scenes that interfere with the user's speech can also be excluded to a certain extent. For example, if the target user is watching TV and gives voice commands to the TV, and User 1 is also talking, but is doing housework with his head down, and is not facing the TV, then most of the lips of User 1 cannot be detected through the first video. During activity, only the lips of the target user can be detected moving, so User 1 is an interfering user and can be effectively excluded.
  • the first category of directions includes the target sound source direction.
  • interference signals from other non-users can be effectively excluded, for example, interference signals from speakers can be excluded.
  • the multiple users can be detected in the first video, and the directions where the multiple users are located can be obtained.
  • the second type of direction includes the direction of the target sound source.
  • the electronic device sends out voice commands, so that the electronic device can receive the voice commands well, and it also enables the user to quickly know whether the electronic device is executing according to the command or get some feedback from the electronic device, for example, the user makes a voice
  • the instruction asks for the weather status, and the user needs to look at the weather condition displayed on the electronic device. Therefore, by detecting the user looking at the electronic device, the scene that interferes with the user's speech can be effectively excluded.
  • the target user is watching TV and gives a voice command to the TV, and User 1 speaks to the target user, but does not look at the TV, then most of the first video can not detect that User 1 is looking at the electronic device, only It is detected that the target user is looking at the electronic device, so the user 1 is the interfering user and can be effectively excluded.
  • the third category of directions includes the direction of the target sound source.
  • the user direction information may include one type or two types or three types of directions among the above three types of directions, which is not limited in any embodiment of the present application.
  • the more types of directions included in the user direction information the more favorable it is to improve the accuracy of determining the direction of the target sound source.
  • the user direction information may also include other directions related to the user, which is not limited in this embodiment of the present application.
  • user orientation information may include other orientations related to user behavior.
  • step S330 the target sound source direction is determined according to the sound source direction information and the user direction information.
  • the direction of the target sound source is the direction of the target user who is performing voice interaction with the electronic device.
  • the sound source direction in the sound source direction information can be regarded as one type of direction, and is combined with at least one type of direction related to the user to jointly determine the target sound source direction.
  • FIG. 8 is a schematic flowchart of a method 230 for an electronic device to determine the direction of a target sound source provided by another embodiment of the present application.
  • the electronic device may determine the direction of the target sound source in the following manner:
  • step S33 at least one sound source direction in the sound source direction information and at least one type of direction in the user direction information are combined and processed to obtain at least one combined direction;
  • step S332 the target sound source direction is determined from the at least one direction.
  • the following takes the sound source direction and the above three types of directions as examples, and first describes the manner of obtaining the combined at least one direction.
  • one direction can be determined based on the multiple directions.
  • the multiple directions can be considered to be the same direction, and the final determination One direction of , may be any one of the multiple directions, or may be obtained by taking the average value of the multiple directions, which is not limited in any embodiment of the present application.
  • the threshold can be reasonably designed based on actual application scenarios, and for example, the threshold can be 5°.
  • the sound source direction information includes 4 sound source directions, and the corresponding angles are 30°, 60°, 95°, and 120° respectively.
  • the first type of direction includes one direction, and the corresponding angle is 93°.
  • the second type of direction It includes two directions, and the corresponding angles are 63° and 95°, respectively.
  • the third type of direction includes one direction, and the corresponding angle is 95°.
  • the angles obtained by the merge processing are: 30°, 61.5°, 94.5°, 120°, that is, the fourth type of directions obtained by the merge includes 4 directions,
  • the direction of the target sound source is one of the four directions. In fact, the direction corresponding to 94.5° is the direction of the target sound source.
  • the target user basically faces the electronic device and performs voice interaction with the electronic device.
  • the electronic device After obtaining the combined at least one direction, the electronic device determines the target sound source direction from the at least one direction.
  • some parameters may be set based on the specific scene of far-sound pickup, and the target sound source direction may be determined based on these parameters.
  • the electronic device may determine the target sound source direction from at least one direction according to at least one of the following parameters, where the at least one parameter includes:
  • the preset time period is the time period between the current time and the historical time
  • the preset angle range includes the corresponding Angle
  • Each parameter is described by taking at least one type of directions including the above three types of directions as an example, and taking the above four sound source directions and angles corresponding to the three types of directions as an example.
  • the corresponding angles are 30°, 60°, 95°, 120°
  • the first type of direction includes 1 direction
  • the corresponding angle is 93°
  • the second type of direction includes 2 directions
  • the corresponding The angles are 63° and 95°, respectively.
  • the third type of direction includes one direction, and the corresponding angle is 95°.
  • the angles corresponding to the four directions obtained by the combined processing are: 30°, 61.5°, 94.5°, and 120°.
  • First parameter the sum of the detected frequencies for each direction in the direction of the sound source and in at least one type of direction.
  • the frequencies detected at 30° in the direction of the sound source, the first type of direction, the second type of direction and the third type of direction are: 1, 0, 0, 0, and the sum of the frequencies is 1; 61.5° in the direction of the sound source , the frequencies detected in the first type direction, the second type direction and the third type direction are: 1, 0, 1, 0, and the sum of the frequencies is 2; 94.5° in the sound source direction, the first type direction, The frequencies detected in the second type direction and the third type direction are: 1, 1, 1, 1, and the sum of the frequencies is 4; 120° in the sound source direction, the first type direction, the second type direction and the third type direction.
  • the frequencies detected in the three types of directions are: 1, 0, 0, 0, and the sum of the frequencies is 1. It can be seen that 94.5° has the highest sum of frequencies detected in the sound source direction and at least one type of direction.
  • this direction is basically the direction of the target sound source.
  • the second parameter within the preset time period and the preset angle range corresponding to each direction, whether the electronic device has successfully performed voice interaction with the user, the preset time period is the time period between the current time and the historical time, the preset angle
  • the range includes the corresponding angle for each direction.
  • the angle of the preset angle range corresponding to each direction may include not only the angle corresponding to the direction, but also the angles near the angle. For example, if the angle corresponding to a certain direction is 30°, the preset angle range may be 25° ⁇ 35° °. It should be understood that, the smaller the preset angle range is, the more accurate the target sound source direction determined by using this parameter.
  • the preset time period is the time period between the current time and the historical time, and the historical time is the time before the current time.
  • the duration of the preset time period should not be set too long, which is conducive to more accurate determination of the target sound source direction.
  • the duration of the preset period can be set to 1 minute, 5 minutes, 10 minutes, etc. Assuming that the current time is 10:30 and the preset period is 10 minutes, the historical time is 10:20, and the preset period is Hours between 10:20 and 10:30.
  • the electronic device has successfully performed voice interaction with the user in the vicinity of an angle corresponding to a certain direction within a preset period of time.
  • the user is likely to use the electronic device for a certain period of time, especially for the electronic device with a display screen such as a smart TV, the user basically does not move the position frequently when watching TV. Therefore, within the preset time period and the preset angle range corresponding to a certain direction, if the electronic device and the user have successfully performed voice interaction, it means that the direction is more likely to be the direction of the target sound source, otherwise, the direction is the target sound source. The possibility of the source direction is smaller. Further, if the frequency of successful voice interaction between the electronic device and the user is more, it also means that the direction is more likely to be the direction of the target sound source, and vice versa is less likely.
  • the third parameter the angle between each direction and the direction perpendicular to the display screen of the electronic device.
  • the third parameter is more suitable for an electronic device with a display screen, and the direction perpendicular to the display screen of the electronic device can be understood as the thickness direction of the electronic device.
  • the third parameter in other words, it can be understood as whether the user is in the vicinity of a specific direction defined when using the preset scene for the electronic device.
  • the electronic device may determine the direction of the target sound source based on one, two or three of the above parameters, which is not limited in the embodiment of the present application, which will be described below.
  • the at least one parameter includes the first parameter, ie, the at least one parameter includes the sum of the detected frequencies for each direction in the direction of the sound source and in at least one type of direction.
  • the direction in which the sum of the detected frequencies in the sound source direction and at least one type of direction is the largest may be used as the target sound source direction.
  • the at least one parameter includes the second parameter, that is, the at least one parameter includes: within a preset time period and a preset angle range corresponding to each direction, whether the electronic device has successfully performed voice interaction with the user.
  • the direction corresponding to the angle at which the electronic device and the user have successfully performed voice interaction within the preset time period and the preset angle range is determined as the target sound source direction.
  • the at least one parameter includes a third parameter, that is, the at least one parameter includes an angle between each direction and a direction perpendicular to the display screen of the electronic device.
  • the direction with the smallest included angle with the direction perpendicular to the display screen of the electronic device may be determined as the target sound source direction.
  • At least one parameter includes any two or three parameters.
  • a candidate sound source direction may be obtained based on the principle in the corresponding example above, and the candidate sound source direction may be The direction with the highest repetition rate is used as the target sound source direction.
  • the at least one parameter includes a first parameter and a second parameter, and for the first parameter, the direction with the largest sum of the detected frequencies in the sound source direction and the at least one type of direction is used as a candidate sound source Direction, assuming that the candidate sound source direction is 94.5°, for the second parameter, the direction corresponding to the angle at which the electronic device and the user have successfully performed voice interaction within the preset time period and the preset angle range is used as another candidate sound source direction , assuming that the candidate sound source direction is 94.5°, then the target sound source direction obtained based on the two candidate sound source directions is 94.5°.
  • the electronic device may determine the confidence of each direction according to at least one parameter, and determine the direction corresponding to the confidence with the largest numerical value in the at least one direction as the direction of the target sound source.
  • the confidence of each direction can also be called the reliability of each direction, indicating the probability that the direction is the direction of the target sound source, and the greater the confidence, the probability that the direction corresponding to the confidence is the direction of the target sound source bigger.
  • the method of determining the direction of the target sound source by the confidence is described. It should be understood that in the embodiment where at least one parameter includes one or two parameters, the method of determining the direction of the target sound source by using the confidence is similar to the embodiment with three parameters, and reference may be made to the following description, which will not be repeated hereafter.
  • a weighting value may be configured for each parameter according to the priority of the three parameters, and the target sound source direction may be determined based on the confidence level obtained from each parameter by calculating each direction.
  • the direction corresponding to the confidence degree with the largest numerical value in at least one direction may be determined as the target sound source direction
  • the priorities of the three parameters are in descending order: the priority of the first parameter>the priority of the second parameter>the priority of the third parameter, correspondingly, the first parameter
  • the weighted value of the first parameter is 0.5
  • the weighted value of the second parameter is 0.3
  • the weighted value of the third parameter is 0.2.
  • the score for each direction being detected once is 10 points.
  • the electronic device and the user are within the preset angle range corresponding to a preset time period and a certain direction If the voice interaction is successful, the score for this direction is also 10 points.
  • the score for this direction is also 10 points, for example, the threshold is 10°.
  • the four sound source directions have corresponding angles of 30°, 60°, 95°, and 120°
  • the first type of direction includes one direction
  • the corresponding angle is 93°
  • the second type of direction includes two directions.
  • the corresponding angles are 63° and 95°, respectively.
  • the third type of direction includes one direction, and the corresponding angle is 95°.
  • the angles corresponding to the four directions obtained by the combined processing are: 30°, 61.5°, 94.5°, and 120°.
  • the first parameter it is only detected in the direction of the sound source, and one score of 10 can be obtained.
  • the first parameter it is detected in the sound source direction and the second type of direction, and two 10 points, that is, 20 points, can be obtained.
  • the direction corresponding to 61.5° is the same as the electronic If the device has successfully performed a voice interaction once, it can get a score of 10.
  • the sound source direction and three types of directions are detected, and four 10 points, that is, 40 points, can be obtained.
  • the direction corresponding to 94.5° is the same as that of the electronic device. If you have successfully performed a voice interaction, you can get 1 10 points.
  • the confidence value of 94.5° is the highest, then the direction corresponding to 94.5° is determined as the direction of the target sound source.
  • step S340 a video of the user's lips in the direction of the target sound source is obtained through the camera 140.
  • the electronic device After determining the direction of the target sound source, the electronic device rotates the camera 140 to the direction of the target sound source, and the camera 140 shoots a video in the direction of the target sound source, and the video includes a video of the user's lips of the target user in the direction of the target sound source.
  • step S350 a second audio signal is obtained through the microphone array 130.
  • the second audio signal is a signal indicative of the actual voice command.
  • the voice instruction issued by the target user is "Xiaoyi Xiaoyi, I want to watch a variety show"
  • the second audio signal can be used to indicate the voice instruction "I want to watch a variety show”.
  • the second audio signal is acquired in the direction of the target sound source through the microphone array 130 based on the beamforming technology.
  • step S350 according to the second audio signal and the video of the user's lips, a third audio signal is obtained through a speech enhancement model, where the speech enhancement model includes correspondences between multiple pronunciations and multiple lip shapes.
  • the purpose of the speech enhancement model is to perform pickup enhancement processing on the audio signal, enhance the audio signal in the direction of the target sound source, and suppress or eliminate the audio signal in other directions, so as to obtain or restore a clearer audio signal.
  • the audio and video information is integrated in the speech enhancement model, and the correspondence between pronunciation and lip shape is integrated, that is, one or more pronunciations correspond to a lip shape, the second audio signal is used as audio input, and the lip information in the direction of the target sound source
  • the speech enhancement model can enhance the audio signal based on the correspondence between pronunciation and lip shape and the input video of the user's lips, and obtain or restore a relatively clear third audio signal for speech recognition.
  • the audio signals are processed by the audio and video information, so that a relatively clean audio signal can be obtained, which greatly improves the sound pickup enhancement effect.
  • the speech enhancement module may perform noise reduction processing, echo cancellation residual processing, de-reverberation processing, etc. on the second audio signal.
  • FIG. 9 is a schematic flowchart of a signal processing method 400 provided by another embodiment of the present application, and the method may be executed by the processing unit 160 of the electronic device 10 .
  • step S410 sound source localization is performed on the first audio signal obtained through the microphone array 130 to obtain sound source direction information, where the sound source direction information includes at least one sound source direction.
  • the at least one sound source direction includes a target sound source direction.
  • step S410 For the specific description of step S410, reference may be made to the relevant description of step S310 above.
  • step S420 the first video obtained by the camera 140 is processed to obtain user direction information, where the user direction information includes at least one type of direction related to the user.
  • step S420 For the specific description of step S420, reference may be made to the relevant description of step S320 above.
  • step S430 the target sound source direction is determined according to the sound source direction information and the user direction information, where the target sound source direction is the direction of the target user who is performing voice interaction with the electronic device.
  • step S430 For the specific description of step S430, reference may be made to the relevant description of step S330 above.
  • step S440 a video of the user's lips in the direction of the target sound source is obtained through the camera 140.
  • step S440 For the specific description of step S440, reference may be made to the relevant description of step S340 above.
  • step S450 a second audio signal is obtained through the microphone array 130.
  • step S450 For the specific description of step S450, reference may be made to the relevant description of step S350 above.
  • step S460 a fourth audio signal in the direction of the target sound source is obtained through the directional microphone 150.
  • the electronic device may control the directional microphone 150 to rotate to the direction of the target sound source, and pick up the fourth audio signal in the direction of the target sound source.
  • the electronic device can control the camera 140 and the directional microphone 150 to rotate together to the direction of the target sound source.
  • step S470 a third audio signal is obtained through a speech enhancement model according to the second audio signal, the fourth audio signal and the video of the user's lips.
  • the second audio signal picked up by the microphone array 130 and the fourth audio signal picked up by the directional microphone 150 are used as the audio input of the speech enhancement model, and the video of the user's lips is used as the video input.
  • the input audio signal is processed to obtain a relatively clear third audio signal.
  • the directional microphone 150 can pick up the sound in the direction of the target sound source without distortion, it can suppress interference and reverberation to a certain extent, and the directional microphone 150 can pick up the sound forward, which can also play a good role in the echo. inhibition.
  • the size of the sequence number of each process does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic , and should not constitute any limitation on the implementation process of the embodiments of the present application.
  • the embodiment of the present application also provides an electronic device, which may be the electronic device shown in FIG. 4 , the electronic device includes a microphone array 130, a rotatable camera 140 and a processor 160, and the processor 160 is used for:
  • a third audio signal is obtained through a speech enhancement model, where the speech enhancement model includes a correspondence between pronunciation and lip shape.
  • the electronic device further includes a directional microphone 150
  • the processor 160 is further configured to:
  • the processor 160 is specifically used for:
  • the third audio signal is obtained through the speech enhancement module according to the second audio signal, the fourth audio signal and the video of the user's lips.
  • the directional microphone 150 is fixedly connected to the camera 140 .
  • the user direction information includes at least one of the following types of directions:
  • the first class of directions including the direction in which the at least one active lip is located;
  • the second type of direction includes the direction in which at least one user is located;
  • a third type of direction includes a direction in which at least one user who is looking at the electronic device is located.
  • the sound source direction information includes at least one sound source direction
  • the processor 160 is specifically configured to:
  • the target sound source direction is determined from the at least one direction.
  • processor 160 is specifically configured to:
  • the target sound source direction is determined from the at least one direction according to at least one of the following parameters;
  • the at least one parameter includes:
  • the preset time period is the time period between the current time and the historical time
  • the included angle between each of the directions and a direction perpendicular to the display screen of the electronic device is a direction perpendicular to the display screen of the electronic device.
  • processor 160 is specifically configured to:
  • the direction corresponding to the confidence level with the largest numerical value in the at least one direction is determined as the target sound source direction.
  • processor 160 is specifically configured to:
  • the second audio signal is obtained in the direction of the target sound source based on a beamforming technique.
  • the first audio signal is a wake-up signal.
  • the electronic device is a smart TV.
  • connection and “fixed connection” should be interpreted in a broad sense.
  • connection and “fixed connection” should be interpreted in a broad sense.
  • specific meanings of the above various terms in the embodiments of the present application can be understood according to specific situations.
  • connection it can be various connection methods such as fixed connection, rotational connection, flexible connection, movable connection, integral molding, electrical connection, etc.; it can be directly connected, or it can be indirectly connected through an intermediate medium, or , which can be a connection within two elements or an interaction relationship between two elements.
  • fixed connection it can be that one element can be directly or indirectly fixedly connected to another element; the fixed connection can include mechanical connection, welding and bonding, etc., wherein the mechanical connection can include riveting, bolting , screw connection, key pin connection, snap connection, lock connection, plug connection, etc., bonding can include adhesive bonding and solvent bonding.
  • first and second are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implying the number of indicated technical features.
  • Features delimited with “first” and “second” may expressly or implicitly include one or more of that feature.
  • At least one refers to one or more, and "a plurality” refers to two or more.
  • At least part of an element means part or all of an element.
  • And/or which describes the association relationship of the associated objects, means that there can be three kinds of relationships, for example, A and/or B, it can mean that A exists alone, A and B exist at the same time, and B exists alone, where A, B can be singular or plural.
  • the character “/” generally indicates that the associated objects are an "or" relationship.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Quality & Reliability (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Studio Devices (AREA)

Abstract

信号处理的方法和电子设备(10),通过摄像头(140)得到第一视频(S220),结合麦克风阵列(130)得到的第一音频信号(S210),确定正在与电子设备(10)进行语音交互的目标用户所在的目标声源方向(S240),可以大幅提高目标声源方向的估计精度,以及,通过摄像头(140)获取的在目标声源方向的用户唇部视频(S250)和预设的语音增强模型,对通过麦克风阵列(130)获取的第二音频信号(S260)做语音增强处理,由于语音增强模型中集成了发音和唇型的对应关系,结合用户唇部视频和语音增强模型,可以恢复出较为干净的第三音频信号,最终,可以有效地提高语音识别效率。

Description

信号处理的方法和电子设备
本申请要求于2020年9月30日提交中国专利局、申请号为202011065346.1、申请名称为“信号处理的方法和电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及声学领域,更具体地,涉及一种信号处理的方法和电子设备。
背景技术
当前,例如智能电视、智能音箱、智能电灯等智能设备都可进行远扬拾音,例如,用户在5米外说一句“关灯”的指令,智能设备拾取语音且识别语音,并控制电灯执行对应的关灯动作。
常用的远场拾音技术是利用麦克风阵列拾取音频信号,并借助波束形成技术和回声消除算法,实现对环境噪声和回声的抑制,以得到较为清晰的音频信号。但是,实际环境中可能有各种噪声和干扰,例如厨房的做菜洗碗噪声、电视节目噪声、家人聊天的干扰噪声等,以及,部分家庭的房间空旷或者装潢的墙壁采用了声反射系数大的材料,导致混响较大,声音容易发糊,所有这些不利因素都会导致采用麦克风阵列拾取的声音的清晰度大大降低,从而导致语音识别率大幅下降。
因此,需要提供一种技术,可以大幅提高语音识别效率。
发明内容
本申请实施例提供了一种信号处理的方法和电子设备,通过一个音频信号和基于摄像头获得的视频,确定正在和电子设备进行语音交互的用户所在的目标声源方向,进而,基于摄像头获得的在该目标声源方向的用户唇部视频和预设的语音增强模型,对拾取的音频信号做语音增强处理,得到或恢复出较为清晰的音频信号,可以大幅提高语音识别效率。
第一方面,提供了一种信号处理的方法,其特征在于,应用于电子设备,所述电子设备包括麦克风阵列和摄像头,所述方法包括:
对通过所述麦克风阵列获得的第一音频信号进行声源定位,获得声源方向信息;
对通过所述摄像头获得的第一视频进行处理,获得用户方向信息;
根据所述声源方向信息和所述用户方向信息,确定目标声源方向;
通过所述摄像头获得在所述目标声源方向的用户唇部视频;
通过所述麦克风阵列获得第二音频信号;
根据所述第二音频信号和所述用户唇部视频,通过语音增强模型,获得第三音频信号,所述语音增强模型包括发音和唇型的对应关系。
声源方向信息包括至少一个声源方向,该至少一个声源方向包括目标声源方向。用户 方向信息包括与用户相关的一些方向,示例性地,包括与用户相关的至少一种类型的方向。目标声源方向是正在和电子设备进行语音交互的目标用户所在的方向,即,目标用户发出的声音的来源方向。
用户唇部视频中记录有用户说话过程中的多个唇型,唇型与发音具有对应关系,即,一个唇型可对应一个或多个发音,当用户不说话时,唇部处于静止状态。在目标声源方向的用户唇部视频实际上也可以理解为目标用户的唇部视频。
语音增强模型的目的是对音频信号做拾音增强处理,增强在目标声源方向的音频信号,抑制或消除其他方向的包括说话人或背景噪声等产生的音频信号,以得到或恢复出较为清晰的音频信号。本申请实施例的语音增强模型融合了音视频的信息,集成了发音和唇型的对应关系,即,一个或多个发音可对应一个唇型。
示例性地,摄像头是可旋转的摄像头,在确定目标声源方向后,可将摄像头旋转至目标声源方向,拍摄在目标声源方向的用户唇部视频。
本申请实施例的信号处理的方法,通过摄像头得到第一视频,结合麦克风阵列得到的第一音频信号,确定目标声源方向,可以大幅提高目标声源方向的估计精度,避免了仅通过音频信号确定目标声源方向时由于强烈的反射声产生的虚假声源干扰目标声源方向的确定,以及,通过摄像头获取的在目标声源方向的用户唇部视频和预设的语音增强模型,对通过麦克风阵列获取的第二音频信号做语音增强处理,由于语音增强模型中集成了发音和唇型的对应关系,结合用户唇部视频和语音增强模型,可以恢复出较为干净的第三音频信号,最终,可以有效地提高语音识别效率。
结合第一方面,在第一方面的某些实现方式中,所述电子设备还包括指向性麦克风,所述方法还包括:
通过所述指向性麦克风获得在所述目标声源方向的第四音频信号;以及,
根据所述第二音频信号和在所述目标声源方向的用户唇部视频,通过语音增强模型,获得第三音频信号,包括:
根据所述第二音频信号、所述第四音频信号和所述用户唇部视频,通过所述语音增强模块,获得所述第三音频信号。
在一些实施例中,所述指向性麦克风可固定在所述摄像头上。这样,在确定目标声源方向后,在旋转摄像头的过程中带动指向性麦克风旋转,最终旋转至目标声源方向,摄像头拍摄在目标声源方向的用户唇部视频,指向性麦克风拾取在目标声源方向的第四音频信号。
本申请实施例的信号处理的方法,在确定目标声源方向后,通过指向性麦克风获得在目标声源方向的第四音频信号,由于指向性麦克风对于混响、目标声源方向以外的干扰、显示屏本身的回声具有一定的抑制作用,且对回声消除后的回声残留有进一步抑制的作用,所以,本申请实施例利用指向性麦克风在目标声源方向获得的第四音频信号,结合麦克风阵列得到的第二音频信号,将这两个音频信号作为音频输入,可以大大提高拾音增强的效果,以提高语音识别效率。
结合第一方面,在第一方面的某些实现方式中,所述用户方向信息包括以下至少一种类型的方向:
第一类方向,所述第一类方向包括至少一个处于活动状态的唇部所在的方向;
第二类方向,所述第二类方向包括至少一个用户所在的方向;
第三类方向,所述第三类方向包括至少一个正在注视所述电子设备的用户所在的方向。
本申请实施例的信号处理的方法,对于采用第一类方向确定目标声源方向的方式,通过第一视频检测画面中是否有人的唇部在活动,也就是在检测是否有人在说话,可以有效地排除例如视频中人说话的场景,对于具有显示屏的电子设备来说,也可以在一定程度上排除干扰用户说话的场景;对于采用第二类方向确定目标声源方向的方式,通过第一视频检测画面中出现的用户,可以有效地排除其他非用户发出的干扰信号,例如,可以排除音箱发出的干扰信号;对于采用第三类方向确定目标声源方向的方式,通过第一视频检测画面中是否有用户在注视电子设备,一般情况下,尤其对于有显示屏的电子设备来说,若用户与电子设备有交互意图,大多数情况会对着电子设备发出语音指令,以便于电子设备能很好地接收到语音指令,以及,也能使得用户更快地获知电子设备是否按照指令执行或者从电子设备处得到一些反馈,例如,用户发出语音指令询问天气状态,用户需要看一下电子设备上显示的天气情况。
结合第一方面,在第一方面的某些实现方式中,所述声源方向信息包括至少一个声源方向,以及,
所述根据所述声源方向信息和所述用户方向信息,确定目标声源方向,包括:
将所述至少一个声源方向和所述至少一种类型的方向合并处理,获得合并后的至少一个方向;
从所述至少一个方向中确定所述目标声源方向。
本申请实施例的信号处理的方法,通过与至少一个声源方向和至少一种类型的方向合并处理以确定目标声源方向,可以简化计算。
结合第一方面,在第一方面的某些实现方式中,所述从所述至少一个方向中确定所述目标声源方向,包括:
根据以下至少一个参数,从所述至少一个方向中确定所述目标声源方向;
其中,所述至少一个参数包括:
所述至少一个方向中每个方向在所述声源方向和所述至少一种类型的方向中被检测到的频率的总和;
在预设时段和所述每个方向对应的预设角度范围内,所述电子设备是否和用户成功进行过语音交互,所述预设时段是当前时间与历史时间之间的时段;
所述每个方向与垂直于所述电子设备的显示屏的方向之间的夹角。
对于“每个方向在所述声源方向和所述至少一种类型的方向中被检测到的频率的总和”的参数,可以理解,哪个方向被检测到的频率的总和越多,该方向是目标声源方向的可能性最大。理想情况下,该方向基本上就是目标声源方向。
对于“在预设时段和所述每个方向对应的预设角度范围内,所述电子设备是否和用户成功进行过语音交互”的参数,每个方向对应的预设角度范围的角度不仅可以包括该方向对应的角度,也包括该角度附近的角度。该参数可以理解为在某个方向对应的角度附近,在预设时段内电子设备是否和用户成功进行过语音交互。
对于“所述每个方向与垂直于所述电子设备的显示屏的方向之间的夹角”的参数,比较适用于具有显示屏的电子设备中,该参数可以理解为用户是否在针对电子设备使用预 设场景时被定义的某个特定方向附近。
本申请实施例的信号处理的方法,结合具体场景设置不同参数,通过上述至少一个参数从至少一个方向中确定目标声源方向,针对特定的电子设备(例如,智能电视),可以进一步有效提高目标声源方向的估计精度,以提高语音识别效率。
结合第一方面,在第一方面的某些实现方式中,所述根据以下至少一个参数,从所述至少一个方向中确定所述目标声源方向,包括:
根据所述至少一个参数,确定所述每个方向的置信度;
将所述至少一个方向中数值最大的置信度对应的方向确定为所述目标声源方向。
结合第一方面,在第一方面的某些实现方式中,所述通过所述麦克风阵列获得第二音频信号,包括:
通过所述麦克风阵列,基于波束形成技术在所述目标声源方向上获得所述第二音频信号。
本申请实施例的信号处理的方法,通过波束成形技术在目标声源方向上得到第二音频信号,增强了拾音效果,有效减少了其他方向的干扰信号对语音识别的效率的影响。
结合第一方面,在第一方面的某些实现方式中,所述第一音频信号为唤醒信号。
第二方面,提供了一种电子设备,包括麦克风阵列、摄像头和处理器,所述处理器用于:
对通过所述麦克风阵列获得的第一音频信号进行声源定位,获得声源方向信息;
对通过所述摄像头获得的第一视频进行处理,获得用户方向信息;
根据所述声源方向信息和所述用户方向信息,确定目标声源方向;
通过所述摄像头获得在所述目标声源方向的用户唇部视频;
通过所述麦克风阵列获得第二音频信号;
根据所述第二音频信号和所述用户唇部视频,通过语音增强模型,获得第三音频信号,所述语音增强模型包括发音和唇型的对应关系。
结合第二方面,在第一方面的某些实现方式中,所述电子设备还包括指向性麦克风,所述处理器还用于:
通过所述指向性麦克风获得在所述目标声源方向的第四音频信号;以及,
所述处理器具体用于:
根据所述第二音频信号、所述第四音频信号和所述用户唇部视频,通过所述语音增强模块,获得所述第三音频信号。
结合第二方面,在第一方面的某些实现方式中,所述指向性麦克风与所述摄像头固定连接。
结合第二方面,在第一方面的某些实现方式中,所述用户方向信息包括以下至少一种类型的方向:
第一类方向,所述第一类方向包括至少一个处于活动状态的唇部所在的方向;
第二类方向,所述第二类方向包括至少一个用户所在的方向;
第三类方向,所述第三类方向包括至少一个正在注视所述电子设备的用户所在的方向。
结合第二方面,在第一方面的某些实现方式中,所述声源方向信息包括至少一个声源方向,以及,
所述处理器具体用于:
将所述至少一个声源方向和所述至少一种类型的方向合并处理,获得合并后的至少一个方向;
从所述至少一个方向中确定所述目标声源方向。
结合第二方面,在第一方面的某些实现方式中,所述处理器具体用于:
根据以下至少一个参数,从所述至少一个方向中确定所述目标声源方向;
其中,所述至少一个参数包括:
所述至少一个方向中每个方向在所述声源方向和所述至少一种类型的方向中被检测到的频率的总和;
在预设时段和所述每个方向对应的预设角度范围内,所述电子设备是否和用户成功进行过语音交互,所述预设时段是当前时间与历史时间之间的时段;
所述每个方向与垂直于所述电子设备的显示屏的方向之间的夹角。
结合第二方面,在第一方面的某些实现方式中,所述处理器具体用于:
根据所述至少一个参数,确定所述每个方向的置信度;
将所述至少一个方向中数值最大的置信度对应的方向确定为所述目标声源方向。
结合第二方面,在第一方面的某些实现方式中,所述处理器具体用于:
通过所述麦克风阵列,基于波束形成技术在所述目标声源方向上获得所述第二音频信号。
结合第二方面,在第一方面的某些实现方式中,所述第一音频信号为唤醒信号。
结合第二方面,在第一方面的某些实现方式中,所述电子设备为智能电视。
第三方面,提供了一种芯片,包括处理器,用于从存储器中调用并运行所述存储器中存储的指令,使得安装有所述芯片的电子设备执行上述第一方面所述的方法。
第四方面,提供了一种计算机存储介质,包括:处理器,所述处理器与存储器耦合,所述存储器用于存储程序或指令,当所述程序或指令被所述处理器执行时,使得所述装置执行上述第一方面所述的方法。
第五方面,本申请提供一种计算机程序产品,当计算机程序产品在电子设备上运行时,使得电子设备执行如第一方面中任一项所述的方法。
可以理解,上述提供的电子设备、芯片、计算机存储介质以及计算机程序产品均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。
附图说明
图1是本申请一实施例提供的电子设备的示意性结构图。
图2本申请另一实施例提供的电子设备的示意性结构图。
图3是本申请一实施例提供的摄像头拍摄视频的示意性场景图。
图4是本申请一实施例提供的电子设备的示例性框图。
图5是本申请一实施例提供的示意性场景图。
图6是本申请一实施例提供的信号处理的方法的示意性流程图。
图7是本申请另一实施例提供的信号处理的方法的示意性流程图。
图8是本申请另一实施例提供的电子设备确定目标声源方向的方法的示意性流程图。
图9是本申请另一实施例提供的信号处理的方法的示意性流程图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
本申请实施例提供的信号处理的方法,通过一个音频信号和基于摄像头获得的视频,确定正在和电子设备进行语音交互的用户(记为目标用户)所在的方向(记为目标声源方向),进而,基于摄像头获得的在该方向的用户唇部视频和预设的语音增强模型,对拾取的音频信号做语音增强处理,得到或恢复出较为清晰的音频信号,可以大幅提高语音识别效率。
为了便于描述,本申请实施例定义了一些术语,下面对这些术语做一介绍。
目标用户,正在和电子设备进行语音交互的人,该目标用户正在向电子设备发出执行某个动作的语音指令。目标用户也可以理解为实际说话的人。
目标声源方向,目标用户所在的方向,即,目标用户发出的声音的来源方向。由于环境中各种干扰信号的影响,电子设备可能会拾取到多个声源方向的音频信号,所以,将目标用户所在的方向定义为目标声源方向。
用户唇部视频,用户唇部视频中记录有用户说话过程中的唇部口型(记为唇型),当用户说话时,唇部会做出各种唇型动作,唇部视频可记录有多个唇型,唇型与发音具有对应关系,即,一个唇型可对应一个或多个发音,例如,“窝”、“我”和“握”表示三个不同的发音,但是,对应一个唇型。当用户不说话时,唇部处于静止状态。在本申请实施例中,在目标声源方向的用户唇部视频实际上也可以理解为目标用户的唇部视频。
语音增强模型的目的是对音频信号做拾音增强处理,增强在目标声源方向的音频信号,抑制或消除其他方向的包括说话人或背景噪声等产生的音频信号,以得到或恢复出较为清晰的音频信号。本申请实施例的语音增强模型融合了音视频的信息,集成了发音和唇型的对应关系,一个或多个发音可对应一个唇型。在本申请实施例中,将音频信号和用户唇部视频作为语音增强模型的输入,语音增强模型可以基于发音和唇型的对应关系和输入的用户唇部视频,对音频信号做语音增强处理,得到较为清晰的音频信号,以进行语音识别。
示例性地,语音增强模块可以对音频信号做降噪处理、消回声残留处理、去混响处理等。
本申请实施例的信号处理的方法可以应用在任何能够进行识别语音的电子设备。在一示例中,该电子设备可以是智能电视(也称为智慧屏)等语音控制设备。在另一示例中,该电子设备可以是手机、电脑等语音通话设备。
以下,先结合图1至图3,以智能电视为例,对本申请实施例的电子设备做说明。
参考图1,电子设备10包括壳体110、显示屏120、麦克风阵列130、摄像头140,显示屏120、麦克风阵列130和摄像头140安装在壳体110内。
显示屏120用于显示图像,视频等。显示屏120包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed, Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。
麦克风阵列130用于拾取音频信号,包括多个麦克风,可拾取多个方向的音频信号。示例性地,麦克风阵列130中的麦克风可以是全向性麦克风,也可以是指向性麦克风,也可以是全向性麦克风和指向性麦克风的组合,本申请实施不做任何限定。全向性麦克风可以对全部方向的音频信号进行拾取,不管说话的人在哪里,所有方向的声音都会以相同的灵敏度被拾取。指向性麦克风仅可以对特定方向的音频信号进行拾取。
麦克风阵列130可以设置在壳体110的任意位置,本申请实施例不做任何限定。
在一示例中,如图1所示,麦克风阵列130设置在壳体110内且位于显示屏120的一侧的区域,麦克风阵列130的出音孔设置在壳体110的正面上,出音孔的朝向与显示屏120的朝向相同,壳体110的正面可以理解为与显示屏120的朝向相同的一面,或者,壳体110的正面可以理解为用户在正常使用情况下壳体120朝向用户的一面。麦克风阵列130可以设置在壳体110中位于显示屏120的任一侧的区域,假设,图1所示的麦克风阵列130设置在壳体110中位于显示屏120的顶侧的区域,那么,麦克风阵列130也可以设置在壳体110中位于显示屏120的其他侧(例如,左侧、右侧或底侧)的区域。
在另一示例中,麦克风阵列130可以设置在壳体110内且位于显示屏120的顶侧的区域,麦克风阵列130的出音孔设置在壳体110的顶面(图中未示出),壳体110的顶面与壳体110的正面相连,出音孔的朝向与显示屏120的朝向垂直。
在另一示例中,麦克风阵列130还可以设置在显示屏120的后侧,麦克风阵列130的出音孔设置在显示屏120上(图中未示出)。
在另一示例中,麦克风阵列130还可以设置在显示屏120的后侧,麦克风阵列130的出音孔设置在壳体110的正面。
麦克风阵列130可以呈如图1所示的线型结构排列,也可以呈其他结构排列,本申请实施例不做任何限定。例如,麦克风阵列130可以呈圆形结构或矩形结构等排列。
摄像头140用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,电子设备10可以包括1个或N个摄像头130,N为大于1的正整数。
在本申请实施例中,摄像头140可在预设的角度范围内旋转,以拍摄一定角度范围内的视频,该视频可用于确定目标声源方向;以及,在电子设备10确定目标声源方向后,摄像头140可旋转至目标声源方向,以使得摄像头140正对该目标声源方向,尽可能使得目标用户显示在画面的正中间,以更好地在该目标声源方向上拍摄视频,得到的用户唇部视频可用于处理输入语音增强模型的音频信号以输出较为清晰的音频信号,以进行语音识别。
在一示例中,参考图1,摄像头140设置在壳体110的顶面且伸出顶面,以更好地实现摄像头140旋转。在麦克风阵列130位于显示屏120的顶侧的区域的实施例中,摄像头140可以位于麦克风阵列130的上方。
在另一示例中,参考图2,摄像头140可以设置在壳体110的正面且位于显示屏120 的顶侧的区域。
摄像头140可以在预设的角度范围内旋转,该预设的角度范围可以是任意范围的角度。参考图3,在电子设备10为智能电视的实施例中,摄像头140可旋转的角度范围小于或等于180°,示例性地,该角度范围可以是120°,摄像头140可以在位于显示屏120的前方的120°的角度范围内旋转,此外,加上摄像头140的拍摄视场角,基本可以拍摄到位于智能电视的前方的180°范围内的所有画面。
在一些实施例中,参考图1和图2,电子设备10中还包括指向性麦克风150,指向性麦克风150可旋转,以在特定方向拾取音频信号。在电子设备10确定目标声源方向后,指向性麦克风150可旋转至该目标声源方向,在该目标声源方向上进行指向性拾音。
由于指向性麦克风150可以对目标声源方向进行无失真地拾音,能够对干扰和混响有一定的抑制作用,以及,指向性麦克风150朝前拾音,对回声也能起到很好的抑制作用。所以,在本申请实施例中,可以将通过指向性麦克风150得到的音频信号和麦克风阵列130得到的音频信号一起作为语音增强模型的音频输入,可以得到或恢复出更为清晰的音频信号。
结合摄像头140可以旋转至目标声源方向以朝向目标声源方向进行视频拍摄的实施例,在一示例中,继续参考图1和图2,指向性麦克风150可设置在摄像头140上,示例性地,指向性麦克风150固定连接在摄像头140上,在摄像头140旋转至目标声源方向时,指向性麦克风150也随着旋转至目标声源方向,实现简单且方便。电子设备10还包括处理器(图中未示出),显示屏120、麦克风阵列130、摄像头140以及指向性麦克风150都连接至处理器,用于将各个部件采集到的信号输入到处理器,以进行下一步处理。处理器运行指令实现本申请实施例的信号处理的方法,以得到用户发出的较为清晰的音频信号,对该音频信号进行语音识别后,可控制相应部件执行该音频信号对应的指令。
应理解,上述以智能电视为例描述的电子设备10的结构仅为示意性说明,电子设备10可以有更多或更少的部件。
在一些实施例中,电子设备10可以包括麦克风阵列130、摄像头140,可选地,电子设备10还可以包括指向性麦克风150,但电子设备10可以不包括显示屏120。
在另一些实施例中,电子设备10可以包括指向性麦克风150和摄像头140,但电子设备10不包括麦克风阵列130,在该实施例中,可以采用指向性麦克风150拾取的音频信号和摄像头140拍摄的视频确定目标声源方向,以及,通过摄像头140拍摄在目标声源方向的视频和通过指向性麦克风150拾取在目标声源方向的音频信号,以通过在目标声源方向的视频和语音增强模型恢复出较为清晰的音频信号。示例性地,在确定目标声源方向之前,指向性麦克风150可以一直旋转采集音频信号。
在另一些实施例中,电子设备10除了可以包括麦克风阵列130、摄像头140以及指向性麦克风150外,还可以包括其他更多的部件,例如,电子设备10可以是手机或电脑等设备。
图4是本申请实施例提供的电子设备10的示例性框图。电子设备10可以包括上述图3所示的显示屏120、麦克风阵列130、指向性麦克风150、摄像头140,示例性地,电子设备10还可以包括以下一个或多个部件:处理器160、无线通信模块171、音频模块172、扬声器173、触摸传感器174、按键175和内部存储器176。
无线通信模块171可以提供应用在电子设备10上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块171可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块171经由天线接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器。无线通信模块171还可以从处理器接收待发送的信号,对其进行调频,放大,经天线转为电磁波辐射出去。
音频模块172用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块172还可以用于对音频信号编码和解码。在一些实施例中,音频模块172可以设置于处理器中,或将音频模块172的部分功能模块设置于处理器160中。
扬声器173也称“喇叭”,用于将音频电信号转换为声音信号。电子设备10可以通过扬声器173收听音乐或视频中的声音,在电子设备10为手机的实施例中,扬声器173还可以用于收听免提通话。
触摸传感器174也称“触控面板”。触摸传感器174可以设置于显示屏120,由触摸传感器174与显示屏120组成触摸屏,也称“触控屏”。触摸传感器174用于检测作用于其上或附近的触摸操作。触摸传感器174可以将检测到的触摸操作传递给处理器160,以确定触摸事件类型。可以通过显示屏120提供与触摸操作相关的视觉输出。在另一些实施例中,触摸传感器174也可以设置于电子设备10的表面,与显示屏120所处的位置不同。
按键175包括开机键,音量键等。按键175可以是机械按键。也可以是触摸式按键。电子设备10可以接收按键175输入,产生与电子设备10的用户设置以及功能控制有关的键信号输入。
内部存储器176用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器160通过运行存储在内部存储器的指令,从而执行电子设备10的各种功能应用以及数据处理。内部存储器176可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储电子设备10使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器176可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。
图5是本申请实施例提供的示意性场景图。继续以智能电视为例,参考图5,目标用户正在看电视,对着智能电视说“小艺小艺,我想看综艺节目”,智能电视接收且识别该指令,以将智能电视调至综艺节目。
在本申请实施例中,为了便于描述,可以采用角度表示某个方向,我们可以定义一个参考方向,采用某个方向与参考方向之间的夹角表示该某个方向。应理解,参考方向可以是任意的,本申请实施例不做任何限定。
以图5为例,例如,可以将电子设备10的长度方向(例如,x方向)中沿着左侧(图中0°对应的方向的箭头所指的方向)延伸的方向记为参考方向,该参考方向对应的角度为0°,目标用户正对电子设备10,目标用户所在的目标声源方向与参考方向之间的夹角 为90°。
以下,结合图6至图9,对本申请实施例的信号处理的方法做说明,该方法可以由电子设备10执行。电子设备10包括麦克风阵列130,摄像头140和处理单元160,示例性地,处理单元160可以包括目标声源方向确定模块161和语音增强模块162,可选地,电子设备10还包括指向性麦克风150。
图6是本申请一实施例提供的信号处理的方法的示意性流程图。参考图6,本申请实施例的大致过程如下:
S210,目标用户开始向电子设备发出语音指令,麦克风阵列130拾取第一音频信号。
S220,摄像头140拍摄视频,得到第一视频。
S230,处理单元160对第一音频信号做声源定位,获得包括至少一个声源方向的声源方向信息,以及,处理单元160对第一视频进行处理,获得用户方向信息,该步骤可由处理单元160中的目标声源方向确定模块161执行。
S240,处理单元160根据声源方向信和用户方向信息,确定目标用户所在的目标声源方向,该步骤可由处理单元160中的目标声源方向确定模块161执行。
S250,处理单元160控制摄像头140旋转至目标声源方向,摄像头140在目标声源方向上拍摄视频,以得到在目标声源方向的用户唇部视频。
S260,麦克风阵列130继续拾取第二音频信号,该第二音频信号是实际需要语音识别的信号。
S270,在电子设备包括指向性麦克风150的实施例中,处理单元160还可以控制指向性麦克风150旋转至目标声源方向,指向性麦克风150在目标声源方向上拾取第四音频信号。
在指向性麦克风150设置在摄像头140的实施例中,处理单元160控制摄像头140和指向性麦克风150一起旋转至目标声源方向。
S280,将第二音频信号和在目标声源方向的用户唇部视频作为输入,处理单元160通过语音增强模型对第二音频信号做语音增强处理,获得增强后的较为清晰的第三音频信号,该步骤可由处理单元160中的语音增强模块162执行。
在电子设备包括指向性麦克风150的实施例中,在S280中,将第二音频信号、第四音频信号和在目标声源方向的用户唇部视频作为输入,处理单元160通过语音增强模型对第二音频信号和第四音频信号做语音增强处理,得到第三音频信号。
应理解,在本申请的方法200的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。例如,步骤S210和步骤S220可同时执行,步骤S250和步骤S260可同时执行,步骤S250、步骤S260和步骤S270可同时执行。再例如,步骤S250可以在步骤S260之前执行,也可以在步骤S260之后执行。
本申请实施例的信号处理的方法,通过摄像头得到第一视频,结合麦克风阵列得到的第一音频信号,确定正在与电子设备进行语音交互的目标用户所在的目标声源方向,可以大幅提高目标声源方向的估计精度,避免了仅通过音频信号确定目标声源方向时由于强烈的反射声产生的虚假声源干扰目标声源方向的确定,以及,通过摄像头获取的在目标声源方向的用户唇部视频和预设的语音增强模型,对通过麦克风阵列获取的第二音频信号做语 音增强处理,由于语音增强模型中集成了发音和唇型的对应关系,结合用户唇部视频和语音增强模型,可以恢复出较为干净的第三音频信号,最终,可以有效地提高语音识别效率。
此外,指向性麦克风对于混响、目标声源方向以外的干扰、显示屏本身的回声具有一定的抑制作用,且对回声消除后的回声残留有进一步抑制的作用,本申请实施例利用指向性麦克风在目标声源方向拾取的第四音频信号,结合麦克风阵列得到的第二音频信号,将这两个音频信号作为音频输入,可以大大提高拾音增强的效果,以提高语音识别效率。
图7是本申请另一实施例提供的信号处理的方法300的示意性流程图,该方法可以可以由电子设备10的处理单元160执行。
在步骤S310中,对通过麦克风阵列130获得的第一音频信号进行声源定位,获得声源方向信息,该声源方向信息包括至少一个声源方向。其中,该至少一个声源方向包括目标声源方向。
用户向电子设备发出语音指令,麦克风阵列130拾取音频信号。该步骤可用于进行声源定位,第一音频信号可以是用户发出的语音指令中很少部分的内容,该很少部分的内容基本不影响后续的用于语音识别的内容。
示例性地,第一音频信号可以是唤醒信号。例如,用户发出的语音指令为“小艺小艺,我想看综艺节目”,那么,第一音频信号可以是“小艺”中的一个词或多个词或多个“小艺”,““小艺”中的一个词或多个词或多个“小艺”可以理解为唤醒信号。电子设备检测到“小艺”,可以确知用户可能需要电子设备执行语音指令,麦克风阵列130会进行声源定位,以及,会持续性拾取后续的音频信号。
当然,在没有唤醒信号的语音指令中,第一音频信号可以是语音指令中的前几个词。一般情况下,用户只要发出一个或两个词,麦克风阵列130都会检测到音频信号。例如,用户发出的语音指令为“我想看综艺节目”,第一音频信号可以是“我”。
麦克风阵列130对第一音频信号进行声源定位,目的是想要确定目标用户所在的目标声源方向,也就是实际发出语音指令的声源方向。不过,麦克风阵列130会拾取各个方向的音频信号,由于环境中的各种干扰声音的影响,导致最终确定的声源方向并不一定准确,会得到至少一个声源方向,该至少一个声源方向包括目标声源方向,还可以包括干扰信号所在的方向。例如,目标用户向电子设备发出语音指令,音箱正在播放音乐,以及,还有其他用户(记为干扰用户)正在说话,假设上述3种声音都可以被麦克风阵列130拾取到,那么,麦克风阵列130可能会确定出3个或2个或1个声源方向,结果并不准确,所以,需要结合视频进一步确定目标声源方向。
示例性地,本申请实施例的进行声源定位的技术可以是基于最大输出功率的可控波束形成技术、基于高分辨率谱图估计技术或基于声音时间差(time-delay estimation,TDE)的声源定位技术,本申请实施例不做任何限定。
如前所述,本申请实施例的某个方向可以采用角度表示,这里,声源方向可以采用角度θ表示,至少一个声源方向中任一个方向对应的角度额可以记为θ i,i=1,2,…,I,I为至少一个声源方向中包括的声源方向的数量。
在步骤S320中,对通过摄像头140获得的第一视频进行处理,获得用户方向信息。该用户方向信息包括一些与用户相关的方向,示例性地,该用户方向信息包括与用户相关的至少一种类型的方向。
在一些实施例中,用户向电子设备发出语音指令,摄像头140可以拍摄视频,电子设备基于得到的第一视频可用于确定目标声源方向。
示例性地,用户发出的语音指令可以作为摄像头140拍摄视频的触发条件,电子设备检测到用户发出的语音指令,控制摄像头140开始拍摄视频。由于本申请实施例的摄像头140可旋转,所以,在一些示例中,摄像头140可以边旋转边拍摄视频,以得到更多角度范围的画面。
在另一些实施例中,电子设备工作过程中,摄像头140可以一直拍摄视频,将电子设备接收到用户发出的语音指令后的一段时间的视频作为第一视频,用于确定目标声源方向。
电子设备对摄像头140拍摄的第一视频处理,检测第一视频中与用户相关的内容,
得到用户方向信息,用户方向信息包括与用户相关的至少一种类型的方向。这样,在声源方向信息的基础上,结合用户方向信息,可以很有效地排除其他非用户产生的干扰信号,例如,可以排除音箱产生的干扰信号。
应理解,用户方向信息中所涉及的用户不仅仅包括正在和电子设备进行语音交互的目标用户,也可以包括其他用户,只要是在第一视频中被检测到的用户都可以,不过,相对于目标用户,其他用户可以理解为干扰用户。
用户方向信息包括与用户相关的至少一种类型的方向,每种类型的方向包括至少一个方向。
在一些实施例中,至少一种类型的方向包括以下至少一种:
第一类方向,该第一类方向包括至少一个处于活动状态的唇部所在的方向;
第二类方向,该第二类方向包括至少一个用户所在的方向;
第三类方向,该第三类方向包括至少一个正在注视电子设备的用户所在的方向。
对于第一类方向,通过第一视频检测画面中是否有人的唇部在活动,也就是在检测是否有人在说话,可以有效地排除例如视频中人说话的场景,对于具有显示屏的电子设备10来说,也可以在一定程度上排除干扰用户说话的场景。例如,目标用户在看电视,对着电视发出语音指令,用户1也在说话,但是在低头做家务,并没有对着电视,那么,通过第一视频大部分是检测不到用户1的唇部在活动,只能检测到目标用户的唇部在活动,所以,用户1即为干扰用户,能够有效地被排除。
若环境中有多个用户(包括目标用户)在摄像头140的视角范围内说话,可能会检测到多个用户的唇部在活动,得到多个处于活动状态的唇部所在的方向。在正常情况下,该第一类方向包括目标声源方向。
这里,第一类方向采用角度γ表示,第一类方向中任一个方向对应的角度可以记为γ l,l=1,2,…,L,L为第一类方向中包括的方向的数量。
对于第二类方向,通过第一视频检测画面中出现的用户,可以有效地排除其他非用户发出的干扰信号,例如,可以排除音箱发出的干扰信号。
若环境中有多个用户(包括目标用户),可以在第一视频中检测到多个用户,得到多个用户所在的方向。应理解,在正常情况下,该第二类方向包括目标声源方向。
为了便于区分,第二类方向可以采用角度α表示,第二类方向中任一个方向对应的角度可以记为α j,j=1,2,…,J,J为第二类方向中包括的方向的数量。
对于第三类方向,通过第一视频检测画面中是否有用户在注视电子设备,一般情况下, 尤其对于有显示屏的电子设备来说,若用户与电子设备有交互意图,大多数情况会对着电子设备发出语音指令,以便于电子设备能很好地接收到语音指令,以及,也能使得用户更快地获知电子设备是否按照指令执行或者从电子设备处得到一些反馈,例如,用户发出语音指令询问天气状态,用户需要看一下电子设备上显示的天气情况。所以,通过检测注视电子设备的用户,可以有效地排除干扰用户说话的场景。例如,目标用户在看电视,对着电视发出语音指令,用户1对着目标用户说话,但是并没有注视电视,那么,通过第一视频大部分是检测不到用户1正在注视电子设备,只能检测到目标用户正在注视电子设备,所以,用户1即为干扰用户,能够有被效地排除掉。
若环境中有多个用户(包括目标用户),可能会在第一视频中检测到多个用户注视电子设备,得到多个正在注视电子设备的用户的方向。在正常情况下,该第三类方向包括目标声源方向。
为了便于区分,第三类方向可以采用角度β表示,第三类方向中任一个方向对应的角度可以记为β k,k=1,2,…,K,K为第三类方向中包括的方向的数量。
应理解,用户方向信息可以包括上述三种类型的方向中的一类或两类或三类方向,本申请实施例不做任何限定。当然,用户方向信息中包括的方向的类型越多,越有利于提高确定目标声源方向的精确度。
还应理解,用户方向信息除了包括上述三种类型方向外,还可以包括其他与用户相关的方向,本申请实施例不做任何限定。例如,用户方向信息可以包括其他与用户行为相关的方向。
在步骤S330中,根据声源方向信息和用户方向信息,确定目标声源方向。
该目标声源方向为正在和电子设备进行语音交互的目标用户所在的方向。
可以理解,声源方向信息中的声源方向可以视为一种类型的方向,和与用户相关的至少一种类型的方向结合起来,共同用于确定目标声源方向。
图8是本申请另一实施例提供的电子设备确定目标声源方向的方法230的示意性流程图。
在一些实施例中,参考图8,电子设备可以采用以下方式确定目标声源方向:
在步骤S331中,将声源方向信息中的至少一个声源方向和用户方向信息中的至少一种类型的方向合并处理,获得合并后的至少一个方向;
在步骤S332中,从该至少一个方向中确定目标声源方向。
为了便于描述,以下,将声源方向和上述三种类型的方向为例,首先对获得合并后的至少一个方向的方式做说明。
在合并过程中,为了简化计算,若多个方向对应的角度之间的偏差小于一个阈值,则可以基于该多个方向确定一个方向,逻辑上可以认为该多个方向为同一个方向,最终确定的一个方向可以是该多个方向中的任一个,也可以对该多个方向取平均值得到的,本申请实施例不做任何限定。阈值可以基于实际的应用场景合理设计,示例性地,阈值可以为5°。
假设,声源方向信息包括4个声源方向,对应的角度分别为30°、60°、95°、120°,第一类方向包括1个方向,对应的角度为93°,第二类方向包括2个方向,对应的角度分别为63°、95°,第三类方向包括1个方向,对应的角度为95°。
按照从小到达的顺序罗列出所有的方向对应的角度:30°、60°、63°、93°、95°、95°、95°、120°,60°与63°、93°与95°相近或相同,以对两个方向取平均值的方式为例,合并处理得到的角度为:30°、61.5°、94.5°、120°,即,合并得到的第四类方向包括4个方向,目标声源方向是4个方向中的一个,实际上94.5°对应的方向为目标声源方向,目标用户基本正对着电子设备,与电子设备进行语音交互。
电子设备在得到合并后的至少一个方向后,从该至少一个方向中确定目标声源方向。
在本申请实施例中,可以基于远扬拾音的具体场景,设置一些参数,基于和这些参数确定目标声源方向。
在一些实施例中,在步骤S332中,电子设备可以根据以下至少一种参数,从至少一个方向中确定目标声源方向,其中,该至少一种参数包括:
每个方向在声源方向和至少一种类型的方向中被检测到的频率的总和;
在预设时段和每个方向对应的预设角度范围内,电子设备是否和用户成功进行过语音交互,预设时段是当前时间与历史时间之间的时段,预设角度范围包括每个方向对应的角度;
每个方向与垂直于电子设备的显示屏的方向之间的夹角。
以至少一种类型的方向包括上述三种类型的方向为例,以及,以上述四个声源方向和三种类型的方向对应的角度为例,对每个参数做说明。
4个声源方向,对应的角度分别为30°、60°、95°、120°,第一类方向包括1个方向,对应的角度为93°,第二类方向包括2个方向,对应的角度分别为63°、95°,第三类方向包括1个方向,对应的角度为95°,合并处理得到的4个方向对应的角度:30°、61.5°、94.5°、120°。
第一个参数:每个方向在声源方向和至少一种类型的方向中被检测到的频率的总和。
30°在声源方向、第一类方向、第二类方向和第三类方向中分别被检测到的频率为:1、0、0、0,频率的总和为1;61.5°在声源方向、第一类方向、第二类方向和第三类方向中分别被检测到的频率为:1、0、1、0,频率的总和为2;94.5°在声源方向、第一类方向、第二类方向和第三类方向中分别被检测到的频率为:1、1、1、1,频率的总和为4;120°在声源方向、第一类方向、第二类方向和第三类方向中分别被检测到的频率为:1、0、0、0,频率的总和为1。可以看出,94.5°在声源方向和至少一种类型的方向中被检测到的频率的总和最多。
可以理解,哪个方向被检测到的频率的总和越多,该方向是目标声源方向的可能性最大。理想情况下,该方向基本上就是目标声源方向。
第二个参数:在预设时段和对应每个方向的预设角度范围内,电子设备是否和用户成功进行过语音交互,预设时段是当前时间和与历史时间之间的时段,预设角度范围包括每个方向对应的角度。
每个方向对应的预设角度范围的角度不仅可以包括该方向对应的角度,也包括该角度附近的角度,例如,某个方向对应的角度30°,该预设角度范围可以是25°~35°。应理解,该预设角度范围越小,采用该参数确定的目标声源方向越精确。
预设时段是当前时间与历史时间之间的时段,历史时间是位于当前时间之前的时间,预设时段的时长一般设置的不宜过长,这样有利于较为精确地确定目标声源方向。例如, 预设时段的时长可以设置为1分钟、5分钟、10分钟等,假设,当前时间为10:30,预设时段的时长为10分钟,则历史时间为10:20,预设时段为10:20到10:30之间的时段。
针对第二个参数,换句话说,可以理解为在某个方向对应的角度附近,在预设时段内电子设备是否和用户成功进行过语音交互。
在实际场景中,用户很可能在在一定时段内一直使用电子设备,尤其对于具有显示屏的例如智能电视的电子设备来说,用户在看电视时,基本不会频繁移动位置。所以,在预设时段和某个方向对应的预设角度范围内,若电子设备和用户成功进行过语音交互,意味该方向是目标声源方向的可能性较大,反之,该方向是目标声源方向的可能小较小。进一步地,若电子设备和用户成功进行语音交互的频率越多,也意味该方向是目标声源方向的可能性较大,反之则可能性较小。
第三个参数:每个方向与垂直于电子设备的显示屏的方向之间的夹角。
第三个参数比较适用于具有显示屏的电子设备中,垂直于电子设备的显示屏的方向可以理解为电子设备的厚度方向。
在实际场景中,用户观看视频时,都会在电子设备(或显示屏)的前方对着电子设备,以有着较好的观看体验。所以,若某个方向与垂直于电子设备的显示屏的方向之间的夹角越小,意味着用户很可能在对着电子设备看视频,那么该用户有很大可能发出语音指令,因此,该方向是目标声源方向的可能性也较大,反之则可能性较小。换句话说,若某个方向越靠近垂直于显示屏的方向,则该方向为目标声源方向的可能性也较大。
针对第三个参数,换句话说,可以理解为用户是否在针对电子设备使用预设场景时被定义的某个特定方向的附近位置。
应理解,电子设备可以基于上述参数中的一个或两个或三个确定目标声源方向,本申请实施例不做任何限定,以下进行说明。
在一些实施例中,至少一个参数包括第一个参数,即,至少一个参数包括:每个方向在声源方向和至少一种类型的方向中被检测到的频率的总和。示例性地,作为一个原则,可以将在声源方向和至少一种类型的方向中被检测到的频率的总和最大的方向作为目标声源方向。
在另一些实施例中,至少一个参数包括第二个参数,即,至少一个参数包括:在预设时段和对应每个方向的预设角度范围内,电子设备是否和用户成功进行过语音交互。示例性地,作为一个原则,将在预设时段和预设角度范围内电子设备和用户成功进行过语音交互的角度对应的方向确定为目标声源方向。
在另一些实施例中,至少一个参数包括第三个参数,即,至少一个参数包括:每个方向与垂直于电子设备的显示屏的方向之间的夹角。示例性地,作为一个原则,可以将与垂直于电子设备的显示屏的方向之间的夹角最小的方向确定为目标声源方向。
在另一些实施例中,至少一个参数包括任意两个或三个参数,示例性地,针对每个参数,可以基于上述对应的示例中的原则得到一个候选声源方向,将候选声源方向中重复率最高的方向作为目标声源方向。
例如,至少一个参数包括第一个参数和第二个参数,针对第一个参数,将在声源方向和至少一种类型的方向中被检测到的频率的总和最大的方向作为一个候选声源方向,假设该候选声源方向为94.5°,针对第二个参数,将在预设时段和预设角度范围内电子设备和 用户成功进行过语音交互的角度对应的方向作为另一个候选声源方向,假设,该候选声源方向为94.5°,那么基于这两个候选声源方向得到的目标声源方向为94.5°。
在另一些实施例中,电子设备可以根据至少一个参数,确定每个方向的置信度,将至少一个方向中数值最大的置信度对应的方向确定为目标声源方向。其中,每个方向的置信度也可以称为每个方向的可靠度,表示该方向为目标声源方向的概率,置信度越大则表示该置信度对应的方向为目标声源方向的可能性越大。
以至少一个参数包括三个参数为例,对通过置信度确定目标声源方向的方式做一说明。应理解,在至少一个参数中包括一个或两个参数的实施例中通过置信度确定目标声源方向的方式与三个参数的实施例类似,可参考下文描述,后续不再赘述。
示例性地,可以按照三个参数的优先级,为每个参数配置一个加权值,通过计算每个方向基于每个参数得到置信度确定目标声源方向。示例性地,可以将至少一个方向中数值最大的置信度对应的方向确定为目标声源方向,
示例性地,三个参数的优先级按照由高到低的顺序依次为:第一个参数的优先级>第二个参数的优先级>第三个参数的优先级,对应地,第一个参数的加权值>第二个参数的加权值>第三个参数的加权值。
继续以上文所述的4个声源方向和三种类型的方向对应的角度为例,以及,以合并后的4个方向对应的角度(30°、61.5°、94.5°、120°)为例,对基于每个方向的置信度确定目标声源方向做说明。
假设,第一个参数的加权值为0.5,第二个参数的加权值为0.3,第三个参数的加权值为0.2,针对第一个参数,若合并后的每个方向在声源方向和三种类型的方向中被检测到,则每个方向被检测到一次的得分为10分,针对第二个参数,若在预设时段和某个方向对应的预设角度范围内电子设备与用户成功进行过语音交互,则该方向的得分也为10分。针对第三个参数,若某个方向与垂直于显示屏的方向之间的夹角小于一个阈值,则该方向的得分也为10分,例如,该阈值为10°。
其中,4个声源方向,对应的角度分别为30°、60°、95°、120°,第一类方向包括1个方向,对应的角度为93°,第二类方向包括2个方向,对应的角度分别为63°、95°,第三类方向包括1个方向,对应的角度为95°,合并处理得到的4个方向对应的角度:30°、61.5°、94.5°、120°。
在30°中,针对第一种参数,仅在声源方向中被检测到,可以得到1个10分,针对第二种参数和第三种参数,不满足条件,得分为0,置信度为10*0.5=5。
在61.5°中,针对第一种参数,在声源方向和第二类方向中被检测到,可以得到2个10分,即20分,针对第二种参数,在61.5°对应的方向与电子设备成功进行过一次语音交互,可以得到1个10分,针对第三个参数,不满足条件,得分为0,置信度为20*0.5+10*0.3=13。
在94.5°中,针对第一种参数,在声源方向和三类方向中被检测到,可以得到4个10分,即40分,针对第二种参数,在94.5°对应的方向与电子设备成功进行过一次语音交互,可以得到1个10分,针对第三个参数,94.5°-90°=4.5°,4.5°小于10°,满足条件,也可以得到1个10分,因此,置信度为40*0.5+10*0.3+10*0.2=25。
在10°中,针对第一种参数,仅在第三类方向中被检测到,可以得到1个10分,针 对第二种参数和第三种参数,不满足条件,得分为0,置信度为10*0.5=5。
综上,94.5°的置信度的数值最高,那么,将94.5°对应的方向确定为目标声源方向。
在步骤S340中,通过摄像头140获得在目标声源方向的用户唇部视频。
在确定目标声源方向后,电子设备将摄像头140旋转至目标声源方向,摄像头140在该目标声源方向上拍摄视频,该视频包括目标用户在目标声源方向的用户唇部视频。
在步骤S350中,通过麦克风阵列130获得第二音频信号。
应理解,该第二音频信号是用于指示实际的语音命令的信号。示例性地,假设,整个过程中,目标用户发出的语音指令为“小艺小艺,我想看综艺节目”,那么,第二音频信号可用于指示“我想看综艺节目”的语音指令。
为了提高拾音效果,在一些实施例中,通过麦克风阵列130,基于波束形成技术在目标声源方向上获取第二音频信号。
在步骤S350中,根据第二音频信号和用户唇部视频,通过语音增强模型,获得第三音频信号,语音增强模型包括多个发音和多个唇型的对应关系。
语音增强模型的目的是对音频信号做拾音增强处理,增强目标声源方向的音频信号,抑制或消除其他方向的音频信号,以得到或恢复出较为清晰的音频信号。语音增强模型中融合了音视频的信息,集成了发音和唇型的对应关系,即,一个或多个发音对应一个唇型,第二音频信号作为音频输入,在目标声源方向的唇部信息作为视频输入,语音增强模型可以基于发音和唇型的对应关系和输入的用户唇部视频,对音频信号做增强处理,得到或恢复出较为清晰的第三音频信号,以进行语音识别。相比于仅基于音频信息处理音频信号的方式,本申请实施例的语音增强模型中通过音视频的信息对音频信号做处理,可以得到较为干净的音频信号,大幅提高了拾音增强效果。
示例性地,语音增强模块可以对第二音频信号做降噪处理、消回声残留处理、去混响处理等。
图9是本申请另一实施例提供的信号处理的方法400的示意性流程图,该方法可以由电子设备10的处理单元160执行。
在步骤S410中,对通过麦克风阵列130获得的第一音频信号进行声源定位,获得声源方向信息,该声源方向信息包括至少一个声源方向。其中,该至少一个声源方向包括目标声源方向。
关于步骤S410的具体描述可参考上文关于步骤S310的相关描述。
在步骤S420中,对通过摄像头140获得的第一视频进行处理,获得用户方向信息,该用户方向信息包括与用户相关的至少一种类型的方向。
关于步骤S420的具体描述可参考上文关于步骤S320的相关描述。
在步骤S430中,根据声源方向信息和用户方向信息,确定目标声源方向,该目标声源方向为正在和电子设备进行语音交互的目标用户所在的方向。
关于步骤S430的具体描述可参考上文关于步骤S330的相关描述。
在步骤S440中,通过摄像头140获得在目标声源方向的用户唇部视频。
关于步骤S440的具体描述可参考上文关于步骤S340的相关描述。
在步骤S450中,通过麦克风阵列130获得第二音频信号。
关于步骤S450的具体描述可参考上文关于步骤S350的相关描述。
在步骤S460中,通过指向性麦克风150获得在目标声源方向的第四音频信号。
在电子设备确定目标声源方向后,电子设备可以控制指向性麦克风150旋转至目标声源方向,在目标声源方向上拾取第四音频信号。
在指向性麦克风150设置在摄像头140的实施例中,电子设备可以控制摄像头140和指向性麦克风150一起旋转至目标声源方向。
在步骤S470中,根据第二音频信号、第四音频信号和用户唇部视频,通过语音增强模型,获得第三音频信号。
在该步骤中,将麦克风阵列130拾取到的第二音频信号和指向性麦克风150拾取到的第四音频信号作为语音增强模型的音频输入,将用户唇部视频作为视频输入,通过语音增强模块对输入的音频信号做处理,得到较为清晰的第三音频信号。
由于指向性麦克风150可以对目标声源方向进行无失真地拾音,能够对干扰和混响有一定的抑制作用,以及,指向性麦克风150朝前拾音,对回声也能起到很好的抑制作用。
所以,将通过指向性麦克风150得到的第四音频信号和麦克风阵列130得到的第二音频信号一起作为语音增强模型的音频输入,可以得到或恢复出更为清晰的第三音频信号。
应理解,与上述方法200类似,上述方法300和400的各种实施例中,各过程的序号的大小也并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本申请实施例还提供了一种电子设备,该电子设备可以是图4所示的电子设备,该电子设备包括麦克风阵列130、可旋转的摄像头140和处理器160,处理器160用于:
对通过所述麦克风阵列130获得的第一音频信号进行声源定位,获得声源方向信息;
对通过所述摄像头140获得的第一视频进行处理,获得用户方向信息;
根据所述声源方向信息和所述用户方向信息,确定目标声源方向;
通过所述摄像头140获得在所述目标声源方向的用户唇部视频;
通过所述麦克风阵列130获得第二音频信号;
根据所述第二音频信号和所述用户唇部视频,通过语音增强模型,获得第三音频信号,所述语音增强模型包括发音和唇型的对应关系。
可选地,所述电子设备还包括指向性麦克风150,所述处理器160还用于:
通过所述指向性麦克风150获得在所述目标声源方向的第四音频信号;以及,
所述处理器160具体用于:
根据所述第二音频信号、所述第四音频信号和所述用户唇部视频,通过所述语音增强模块,获得所述第三音频信号。
可选地,所述指向性麦克风150与所述摄像头140固定连接。
可选地,所述用户方向信息包括以下至少一种类型的方向:
第一类方向,所述第一类方向包括至少一个处于活动状态的唇部所在的方向;
第二类方向,所述第二类方向包括至少一个用户所在的方向;
第三类方向,所述第三类方向包括至少一个正在注视所述电子设备的用户所在的方向。
可选地,所述声源方向信息包括至少一个声源方向,以及,所述处理器160具体用于:
将所述至少一个声源方向和所述至少一种类型的方向合并处理,获得合并后的至少一个方向;
从所述至少一个方向中确定所述目标声源方向。
可选地,所述处理器160具体用于:
根据以下至少一个参数,从所述至少一个方向中确定所述目标声源方向;
其中,所述至少一个参数包括:
所述至少一个方向中每个方向在所述声源方向和所述至少一种类型的方向中被检测到的频率的总和;
在预设时段和所述每个方向对应的预设角度范围内,所述电子设备是否和用户成功进行过语音交互,所述预设时段是当前时间与历史时间之间的时段;
所述每个方向与垂直于所述电子设备的显示屏的方向之间的夹角。
可选地,所述处理器160具体用于:
根据所述至少一个参数,确定所述每个方向的置信度;
将所述至少一个方向中数值最大的置信度对应的方向确定为所述目标声源方向。
可选地,所述处理器160具体用于:
通过所述麦克风阵列130,基于波束形成技术在所述目标声源方向上获得所述第二音频信号。
可选地,所述第一音频信号为唤醒信号。
可选地,所述电子设备为智能电视。
应理解,在本申请实施例中,除非另有明确的规定和限定,术语“连接”、“固定连接”等术语应做广义理解。对于本领域的普通技术人员而言,可以根据具体情况理解上述各种术语在本申请实施例中的具体含义。
示例性地,针对“连接”,可以是固定连接、转动连接、柔性连接、移动连接、一体成型、电连接等各种连接方式;可以是直接相连,或,可以是通过中间媒介间接相连,或,可以是两个元件内部的连通或两个元件的相互作用关系。
示例性地,针对“固定连接”,可以是一个元件可以直接或间接固定连接在另一个元件上;固定连接可以包括机械连接、焊接以及粘接等方式,其中,机械连接可以包括铆接、螺栓连接、螺纹连接、键销连接、卡扣连接、锁扣连接、插接等方式,粘接可以包括粘合剂粘接以及溶剂粘接等方式。
还应理解,本申请实施例描述的“平行”或“垂直”,可以理解为“近似平行”或“近似垂直”。
还应理解,术语““长度”、“宽度”、“厚度”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本申请和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本实用新型的限制。
需要说明的是,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。
在本申请实施例中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“元件的至少部分”是指元件的部分或全部。“和/或”,描述关联对象的关联关系,表示 可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A、B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本申请中各个实施例之间相同或相似的部分可以互相参考。在本申请中各个实施例、以及各实施例中的各个实施方式/实施方法/实现方法中,如果没有特殊说明以及逻辑冲突,不同的实施例之间、以及各实施例中的各个实施方式/实施方法/实现方法之间的术语和/或描述具有一致性、且可以相互引用,不同的实施例、以及各实施例中的各个实施方式/实施方法/实现方法中的技术特征根据其内在的逻辑关系可以组合形成新的实施例、实施方式、实施方法、或实现方法。以上所述的本申请实施方式并不构成对本申请保护范围的限定。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (20)

  1. 一种信号处理的方法,其特征在于,应用于电子设备,所述电子设备包括麦克风阵列和摄像头,所述方法包括:
    对通过所述麦克风阵列获得的第一音频信号进行声源定位,获得声源方向信息;
    对通过所述摄像头获得的第一视频进行处理,获得用户方向信息;
    根据所述声源方向信息和所述用户方向信息,确定目标声源方向;
    通过所述摄像头获得在所述目标声源方向的用户唇部视频;
    通过所述麦克风阵列获得第二音频信号;
    根据所述第二音频信号和所述用户唇部视频,通过语音增强模型,获得第三音频信号,所述语音增强模型包括发音和唇型的对应关系。
  2. 根据权利要求1所述的方法,其特征在于,所述电子设备还包括指向性麦克风,所述方法还包括:
    通过所述指向性麦克风获得在所述目标声源方向的第四音频信号;以及,
    根据所述第二音频信号和在所述目标声源方向的用户唇部视频,通过语音增强模型,获得第三音频信号,包括:
    根据所述第二音频信号、所述第四音频信号和所述用户唇部视频,通过所述语音增强模块,获得所述第三音频信号。
  3. 根据权利要求1或2所述的方法,其特征在于,所述用户方向信息包括以下至少一种类型的方向:
    第一类方向,所述第一类方向包括至少一个处于活动状态的唇部所在的方向;
    第二类方向,所述第二类方向包括至少一个用户所在的方向;
    第三类方向,所述第三类方向包括至少一个正在注视所述电子设备的用户所在的方向。
  4. 根据权利要求3所述的方法,其特征在于,所述声源方向信息包括至少一个声源方向,以及,
    所述根据所述声源方向信息和所述用户方向信息,确定目标声源方向,包括:
    将所述至少一个声源方向和所述至少一种类型的方向合并处理,获得合并后的至少一个方向;
    从所述至少一个方向中确定所述目标声源方向。
  5. 根据权利要求4所述的方法,其特征在于,所述从所述至少一个方向中确定所述目标声源方向,包括:
    根据以下至少一个参数,从所述至少一个方向中确定所述目标声源方向;
    其中,所述至少一个参数包括:
    所述至少一个方向中每个方向在所述声源方向和所述至少一种类型的方向中被检测到的频率的总和;
    在预设时段和所述每个方向对应的预设角度范围内,所述电子设备是否和用户成功进行过语音交互,所述预设时段是当前时间与历史时间之间的时段;
    所述每个方向与垂直于所述电子设备的显示屏的方向之间的夹角。
  6. 根据权利要求5所述的方法,其特征在于,所述根据以下至少一个参数,从所述至少一个方向中确定所述目标声源方向,包括:
    根据所述至少一个参数,确定所述每个方向的置信度;
    将所述至少一个方向中数值最大的置信度对应的方向确定为所述目标声源方向。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述通过所述麦克风阵列获得第二音频信号,包括:
    通过所述麦克风阵列,基于波束形成技术在所述目标声源方向上获得所述第二音频信号。
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,所述第一音频信号为唤醒信号。
  9. 一种电子设备,其特征在于,包括麦克风阵列、摄像头和处理器,所述处理器用于:
    对通过所述麦克风阵列获得的第一音频信号进行声源定位,获得声源方向信息;
    对通过所述摄像头获得的第一视频进行处理,获得用户方向信息;
    根据所述声源方向信息和所述用户方向信息,确定目标声源方向;
    通过所述摄像头获得在所述目标声源方向的用户唇部视频;
    通过所述麦克风阵列获得第二音频信号;
    根据所述第二音频信号和所述用户唇部视频,通过语音增强模型,获得第三音频信号,所述语音增强模型包括发音和唇型的对应关系。
  10. 根据权利要求9所述的电子设备,其特征在于,所述电子设备还包括指向性麦克风,所述处理器还用于:
    通过所述指向性麦克风获得在所述目标声源方向的第四音频信号;以及,
    所述处理器具体用于:
    根据所述第二音频信号、所述第四音频信号和所述用户唇部视频,通过所述语音增强模块,获得所述第三音频信号。
  11. 根据权利要求10所述的电子设备,其特征在于,所述指向性麦克风与所述摄像头固定连接。
  12. 根据权利要求9至11中任一项所述的电子设备,其特征在于,所述用户方向信息包括以下至少一种类型的方向:
    第一类方向,所述第一类方向包括至少一个处于活动状态的唇部所在的方向;
    第二类方向,所述第二类方向包括至少一个用户所在的方向;
    第三类方向,所述第三类方向包括至少一个正在注视所述电子设备的用户所在的方向。
  13. 根据权利要求12所述的电子设备,其特征在于,所述声源方向信息包括至少一个声源方向,以及,
    所述处理器具体用于:
    将所述至少一个声源方向和所述至少一种类型的方向合并处理,获得合并后的至少一个方向;
    从所述至少一个方向中确定所述目标声源方向。
  14. 根据权利要求13所述的电子设备,其特征在于,所述处理器具体用于:
    根据以下至少一个参数,从所述至少一个方向中确定所述目标声源方向;
    其中,所述至少一个参数包括:
    所述至少一个方向中每个方向在所述声源方向和所述至少一种类型的方向中被检测到的频率的总和;
    在预设时段和所述每个方向对应的预设角度范围内,所述电子设备是否和用户成功进行过语音交互,所述预设时段是当前时间与历史时间之间的时段;
    所述每个方向与垂直于所述电子设备的显示屏的方向之间的夹角。
  15. 根据权利要求14所述的电子设备,其特征在于,所述处理器具体用于:
    根据所述至少一个参数,确定所述每个方向的置信度;
    将所述至少一个方向中数值最大的置信度对应的方向确定为所述目标声源方向。
  16. 根据权利要求9至15中任一项所述的电子设备,其特征在于,所述处理器具体用于:
    通过所述麦克风阵列,基于波束形成技术在所述目标声源方向上获得所述第二音频信号。
  17. 根据权利要求9至16中任一项所述的电子设备,其特征在于,所述第一音频信号为唤醒信号。
  18. 根据权利要求9至17中任一项所述的电子设备,其特征在于,所述电子设备为智能电视。
  19. 一种计算机存储介质,其特征在于,包括:处理器,所述处理器与存储器耦合,所述存储器用于存储程序或指令,当所述程序或指令被所述处理器执行时,使得所述装置执行如权利要求1至8中任一项所述的方法。
  20. 一种包含指令的计算机程序产品,其特征在于,当所述计算机程序产品在电子设备上运行时,使得所述电子设备执行如权利要求1至8中任一项所述的方法。
PCT/CN2021/118948 2020-09-30 2021-09-17 信号处理的方法和电子设备 Ceased WO2022068608A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US18/247,212 US20230386494A1 (en) 2020-09-30 2021-09-17 Signal processing method and electronic device
EP21874269.0A EP4207186B1 (en) 2020-09-30 2021-09-17 Signal processing method and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011065346.1 2020-09-30
CN202011065346.1A CN114333831B (zh) 2020-09-30 2020-09-30 信号处理的方法和电子设备

Publications (1)

Publication Number Publication Date
WO2022068608A1 true WO2022068608A1 (zh) 2022-04-07

Family

ID=80949550

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/118948 Ceased WO2022068608A1 (zh) 2020-09-30 2021-09-17 信号处理的方法和电子设备

Country Status (4)

Country Link
US (1) US20230386494A1 (zh)
EP (1) EP4207186B1 (zh)
CN (1) CN114333831B (zh)
WO (1) WO2022068608A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115174959A (zh) * 2022-06-21 2022-10-11 咪咕文化科技有限公司 视频3d音效设置方法及装置

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115910038A (zh) * 2022-09-27 2023-04-04 北京地平线机器人技术研发有限公司 语音信号的提取方法、装置、可读存储介质及电子设备
CN119493074A (zh) * 2023-08-15 2025-02-21 华为技术有限公司 朝向确定方法及相关设备
CN119673153A (zh) * 2023-09-12 2025-03-21 荣耀终端股份有限公司 语音交互方法及相关设备
CN118447866B (zh) * 2023-09-13 2025-03-07 荣耀终端股份有限公司 一种音频处理方法及电子设备
CN118072744B (zh) * 2024-04-18 2024-07-23 深圳市万屏时代科技有限公司 基于声纹的语言识别方法及装置
CN118865995B (zh) * 2024-09-04 2025-08-12 美的集团(上海)有限公司 多通道语音的降噪方法及系统、电子设备及存储介质
CN118900374B (zh) * 2024-09-18 2025-10-28 深圳市万屏时代科技有限公司 拾音组件及其控制方法和控制装置
CN120452418A (zh) * 2025-04-23 2025-08-08 杭州灵伴科技有限公司 翻译系统、智能眼镜、计算机可读介质和计算机程序产品

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106328156A (zh) * 2016-08-22 2017-01-11 华南理工大学 一种音视频信息融合的麦克风阵列语音增强系统及方法
CN106679651A (zh) * 2017-02-08 2017-05-17 北京地平线信息技术有限公司 声源定位方法、装置和电子设备
WO2017129239A1 (en) * 2016-01-27 2017-08-03 Nokia Technologies Oy System and apparatus for tracking moving audio sources
CN110082723A (zh) * 2019-05-16 2019-08-02 浙江大华技术股份有限公司 一种声源定位方法、装置、设备及存储介质
CN110389597A (zh) * 2018-04-17 2019-10-29 北京京东尚科信息技术有限公司 基于声源定位的摄像头调整方法、装置和系统
CN110691196A (zh) * 2019-10-30 2020-01-14 歌尔股份有限公司 一种音频设备的声源定位的方法及音频设备
CN110858488A (zh) * 2018-08-24 2020-03-03 阿里巴巴集团控股有限公司 语音活动检测方法、装置、设备及存储介质

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100499124B1 (ko) * 2002-03-27 2005-07-04 삼성전자주식회사 직교 원형 마이크 어레이 시스템 및 이를 이용한 음원의3차원 방향을 검출하는 방법
WO2013091677A1 (en) * 2011-12-20 2013-06-27 Squarehead Technology As Speech recognition method and system
CN107146614B (zh) * 2017-04-10 2020-11-06 北京猎户星空科技有限公司 一种语音信号处理方法、装置及电子设备
CN107993671A (zh) * 2017-12-04 2018-05-04 南京地平线机器人技术有限公司 声音处理方法、装置和电子设备
CN108346427A (zh) * 2018-02-05 2018-07-31 广东小天才科技有限公司 一种语音识别方法、装置、设备及存储介质
CN111326152A (zh) * 2018-12-17 2020-06-23 南京人工智能高等研究院有限公司 语音控制方法及装置
CN110503957A (zh) * 2019-08-30 2019-11-26 上海依图信息技术有限公司 一种基于图像去噪的语音识别方法及装置
CN110570862A (zh) * 2019-10-09 2019-12-13 三星电子(中国)研发中心 一种语音识别方法及智能语音引擎装置
CN111028842B (zh) * 2019-12-10 2021-05-11 上海芯翌智能科技有限公司 触发语音交互响应的方法及设备
CN111048113B (zh) * 2019-12-18 2023-07-28 腾讯科技(深圳)有限公司 声音方向定位处理方法、装置、系统、计算机设备及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017129239A1 (en) * 2016-01-27 2017-08-03 Nokia Technologies Oy System and apparatus for tracking moving audio sources
CN106328156A (zh) * 2016-08-22 2017-01-11 华南理工大学 一种音视频信息融合的麦克风阵列语音增强系统及方法
CN106679651A (zh) * 2017-02-08 2017-05-17 北京地平线信息技术有限公司 声源定位方法、装置和电子设备
CN110389597A (zh) * 2018-04-17 2019-10-29 北京京东尚科信息技术有限公司 基于声源定位的摄像头调整方法、装置和系统
CN110858488A (zh) * 2018-08-24 2020-03-03 阿里巴巴集团控股有限公司 语音活动检测方法、装置、设备及存储介质
CN110082723A (zh) * 2019-05-16 2019-08-02 浙江大华技术股份有限公司 一种声源定位方法、装置、设备及存储介质
CN110691196A (zh) * 2019-10-30 2020-01-14 歌尔股份有限公司 一种音频设备的声源定位的方法及音频设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4207186A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115174959A (zh) * 2022-06-21 2022-10-11 咪咕文化科技有限公司 视频3d音效设置方法及装置
CN115174959B (zh) * 2022-06-21 2024-01-30 咪咕文化科技有限公司 视频3d音效设置方法及装置

Also Published As

Publication number Publication date
CN114333831B (zh) 2026-01-02
US20230386494A1 (en) 2023-11-30
EP4207186A4 (en) 2024-01-24
CN114333831A (zh) 2022-04-12
EP4207186A1 (en) 2023-07-05
EP4207186B1 (en) 2025-07-30

Similar Documents

Publication Publication Date Title
WO2022068608A1 (zh) 信号处理的方法和电子设备
US11705135B2 (en) Detection of liveness
US11023755B2 (en) Detection of liveness
US11624800B1 (en) Beam rejection in multi-beam microphone systems
US11017252B2 (en) Detection of liveness
US10993025B1 (en) Attenuating undesired audio at an audio canceling device
CN110493690B (zh) 一种声音采集方法及装置
CN115831155B (zh) 音频信号的处理方法、装置、电子设备及存储介质
CN113132863B (zh) 立体声拾音方法、装置、终端设备和计算机可读存储介质
US20160094910A1 (en) Directional audio capture
CN108766457B (zh) 音频信号处理方法、装置、电子设备及存储介质
CN107749925B (zh) 音频播放方法及装置
CN107210824A (zh) 麦克风的环境切换
CN104349040A (zh) 用于视频会议系统中的摄像机底座及其方法
US20240422503A1 (en) Rendering based on loudspeaker orientation
CN110572600A (zh) 一种录像处理方法及电子设备
JPWO2020021861A1 (ja) 情報処理装置、情報処理システム、情報処理方法及び情報処理プログラム
CN115981173A (zh) 设备控制方法、终端设备及存储介质
CN113676593A (zh) 视频录制方法、装置、电子设备及存储介质
CN119603627A (zh) 在包含智能音频装置的系统中估计用户位置
CN115035187B (zh) 声源方向确定方法、装置、终端、存储介质及产品
CN110392334B (zh) 一种麦克风阵列音频信号自适应处理方法、装置及介质
JP2024545571A (ja) 分散型オーディオデバイスダッキング
JP2022147989A (ja) 発話制御装置、発話制御方法及び発話制御プログラム
US20230105785A1 (en) Video content providing method and video content providing device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21874269

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18247212

Country of ref document: US

ENP Entry into the national phase

Ref document number: 2021874269

Country of ref document: EP

Effective date: 20230331

NENP Non-entry into the national phase

Ref country code: DE

WWG Wipo information: grant in national office

Ref document number: 2021874269

Country of ref document: EP