WO2022068608A1 - 信号处理的方法和电子设备 - Google Patents
信号处理的方法和电子设备 Download PDFInfo
- Publication number
- WO2022068608A1 WO2022068608A1 PCT/CN2021/118948 CN2021118948W WO2022068608A1 WO 2022068608 A1 WO2022068608 A1 WO 2022068608A1 CN 2021118948 W CN2021118948 W CN 2021118948W WO 2022068608 A1 WO2022068608 A1 WO 2022068608A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sound source
- electronic device
- user
- audio signal
- target sound
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/028—Voice signal separating using properties of sound source
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/60—Control of cameras or camera modules
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
- G10L15/25—Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
Definitions
- the embodiments of the present application relate to the field of acoustics, and more particularly, to a signal processing method and electronic device.
- smart devices such as smart TVs, smart speakers, and smart lights can be used for remote pickup. For example, when a user says “turn off the lights” at a distance of 5 meters, the smart device picks up the voice and recognizes the voice, and controls the lights to perform corresponding light-off action.
- the commonly used far-field sound pickup technology uses a microphone array to pick up audio signals, and uses beamforming technology and echo cancellation algorithms to suppress environmental noise and echoes to obtain clearer audio signals.
- various noises and interferences in the actual environment such as the noise of cooking and washing dishes in the kitchen, the noise of TV programs, the interference noise of family chats, etc., and some families have empty rooms or decorated walls with large acoustic reflection coefficients. material, resulting in a large reverberation, and the sound is easily mushy, all these unfavorable factors will greatly reduce the intelligibility of the sound picked up by the microphone array, resulting in a significant drop in the speech recognition rate.
- Embodiments of the present application provide a signal processing method and electronic device, which determine the target sound source direction of a user who is interacting with the electronic device through an audio signal and a video obtained based on a camera, and further, based on the video obtained by the camera
- the video of the user's lips and the preset voice enhancement model in the direction of the target sound source the voice enhancement process is performed on the picked-up audio signal to obtain or restore a relatively clear audio signal, which can greatly improve the efficiency of voice recognition.
- a signal processing method which is characterized in that it is applied to an electronic device, the electronic device includes a microphone array and a camera, and the method includes:
- a third audio signal is obtained through a speech enhancement model, where the speech enhancement model includes a correspondence between pronunciation and lip shape.
- the sound source direction information includes at least one sound source direction, and the at least one sound source direction includes a target sound source direction.
- the user direction information includes some directions related to the user, illustratively including at least one type of direction related to the user.
- the target sound source direction is the direction in which the target user who is performing voice interaction with the electronic device is located, that is, the source direction of the sound emitted by the target user.
- lip shapes during the user's speech are recorded in the user's lip video, and the lip shapes and pronunciations have a corresponding relationship, that is, one lip shape can correspond to one or more pronunciations.
- the lips are in a static state.
- the lip video of the user in the direction of the target sound source can actually be understood as the lip video of the target user.
- the purpose of the speech enhancement model is to perform pickup enhancement processing on the audio signal, enhance the audio signal in the direction of the target sound source, and suppress or eliminate the audio signal generated in other directions, including the speaker or background noise, so as to obtain or restore a clearer audio signal.
- audio signal The speech enhancement model of the embodiment of the present application integrates the information of audio and video, and integrates the correspondence between pronunciation and lip shape, that is, one or more pronunciations may correspond to one lip shape.
- the camera is a rotatable camera. After determining the direction of the target sound source, the camera can be rotated to the direction of the target sound source to capture a video of the user's lips in the direction of the target sound source.
- the first video is obtained by the camera, and the direction of the target sound source is determined in combination with the first audio signal obtained by the microphone array, which can greatly improve the estimation accuracy of the direction of the target sound source, and avoids the need to use only the audio signal.
- the false sound source generated by the strong reflected sound interferes with the determination of the direction of the target sound source, and the video of the user's lips in the direction of the target sound source obtained by the camera and the preset speech enhancement model.
- the second audio signal obtained by the microphone array is processed for voice enhancement. Since the corresponding relationship between pronunciation and lip shape is integrated in the voice enhancement model, combined with the user's lip video and the voice enhancement model, a relatively clean third audio signal can be recovered. Finally, which can effectively improve the efficiency of speech recognition.
- the electronic device further includes a directional microphone
- the method further includes:
- a third audio signal is obtained through a speech enhancement model, including:
- the third audio signal is obtained through the speech enhancement module according to the second audio signal, the fourth audio signal and the video of the user's lips.
- the directional microphone can be fixed on the camera. In this way, after the direction of the target sound source is determined, the directional microphone is driven to rotate during the process of rotating the camera, and finally rotates to the direction of the target sound source. The fourth audio signal in the source direction.
- a fourth audio signal in the direction of the target sound source is obtained through the directional microphone.
- the echo of the display screen itself has a certain inhibitory effect, and has a further inhibitory effect on the echo residue after echo cancellation. Therefore, the embodiment of the present application uses the fourth audio signal obtained by the directional microphone in the direction of the target sound source, combined with the microphone array. For the obtained second audio signal, the two audio signals are used as audio input, which can greatly improve the effect of sound pickup enhancement, so as to improve the speech recognition efficiency.
- the user direction information includes at least one of the following types of directions:
- the first class of directions including the direction in which the at least one active lip is located;
- the second type of direction includes the direction in which at least one user is located;
- a third type of direction includes a direction in which at least one user who is looking at the electronic device is located.
- the signal processing method of the embodiment of the present application for the method of determining the direction of the target sound source using the first type of direction, can effectively detect whether the lips of a person are moving in the picture through the first video, that is, whether there is a person talking. For example, the scene of people talking in the video can be excluded to a certain extent.
- the scene that interferes with the user's speech can also be excluded to a certain extent; for the method of using the second type of direction to determine the target sound source direction, the first Users appearing in the video detection screen can effectively eliminate interference signals from other non-users, for example, interference signals from speakers can be excluded; for the method of determining the direction of the target sound source using the third type of direction, the first video detection screen Whether there is a user looking at the electronic device, in general, especially for the electronic device with a display screen, if the user has the intention of interacting with the electronic device, in most cases, a voice command will be issued to the electronic device, so that the electronic device can The voice commands are well received, and the user can quickly know whether the electronic device is executing according to the command or get some feedback from the electronic device. For example, if the user sends a voice command to ask about the weather status, the user needs to look at the Displayed weather conditions.
- the sound source direction information includes at least one sound source direction
- the determining the target sound source direction according to the sound source direction information and the user direction information includes:
- the target sound source direction is determined from the at least one direction.
- the signal processing method of the embodiment of the present application can simplify the calculation by combining with at least one sound source direction and at least one type of direction to determine the target sound source direction.
- the determining the target sound source direction from the at least one direction includes:
- the target sound source direction is determined from the at least one direction according to at least one of the following parameters;
- the at least one parameter includes:
- the preset time period is the time period between the current time and the historical time
- the included angle between each of the directions and a direction perpendicular to the display screen of the electronic device is a direction perpendicular to the display screen of the electronic device.
- the direction is The direction of the target sound source is most likely. Ideally, this direction is basically the direction of the target sound source.
- the angle of the preset angle range corresponding to each direction can not only include The angle corresponding to this direction also includes the angles near this angle.
- This parameter can be understood as whether the electronic device has successfully performed voice interaction with the user in the vicinity of an angle corresponding to a certain direction within a preset period of time.
- the included angle between each direction and the direction perpendicular to the display screen of the electronic device it is more suitable for electronic devices with a display screen. This parameter can be understood as whether the user is targeting the electronic device. Near a specific direction defined when using a preset scene.
- the target sound source direction is determined from at least one direction by using the above at least one parameter.
- the target sound source can be further effectively improved. Estimation accuracy of sound source direction to improve speech recognition efficiency.
- determining the target sound source direction from the at least one direction according to at least one of the following parameters includes:
- the direction corresponding to the confidence level with the largest numerical value in the at least one direction is determined as the target sound source direction.
- the obtaining the second audio signal through the microphone array includes:
- the second audio signal is obtained in the direction of the target sound source based on a beamforming technique.
- the signal processing method of the embodiment of the present application obtains the second audio signal in the direction of the target sound source through the beamforming technology, which enhances the sound pickup effect and effectively reduces the influence of interference signals in other directions on the efficiency of speech recognition.
- the first audio signal is a wake-up signal.
- an electronic device including a microphone array, a camera and a processor, the processor being used for:
- a third audio signal is obtained through a speech enhancement model, where the speech enhancement model includes a correspondence between pronunciation and lip shape.
- the electronic device further includes a directional microphone
- the processor is further configured to:
- the processor is specifically used for:
- the third audio signal is obtained through the speech enhancement module according to the second audio signal, the fourth audio signal and the video of the user's lips.
- the directional microphone is fixedly connected to the camera.
- the user direction information includes at least one of the following types of directions:
- the first class of directions including the direction in which the at least one active lip is located;
- the second type of direction includes the direction in which at least one user is located;
- a third type of direction includes a direction in which at least one user who is looking at the electronic device is located.
- the sound source direction information includes at least one sound source direction
- the processor is specifically used for:
- the target sound source direction is determined from the at least one direction.
- the processor is specifically configured to:
- the target sound source direction is determined from the at least one direction according to at least one of the following parameters;
- the at least one parameter includes:
- the preset time period is the time period between the current time and the historical time
- the included angle between each of the directions and a direction perpendicular to the display screen of the electronic device is a direction perpendicular to the display screen of the electronic device.
- the processor is specifically configured to:
- the direction corresponding to the confidence level with the largest numerical value in the at least one direction is determined as the target sound source direction.
- the processor is specifically configured to:
- the second audio signal is obtained in the direction of the target sound source based on a beamforming technique.
- the first audio signal is a wake-up signal.
- the electronic device is a smart TV.
- a chip including a processor for calling and executing instructions stored in the memory from a memory, so that an electronic device on which the chip is installed executes the method described in the first aspect.
- a computer storage medium comprising: a processor coupled to a memory for storing a program or an instruction that, when executed by the processor, causes the processor to The apparatus performs the method described in the first aspect above.
- the present application provides a computer program product that, when the computer program product is run on an electronic device, causes the electronic device to perform the method according to any one of the first aspects.
- FIG. 1 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
- FIG. 2 is a schematic structural diagram of an electronic device provided by another embodiment of the present application.
- FIG. 3 is a schematic scene diagram of a video captured by a camera according to an embodiment of the present application.
- FIG. 4 is an exemplary block diagram of an electronic device provided by an embodiment of the present application.
- FIG. 5 is a schematic scene diagram provided by an embodiment of the present application.
- FIG. 6 is a schematic flowchart of a signal processing method provided by an embodiment of the present application.
- FIG. 7 is a schematic flowchart of a signal processing method provided by another embodiment of the present application.
- FIG. 8 is a schematic flowchart of a method for an electronic device to determine the direction of a target sound source provided by another embodiment of the present application.
- FIG. 9 is a schematic flowchart of a signal processing method provided by another embodiment of the present application.
- the signal processing method provided by the embodiment of the present application determines the direction (referred to as the direction of the target sound source) of the user who is performing voice interaction with the electronic device (referred to as the target user) through an audio signal and a video obtained based on a camera, Furthermore, based on the video of the user's lips in the direction obtained by the camera and the preset voice enhancement model, voice enhancement processing is performed on the picked-up audio signal to obtain or restore a relatively clear audio signal, which can greatly improve the efficiency of voice recognition.
- the target user is a person who is interacting with the electronic device by voice, and the target user is sending a voice command to the electronic device to perform a certain action.
- the target user can also be understood as the actual speaker.
- the target sound source direction is the direction where the target user is located, that is, the source direction of the sound emitted by the target user. Due to the influence of various interference signals in the environment, the electronic device may pick up audio signals in multiple sound source directions. Therefore, the direction where the target user is located is defined as the target sound source direction.
- the user's lip video records the lip shape (referred to as lip shape) during the user's speech.
- lip shape which has a corresponding relationship with pronunciation, that is, a lip shape can correspond to one or more pronunciations, for example, "wo", "me” and “grip” represent three different pronunciations, but one lip corresponds to one type.
- the lip video of the user in the direction of the target sound source may actually be understood as the lip video of the target user.
- the purpose of the speech enhancement model is to perform pickup enhancement processing on the audio signal, enhance the audio signal in the direction of the target sound source, and suppress or eliminate the audio signal generated in other directions, including the speaker or background noise, so as to obtain or restore a clearer audio signal.
- audio signal The speech enhancement model of the embodiment of the present application integrates audio and video information, and integrates the correspondence between pronunciation and lip shape, and one or more pronunciations may correspond to one lip shape.
- the audio signal and the user's lip video are used as the input of the speech enhancement model, and the speech enhancement model can perform speech enhancement processing on the audio signal based on the correspondence between pronunciation and lip shape and the input user's lip video, A clearer audio signal is obtained for speech recognition.
- the speech enhancement module may perform noise reduction processing, echo cancellation residual processing, de-reverberation processing, etc. on the audio signal.
- the signal processing method in the embodiment of the present application can be applied to any electronic device capable of recognizing speech.
- the electronic device may be a voice-controlled device such as a smart TV (also referred to as a smart screen).
- the electronic device may be a voice communication device such as a mobile phone or a computer.
- a smart TV is taken as an example to describe the electronic device according to the embodiment of the present application.
- the electronic device 10 includes a housing 110 , a display screen 120 , a microphone array 130 , and a camera 140 , and the display screen 120 , the microphone array 130 and the camera 140 are installed in the housing 110 .
- the display screen 120 is used to display images, videos, and the like.
- the display screen 120 includes a display panel.
- the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (active-matrix organic light).
- LED liquid crystal display
- OLED organic light-emitting diode
- AMOLED organic light-emitting diode
- FLED flexible light-emitting diode
- Miniled MicroLed, Micro-oLed, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED) and so on.
- the microphone array 130 is used to pick up audio signals, and includes a plurality of microphones, which can pick up audio signals in multiple directions.
- the microphones in the microphone array 130 may be omnidirectional microphones, directional microphones, or a combination of omnidirectional microphones and directional microphones, which is not limited in the implementation of this application.
- the omnidirectional microphone can pick up the audio signal in all directions, no matter where the speaker is, the sound in all directions will be picked up with the same sensitivity.
- a directional microphone can only pick up audio signals in a specific direction.
- the microphone array 130 may be disposed at any position of the housing 110, which is not limited in the embodiment of the present application.
- the microphone array 130 is arranged in the housing 110 and is located in an area on one side of the display screen 120 , the sound outlet of the microphone array 130 is arranged on the front of the housing 110 , and the sound outlet
- the orientation of the housing 110 is the same as the orientation of the display screen 120, the front of the housing 110 can be understood as the same side as the display screen 120, or the front of the housing 110 can be understood as the housing 120 facing the user under normal use conditions. one side.
- the microphone array 130 can be arranged in the area of the casing 110 on either side of the display screen 120 . Assuming that the microphone array 130 shown in FIG. 1 is arranged in the area of the casing 110 at the top side of the display screen 120 , then the microphone Arrays 130 may also be provided in areas of housing 110 on other sides of display screen 120 (eg, left, right, or bottom side).
- the microphone array 130 may be disposed in the casing 110 and located in the area on the top side of the display screen 120, and the sound outlet of the microphone array 130 may be disposed on the top surface of the casing 110 (not shown in the figure),
- the top surface of the housing 110 is connected to the front surface of the housing 110 , and the orientation of the sound outlet is perpendicular to the orientation of the display screen 120 .
- the microphone array 130 may also be arranged on the rear side of the display screen 120, and the sound outlet of the microphone array 130 is arranged on the display screen 120 (not shown in the figure).
- the microphone array 130 may also be disposed on the rear side of the display screen 120 , and the sound outlet holes of the microphone array 130 may be disposed on the front side of the housing 110 .
- the microphone array 130 may be arranged in a linear structure as shown in FIG. 1 , or may be arranged in other structures, which are not limited in any embodiment of the present application.
- the microphone array 130 may be arranged in a circular configuration, a rectangular configuration, or the like.
- Camera 140 is used to capture still images or video.
- the object is projected through the lens to generate an optical image onto the photosensitive element.
- the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
- CMOS complementary metal-oxide-semiconductor
- the photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal.
- the ISP outputs the digital image signal to the DSP for processing.
- DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
- the electronic device 10 may include 1 or N cameras 130 , where N is a positive integer greater than 1.
- the camera 140 can be rotated within a preset angle range to shoot a video within a certain angle range, and the video can be used to determine the direction of the target sound source; and, after the electronic device 10 determines the direction of the target sound source , the camera 140 can be rotated to the direction of the target sound source, so that the camera 140 is facing the direction of the target sound source, so that the target user is displayed in the middle of the screen as much as possible, so as to better shoot the video in the direction of the target sound source, and obtain
- the user's lip video can be used to process the audio signal input to the speech enhancement model to output a clearer audio signal for speech recognition.
- the camera 140 is disposed on the top surface of the housing 110 and protrudes from the top surface, so as to better realize the rotation of the camera 140 .
- the camera 140 may be located above the microphone array 130 .
- the camera 140 may be disposed on the front of the housing 110 and in an area on the top side of the display screen 120 .
- the camera 140 may rotate within a preset angle range, and the preset angle range may be any range of angles.
- the rotatable angle range of the camera 140 is less than or equal to 180°, for example, the angle range may be 120°, and the camera 140 may be located on the display screen 120 It can be rotated within an angle range of 120° in front of the smart TV.
- the shooting field of view of the camera 140 basically all the pictures located in the front of the smart TV within a range of 180° can be captured.
- the electronic device 10 further includes a directional microphone 150 , which can be rotated to pick up audio signals in a specific direction. After the electronic device 10 determines the target sound source direction, the directional microphone 150 can be rotated to the target sound source direction to perform directional pickup in the target sound source direction.
- the directional microphone 150 can pick up the sound in the direction of the target sound source without distortion, it can suppress interference and reverberation to a certain extent, and the directional microphone 150 can pick up the sound forward, which can also play a good role in the echo. inhibition. Therefore, in the embodiment of the present application, the audio signal obtained by the directional microphone 150 and the audio signal obtained by the microphone array 130 can be used as the audio input of the speech enhancement model, and a clearer audio signal can be obtained or restored.
- the directional microphone 150 can be provided on the camera 140 , for example
- the directional microphone 150 is fixedly connected to the camera 140.
- the directional microphone 150 also rotates to the target sound source direction, which is simple and convenient to implement.
- the electronic device 10 further includes a processor (not shown in the figure), and the display screen 120, the microphone array 130, the camera 140 and the directional microphone 150 are all connected to the processor for inputting the signals collected by the various components to the processor, for further processing.
- the processor runs the instruction to implement the signal processing method of the embodiment of the present application, so as to obtain a relatively clear audio signal sent by the user, and after performing speech recognition on the audio signal, it can control the corresponding component to execute the instruction corresponding to the audio signal.
- the structure of the electronic device 10 described above by taking the smart TV as an example is only a schematic illustration, and the electronic device 10 may have more or less components.
- the electronic device 10 may include a microphone array 130 , a camera 140 , and optionally, the electronic device 10 may further include a directional microphone 150 , but the electronic device 10 may not include the display screen 120 .
- the electronic device 10 may include the directional microphone 150 and the camera 140, but the electronic device 10 does not include the microphone array 130.
- the audio signal picked up by the directional microphone 150 and the camera 140 may be used for shooting
- the video determines the direction of the target sound source, and the video in the direction of the target sound source is captured by the camera 140 and the audio signal in the direction of the target sound source is picked up by the directional microphone 150, so as to enhance the model through the video and the voice in the direction of the target sound source A clearer audio signal is recovered.
- the directional microphone 150 may rotate all the time to collect the audio signal.
- the electronic device 10 may include other components besides the microphone array 130, the camera 140 and the directional microphone 150.
- the electronic device 10 may be a mobile phone or a computer.
- FIG. 4 is an exemplary block diagram of an electronic device 10 provided by an embodiment of the present application.
- the electronic device 10 may include the display screen 120, the microphone array 130, the directional microphone 150, and the camera 140 shown in FIG. 3.
- the electronic device 10 may further include one or more of the following components: a processor 160, a wireless communication A module 171 , an audio module 172 , a speaker 173 , a touch sensor 174 , a key 175 and an internal memory 176 .
- the wireless communication module 171 can provide applications on the electronic device 10 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), global navigation satellites Wireless communication solutions such as global navigation satellite system (GNSS), frequency modulation (FM), near field communication (NFC), and infrared technology (IR).
- WLAN wireless local area networks
- BT Bluetooth
- GNSS global navigation satellite system
- FM frequency modulation
- NFC near field communication
- IR infrared technology
- the wireless communication module 171 may be one or more devices integrating at least one communication processing module.
- the wireless communication module 171 receives electromagnetic waves via the antenna, frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor.
- the wireless communication module 171 can also receive the signal to be sent from the processor, perform frequency modulation on it, amplify it, and convert it into electromagnetic waves for radiation through the antenna.
- the audio module 172 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 172 may also be used to encode and decode audio signals. In some embodiments, the audio module 172 may be provided in the processor, or some functional modules of the audio module 172 may be provided in the processor 160 .
- the speaker 173 is also called “speaker", and is used to convert audio electrical signals into sound signals.
- the electronic device 10 can listen to the sound in music or video through the speaker 173.
- the speaker 173 can also be used to listen to hands-free calls.
- the touch sensor 174 is also referred to as a "touch panel”.
- the touch sensor 174 may be disposed on the display screen 120, and the touch sensor 174 and the display screen 120 form a touch screen, which is also referred to as a "touch screen”.
- the touch sensor 174 is used to detect a touch operation on or near it.
- the touch sensor 174 may communicate the detected touch operation to the processor 160 to determine the type of touch event.
- Visual output related to touch operations may be provided through display screen 120 .
- the touch sensor 174 may also be disposed on the surface of the electronic device 10 , which is different from the location where the display screen 120 is located.
- the keys 175 include a power-on key, a volume key, and the like. Keys 175 may be mechanical keys. It can also be a touch key.
- the electronic device 10 may receive key 175 inputs and generate key signal inputs related to user settings and function control of the electronic device 10 .
- Internal memory 176 is used to store computer executable program code, which includes instructions.
- the processor 160 executes various functional applications and data processing of the electronic device 10 by executing the instructions stored in the internal memory.
- the internal memory 176 may include a stored program area and a stored data area.
- the storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like.
- the storage data area can store data (such as audio data, phone book, etc.) created during the use of the electronic device 10 and the like.
- the internal memory 176 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like.
- FIG. 5 is a schematic scene diagram provided by an embodiment of the present application.
- the target user is watching TV and says "Xiaoyi Xiaoyi, I want to watch a variety show" to the smart TV, the smart TV receives and recognizes this instruction to tune the smart TV to the variety show programme.
- an angle may be used to represent a certain direction
- we may define a reference direction and the angle between a certain direction and the reference direction may be used to represent the certain direction.
- the reference direction may be arbitrary, and the embodiment of the present application does not make any limitation.
- the direction extending along the left side (the direction indicated by the arrow in the direction corresponding to 0° in the figure) in the length direction (for example, the x direction) of the electronic device 10 can be recorded as the reference direction,
- the angle corresponding to the reference direction is 0°
- the target user is facing the electronic device 10
- the included angle between the target sound source direction where the target user is located and the reference direction is 90°.
- the electronic device 10 includes a microphone array 130, a camera 140 and a processing unit 160.
- the processing unit 160 may include a target sound source direction determination module 161 and a speech enhancement module 162.
- the electronic device 10 further includes a directional microphone 150 .
- FIG. 6 is a schematic flowchart of a signal processing method provided by an embodiment of the present application. Referring to FIG. 6 , the general process of the embodiment of the present application is as follows:
- the target user starts to send a voice command to the electronic device, and the microphone array 130 picks up the first audio signal.
- the camera 140 shoots a video to obtain a first video.
- the processing unit 160 performs sound source localization on the first audio signal to obtain sound source direction information including at least one sound source direction, and the processing unit 160 processes the first video to obtain user direction information. This step may be performed by the processing unit
- the target sound source direction determination module 161 in 160 is executed.
- the processing unit 160 determines the target sound source direction where the target user is located according to the sound source direction information and the user direction information, and this step can be performed by the target sound source direction determination module 161 in the processing unit 160 .
- the processing unit 160 controls the camera 140 to rotate to the direction of the target sound source, and the camera 140 shoots a video in the direction of the target sound source to obtain a video of the user's lips in the direction of the target sound source.
- the microphone array 130 continues to pick up the second audio signal, where the second audio signal is a signal that actually needs speech recognition.
- the processing unit 160 may also control the directional microphone 150 to rotate to the target sound source direction, and the directional microphone 150 picks up the fourth audio signal in the target sound source direction.
- the processing unit 160 controls the camera 140 and the directional microphone 150 to rotate together to the direction of the target sound source.
- the processing unit 160 uses the second audio signal and the video of the user's lip in the direction of the target sound source as input, the processing unit 160 performs speech enhancement processing on the second audio signal through the speech enhancement model, and obtains an enhanced relatively clear third audio signal, This step may be performed by the speech enhancement module 162 in the processing unit 160 .
- the processing unit 160 uses the speech enhancement model to perform a The second audio signal and the fourth audio signal are subjected to speech enhancement processing to obtain a third audio signal.
- step S210 and S220 may be performed simultaneously
- steps S250 and S260 may be performed simultaneously
- steps S250, S260 and S270 may be performed simultaneously
- step S250 may be performed before step S260, or may be performed after step S260.
- the first video is obtained by the camera, and the first audio signal obtained by the microphone array is combined to determine the direction of the target sound source where the target user who is performing voice interaction with the electronic device is located, which can greatly improve the target sound.
- the estimation accuracy of the source direction avoids the false sound source generated by the strong reflected sound from interfering with the determination of the target sound source direction when the target sound source direction is determined only by the audio signal, and the user's lips in the target sound source direction obtained by the camera.
- the second audio signal obtained through the microphone array is processed for voice enhancement. Since the corresponding relationship between pronunciation and lip shape is integrated in the voice enhancement model, combined with the user's lip video and voice enhancement model, A relatively clean third audio signal can be recovered, and finally, the speech recognition efficiency can be effectively improved.
- the directional microphone has a certain inhibitory effect on reverberation, interference other than the direction of the target sound source, and the echo of the display screen itself, and has a further inhibitory effect on the residual echo after echo cancellation.
- the embodiment of this application uses a directional microphone.
- the fourth audio signal picked up in the direction of the target sound source is combined with the second audio signal obtained by the microphone array, and these two audio signals are used as audio input, which can greatly improve the effect of sound pickup enhancement and improve the efficiency of speech recognition.
- FIG. 7 is a schematic flowchart of a signal processing method 300 provided by another embodiment of the present application, and the method may be executed by the processing unit 160 of the electronic device 10 .
- step S310 sound source localization is performed on the first audio signal obtained through the microphone array 130 to obtain sound source direction information, where the sound source direction information includes at least one sound source direction.
- the at least one sound source direction includes a target sound source direction.
- the user issues a voice command to the electronic device, and the microphone array 130 picks up the audio signal.
- This step can be used for sound source localization, and the first audio signal can be a small part of the content of the voice command issued by the user, and the small part of the content basically does not affect the subsequent content for speech recognition.
- the first audio signal may be a wake-up signal.
- the voice command issued by the user is "Xiaoyi Xiaoyi, I want to watch a variety show”
- the first audio signal can be one word or multiple words or multiple "Xiaoyi” in “Xiaoyi”
- " One or more words or multiple "Xiaoyi” in “Xiaoyi” can be understood as a wake-up signal.
- the electronic device detects "Xiaoyi” it can be determined that the user may need the electronic device to execute voice commands, and the microphone array 130 will Sound source localization is performed, and subsequent audio signals are continuously picked up.
- the first audio signal may be the first few words in the voice command.
- the microphone array 130 will detect the audio signal.
- the voice command issued by the user is "I want to watch a variety show", and the first audio signal may be "I”.
- the microphone array 130 performs sound source localization on the first audio signal, in order to determine the direction of the target sound source where the target user is located, that is, the direction of the sound source where the voice command is actually issued. However, the microphone array 130 will pick up audio signals in all directions. Due to the influence of various interfering sounds in the environment, the final sound source direction may not be accurate, and at least one sound source direction will be obtained. The at least one sound source direction Including the direction of the target sound source, and may also include the direction of the interfering signal. For example, the target user sends a voice command to the electronic device, the speaker is playing music, and other users (referred to as interfering users) are speaking. Assuming that the above three sounds can be picked up by the microphone array 130, then the microphone array 130 3 or 2 or 1 sound source directions may be determined, but the results are not accurate. Therefore, it is necessary to further determine the target sound source direction in combination with the video.
- the technology for sound source localization in this embodiment of the present application may be a maximum output power-based controllable beamforming technology, a high-resolution spectrogram estimation technology, or a sound time-delay estimation (TDE)-based sound.
- the source location technology is not limited in any embodiment of the present application.
- a certain direction in this embodiment of the present application can be represented by an angle.
- step S320 the first video obtained by the camera 140 is processed to obtain user direction information.
- the user direction information includes some directions related to the user, for example, the user direction information includes at least one type of direction related to the user.
- the user sends a voice command to the electronic device
- the camera 140 can shoot a video
- the electronic device can be used to determine the direction of the target sound source based on the obtained first video.
- a voice command issued by the user may be used as a trigger condition for the camera 140 to shoot a video, and the electronic device detects the voice command issued by the user and controls the camera 140 to start shooting a video. Since the camera 140 in the embodiments of the present application can be rotated, in some examples, the camera 140 can be rotated while shooting videos, so as to obtain pictures with more angular ranges.
- the camera 140 can shoot video all the time, and use the video for a period of time after the electronic device receives the voice command from the user as the first video to determine the direction of the target sound source.
- the electronic device processes the first video captured by the camera 140, and detects user-related content in the first video,
- User direction information is obtained, the user direction information including at least one type of direction related to the user.
- the user direction information including at least one type of direction related to the user.
- the users involved in the user direction information not only include the target users who are interacting with the electronic device, but also other users, as long as they are the users detected in the first video.
- Target users other users can be understood as interfering users.
- the user direction information includes at least one type of direction related to the user, each type of direction including at least one direction.
- the at least one type of orientation includes at least one of the following:
- the first class of directions including the direction in which the at least one active lip is located;
- a second type of direction where the second type of direction includes the direction in which at least one user is located;
- a third type of direction includes a direction in which at least one user who is looking at the electronic device is located.
- the first video is used to detect whether someone's lips are moving, that is, detecting whether someone is speaking, which can effectively exclude the scene where a person is speaking in the video.
- scenes that interfere with the user's speech can also be excluded to a certain extent. For example, if the target user is watching TV and gives voice commands to the TV, and User 1 is also talking, but is doing housework with his head down, and is not facing the TV, then most of the lips of User 1 cannot be detected through the first video. During activity, only the lips of the target user can be detected moving, so User 1 is an interfering user and can be effectively excluded.
- the first category of directions includes the target sound source direction.
- interference signals from other non-users can be effectively excluded, for example, interference signals from speakers can be excluded.
- the multiple users can be detected in the first video, and the directions where the multiple users are located can be obtained.
- the second type of direction includes the direction of the target sound source.
- the electronic device sends out voice commands, so that the electronic device can receive the voice commands well, and it also enables the user to quickly know whether the electronic device is executing according to the command or get some feedback from the electronic device, for example, the user makes a voice
- the instruction asks for the weather status, and the user needs to look at the weather condition displayed on the electronic device. Therefore, by detecting the user looking at the electronic device, the scene that interferes with the user's speech can be effectively excluded.
- the target user is watching TV and gives a voice command to the TV, and User 1 speaks to the target user, but does not look at the TV, then most of the first video can not detect that User 1 is looking at the electronic device, only It is detected that the target user is looking at the electronic device, so the user 1 is the interfering user and can be effectively excluded.
- the third category of directions includes the direction of the target sound source.
- the user direction information may include one type or two types or three types of directions among the above three types of directions, which is not limited in any embodiment of the present application.
- the more types of directions included in the user direction information the more favorable it is to improve the accuracy of determining the direction of the target sound source.
- the user direction information may also include other directions related to the user, which is not limited in this embodiment of the present application.
- user orientation information may include other orientations related to user behavior.
- step S330 the target sound source direction is determined according to the sound source direction information and the user direction information.
- the direction of the target sound source is the direction of the target user who is performing voice interaction with the electronic device.
- the sound source direction in the sound source direction information can be regarded as one type of direction, and is combined with at least one type of direction related to the user to jointly determine the target sound source direction.
- FIG. 8 is a schematic flowchart of a method 230 for an electronic device to determine the direction of a target sound source provided by another embodiment of the present application.
- the electronic device may determine the direction of the target sound source in the following manner:
- step S33 at least one sound source direction in the sound source direction information and at least one type of direction in the user direction information are combined and processed to obtain at least one combined direction;
- step S332 the target sound source direction is determined from the at least one direction.
- the following takes the sound source direction and the above three types of directions as examples, and first describes the manner of obtaining the combined at least one direction.
- one direction can be determined based on the multiple directions.
- the multiple directions can be considered to be the same direction, and the final determination One direction of , may be any one of the multiple directions, or may be obtained by taking the average value of the multiple directions, which is not limited in any embodiment of the present application.
- the threshold can be reasonably designed based on actual application scenarios, and for example, the threshold can be 5°.
- the sound source direction information includes 4 sound source directions, and the corresponding angles are 30°, 60°, 95°, and 120° respectively.
- the first type of direction includes one direction, and the corresponding angle is 93°.
- the second type of direction It includes two directions, and the corresponding angles are 63° and 95°, respectively.
- the third type of direction includes one direction, and the corresponding angle is 95°.
- the angles obtained by the merge processing are: 30°, 61.5°, 94.5°, 120°, that is, the fourth type of directions obtained by the merge includes 4 directions,
- the direction of the target sound source is one of the four directions. In fact, the direction corresponding to 94.5° is the direction of the target sound source.
- the target user basically faces the electronic device and performs voice interaction with the electronic device.
- the electronic device After obtaining the combined at least one direction, the electronic device determines the target sound source direction from the at least one direction.
- some parameters may be set based on the specific scene of far-sound pickup, and the target sound source direction may be determined based on these parameters.
- the electronic device may determine the target sound source direction from at least one direction according to at least one of the following parameters, where the at least one parameter includes:
- the preset time period is the time period between the current time and the historical time
- the preset angle range includes the corresponding Angle
- Each parameter is described by taking at least one type of directions including the above three types of directions as an example, and taking the above four sound source directions and angles corresponding to the three types of directions as an example.
- the corresponding angles are 30°, 60°, 95°, 120°
- the first type of direction includes 1 direction
- the corresponding angle is 93°
- the second type of direction includes 2 directions
- the corresponding The angles are 63° and 95°, respectively.
- the third type of direction includes one direction, and the corresponding angle is 95°.
- the angles corresponding to the four directions obtained by the combined processing are: 30°, 61.5°, 94.5°, and 120°.
- First parameter the sum of the detected frequencies for each direction in the direction of the sound source and in at least one type of direction.
- the frequencies detected at 30° in the direction of the sound source, the first type of direction, the second type of direction and the third type of direction are: 1, 0, 0, 0, and the sum of the frequencies is 1; 61.5° in the direction of the sound source , the frequencies detected in the first type direction, the second type direction and the third type direction are: 1, 0, 1, 0, and the sum of the frequencies is 2; 94.5° in the sound source direction, the first type direction, The frequencies detected in the second type direction and the third type direction are: 1, 1, 1, 1, and the sum of the frequencies is 4; 120° in the sound source direction, the first type direction, the second type direction and the third type direction.
- the frequencies detected in the three types of directions are: 1, 0, 0, 0, and the sum of the frequencies is 1. It can be seen that 94.5° has the highest sum of frequencies detected in the sound source direction and at least one type of direction.
- this direction is basically the direction of the target sound source.
- the second parameter within the preset time period and the preset angle range corresponding to each direction, whether the electronic device has successfully performed voice interaction with the user, the preset time period is the time period between the current time and the historical time, the preset angle
- the range includes the corresponding angle for each direction.
- the angle of the preset angle range corresponding to each direction may include not only the angle corresponding to the direction, but also the angles near the angle. For example, if the angle corresponding to a certain direction is 30°, the preset angle range may be 25° ⁇ 35° °. It should be understood that, the smaller the preset angle range is, the more accurate the target sound source direction determined by using this parameter.
- the preset time period is the time period between the current time and the historical time, and the historical time is the time before the current time.
- the duration of the preset time period should not be set too long, which is conducive to more accurate determination of the target sound source direction.
- the duration of the preset period can be set to 1 minute, 5 minutes, 10 minutes, etc. Assuming that the current time is 10:30 and the preset period is 10 minutes, the historical time is 10:20, and the preset period is Hours between 10:20 and 10:30.
- the electronic device has successfully performed voice interaction with the user in the vicinity of an angle corresponding to a certain direction within a preset period of time.
- the user is likely to use the electronic device for a certain period of time, especially for the electronic device with a display screen such as a smart TV, the user basically does not move the position frequently when watching TV. Therefore, within the preset time period and the preset angle range corresponding to a certain direction, if the electronic device and the user have successfully performed voice interaction, it means that the direction is more likely to be the direction of the target sound source, otherwise, the direction is the target sound source. The possibility of the source direction is smaller. Further, if the frequency of successful voice interaction between the electronic device and the user is more, it also means that the direction is more likely to be the direction of the target sound source, and vice versa is less likely.
- the third parameter the angle between each direction and the direction perpendicular to the display screen of the electronic device.
- the third parameter is more suitable for an electronic device with a display screen, and the direction perpendicular to the display screen of the electronic device can be understood as the thickness direction of the electronic device.
- the third parameter in other words, it can be understood as whether the user is in the vicinity of a specific direction defined when using the preset scene for the electronic device.
- the electronic device may determine the direction of the target sound source based on one, two or three of the above parameters, which is not limited in the embodiment of the present application, which will be described below.
- the at least one parameter includes the first parameter, ie, the at least one parameter includes the sum of the detected frequencies for each direction in the direction of the sound source and in at least one type of direction.
- the direction in which the sum of the detected frequencies in the sound source direction and at least one type of direction is the largest may be used as the target sound source direction.
- the at least one parameter includes the second parameter, that is, the at least one parameter includes: within a preset time period and a preset angle range corresponding to each direction, whether the electronic device has successfully performed voice interaction with the user.
- the direction corresponding to the angle at which the electronic device and the user have successfully performed voice interaction within the preset time period and the preset angle range is determined as the target sound source direction.
- the at least one parameter includes a third parameter, that is, the at least one parameter includes an angle between each direction and a direction perpendicular to the display screen of the electronic device.
- the direction with the smallest included angle with the direction perpendicular to the display screen of the electronic device may be determined as the target sound source direction.
- At least one parameter includes any two or three parameters.
- a candidate sound source direction may be obtained based on the principle in the corresponding example above, and the candidate sound source direction may be The direction with the highest repetition rate is used as the target sound source direction.
- the at least one parameter includes a first parameter and a second parameter, and for the first parameter, the direction with the largest sum of the detected frequencies in the sound source direction and the at least one type of direction is used as a candidate sound source Direction, assuming that the candidate sound source direction is 94.5°, for the second parameter, the direction corresponding to the angle at which the electronic device and the user have successfully performed voice interaction within the preset time period and the preset angle range is used as another candidate sound source direction , assuming that the candidate sound source direction is 94.5°, then the target sound source direction obtained based on the two candidate sound source directions is 94.5°.
- the electronic device may determine the confidence of each direction according to at least one parameter, and determine the direction corresponding to the confidence with the largest numerical value in the at least one direction as the direction of the target sound source.
- the confidence of each direction can also be called the reliability of each direction, indicating the probability that the direction is the direction of the target sound source, and the greater the confidence, the probability that the direction corresponding to the confidence is the direction of the target sound source bigger.
- the method of determining the direction of the target sound source by the confidence is described. It should be understood that in the embodiment where at least one parameter includes one or two parameters, the method of determining the direction of the target sound source by using the confidence is similar to the embodiment with three parameters, and reference may be made to the following description, which will not be repeated hereafter.
- a weighting value may be configured for each parameter according to the priority of the three parameters, and the target sound source direction may be determined based on the confidence level obtained from each parameter by calculating each direction.
- the direction corresponding to the confidence degree with the largest numerical value in at least one direction may be determined as the target sound source direction
- the priorities of the three parameters are in descending order: the priority of the first parameter>the priority of the second parameter>the priority of the third parameter, correspondingly, the first parameter
- the weighted value of the first parameter is 0.5
- the weighted value of the second parameter is 0.3
- the weighted value of the third parameter is 0.2.
- the score for each direction being detected once is 10 points.
- the electronic device and the user are within the preset angle range corresponding to a preset time period and a certain direction If the voice interaction is successful, the score for this direction is also 10 points.
- the score for this direction is also 10 points, for example, the threshold is 10°.
- the four sound source directions have corresponding angles of 30°, 60°, 95°, and 120°
- the first type of direction includes one direction
- the corresponding angle is 93°
- the second type of direction includes two directions.
- the corresponding angles are 63° and 95°, respectively.
- the third type of direction includes one direction, and the corresponding angle is 95°.
- the angles corresponding to the four directions obtained by the combined processing are: 30°, 61.5°, 94.5°, and 120°.
- the first parameter it is only detected in the direction of the sound source, and one score of 10 can be obtained.
- the first parameter it is detected in the sound source direction and the second type of direction, and two 10 points, that is, 20 points, can be obtained.
- the direction corresponding to 61.5° is the same as the electronic If the device has successfully performed a voice interaction once, it can get a score of 10.
- the sound source direction and three types of directions are detected, and four 10 points, that is, 40 points, can be obtained.
- the direction corresponding to 94.5° is the same as that of the electronic device. If you have successfully performed a voice interaction, you can get 1 10 points.
- the confidence value of 94.5° is the highest, then the direction corresponding to 94.5° is determined as the direction of the target sound source.
- step S340 a video of the user's lips in the direction of the target sound source is obtained through the camera 140.
- the electronic device After determining the direction of the target sound source, the electronic device rotates the camera 140 to the direction of the target sound source, and the camera 140 shoots a video in the direction of the target sound source, and the video includes a video of the user's lips of the target user in the direction of the target sound source.
- step S350 a second audio signal is obtained through the microphone array 130.
- the second audio signal is a signal indicative of the actual voice command.
- the voice instruction issued by the target user is "Xiaoyi Xiaoyi, I want to watch a variety show"
- the second audio signal can be used to indicate the voice instruction "I want to watch a variety show”.
- the second audio signal is acquired in the direction of the target sound source through the microphone array 130 based on the beamforming technology.
- step S350 according to the second audio signal and the video of the user's lips, a third audio signal is obtained through a speech enhancement model, where the speech enhancement model includes correspondences between multiple pronunciations and multiple lip shapes.
- the purpose of the speech enhancement model is to perform pickup enhancement processing on the audio signal, enhance the audio signal in the direction of the target sound source, and suppress or eliminate the audio signal in other directions, so as to obtain or restore a clearer audio signal.
- the audio and video information is integrated in the speech enhancement model, and the correspondence between pronunciation and lip shape is integrated, that is, one or more pronunciations correspond to a lip shape, the second audio signal is used as audio input, and the lip information in the direction of the target sound source
- the speech enhancement model can enhance the audio signal based on the correspondence between pronunciation and lip shape and the input video of the user's lips, and obtain or restore a relatively clear third audio signal for speech recognition.
- the audio signals are processed by the audio and video information, so that a relatively clean audio signal can be obtained, which greatly improves the sound pickup enhancement effect.
- the speech enhancement module may perform noise reduction processing, echo cancellation residual processing, de-reverberation processing, etc. on the second audio signal.
- FIG. 9 is a schematic flowchart of a signal processing method 400 provided by another embodiment of the present application, and the method may be executed by the processing unit 160 of the electronic device 10 .
- step S410 sound source localization is performed on the first audio signal obtained through the microphone array 130 to obtain sound source direction information, where the sound source direction information includes at least one sound source direction.
- the at least one sound source direction includes a target sound source direction.
- step S410 For the specific description of step S410, reference may be made to the relevant description of step S310 above.
- step S420 the first video obtained by the camera 140 is processed to obtain user direction information, where the user direction information includes at least one type of direction related to the user.
- step S420 For the specific description of step S420, reference may be made to the relevant description of step S320 above.
- step S430 the target sound source direction is determined according to the sound source direction information and the user direction information, where the target sound source direction is the direction of the target user who is performing voice interaction with the electronic device.
- step S430 For the specific description of step S430, reference may be made to the relevant description of step S330 above.
- step S440 a video of the user's lips in the direction of the target sound source is obtained through the camera 140.
- step S440 For the specific description of step S440, reference may be made to the relevant description of step S340 above.
- step S450 a second audio signal is obtained through the microphone array 130.
- step S450 For the specific description of step S450, reference may be made to the relevant description of step S350 above.
- step S460 a fourth audio signal in the direction of the target sound source is obtained through the directional microphone 150.
- the electronic device may control the directional microphone 150 to rotate to the direction of the target sound source, and pick up the fourth audio signal in the direction of the target sound source.
- the electronic device can control the camera 140 and the directional microphone 150 to rotate together to the direction of the target sound source.
- step S470 a third audio signal is obtained through a speech enhancement model according to the second audio signal, the fourth audio signal and the video of the user's lips.
- the second audio signal picked up by the microphone array 130 and the fourth audio signal picked up by the directional microphone 150 are used as the audio input of the speech enhancement model, and the video of the user's lips is used as the video input.
- the input audio signal is processed to obtain a relatively clear third audio signal.
- the directional microphone 150 can pick up the sound in the direction of the target sound source without distortion, it can suppress interference and reverberation to a certain extent, and the directional microphone 150 can pick up the sound forward, which can also play a good role in the echo. inhibition.
- the size of the sequence number of each process does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic , and should not constitute any limitation on the implementation process of the embodiments of the present application.
- the embodiment of the present application also provides an electronic device, which may be the electronic device shown in FIG. 4 , the electronic device includes a microphone array 130, a rotatable camera 140 and a processor 160, and the processor 160 is used for:
- a third audio signal is obtained through a speech enhancement model, where the speech enhancement model includes a correspondence between pronunciation and lip shape.
- the electronic device further includes a directional microphone 150
- the processor 160 is further configured to:
- the processor 160 is specifically used for:
- the third audio signal is obtained through the speech enhancement module according to the second audio signal, the fourth audio signal and the video of the user's lips.
- the directional microphone 150 is fixedly connected to the camera 140 .
- the user direction information includes at least one of the following types of directions:
- the first class of directions including the direction in which the at least one active lip is located;
- the second type of direction includes the direction in which at least one user is located;
- a third type of direction includes a direction in which at least one user who is looking at the electronic device is located.
- the sound source direction information includes at least one sound source direction
- the processor 160 is specifically configured to:
- the target sound source direction is determined from the at least one direction.
- processor 160 is specifically configured to:
- the target sound source direction is determined from the at least one direction according to at least one of the following parameters;
- the at least one parameter includes:
- the preset time period is the time period between the current time and the historical time
- the included angle between each of the directions and a direction perpendicular to the display screen of the electronic device is a direction perpendicular to the display screen of the electronic device.
- processor 160 is specifically configured to:
- the direction corresponding to the confidence level with the largest numerical value in the at least one direction is determined as the target sound source direction.
- processor 160 is specifically configured to:
- the second audio signal is obtained in the direction of the target sound source based on a beamforming technique.
- the first audio signal is a wake-up signal.
- the electronic device is a smart TV.
- connection and “fixed connection” should be interpreted in a broad sense.
- connection and “fixed connection” should be interpreted in a broad sense.
- specific meanings of the above various terms in the embodiments of the present application can be understood according to specific situations.
- connection it can be various connection methods such as fixed connection, rotational connection, flexible connection, movable connection, integral molding, electrical connection, etc.; it can be directly connected, or it can be indirectly connected through an intermediate medium, or , which can be a connection within two elements or an interaction relationship between two elements.
- fixed connection it can be that one element can be directly or indirectly fixedly connected to another element; the fixed connection can include mechanical connection, welding and bonding, etc., wherein the mechanical connection can include riveting, bolting , screw connection, key pin connection, snap connection, lock connection, plug connection, etc., bonding can include adhesive bonding and solvent bonding.
- first and second are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implying the number of indicated technical features.
- Features delimited with “first” and “second” may expressly or implicitly include one or more of that feature.
- At least one refers to one or more, and "a plurality” refers to two or more.
- At least part of an element means part or all of an element.
- And/or which describes the association relationship of the associated objects, means that there can be three kinds of relationships, for example, A and/or B, it can mean that A exists alone, A and B exist at the same time, and B exists alone, where A, B can be singular or plural.
- the character “/” generally indicates that the associated objects are an "or" relationship.
- the disclosed system, apparatus and method may be implemented in other manners.
- the apparatus embodiments described above are only illustrative.
- the division of the units is only a logical function division. In actual implementation, there may be other division methods.
- multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
- the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
- the functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium.
- the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution.
- the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
- the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Quality & Reliability (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Circuit For Audible Band Transducer (AREA)
- Studio Devices (AREA)
Abstract
Description
Claims (20)
- 一种信号处理的方法,其特征在于,应用于电子设备,所述电子设备包括麦克风阵列和摄像头,所述方法包括:对通过所述麦克风阵列获得的第一音频信号进行声源定位,获得声源方向信息;对通过所述摄像头获得的第一视频进行处理,获得用户方向信息;根据所述声源方向信息和所述用户方向信息,确定目标声源方向;通过所述摄像头获得在所述目标声源方向的用户唇部视频;通过所述麦克风阵列获得第二音频信号;根据所述第二音频信号和所述用户唇部视频,通过语音增强模型,获得第三音频信号,所述语音增强模型包括发音和唇型的对应关系。
- 根据权利要求1所述的方法,其特征在于,所述电子设备还包括指向性麦克风,所述方法还包括:通过所述指向性麦克风获得在所述目标声源方向的第四音频信号;以及,根据所述第二音频信号和在所述目标声源方向的用户唇部视频,通过语音增强模型,获得第三音频信号,包括:根据所述第二音频信号、所述第四音频信号和所述用户唇部视频,通过所述语音增强模块,获得所述第三音频信号。
- 根据权利要求1或2所述的方法,其特征在于,所述用户方向信息包括以下至少一种类型的方向:第一类方向,所述第一类方向包括至少一个处于活动状态的唇部所在的方向;第二类方向,所述第二类方向包括至少一个用户所在的方向;第三类方向,所述第三类方向包括至少一个正在注视所述电子设备的用户所在的方向。
- 根据权利要求3所述的方法,其特征在于,所述声源方向信息包括至少一个声源方向,以及,所述根据所述声源方向信息和所述用户方向信息,确定目标声源方向,包括:将所述至少一个声源方向和所述至少一种类型的方向合并处理,获得合并后的至少一个方向;从所述至少一个方向中确定所述目标声源方向。
- 根据权利要求4所述的方法,其特征在于,所述从所述至少一个方向中确定所述目标声源方向,包括:根据以下至少一个参数,从所述至少一个方向中确定所述目标声源方向;其中,所述至少一个参数包括:所述至少一个方向中每个方向在所述声源方向和所述至少一种类型的方向中被检测到的频率的总和;在预设时段和所述每个方向对应的预设角度范围内,所述电子设备是否和用户成功进行过语音交互,所述预设时段是当前时间与历史时间之间的时段;所述每个方向与垂直于所述电子设备的显示屏的方向之间的夹角。
- 根据权利要求5所述的方法,其特征在于,所述根据以下至少一个参数,从所述至少一个方向中确定所述目标声源方向,包括:根据所述至少一个参数,确定所述每个方向的置信度;将所述至少一个方向中数值最大的置信度对应的方向确定为所述目标声源方向。
- 根据权利要求1至6中任一项所述的方法,其特征在于,所述通过所述麦克风阵列获得第二音频信号,包括:通过所述麦克风阵列,基于波束形成技术在所述目标声源方向上获得所述第二音频信号。
- 根据权利要求1至7中任一项所述的方法,其特征在于,所述第一音频信号为唤醒信号。
- 一种电子设备,其特征在于,包括麦克风阵列、摄像头和处理器,所述处理器用于:对通过所述麦克风阵列获得的第一音频信号进行声源定位,获得声源方向信息;对通过所述摄像头获得的第一视频进行处理,获得用户方向信息;根据所述声源方向信息和所述用户方向信息,确定目标声源方向;通过所述摄像头获得在所述目标声源方向的用户唇部视频;通过所述麦克风阵列获得第二音频信号;根据所述第二音频信号和所述用户唇部视频,通过语音增强模型,获得第三音频信号,所述语音增强模型包括发音和唇型的对应关系。
- 根据权利要求9所述的电子设备,其特征在于,所述电子设备还包括指向性麦克风,所述处理器还用于:通过所述指向性麦克风获得在所述目标声源方向的第四音频信号;以及,所述处理器具体用于:根据所述第二音频信号、所述第四音频信号和所述用户唇部视频,通过所述语音增强模块,获得所述第三音频信号。
- 根据权利要求10所述的电子设备,其特征在于,所述指向性麦克风与所述摄像头固定连接。
- 根据权利要求9至11中任一项所述的电子设备,其特征在于,所述用户方向信息包括以下至少一种类型的方向:第一类方向,所述第一类方向包括至少一个处于活动状态的唇部所在的方向;第二类方向,所述第二类方向包括至少一个用户所在的方向;第三类方向,所述第三类方向包括至少一个正在注视所述电子设备的用户所在的方向。
- 根据权利要求12所述的电子设备,其特征在于,所述声源方向信息包括至少一个声源方向,以及,所述处理器具体用于:将所述至少一个声源方向和所述至少一种类型的方向合并处理,获得合并后的至少一个方向;从所述至少一个方向中确定所述目标声源方向。
- 根据权利要求13所述的电子设备,其特征在于,所述处理器具体用于:根据以下至少一个参数,从所述至少一个方向中确定所述目标声源方向;其中,所述至少一个参数包括:所述至少一个方向中每个方向在所述声源方向和所述至少一种类型的方向中被检测到的频率的总和;在预设时段和所述每个方向对应的预设角度范围内,所述电子设备是否和用户成功进行过语音交互,所述预设时段是当前时间与历史时间之间的时段;所述每个方向与垂直于所述电子设备的显示屏的方向之间的夹角。
- 根据权利要求14所述的电子设备,其特征在于,所述处理器具体用于:根据所述至少一个参数,确定所述每个方向的置信度;将所述至少一个方向中数值最大的置信度对应的方向确定为所述目标声源方向。
- 根据权利要求9至15中任一项所述的电子设备,其特征在于,所述处理器具体用于:通过所述麦克风阵列,基于波束形成技术在所述目标声源方向上获得所述第二音频信号。
- 根据权利要求9至16中任一项所述的电子设备,其特征在于,所述第一音频信号为唤醒信号。
- 根据权利要求9至17中任一项所述的电子设备,其特征在于,所述电子设备为智能电视。
- 一种计算机存储介质,其特征在于,包括:处理器,所述处理器与存储器耦合,所述存储器用于存储程序或指令,当所述程序或指令被所述处理器执行时,使得所述装置执行如权利要求1至8中任一项所述的方法。
- 一种包含指令的计算机程序产品,其特征在于,当所述计算机程序产品在电子设备上运行时,使得所述电子设备执行如权利要求1至8中任一项所述的方法。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US18/247,212 US20230386494A1 (en) | 2020-09-30 | 2021-09-17 | Signal processing method and electronic device |
| EP21874269.0A EP4207186B1 (en) | 2020-09-30 | 2021-09-17 | Signal processing method and electronic device |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202011065346.1 | 2020-09-30 | ||
| CN202011065346.1A CN114333831B (zh) | 2020-09-30 | 2020-09-30 | 信号处理的方法和电子设备 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022068608A1 true WO2022068608A1 (zh) | 2022-04-07 |
Family
ID=80949550
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/118948 Ceased WO2022068608A1 (zh) | 2020-09-30 | 2021-09-17 | 信号处理的方法和电子设备 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20230386494A1 (zh) |
| EP (1) | EP4207186B1 (zh) |
| CN (1) | CN114333831B (zh) |
| WO (1) | WO2022068608A1 (zh) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115174959A (zh) * | 2022-06-21 | 2022-10-11 | 咪咕文化科技有限公司 | 视频3d音效设置方法及装置 |
Families Citing this family (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115910038A (zh) * | 2022-09-27 | 2023-04-04 | 北京地平线机器人技术研发有限公司 | 语音信号的提取方法、装置、可读存储介质及电子设备 |
| CN119493074A (zh) * | 2023-08-15 | 2025-02-21 | 华为技术有限公司 | 朝向确定方法及相关设备 |
| CN119673153A (zh) * | 2023-09-12 | 2025-03-21 | 荣耀终端股份有限公司 | 语音交互方法及相关设备 |
| CN118447866B (zh) * | 2023-09-13 | 2025-03-07 | 荣耀终端股份有限公司 | 一种音频处理方法及电子设备 |
| CN118072744B (zh) * | 2024-04-18 | 2024-07-23 | 深圳市万屏时代科技有限公司 | 基于声纹的语言识别方法及装置 |
| CN118865995B (zh) * | 2024-09-04 | 2025-08-12 | 美的集团(上海)有限公司 | 多通道语音的降噪方法及系统、电子设备及存储介质 |
| CN118900374B (zh) * | 2024-09-18 | 2025-10-28 | 深圳市万屏时代科技有限公司 | 拾音组件及其控制方法和控制装置 |
| CN120452418A (zh) * | 2025-04-23 | 2025-08-08 | 杭州灵伴科技有限公司 | 翻译系统、智能眼镜、计算机可读介质和计算机程序产品 |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106328156A (zh) * | 2016-08-22 | 2017-01-11 | 华南理工大学 | 一种音视频信息融合的麦克风阵列语音增强系统及方法 |
| CN106679651A (zh) * | 2017-02-08 | 2017-05-17 | 北京地平线信息技术有限公司 | 声源定位方法、装置和电子设备 |
| WO2017129239A1 (en) * | 2016-01-27 | 2017-08-03 | Nokia Technologies Oy | System and apparatus for tracking moving audio sources |
| CN110082723A (zh) * | 2019-05-16 | 2019-08-02 | 浙江大华技术股份有限公司 | 一种声源定位方法、装置、设备及存储介质 |
| CN110389597A (zh) * | 2018-04-17 | 2019-10-29 | 北京京东尚科信息技术有限公司 | 基于声源定位的摄像头调整方法、装置和系统 |
| CN110691196A (zh) * | 2019-10-30 | 2020-01-14 | 歌尔股份有限公司 | 一种音频设备的声源定位的方法及音频设备 |
| CN110858488A (zh) * | 2018-08-24 | 2020-03-03 | 阿里巴巴集团控股有限公司 | 语音活动检测方法、装置、设备及存储介质 |
Family Cites Families (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR100499124B1 (ko) * | 2002-03-27 | 2005-07-04 | 삼성전자주식회사 | 직교 원형 마이크 어레이 시스템 및 이를 이용한 음원의3차원 방향을 검출하는 방법 |
| WO2013091677A1 (en) * | 2011-12-20 | 2013-06-27 | Squarehead Technology As | Speech recognition method and system |
| CN107146614B (zh) * | 2017-04-10 | 2020-11-06 | 北京猎户星空科技有限公司 | 一种语音信号处理方法、装置及电子设备 |
| CN107993671A (zh) * | 2017-12-04 | 2018-05-04 | 南京地平线机器人技术有限公司 | 声音处理方法、装置和电子设备 |
| CN108346427A (zh) * | 2018-02-05 | 2018-07-31 | 广东小天才科技有限公司 | 一种语音识别方法、装置、设备及存储介质 |
| CN111326152A (zh) * | 2018-12-17 | 2020-06-23 | 南京人工智能高等研究院有限公司 | 语音控制方法及装置 |
| CN110503957A (zh) * | 2019-08-30 | 2019-11-26 | 上海依图信息技术有限公司 | 一种基于图像去噪的语音识别方法及装置 |
| CN110570862A (zh) * | 2019-10-09 | 2019-12-13 | 三星电子(中国)研发中心 | 一种语音识别方法及智能语音引擎装置 |
| CN111028842B (zh) * | 2019-12-10 | 2021-05-11 | 上海芯翌智能科技有限公司 | 触发语音交互响应的方法及设备 |
| CN111048113B (zh) * | 2019-12-18 | 2023-07-28 | 腾讯科技(深圳)有限公司 | 声音方向定位处理方法、装置、系统、计算机设备及存储介质 |
-
2020
- 2020-09-30 CN CN202011065346.1A patent/CN114333831B/zh active Active
-
2021
- 2021-09-17 EP EP21874269.0A patent/EP4207186B1/en active Active
- 2021-09-17 US US18/247,212 patent/US20230386494A1/en active Pending
- 2021-09-17 WO PCT/CN2021/118948 patent/WO2022068608A1/zh not_active Ceased
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2017129239A1 (en) * | 2016-01-27 | 2017-08-03 | Nokia Technologies Oy | System and apparatus for tracking moving audio sources |
| CN106328156A (zh) * | 2016-08-22 | 2017-01-11 | 华南理工大学 | 一种音视频信息融合的麦克风阵列语音增强系统及方法 |
| CN106679651A (zh) * | 2017-02-08 | 2017-05-17 | 北京地平线信息技术有限公司 | 声源定位方法、装置和电子设备 |
| CN110389597A (zh) * | 2018-04-17 | 2019-10-29 | 北京京东尚科信息技术有限公司 | 基于声源定位的摄像头调整方法、装置和系统 |
| CN110858488A (zh) * | 2018-08-24 | 2020-03-03 | 阿里巴巴集团控股有限公司 | 语音活动检测方法、装置、设备及存储介质 |
| CN110082723A (zh) * | 2019-05-16 | 2019-08-02 | 浙江大华技术股份有限公司 | 一种声源定位方法、装置、设备及存储介质 |
| CN110691196A (zh) * | 2019-10-30 | 2020-01-14 | 歌尔股份有限公司 | 一种音频设备的声源定位的方法及音频设备 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4207186A4 |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115174959A (zh) * | 2022-06-21 | 2022-10-11 | 咪咕文化科技有限公司 | 视频3d音效设置方法及装置 |
| CN115174959B (zh) * | 2022-06-21 | 2024-01-30 | 咪咕文化科技有限公司 | 视频3d音效设置方法及装置 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN114333831B (zh) | 2026-01-02 |
| US20230386494A1 (en) | 2023-11-30 |
| EP4207186A4 (en) | 2024-01-24 |
| CN114333831A (zh) | 2022-04-12 |
| EP4207186A1 (en) | 2023-07-05 |
| EP4207186B1 (en) | 2025-07-30 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2022068608A1 (zh) | 信号处理的方法和电子设备 | |
| US11705135B2 (en) | Detection of liveness | |
| US11023755B2 (en) | Detection of liveness | |
| US11624800B1 (en) | Beam rejection in multi-beam microphone systems | |
| US11017252B2 (en) | Detection of liveness | |
| US10993025B1 (en) | Attenuating undesired audio at an audio canceling device | |
| CN110493690B (zh) | 一种声音采集方法及装置 | |
| CN115831155B (zh) | 音频信号的处理方法、装置、电子设备及存储介质 | |
| CN113132863B (zh) | 立体声拾音方法、装置、终端设备和计算机可读存储介质 | |
| US20160094910A1 (en) | Directional audio capture | |
| CN108766457B (zh) | 音频信号处理方法、装置、电子设备及存储介质 | |
| CN107749925B (zh) | 音频播放方法及装置 | |
| CN107210824A (zh) | 麦克风的环境切换 | |
| CN104349040A (zh) | 用于视频会议系统中的摄像机底座及其方法 | |
| US20240422503A1 (en) | Rendering based on loudspeaker orientation | |
| CN110572600A (zh) | 一种录像处理方法及电子设备 | |
| JPWO2020021861A1 (ja) | 情報処理装置、情報処理システム、情報処理方法及び情報処理プログラム | |
| CN115981173A (zh) | 设备控制方法、终端设备及存储介质 | |
| CN113676593A (zh) | 视频录制方法、装置、电子设备及存储介质 | |
| CN119603627A (zh) | 在包含智能音频装置的系统中估计用户位置 | |
| CN115035187B (zh) | 声源方向确定方法、装置、终端、存储介质及产品 | |
| CN110392334B (zh) | 一种麦克风阵列音频信号自适应处理方法、装置及介质 | |
| JP2024545571A (ja) | 分散型オーディオデバイスダッキング | |
| JP2022147989A (ja) | 発話制御装置、発話制御方法及び発話制御プログラム | |
| US20230105785A1 (en) | Video content providing method and video content providing device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21874269 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 18247212 Country of ref document: US |
|
| ENP | Entry into the national phase |
Ref document number: 2021874269 Country of ref document: EP Effective date: 20230331 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWG | Wipo information: grant in national office |
Ref document number: 2021874269 Country of ref document: EP |