WO2024179425A1 - 语音交互方法及相关设备 - Google Patents
语音交互方法及相关设备 Download PDFInfo
- Publication number
- WO2024179425A1 WO2024179425A1 PCT/CN2024/078662 CN2024078662W WO2024179425A1 WO 2024179425 A1 WO2024179425 A1 WO 2024179425A1 CN 2024078662 W CN2024078662 W CN 2024078662W WO 2024179425 A1 WO2024179425 A1 WO 2024179425A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio signal
- user
- electronic device
- event
- raising
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/017—Gesture based interaction, e.g. based on a set of recognized hand gestures
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
- G06F3/167—Audio in a user interface, e.g. using voice commands for navigating, audio feedback
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/84—Detection of presence or absence of voice signals for discriminating voice from noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/226—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
- G10L2015/227—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of the speaker; Human-factor methodology
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Definitions
- the present application relates to the field of artificial intelligence, and in particular to a voice interaction method and related equipment.
- voice assistants There are currently two main ways of interacting with voice assistants.
- the user speaks a specific wake-up word (such as "Xiaoyi Xiaoyi"), and the smart terminal initiates a voice conversation after recognizing the voice signal.
- This method has privacy issues in public places, and the interaction is relatively lengthy.
- the user performs specific actions, such as significantly raising the wrist, or long pressing a physical key (such as a power button or a sports shortcut key).
- the key operation requires greater force and delay, and the restart/shutdown interface may be called up accidentally, and the interaction process is not concise enough.
- the present application provides a voice interaction method and related equipment, which can improve the sensitivity of detecting interactive operations.
- the present application provides a voice interaction method, applied to a first device, the method comprising:
- the first device When the first device detects the first event, it obtains inertial measurement unit (IMU) data and illuminance data of the first device; determines whether the user has performed a first preset action based on the IMU data and illuminance data of the first device; if it is determined that the user has performed the first preset action, the first device activates the microphone of the first device and obtains a first audio signal collected by the microphone; determines whether the type of the first audio signal is an approaching human voice based on the first audio signal; if it is determined that the type of the first audio signal is an approaching human voice, the first device activates the voice assistant of the first device.
- IMU inertial measurement unit
- the first preset action is a wrist raising action or a hand raising action.
- the event of the user approaching the first device can be detected more accurately through IMU data and illumination data.
- the solution of the present application is more sensitive and can achieve detection when the user naturally raises his wrist or hand.
- the first event is a wrist-raising hardware interrupt event, a hand-raising hardware interrupt event of the first device, a key-pressing screen-lighting event of the first device, a hand-raising screen-lighting event of the first device, or a wrist-raising screen-lighting event of the first device.
- the method of the present application further includes:
- starting a voice assistant of the first device includes:
- the voice assistant of the first device is activated.
- the second preset action is a wrist-raising and holding action or a hand-raising and holding action.
- the voice assistant of the first device After starting the microphone of the first device, continuously obtaining the IMU data and illumination data collected by the first device, and determining whether the user has maintained the wrist-raising action or the hand-raising action based on the collected IMU data and illumination data, when it is determined that the user has maintained the wrist-raising action or the hand-raising action, and the type of the first audio signal is a nearby human voice, the voice assistant of the first device will be started, which is beneficial to reducing the false wake-up rate and interference.
- the method of the present application further includes:
- the first device turns off the microphone of the first device.
- the first device turns off the microphone, which can reduce the power of the first device and also reduce the false wake-up rate of the voice assistant.
- the duration of collecting the IMU data and illumination data used to determine whether the user has performed a first preset action is a first preset duration
- the duration of collecting the IMU data and illumination data used to determine whether the user has performed a second preset action is a second preset duration, wherein the first preset duration is greater than the second preset duration
- the second preset time length By setting the second preset time length to be shorter than the first preset time length, it is not necessary to collect longer IMU data and illumination data before determining whether the user has performed the second preset action, thereby quickly detecting whether the user has performed the second preset action.
- the IMU data and illuminance data used to determine whether the user has performed the second preset action can be interpolated respectively so that the equivalent time of the IMU data and illuminance data used to determine whether the user has performed the second preset action is consistent with the collection time of the IMU data and illuminance data used to determine whether the user has performed the first preset action, and then the same prediction model can be used to predict whether the user has performed the first preset action and the second preset action.
- the first event is a wrist-raising hardware interrupt event, a hand-raising hardware interrupt event of the first device, a hand-raising screen-lighting event of the first device, or a wrist-raising screen-lighting event of the first device, and obtaining IMU data and illumination data of the first device includes:
- Start the ambient light sensor of the first device collect the illumination data of the environment of the first device and continue to collect IMU data, obtain the cached IMU data related to the first event, and set the illumination data related to the first event to zero.
- the equivalent duration of the illuminance data can be made consistent with the collection duration of the IMU data.
- the data from the complete wrist-raising stage or hand-raising stage can be used to determine whether the user has performed the first preset action, thereby making the judgment result more accurate and improving the wake-up success rate and reducing the false wake-up rate in terms of user experience.
- the duration of the first audio signal is less than or equal to 0.5 s, and determining whether the type of the first audio signal is an approaching human voice according to the first audio signal is implemented by a digital signal processor DSP of the first device.
- the purpose is to enable the text corresponding to the first audio signal to be echoed in real time when the voice assistant is subsequently started to recognize the first voice signal.
- the first audio signal is an audio signal that has not been processed by automatic gain control (AGC), noise reduction, dereverberation or compression, and there is no loss of information close to the human voice.
- AGC automatic gain control
- the sound type of the first audio signal is determined by using the original audio signal that has not been processed by AGC, noise reduction, dereverberation or compression, which can improve the accuracy of the sound type of the determined first audio signal.
- the first audio signal is collected by multiple microphones of the first device.
- wind noise signals in the audio signals collected by different microphones there are wind noise signals in the audio signals collected by different microphones.
- the wind noise in the audio signal collected by one microphone can be used to reduce the wind noise in the audio signal collected by another microphone, and then the audio signal after wind noise processing can be used to predict the sound type, so as to obtain a more accurate prediction result.
- the method of the present application further includes:
- the first device determines the task to be executed based on the audio signal collected by the microphone of the first device; if the task to be executed is a sensitive task, the first device obtains the identity information of the user; after determining that the user is a target user based on the identity information of the user, the first device executes the task to be processed.
- the above-mentioned user is a user of the first device, and the target user is an owner of the first device.
- the above method can prevent a person who is not the first device owner from performing security-sensitive voice tasks through the first device, thereby ensuring the information security of the device of the first device owner.
- the method of the present application further includes:
- the first device After starting the voice assistant, the first device displays the text corresponding to the collected audio signal in real time.
- an electronic device including:
- an acquisition unit configured to acquire IMU data and illumination data of the electronic device when a first event is detected
- a determination unit configured to determine whether the user has performed a first preset action according to IMU data and illumination data of the electronic device
- an activation unit configured to activate a microphone of the electronic device if it is determined that the user has performed the first preset action
- an acquisition unit configured to acquire a first audio signal collected by the microphone
- the determination unit is further used to determine whether the type of the first audio signal is an approaching human voice according to the first audio signal
- the activation unit is further used to activate the voice assistant of the electronic device if it is determined that the type of the first audio signal is an approaching human voice.
- the first event is a wrist-raising hardware interrupt event, a hand-raising hardware interrupt event of the electronic device, a key-pressing screen-lighting event of the electronic device, a hand-raising screen-lighting event of the electronic device, or a wrist-raising screen-lighting event of the electronic device.
- the acquisition unit is further configured to continuously acquire the IMU data and illumination data collected by the electronic device after the activation unit activates the microphone of the electronic device.
- the determination unit is further used to determine whether the user has performed a second preset action based on the IMU data and illumination data collected by the electronic device;
- the activation unit is specifically used for:
- the voice assistant of the electronic device is activated.
- the startup unit is further configured to:
- the microphone of the electronic device is turned off.
- the duration is a first preset duration
- the duration for collecting the IMU data and illumination data used to determine whether the user has performed the second preset action is a second preset duration, wherein the first preset duration is greater than the second preset duration
- the first event is a wrist-raising hardware interrupt event, a hand-raising hardware interrupt event of an electronic device, a hand-raising screen-lighting event of an electronic device, or a wrist-raising screen-lighting event of an electronic device.
- the acquisition unit is specifically used for:
- Start the ambient light sensor of the electronic device collect the illumination data of the environment of the electronic device and continue to collect IMU data, obtain the cached IMU data related to the first event, and set the illumination data related to the first event to zero.
- a duration of the first audio signal is less than or equal to 0.5 s, and determining whether the type of the first audio signal is an approaching human voice based on the first audio signal is implemented by a digital signal processor (DSP) of the electronic device.
- DSP digital signal processor
- the first audio signal is collected by multiple microphones of the electronic device.
- the determination unit is further configured to determine the task to be performed according to an audio signal collected by a microphone of the electronic device;
- the acquisition unit is further used to acquire the identity information of the user if the task to be executed is a sensitive task;
- Electronic equipment also includes:
- the execution unit is used to determine that the user is a target user based on the user's identity information and execute the task to be processed.
- the electronic device further includes:
- the display unit is used to display the text corresponding to the collected audio signal in real time after the voice assistant is started.
- the present application provides another voice interaction method, applied to a first device, the method comprising:
- the first device When the first device detects the first event, the first device obtains the IMU data of the first device; determines whether the user has performed a first preset action based on the IMU data of the first device; if it is determined that the user has performed the first preset action, the first device activates the microphone of the first device and obtains a first audio signal collected by the microphone; determines whether the type of the first audio signal is an approaching human voice based on the first audio signal; if it is determined that the type of the first audio signal is an approaching human voice, the first device activates the voice assistant of the first device.
- the first event is a wrist raising event of the first device, a hand raising event of the first device, or a wrist turning event of the first device; and the wrist raising event of the first device, the hand raising event of the first device, and the wrist turning event of the first device are all obtained through the same event prediction model.
- the above event prediction model can be applied not only to the detection of wrist raising events, wrist turning events and hand raising events, but also to the detection of events in other applications, such as events of wrist raising to say applications and wrist raising to light up applications, which are not limited here. Therefore, the event prediction model can be regarded as the public capability of the first device.
- the public capability of the first device to detect the first event, for the first device, there is no need to train a neural network model specifically for detecting the first event, which reduces the workload of the first device and reduces the computing power consumption of the first device. And after detecting the first event, it is further determined whether the user has performed the first preset action.
- the microphone of the first device will be activated to obtain the first audio signal collected by the microphone; and according to the first audio signal, it is determined whether the type of the first audio signal is a nearby human voice; if it is determined that the type of the first audio signal is a nearby human voice, the first device activates the voice assistant of the first device. Adopting this method is conducive to reducing the probability of the voice assistant of the first device being activated.
- the first device determines whether the user has performed a first preset action according to the IMU data of the first device, including:
- the first device calculates the posture information of the first device based on the IMU data of the first device; obtains the acceleration information of the first device from the IMU data of the first device; if the posture information of the first device is within a preset posture range, the acceleration information of the first device is within a preset acceleration range, and the first event duration is within a preset duration range, the first device determines that the user has performed a first preset action.
- This method is helpful to improve the accuracy of determining whether the user has performed the first preset action.
- the method of the present application further includes:
- starting a voice assistant of the first device includes:
- the voice assistant of the first device is activated.
- the second preset action is a wrist-raising and holding action or a hand-raising and holding action.
- the voice assistant of the first device After starting the microphone of the first device, continuously obtaining the IMU data and illumination data collected by the first device, and determining whether the user has maintained the wrist-raising action or the hand-raising action based on the collected IMU data and illumination data, when it is determined that the user has maintained the wrist-raising action or the hand-raising action, and the type of the first audio signal is a nearby human voice, the voice assistant of the first device will be started, which is beneficial to reducing the false wake-up rate and interference.
- the method of the present application further includes:
- the first device turns off the microphone of the first device.
- the first device turns off the microphone, which can reduce the power of the first device and also reduce the false wake-up rate of the voice assistant.
- the duration of the first audio signal is less than or equal to 0.5 s, and determining whether the type of the first audio signal is an approaching human voice according to the first audio signal is implemented by a digital signal processor DSP of the first device.
- the text corresponding to the first audio signal can be echoed in real time when the voice assistant is subsequently started to recognize the first voice signal.
- the first audio signal is an audio signal that has not been processed by AGC, noise reduction, dereverberation or compression, and there is no loss of information close to the human voice.
- the sound type of the first audio signal is determined by using the original audio signal that has not been processed by AGC, noise reduction, dereverberation or compression, which can improve the accuracy of the sound type of the determined first audio signal.
- the first audio signal is collected by multiple microphones of the first device.
- wind noise signals in the audio signals collected by different microphones there are wind noise signals in the audio signals collected by different microphones.
- the wind noise in the audio signal collected by one microphone can be used to reduce the wind noise in the audio signal collected by another microphone, and then the audio signal after wind noise processing can be used to predict the sound type, so as to obtain a more accurate prediction result.
- the method of the present application further includes:
- the first device determines the task to be executed based on the audio signal collected by the microphone of the first device; if the task to be executed is a sensitive task, the first device obtains the identity information of the user; after determining that the user is a target user based on the identity information of the user, the first device executes the task to be processed.
- the above-mentioned user is a user of the first device, and the target user is an owner of the first device.
- the above method can prevent a person who is not the first device owner from performing security-sensitive voice tasks through the first device, thereby ensuring the information security of the device of the first device owner.
- the method of the present application further includes:
- the first device After starting the voice assistant, the first device displays the text corresponding to the collected audio signal in real time.
- the present application provides another electronic device, including:
- An acquisition unit configured to acquire IMU data of the first device when a first event is detected
- a determination unit configured to determine whether the user has performed a first preset action according to the IMU data of the first device
- the activation unit is used to activate the microphone of the first device if it is determined that the user has performed the first preset action, and the acquisition unit is further used to acquire the first audio signal collected by the microphone;
- the determination unit is further used to determine whether the type of the first audio signal is an approaching human voice according to the first audio signal
- the starting unit is further used to start the voice assistant of the first device if it is determined that the type of the first audio signal is an approaching human voice.
- the first event is a wrist raising event of the first device, a hand raising event of the first device, or a wrist turning event of the first device; and the wrist raising event of the first device, the hand raising event of the first device, and the wrist turning event of the first device are all obtained through the same event prediction model.
- the determining unit in determining whether the user has performed the first preset action according to the IMU data of the first device, is further configured to include:
- the first device calculates the posture information of the first device based on the IMU data of the first device; obtains the acceleration information of the first device from the IMU data of the first device; if the posture information of the first device is within a preset posture range, the acceleration information of the first device is within a preset acceleration range, and the first event duration is within a preset duration range, the first device determines that the user has performed a first preset action.
- the acquisition unit is further configured to continuously acquire the IMU data and illumination data collected by the electronic device after the activation unit activates the microphone of the electronic device.
- the determination unit is further used to determine whether the user has performed a second preset action based on the IMU data and illumination data collected by the electronic device;
- the activation unit is specifically used for:
- the voice assistant of the electronic device is activated.
- the duration of the first audio signal is less than or equal to 0.5 s, and determining whether the type of the first audio signal is an approaching human voice according to the first audio signal is implemented by a DSP of the electronic device.
- the first audio signal is collected by multiple microphones of the electronic device.
- the determination unit is further configured to determine the task to be performed according to an audio signal collected by a microphone of the electronic device;
- the acquisition unit is further used to acquire the identity information of the user if the task to be executed is a sensitive task;
- Electronic equipment also includes:
- the execution unit is used to determine that the user is a target user based on the user's identity information and execute the task to be processed.
- the electronic device further includes:
- the display unit is used to display the text corresponding to the collected audio signal in real time after the voice assistant is started.
- the present application provides another electronic device, including a processor and a memory.
- the memory is used to store program code.
- the processor is used to call the program code stored in the memory to execute the method provided by the first aspect, the third aspect, any possible implementation of the first aspect, or any possible implementation of the third aspect.
- the present application provides a computer storage medium comprising computer instructions, which, when executed on an electronic device, enables the electronic device to execute a method as provided in any possible implementation of the first aspect.
- the present application provides a computer program product, which, when executed on a computer, enables the computer to execute a method as provided in the first aspect, the third aspect, any possible implementation of the first aspect, or any possible implementation of the third aspect.
- the electronic device described in the second aspect, the electronic device described in the third aspect, the computer storage medium described in the fourth aspect, or the computer program product described in the fifth aspect provided above are all used to execute the method provided in the first aspect, the third aspect, any possible implementation of the first aspect, or the method provided in any possible implementation of the third aspect. Therefore, the beneficial effects that can be achieved can refer to the beneficial effects in the corresponding method, which will not be repeated here.
- FIG. 1a is a schematic diagram of a scenario of a voice interaction method provided in an embodiment of the present application
- FIG1b is a schematic diagram of another scenario of a voice interaction method provided in an embodiment of the present application.
- FIG1c is a schematic diagram of another scenario of a voice interaction method provided in an embodiment of the present application.
- FIG. 1d is a schematic diagram of another scenario of a voice interaction method provided in an embodiment of the present application.
- FIG2 is a flow chart of a voice interaction method provided in an embodiment of the present application.
- FIG3 is a schematic diagram showing changes in illumination data and IMU data during wrist lifting
- FIG4 is a schematic diagram of the audio signal strength collected by the wrist-worn device within the deflection angle range provided by an embodiment of the present application:
- FIG5 is a schematic diagram of the strength of audio signals collected by the main microphone and the auxiliary microphone provided in an embodiment of the present application;
- FIG6 is a schematic diagram of a preset area provided in an embodiment of the present application.
- FIG7 is a schematic diagram of lip height and lip width provided in an embodiment of the present application.
- FIG8 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.
- FIG8a is a schematic diagram of the structure of another electronic device provided in an embodiment of the present application.
- FIG. 9 is a schematic diagram of the structure of another electronic device provided in an embodiment of the present application.
- Multiple means two or more.
- “And/or” describes the relationship between related objects, indicating that there can be three relationships. For example, A and/or B can mean: A exists alone, A and B exist at the same time, and B exists alone. The character “/” generally indicates that the related objects are in an "or” relationship.
- Intelligent personal assistant is a personal virtual assistant driven by artificial intelligence, which can also be called intelligent assistant, voice assistant, etc. It allows users to easily operate various functions of smart terminals (such as smartphones, smart watches, tablets, laptops, AI speakers, smart large screens and smart cockpits) or conduct Internet searches through voice interaction, such as setting alarms on smartphones, reading emails using text-to-speech technology, playing and searching music, and sending text messages.
- Specific intelligent personal assistants include Huawei's Xiaoyi (the corresponding overseas version is Celia), Apple's Siri, Amazon's Alexa, Microsoft's Cortana and Google's Google Assistant.
- Wrist wearable device refers to a smart device worn on the user's wrist, generally based on whether it has a screen or the size of the screen. Depending on the size, computing power of the application processor, and the degree of intelligence (such as whether it can download and install apps on its own, and whether it supports voice assistants), they can be divided into smart watches, sports watches, and smart bracelets.
- IMU Inertial measurement unit
- IMU Inertial measurement unit
- IMU is a device that measures the angular velocity and acceleration of a rigid object in three-dimensional space. It is generally used to identify and track the posture and movement of a device, such as the tilt angle of the device relative to the horizontal plane, the orientation of the device (i.e. the deflection angle from the geomagnetic north pole), the cumulative rotation angle of the device, the relative motion speed and displacement, etc.
- IMU wrist-raise hardware interrupt Low-power primary wrist-raise detection implemented inside the IMU device generates a hardware interrupt once a wrist-raise event is detected.
- the ambient light sensor can sense the surrounding light conditions, collect environmental illumination data, and inform the processing chip to automatically adjust the screen brightness and reduce the power consumption of the smart terminal.
- FIG. 1a is a schematic diagram of an application scenario of a voice interaction method provided in an embodiment of the present application. As shown in Figure 1a:
- the user When the screen of the wrist-worn device is off, the user directly and naturally moves the wrist close to the lips.
- the user speaks out the voice task
- the microphone of the wrist-worn device collects the user's audio signal
- the voice assistant of the wrist-worn device recognizes the voice task based on the audio signal and responds to the voice task.
- FIG. 1b is a schematic diagram of an application scenario of another voice interaction method provided in an embodiment of the present application. As shown in Figure 1b:
- the screen of the wrist-worn device When the screen of the wrist-worn device is on, for example, the user is reading information on the screen of the wrist-worn device or the user is operating the wrist-worn device, the user raises his wrist so that the user's lips are facing the screen of the wrist-worn device and close to the wrist-worn device, and then the user speaks a voice task, such as "Today's weather", and the wrist-worn device collects the user's audio signal.
- the voice assistant of the wrist-worn device recognizes the voice task based on the audio signal and responds to the voice task, for example, the wrist-worn device outputs "Today's weather is fine".
- FIG. 1c is a schematic diagram of an application scenario of another voice interaction method provided in an embodiment of the present application. As shown in Figure 1c:
- the phone screen When the phone screen is off, the user presses the power button to turn on the screen, then picks up the phone from the desktop, holds the bottom of the phone close to the mouth, and then speaks out the voice task.
- the phone collects the user's audio signal, and the phone's voice assistant recognizes the voice task based on the audio signal and responds to the voice task.
- FIG. 1d is a schematic diagram of an application scenario of a voice interaction method provided in an embodiment of the present application. As shown in Figure 1d:
- the user's face is facing the large-screen device, either directly facing the large screen, or tilted to the left at a preset angle, or tilted to the right at a preset angle.
- the user then speaks out a voice task, such as "Today's weather.”
- the wrist-worn device collects the user's audio signal, recognizes the voice task based on the audio signal, and responds to the voice task. For example, the wrist-worn device outputs "Today's weather is fine.”
- voice interaction method of the present application can be applied not only to the above four scenarios, but also to other scenarios, which are not limited here.
- Figure 2 is a flow chart of a voice interaction method provided in an embodiment of the present application. As shown in Figure 2, the method includes:
- the first device may be a wrist-worn device of the user or a user terminal device
- the user terminal device may be a smart phone, a tablet computer, etc.
- the first event is a wrist-raising hardware interrupt event of the first device, a hand-raising hardware interrupt event of the first device, a key-pressing screen-lighting event of the first device, a hand-raising screen-lighting event of the first device, or a wrist-raising screen-lighting event of the first device.
- the wrist-raising hardware interrupt event and the hand-raising hardware interrupt event of the first device refer to the low-power primary wrist-raising detection implemented inside the IMU.
- a small degree of wrist-raising or wrist-turning behavior is detected (for example, based on the acceleration data collected by the IMU of the first device, if the acceleration exceeds 30% of the gravity acceleration (9.8m/s ⁇ 2), it is determined to be a wrist-raising or wrist-turning behavior or a hand-raising behavior)
- the IMU of the first device generates a hand-raising hardware interrupt or a wrist-raising hardware interrupt.
- This interrupt event can be subscribed to by other software modules to trigger the start-up of other software.
- the wrist-raising screen-lighting software is started to perform another wrist-raising or wrist-turning behavior detection to achieve an accurate and reliable wrist-raising screen-lighting function (for example, to achieve the highest possible success rate and the lowest possible false touch rate).
- the key-press screen-on event of the first device refers to the key-press screen-on event generated when the first device detects the user's operation on the key of the first device (such as the power button, volume + button, volume - button). This event can be subscribed by other software modules to trigger other software to start running, such as rendering software starting interface rendering calculations, and then submitting the generated picture information to the screen for display. It should be understood that the key-press screen-on event of the first device is generated when the first device determines that the conditions for lighting up the screen of the first device are met, and at this time the first device has not yet controlled the lighting of the screen of the first device.
- the wrist-raising event or hand-raising event of the first device to light up the screen refers to the wrist-raising event or hand-raising event detected by the first device in which the user attempts to light up the screen.
- the wrist-raising event or hand-raising event can be determined by the first device based on the IMU data collected by the IMU. It should be understood that the wrist-raising event or hand-raising event to light up the screen of the first device is generated when the first device determines that the conditions for lighting up the screen of the first device have been met.
- the first device detects the wrist-raising event or hand-raising event of the user trying to light up the screen is the wrist-raising or hand-raising screen lighting software of the first device detecting the wrist-raising event or hand-raising event of the user trying to light up the screen.
- first event it may be that the user accidentally triggered it or the user caused the first device to detect the corresponding first event for other purposes.
- the IMU data of the wrist-worn device will also change, which may trigger a wrist-raising hardware interrupt event or a wrist-raising screen-lighting event, but the user does not want to raise his wrist to interact with the wrist-worn device by voice.
- the IMU data of the mobile phone will change, which may trigger a hand-raising interrupt event or a hand-raising screen-lighting event.
- the user may turn on the screen of the smartphone through the power button of the smartphone, which will trigger the hand-raising screen-lighting event of the mobile phone, but the user does not want to raise his hand to interact with the phone by voice.
- the first device obtains the IMU data and the illuminance data of the first device, and determines whether the user has performed the first preset action through the IMU data and the illuminance data of the first device.
- the first device when the first event is a button-pressing screen-lighting event of the first device, the first device obtains the IMU data and (ambient) illumination data of the first device, including the first device starting the ambient light sensor to collect illumination data and starting the IMU to collect IMU data, and obtaining the collected illumination data and IMU data, wherein the collection time of the illumination data is the same as the collection time of the IMU data.
- the first device obtains IMU data and illumination data of the first device, including:
- the ambient light sensor of the first device collect the (ambient) illumination data of the first device and continue to collect IMU data, obtain the cached IMU data related to the first event, and set the illumination data related to the first event to zero (because the ambient light sensor has not been started during the first event).
- the reason for setting to zero is that the illumination in the lower space is generally lower.
- the detection of the wrist-raising hardware interrupt event of the first device, the hand-raising hardware interrupt event of the first device, the wrist-raising screen-lighting event of the first device, or the hand-raising screen-lighting event of the first device utilizes the IMU data of the first device, that is, before the wrist-raising hardware interrupt event of the first device, the hand-raising hardware interrupt event of the first device, the wrist-raising screen-lighting event of the first device, or the hand-raising screen-lighting event of the first device is detected, the IMU of the first device has been started and the IMU data has been collected.
- the IMU data used to determine whether the user has performed the first preset action can be regarded as two parts, one part is collected before the first event is detected, and this part of data can be called IMU data related to the first event, and the other part is collected when the first event is detected.
- the two parts of IMU data are a complete wrist-raising or hand-raising IMU data. Using a complete wrist-raising or hand-raising IMU data can more accurately determine whether the user has performed the first preset action.
- the collection time of the illuminance data used to determine whether the user has performed the first preset action needs to be equal to the collection time of the IMU data, before using the illuminance data and the IMU data to determine whether the user has performed the first preset action, the illuminance data is zero-padded, and the data obtained by zero-padded is the illuminance data related to the first event.
- the equivalent time of the illuminance data after zero-padded is equal to the collection time of the IMU data.
- the collection time of the IMU data is the first preset time.
- the senor can collect a certain amount of data after working for a period of time.
- the amount of data can be characterized by the duration.
- the equivalent duration here is the size of the data amount.
- the first device determines whether the user has performed a first preset action based on the IMU data and illumination data of the first device.
- the first preset action may be a wrist raising action or a hand raising action.
- the wrist-worn device moves with the user's wrist, generating specific posture movements (such as the inclination angle of the dial with the horizontal plane changes to face the user's lips) and position movements (such as rising from the lower space to the upper space, and the distance from the vertical plane of the user's body is reduced), and the corresponding change in acceleration is shown in the curve on the right of Figure 3.
- specific posture movements such as the inclination angle of the dial with the horizontal plane changes to face the user's lips
- position movements such as rising from the lower space to the upper space, and the distance from the vertical plane of the user's body is reduced
- the corresponding change in acceleration is shown in the curve on the right of Figure 3.
- the illuminance value collected by the ambient light sensor of the handheld device will also tend to rise.
- the handheld device will also produce specific posture movements (such as the inclination angle of the mobile phone screen to the horizontal plane becomes close to zero degrees, so that the microphone at the bottom of the mobile phone is facing the user's lips) and position movements.
- the right figure of Figure 3 shows the acceleration of the wrist-worn device in the x-axis, y-axis and z-axis directions in the local coordinate system when the user raises his wrist to bring the wrist-worn device close to his lips.
- the origin of the local coordinate system is the center of gravity of the wrist-worn device, and the x-axis points to the surface of the wrist-worn device.
- the y-axis points to the 3 o'clock direction of the dial, the z-axis points to the top of the dial.
- the first device will produce specific posture movements and position movements when approaching the user's lips. Therefore, it is possible to determine whether the user has performed a wrist-raising action or a hand-raising action based on the illumination data and IMU data of the wrist-worn device.
- a prediction model is pre-trained to predict whether the user has performed a wrist-raising action or a hand-raising action.
- the prediction model can be implemented based on a decision tree or a convolutional neural network.
- the same prediction model can be used for prediction, or different prediction models can be used for prediction.
- the training data includes acceleration data and illuminance data collected for a fixed time (e.g., 0.5s) and their derivative data, such as the maximum value, mean, skewness, kurtosis, histogram, sliding average sequence generated according to a certain window size, and amplitude value sequence of the acceleration vector of the acceleration data.
- the training data is collected for different groups of users, including the elderly and the youth, the male and the female, and/or the left-hand wearing/holding group and the right-hand wearing/holding group.
- the illuminance data in the training data may also include data collected in different environments, including an outdoor sunny environment, an indoor normal lighting environment, and a dark light environment.
- the first device inputs the IMU data and the illumination data into the first prediction model for prediction, and a first prediction result can be obtained, for example, the first prediction result is a wrist-raising action.
- the first prediction model can be a four-classification network, and the first device inputs the IMU data and the illumination data into the first prediction model for prediction.
- the first prediction model outputs four probabilities, namely, the probability of wrist-raising action, the probability of wrist-dropping action, the probability of wrist-raising and holding action, and the probability of free hand position; the first prediction result is the action corresponding to the maximum probability, such as wrist-raising action.
- the free hand position here refers to actions other than wrist-raising action, wrist-dropping action, and wrist-raising and holding action.
- the first prediction model can also be a two-classification network, and the first device inputs the IMU data and the illumination data into the first prediction model for prediction.
- the first prediction model outputs two probabilities, namely, the probability of wrist-raising action and the probability of other actions, and the first prediction result is the action corresponding to the maximum probability; the other actions here refer to actions other than wrist-raising action.
- the wrist-raising and holding action means that after the user raises his wrist, the wrist maintains a certain posture for a period of time (for example, the dial is facing oneself, the pitch angle with the horizontal plane is within plus or minus 15 degrees, and the roll angle is 30 to 60 degrees) and the illumination remains stable (for example, the multiple of the maximum/minimum illumination value is less than 2), and there is no drop.
- the first device if the IMU data and illumination data collected after the user raises his wrist are within the preset range, it is determined that the user has performed the wrist-raising and holding action.
- the first device inputs the IMU data and the illumination data into the second prediction model for prediction
- a second prediction result can be obtained, for example, the second prediction result is a hand-raising action.
- the second prediction model can be a four-classification network, and the first device inputs the IMU data and the illumination data into the second prediction model for prediction.
- the second prediction model outputs four probabilities, namely, the probability of the hand-raising action, the probability of the arm-dropping action, the probability of the hand-raising and holding action, and the probability of the free hand position.
- the second prediction result is the action corresponding to the maximum probability, such as the hand-raising action.
- the free hand position here refers to the action other than the hand-raising action, the arm-dropping action, and the hand-raising and holding action.
- the second prediction model can also be a two-classification network, and the first device inputs the IMU data and the illumination data into the second prediction model for prediction.
- the second prediction model outputs two probabilities, namely, the probability of the hand-raising action and the probability of other actions.
- the second prediction result is the action corresponding to the maximum probability; the other actions here refer to actions other than the hand-raising action.
- the training data of the first prediction model and the second prediction model can be found in the above-mentioned related descriptions and will not be described again here.
- the first device activates a microphone of the first device and obtains a first audio signal collected by the microphone.
- the first audio signal is collected by multiple microphones of the first device, that is, if it is determined that the user has performed the first preset action, the first device activates multiple microphones of the first device, such as a main microphone and a secondary microphone.
- the audio signal collected by a single microphone has wind noise, and the accuracy of predicting the sound type of the audio signal based on the audio signal collected by a single microphone is low.
- both the audio signals collected by the main microphone and the audio signals collected by the auxiliary microphone may have wind noise, but the wind noise intensities in the audio signals collected by the two are different.
- the first device can use the wind noise in one of the audio signals to process the wind noise in the other audio signal.
- the wind noise in the audio signal collected by the auxiliary microphone can be used to reduce the wind noise in the audio signal collected by the main microphone, and then the audio signal after the wind noise processing can be used to predict the sound type, so as to obtain a more accurate prediction result.
- the first audio signal After the microphone of the first device collects the first audio signal, the first audio signal will be transmitted to the DSP of the first device.
- the DSP implements basic voice activity detection (VAD) in hardware; when the energy of the first audio signal exceeds a preset energy threshold, the DSP of the first device generates a hardware VAD interrupt, triggering the software algorithm in the DSP to run, that is, to determine whether the type of the first audio signal is an approaching human voice.
- VAD basic voice activity detection
- the first device determines, according to the first audio signal, whether the type of the first audio signal is an approaching human voice.
- the first device inputs the first audio signal into the third prediction model for prediction, and obtains a third prediction result.
- the third prediction result is an approaching human voice.
- the third prediction model can be a four-classification network.
- the first device inputs the first audio signal into the first prediction model for prediction.
- the third prediction model outputs four probabilities, namely, the probability of an approaching human voice, the probability of a nearby human voice, the probability of a distant human voice, and the probability of a non-human voice.
- the third prediction result is the sound type corresponding to the maximum probability, such as an approaching human voice.
- the third prediction model can also be a two-classification network.
- the first device inputs the first audio signal into the third prediction model for prediction.
- the third prediction model outputs two probabilities, namely, the probability of an approaching human voice and the probability of a non-approaching human voice.
- the third prediction result is the sound type corresponding to the maximum probability.
- the first device extracts features of the first audio signal through the third prediction model to obtain audio features of the first audio signal.
- the audio features include at least one of energy features, time domain features, frequency domain features, music theory features, and perception features.
- the strategy for extracting audio features is based on device scenarios and user scenario settings, specifically the scenario in which the user raises his wrist and brings the wrist-worn device close to his mouth to speak a voice task: taking into account the high and low frequency distribution characteristics of the human voice after leaving the mouth (that is, the energy attenuation of the audio signal in the deflection direction directly opposite the lips, as shown in Figure 4, the energy attenuation of the audio signal behind the user's head is the largest, where the higher the frequency, the greater the attenuation; and when the lips are close to the wrist-worn device, for example, when the distance from the screen of the wrist-worn device is 3-5 cm, the deflection angle of the microphone of the wrist-worn device is 15-55 degrees.)
- the lips are close to the screen to produce specific reflection reverberation characteristics, and the microphone of the wrist-worn device is located at the edge of the wrist-worn device (for example, at 11 o'clock).
- HLBR high-low band ratio
- FFT fast Fourier transform
- SRMR speech-to-reverberation modulation energy ratio
- the main microphone of the handheld device can collect strong low-frequency breath wind noise; and for sports wind noise or environmental wind noise, the main microphone and the auxiliary microphone can collect strong low-frequency wind noise.
- the specific preset strategy can be the HLBR features of the main and auxiliary microphones, such as the high-low frequency energy ratio bounded by 200Hz (linear frequency or Mel scale frequency), 4 coefficients of the third-order polynomial fitting of the 128-point FFT, 2 coefficients of the linear regression fitting, or dual-channel MFCCs.
- the first device extracts the audio feature set value of the collected audio signal according to the preset strategy, specifically: calculates the collected audio data according to the preset strategy to generate specific data of each audio feature.
- the calculation method of MFCCs value is:
- the collected audio signal is divided into frames, generally 25ms as one frame, and the frame overlap is 10ms; the periodogram method is used to estimate the power spectrum of each frame of audio signal; the estimated power spectrum is filtered by 85 Mel filters, and the energy in each filter is calculated; the logarithm of the energy in each filter is taken to obtain 85 logarithmic results; the 85 logarithmic results are subjected to discrete cosine transform (DCT), and finally the 2nd to 85th coefficients of DCT are retained, the first coefficient is removed, and finally 84 coefficients are obtained, which are 84 Mel-frequency cepstrum coefficients.
- DCT discrete cosine transform
- the third prediction model can be obtained according to the following method:
- a certain number of audio signals emitted by the user when the lips are close to the wrist-worn device or handheld device, as well as audio signals corresponding to human voices near or far away from the wrist-worn device or handheld device, are obtained, and then the audio feature set values of the collected audio signals are extracted according to the above-mentioned preset strategy, and then the audio feature set values are used as training samples to train the machine learning model or convolutional neural network to obtain a third prediction model.
- the machine learning model can be a decision tree, a random forest algorithm, Xgboost or AdaBoost.
- the duration of the first audio signal is less than or equal to 0.5 s.
- the duration of the first audio signal may be 0.25 s, 0.3 s, 0.4 s, or 0.5 s.
- the duration of the first audio signal By setting the duration of the first audio signal to be less than or equal to 0.5s, it is possible to echo the text corresponding to the first audio signal in real time when the voice assistant is subsequently started to recognize the first voice signal.
- determining whether the type of the first audio signal is an approaching human voice according to the first audio signal is implemented in the DSP of the first device.
- the first audio signal is an audio signal that has not been processed by AGC, noise reduction, dereverberation or compression, and there is no loss of information about the approaching human voice.
- the sound type of the first audio signal is determined by using the original audio signal that has not been processed by AGC, noise reduction, dereverberation or compression, which can improve the accuracy of the determined sound type of the first audio signal.
- the audio signal corresponding to a complete speech task includes not only the first audio signal, but also the audio signal after the first audio signal.
- the microphone collects the audio signal periodically according to a collection time of less than or equal to 0.5s.
- the microphone of the first device continuously collects the audio signal, extracts the audio feature set value of the collected audio signal, and determines the type of the collected audio signal based on the audio feature set value of the collected audio signal at a preset time interval.
- the preset time interval may be 0.1s, 0.2s or other time.
- the first device controls the IMU and ambient light sensor of the first device to continue to collect IMU data and illuminance data.
- the collection duration is a second preset duration.
- the first device determines whether the user has performed the second preset action based on the IMU data and illuminance data that continue to be collected; if it is determined that the user has performed the second preset action and the type of the first audio signal is an approaching human voice, the first device starts the voice assistant. If it is determined that the user has not performed the second preset action, the first device turns off the microphone of the first device.
- the second preset action may be a wrist-raising and holding action or a hand-raising and holding action.
- the first device activates the voice assistant.
- the third preset time may be 0.3s.
- the first device when determining whether the user has maintained the wrist-raising action, the first device will continue to input the collected IMU data and illumination data into the fourth prediction model for prediction, and a fourth prediction result can be obtained, for example, the fourth prediction result is the wrist-raising and holding action.
- the fourth prediction model can be a four-classification network, and the first device will continue to input the collected IMU data and illumination data into the fourth prediction model for prediction.
- the fourth prediction model outputs four probabilities, namely, the probability of the wrist-raising action, the probability of the wrist-dropping action, the probability of the wrist-raising and holding action, and the probability of the free hand position; the fourth prediction result is the action corresponding to the maximum probability, such as the wrist-raising and holding action.
- the free hand position here refers to the action other than the wrist-raising action, the wrist-dropping action, and the wrist-raising and holding action.
- the fourth prediction model can also be a two-classification network, and the first device will continue to input the collected IMU data and illumination data into the fourth prediction model for prediction.
- the fourth prediction model outputs two probabilities, namely, the probability of the wrist-raising and holding action and the probability of other actions.
- the fourth prediction result is the action corresponding to the maximum probability; the other actions here refer to the actions other than the wrist-raising and holding action.
- the first device when judging whether the user has maintained the hand-raising action, the first device will continue to input the collected IMU data and illumination data into the fifth prediction model for prediction, and the fifth prediction result can be obtained, for example, the fifth prediction result is the hand-raising and holding action.
- the fifth prediction model can be a four-classification network, the first device inputs the IMU data and illumination data into the fifth prediction model for prediction, and the fifth prediction model outputs four probabilities, namely, the probability of the hand-raising action, the probability of the arm-dropping action, the probability of the hand-raising and holding action, and the probability of the free hand position; the fifth prediction result is the action corresponding to the maximum probability, such as the hand-raising and holding action.
- the free hand position here refers to the action other than the hand-raising action, the arm-dropping action, and the hand-raising and holding action.
- the fifth prediction model can also be a two-classification network, the first device will continue to input the collected IMU data and illumination data into the fifth prediction model for prediction, the fifth prediction model outputs two probabilities, namely, the probability of the hand-raising and holding action and the probability of other actions, and the fifth prediction result is the action corresponding to the maximum probability; the other actions here refer to the actions other than the hand-raising and holding action.
- the fourth prediction model and the first prediction model can be the same prediction model.
- the collection time of the IMU data and illuminance data used to determine whether the user has performed the second preset action needs to be equal to the collection time of the IMU data and illuminance data used to determine whether the user has performed the first preset action, that is, the second preset time is required to be equal to the first preset time.
- the first device interpolates the IMU data and illuminance data that continue to be collected, respectively, so that the equivalent time of the IMU data and illuminance data used to determine whether the user has performed the second preset action is equal to the collection time of the IMU data and illuminance data used to determine whether the user has performed the first preset action, and then the same prediction model can be used for prediction.
- the fifth prediction model and the second prediction model can be the same prediction model.
- the fifth prediction model and the second prediction model are the same prediction model, and the second preset time length is less than the first preset time length, before using the continuously collected IMU data and illuminance data to determine whether the user has performed the second preset action, the first device interpolates the continuously collected IMU data and illuminance data respectively, so that the equivalent time length of the IMU data and illuminance data used to determine whether the user has performed the second preset action is equal to the collection time length of the IMU data and illuminance data used to determine whether the user has performed the first preset action, and the same prediction model can be used for prediction.
- the senor can collect a certain amount of data after working for a period of time.
- the amount of data can be characterized by the duration.
- the equivalent duration here is to characterize the amount of data.
- the first device determines the task to be executed based on the audio signal collected by the microphone of the first device, and identifies the type of the task to be processed; if the task to be processed is a sensitive task, in order to ensure security, it is necessary to obtain the user's identity information, such as the user's fingerprint information, voiceprint information, facial image information, etc.
- the task to be processed is executed.
- the sensitive task may be a transfer task, a private folder opening task, etc.
- the target user is the owner of the first device.
- the current user of the first device says to the first device "transfer 50 yuan to xx", and the microphone of the first device collects the corresponding audio signal, and determines the pending task based on the audio signal: transfer 50 yuan to xx.
- the first device determines that the pending task is a sensitive task, obtains the voiceprint information of the current user of the first device, and matches the voiceprint information of the current user with the voiceprint information of the owner of the first device pre-stored in the first device. If the match is successful, it is determined that the user of the first device is the owner of the first device, and the first device executes the pending task: transfer 50 yuan to xx. If the match fails, it means that the user of the first device is not the owner of the first device, and the first device does not execute the pending task.
- the first device determines whether identity authentication is required before executing the task; if identity authentication is required, the first device sends a prompt message to the user, such as the prompt message "Please unlock the screen by face or fingerprint first".
- the first device obtains the user's identity information, such as face information or fingerprint information; after the identity authentication is passed (for example, face unlocking is successful or the first device is in the screen unlocking state), the first device executes the task corresponding to the first audio signal.
- the first device obtains a first confidence and a second confidence, wherein the first confidence is used to characterize the duration of the user's wrist-raising action or hand-raising action; the second confidence is used to characterize the degree to which the first audio signal is an approaching human voice; the first confidence and the second confidence are weighted and summed according to a preset weight to obtain the confidence of starting the voice assistant. If the confidence of starting the voice assistant is greater than the preset confidence, the voice assistant of the first device is started.
- the first device may obtain the first confidence level and the second confidence level in the following manner:
- the first device calculates the duration of the user's wrist-raising gesture close to the mouth according to the acquired IMU data, or calculates the duration of the user's hand-raising gesture close to the mouth according to the acquired IMU data, that is, calculates the duration of the user's wrist-raising action or hand-raising action according to the acquired IMU data; and then calculates the first confidence based on the obtained duration; wherein, the longer the obtained duration, the greater the first confidence.
- the value interval of the first confidence is [0.6,1.0], that is, the first confidence corresponding to the lowest value of the obtained duration is 0.6, and the first confidence corresponding to the highest value of the obtained duration is 1.0; based on this, the linear relationship between the obtained duration and the corresponding first confidence can be determined.
- the first device can calculate and determine the first confidence corresponding to the obtained duration based on the linear relationship.
- the first device uses the probability of the approaching human voice output by the third prediction model as the second confidence level.
- the first device after starting the voice assistant, identifies the voice content corresponding to the first audio signal based on automatic speech recognition (ASR) technology, and displays the voice content corresponding to the first audio signal in real time. For example, if the voice content corresponding to the first audio signal is "How will the weather be tomorrow?", “tomorrow” is displayed after “tomorrow” is identified; “weather” is displayed after “weather” is identified, instead of displaying it after the complete voice content is identified.
- ASR automatic speech recognition
- the first device sends the first audio signal to the server, and the server identifies the voice content corresponding to the first audio signal based on ASR technology, and filters the voice content, filtering some content that is not spoken to the voice assistant, such as "I'm going to play ball after get off work today", that is, the voice assistant of the first device does not provide feedback on content that is not spoken to the voice assistant, and the voice assistant terminates the voice conversation. In this way, user interference can be reduced.
- some keywords or sentences can be set in advance; when the first device determines that the voice content corresponding to the first audio signal contains the set keywords or sentences, it determines that the voice content corresponding to the first audio signal is the content that needs to be filtered out, and the voice assistant does not need to provide feedback on the voice content corresponding to the first audio signal.
- the first device is a wrist-worn device as shown in FIG. 1a and FIG. 1b , or a handheld device as shown in FIG. 1c .
- the event of the user approaching the first device can be detected more accurately through IMU data and illumination data, instead of requiring the user to significantly raise his wrist/hand to detect the event of the user approaching the device when only using acceleration data.
- the solution of the present application is more sensitive, and the detection can be achieved when the wrist or hand is raised naturally, providing users with a natural voice interaction experience, and the movement amplitude is small, improving user privacy in public places, and easy operation.
- the duration of the first audio signal to be less than or equal to 0.5s, it is to enable the text corresponding to the first audio signal to be echoed in real time when the voice assistant is subsequently started to recognize the first voice signal.
- the first audio signal is an audio signal that has not been processed by AGC, noise reduction, reverberation or compression, and there is no loss of information close to the human voice.
- the information in the first audio signal can be maximized, and the accuracy of the sound type of the determined first audio signal can be improved.
- identity authentication is required before the task corresponding to the audio signal is a sensitive task. In this way, non-first device owners can be prevented from performing security-sensitive voice tasks through the first device, thereby ensuring the information security of the device of the first device owner.
- the first device collects IMU data of the first device in real time. It should be understood that the IMU data here includes acceleration, angular velocity and other data. The first device determines whether to trigger the first event based on the IMU data, that is, whether the first device detects the first event. If the first device determines that the first event is triggered based on the IMU data, it means that the first device detects the first event; if the first device determines that the first event is not triggered based on the IMU data, it means that the first device does not detect the first event.
- the first event is a wrist raising event of the first device, a hand raising event of the first device, or a wrist turning event of the first device. It should be understood that the first event can also be an event other than the wrist raising event of the first device, the hand raising event of the first device, and the wrist turning event of the first device.
- the wrist raising event of the first device refers to the user raising his hand to place the first device in a viewable or operable position and posture. It should be understood that the first device here is a handheld device.
- the wrist turning event of the first device refers to the user turning the wrist directly inward or first turning the wrist outward and then turning the wrist inward to place the first device in a viewable or operable position and posture. It should be understood that the first device here is a wrist-worn device.
- the wrist-raising event of the first device refers to the user raising his wrist to place the first device in a viewable or operable position and posture. It should be understood that the first device here is a wrist-worn device.
- the position and posture of the first device can be determined by the IMU data of the first device, and the first device can determine whether to trigger the first event based on the IMU data of the first device.
- the first device can use an event prediction model to predict whether the first device triggers the first event. Specifically, the first device inputs the collected IMU data into the event prediction model for processing, and the event prediction model outputs four probability values. The four probability values are respectively used to characterize the probability of triggering a wrist raising event of the first device, the probability of triggering a hand raising event of the first device, the probability of triggering a wrist turning event of the first device, and the probability of not triggering the above events. Among them, the result corresponding to the maximum probability is the final output result of the event prediction model.
- the above event prediction model can be applied not only to the detection of wrist raising events, wrist turning events and hand raising events, but also to the detection of events in other applications, such as events of wrist raising to talk applications and wrist raising to light up applications, without limitation. Therefore, the event prediction model can be regarded as a public capability of the first device.
- the first device does not need to separately train a neural network model specifically for detecting the first event, thereby reducing the workload of the first device and the computing power consumption of the first device.
- the first device Since the first device detects the first event through the public capability of the first device, detecting the first event through the public capability of the first device does not indicate that the user has performed the first preset action. Therefore, the first device needs to further determine whether the user has performed the first preset action.
- the first device calculates the posture information of the first device based on the IMU data of the first device; obtains the acceleration information of the first device from the IMU data of the first device; if the posture information of the first device is within a preset posture range, the acceleration information of the first device is within a preset acceleration range, and the first event duration is within a preset duration range, it is determined that the user has performed a first preset action.
- the first event duration refers to the duration that the first device maintains the first device in a viewable or operable position and posture after the first event is detected.
- the preset posture range, preset acceleration range and preset duration range are all obtained based on historical experience values.
- the operations performed by the first device after determining whether the user has performed the first preset action can be found in the relevant descriptions of S203-S205 and will not be described again here.
- the second device obtains the second audio signal of the user, and the second device determines the position information of the user relative to the second device based on the second audio signal; if the user is located within a preset range based on the position information of the user relative to the second device; if the user is determined to be within the preset range, the second device determines that the user is detected to be close to the second device, and the second device starts the approaching human voice detection.
- the second device confirms that the type of the third audio signal is an approaching human voice through the approaching human voice detection, the second device starts the voice assistant.
- the acquisition time of the third audio signal is after the acquisition time of the second audio signal.
- the second device is a large-screen device.
- all the microphones of the large-screen device can receive the user's audio data.
- the second device processes the audio signals collected by the multiple microphones based on the time difference of arrival (TDOA) technology of the audio to obtain the user's position information relative to the second device.
- TDOA time difference of arrival
- the second device determines whether the user is within a preset range based on the user's position information relative to the second device.
- the preset area is shown in Figure 6.
- the preset area is an inverted trapezoidal area 5 meters in front of the large-screen device, and the left and right angles of the lower part of the trapezoid are both 120°.
- the second device determines to start the proximity voice detection.
- the second device after the second device starts approaching human voice detection, the second device continues to detect the user's behavior of staying close to the second device. Specifically, the second device continues to obtain the user's audio signal and determines whether the user is still in a preset area based on the obtained audio signal; if it is determined that the user is not in the preset area based on the obtained audio signal, the second device terminates the approaching human voice detection.
- the second device uses an audio signal with a shorter acquisition time to determine whether the user is in a preset area.
- the shorter acquisition time means that the acquisition time is not greater than a preset acquisition time, and the preset acquisition time may be 0.1s, 0.2s, 0.5s or other time lengths.
- the approaching human voice detection and the device approaching detection can be performed simultaneously.
- the device approaching detection is to determine whether the user is within a preset range based on the acquired audio signal.
- the specific process of approaching human voice detection includes:
- the second device obtains a video stream when the user speaks a voice task to the second device, and simultaneously obtains an audio signal corresponding to the user speaking the voice task to the second device, which audio signal can be called a third audio signal; the second device determines audio feature information based on the video stream and the third audio signal, and the audio feature information includes a correlation coefficient between speech and lip movement, and/or a time delay between speech, lip movement and video; the second device determines a type of the third audio signal based on the audio feature information; if the type of the determined third audio signal is a nearby human voice, the second device activates the voice assistant.
- the correlation coefficient between speech and lip movement is the correlation coefficient between the lip height and lip width of the user in the video stream and the amplitude of the third audio signal, such as the Pearson correlation coefficient, and the value range is [0,1], 1 indicates strong correlation, and 0 indicates no correlation.
- L1 indicates lip width
- L2 , L3 , L4 , L5 , L6 , L7 , and L8 indicate lip height. Since the speed of speech propagation is slower than the speed of light, there is a time delay between the arrival time of the speech signal and the arrival time of the video signal.
- the maximum value of the mutual correlation coefficient between the lip height, lip width and the audio amplitude of the third audio signal can be selected as the representation of the correlation between speech and lip movement, that is, the maximum value of the mutual correlation coefficient between the lip height, lip width and the audio amplitude of the third audio signal can be taken as the correlation coefficient between speech and lip movement, and the length of the right shift on the time axis can be used as the time delay between speech, lip movement and video.
- the second device determines the type of the third audio signal based on the audio feature information, and the second device may input the audio feature information into the sixth prediction model for processing to obtain the type of the third audio signal.
- the sixth prediction model may be a four-classification network, and the second device inputs the audio feature information into the sixth prediction model for prediction.
- the sixth prediction model outputs four probabilities, which are the probability of an approaching human voice, the probability of a nearby human voice, the probability of a distant human voice, and the probability of a non-human voice; the type of the third audio signal is the sound type corresponding to the maximum probability, such as an approaching human voice.
- the sixth prediction model may also be a two-classification network, and the second device inputs the audio feature information into the sixth prediction model for prediction.
- the sixth prediction model outputs two probabilities, which are the probability of an approaching human voice and the probability of a non-approaching human voice.
- the type of the third audio signal is the sound type corresponding to the maximum probability.
- the second device will obtain the sixth prediction model, which may be the sixth prediction model trained by other devices and obtained from other devices, or the sixth prediction model trained by the second device itself.
- the process of training and obtaining the sixth prediction model specifically includes:
- the machine learning model can be a decision tree, a random forest algorithm, Xgboost or AdaBoost.
- the voice assistant when it is determined that the type of the third audio signal is an approaching human voice and the user approaches the second device and maintains a preset time, such as 0.3 seconds, the voice assistant is activated. In this way, it can be avoided that the user accidentally triggers the activation of the voice assistant.
- the second device determines the task to be executed according to the third audio signal and identifies the type of the task to be processed; if the task to be processed is a sensitive task, in order to ensure security, it is necessary to obtain the user's identity information, such as the user's fingerprint information, voiceprint information, facial image information, etc.
- the task to be processed is executed.
- the second device after starting the voice assistant, identifies the voice content corresponding to the third audio signal based on ASR technology, and displays the voice content corresponding to the third audio signal in real time. For example, if the voice content corresponding to the third audio signal is "What will the weather be like tomorrow", “tomorrow” will be displayed after “tomorrow” is identified; “weather” will be displayed after “weather” is identified, instead of displaying it after the complete voice content is identified.
- the second device sends the third audio signal to the server, and the server identifies the voice content corresponding to the third audio signal based on ASR technology, and filters the voice content, filtering out some content that is not spoken to the voice assistant, such as "I'm going to play ball after get off work today", that is, the voice assistant of the second device does not provide feedback on content that is not spoken to the voice assistant, and the voice assistant terminates the voice conversation. In this way, user interference can be reduced.
- some keywords or sentences can be set in advance; when the second device determines that the voice content corresponding to the third audio signal contains the set keywords or sentences, it determines that the voice content corresponding to the third audio signal is the content that needs to be filtered out, and the voice assistant does not need to provide feedback on the voice content corresponding to the third audio signal.
- the user approaching the device is detected by using multiple microphones of the large-screen device to detect the location of the sound source, rather than using the wake-up word to trigger the microphone to pick up the user's voice task, providing users with a natural voice interaction (approach and speak) experience with easy operation.
- lip movement information based on the camera of the large-screen device and then calculating the voice and lip movement correlation, human voices and non-human voices near the device that are not facing the large screen can be intercepted, suppressing user interference from falsely waking up the voice assistant.
- identity authentication is required before the task corresponding to the audio signal is a sensitive task. This method can prevent non-second device owners from performing security-sensitive voice tasks through the second device, thereby ensuring the information security of the device of the second device owner.
- the electronic device 800 includes:
- An acquisition unit 801 is used to acquire IMU data and illumination data of the electronic device when a first event is detected;
- a determination unit 802 configured to determine whether the user has performed a first preset action according to the IMU data and the illumination data of the electronic device;
- the starting unit 803 is used to start the microphone of the electronic device if it is determined that the user has performed the first preset action.
- the acquiring unit is also used to acquire the first audio signal collected by the microphone;
- the determining unit 802 is further configured to determine, based on the first audio signal, whether the type of the first audio signal is an approaching human voice;
- the activation unit 803 is further configured to activate a voice assistant of the electronic device if it is determined that the type of the first audio signal is an approaching human voice.
- the first event is a wrist-raising hardware interrupt event, a hand-raising hardware interrupt event of the electronic device, a key-pressing screen-lighting event of the electronic device, a hand-raising screen-lighting event of the electronic device, or a wrist-raising screen-lighting event of the electronic device.
- the acquisition unit 801 is further configured to continuously acquire the IMU data and illumination data collected by the electronic device after the start unit 803 starts the microphone of the electronic device.
- the determination unit 802 is further used to determine whether the user has performed a second preset action according to the IMU data and the illumination data collected by the electronic device;
- the starting unit 803 is specifically used to:
- the voice assistant of the electronic device is activated.
- the starting unit 803 is further configured to:
- the microphone of the electronic device is turned off.
- the duration of collecting the IMU data and illumination data used to determine whether the user has performed a first preset action is a first preset duration
- the duration of collecting the IMU data and illumination data used to determine whether the user has performed a second preset action is a second preset duration, wherein the first preset duration is greater than the second preset duration
- the first event is a wrist-raising hardware interrupt event, a hand-raising hardware interrupt event of an electronic device, a hand-raising screen-lighting event of an electronic device, or a wrist-raising screen-lighting event of an electronic device.
- the acquisition unit 801 is specifically used to:
- Start the ambient light sensor of the electronic device collect the illumination data of the environment of the electronic device and continue to collect IMU data, obtain the cached IMU data related to the first event, and set the illumination data related to the first event to zero.
- the duration of the first audio signal is less than or equal to 0.5 s, and determining whether the type of the first audio signal is an approaching human voice based on the first audio signal is implemented by a digital signal processor (DSP) of the electronic device.
- DSP digital signal processor
- the first audio signal is collected by multiple microphones of the electronic device.
- the determination unit 802 is further configured to determine the task to be performed according to the audio signal collected by the microphone of the electronic device;
- the acquisition unit 801 is further used to acquire the identity information of the user if the task to be executed is a sensitive task;
- the electronic device 800 further includes:
- the execution unit 804 is used to determine that the user is a target user based on the user's identity information and execute the task to be processed.
- the electronic device 800 further includes:
- the display unit 805 is used to display the text corresponding to the collected audio signal in real time after the voice assistant is started.
- the specific functional implementation method of the electronic device can refer to the description of the above-mentioned voice interaction method, such as the acquisition unit 801 is used to execute the relevant content of S201, the determination unit 802 is used to execute the relevant content of S202 and S204, the startup unit 803, the execution unit 804 and the display unit are used to execute the relevant content of S203 and S205, which will not be repeated here.
- Each unit or module in the electronic device can be separately or completely merged into one or several other units or modules to constitute, or one (some) of the units or modules can also be divided into multiple smaller units or modules to constitute, which can achieve the same operation without affecting the realization of the technical effect of the embodiment of the present invention.
- the above-mentioned units or modules are divided based on logical functions. In practical applications, the function of a unit (or module) can also be implemented by multiple units (or modules), or the functions of multiple units (or modules) are implemented by one unit (or module).
- the electronic device 800a includes:
- An acquisition unit 801a is used to acquire IMU data of the first device when a first event is detected;
- a determination unit 802a configured to determine whether the user has performed a first preset action according to the IMU data of the first device
- the activation unit 803a is used to activate the microphone of the first device if it is determined that the user has performed the first preset action, and the acquisition unit is further used to acquire the first audio signal collected by the microphone;
- the determining unit 802a is further configured to determine, according to the first audio signal, whether the type of the first audio signal is an approaching human voice;
- the starting unit 803a is further configured to start the voice assistant of the first device if it is determined that the type of the first audio signal is an approaching human voice.
- the first event is a wrist raising event of the first device, a hand raising event of the first device, or a wrist turning event of the first device; and the wrist raising event of the first device, the hand raising event of the first device, and the wrist turning event of the first device are all obtained through the same event prediction model.
- the determining unit 802a is further configured to include:
- the first device calculates the posture information of the first device based on the IMU data of the first device; obtains the acceleration information of the first device from the IMU data of the first device; if the posture information of the first device is within a preset posture range, the acceleration information of the first device is within a preset acceleration range, and the first event duration is within a preset duration range, the first device determines that the user has performed a first preset action.
- the acquisition unit 801a is further configured to continuously acquire the IMU data and illumination data collected by the electronic device after the activation unit activates the microphone of the electronic device;
- the determination unit 802a is further used to determine whether the user has performed a second preset action according to the IMU data and the illumination data collected by the electronic device;
- the starting unit 803a is specifically used to:
- the voice assistant of the electronic device is activated.
- the duration of the first audio signal is less than or equal to 0.5 s, and determining whether the type of the first audio signal is an approaching human voice according to the first audio signal is implemented by a DSP of the electronic device.
- the first audio signal is collected by multiple microphones of the electronic device.
- the determining unit 802a is further configured to determine the task to be executed according to the audio signal collected by the microphone of the electronic device;
- the acquisition unit 801a is further used to acquire the identity information of the user if the task to be executed is a sensitive task;
- the electronic device 800a further includes:
- the execution unit 804a is used to determine that the user is a target user based on the user's identity information and execute the task to be processed.
- the electronic device 800a further includes:
- the display unit 805a is used to display the text corresponding to the collected audio signal in real time after the voice assistant is started.
- each unit or module in the electronic device can be separately or completely combined into one or several other units or modules to constitute, or one (some) of the units or modules can also be divided into multiple functionally smaller units or modules to constitute, which can achieve the same operation without affecting the realization of the technical effects of the embodiments of the present invention.
- the above-mentioned units or modules are divided based on logical functions. In actual applications, the functions of a unit (or module) can also be implemented by multiple units (or modules), or the functions of multiple units (or modules) can be implemented by one unit (or module).
- Figure 9 is a schematic diagram of the structure of an electronic device 900 provided in an embodiment of the present invention.
- the electronic device 900 shown in Figure 9 includes a memory 901, a processor 902, a communication interface 903, a display screen 905, and a bus 904.
- the memory 901, the processor 902, the communication interface 903, and the display screen 905 are connected to each other through the bus 904.
- Memory 901 can be a read-only memory (ROM), a static storage device, a dynamic storage device or a random access memory (RAM).
- ROM read-only memory
- RAM random access memory
- the memory 901 can store programs. When the program stored in the memory 901 is executed by the processor 902, the processor 902 and the communication interface 903 are used to execute the various steps of the voice interaction method of the embodiment of the present application.
- the processor 902 can adopt a general-purpose central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), a graphics processing unit (GPU) or one or more integrated circuits to execute relevant programs to implement the functions that need to be performed by the units in the electronic device 900 of the embodiment of the present application, or to execute the voice interaction method of the method embodiment of the present application.
- CPU central processing unit
- ASIC application specific integrated circuit
- GPU graphics processing unit
- the processor 902 may also be an integrated circuit chip having the ability to process signals. In the implementation process, each step of the voice interaction method of the present application may be completed by an integrated logic circuit of hardware in the processor 902 or by instructions in the form of software. It can also be a general-purpose processor, a digital signal processor (Digital Signal Processing, DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The methods, steps and logic block diagrams disclosed in the embodiments of the present application can be implemented or executed.
- the general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc.
- the steps of the method disclosed in the embodiments of the present application can be directly embodied as a hardware decoding processor to be executed, or the hardware and software modules in the decoding processor are combined and executed.
- the software module can be located in a random access memory, a flash memory, a read-only memory, a programmable read-only memory or an electrically erasable programmable memory, a register, and other mature storage media in the art.
- the storage medium is located in the memory 901, and the processor 902 reads the information in the memory 901, and combines its hardware to complete the functions required to be performed by the units included in the network electronic device of the embodiment of the present application, or executes the voice interaction method of the method embodiment of the present application.
- the communication interface 903 uses a transceiver such as, but not limited to, a transceiver to implement communication between the electronic device 900 and other devices or a communication network. For example, data can be acquired through the communication interface 903 .
- a transceiver such as, but not limited to, a transceiver to implement communication between the electronic device 900 and other devices or a communication network. For example, data can be acquired through the communication interface 903 .
- the bus 904 may include a path for transmitting information between various components of the electronic device 900 (eg, the memory 901 , the processor 902 , and the communication interface 903 ).
- the display screen 905 is used to display the text corresponding to the collected audio signal after the voice assistant is started.
- the text of the voice assistant's reply information to the collected audio signal can also be displayed.
- the display screen 905 can be an LCD screen, an LED screen, an OLED screen, and of course it can also be other display screens, which are not limited here.
- the electronic device 900 shown in FIG9 only shows a memory, a processor, and a communication interface, in the specific implementation process, those skilled in the art should understand that the electronic device 900 also includes other devices necessary for normal operation, such as a display. At the same time, according to specific needs, those skilled in the art should understand that the electronic device 900 may also include hardware devices for implementing other additional functions. In addition, those skilled in the art should understand that the electronic device 900 may also only include the devices necessary for implementing the embodiments of the present application, and does not necessarily include all the devices shown in FIG9.
- An embodiment of the present application also provides a chip, which includes a processor and a data interface.
- the processor reads instructions stored in a memory through the data interface to implement the voice interaction method.
- the chip may further include a memory, in which instructions are stored, and the processor is used to execute the instructions stored in the memory.
- the processor is used to execute the voice interaction method.
- An embodiment of the present application also provides a computer-readable storage medium, which stores instructions.
- the computer-readable storage medium is executed on a computer or a processor, the computer or the processor executes one or more steps in any of the above methods.
- the embodiment of the present application further provides a computer program product including instructions.
- the computer program product is run on a computer or a processor, the computer or the processor executes one or more steps in any of the above methods.
- Computer-readable media may include computer-readable storage media, which corresponds to tangible media, such as data storage media, or includes any media (e.g., based on a communication protocol) that facilitates the transfer of a computer program from one place to another.
- computer-readable media may generally correspond to (1) non-temporary tangible computer-readable storage media, or (2) communication media, such as signals or carrier waves.
- Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, codes, and/or data structures for implementing the techniques described in this application.
- a computer program product may include computer-readable media.
- such computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage, flash memory, or any other medium that can be used to store the desired program code in the form of instructions or data structures and can be accessed by a computer.
- any connection is properly referred to as a computer-readable medium.
- a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio and microwave
- the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio and microwave are included in the definition of media.
- disks and optical disks include compact disks (CDs), laser optical disks, optical optical disks, digital versatile disks (DVDs), and Blu-ray disks, where disks typically reproduce data magnetically, while optical disks reproduce data optically using lasers. Combinations of the above should also be included within the scope of computer-readable media.
- the instructions may be executed by one or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuits.
- DSPs digital signal processors
- ASICs application specific integrated circuits
- FPGAs field programmable logic arrays
- processors may refer to any of the aforementioned structures or any other structure suitable for implementing the techniques described herein.
- the functions described by the various illustrative logical blocks, modules, and steps described herein may be provided in a processor configured for encoding and decoding.
- the technology may be fully implemented in one or more circuits or logic elements.
- the techniques of the present application may be implemented in a variety of devices or apparatuses, including a wireless handset, an integrated circuit (IC), or a set of ICs (e.g., a chipset).
- IC integrated circuit
- a set of ICs e.g., a chipset
- Various components, modules, or units are described in the present application to emphasize functional aspects of the devices for performing the disclosed techniques, but do not necessarily need to be implemented by different hardware units.
- the various units may be combined in a coded hardware unit in conjunction with appropriate software and/or firmware, or provided by interoperating hardware units (including one or more processors as described above).
- A/B can represent A or B; wherein A and B can be singular or plural.
- multiple refers to two or more than two.
- At least one of the following" or similar expressions refers to any combination of these items, including any combination of single items or plural items.
- at least one of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c can be single or multiple.
- the words “first”, “second”, etc. are used to distinguish the same items or similar items with substantially the same functions and effects. Those skilled in the art can understand that the words “first”, “second”, etc. do not limit the quantity and execution order, and the words “first”, “second”, etc. do not limit them to be necessarily different. Meanwhile, in the embodiments of the present application, words such as “exemplary” or “for example” are used to indicate examples, illustrations or descriptions. Any embodiment or design described as “exemplary” or “for example” in the embodiments of the present application should not be interpreted as being more preferred or more advantageous than other embodiments or designs. Specifically, the use of words such as “exemplary” or “for example” is intended to present related concepts in a concrete manner for ease of understanding.
- the disclosed systems, devices and methods can be implemented in other ways.
- the division of the unit is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
- the mutual coupling, direct coupling, or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
- the computer program product includes one or more computer instructions.
- the computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
- the computer instructions can be stored in a computer-readable storage medium or transmitted through the computer-readable storage medium.
- the computer instructions can be transmitted from a website site, computer, server or data center to another website site, computer, server or data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means.
- the computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that includes one or more available media integrated.
- the available medium can be a read-only memory (ROM), or a random access memory (RAM), or a magnetic medium, such as a floppy disk, a hard disk, a tape, a magnetic disk, or an optical medium, such as a digital versatile disc (DVD), or a semiconductor medium, such as a solid state disk (SSD), etc.
- ROM read-only memory
- RAM random access memory
- magnetic medium such as a floppy disk, a hard disk, a tape, a magnetic disk, or an optical medium, such as a digital versatile disc (DVD), or a semiconductor medium, such as a solid state disk (SSD), etc.
- SSD solid state disk
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- Signal Processing (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Quality & Reliability (AREA)
- User Interface Of Digital Computer (AREA)
- Telephone Function (AREA)
Abstract
Description
Claims (38)
- 一种语音交互方法,应用于第一设备,其特征在于,所述方法包括:在检测到第一事件时,获取所述第一设备的IMU数据和照度数据;根据所述第一设备的IMU数据和照度数据确定用户是否做了第一预设动作;若确定所述用户做了所述第一预设动作,启动第一设备的麦克风,获取所述麦克风采集的第一音频信号;根据所述第一音频信号确定所述第一音频信号的类型是否为靠近的人声;若确定所述第一音频信号的类型为靠近的人声,启动所述第一设备的语音助手。
- 根据权利要求1所述的方法,其特征在于,所述第一事件为抬腕硬件中断事件、所述第一设备的抬手硬件中断事件、所述第一设备的按键亮屏事件、所述第一设备的抬手亮屏事件或者所述第一设备的抬腕亮屏事件。
- 根据权利要求1或2所述的方法,其特征在于,所述方法还包括:在启动所述第一设备的麦克风之后,持续获取第一设备采集的IMU数据和照度数据,根据所述第一设备采集的IMU数据和照度数据确定所述用户是否执行了第二预设动作;所述若确定所述第一音频信号的类型为靠近的人声,启动所述第一设备的语音助手,包括:若确定所述用户执行了所述第二预设动作,且确定所述第一音频信号的类型为靠近的人声,启动所述第一设备的语音助手。
- 根据权利要求3所述的方法,其特征在于,所述方法还包括:若确定所述用户未执行所述第二预设动作,则关闭所述第一设备的麦克风。
- 根据权利要求3或4所述的方法,其特征在于,确定用户是否执行了所述第一预设动作所使用的IMU数据和照度数据的采集时长为第一预设时长,确定用户是否执行了所述第二预设动作所使用的IMU数据和照度数据的采集时长为第二预设时长,其中,所述第一预设时长大于所述第二预设时长。
- 根据权利要求2所述的方法,其特征在于,所述第一事件为抬腕硬件中断事件、所述第一设备的抬手硬件中断事件、所述第一设备的抬手亮屏事件或者所述第一设备的抬腕亮屏事件,所述获取所述第一设备的IMU数据和照度数据,包括:启动所述第一设备的环境光传感器,采集所述第一设备的环境的照度数据和继续采集IMU数据,并获取与所述第一事件相关的缓存的IMU数据,并对与所述第一事件相关的照度数据置零补齐。
- 根据权利要求1-6任一项所述的方法,其特征在于,所述第一音频信号的时长小于或者等于0.5s,且根据所述第一音频信号确定所述第一音频信号的类型是否为靠近的人声是所述第一设备的数字信号处理器DSP实现的。
- 根据权利要求1-7任一项所述的方法,其特征在于,所述第一音频信号是通过所述第一设备的多个麦克风采集得到的。
- 根据权利要求1-8任一项所述的方法,其特征在于,所述方法还包括:根据所述第一设备的麦克风采集的音频信号确定待执行任务;若所述待执行任务为敏感任务,获取所述用户的身份信息;在基于所述用户的身份信息确定所述用户为目标用户,执行所述待处理任务。
- 根据权利要求1-9任一项所述的方法,其特征在于,所述方法还包括:在开启语音助手后,所述第一设备实时显示采集的音频信号对应的文本。
- 一种语音交互方法,应用于第一设备,其特征在于,所述方法包括:在检测到第一事件时,获取所述第一设备的IMU数据;其中,所述第一事件为所述第一设备的抬腕事件、所述第一设备的抬手事件或者所述第一设备的转腕事件;且所述第一设备的抬腕事件、所述第一设备的抬手事件和所述第一设备的转腕事件均是通过同一事件预测模型得到的;根据所述第一设备的IMU数据确定用户是否做了第一预设动作;若确定所述用户做了所述第一预设动作,启动第一设备的麦克风,获取所述麦克风采集的第一音频信号;根据所述第一音频信号确定所述第一音频信号的类型是否为靠近的人声;若确定所述第一音频信号的类型为靠近的人声,启动所述第一设备的语音助手。
- 根据权利要求11所述的方法,其特征在于,所述根据所述第一设备的IMU数据确定用户是否做了第一预设动作,包括:根据所述第一设备的IMU数据计算得到所述第一设备的姿态信息;从所述第一设备的IMU数据中获取所述第一设备的加速度信息;若所述第一设备的姿态信息处于预设姿态范围、所述第一设备的加速度信息处于预设加速度范围内,且所述第一事件时长处于预设时长范围内,则确定所述用户做了所述第一预设动作。
- 根据权利要求11或12所述的方法,其特征在于,所述方法还包括:在启动所述第一设备的麦克风之后,持续获取第一设备采集的IMU数据和照度数据,根据所述第一设备采集的IMU数据和照度数据确定所述用户是否执行了第二预设动作;所述若确定所述第一音频信号的类型为靠近的人声,启动所述第一设备的语音助手,包括:若确定所述用户执行了所述第二预设动作,且确定所述第一音频信号的类型为靠近的人声,启动所述第一设备的语音助手。
- 根据权利要求13所述的方法,其特征在于,所述方法还包括:若确定所述用户未执行所述第二预设动作,则关闭所述第一设备的麦克风。
- 根据权利要求11-14任一项所述的方法,其特征在于,所述第一音频信号的时长小于或者等于0.5s,且根据所述第一音频信号确定所述第一音频信号的类型是否为靠近的人声是所述第一设备的数字信号处理器DSP实现的。
- 根据权利要求11-15任一项所述的方法,其特征在于,所述第一音频信号是通过所述第一设备的多个麦克风采集得到的。
- 根据权利要求11-16任一项所述的方法,其特征在于,所述方法还包括:根据所述第一设备的麦克风采集的音频信号确定待执行任务;若所述待执行任务为敏感任务,获取所述用户的身份信息;在基于所述用户的身份信息确定所述用户为目标用户,执行所述待处理任务。
- 根据权利要求11-17任一项所述的方法,其特征在于,所述方法还包括:在开启语音助手后,第一设备实时显示采集的音频信号对应的文本。
- 一种电子设备,其特征在于,包括:获取单元,用于在检测到第一事件时,获取所述电子设备的IMU数据和照度数据;确定单元,用于根据所述电子设备的IMU数据和照度数据确定用户是否做了第一预设动作;启动单元,用于若确定所述用户做了所述第一预设动作,启动电子设备的麦克风,所述获取单元,还用于获取所述麦克风采集的第一音频信号;所述确定单元,还用于根据所述第一音频信号确定所述第一音频信号的类型是否为靠近的人声;所述启动单元,还用于若确定所述第一音频信号的类型为靠近的人声,启动所述电子设备的语音助手。
- 根据权利要求19所述的电子设备,其特征在于,所述第一事件为抬腕硬件中断事件、所述电子设备的抬手硬件中断事件、所述电子设备的按键亮屏事件、所述电子设备的抬手亮屏事件或者所述电子设备的抬腕亮屏事件。
- 根据权利要求19或20所述的电子设备,其特征在于,所述获取单元,还用于在所述启动单元启动所述电子设备的麦克风之后,持续获取电子设备采集的IMU数据和照度数据,所述确定单元,还用于根据所述电子设备采集的IMU数据和照度数据确定所述用户是否执行了第二预设动作;在所述若确定所述第一音频信号的类型为靠近的人声,启动所述电子设备的语音助手的方面,所述启动单元具体用于:若确定所述用户执行了所述第二预设动作,且确定所述第一音频信号的类型为靠近的人声,启动所述电子设备的语音助手。
- 根据权利要求21所述的电子设备,其特征在于,所述启动单元还用于:若确定所述用户未执行所述第二预设动作,则关闭所述电子设备的麦克风。
- 根据权利要求21或22所述的电子设备,其特征在于,确定用户是否执行了所述第一预设动作所使用的IMU数据和照度数据的采集时长为第一预设时长,确定用户是否执行了所述第二预设动作所使用的IMU数据和照度数据的采集时长为第二预设时长,其中,所述第一预设时长大于所述第二预设时长。
- 根据权利要求20所述的电子设备,其特征在于,所述第一事件为抬腕硬件中断事件、所述电子设备的抬手硬件中断事件、所述电子设备的抬手亮屏事件或者所述电子设备的抬腕亮屏事件,在所述获取所述电子设备的IMU数据和照度数据的方面,所述获取单元具体用于:启动所述电子设备的环境光传感器,采集所述电子设备的环境的照度数据和继续采集IMU数据,并获取与所述第一事件相关的缓存的IMU数据,并对与所述第一事件相关的照度数据置零补齐。
- 根据权利要求19-24任一项所述的电子设备,其特征在于,所述第一音频信号的时长小于或者等于0.5s,且根据所述第一音频信号确定所述第一音频信号的类型是否为靠近的人声是所述电子设备的数字信号处理器DSP实现的。
- 根据权利要求19-25任一项所述的电子设备,其特征在于,所述第一音频信号是通过所述电子设备的多个麦克风采集得到的。
- 根据权利要求19-26任一项所述的电子设备,其特征在于,所述确定单元,还用于根据所述电子设备的麦克风采集的音频信号确定待执行任务;所述获取单元,还用于若所述待执行任务为敏感任务,获取所述用户的身份信息;所述电子设备还包括:执行单元,用于在基于所述用户的身份信息确定所述用户为目标用户,执行所述待处理任务。
- 根据权利要求19-27任一项所述的电子设备,其特征在于,所述电子设备还包括:显示单元,用于在开启语音助手后,实时显示采集的音频信号对应的文本。
- 一种电子设备,其特征在于,包括:获取单元,用于在检测到第一事件时,获取所述电子设备的IMU数据;其中,所述第一事件为所述第一设备的抬腕事件、所述第一设备的抬手事件或者所述第一设备的转腕事件;且所述第一设备的抬腕事件、所述第一设备的抬手事件和所述第一设备的转腕事件均是通过同一事件预测模型得到的;确定单元,用于根据所述电子设备的IMU数据确定用户是否做了第一预设动作;启动单元,用于若确定所述用户做了所述第一预设动作,启动电子设备的麦克风,获取所述麦克风采集的第一音频信号;所述确定单元,还用于根据所述第一音频信号确定所述第一音频信号的类型是否为靠近的人声;所述启动单元,还用于若确定所述第一音频信号的类型为靠近的人声,启动所述电子设备的语音助手。
- 根据权利要求29所述的电子设备,其特征在于,在所述根据所述电子设备的IMU数据确定用户是否做了第一预设动作的方面,所述确定单元具体用于:根据所述电子设备的IMU数据计算得到所述电子设备的姿态信息;从所述电子设备的IMU数据中获取所述电子设备的加速度信息;若所述电子设备的姿态信息处于预设姿态范围、所述电子设备的加速度信息处于预设加速度范围内,且所述第一事件时长处于预设时长范围内,则确定所述用户做了所述第一预设动作。
- 根据权利要求29或30所述的电子设备,其特征在于,所述获取单元,还用于在所述启动单元启动所述电子设备的麦克风之后,持续获取电子设备采集的IMU数据和照度数据,所述确定单元,还用于根据所述电子设备采集的IMU数据和照度数据确定所述用户是否执行了第二预设动作;在所述若确定所述第一音频信号的类型为靠近的人声,启动所述电子设备的语音助手的方面,所述启动单元具体用于:若确定所述用户执行了所述第二预设动作,且确定所述第一音频信号的类型为靠近的人声,启动所述电子设备的语音助手。
- 根据权利要求31所述的电子设备,其特征在于,所述启动单元还用于:若确定所述用户未执行所述第二预设动作,则关闭所述电子设备的麦克风。
- 根据权利要求29-32任一项所述的电子设备,其特征在于,所述第一音频信号的时长小于或者等于0.5s,且根据所述第一音频信号确定所述第一音频信号的类型是否为靠近的人声是所述电子设备的数字信号处理器DSP实现的。
- 根据权利要求29-33任一项所述的电子设备,其特征在于,所述第一音频信号是通过所述电子设备的多个麦克风采集得到的。
- 根据权利要求29-34任一项所述的电子设备,其特征在于,所述确定单元,还用于根据所述电子设备的麦克风采集的音频信号确定待执行任务;所述获取单元,还用于若所述待执行任务为敏感任务,获取所述用户的身份信息;所述电子设备还包括:执行单元,用于在基于所述用户的身份信息确定所述用户为目标用户,执行所述待处理任务。
- 根据权利要求29-35任一项所述的电子设备,其特征在于,所述电子设备还包括:显示单元,用于在开启语音助手后,实时显示采集的音频信号对应的文本。
- 一种电子设备,其特征在于,包括处理器和存储器,其中,所述存储器用于存储程序代码,所述处理器用于执行所述程序代码,以实现权利要求1至18任一项所述的方法。
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时,实现权利要求1至18任一项所述的方法。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP24763107.0A EP4632733A4 (en) | 2023-02-28 | 2024-02-27 | VOICE INTERACTION METHOD AND ASSOCIATED DEVICE |
| US19/310,862 US20250378833A1 (en) | 2023-02-28 | 2025-08-26 | Speech interaction method and related device |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310224268.2 | 2023-02-28 | ||
| CN202310224268.2A CN116229953A (zh) | 2023-02-28 | 2023-02-28 | 语音交互方法及相关设备 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US19/310,862 Continuation US20250378833A1 (en) | 2023-02-28 | 2025-08-26 | Speech interaction method and related device |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024179425A1 true WO2024179425A1 (zh) | 2024-09-06 |
Family
ID=86580372
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2024/078662 Ceased WO2024179425A1 (zh) | 2023-02-28 | 2024-02-27 | 语音交互方法及相关设备 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20250378833A1 (zh) |
| EP (1) | EP4632733A4 (zh) |
| CN (1) | CN116229953A (zh) |
| WO (1) | WO2024179425A1 (zh) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120954111A (zh) * | 2025-10-17 | 2025-11-14 | 小芒电子商务有限责任公司 | 一种真人直播识别方法及相关装置 |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116229953A (zh) * | 2023-02-28 | 2023-06-06 | 华为技术有限公司 | 语音交互方法及相关设备 |
| CN121905167A (zh) * | 2023-10-31 | 2026-04-21 | 华为技术有限公司 | 语音助手交互的方法和电子设备 |
| CN120390205A (zh) * | 2024-01-26 | 2025-07-29 | 华为技术有限公司 | 感知方法及装置 |
| CN119376683B (zh) * | 2024-12-28 | 2025-06-20 | 荣耀终端股份有限公司 | 语音交互的方法和电子设备 |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111651041A (zh) * | 2020-05-27 | 2020-09-11 | 上海龙旗科技股份有限公司 | 移动设备的抬起唤醒方法及系统 |
| CN113377206A (zh) * | 2021-07-05 | 2021-09-10 | 安徽淘云科技股份有限公司 | 词典笔抬起唤醒方法、装置和设备 |
| CN113724699A (zh) * | 2021-09-18 | 2021-11-30 | 优奈柯恩(北京)科技有限公司 | 设备唤醒识别模型训练方法、设备唤醒控制方法及装置 |
| CN114283798A (zh) * | 2021-07-15 | 2022-04-05 | 海信视像科技股份有限公司 | 手持设备的收音方法及手持设备 |
| CN114341779A (zh) * | 2019-09-04 | 2022-04-12 | 脸谱科技有限责任公司 | 用于基于神经肌肉控制执行输入的系统、方法和界面 |
| WO2022228056A1 (zh) * | 2021-04-30 | 2022-11-03 | 华为技术有限公司 | 一种人机交互方法及设备 |
| CN115588435A (zh) * | 2022-11-08 | 2023-01-10 | 荣耀终端有限公司 | 语音唤醒方法及电子设备 |
| CN116229953A (zh) * | 2023-02-28 | 2023-06-06 | 华为技术有限公司 | 语音交互方法及相关设备 |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104657105B (zh) * | 2015-01-30 | 2016-10-26 | 腾讯科技(深圳)有限公司 | 一种开启终端的语音输入功能的方法和装置 |
| DE112019000018B4 (de) * | 2018-05-07 | 2025-04-30 | Apple Inc. | Anheben, um zu sprechen |
-
2023
- 2023-02-28 CN CN202310224268.2A patent/CN116229953A/zh active Pending
-
2024
- 2024-02-27 EP EP24763107.0A patent/EP4632733A4/en active Pending
- 2024-02-27 WO PCT/CN2024/078662 patent/WO2024179425A1/zh not_active Ceased
-
2025
- 2025-08-26 US US19/310,862 patent/US20250378833A1/en active Pending
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114341779A (zh) * | 2019-09-04 | 2022-04-12 | 脸谱科技有限责任公司 | 用于基于神经肌肉控制执行输入的系统、方法和界面 |
| CN111651041A (zh) * | 2020-05-27 | 2020-09-11 | 上海龙旗科技股份有限公司 | 移动设备的抬起唤醒方法及系统 |
| WO2022228056A1 (zh) * | 2021-04-30 | 2022-11-03 | 华为技术有限公司 | 一种人机交互方法及设备 |
| CN113377206A (zh) * | 2021-07-05 | 2021-09-10 | 安徽淘云科技股份有限公司 | 词典笔抬起唤醒方法、装置和设备 |
| CN114283798A (zh) * | 2021-07-15 | 2022-04-05 | 海信视像科技股份有限公司 | 手持设备的收音方法及手持设备 |
| CN113724699A (zh) * | 2021-09-18 | 2021-11-30 | 优奈柯恩(北京)科技有限公司 | 设备唤醒识别模型训练方法、设备唤醒控制方法及装置 |
| CN115588435A (zh) * | 2022-11-08 | 2023-01-10 | 荣耀终端有限公司 | 语音唤醒方法及电子设备 |
| CN116229953A (zh) * | 2023-02-28 | 2023-06-06 | 华为技术有限公司 | 语音交互方法及相关设备 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4632733A1 |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120954111A (zh) * | 2025-10-17 | 2025-11-14 | 小芒电子商务有限责任公司 | 一种真人直播识别方法及相关装置 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4632733A1 (en) | 2025-10-15 |
| EP4632733A4 (en) | 2026-04-01 |
| CN116229953A (zh) | 2023-06-06 |
| US20250378833A1 (en) | 2025-12-11 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2024179425A1 (zh) | 语音交互方法及相关设备 | |
| CN108615526B (zh) | 语音信号中关键词的检测方法、装置、终端及存储介质 | |
| CN112634911B (zh) | 人机对话方法、电子设备及计算机可读存储介质 | |
| CN110070863A (zh) | 一种语音控制方法及装置 | |
| EP3274988A1 (en) | Controlling electronic device based on direction of speech | |
| CN112634895A (zh) | 语音交互免唤醒方法和装置 | |
| CN108806684B (zh) | 位置提示方法、装置、存储介质及电子设备 | |
| CN114125143B (zh) | 一种语音交互方法及电子设备 | |
| CN112286364A (zh) | 人机交互方法和装置 | |
| CN114220420A (zh) | 多模态语音唤醒方法、装置及计算机可读存储介质 | |
| CN114765026A (zh) | 一种语音控制方法、装置及系统 | |
| CN113160802B (zh) | 语音处理方法、装置、设备及存储介质 | |
| CN114360546B (zh) | 电子设备及其唤醒方法 | |
| CN117133282B (zh) | 一种语音交互方法及电子设备 | |
| WO2020102943A1 (zh) | 手势识别模型的生成方法、装置、存储介质及电子设备 | |
| CN115344111A (zh) | 手势交互方法、系统和装置 | |
| CN109064720B (zh) | 位置提示方法、装置、存储介质及电子设备 | |
| WO2021103449A1 (zh) | 交互方法、移动终端及可读存储介质 | |
| CN111681654A (zh) | 语音控制方法、装置、电子设备及存储介质 | |
| WO2023006033A1 (zh) | 语音交互方法、电子设备及介质 | |
| CN114299935A (zh) | 唤醒词识别方法、装置、终端及存储介质 | |
| CN115331672B (zh) | 设备控制方法、装置、电子设备及存储介质 | |
| WO2024055831A1 (zh) | 一种语音交互方法、装置及终端 | |
| CN110148401B (zh) | 语音识别方法、装置、计算机设备及存储介质 | |
| CN115705851A (zh) | 端点检测方法及相关设备 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24763107 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2024763107 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2024763107 Country of ref document: EP Effective date: 20250711 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWP | Wipo information: published in national office |
Ref document number: 2024763107 Country of ref document: EP |