WO2024179425A1 - 语音交互方法及相关设备 - Google Patents

语音交互方法及相关设备 Download PDF

Info

Publication number
WO2024179425A1
WO2024179425A1 PCT/CN2024/078662 CN2024078662W WO2024179425A1 WO 2024179425 A1 WO2024179425 A1 WO 2024179425A1 CN 2024078662 W CN2024078662 W CN 2024078662W WO 2024179425 A1 WO2024179425 A1 WO 2024179425A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
user
electronic device
event
raising
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2024/078662
Other languages
English (en)
French (fr)
Inventor
龙水平
许强
吴文昊
曾宇浩
李琛贺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to EP24763107.0A priority Critical patent/EP4632733A4/en
Publication of WO2024179425A1 publication Critical patent/WO2024179425A1/zh
Priority to US19/310,862 priority patent/US20250378833A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/227Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of the speaker; Human-factor methodology
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Definitions

  • the present application relates to the field of artificial intelligence, and in particular to a voice interaction method and related equipment.
  • voice assistants There are currently two main ways of interacting with voice assistants.
  • the user speaks a specific wake-up word (such as "Xiaoyi Xiaoyi"), and the smart terminal initiates a voice conversation after recognizing the voice signal.
  • This method has privacy issues in public places, and the interaction is relatively lengthy.
  • the user performs specific actions, such as significantly raising the wrist, or long pressing a physical key (such as a power button or a sports shortcut key).
  • the key operation requires greater force and delay, and the restart/shutdown interface may be called up accidentally, and the interaction process is not concise enough.
  • the present application provides a voice interaction method and related equipment, which can improve the sensitivity of detecting interactive operations.
  • the present application provides a voice interaction method, applied to a first device, the method comprising:
  • the first device When the first device detects the first event, it obtains inertial measurement unit (IMU) data and illuminance data of the first device; determines whether the user has performed a first preset action based on the IMU data and illuminance data of the first device; if it is determined that the user has performed the first preset action, the first device activates the microphone of the first device and obtains a first audio signal collected by the microphone; determines whether the type of the first audio signal is an approaching human voice based on the first audio signal; if it is determined that the type of the first audio signal is an approaching human voice, the first device activates the voice assistant of the first device.
  • IMU inertial measurement unit
  • the first preset action is a wrist raising action or a hand raising action.
  • the event of the user approaching the first device can be detected more accurately through IMU data and illumination data.
  • the solution of the present application is more sensitive and can achieve detection when the user naturally raises his wrist or hand.
  • the first event is a wrist-raising hardware interrupt event, a hand-raising hardware interrupt event of the first device, a key-pressing screen-lighting event of the first device, a hand-raising screen-lighting event of the first device, or a wrist-raising screen-lighting event of the first device.
  • the method of the present application further includes:
  • starting a voice assistant of the first device includes:
  • the voice assistant of the first device is activated.
  • the second preset action is a wrist-raising and holding action or a hand-raising and holding action.
  • the voice assistant of the first device After starting the microphone of the first device, continuously obtaining the IMU data and illumination data collected by the first device, and determining whether the user has maintained the wrist-raising action or the hand-raising action based on the collected IMU data and illumination data, when it is determined that the user has maintained the wrist-raising action or the hand-raising action, and the type of the first audio signal is a nearby human voice, the voice assistant of the first device will be started, which is beneficial to reducing the false wake-up rate and interference.
  • the method of the present application further includes:
  • the first device turns off the microphone of the first device.
  • the first device turns off the microphone, which can reduce the power of the first device and also reduce the false wake-up rate of the voice assistant.
  • the duration of collecting the IMU data and illumination data used to determine whether the user has performed a first preset action is a first preset duration
  • the duration of collecting the IMU data and illumination data used to determine whether the user has performed a second preset action is a second preset duration, wherein the first preset duration is greater than the second preset duration
  • the second preset time length By setting the second preset time length to be shorter than the first preset time length, it is not necessary to collect longer IMU data and illumination data before determining whether the user has performed the second preset action, thereby quickly detecting whether the user has performed the second preset action.
  • the IMU data and illuminance data used to determine whether the user has performed the second preset action can be interpolated respectively so that the equivalent time of the IMU data and illuminance data used to determine whether the user has performed the second preset action is consistent with the collection time of the IMU data and illuminance data used to determine whether the user has performed the first preset action, and then the same prediction model can be used to predict whether the user has performed the first preset action and the second preset action.
  • the first event is a wrist-raising hardware interrupt event, a hand-raising hardware interrupt event of the first device, a hand-raising screen-lighting event of the first device, or a wrist-raising screen-lighting event of the first device, and obtaining IMU data and illumination data of the first device includes:
  • Start the ambient light sensor of the first device collect the illumination data of the environment of the first device and continue to collect IMU data, obtain the cached IMU data related to the first event, and set the illumination data related to the first event to zero.
  • the equivalent duration of the illuminance data can be made consistent with the collection duration of the IMU data.
  • the data from the complete wrist-raising stage or hand-raising stage can be used to determine whether the user has performed the first preset action, thereby making the judgment result more accurate and improving the wake-up success rate and reducing the false wake-up rate in terms of user experience.
  • the duration of the first audio signal is less than or equal to 0.5 s, and determining whether the type of the first audio signal is an approaching human voice according to the first audio signal is implemented by a digital signal processor DSP of the first device.
  • the purpose is to enable the text corresponding to the first audio signal to be echoed in real time when the voice assistant is subsequently started to recognize the first voice signal.
  • the first audio signal is an audio signal that has not been processed by automatic gain control (AGC), noise reduction, dereverberation or compression, and there is no loss of information close to the human voice.
  • AGC automatic gain control
  • the sound type of the first audio signal is determined by using the original audio signal that has not been processed by AGC, noise reduction, dereverberation or compression, which can improve the accuracy of the sound type of the determined first audio signal.
  • the first audio signal is collected by multiple microphones of the first device.
  • wind noise signals in the audio signals collected by different microphones there are wind noise signals in the audio signals collected by different microphones.
  • the wind noise in the audio signal collected by one microphone can be used to reduce the wind noise in the audio signal collected by another microphone, and then the audio signal after wind noise processing can be used to predict the sound type, so as to obtain a more accurate prediction result.
  • the method of the present application further includes:
  • the first device determines the task to be executed based on the audio signal collected by the microphone of the first device; if the task to be executed is a sensitive task, the first device obtains the identity information of the user; after determining that the user is a target user based on the identity information of the user, the first device executes the task to be processed.
  • the above-mentioned user is a user of the first device, and the target user is an owner of the first device.
  • the above method can prevent a person who is not the first device owner from performing security-sensitive voice tasks through the first device, thereby ensuring the information security of the device of the first device owner.
  • the method of the present application further includes:
  • the first device After starting the voice assistant, the first device displays the text corresponding to the collected audio signal in real time.
  • an electronic device including:
  • an acquisition unit configured to acquire IMU data and illumination data of the electronic device when a first event is detected
  • a determination unit configured to determine whether the user has performed a first preset action according to IMU data and illumination data of the electronic device
  • an activation unit configured to activate a microphone of the electronic device if it is determined that the user has performed the first preset action
  • an acquisition unit configured to acquire a first audio signal collected by the microphone
  • the determination unit is further used to determine whether the type of the first audio signal is an approaching human voice according to the first audio signal
  • the activation unit is further used to activate the voice assistant of the electronic device if it is determined that the type of the first audio signal is an approaching human voice.
  • the first event is a wrist-raising hardware interrupt event, a hand-raising hardware interrupt event of the electronic device, a key-pressing screen-lighting event of the electronic device, a hand-raising screen-lighting event of the electronic device, or a wrist-raising screen-lighting event of the electronic device.
  • the acquisition unit is further configured to continuously acquire the IMU data and illumination data collected by the electronic device after the activation unit activates the microphone of the electronic device.
  • the determination unit is further used to determine whether the user has performed a second preset action based on the IMU data and illumination data collected by the electronic device;
  • the activation unit is specifically used for:
  • the voice assistant of the electronic device is activated.
  • the startup unit is further configured to:
  • the microphone of the electronic device is turned off.
  • the duration is a first preset duration
  • the duration for collecting the IMU data and illumination data used to determine whether the user has performed the second preset action is a second preset duration, wherein the first preset duration is greater than the second preset duration
  • the first event is a wrist-raising hardware interrupt event, a hand-raising hardware interrupt event of an electronic device, a hand-raising screen-lighting event of an electronic device, or a wrist-raising screen-lighting event of an electronic device.
  • the acquisition unit is specifically used for:
  • Start the ambient light sensor of the electronic device collect the illumination data of the environment of the electronic device and continue to collect IMU data, obtain the cached IMU data related to the first event, and set the illumination data related to the first event to zero.
  • a duration of the first audio signal is less than or equal to 0.5 s, and determining whether the type of the first audio signal is an approaching human voice based on the first audio signal is implemented by a digital signal processor (DSP) of the electronic device.
  • DSP digital signal processor
  • the first audio signal is collected by multiple microphones of the electronic device.
  • the determination unit is further configured to determine the task to be performed according to an audio signal collected by a microphone of the electronic device;
  • the acquisition unit is further used to acquire the identity information of the user if the task to be executed is a sensitive task;
  • Electronic equipment also includes:
  • the execution unit is used to determine that the user is a target user based on the user's identity information and execute the task to be processed.
  • the electronic device further includes:
  • the display unit is used to display the text corresponding to the collected audio signal in real time after the voice assistant is started.
  • the present application provides another voice interaction method, applied to a first device, the method comprising:
  • the first device When the first device detects the first event, the first device obtains the IMU data of the first device; determines whether the user has performed a first preset action based on the IMU data of the first device; if it is determined that the user has performed the first preset action, the first device activates the microphone of the first device and obtains a first audio signal collected by the microphone; determines whether the type of the first audio signal is an approaching human voice based on the first audio signal; if it is determined that the type of the first audio signal is an approaching human voice, the first device activates the voice assistant of the first device.
  • the first event is a wrist raising event of the first device, a hand raising event of the first device, or a wrist turning event of the first device; and the wrist raising event of the first device, the hand raising event of the first device, and the wrist turning event of the first device are all obtained through the same event prediction model.
  • the above event prediction model can be applied not only to the detection of wrist raising events, wrist turning events and hand raising events, but also to the detection of events in other applications, such as events of wrist raising to say applications and wrist raising to light up applications, which are not limited here. Therefore, the event prediction model can be regarded as the public capability of the first device.
  • the public capability of the first device to detect the first event, for the first device, there is no need to train a neural network model specifically for detecting the first event, which reduces the workload of the first device and reduces the computing power consumption of the first device. And after detecting the first event, it is further determined whether the user has performed the first preset action.
  • the microphone of the first device will be activated to obtain the first audio signal collected by the microphone; and according to the first audio signal, it is determined whether the type of the first audio signal is a nearby human voice; if it is determined that the type of the first audio signal is a nearby human voice, the first device activates the voice assistant of the first device. Adopting this method is conducive to reducing the probability of the voice assistant of the first device being activated.
  • the first device determines whether the user has performed a first preset action according to the IMU data of the first device, including:
  • the first device calculates the posture information of the first device based on the IMU data of the first device; obtains the acceleration information of the first device from the IMU data of the first device; if the posture information of the first device is within a preset posture range, the acceleration information of the first device is within a preset acceleration range, and the first event duration is within a preset duration range, the first device determines that the user has performed a first preset action.
  • This method is helpful to improve the accuracy of determining whether the user has performed the first preset action.
  • the method of the present application further includes:
  • starting a voice assistant of the first device includes:
  • the voice assistant of the first device is activated.
  • the second preset action is a wrist-raising and holding action or a hand-raising and holding action.
  • the voice assistant of the first device After starting the microphone of the first device, continuously obtaining the IMU data and illumination data collected by the first device, and determining whether the user has maintained the wrist-raising action or the hand-raising action based on the collected IMU data and illumination data, when it is determined that the user has maintained the wrist-raising action or the hand-raising action, and the type of the first audio signal is a nearby human voice, the voice assistant of the first device will be started, which is beneficial to reducing the false wake-up rate and interference.
  • the method of the present application further includes:
  • the first device turns off the microphone of the first device.
  • the first device turns off the microphone, which can reduce the power of the first device and also reduce the false wake-up rate of the voice assistant.
  • the duration of the first audio signal is less than or equal to 0.5 s, and determining whether the type of the first audio signal is an approaching human voice according to the first audio signal is implemented by a digital signal processor DSP of the first device.
  • the text corresponding to the first audio signal can be echoed in real time when the voice assistant is subsequently started to recognize the first voice signal.
  • the first audio signal is an audio signal that has not been processed by AGC, noise reduction, dereverberation or compression, and there is no loss of information close to the human voice.
  • the sound type of the first audio signal is determined by using the original audio signal that has not been processed by AGC, noise reduction, dereverberation or compression, which can improve the accuracy of the sound type of the determined first audio signal.
  • the first audio signal is collected by multiple microphones of the first device.
  • wind noise signals in the audio signals collected by different microphones there are wind noise signals in the audio signals collected by different microphones.
  • the wind noise in the audio signal collected by one microphone can be used to reduce the wind noise in the audio signal collected by another microphone, and then the audio signal after wind noise processing can be used to predict the sound type, so as to obtain a more accurate prediction result.
  • the method of the present application further includes:
  • the first device determines the task to be executed based on the audio signal collected by the microphone of the first device; if the task to be executed is a sensitive task, the first device obtains the identity information of the user; after determining that the user is a target user based on the identity information of the user, the first device executes the task to be processed.
  • the above-mentioned user is a user of the first device, and the target user is an owner of the first device.
  • the above method can prevent a person who is not the first device owner from performing security-sensitive voice tasks through the first device, thereby ensuring the information security of the device of the first device owner.
  • the method of the present application further includes:
  • the first device After starting the voice assistant, the first device displays the text corresponding to the collected audio signal in real time.
  • the present application provides another electronic device, including:
  • An acquisition unit configured to acquire IMU data of the first device when a first event is detected
  • a determination unit configured to determine whether the user has performed a first preset action according to the IMU data of the first device
  • the activation unit is used to activate the microphone of the first device if it is determined that the user has performed the first preset action, and the acquisition unit is further used to acquire the first audio signal collected by the microphone;
  • the determination unit is further used to determine whether the type of the first audio signal is an approaching human voice according to the first audio signal
  • the starting unit is further used to start the voice assistant of the first device if it is determined that the type of the first audio signal is an approaching human voice.
  • the first event is a wrist raising event of the first device, a hand raising event of the first device, or a wrist turning event of the first device; and the wrist raising event of the first device, the hand raising event of the first device, and the wrist turning event of the first device are all obtained through the same event prediction model.
  • the determining unit in determining whether the user has performed the first preset action according to the IMU data of the first device, is further configured to include:
  • the first device calculates the posture information of the first device based on the IMU data of the first device; obtains the acceleration information of the first device from the IMU data of the first device; if the posture information of the first device is within a preset posture range, the acceleration information of the first device is within a preset acceleration range, and the first event duration is within a preset duration range, the first device determines that the user has performed a first preset action.
  • the acquisition unit is further configured to continuously acquire the IMU data and illumination data collected by the electronic device after the activation unit activates the microphone of the electronic device.
  • the determination unit is further used to determine whether the user has performed a second preset action based on the IMU data and illumination data collected by the electronic device;
  • the activation unit is specifically used for:
  • the voice assistant of the electronic device is activated.
  • the duration of the first audio signal is less than or equal to 0.5 s, and determining whether the type of the first audio signal is an approaching human voice according to the first audio signal is implemented by a DSP of the electronic device.
  • the first audio signal is collected by multiple microphones of the electronic device.
  • the determination unit is further configured to determine the task to be performed according to an audio signal collected by a microphone of the electronic device;
  • the acquisition unit is further used to acquire the identity information of the user if the task to be executed is a sensitive task;
  • Electronic equipment also includes:
  • the execution unit is used to determine that the user is a target user based on the user's identity information and execute the task to be processed.
  • the electronic device further includes:
  • the display unit is used to display the text corresponding to the collected audio signal in real time after the voice assistant is started.
  • the present application provides another electronic device, including a processor and a memory.
  • the memory is used to store program code.
  • the processor is used to call the program code stored in the memory to execute the method provided by the first aspect, the third aspect, any possible implementation of the first aspect, or any possible implementation of the third aspect.
  • the present application provides a computer storage medium comprising computer instructions, which, when executed on an electronic device, enables the electronic device to execute a method as provided in any possible implementation of the first aspect.
  • the present application provides a computer program product, which, when executed on a computer, enables the computer to execute a method as provided in the first aspect, the third aspect, any possible implementation of the first aspect, or any possible implementation of the third aspect.
  • the electronic device described in the second aspect, the electronic device described in the third aspect, the computer storage medium described in the fourth aspect, or the computer program product described in the fifth aspect provided above are all used to execute the method provided in the first aspect, the third aspect, any possible implementation of the first aspect, or the method provided in any possible implementation of the third aspect. Therefore, the beneficial effects that can be achieved can refer to the beneficial effects in the corresponding method, which will not be repeated here.
  • FIG. 1a is a schematic diagram of a scenario of a voice interaction method provided in an embodiment of the present application
  • FIG1b is a schematic diagram of another scenario of a voice interaction method provided in an embodiment of the present application.
  • FIG1c is a schematic diagram of another scenario of a voice interaction method provided in an embodiment of the present application.
  • FIG. 1d is a schematic diagram of another scenario of a voice interaction method provided in an embodiment of the present application.
  • FIG2 is a flow chart of a voice interaction method provided in an embodiment of the present application.
  • FIG3 is a schematic diagram showing changes in illumination data and IMU data during wrist lifting
  • FIG4 is a schematic diagram of the audio signal strength collected by the wrist-worn device within the deflection angle range provided by an embodiment of the present application:
  • FIG5 is a schematic diagram of the strength of audio signals collected by the main microphone and the auxiliary microphone provided in an embodiment of the present application;
  • FIG6 is a schematic diagram of a preset area provided in an embodiment of the present application.
  • FIG7 is a schematic diagram of lip height and lip width provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.
  • FIG8a is a schematic diagram of the structure of another electronic device provided in an embodiment of the present application.
  • FIG. 9 is a schematic diagram of the structure of another electronic device provided in an embodiment of the present application.
  • Multiple means two or more.
  • “And/or” describes the relationship between related objects, indicating that there can be three relationships. For example, A and/or B can mean: A exists alone, A and B exist at the same time, and B exists alone. The character “/” generally indicates that the related objects are in an "or” relationship.
  • Intelligent personal assistant is a personal virtual assistant driven by artificial intelligence, which can also be called intelligent assistant, voice assistant, etc. It allows users to easily operate various functions of smart terminals (such as smartphones, smart watches, tablets, laptops, AI speakers, smart large screens and smart cockpits) or conduct Internet searches through voice interaction, such as setting alarms on smartphones, reading emails using text-to-speech technology, playing and searching music, and sending text messages.
  • Specific intelligent personal assistants include Huawei's Xiaoyi (the corresponding overseas version is Celia), Apple's Siri, Amazon's Alexa, Microsoft's Cortana and Google's Google Assistant.
  • Wrist wearable device refers to a smart device worn on the user's wrist, generally based on whether it has a screen or the size of the screen. Depending on the size, computing power of the application processor, and the degree of intelligence (such as whether it can download and install apps on its own, and whether it supports voice assistants), they can be divided into smart watches, sports watches, and smart bracelets.
  • IMU Inertial measurement unit
  • IMU Inertial measurement unit
  • IMU is a device that measures the angular velocity and acceleration of a rigid object in three-dimensional space. It is generally used to identify and track the posture and movement of a device, such as the tilt angle of the device relative to the horizontal plane, the orientation of the device (i.e. the deflection angle from the geomagnetic north pole), the cumulative rotation angle of the device, the relative motion speed and displacement, etc.
  • IMU wrist-raise hardware interrupt Low-power primary wrist-raise detection implemented inside the IMU device generates a hardware interrupt once a wrist-raise event is detected.
  • the ambient light sensor can sense the surrounding light conditions, collect environmental illumination data, and inform the processing chip to automatically adjust the screen brightness and reduce the power consumption of the smart terminal.
  • FIG. 1a is a schematic diagram of an application scenario of a voice interaction method provided in an embodiment of the present application. As shown in Figure 1a:
  • the user When the screen of the wrist-worn device is off, the user directly and naturally moves the wrist close to the lips.
  • the user speaks out the voice task
  • the microphone of the wrist-worn device collects the user's audio signal
  • the voice assistant of the wrist-worn device recognizes the voice task based on the audio signal and responds to the voice task.
  • FIG. 1b is a schematic diagram of an application scenario of another voice interaction method provided in an embodiment of the present application. As shown in Figure 1b:
  • the screen of the wrist-worn device When the screen of the wrist-worn device is on, for example, the user is reading information on the screen of the wrist-worn device or the user is operating the wrist-worn device, the user raises his wrist so that the user's lips are facing the screen of the wrist-worn device and close to the wrist-worn device, and then the user speaks a voice task, such as "Today's weather", and the wrist-worn device collects the user's audio signal.
  • the voice assistant of the wrist-worn device recognizes the voice task based on the audio signal and responds to the voice task, for example, the wrist-worn device outputs "Today's weather is fine".
  • FIG. 1c is a schematic diagram of an application scenario of another voice interaction method provided in an embodiment of the present application. As shown in Figure 1c:
  • the phone screen When the phone screen is off, the user presses the power button to turn on the screen, then picks up the phone from the desktop, holds the bottom of the phone close to the mouth, and then speaks out the voice task.
  • the phone collects the user's audio signal, and the phone's voice assistant recognizes the voice task based on the audio signal and responds to the voice task.
  • FIG. 1d is a schematic diagram of an application scenario of a voice interaction method provided in an embodiment of the present application. As shown in Figure 1d:
  • the user's face is facing the large-screen device, either directly facing the large screen, or tilted to the left at a preset angle, or tilted to the right at a preset angle.
  • the user then speaks out a voice task, such as "Today's weather.”
  • the wrist-worn device collects the user's audio signal, recognizes the voice task based on the audio signal, and responds to the voice task. For example, the wrist-worn device outputs "Today's weather is fine.”
  • voice interaction method of the present application can be applied not only to the above four scenarios, but also to other scenarios, which are not limited here.
  • Figure 2 is a flow chart of a voice interaction method provided in an embodiment of the present application. As shown in Figure 2, the method includes:
  • the first device may be a wrist-worn device of the user or a user terminal device
  • the user terminal device may be a smart phone, a tablet computer, etc.
  • the first event is a wrist-raising hardware interrupt event of the first device, a hand-raising hardware interrupt event of the first device, a key-pressing screen-lighting event of the first device, a hand-raising screen-lighting event of the first device, or a wrist-raising screen-lighting event of the first device.
  • the wrist-raising hardware interrupt event and the hand-raising hardware interrupt event of the first device refer to the low-power primary wrist-raising detection implemented inside the IMU.
  • a small degree of wrist-raising or wrist-turning behavior is detected (for example, based on the acceleration data collected by the IMU of the first device, if the acceleration exceeds 30% of the gravity acceleration (9.8m/s ⁇ 2), it is determined to be a wrist-raising or wrist-turning behavior or a hand-raising behavior)
  • the IMU of the first device generates a hand-raising hardware interrupt or a wrist-raising hardware interrupt.
  • This interrupt event can be subscribed to by other software modules to trigger the start-up of other software.
  • the wrist-raising screen-lighting software is started to perform another wrist-raising or wrist-turning behavior detection to achieve an accurate and reliable wrist-raising screen-lighting function (for example, to achieve the highest possible success rate and the lowest possible false touch rate).
  • the key-press screen-on event of the first device refers to the key-press screen-on event generated when the first device detects the user's operation on the key of the first device (such as the power button, volume + button, volume - button). This event can be subscribed by other software modules to trigger other software to start running, such as rendering software starting interface rendering calculations, and then submitting the generated picture information to the screen for display. It should be understood that the key-press screen-on event of the first device is generated when the first device determines that the conditions for lighting up the screen of the first device are met, and at this time the first device has not yet controlled the lighting of the screen of the first device.
  • the wrist-raising event or hand-raising event of the first device to light up the screen refers to the wrist-raising event or hand-raising event detected by the first device in which the user attempts to light up the screen.
  • the wrist-raising event or hand-raising event can be determined by the first device based on the IMU data collected by the IMU. It should be understood that the wrist-raising event or hand-raising event to light up the screen of the first device is generated when the first device determines that the conditions for lighting up the screen of the first device have been met.
  • the first device detects the wrist-raising event or hand-raising event of the user trying to light up the screen is the wrist-raising or hand-raising screen lighting software of the first device detecting the wrist-raising event or hand-raising event of the user trying to light up the screen.
  • first event it may be that the user accidentally triggered it or the user caused the first device to detect the corresponding first event for other purposes.
  • the IMU data of the wrist-worn device will also change, which may trigger a wrist-raising hardware interrupt event or a wrist-raising screen-lighting event, but the user does not want to raise his wrist to interact with the wrist-worn device by voice.
  • the IMU data of the mobile phone will change, which may trigger a hand-raising interrupt event or a hand-raising screen-lighting event.
  • the user may turn on the screen of the smartphone through the power button of the smartphone, which will trigger the hand-raising screen-lighting event of the mobile phone, but the user does not want to raise his hand to interact with the phone by voice.
  • the first device obtains the IMU data and the illuminance data of the first device, and determines whether the user has performed the first preset action through the IMU data and the illuminance data of the first device.
  • the first device when the first event is a button-pressing screen-lighting event of the first device, the first device obtains the IMU data and (ambient) illumination data of the first device, including the first device starting the ambient light sensor to collect illumination data and starting the IMU to collect IMU data, and obtaining the collected illumination data and IMU data, wherein the collection time of the illumination data is the same as the collection time of the IMU data.
  • the first device obtains IMU data and illumination data of the first device, including:
  • the ambient light sensor of the first device collect the (ambient) illumination data of the first device and continue to collect IMU data, obtain the cached IMU data related to the first event, and set the illumination data related to the first event to zero (because the ambient light sensor has not been started during the first event).
  • the reason for setting to zero is that the illumination in the lower space is generally lower.
  • the detection of the wrist-raising hardware interrupt event of the first device, the hand-raising hardware interrupt event of the first device, the wrist-raising screen-lighting event of the first device, or the hand-raising screen-lighting event of the first device utilizes the IMU data of the first device, that is, before the wrist-raising hardware interrupt event of the first device, the hand-raising hardware interrupt event of the first device, the wrist-raising screen-lighting event of the first device, or the hand-raising screen-lighting event of the first device is detected, the IMU of the first device has been started and the IMU data has been collected.
  • the IMU data used to determine whether the user has performed the first preset action can be regarded as two parts, one part is collected before the first event is detected, and this part of data can be called IMU data related to the first event, and the other part is collected when the first event is detected.
  • the two parts of IMU data are a complete wrist-raising or hand-raising IMU data. Using a complete wrist-raising or hand-raising IMU data can more accurately determine whether the user has performed the first preset action.
  • the collection time of the illuminance data used to determine whether the user has performed the first preset action needs to be equal to the collection time of the IMU data, before using the illuminance data and the IMU data to determine whether the user has performed the first preset action, the illuminance data is zero-padded, and the data obtained by zero-padded is the illuminance data related to the first event.
  • the equivalent time of the illuminance data after zero-padded is equal to the collection time of the IMU data.
  • the collection time of the IMU data is the first preset time.
  • the senor can collect a certain amount of data after working for a period of time.
  • the amount of data can be characterized by the duration.
  • the equivalent duration here is the size of the data amount.
  • the first device determines whether the user has performed a first preset action based on the IMU data and illumination data of the first device.
  • the first preset action may be a wrist raising action or a hand raising action.
  • the wrist-worn device moves with the user's wrist, generating specific posture movements (such as the inclination angle of the dial with the horizontal plane changes to face the user's lips) and position movements (such as rising from the lower space to the upper space, and the distance from the vertical plane of the user's body is reduced), and the corresponding change in acceleration is shown in the curve on the right of Figure 3.
  • specific posture movements such as the inclination angle of the dial with the horizontal plane changes to face the user's lips
  • position movements such as rising from the lower space to the upper space, and the distance from the vertical plane of the user's body is reduced
  • the corresponding change in acceleration is shown in the curve on the right of Figure 3.
  • the illuminance value collected by the ambient light sensor of the handheld device will also tend to rise.
  • the handheld device will also produce specific posture movements (such as the inclination angle of the mobile phone screen to the horizontal plane becomes close to zero degrees, so that the microphone at the bottom of the mobile phone is facing the user's lips) and position movements.
  • the right figure of Figure 3 shows the acceleration of the wrist-worn device in the x-axis, y-axis and z-axis directions in the local coordinate system when the user raises his wrist to bring the wrist-worn device close to his lips.
  • the origin of the local coordinate system is the center of gravity of the wrist-worn device, and the x-axis points to the surface of the wrist-worn device.
  • the y-axis points to the 3 o'clock direction of the dial, the z-axis points to the top of the dial.
  • the first device will produce specific posture movements and position movements when approaching the user's lips. Therefore, it is possible to determine whether the user has performed a wrist-raising action or a hand-raising action based on the illumination data and IMU data of the wrist-worn device.
  • a prediction model is pre-trained to predict whether the user has performed a wrist-raising action or a hand-raising action.
  • the prediction model can be implemented based on a decision tree or a convolutional neural network.
  • the same prediction model can be used for prediction, or different prediction models can be used for prediction.
  • the training data includes acceleration data and illuminance data collected for a fixed time (e.g., 0.5s) and their derivative data, such as the maximum value, mean, skewness, kurtosis, histogram, sliding average sequence generated according to a certain window size, and amplitude value sequence of the acceleration vector of the acceleration data.
  • the training data is collected for different groups of users, including the elderly and the youth, the male and the female, and/or the left-hand wearing/holding group and the right-hand wearing/holding group.
  • the illuminance data in the training data may also include data collected in different environments, including an outdoor sunny environment, an indoor normal lighting environment, and a dark light environment.
  • the first device inputs the IMU data and the illumination data into the first prediction model for prediction, and a first prediction result can be obtained, for example, the first prediction result is a wrist-raising action.
  • the first prediction model can be a four-classification network, and the first device inputs the IMU data and the illumination data into the first prediction model for prediction.
  • the first prediction model outputs four probabilities, namely, the probability of wrist-raising action, the probability of wrist-dropping action, the probability of wrist-raising and holding action, and the probability of free hand position; the first prediction result is the action corresponding to the maximum probability, such as wrist-raising action.
  • the free hand position here refers to actions other than wrist-raising action, wrist-dropping action, and wrist-raising and holding action.
  • the first prediction model can also be a two-classification network, and the first device inputs the IMU data and the illumination data into the first prediction model for prediction.
  • the first prediction model outputs two probabilities, namely, the probability of wrist-raising action and the probability of other actions, and the first prediction result is the action corresponding to the maximum probability; the other actions here refer to actions other than wrist-raising action.
  • the wrist-raising and holding action means that after the user raises his wrist, the wrist maintains a certain posture for a period of time (for example, the dial is facing oneself, the pitch angle with the horizontal plane is within plus or minus 15 degrees, and the roll angle is 30 to 60 degrees) and the illumination remains stable (for example, the multiple of the maximum/minimum illumination value is less than 2), and there is no drop.
  • the first device if the IMU data and illumination data collected after the user raises his wrist are within the preset range, it is determined that the user has performed the wrist-raising and holding action.
  • the first device inputs the IMU data and the illumination data into the second prediction model for prediction
  • a second prediction result can be obtained, for example, the second prediction result is a hand-raising action.
  • the second prediction model can be a four-classification network, and the first device inputs the IMU data and the illumination data into the second prediction model for prediction.
  • the second prediction model outputs four probabilities, namely, the probability of the hand-raising action, the probability of the arm-dropping action, the probability of the hand-raising and holding action, and the probability of the free hand position.
  • the second prediction result is the action corresponding to the maximum probability, such as the hand-raising action.
  • the free hand position here refers to the action other than the hand-raising action, the arm-dropping action, and the hand-raising and holding action.
  • the second prediction model can also be a two-classification network, and the first device inputs the IMU data and the illumination data into the second prediction model for prediction.
  • the second prediction model outputs two probabilities, namely, the probability of the hand-raising action and the probability of other actions.
  • the second prediction result is the action corresponding to the maximum probability; the other actions here refer to actions other than the hand-raising action.
  • the training data of the first prediction model and the second prediction model can be found in the above-mentioned related descriptions and will not be described again here.
  • the first device activates a microphone of the first device and obtains a first audio signal collected by the microphone.
  • the first audio signal is collected by multiple microphones of the first device, that is, if it is determined that the user has performed the first preset action, the first device activates multiple microphones of the first device, such as a main microphone and a secondary microphone.
  • the audio signal collected by a single microphone has wind noise, and the accuracy of predicting the sound type of the audio signal based on the audio signal collected by a single microphone is low.
  • both the audio signals collected by the main microphone and the audio signals collected by the auxiliary microphone may have wind noise, but the wind noise intensities in the audio signals collected by the two are different.
  • the first device can use the wind noise in one of the audio signals to process the wind noise in the other audio signal.
  • the wind noise in the audio signal collected by the auxiliary microphone can be used to reduce the wind noise in the audio signal collected by the main microphone, and then the audio signal after the wind noise processing can be used to predict the sound type, so as to obtain a more accurate prediction result.
  • the first audio signal After the microphone of the first device collects the first audio signal, the first audio signal will be transmitted to the DSP of the first device.
  • the DSP implements basic voice activity detection (VAD) in hardware; when the energy of the first audio signal exceeds a preset energy threshold, the DSP of the first device generates a hardware VAD interrupt, triggering the software algorithm in the DSP to run, that is, to determine whether the type of the first audio signal is an approaching human voice.
  • VAD basic voice activity detection
  • the first device determines, according to the first audio signal, whether the type of the first audio signal is an approaching human voice.
  • the first device inputs the first audio signal into the third prediction model for prediction, and obtains a third prediction result.
  • the third prediction result is an approaching human voice.
  • the third prediction model can be a four-classification network.
  • the first device inputs the first audio signal into the first prediction model for prediction.
  • the third prediction model outputs four probabilities, namely, the probability of an approaching human voice, the probability of a nearby human voice, the probability of a distant human voice, and the probability of a non-human voice.
  • the third prediction result is the sound type corresponding to the maximum probability, such as an approaching human voice.
  • the third prediction model can also be a two-classification network.
  • the first device inputs the first audio signal into the third prediction model for prediction.
  • the third prediction model outputs two probabilities, namely, the probability of an approaching human voice and the probability of a non-approaching human voice.
  • the third prediction result is the sound type corresponding to the maximum probability.
  • the first device extracts features of the first audio signal through the third prediction model to obtain audio features of the first audio signal.
  • the audio features include at least one of energy features, time domain features, frequency domain features, music theory features, and perception features.
  • the strategy for extracting audio features is based on device scenarios and user scenario settings, specifically the scenario in which the user raises his wrist and brings the wrist-worn device close to his mouth to speak a voice task: taking into account the high and low frequency distribution characteristics of the human voice after leaving the mouth (that is, the energy attenuation of the audio signal in the deflection direction directly opposite the lips, as shown in Figure 4, the energy attenuation of the audio signal behind the user's head is the largest, where the higher the frequency, the greater the attenuation; and when the lips are close to the wrist-worn device, for example, when the distance from the screen of the wrist-worn device is 3-5 cm, the deflection angle of the microphone of the wrist-worn device is 15-55 degrees.)
  • the lips are close to the screen to produce specific reflection reverberation characteristics, and the microphone of the wrist-worn device is located at the edge of the wrist-worn device (for example, at 11 o'clock).
  • HLBR high-low band ratio
  • FFT fast Fourier transform
  • SRMR speech-to-reverberation modulation energy ratio
  • the main microphone of the handheld device can collect strong low-frequency breath wind noise; and for sports wind noise or environmental wind noise, the main microphone and the auxiliary microphone can collect strong low-frequency wind noise.
  • the specific preset strategy can be the HLBR features of the main and auxiliary microphones, such as the high-low frequency energy ratio bounded by 200Hz (linear frequency or Mel scale frequency), 4 coefficients of the third-order polynomial fitting of the 128-point FFT, 2 coefficients of the linear regression fitting, or dual-channel MFCCs.
  • the first device extracts the audio feature set value of the collected audio signal according to the preset strategy, specifically: calculates the collected audio data according to the preset strategy to generate specific data of each audio feature.
  • the calculation method of MFCCs value is:
  • the collected audio signal is divided into frames, generally 25ms as one frame, and the frame overlap is 10ms; the periodogram method is used to estimate the power spectrum of each frame of audio signal; the estimated power spectrum is filtered by 85 Mel filters, and the energy in each filter is calculated; the logarithm of the energy in each filter is taken to obtain 85 logarithmic results; the 85 logarithmic results are subjected to discrete cosine transform (DCT), and finally the 2nd to 85th coefficients of DCT are retained, the first coefficient is removed, and finally 84 coefficients are obtained, which are 84 Mel-frequency cepstrum coefficients.
  • DCT discrete cosine transform
  • the third prediction model can be obtained according to the following method:
  • a certain number of audio signals emitted by the user when the lips are close to the wrist-worn device or handheld device, as well as audio signals corresponding to human voices near or far away from the wrist-worn device or handheld device, are obtained, and then the audio feature set values of the collected audio signals are extracted according to the above-mentioned preset strategy, and then the audio feature set values are used as training samples to train the machine learning model or convolutional neural network to obtain a third prediction model.
  • the machine learning model can be a decision tree, a random forest algorithm, Xgboost or AdaBoost.
  • the duration of the first audio signal is less than or equal to 0.5 s.
  • the duration of the first audio signal may be 0.25 s, 0.3 s, 0.4 s, or 0.5 s.
  • the duration of the first audio signal By setting the duration of the first audio signal to be less than or equal to 0.5s, it is possible to echo the text corresponding to the first audio signal in real time when the voice assistant is subsequently started to recognize the first voice signal.
  • determining whether the type of the first audio signal is an approaching human voice according to the first audio signal is implemented in the DSP of the first device.
  • the first audio signal is an audio signal that has not been processed by AGC, noise reduction, dereverberation or compression, and there is no loss of information about the approaching human voice.
  • the sound type of the first audio signal is determined by using the original audio signal that has not been processed by AGC, noise reduction, dereverberation or compression, which can improve the accuracy of the determined sound type of the first audio signal.
  • the audio signal corresponding to a complete speech task includes not only the first audio signal, but also the audio signal after the first audio signal.
  • the microphone collects the audio signal periodically according to a collection time of less than or equal to 0.5s.
  • the microphone of the first device continuously collects the audio signal, extracts the audio feature set value of the collected audio signal, and determines the type of the collected audio signal based on the audio feature set value of the collected audio signal at a preset time interval.
  • the preset time interval may be 0.1s, 0.2s or other time.
  • the first device controls the IMU and ambient light sensor of the first device to continue to collect IMU data and illuminance data.
  • the collection duration is a second preset duration.
  • the first device determines whether the user has performed the second preset action based on the IMU data and illuminance data that continue to be collected; if it is determined that the user has performed the second preset action and the type of the first audio signal is an approaching human voice, the first device starts the voice assistant. If it is determined that the user has not performed the second preset action, the first device turns off the microphone of the first device.
  • the second preset action may be a wrist-raising and holding action or a hand-raising and holding action.
  • the first device activates the voice assistant.
  • the third preset time may be 0.3s.
  • the first device when determining whether the user has maintained the wrist-raising action, the first device will continue to input the collected IMU data and illumination data into the fourth prediction model for prediction, and a fourth prediction result can be obtained, for example, the fourth prediction result is the wrist-raising and holding action.
  • the fourth prediction model can be a four-classification network, and the first device will continue to input the collected IMU data and illumination data into the fourth prediction model for prediction.
  • the fourth prediction model outputs four probabilities, namely, the probability of the wrist-raising action, the probability of the wrist-dropping action, the probability of the wrist-raising and holding action, and the probability of the free hand position; the fourth prediction result is the action corresponding to the maximum probability, such as the wrist-raising and holding action.
  • the free hand position here refers to the action other than the wrist-raising action, the wrist-dropping action, and the wrist-raising and holding action.
  • the fourth prediction model can also be a two-classification network, and the first device will continue to input the collected IMU data and illumination data into the fourth prediction model for prediction.
  • the fourth prediction model outputs two probabilities, namely, the probability of the wrist-raising and holding action and the probability of other actions.
  • the fourth prediction result is the action corresponding to the maximum probability; the other actions here refer to the actions other than the wrist-raising and holding action.
  • the first device when judging whether the user has maintained the hand-raising action, the first device will continue to input the collected IMU data and illumination data into the fifth prediction model for prediction, and the fifth prediction result can be obtained, for example, the fifth prediction result is the hand-raising and holding action.
  • the fifth prediction model can be a four-classification network, the first device inputs the IMU data and illumination data into the fifth prediction model for prediction, and the fifth prediction model outputs four probabilities, namely, the probability of the hand-raising action, the probability of the arm-dropping action, the probability of the hand-raising and holding action, and the probability of the free hand position; the fifth prediction result is the action corresponding to the maximum probability, such as the hand-raising and holding action.
  • the free hand position here refers to the action other than the hand-raising action, the arm-dropping action, and the hand-raising and holding action.
  • the fifth prediction model can also be a two-classification network, the first device will continue to input the collected IMU data and illumination data into the fifth prediction model for prediction, the fifth prediction model outputs two probabilities, namely, the probability of the hand-raising and holding action and the probability of other actions, and the fifth prediction result is the action corresponding to the maximum probability; the other actions here refer to the actions other than the hand-raising and holding action.
  • the fourth prediction model and the first prediction model can be the same prediction model.
  • the collection time of the IMU data and illuminance data used to determine whether the user has performed the second preset action needs to be equal to the collection time of the IMU data and illuminance data used to determine whether the user has performed the first preset action, that is, the second preset time is required to be equal to the first preset time.
  • the first device interpolates the IMU data and illuminance data that continue to be collected, respectively, so that the equivalent time of the IMU data and illuminance data used to determine whether the user has performed the second preset action is equal to the collection time of the IMU data and illuminance data used to determine whether the user has performed the first preset action, and then the same prediction model can be used for prediction.
  • the fifth prediction model and the second prediction model can be the same prediction model.
  • the fifth prediction model and the second prediction model are the same prediction model, and the second preset time length is less than the first preset time length, before using the continuously collected IMU data and illuminance data to determine whether the user has performed the second preset action, the first device interpolates the continuously collected IMU data and illuminance data respectively, so that the equivalent time length of the IMU data and illuminance data used to determine whether the user has performed the second preset action is equal to the collection time length of the IMU data and illuminance data used to determine whether the user has performed the first preset action, and the same prediction model can be used for prediction.
  • the senor can collect a certain amount of data after working for a period of time.
  • the amount of data can be characterized by the duration.
  • the equivalent duration here is to characterize the amount of data.
  • the first device determines the task to be executed based on the audio signal collected by the microphone of the first device, and identifies the type of the task to be processed; if the task to be processed is a sensitive task, in order to ensure security, it is necessary to obtain the user's identity information, such as the user's fingerprint information, voiceprint information, facial image information, etc.
  • the task to be processed is executed.
  • the sensitive task may be a transfer task, a private folder opening task, etc.
  • the target user is the owner of the first device.
  • the current user of the first device says to the first device "transfer 50 yuan to xx", and the microphone of the first device collects the corresponding audio signal, and determines the pending task based on the audio signal: transfer 50 yuan to xx.
  • the first device determines that the pending task is a sensitive task, obtains the voiceprint information of the current user of the first device, and matches the voiceprint information of the current user with the voiceprint information of the owner of the first device pre-stored in the first device. If the match is successful, it is determined that the user of the first device is the owner of the first device, and the first device executes the pending task: transfer 50 yuan to xx. If the match fails, it means that the user of the first device is not the owner of the first device, and the first device does not execute the pending task.
  • the first device determines whether identity authentication is required before executing the task; if identity authentication is required, the first device sends a prompt message to the user, such as the prompt message "Please unlock the screen by face or fingerprint first".
  • the first device obtains the user's identity information, such as face information or fingerprint information; after the identity authentication is passed (for example, face unlocking is successful or the first device is in the screen unlocking state), the first device executes the task corresponding to the first audio signal.
  • the first device obtains a first confidence and a second confidence, wherein the first confidence is used to characterize the duration of the user's wrist-raising action or hand-raising action; the second confidence is used to characterize the degree to which the first audio signal is an approaching human voice; the first confidence and the second confidence are weighted and summed according to a preset weight to obtain the confidence of starting the voice assistant. If the confidence of starting the voice assistant is greater than the preset confidence, the voice assistant of the first device is started.
  • the first device may obtain the first confidence level and the second confidence level in the following manner:
  • the first device calculates the duration of the user's wrist-raising gesture close to the mouth according to the acquired IMU data, or calculates the duration of the user's hand-raising gesture close to the mouth according to the acquired IMU data, that is, calculates the duration of the user's wrist-raising action or hand-raising action according to the acquired IMU data; and then calculates the first confidence based on the obtained duration; wherein, the longer the obtained duration, the greater the first confidence.
  • the value interval of the first confidence is [0.6,1.0], that is, the first confidence corresponding to the lowest value of the obtained duration is 0.6, and the first confidence corresponding to the highest value of the obtained duration is 1.0; based on this, the linear relationship between the obtained duration and the corresponding first confidence can be determined.
  • the first device can calculate and determine the first confidence corresponding to the obtained duration based on the linear relationship.
  • the first device uses the probability of the approaching human voice output by the third prediction model as the second confidence level.
  • the first device after starting the voice assistant, identifies the voice content corresponding to the first audio signal based on automatic speech recognition (ASR) technology, and displays the voice content corresponding to the first audio signal in real time. For example, if the voice content corresponding to the first audio signal is "How will the weather be tomorrow?", “tomorrow” is displayed after “tomorrow” is identified; “weather” is displayed after “weather” is identified, instead of displaying it after the complete voice content is identified.
  • ASR automatic speech recognition
  • the first device sends the first audio signal to the server, and the server identifies the voice content corresponding to the first audio signal based on ASR technology, and filters the voice content, filtering some content that is not spoken to the voice assistant, such as "I'm going to play ball after get off work today", that is, the voice assistant of the first device does not provide feedback on content that is not spoken to the voice assistant, and the voice assistant terminates the voice conversation. In this way, user interference can be reduced.
  • some keywords or sentences can be set in advance; when the first device determines that the voice content corresponding to the first audio signal contains the set keywords or sentences, it determines that the voice content corresponding to the first audio signal is the content that needs to be filtered out, and the voice assistant does not need to provide feedback on the voice content corresponding to the first audio signal.
  • the first device is a wrist-worn device as shown in FIG. 1a and FIG. 1b , or a handheld device as shown in FIG. 1c .
  • the event of the user approaching the first device can be detected more accurately through IMU data and illumination data, instead of requiring the user to significantly raise his wrist/hand to detect the event of the user approaching the device when only using acceleration data.
  • the solution of the present application is more sensitive, and the detection can be achieved when the wrist or hand is raised naturally, providing users with a natural voice interaction experience, and the movement amplitude is small, improving user privacy in public places, and easy operation.
  • the duration of the first audio signal to be less than or equal to 0.5s, it is to enable the text corresponding to the first audio signal to be echoed in real time when the voice assistant is subsequently started to recognize the first voice signal.
  • the first audio signal is an audio signal that has not been processed by AGC, noise reduction, reverberation or compression, and there is no loss of information close to the human voice.
  • the information in the first audio signal can be maximized, and the accuracy of the sound type of the determined first audio signal can be improved.
  • identity authentication is required before the task corresponding to the audio signal is a sensitive task. In this way, non-first device owners can be prevented from performing security-sensitive voice tasks through the first device, thereby ensuring the information security of the device of the first device owner.
  • the first device collects IMU data of the first device in real time. It should be understood that the IMU data here includes acceleration, angular velocity and other data. The first device determines whether to trigger the first event based on the IMU data, that is, whether the first device detects the first event. If the first device determines that the first event is triggered based on the IMU data, it means that the first device detects the first event; if the first device determines that the first event is not triggered based on the IMU data, it means that the first device does not detect the first event.
  • the first event is a wrist raising event of the first device, a hand raising event of the first device, or a wrist turning event of the first device. It should be understood that the first event can also be an event other than the wrist raising event of the first device, the hand raising event of the first device, and the wrist turning event of the first device.
  • the wrist raising event of the first device refers to the user raising his hand to place the first device in a viewable or operable position and posture. It should be understood that the first device here is a handheld device.
  • the wrist turning event of the first device refers to the user turning the wrist directly inward or first turning the wrist outward and then turning the wrist inward to place the first device in a viewable or operable position and posture. It should be understood that the first device here is a wrist-worn device.
  • the wrist-raising event of the first device refers to the user raising his wrist to place the first device in a viewable or operable position and posture. It should be understood that the first device here is a wrist-worn device.
  • the position and posture of the first device can be determined by the IMU data of the first device, and the first device can determine whether to trigger the first event based on the IMU data of the first device.
  • the first device can use an event prediction model to predict whether the first device triggers the first event. Specifically, the first device inputs the collected IMU data into the event prediction model for processing, and the event prediction model outputs four probability values. The four probability values are respectively used to characterize the probability of triggering a wrist raising event of the first device, the probability of triggering a hand raising event of the first device, the probability of triggering a wrist turning event of the first device, and the probability of not triggering the above events. Among them, the result corresponding to the maximum probability is the final output result of the event prediction model.
  • the above event prediction model can be applied not only to the detection of wrist raising events, wrist turning events and hand raising events, but also to the detection of events in other applications, such as events of wrist raising to talk applications and wrist raising to light up applications, without limitation. Therefore, the event prediction model can be regarded as a public capability of the first device.
  • the first device does not need to separately train a neural network model specifically for detecting the first event, thereby reducing the workload of the first device and the computing power consumption of the first device.
  • the first device Since the first device detects the first event through the public capability of the first device, detecting the first event through the public capability of the first device does not indicate that the user has performed the first preset action. Therefore, the first device needs to further determine whether the user has performed the first preset action.
  • the first device calculates the posture information of the first device based on the IMU data of the first device; obtains the acceleration information of the first device from the IMU data of the first device; if the posture information of the first device is within a preset posture range, the acceleration information of the first device is within a preset acceleration range, and the first event duration is within a preset duration range, it is determined that the user has performed a first preset action.
  • the first event duration refers to the duration that the first device maintains the first device in a viewable or operable position and posture after the first event is detected.
  • the preset posture range, preset acceleration range and preset duration range are all obtained based on historical experience values.
  • the operations performed by the first device after determining whether the user has performed the first preset action can be found in the relevant descriptions of S203-S205 and will not be described again here.
  • the second device obtains the second audio signal of the user, and the second device determines the position information of the user relative to the second device based on the second audio signal; if the user is located within a preset range based on the position information of the user relative to the second device; if the user is determined to be within the preset range, the second device determines that the user is detected to be close to the second device, and the second device starts the approaching human voice detection.
  • the second device confirms that the type of the third audio signal is an approaching human voice through the approaching human voice detection, the second device starts the voice assistant.
  • the acquisition time of the third audio signal is after the acquisition time of the second audio signal.
  • the second device is a large-screen device.
  • all the microphones of the large-screen device can receive the user's audio data.
  • the second device processes the audio signals collected by the multiple microphones based on the time difference of arrival (TDOA) technology of the audio to obtain the user's position information relative to the second device.
  • TDOA time difference of arrival
  • the second device determines whether the user is within a preset range based on the user's position information relative to the second device.
  • the preset area is shown in Figure 6.
  • the preset area is an inverted trapezoidal area 5 meters in front of the large-screen device, and the left and right angles of the lower part of the trapezoid are both 120°.
  • the second device determines to start the proximity voice detection.
  • the second device after the second device starts approaching human voice detection, the second device continues to detect the user's behavior of staying close to the second device. Specifically, the second device continues to obtain the user's audio signal and determines whether the user is still in a preset area based on the obtained audio signal; if it is determined that the user is not in the preset area based on the obtained audio signal, the second device terminates the approaching human voice detection.
  • the second device uses an audio signal with a shorter acquisition time to determine whether the user is in a preset area.
  • the shorter acquisition time means that the acquisition time is not greater than a preset acquisition time, and the preset acquisition time may be 0.1s, 0.2s, 0.5s or other time lengths.
  • the approaching human voice detection and the device approaching detection can be performed simultaneously.
  • the device approaching detection is to determine whether the user is within a preset range based on the acquired audio signal.
  • the specific process of approaching human voice detection includes:
  • the second device obtains a video stream when the user speaks a voice task to the second device, and simultaneously obtains an audio signal corresponding to the user speaking the voice task to the second device, which audio signal can be called a third audio signal; the second device determines audio feature information based on the video stream and the third audio signal, and the audio feature information includes a correlation coefficient between speech and lip movement, and/or a time delay between speech, lip movement and video; the second device determines a type of the third audio signal based on the audio feature information; if the type of the determined third audio signal is a nearby human voice, the second device activates the voice assistant.
  • the correlation coefficient between speech and lip movement is the correlation coefficient between the lip height and lip width of the user in the video stream and the amplitude of the third audio signal, such as the Pearson correlation coefficient, and the value range is [0,1], 1 indicates strong correlation, and 0 indicates no correlation.
  • L1 indicates lip width
  • L2 , L3 , L4 , L5 , L6 , L7 , and L8 indicate lip height. Since the speed of speech propagation is slower than the speed of light, there is a time delay between the arrival time of the speech signal and the arrival time of the video signal.
  • the maximum value of the mutual correlation coefficient between the lip height, lip width and the audio amplitude of the third audio signal can be selected as the representation of the correlation between speech and lip movement, that is, the maximum value of the mutual correlation coefficient between the lip height, lip width and the audio amplitude of the third audio signal can be taken as the correlation coefficient between speech and lip movement, and the length of the right shift on the time axis can be used as the time delay between speech, lip movement and video.
  • the second device determines the type of the third audio signal based on the audio feature information, and the second device may input the audio feature information into the sixth prediction model for processing to obtain the type of the third audio signal.
  • the sixth prediction model may be a four-classification network, and the second device inputs the audio feature information into the sixth prediction model for prediction.
  • the sixth prediction model outputs four probabilities, which are the probability of an approaching human voice, the probability of a nearby human voice, the probability of a distant human voice, and the probability of a non-human voice; the type of the third audio signal is the sound type corresponding to the maximum probability, such as an approaching human voice.
  • the sixth prediction model may also be a two-classification network, and the second device inputs the audio feature information into the sixth prediction model for prediction.
  • the sixth prediction model outputs two probabilities, which are the probability of an approaching human voice and the probability of a non-approaching human voice.
  • the type of the third audio signal is the sound type corresponding to the maximum probability.
  • the second device will obtain the sixth prediction model, which may be the sixth prediction model trained by other devices and obtained from other devices, or the sixth prediction model trained by the second device itself.
  • the process of training and obtaining the sixth prediction model specifically includes:
  • the machine learning model can be a decision tree, a random forest algorithm, Xgboost or AdaBoost.
  • the voice assistant when it is determined that the type of the third audio signal is an approaching human voice and the user approaches the second device and maintains a preset time, such as 0.3 seconds, the voice assistant is activated. In this way, it can be avoided that the user accidentally triggers the activation of the voice assistant.
  • the second device determines the task to be executed according to the third audio signal and identifies the type of the task to be processed; if the task to be processed is a sensitive task, in order to ensure security, it is necessary to obtain the user's identity information, such as the user's fingerprint information, voiceprint information, facial image information, etc.
  • the task to be processed is executed.
  • the second device after starting the voice assistant, identifies the voice content corresponding to the third audio signal based on ASR technology, and displays the voice content corresponding to the third audio signal in real time. For example, if the voice content corresponding to the third audio signal is "What will the weather be like tomorrow", “tomorrow” will be displayed after “tomorrow” is identified; “weather” will be displayed after “weather” is identified, instead of displaying it after the complete voice content is identified.
  • the second device sends the third audio signal to the server, and the server identifies the voice content corresponding to the third audio signal based on ASR technology, and filters the voice content, filtering out some content that is not spoken to the voice assistant, such as "I'm going to play ball after get off work today", that is, the voice assistant of the second device does not provide feedback on content that is not spoken to the voice assistant, and the voice assistant terminates the voice conversation. In this way, user interference can be reduced.
  • some keywords or sentences can be set in advance; when the second device determines that the voice content corresponding to the third audio signal contains the set keywords or sentences, it determines that the voice content corresponding to the third audio signal is the content that needs to be filtered out, and the voice assistant does not need to provide feedback on the voice content corresponding to the third audio signal.
  • the user approaching the device is detected by using multiple microphones of the large-screen device to detect the location of the sound source, rather than using the wake-up word to trigger the microphone to pick up the user's voice task, providing users with a natural voice interaction (approach and speak) experience with easy operation.
  • lip movement information based on the camera of the large-screen device and then calculating the voice and lip movement correlation, human voices and non-human voices near the device that are not facing the large screen can be intercepted, suppressing user interference from falsely waking up the voice assistant.
  • identity authentication is required before the task corresponding to the audio signal is a sensitive task. This method can prevent non-second device owners from performing security-sensitive voice tasks through the second device, thereby ensuring the information security of the device of the second device owner.
  • the electronic device 800 includes:
  • An acquisition unit 801 is used to acquire IMU data and illumination data of the electronic device when a first event is detected;
  • a determination unit 802 configured to determine whether the user has performed a first preset action according to the IMU data and the illumination data of the electronic device;
  • the starting unit 803 is used to start the microphone of the electronic device if it is determined that the user has performed the first preset action.
  • the acquiring unit is also used to acquire the first audio signal collected by the microphone;
  • the determining unit 802 is further configured to determine, based on the first audio signal, whether the type of the first audio signal is an approaching human voice;
  • the activation unit 803 is further configured to activate a voice assistant of the electronic device if it is determined that the type of the first audio signal is an approaching human voice.
  • the first event is a wrist-raising hardware interrupt event, a hand-raising hardware interrupt event of the electronic device, a key-pressing screen-lighting event of the electronic device, a hand-raising screen-lighting event of the electronic device, or a wrist-raising screen-lighting event of the electronic device.
  • the acquisition unit 801 is further configured to continuously acquire the IMU data and illumination data collected by the electronic device after the start unit 803 starts the microphone of the electronic device.
  • the determination unit 802 is further used to determine whether the user has performed a second preset action according to the IMU data and the illumination data collected by the electronic device;
  • the starting unit 803 is specifically used to:
  • the voice assistant of the electronic device is activated.
  • the starting unit 803 is further configured to:
  • the microphone of the electronic device is turned off.
  • the duration of collecting the IMU data and illumination data used to determine whether the user has performed a first preset action is a first preset duration
  • the duration of collecting the IMU data and illumination data used to determine whether the user has performed a second preset action is a second preset duration, wherein the first preset duration is greater than the second preset duration
  • the first event is a wrist-raising hardware interrupt event, a hand-raising hardware interrupt event of an electronic device, a hand-raising screen-lighting event of an electronic device, or a wrist-raising screen-lighting event of an electronic device.
  • the acquisition unit 801 is specifically used to:
  • Start the ambient light sensor of the electronic device collect the illumination data of the environment of the electronic device and continue to collect IMU data, obtain the cached IMU data related to the first event, and set the illumination data related to the first event to zero.
  • the duration of the first audio signal is less than or equal to 0.5 s, and determining whether the type of the first audio signal is an approaching human voice based on the first audio signal is implemented by a digital signal processor (DSP) of the electronic device.
  • DSP digital signal processor
  • the first audio signal is collected by multiple microphones of the electronic device.
  • the determination unit 802 is further configured to determine the task to be performed according to the audio signal collected by the microphone of the electronic device;
  • the acquisition unit 801 is further used to acquire the identity information of the user if the task to be executed is a sensitive task;
  • the electronic device 800 further includes:
  • the execution unit 804 is used to determine that the user is a target user based on the user's identity information and execute the task to be processed.
  • the electronic device 800 further includes:
  • the display unit 805 is used to display the text corresponding to the collected audio signal in real time after the voice assistant is started.
  • the specific functional implementation method of the electronic device can refer to the description of the above-mentioned voice interaction method, such as the acquisition unit 801 is used to execute the relevant content of S201, the determination unit 802 is used to execute the relevant content of S202 and S204, the startup unit 803, the execution unit 804 and the display unit are used to execute the relevant content of S203 and S205, which will not be repeated here.
  • Each unit or module in the electronic device can be separately or completely merged into one or several other units or modules to constitute, or one (some) of the units or modules can also be divided into multiple smaller units or modules to constitute, which can achieve the same operation without affecting the realization of the technical effect of the embodiment of the present invention.
  • the above-mentioned units or modules are divided based on logical functions. In practical applications, the function of a unit (or module) can also be implemented by multiple units (or modules), or the functions of multiple units (or modules) are implemented by one unit (or module).
  • the electronic device 800a includes:
  • An acquisition unit 801a is used to acquire IMU data of the first device when a first event is detected;
  • a determination unit 802a configured to determine whether the user has performed a first preset action according to the IMU data of the first device
  • the activation unit 803a is used to activate the microphone of the first device if it is determined that the user has performed the first preset action, and the acquisition unit is further used to acquire the first audio signal collected by the microphone;
  • the determining unit 802a is further configured to determine, according to the first audio signal, whether the type of the first audio signal is an approaching human voice;
  • the starting unit 803a is further configured to start the voice assistant of the first device if it is determined that the type of the first audio signal is an approaching human voice.
  • the first event is a wrist raising event of the first device, a hand raising event of the first device, or a wrist turning event of the first device; and the wrist raising event of the first device, the hand raising event of the first device, and the wrist turning event of the first device are all obtained through the same event prediction model.
  • the determining unit 802a is further configured to include:
  • the first device calculates the posture information of the first device based on the IMU data of the first device; obtains the acceleration information of the first device from the IMU data of the first device; if the posture information of the first device is within a preset posture range, the acceleration information of the first device is within a preset acceleration range, and the first event duration is within a preset duration range, the first device determines that the user has performed a first preset action.
  • the acquisition unit 801a is further configured to continuously acquire the IMU data and illumination data collected by the electronic device after the activation unit activates the microphone of the electronic device;
  • the determination unit 802a is further used to determine whether the user has performed a second preset action according to the IMU data and the illumination data collected by the electronic device;
  • the starting unit 803a is specifically used to:
  • the voice assistant of the electronic device is activated.
  • the duration of the first audio signal is less than or equal to 0.5 s, and determining whether the type of the first audio signal is an approaching human voice according to the first audio signal is implemented by a DSP of the electronic device.
  • the first audio signal is collected by multiple microphones of the electronic device.
  • the determining unit 802a is further configured to determine the task to be executed according to the audio signal collected by the microphone of the electronic device;
  • the acquisition unit 801a is further used to acquire the identity information of the user if the task to be executed is a sensitive task;
  • the electronic device 800a further includes:
  • the execution unit 804a is used to determine that the user is a target user based on the user's identity information and execute the task to be processed.
  • the electronic device 800a further includes:
  • the display unit 805a is used to display the text corresponding to the collected audio signal in real time after the voice assistant is started.
  • each unit or module in the electronic device can be separately or completely combined into one or several other units or modules to constitute, or one (some) of the units or modules can also be divided into multiple functionally smaller units or modules to constitute, which can achieve the same operation without affecting the realization of the technical effects of the embodiments of the present invention.
  • the above-mentioned units or modules are divided based on logical functions. In actual applications, the functions of a unit (or module) can also be implemented by multiple units (or modules), or the functions of multiple units (or modules) can be implemented by one unit (or module).
  • Figure 9 is a schematic diagram of the structure of an electronic device 900 provided in an embodiment of the present invention.
  • the electronic device 900 shown in Figure 9 includes a memory 901, a processor 902, a communication interface 903, a display screen 905, and a bus 904.
  • the memory 901, the processor 902, the communication interface 903, and the display screen 905 are connected to each other through the bus 904.
  • Memory 901 can be a read-only memory (ROM), a static storage device, a dynamic storage device or a random access memory (RAM).
  • ROM read-only memory
  • RAM random access memory
  • the memory 901 can store programs. When the program stored in the memory 901 is executed by the processor 902, the processor 902 and the communication interface 903 are used to execute the various steps of the voice interaction method of the embodiment of the present application.
  • the processor 902 can adopt a general-purpose central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), a graphics processing unit (GPU) or one or more integrated circuits to execute relevant programs to implement the functions that need to be performed by the units in the electronic device 900 of the embodiment of the present application, or to execute the voice interaction method of the method embodiment of the present application.
  • CPU central processing unit
  • ASIC application specific integrated circuit
  • GPU graphics processing unit
  • the processor 902 may also be an integrated circuit chip having the ability to process signals. In the implementation process, each step of the voice interaction method of the present application may be completed by an integrated logic circuit of hardware in the processor 902 or by instructions in the form of software. It can also be a general-purpose processor, a digital signal processor (Digital Signal Processing, DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The methods, steps and logic block diagrams disclosed in the embodiments of the present application can be implemented or executed.
  • the general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc.
  • the steps of the method disclosed in the embodiments of the present application can be directly embodied as a hardware decoding processor to be executed, or the hardware and software modules in the decoding processor are combined and executed.
  • the software module can be located in a random access memory, a flash memory, a read-only memory, a programmable read-only memory or an electrically erasable programmable memory, a register, and other mature storage media in the art.
  • the storage medium is located in the memory 901, and the processor 902 reads the information in the memory 901, and combines its hardware to complete the functions required to be performed by the units included in the network electronic device of the embodiment of the present application, or executes the voice interaction method of the method embodiment of the present application.
  • the communication interface 903 uses a transceiver such as, but not limited to, a transceiver to implement communication between the electronic device 900 and other devices or a communication network. For example, data can be acquired through the communication interface 903 .
  • a transceiver such as, but not limited to, a transceiver to implement communication between the electronic device 900 and other devices or a communication network. For example, data can be acquired through the communication interface 903 .
  • the bus 904 may include a path for transmitting information between various components of the electronic device 900 (eg, the memory 901 , the processor 902 , and the communication interface 903 ).
  • the display screen 905 is used to display the text corresponding to the collected audio signal after the voice assistant is started.
  • the text of the voice assistant's reply information to the collected audio signal can also be displayed.
  • the display screen 905 can be an LCD screen, an LED screen, an OLED screen, and of course it can also be other display screens, which are not limited here.
  • the electronic device 900 shown in FIG9 only shows a memory, a processor, and a communication interface, in the specific implementation process, those skilled in the art should understand that the electronic device 900 also includes other devices necessary for normal operation, such as a display. At the same time, according to specific needs, those skilled in the art should understand that the electronic device 900 may also include hardware devices for implementing other additional functions. In addition, those skilled in the art should understand that the electronic device 900 may also only include the devices necessary for implementing the embodiments of the present application, and does not necessarily include all the devices shown in FIG9.
  • An embodiment of the present application also provides a chip, which includes a processor and a data interface.
  • the processor reads instructions stored in a memory through the data interface to implement the voice interaction method.
  • the chip may further include a memory, in which instructions are stored, and the processor is used to execute the instructions stored in the memory.
  • the processor is used to execute the voice interaction method.
  • An embodiment of the present application also provides a computer-readable storage medium, which stores instructions.
  • the computer-readable storage medium is executed on a computer or a processor, the computer or the processor executes one or more steps in any of the above methods.
  • the embodiment of the present application further provides a computer program product including instructions.
  • the computer program product is run on a computer or a processor, the computer or the processor executes one or more steps in any of the above methods.
  • Computer-readable media may include computer-readable storage media, which corresponds to tangible media, such as data storage media, or includes any media (e.g., based on a communication protocol) that facilitates the transfer of a computer program from one place to another.
  • computer-readable media may generally correspond to (1) non-temporary tangible computer-readable storage media, or (2) communication media, such as signals or carrier waves.
  • Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, codes, and/or data structures for implementing the techniques described in this application.
  • a computer program product may include computer-readable media.
  • such computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage, flash memory, or any other medium that can be used to store the desired program code in the form of instructions or data structures and can be accessed by a computer.
  • any connection is properly referred to as a computer-readable medium.
  • a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio and microwave
  • the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio and microwave are included in the definition of media.
  • disks and optical disks include compact disks (CDs), laser optical disks, optical optical disks, digital versatile disks (DVDs), and Blu-ray disks, where disks typically reproduce data magnetically, while optical disks reproduce data optically using lasers. Combinations of the above should also be included within the scope of computer-readable media.
  • the instructions may be executed by one or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuits.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field programmable logic arrays
  • processors may refer to any of the aforementioned structures or any other structure suitable for implementing the techniques described herein.
  • the functions described by the various illustrative logical blocks, modules, and steps described herein may be provided in a processor configured for encoding and decoding.
  • the technology may be fully implemented in one or more circuits or logic elements.
  • the techniques of the present application may be implemented in a variety of devices or apparatuses, including a wireless handset, an integrated circuit (IC), or a set of ICs (e.g., a chipset).
  • IC integrated circuit
  • a set of ICs e.g., a chipset
  • Various components, modules, or units are described in the present application to emphasize functional aspects of the devices for performing the disclosed techniques, but do not necessarily need to be implemented by different hardware units.
  • the various units may be combined in a coded hardware unit in conjunction with appropriate software and/or firmware, or provided by interoperating hardware units (including one or more processors as described above).
  • A/B can represent A or B; wherein A and B can be singular or plural.
  • multiple refers to two or more than two.
  • At least one of the following" or similar expressions refers to any combination of these items, including any combination of single items or plural items.
  • at least one of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c can be single or multiple.
  • the words “first”, “second”, etc. are used to distinguish the same items or similar items with substantially the same functions and effects. Those skilled in the art can understand that the words “first”, “second”, etc. do not limit the quantity and execution order, and the words “first”, “second”, etc. do not limit them to be necessarily different. Meanwhile, in the embodiments of the present application, words such as “exemplary” or “for example” are used to indicate examples, illustrations or descriptions. Any embodiment or design described as “exemplary” or “for example” in the embodiments of the present application should not be interpreted as being more preferred or more advantageous than other embodiments or designs. Specifically, the use of words such as “exemplary” or “for example” is intended to present related concepts in a concrete manner for ease of understanding.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the division of the unit is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling, direct coupling, or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the computer program product includes one or more computer instructions.
  • the computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions can be stored in a computer-readable storage medium or transmitted through the computer-readable storage medium.
  • the computer instructions can be transmitted from a website site, computer, server or data center to another website site, computer, server or data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that includes one or more available media integrated.
  • the available medium can be a read-only memory (ROM), or a random access memory (RAM), or a magnetic medium, such as a floppy disk, a hard disk, a tape, a magnetic disk, or an optical medium, such as a digital versatile disc (DVD), or a semiconductor medium, such as a solid state disk (SSD), etc.
  • ROM read-only memory
  • RAM random access memory
  • magnetic medium such as a floppy disk, a hard disk, a tape, a magnetic disk, or an optical medium, such as a digital versatile disc (DVD), or a semiconductor medium, such as a solid state disk (SSD), etc.
  • SSD solid state disk

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • User Interface Of Digital Computer (AREA)
  • Telephone Function (AREA)

Abstract

一种语音交互方法及相关设备,涉及人工智能领域。第一设备在检测到第一事件时,获取第一设备的IMU数据和照度数据(S201);根据第一设备的IMU数据和照度数据确定用户是否做了第一预设动作(S202);若确定用户做了第一预设动作,第一设备启动第一设备的麦克风,获取麦克风采集的第一音频信号(S203);根据第一音频信号确定第一音频信号的类型是否为靠近的人声(S204);若确定第一音频信号的类型为靠近的人声,第一设备启动第一设备的语音助手(S205)。通过IMU数据和照度数据可以更加准确的检测用户靠近第一设备的事件,而不是仅使用加速度数据时需要用户显著抬腕/抬手才能检测到用户靠近设备的事件,该方案更加的灵敏,并且在自然抬腕或者抬手时即可实现检测。

Description

语音交互方法及相关设备
本申请要求于2023年02月28日提交中国国家知识产权局、申请号为202310224268.2、发明名称为“语音交互方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种语音交互方法及相关设备。
背景技术
当前语音助手交互的方式主要是两种。一种方式下,用户讲出特定的唤醒词(比如“小艺小艺”),智能终端识别到该语音信号后发起语音会话,这种方式在公共场所有隐私的问题,而且交互较冗长。另一种方式,用户执行特定的动作,比如显著的抬腕动作,或长按实体键(比如电源键或运动快捷键),按键操作需要较大力度和延迟,一不小心还可能调出重启/关机界面,交互过程不够简洁。
发明内容
本申请提供一种语音交互方法及相关设备,可以提高检测交互操作的灵敏度。
第一方面,本申请提供一种语音交互方法,应用于第一设备,该方法包括:
第一设备在检测到第一事件时,获取第一设备的惯性测量单元(inertial measurement unit,IMU)数据和照度数据;根据第一设备的IMU数据和照度数据确定用户是否做了第一预设动作;若确定用户做了第一预设动作,第一设备启动第一设备的麦克风,获取麦克风采集的第一音频信号;根据第一音频信号确定第一音频信号的类型是否为靠近的人声;若确定第一音频信号的类型为靠近的人声,第一设备启动所述第一设备的语音助手。
其中,第一预设动作为抬腕动作或者抬手动作。
通过IMU数据和照度数据可以更加准确的检测用户靠近第一设备的事件,而不是仅使用加速度数据时需要用户显著抬腕/抬手才能检测到用户靠近设备的事件,本申请的方案更加的灵敏,并且在自然抬腕或者抬手时即可实现检测。
在一个可能的实现方式中,第一事件为抬腕硬件中断事件、第一设备的抬手硬件中断事件、第一设备的按键亮屏事件、第一设备的抬手亮屏事件或者第一设备的抬腕亮屏事件。
在一个可能的实现方式中,本申请的方法还包括:
在启动第一设备的麦克风之后,持续获取第一设备采集的IMU数据和照度数据,根据第一设备采集的IMU数据和照度数据确定用户是否执行了第二预设动作;
若确定第一音频信号的类型为靠近的人声,启动第一设备的语音助手,包括:
若确定用户执行了所述第二预设动作,且确定第一音频信号的类型为靠近的人声,启动第一设备的语音助手。
其中,第二预设动作为抬腕保持动作或者抬手保持动作。
通过在启动第一设备的麦克风之后,持续获取第一设备采集的IMU数据和照度数据,并基于采集的IMU数据和照度数据确定用户是否保持了抬腕动作或者是否保持了抬手动作,在确定用户保持了抬腕动作或者保持了抬手动作,且第一音频信号的类型为靠近的人声时,才会启动第一设备的语音助手,有利于降低误唤醒率和干扰。
在一个可能的实现方式中,本申请的方法还包括:
若确定用户未执行第二预设动作,则第一设备关闭第一设备的麦克风。
在确定用户未执行第二预设动作,也就是用户未保持抬手动作或者未保持抬手动作,表示用户不是准确唤醒语音助手,在这种情况下第一设备关闭麦克风,可以降低第一设备功率,同时也降低了语音助手的误唤醒率。
在一个可能的实现方式中,确定用户是否执行了第一预设动作所使用的IMU数据和照度数据的采集时长为第一预设时长,确定用户是否执行了第二预设动作所使用的IMU数据和照度数据的采集时长为第二预设时长,其中,第一预设时长大于第二预设时长。
通过设置第二预设时长小于第一预设时长,使得由于在确定用户是否执行了第二预设动作之前不用采集较长的IMU数据和照度数据,进而可以快速的检测用户是否执行了第二预设动作。
在确定用户是否执行了第二预设动作所使用的IMU数据和照度数据的采集时长小于确定用户是否执行了第一预设动作所使用的IMU数据和照度数据的采集时长时,可以通过分别对确定用户是否执行了第二预设动作所使用的IMU数据和照度数据进行插值处理,使得确定用户是否执行了第二预设动作所使用的IMU数据和照度数据的等效时长与确定用户是否执行了第一预设动作所使用的IMU数据和照度数据的采集时长一致,进而可以使用同一个预测模型预测确定用户是否执行了第一预设动作和第二预设动作。
在一个可能的实现方式中,第一事件为抬腕硬件中断事件、第一设备的抬手硬件中断事件、第一设备的抬手亮屏事件或者第一设备的抬腕亮屏事件,获取所述第一设备的IMU数据和照度数据,包括:
启动第一设备的环境光传感器,采集所述第一设备的环境的照度数据和继续采集IMU数据,并获取与所述第一事件相关的缓存的IMU数据,并对与所述第一事件相关的照度数据置零补齐。
通过对照度数据进行置零补齐,可以使得照度数据的等效时长与IMU数据的采集时长一致,进而可以在判断用户是否执行了第一预设动作时可以使用完整的抬腕阶段或者抬手阶段的数据,使得判断结果更加准确,并且在用户体验上提高了唤醒的成功率,降低了误唤醒率。
在一个可能的实现方式中,所述第一音频信号的时长小于或者等于0.5s,且根据所述第一音频信号确定所述第一音频信号的类型是否为靠近的人声是所述第一设备的数字信号处理器DSP实现的。
通过将第一音频信号的时长设定为小于或者等于0.5s,是为了后续启动语音助手识别第一语音信号时,可以实时回显第一音频信号对应的文本。并且第一音频信号是未经过自动增益控制(automatic gain control,AGC)、降噪、去混响或者压缩处理的音频信号,不存在靠近人声信息的丢失,使用未经过AGC、降噪、去混响或者压缩处理的原始音频信号来确定第一音频信号的声音类型,可以提高确定的第一音频信号的声音类型的精度。
在一个可能的实现方式中,所述第一音频信号是通过所述第一设备的多个麦克风采集得到的。
针对风噪场景,不同麦克风采集的音频信号中存在风噪信号,可以利用一个麦克风采集的音频信号中的风噪对另一麦克风采集的音频信号中的风噪进行降风噪处理,再利用将风噪处理会后的音频信号进行声音类型的预测,可以得到更加准确的预测结果。
在一个可能的实现方式中,本申请的方法还包括:
第一设备根据第一设备的麦克风采集的音频信号确定待执行任务;若待执行任务为敏感任务,第一设备获取用户的身份信息;在基于用户的身份信息确定用户为目标用户,第一设备执行待处理任务。
其中,上述用户为第一设备的使用者,目标用户为第一设备的拥有者。
通过上述方式可以避免非第一设备拥有者通过第一设备执行安全敏感的语音任务,保证了第一设备拥有者的设备的信息安全。
在一个可能的实现方式中,本申请的方法还包括:
在启动语音助手后,第一设备实时显示采集的音频信号对应的文本。
第二方面,本申请提供一种电子设备,包括:
获取单元,用于在检测到第一事件时,获取电子设备的IMU数据和照度数据;
确定单元,用于根据电子设备的IMU数据和照度数据确定用户是否做了第一预设动作;
启动单元,用于若确定用户做了所述第一预设动作,启动电子设备的麦克风,获取单元,还用于获取麦克风采集的第一音频信号;
确定单元,还用于根据第一音频信号确定第一音频信号的类型是否为靠近的人声;
启动单元,还用于若确定第一音频信号的类型为靠近的人声,启动电子设备的语音助手。
在一个可能的实现方式中,第一事件为抬腕硬件中断事件、所述电子设备的抬手硬件中断事件、电子设备的按键亮屏事件、所述电子设备的抬手亮屏事件或者所述电子设备的抬腕亮屏事件。
在一个可能的实现方式中,获取单元,还用于在启动单元启动电子设备的麦克风之后,持续获取电子设备采集的IMU数据和照度数据,
确定单元,还用于根据电子设备采集的IMU数据和照度数据确定用户是否执行了第二预设动作;
在若确定第一音频信号的类型为靠近的人声,启动电子设备的语音助手的方面,启动单元具体用于:
若确定用户执行了第二预设动作,且确定第一音频信号的类型为靠近的人声,启动电子设备的语音助手。
在一个可能的实现方式中,启动单元还用于:
若确定用户未执行第二预设动作,则关闭电子设备的麦克风。
在一个可能的实现方式中,确定用户是否执行了第一预设动作所使用的IMU数据和照度数据的采集 时长为第一预设时长,确定用户是否执行了第二预设动作所使用的IMU数据和照度数据的采集时长为第二预设时长,其中,第一预设时长大于第二预设时长。
在一个可能的实现方式中,第一事件为抬腕硬件中断事件、电子设备的抬手硬件中断事件、电子设备的抬手亮屏事件或者电子设备的抬腕亮屏事件,在获取电子设备的IMU数据和照度数据的方面,获取单元具体用于:
启动电子设备的环境光传感器,采集电子设备的环境的照度数据和继续采集IMU数据,并获取与第一事件相关的缓存的IMU数据,并对与第一事件相关的照度数据置零补齐。
在一个可能的实现方式中,第一音频信号的时长小于或者等于0.5s,且根据第一音频信号确定第一音频信号的类型是否为靠近的人声是电子设备的数字信号处理器(digital signal processor,DSP)实现的。
在一个可能的实现方式中,第一音频信号是通过电子设备的多个麦克风采集得到的。
在一个可能的实现方式中,确定单元,还用于根据电子设备的麦克风采集的音频信号确定待执行任务;
获取单元,还用于若待执行任务为敏感任务,获取用户的身份信息;
电子设备还包括:
执行单元,用于在基于用户的身份信息确定用户为目标用户,执行待处理任务。
在一个可能的实现方式中,电子设备还包括:
显示单元,用于在启动语音助手后,实时显示采集的音频信号对应的文本。
第三方面,本申请提供另一种语音交互方法,应用于第一设备,该方法包括:
第一设备在检测到第一事件时,获取第一设备的IMU数据;根据第一设备的IMU数据确定用户是否做了第一预设动作;若确定用户做了第一预设动作,第一设备启动第一设备的麦克风,获取麦克风采集的第一音频信号;根据第一音频信号确定第一音频信号的类型是否为靠近的人声;若确定第一音频信号的类型为靠近的人声,第一设备启动所述第一设备的语音助手。
其中,第一事件为第一设备的抬腕事件、第一设备的抬手事件或者第一设备的转腕事件;且第一设备的抬腕事件、第一设备的抬手事件和第一设备的转腕事件均是通过同一事件预测模型得到的。
上述事件预测模型不仅可以应用于抬腕事件的检测、转腕事件的检测和抬手事件的检测,还可以应用于其他应用里的事件的检测,比如抬腕即说应用的事件、抬腕亮屏应用的事件,在此不做限定。因此事件预测模型可以看成第一设备的公共能力。通过使用第一设备的公共能力来检测第一事件,对于第一设备来说,不需要再单独训练一个专门用于检测第一事件的神经网络模型,减少了第一设备的工作量,降低了第一设备的算力消耗。并且在检测到第一事件后,进一步判断用户是否做了第一预设动作,若确定用户做了第一预设动作,才会启动第一设备的麦克风,获取麦克风采集的第一音频信号;并根据第一音频信号确定第一音频信号的类型是否为靠近的人声;若确定第一音频信号的类型为靠近的人声,第一设备启动第一设备的语音助手。采用该方式,有利于降低第一设备的语音助手被启动的概率。
在一个可能的实现方式中,第一设备根据第一设备的IMU数据确定用户是否做了第一预设动作,包括:
第一设备根据第一设备的IMU数据计算得到第一设备的姿态信息;从第一设备的IMU数据中获取第一设备的加速度信息;若第一设备的姿态信息处于预设姿态范围、第一设备的加速度信息处于预设加速度范围内,且第一事件时长处于预设时长范围内,则第一设备确定用户做了第一预设动作。
采用该方式,有利于提高判断用户是否做了第一预设动作的精度。
在一个可能的实现方式中,本申请的方法还包括:
在启动第一设备的麦克风之后,持续获取第一设备采集的IMU数据和照度数据,根据第一设备采集的IMU数据和照度数据确定用户是否执行了第二预设动作;
若确定第一音频信号的类型为靠近的人声,启动第一设备的语音助手,包括:
若确定用户执行了所述第二预设动作,且确定第一音频信号的类型为靠近的人声,启动第一设备的语音助手。
其中,第二预设动作为抬腕保持动作或者抬手保持动作。
通过在启动第一设备的麦克风之后,持续获取第一设备采集的IMU数据和照度数据,并基于采集的IMU数据和照度数据确定用户是否保持了抬腕动作或者是否保持了抬手动作,在确定用户保持了抬腕动作或者保持了抬手动作,且第一音频信号的类型为靠近的人声时,才会启动第一设备的语音助手,有利于降低误唤醒率和干扰。
在一个可能的实现方式中,本申请的方法还包括:
若确定用户未执行第二预设动作,则第一设备关闭第一设备的麦克风。
在确定用户未执行第二预设动作,也就是用户未保持抬手动作或者未保持抬手动作,表示用户不是准确唤醒语音助手,在这种情况下第一设备关闭麦克风,可以降低第一设备功率,同时也降低了语音助手的误唤醒率。
在一个可能的实现方式中,所述第一音频信号的时长小于或者等于0.5s,且根据所述第一音频信号确定所述第一音频信号的类型是否为靠近的人声是所述第一设备的数字信号处理器DSP实现的。
通过将第一音频信号的时长设定为小于或者等于0.5s,是为了后续启动语音助手识别第一语音信号时,可以实时回显第一音频信号对应的文本。并且第一音频信号是未经过AGC、降噪、去混响或者压缩处理的音频信号,不存在靠近人声信息的丢失,使用未经过AGC、降噪、去混响或者压缩处理的原始音频信号来确定第一音频信号的声音类型,可以提高确定的第一音频信号的声音类型的精度。
在一个可能的实现方式中,所述第一音频信号是通过所述第一设备的多个麦克风采集得到的。
针对风噪场景,不同麦克风采集的音频信号中存在风噪信号,可以利用一个麦克风采集的音频信号中的风噪对另一麦克风采集的音频信号中的风噪进行降风噪处理,再利用将风噪处理会后的音频信号进行声音类型的预测,可以得到更加准确的预测结果。
在一个可能的实现方式中,本申请的方法还包括:
第一设备根据第一设备的麦克风采集的音频信号确定待执行任务;若待执行任务为敏感任务,第一设备获取用户的身份信息;在基于用户的身份信息确定用户为目标用户,第一设备执行待处理任务。
其中,上述用户为第一设备的使用者,目标用户为第一设备的拥有者。
通过上述方式可以避免非第一设备拥有者通过第一设备执行安全敏感的语音任务,保证了第一设备拥有者的设备的信息安全。
在一个可能的实现方式中,本申请的方法还包括:
在启动语音助手后,第一设备实时显示采集的音频信号对应的文本。
第四方面,本申请提供另一种电子设备,包括:
获取单元,用于在检测到第一事件时,获取第一设备的IMU数据;
确定单元,用于根据第一设备的IMU数据确定用户是否做了第一预设动作;
启动单元,用于若确定用户做了第一预设动作,启动第一设备的麦克风,获取单元,还用于获取麦克风采集的第一音频信号;
确定单元,还用于根据第一音频信号确定第一音频信号的类型是否为靠近的人声;
启动单元,还用于若确定第一音频信号的类型为靠近的人声,启动所述第一设备的语音助手。
其中,第一事件为第一设备的抬腕事件、第一设备的抬手事件或者第一设备的转腕事件;且第一设备的抬腕事件、第一设备的抬手事件和第一设备的转腕事件均是通过同一事件预测模型得到的。
在一个可能的实现方式中,在根据第一设备的IMU数据确定用户是否做了第一预设动作的方面,确定单元,还用于包括:
第一设备根据第一设备的IMU数据计算得到第一设备的姿态信息;从第一设备的IMU数据中获取第一设备的加速度信息;若第一设备的姿态信息处于预设姿态范围、第一设备的加速度信息处于预设加速度范围内,且第一事件时长处于预设时长范围内,则第一设备确定用户做了第一预设动作。
在一个可能的实现方式中,获取单元,还用于在启动单元启动电子设备的麦克风之后,持续获取电子设备采集的IMU数据和照度数据,
确定单元,还用于根据电子设备采集的IMU数据和照度数据确定用户是否执行了第二预设动作;
在若确定第一音频信号的类型为靠近的人声,启动电子设备的语音助手的方面,启动单元具体用于:
若确定用户执行了第二预设动作,且确定第一音频信号的类型为靠近的人声,启动电子设备的语音助手。
在一个可能的实现方式中,第一音频信号的时长小于或者等于0.5s,且根据第一音频信号确定第一音频信号的类型是否为靠近的人声是电子设备的DSP实现的。
在一个可能的实现方式中,第一音频信号是通过电子设备的多个麦克风采集得到的。
在一个可能的实现方式中,确定单元,还用于根据电子设备的麦克风采集的音频信号确定待执行任务;
获取单元,还用于若待执行任务为敏感任务,获取用户的身份信息;
电子设备还包括:
执行单元,用于在基于用户的身份信息确定用户为目标用户,执行待处理任务。
在一个可能的实现方式中,电子设备还包括:
显示单元,用于在启动语音助手后,实时显示采集的音频信号对应的文本。
第五方面,本申请提供另一种电子设备,包括处理器和存储器。存储器用于存储程序代码。处理器用于调用存储于存储器的程序代码,以执行第一方面、第三方面、第一方面的任一种可能的实施方式提供的方法,或者第三方面的任一种可能的实施方式提供的方法。
第六方面,本申请提供了一种计算机存储介质,包括计算机指令,当所述计算机指令在电子设备上运行时,使得所述电子设备执行如第一方面任一种可能的实施方式提供的方法。
第七方面,本申请提供一种计算机程序产品,当计算机程序产品在计算机上运行时,使得计算机执行如第一方面、第三方面、第一方面的任一种可能的实施方式提供的方法,或者第三方面的任一种可能的实施方式提供的方法。
可以理解地,上述提供的第二方面所述的电子设备、第三方面所述的电子设备、第四方面所述的计算机存储介质或者第五方面所述的计算机程序产品均用于执行第一方面、第三方面、第一方面的任一种可能的实施方式提供的方法,或者第三方面的任一种可能的实施方式提供的方法。因此,其所能达到的有益效果可参考对应方法中的有益效果,此处不再赘述。
附图说明
图1a为本申请实施例提供的一种语音交互方法的场景示意图;
图1b为本申请实施例提供的另一种语音交互方法的场景示意图;
图1c为本申请实施例提供的另一种语音交互方法的场景示意图;
图1d为本申请实施例提供的另一种语音交互方法的场景示意图;
图2为本申请实施例提供的一种语音交互方法的流程示意图;
图3为抬腕过程中照度数据和IMU数据的变化示意图;
图4为本申请实施例提供的腕戴设备在偏转角范围内所采集的音频信号强度示意图:
图5为本申请实施例提供的主麦和副麦所采集的音频信号强度示意图;
图6为本申请实施例提供的预设区域示意图;
图7为本申请实施例提供的唇高和唇宽示意图;
图8为本申请实施例提供的一种电子设备的结构示意图;
图8a为本申请实施例提供的另一种电子设备的结构示意图;
图9为本申请实施例提供的另一种电子设备的结构示意图。
具体实施方式
本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。
首先对本申请涉及的术语进行解释。
智能语音助理(intelligent personal assistant,IPA),是由人工智能驱动的个人虚拟助手,也可以称为智能助理、语音助手等。允许用户通过语音交互轻松地操作智能终端,比如智能手机、智能手表、平板电脑、笔记本电脑、AI音箱、智能大屏和智能座舱等)的各种功能或进行互联网搜索等,比如在智能手机上设闹钟、使用文本转语音技术朗读电子邮件、播放和搜索音乐以及发送短信等。具体的智能个人助理包括华为的小艺(对应的海外版为Celia),苹果的Siri,亚马逊的Alexa,微软的Cortana和谷歌的Google Assistant。
腕戴设备(wrist wearable device):是指佩戴在用户手腕上的智能设备,一般根据其有无屏幕或屏幕大 小,应用处理器的计算能力以及智能化程度(比如是否可以自行下载和安装App,是否支持语音助手)高低,可分为智能手表、运动手表、智能手环。
惯性测量单元(IMU,inertial measurement unit):是测量刚性物体在三维空间的角速度以及加速度的装置。一般用来识别和跟踪设备的姿态和运动,比如设备相对水平面的倾斜角度,设备的方位(即与地磁北极的偏转角度),设备的累积旋转角度、相对运动速度和位移等。
IMU抬腕硬件中断:IMU器件内部实现的低功耗初级抬腕检测,一旦检测到抬腕事件就产生硬件中断。
环境光传感器(ambient light sensor,Als):环境光传感器可以感知周围光线情况,可采集环境的照度数据,并告知处理芯片自动调节屏幕亮度,降低智能终端的功耗。
下面结合附图对本申请的实施例进行描述。
参见图1a,图1a为本申请实施例提供的一种语音交互方法的应用场景示意图。如图1a所示:
用户在腕戴设备灭屏的状态下,直接自然移腕靠近嘴唇。用户讲出语音任务,腕戴设备的麦克风采集用户的音频信号,腕戴设备的语音助手基于音频信号识别出语音任务,并针对该语音任务作出响应。
参见图1b,图1b为本申请实施例提供的另一种语音交互方法的应用场景示意图。如图1b所示:
用户在腕戴设备亮屏的状态下,比如用户通过腕戴设备的屏幕阅读信息或者用户正在操控腕戴设备,用户抬腕使得用户的嘴唇朝向腕戴设备的屏幕并靠近腕戴设备,然后用户讲出语音任务,比如“今天天气”,腕戴设备采集用户的音频信号,腕戴设备的语音助手基于音频信号识别出语音任务,并针对该语音任务作出响应,比如腕戴设备输出“今天天气晴朗”。
参见图1c,图1c为本申请实施例提供的另一种语音交互方法的应用场景示意图。如图1c所示:
用户在手机灭屏的情况下,通过按压手机的电源键点亮手机屏幕,然后从桌面拿起手机,将手机底端靠近嘴边,然后用户讲出语音任务,手机采集用户的音频信号,手机的语音助手基于音频信号识别出语音任务,并针对该语音任务作出响应。
参见图1d,图1d为本申请实施例提供的一种语音交互方法的应用场景示意图。如图1d所示:
用户脸朝向大屏设备,可以是正面朝向大屏,也可以是向左偏转预设角度或者向右偏转预设角度朝向大屏,然后用户讲出语音任务,比如“今天天气”,腕戴设备采集用户的音频信号,并基于音频信号识别出语音任务,并针对该语音任务作出响应,比如腕戴设备输出“今天天气晴朗”。
应理解,本申请的语音交互方法不仅可应用于上述四种场景,还可以应用于其他场景,在此不做限定。
参见图2,图2为本申请实施例提供的一种语音交互方法的流程示意图。如图2所示,该方法包括:
S201、第一设备检测到第一事件时,获取第一设备的IMU数据和照度数据。
应理解,第一设备可以为用户的腕戴设备或者用户终端设备,用户终端设备可以为智能手机、平板电脑等。
其中,第一事件为第一设备的抬腕硬件中断事件、第一设备的抬手硬件中断事件、第一设备的按键亮屏事件、第一设备的抬手亮屏事件或者第一设备的抬腕亮屏事件。
对于第一设备的抬腕硬件中断事件和抬手硬件中断事件,是指IMU器内部实现的低功耗初级抬腕检测,一旦检测到较小程度的抬腕或转腕行为(比如,基于第一设备的IMU采集的加速度数据,若加速度超过重力加速度(9.8m/s^2)的30%以上,则判定为抬腕或转腕行为或者抬手行为),则第一设备的IMU产生抬手硬件中断或者抬腕硬件中断,此中断事件可被其它软件模块订阅,以触发其它软件启动运行,比如,启动抬腕亮屏软件再做一次抬腕或转腕行为检测,以实现准确可靠的抬腕亮屏功能(比如,实现尽可能高的成功率以及尽可能低的误触率)。
对于第一设备的按键亮屏事件,是指第一设备检测到用户针对第一设备的按键(比如电源按键、音量+按键、音量-按键)的操作时,产生按键亮屏事件,此事件可被其它软件模块订阅,以触发其它软件启动运行,比如渲染软件开始界面渲染计算,之后将产生的画面信息提交给屏幕做显示。应理解,产生第一设备的按键亮屏事件是第一设备确定达到了点亮第一设备的屏幕的条件,此时第一设备还并未控制点亮第一设备的屏幕。
对于第一设备的抬腕亮屏事件或者抬手事件,是指第一设备检测到用户试图点亮屏幕的抬腕事件或者抬手事件,抬腕事件或者抬手事件可以是第一设备基于IMU采集的IMU数据确定的。应理解,产生第一设备的抬腕亮屏事件或者抬手亮屏事件是第一设备确定达到了点亮第一设备的屏幕的条件,此时第一设备 还并未控制点亮第一设备的屏幕。应理解,第一设备检测到用户试图点亮屏幕的抬腕事件或者抬手事件是第一设备的抬腕或者抬手亮屏软件来检测用户试图点亮屏幕的抬腕事件或者抬手事件。
在此需要指出的是,对于上述第一事件,可能是用户误触发或用户出于其它目的而导致第一设备检测到相应的第一事件,比如在用户摆动手臂时,腕戴设备的IMU数据也会发生变化,可能会触发抬腕硬件中断事件或者抬腕亮屏事件,但用户不是想抬起手腕与腕戴设备进行语音交互。再比如,在用户拿着手机跑步过程中,手机的IMU数据会发生变化,可能会触发抬手中断事件或者抬手亮屏事件。再比如,用户想使用智能手机上的应用程序,用户可能会通过智能手机的电源键点亮智能手机的屏幕,这样会触发手机的抬手亮屏事件,但用户不是想抬手与手机进行语音交互。
为了避免上述误触发,在检测到第一事件时,第一设备获取第一设备的IMU数据和照度数据,并通过第一设备的IMU数据和照度数据确定用户是否做了第一预设动作。
在一个示例中,当第一事件为第一设备的按键亮屏事件时,第一设备获取第一设备的IMU数据和(环境)照度数据,包括第一设备启动环境光传感器采集照度数据和启动IMU采集IMU数据,并获取采集的照度数据和IMU数据,其中,照度数据的采集时长与IMU数据的采集时长相同。
应理解,在采集IMU数据之前,若IMU未启动,则启动IMU;若IMU里面的加速度计已启动而角度计未启动,在需要采集角度数据之前,启动IMU数据的角度计。在采集照度数据之前,若环境光传感器未启动,则启动环境光传感器。
在另一个示例中,当第一事件为第一设备的抬腕硬件中断事件、第一设备的抬手硬件中断事件、第一设备的抬腕亮屏事件或第一设备的抬手亮屏事件,第一设备获取第一设备的IMU数据和照度数据,包括:
启动第一设备的环境光传感器,采集第一设备的(环境)照度数据和继续采集IMU数据,并获取与第一事件相关的缓存的IMU数据,并对与第一事件相关的照度数据置零补齐(因第一事件过程中尚未启动环境光传感器),置零是考虑到下部空间一般照度较低。
应理解,检测第一设备的抬腕硬件中断事件、第一设备的抬手硬件中断事件、第一设备的抬腕亮屏事件或第一设备的抬手亮屏事件是利用第一设备的IMU数据,也就是说在检测到第一设备的抬腕硬件中断事件、第一设备的抬手硬件中断事件、第一设备的抬腕亮屏事件或第一设备的抬手亮屏事件之前,第一设备的IMU已启动并采集了IMU数据。因此,确定用户是否做了第一预设动作使用的IMU数据可以看成两部分,一部分是在检测到第一事件之前采集的,这部分数据可称为与第一事件相关的IMU数据,另一部分是检测到第一事件时开始采集的,两部分的IMU数据是一个完整的抬腕或者抬手的IMU数据,使用一个完整的抬腕或者抬手的IMU数据可以更加准确的判断用户是否执行了第一预设动作。
由于确定用户是否做了第一预设动作使用的照度数据的采集时长需要与IMU数据的采集时长相等,因此在使用照度数据和IMU数据确定用户是否做了第一预设动作之前,对照度数据进行置零补齐,进行置零补齐得到的数据为与第一事件相关的照度数据。置零补齐后的照度数据的等效时长与IMU数据的采集时长相等。其中,IMU数据的采集时长为第一预设时长。
应理解,传感器在工作一段时间后可以采集一定量的数据,数据量的多少可以通过时长来表征,这里的等效时长就是表征数据量的大小。
S202、第一设备根据第一设备的IMU数据和照度数据确定用户是否做了第一预设动作。
其中,第一预设动作可以为抬腕动作或者抬手动作。
应理解,参照图1a和图1b,在用户抬腕将腕戴设备靠近嘴唇的过程中,由于上部空间光线比下部空间的光线强,以及用户嘴唇靠近朝向腕戴设备的屏幕时,屏幕下方的环境光传感器的进光被遮挡,因此环境光传感器采集的照度数据的变化如图3左图的曲线所示,环境光传感器采集的照度数值先上升后显著下降,并稳定在一个较低值。参照图1a和图1b,在用户抬腕将腕戴设备靠近嘴唇过程中,腕戴设备随着用户的手腕运动,产生特定的姿态运动(比如表盘与水平面的倾角发生改变以朝向用户嘴唇)和位置运动(比如从下部空间上升到上部空间,与用户身体垂直面的距离缩小),相应的加速度的变化如图3右图的曲线所示。同样的,对于手持设备(比如手机),在用户拿起手持设备,抬手以使手持设备靠近用户的嘴唇过程中,手持设备的环境光传感器采集的照度数值也会有上升的趋势,但是在手持设备靠近用户的嘴唇后,手持设备下方的环境光传感器的进光不会被遮挡,因此手持设备下方的环境光传感器采集的照度数值不会出现明显的下降。手持设备随着用户的手部运动,也会产生特定的姿态运动(比如手机屏幕与水平面的倾角变为接近零度,以让手机底端的麦克风朝向用户的嘴唇)和位置运动。
其中,图3右图示意出了用户在抬腕将腕戴设备靠近嘴唇的过程中,腕戴设备在本机坐标系下x轴方向,y轴方向和z轴方向上的加速度示意图。本机坐标系的原点为腕戴设备的重心,x轴指向腕戴设备表 盘的3点钟方向,y轴指向表盘的12点钟方向,z轴指向表盘的正上方。
基于上述可知,第一设备在靠近用户的嘴唇过程中会产生特定的姿态运动和位置运动,因此可以基于腕戴设备的照度数据和IMU数据确定用户是否执行了抬腕动作或者抬手动作。
在一个可能的实现方式中,预先训练一个预测模型,用于预测用户是否执行了抬腕动作或者抬手动作。该预测模型可以是基于决策树或者卷积神经网络实现的。可选的,对于腕戴设备和手持设备,可以使用同一个预测模型进行预测,也可以使用不同的预测模型进行预测。
训练数据包括采集时长为固定时长(比如0.5s)的加速度数据和照度数据及其衍生数据,比如加速度数据的最大值、均值、偏度、峰度、直方图、按照一定窗口大小生成的滑动平均值序列、加速度向量的幅度值序列)。可选的,训练数据是针对不同群体的用户采集的,包括老年群体和青少年群体、男性群体和女性群体、和/或左手佩戴/持握群体和右手佩戴/持握群体。训练数据中的照度数据也可以包括在不同的环境下采集的数据,包括户外晴天环境、室内正常照明环境、暗光环境。
在一个示例中,第一设备将IMU数据和照度数据输入到第一预测模型中进行预测,可以得到第一预测结果,比如第一预测结果为抬腕动作。第一预测模型可以为四分类网络,第一设备将IMU数据和照度数据输入到第一预测模型中进行预测,第一预测模型输出四个概率,分别为抬腕动作的概率、落腕动作的概率、抬腕保持动作的概率及自由手位的概率;第一预测结果为最大概率对应的动作,比如抬腕动作。其中,这里的自由手位是指除了抬腕动作、落腕动作和抬腕保持动作之外的动作。第一预测模型也可为两分类网络,第一设备将IMU数据和照度数据输入到第一预测模型中进行预测,第一预测模型输出两个概率,分别为抬腕动作的概率和其他动作的概率,第一预测结果为最大概率对应的动作;这里的其他动作是指处理抬腕动作之外的动作。
应理解,抬腕保持动作是指用户抬起手腕后,手腕在一段时间内保持一定的姿态(比如表盘朝向自己,与水平面的pitch角为正负15度之内,roll角为30~60度)且照度保持一定的稳定(比如照度最大值/最小值的倍数小于2),且没有落下。对于第一设备来说,用户抬起手腕后采集的IMU数据和照度数据在预设范围内,则判定用户执行了抬腕保持动作。
在另一个示例中,第一设备将IMU数据和照度数据输入到第二预测模型中进行预测,可得到第二预测结果,比如第二预测结果为抬手动作。第二预测模型可以是四分类网络,第一设备将IMU数据和照度数据输入到第二预测模型中进行预测,第二预测模型输出四个概率,分别为抬手动作的概率、落臂动作的概率、抬手保持动作的概率和自由手位的概率。第二预测结果为最大概率对应的动作,比如抬手动作。这里的自由手位是除抬手动作、落臂动作和抬手保持动作之外的动作。第二预测模型也可为两分类网络,第一设备将IMU数据和照度数据输入到第二预测模型中进行预测,第二预测模型输出两个概率,分别为抬手动作的概率和其他动作的概率,第二预测结果为最大概率对应的动作;这里的其他动作是指除抬手动作之外的动作。
其中,上述第一预测模型和第二预测模型的训练数据可以参见前述相关描述,在此不再叙述。
S203、若确定用户做了第一预设动作,第一设备启动第一设备的麦克风,并获取麦克风采集的第一音频信号。
其中,第一音频信号是通过第一设备的多个麦克风采集的,也就是说,若确定了用户做了第一预设动作,第一设备启动第一设备的多个麦克风,比如包括主麦克风和副麦克风。单个麦克风采集的音频信号存在风噪,并且基于单个麦克风采集的音频信号来预测音频信号的声音类型,准确率低,通过第一设备的多个麦克风采集第一音频信号,再利用多个麦克风采集的音频信号来预测音频信号的声音类型,可以得到准确率高的预测结果。
对于多麦克风采集的音频信号,主麦采集的音频信号和副麦采集的音频信号可能都存在风噪,但两者采集的音频信号中风噪强度不一样,第一设备可以利用其中一个音频信号的风噪对另一音频信号中的风噪进行将风噪处理,比如利用副麦采集的音频信号中风噪对主麦采集的音频信号中风噪进行降风噪处理,再利用将风噪处理会后的音频信号进行声音类型的预测,可以得到更加准确的预测结果。
第一设备的麦克风采集到第一音频信号后,第一音频信号会被传输至第一设备的DSP。DSP在硬件实现基础的语音激活检测(voice activity detection,VAD);当第一音频信号的能量超过预设能量阈值时,第一设备的DSP产生硬件VAD中断,触发DSP中的软件算法运行,也就是确定第一音频信号的类型是否为靠近的人声。
S204、第一设备根据第一音频信号确定第一音频信号的类型是否为靠近的人声。
在一个示例中,第一设备将第一音频信号输入到第三预测模型中进行预测,可以得到第三预测结果, 比如第三预测结果为靠近的人声。第三预测模型可以为四分类网络,第一设备将第一音频信号输入到第一预测模型中进行预测,第三预测模型输出四个概率,分别为靠近的人声的概率、附近的人声的概率、远处人声的概率及非人声的概率;第三预测结果为最大概率对应的声音类型,比如靠近的人声。第三预测模型也可为两分类网络,第一设备将第一音频信号输入到第三预测模型中进行预测,第三预测模型输出两个概率,分别为靠近的人声的概率和非靠近人声的概率,第三预测结果为最大概率对应的声音类型。
具体的,第一设备通过第三预测模型对第一音频信号进行特征提取,得到第一音频信号的音频特征。可选的,音频特征包括能量特征、时域特征、频域特征、乐理特征和感知特征中的至少一项。
其中,提取音频特征的策略是基于设备场景和用户场景设定的,具体到用户抬腕将腕戴设备靠近嘴边说出语音任务的场景:考虑到人声离开口腔后的高低频分布特点(即在嘴唇正对方向的偏转方位上的音频信号能量衰减,如图4所示,用户脑后方音频信号的能量衰减最大,其中,频率越高衰减越大;而嘴唇靠近腕戴设备时,比如距离腕戴设备的屏幕3-5cm时,腕戴设备的麦克风的偏转角度为15-55度。)嘴唇靠近屏幕产生特定的反射混响特征,以及腕戴设备的麦克风位于腕戴设备的边缘位置(比如,11点方向)。由于一般靠近讲话时会压低声音,具体的预设策略可以为高低频比例(high low band ratio,HLBR)特征,比如以(线性频率或者梅尔刻度频率)1kHz为界的高低频能量比,128点快速傅里叶变换(fast Fourier transform,FFT)三阶多项式拟合的4个系数和线性回归拟合的2个系数、混响特征(比如,自相关最大峰与正负10ms之内其他峰的平均值的比值、自相关的最大峰与次的9个峰的平均值的比值、自相关的标准差和曲线下面积、绝对一阶导自相关的标准差和曲线下面积、语音混响调制能量比(speech-to-reverberation modulation energy ratio,SRMR))、低语特征(比如音高和音色等)。在DSP运行内容和计算资源允许的条件下,具体的预设策略可以为可以包括受人耳听觉认知启发改变量化特征尺度得到的特征,比如84个梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,MFCCs)。
对于用户抬手将手持设备靠近嘴边说出语音任务的场景,手持设备靠近用户嘴边3-5cm时存在明显的喷麦现象,如图5所示,对于喷麦现象,手持设备的主麦克风可以采集到有较强的低频气息风噪;而且对于运动风噪或者环境风噪,主麦克风和副麦克风可以采集到强的低频风噪。对于这类情况,具体的预设策略可以为主副麦的HLBR特征,比如以(线性频率或者梅尔刻度频率)200Hz为界的高低频能量比,128点FFT的三级多项式拟合的4个系数、线性回归拟合的2个系数,或者双通道MFCCs。
第一设备根据预设策略提取采集的音频信号的音频特征集数值,具体为:按照预设策略对采集的音频数据进行计算,生成各个音频特征的具体数据。比如MFCCs数值的计算方法:
对采集的音频信号进行分帧,一般25ms为一帧,帧交叠为10ms;采用周期图法对每帧音频信号进行功率谱估计;对估计得到的功率谱采用85个梅尔滤波器进行滤波,计算每个滤波器里的能量;对每个滤波器里的能量取log对数,得到85个对数结果;对85个对数结果进行离散余弦变换(discrete cosine transform,DCT),最后保留DCT的第2-85个系数,去掉第1个系数,最后得到84个系数,该84个系数即为84个梅尔频率倒谱系数。
在一个可能的实施例中,第三预测模型可以按照如下方法得到:
获取腕戴设备或者手持设备采集的一定数量的嘴唇靠近腕戴设备或者手持设备时用户发出的音频信号,及腕戴设备或者手持设备附近或者远处的人声对应的音频信号,然后按照上述预设策略提取采集的音频信号的音频特征集数值,再将音频特征集数值作为训练样本,对机器学习模型或者卷积神经网络进行训练,以得到第三预测模型。
其中,机器学习模型可以为决策树、随机森林算法、Xgboost或者AdaBoost。
其中,第一音频信号的时长小于或者等于0.5s。比如第一音频信号的时长可以为0.25s、0.3s、0.4s和0.5s。
通过将第一音频信号的时长设定为小于或者等于0.5s,是为了后续启动语音助手识别第一语音信号时,可以实时回显第一音频信号对应的文本。
其中,根据第一音频信号确定第一音频信号的类型是否为靠近的人声是在第一设备的DSP实现的。并且第一音频信号是未经过AGC、降噪、去混响或者压缩处理的音频信号,不存在靠近人声信息的丢失,使用未经过AGC、降噪、去混响或者压缩处理的原始音频信号来确定第一音频信号的声音类型,可以提高确定的第一音频信号的声音类型的精度。
应理解,一个完整语音任务对应的音频信号不仅仅包括第一音频信号,还包括第一音频信号之后的音频信号。麦克风采集音频信号是按照小于或者等于0.5s的采集时长周期性采集的。
S205、若确定第一音频信号的类型为靠近的人声,启动第一设备的语音助手。
应理解,在确定第一音频信号的类型后,第一设备的麦克风持续采集音频信号,并提取采集的音频信号的音频特征集数值,按照预设时间间隔基于采集的音频信号的音频特征集数值确定采集的音频信号的类型。其中,预设时间间隔可以为0.1s、0.2s或者其他时间。
在一个可行的实现方式中,第一设备在开启第一设备的麦克风后,控制第一设备的IMU和环境光传感器继续采集IMU数据和照度数据。可选的,采集时长为第二预设时长。第一设备根据继续采集的IMU数据和照度数据确定用户是否执行了第二预设动作;若确定用户执行了第二预设动作,且第一音频信号的类型为靠近的人声,则第一设备启动语音助手。若确定用户未执行第二预设动作,则第一设备关闭第一设备的麦克风。
其中,第二预设动作可以为抬腕保持动作或者抬手保持动作。在一个示例中,若用户的抬腕动作或者抬手动作保持了第三预设时长,第一设备启动语音助手。可选的,第三预设时长可以为0.3s。
在一个示例中,在判断用户是否保持了抬腕动作时,第一设备将继续采集的IMU数据和照度数据输入到第四预测模型中进行预测,可以得到第四预测结果,比如,第四预测结果为抬腕保持动作。第四预测模型可以为四分类网络,第一设备将继续采集的IMU数据和照度数据输入到第四预测模型中进行预测,第四预测模型输出四个概率,分别为抬腕动作的概率、落腕动作的概率、抬腕保持动作的概率及自由手位的概率;第四预测结果为最大概率对应的动作,比如抬腕保持动作。其中,这里的自由手位是指除了抬腕动作、落腕动作和抬腕保持动作之外的动作。第四预测模型也可为两分类网络,第一设备将继续采集的IMU数据和照度数据输入到第四预测模型中进行预测,第四预测模型输出两个概率,分别为抬腕保持动作的概率和其他动作的概率,第四预测结果为最大概率对应的动作;这里的其他动作是指除了抬腕保持动作之外的动作。
在另一个示例中,在判断用户是否保持了抬手动作时,第一设备将继续采集的IMU数据和照度数据输入到第五预测模型中进行预测,可以得到第五预测结果,比如,第五预测结果为抬手保持动作。第五预测模型可以为四分类网络,第一设备将IMU数据和照度数据输入到第五预测模型中进行预测,第五预测模型输出四个概率,分别为抬手动作的概率、落臂动作的概率、抬手保持动作的概率及自由手位的概率;第五预测结果为最大概率对应的动作,比如抬手保持动作。其中,这里的自由手位是指除了抬手动作、落臂动作和抬手保持动作之外的动作。第五预测模型也可为两分类网络,第一设备将继续采集的IMU数据和照度数据输入到第五预测模型中进行预测,第五预测模型输出两个概率,分别为抬手保持动作的概率和其他动作的概率,第五预测结果为最大概率对应的动作;这里的其他动作是指除了抬手保持动作之外的动作。
在此需要指出的是,第四预测模型与第一预测模型可以为同一个预测模型。当第四预测模型与第一预测模型为同一个预测模型时,用于确定用户是否执行了第二预设动作的IMU数据和照度数据的采集时长需要与用于确定用户是否执行了第一预设动作的IMU数据和照度数据的采集时长相等,也即是需要第二预设时长与第一预设时长相等。当第二预设时长小于第一预设时长时,在使用继续采集的IMU数据和照度数据确定用户是否执行了第二预设动作之前,第一设备分别对继续采集的IMU数据和照度数据进行插值处理,以使得用于确定用户是否执行了第二预设动作的IMU数据和照度数据的等效时长与用于确定用户是否执行了第一预设动作的IMU数据和照度数据的采集时长相等,进而可以使用同一个预测模型进行预测。
同理,第五预测模型与第二预测模型可以为同一个预测模型。当第五预测模型与第二预测模型为同一个预测模型时,第二预设时长小于第一预设时长时,在使用继续采集的IMU数据和照度数据确定用户是否执行了第二预设动作之前,第一设备分别对继续采集的IMU数据和照度数据进行插值处理,以使的用于确定用户是否执行了第二预设动作的IMU数据和照度数据的等效时长与用于确定用户是否执行了第一预设动作的IMU数据和照度数据的采集时长相等,进而可以使用同一个预测模型进行预测。
应理解,传感器在工作一段时间后可以采集一定量的数据,数据量的多少可以通过时长来表征,这里的等效时长就是表征数据量大小。
需要说明的是,利用预测模型预测用户是否执行了第一预设动作或者第二预设动作的过程只是一个示例,不是对本申请的限定,当然可以通过其他方式预测用户是否执行了第一预设动作或者第二预设动作。
在开启语音助手后,第一设备根据第一设备的麦克风采集的音频信号确定待执行任务,并识别待处理任务的类型;若待处理任务为敏感任务,为了保证安全性,需要获取用户的身份信息,比如用户的指纹信息,声纹信息、脸部图像信息等。在根据获取的用户的身份信息确定用户为目标用户时,执行待处理任务。
其中,敏感任务可以为转账任务、打开隐私文件夹任务等。目标用户为第一设备的所有者。
在一个示例中,第一设备的当前使用者对第一设备说“向xx转账50元”,第一设备的麦克风采集对应的音频信号,根据该音频信号确定待处理任务:向xx转账,金额为50元。第一设备确定该待处理任务为敏感任务,获取第一设备的当前使用者的声纹信息,将当前使用者的声纹信息与第一设备中预存储的第一设备的所有者的声纹信息进行匹配,若匹配成功,则确定第一设备的使用者为第一设备的所有者,第一设备执行待处理任务:向xx转账,金额为50元。若匹配失败,则表示第一设备的使用者不是第一设备的所有者,第一设备不执行待处理任务。
在一个可能的实施例中,在第一设备的语音助手识别出第一音频信号对应的任务后,第一设备确定执行该任务之前是否需要进行身份认证;如需要进行身份认证,第一设备向用户发出提示信息,比如提示信息为“请先通过人脸或者指纹解锁屏幕”。第一设备获取使用者的身份信息,比如人脸信息或者指纹信息;在身份认证通过(比如人脸解锁已成功或者第一设备已处于屏幕解锁状态)后,第一设备执行第一音频信号对应的任务。
在一个可能的实施例中,第一设备获取第一置信度和第二置信度,其中第一置信度用于表征用户保持抬腕动作或者抬手动作的时长;第二置信度用于表征第一音频信号为靠近的人声的程度;根据预设权重对第一置信度和第二置信度进行加权求和,以得到启动语音助手的置信度。若启动语音助手的置信度大于预设置信度,则启动第一设备的语音助手。
可选的,第一设备可以如下方式得到第一置信度和第二置信度:
第一设备根据获取的IMU数据计算用户保持抬腕靠近嘴边姿态的时长,或者根据获取的IMU数据计算用户保持抬手靠近嘴边姿态的时长,也即是根据获取的IMU数据计算用户保持抬腕动作或者抬手动作的时长;再基于得到的时长计算得到第一置信度;其中,得到的时长越长,第一置信度越大。可选的,第一置信度的取值区间为[0.6,1.0],也即是说得到的时长的最低值对应的第一置信度为0.6,得到的时长的最高值对应的第一置信度为1.0;基于此可以确定得到的时长与对应的第一置信度之间的线性关系,在得到时长后,第一设备可以基于该线性关系计算确定得到时长对应的第一置信度。
对于第二置信度,在前述步骤S204中,第一设备将第三预测模型输出的靠近的人声的概率作为第二置信度。
在一个可能的实施例中,在启动语音助手后,第一设备基于自动语音识别(automatic speech recognition,ASR)技术识别第一音频信号对应的语音内容,并实时显示第一音频信号对应的语音内容。比如第一音频信号对应的语音内容为“明天天气怎么样”,在识别出“明天”,就显示“明天”;在识别出“天气”,就显示“天气”,而不是在识别到完整的语音内容后再显示。可选的,第一设备将第一音频信号发送至服务器,服务器基于ASR技术识别第一音频信号对应的语音内容,并对语音内容进行过滤,过滤一些不是对语音助手讲的内容,比如“今天下班打球啊”,也就是第一设备的语音助手对不是对语音助手讲的内容不进行反馈,语音助手终止语音会话。采用这种方式,可以减少用户的干扰。
对于过滤规则,可以预先设定一些关键词或者句子;第一设备确定第一音频信号对应的语音内容包含设定的关键词或者句子时,则确定第一音频信号对应的语音内容是需要过滤掉的内容,语音助手不需要对第一音频信号对应的语音内容作出反馈。
应理解,第一设备为如图1a和图1b中所示的腕戴设备,或者为如图1c中所示的手持设备。
可以看出,通过IMU数据和照度数据可以更加准确的检测用户靠近第一设备的事件,而不是仅使用加速度数据时需要用户显著抬腕/抬手才能检测到用户靠近设备的事件,本申请的方案更加的灵敏,并且在自然抬腕或者抬手时即可实现检测,给用户提供了自然的语音交互体验,并且动作幅度小,在公共场合改善了用户隐私,操作轻松。通过将第一音频信号的时长设定为小于或者等于0.5s,是为了后续启动语音助手识别第一语音信号时,可以实时回显第一音频信号对应的文本。并且第一音频信号是未经过AGC、降噪、去混响或者压缩处理的音频信号,不存在靠近人声信息的丢失,可以最大限度利用第一音频信号中的信息,可以提高确定的第一音频信号的声音类型的精度。在音频信号对应的任务是敏感任务之前时,需要进行身份认证,通过该方式可以避免非第一设备拥有者通过第一设备执行安全敏感的语音任务,保证了第一设备拥有者的设备的信息安全。
在另一个具体的实施例中,第一设备实时采集第一设备的IMU数据。应理解,这里的IMU数据包括加速度、角速度等数据。第一设备根据IMU数据确定是否触发第一事件,也即是第一设备是否检测到第一事件,若第一设备根据IMU数据确定触发第一事件,则表示第一设备检测到第一事件;若第一设备根据IMU数据确定未触发第一事件,则表示第一设备未检测到第一事件。
其中,第一事件为第一设备的抬腕事件、第一设备的抬手事件或第一设备的转腕事件。应理解,第一事件当然还可以为除了第一设备的抬腕事件、第一设备的抬手事件和第一设备的转腕事件之外的事件。
第一设备的抬腕事件是指用户抬手将第一设备置于可观看或者可操作的位置和姿态。应理解,这里的第一设备为手持设备。
第一设备的转腕事件是指用户直接向内转腕或者先向外转腕再向内转腕将第一设备置于可观看或者可操作的位置和姿态。应理解,这里的第一设备为腕戴设备。
第一设备的抬腕事件是指用户抬腕将第一设备置于可观看或者可操作的位置和姿态。应理解,这里的第一设备为腕戴设备。
第一设备的位置和姿态可以通过第一设备的IMU数据确定,第一设备可以基于第一设备的IMU数据确定是否触发第一事件。在一个可能的实施例中,第一设备可以一个事件预测模型来预测第一设备是否触发第一事件。具体的,第一设备将采集的IMU数据输入到事件预测模型中进行处理,事件预测模型输出四个概率值。该四个概率值分别用于表征触发第一设备的抬腕事件的概率、触发第一设备的抬手事件的概率、触发第一设备的转腕事件的概率及未触发上述事件的概率。其中,最大概率对应结果为事件预测模型的最终输出结果。
在此需要指出的是,上述事件预测模型不仅可以应用于抬腕事件的检测、转腕事件的检测和抬手事件的检测,还可以应用于其他应用里的事件的检测,比如抬腕即说应用的事件、抬腕亮屏应用的事件,在此不做限定。因此事件预测模型可以看成第一设备的公共能力。
通过使用第一设备的公共能力来检测第一事件,对于第一设备来说,不需要再单独训练一个专门用于检测第一事件的神经网络模型,减少了第一设备的工作量,降低了第一设备的算力消耗。
由于第一设备是通过第一设备的公共能力检测第一事件的,在通过第一设备的公共能力检测到第一事件不能表示用户做了第一预设动作,因此第一设备需要进一步判断用户是否做了第一预设动作。
具体的,第一设备根据第一设备的IMU数据计算得到第一设备的姿态信息;从第一设备的IMU数据中获取第一设备的加速度信息;若第一设备的姿态信息处于预设姿态范围、第一设备的加速度信息处于预设加速度范围内,且第一事件时长处于预设时长范围内,则确定用户做了第一预设动作。
其中,第一事件时长是指检测到第一事件后,第一设备保持将第一设备置于可观看或者可操作的位置和姿态的时长。预设姿态范围、预设加速度范围和预设时长范围均是基于历史经验值得到的。
在判断用户是否做了第一预设动作之后第一设备所做的操作可参见S203-S205的相关描述,在此不再描述。
在另一个具体的实施例中,第二设备获取用户的第二音频信号,第二设备根据第二音频信号确定用户相对于第二设备的位置信息;若根据用户相对于第二设备的位置信息确定用户是否位于预设范围内;若确定用户位于预设范围内,则第二设备确定检测到用户靠近第二设备,第二设备启动靠近人声检测。第二设备通过靠近人声检测确认第三音频信号的类型为靠近的人声时,第二设备启动语音助手。其中,第三音频信号的采集时间在第二音频信号的采集时间之后。
在一个示例中,第二设备为大屏设备,用户在大屏设备附近发声时,大屏设备总的多个麦克风(比如3个以上的麦克风)均能收到用户的音频数据,第二设备基于音频的到达时间差(time difference of arrival,TDOA)技术对多个麦克风采集的音频信号进行处理,以得到用户相对于第二设备的位置信息。第二设备根据用户相对于第二设备的位置信息确定用户是否位于预设范围内。示例性的,预设区域如图6所示,预设区域为大屏设备前5米的倒梯形区域,梯形下部左右角度均为120°。在确定用户处于预设区域内时,第二设备确定启动靠近人声检测。
在一个可能的实施例中,第二设备在启动靠近人声检测后,第二设备持续检测用户保持靠近第二设备的行为,具体是第二设备持续获取用户的音频信号,并根据获取的音频信号确定用户是否仍处于预设区域内;若根据获取的音频信号确定用户未处于预设区域内,第二设备则终止靠近人声检测。
可选的,为了提高语音任务的响应速度,第二设备使用采集时长较短的音频信号确定用户是否处于预设区域内。其中,采集时长较短是指采集时长不大于预设采集时长,预设采集时长可以为0.1s、0.2s、0.5s或者其他时长。
可选的,靠近人声检测和设备靠近检测可以同步进行。其中,设备靠近检测就是基于获取的音频信号确定用户是否处于预设范围内。
在一个可能的实施例中,靠近人声检测的具体过程包括:
第二设备获取用户对第二设备讲出语音任务时的视频流,并同时获取用户对第二设备讲出语音任务时对应的音频信号,该音频信号可以称为第三音频信号;第二设备根据视频流和第三音频信号确定音频特征信息,该音频特征信息包括语音与唇动的关联系数,和/或语音唇动视频间时延;第二设备根据音频特征信息确定第三音频信号的类型;若确定的第三音频信号的类型为靠近的人声时,第二设备启动语音助手。
其中,语音与唇动的关联系数为视频流中用户的唇高、唇宽分别与第三音频信号的幅度的互相关系数,比如皮尔逊相关系数,取值范围为[0,1],1表示强相关,0表示不相关。如图7所示,L1表示唇宽,L2、L3、L4、L5、L6、L7和L8表示唇高。由于语音传播速度慢于光速,语音信号到达时间相比于视频信号的到达时间存在一个时延,因此在计算唇高、唇宽分别与第三音频信号的音频幅度的互相关系数时,需要在时间轴上对第三音频信号进行右移,以达到第三音频信号与视频的同步,当达到同步时,可以选择唇高、唇宽分别与第三音频信号的音频幅度的互相关系数的最大值作为语音与唇动关联性的表征,也就是取唇高、唇宽分别与第三音频信号的音频幅度的互相关系数的最大值作为语音与唇动的关联系数,而在时间轴上右移的长度可作为语音唇动视频间时延。
在一个可能的实施例,第二设备根据音频特征信息确定第三音频信号的类型,可以是第二设备将音频特征信息输入到第六预测模型中进行处理,可以得到第三音频信号的类型。其中,第六预测模型可以为四分类网络,第二设备将音频特征信息输入到第六预测模型中进行预测,第六预测模型输出四个概率,分别为靠近的人声的概率、附近的人声的概率、远处人声的概率及非人声的概率;第三音频信号的类型为最大概率对应的声音类型,比如靠近的人声。第六预测模型也可为两分类网络,第二设备将音频特征信息输入到第六预测模型中进行预测,第六预测模型输出两个概率,分别为靠近的人声的概率和非靠近人声的概率,第三音频信号的类型为最大概率对应的声音类型。
应理解,在使用第六预测模型之前,第二设备会获取第六预测模型,可以是其他设备训练好第六预测模型,从其他设备中获取的,也可以是第二设备自己训练得到的。
在一个实例中,训练得到第六预测模型的过程,具体包括:
获取第二设备采集的一定数量的用户靠近朝向第二设备的屏幕发出的语音任务时的音频信号,和非朝向第二设备的屏幕的人声音频信号/非人声音频信号,以及对应的视频数据;然后按照上述方式获取对应的音频特征信息;然后将得到的音频特征信息作为训练样本,对机器学习模型或者卷积神经网络进行训练,以得到第六预测模型。
其中,机器学习模型可以为决策树、随机森林算法、Xgboost或者AdaBoost。
在一个可能的实施例中,在确定第三音频信号的类型为靠近的人声,且用户靠近第二设备并保持了预设时长,比如0.3s时,启动语音助手。通过这种方式,可以避免用户误触发启动语音助手。
在开启语音助手后,第二设备根据第三音频信号确定待执行任务,并识别待处理任务的类型;若待处理任务为敏感任务,为了保证安全性,需要获取用户的身份信息,比如用户的指纹信息,声纹信息、脸部图像信息等。在根据获取的用户的身份信息确定用户为目标用户时,执行待处理任务。
在一个可能的实施例中,在启动语音助手后,第二设备基于ASR技术识别第三音频信号对应的语音内容,并实时显示第三音频信号对应的语音内容。比如第三音频信号对应的语音内容为“明天天气怎么样”,在识别出“明天”,就显示“明天”;在识别出“天气”,就显示“天气”,而不是在识别到完整的语音内容后再显示。可选的,第二设备将第三音频信号发送至服务器,服务器基于ASR技术识别第三音频信号对应的语音内容,并对语音内容进行过滤,过滤一些不是对语音助手讲的内容,比如“今天下班打球啊”,也就是第二设备的语音助手对不是对语音助手讲的内容不进行反馈,语音助手终止语音会话。采用这种方式,可以减少用户的干扰。
对于过滤规则,可以预先设定一些关键词或者句子;第二设备确定第三音频信号对应的语音内容包含设定的关键词或者句子时,则确定第三音频信号对应的语音内容是需要过滤掉的内容,语音助手不需要对第三音频信号对应的语音内容作出反馈。
可以看出,通过使用大屏设备的多麦克风检测声源的位置实现的用户靠近设备检测,而不是唤醒词来触发麦克风拾取用户的语音任务,给用户提供了自然的语音交互(靠近朝向即说)的体验,操作轻松。通过基于大屏设备的摄像头获取唇动信息,然后计算语音唇动关联性,可以拦截设备附近非朝向大屏的人声和非人声,抑制误唤醒语音助手的用户干扰。在音频信号对应的任务是敏感任务之前时,需要进行身份认证,通过该方式可以避免非第二设备拥有者通过第二设备执行安全敏感的语音任务,保证了第二设备拥有者的设备的信息安全。
参照图8所示,为本申请实施例提供的一种电子设备的结构示意图。如图8所示,该电子设备800包括:
获取单元801,用于在检测到第一事件时,获取电子设备的IMU数据和照度数据;
确定单元802,用于根据电子设备的IMU数据和照度数据确定用户是否做了第一预设动作;
启动单元803,用于若确定用户做了所述第一预设动作,启动电子设备的麦克风,获取单元,还用于获取麦克风采集的第一音频信号;
确定单元802,还用于根据第一音频信号确定第一音频信号的类型是否为靠近的人声;
启动单元803,还用于若确定第一音频信号的类型为靠近的人声,启动电子设备的语音助手。
在一个可能的实现方式中,第一事件为抬腕硬件中断事件、所述电子设备的抬手硬件中断事件、电子设备的按键亮屏事件、所述电子设备的抬手亮屏事件或者所述电子设备的抬腕亮屏事件。
在一个可能的实现方式中,获取单元801,还用于在启动单元803启动电子设备的麦克风之后,持续获取电子设备采集的IMU数据和照度数据,
确定单元802,还用于根据电子设备采集的IMU数据和照度数据确定用户是否执行了第二预设动作;
在若确定第一音频信号的类型为靠近的人声,启动电子设备的语音助手的方面,启动单元803具体用于:
若确定用户执行了第二预设动作,且确定第一音频信号的类型为靠近的人声,启动电子设备的语音助手。
在一个可能的实现方式中,启动单元803还用于:
若确定用户未执行第二预设动作,则关闭电子设备的麦克风。
在一个可能的实现方式中,确定用户是否执行了第一预设动作所使用的IMU数据和照度数据的采集时长为第一预设时长,确定用户是否执行了第二预设动作所使用的IMU数据和照度数据的采集时长为第二预设时长,其中,第一预设时长大于第二预设时长。
在一个可能的实现方式中,第一事件为抬腕硬件中断事件、电子设备的抬手硬件中断事件、电子设备的抬手亮屏事件或者电子设备的抬腕亮屏事件,在获取电子设备的IMU数据和照度数据的方面,获取单元801具体用于:
启动电子设备的环境光传感器,采集电子设备的环境的照度数据和继续采集IMU数据,并获取与第一事件相关的缓存的IMU数据,并对与第一事件相关的照度数据置零补齐。
在一个可能的实现方式中,第一音频信号的时长小于或者等于0.5s,且根据第一音频信号确定第一音频信号的类型是否为靠近的人声是电子设备的数字信号处理器(digital signal processor,DSP)实现的。
在一个可能的实现方式中,第一音频信号是通过电子设备的多个麦克风采集得到的。
在一个可能的实现方式中,确定单元802,还用于根据电子设备的麦克风采集的音频信号确定待执行任务;
获取单元801,还用于若待执行任务为敏感任务,获取用户的身份信息;
电子设备800还包括:
执行单元804,用于在基于用户的身份信息确定用户为目标用户,执行待处理任务。
在一个可能的实现方式中,电子设备800还包括:
显示单元805,用于在启动语音助手后,实时显示采集的音频信号对应的文本。
值得指出的是,其中,电子设备的具体功能实现方式可以参见上述语音交互方法的描述,比如获取单元801用于执行S201的相关内容,确定单元802用于执行S202和S204的相关内容、启动单元803、执行单元804和显示单元用于执行S203和S205的相关内容,这里不再进行赘述。电子设备中的各个单元或模块可以分别或全部合并为一个或若干个另外的单元或模块来构成,或者其中的某个(些)单元或模块还可以再拆分为功能上更小的多个单元或模块来构成,这可以实现同样的操作,而不影响本发明的实施例的技术效果的实现。上述单元或模块是基于逻辑功能划分的,在实际应用中,一个单元(或模块)的功能也可以由多个单元(或模块)来实现,或者多个单元(或模块)的功能由一个单元(或模块)实现。
参照图8a所示,为本申请实施例提供的一种电子设备的结构示意图。如图8a所示,该自电子设备800a包括:
获取单元801a,用于在检测到第一事件时,获取第一设备的IMU数据;
确定单元802a,用于根据第一设备的IMU数据确定用户是否做了第一预设动作;
启动单元803a,用于若确定用户做了第一预设动作,启动第一设备的麦克风,获取单元,还用于获取麦克风采集的第一音频信号;
确定单元802a,还用于根据第一音频信号确定第一音频信号的类型是否为靠近的人声;
启动单元803a,还用于若确定第一音频信号的类型为靠近的人声,启动所述第一设备的语音助手。
其中,第一事件为第一设备的抬腕事件、第一设备的抬手事件或者第一设备的转腕事件;且第一设备的抬腕事件、第一设备的抬手事件和第一设备的转腕事件均是通过同一事件预测模型得到的。
在一个可能的实现方式中,在根据第一设备的IMU数据确定用户是否做了第一预设动作的方面,确定单元802a,还用于包括:
第一设备根据第一设备的IMU数据计算得到第一设备的姿态信息;从第一设备的IMU数据中获取第一设备的加速度信息;若第一设备的姿态信息处于预设姿态范围、第一设备的加速度信息处于预设加速度范围内,且第一事件时长处于预设时长范围内,则第一设备确定用户做了第一预设动作。
在一个可能的实现方式中,获取单元801a,还用于在启动单元启动电子设备的麦克风之后,持续获取电子设备采集的IMU数据和照度数据;
确定单元802a,还用于根据电子设备采集的IMU数据和照度数据确定用户是否执行了第二预设动作;
在若确定第一音频信号的类型为靠近的人声,启动电子设备的语音助手的方面,启动单元803a具体用于:
若确定用户执行了第二预设动作,且确定第一音频信号的类型为靠近的人声,启动电子设备的语音助手。
在一个可能的实现方式中,第一音频信号的时长小于或者等于0.5s,且根据第一音频信号确定第一音频信号的类型是否为靠近的人声是电子设备的DSP实现的。
在一个可能的实现方式中,第一音频信号是通过电子设备的多个麦克风采集得到的。
在一个可能的实现方式中,确定单元802a,还用于根据电子设备的麦克风采集的音频信号确定待执行任务;
获取单元801a,还用于若待执行任务为敏感任务,获取用户的身份信息;
电子设备800a还包括:
执行单元804a,用于在基于用户的身份信息确定用户为目标用户,执行待处理任务。
在一个可能的实现方式中,电子设备800a还包括:
显示单元805a,用于在启动语音助手后,实时显示采集的音频信号对应的文本。
值得指出的是,其中,电子设备的具体功能实现方式可以参见上述语音交互方法的描述。电子设备中的各个单元或模块可以分别或全部合并为一个或若干个另外的单元或模块来构成,或者其中的某个(些)单元或模块还可以再拆分为功能上更小的多个单元或模块来构成,这可以实现同样的操作,而不影响本发明的实施例的技术效果的实现。上述单元或模块是基于逻辑功能划分的,在实际应用中,一个单元(或模块)的功能也可以由多个单元(或模块)来实现,或者多个单元(或模块)的功能由一个单元(或模块)实现。
基于上述方法实施例以及装置实施例的描述,请参见图9,本发明实施例还提供的一种电子设备900的结构示意图。图9所示的电子设备900(该电子设备900具体可以是一种计算机设备)包括存储器901、处理器902、通信接口903、显示屏905以及总线904。其中,存储器901、处理器902、通信接口903、显示屏905通过总线904实现彼此之间的通信连接。
存储器901可以是只读存储器(Read Only Memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(Random Access Memory,RAM)。
存储器901可以存储程序,当存储器901中存储的程序被处理器902执行时,处理器902和通信接口903用于执行本申请实施例的语音交互方法的各个步骤。
处理器902可以采用通用的中央处理器(Central Processing Unit,CPU),微处理器,应用专用集成电路(Application Specific Integrated Circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的电子设备900中的单元所需执行的功能,或者执行本申请方法实施例的语音交互方法。
处理器902还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的语音交互方法的各个步骤可以通过处理器902中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器902 还可以是通用处理器、数字信号处理器(Digital Signal Processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器901,处理器902读取存储器901中的信息,结合其硬件完成本申请实施例的网络电子设备中包括的单元所需执行的功能,或者执行本申请方法实施例的语音交互方法。
通信接口903使用例如但不限于收发器一类的收发装置,来实现电子设备900与其他设备或通信网络之间的通信。例如,可以通过通信接口903获取数据。
总线904可包括在电子设备900各个部件(例如,存储器901、处理器902、通信接口903)之间传送信息的通路。
其中,显示屏905用于在启动语音助手后,显示采集到的音频信号对应的文本。当然还可以显示语音助手针对采集到的音频信号的回复信息的文本。需要指出的是,显示屏905可以为LCD屏、LED屏、OLED屏,当然还可以为其他显示屏,在此不做限定。
应注意,尽管图9所示的电子设备900仅仅示出了存储器、处理器、通信接口,但是在具体实现过程中,本领域的技术人员应当理解,电子设备900还包括实现正常运行所必须的其他器件,比如显示器。同时,根据具体需要,本领域的技术人员应当理解,电子设备900还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,电子设备900也可仅仅包括实现本申请实施例所必须的器件,而不必包括图9中所示的全部器件。
本申请实施例还提供了一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,以实现所述的语音交互方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行所述的语音交互方法。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当其在计算机或处理器上运行时,使得计算机或处理器执行上述任一个方法中的一个或多个步骤。
本申请实施例还提供了一种包含指令的计算机程序产品。当该计算机程序产品在计算机或处理器上运行时,使得计算机或处理器执行上述任一个方法中的一个或多个步骤。
本领域技术人员能够领会,结合本文公开描述的各种说明性逻辑框、模块和算法步骤所描述的功能可以硬件、软件、固件或其任何组合来实施。如果以软件来实施,那么各种说明性逻辑框、模块、和步骤描述的功能可作为一或多个指令或代码在计算机可读媒体上存储或传输,且由基于硬件的处理单元执行。计算机可读媒体可包含计算机可读存储媒体,其对应于有形媒体,例如数据存储媒体,或包括任何促进将计算机程序从一处传送到另一处的媒体(例如,基于通信协议)的通信媒体。以此方式,计算机可读媒体大体上可对应于(1)非暂时性的有形计算机可读存储媒体,或(2)通信媒体,例如信号或载波。数据存储媒体可为可由一或多个计算机或一或多个处理器存取以检索用于实施本申请中描述的技术的指令、代码和/或数据结构的任何可用媒体。计算机程序产品可包含计算机可读媒体。
作为实例而非限制,此类计算机可读存储媒体可包括RAM、ROM、EEPROM、CD-ROM或其它光盘存储装置、磁盘存储装置或其它磁性存储装置、快闪存储器或可用来存储指令或数据结构的形式的所要程序代码并且可由计算机存取的任何其它媒体。并且,任何连接被恰当地称作计算机可读媒体。举例来说,如果使用同轴缆线、光纤缆线、双绞线、数字订户线(DSL)或例如红外线、无线电和微波等无线技术从网站、服务器或其它远程源传输指令,那么同轴缆线、光纤缆线、双绞线、DSL或例如红外线、无线电和微波等无线技术包含在媒体的定义中。但是,应理解,所述计算机可读存储媒体和数据存储媒体并不包括连接、载波、信号或其它暂时媒体,而是实际上针对于非暂时性有形存储媒体。如本文中所使用,磁盘和光盘包含压缩光盘(CD)、激光光盘、光学光盘、数字多功能光盘(DVD)和蓝光光盘,其中磁盘通常以磁性方式再现数据,而光盘利用激光以光学方式再现数据。以上各项的组合也应包含在计算机可读媒体的范围内。
可通过例如一或多个数字信号处理器(DSP)、通用微处理器、专用集成电路(ASIC)、现场可编程逻辑阵列(FPGA)或其它等效集成或离散逻辑电路等一或多个处理器来执行指令。因此,如本文中所使用的术语“处理器”可指前述结构或适合于实施本文中所描述的技术的任一其它结构中的任一者。另外,在一些方面中,本文中所描述的各种说明性逻辑框、模块、和步骤所描述的功能可以提供于经配置以用于编码和解码 的专用硬件和/或软件模块内,或者并入在组合编解码器中。而且,所述技术可完全实施于一或多个电路或逻辑元件中。
本申请的技术可在各种各样的装置或设备中实施,包含无线手持机、集成电路(IC)或一组IC(例如,芯片组)。本申请中描述各种组件、模块或单元是为了强调用于执行所揭示的技术的装置的功能方面,但未必需要由不同硬件单元实现。实际上,如上文所描述,各种单元可结合合适的软件和/或固件组合在编码硬件单元中,或者通过互操作硬件单元(包含如上文所描述的一或多个处理器)来提供。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应步骤过程的具体描述,在此不再赘述。
应理解,在本申请的描述中,除非另有说明,“/”表示前后关联的对象是一种“或”的关系,例如,A/B可以表示A或B;其中A,B可以是单数或者复数。并且,在本申请的描述中,除非另有说明,“多个”是指两个或多于两个。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。另外,为了便于清楚描述本申请实施例的技术方案,在本申请的实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。同时,在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念,便于理解。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,该单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如,多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。所显示或讨论的相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。该计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行该计算机程序指令时,全部或部分地产生按照本申请实施例的流程或功能。该计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。该计算机指令可以存储在计算机可读存储介质中,或者通过该计算机可读存储介质进行传输。该计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。该计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。该可用介质可以是只读存储器(read-only memory,ROM),或随机存取存储器(random access memory,RAM),或磁性介质,例如,软盘、硬盘、磁带、磁碟、或光介质,例如,数字通用光盘(digital versatile disc,DVD)、或者半导体介质,例如,固态硬盘(solid state disk,SSD)等。
以上所述,仅为本申请实施例的具体实施方式,但本申请实施例的保护范围并不局限于此,任何在本申请实施例揭露的技术范围内的变化或替换,都应涵盖在本申请实施例的保护范围之内。因此,本申请实施例的保护范围应以所述权利要求的保护范围为准。

Claims (38)

  1. 一种语音交互方法,应用于第一设备,其特征在于,所述方法包括:
    在检测到第一事件时,获取所述第一设备的IMU数据和照度数据;
    根据所述第一设备的IMU数据和照度数据确定用户是否做了第一预设动作;
    若确定所述用户做了所述第一预设动作,启动第一设备的麦克风,获取所述麦克风采集的第一音频信号;
    根据所述第一音频信号确定所述第一音频信号的类型是否为靠近的人声;
    若确定所述第一音频信号的类型为靠近的人声,启动所述第一设备的语音助手。
  2. 根据权利要求1所述的方法,其特征在于,所述第一事件为抬腕硬件中断事件、所述第一设备的抬手硬件中断事件、所述第一设备的按键亮屏事件、所述第一设备的抬手亮屏事件或者所述第一设备的抬腕亮屏事件。
  3. 根据权利要求1或2所述的方法,其特征在于,所述方法还包括:
    在启动所述第一设备的麦克风之后,持续获取第一设备采集的IMU数据和照度数据,
    根据所述第一设备采集的IMU数据和照度数据确定所述用户是否执行了第二预设动作;
    所述若确定所述第一音频信号的类型为靠近的人声,启动所述第一设备的语音助手,包括:
    若确定所述用户执行了所述第二预设动作,且确定所述第一音频信号的类型为靠近的人声,启动所述第一设备的语音助手。
  4. 根据权利要求3所述的方法,其特征在于,所述方法还包括:
    若确定所述用户未执行所述第二预设动作,则关闭所述第一设备的麦克风。
  5. 根据权利要求3或4所述的方法,其特征在于,
    确定用户是否执行了所述第一预设动作所使用的IMU数据和照度数据的采集时长为第一预设时长,
    确定用户是否执行了所述第二预设动作所使用的IMU数据和照度数据的采集时长为第二预设时长,其中,所述第一预设时长大于所述第二预设时长。
  6. 根据权利要求2所述的方法,其特征在于,所述第一事件为抬腕硬件中断事件、所述第一设备的抬手硬件中断事件、所述第一设备的抬手亮屏事件或者所述第一设备的抬腕亮屏事件,所述获取所述第一设备的IMU数据和照度数据,包括:
    启动所述第一设备的环境光传感器,采集所述第一设备的环境的照度数据和继续采集IMU数据,并获取与所述第一事件相关的缓存的IMU数据,并对与所述第一事件相关的照度数据置零补齐。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述第一音频信号的时长小于或者等于0.5s,且根据所述第一音频信号确定所述第一音频信号的类型是否为靠近的人声是所述第一设备的数字信号处理器DSP实现的。
  8. 根据权利要求1-7任一项所述的方法,其特征在于,所述第一音频信号是通过所述第一设备的多个麦克风采集得到的。
  9. 根据权利要求1-8任一项所述的方法,其特征在于,所述方法还包括:
    根据所述第一设备的麦克风采集的音频信号确定待执行任务;
    若所述待执行任务为敏感任务,获取所述用户的身份信息;
    在基于所述用户的身份信息确定所述用户为目标用户,执行所述待处理任务。
  10. 根据权利要求1-9任一项所述的方法,其特征在于,所述方法还包括:
    在开启语音助手后,所述第一设备实时显示采集的音频信号对应的文本。
  11. 一种语音交互方法,应用于第一设备,其特征在于,所述方法包括:
    在检测到第一事件时,获取所述第一设备的IMU数据;其中,所述第一事件为所述第一设备的抬腕事件、所述第一设备的抬手事件或者所述第一设备的转腕事件;且所述第一设备的抬腕事件、所述第一设备的抬手事件和所述第一设备的转腕事件均是通过同一事件预测模型得到的;
    根据所述第一设备的IMU数据确定用户是否做了第一预设动作;
    若确定所述用户做了所述第一预设动作,启动第一设备的麦克风,获取所述麦克风采集的第一音频信号;
    根据所述第一音频信号确定所述第一音频信号的类型是否为靠近的人声;
    若确定所述第一音频信号的类型为靠近的人声,启动所述第一设备的语音助手。
  12. 根据权利要求11所述的方法,其特征在于,所述根据所述第一设备的IMU数据确定用户是否做了第一预设动作,包括:
    根据所述第一设备的IMU数据计算得到所述第一设备的姿态信息;
    从所述第一设备的IMU数据中获取所述第一设备的加速度信息;
    若所述第一设备的姿态信息处于预设姿态范围、所述第一设备的加速度信息处于预设加速度范围内,且所述第一事件时长处于预设时长范围内,则确定所述用户做了所述第一预设动作。
  13. 根据权利要求11或12所述的方法,其特征在于,所述方法还包括:
    在启动所述第一设备的麦克风之后,持续获取第一设备采集的IMU数据和照度数据,
    根据所述第一设备采集的IMU数据和照度数据确定所述用户是否执行了第二预设动作;
    所述若确定所述第一音频信号的类型为靠近的人声,启动所述第一设备的语音助手,包括:
    若确定所述用户执行了所述第二预设动作,且确定所述第一音频信号的类型为靠近的人声,启动所述第一设备的语音助手。
  14. 根据权利要求13所述的方法,其特征在于,所述方法还包括:
    若确定所述用户未执行所述第二预设动作,则关闭所述第一设备的麦克风。
  15. 根据权利要求11-14任一项所述的方法,其特征在于,所述第一音频信号的时长小于或者等于0.5s,且根据所述第一音频信号确定所述第一音频信号的类型是否为靠近的人声是所述第一设备的数字信号处理器DSP实现的。
  16. 根据权利要求11-15任一项所述的方法,其特征在于,所述第一音频信号是通过所述第一设备的多个麦克风采集得到的。
  17. 根据权利要求11-16任一项所述的方法,其特征在于,所述方法还包括:
    根据所述第一设备的麦克风采集的音频信号确定待执行任务;
    若所述待执行任务为敏感任务,获取所述用户的身份信息;
    在基于所述用户的身份信息确定所述用户为目标用户,执行所述待处理任务。
  18. 根据权利要求11-17任一项所述的方法,其特征在于,所述方法还包括:
    在开启语音助手后,第一设备实时显示采集的音频信号对应的文本。
  19. 一种电子设备,其特征在于,包括:
    获取单元,用于在检测到第一事件时,获取所述电子设备的IMU数据和照度数据;
    确定单元,用于根据所述电子设备的IMU数据和照度数据确定用户是否做了第一预设动作;
    启动单元,用于若确定所述用户做了所述第一预设动作,启动电子设备的麦克风,所述获取单元,还用于获取所述麦克风采集的第一音频信号;
    所述确定单元,还用于根据所述第一音频信号确定所述第一音频信号的类型是否为靠近的人声;
    所述启动单元,还用于若确定所述第一音频信号的类型为靠近的人声,启动所述电子设备的语音助手。
  20. 根据权利要求19所述的电子设备,其特征在于,所述第一事件为抬腕硬件中断事件、所述电子设备的抬手硬件中断事件、所述电子设备的按键亮屏事件、所述电子设备的抬手亮屏事件或者所述电子设备的抬腕亮屏事件。
  21. 根据权利要求19或20所述的电子设备,其特征在于,
    所述获取单元,还用于在所述启动单元启动所述电子设备的麦克风之后,持续获取电子设备采集的IMU数据和照度数据,
    所述确定单元,还用于根据所述电子设备采集的IMU数据和照度数据确定所述用户是否执行了第二预设动作;
    在所述若确定所述第一音频信号的类型为靠近的人声,启动所述电子设备的语音助手的方面,所述启动单元具体用于:
    若确定所述用户执行了所述第二预设动作,且确定所述第一音频信号的类型为靠近的人声,启动所述电子设备的语音助手。
  22. 根据权利要求21所述的电子设备,其特征在于,所述启动单元还用于:
    若确定所述用户未执行所述第二预设动作,则关闭所述电子设备的麦克风。
  23. 根据权利要求21或22所述的电子设备,其特征在于,
    确定用户是否执行了所述第一预设动作所使用的IMU数据和照度数据的采集时长为第一预设时长,
    确定用户是否执行了所述第二预设动作所使用的IMU数据和照度数据的采集时长为第二预设时长,
    其中,所述第一预设时长大于所述第二预设时长。
  24. 根据权利要求20所述的电子设备,其特征在于,所述第一事件为抬腕硬件中断事件、所述电子设备的抬手硬件中断事件、所述电子设备的抬手亮屏事件或者所述电子设备的抬腕亮屏事件,在所述获取所述电子设备的IMU数据和照度数据的方面,所述获取单元具体用于:
    启动所述电子设备的环境光传感器,采集所述电子设备的环境的照度数据和继续采集IMU数据,并获取与所述第一事件相关的缓存的IMU数据,并对与所述第一事件相关的照度数据置零补齐。
  25. 根据权利要求19-24任一项所述的电子设备,其特征在于,所述第一音频信号的时长小于或者等于0.5s,且根据所述第一音频信号确定所述第一音频信号的类型是否为靠近的人声是所述电子设备的数字信号处理器DSP实现的。
  26. 根据权利要求19-25任一项所述的电子设备,其特征在于,所述第一音频信号是通过所述电子设备的多个麦克风采集得到的。
  27. 根据权利要求19-26任一项所述的电子设备,其特征在于,
    所述确定单元,还用于根据所述电子设备的麦克风采集的音频信号确定待执行任务;
    所述获取单元,还用于若所述待执行任务为敏感任务,获取所述用户的身份信息;
    所述电子设备还包括:
    执行单元,用于在基于所述用户的身份信息确定所述用户为目标用户,执行所述待处理任务。
  28. 根据权利要求19-27任一项所述的电子设备,其特征在于,所述电子设备还包括:
    显示单元,用于在开启语音助手后,实时显示采集的音频信号对应的文本。
  29. 一种电子设备,其特征在于,包括:
    获取单元,用于在检测到第一事件时,获取所述电子设备的IMU数据;其中,所述第一事件为所述第一设备的抬腕事件、所述第一设备的抬手事件或者所述第一设备的转腕事件;且所述第一设备的抬腕事件、所述第一设备的抬手事件和所述第一设备的转腕事件均是通过同一事件预测模型得到的;
    确定单元,用于根据所述电子设备的IMU数据确定用户是否做了第一预设动作;
    启动单元,用于若确定所述用户做了所述第一预设动作,启动电子设备的麦克风,获取所述麦克风采集的第一音频信号;
    所述确定单元,还用于根据所述第一音频信号确定所述第一音频信号的类型是否为靠近的人声;
    所述启动单元,还用于若确定所述第一音频信号的类型为靠近的人声,启动所述电子设备的语音助手。
  30. 根据权利要求29所述的电子设备,其特征在于,在所述根据所述电子设备的IMU数据确定用户是否做了第一预设动作的方面,所述确定单元具体用于:
    根据所述电子设备的IMU数据计算得到所述电子设备的姿态信息;
    从所述电子设备的IMU数据中获取所述电子设备的加速度信息;
    若所述电子设备的姿态信息处于预设姿态范围、所述电子设备的加速度信息处于预设加速度范围内,且所述第一事件时长处于预设时长范围内,则确定所述用户做了所述第一预设动作。
  31. 根据权利要求29或30所述的电子设备,其特征在于,
    所述获取单元,还用于在所述启动单元启动所述电子设备的麦克风之后,持续获取电子设备采集的IMU数据和照度数据,
    所述确定单元,还用于根据所述电子设备采集的IMU数据和照度数据确定所述用户是否执行了第二预设动作;
    在所述若确定所述第一音频信号的类型为靠近的人声,启动所述电子设备的语音助手的方面,所述启动单元具体用于:
    若确定所述用户执行了所述第二预设动作,且确定所述第一音频信号的类型为靠近的人声,启动所述电子设备的语音助手。
  32. 根据权利要求31所述的电子设备,其特征在于,所述启动单元还用于:
    若确定所述用户未执行所述第二预设动作,则关闭所述电子设备的麦克风。
  33. 根据权利要求29-32任一项所述的电子设备,其特征在于,所述第一音频信号的时长小于或者等于0.5s,且根据所述第一音频信号确定所述第一音频信号的类型是否为靠近的人声是所述电子设备的数字信号处理器DSP实现的。
  34. 根据权利要求29-33任一项所述的电子设备,其特征在于,所述第一音频信号是通过所述电子设备的多个麦克风采集得到的。
  35. 根据权利要求29-34任一项所述的电子设备,其特征在于,
    所述确定单元,还用于根据所述电子设备的麦克风采集的音频信号确定待执行任务;
    所述获取单元,还用于若所述待执行任务为敏感任务,获取所述用户的身份信息;
    所述电子设备还包括:
    执行单元,用于在基于所述用户的身份信息确定所述用户为目标用户,执行所述待处理任务。
  36. 根据权利要求29-35任一项所述的电子设备,其特征在于,所述电子设备还包括:
    显示单元,用于在开启语音助手后,实时显示采集的音频信号对应的文本。
  37. 一种电子设备,其特征在于,包括处理器和存储器,其中,所述存储器用于存储程序代码,所述处理器用于执行所述程序代码,以实现权利要求1至18任一项所述的方法。
  38. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时,实现权利要求1至18任一项所述的方法。
PCT/CN2024/078662 2023-02-28 2024-02-27 语音交互方法及相关设备 Ceased WO2024179425A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP24763107.0A EP4632733A4 (en) 2023-02-28 2024-02-27 VOICE INTERACTION METHOD AND ASSOCIATED DEVICE
US19/310,862 US20250378833A1 (en) 2023-02-28 2025-08-26 Speech interaction method and related device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202310224268.2 2023-02-28
CN202310224268.2A CN116229953A (zh) 2023-02-28 2023-02-28 语音交互方法及相关设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US19/310,862 Continuation US20250378833A1 (en) 2023-02-28 2025-08-26 Speech interaction method and related device

Publications (1)

Publication Number Publication Date
WO2024179425A1 true WO2024179425A1 (zh) 2024-09-06

Family

ID=86580372

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2024/078662 Ceased WO2024179425A1 (zh) 2023-02-28 2024-02-27 语音交互方法及相关设备

Country Status (4)

Country Link
US (1) US20250378833A1 (zh)
EP (1) EP4632733A4 (zh)
CN (1) CN116229953A (zh)
WO (1) WO2024179425A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120954111A (zh) * 2025-10-17 2025-11-14 小芒电子商务有限责任公司 一种真人直播识别方法及相关装置

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116229953A (zh) * 2023-02-28 2023-06-06 华为技术有限公司 语音交互方法及相关设备
CN121905167A (zh) * 2023-10-31 2026-04-21 华为技术有限公司 语音助手交互的方法和电子设备
CN120390205A (zh) * 2024-01-26 2025-07-29 华为技术有限公司 感知方法及装置
CN119376683B (zh) * 2024-12-28 2025-06-20 荣耀终端股份有限公司 语音交互的方法和电子设备

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111651041A (zh) * 2020-05-27 2020-09-11 上海龙旗科技股份有限公司 移动设备的抬起唤醒方法及系统
CN113377206A (zh) * 2021-07-05 2021-09-10 安徽淘云科技股份有限公司 词典笔抬起唤醒方法、装置和设备
CN113724699A (zh) * 2021-09-18 2021-11-30 优奈柯恩(北京)科技有限公司 设备唤醒识别模型训练方法、设备唤醒控制方法及装置
CN114283798A (zh) * 2021-07-15 2022-04-05 海信视像科技股份有限公司 手持设备的收音方法及手持设备
CN114341779A (zh) * 2019-09-04 2022-04-12 脸谱科技有限责任公司 用于基于神经肌肉控制执行输入的系统、方法和界面
WO2022228056A1 (zh) * 2021-04-30 2022-11-03 华为技术有限公司 一种人机交互方法及设备
CN115588435A (zh) * 2022-11-08 2023-01-10 荣耀终端有限公司 语音唤醒方法及电子设备
CN116229953A (zh) * 2023-02-28 2023-06-06 华为技术有限公司 语音交互方法及相关设备

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104657105B (zh) * 2015-01-30 2016-10-26 腾讯科技(深圳)有限公司 一种开启终端的语音输入功能的方法和装置
DE112019000018B4 (de) * 2018-05-07 2025-04-30 Apple Inc. Anheben, um zu sprechen

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114341779A (zh) * 2019-09-04 2022-04-12 脸谱科技有限责任公司 用于基于神经肌肉控制执行输入的系统、方法和界面
CN111651041A (zh) * 2020-05-27 2020-09-11 上海龙旗科技股份有限公司 移动设备的抬起唤醒方法及系统
WO2022228056A1 (zh) * 2021-04-30 2022-11-03 华为技术有限公司 一种人机交互方法及设备
CN113377206A (zh) * 2021-07-05 2021-09-10 安徽淘云科技股份有限公司 词典笔抬起唤醒方法、装置和设备
CN114283798A (zh) * 2021-07-15 2022-04-05 海信视像科技股份有限公司 手持设备的收音方法及手持设备
CN113724699A (zh) * 2021-09-18 2021-11-30 优奈柯恩(北京)科技有限公司 设备唤醒识别模型训练方法、设备唤醒控制方法及装置
CN115588435A (zh) * 2022-11-08 2023-01-10 荣耀终端有限公司 语音唤醒方法及电子设备
CN116229953A (zh) * 2023-02-28 2023-06-06 华为技术有限公司 语音交互方法及相关设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4632733A1

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120954111A (zh) * 2025-10-17 2025-11-14 小芒电子商务有限责任公司 一种真人直播识别方法及相关装置

Also Published As

Publication number Publication date
EP4632733A1 (en) 2025-10-15
EP4632733A4 (en) 2026-04-01
CN116229953A (zh) 2023-06-06
US20250378833A1 (en) 2025-12-11

Similar Documents

Publication Publication Date Title
WO2024179425A1 (zh) 语音交互方法及相关设备
CN108615526B (zh) 语音信号中关键词的检测方法、装置、终端及存储介质
CN112634911B (zh) 人机对话方法、电子设备及计算机可读存储介质
CN110070863A (zh) 一种语音控制方法及装置
EP3274988A1 (en) Controlling electronic device based on direction of speech
CN112634895A (zh) 语音交互免唤醒方法和装置
CN108806684B (zh) 位置提示方法、装置、存储介质及电子设备
CN114125143B (zh) 一种语音交互方法及电子设备
CN112286364A (zh) 人机交互方法和装置
CN114220420A (zh) 多模态语音唤醒方法、装置及计算机可读存储介质
CN114765026A (zh) 一种语音控制方法、装置及系统
CN113160802B (zh) 语音处理方法、装置、设备及存储介质
CN114360546B (zh) 电子设备及其唤醒方法
CN117133282B (zh) 一种语音交互方法及电子设备
WO2020102943A1 (zh) 手势识别模型的生成方法、装置、存储介质及电子设备
CN115344111A (zh) 手势交互方法、系统和装置
CN109064720B (zh) 位置提示方法、装置、存储介质及电子设备
WO2021103449A1 (zh) 交互方法、移动终端及可读存储介质
CN111681654A (zh) 语音控制方法、装置、电子设备及存储介质
WO2023006033A1 (zh) 语音交互方法、电子设备及介质
CN114299935A (zh) 唤醒词识别方法、装置、终端及存储介质
CN115331672B (zh) 设备控制方法、装置、电子设备及存储介质
WO2024055831A1 (zh) 一种语音交互方法、装置及终端
CN110148401B (zh) 语音识别方法、装置、计算机设备及存储介质
CN115705851A (zh) 端点检测方法及相关设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24763107

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2024763107

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2024763107

Country of ref document: EP

Effective date: 20250711

NENP Non-entry into the national phase

Ref country code: DE

WWP Wipo information: published in national office

Ref document number: 2024763107

Country of ref document: EP